👋 Welcome to JX’s log

Hi, this is JJ.

I’m directing my logger to put down my learning notes, thoughts and updates here.

Estimating Statistical Properties in Grouped Measurements

Intro In this post, we’ll explore a statistical problem: estimating population means and their uncertainties from hierarchically structured data or so called grouped measurements. While averaging measurements may seem straightforward, the presence of natural groupings in data introduces important statistical considerations that require careful treatment. To illustrate the practical significance of this problem, let’s examine how hierarchically structured measurements - where individual observations are naturally clustered into groups - arise across diverse real-world applications. Multiple-Instance Learning (MIL) represents an important machine learning paradigm specifically designed for analyzing such grouped or clustered data structures. The following scenarios showcase a few situations where measurements naturally organize into clusters of varying sizes: ...

回留 Revisted Khalil Fong

时间等不了人。大同，多好的一个人，看到了心目里少年自己希望成为的那样一个人：谦逊，幽默，乐观，专情，坚韧，豁达，有才华，有愿景，聊到热爱的人和事眼睛会有光…… ...

Biomedical LLMs (2): Genomics

[Updated in Jan 2025]: Added HELM. [Updated in Dec 2024]: Added MethylGPT. In previous post, we discussed some of the introduction to Large Language Models (LLMs) and how they are constructed, trained, and utilized. Beginning with this post in the Biomedical LLMs series, we will explore their applications in biomedical domains. This post will concentrate on a few LLMs for genomics (e.g. DNA and RNA). DNA Language Models DNABERT DNABERT (Ji et al., 2021) is designed to encoder genomic DNA sequences by adapting the Bidirectional Encoder Representations from Transformers (BERT) model. DNABERT utilizes a Transformer’s encoder architecture characterized by attention mechanisms, which effectively capture both local and long-range dependencies in DNA sequences and offer contextual representation of the input DNA sequences. The encoder-only architecture is identical to the BERT base model, comprising 12 transformer layers, each with 768 hidden units and 12 attention heads. ...

Biomedical LLMs (1): Intro

The rapid advancements in Natural Language Processing (NLP) have showcased the versatility and efficacy of Large Language Models (LLMs). These models have demonstrated significant capabilities in compressing vast amounts of information through unsupervised or self-supervised training, enabling impressive few-shot and zero-shot learning performance. These attributes make LLMs particularly attractive for domains where generating extensive task-specific datasets is challenging, such as in biomedical applications. Recent attempts to apply LLMs in biomedical contexts have yielded promising results, highlighting their potential to address complex problems where data scarcity is a significant barrier. Starting from this post, I am planning to write a series on Biomedical LLMs. ...

What a large p for small n

“Large p small n” describes a scenario where the number of features ($p$) is much greater than the number of observations ($n$) for model training. While it is not a new problem, it continues to pose significant challenges in real-world applications of machine learning, especially for domains lacking rich data or fast and cheap data generation processes. In this blog post, I’ll document my recent thoughts on the “large p small n” problem. ...

Toward Robust AI (2): How To Achieve Robust AI

In my previous post, I highlighted the growing influence and adoption of Artificial Intelligence (AI) and machine learning (ML) systems, discussing how they attain “intelligence” through a careful “data diet.” However, a fundamental challenge arises from out-of-distribution (OOD), posing barriers to robust performance and reliable deployment. In particular, covariate shift (eq 1) and concept drift (eq 2) are two major types of OOD frequently encountered in practice, demanding mitigation for robust model deployment. ...

Toward Robust AI (1): Why Robustness Matters

Brilliant AI/ML Models Remain Brittle Artificial intelligence (AI) and machine learning (ML) have garnered significant attention for their potential to emulate, and sometimes surpass, human capabilities across diverse domains such as vision, translation, and planning. The popularity of groundbreaking models like ChatGPT and Stable Diffusion has fueled optimism, with many speculating not if, but when, Artificial General Intelligence (AGI) will emerge. Yet, beneath the in silico surface, AI/ML systems remain at their core parametrized mathematical models. They are trained to transform inputs into predictive outputs, which includes tasks like classification, regression, media generation, data clustering, and action planning. Despite the awe-inspiring results, the deployment of even the most sophisticated models reveals a fundamental fragility. ...

Hello World

Greetings! This is JJ, and I am thrilled to welcome you to my corner of the internet! Taking inspiration from Lilian Weng, whose blog has been an invaluable resource during my studies and work in AI/ML, I’ve decided to also share my learning notes, thoughts, and updates through here. Why Blogging? While I used to constantly write some diary during my childhood, I have to admit that I haven’t done so for quite a while. Blogging on this site, for me, it may be more than just a digital diary. I hope to crystallize my thoughts, document my learning experiences, and engage in meaningful conversations with readers. ...