Biomedical LLMs (2): Genomics

[Updated in Jan 2025]: Added HELM. [Updated in Dec 2024]: Added MethylGPT. In previous post, we discussed some of the introduction to Large Language Models (LLMs) and how they are constructed, trained, and utilized. Beginning with this post in the Biomedical LLMs series, we will explore their applications in biomedical domains. This post will concentrate on a few LLMs for genomics (e.g. DNA and RNA). DNA Language Models DNABERT DNABERT (Ji et al., 2021) is designed to encoder genomic DNA sequences by adapting the Bidirectional Encoder Representations from Transformers (BERT) model. DNABERT utilizes a Transformer’s encoder architecture characterized by attention mechanisms, which effectively capture both local and long-range dependencies in DNA sequences and offer contextual representation of the input DNA sequences. The encoder-only architecture is identical to the BERT base model, comprising 12 transformer layers, each with 768 hidden units and 12 attention heads. ...

2024-05-12 · 44 min · Jiajie Xiao

Biomedical LLMs (1): Intro

The rapid advancements in Natural Language Processing (NLP) have showcased the versatility and efficacy of Large Language Models (LLMs). These models have demonstrated significant capabilities in compressing vast amounts of information through unsupervised or self-supervised training, enabling impressive few-shot and zero-shot learning performance. These attributes make LLMs particularly attractive for domains where generating extensive task-specific datasets is challenging, such as in biomedical applications. Recent attempts to apply LLMs in biomedical contexts have yielded promising results, highlighting their potential to address complex problems where data scarcity is a significant barrier. Starting from this post, I am planning to write a series on Biomedical LLMs. ...

2024-05-10 · 12 min · Jiajie Xiao

What a large p for small n

“Large p small n” describes a scenario where the number of features ($p$) is much greater than the number of observations ($n$) for model training. While it is not a new problem, it continues to pose significant challenges in real-world applications of machine learning, especially for domains lacking rich data or fast and cheap data generation processes. In this blog post, I’ll document my recent thoughts on the “large p small n” problem. ...

2024-04-29 · 17 min · Jiajie Xiao

Toward Robust AI (2): How To Achieve Robust AI

In my previous post, I highlighted the growing influence and adoption of Artificial Intelligence (AI) and machine learning (ML) systems, discussing how they attain “intelligence” through a careful “data diet.” However, a fundamental challenge arises from out-of-distribution (OOD), posing barriers to robust performance and reliable deployment. In particular, covariate shift (eq 1) and concept drift (eq 2) are two major types of OOD frequently encountered in practice, demanding mitigation for robust model deployment. ...

2024-01-06 · 26 min · Jiajie Xiao

Toward Robust AI (1): Why Robustness Matters

Brilliant AI/ML Models Remain Brittle Artificial intelligence (AI) and machine learning (ML) have garnered significant attention for their potential to emulate, and sometimes surpass, human capabilities across diverse domains such as vision, translation, and planning. The popularity of groundbreaking models like ChatGPT and Stable Diffusion has fueled optimism, with many speculating not if, but when, Artificial General Intelligence (AGI) will emerge. Yet, beneath the in silico surface, AI/ML systems remain at their core parametrized mathematical models. They are trained to transform inputs into predictive outputs, which includes tasks like classification, regression, media generation, data clustering, and action planning. Despite the awe-inspiring results, the deployment of even the most sophisticated models reveals a fundamental fragility. ...

2023-12-17 · 11 min · Jiajie Xiao