There Is No Big Data in Biology, Only Batches of Small Data

Apr 18, 2025

1×

0:00

-15:27

Note: Above is a podcast I created using Notebook LLM about the blog below, its honestly way cooler than what I thought it would be :-)

In 2010, a groundbreaking study by Leek et.al demonstrated something alarming about biological data as applied to AI: simply shuffling the sequencing run order of RNA samples dropped a disease classifier's accuracy from 91% to barely better than random chance (52%). This isn't merely an academic concern—it means patients could receive entirely different diagnoses depending on which lab processed their samples or when the sample was run. Imagine training a camera to identify photos of dogs versus cats, only to discover it was actually detecting whether photos were taken indoors or outdoors instead of identifying the animals themselves. Similarly, the biological classifier hadn't learned patterns related to disease; it had instead captured the fingerprint of how samples were processed - focusing on meaningless laboratory variations rather than the actual biology.

Fast forward to today, where despite remarkable advances in AI algorithms and the dawn of large language models and sophisticated computer vision, this problem persists. A 2024 review calls batch effects—systematic differences in measurements caused by technical variables like which lab processed the samples or even what time of day experiments were run—"the single largest barrier to reliable multi-omics machine learning." Another result emphasizes that even with advanced data standardization and correction algorithms, addressing batch effects remains dependent on rigorous clinical research design. The algorithms have improved dramatically, but the fundamental nature of biological data hasn't changed.

Why does this matter now more than ever? Because we're in an era where AI's dominant paradigm—"just add more data"—has delivered unprecedented breakthroughs in language and image processing. The spectacular successes of models like ChatGPT, Claude, and MidJourney have reinforced a compelling narrative: scale your dataset, and performance will follow. This approach has become so seductive that it's being applied across domains with minimal adaptation, often with a shiny "GPT" suffix attached to signal its presumed transformative power.

But biology presents a unique challenge to this paradigm. Unlike text or images, where more diverse data tends to improve models, biological datasets carry hidden signatures that can mislead AI in fundamental ways. The question we need to ask isn't just whether we can apply language model techniques to biology, but whether biology's data structure allows these techniques to work at all.

The False Promise of Scale: Biology's Unique Data Challenge

To understand why biology presents such unique challenges, we must examine how biological data fundamentally differs from the text and images that have fueled AI's recent successes. The "just add more data" approach that works brilliantly for language and vision involves gathering massive amounts of unlabeled data and applying self-supervised learning, where algorithms first learn general patterns before being fine-tuned for specific tasks. This two-step process has transformed AI capabilities in many domains—but falters dramatically when applied to biology.

The issue begins with how information is captured. When you write a sentence, each word is represented by the same digital code regardless of who typed it or where—the word "photosynthesis" is encoded identically whether typed in Tokyo or Toronto. Similarly, in photography, while subtle differences between cameras exist, the subject remains recognizable—a cat is visibly a cat despite variations in resolution or lighting. The signal dominates the noise.

In biological data, however, this fundamental relationship is rendered unreliable. A single gene expression measurement for the same gene in the same cell type can vary dramatically based on sample preparation, reagent batch, and instrument calibration. Each step in the process—from tissue collection to final measurement—introduces profound variability that often overwhelms the biological signal of interest.

Consider RNA sequencing, where a sample undergoes dozens of processing steps before yielding data. The same blood sample analyzed by different labs will produce substantially different results due to variations in extraction methods, library preparation protocols, and sequencing platforms. More advanced technologies further amplify these problems: single-cell transcriptomics—which measures gene activity in individual cells rather than cell mixtures, revealing differences between seemingly identical cells—introduces new sources of variability through cell dissociation methods. Similarly, spatial transcriptomics—which preserves information about where each cell was located within a tissue, mapping cellular function to position—adds complexity through variations in fixation time and tissue sectioning. One work cataloged over 40 routine sources of batch effects in genomics data alone, demonstrating that these technical variations can completely overshadow the biological differences researchers actually care about.

Imagine teaching an AI to identify great recipes by showing it photos of thousands of dishes without telling it which ones come from restaurants versus home kitchens. Restaurant dishes are professionally photographed with perfect lighting and plating, while homemade meals are captured with smartphones in dim kitchens. The AI would likely learn that 'good food' means professional lighting and elegant plating rather than actual culinary quality. When shown a brilliantly photographed bowl of mediocre pasta, it would rate it higher than a poorly photographed but exquisitely prepared home-cooked meal. This is precisely what happens with biological data—models eagerly learn the distinctive "fingerprints" of different laboratories, equipment, or experimental runs instead of the underlying biology.

Several key factors allow text and image models to succeed despite potential artifacts:

Standardized encoding in text: When you type "photosynthesis," it becomes the exact same UTF-8 sequence regardless of who typed it, what keyboard they used, or what time of day it was.This uniform representation means language models can focus on patterns in the text itself, not artifacts from its generation.
Information-to-noise ratio in images: While camera artifacts exist in photos, the relevant visual information strongly dominates these artifacts. In biological data, a subtle but crucial gene expression change can be completely overshadowed by batch effects.
Massive scale and diversity: Text and image datasets often contain billions of examples from countless sources, effectively averaging out source-specific artifacts.In biology, experiments often come from a handful of labs using similar protocols, amplifying rather than diluting batch effects.
Domain-specific normalization: Image and text processing pipelines include techniques specifically designed to counteract artifacts. In biology, these normalization techniques are still evolving and often can't fully address the complexity of batch effect, mainly due to absence of clear ground truth.

Crucially, in image and text domains, we invariably know the "ground truth"—we can easily verify that a picture contains a cat or that a sentence is grammatically correct. In biology, we rarely know what gene expression levels "should" be, making it nearly impossible to separate real biological signals from experimental noise.

This technical noise isn't merely theoretical—it's a primary contributor to biology's well-documented reproducibility crisis, where "you can't reproduce my cell-line result in your lab" is a common frustration. A 2022 work by Molania et al. calls batch effects "the single largest barrier to reliable multi-omics machine learning," while metadata compliance with minimum information standards remains below 50% for public datasets.

Without explicit correction, biological data offers no salvation through scale. While adding more text data from diverse sources tends to cancel out source-specific quirks, adding more biological data from different labs often amplifies batch effects. Each additional dataset brings its own unique experimental fingerprint, making the noise more complex rather than averaging it out. Without proper controls for these technical variations, more data can actually make biological models less accurate rather than more.

Wait—What About AI In Genomics, Pathology and AlphaFold?

Given the pervasive impact of batch effects described above, a natural question arises: how do we explain the remarkable successes of AI in certain biological domains? Skeptics often point to impressive achievements in DNA language models and protein-structure prediction as counterexamples, asking: If batch effects are truly so problematic, why do these models work so well? This apparent contradiction deserves careful examination.

There are several key reasons:

DNA sequencing is nearly digital. Unlike other biological measurements, DNA sequencing produces relatively standardized outputs. Once sequenced, the basic units (A,C,G,T) remain fixed and universal across labs. While technical variables like read quality and coverage depth introduce some noise, the fundamental data representation doesn't substantially change between labs or instruments. Lab protocols affect counts and coverage, not the underlying base calls. Generative tasks—like predicting a promoter motif—therefore feel almost text-like in their consistency.
Protein-structure models exploit evolution, not lab spectra. AlphaFold learns residue-residue couplings from multiple-sequence alignments derived from genomic data, not from laboratory measurements. These alignments—arrangements of protein or DNA sequences that highlight evolutionary similarities and shared functional elements across species—represent relationships that remain constant regardless of which lab analyzes them. This evolutionary information is far less tainted by batch noise than laboratory measurements.
Computational pathology shows promise. Digital pathology represents another area where AI has shown success, particularly in cancer diagnosis. While there are variations in tissue processing, staining techniques, and imaging equipment, the visual features of disease states are often robust enough to these variations that deep learning models can effectively learn diagnostic patterns. The key advantage here is that the signal (cellular morphology and tissue architecture) is visually dominant over the noise (staining variation), similar to how general image recognition works.

These success stories share a common thread – they all operate in domains where either the data representation is inherently standardized (DNA sequences), derived from evolution rather than direct measurement (protein alignments), or where the signal-to-noise ratio favors the biological signal (pathology images).

However, these examples, while impressive, only capture static snapshots or probabilistic views of biology rather than its dynamic, system-level behavior. The genome tells us what might happen, and protein structures show us potential forms, but neither reveals the dynamic reality of what is actually happening in a living system at a particular moment. The true intersection of bio-AI and healthcare will emerge from technologies that capture this dynamic reality—RNA-seq showing which genes are active, proteomics revealing which proteins are being expressed, metabolomics tracking biochemical processes, and single-cell analyses observing heterogeneity within tissues. These are precisely the areas most affected by batch effects today, yet they hold the greatest potential for transformative healthcare applications like precise disease monitoring, therapy selection, and drug development. Solving the batch effect challenge in these domains isn't just an academic exercise—it's the key to unlocking the next generation of predictive and personalized medicine.

Two Ways Out

Looking at the challenge of batch effects in biological data, I see only two viable strategies that can fundamentally change how we approach biological AI. These aren't hypothetical possibilities—they represent the only realistic paths forward if we want to realize the promise of machine learning in biology.

Strategy 1: Radical Transparency—EXIF for Biology

For researchers working with diverse public datasets, I propose a transparency revolution analogous to what EXIF data did for photography. For context, EXIF (Exchangeable Image File Format) revolutionized digital photography by embedding crucial metadata—camera settings, date/time information, and even GPS coordinates—directly into image files. This standardized approach to capturing technical context transformed photography, enabling software to automatically correct for specific camera quirks, standardize colors across different devices, and adjust for lighting conditions. In biological data, a similar revolution would mean comprehensive documentation of experimental conditions that could allow algorithms to separate technical artifacts from genuine biological signals.This would require:

Exposing everything: Biological datasets need metadata standards as rigorous as EXIF data for photographs. Every experimental detail—from reagent lot numbers to equipment calibration records—must accompany the raw data. We need side-car files with all provenance information: instrument firmware versions, ambient humidity, even operator IDs.
Design for invariance: Rather than ignoring batch effects, we should explicitly model them. Training strategies should include augmentation across reagent lots, instruments, and operators to force models to learn biology, not artifacts. Strategies like domain-adversarial networks—machine learning techniques designed to identify features useful for the main task while remaining invariant to differences between data sources—can be employed to learn to ignore these nuisance factors.
Cross-lab validation: The gold standard for biological AI isn't random cross-validation but holding out entire laboratories or experimental sites as test data.
Causal objectives: New training approaches can incorporate causal reasoning or domain-adversarial objectives that actively penalize models for learning batch-specific patterns.
Protocols-as-code: Version-controlled, precisely specified experimental procedures can reduce the hidden entropy in data generation.

While this approach is theoretically sound, the reality is more challenging. There are currently no strong incentives for individual labs to adopt these comprehensive metadata practices. Scientific publications often remain elusive about batch information, and without coordinated efforts across the field, the transparency approach faces significant adoption barriers.

Strategy 2: Manufacturing-Grade Standardization

The second approach, which I see as more immediately practical, has two variations:

Industry-Scale Standardization: End-to-end standard operating procedures (like those from Broad's Terra or 10x Chromium) turn labs into miniature fabrication facilities. The closer samples look like "identical twins," the less the model has to disentangle. However, this approach has limitations. Standardization typically happens only after technologies become commoditized, which is still not the case for many cutting-edge biological assays. Achieving universal standardization requires a level of maturity that biological data generation hasn't yet reached.
The Controlled Data Generation Approach: This solution is being pursued by companies like Tempus AI, Recursion Pharmaceuticals, and my own company, Biostate.ai. Rather than wrestling with the heterogeneity of public datasets, these entities take a fundamentally different approach: generating all data in-house with rigorous standardization. By controlling every aspect of the experimental process—from sample collection to sequencing—one can ensure that batch effects are systematically minimized. When a single entity handles the entire pipeline with consistent protocols, equipment, and personnel, the noise profile becomes predictable and can be effectively managed. The key insight is that when the economics of data generation improve sufficiently, it becomes viable to build standardized datasets where the signal-to-noise ratio strongly favors biological signal over technical artifacts.

Similar approaches are being explored by research institutes like the Broad Institute's Data Sciences Platform, the Chan Zuckerberg Biohub, the New York Genome Center, and emerging initiatives like the Arch Institute. Foundations including the Gates Foundation's Grand Challenges and the Parker Institute for Cancer Immunotherapy are also investing in standardized data generation infrastructure to support their respective missions.

This approach has two significant drawbacks, however. First, it's typically much more expensive than using public datasets, requiring substantial investment in laboratory infrastructure and personnel. At Biostate.ai, we've developed unique technological capabilities that deliver a 10x cost reduction in RNA-seq, whole exome sequencing, and single-cell technologies compared to traditional approaches, making this strategy more economically viable for ourselves, but others might not have the same luxury. Second, this approach often results in proprietary, closed datasets that aren't available to the broader scientific community, potentially limiting scientific progress and reproducibility—a tension that must be acknowledged as we pursue solutions to the batch effect challenge. The ideal future may involve a hybrid approach: standardized data generation with open sharing policies that preserve both data quality and scientific openness.

These are the only two paths forward—there is no third option that magically solves the batch effect problem while maintaining the current fragmented approach to biological data generation. We either radically improve how we document experimental variation, or we must control it by centralizing and standardizing data generation. Neither path is cheap, but both are cheaper than drawing the wrong conclusion from a petabyte of noisy data.

The Future of Biology's Data Revolution

If we want GPT-style breakthroughs in biology that can transform healthcare, we must fundamentally change how we approach biological data. The batch effects that plague our measurements aren't just annoying background noise—they're existential threats to the entire biological AI paradigm. Left unaddressed, they will continue to undermine our models, waste billions in R&D spending, and delay life-saving innovations.

The promise of AI in biology remains profound. Imagine predictive models that track disease progression in real-time, personalize treatments based on an individual's unique biology, or accelerate drug discovery by accurately simulating biological responses. These aren't science fiction—they're achievable goals that hinge on solving the batch effect challenge.

But this will require both technical innovation and cultural change across the biological sciences. Training models on public omics data without rich metadata is like teaching a vision model with lens caps smeared differently for each image and never telling the model which smear came from which cap. Unsurprisingly, it learns smear patterns, not biology.

The path forward will require commitment from diverse stakeholders. Academic researchers must prioritize comprehensive metadata collection and transparent reporting. Funding agencies should incentivize reproducibility and cross-lab validation. Technology companies need to invest in standardization and controlled data generation. And regulatory bodies must develop frameworks that acknowledge the unique challenges of biological AI.

Those who solve this challenge—whether through radical transparency, industrial-grade standardization, or novel approaches we haven't yet imagined—will unlock unprecedented capabilities in medicine, agriculture, and biotechnology. The winners won't be those who simply apply the latest AI techniques to biological data, but those who develop approaches that respect the unique challenges of measuring and modeling life itself.

Biology's complexity isn't just an obstacle—it's also our greatest opportunity. By solving these fundamental data challenges, we can build AI systems that truly understand the generative nature of living systems and harness that understanding to improve health and extend human capabilities. The data revolution that transformed how we interact with information is poised to transform how we understand and interact with life itself—if we can first learn to see beyond the batch effect.

Benjamin Nelson

Apr 26

I didn't know about this issue and I feel, now that I am aware, that it is critical to resolve as quickly as possible. Am I understanding correctly that 2 solutions are needed overall?

1. Creating the AI

2. Building the industrial + scientific scaffolding needed to implement and maintain standardization?

May I know your thoughts as to what this might AI might look/ behave like? Is it reasoning differently from/ similar to current SOTA/ human cognition, or possibly some other capability is needed?

What criteria or tests do you believe would allow us to recognize and verify that we’ve developed such a thing?

Thank you and I appreciate but, have no expectations, as to a response given the complexity and importance of your work.

Expand full comment

Nano Thoughts

Discussion about this post