Not All Zeros Are Equal

Jan 20, 2025

Sometimes, as one dive into data—whether from sophisticated biological experiments or from straightforward retail logs—one notices an odd pattern: rows upon rows of zeros. These zeros might mean absolutely nothing was there, or they might signal that something slipped past our measurements. That confusion is known as zero inflation.

Picture your neighborhood grocery store. You walk in late in the day and see certain shelves completely empty. One possibility is that these products were never stocked at all. Another is that they were put out but sold so quickly that by the time you arrived, there was nothing left. On paper, both scenarios translate to zero items remaining, yet they carry very different realities.

A similar puzzle appears in scientific data. In large-scale biological measurements (often called omics), even tiny amounts of certain molecules can have a big impact on conclusions. If an instrument reports a gene or protein level as zero, researchers might dismiss something actually important—just faint. That could distort conclusions in biology, and analogous issues in business or other fields can be equally misleading.

Why More Data Means More Zeros

Data collection today is more ambitious than ever. We measure thousands—or even millions—of features at once, whether they're gene expressions, protein levels, cellular functions, or shopping behaviors. Ironically, as our scope expands, we run up against the limits of detecting small signals. That gap often appears as a surge of zeros, making us wonder which ones reflect real absences and which ones hide missed presences.

One stark example arises in RNA sequencing (RNA-seq), a method to understand gene activity. At Biostate, where we regularly work with RNA-seq data—and offer it at a distinctly lower cost than many competitors—zero inflation remains a persistent challenge. Even with high-quality data, zeros can still mislead. For bulk RNA-seq (where gene activity is averaged across many cells), these zeros occur often. The problem intensifies in single-cell RNA sequencing (scRNA-seq), which examines each cell individually. The complexity of measuring single cells combined with technical constraints means 70–90% of readings might come back as zero—largely because low-level signals can slip through the cracks.

Such zeros aren't limited to biology. In insurance, zero claims could mask small but unreported incidents. In online retail, zero sales in a region might be real, or it might stem from a website glitch that blocked transactions. Regardless of the domain, the zeros we see can come from two main sources: true nothing, where there really isn't anything there, and hidden something, where something exists but got overlooked by the measurement process.

How We Deal With All Those Zeros

Several sophisticated approaches have been developed to tackle the challenge of zero inflation, each offering unique advantages for different situations.

The most straightforward approach employs what statisticians call zero-inflated distributions. Think of it as a two-step process: first, one asks whether each zero might be genuine or just missed data. It's similar to how a store manager might look at empty shelves differently during a supply chain crisis versus normal operations. For instance, if neighboring stores show sales of the same item, or if historical patterns suggest there should be activity, that zero might warrant closer inspection. These models add this crucial sorting step before applying traditional statistical methods, helping us identify which zeros deserve skepticism. While this approach offers better accuracy when zeros dominate our data, it does require more complex mathematical machinery.

Another creative solution borrows from the world of AI assisted image manipulation: imputation. Just as photo editing software can fill in scratches or damages in old photographs by analyzing surrounding pixels, imputation algorithms examine patterns in similar data to guess which zeros might be false. Imagine you're tracking customer behavior, and a loyal weekly shopper suddenly shows zero purchases for a month. By looking at their past patterns and those of similar customers, you might infer they were actually shopping but their transactions weren't recorded properly. This method can rescue important signals that might otherwise be lost, though it runs the risk of introducing artificial patterns if the guesses aren't accurate.

A more hands-on approach involves deliberately adding known quantities to the system—often called spike-ins. This is like adding traceable items to a shipping container to verify the tracking system works. If these known additions consistently show up as zero in the results, it's concrete evidence that the measurement system is missing real signals. While this provides direct proof of measurement gaps, it requires careful experimental design and might not catch every type of error.

For some broad analyses, a simpler route would be to convert everything into a yes/no format—a process called binarization. Any non-zero value, no matter how small, becomes a "yes," while zeros remain "no." This approach is like checking attendance in a classroom: you might not care how long each student stayed, just whether they showed up at all. While this can make large-scale patterns more apparent, such as whether a gene is active at all or if a product has any market presence, it sacrifices information about intensity that might be crucial for understanding subtle differences.

Each of these methods represents a different compromise between accuracy, complexity, and practicality. The choice often depends on the specific context: what kind of data you're dealing with, what questions you're trying to answer, and what resources you have available. What unites them all is the recognition that not every zero in our data tells the same story, and sometimes we need sophisticated tools to hear the whispers hidden in the silence.

If a Tree Falls in a Forest...

Zero inflation brings to mind a classic question: if a tree falls in a forest and nobody's around to hear it, do we record "no sound"? In data terms, if a measurement reads zero, does that definitively mean nothing was there, or might it have been something that we simply missed? This dilemma speaks to the larger debate of "absence of evidence" versus "evidence of absence." Seeing a zero only tells us nothing was detected, which is different from nothing exists.

For example, a gene might be active at very low levels, or only during certain phases, and if our instrument fails to pick up those fleeting signals, we see a zero. The same pattern can hold in business or ecology—failing to detect an event doesn't mean it never occurred.

Did We Truly Find Nothing, or Did We Fail to Detect It?

Much like we once assumed dinosaurs had vanished completely, only to discover they left traces in modern birds, zeros in our data may conceal important details behind imperfect tools. Recognizing that possibility—and figuring out which strategy to use—could open the door to critical insights, considering overlooked molecules, or discarding a sleeper hit.

Nano Thoughts

Discussion about this post