Scientists merge statistics, biology to produce important new gene computational tool
The cells in our bodies express themselves in different ways. One cell might put a chunk of genetic code to work, while another cell ignores the same information entirely. Understanding why could spur new stem cell therapies, or lead to a more fundamental understanding of how organisms develop. But zeroing in on these cell-to-cell differences can be challenging.
Now, two UCLA researchers have come up with a computational tool that increases the reliability of measuring how strongly genes are expressed in an individual cell, even when the cell is barely reading certain genes. The research was published last month in the journal Nature Communications.
"The DNA sequence is the same in a brain cell, a liver cell and a heart cell," said Jingyi "Jessica" Li , the study's corresponding author and a UCLA assistant professor of statistics. "Why do those cells look so different? The key thing is gene expression."
DNA encodes the information needed to create and operate an organism. But the task of reading and acting on that information falls to RNA, long strands of mobile molecules that transport genetic instructions to other parts of a cell. By tallying the various RNA molecules in a cell, researchers can tell which genes are active—or "expressed"—and to what degree.
However, if RNA molecules are present only in trace amounts, analysis tools can be fooled into thinking that the corresponding genes aren't active at all. Unless corrected for, these "dropouts" can paint a misleading picture about actual differences between cells.
"If you want to obtain useful biological information at the individual cell level, then you need to do some statistical inferences," said Li, who is also head of the Junction of Statistics and Biology laboratory. "Otherwise your conclusions may be wrong."
Li and Wei "Vivian" Li, a doctoral candidate in the UCLA department of statistics, have designed statistical analysis software for handling dropouts in RNA sequencing. Their tool, called "scImpute," estimates which genes in a cell are most likely to drop out based on studying all individual cells in an experiment. The tool then uses information from similar cells to make an educated guess about what the level of gene expression should be.
Utilizing estimates isn't new. But available tools are either too broad—swapping out all gene expressions of one cell with another—or hyper-specialized for a particular type of study. The advantages of scImpute are "flexibility and universality," Jessica Li said. The tool acts with surgical precision to replace only abundances that have most likely dropped out and can be used in any type of single-cell gene-expression analysis.
In Vivian Li's comprehensive tests on both simulated and actual data—some of which provide empirical evidence for actual levels of gene expression—scImpute is more accurate than other methods. The software reliably distinguishes dropout genes from those that aren't expressed at all, and it provides accurate estimates of the actual abundances.
The open-source software is available for free online as an add-on for a widely used scientific computing platform for statistical analysis known as the R programming environment.
The two researchers have proven that scImpute works well in small groups of cells when dropout rates are low. But in large populations, dropout rates can exceed 90 percent of the genes. Their next goal is to make the tool just as reliable in those situations. By borrowing information from other genes—not just other cells—and from online databases, they believe that scImpute can become a robust tool for all situations.