Deciphering the language of transcription factors

Sep 11, 2012 by Larry Hardesty

Transcription factors are proteins that bind to DNA to promote or suppress protein production. Since almost all diseases involve disruption of the protein-production process, transcription factors are promising biological targets for drugs—and could even serve as drugs themselves.

But there are likely thousands of in humans, each of which might bind to the genome at tens of thousands of different locations. Previously, there was no cost-effective way to figure out exactly where transcription factors bind—which exact in a given stretch of genome each of them attaches to. thus relied on approximate methods to identify the general vicinity of binding sites.

In the August issue of the online journal PLoS , a team of researchers from MIT's Computer Science and Artificial Intelligence Laboratory presented a new analytic technique that identifies binding sites with much greater accuracy. As a consequence, the researchers were able to infer previously unknown relationships among transcription factors, which could provide clues to the roles they play in .

The researchers initially tested their technique on two sets of experimental data, which they say represent both "relatively easy and difficult cases" for analysis. In the easy case, their new technique identified the precise locations at which transcription factors bound to the genome with more than 90 percent accuracy, while the accuracy of existing techniques was about 10 percent or less. In the difficult case, the new method was more than 55 percent accurate, compared to about 5 percent for existing techniques.

The leading method for determining how transcription factors behave in living cells is to chop up the DNA from millions of cells and use protein to extract the fragments that have a particular transcription factor attached to them. While the DNA sequence that a transcription factor binds to consists of only about six to 12 DNA letters, the fragment extracted by the antibody could be a couple of hundred letters long. Sequencing the fragments can determine where in the genome they came from, but it offers little information about where on the fragment the transcription factor is attached.

Feedback loop

David Gifford, a professor of electrical engineering and computer science and director of the Computational Genomics Group, his graduate student Yuchun Guo, and Shaun Mahony, a research scientist in the group, developed a new algorithm for analyzing millions of experimentally identified fragments and inferring the precise locations at which transcription factors bind to them.

Previous methods would compare the fragments to try to identify sequences they had in common. But that's just the first step in the MIT researchers' method. They then use that initial, rough guess about common sequences to predict where, throughout the entire genome, the transcription factor would bind, then compare those predictions to the experimental data on where the factor actually did bind. On the basis of that comparison, they then refine their estimate of the specific binding sequence and repeat the whole process.

"We iterate between estimating where proteins bind and using that information to discover the sequences that they bind to," Gifford says, "and then we go backward and use the sequences they bind to to improve the estimate of where they're binding."

But determining transcription factors' precise binding sites is just the first step in understanding their role in . For a single transcription factor, that role can vary according to both the type of cell in which it's active and its interactions with other transcription factors. It's the second of these elements that the MIT researchers are shedding light on, by identifying spatial relationships between binding locations that imply a functional relationship between the corresponding transcription factors.

The genome's language

That approach, Gifford says, is similar to the statistical analysis of language, which artificial-intelligence researchers have used to build language-interpreting computer systems. Indeed, Gifford says, the sequences that transcription factors bind to can be thought of as words and their spacing as the "syntax" of the genome.

"If you did an analysis of the English language, you would find a lot of relationships between words that were highly significant, because they co-occur," Gifford says. "You would not necessarily understand from the analysis what their meaning was, but you would know that they were highly significant and did carry meaning." The same is true of the DNA "words" that constitute the transcription-factor binding sites. "If you look at a null model, which would posit random occurrence of words, then you ask how unlikely it is that you would see these things together," Gifford says. "And we're testing everything against a random model."

The MIT researchers' analysis identified a handful of relationships between transcription factors that were already known, but it also identified 390 more statistically significant relationships between binding sites. Some of those may be red herrings, but many of them could turn out to indicate previously unsuspected relationships between transcription factors, which could help biologists unravel the mysteries of genetic expression.

"I think it's beautiful work. I really like it," says Michael Snyder, a professor of genetics at Stanford University and one of the lead investigators on the National Human Genome Research Institute's massive ENCODE project to characterize all the functional elements of the human . Just last week, Nature published a paper called "Architecture of the human regulatory network derived from ENCODE data," which had dozens of authors, including Snyder and a number of other ENCODE researchers.

"In much higher detail than we had done in our paper, Gifford's work could really let us understand much better about how these proteins are functioning together," Snyder says. "This is really going to be critical for us to understand the basic biological pathways for how you develop a human being and, of course, ultimately, for what goes wrong in human disease."

Explore further: Scientists see a natural place for 'rewilded' plants in organic farming

More information: Paper: "High Resolution Genome Wide Binding Event Finding and Motif Discovery Reveals Transcription Factor Spatial Binding Constraints"

Related Stories

'Moonlighting' molecules discovered

Oct 29, 2009

Since the completion of the human genome sequence, a question has baffled researchers studying gene control: How is it that humans, being far more complex than the lowly yeast, do not proportionally contain in our genome ...

Mice and men make livers differently

May 21, 2007

Scientists often study mice as a model for human biology and disease, because their basic biological processes are assumed to be essentially the same as those of humans.

Recommended for you

Bacterial tenants in fungal quarters

5 hours ago

Ludwig Maximilian University of Munich researchers have sequenced the genome of a bacterial symbiont hosted by a mycorrhizal fungus. Analysis of the symbiont's genetic endowment reveals previously unknown ...

First step towards global attack on potato blight

May 28, 2015

European researchers and companies concerned with the potato disease phytophthora will work more closely with parties in other parts of the world. The first move was made during the biennial meeting of the ...

Bacteria study could have agricultural impact

May 28, 2015

Wichita State University microbiology professor Mark Schneegurt and ornithology professor Chris Rogers have discovered that one of North America's most common migratory birds – the Dark-eyed Junco – carries ...

User comments : 0

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.