40-year-old algorithm proven the best possible

40-year-old algorithm proven the best possible
Credit: iStock

Comparing the genomes of different species—or different members of the same species—is the basis of a great deal of modern biology. DNA sequences that are conserved across species are likely to be functionally important, while variations between members of the same species can indicate different susceptibilities to disease.

The basic algorithm for determining how much two sequences of symbols have in common—the "edit distance" between them—is now more than 40 years old. And for more than 40 years, researchers have been trying to improve upon it, without much success.

At the ACM Symposium on Theory of Computing (STOC) next week, MIT researchers will report that, in all likelihood, that's because the algorithm is as good as it gets. If a widely held assumption about computational complexity is correct, then the problem of measuring the difference between two genomes—or texts, or speech samples, or anything else that can be represented as a string of symbols—can't be solved more efficiently.

In a sense, that's disappointing, since a computer running the existing algorithm would take 1,000 years to exhaustively compare two human genomes. But it also means that computer scientists can stop agonizing about whether they can do better.

"This edit distance is something that I've been trying to get better algorithms for since I was a graduate student, in the mid-'90s," says Piotr Indyk, a professor of computer science and engineering at MIT and a co-author of the STOC paper. "I certainly spent lots of late nights on that—without any progress whatsoever. So at least now there's a feeling of closure. The problem can be put to sleep."

Moreover, Indyk says, even though the paper hasn't officially been presented yet, it's already spawned two follow-up papers, which apply its approach to related problems. "There is a technical aspect of this paper, a certain gadget construction, that turns out to be very useful for other purposes as well," Indyk says.

Squaring off

Edit distance is the minimum number of edits—deletions, insertions, and substitutions—required to turn one string into another. The standard algorithm for determining edit distance, known as the Wagner-Fischer algorithm, assigns each symbol of one string to a column in a giant grid and each symbol of the other string to a row. Then, starting in the upper left-hand corner and flooding diagonally across the grid, it fills in each square with the number of edits required to turn the string ending with the corresponding column into the string ending with the corresponding row.

Computer scientists measure algorithmic efficiency as computation time relative to the number of elements the algorithm manipulates. Since the Wagner-Fischer algorithm has to fill in every square of its grid, its running time is proportional to the product of the lengths of the two strings it's considering. Double the lengths of the strings, and the running time quadruples. In computer parlance, the algorithm runs in quadratic time.

That may not sound terribly efficient, but quadratic time is much better than exponential time, which means that running time is proportional to N2, where N is the number of elements the algorithm manipulates. If on some machine a quadratic-time algorithm took, say, a hundredth of a second to process 100 elements, an exponential-time algorithm would take about 100 quintillion years.

Theoretical computer science is particularly concerned with a class of problems known as NP-complete. Most researchers believe that NP-complete problems take exponential time to solve, but no one's been able to prove it. In their STOC paper, Indyk and his student Artūrs Bačkurs demonstrate that if it's possible to solve the edit-distance problem in less-than-quadratic time, then it's possible to solve an NP-complete problem in less-than-exponential time. Most researchers in the computational-complexity community will take that as strong evidence that no subquadratic solution to the edit-distance problem exists.

Can't get no satisfaction

The core NP-complete problem is known as the "satisfiability problem": Given a host of logical constraints, is it possible to satisfy them all? For instance, say you're throwing a dinner party, and you're trying to decide whom to invite. You may face a number of constraints: Either Alice or Bob will have to stay home with the kids, so they can't both come; if you invite Cindy and Dave, you'll have to invite the rest of the book club, or they'll know they were excluded; Ellen will bring either her husband, Fred, or her lover, George, but not both; and so on. Is there an invitation list that meets all those constraints?

In Indyk and Bačkurs' proof, they propose that, faced with a satisfiability problem, you split the variables into two groups of roughly equivalent size: Alice, Bob, and Cindy go into one, but Walt, Yvonne, and Zack go into the other. Then, for each group, you solve for all the pertinent constraints. This could be a massively complex calculation, but not nearly as complex as solving for the group as a whole. If, for instance, Alice has a restraining order out on Zack, it doesn't matter, because they fall in separate subgroups: It's a constraint that doesn't have to be met.

At this point, the problem of reconciling the solutions for the two subgroups—factoring in constraints like Alice's restraining order—becomes a version of the edit-distance problem. And if it were possible to solve the edit-distance problem in subquadratic time, it would be possible to solve the satisfiability problem in subexponential time.

"This is really nice work," says Barna Saha, an assistant professor of computer science at the University of Massachusetts atAmherst. "There are lots of people who have been working on this problem, because it has a big practical impact. But they won't keep trying to develop a subquadratic , because that seems very unlikely to happen, given the result of this paper."

As for the conjecture that the MIT researchers' proof depends on—that NP-complete problems can't be solved in subexponential time—"It's a very widely believed conjecture," Saha says. "And there are many other results in this low-polynomial-time complexity domain that rely on this conjecture.


Explore further

Algorithm reduces size of data sets while preserving their mathematical properties

More information: "Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false)" arxiv.org/abs/1412.0348

This story is republished courtesy of MIT News (web.mit.edu/newsoffice/), a popular site that covers news about MIT research, innovation and teaching.

Citation: 40-year-old algorithm proven the best possible (2015, June 11) retrieved 16 July 2019 from https://phys.org/news/2015-06-year-old-algorithm-proven.html
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.
2906 shares

Feedback to editors

User comments

JVK
Jun 11, 2015
Excerpt: Edit distance is the minimum number of edits—deletions, insertions, and substitutions—required to turn one string into another.

My comment: "...deletions, insertions, and substitution..." are RNA-mediated.

Jun 11, 2015
My comment: ... RNA-mediated.
@jk
yes, yes, yes... you have repeatedly flooded this site with your SPAM telling us all about this and your conjectures about RNA mediated...

however, not ALL mutations are RNA mediated
just like not all mutations are deleterious (and even from your own links we've found out that deleterious mutations can actually be beneficial ! )

so what is your point with the above?
other than to open the door to try to draw attention to your personal site and religious dogma?


Jun 11, 2015
That may not sound terribly efficient, but quadratic time is much better than exponential time, which means that running time is proportional to N^2, where N is the number of elements the algorithm manipulates.


No, exponential would be 2^N, not N^2.


Jun 12, 2015
Guess what?

Quantum computers won't solve either of those problems any faster, contrary to popular notions of "infinite computing power".

Jun 12, 2015
@JVK - Since RNA is part of the DNA copying processes and the DNA repair processes, almost anything to do with maintaining or changing the DNA sequence could be called RNA-mediated, so used in that general sense the term "RNA-mediated" adds no new information.

But you seem to be using the term RNA-mediated in a stronger sense than that.
In your model, are ALL deletions, insertions, and substitutions "RNA-mediated" as you are using the term?
How about inversions and frame shifts - are they also all RNA-mediated as well in your model?
And does being RNA-mediated make these changes always beneficial, always harmful, or sometimes beneficial and sometimes harmful in your model?

JVK
Jun 13, 2015
But you seem to be using the term RNA-mediated in a stronger sense than that.


http://phys.org/n...ial.html

"When you explain that the myth is false, you pluck out that cog, leaving a gap in their mental model."

RNA-mediated amino acid substitutions link ecological variation to ecological adaptation via metabolic networks and genetic networks without the pseudoscientific nonsense about mutations and evolution. The amino acid substitutions that stabilize organized genomes are fixed via the physiology of reproduction.

You can either accept the biologically-based facts or continue flaunting your ignorance.


JVK
Jun 13, 2015
...so what is your point with the above?
other than to open the door to try to draw attention to your personal site...


I'm trying to help serious scientists put an end to the pseudoscientific nonsense by encouraging theorists to examine what is currently known about physics, chemistry, and conserved molecular mechanisms of cell type differentiation in all genera.

Jun 13, 2015
You can either accept the biologically-based facts or continue flaunting your ignorance
@jk
this is good advice, you should listen to it
especially when you make blatantly fallacious claims like this
the pseudoscientific nonsense about mutations and evolution
Your attempts to prove Evolution is pseudoscience have been:
argue Creationist dogma,
ignore actual proof and validated experimental evidence
completely misinterpret actual science and claim it supports you
claim argument from self-authority
push your Dunning-Kruger
OR
blatantly LIE and post libelous comments about actual scientists when they refute your claims about their science

this evidence can be seen from:
http://phys.org/n...ols.html

to
http://phys.org/n...sts.html

plus your TROLLING of ScienceMag is NOT the same thing as being published there
so it doesn't validate your claims

Jun 13, 2015
I'm trying to help serious scientists put an end to the pseudoscientific nonsense
@jk
no, you're not
you've already claimed to support Creationist dogma, which has been proven to be anti-science and shown to not contain science in it at all...
so you are simply pushing your own agenda and religious beliefs, not trying to end pseudoscience at all

shall i quote and link all your Creationist claims?
you can't even comprehend your own model, let alone the basic biology which debunked it

published by the same journal which published your model, btw
want that link again?
Jones destroys your model and you still can't figure out why

you've even falsely claimed to have linked LIGHT and pressure from light to de novo creation of amino acids!

where is the evidence in that?

that is called PSEUDOSCIENCE, not science
claims made with techno-jargon sounding verbiage while failing to meet the basic requirements of the scientific method and refusing empirical evidence

JVK
Jun 13, 2015
you've even falsely claimed to have linked LIGHT and pressure from light to de novo creation of amino acids!

where is the evidence in that?


Thanks for asking. See:

Common origins of RNA, protein and lipid precursors in a cyanosulfidic protometabolism
http://dx.doi.org...hem.2202

See also: http://www.the-sc...of-Life/

Jun 13, 2015
You didn't answer any of the questions JVK (and here I had hoped that you were actually "moving forward"), so I'll ask them again.

In your model, are ALL deletions, insertions, and substitutions "RNA-mediated" as you are using the term?
How about inversions and frame shifts - are they also all RNA-mediated as well in your model?
And does being RNA-mediated make these changes always beneficial, always harmful, or sometimes beneficial and sometimes harmful in your model?

And while we are at it how about whole chromosome duplications and whole genome duplications - are those "RNA-mediated" in your model, and are they ever beneficial?

JVK
Jun 14, 2015
Given the amount of information that anyone can find via a search for "RNA mediated," or at my domain RNA-mediated.com, questions about specific aspects of what I have detailed may be answered if I decide to waste more time attempting to explain biologically-based cause and effect to biologically uninformed science idiots.

I will not respond to foolish attempts to get me to make claims that cannot be supported, or those that will be misrepresented. My published works include all the required citations.

Jun 14, 2015
What JVK is really saying is that he was put on probation for spamming.

Jun 15, 2015
What JVK is really saying is that he was put on probation for spamming.


It is a good start. JVK is still spamming, but he has been reduced to baiting people via Google searches or cut-and-paste rather than clickable links to his site. He's so desperate to get visitors that I suspect he is charging some sucker advertiser by the site visit.

Jun 15, 2015
Given the amount of information that anyone can find via a search for "RNA mediated,"


JVK, on June 3rd claim that you
have repeatedly established the fact that biologically uninformed science idiots...are not able to perform a simple Google search to find information on RNA-mediated...

http://phys.org/n...ins.html

When challenged, you failed to provide ANY evidence that you have ever established this even ONCE for ANY commenter that you have called a "biologically uninformed science idiot", thus showing how little credence should be given to claims of yours that something is a fact.

But now saying that ANYONE can can find information via a Google search on "RNA Mediated", so you are effectively ADMITTING that you were wrong.

LMAO!


Please sign in to add a comment. Registration is free, and takes less than a minute. Read more