Writing programs using ordinary language

Jul 11, 2013 by Larry Hardesty
A new algorithm can automatically convert natural-language specifications into "regular expressions" — special-purpose combinations of symbols that allow very flexible searches of digital files. Credit: CHRISTINE DANILOFF

In a pair of recent papers, researchers at MIT's Computer Science and Artificial Intelligence Laboratory have demonstrated that, for a few specific tasks, it's possible to write computer programs using ordinary language rather than special-purpose programming languages.

The work may be of some help to programmers, and it could let nonprogrammers manipulate common types of files—like word-processing documents and spreadsheets—in ways that previously required with . But the researchers' methods could also prove applicable to other programming tasks, expanding the range of contexts in which programmers can specify functions using ordinary language.

"I don't think that we will be able to do this for everything in programming, but there are areas where there are a lot of examples of how humans have done translation," says Regina Barzilay, an associate professor of and electrical engineering and a co-author on both papers. "If the information is available, you may be able to learn how to translate this language to code."

In other cases, Barzilay says, programmers may already be in the practice of writing specifications that describe in precise and formal language. "Even though they're written in natural language, and they do exhibit some variability, they're not exactly Shakespeare," Barzilay says. "So again, you can translate them."

The researchers' recent papers demonstrate both approaches. In work presented in June at the annual Conference of the North American Chapter of the Association for Computational Linguistics, Barzilay and graduate student Nate Kushman used examples harvested from the Web to train a computer system to convert natural-language descriptions into so-called "regular expressions": combinations of symbols that enable file searches that are far more flexible than the standard search functions available in .

In a paper being presented at the Association for Computational Linguistics' annual conference in August, Barzilay and another of her graduate students, Tao Lei, team up with professor of and computer science Martin Rinard and his graduate student Fan Long to describe a system that automatically learned how to handle data stored in different file formats, based on specifications prepared for a popular programming competition.

Regular irregularities

As Kushman explains, computer science researchers have had some success with systems that translate questions written in natural language into special-purpose formal languages—languages used to specify database searches, for instance. "Usually, the way those techniques work is that they're finding some fairly direct mapping between the natural language and this formal representation," Kushman says. "In general, the logical forms are handwritten so that they have this nice mapping."

Unfortunately, Kushman says, that approach doesn't work with regular expressions, strings of symbols that can describe the data contained in a file with great specificity. A regular expression could indicate, say, just those numerical entries in a spreadsheet that are three columns over from a cell containing a word of any length whose final three letters are "BOS."

But regular expressions, as ordinarily written, don't map well onto natural language. For example, Kushman explains, the regular expression used to search for a three-letter word starting with "a" would contain a symbol indicating the start of a word, another indicating the letter "a," a set of symbols indicating the identification of a letter, and a set of symbols indicating that the previous operation should be repeated twice. "If I'm trying to do the same syntactic mapping that I would normally do," Kushman says, "I can't pull out any sub-chunk of this that means 'three-letter.'"

What Kushman and Barzilay determined, however, is that any regular expression has an equivalent that does map nicely to natural language—although it may not be very succinct or, for a programmer, very intuitive. Moreover, using a mathematical construct known as a graph, it's possible to represent all equivalent versions of a regular expression at once. Kushman and Barzilay's system thus has to learn only one straightforward way of mapping natural language to symbols; then it can use the graph to find a more succinct version of the same expression.

When Kushman presented the paper he co-authored with Barzilay, he asked the roomful of computer scientists to write down the regular expression corresponding to a fairly simple text search. When he revealed the answer and asked how many had gotten it right, only a few hands went up. So the system could be of use to accomplished programmers, but it could also allow casual users of, say, spreadsheet and word-processing programs to specify elaborate searches using natural language.

Opening gambit

The system that Barzilay, Rinard, Lei and Long developed is one that can automatically write what are called input-parsing programs, essential components of all software applications. Every application has an associated file type—.doc for Word programs, .pdf for document viewers, .mp3 for music players, and so on. And every file type organizes data differently. An image file, for instance, might begin with a few bits indicating the file type, a few more indicating the width and height of the image, and a few more indicating the number of bits assigned to each pixel, before proceeding to the bits that actually represent pixel colors.

Input parsers figure out which parts of a file contain which types of data: Without an input parser, a file is just a random string of zeroes and ones.

The MIT researchers' system can write an input parser based on specifications written in natural language. They tested it on more than 100 examples culled from the Association for Computing Machinery's International Collegiate Programming Contest, which includes file specifications for every programming challenge it poses. The system was able to produce working input parsers for about 80 percent of the specifications. And in the remaining cases, changing just a word or two of the specification usually yielded a working parser.

"This could be used as an interactive tool for the developer," Long says. "The developer could look at those cases and see what kind of changes they need to make to the natural language—maybe some word is hard for the system to figure out."

The system begins with minimal information about how written specifications might correspond to parser programs. It knows a handful of words that should consistently refer to particular data types—the word "integer," for instance—and it knows that the specification will probably describe some data structures that are nested in others: An image file, for instance, could consist of multiple chunks, and each chunk would be headed by a few bytes indicating how big it is.

Otherwise, the system just tries lots of different interpretations of the specification on a few sample files; in the researchers' experiments, the samples, too, were provided on the competition website. If the resulting parser doesn't seem to work on some of the samples, the system varies its interpretation of the specification slightly. Moreover, as it builds more and more working parsers, it becomes more adept at recognizing regularities in the way that parsers are specified. It took only about 10 minutes of calculation on an ordinary laptop for the system to produce its candidate parsers for all 100-odd specifications.

"This is a big first step toward allowing everyday users to program their computers without requiring any knowledge of programming language," says Luke Zettlemoyer, an assistant professor of computer science and engineering at the University of Washington. "The techniques they have developed should definitely generalize to other related programming tasks."

The two paper are titled "From natural language specifications to program input parsers" and "Using semantic unification to generate regular expressions from natural language."

Explore further: Researchers developing algorithms to detect fake reviews

Related Stories

Computer learns language by playing games

Jul 12, 2011

Computers are great at treating words as data: Word-processing programs let you rearrange and format text however you like, and search engines can quickly find a word anywhere on the Web. But what would it ...

Study shows bilingual children have a two-tracked mind

Jul 11, 2013

Adults learning a foreign language often need flash cards, tapes, and practice, practice, practice. Children, on the other hand, seem to pick up their native language out of thin air. The learning process is even more remarkable ...

Computer automatically deciphers ancient language

Jun 30, 2010

In his 2002 book Lost Languages, Andrew Robinson, then the literary editor of the London Times' higher-education supplement, declared that "successful archaeological decipherment has turned out to require ...

Recommended for you

Is big data heading for its 'horsemeat moment'?

53 minutes ago

There have been so many leaks, hacks and scares based on misuse or misappropriation of personal data that any thought that "big data" could provide benefits rather than only opportunities for harm may be ...

User comments : 14

Adjust slider to filter visible comments by rank

Display comments: newest first

antialias_physorg
5 / 5 (3) Jul 11, 2013
Even though they're written in natural language, and they do exhibit some variability, they're not exactly Shakespeare

Then he should try out the Shakespeare programming language (which is kinda useless, but fun to read)
http://en.wikiped...guage%29

When he revealed the answer and asked how many had gotten it right,

I can believe that. RegExp are a nightmare for anything but the most trivial examples. Especially if you have to read someone else's.
grondilu
not rated yet Jul 11, 2013
With Perl 6, there are new kinds of regular expressions available now. They are not at all comparable to a natural language, but they are arguably much easier to write and to understand than the original, Perl 5 compatible version.

Check them out on: http://www.perlca...S05.html
Gmr
2.7 / 5 (7) Jul 11, 2013
Programmers already are translation engines between what is desired and what the machine can do. Even with this, some people are still not going to be able to communicate in a clear and succinct manner what they want the machine to do. If full on natural language programming is available, you'll still need programmers - people who can translate messy ideas into digestible chunks for programming.
komone
not rated yet Jul 11, 2013
Awesome. Wonder if I wrote "some word with three letters in it that starts with x"... would it output the same?
_ilbud
5 / 5 (1) Jul 11, 2013
This line of garbage has been trotted out since CoBOL. This is the most obfuscated pseudo babble version of it I've ever seen.
LarryD
not rated yet Jul 11, 2013
I know very little about the subject but I am curious. The 'old' BASIC was certainly understandable & useable by people like me. Whatever happened to that? What were it's limitations as far as common language is concerned...I know it had mathematical limitations. Did BASIC evolve into something else? If I recall correctly Fortran and Cobol were used in more complicated calculations but I never knew how they differed from BASIC. I'd appreciate a comment or two chaps.
packrat
1 / 5 (2) Jul 12, 2013
@LarryD
Basic is still around... search for libertybasic for a windows machine. (best version I know of) I don't know if it runs on a apple or linux but probably does at least on linux. there is probably some versions for Apple machines too. There are even versions for android. I have a setup on a little cheap 7" android pad that works fine. It's grown a bit though. The modern versions let you use more capabilities of your machine and can use just about any of the input/output options available plus the graphics capabilities.
Code_Warrior
5 / 5 (1) Jul 12, 2013
If I recall correctly Fortran and Cobol were used in more complicated calculations but I never knew how they differed from BASIC. I'd appreciate a comment or two chaps.

COBOL (COmmon Business Oriented Language) was geared toward business and administrative applications and featured a language syntax that was closer to English. FORTRAN (FORmula TRANslating) is a language geared toward engineering and scientific mathematical computing applications. BASIC (Beginners All-purpose Symbolic Instruction Code) was a general purpose language that was easy to learn. All 3 languages are still in active use today and have evolved into Object Oriented versions, with Fortran also having development in concurrent computing on multiple processors. These days BASIC is in the MS Office suite where VBA (Visual Basic for Applications) is used to record macros and automate tasks. Windows has VB Script for automating tasks in Windows. A VB Compiler can make stand-alone VB applications.
pietrocecchi
not rated yet Jul 12, 2013
great!
an effort of encoding natural language in regular expression language in order to ease string searches
beleg
1 / 5 (3) Jul 12, 2013
Language translation is never synonymous with equivalency. All languages humans use harbor interpretation when an application for that language is found.
All maps overlap. This gives rise to the origins of equivalency.

"for exact understanding exact language is necessary."

(Gurdjieff to Ouspensky)

Eliminate interpretation (application) and you fulfill the necessity and sufficiency stated above.

That is equivalent to asking if any language has limits.

LarryD
not rated yet Jul 12, 2013
packrat& Code_warrior, thanks for the info. I was working in a Schweppes QC lab when R&D introduced a Sinclair (or was it Sinclare?) that used basic. It's first program was to calculate the amount of 'free space' on the overlapping joins after a can of drink had the lid sealed on. Of course we had to 'can strip' and take the various measurements of the overlap first then either do the calculation be hand or by the the later introduction of a 64K computer. Ha! 64K, can you believe that...wouldn't even being enough for one photo on my mobile phone these days.
rsklyar
1 / 5 (5) Jul 12, 2013
Some else results which never be published in MIT News. They are about stealing of both the ideas and money of taxpayers by numerous swindlers from David H. Koch Institute for Integrative Cancer Research and Department of Chemical Engineering, also with Department of Chemistry and Chemical Biology and School of Engineering and Applied Science of Harvard University at http://issuu.com/...vard_mit and http://issuu.com/...llsens12 .
Their plagiaristic "masterpieces" titled Macroporous nanowire nanoelectronic scaffolds for synthetic tissues (DOI: 10.1038/NMAT3404) and Outside Looking In: Nanotube Transistor Intracellular Sensors (dx.doi.org/10.1021/nl301623p) were funded by NIH Director's Pioneer Award (1DP1OD003900) and a McKnight Foundation Technological Innovations in Neurosciences Award, also a Biotechnology Research Endowment from the Dep. of Anesthesiology at Children's Hospital Boston and NIH grant GM073626, and NIH grants DE013023 and DE016516.
antialias_physorg
5 / 5 (1) Jul 12, 2013
If I recall correctly Fortran and Cobol were used in more complicated calculations but I never knew how they differed from BASIC.

You can do most anything with any language. When you get right down to it any language is broken down by the compiler into the same units eventually.

Languages are more about style (i.e. what kind of applications their syntax is geared towards). But you can write a word processor in COBOL or in Basic or in Fortran if you feel like it (with various amounts of effort) - so the language itself doesn't limit you per se.

Some languages compile more efficiently for some task types, though. That's why professional programmers like to pick the language best suited for a task (instead of writing everything in one language).

This may seem like you have to learn a lot of languages. But once you've learned two or three you find that they are all fundamentally very similar. Picking up a new one doesn't take long (around a week or so).
alfie_null
not rated yet Jul 12, 2013
Learning how to describe a solution in some computer language isn't, or shouldn't be, the focus of, say, a computer science curriculum. There's no correlation between being able to speak English and being able to solve problems. Having the capability of translating English into some action a computer takes doesn't guarantee that expressed English represents a useful action.