AI for code encourages collaborative, open scientific discovery

August 16, 2018 by Kush Varshney, IBM
Semantic flow graph representation produced automatically from an analysis of rheumatoid arthritis data. Credit: IBM

We have seen significant recent progress in pattern analysis and machine intelligence applied to images, audio and video signals, and natural language text, but not as much applied to another artifact produced by people: computer program source code. In a paper to be presented at the FEED Workshop at KDD 2018, we showcase a system that makes progress towards the semantic analysis of code. By doing so, we provide the foundation for machines to truly reason about program code and learn from it.

The work, also recently demonstrated at IJCAI 2018, is conceived and led by IBM Science for Social Good fellow Evan Patterson and focuses specifically on software. Data science programs are a special kind of computer code, often fairly short, but full of semantically rich content that specifies a sequence of data transformation, analysis, modeling, and interpretation operations. Our technique executes a data analysis (imagine an R or Python script) and captures all of the functions that are called in the analysis. It then connects those functions to a data science ontology we have created, performs several simplification steps, and produces a semantic flow graph representation of the program. As an example, the flow graph below is produced automatically from an analysis of rheumatoid arthritis data.

The technique is applicable across choices of programming language and package. The three code snippets below are written in R, Python with the NumPy and SciPy packages, and Python with the Pandas and Scikit-learn packages. All produce exactly the same semantic flow graph.

Credit: IBM
Credit: IBM
We can think of the semantic flow graph we extract as a single data point, just like an image or a paragraph of text, on which to perform further higher-level tasks. With the representation we have developed, we can enable several useful functionalities for practicing data scientists, including intelligent search and auto-completion of analyses, recommendation of similar or complementary analyses, visualization of the space of all analyses conducted on a particular problem or dataset, translation or style transfer, and even machine generation of novel data analyses (i.e. computational creativity)—all predicated on the truly semantic understanding of what the code does.

The Data Science Ontology is written in a new ontology language we have developed named Monoidal Ontology and Computing Language (Monocl). This line of work was initiated in 2016 in partnership with the Accelerated Cure Project for Multiple Sclerosis.

Explore further: Using machine learning to detect software vulnerabilities

More information: E. Patterson et al. Dataflow representation of data analyses: Toward a platform for collaborative data science, IBM Journal of Research and Development (2017). DOI: 10.1147/JRD.2017.2736278

Related Stories

Using machine learning to detect software vulnerabilities

July 24, 2018

A team of researchers from R&D company Draper and Boston University developed a new large-scale vulnerability detection system using machine learning algorithms, which could help to discover software vulnerabilities faster ...

Cooperative software framework helps tame "too big" data

March 23, 2015

Furthering work involving the Graph Engine for Multithreaded Systems, or GEMS, a multilayer software framework for querying graph databases developed at Pacific Northwest National Laboratory, scientists from PNNL and NVIDIA ...

A shiny, new graph query system

October 9, 2014

As computing tools and expertise used in conducting scientific research continue to expand, so have the enormity and diversity of the data being collected. Developed at Pacific Northwest National Laboratory, the Graph Engine ...

Python bindings snake into global arrays toolkit

September 26, 2011

While many of us don't want anything to do with snakes, for some, a certain kind of Python—the computer programming language, that is—is the preferred option. Researchers at Pacific Northwest National Laboratory ...

Recommended for you

Nano-droplets are the key to controlling membrane formation

February 19, 2019

The creation of membranes is of enormous importance in biology, but also in many chemical applications developed by humans. These membranes are shaped spontaneously when soap-like molecules in water join together. Researchers ...

LOFAR radio telescope reveals secrets of solar storms

February 19, 2019

An international team of scientists led by a researcher from Trinity College Dublin and University of Helsinki announced a major discovery on the very nature of solar storms in the journal Nature Astronomy.

Pottery reveals America's first social media networks

February 19, 2019

Long before Snapchat, Instagram, Facebook and even MySpace, early Mississippian Mound cultures in America's southern Appalachian Mountains shared artistic trends and technologies across regional networks that functioned in ...

Observation of quantized heating in quantum matter

February 19, 2019

Shaking a physical system typically heats it up, in the sense that the system continuously absorbs energy. When considering a circular shaking pattern, the amount of energy that is absorbed can potentially depend on the orientation ...

Lobster's underbelly is as tough as industrial rubber

February 19, 2019

Flip a lobster on its back, and you'll see that the underside of its tail is split in segments connected by a translucent membrane that appears rather vulnerable when compared with the armor-like carapace that shields the ...


Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.