August 16, 2018

AI for code encourages collaborative, open scientific discovery

by Kush Varshney , IBM

Semantic flow graph representation produced automatically from an analysis of rheumatoid arthritis data. Credit: IBM

We have seen significant recent progress in pattern analysis and machine intelligence applied to images, audio and video signals, and natural language text, but not as much applied to another artifact produced by people: computer program source code. In a paper to be presented at the FEED Workshop at KDD 2018, we showcase a system that makes progress towards the semantic analysis of code. By doing so, we provide the foundation for machines to truly reason about program code and learn from it.

The work, also recently demonstrated at IJCAI 2018, is conceived and led by IBM Science for Social Good fellow Evan Patterson and focuses specifically on data science software. Data science programs are a special kind of computer code, often fairly short, but full of semantically rich content that specifies a sequence of data transformation, analysis, modeling, and interpretation operations. Our technique executes a data analysis (imagine an R or Python script) and captures all of the functions that are called in the analysis. It then connects those functions to a data science ontology we have created, performs several simplification steps, and produces a semantic flow graph representation of the program. As an example, the flow graph below is produced automatically from an analysis of rheumatoid arthritis data.

The technique is applicable across choices of programming language and package. The three code snippets below are written in R, Python with the NumPy and SciPy packages, and Python with the Pandas and Scikit-learn packages. All produce exactly the same semantic flow graph.

Credit: IBM

Credit: IBM

We can think of the semantic flow graph we extract as a single data point, just like an image or a paragraph of text, on which to perform further higher-level tasks. With the representation we have developed, we can enable several useful functionalities for practicing data scientists, including intelligent search and auto-completion of analyses, recommendation of similar or complementary analyses, visualization of the space of all analyses conducted on a particular problem or dataset, translation or style transfer, and even machine generation of novel data analyses (i.e. computational creativity)—all predicated on the truly semantic understanding of what the code does.

The Data Science Ontology is written in a new ontology language we have developed named Monoidal Ontology and Computing Language (Monocl). This line of work was initiated in 2016 in partnership with the Accelerated Cure Project for Multiple Sclerosis.

More information: E. Patterson et al. Dataflow representation of data analyses: Toward a platform for collaborative data science, IBM Journal of Research and Development (2017). DOI: 10.1147/JRD.2017.2736278

Provided by IBM

This story is republished courtesy of IBM Research. Read the original story here.

Recommended

Hyphens in paper titles harm citation counts and journal impact factors

May 31, 2019

A big step toward the practical application of 3-D holography with high-performance computers

Nov 28, 2018

Combining multiple CCTV images could help catch suspects

Sep 21, 2018

Applying deep learning to motion capture with DeepLabCut

Aug 23, 2018

report

Training artificial intelligence with artificial X-rays

Jul 6, 2018

New model for large-scale 3-D facial recognition

Jul 6, 2018

How to train your robot: Research provides new approaches

Jun 25, 2018

Load comments (0)

Optical barcodes expand range of high-resolution sensor

2 hours ago

Ridesourcing platforms thrive on socio-economic inequality, say researchers

3 hours ago

Did Vesuvius bury the home of the first Roman emperor?

3 hours ago

Florida dolphin found with highly pathogenic avian flu: Report

3 hours ago

A new way to study and help prevent landslides

3 hours ago

New algorithm cuts through 'noisy' data to better predict tipping points

4 hours ago

Researchers reconstruct landscapes that greeted the first humans in Australia around 65,000 years ago

4 hours ago

High-precision blood glucose level prediction achieved by few-molecule reservoir computing

5 hours ago

Enhancing memory technology: Multiferroic nanodots for low-power magnetic storage

5 hours ago

Researchers advance detection of gravitational waves to study collisions of neutron stars and black holes

5 hours ago

Automated machine learning robot unlocks new potential for genetics research

5 hours ago