Guidelines for a standardized data format for use in cross-linguistic studies

October 16, 2018, Max Planck Society
A world map showing data points, for which the researchers plan to gather unified data (e.g., data that is directly comparable) using the guidelines given in the paper. Credit: OpenStreetMap. Forkel et al. 2018. Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data.

An international team of researchers, members of the Cross-Linguistic Data Formats Initiative (CLDF) led by the Max Planck Institute for the Science of Human History, has proposed new guidelines on cross-linguistic data formats in order to facilitate sharing and data comparisons between the growing number of large linguistic databases worldwide. This format provides a software package, a basic ontology and usage examples.

There is an increasing number of linguistic databases worldwide, raising the possibility of a vast network for potential comparative studies. However, these databases are generally created independently of each other, and often have a unique and narrow focus. This means that the formats used for encoding the data are often different, creating difficulties in comparing data across databases.

The Cross-Linguistic Data Formats Initiative (CLDF) is an effort to resolve these issues. In a paper published in Scientific Data, the CLDF sets out proposed guidelines for a standardized format for linguistic databases, and also supplies a software package, a basic ontology and usage examples of best practices. The goal of this effort is to facilitate sharing and re-use of data in comparative linguistics.

The CLDF provides a data model underlying its recommendations that aims to be simple, yet expressive, and is based on the data model previously developed for the Cross-Linguistic Data project. This model has four main entities: (a) languages; (b) parameters; (c) values; and (d) sources. In the model, each value is related to a parameter and a language, and can be based on multiple sources. There are additionally references for sources, and references can also have contexts (which, for example, for printed references would be page numbers).

Basic rules of data coding included in the guidelines, taking cognate coding in wordlists as an example. (a) illustrates why long tables should be favored throughout all applications. (b) underlines the importance of anticipating multiple tables along with metadata indicating how they should be linked. Credit: Forkel et al. 2018. Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data.

The CLDF data model is a package format in which a dataset would be made up of a set of data files containing tables, and a descriptive file that defines the relationships between the tables. Each linguistic data type would have a CLDF module and additional components, which would be the aspects of the data in the module that recur across multiple data types. The CLDF modules would also contain terms from the CLDF ontology. The ontology is a list of vocabulary that represents objects and properties with well-known semantics in comparative linguistics. This makes it possible for users to reference these terms in a uniform way.

A software package to enable validation and manipulation

The CLDF specifications use common file formats—such as CSV, JSON and BibTeX—that are widely supported, with the goal that these files can easily be read and written on many platforms. Even more importantly, the standardized format will allow researchers without programming skills to access and manipulate the data with preexisting tools, to avoid restricting the package only to researchers with sufficient programming skills to create their own tools. To facilitate this, the CLDF has created a "cookbook" repository for scripts for use with the CLDF specifications.

"We want to bring access to these data and the ability to compare them to as many researchers as possible," says Johann-Mattis List of the Max Planck Institute for the Science of Human History. Robert Forkel, one of the driving forces behind the CLDF initiative, also notes that the CLDF format is not limited to linguistic data alone, but can also incorporate databases of cultural and geographic data, for example. "CLDF may drastically facilitate the testing of questions regarding the interaction between linguistic, cultural, and environmental factors in linguistic and cultural evolution."

Explore further: Knowledge of African-American language and culture benefits teachers in STEM fields

More information: Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics , Scientific Data, DOI: 10.1038/sdata.2018.205

Related Stories

A revolution in cross-linguistic research

December 5, 2017

In his new book, "The Comparative Method of Language Acquisition Research" (University of Chicago Press), Associate Professor of Linguistics Clifton Pye introduces a revolutionary method for crosslinguistic research.

Identifying major transitions in human cultural evolution

July 26, 2017

Over the past 10,000 years human cultures have expanded from small groups of hunter-gatherers to colossal and complexly organized societies. The secrets to how and why this major cultural transition occurred have largely ...

Recommended for you

Scientists solve mystery shrouding oldest animal fossils

March 25, 2019

Scientists from The Australian National University (ANU) have discovered that 558 million-year-old Dickinsonia fossils do not reveal all of the features of the earliest known animals, which potentially had mouths and guts.

Earth's deep mantle flows dynamically

March 25, 2019

As ancient ocean floors plunge over 1,000 km into the Earth's deep interior, they cause hot rock in the lower mantle to flow much more dynamically than previously thought, finds a new UCL-led study.

0 comments

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.