Data may not compute: Program stores older Web research files, left at risk by technological leaps

September 19, 2011 By Alvin Powell

Data may not compute

Enlarge

“Data is not like a book. If you get a 300-year-old book and you know the language, you can usually read it,” said Gary King, the Albert J. Weatherhead III University Professor and head of Harvard’s Institute for Quantitative Social Science (IQSS). “Data changes formats. If it’s from even five years ago, you might not be able to read it at all.” Credit: Kris Snibbe/Harvard Staff Photographer

Modern scholars are wrestling with a problem that ancient monks and early authors managed to master: how to keep their work accessible to future generations.

While the books, papers, and journals of early remain readable to anyone who can lay hands on them and knows the language, and that is not the case for those whose work is stored on early computer media, just a few decades old.

The breakneck pace of technology’s advance has left data in its dust, stored on tapes, floppy disks, and other media now unreadable by newer computers. And it’s not just the nature of storage media that is rapidly changing. File formats change as new programs are developed, rendering older programs obsolete even while giving powerful new tools.

“Data is not like a book. If you get a 300-year-old book and you know the language, you can usually read it,” said Gary King, the Albert J. Weatherhead III University Professor and head of Harvard’s Institute for Quantitative Social Science (IQSS). “Data changes formats. If it’s from even five years ago, you might not be able to read it at all.”

King has watched those changes since he arrived at Harvard in 1987. As head of the Harvard Data Center, then the Harvard-MIT Data Center, and now the institute, King realized long ago that efforts had to be made to ensure access to digital data for future scholars.

While publication in academic and scientific journals provides summaries of research, King said those articles are like advertisements for the underlying work, the reams of data gathered during exhaustive social science surveys, years of field observations, and long nights in the lab. Further, he said, today more grant-making agencies and journals require researchers to make their data available to others as a condition of a grant or of publication.

“It’s very important in science and social science to share research data,” King said.

One solution to the problem already exists, on computers at Harvard and in a growing network of corporations, universities, and other institutions. Called the Dataverse Network Project and spearheaded by the IQSS, the effort provides archival storage for research projects, initially in the social sciences but recently expanding to the physical sciences and humanities.

The Dataverse project solves problems that plagued the two most common previous data storage strategies, King said. The first is that researchers sometimes use major archives to hold their data. The problem with that, King said, involves loss of control over the data and, potentially, a loss of credit for gathering it, because the archive is sometimes cited as the source. The second commonly used strategy is to store the data on personal computers or servers, making it available on the Web through a researcher’s Web page. The problem there, King said, is that Web pages don’t endure for long. Researchers change institutions, links are lost, and access to data is gone as well.
“The average age of a link on the Web is very short,” King said. “Servers under the desk break or are replaced; the data can disappear.”

The Dataverse project is designed to solve both problems, King said. First, the IQSS employs professional archiving standards that ensure access to data long into the future. Once a researcher’s data is put into the system, it is converted from its original file format into a basic one that ensures the information will remain readable for decades to come. When that format becomes obsolete, King said, the system will automatically convert it to a new format, also designed to endure for decades. To guard against loss, the data is backed up on servers at different locations.

Instead of being locked away somewhere, the data remains accessible to the researcher through a Web interface designed to look like just another page — holding a list of datasets — on the researcher’s website. Instead of bringing visitors who click on a page to a researcher’s server, though, it links directly to a Dataverse server. The data sets, like the articles that result from them, have their own citations so that, if they are used by other scientists, a researcher gets credit for the work.

“As a researcher, I don’t need to do anything. It looks like it’s mine, but it’s preserved in the background,” King said.

There are Dataverses at several different levels, including the Dataverse Network Project, which has developed and distributed the software; the IQSS’s Dataverse Network, which is the Harvard-centered network, holding the data of Harvard researchers; the Dataverse networks of other institutions; and the Dataverses of individual researchers, which are individual archives from their specific projects and which reside on the networks at specific institutions.

Mercè Crosas , director of product development at IQSS, led the development efforts of the Dataverse Network software. She said IQSS currently hosts more than 350 individual researchers’ Dataverses. Those Dataverses hold about 40,000 studies, made up of 665,000 files. Although Dataverse has so far mainly been used by social scientists, Crosas said some groups in the sciences, including the Harvard-Smithsonian Center for Astrophysics, are beginning to explore Dataverse options.

She expects the size of the files stored there to double in the next five years, as more researchers seek solutions to the problem of storing data into perpetuity. To help that expansion, she said, the Dataverse software is open source, meaning that the code is open to others to download and edit. Among the institutions that have adopted the Dataverse approach are the University of North Carolina, the University of Michigan, and several campuses of the University of California.

The software’s open-source nature means that other institutions can have their own programmers add features that can then be shared with the community of users.

Of course, preserving anything into perpetuity is a tall order, and King acknowledged that will be a central challenge as people and institutions change. The advantage of a place like Harvard, though, is that it is stable and likely to endure.

“You need the community to persist,” King said. “That’s the kind of thing Harvard does best.”


This story is published courtesy of the Harvard Gazette, Harvard University’s official newspaper. For additional university news, visit Harvard.edu.

More information: http://dvn.iq.harvard.edu/dvn/

Provided by Harvard University search and more info website

Filter


Move the slider to adjust rank threshold, so that you can hide some of the comments.


Display comments: newest first

randolmj
Sep 19, 2011

Rank: not rated yet
It seems to me that "old files" should be very easy to decrepit.
DiverseByDesign
Sep 19, 2011

Rank: not rated yet
"should be", but what if someone stored most of all their research on 5-1/4" floppies? That being the case, I don't know anyone anymore who has a floppy drive that size. And I am a techie/repair type. It is a valid and serious issue and I am happy to see something being done about it. However, I still have some doubts that this system is completely a viable solution.
tscati
Sep 19, 2011

Rank: not rated yet
It seems to me that "old files" should be very easy to decrepit.

What does this mean? Decrepit is not a verb. Many old files are decrepit. Perhaps meant to decipher? or to decode?

de·crep·it Adjective
1. (of a person) Elderly and infirm.
2. Worn out or ruined because of age or neglect.

Either way it's a very real problem - files on 3inch Amstrad WP disks in some strange compressed format are as good as lost!
El_Nose
Sep 19, 2011

Rank: not rated yet
another issue is general file format --- I challenge you to find the orignal source code of db - a fully hashable database program. It's difficult to find online as it took me days just to find one source -- but know having that source its is easier to understand how a file made by that program will look.

XML was the hope to this an issue of the past -- but good documentation once again could solve much of this issue, mainly how to read and interpret the data in a given flie.
that_guy
Sep 19, 2011

Rank: not rated yet
I think the articae touched only briefly on the file type issue, but you guys (El Nose and Randolmj) should take note that:
Once a researcher's data is put into the system, it is converted from its original file format into a basic one that ensures the information will remain readable for decades to come

just like a .txt file from 1985 is still readable today. Granted, it's probably a little more sophisticated than that - but it is probably all stored in some kind of markup language and scalable image type.

randolm...you are absolutely incorrect. There are plenty of archaic filetypes that are not universal or standard that would be a pain in the ass to find software that is compatible with modern hardware and can decode it.
As for the physical media, they have their own server cluster - presumably that would replicate across the network as they upgrade the nodes.

and @ diverse...why don't we just use our comprehension skills and assume that randolmj meant 'decript'.
that_guy
Sep 19, 2011

Rank: not rated yet
lmao, I mean 'decrypt' whoops.
Rank 5 /5 (3 votes)
Relevant PhysicsForums posts
  • Ideas to mitigate risk of 911 calls being misdirected
    createdMay 24, 2012
  • Live scribe pen?
    createdMay 10, 2012
  • Shallow water flow simulation
    createdMay 07, 2012
  • Tablet for taking notes?
    createdMay 05, 2012
  • Best fit tablet for me?
    createdMay 05, 2012
  • Measure of Informaton
    createdMay 04, 2012
  • More from Physics Forums - Computing & Technology

More news stories

Browser wars flare in mobile space

The browser wars are heating up again, but this time the fight is for dominance of the mobile Internet.

Technology / Software

created 9 hours ago | popularity 5 / 5 (1) | comments 3

Probability of contamination from severe nuclear reactor accidents is higher than expected: study

Catastrophic nuclear accidents such as the core meltdowns in Chernobyl and Fukushima are more likely to happen than previously assumed. Based on the operating hours of all civil nuclear reactors and the number ...

Technology / Energy & Green Tech

created May 22, 2012 | popularity 3.6 / 5 (22) | comments 56 | with audio podcast

SpotterRF debuts Radar Backpack Kit (w/ Video)

(Phys.org) -- SpotterRF has announced a special radar backpack kit designed to enhance situational awareness for soldiers on the ground. The company says its special radar is designed for warfighters as part ...

Technology / Hi Tech & Innovation

created May 26, 2012 | popularity 5 / 5 (5) | comments 13 | with audio podcast report

HyperSolar shows dirty water no barrier to power world

(Phys.org) -- The Santa Barbara, California, company, HyperSolar, is set to transparently share the ups and downs of its research experiences toward the company’s ultimate vision, successfully producing ...

Technology / Energy & Green Tech

created May 24, 2012 | popularity 4.8 / 5 (16) | comments 17 | with audio podcast report

Tesla to launch electric sedan in US on June 22

Tesla Motors said Tuesday it would begin deliveries of "the world's first premium electric sedan" on June 22, slightly ahead of schedule.

Technology / Energy & Green Tech

created May 22, 2012 | popularity 4.5 / 5 (12) | comments 18


Land and sea species differ in climate change response: study

(Phys.org) -- Marine and terrestrial species will likely differ in their responses to climate warming, new research by Simon Fraser University and Australia’s University of Tasmania has found.

Almost half of new vets seek disability

(AP) -- America's newest veterans are filing for disability benefits at a historic rate, claiming to be the most medically and mentally troubled generation of former troops the nation has ever seen.

'Unzipped' carbon nanotubes could help energize fuel cells, batteries

Multi-walled carbon nanotubes riddled with defects and impurities on the outside could replace some of the expensive platinum catalysts used in fuel cells and metal-air batteries, according to scientists at ...

T cells 'hunt' parasites like animal predators seek prey, study shows

By pairing an intimate knowledge of immune-system function with a deep understanding of statistical physics, a cross-disciplinary team at the University of Pennsylvania has arrived at a surprising finding: T cells use a movement ...

Computer model used to pinpoint prime materials for efficient carbon capture

When power plants begin capturing their carbon emissions to reduce greenhouse gases – and to most in the electric power industry, it's a question of when, not if – it will be an expensive undertaking.

Change in developmental timing was crucial in the evolutionary shift from dinosaurs to birds: study

At first glance, it's hard to see how a common house sparrow and a Tyrannosaurus Rex might have anything in common. After all, one is a bird that weighs less than an ounce, and the other is a dinosaur that ...