Big Data trips up scientists

Scientists have always worked with large amounts of data, but an emphasis on collaboration has led to problems with Big Data. 

Big Data
Courtesy of Defense Advanced Research Projects Agency

“Data availability is a topic very important to science,” said  Clifford Spiegelman, who led a panell discussion Saturday at the AAAS Annual Meeting in Chicago. “It allows checks on computations and allows other uses for data other than what was initially thought.”

However, Big Data poses many challenges, including the development of complicated, immense data sets using specialized formats that may be hard to read for other scientists, according to David Reitze. Reitze is a researcher at the Laser Interferometer Gravitational-Wave Observatory (LIGO), which produces about a petabyte of data each year. (A petabyte is one million gigabytes.)

LIGO receives public money to study variations in gravity, which means it is required  by the National Science Foundation to make its data publicly available. Reitze said his group strives to meet those requirements, but there are some hurdles to sharing the data

“Making data publicly available is really an entirely new concept,” Reitze said.

LIGO and its European sister organization VIRGO comprise about 1,300 scientists pooling data. The data that looks most interesting is double-checked by a special team of LIGO researchers. If the data still looks promising, it is checked again.

“The checking that we do at LIGO is the most thorough that I’ve ever been involved with,” Reitze said. “We’re paranoid about data—we have to be.”

LIGO not only collects data about gravitational waves, it collects data about how the data is collected. This allows future researchers to determine if there were any flaws in the data collection. Collecting all this data comes at a cost: It takes a team of 50 staffers to maintain LIGO’s  side of the database.

Other costs to making data available include converting data to standardized file formats, cleaning data of simulated events and keeping data on compatible devices. But a scientist’s hubris can also lead to less data being made publicly available.

Reitze said scientists’ excuses for not releasing data publicly range from being too busy to compile it for public consumption to believing that they are the only ones who can comprehend the data.

Another question in relaying data to the public is where the cost lies for data storage and upkeep. Researchers are loathe to spend grant money on storage and National Science Foundation isn’t providing funding explicitly for data storage and distribution yet.

As more researchers look to the data others have produced, some of these problems will be ironed out. Astronomers have pioneered a culture of data sharing and Reitze hopes that mindset will penetrate the all areas of the science community.