Many years ago, Jim Gray was conducting a talk at Stanford I attended, whereby he outlined the challenges in processing the huge datasets accumulated in scientific fields like astronomy, cosmology and medicine.
In those days, the greatest concerns were: 1) cleaning the data sets and 2) transporting the data sets. The processing of these data sets, surprisingly, was of little concern. Data manipulation was processor-limited and modeling tools were few. Hence, success was dependent on the skill of the researchers to delve through the results for meaning.
Jim lived in a world of specialized expensive hardware platforms for stylized processing, painstaking manual cleaning of data, and elaborate databases to manipulate and store information. As such, large academic projects were beholden to the generosity of a few large corporations. This, to say the least, meant that any research project requiring large resources would likely languish.
In the decades since Jim first broached the huge data set problem (and twelve years after his passing), the open source disruption that started with operating systems (of which I was a part) and new languages spawned in turn the creation of data tools, processing technologies and methods that Jim, a corporate enterprise technologist, could not have imagined. Beginning with open source projects like Hadoop and Spark (originally from UC Berkeley, just like 386BSD), on demand databases and tools can provide (relatively speaking) economical and efficient capabilities. And one of the biggest of big data projects ever recently demonstrated that success.
The latest results in Physics Review Letters by Loureiro, et. al. used the motion of galaxies and cosmological data to infer the upper bound on the mass of neutrinos, and it is quite small: 0.086 eV (electron volts). For something so very tiny, the work required to obtain this upper bound was massive. According to Andre Cuceu speaking to Live Science, processing the data required “half a million computing hours…equivalent to almost 60 years on a single processor. This project pushed the limits for big data analysis in cosmology”.
Jim would have been thrilled to see these results – and how they were achieved. His concern back then was how to physically get those datasets to people to play with in the first place and make sure they were clean databases from which to work. Now people are incorporating these huge data sets into massive computational models that utilize information on the largest galactic structures to provide information on the smallest particles known. What was inconceivable two decades ago is now reality. I think Jim would have liked this — a lot.