Many years ago, Jim Gray was conducting a talk at Stanford I attended, whereby he outlined the challenges in processing the huge datasets accumulated in scientific fields like astronomy, cosmology and medicine.
In those days, the greatest concerns were: 1) cleaning the data sets and 2) transporting the data sets. The processing of these data sets, surprisingly, was of little concern. Data manipulation was processor-limited and modeling tools were few. Hence, success was dependent on the skill of the researchers to delve through the results for meaning.
Jim lived in a world of specialized expensive hardware platforms for stylized processing, painstaking manual cleaning of data, and elaborate databases to manipulate and store information. As such, large academic projects were beholden to the generosity of a few large corporations. This, to say the least, meant that any research project requiring large resources would likely languish.
In the decades since Jim first broached the huge data set problem (and twelve years after his passing), the open source disruption that started with operating systems (of which I was a part) and new languages spawned in turn the creation of data tools, processing technologies and methods that Jim, a corporate enterprise technologist, could not have imagined. Beginning with open source projects like Hadoop and Spark (originally from UC Berkeley, just like 386BSD), on demand databases and tools can provide (relatively speaking) economical and efficient capabilities. And one of the biggest of big data projects ever recently demonstrated that success.