Big Data and Many Things Trying to Get Along
by Ken Wood on Mar 8, 2013
The result of the n-body problem, beyond 3 interacting bodies of gravity on each other, becomes chaos and is “nearly impossible” to accurately predict with simulations especially if the bodies are different in composition. While this is understandably an exaggerated stretch of an analogy, big data (at least the way I adjust everyone’s definition) is similarly an n-body problem, but with data.
There are many industry and vendor definitions as to what big data is. Ultimately, there is some truth in just about all of them. My colleague Michael Hay posts the HDS definition of “Big Data of the Future” in a recent blog. What I like to add to these definitions and what I like to describe in any of my conversations with people on big data is that big data is also the interacting complex relationship between different types of data to form a single thread of information. Huh? That’s right, stick this in your definition, and you’ll sound like me.
This phrasing applies to both the different data itself and to the different sources of data whether it is retrieved from a persistent store, inflight from sensors, from social networks and so forth. Each of these data sources is comprised and/or derived from potentially complex systems or possibly from combinations of dissimilar data sources. Big data at one end can force simpler atomic datum with complex relationships together all the way to using the results of a constitution of complex systems mashing data together, yielding outcomes that are again used as input to another complex process. Obviously, the difference is that we want orderly and predictable results sooner than what the n-body problem might suggest. Maybe in big data research, this should be called the n-data problem, but with unpredictable and useful results.
Think about it, information innovation has evolved from a single database source of data, to a wide variety of data sources at different velocities. This data can be from different eras, all trying to interact with each other, force attractions, and produce complex associations to derive meaning and ultimately, produce a result that is unexpected yet useful. This combination of different types of data, in some cases seemingly unrelated types of data, is the foundation of science fiction and marketing commercials.
This concept hit home with me recently when talking to a European energy customer . As you know, one of the hats I wear is in the high performance computing space, or high performance “anything” space. The conversation turned to seismic data processing using HPC systems. Did you know that in most cases, oil companies aren’t always “looking” for oil in historic seismic data? They already know that there’s energy there. What they are now analyzing is whether or not it is “economically justifiable” to extract this oil or gas. I now use “economically justifiable” as an over-weighted term. This means that at the time of the survey, oil was maybe $10 a barrel, but the amount and the conditions surrounding this discovery made it too expensive to extract. This could include, quantity (not enough details available to determine the amount of oil), environment (the oil is under a nature preserve or city), and situation (the oil is too deep or the ground is too hard to drill, and so forth).
One of the reasons certain customers keep data forever, especially in the oil and gas industry, is that the analysis processes and tools continue to get better over time. This can be through better and faster hardware, new software improvements, new mathematic algorithms or improved times to analyze data. Historically, this has been the case and this is how I once described this industry’s use of their seismic data archives and HPC systems: the continuous cycle of applying new tools to old data, looking under every shale rock for oil or gas by using methods of increasing data resolution through computing of historic data.
However, the oil and gas industry is probably one of the most successful users of the big data concept (with my definition addendum) in the world today. Analysis now includes more than just brute force processing of seismic data. They combine current situational data to the results of a seismic run. New drilling techniques, hydraulic fracturing “fracking,” horizontal drilling, new refinery processes, regulatory policies and taxes, climate conditions, social media sentiment analysis, national, political and monetary policies, and other parameters all combine into their big data analytics. Each of the data sources is in itself a complex system yielding results as a piece of this process.
With oil hovering today at $100 a barrel and gas pump prices threatening $5.00 a gallon, there may now be “economical justification” to extract previously economically ignored, environmentally or ecologically undesirable energy sources. In fact if you look closely, the world is experiencing a recent energy resurgence in new fields.
When I state that the n-body problem is “nearly impossible” to predict, at least accurately, what I mean is that the result, no matter what it is, is useful. Similarly with big data, the complex relationships between different data types and the correlated orchestration of combining this n-data problem may not result in a predictable outcome, why would you want that? What you should be looking for is something unpredictably useful.