Is Big Data at a tipping point?
(5/18/09 update - included an overdue reference to linked data!)
Stuart Kauffman, in one of his books about complexity, discusses tipping points in networks — what he calls a phase transitions — by way of buttons. Suppose you’re sitting on a floor strewn with 400 buttons, and you begin tying them together with pieces of string at random. At first, you have just pairs of buttons. Then, you have clusters of threes, which in turn get tied into ever larger clumps. The question is: How long until picking any button off the floor pulls them all off together, in one connected mass?
It turns out that this supercluster of buttons doesn’t build gradually as we tie more threads, it emerges suddenly. This rapid phase transition, from relatively unconnected to mostly connected, occurs right around where we have about half as many threads as buttons (see figure). This is the tipping point of the system: where a few threads make a big difference.
A similar phase transition has already occurred with regards to data inside business ecosystems. For the past several decades, an increasing number of business processes– from sales, customer service, shipping - have come online, along with the data they throw off. As these individual databases are linked, via common formats or labels, a tipping point is reached: suddenly, every part of the company organism is connected to the data center. And every action — sales lead, mouse click, and shipping update — is stored. The result: organizations are overwhelmed by what feels like a tsunami of data.
The same trend is occurring in the larger universe of data that these organizations inhabit. Big Data unleashed by the “Industrial Revolution of Data”, whether from public agencies, non-profit institutes, or forward-thinking private firms.
At present, much of the world’s Big Data is iceberg-like: frozen and mostly underwater. It’s frozen because format and meta-data standards make it hard to flow from one place to another: comparing the SEC’s financial data with that of Europe’s requires common formats and labels (ahem, XBRL) that don’t yet exist. Data is “underwater” when, whether reasons of competitiveness, privacy, or sheer incompetence it’s not shared: US medical records may contain a wealth of data, but much of it is on paper and offline (not so in Europe, enabling studies with huge cohorts).
Yet there’s a slow thaw underway as evidenced by a number of initiatives: Aaron Swartz’s theinfo.org, Flip Kromer’s infochimps, Carl Malamud’s bulk.resource.org, the Tim-Berners-Lee-inspired LinkedData.org, as well as Numbrary, Swivel, Freebase, and Amazon’s public data sets. These are all ambitious projects, but the challenge of weaving these data sets together is still greater.
How far are we from the tipping point of Big Data? When will the world’s icebergs of data melt into one sea? More importantly, when it happens, will we be ready to do something useful with it all?
comments
17 Responses to “Is Big Data at a tipping point?”
Leave a Reply

I was glad to hear you mention the problems with format standards and meta-data. I cringe whenever I run across someone who says that the problem with integrating medical data is only that the data is on paper. Some people believe if we could somehow get all this data into one enormous Oracle database then magic would happen.
The meta-data problem is enormous. Say you’ve got a field called “diagnosis date” for some disease. It’s in a database with a date type, so there’s no format issue. What exactly does that date mean? First appointment with a family doctor? First appointment with a specialist? Is it self-reported? Has the date always meant the same thing, or did the meaning of the field change over time as personnel changed? Those are all problems with interpreting data that is all sitting in one institution’s private database. It’s hard to “integrate” data that is supposedly already integrated.
While there are efforts underway to expose more data through publicly accessible APIs, those efforts are child’s play compared to the problem of integrating the data to make it more than a grab bag of iceberg chips. I’m reminded of a discussion over at my blog:
http://thenoisychannel.com/2008/12/14/is-soa-enabling-intelligent-agents/
You wrote: “Comparing the SEC’s financial data with that of Europe’s requires common formats and labels (ahem, XBRL) that don’t yet exist.”
On the one hand, it requires more than just “common formats and labels”. They are reporting DIFFERENT THINGS, and even what is “common” has different measurement rules. What is measured at historical cost in one place is measured at market value in another - common formats and labels don’t help that; they can actually confuse the comparison.
On the other hand:
1. The SEC is mandating/has rules to mandate XBRL over the next 2.5 years for all public filers over whom they have authority, with a ruling finalizing it on Dec. 17, 2008.
- http://www.sec.gov/news/press/2008/2008-300.htm
2. The SEC also has published their roadmap to IFRS convergence (with XBRL as important foundation #3) to help get rid of those measurement and reporting differences.
- http://www.sec.gov/rules/proposed/2008/33-8982.pdf
3. This is a continuation of the collaboration between the US FASB and the IASB (at the rules, not XBRL, level) to bring more harmonization.
4. The SEC, the IASB and even the Japanese FSA are working together on top of that for more XBRL harmonization.
5. There is a US GAAP taxonomy
- http://www.xbrl.us/Pages/US-GAAP.aspx
and an IFRS taxonomy
- http://www.iasb.org/XBRL/IFRS+Taxonomy/Latest+Taxonomy/Latest+taxonomy.htm
and nothing to stop any group, commercial or otherwise, from preparing a mapping between them (ontological, XLink linkbase, virtual)
So if this big data is important to you, support the XBRL effort, get busy understanding and mapping the taxonomies, support convergence efforts, or otherwise work with the groups already active to make this a reality.
<XM
[…] are questions asked by Michael Driscoll in is big data at a tipping point. The post came to my attention via Paul Kedrosky and talks about a potential tipping point for Big […]
Thanks for the shout-out. As you note, the movement is just getting started, but seeing the energy behind it, I’m hopeful.
@ XMeister, I agree with you that common labels don’t necessarily mean common practices, . But even within the US GAAP taxonomy, putting firms on equal footing isn’t always easy (hence Benjamin Graham’s Security Analysis).
As us data geeks know, there’s always a bit of ontological hand-wringing that goes on when working closely with any data set: the real world is messy, and no schema fits perfectly. But the trade-off of precision for clarity is usually worth it.
Thanks all for the comments, and keep up the good fight!
[…] Is Big Data at a tipping point? : Data Evolution (tags: datamining analysis privacy) […]
[…] that we are on our way to really understanding and maximizing the data commons. Recent posts by Mike Driscoll and Paul Miller speak to this path we seem to be going down. The Semantic Web is one core resource, […]
Four short links: 4 Feb 2009…
Data, climate change, and location: Details on Yahoo’s Distributed Database (Greg Linden) — summary of Yahoo!’s PNUTS, “a massively parallel and geographically distributed database system for Yahoo!’s web applications.” Greg keeps up with the pap…
[…] Is Data at a tipping point? In the blog post he says: “[…]A similar phase transition has already occurred with regards to data inside business ecosystems. For the past several decades, an increasing number of business processes– from sales, customer service, shipping - have come online, along with the data they throw off. […]
[…] Is Big Data at a Tipping Point?"For the past several decades, an increasing number of business processes– from sales, customer service, shipping - have come online, along with the data they throw off. As these individual databases are linked, via common formats or labels, a tipping point is reached: suddenly, every part of the company organism is connected to the data center. And every action — sales lead, mouse click, and shipping update — is stored. The result: organizations are overwhelmed by what feels like a tsunami of data." Also see Big Data, Big Data Infrastructure, Big Data, Big Data and Big Data. It’s geeky now with big implications for HR functions and websites. This is the trend in 2012. […]
[…] Data Evolution blog has an interesting post that asks Is Big Data at a tipping point?. It’s suggests that we may be approaching a tipping point in which large amounts of online […]
[…] point, even though the buzz seems to have slowed down in the echo chamber. There’s a great little analogy I came across where data are compared to buttons being threaded together from one to the next, […]
Important topic to pay attention to, we need to see what could happend in the future cause otherwise it will be to late, lots of data will be wasted. lots of money and also valuable information for the future current society.
It is clear that networks do not grow linearly, at least they grow exponentially, but mostly at higher orders such as 2 to the X. There for in just a couple of days, once the data passed a certain volumn its growth becomes impossible to control. As this post mentions, same problem happens with company data.
Therefore there are many different kind of solutions being investigated, such as the ones mentioned in the article. But I have the feeling these are just momentary patches not real solutions. No real innovation is being thought of.
Probably because most of the investors do not want to take risk, and stick with the traditional solutions. But I can think much more intresting initiatives for solving the problem, which of course will be riskier, and will need more money. I will also involve AI in the research. But I think the real need is the emergence of a new paradigm for dealing with this issue.
Big Data: SSD’s, R, and Linked Data Streams…
The Solid State Storage Revolution: If you haven’t seen it, I recommend you watch Andy Bechtolsheim’s keynote at the recent Mysqlconf. We covered SSD’s in our just published report on Big Data management technologies. Since then, we’ve gotten addit…
[…] talked about statistics and R, our main focus was Big Data. Mike is particularly excited about the growing number of open data sources, and the potential for linking them together to create interesting applications. The growing […]
You cannot take the issue much better.