Is Big Data at a tipping point?

by Michael E. Driscoll | January 9, 2009

(5/18/09 update - included an overdue reference to linked data!) 

Stuart Kauffman, in one of his books about complexity, discusses tipping points in networks — what he calls a phase transitions — by way of buttons. Suppose you’re sitting on a floor strewn with 400 buttons, and you begin tying them together with pieces of string at random. At first, you have just pairs of buttons.   Then, you have clusters of threes, which in turn get tied into ever larger clumps. The question is: How long until picking any button off the floor pulls them all off together, in one connected mass?

It turns out that this supercluster of buttons doesn’t build gradually as we tie more threads, it emerges suddenly.  This rapid phase transition, from relatively unconnected to mostly connected, occurs right around where we have about half as many threads as buttons (see figure).  This is the tipping point of the system:  where a few threads make a big difference.

A similar phase transition has already occurred with regards to data inside business ecosystems. For the past several decades, an increasing number of business processes– from sales, customer service, shipping - have come online, along with the data they throw off.  As these individual databases are linked, via common formats or labels, a tipping point is reached: suddenly, every part of the company organism is connected to the data center.  And every action — sales lead, mouse click, and shipping update  — is stored.  The result:  organizations are overwhelmed by what feels like a tsunami of data.

The same trend is occurring in the larger universe of data that these organizations inhabit.  Big Data unleashed by the “Industrial Revolution of Data”, whether from public agencies, non-profit institutes, or forward-thinking private firms.

At present, much of the world’s Big Data is iceberg-like: frozen and mostly underwater. It’s frozen because format and meta-data standards make it hard to flow from one place to another:  comparing the SEC’s financial data with that of Europe’s requires common formats and labels (ahem, XBRL) that don’t yet exist. Data is “underwater” when, whether reasons of competitiveness, privacy, or sheer incompetence it’s not shared: US medical records may contain a wealth of data, but much of it is on paper and offline (not so in Europe, enabling studies with huge cohorts).

Yet there’s a slow thaw underway as evidenced by a number of initiatives:  Aaron Swartz’s theinfo.org, Flip Kromer’s infochimps, Carl Malamud’s bulk.resource.org, the Tim-Berners-Lee-inspired LinkedData.org, as well as Numbrary, Swivel, Freebase, and Amazon’s public data sets.  These are all ambitious projects, but the challenge of weaving these data sets together is still greater.

How far are we from the tipping point of Big Data? When will the world’s icebergs of data melt into one sea? More importantly, when it happens, will we be ready to do something useful with it all?

comments

17 Responses to “Is Big Data at a tipping point?”

  1. John on January 9th, 2009

    I was glad to hear you mention the problems with format standards and meta-data. I cringe whenever I run across someone who says that the problem with integrating medical data is only that the data is on paper. Some people believe if we could somehow get all this data into one enormous Oracle database then magic would happen.

    The meta-data problem is enormous. Say you’ve got a field called “diagnosis date” for some disease. It’s in a database with a date type, so there’s no format issue. What exactly does that date mean? First appointment with a family doctor? First appointment with a specialist? Is it self-reported? Has the date always meant the same thing, or did the meaning of the field change over time as personnel changed? Those are all problems with interpreting data that is all sitting in one institution’s private database. It’s hard to “integrate” data that is supposedly already integrated.

  2. Daniel Tunkelang on January 9th, 2009

    While there are efforts underway to expose more data through publicly accessible APIs, those efforts are child’s play compared to the problem of integrating the data to make it more than a grab bag of iceberg chips. I’m reminded of a discussion over at my blog:

    http://thenoisychannel.com/2008/12/14/is-soa-enabling-intelligent-agents/

  3. XMeister on January 10th, 2009

    You wrote: “Comparing the SEC’s financial data with that of Europe’s requires common formats and labels (ahem, XBRL) that don’t yet exist.”

    On the one hand, it requires more than just “common formats and labels”. They are reporting DIFFERENT THINGS, and even what is “common” has different measurement rules. What is measured at historical cost in one place is measured at market value in another - common formats and labels don’t help that; they can actually confuse the comparison.

    On the other hand:

    1. The SEC is mandating/has rules to mandate XBRL over the next 2.5 years for all public filers over whom they have authority, with a ruling finalizing it on Dec. 17, 2008.
    - http://www.sec.gov/news/press/2008/2008-300.htm

    2. The SEC also has published their roadmap to IFRS convergence (with XBRL as important foundation #3) to help get rid of those measurement and reporting differences.
    - http://www.sec.gov/rules/proposed/2008/33-8982.pdf

    3. This is a continuation of the collaboration between the US FASB and the IASB (at the rules, not XBRL, level) to bring more harmonization.

    4. The SEC, the IASB and even the Japanese FSA are working together on top of that for more XBRL harmonization.

    5. There is a US GAAP taxonomy
    - http://www.xbrl.us/Pages/US-GAAP.aspx

    and an IFRS taxonomy
    - http://www.iasb.org/XBRL/IFRS+Taxonomy/Latest+Taxonomy/Latest+taxonomy.htm

    and nothing to stop any group, commercial or otherwise, from preparing a mapping between them (ontological, XLink linkbase, virtual)

    So if this big data is important to you, support the XBRL effort, get busy understanding and mapping the taxonomies, support convergence efforts, or otherwise work with the groups already active to make this a reality.

    <XM

  4. Connected data and the tipping point : business|bytes|genes|molecules on January 10th, 2009

    […] are questions asked by Michael Driscoll in is big data at a tipping point. The post came to my attention via Paul Kedrosky and talks about a potential tipping point for Big […]

  5. Aaron Swartz on January 10th, 2009

    Thanks for the shout-out. As you note, the movement is just getting started, but seeing the energy behind it, I’m hopeful.

  6. Michael E. Driscoll on January 10th, 2009

    @ XMeister, I agree with you that common labels don’t necessarily mean common practices, . But even within the US GAAP taxonomy, putting firms on equal footing isn’t always easy (hence Benjamin Graham’s Security Analysis).

    As us data geeks know, there’s always a bit of ontological hand-wringing that goes on when working closely with any data set: the real world is messy, and no schema fits perfectly. But the trade-off of precision for clarity is usually worth it.

    Thanks all for the comments, and keep up the good fight! :)

  7. links for 2009-01-11 « Amy G. Dala on January 11th, 2009

    […] Is Big Data at a tipping point? : Data Evolution (tags: datamining analysis privacy) […]

  8. download, mirror, fork : business|bytes|genes|molecules on January 19th, 2009

    […] that we are on our way to really understanding and maximizing the data commons. Recent posts by Mike Driscoll and Paul Miller speak to this path we seem to be going down. The Semantic Web is one core resource, […]

  9. O'Reilly Radar on February 4th, 2009

    Four short links: 4 Feb 2009…

    Data, climate change, and location: Details on Yahoo’s Distributed Database (Greg Linden) — summary of Yahoo!’s PNUTS, “a massively parallel and geographically distributed database system for Yahoo!’s web applications.” Greg keeps up with the pap…

  10. Pitos Blog » Blog Archive – Government Data, Apis, and the tipping point - Welcome! If you’re interested in the same kind of things I am, consider adding this site to your favorites, or better yet, you may want to subscribe to my RSS fee on February 4th, 2009

    […] Is Data at a tipping point? In the blog post he says: “[…]A similar phase transition has already occurred with regards to data inside business ecosystems. For the past several decades, an increasing number of business processes– from sales, customer service, shipping - have come online, along with the data they throw off.  […]

  11. 090206 New Tools Links | johnsumser.com: Recruiting News and Views on February 6th, 2009

    […] Is Big Data at a Tipping Point?"For the past several decades, an increasing number of business processes– from sales, customer service, shipping - have come online, along with the data they throw off. As these individual databases are linked, via common formats or labels, a tipping point is reached: suddenly, every part of the company organism is connected to the data center. And every action — sales lead, mouse click, and shipping update — is stored. The result: organizations are overwhelmed by what feels like a tsunami of data." Also see Big Data, Big Data Infrastructure, Big Data, Big Data and Big Data. It’s geeky now with big implications for HR functions and websites. This is the trend in 2012.   […]

  12. Big (linked?) data on February 8th, 2009

    […] Data Evolution blog has an interesting post that asks Is Big Data at a tipping point?. It’s suggests that we may be approaching a tipping point in which large amounts of online […]

  13. Semantic Web is Taking Forever, Right? | andrewapeterson.com on February 9th, 2009

    […] point, even though the buzz seems to have slowed down in the echo chamber.  There’s a great little analogy I came across where data are compared to buttons being threaded together from one to the next, […]

  14. mariana on April 13th, 2009

    Important topic to pay attention to, we need to see what could happend in the future cause otherwise it will be to late, lots of data will be wasted. lots of money and also valuable information for the future current society.
    It is clear that networks do not grow linearly, at least they grow exponentially, but mostly at higher orders such as 2 to the X. There for in just a couple of days, once the data passed a certain volumn its growth becomes impossible to control. As this post mentions, same problem happens with company data.
    Therefore there are many different kind of solutions being investigated, such as the ones mentioned in the article. But I have the feeling these are just momentary patches not real solutions. No real innovation is being thought of.
    Probably because most of the investors do not want to take risk, and stick with the traditional solutions. But I can think much more intresting initiatives for solving the problem, which of course will be riskier, and will need more money. I will also involve AI in the research. But I think the real need is the emergence of a new paradigm for dealing with this issue.

  15. O'Reilly Radar on May 4th, 2009

    Big Data: SSD’s, R, and Linked Data Streams…

    The Solid State Storage Revolution: If you haven’t seen it, I recommend you watch Andy Bechtolsheim’s keynote at the recent Mysqlconf. We covered SSD’s in our just published report on Big Data management technologies. Since then, we’ve gotten addit…

  16. Big Data: SSD’s, R, and Linked Data Streams | Test Blog on May 5th, 2009

    […] talked about statistics and R, our main focus was Big Data. Mike is particularly excited about the growing number of open data sources, and the potential for linking them together to create interesting applications. The growing […]

  17. Nicole Beichel on November 5th, 2009

    You cannot take the issue much better.

Leave a Reply