The Data Singularity is Here

by Michael E. Driscoll | March 8, 2010

In the next two blog posts I’ll attempt to sketch the forces behind what I’m calling, somewhat sensationally, the Data Singularity, and then (in a following post) discuss what I see as its consequences.

In a nutshell, the Data Singularity is this: humans are being spliced out of the data-driven processes around us, and frequently we aren’t even at the terminal node of action. International cargo shipments, high-frequency stock trades, and genetic diagnoses are all made without us.

Absent humans, these data and decision loops have far less friction; they become constrained only by the costs of bandwidth, computation, and storage– all of which are dropping exponentially.

The result is an explosion of data thrown off from these machine-mediated pipelines, along with data about those flows (and data about that data, and so on). The machines all around us — our smart phones, smart cars, and fee-happy bank accounts — are talking, and increasingly we’re being left out of the conversation.

So whether or not the Singularity is Near, the Data Singularity is here, and its consequences are being felt.

But before I discuss these consequences, I’d like to expand on the premise. The world wasn’t always drowning in this data deluge, so how did we get here?

I. Data at the Speed of Speech

For most of human history, information traveled no faster than the sound of the human voice. The origin of human language was the original singularity: it marked the birth of a non-biological information channel, distinct from our DNA.

But despite this achievement , the production of information — whether farmers’ almanacs or merchants’ ledgers — was still constrained the by costs of ink and parchment and the write-speed of the human hand.

All 70,000 volumes of the Library of Alexandria, the collected body of human knowledge in antiquity, could fit on two thumb drives today.

Thus the transmission and production of data, when it was done at all, was painstaking in form, small in scale, and occurred between people.

People --> People

II. Data at the Speed of Light

With the telegraph, for the first time, data flowed at the speed of light.

In the late 18th century, the first substantive telegraph line connected Paris to a suburb 210 kilometers to its north, using optical semaphores rather than electrical currents to communicate. Yet while data hopped between stations at light speed, it had to be routed by human operators at each station.

Centuries earlier, the printing press dramatically reduced the production costs of information. Still, human authors transmitted their hand drafted manuscripts to type setters, who set type with fonts optimally designed for human eyes.

III. Programmable Looms and Reading Machines

Punch cards represented the movement of data away from human-readable, anthropocentric substrates, onto a medium designed principally for consumption by machines.

Punch cards were developed in the early 18th century to control industrial looms , in France.

Now, machines were the final terminus of data transmission. This act of communicating with our machines, programming them, was at the heart of Charles Babbage’s Analytical Engine, which came more than a century later.

People --> Machines

IV. Phonographs and Recording Machines

Developing on the other side of the communication spectrum were machines that excelled at writing and storing data.

The modern rotating disk drive feels less inspired by punch cards, but by Thomas Edison’s cylinder machines, better known as phonographs.

The human voice was a natural data format, and if early pioneers had a vision for the modern human-machine interface, I imagine it would have been to program machines by voice. It’s a vision that still eludes us.

By the middle of the 20th century, a slew of semiconductor technologies emerged to close the loop of data generation: we had machines that produced digital data, and machines that continuously consumed it, without human intervention.

Machines --> Machines

These technologies also sparked the beginning of a less-celebrated, but equally important exponential curve: the falling cost of data storage.


V. Listening to the Pulse of the Planet

The exponential drop in data storage costs has meant that logging historical data about a process, or billions of processes, is economically feasible.

I conjecture that the largest share of data on the planet sits in log files; these are the EKGs of the server farms that manage our cell phones, our e-mail accounts, and every other facet of our online existence — and which consume 3% of the US energy budget .

Ubiquitous networking and cheap bandwidth has meant these pools of storage are no longer isolated on individual sensors, phones, or servers, but form the tributaries feeding an ocean of data in the Cloud.

And yet, funneling these massive volumes of data creates enormous technological pressures, against which companies struggle. So why keep the data?

Because inside these log files, amidst the myriad conversations recorded between machines, lies the pulse of their customers.

Collectively, these logs reveal the pulse of the planet — flight delays, package shipments, job losses, and human sentiments.

And as I’ll discuss in my next post, those who can extract a meaningful signal from this thunderous cacophony — the analysts, statisticians, and data scientists — are uniquely positioned to change the world.

SQL is Dead. Long Live SQL!

by Michael E. Driscoll | November 25, 2009

“The adoption of a relational model of data, as described above, permits the development of a universal data sub-language.”– E.F. Codd, 1969

“Database research has produced a number of good results, but the relational database is not one of them.” – Henry Baker, 1991

Outside of programming language flame wars, few questions raise the hackles of hackers more than: “how should I store my data?”

I will argue here, like many such debates , the answer is:  it depends on what you’re doing.

While the rise of non-relational data stores serves a much-needed niche, the death of SQL and relational databases has been much exaggerated.  E.F. Codd may be dead, but SQL is alive and well as a simple yet powerful data query language.

3NF Crusaders vs NoSQL Rebels

While the current critique relational databases shares features of earlier debates (such as in the 1990s, when object-oriented databases were heralded as the next big thing), it has some new twists.  Thus to review the players and their positions:

On our right are the relational curmudgeons, the kind of folks who pen manifestos and crusade against NULL values. They have converted nearly all of big business to their ministry, and have billions of dollars in their coffers to show for it. They insist that data should be stored in terms of its relations, to protect its integrity and facilitate its analysis. Ideally that means third-normal form, but more liberal branches of the church exist.

Read more

How XML Threatens Big Data

by Michael E. Driscoll | August 22, 2009

Credit:  http://www.flickr.com/photos/digitalart/2101765353Confessions from a Massive, Nightmarish Data Project

Back in 2000, I went to France to build a genomics platform. A biotech hired me to combine their in-house genome data with that of public repositories like Genbank. The problem was the repositories, all with millions of records, each had their own format. It sounded like a massive, nightmarish data interoperability project. And an ideal fit for a hot new technology : XML.

So I dove in, spending my days designing DTDs, writing parsers, tweaking tags (”taxon” or “species”? attribute or element?). At night I dreamt in ontologies. It was perfect.

Then reality struck. The pipeline was slow: Oracle loaded XML at a crawl. And it was a memory hog, since XSLT required putting full document trees in RAM.

We had a deadline to meet (and, mon dieu, a 35 hour work-week). So we changed course. We hacked our Perl scripts to emit a flat tab-delimited format — “TabML” — which was bulk loaded into Oracle. It wasn’t elegant, but it was fast and it worked.

Yet looking back, I realize that XML was the wrong format from the start. And as I’ll argue here, our unhealthy obsession with XML formats threatens to slow or impede many open data projects, including initiatives like Data.gov.

In the next sections, I discuss how XML fails for Big Data because of its unnatural form, bulk, and complexity. Finally, I generalize to three rules that advocate a more liberal approach to data.

Read more

The Rise of the Data Web

by Michael E. Driscoll | August 20, 2009

The future of the web is data, not documents. The web has evolved from Tim Berners-Lee’s original vision of “some big, virtual documentation system in the sky” into an vibrant ecosystem of data where documents — and human actors — will play an ever smaller role.

As others have noted, we’ve reached a tipping point in history: more data is being manufactured by machines — servers, cell phones, GPS-enabled cars — than by people. The early, document-centric web was populated by hand-coded hypertext files; today, a hand-coded web page is as rare as hand-woven clothing.

Through web frameworks, wikis, and blogs, we have industrialized the creation of hypertext. Similarly, we’ve also industrialized the collection of data, and spliced out the human steps in many data flows, such that data entry clerks may soon be as rare as typesetters.

The web we experience will continue to be dominated by documents — e-mail, blogs, and news. And while many sites are data-centric — Google maps, Weather.com, and Yahoo finance — it’s the web that we can’t see that surging with data. It’s not about us, it’s about servers in the cloud mediating entire pipelines of data, only occasionally surfacing in a browser.

But the web’s data architecture is fractious and in flux: many competing standards exist for serializing, parsing, and describing data. As we build out the data web, we ought to embrace standards that mirror data’s form in its natural habitats — as programmatic data structures, relational tables, or key-value pairs — while taking advantage of data’s stream-like nature. Mark-up languages like HTML and XML are ideal for documents, but they are poor containers for data, especially Big Data.

Read more

The Three Sexy Skills of Data Geeks

by Michael E. Driscoll | May 27, 2009

Marilyn Monroe Scatterplot MashupHal Varian, Google’s Chief Economist, was interviewed a few months ago, and said the following in the McKinsey Quarterly:
“The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.”

In prepping for tonite’s talk at the Google IO Ignite event, this quote inspired me to muse about how sex appeal and statistics might go together: so I chose to mash up a few scatter plots with Andy Warhol’s Marilyn Monroe.

Statisticians’ sex appeal has little to do with their lascivious leanings (ahem, BedPost), and more with the scarcity of their skills.  I believe that the folks to whom Hal Varian is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas:  statistics, data munging, and data visualization.  (In parentheses next to each, I’ve put the salient character trait needed to acquire it).

Skill #1: Statistics (Studying). Statistics is perhaps the most important skill and the hardest to learn. Read more

Dataviz Salon SF #2: Maps, Grammars, & Models

by Michael E. Driscoll | May 8, 2009

A few nights ago the talented folks at Stamen Design hosted us at their studios for our second dataviz salon in San Francisco.  (Special thanks to Tom Carden and Michal Migurski for inviting us).  Four talks were given, which I’ll review in turn.

Stamen: Reaching through Maps

Eric Rodenbeck (Stamen) started by highlighting several mapping visualizations that Stamen has been hacking on recently and in the past, including Cabspotting in San Francisco , Crimespotting in Oakland, and Olympic Stadium spotting in London.

Eric showed how Stamen has attempted to move away from what Schuyler Erle has dubbed “red dot fever”, whereby the overlayed data can overwhelm our visual attention, and toward allowing various data layers to “reach through” the maps.

For example, the London Olympic maps provide a mixture of schematic, satellite, and webcam images.  These various drill-downs of detail are not all exposed, but rather collaged.  Even more interesting was a movable ‘lens’ that, as it is moved over regions of a map, reveals another layer (reminiscent of a polarized-light based mural at Boston’s MoS).  In these ways, additional layers of data are only selectively brought into focus (echoing a design pattern in Japanese gardening, mie gakure, meaning “seen and unseen”).
Read more

Color: The Cinderella of dataviz

by Michael E. Driscoll | March 13, 2009

“Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.”  — Envisioning Information, Edward Tufte, Graphics Press, 1990   

multivariate color strip plot Color is one of the most abused and neglected tools in data visualization. It is abused when we make poor color choices; it is neglected when we rely on poor software defaults. Yet despite its historically poor treatment at the hands of engineers and end-users alike, if used wisely, color is unrivaled as a visualization tool.

Most of us think twice before walking outside in fluorescent red underoos. If only we were as cautious in choosing colors for infographics. The difference is that few of us design our own clothes. But until good palettes (like ColorBrewer) are commonplace, to get colors that fit our purposes, we must be our own tailors.

While obsessing about how to implement color on the Dataspora Labs’ PitchFX viewer I began with a basic motivating question: Read more

People who love scatter plots & connecting dots

by Michael E. Driscoll | February 19, 2009


We hosted the first Dataviz Salon SF on Tuesday night, with lightning talks by boredom cop Shane Booth, dataviz wiz Lee Byron , computational journalist Brad Stenger, data wrangler Pete Skomoroch , and any/all data enthusiast Brendan O’Connor .

I was going to blog all about it — but Tom Carden of Stamen Design already has a great write-up.

… Dataspora invited a few people to a Dataviz Salon yesterday evening. Mike and I went along and huddled in a brick-built basement in SoMa to listen to the following:

.

How Google and Facebook are using R

by Michael E. Driscoll | February 19, 2009


(March 26th Update: Video now available)
Last night, I moderated our Bay Area R Users Group kick-off event with a panel discussion entitled “The R and Science of Predictive Analytics”, co-located with the Predictive Analytics World conference here in SF.

The panel comprised of four recognized R users from industry:

  • Bo Cowgill, Google
  • Itamar Rosenn, Facebook
  • David Smith, Revolution Computing
  • Jim Porzak, The Generations Network (and Co-Chair of our R Users Group)

The panelists were asked to explain how they use R for predictive analytics within their firms, its strengths and weaknesses as a tool, and provide a case study. What follows is my summary with comments.

Read more

Is Big Data at a tipping point?

by Michael E. Driscoll | January 9, 2009

(5/18/09 update - included an overdue reference to linked data!) 

Stuart Kauffman, in one of his books about complexity, discusses tipping points in networks — what he calls a phase transitions — by way of buttons. Suppose you’re sitting on a floor strewn with 400 buttons, and you begin tying them together with pieces of string at random. At first, you have just pairs of buttons.   Then, you have clusters of threes, which in turn get tied into ever larger clumps. The question is: How long until picking any button off the floor pulls them all off together, in one connected mass?

It turns out that this supercluster of buttons doesn’t build gradually as we tie more threads, it emerges suddenly.  This rapid phase transition, from relatively unconnected to mostly connected, occurs right around where we have about half as many threads as buttons (see figure).  This is the tipping point of the system:  where a few threads make a big difference.

A similar phase transition has already occurred with regards to data inside business ecosystems. For the past several decades, an increasing number of business processes– from sales, customer service, shipping - have come online, along with the data they throw off.  As these individual databases are linked, via common formats or labels, a tipping point is reached: suddenly, every part of the company organism is connected to the data center.  And every action — sales lead, mouse click, and shipping update  — is stored.  The result:  organizations are overwhelmed by what feels like a tsunami of data.

The same trend is occurring in the larger universe of data that these organizations inhabit.  Big Data unleashed by the “Industrial Revolution of Data”, whether from public agencies, non-profit institutes, or forward-thinking private firms.

At present, much of the world’s Big Data is iceberg-like: frozen and mostly underwater. It’s frozen because format and meta-data standards make it hard to flow from one place to another:  comparing the SEC’s financial data with that of Europe’s requires common formats and labels (ahem, XBRL) that don’t yet exist. Data is “underwater” when, whether reasons of competitiveness, privacy, or sheer incompetence it’s not shared: US medical records may contain a wealth of data, but much of it is on paper and offline (not so in Europe, enabling studies with huge cohorts).

Yet there’s a slow thaw underway as evidenced by a number of initiatives:  Aaron Swartz’s theinfo.org, Flip Kromer’s infochimps, Carl Malamud’s bulk.resource.org, the Tim-Berners-Lee-inspired LinkedData.org, as well as Numbrary, Swivel, Freebase, and Amazon’s public data sets.  These are all ambitious projects, but the challenge of weaving these data sets together is still greater.

How far are we from the tipping point of Big Data? When will the world’s icebergs of data melt into one sea? More importantly, when it happens, will we be ready to do something useful with it all?

Next Page »