How XML Threatens Big Data

by Michael E. Driscoll | August 22, 2009

Credit:  http://www.flickr.com/photos/digitalart/2101765353Confessions from a Massive, Nightmarish Data Project

Back in 2000, I went to France to build a genomics platform. A biotech hired me to combine their in-house genome data with that of public repositories like Genbank. The problem was the repositories, all with millions of records, each had their own format. It sounded like a massive, nightmarish data interoperability project. And an ideal fit for a hot new technology : XML.

So I dove in, spending my days designing DTDs, writing parsers, tweaking tags (”taxon” or “species”? attribute or element?). At night I dreamt in ontologies. It was perfect.

Then reality struck. The pipeline was slow: Oracle loaded XML at a crawl. And it was a memory hog, since XSLT required putting full document trees in RAM.

We had a deadline to meet (and, mon dieu, a 35 hour work-week). So we changed course. We hacked our Perl scripts to emit a flat tab-delimited format — “TabML” — which was bulk loaded into Oracle. It wasn’t elegant, but it was fast and it worked.

Yet looking back, I realize that XML was the wrong format from the start. And as I’ll argue here, our unhealthy obsession with XML formats threatens to slow or impede many open data projects, including initiatives like Data.gov.

In the next sections, I discuss how XML fails for Big Data because of its unnatural form, bulk, and complexity. Finally, I generalize to three rules that advocate a more liberal approach to data.

Read more

The Rise of the Data Web

by Michael E. Driscoll | August 20, 2009

The future of the web is data, not documents. The web has evolved from Tim Berners-Lee’s original vision of “some big, virtual documentation system in the sky” into an vibrant ecosystem of data where documents — and human actors — will play an ever smaller role.

As others have noted, we’ve reached a tipping point in history: more data is being manufactured by machines — servers, cell phones, GPS-enabled cars — than by people. The early, document-centric web was populated by hand-coded hypertext files; today, a hand-coded web page is as rare as hand-woven clothing.

Through web frameworks, wikis, and blogs, we have industrialized the creation of hypertext. Similarly, we’ve also industrialized the collection of data, and spliced out the human steps in many data flows, such that data entry clerks may soon be as rare as typesetters.

The web we experience will continue to be dominated by documents — e-mail, blogs, and news. And while many sites are data-centric — Google maps, Weather.com, and Yahoo finance — it’s the web that we can’t see that surging with data. It’s not about us, it’s about servers in the cloud mediating entire pipelines of data, only occasionally surfacing in a browser.

But the web’s data architecture is fractious and in flux: many competing standards exist for serializing, parsing, and describing data. As we build out the data web, we ought to embrace standards that mirror data’s form in its natural habitats — as programmatic data structures, relational tables, or key-value pairs — while taking advantage of data’s stream-like nature. Mark-up languages like HTML and XML are ideal for documents, but they are poor containers for data, especially Big Data.

Read more

Color: The Cinderella of dataviz

by Michael E. Driscoll | March 13, 2009

“Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.”  — Envisioning Information, Edward Tufte, Graphics Press, 1990   

multivariate color strip plot Color is one of the most abused and neglected tools in data visualization. It is abused when we make poor color choices; it is neglected when we rely on poor software defaults. Yet despite its historically poor treatment at the hands of engineers and end-users alike, if used wisely, color is unrivaled as a visualization tool.

Most of us think twice before walking outside in fluorescent red underoos. If only we were as cautious in choosing colors for infographics. The difference is that few of us design our own clothes. But until good palettes (like ColorBrewer) are commonplace, to get colors that fit our purposes, we must be our own tailors.

While obsessing about how to implement color on the Dataspora Labs’ PitchFX viewer I began with a basic motivating question: Read more

The cloud is a fungible, elastic computing infrastructure

by Michael E. Driscoll | June 27, 2008

Over the last few days I’ve attended a couple of events here in San Francisco discussing the promise of cloud computing. I believe there are several reasons why this technology represents a paradigm shift (and one that does justice to Kuhn’s original meaning).

First, what is cloud computing? From the ten-thousand foot view, it is technology that uncouples web servers from their underlying hardware; it re-conceives them from being physical machines with plugs into “instances” running on top of the hardware, bundles of bits that can be moved and multiplied as easily as the software on our desktops. The “cloud”, like the “web” , is an abstraction whose physical reality — data centers with thousands of softly humming servers — we need not care about.

This shift has far-reaching consequences for the economics of computing, among them:
Read more

Google app engine and the read-write-execute web

by Jason Morton | April 8, 2008

Google’s App Engine cloud service is launched, and it reflects a very different philosophy than Amazon’s. First of all, as we are used to with Google, the first bit is free; this should get a lot of small users over the hump who were hesitant to start the flow of funds (and $75/month minimum for a persistent presence) to Amazon. For many purposes, the free account will be enough.

Secondly, it is very much in the direction of the next generation of the web: the read-write-execute web. The read-write web made data first-class, enabling two-way communication; the rwx-web makes functions first class. In this environment, not just the data displayed but the code executed on a rwx-web site is user-contributed.

Google provides a sandboxed platform, development environment, a Python runtime, and APIs to link to persistent storage and Google services (authentication, mail, etc.). This removes a lot of complexity for developers and allows many scalability issues to be dealt with by Google engineers, at the platform level.

I was delighted to see their choice of Python: it is a remarkably clear language, and well suited for web applications. Google’s endorsement of Python through employing the BDFL helped put some corporate power behind the language, but if App Engine catches on it could really vault Python to the next level in terms of acceptance. It also strikes me as particularly suited to the kind of abstraction they are offering.