What I’ll be presenting at O’Reilly Money Tech 2009

by mike | October 21, 2008

I’ve been invited to speak at O’Reilly’s Money Tech conference this coming February 4-6th in New York City and thought I’d share the abstract for my talk here.  I’ll likely be in New York for several days, if you’d like to get together to chat about data drop me a line!

Open Source Analytics: Visualization and Predictive Modeling of
Big Data with the R Programming Language

ABSTRACT

Just as the explosion of online data catalyzed the development of
storage technologies such as Hadoop, new challenges in data analytics
– turning terabytes into actionable insights — demand new tools.  R,
an open-source language for statistical computing and graphics, is an
extensible, embeddable, and industry-strength solution for analytics.
In this session, I showcase R’s power by building predictive models
for Brazilian soybean harvests and baseball slugger salaries.

DESCRIPTION

The economics of data aggregation and analysis are being disrupted by
falling costs for storage and CPU power, the continuing shift of
business processes online, and the deluge of data that is being
generated as a consequence.

Satellite images, SEC filings, supply chain data (RFID data streams),
online prices, and newsgroup content represent just a few of the data
sources that hold potential for predictive modeling of markets.

Much of this data does not fit within existing paradigms for business
analysis: either its size overwhelms traditional desktop tools such as
Excel, or else its unique dimensions (such as geocodes) prevent its
being pipelined into more powerful, but narrowly designed, analysis
tools.  Finally, closed-source tools cannot keep pace with the leading
edge of innovation in statistical and machine-learning algorithms.

Enter the open source programming language R.  R has been dubbed the
lingua franca for statistical computing and graphical analysis, with a
pedigree tracing back several decades at Bell Labs.  Though its
million-plus users are concentrated within academia, R is gaining
currency within several high-profile quantitative analysis groups,
including Google’s Customer Insights team and Barclays Global
Investors.  In addition, R’s extensibility via user-contributed
packages has spawned an active developer community.

In this session, I will focus on applying R’s powerful visualization
tools to guide the construction of predictive models, using the kind
of large, multidimensional data sets that increasingly confront
quantitative analysts.  Along the way, I will highlight R’s packages
for inferential statistics, its compact modeling syntax, and its ease
of connectivity with persistent data stores.

The two specific examples I will discuss are:

- an analysis of NASA’s Landsat imagery of Brazil’s center-west
agricultural regions to detect correlates for soybean harvest yields,
and a derived predictor of the Brazilian soybean market based in part
on these correlates.

- a validation of Bill James’ sabermetrics approach to batting
performance using 30 years of Major League Baseball statistics, and a
derived predictor for batters’ salaries.

For all of its strengths, R has an admittedly steep learning curve.
While source code for the examples will be provided online, this talk
will emphasize techniques and working examples over technical details.
The goal of this session is to give quantitative analysts the courage
to invest in learning the R language, by showcasing R’s power,
highlighting its features, and providing examples of its use for
innovative applications.

How do you measure a major league slugger?

by mike | September 1, 2008

I gave a talk last month at SAP Labs in Palo Alto, along with Jim Porzak of ResponSys, introducing the R Statistical Language to a Business Intelligence interest group.  The goal was to highlight how open source tools, like R, can be used to build predictive models.  The example I gave centered around baseball and a simple question:  how do you measure a baseball slugger?

Michael Lewis, in Moneyball , described how the baseball analyst Bill James was frustrated by the fact that major league hitters were consistently rated by their batting averages. James wrote:

“a hitter should be measured by his success in that which he is trying to do, … create runs. It is startling, when you think about it, how much confusion there is about this.”

- Bill James, 1979 Baseball Abstract

However, since teams create runs, not batters, the only way to connect batting statistics with runs is to use team averages. The idea is that if we know which statistics predict runs at the team level, these statistics could be used to measure individual hitters.

I decided to test the value of three batting statistics myself — batting average, slugging percentage, and OPS (on-base plus slugging) — and see how well they predicted team runs, using MLB team data for the years 2000-2005 (available from baseball-databank.org). The results are shown in the three scatter plots below, and no surprise, Bill James is right:  a team’s overall batting average (top-most chart) is a comparatively poor predictor of how many runs it will score in an average game.  Slugging percentage (middle plot) is a slightly better predictor, and OPS (bottom plot) is the best of the three statistics I looked at:  it has a 0.95 correlation with runs scored (the r shown in the upper right corner of the plots is the Pearson correlation coefficient, the red lines represent least-squares fits to the points).

I highlighted a couple of interesting outliers in the top batting average plot: teams that achieved a high level of scoring with a comparatively low team batting average.  Who were these teams?  None other than Billy Beane’s 2000 and 2001 Oakland Athletics.  This suggests that the As management may have found excess value in fielding players who — despite having slightly lower batting averages — were capable of generating runs.

These results show what Bill James and others already know: that a baseball slugger should not be measured by his batting average, but by OPS or other hybrid statistics that best correlate with his success at generating runs. There is nothing novel about the results of my analysis.  But what I hope is novel is showing how it can be done using open source tools (R and MySQL), free data (baseball-databank.org), and a few lines of code ( sabermetrics using R page).

the case for open source data visualization

by mike | July 24, 2008

When I was in graduate school, the most closely studied part of the scientific publications we read was not the results, but the methods sections. (It was also, incidentally, often the hardest section to write for one’s own publications.) Methods sections are wonderful because they allow you to verify that someone else’s work is correct — by reproducing it yourself. But more importantly, methods sections allow you to build upon the work of others. They are the open source code of science.

Unfortunately, for all but a small fraction of data visualizations on the web, there are no methods sections being published. This is a shame, because it slows the free flow of ideas and prevents the creative extension of other people’s work.

Three conditions must be met for a data visualization to be considered open and reproducible.

  • Open Tools — The software tool used for the visualization must be freely available. Thankfully, many of the most powerful visualization software tools, languages, and frameworks are now open source, such as Processing, Prefuse, Actionscript, and R.
  • Open Code (or Methods) — The actual code, script, and/or series of steps taken to generate the visualization must be published. (For example, Lee Byron released his code for a walkability heatmap of San Francisco.)
  • Open Data — The data which is visualized should also be available in the same washed and scrubbed format that was used for the visualization. Ideally any code used to clean up the data might also be shared.

I grade some of the web’s existing data visualization sites using these criteria.

  • The New York Times routinely creates stunning graphics (like a visualization of 22 years of box office receipts ), but we are left to guess how they were created. Grade:D
  • VisualComplexity, a graphics gallery of mostly complex networks (like genome neworks), has pretty images but neither data nor visualization code. Grade:D
  • IBM’s ManyEyes has gorgeous visualizations (some of which are made with the Prefuse toolkit), and while the data is made available, the source code for the visualization is not. Grade:C
  • Processing’s exhibition page highlights several extraordinary visualizations created with its open-source framework. But unfortunately, no source code is available from the visual artists. Grade:C
  • The R Graphics Gallery does make source code for graphics available, but in more than half of the cases, no data is available. Grade: B

data visualization and the curse of dimensionality

by mike | July 24, 2008

“To achieve great things, we must be self-confined:
Mastery is revealed in limitation
And law alone can set us free again.”
– Goethe, Natur und Kunst

Data visualization suffers from the curse of dimensionality. Visualizing data is hard because there are so many ways to do it — and so few ways to do it right. There are orders of magnitude more arrangements of five points on a grid than there are five-word sentences. In his book, Information Visualization , Colin Ware enumerates several dimensions in visualization — including form, color, and position — which can be combined to create this vast space of possibility. The graphs below all visualize the same five data points but with different forms.

Written language has a grammar, and only a minute fraction of possible five-word sentences are valid. From this perspective, data visualization is deceptively easy. To a first approximation, anything goes. A fluorescently colored Excel ‘97 pie chart may reflect poor design choices, but it doesn’t break any rules.

Constraints — whether formal grammars or accepted conventions — force choices upon creators, freeing them to train their expressive powers elsewhere. Constraints also guide readers and viewers. A sonnet has a defined length and rhyme scheme: a writer works within the form and a reader knows what to expect.

Data visualization requires making a huge number of choices with few conventions to guide those choices. To these add another: that after the choices of form, color, and position are made in the mind, you face the challenge of realizing them with software tools.

Several texts, such as Edward Tufte’s The Quantitative Display of Visual Information and Stephen Few’s Information Dashboard Design, contain recommendations for data visualizations. Some of these emerge from studies of visual perception and theories of color, such as Few’s statement that our eyes can distinguish between at most five intensities of a single color. However, beyond the theoretical, discovering good form is often a process of trial and error. Historically, sets of conventions — whether musical notation or typographical style — have emerged from a community of a practitioners.

A bottom-up evolution of standards– embodied by blogs such as Junk Charts — is beginning. And with time, it may raise the overall quality of information graphics, while relieving the burden of choice facing those who wish to visualize data. Yet one obstacle I see to this process is the closed-source nature of most visualizations, something I’ll discuss in my next post.

cloud computing and the rise of a fungible, elastic computing infrastructure

by mike | June 27, 2008

Over the last few days I’ve attended a couple of events here in San Francisco discussing the promise of cloud computing. I believe there are several reasons why this technology represents a paradigm shift (and one that does justice to Kuhn’s original meaning).

First, what is cloud computing? From the ten-thousand foot view, it is technology that uncouples web servers from their underlying hardware; it re-conceives them from being physical machines with plugs into “instances” running on top of the hardware, bundles of bits that can be moved and multiplied as easily as the software on our desktops. The “cloud”, like the “web” , is an abstraction whose physical reality — data centers with thousands of softly humming servers — we need not care about.

This shift has far-reaching consequences for the economics of computing, among them:

  • Fungibility of computing power. When the hardware that powers servers is indistinguishable and interchangeable, it becomes a fungible commodity. It binds together the entire market for computing infrastructure into one of massive scale (an estimated 1.5% of all energy use in the U.S. is due to servers). Servers can now run on hardware the way cars run on gasoline. It creates competition and market opportunities that didn’t previously exist. companies like Amazon and Google are recognizing the opportunity to become the dominant providers of this new commodity.
  • Computing power as a leasable (and releasable) commodity. As Werner Vogels, the CTO of Amazon.com, has observed, cloud computing allows firms to shift infrastructure costs from being capital expenditures (owned) to operating costs (leased). Most servers are idle most of the time — this is because firms have traditionally invested in computing infrastructure to meet their peak capacity needs. Leasing computing power allows firms to more efficiently fit their demand, leasing more capacity on the cloud when needed, and releasing it when a peak period has passed.

Given these two features of the cloud computing marketplace, one might wonder where the competitive advantages lie for firms in this space outside of cost competition. In other words, what’s to stop users from seeking only the lowest prices? There is room for differentiation in service offerings, most notably in terms of reliability and security.

But a less conspicuous competitive edge that accrues to the first-movers in this space owes to the weight of data. Because of the relatively high cost of bandwidth, data needs to live close to the computing power that operates on it. But this same high cost of bandwidth makes moving data warehouses from one cloud provider to another — unlike migrating servers “instances” — an expensive proposition. Indeed, one of the feature requests for the Amazon Web Services folks at June 24th’s CloudCamp was to accept data uploads mailed in on optical disks — showing that snail mail still lives as a cheap, fast way to move data.

For data analytics applications, many of which require short-lived but intense bursts of computation (performing daily or monthly trend analysis, for example), cloud computing offers cheap access to vast CPU power. It also provides a compelling incentive to parallelize these analytics algorithms - as leasing 100 servers for one hour is cost equivalent to having one server for 100 hours.

Cloud computing promises to elevate our computing infrastructure to the level of a utility, like water, gas, and electricity: something we take for granted in best sense.

Information is more valuable than what it describes

by mike | June 1, 2008

“Information about money has become almost as important as money itself.” — Walter Wriston, former Chairman of Citicorp

“Some firms believe that in 10 years half their business will come from moving information about goods, rather than moving the goods themselves.” — The 20-Ton Packet, Wired 7.10

Information has always been valuable, but it’s only in the last decade that has it become so dramatically cheap — to store, to move, and to process. But it’s still not “too cheap to meter” (Stewart Brand’s phrase) and probably never will be.

Yet despite this dramatic drop in cost, the real value of information, by any measure, has not diminished. And because the costs of other goods — whether shipping containers, materials, or home furnishings — have fallen more slowly, information contributes an ever larger fraction of a firm’s profits.

A corollary to the falling cost of information, and its persistent value, is that as more kinds of information comes online, more of this data is worth keeping. Even data whose value is metered in cents per terabyte is increasingly worth storing, and eventually analyzing, as it may yield several cents profit.

As industries evolve and their products become commodities, raw goods prices are driven down, tipping the balance further in favor of information as a source of profit. Firms become information enterprises first, with actual goods being only ancillary.

Nowhere is this more apparent than with the evolution of Amazon.com in the last decade. Amazon began as a retailer of books, became a retailer of consumer goods (abstracting away books), then expanded to be a retailer of retailers (abstracting away goods), and now with its S3 initiative, Amazon is attempting to become a seller of information services (abstracting away retailing at all). Perhaps Amazon’s goal is to become a purely abstract entity, managing information (software) about information (marketplaces) for retailers who sell goods.

Wriston’s quote was prescient because he foresaw this change occurring in the consumer banking and finance world, long before Capital One took advantage of vast troves of consumer credit information to individually tailor its credit cards for maximum profit. But his observation is true today in its more general form: information about goods is almost as important as goods themselves.

Visualizing Tim Wakefield’s knuckleball

by mike | May 2, 2008

“Back in 1980, STATS Inc. … sent its own scorekeepers [to record] play-by-play information about the games that had never before been systematically collected: the pitch count at the end of at bats, pitch types and locations, the direction and distance of batted balls. They broke the field down into twenty-six wedges radiating out from home plate.”

Michael Lewis, Moneyball, p. 84

A friend recently sent me a blog post (originally from Josh Kalk ) visualizing the differences in two Red Sox pitchers’ styles by using a data set — called MLB Extended Game Log — which catalogs over a dozen attributes of each pitch thrown. This got me wondering about why baseball has attracted such interest by statisticians, and also about ways in which this pitching data, in particular, might be better visualized.

Among the major professional sports in America, baseball is the game most amenable to statistical analysis. Whereas in football an offensive drive can be any range of yards, in baseball a hit comes in just four sizes. In football, hockey, and basketball, a clock continuously draws us toward the game’s end; in baseball, innings and outs mark the stepwise progress of the game. Baseball is, in many ways, a lovable finite-state machine — a game composed of a countable number of states (innnings, outs, the count, base runner configurations).

It’s not surprising, then, that statisticians have had a long love affair with baseball. Yet only in recent years have saber-metricians like Bill James finally gained the esteem of the front office and even the dugouts — as Michael Lewis’ Moneyball recounts.

(Aside: baseball is to professional sports what Wall Street is to the business world; a field dominated by numbers and consequently the leading adopter of quantitative techniques among its peers. It’s perhaps no coincidence that Michael Lewis’ earlier book, Liar’s Poker, centered on Wall Street).

To get back to the original motivation for this post, the MLB’s Extended Game Log introduces detailed data about pitches; for this data to be used effectively, it must be presented effectively. Unfortunately, data visualization is a hard problem — the choices of layout, color, point shape and size are nearly infinite. I have thus taken the same 2007 data (data for every pitch thrown in the MLB, courtesy of Josh Kalk’s blog ), and generated a per-pitcher visualization. However, rather than overlaying all pitches on a single graph, I have created a mini-plot for each kind of pitch (fastball, slider, etc.). The result is that each pitcher has a panel that characterizes his pitching style, based on all pitches in the 2007 season.

In my visualization shown below, color depth is used to illustrate pitch count (deep blues represent more pitches) and pitch distribution is evidenced through the small multiples of the mini-graphs. One can see, for example, that Wakefield throws almost nothing but knuckleballs, and those knuckeballs break in every direction. We can also see how Matsuzaka’s and Sabathia’s choice of pitches are similar, but that Matsuzaka lacks a splitter. What is not preserved, but should be, is a fixed axis for the horizontal and vertical breaks (each sub-graph is centered on its own axes) that would better demonstrate how different pitchers and pitches use the strike zone.

A tale of three pitchers

What’s happening in baseball is a harbinger of what’s to come in other sports, and in other industries, as more data is generated and available for analysis. A few lessons emerge from baseball. First, data analysis will not replace human decision-makers; a manager’s domain expertise and tacit knowledge can not be replaced by an automated algorithm. However, data analysis can support and augment decision-makers’ instincts and reasoning abilities. Second, because decision-makers are people, and because — given our powerful visual system — seeing is believing, the way in which data is presented is critical.

Wikipedia as a human-writable, machine-readable database

by mike | April 10, 2008

Yesterday the blogosphere (or the corner of it I read) was buzzing about Bret Taylor’s post entitled We Need a Wikipedia for data . He makes some great points about how the world would be a better place if structured data were more easily available on the web somewhere, and he lists example data sets — stock data, sports scores, and mapping data — that he wishes he could more readily access (as his readers’ pointed out, historical stock data is available via Yahoo Finance’s API).

There are already a few groups striving to become the Wikipedia of Data — DBpedia is a distillation of Wikipedia, and Freebase is making a similar play (but going beyond just Wikipedia data).

Wikipedia and publishing software like Wordpress dramatically lowered the costs of putting content on the web. A high school chemistry teacher could edit the Wikipedia page on sodium or post an alkali metals lesson plan without being versed in HTML or server technology. The result was an explosion of content on the web.

However, just as we have lowered the barriers for content providers, many semantic web visionaries are pushing to raise them for data providers, hoping they will conform to a slew of specifications and ontologies when they publish their data. This is not a realistic hope. We should be happy that the Bureau of Labor Statistics is even publishing its data, never mind how. If the semantic web is to succeed, the only burden for data providers should be to publish data in a consistent, machine-readable format. The remaining burden should be shouldered by the data aggregators.

This is a challenge that DBpedia is addressing directly. Though there may be no ontology that describes the structure of a Wikipedia page, rules about document structure are enforced (often by automated bots ), and many pages contain sidebars with semi-structured data. DBpedia recognizes these sidebars, typically consisting of key-value pairs with controlled vocabularies (”boiling point = 100″, or “population = 250 million”), and parses them into a well-structured database.

We need more efforts like DBpedia which are willing to capitalize on the inherent structure in much freely available content and transform it into queryable data sets. By shifting the costs of formally structuring content away from the data-providers, perhaps we can catalyze an explosion of machine-readable data on the web, similar to what Wikipedia and blogging software have done for content publishing.

Why the web should pay more attention to you

by mike | April 9, 2008

Last night I attended a meet-up of the Bay Area Attention Data group, down at one of Google’s sprawling office parks in Mountain View.

The presenters were two folks, Kent Brewster of MyBlogLog and Peter Berger of Alitora Systems, both discussing ways of using semantic data to improve users’ web experiences.

The idea is to use information about you to better provide you with web services: in essence, it’s an extension of what web advertisers are already doing (Google’s AdSense is does a very profitable job figuring out relevancy for advertising).

What does this have to do with the semantic web or attention economics? On the semantic web side, it’s important because in the present, who we are as individual web users is poorly structured; our click-streams and cookies may reveal a lot about us (and more than we realize), but it would be far better to have structured data about who we are (age, zip code, marital status, favorite movies, etc.) than to have to guess. Semantic web technologies could allow this “who am I?” data, either generated by me, by some trusted information broker, or by the sites that I visit (Amazon.com, NetFlix, etc.) to be to be shared and used by other services.

This relates to attention economics because services that know more about me can pay better attention to me. Search query terms can be disambiguated, spam mail better identified (no more ED ads for women), and our interactions with the web could consist of more signal and less noise, in theory.

To some extent, this is already happening inside sites, but whether it can happen across sites, and the whether can carry our well-structured personal data around with us, will depend on two things: (i) the emergence of new (and adoption of existing) open standards for personal data, and (ii) pressure from consumers to force private firms like LinkedIn, Facebook, and NetFlix — whose business models benefit from a lack of portability — to work with these open standards.

Google app engine and the read-write-execute web

by jason | April 8, 2008

Google’s App Engine cloud service is launched, and it reflects a very different philosophy than Amazon’s. First of all, as we are used to with Google, the first bit is free; this should get a lot of small users over the hump who were hesitant to start the flow of funds (and $75/month minimum for a persistent presence) to Amazon. For many purposes, the free account will be enough.

Secondly, it is very much in the direction of the next generation of the web: the read-write-execute web. The read-write web made data first-class, enabling two-way communication; the rwx-web makes functions first class. In this environment, not just the data displayed but the code executed on a rwx-web site is user-contributed.

Google provides a sandboxed platform, development environment, a Python runtime, and APIs to link to persistent storage and Google services (authentication, mail, etc.). This removes a lot of complexity for developers and allows many scalability issues to be dealt with by Google engineers, at the platform level.

I was delighted to see their choice of Python: it is a remarkably clear language, and well suited for web applications. Google’s endorsement of Python through employing the BDFL helped put some corporate power behind the language, but if App Engine catches on it could really vault Python to the next level in terms of acceptance. It also strikes me as particularly suited to the kind of abstraction they are offering.

Next Page »