Now that you’ve met the fundamental analytic machinery — in both their map/reduce and table-operation form — it’s time to put them to work.
This part of the book will equip you to think tactically, to think in terms of the changes you would like to make to the data. Each section introduces a repeatedly-useful data transformation pattern, demonstrated in Pig (and, where we’d like to reinforce the record-by-record action, in Python as well).
One of this book’s principles is to center demonstrations around an interesting and realistic problem from some domain. And whenever possible, we endeavor to indicate how the approach would extend to other domains, especially ones with an obvious business focus. The tactical patterns, however, are exactly those tools that crop up in nearly every domain: think of them as the screwdriver, torque wrench, lathe and so forth of your toolkit. Now, if this book were called "Big Mechanics for Chimps", we might introduce those tools by repairing and rebuilding a Volkswagen Beetle engine, or by building another lathe from scratch. Those lessons would carry over to anywhere machine tools apply: air conditioner repair, fixing your kid’s bike, building a rocket ship to Mars.
So we will center this part of the book on the dataset we just introduced, what Nate Silver calls "the perfect data set": the sea of numbers surrounding the sport of baseball. The members of the Retrosheet and Baseball Databank projects have provided an extraordinary resource: comprehensive statistics from the birth of the game in the late 1800s until the present day, freely available and redistributable. There is an overview of the stats we’ll use in the previous chapter. Even if you’re not a baseball fan, we’ve minimized the number of concepts you’ll need to learn.
In particular, we will be hopping in and out of two main storylines as each pattern is introduced. One is a graphical biography of each player and team — the data tables for a website that can display timelines, maps and charts of the major events and people in the history of a team or player. This is explanatory analytics, where the goal is to summarize the answers to well-determined questions for presentation. We will demonstrate finding the geographical coordinates for each stadium or assembling the events in a player’s career in a way that you can apply any time you want to show things on a map or display a timeline. When we demonstrate the self-join by listing each player’s teammates, we’re showing you how to list all other products purchased in the same shopping cart as a product, or all pages co-visited by a user during a website session, and any other occasion where you want to extend a relationship by one degree.
The other storyline is to find indicators of exceptional performance, supplying a quantitative basis for the age-old question "Who are the greatest players in the game?". This is exploratory analytics, where the work is as much to determine the questions as to assemble the answers. Quantifying "how many great seasons did this player have?" or "how many great players did this team have in which era?". As we pursue this exploration, you should recognize not just a way for fantasy baseball players to get an edge, but strategies for quantifying the behavior of any sort of outlier. Here, it’s baseball players, but similar questions will apply when examining agents posing security threats, load factor on public transit routes, factors causing manufacturing defects, cell strains with a significantly positive response, and many other topics of importance.
In many cases, though, a pattern has no natural demonstration in service of those primary stories, and so we’ll find questions that could support an investigation of their own: "How can we track changes in each team’s roster over time?", "Is the stereotypical picture of the big brawny home-run hitter true?" For these we will usually just show the setup and stop at the trailhead. But when the data comes forth with a story so compelling it demands investigation ("Does God really hate Cleveland?", "Why are baseball players more likely to die in January and be born in August?") we will take a brief side trip to follow the tale.
This means, however, that you may find yourself looking at a pattern and saying "geez, I don’t see how this would apply to my work in \[quantitative finance|manufacturing|basketweaving|etc\]". It might be the case that it doesn’t apply; a practicing air conditioner repair person has generally not much use for a lathe. In many other cases it does apply, but you won’t see how until some late night when your back’s against the wall and you remember that one section in that covered "Splitting a Table into Uniform Chunks" and an hour later you tweet "No doubt about it, I sure am glad I purchased 'Big Data for Chimps'". Our belief and our goal is that it’s most commonly the second scenario.
Each pattern is followed by a "pattern in use" synopsis that suggests alternative business contexts for this pattern, lists important caveats and associated patterns, and how to reason about its performance. These will become most useful once you’ve read the book a first time and (we hope) begin using it as your go-to reference. If you find the level of detail here to be a bit intense, skim past them on the first reading.
What’s most important is that you learn the mechanics of each pattern, ignoring the story if you must. The best thing you can do is to grab a data set of your own — from your work or research — and translate the patterns to that domain. Don’t worry about finding an overarching theme like our performance-outliers storyline, just get a feel for the craft of using Hadoop at scale.
In the next chapter, we will introduce Apache Pig and its language, Pig Latin. Thereafter we’ll use Pig to introduce the analytic patterns, one after the other, building your toolkit of techniques as we go.