Data Modeling — Usual Musing

This morning I got back to one of my high-priority questions regarding my NFL data analysis -- if statistically there's a significant advantage to scoring first, why do teams almost universally defer to the second half after winning the coin toss?

The most prevalent theory seems to be that there's a bigger advantage in a 1-2 punch: scoring at the end of the first half, then scoring again immediately after receiving the opening kick of the second half. Good news -- with my handy data set from ArmchairAnalysis.com, we can put that theory to the test.

Time to augment the dataset!* I added a number of fields, tracking which team was on offense for the opening drive, which team was the offense for the final drive of the first half, and which team was on offense for the first drive of the second half. Next, how many points were scored in the H1 final drive, and how many points were scored in the H2 opening drive. This gives me all the data I need to test the hypothesis.

But wait! I compiled these new fields for ten games, just a small set for a quick check of the data logic. Good thing, because game #3 in my dataset showed the bane of the business intelligence professional: an anomaly that seems to contradict the business model.

Specifically, in the Week 1 of the 2000 season, the Eagles somehow were on offense for the first drive of EACH half. According to the business model, this can't happen. The team that kicks off first in the first half will receive the first kickoff of the second half.**

So, what's going on here? Is my business model incorrect? Or is it a data issue?*** Fortunately, this dataset has the mother of all sports data: play by play information for all 4,523 games represented. It's a virtual ton of data. Analyst heaven.

And it provides the answer. Those crafty Eagles actually kicked and recovered an onside kick to start the game. Thus, in the dataset's [Drive] table, Philly is on offense for the first drive of the game. They still get to receive the ball first in the second half, and since Dallas did not try the same maneuver, Philly ended up on offense for both opening kicks in the game. Go figure.

Is this an error in the business model or the data model? Neither, in my opinion. The business rule is that each team must make one of the opening kicks. The data definition of a "drive" is a series of events demarked by one team's possession of the ball. The kickoff is a specific business process event, and that event does not constitute a "drive."

To prevent confusion, our documentation (our data dictionary, information model definition, white papers, etc.) should include a thorough explanation of this important point. Analysts need to understand it so they handle the scenario correctly in calculated fields. Report consumers need to understand it so they don't make misguided decisions. Remember that documentation blog that I posted last week? (It's the one you skipped.)

But wait...this isn't the only scenario where the team receiving the opening kick doesn't also own the first drive, according to the [Drive] table. In the 2010 season, Miami kicked off to Oakland, and Jacoby Ford returned it 101 yards for a touchdown. In the data model, no actual drive occurred -- only the kickoff event. The first drive of the game occurred after Oakland kicked, and Miami started an actual drive from their own 1 yard line. They did go on to win, despite taking the more common route to scoring.

All right, I've discovered something crucial about my dataset. I've explained it to my stakeholders. But...the job isn't done. Before I can move on to my question about the 1-2 Punch, I must decide how to handle these scenarios during analysis. First step -- determine how often they occur.

Turns out there are 159 instances of the same team seemingly owning both "opening" drives. That's 3.5% of my total games, not an insignificant number. If this was a professional project, I'd have a governance process defined for an official decision on how to account for the scenario in reporting. Since it's my personal project, though, I only have to agree with myself. Best stakeholder ever.

The solution: add a flag to conveniently filter these games out of the analysis when I (finally) get to that question of the 1-2 Punch. But since today's blog was hijacked by my seeming anomaly, that work will have to wait until tomorrow. Such is BI.

* Data analysts get so excited about adding calculated fields to the dataset. I think that's our version of creating the CGI fire from the dragons in Game of Thrones.

** Unless the kickoff rules in 2000 were as unclear as the current day catch rules are, that is...

*** Dear BI Analysts: you may as well assume it's a data issue, until you prove otherwise. Your stakeholders certainly will. Low customer satisfaction? Bad resource utilization? Blew through the budget in one quarter? Prove it isn't a data error, then we'll talk.****

**** Keep in mind, as irritating as that attitude can be, that should be the attitude of the BI professional. You need to be 100% confident in the data before you tell your stakeholder that the data's fine.