Friday, October 1, 2010

Soccer Sabermetrics - Challenges

The Challenges of Soccer Sabermetrics
To continue on with my post from yesterday, talking about soccer sabermetrics, there's a few things to consider before I dig into the actual equations. I mentioned this briefly the previous post when I said:

Obviously the key difficulty is whereas measuring a single player's performance in baseball is relatively easy (each player performs, in essence, individually), it's harder to do for true team sports like soccer.

And I want to expand on that. In baseball, a batter performs on his own when he's hitting. Yes, he's reacting to the pitches, but it's his swing, his power, and his strategy that leads to hits or outs. The pitcher on the mound is relatively isolated as well. He works with the catcher, but the average catcher's efforts provide only a small amount of support to the overall effort.

That's not true on the soccer field. It's uncommon for a single player to act wholly on their own on the field, except, of course, the goalie, who is a special case (which we will deal with later). Absolutely, there are players who elevate the play of the whole team, and there are players, Lionel Messi for example, who can make fantastic plays on their own. But they're the exception and not the norm. But sabermetrics whole point is to calculate reliable non-biased statistics to measure individual players in their sport.

So two things need to happen:
  1. We need to find a way to isolate an individual player's efforts from those of their peers.
  2. We need to develop a set of statistics that indicate the overall ability of team as well.
Measuring the Individual Player
A good measure of an individual player's skill would be shooting efficiency. How many goals does a player make for every shot on goal? How many times do they have the ball stolen from them when they have possession? How many shots on goal does a player take over their total time of possession?

These types of statistics relate as closely as possible to what a single player can do on the field. Obviously most of these examples given are specific to offensive players, but they give the clearest picture of a player's contribution to the team. This is one of those challenging aspects of applying statistical models to individual players on a team sport. Relating it back to baseball again, leaving the whole DH vs. pitchers hitting discussion aside for now, every player plays offense, and every player plays defense. Not so in sports like soccer, American football, basketball, and so on.

Measuring the Team as a Whole
For overall team statistics? How long does the ball stay in a specific zone on the field? Does a team play 70% of their matches in the mid-field? Then maybe their forwards are underpowered. Do they play most of the match near their opponents goal? Maybe they've stacked most of their powerful players forward, and you need to adopt a strategy of long balls towards their goal.

The biggest concern in this class of statistics is that these can depend very heavily on the opponent. If you have an average team who faces four of the best teams in the league at the start of their season then the stats may be skewed unfairly in a negative direction.

Furthermore, as teams change, these statistics become invalid. Last year's "Total Defensive Power" may not apply when the keeper retires and the team sweeper gets traded. So each year you have to redo these numbers. Perhaps you could apply a modifier against the calculated number to reflect the number of personnel changes, but that seems like a poor practice right on the surface.

Handling the Challenges
Looking at these two classes of statistics, there are a few other problems which are apparent:
  1. Not all players play in the same position all the time. You may have a player who moves around into several different positions. The statistics need to be tailored to the positions the player is in.
  2. Not all players play the whole game. You may have a player who scores all of the goals in the first half and then needs to sit out the second. The statistics you generate should not be solely based on the overall outcome of the game, but zero in more tightly on the player themselves. Of course, if you have a player who routinely loses possession of the ball which leads to more losses, you should probably have a way to track that as well.
Unfortunately, or maybe the better word is realistically, this effort is a first draft. These methods and algorithms will be honed over time. I think it's better to keep them simplistic this first go round. The more complex they are, the more likely they are to have errors or biases we want to avoid. But as we apply the methods over time we will start see what needs tweaked where.

Alright, now I think we're ready to start talking actual algorithms.