Beyond Baseball

Saturday, December 16, 2006

The Park Factor Conundrum

In the world of sabermetrics Park Factors are an important tool. For those that don't know Park Factors are a metric that estimates the effect on an event (run scoring, for example) that a specific ballpark has. Originally it was an easy metric to calculate. Using our runs scored example, we'd simply take the amount of runs scored in a team's home games (including the opposition) and divide it by the average amount of runs scored. If the event occurred at a greater rate (more runs) at home than on the road then the factor was greater than one. If the event occurred at a lesser rate, then the factor was less than one. Pretty simple and effective.

Now that simple way of calculating a park's factor isn't valid. The introduction of interleague play and unbalanced scheduling have made the old way useless. The reason is that the distribution of teams that play in a given park isn't the same. For example, the Yankees do not see the same teams the same amount of times in Yankee Stadium as they do away from the Bronx. This makes it very difficult to separate the effect of the parks and the teams that are playing the games. It is my hope to provide an improvement to be able to correctly calculate a specific park's factor.

The simplest possible model would be a league of only 2 teams. The teams' rosters don't change significantly from game to game so their individual offensive abilities should be about the same from game to game. Therefore, if we compare the relative offensive levels of both teams for games in each park will give us a good metric of the park factors. To illustrate let's form a theoretical league. We'll start out with just two teams. For simplicity, we'll play a 100 game schedule with 50 games at each park. If 250 runs were scored in Park A and 200 runs were scored in Park B, the park factors would be 1.11 (250/225=1.11) for Park A and 0.89 (200/225=0.89) for Park B. Pretty simple when the schedule is balanced.

What if we unbalance the schedule? Let's play 40 games in Park A and 60 Park B. If we assume that the teams score at the same rate in each park as before we should get 200 runs scored in Park A and 240 runs scored in Park B. If we do it with the calculation for balanced schedules we get factors of 0.91 for Park A and 1.09 for Park B. Clearly an error is made.

One solution is to use the rate per game numbers rather than the aggregate statistics. Regardless of the distribution of games between the two teams in each park Park A will average 5 runs/game and Park B will average 4 runs/game, ceteras peribus. If we look at the rates and have an even distribution we'd see an average of 4.5 runs/game with park factors of 1.11 (5/4.5=1.11) for Park A and 0.89 (4/4.5=1.11) for Park B. We're back to the correct answer!

If we apply this method to a larger league we should be able to calculate park factors for all of them. The most immediate concern and potential for error is a counfounding effect of the individual offensive capabilities of specific teams. I believe this can be controlled through determining how many "neutral park runs" would be scored for each matchup and then create a simple balanced schedule for each team. Once that is done we can make a new aggregate runs scored in a specific park and away from a specific park to determine the correct park factor. I will post later on this subject using raw data to see if this idea works in practice.

0 Comments:

Post a Comment

<< Home