Beyond Baseball

Saturday, December 16, 2006

The Park Factor Conundrum

In the world of sabermetrics Park Factors are an important tool. For those that don't know Park Factors are a metric that estimates the effect on an event (run scoring, for example) that a specific ballpark has. Originally it was an easy metric to calculate. Using our runs scored example, we'd simply take the amount of runs scored in a team's home games (including the opposition) and divide it by the average amount of runs scored. If the event occurred at a greater rate (more runs) at home than on the road then the factor was greater than one. If the event occurred at a lesser rate, then the factor was less than one. Pretty simple and effective.

Now that simple way of calculating a park's factor isn't valid. The introduction of interleague play and unbalanced scheduling have made the old way useless. The reason is that the distribution of teams that play in a given park isn't the same. For example, the Yankees do not see the same teams the same amount of times in Yankee Stadium as they do away from the Bronx. This makes it very difficult to separate the effect of the parks and the teams that are playing the games. It is my hope to provide an improvement to be able to correctly calculate a specific park's factor.

The simplest possible model would be a league of only 2 teams. The teams' rosters don't change significantly from game to game so their individual offensive abilities should be about the same from game to game. Therefore, if we compare the relative offensive levels of both teams for games in each park will give us a good metric of the park factors. To illustrate let's form a theoretical league. We'll start out with just two teams. For simplicity, we'll play a 100 game schedule with 50 games at each park. If 250 runs were scored in Park A and 200 runs were scored in Park B, the park factors would be 1.11 (250/225=1.11) for Park A and 0.89 (200/225=0.89) for Park B. Pretty simple when the schedule is balanced.

What if we unbalance the schedule? Let's play 40 games in Park A and 60 Park B. If we assume that the teams score at the same rate in each park as before we should get 200 runs scored in Park A and 240 runs scored in Park B. If we do it with the calculation for balanced schedules we get factors of 0.91 for Park A and 1.09 for Park B. Clearly an error is made.

One solution is to use the rate per game numbers rather than the aggregate statistics. Regardless of the distribution of games between the two teams in each park Park A will average 5 runs/game and Park B will average 4 runs/game, ceteras peribus. If we look at the rates and have an even distribution we'd see an average of 4.5 runs/game with park factors of 1.11 (5/4.5=1.11) for Park A and 0.89 (4/4.5=1.11) for Park B. We're back to the correct answer!

If we apply this method to a larger league we should be able to calculate park factors for all of them. The most immediate concern and potential for error is a counfounding effect of the individual offensive capabilities of specific teams. I believe this can be controlled through determining how many "neutral park runs" would be scored for each matchup and then create a simple balanced schedule for each team. Once that is done we can make a new aggregate runs scored in a specific park and away from a specific park to determine the correct park factor. I will post later on this subject using raw data to see if this idea works in practice.

Saturday, December 09, 2006

Independence of Runs Scored and Runs Allowed.

In the February 2006 issue of By the Numbers, the SABR statistical publication, Ray Ciccolella published an article titled "Are Runs Scored and Runs Allowed Independent?" In the article he referenced a separate article in the same issue written by Steven Miller. Miller's article provided a theoretical framework for the Pythagorean Theorem (the baseball one) and concluded that Run Scored (RS) and Runs Allowed (RA) are independent. Ciccolella found that conclusion to be counter-intuitive because he thought "environmental factors" like the "ballpark, the weather conditions, and the home plate umpire" are the same for each team. Ciccolella then performed his own experiments to determine the independence. He tried three methods.

Method 1 was comparing the actual margin of victory with a randomly margin of victory and comparing the difference. The distribution for RA was the teams actual distribution for RA. He completed 5 seasons worth of data for each team. He found that the random margin of victory was larger than the actual and it also resulted in less 1-run games (which makes sense if your margin of victory is larger). Although this seems to suggest that RS and RA are not indpendent, Ciccolella was surprised that the difference between the two wasn't that big.

Method 2 found the expected number of wins given the team scored X number of runs. He found that when teams score 0-2 and 6+ runs per game, the team won less than expected. Again this suggests that RS and RA are not independent.

Method 3 was a similar experiment to previous work regarding this question and ended with similar results.

Ciccolella concluded that RS and RA couldn't be independent but their correlation was weaker than he expected. Not by coincedence Miller had similar conclusions but only took a different path to get there. Miller did say that RS and RA couldn't be independent because there are no tie games in baseball. If one team scores 5 runs, the other team can't score 5 runs. Nevertheless Miller concluded that RS and RA behave as they were independent once you correct for ties.

Personally, I find that Miller is more correct in this issue than Ciccolella. By choosing RA randomly from the teams distribution and comparing it to its RS distribution you bring in the "environmental factors" Miller talked about in the beginning of the article. Individual ballparks alone have been shown to increase or decrease scoring (which is why sabermaticians have developed a park effect statistic). When you randomly combine these two distributions you introduce the possibility of RS in Coors Field (lots of runs) and an RA in PETCO Park (not so much runs) or vice versa. Ciccolella would argue that this scenario proves that RS and RA are not independent of each other because they do depend somewhat on the environment they were scored/allowed in. However, I see it as Ciccolella proving that RS and RA are not independent of their environment not necessarily dependent of each other. The nature of baseball doesn't preclude you from scoring runs if the other team is or isn't. Perhaps this study needs to be redone with the "environmental factors" controlled for and it may yield different results.