Beyond Baseball

Tuesday, August 21, 2007

Jenks v. Webb: Which Feat is more Impressive?

On August 20th, Bobby Jenks' streak of 41 consecutive batters retired came to an end. Jenks streak tied Jim Barr for the Major League record. Currently, Arizona's Brandon Webb has pitched 42 consecutive scoreless innings. Both are tremendous accomplishments, but which is more impressive? To answer that question we'll look at the probability of each feat using the binomial distribution. For those that aren't as statistically inclined, the binomial distribution allows us to calculate the probability of getting a given number of successes over a specific number of trials with a probability of success of each trial. Confused? Think of flipping a coin. If we designate flipping heads as a success we could calculate the probability of getting X number of heads in Y number of flips without actually multiple sets of flips. On to the comparison!

First will look at Jenks:

Jenks' streak came against Cleveland, Boston, Detroit, Toronto, New York (A), and Seattle. To determine Jenks' probability to retire 41 consecutive batters, I used the teams' on-base percentage as a proxy for their ability of avoiding an out (thus getting on base). Using a weighted average to adjust for varying amounts of batters faced, the OBP of Jenks' opponents is .344. Setting that as the probability of success, I calculated the probability of running 41 trials (ABs) and getting zero successes (no baserunners). The result? 0.0000000302 (3.02ee-8).

Now let's look at Webb:

Webb's current streak has come against the bats of Baltimore, New York (N), Washington, Houston, San Deigo, Los Angeles, Colorado, and San Francisco. Again, we need the probability of success (in this case the probability that a team will score) and the event (an inning). To approximate this I averaged the teams runs scored per game. After some grueling spreadsheet calculations, I got 4.5. Taking that number and dividing by 9 I get a proxy for th probability of scoring a run in an inning. The probability of zero successes in 42 trials? 0.000000000000233 (2.33e-13).

The chances of both feats are essentially statistically impossible. Webb's is more so (about 100,000x) less likely.

Sunday, January 21, 2007

Park Factors Solved!

Park factors are one of many sabermetric tools. They allow sabermaticians to adjust a player's raw statistics to what they would be in a league average environment. Simple park factors can be calculated very easily. For example, to find the run park factor (or just run factor) of Team A we'd use the following formula:

((RShome+RAhome)/Ghome)
----------------------------
((RSroad+RAroad)/Groad)

The bigger the number the "easier" it is to score runs in Team A's home park. Using this basic formula park factors for other events (such as singles, doubles, triples, and homeruns) can easily determined. Unfortunately, this simple formula has become inadequate. John Shea, an economics professor at the University of Maryland wrote a paper illustrating the problems this formula is ill-equipped to address. In the paper Shea outlines that the advent of the unbalanced schedule, interleague play, and differences in the overall quality of opponents are introduced. Under a balanced schedule teams play the same proportion of teams at home and on the road. That allows the formula to factor out those confounding effects. However, in today's baseball that is impossible because the denominator isn't the same for each team.
I believe that I have found a solution to this problem. My calculations are a bit more intensive but think it factors out the confounding effects that Shea outlined. To begin I start with a basic formula:

E(Event)=LA*OF*DF*PF

Here the expected rate of an event (Runs Scored, H/BIP, etc.) is dependent on four factors. LA is League Average, it's the baseline rate of that event. OF/DF are multipliers accounting for the offensive or defensive prowess of a specific team. PF is the park factor.

Looking at this formula we can see that if the teams are average and playing in an average park (OF, DF, and PF are all equal to 1) then we see the League Average rate. Factors greater (lesser) than one will increase (decrease) the observed rate. Within this framework we can isolate the PF variable to determine the appropriate park factor.

First, I determined the OF and DF multipliers for each team. Using game data from thos great people at Retrosheet, I was able to calculate estimates by isolating a particular variable. If I look at only home Twin games (all played in the Metrodome), I find the OF and DF multipliers for all Twins opponents by finding the average rate of an event for each team (i.e. White Sox or Tigers) and divide it by the average rate of all Twins opponents (thus controlling for different amount of games per opponent). I repeat this process for all 30 teams which gives me an array of OF and DF multipliers for each team. Average those numbers and voila! the OF and DF factors are estimated.

Using the appropriate OF and DF for each game, I adjust the raw rates by dividing by OF*DF. The resultant rates are now factors of just LA and PF. Since LA remains a constant for all games, I simply divide the average adjusted rate for games in a specific park by the average adjusted rate for all Major League games. The result is a park factor!

I calculated park factors for singles, doubles, triples, and homeruns on balls in play.

Team	1B/BIP	2B/BIP	3B/BIP	HR/BIP
ANA	0.94	1.02	0.98	0.95
ARZ	0.98	0.99	1.40	1.07
ATL	1.06	1.10	1.26	0.88
BAL	1.02	0.83	0.68	0.98
BOS	0.90	1.25	1.00	0.97
CHW	0.97	0.94	0.77	1.28
CHC	0.95	1.07	0.95	1.11
CIN	0.97	1.18	0.54	1.20
CLE	0.97	1.01	0.48	0.89
COL	1.15	0.97	1.10	1.07
DET	1.03	0.87	1.64	0.98
FLA	1.02	0.94	1.43	0.84
HOU	1.06	0.89	1.00	1.22
KC	1.01	1.09	1.03	0.75
LAD	0.92	1.00	0.54	1.09
MIL	0.90	1.00	0.90	1.16
MIN	1.03	0.95	0.86	0.96
NYY	1.08	0.89	0.90	1.03
NYM	1.04	0.94	0.76	0.90
OAK	0.98	1.08	0.68	0.84
PHI	1.02	1.06	1.31	1.17
PIT	1.04	1.17	1.09	0.92
SD	1.00	0.90	1.47	0.78
SF	1.00	0.88	1.21	0.91
SEA	1.06	0.98	0.71	0.81
STL	1.01	1.05	0.64	1.15
TB	1.01	0.91	1.31	0.95
TEX	0.96	1.02	1.54	1.17
TOR	1.01	1.06	0.84	1.16
WSH	0.91	1.00	0.98	0.81

Saturday, December 16, 2006

The Park Factor Conundrum

In the world of sabermetrics Park Factors are an important tool. For those that don't know Park Factors are a metric that estimates the effect on an event (run scoring, for example) that a specific ballpark has. Originally it was an easy metric to calculate. Using our runs scored example, we'd simply take the amount of runs scored in a team's home games (including the opposition) and divide it by the average amount of runs scored. If the event occurred at a greater rate (more runs) at home than on the road then the factor was greater than one. If the event occurred at a lesser rate, then the factor was less than one. Pretty simple and effective.

Now that simple way of calculating a park's factor isn't valid. The introduction of interleague play and unbalanced scheduling have made the old way useless. The reason is that the distribution of teams that play in a given park isn't the same. For example, the Yankees do not see the same teams the same amount of times in Yankee Stadium as they do away from the Bronx. This makes it very difficult to separate the effect of the parks and the teams that are playing the games. It is my hope to provide an improvement to be able to correctly calculate a specific park's factor.

The simplest possible model would be a league of only 2 teams. The teams' rosters don't change significantly from game to game so their individual offensive abilities should be about the same from game to game. Therefore, if we compare the relative offensive levels of both teams for games in each park will give us a good metric of the park factors. To illustrate let's form a theoretical league. We'll start out with just two teams. For simplicity, we'll play a 100 game schedule with 50 games at each park. If 250 runs were scored in Park A and 200 runs were scored in Park B, the park factors would be 1.11 (250/225=1.11) for Park A and 0.89 (200/225=0.89) for Park B. Pretty simple when the schedule is balanced.

What if we unbalance the schedule? Let's play 40 games in Park A and 60 Park B. If we assume that the teams score at the same rate in each park as before we should get 200 runs scored in Park A and 240 runs scored in Park B. If we do it with the calculation for balanced schedules we get factors of 0.91 for Park A and 1.09 for Park B. Clearly an error is made.

One solution is to use the rate per game numbers rather than the aggregate statistics. Regardless of the distribution of games between the two teams in each park Park A will average 5 runs/game and Park B will average 4 runs/game, ceteras peribus. If we look at the rates and have an even distribution we'd see an average of 4.5 runs/game with park factors of 1.11 (5/4.5=1.11) for Park A and 0.89 (4/4.5=1.11) for Park B. We're back to the correct answer!

If we apply this method to a larger league we should be able to calculate park factors for all of them. The most immediate concern and potential for error is a counfounding effect of the individual offensive capabilities of specific teams. I believe this can be controlled through determining how many "neutral park runs" would be scored for each matchup and then create a simple balanced schedule for each team. Once that is done we can make a new aggregate runs scored in a specific park and away from a specific park to determine the correct park factor. I will post later on this subject using raw data to see if this idea works in practice.

Saturday, December 09, 2006

Independence of Runs Scored and Runs Allowed.

In the February 2006 issue of By the Numbers, the SABR statistical publication, Ray Ciccolella published an article titled "Are Runs Scored and Runs Allowed Independent?" In the article he referenced a separate article in the same issue written by Steven Miller. Miller's article provided a theoretical framework for the Pythagorean Theorem (the baseball one) and concluded that Run Scored (RS) and Runs Allowed (RA) are independent. Ciccolella found that conclusion to be counter-intuitive because he thought "environmental factors" like the "ballpark, the weather conditions, and the home plate umpire" are the same for each team. Ciccolella then performed his own experiments to determine the independence. He tried three methods.

Method 1 was comparing the actual margin of victory with a randomly margin of victory and comparing the difference. The distribution for RA was the teams actual distribution for RA. He completed 5 seasons worth of data for each team. He found that the random margin of victory was larger than the actual and it also resulted in less 1-run games (which makes sense if your margin of victory is larger). Although this seems to suggest that RS and RA are not indpendent, Ciccolella was surprised that the difference between the two wasn't that big.

Method 2 found the expected number of wins given the team scored X number of runs. He found that when teams score 0-2 and 6+ runs per game, the team won less than expected. Again this suggests that RS and RA are not independent.

Method 3 was a similar experiment to previous work regarding this question and ended with similar results.

Ciccolella concluded that RS and RA couldn't be independent but their correlation was weaker than he expected. Not by coincedence Miller had similar conclusions but only took a different path to get there. Miller did say that RS and RA couldn't be independent because there are no tie games in baseball. If one team scores 5 runs, the other team can't score 5 runs. Nevertheless Miller concluded that RS and RA behave as they were independent once you correct for ties.

Personally, I find that Miller is more correct in this issue than Ciccolella. By choosing RA randomly from the teams distribution and comparing it to its RS distribution you bring in the "environmental factors" Miller talked about in the beginning of the article. Individual ballparks alone have been shown to increase or decrease scoring (which is why sabermaticians have developed a park effect statistic). When you randomly combine these two distributions you introduce the possibility of RS in Coors Field (lots of runs) and an RA in PETCO Park (not so much runs) or vice versa. Ciccolella would argue that this scenario proves that RS and RA are not independent of each other because they do depend somewhat on the environment they were scored/allowed in. However, I see it as Ciccolella proving that RS and RA are not independent of their environment not necessarily dependent of each other. The nature of baseball doesn't preclude you from scoring runs if the other team is or isn't. Perhaps this study needs to be redone with the "environmental factors" controlled for and it may yield different results.

Sunday, October 08, 2006

The Starting Lineup

I've got some work to do before I'll be able to post some "real" articles. In the meantime you can salivate at the thought of what's coming next:

I'm going to test the Pythagorean Win Expectancy for the 2006 season. I'd like to expand it for other seasons as well, but I don't have the data...yet. I'll update it whenever.
Team Reviews for the 2006 season.
Team Previews for the 2007 season. (I'll probably do that sometime in April. Season previews make more sense that way.)

I've also got something special on the way. I need to do a bit more research on the subject, but I promise, It'll be really cool. Maybe just in time for the 2007 previews. *wink wink* *nudge nudge*

Saturday, October 07, 2006

Beyond Baseball: The Beginning

Welcome to Beyond Baseball. If you're wondering what you're reading about or what this blog is about maybe you should it the back button.

Wait!

I have a better idea. Keep reading. Why? Because there may be cookies and other sugary treats for you at the end of this article. And who wouldn't want to read something for the chance and sugary treats?

Anyway, since this is an introductary article, I'll tell you what Beyond Baseball is about. I'm a baseball fan and a new student of sabermetrics. I'm starting this blog to publish my sabermetric studies, and insights about baseball. Beyond Baseball isn't going to be just another sabermetric blog.

Well...it is...but not really...

Most baseball blogs deal with the Major Leagues. A great majority of the sabermetric community studies Major League Baseball. Beyond Baseball is going to be...well...beyond baseball. I'll be focusing my studies on independent baseball. Primarily on the Northern League. Perhaps later on, Beyond Baseball will expand to coverage of other independent baseball leagues, but that is beyond what I can do. You've read this far, and I know I can't hold you're attention much longer because you're thinking about those cookies. So get up, grab a cookie and taste that sweet, sugary reward. Mmm....sugary.