Saturday, February 16, 2008

Equivalent Average Unmasked

Runs Batted In was created in the late 1800s. A few teams created the statistic to show how good they were. In fact, some sportswriters of the day realized it's inherent bias towards hitters in the middle of the order and disregarded it. The little guys with pointy hats and horse-drawn carriages knew what they were talking about. RBI would not surface as widely accepted statistic until after the dead ball era was over. Eventually it became THE way to grade an offensive players "production." We all know why it's a bad statistic.

Batting average has its flaws as well. If you go out on the street and ask someone what batting average is, they will respond with something sounding like this: How often a player gets a hit. Wrong. Batting average does not tell us how often a player gets a hit. It tells how often a player gets hit when while deciding to throw out some times he goes up to the plate for no reason other than we feel like it. It also fails to tell us to what type of a hit the player got. A single is not worth the same as a double. This is why we use on base average and slugging average. Then again is slugging average really any better? Well yes and no. It tells you the type of hit, but it still has the first problem of batting average. We're partitioning the times the player comes up to bat and excluding one for inherently biased reasons. Is on base average any better? It fixes the first problem, but fails to solve the second problem of batting average. It acknowledges all plate appearances, but it makes a walk and a home run equal.

We can sum on base average and slugging average for OPS, but then again who says that the relationship for that is better. Instead we can try to develop a system that solves both problems. Enter equivalent average. This post is going to describe anything and everything about EqA so you can come up with the exact EqA's BaseballProspectus comes up with. One of the criticisms for EqA is that BP develops it in a black box. No one knows how they arrive at it. They do spell out the method here. You can do all the things they do. You'll find out that the league leaders in EqA are generally around .300. BP's EqA leaders are generally around .350 or so. You can play around with the stuff in that article for days and never come up with anything remotely close to their EqA. Sorry. As TangoTiger put it: Opening up the black box will not cause a single dent on [BaseballProspect's] bottom line.

What I am going to tell you is everything and why Baseball Prospectus is doing what they do. It's rather simple. In fact it's essentially what people say mathematicians criticize sabermetricians for: Units. People who dislike sabermetrics generally say real mathematicians would hate their "work" because they shed units completely. This really isn't true. Everything in EqA is measured in relatively precise units that in the end cancel out leaving an answer in runs.

Now let's go on and attack the two major problems with oba, slg, and avg. We need to create some sort of rate statistic that includes getting on base and hitting for extra bases as well as stealing a base efficiently. The first thing that is calculated answers all of these problems in what they feel is the best way. We'll call this Raw:

Raw = (SF + SH + 1.5*BB + 1.5*HBP + 1.5*SB + 2*1B + 3*2B + 4*3B + 5*HR)/(SF+SH+BB+HBP+SB+CS+AB)

What is Raw measuring? It's essentially scaled bases per opportunities of moving up a base. Intuitively the idea that walks are worth more than sacs, but not quite as much as singles is good. Raw EqA addresses our two problems effectively, only adding in SB and CS, which can be described as a third problem with each oba, slg and avg. So in the end what does raw measure? Scaled Bases per PA+CS. It gives a numeric value of production. Now we can use Raw and convert it to runs. For a team we do this with this equation:

EqR = (Raw/LgRaw )^2* PA * LgR/LgPA

So what is EqR doing? It's measuring the relative production of the team divided by what an average team does squaring and multiplying it by PA and the runs per PA an average team scores. The squared term is based on the idea that the relationship between Raw/LgRaw and runs is not linear. This makes sense because when you add good hitters your other good hitters get more guys on base and each of their hits cause more runs. Now since we're looking at EqR on a team level and we want it on the player level let's look at that.

First, an assumption: The player in question is being analyzed by an average team in his home park. This assumption is needed to derive the equation most people see for EqR. Now, to look at the change in EqR for some change in Raw, take the derivative of EqR with respect to Raw. We get this equation:

dEqR = 2*Raw/LgRaw*PA*LgR/LgPA

Now we're adding some guy to this team, but a team only has nine slots it can play. So what are we doing? We're replacing an average player on this team and adding this players production. So basically we have our runs minus an average player's runs in the same PA. We're NOT measuring runs over an average player. We're measuring all of the runs created by a player. So our equation becomes:

dEqR = 2*Raw/LgRaw*PA*LgR/LgPA - PA*LgR/LgPA

Now we can factor out PA*LgR/LgPA resulting in the equation for EqR for a player you'll see at BP, only they drop the dEqR and call it EqR.

EqR = (2*Raw/LgRaw - 1) * PA* LgR/LgPA

Generally people look at that and say what the heck are they doing? Now you know why you're subtracting 1 and multiplying the ratio by two. Here is where we can multiply this by our park factor to normalize for parks, if desired. Now we want to scale EqR and to some rate statistic. What should we use? Outs of course. Why? Outs are the stopclock in baseball. We have 9 sets of 3 outs. We can bat as long as we want as long as we don't make those outs. So we decide to make our rate be something close to runs per out used. So then we get this equation, that you can find at BP, albeit not in the article I linked to regarding how to compute EqA (lol).

EqA = (EqR/Out/5)^.4

First let's analyze the "units". We have runs divided by outs, which is want we wanted. Pay no attention to the .4 right now. The thing that should cross your mind is what crosses everyone's mind: Why the hell do they divide by five? WHY? This is where everyone gets lost. In fact if you follow the calculations done in this thread and divide by five you will won't get the EqA BP computes. This is the black box, so to speak. Remember, average EqA is supposed to be .260. If you plug all this in you'll get the league average to be about .266 or so, depending on the season. IT DOESN'T WORK. 5 is more or less a constant that forces the average to be equal to .260. How do we do that?

Well League average is going to be (LgR/LgOut/C)^.4. Since we want to "force" EqA to be equal to .260 for an average player, simply set that equation equal to .260 and solve for C. So C =(LgR/LgOut)/.260^2.5. This number tends to be around 5, ranging anywhere from 4.6 (Japan Central League) to about 5.6 (2007 AL). The 2007 National League was about 5.2.

And there, with the above information you can get the exact answers that BP gets for EqA and puts on their player cards. In fact, If you want to you can find out the park factors to extra digits. I've gotten to the point where the average "error" on the EqA I come up with is .000226 compared to their's. Remember that their EqA is the ring of integers divided by 1000. In other words: It's rounded after three digits. Theoretically, the average error in rounding then will be .00025, which is actually greater than the error I come up with.

So there you have it. EqA perfectly. Now go look up EqR on BP and you'll see this:

EqR = 5*Out*EqA^2.5

Oh and 1/2.5=.4, so solving that equation for EqA gives us the EqA=(EqR/Out/5)^.4. Look familiar? Oh, but now we're all smart enough to realize that the five isn't five.

And yes, in case you noticed LgRuns gets canceled out. If you plug in everything you get:

EqA = ((2*Raw/LgRaw - 1) * PA* LgR/LgPA) * Out * LgOut/LgR*.26^2.5)^.4
EqA = ((2*Raw/LgRaw - 1) * PA * Out * LgOut/LgPA * .26^2.5)^.4

When you ever want to scale EqA to some league average production based on runs, it's going to cancel out....which of course makes sense.

Sunday, February 10, 2008

Building A Projection System

Rk
Name
Pos
Act
Pec
My
E P
E My
1.
Rodriguez
3b
.340
.319
.310
.021
.030
2.
Ramirez
ss
.315
.277
.302
.038
.013
3.
Renteria
ss
.297
.262
.259
.035
.038
4.
Rollins
ss
.290
.274
.272
.016
.018
5.
Jeter
ss
.285
.305
.277
.020
.008
6.
Guillen
ss
.283
.306
.291
.023
.008
7.
Reyes
ss
.278
.276
.266
.002
.012
8.
Tejada
ss
.271
.296
.289
.025
.018
9.
Young
ss
.270
.286
.275
.016
.005
10.
Wilson
ss
.269
.247
.246
.022
.023
11.
Eckstein
ss
.266
.247
.257
.019
.009
12.
Greene
ss
.263
.272
.266
.009
.003
13.
Hardy
ss
.261
.254
.241
.007
.020
14.
Sea Bass
ss
.260
.247
.233
.013
.027
15.
Cabrera
ss
.260
.260
.253
.000
.007
16.
Peralta
ss
.259
.281
.265
.022
.006
17.
Loretta
ss
.254
.252
.250
.002
.004
18.
Bartlett
ss
.253
.269
.252
.016
.001
19.
Betancourt
ss
.248
.251
.244
.003
.004
20.
Scutaro
ss
.246
.253
.246
.007
.000
21.
Furcal
ss
.244
.268
.278
.024
.034
22.
Lopez
ss
.239
.272
.268
.033
.029
23.
Drew
ss
.236
.276
.289
.040
.053
24.
Durham
2b
.227
.295
.269
.068
.042
25.
Lugo
ss
.225
.269
.261
.044
.036
26.
Uribe
ss
.222
.263
.228
.041
.006
27.
Vizquel
ss
.221
.264
.242
.043
.021
28.
Crosby
ss
.219
.265
.247
.046
.028
29.
McDonald
ss
.211
.215
.215
.004
.004
30.
Izturis
ss
.210
.234
.221
.024
.011
Average
ss
.257
.269
.260
.023
.017
As I sit here working on a simple projection system to evaluate translations from Japan to the United States, I beta ran one of the simple methods I came up with. The method is based on Marcel and I was only trying to project Equivalent Average. I looked at most middle infielders from the 1990s and developed a simplistic general age curve for all of them. Fitted that using a similar weighted season process that Marcel uses. I then looked at the set of 2007 SSs with a large amount of PAs and compared the projections versus the actual results for PECOTA and the simplistic method I came up with. Surprisingly the method I devised was more accurate. Weird. In case you're interested, the results are to the right.

Saturday, February 09, 2008

Shortstop Rankings

Rk.
Name
Pos
R
HR
RBI
SB
AVG
1.
Hanley Ramirez
ss
112
22
78
41
.309
2.
Jose Reyes
ss
111
14
66
63
.288
3.
Jimmy Rollins
ss
110
21
76
31
.286
4.
Troy Tulowitzki
ss
99
22
92
9
.286
5.
Derek Jeter
ss
102
12
75
15
.306
6.
Carlos Guillen
ss
86
17
83
12
.295
7.
Rafael Furcal
ss
99
10
56
28
.281
8.
Miguel Tejada
ss
77
19
88
4
.298
9.
Michael Young
ss
81
12
80
9
.299
10.
Jhonny Peralta
ss
92
21
84
4
.272
11.
Yunel Escobar
ss
85
8
70
12
.297
12.
JJ Hardy
ss
85
23
84
3
.270
13.
Orlando Cabrera
ss
88
9
70
18
.273
14.
Stephen Drew
ss
77
18
78
8
.264
15.
Edgar Renteria
ss
82
10
61
10
.287
16.
Khalil Greene
ss
74
22
82
5
.252
17.
Julio Lugo
ss
75
7
56
26
.269
18.
Brendan Harris
ss
77
14
74
5
.270
19.
Asdrubal Cabrera
ss
78
9
57
18
.266
20.
David Eckstein
ss
82
4
49
10
.280
Average
ss
93
16
77
18
.287
Replacement Level
ss
75
9
58
12
.272
Who does not love fantasy baseball? This is the first entry in a series of posts that will rank fantasy players based on their projections for the 2008 season. The projection systems that are used to come up with a players projections are PECOTA, Bill James' and ZiPS. An estimate was made based on depth charts to see how many plate appearances can be expected by each player, injury likelihood included. This cuts out projections where the PT is low because of flukish injuries, like Derrek Lee's. Shortstop happens to be the position that has the least depth this year, but it's also quite top heavy. Three shortstops are going in the first round. They're elite status, and are top 12 players.

I have several strategies I like to employ at short. The two guys I like to target are Derek Jeter and Stephen Drew. When I draft Jeter I usually do it for his batting average. Drafting his average allows you to invest in guys who are good power hitters but do not hit for a great average. The four guys that immediately come to mind are Ryan Howard, Adam Dunn, Josh Fields and Chris Young. Jeter also adds some runs and some steals on the side. Drew's a bit of the opposite. He has nice power upside, although the projections are not the greatest in the world. He probably won't hit higher than .290, but he's a good gamble late.

I don't advise investing a top five pick on Jose Reyes. I have him rated #10 overall and I just don't see the point of investing in a two category player in the first round. You're limiting yourself way too much.

Reboot #2

Let's see, last time I decided to promise my zero readers that I was going to reboot this blog and make more posts, it last all of one post after that one. Maybe this time it will last longer. I would not bet on it, but it is worth a try. We'll see...