Shades of Wrigley: February 2008

Saturday, February 16, 2008

Equivalent Average Unmasked

Runs Batted In was created in the late 1800s. A few teams created the statistic to show how good they were. In fact, some sportswriters of the day realized it's inherent bias towards hitters in the middle of the order and disregarded it. The little guys with pointy hats and horse-drawn carriages knew what they were talking about. RBI would not surface as widely accepted statistic until after the dead ball era was over. Eventually it became THE way to grade an offensive players "production." We all know why it's a bad statistic.

Batting average has its flaws as well. If you go out on the street and ask someone what batting average is, they will respond with something sounding like this: How often a player gets a hit. Wrong. Batting average does not tell us how often a player gets a hit. It tells how often a player gets hit when while deciding to throw out some times he goes up to the plate for no reason other than we feel like it. It also fails to tell us to what type of a hit the player got. A single is not worth the same as a double. This is why we use on base average and slugging average. Then again is slugging average really any better? Well yes and no. It tells you the type of hit, but it still has the first problem of batting average. We're partitioning the times the player comes up to bat and excluding one for inherently biased reasons. Is on base average any better? It fixes the first problem, but fails to solve the second problem of batting average. It acknowledges all plate appearances, but it makes a walk and a home run equal.

We can sum on base average and slugging average for OPS, but then again who says that the relationship for that is better. Instead we can try to develop a system that solves both problems. Enter equivalent average. This post is going to describe anything and everything about EqA so you can come up with the exact EqA's BaseballProspectus comes up with. One of the criticisms for EqA is that BP develops it in a black box. No one knows how they arrive at it. They do spell out the method here. You can do all the things they do. You'll find out that the league leaders in EqA are generally around .300. BP's EqA leaders are generally around .350 or so. You can play around with the stuff in that article for days and never come up with anything remotely close to their EqA. Sorry. As TangoTiger put it: Opening up the black box will not cause a single dent on [BaseballProspect's] bottom line.

What I am going to tell you is everything and why Baseball Prospectus is doing what they do. It's rather simple. In fact it's essentially what people say mathematicians criticize sabermetricians for: Units. People who dislike sabermetrics generally say real mathematicians would hate their "work" because they shed units completely. This really isn't true. Everything in EqA is measured in relatively precise units that in the end cancel out leaving an answer in runs.

Now let's go on and attack the two major problems with oba, slg, and avg. We need to create some sort of rate statistic that includes getting on base and hitting for extra bases as well as stealing a base efficiently. The first thing that is calculated answers all of these problems in what they feel is the best way. We'll call this Raw:

Raw = (SF + SH + 1.5*BB + 1.5*HBP + 1.5*SB + 2*1B + 3*2B + 4*3B + 5*HR)/(SF+SH+BB+HBP+SB+CS+AB)

What is Raw measuring? It's essentially scaled bases per opportunities of moving up a base. Intuitively the idea that walks are worth more than sacs, but not quite as much as singles is good. Raw EqA addresses our two problems effectively, only adding in SB and CS, which can be described as a third problem with each oba, slg and avg. So in the end what does raw measure? Scaled Bases per PA+CS. It gives a numeric value of production. Now we can use Raw and convert it to runs. For a team we do this with this equation:

EqR = (Raw/LgRaw )^2* PA * LgR/LgPA

So what is EqR doing? It's measuring the relative production of the team divided by what an average team does squaring and multiplying it by PA and the runs per PA an average team scores. The squared term is based on the idea that the relationship between Raw/LgRaw and runs is not linear. This makes sense because when you add good hitters your other good hitters get more guys on base and each of their hits cause more runs. Now since we're looking at EqR on a team level and we want it on the player level let's look at that.

First, an assumption: The player in question is being analyzed by an average team in his home park. This assumption is needed to derive the equation most people see for EqR. Now, to look at the change in EqR for some change in Raw, take the derivative of EqR with respect to Raw. We get this equation:

dEqR = 2*Raw/LgRaw*PA*LgR/LgPA

Now we're adding some guy to this team, but a team only has nine slots it can play. So what are we doing? We're replacing an average player on this team and adding this players production. So basically we have our runs minus an average player's runs in the same PA. We're NOT measuring runs over an average player. We're measuring all of the runs created by a player. So our equation becomes:

dEqR = 2*Raw/LgRaw*PA*LgR/LgPA - PA*LgR/LgPA

Now we can factor out PA*LgR/LgPA resulting in the equation for EqR for a player you'll see at BP, only they drop the dEqR and call it EqR.

EqR = (2*Raw/LgRaw - 1) * PA* LgR/LgPA

Generally people look at that and say what the heck are they doing? Now you know why you're subtracting 1 and multiplying the ratio by two. Here is where we can multiply this by our park factor to normalize for parks, if desired. Now we want to scale EqR and to some rate statistic. What should we use? Outs of course. Why? Outs are the stopclock in baseball. We have 9 sets of 3 outs. We can bat as long as we want as long as we don't make those outs. So we decide to make our rate be something close to runs per out used. So then we get this equation, that you can find at BP, albeit not in the article I linked to regarding how to compute EqA (lol).

EqA = (EqR/Out/5)^.4

First let's analyze the "units". We have runs divided by outs, which is want we wanted. Pay no attention to the .4 right now. The thing that should cross your mind is what crosses everyone's mind: Why the hell do they divide by five? WHY? This is where everyone gets lost. In fact if you follow the calculations done in this thread and divide by five you will won't get the EqA BP computes. This is the black box, so to speak. Remember, average EqA is supposed to be .260. If you plug all this in you'll get the league average to be about .266 or so, depending on the season. IT DOESN'T WORK. 5 is more or less a constant that forces the average to be equal to .260. How do we do that?

Well League average is going to be (LgR/LgOut/C)^.4. Since we want to "force" EqA to be equal to .260 for an average player, simply set that equation equal to .260 and solve for C. So C =(LgR/LgOut)/.260^2.5. This number tends to be around 5, ranging anywhere from 4.6 (Japan Central League) to about 5.6 (2007 AL). The 2007 National League was about 5.2.

And there, with the above information you can get the exact answers that BP gets for EqA and puts on their player cards. In fact, If you want to you can find out the park factors to extra digits. I've gotten to the point where the average "error" on the EqA I come up with is .000226 compared to their's. Remember that their EqA is the ring of integers divided by 1000. In other words: It's rounded after three digits. Theoretically, the average error in rounding then will be .00025, which is actually greater than the error I come up with.

So there you have it. EqA perfectly. Now go look up EqR on BP and you'll see this:

EqR = 5*Out*EqA^2.5

Oh and 1/2.5=.4, so solving that equation for EqA gives us the EqA=(EqR/Out/5)^.4. Look familiar? Oh, but now we're all smart enough to realize that the five isn't five.

And yes, in case you noticed LgRuns gets canceled out. If you plug in everything you get:

EqA = ((2*Raw/LgRaw - 1) * PA* LgR/LgPA) * Out * LgOut/LgR*.26^2.5)^.4
EqA = ((2*Raw/LgRaw - 1) * PA * Out * LgOut/LgPA * .26^2.5)^.4

When you ever want to scale EqA to some league average production based on runs, it's going to cancel out....which of course makes sense.

Sunday, February 10, 2008

Building A Projection System

Rk	Name	Pos	Act	Pec	My	E P	E My
1.	Rodriguez	3b	.340	.319	.310	.021	.030
2.	Ramirez	ss	.315	.277	.302	.038	.013
3.	Renteria	ss	.297	.262	.259	.035	.038
4.	Rollins	ss	.290	.274	.272	.016	.018
5.	Jeter	ss	.285	.305	.277	.020	.008
6.	Guillen	ss	.283	.306	.291	.023	.008
7.	Reyes	ss	.278	.276	.266	.002	.012
8.	Tejada	ss	.271	.296	.289	.025	.018
9.	Young	ss	.270	.286	.275	.016	.005
10.	Wilson	ss	.269	.247	.246	.022	.023
11.	Eckstein	ss	.266	.247	.257	.019	.009
12.	Greene	ss	.263	.272	.266	.009	.003
13.	Hardy	ss	.261	.254	.241	.007	.020
14.	Sea Bass	ss	.260	.247	.233	.013	.027
15.	Cabrera	ss	.260	.260	.253	.000	.007
16.	Peralta	ss	.259	.281	.265	.022	.006
17.	Loretta	ss	.254	.252	.250	.002	.004
18.	Bartlett	ss	.253	.269	.252	.016	.001
19.	Betancourt	ss	.248	.251	.244	.003	.004
20.	Scutaro	ss	.246	.253	.246	.007	.000
21.	Furcal	ss	.244	.268	.278	.024	.034
22.	Lopez	ss	.239	.272	.268	.033	.029
23.	Drew	ss	.236	.276	.289	.040	.053
24.	Durham	2b	.227	.295	.269	.068	.042
25.	Lugo	ss	.225	.269	.261	.044	.036
26.	Uribe	ss	.222	.263	.228	.041	.006
27.	Vizquel	ss	.221	.264	.242	.043	.021
28.	Crosby	ss	.219	.265	.247	.046	.028
29.	McDonald	ss	.211	.215	.215	.004	.004
30.	Izturis	ss	.210	.234	.221	.024	.011
	Average	ss	.257	.269	.260	.023	.017

As I sit here working on a simple projection system to evaluate translations from Japan to the United States, I beta ran one of the simple methods I came up with. The method is based on Marcel and I was only trying to project Equivalent Average. I looked at most middle infielders from the 1990s and developed a simplistic general age curve for all of them. Fitted that using a similar weighted season process that Marcel uses. I then looked at the set of 2007 SSs with a large amount of PAs and compared the projections versus the actual results for PECOTA and the simplistic method I came up with. Surprisingly the method I devised was more accurate. Weird. In case you're interested, the results are to the right.

Saturday, February 09, 2008

Shortstop Rankings

Rk.	Name	Pos	R	HR	RBI	SB	AVG
1.	Hanley Ramirez	ss	112	22	78	41	.309
2.	Jose Reyes	ss	111	14	66	63	.288
3.	Jimmy Rollins	ss	110	21	76	31	.286
4.	Troy Tulowitzki	ss	99	22	92	9	.286
5.	Derek Jeter	ss	102	12	75	15	.306
6.	Carlos Guillen	ss	86	17	83	12	.295
7.	Rafael Furcal	ss	99	10	56	28	.281
8.	Miguel Tejada	ss	77	19	88	4	.298
9.	Michael Young	ss	81	12	80	9	.299
10.	Jhonny Peralta	ss	92	21	84	4	.272
11.	Yunel Escobar	ss	85	8	70	12	.297
12.	JJ Hardy	ss	85	23	84	3	.270
13.	Orlando Cabrera	ss	88	9	70	18	.273
14.	Stephen Drew	ss	77	18	78	8	.264
15.	Edgar Renteria	ss	82	10	61	10	.287
16.	Khalil Greene	ss	74	22	82	5	.252
17.	Julio Lugo	ss	75	7	56	26	.269
18.	Brendan Harris	ss	77	14	74	5	.270
19.	Asdrubal Cabrera	ss	78	9	57	18	.266
20.	David Eckstein	ss	82	4	49	10	.280
	Average	ss	93	16	77	18	.287
	Replacement Level	ss	75	9	58	12	.272

Who does not love fantasy baseball? This is the first entry in a series of posts that will rank fantasy players based on their projections for the 2008 season. The projection systems that are used to come up with a players projections are PECOTA, Bill James' and ZiPS. An estimate was made based on depth charts to see how many plate appearances can be expected by each player, injury likelihood included. This cuts out projections where the PT is low because of flukish injuries, like Derrek Lee's. Shortstop happens to be the position that has the least depth this year, but it's also quite top heavy. Three shortstops are going in the first round. They're elite status, and are top 12 players.

I have several strategies I like to employ at short. The two guys I like to target are Derek Jeter and Stephen Drew. When I draft Jeter I usually do it for his batting average. Drafting his average allows you to invest in guys who are good power hitters but do not hit for a great average. The four guys that immediately come to mind are Ryan Howard, Adam Dunn, Josh Fields and Chris Young. Jeter also adds some runs and some steals on the side. Drew's a bit of the opposite. He has nice power upside, although the projections are not the greatest in the world. He probably won't hit higher than .290, but he's a good gamble late.

I don't advise investing a top five pick on Jose Reyes. I have him rated #10 overall and I just don't see the point of investing in a two category player in the first round. You're limiting yourself way too much.

Reboot #2

Let's see, last time I decided to promise my zero readers that I was going to reboot this blog and make more posts, it last all of one post after that one. Maybe this time it will last longer. I would not bet on it, but it is worth a try. We'll see...

Shades of Wrigley

Categories

Blog Archive