Wednesday, November 25, 2009

Run Estimators: Avg, OBP, SLG, OPS vs. OOPS

The fits below were obtained by fitting total runs for the 14 AL teams for the 2009 season.
2009 League averages:
      AB   Runs   BA    OBP   SLG   OPS  OOPS  R1_est  OOPS2  R2_est
AL   5569   781  .266  .335  .428  .763  1.098   776   1.206    779
NL   5493   718  .259  .330  .409  .739  1.069   734   1.143    738

OOPS == 2*OBP + SLG;   OOPS2== OOPS^2
R1_est = -910 + 1350 * (2.45 * OBP + SLG)
R2_est = 646 * OOPS^2
OOPS2 has the advantage of zero offset (Runs=0 at x=0) with only 1 parameter.  The fractional change in OOPS2 is equal to the fractional change in predicted runs. That is not true for the common run estimators, which predict negative runs for very low values of the independent variables.
One could define OOPS2_BA as 0.266 * OOPS^2/1.206 for the AL. Or, OOPS2+ = 100* OOPS^2/1.206.
2009 stats:
           AVG    OBP    SLG    OPS   OOPS   OOPS2   OOPS2+   OOPS2_BA
Guerrero  .295   .334   .460   .794   1.128  1.272    106       .281
Dunn      .267   .398   .529   .928   1.325  1.756    146(AL)   .387
Pujols    .327   .443   .658  1.101   1.544  2.384    198       .526
Mauer     .365   .444   .587  1.031   1.475  2.176    180       .480
Youkilis  .306   .413   .548   .961   1.374  1.888    157       .416
Teixeira  .292   .383   .565   .948   1.331  1.772    147       .391
A-Rod     .286   .402   .532   .933   1.336  1.785    148       .394
Swisher   .249   .371   .498   .869   1.240  1.538    127       .339
Abreu     .293   .390   .435   .825   1.215  1.476    122       .326
Damon     .282   .365   .489   .854   1.219  1.486    123       .328
Matsui    .274   .367   .509   .876   1.243  1.545    128       .341
Jeter     .334   .406   .465   .871   1.277  1.631    135       .360
V_Mart    .303   .381   .480   .861   1.242  1.543    128       .340
Posada    .285   .363   .522   .885   1.248  1.558    129       .344
J Molina  .217   .292   .268   .560   0.852  0.728     60       .160
Cervelli  .298   .309   .372   .682   0.990  0.980     81       .216
Varitek   .209   .313   .390   .703   1.016  1.032     86       .228
Melky     .274   .336   .416   .752   1.088  1.184     98       .261
Gardner   .270   .345   .379   .724   1.069  1.143     95       .252
Granderson.249   .327   .453   .780   1.107  1.225    102       .270
Cameron(NL).250  .342   .452   .795   1.136  1.290    107(AL)   .285


==================================================
KaleidaGraph Results:

KG:  RUNS_OPS.qpc:

y = a + b * x  
       Value   Error
a      -811.8  160.7
b      2087.2  210.33
Chisq  8089.1  NA
R      0.9441  NA

RUNS = -810 + 2090 * OPS
CHI2 = 8090  ==>  sig = 25 runs;  R = 0.944

Note that CHI2 in both KG and ProFit basically assume sig=1;
i.e., Chi2 = SUM [ (y-y_fit)^2 ].  
"Chi2" = sum[ (y-y_fit)^2 ]
VAR = "Chi2"  /  (N-1)
sig = sqrt(VAR) = sqrt(  "Chi2" / (N-1) ) = sqrt( "Chi2"/13 )

SS_err = "Chi2"
SS_tot = SUM [ (y-y_avg)^2 ]
SS_reg = SUM [ (y_fit-y_avg)^2 ]
R^2 = 1 - SS_err/SS_tot  = 1 - "Chi2"/SS_tot
standard error in R is sqrt[ (1-R^2) / (N-2) ]

For RUNS_OPS.qpc:    R^2 = 0.8914 = 1 - 8089/SS_tot, or SS_tot = 74,460
(check:  KG gives Variance=5728;  5728 x 13 = 74464)
for this data set, R = SQRT( 1 - "Chi2"/74460 )

From Kaleidagraph fits for AL 2009:
Stat      R  Chisq   sig
AVG    0.840 21914   41.1
SLG    0.867 18497   37.7
OBP    0.915 12078   30.5
OPS    0.944  8089   24.9
OOPS   0.953  6836   22.9
OOPS2  0.948  7486   24.0
wOBA   0.953  6834   22.9
ProFit:
eq(1)  0.954  6740   22.8
eq(2)  0.959  5930   21.4

===================================================
ProFit Results:

function Fred(OBP, SLG,  b, c,d : real);
begin;
y := b*(c*OBP + SLG) + d;
end;

iterations: 19
------------------------
Chi squared        = 6743.2299

Parameters:        Standard deviations:
b   = 1353.3799    ∆b = 458.7910
c   =    2.4512    ∆c =   1.3064
d   = -910.4042    ∆d = 166.6370

RUNS = -910 + 1350 * (2.45 * OBP + SLG)        eq(1)
CHI2 = 6740  ==>  sig = 23 runs;  R = 0.954
===================================================
=================================================
function Fred(BB_PA, S_PA, D_PA, T_PA, HR_PA,  b,c,d,e,f : real);
begin;
y := b + c * (d*BB_PA + S_PA + e*D_PA + 1.6*T_PA + f*HR_PA)
end;

Iterations: 14
-------------------------------------------
Chi squared        = 5932.8988

Parameters:           Standard deviations:
b     = -1024.8696    ∆b =  241.3729
c     =  5185.8310    ∆c = 1007.9237
d     =     0.7774    ∆d =    0.2233
e     =     1.4131    ∆e =    0.3643
f     =     1.7033    ∆f =    0.3541

RUNS = -1020 + 5200 * (0.77*BB_PA + 1B_PA + 1.4*2B_PA + 1.6*3B_PA + 1.7*HR_PA) eq(2)
CHI2 = 5930  ==>  sig = 21.4 runs;  R = 0.959
=================================================

for comparison (http://www.insidethebook.com/woba.shtml)lists
HR 1.70, 3B 1.37, 2B 1.08, 1B 0.77, NIBB 0.62, equivalent to:
BB 0.81,  1B 1.00,  2B 1.40,  3B 1.78, HR 2.21
---------------------------------------------------
from http://www.hardballtimes.com/main/statpages/glossary/
GPA= Gross Production Average, a variation of OPS, but more accurate and easier to interpret. The exact formula is (1.8*OBP + SLG)/4, adjusted for ballpark factor. The scale of GPA is similar to BA: .200 is lousy, .265 is around average and .300 is a star. A simple formula for converting GPA to runs is PA*1.356*(GPA^1.77).
---------------------------------------------------
from  http://www.baseball-fever.com/showthread.php?t=66363
"the best correlation with runs comes from (1.8*OBA + SLG), or something in that range"
---------------------------------------------------
from  http://www.tangotiger.net/wiki/index.php?title=Linear_Weights
R = .49S + .61D + 1.14T + 1.50HR + .33W + .14SB + .73SF, roughly:
BB 0.67,  1B 1.00,  2B 1.24,  3B 2.33,  HR 3.06
---------------------------------------------------
from  http://www.tangotiger.net/wiki/index.php?title=Batting_Runs
BR = .47S + .85D + 1.02T + 1.40HR + .33(W + HB),  roughly:
BB 0.70,  1B 1.00,  2B 1.81,  3B 2.17,  HR 2.98
---------------------------------------------------
from  http://www.baseballmusings.com/archives/005962.php
RG = -5.84 + 22.92 (OBP) + 7.21 (SLG) + e,  roughly:
RUNS = -950 + 1170 * (3.18 * OBP + SLG)    R^2 = 0.92
---------------------------------------------------
from  http://cyrilmorong.com/Havoc.htm
The table below summarizes the correlation and r-squared that various stats had
with team runs from 2001-03:
  Stat  Correlation  R-squared
  AVG      0.858       0.736
  SLG      0.917       0.842
  OBP      0.891       0.794
  OPS      0.950       0.903
  SB/G    -0.032       0.001
  Net SB/G 0.136       0.018
  SB%      0.303       0.092

Friday, October 9, 2009

Rating the MLB Divisions for 2009

Two simple ways to rate the strength of the 6 MLB Divisions are to add up the games over .500 (call this "G500") and the Run-Differentials ("RunDiff") for each Division. For the 2009 season:
          G500  RunDiff
AL-East    32     239
AL-Cent   -48    -206
AL-West    40     112

NL-East   -20     -39
NL-Cent   -34    -217
NL-West    30     111

AL-TOTAL   24     145
NL-TOTAL  -24    -145

So, based *solely* on these numbers, one would strongly expect the NL champ to come from the NL-West, and somewhat less strongly expect the AL champ to come from the AL-East.


One can use the G500 numbers to tweak teams' W-L record:  divide the Divison G500 by the number of teams in that division, divide by 2, add to the actual W-L record.  
Then the actual # wins in the AL:
    NYY=103, LAA=97, BOS=95, Min=87
become   NYY=107, LAA=102, BOS=99, Min=82

For the NL:  LAD=95, Phi=93, Col=92, STL=91
become:  LAD=98, Phi=91, Col=95, STL=88

We can also try to combine the two predictors (G500 and RunDiff) into a single "Strength" number for each Division.   The overall ratio of the absolute magnitude (i.e., ignoring the sign) of RunDiff to GB500 is about 4.5.   So if you want to weight the 2 predictors equally, multiply GB500 by 4.5.  Making the assumption that GB500 has more luck involved than RunDiff, we choose to multiply GB500 by only 2.  Then:
             Strength = 2*GB500 + RunDiff
          Strength
AL-East     303
AL-Cent    -302
AL-West     192

NL-East     -79
NL-Cent    -285
NL-West     171

AL-TOTAL    193
NL-TOTAL   -193
-------------------------------------- 
Yet another predictor is to use the 'expected' W-L records based on Runs scored and Runs allowed;  there are various flavors of this 'Pythagorian' predictor, we use the one from mlb.com.  Then the Games-above-.500 become:

          G500 
AL-East    44 
AL-Cent   -40 
AL-West    18 

NL-East     4 
NL-Cent   -42 
NL-West    24 

AL-TOTAL   22 
NL-TOTAL  -14 
 
We can use these new G500 numbers to once again tweak teams' *expected* W-L record:  divide the Divison G500 by the number of teams in that division, divide by 2, add to the *expected* W-L record.  
Expected # wins in the AL:
    NYY=95, LAA=92, BOS=93, Min=86
become   NYY=99, LAA=94, BOS=97, Min=82

For the NL:  LAD=99, Phi=92, Col=90, STL=91
become:  LAD=101, Phi=92, Col=92, STL=87

Finally, a rough correction for AL vs. NL:
AL - NL = 36; divide by 2 ==> 18
# interleague games ~ 18*14 = 252
Normalizing to a 162-game season ==> 18 * (162/252) = 12 Wins
Then the AL estimate becomes   NYY=111, LAA=106, BOS=109, Min=94

Wednesday, October 7, 2009

Approximate odds of Yanks winning the 2009 World Series

The following ignores several effects, just a simple estimate!

Yanks win % over Minn = 0.600 in one game (based on regular season).
In THREE-game series, add up P for the winning 4 of the 8 sequences, get 0.650
In 5-game series, add up P for the winning 16 of the 32 sequences, get 0.683
(binomial distribution ==> P=0.710   for 7-game series)
Add a bit for 5th game at home ==> 0.700
Strength of 3 starters ==> maybe 0.750.
AL-East is stronger, ==> make it 0.800

In 2nd and 3rd (7-game) series, crudely estimate 0.60 and 0.70

0.80 * 0.60 * 0.70 = 0.34 (34% chance of winning WS).

Followers

Who's on first?