Smart Baseball

Page 15

by Keith Law

BABIP may be new to many of you, but it has become increasingly useful in the sieve major-league teams use to try to separate a pitcher’s performance from the effects of defense and luck. It’s equal to the number of hits a pitcher allowed divided by the total number of balls he allowed batters to put into play, or BABIP = (H –HR) / (AB –K –HR + SF).* If the variation in a pitcher’s BABIP from one year to the next is largely out of his control, it stands to reason that we’ll have better results predicting his future ERA (or RA) using a neutral number instead of his current BABIP.

New data streams have helped us refine such predictions, but for a quick back-of-the-envelope look at a pitcher’s performance, FIP, or Fielding Independent Pitching, is useful because it’s on the same scale as ERA but does what I just described—swaps in a league-average BABIP for the pitcher’s actual BABIP. It’s like saying, “Hey, this guy had great help from his defense turning lots of balls in play into outs, so what would he have pitched like had he pitched in front of an average defense?”

The concept is to strip the noise away from the signal in the pitcher’s ERA; DIPS’s introduction was, and to some extent still is, controversial, because many people, including pitchers, did not like the idea that they lacked control over the results of balls in play. (One popular trend of the early 2000s was for mischievous sportswriters to present pitchers with low BABIPs with the conclusions of McCracken’s studies. Hilarity ensued.) It flies in the face of what we always heard growing up—a pitcher induced that groundball, he got that weak contact, the hitter didn’t pop up but the pitcher “popped him up” (as if the hitter was merely a kernel of corn)—and that kind of shift in how we perceive the game is uncomfortable. A pitcher has the most control over three things: striking hitters out, avoiding walks, and avoiding home runs. He has some control over the type of contact he allows, such as a groundball or a flyball, but once the ball’s in play, whether it’s fielded cleanly or not has nothing to do with what the pitcher did before contact. It’s the baseball equivalent of learning that all life on earth evolved from a single common ancestor: whether it’s what we want to hear doesn’t matter because the evidence says it’s true.

Defense-Independent Pitching Statistics first emerged on the old Usenet discussion board rec.sport.baseball and came to wider attention when Voros McCracken consolidated his findings into an article on Baseball Prospectus titled “Pitching and Defense: How Much Control Do Hurlers Have?” The shocking answer was “not that much.” Whether a ball put into play became a hit or was converted into an out appeared to be out of the pitcher’s control almost completely:

The pitchers who are the best at preventing hits on balls in play one year are often the worst at it the next. In 1998, Greg Maddux had one of the best rates in baseball, then in 1999 he had one of the worst. In 2000, he had one of the better ones again. In 1999, Pedro Martinez had one of the worst; in 2000, he had the best. This happens a lot. There is little correlation between what a pitcher does one year in the stat and what he will do the next. . . . This is not true in the other significant stats (walks, strikeouts, home runs).

This shouldn’t happen at all if the then-conventional wisdom that pitchers could prevent hits on balls in play were true. (This conventional wisdom hasn’t died entirely in the fifteen-plus years since Voros’s piece appeared; you will often see references to a pitcher’s batting average against or hits allowed per nine innings, both of which are almost entirely functions of his BABIP.)

McCracken argued that great pitchers were great not because they had some sort of special woo-woo to prevent hits on balls in play, but because they allowed fewer balls in play to begin with. You’d be better at predicting a pitcher’s performance in the following season by assuming his BABIP allowed would regress to the league average than by using his actual BABIP allowed in the previous season. He was positing that BABIP made us dumber when it came to predicting pitcher stats, and by and large he was correct. (As an aside, it turns out that the converse is not true for hitters: There are absolutely hitters who can make better or worse quality contact on a consistent basis and thus regularly post BABIPs that vary from the league average. Mike Trout, the best player in baseball as of this writing, has a career .356 BABIP; Bryce Harper, the other best player in baseball as of this writing, has a career .331 BABIP. The league averages for their careers are just under .300.)

The issue of untangling a pitcher’s own contributions to run prevention from his teammates’ efforts, especially on defense, isn’t going away anytime soon. As the 2016 season drew to a close, the Cubs were posting the lowest BABIP allowed by any team in a full season in forty years, and one of the best relative to the league BABIP in MLB history. The 2016 Cubs were a very good defensive unit, with plus defenders at multiple positions, including Jason Heyward, one of the best right fielders in recent years, playing right and occasionally center. They were also one of the most analytically driven teams, including how they used data to position their fielders for each batter. So how much do we want to credit their pitchers for work that might have been the result of great defense?

All five of the Cubs’ starters made at least 29 starts in 2016, and every one had an ERA at least 0.40 runs lower than his FIP. Kyle Hendricks led the Cubs and the NL with a 2.13 ERA but his FIP was 3.20—the largest gap between ERA and FIP of any of the Cubs’ starters. He allowed a .300 BABIP in a full season in 2015, but allowed just a .250 BABIP in 2016. It’s almost certain that the drop in his rate of hits allowed on balls in play was the result of the Cubs’ best-in-class defense and some good fortune, with little if any of the improvement the work of anything Hendricks specifically did, especially since his other key rates—strikeout rate, walk rate, home runs allowed—have barely changed year over year. Hendricks’s season is an extreme example, but it shows that several things can be true at the same time:

• Hendricks had a very good season. He was an above-average starter for the Cubs.

• Opposing teams barely scored when Hendricks was on the mound for the Cubs.

• Some portion of the credit for the second point goes to the Cubs as a team rather than Hendricks specifically.

• Going forward, Hendricks is likely to allow more hits and more runs, especially if the Cubs’ defense should change for the worse or Hendricks should end up pitching for another team.

These two challenges—separating the descriptive (what happened) from the prescriptive (what will happen), and splitting credit for something among different players—are at the heart of many sabermetric debates and define what team analysts do with the piles of data coming their way each day. You can’t discuss the future without understanding what happened in the past, and understanding what happened includes figuring out which parts of the past really tell us something about each player’s true ability or skill. That’s why looking at a pitcher’s ERA or, better, his RA, tells us something, but not everything. We can see from his RA how many runs opposing teams scored while he was responsible for the runners on base—but the picture RA or ERA gives us about the pitcher’s true skill, and thus about how he’s likely to pitch in the future, is too blurry to be all that useful. That’s why, when we’re talking about value produced or making projections for the future, we want to use something that’s a little more precise at carving out what the pitcher did himself from what his teammates did to help or hurt him.

FIP, originally developed by the sabermetrician Tom Tango, predicts next-year ERA better than ERA does, but while it tells us quite a bit about a pitcher’s performance in a specific year, FIP too is imperfect. It replaces a noisy number (the pitcher’s BABIP) with a dead-neutral one, and that may indeed throw out the baby with the bathwater. (My wife and I always sent the bathwater down the drain, but that’s just us.) Over long enough samples, slight abilities by certain pitchers to suppress hits on balls in play can emerge, although the quantity of innings required to confirm this is so large that it’s not terribly useful for baseball decisions. Clayton Kershaw is one such
pitcher, with a career BABIP allowed of .271, coming in below the league average in every full season of his career through 2015; Mariano Rivera, who lived by throwing one pitch, a cut fastball, that hitters found very difficult to hit squarely, is another, finishing his illustrious career with a BABIP of .263. Such pitchers will be underrated by FIP, because this skill they have at limiting hits is deleted from the equation entirely.

On the other hand, FIP can also fake us out on pitchers who truly can’t limit hits on balls in play. If you took someone from your local beer league who could throw 80 mph and sent him to pitch in the majors, he’d allow a lot of balls in play, and most of them would be hit hard enough to send fielders running for cover. DIPS theory does not apply to him, and it doesn’t take examples quite that extreme to break with this idea of BABIP regressing to the mean. Glendon Rusch had a short career as a durable fifth starter who threw a ton of strikes but always seemed to give up too many hits, finishing with a career ERA (5.04) well above his FIP (4.29) because his career BABIP, .326, was about 30 points above the league average for the years in which he pitched. In both of these cases, ERA might actually tell us more about the pitcher than FIP or other such stats, called ERA estimators or component ERA metrics, could.

What ERA, or RA, or FIP gives us—and to be honest, I look at all of them when trying to evaluate a pitcher’s performance or consider what his future performance might be—is a rate stat: how much damage did a pitcher allow per nine innings pitched. It was a sensible framework in the days when pitchers finished most of their games, but now nine innings is more of an accepted standard of measurement than a meaningful unit on its own. Perhaps someone should introduce a “metric ERA” that shows runs allowed every ten innings rather than every nine. (Don’t do that. Please.) But regardless of which runs-allowed average you use, it’s still just a rate stat. It’s more useful than simply looking at the bulk total of runs allowed, but ERA or FIP can’t tell us how valuable the pitcher’s performance was because neither includes his innings pitched. To figure out a pitcher’s total value, we’ll have to bring those two variables together to figure out how many runs the pitcher prevented compared to a set standard of performance.

More refined data on balls put into play are now making their way to teams from MLB via their Statcast product, which I’ll discuss in a later chapter; these data should help improve our understanding of how much control pitchers might have on affecting the way a ball is put into play, and thus the likelihood that the ball is fielded for an out.

This discrepancy between a pitcher’s ERA and how he might have pitched given neutral support from his defense or his bullpen became particularly clear in the debate over who should win the 2015 NL Cy Young Award, which came down to three candidates having historically great seasons: Jake Arrieta of the Chicago Cubs and Zack Greinke and Clayton Kershaw of the Los Angeles Dodgers. Arrieta won the award, but who you think deserved it depends a lot on how you view ERA and the idea of judging a pitcher on his peripherals rather than his more traditional stats.

There’s a lot going on here, so let me start from the left. You know what the pitcher’s name means, and what ERA is. The opponents’ OBP is just what it implies: the aggregate on-base percentage of all batters that pitcher faced. Expressed as a percentage (.231 = 23.1 percent), it says what percent of opposing batters reached base safely against that pitcher.

K% is strikeout rate, just strikeouts divided by total batters faced. This is better than strikeouts per inning or strikeouts per nine innings because strikeouts are measured in batters, while innings can contain any number of batters from three up.

IP is innings pitched, included here just to show that all three pitchers had about the same workload and thus the same opportunities to add value to their teams.

Which of the three pitchers pitched the best—contributed the most value by preventing runs—to his team in that year? The old answer would likely have been Greinke, whose ERA of 1.66 was the lowest in MLB in twenty years, the lowest in a nonstrike season in thirty years. And Greinke undoubtedly pitched well, as even his defense-independent statistics tell us, thanks to his stinginess with walks and his especially low BABIP.

Arrieta pitched almost as well by ERA standards, thanks to a second-half run where he pitched about as well as any pitcher has thrown in any half season in history, posting a 0.75 ERA in 107.1 innings while allowing 55 hits, 23 walks, and 2 home runs against 113 strikeouts.

But when it comes down to the things we know a pitcher is responsible for, Kershaw wasn’t just the best in the National League in 2015, but one of the best in history. His FIP—again, a rough look at how well a pitcher fared when focusing on strikeouts, walks, and homers—was the ninth-best FIP by any pitcher who qualified for the ERA title since baseball’s live-ball era began in 1921. The fourth-best FIP on that list was also Clayton Kershaw, just the year before. Kershaw led the NL by a huge margin in strikeout rate, allowed fewer walks (on a rate basis) than Arrieta while almost dead even with Greinke, and allowed five more homers than Arrieta and one more than Greinke in a few more innings pitched. Kershaw had a very good case to win this award; Greinke had a solid case; and Arrieta ended up winning the award, with Greinke finishing second and Kershaw finishing third. Arrieta wasn’t a bad choice, but he wasn’t clearly the best one . . . and one has to wonder if his MLB-leading 22 wins played a part in the voting.

That particular Cy Young debate is a convenient example of the complexities of evaluating pitcher performance once you’ve moved beyond pitcher wins and have recognized that ERA doesn’t give us the perfect, single answer. Working with what we know about BABIP, about defense, about luck and randomness, we end up with a story around the pitcher’s performance that’s more nuanced than any single number can express. Teams continue to develop new metrics to try to isolate pitching performance and project pitchers’ values going forward, meaning this is a topic we’ll continue to hear about for quite a few more years to come.

12

WPA:

Measuring Clutch, If You Must

I want to detour for a moment from the various stats I’ve presented that help us isolate a player’s production by stripping out context, or stats that help us better project a player’s performance going forward, to look at one stat that is all-descriptive—it tells us more about what happened, and actively considers the context of the player’s performance. The stat, Win Probability Added, is all about context—was the player’s team better or worse off in that particular game based on what the player did?

There’s no existing stat that WPA tries to replace; it grew out of a desire to identify “clutch” performers by looking at which players did the most to swing outcomes of specific games. Such players don’t exist—if you can hit, you can hit in the clutch and the unclutch and everything in between—but the effort gave us a new set of stats, led by WPA, that at least allows us to determine the impact of a specific hit or out on each team’s chances of winning that specific game, and to add up such opportunities for each player over the course of a season.

The good thing about WPA, as opposed to the typical codswallop claims of players who are clutch or who “smell an RBI” or “know when to bear down,” is that WPA is deaf to your excuses. WPA accepts no rationalizations like luck or bad defense or anything of the sort. WPA does not care for your explanations of context or your what-ifs. If a reliever comes in with men on base and gets a couple of hard-hit outs without letting any of those runners score, then his WPA for the game is going to be positive—the closer the game, the more positive it will be. His team had a certain chance of winning the game before he entered, and had a better chance after he was done. You might argue that his defense bailed him out, or that he was lucky those screaming line drives were hit right at the fielders, but WPA just looks at whether his team’s odds of winning the game increased or decreased.

The earliest public recording of a stat like WPA was Player Win Averages, described by brothers Eldon and Harlan Mills and cover
ed in The Hidden Game of Baseball,* although their system involved some arbitrary scoring of win probabilities that we can now calculate more precisely. Win Probability Added, unlike a lot of sabermetric stats, tells you in its name exactly what it’s trying to measure: how much did a player add to his team’s probability of winning a game—or all the games he appeared in over a week or a month or a season?

This flips the sabermetric dogma of distilling the signal player contributions out from the noise of everything else happening on the field on its head: for WPA, we want that context. A home run when your team is down 8–0 only very slightly increases your chances of winning the game, and if it comes with two outs in the ninth, the change is going to be effectively nil. A home run that breaks a tie game increases your team’s chances of winning the game, and the later in the game that this happens, the bigger the increase will be.

If this sounds like an attempt to measure clutchiness, you’re right—the Mills brothers conceived it that way when looking for evidence of clutch hitting ability. That has not panned out, as WPA has very little predictive value; a player’s WPA in one period of time isn’t a good predictor of what it will be in a subsequent period of time, such as from one year to the next. But as a descriptive statistic—hey, this is what actually happened, so get your head out of the spreadsheet and watch the game—WPA is about as good a measure of a clutch hit, at bat, pitch, or inning as you can find.

‹ Prev Next ›