February 3, 2004

Don't do this

Yesterday, I had the unpleasant task of reviewing a study that had taken months and months of hard work by several professors and a bundle of students and that wound up, after all this effort, with less than nothing. I hate to lose a day trying to explain to people that their work for the past year is unpublishable. This problem crops up again and again, and I'm tired of breaking the bad news. It's easy to understand the problem.

Let's say you're working on a new software widget, and you want to demonstrate that it's useful. A very common approach is the following:

Now, it's good to actually do the experiment to confirm the widget's value. But this experiment is probably doomed: if the result is going to be statistically significant with 20 subjects, the widget is so good that its value will be obvious from one or two subjects -- or from simple inspection.

Let's say that, over the course of the experiment, we expect about half the students to have a good outcome: the control group should have about 5 wins, and we hope the group with the widget has lots more. How many wins do we need to be confident the widget is good?

Every single subject needs to win! As a rule of thumb, if you expect n wins, pure chance accounts for about sqrt(n) wins more or less. So, if the control group gets 5 wins this time, next time it might well get 3, or 7: to be confident the widget is better, you need about ten wins in the widget group.

Many people have excellent statistical intuition in sports. Take baseball: imagine a team in spring training that's a solid contender, a team with a good chance to make the playoffs. In other words, you're envisioning a team that you expect to win about 90 games. The Oakland A's, or the San Francisco Giants.

Now, if this team turns out to go 95-67, are we shocked? Not at all! It's a long season, that's why they play so many games. On the other hand, if this team that you expected to be competitive actually winds up having one of the best seasons in history, going 110-52, then you are surprised -- something you didn't expect happened. A swing of 5 or 10 wins, we know, might well be luck in a 162 game season; a swing of 20 needs an explanation.

Take a veteran left-fielder who usually hits .300 with 25 home runs. You expect him to get a hit about three times in every ten official at bats; if he bats 500 times in a season, you expect about 150 hits. The square root of 150 is about 12. If he finishes next year with 138 hits (.276) and 20 HR, you say 'He had an off year, but it's probably just bad breaks.' But if he finishes with 126 hits (.252) and 15 HR, you're pretty sure something is wrong. And if he gets 174 hits (.348) with 35 homers, everyone is going to be amazed. Same thing with pitchers: if a #3 starter who usually has nine or ten wins every year gets twelve wins, it could easily be good luck. If he gets 15, you're pretty sure something changed.

Can Senegal win a World Cup game against, say, France? Sure. Do you fancy them to win it all? Not likely.

Same thing with American football: if you expect n wins, then a difference of sqrt(n) wins is not remarkable. Last year, the Patriots went 9-7. This year, people expected them to be a little better, so pencil in 10-6. They actually went 14-2 (and won the Super Bowl). Next year, if they win 10 games instead of 14, their fans will be sad but not shocked. If they go 7-9, people will all be asking, 'what happened?'

When you set out to prove that your widget is best, the sqrt(n) rule is a terrific rule of thumb to keep in mind. An empirical study is a terrible thing to waste: be sure you have enough subjects.