
Statistical Methods for Research Workers, 1925
You've probably never heard of Ronald Fisher, one of the founders of modern research methods, but his work and ideas have affected nearly every statistical study -- and the way we understand the world.
Transcript
What does it take for you to be sure? Nobody likes to admit they鈥檙e wrong, particularly on any matter of real importance, though no matter how certain you are about something, in the background there鈥檚 often that private calculation and the tiniest of doubt.
Eventually, however, there does come a point when it鈥檚 just obvious. In lots of cases, our decisions have minor impacts; deciding when you think you鈥檝e made a wrong turn and have to double back, or that maybe that shirt doesn鈥檛 go with that jacket after all, are usually forgotten pretty quickly.
For really major stuff 鈥 like deciding whether or not to accept the result of a scientific research project 鈥 you鈥檇 imagine that there was a much more structured and rigorous process to make these decisions. And you鈥檇 be right, though the origin of one of the critical components of that process is a bit on the murky side, emerging decades ago in a surprisingly offhand manner. It was likely crystallized, unintentionally, by one of the founders of modern research methods, whose work and ideas you鈥檝e probably never heard of but who has had an effect on nearly every statistical study, and the way we understand the way we understand the world, for nearly a century.
I鈥檓 Joe Janes of the University of Washington Information School. Ah yes, statistics, everybody鈥檚 favoritest and most fascinating topic. For those of you who haven鈥檛 already run screaming, there鈥檚 an important story here and I鈥檒l spare you most of the gory numerical details.
Ronald Fisher, by all accounts, was a man of great personal charm and a loyal friend, and also quick to anger especially when he perceived error or misrepresentation in an argument, which led to famous feuds with other early statisticians. He was nearly blind from a young age, married in secret because his 16-year-old wife鈥檚 family disapproved, and taught in public schools for a while, though as one source puts it, 鈥渘either he nor his pupils enjoyed the experience.鈥 His early work, typical for the emerging world of experimental design, was in agriculture, leading him to interest in genetics and, eventually, eugenics, co-founding the Cambridge University Eugenics Society in 1911.
Here鈥檚 the basic idea. Let鈥檚 say you wanted to find out whether people have better memory of things they read on paper or on a tablet. You recruit a bunch of people, and have half of them read a chapter of a physical book, and the other half read the same thing on an iPad. You then test their memories for facts in the chapter. Whichever group has the higher score gives you your answer. Simple, yes?
Well, maybe. The higher score might indicate that people in general remembered better in one mode or the other. Or, it might be that the people in one group happened to be better readers or have better memories, or had a natural preference (or hatred) of the format you gave them, or were just having a crummy day. How could you separate any real difference from these other factors? One way would be to randomly assign people to the two groups to mix all those things up. OK, that鈥檚 fine, so now you get to this question: how much higher would you want scores for one group to be to be sure that the difference was really because of the format and not just random chance?
That question was very much on the minds of Fisher and his colleagues. A number of ideas were swirling around, as the field was coalescing, using a variety of techniques and guidelines to try to identify what seemed to be the right level of certainty that a result was real. The underlying concepts of statistics are embedded in probability theory, first explored seriously in the 16th and 17th centuries to better understand games of chance, though early concepts date back to ancient dice games using animal bones.
Fisher鈥檚 book wasn鈥檛 intended to be a textbook, at least not in a classroom setting; it was meant rather to collect, popularize and spread the considerable number of threads of work in statistics which had been developing in earnest for a few decades. It emerged in the same year as The New Yorker, The Great Gatsby, and Mein Kampf, the Scopes Trial, Art Deco, Mount Rushmore, and an August march of 40,000 KKK members down Pennsylvania Avenue in Washington.
Ultimately this question was framed as how much likelihood, what probability of a result being wrong, we should be willing to live with. In the midst of this, along comes Fisher, who seemingly tosses this off on page 79: 鈥渋n practice we鈥ant to know鈥hether or not the observed value is open to suspicion鈥. We shall not often be astray if we draw a conventional line at .05鈥.鈥 A year later, in another paper, he reinforces this: 鈥淧ersonally, the writer prefers to set a low standard of significance at the 5 per cent. point, and ignore entirely all results which fail to reach this level.鈥 Modern researchers would refer to this as a p level of .05, a 5% probability that a research result doesn鈥檛 indicate a real effect but rather comes from some random source.
If that sounds a bit casual to be the lodestone of the means by which we decide what studies to believe and which ones we don鈥檛, you鈥檙e not alone. There鈥檚 no evidence that Fisher intended this dictum to be universal or definitive or exclusive, but there it is. Indeed, if you look a bit more deeply, he does provide a bit more context. The 1926 paper continues: 鈥淎 scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.鈥 Yet here we are. In the vast majority of settings, if you want a quantitatively based science or social science research study published, and you don鈥檛 hit that 5 percent mark, it鈥檚 an uphill battle. Moreover, many won鈥檛 even submit such results for publication, since who wants to be seen as a failure?
Important consequences emerge from all this. First of all, if you have a hard and fast rule, it鈥檚 hard and fast, so a study result that just makes it across the line, winding up with a p value of .0499, gets to be called 鈥渟tatistically significant,鈥 and one that falls just short, with .0501, doesn鈥檛. It鈥檚 quite likely that, without Fisher鈥檚 writings, some similar setup would have been developed, but it鈥檚 safe to say that this standard has impacted profoundly the development of quantitative research, and who can say how many potentially interesting and valuable findings never saw the light of day because they just missed the mark?
Also, consider this. If each statistically significant result has a 5 percent chance of being spurious, then it follows that 5 percent of all the 鈥渟tatistically significant鈥 results that have been published aren鈥檛 real, they鈥檙e due to random chance. I hasten to point out that studies in the medical and pharmaceutical fields often have considerably more stringent thresholds, 1 percent or even a tenth of a percent, since health and safety are involved. Still, this means that some small fraction of findings鈥ren鈥檛 findings at all.
Think of that 5 percent as the likelihood of a false positive on a test, or as one often hears in beginning stats classes, the likelihood of convicting an innocent person (鈥渂eyond a reasonable doubt鈥). We know these happen, and we want to minimize those chances, and yet you don鈥檛 want to require so much certainty that you wind up missing something potentially interesting. This balance between screening false positives and false negatives is very important to researchers, and the focus on p values reveals how they feel about it. There鈥檚 a firm line on false positives, and yet in many cases there鈥檚 no calculation or even consideration of the probability of a false negative, of failing to see a real effect. The scholarly community wants to be sure that results in the literature are credible, sometimes to the exclusion of results that might otherwise be found.
It could, however, go the other way around. Today, some researchers are increasingly of the opinion that this significance level business is less important than a focus on how big an observed effect is; if people remembered things twice as well when reading from a book than an e-book, that might be worth paying attention to, rather than a smaller effect we鈥檙e more sure about. That would give us a different kind of scholarly record, and would require a different way of reading and thinking about and navigating the literature, perhaps with a greater desire and necessity for replication, and the accompanying effort and money and time that would require.
Neither approach is inherently better; they鈥檙e just different. You could even imagine combining the two somehow, factoring in the size of effects and certainty in some hybrid way. Is that possible? Could it happen? Getting to agreement on something like that would likely take more than just a line or two in a book, so it would appear that here, as in so many cases, the only thing we can truly be certain of is鈥aybe.