Friday, April 15, 2016

Lunchtime Stats: p-values and Type II error, Part 1

It should not be so hard to publish and effect estimate that is not statistically significant.

Even stronger: Do you think that a non-significant result is unimportant, uninteresting, not actionable, or not worth consideration because it is non-significant? Then you are wrong. And you are contributing a very large, very preventable problem.

There's been a hubbub about p-values and significance testing for quite a while. It ebbs and flows, but it's always at least a murmur in the background of every social and medical science. Recently, the American Statistical Association's publication of a formal declaration about p-values has turned up the volume to a dull roar. But that declaration is in no way new or novel or even more interesting that other previous attempts to get scientists to stop using statistical significance as a proxy for reliability, strength, or importance.

Most researchers know enough to ask "sure it's statistically significant but is it practically significant?" because they know enough to understand that the two are different. And they are different.

But that wisdom focuses on Type I sort of error: the possibility of treating a tiny or even non-existent effect as something important or even real.

The real monster lurking beneath the boat is Type II error: the possibility of rejecting evidence of a real effect because it did not meet your criteria for good evidence.

Can someone put a dollar amount on the resources that generate research results that never get published? And then express that as a percentage of resources that are spent on research generally? What would that percentage be, do you think? And what should it be?

I contend that if you think that percentage should be less than 80%, then we have a huge problem on our hands, and null hypothesis significance testing is to blame.

Wednesday, April 13, 2016

The Importance of Scripting

I have lost so many analyses to the wind. In the flow state of working through a dataset and finding patterns, it's easy to forget to document what you're doing. Documenting requires breaking the flow. It feels like a bad kind of slowing down.

The first time I discovered Sweave in R it was like finding a new continent. Suddenly there was a way to work quickly, to save the work and the results, and to not have to document as you go: you can go back and document when you check it all for errors later.

(I now have started using RMarkup for everything for a few different reasons, but Sweave gave me my revelation about the importance of scripting...)

If you don't know R or Sweave, here's what I'm talking about:

You want to test a correlation between X and Y so you use your pull-down menu in SPSS or JMP or what have you and specify your variables and execute it. You get a scatterplot with a trendline and a Pearson's r with degrees of freedom and a t-test of that r with a p-value.

In R you do the same thing by typing this at the R command line:

set <- read.csv("data.csv")
X <- set$X
Y <- set$Y
plot(Y~X)
abline(lm(Y~X))
cor.test(X,Y)

This also gives you the scatterplot (the plot command), the trendline (the abline command) and the correlation with a statistical test (the cor.test command).

Using Sweave, you make a text file (called "cor.Rnw") and you type that into your file. The Sweave document is a LaTeX document so it needs some LaTeX markup commands to get it going, but it looks like this:

\documentclass{article}
\begin{document}
<<cor>>=
set <- read.csv("data.csv")
X <- set$X
Y <- set$Y
plot(Y~X)
abline(lm(Y~X))
cor.test(X,Y)
@
\end{document}

Now what this does is amazing. Run this file through Sweave: in R type Sweave("cor.Rnw"). This will make a file called cor.tex. Now run cor.tex through LaTeX (type "pdflatex cor" at the terminal). This will make a formatted pdf with your scatterplot and the results of your correlation. And it will include the commands you typed to get those results.

So now you can go back and edit your Sweave file:


\documentclass{article}
\begin{document}
I wanted to correlate X and Y to see if they were related.
<<cor>>=
set <- read.csv("data.csv")
X <- set$X
Y <- set$Y
plot(Y~X)
abline(lm(Y~X))
cor.test(X,Y)
@
\end{document}

And viola! You've documented your analysis, you can re-run it any time. You made a note to yourself and others about why you did what you did. The results are formatted and presented with the code that generated them. If the data changes you can rerun the script and get new output without having to re-do anything. And so on and so on and so on.

Why anyone still uses their drop-downs and saves only the results is beyond me. Your analysis is going to get lost in the wind. And in 6 months when someone asks you where that graph came from and can you try it on a new dataset, you'll have to scratch your head and do it all over again.

Tuesday, April 12, 2016

Lunchtime Stats: Cluster Analysis

Cluster analysis falls easily into the "dark arts" category of statistics. Analytic judgement will influence your results. So be wise and be wary, think hard and question everything.

There are lots of types of cluster analyses, but they all promise the same thing: to group your observations meaningfully. The frustrating thing about this is that your mind easily does cluster analysis very quickly but getting an algorithm to do it is very hard. Plot your data first and choose an algorithm that matches what your eye sees.

Until the computers take over, the process will require a lot of back and forth between humans and algorithms to get it right. For the algorithm, of course, that means making mistakes. Algorithms are all too happy to oblige. But for the human it means figuring out which types of cluster analysis make which types of mistakes and getting good at detecting those mistakes and picking a better algorithm. After a few tries back and forth, we can get good at clustering.

Today I'm thinking about k-means clustering. This is a pretty simple model. Give your algorithm multivariate data and a number of clusters. It will pick random starting points for the center of each cluster, let each cluster "capture" all the data points closest to it, compute the distances between each data point and its closest cluster center, then move the centers around (randomly) until all those distances can't get any smaller.

All this means that a k-means cluster analysis will give you a somewhat different solution every time (because of the randomness). It will always give you the number of clusters you ask for (no matter what the data look like). It will always give you clusters that are about the same size (due to the way it moves centers and computes distances). And k-means clustering can only find clusters that are clouds of data points with dense centers and diffuse edges.

This guy wrote a good post on the topic:
http://varianceexplained.org/r/kmeans-free-lunch/

Always plot your data before deciding what kind of cluster analysis to run. If your data look diffuse and clumpy, k-means might be right for you, especially if the clusters are all about the same size. But remember to seed your random number generator and try it a few times with different seeds to make sure you aren't treating a particular happenstance as the best model of your data.