Wednesday, April 13, 2016

The Importance of Scripting

I have lost so many analyses to the wind. In the flow state of working through a dataset and finding patterns, it's easy to forget to document what you're doing. Documenting requires breaking the flow. It feels like a bad kind of slowing down.

The first time I discovered Sweave in R it was like finding a new continent. Suddenly there was a way to work quickly, to save the work and the results, and to not have to document as you go: you can go back and document when you check it all for errors later.

(I now have started using RMarkup for everything for a few different reasons, but Sweave gave me my revelation about the importance of scripting...)

If you don't know R or Sweave, here's what I'm talking about:

You want to test a correlation between X and Y so you use your pull-down menu in SPSS or JMP or what have you and specify your variables and execute it. You get a scatterplot with a trendline and a Pearson's r with degrees of freedom and a t-test of that r with a p-value.

In R you do the same thing by typing this at the R command line:

set <- read.csv("data.csv")
X <- set$X
Y <- set$Y
plot(Y~X)
abline(lm(Y~X))
cor.test(X,Y)

This also gives you the scatterplot (the plot command), the trendline (the abline command) and the correlation with a statistical test (the cor.test command).

Using Sweave, you make a text file (called "cor.Rnw") and you type that into your file. The Sweave document is a LaTeX document so it needs some LaTeX markup commands to get it going, but it looks like this:

\documentclass{article}
\begin{document}
<<cor>>=
set <- read.csv("data.csv")
X <- set$X
Y <- set$Y
plot(Y~X)
abline(lm(Y~X))
cor.test(X,Y)
@
\end{document}

Now what this does is amazing. Run this file through Sweave: in R type Sweave("cor.Rnw"). This will make a file called cor.tex. Now run cor.tex through LaTeX (type "pdflatex cor" at the terminal). This will make a formatted pdf with your scatterplot and the results of your correlation. And it will include the commands you typed to get those results.

So now you can go back and edit your Sweave file:


\documentclass{article}
\begin{document}
I wanted to correlate X and Y to see if they were related.
<<cor>>=
set <- read.csv("data.csv")
X <- set$X
Y <- set$Y
plot(Y~X)
abline(lm(Y~X))
cor.test(X,Y)
@
\end{document}

And viola! You've documented your analysis, you can re-run it any time. You made a note to yourself and others about why you did what you did. The results are formatted and presented with the code that generated them. If the data changes you can rerun the script and get new output without having to re-do anything. And so on and so on and so on.

Why anyone still uses their drop-downs and saves only the results is beyond me. Your analysis is going to get lost in the wind. And in 6 months when someone asks you where that graph came from and can you try it on a new dataset, you'll have to scratch your head and do it all over again.

No comments:

Post a Comment