Tuesday, April 12, 2016

Lunchtime Stats: Cluster Analysis

Cluster analysis falls easily into the "dark arts" category of statistics. Analytic judgement will influence your results. So be wise and be wary, think hard and question everything.

There are lots of types of cluster analyses, but they all promise the same thing: to group your observations meaningfully. The frustrating thing about this is that your mind easily does cluster analysis very quickly but getting an algorithm to do it is very hard. Plot your data first and choose an algorithm that matches what your eye sees.

Until the computers take over, the process will require a lot of back and forth between humans and algorithms to get it right. For the algorithm, of course, that means making mistakes. Algorithms are all too happy to oblige. But for the human it means figuring out which types of cluster analysis make which types of mistakes and getting good at detecting those mistakes and picking a better algorithm. After a few tries back and forth, we can get good at clustering.

Today I'm thinking about k-means clustering. This is a pretty simple model. Give your algorithm multivariate data and a number of clusters. It will pick random starting points for the center of each cluster, let each cluster "capture" all the data points closest to it, compute the distances between each data point and its closest cluster center, then move the centers around (randomly) until all those distances can't get any smaller.

All this means that a k-means cluster analysis will give you a somewhat different solution every time (because of the randomness). It will always give you the number of clusters you ask for (no matter what the data look like). It will always give you clusters that are about the same size (due to the way it moves centers and computes distances). And k-means clustering can only find clusters that are clouds of data points with dense centers and diffuse edges.

This guy wrote a good post on the topic:
http://varianceexplained.org/r/kmeans-free-lunch/

Always plot your data before deciding what kind of cluster analysis to run. If your data look diffuse and clumpy, k-means might be right for you, especially if the clusters are all about the same size. But remember to seed your random number generator and try it a few times with different seeds to make sure you aren't treating a particular happenstance as the best model of your data.

No comments:

Post a Comment