For all those struggling to understand the mean shift algorithm, it's much easier to understand in pictures. Here's an article that explains it with some diagrams: http://eric-yuan.me/continuously-adaptive-shift/
Basically, it goes like this:
1. Select a data point of interest.
2. Draw a circle of a specified radius around the point of interest.
3. Collect all data points within the circle and compute their mean.
4. Move the center of the circle to the mean.
5. Repeat 3 & 4 until convergence. Each iteration will move
"uphill" on the density gradient of the data distribution until it reaches the top of the hill (a local maximum).
6. Repeat 1-5 for all data points. Points that converge to the same local maximum are members of the same cluster. The number of clusters is the number of local maxima.
For higher dimensions, replace "circle" with sphere (3-D) or hypersphere (4 & higher dimensions). Obviously, this algorithm depends on a choice of radius, which determines the granularity of the search for local maxima.
Do I understand correctly that you run the converging process on every single point?
Couldn't you have a situation where points really far from a cluster would still converge to that cluster by this method? Something like a long string of points slowly getting closer and closer?
Does the mean shift algorithm have any guarantees on running time and/or the quality of the clustering it finds?
> Do I understand correctly that you run the converging process on every single point?
I don't think you have to run it on every point in the data set. If your data set is very large, you could run it on a random sample of points, or a define a regular grid of starting positions at the resolution that you require.
> Couldn't you have a situation where points really far from a cluster would still converge to that cluster by this method? Something like a long string of points slowly getting closer and closer?
Yes, that's the idea. For each point, you're basically asking "If I start here and keep walking up the density gradient from here until I hit a maximum, where do I end up?" If the shape of the probability density function has a very long ridge, you could end up walking the entire length of the ridge until you hit the highest point. This means that you can have arbitrary-shaped clusters, within the smoothness bounds imposed by your chosen radius. This feature is considered a potential advantage over k-means clustering, which can only produce convex clusters.
> Does the mean shift algorithm have any guarantees on running time and/or the quality of the clustering it finds?
I haven't actually used it in practice, so I don't know.
So I guess the purpose is to find center-of-mass of a large collection of points, without using all those points for calculation (instead using only the ones inside a pre-defined radius)?
Not the center of mass, the point with the greatest mass nearby. Think of a thin stick with a ball on one end. The center of mass, i.e. the point where the stick balances, might be somewhere along the stick, closer to the end with the ball on it. But the point of maximum local density will be at the center of the ball, no matter how long the stick is.
Based on this plot, my naive reaction would be to decide that DBSCAN is the best algorithm, as it recovers the (qualitatively) best clusters in each case. Anybody has experience with DBSCAN? What are the downside?
This sounds similar to k-means - pick some points "at random" and iteratively move them to the average location of all nearest neighbors until you stabilize. But mean shift goes in the opposite direction, giving each point a window and grouping them on overlaps. Can someone who understands both confirm this is the case, or correct my understanding?
At the high level, we can specify Mean Shift as follows :
1. Fix a window around each data point.
2. Compute the mean of data within the window.
3. Shift the window to the mean and repeat till convergence.
Given an estimate of the mean; for each data point, Mean shift defines a window around it and computes a new estimated mean weighting each point by the probability density at the previous estimated mean calculated using the window
The (weighted) 'mean of the data points within the window' makes sense if you use the other perspective of looking at the window around the current estimated mean - you'll get the same answer, and to me this explanation is easier to grasp - the PDF only depends on the distance between the point and the estimated mean so you can think of either as the 'center'. But saying you calculate the mean of the data points within the window for each data point mixes up two perspectives and makes no sense.
Nope. It would be correct if it said 'For each estimate of the mean, Mean shift defines a window around it and computes the mean of the data points within the window, then repeats the process with this new estimate.'
Suppose we had 500 data points. Daveguy's process calculates a separate mean for a window around each data point. Now we have 500 means...and?
Basically, it goes like this:
1. Select a data point of interest.
2. Draw a circle of a specified radius around the point of interest.
3. Collect all data points within the circle and compute their mean.
4. Move the center of the circle to the mean.
5. Repeat 3 & 4 until convergence. Each iteration will move "uphill" on the density gradient of the data distribution until it reaches the top of the hill (a local maximum).
6. Repeat 1-5 for all data points. Points that converge to the same local maximum are members of the same cluster. The number of clusters is the number of local maxima.
For higher dimensions, replace "circle" with sphere (3-D) or hypersphere (4 & higher dimensions). Obviously, this algorithm depends on a choice of radius, which determines the granularity of the search for local maxima.