### ABOUT UNSUPERVISED LEARNING

In supervised learning problems, we deal with labeled data, means during the training process, we give our machine both Xs and Ys, by training, we ‘forge’ a system, which given new Xs, it is able to guess the output Ys. By using better algorithm and sophisticated methods, we can make the forged system more precisely, and get the system faster.

However, under a lot of circumstances, what we have is unlabeled data, means the data has only Xs, but no Y, and this is why we need unsupervised learning. In unsupervised learning, our target is trying to find the structure of the data, or in the data, because sometime the structure is not that obvious.

Clustering and Hidden Markov Models are some of the famous unsupervised learning methods, and we are going to talk about one of the clustering methods, K-Means.

### ALGORITHM

Compare with the previous several articles, K-Means is an algorithm that is so easy, all it does is nothing but split the data into **K** parts (or clusters), and make every single point is not too far from the centroid of the cluster it belongs to.

The above formula is called Distortion Function, in which, **m** is the amount of samples; **c** means cluster, so **c(i)** means the cluster that **i**th sample belongs to; **μ(x)** means the centroid of **x**th cluster. What we need to do is find **c** and **μ** that minimize **J**.

Things are pretty simple and clear enough now, so the algorithm is something like:

**Randomly initialize k centroids.**

**Loop until converge{**

**1. for every sample, find which cluster it belongs to, by comparing the distance between it and each clusters’ centroid.**

**2. for every cluster, re-calculate its centroid.**

**}**

In which converge means none of the centroids moves, and the K-Means algorithm is guaranteed to converge.

### CODE

This is a C++ version with armadillo.

mat vec2mat(vector<vector<double> >&vec){ int rows = vec.size(); int cols = vec[0].size(); mat A(rows, cols); for(int i = 0; i<rows; i++){ for(int j=0; j<cols; j++){ A(i, j) = vec[i][j]; } } return A; } double getDistance(rowvec a, rowvec b){ rowvec temp = a - b; return norm(temp, 2); } double which_is_nearest(vector<rowvec>& centroids, rowvec pt){ double minDistance = 0; int minLabel = 0; for(int i=0; i<centroids.size(); i++){ double tempDistance = getDistance(centroids[i], pt); if(i == 0|| tempDistance < minDistance){ minDistance = tempDistance; minLabel = i; } } return minLabel; } double getDistortionFunction(mat data, vector<vector<int> >& cluster, vector<rowvec>& centroids){ int nSamples = data.n_rows; int nDim = data.n_cols; double SumDistortion = 0.0; for(int i = 0; i < cluster.size(); i++){ for(int j = 0; j < cluster[i].size(); j++){ double temp = getDistance(data.row(cluster[i][j]), centroids[i]); SumDistortion += temp; } } return SumDistortion; } double kmeans(vector<vector<double> >&input_vectors, int K){ mat data = vec2mat(input_vectors); int nSamples = data.n_rows; int nDim = data.n_cols; //randomly select k samples to initialize k centroid vector<rowvec> centroids; for(int k = 0; k < K; k ++){ int rand_int = rand() % nSamples; centroids.push_back(data.row(rand_int)); } //iteratively find better centroids vector<vector<int> > cluster; for(int k = 0; k < K; k ++){ vector<int > vectemp; cluster.push_back(vectemp); } int counter = 0; while(1){ for(int k = 0; k < K; k ++){ cluster[k].clear(); } bool converge = true; //for every sample, find which cluster it belongs to, //by comparing the distance between it and each clusters' centroid. for(int m = 0; m < nSamples; m++){ int which = which_is_nearest(centroids, data.row(m)); cluster[which].push_back(m); } //for every cluster, re-calculate its centroid. for(int k = 0; k < K; k++){ int cluster_size = cluster[k].size(); rowvec vectemp = zeros<rowvec>(nDim); for(int i = 0; i < cluster_size; i++){ vectemp = vectemp + data.row(cluster[k][i]) / (double)cluster_size; } if(getDistance(centroids[k], vectemp) >= 1e-6) converge = false; centroids[k] = vectemp; } if(converge) break; ++counter; } double distortion = getDistortionFunction(data, cluster, centroids); return distortion; }

### WHAT’S MORE

1. Because the initial value of the k centroids are randomly chosen, so maybe sometimes we chose these things poorly, that make our result looks funny. For example, when we get converge, it is found that some of the clusters have very very few points, or even empty. For one reason that maybe we got outliers in our data, however this case may happen when we poorly chose the initial value, so just in case, we can run this algorithm for several times, and each time we randomly initialize the centroids.

2. For K-Means algorithm, it is a big issue that how to choose k. In Andrew Ng’s online course, he said that we can run the algorithm several times with different k values, and make our decision of what k value to use in the particular circumstance after seeing the **k-distortion** curve. Let me explain it using an example.

I used a dataset that including 3,000 samples, and each sample is a 3-d point which is generated randomly. I ran the algorithm using k value from 1 to 51, and this is the corresponding distortion value:

We can see the distortion is decreasing by the increasing of k (when k equal to 1, there is only one cluster, means the centroid is just the centroid of all points, so the distortion must be huge), this figure is the **k-distortion curve**.

What Andrew Ng said was to find the ‘elbow’ of this curve, and use the k value at the ‘elbow’ point. In this case, I chose 6 as my k value. This is the result of my K-Means Clustering when k=6.

These figures are showed using GNU Octive.

🙂