This post is about a new cluster algorithm published by Alex Rodriguez and Alessandro Laio in the latest Science magazine. The method is short and efficient, I implemented it using about only 100 lines of cpp code.

### BASIC METHOD

There are two leading criteria in this method: Local Density and Minimum Distance with higher density.

**Rho** above is the local density, in which,

and **dc** is a cutoff distance. **Rho** is basically equal to the number of points that are closer than **dc** to point **i**. The algorithm is sensitive only to the relative magnitude of **rho** in different points, implying that, for large data sets, the results of the analysis are robust with respect to the choice of **dc**.

The authors say that we can choose **dc** so that the average number of neighbors is around **1** to **2%** of the total number of points in the data set (I used 1.6% in my code).

**Delta** is measured by computing the minimum distance between the point **i** and any other point with higher density:

and for the point with highest density, we simply let:

The following figure illustrates the basic algorithm:

The left hand figure shows the dataset we use (28 points in a 2d space), most of the points belong to 2 clusters: blue and red, and there are 3 outliers with black circles. By calculating the rho and delta of each point in the dataset, we show the result by the right hand figure (decision graph), x-axis is local density, y-axis is the minimum distance between one point and any other point with higher density.

Our target is to find points that have both high rho and high delta (in this example point number 1 and 10), they’re cluster centers. We can see the three outliers (26, 27, 28) have high delta but very low rho.

### CODE

// Clustering by fast search and find of density peaks // Science 27 June 2014: // Vol. 344 no. 6191 pp. 1492-1496 // DOI: 10.1126/science.1242072 // http://www.sciencemag.org/content/344/6191/1492.full// // // Code Author: Eric Yuan // Blog: http://eric-yuan.me // You are FREE to use the following code for ANY purpose. // // Have fun with it #include "iostream" #include "vector" #include "math.h" using namespace std; #define DIM 3 #define elif else if #ifndef bool #define bool int #define false ((bool)0) #define true ((bool)1) #endif struct Point3d { double x; double y; double z; Point3d(double xin, double yin, double zin) : x(xin), y(yin), z(zin) {} }; void dataPro(vector<vector<double> > &src, vector<Point3d> &dst){ for(int i = 0; i < src.size(); i++){ Point3d pt(src[i][0], src[i][1], src[i][2]); dst.push_back(pt); } } double getDistance(Point3d &pt1, Point3d &pt2){ double tmp = pow(pt1.x - pt2.x, 2) + pow(pt1.y - pt2.y, 2) + pow(pt1.z - pt2.z, 2); return pow(tmp, 0.5); } double getdc(vector<Point3d> &data, double neighborRateLow, double neighborRateHigh){ int nSamples = data.size(); int nLow = neighborRateLow * nSamples * nSamples; int nHigh = neighborRateHigh * nSamples * nSamples; double dc = 0.0; int neighbors = 0; cout<<"nLow = "<<nLow<<", nHigh = "<<nHigh<<endl; while(neighbors < nLow || neighbors > nHigh){ //while(dc <= 1.0){ neighbors = 0; for(int i = 0; i < nSamples - 1; i++){ for(int j = 0; j < nSamples; j++){ if(i == j) continue; if(getDistance(data[i], data[j]) <= dc) ++neighbors; if(neighbors > nHigh) goto DCPLUS; } } DCPLUS: dc += 0.03; cout<<"dc = "<<dc<<", neighbors = "<<neighbors<<endl; } return dc; } vector<int> getLocalDensity(vector<Point3d> &points, double dc){ int nSamples = points.size(); vector<int> rho(nSamples, 0); for(int i = 0; i < nSamples - 1; i++){ for(int j = i + 1; j < nSamples; j++){ if(getDistance(points[i], points[j]) < dc){ ++rho[i]; ++rho[j]; } } //cout<<"getting rho. Processing point No."<<i<<endl; } return rho; } vector<double> getDistanceToHigherDensity(vector<Point3d> &points, vector<int> &rho){ int nSamples = points.size(); vector<double> delta(nSamples, 0.0); for(int i = 0; i < nSamples; i++){ double dist = 0.0; bool flag = false; for(int j = 0; j < nSamples; j++){ if(i == j) continue; if(rho[j] > rho[i]){ double tmp = getDistance(points[i], points[j]); if(!flag){ dist = tmp; flag = true; }else dist = tmp < dist ? tmp : dist; } } if(!flag){ for(int j = 0; j < nSamples; j++){ double tmp = getDistance(points[i], points[j]); dist = tmp > dist ? tmp : dist; } } delta[i] = dist; //cout<<"getting delta. Processing point No."<<i<<endl; } return delta; } int main(int argc, char** argv) { long start, end; FILE *input; input = fopen("dataset.txt", "r"); vector<vector<double> > data; double tpdouble; int counter = 0; while(1){ if(fscanf(input, "%lf", &tpdouble)==EOF) break; if(counter / 3 >= data.size()){ vector<double> tpvec; data.push_back(tpvec); } data[counter / 3].push_back(tpdouble); ++ counter; } fclose(input); //random_shuffle(data.begin(), data.end()); start = clock(); cout<<"********"<<endl; vector<Point3d> points; dataPro(data, points); double dc = getdc(points, 0.016, 0.020); vector<int> rho = getLocalDensity(points, dc); vector<double> delta = getDistanceToHigherDensity(points, rho); // saveToTxt(rho, delta); // now u get the cluster centers end = clock(); cout<<"used time: "<<((double)(end - start)) / CLOCKS_PER_SEC<<endl; return 0; }

PostScript: I found that the result differs a lot when using different value of dc, If you have any idea of how to calculate dc precisely, please let me know.

### Newly Updated: Jul. 1 9:14 pm

Ryan raised an interesting idea yesterday and I tried it.

Here’s the new function to calculate “Local Density”:

vector<double> getAverageKnnDistance(vector<Point2d> &points){ double ratio = 0.015; int nSamples = points.size(); int M = nSamples * ratio; vector<double> rho(nSamples, 0.0); for(int i = 0; i < nSamples; i++){ vector<double> tmp; for(int j = 0; j < nSamples; j++){ if(i == j) continue; double dis = getDistance(points[i], points[j]); if(tmp.empty()){ tmp.push_back(dis); }elif(tmp.size() < M){ if(dis <= tmp[tmp.size() - 1]) tmp.push_back(dis); else{ for(int k = 0; k < tmp.size(); k++){ if(tmp[k] <= dis){ tmp.insert(tmp.begin() + k, dis); break; } } } }else{ if(dis >= tmp[0]){ ; // do nothing }elif(dis <= tmp[tmp.size() - 1]){ tmp.erase(tmp.begin()); tmp.push_back(dis); }else{ for(int k = 0; k < tmp.size(); k++){ if(tmp[k] <= dis){ tmp.insert(tmp.begin() + k, dis); tmp.erase(tmp.begin()); break; } } } } } double res = 0.0; for(int j = 0; j < tmp.size(); j++){ res += tmp[j]; } rho[i] = 0 - res / tmp.size(); } return rho; }

Because the mean distance to nearest M neighbors is inversely proportional to density, so I add a “y = -x” in the end.

I tested this idea using four dataset:

**Aggregation:**788 vectors, 2 dimensions, 7 clusters.**Spiral:**312 vectors, 2 dimensions, 3 clusters**Jain:**373 vectors, 2 dimensions, 2 clusters**Flame:**240 vectors, 2 dimensions, 2 clusters

And the result is as follow (Ryan’s idea in red circles, method in paper in blue circles):

Spiral:

Flame:

Aggregation:

Actually Ryan’s method works well, but just like I replied him, how to get a proper “M” is still like black magic.

Hey, **HERE**‘s the link of Ryan’s blog about this algorithm.

If anyone has any idea, please let me know.

Thanks Ryan.

## 63 Comments

Hi Yuan,

Thanks for sharing your codes. I have some problems compile them under Ubuntu 14.04. Sorry I am not familiar with cpp.

$ g++ fsfdp.cpp

fsfdp.cpp: In function ‘int main(int, char**)’:

fsfdp.cpp:117:37: error: ‘fopen’ was not declared in this scope

input = fopen(“dataset.txt”, “r”);

^

fsfdp.cpp:122:42: error: ‘fscanf’ was not declared in this scope

if(fscanf(input, “%lf”, &tpdouble)==EOF) break;

^

fsfdp.cpp:122:45: error: ‘EOF’ was not declared in this scope

if(fscanf(input, “%lf”, &tpdouble)==EOF) break;

^

fsfdp.cpp:130:17: error: ‘fclose’ was not declared in this scope

fclose(input);

^

fsfdp.cpp:131:44: error: ‘random_shuffle’ was not declared in this scope

random_shuffle(data.begin(), data.end());

^

Hi Simon,

pls try to include “stdio.h”

Thanks Eric.

I have also included ‘algorithm’ for random_shuffle. It is compiled successfully but has ‘Segmentation fault’ after running.

$ g++ fsfdp.cpp

$ ./a.out

Segmentation fault (core dumped)

$ g++ -v

Using built-in specs.

COLLECT_GCC=g++

COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.8/lto-wrapper

Target: x86_64-linux-gnu

Configured with: ../src/configure -v –with-pkgversion=’Ubuntu 4.8.2-19ubuntu1′ –with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs –enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ –prefix=/usr –program-suffix=-4.8 –enable-shared –enable-linker-build-id –libexecdir=/usr/lib –without-included-gettext –enable-threads=posix –with-gxx-include-dir=/usr/include/c++/4.8 –libdir=/usr/lib –enable-nls –with-sysroot=/ –enable-clocale=gnu –enable-libstdcxx-debug –enable-libstdcxx-time=yes –enable-gnu-unique-object –disable-libmudflap –enable-plugin –with-system-zlib –disable-browser-plugin –enable-java-awt=gtk –enable-gtk-cairo –with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64/jre –enable-java-home –with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64 –with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-amd64 –with-arch-directory=amd64 –with-ecj-jar=/usr/share/java/eclipse-ecj.jar –enable-objc-gc –enable-multiarch –disable-werror –with-arch-32=i686 –with-abi=m64 –with-multilib-list=m32,m64,mx32 –with-tune=generic –enable-checking=release –build=x86_64-linux-gnu –host=x86_64-linux-gnu –target=x86_64-linux-gnu

Thread model: posix

gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)

Oh sorry for that, it’s ok to just delete that random_shuffle line, I add that line because I made my own test dataset which is a set of ordered points, I just wanted to make them randomly distributed.

hi,Eric,there is still Segmentation fault (core dumped)

after deleting the random_shuffle line

i have solved it and thanks for your sharing code

Thank you very much.

It works like a charm.

Well, I run your code. However, the result didn’t seem satisfied. Anything wrong with me?

Yeah it’s true that for different dataset, the result differs a lot. I tried different “dc” value and I think that causes the difference. Alex&Alessandro didn’t give the details about calculating dc, only said “can choose dc so that the average number of neighbors is around 1 to 2% of the total number of points in the data set”.

I’m thinking about it and if you have any idea, please let me know.

btw, you can get some cluster datasets here : http://cs.joensuu.fi/sipu/datasets/ including the shape sets used in the paper.

Hi, Eric Yuan

I’m sort of new to this paper, could you tell “dc is a cutoff distance”, and, does dc depend on data set distribution? Can you give me an example that I can learn from?

Hi Aspirin,

I don’t think the value of dc depends on how the dataset distributes, because there’s an AVERAGE process. We can see this dc as the radius of a circle (in 2d), or a sphere (in 3d). For every point in the dataset, we let it to be the centre of circle/sphere, and count how many points are there inside this circle/sphere. As the author said, the target is to find a proper dc, that makes the AVERAGE amount of points in circles/spheres is between 1% and 2% of the total point amount.

For example, say we have 200 samples in our dataset, we want a dc that the average number of neighbors is1.5% of the total number of points in the data set, means we want the average numbers of points in each circle/sphere is 3, and for all points, they will have 3 * 200 = 600 neighbors in total. So in my code, I just try to find a value, and when dc equals this value, the total neighbors is 600, so that the average neighbors is 3.

I don’t know whether I explained clearly enough, if you have any question, feel free to let me know.

Thanks for your explaination! Gives a really nice overview of dc, and, dc may seem a bit anticlimactic, but what it does is quite powerful.

Thinking about dc threshold choosing problem… what if local density was computed as mean distance to nearest M neighbors, where M is set to some percentage of the total number of points. Still have a threshold to choose but seems like that should reduce the sensitivity to the threshold.

Hey Ryan,

It might works but I think it is still hard to get a proper M. For example, there are two points A and B, A has only 3 neighbors and the distance between A and any of its neighbor is 3; B has a lot of neighbors but the distance between point B and each of its neighbors is 3.5. Then it is hard to say which points has bigger local density, and the problem is exactly the same as “dc” in the paper, that is, HOW LOCAL is the “local density”.

I like your idea anyway, and I’ll try it tomorrow, and reply you then with test result.

Thanks.

I tried your idea, check the newly updated part of this post pls.

Thanks for the update. Just to confirm, did you play with the value of “double ratio = 0.015” and decide that the results are still too sensitive to that value? Is there a threshold (ratio) where the mean of M threshold works well for all your data sets?

I guess the formal way to evaluate might be something like:

1) Define (or given) the truth for each point in a data set (human determined clustering).

2) Define an accuracy metric (how well the alg does vs truth). I guess just the obvious nCorrect / nSamples.

3) Sweep the threshold parameter (dc for the original one, ratio for the new one): meaning repeatedly run the alg and vary the parameter.

4) Evaluate how sensitive the accuracy is vs the threshold. I guess just look at a graph?

5) Evaluate how the best threshold changes across the data sets. Maybe plot all the curves from 4 on the same graph?

If a method was more sensitive to the threshold I would think the accuracy metric would vary quickly relative to that threshold, and the best threshold for one data set might not be the best for other data sets. If a method were less sensitive I would think the accuracy metric would vary more slowly relative to that threshold, and maybe there would be a threshold that worked across more data sets.

Another thought, might want a minimum number for the M nearest points, like:

int M = max(5, nSamples * ratio);

(though I don’t know what a good minimum would be, maybe that could be determined from more analysis…)

I might get some time in the next few days to help work on this.

Oh, probably makes more sense to have an intermediate metric, not accuracy metric. Because you’re not doing final classification yet.

So maybe a “separation metric” where:

1) Define truth as the set of peak density points. (Points 10 and 1 in the first example above.)

2) Define a separation metric that measures how well those truth peak density points are separated from the rest of the points (the non-peak-density points) on the rho/delta graph. Maybe distance from centroid of rest of points / stddev of distance of rest of points?

3) Same parameter sweep, varying the threshold, to see how the separation metric varies with threshold.

Because I guess the only thing a good thresholding scheme can do is provide good separation of the peak density points from the rest of the points, without having to choose a threshold for each data set.

I posted a comment with the results of my investigation on the sciencemag site:

http://comments.sciencemag.org/content/10.1126/science.1242072

Cool!

Also I created a blog post that includes the R source and the plot of sensitivity of separation score vs mRatio here: Ryan Seghers Blog

Nice, I’m gonna put your link in the post.

Hi Eric,

Could you share the Matlab version of your code.

Thanks

Hi Pardeep,

Sorry I don’t have a Matlab version of this code, however the algorithm is simple, you can try to write one.

Check this: http://www.sciencemag.org/content/344/6191/1492/suppl/DC1

There’s a matlab version of the whole algorithm on ScienceMag website.

Click “Data S1” under “Additional Data”.

Hi Eric,

Thanks for posting the matlab code link, the current matlab code is not supporting new data sets e.g. iris data set. Could you please post a new link for matlab code of your algorithm that can be applied on irris data set or any other multidimensional data set.

One more doubt about R version of your algorithm: http://cran.r-project.org/web/packages/densityClust/

Suppose we divide irris data set into two parts: Calibration and Verification and apply R code to get cluster information for Calibration data.

How to get the cluster information for Verification data set without running R code?

How to store rho and delta values for centroid points of Calibration clusters so that we can use it for Verification data set?

Hi,Eric,

Thanks for share your code，and now I also need the matlab code for this algorithm.Could you give me your flow chart for this cpp vision or help me to find this friend nemed Pardeep,mabe he has already got the matlab one. Thank you.

Hi Eric,

Thanks for the sharing. I am trying to implement the paper in LabVIEW (if you know it) and I found it confusing to define the border region. I think you didn’t do that in your code. Do you have any thought? Thanks.

Regards,

Bo

Hey Bo,

Sorry I did not do that part because I don’t think I fully understand the dc-calculation part, so the result is not good enough, and I’m still trying to find better way to get dc.

Regards,

Hi Yuan,

Thanks for sharing your codes, can you tell how to get the figures in your testing for Ryan’s method, using matlab tools or other tools? can you share the code for plotting figures.Thank you!

Regards,

Li

Hi Li, &rho){

After getting rho and delta, you can simply use function like the following example to save them into .txt file:

void saveToTxt(vector

FILE *p = fopen(“rho.txt”, “w”);

for(int i = 0; i < rho.size(); i++){ fprintf(p, "%lf\n", rho[i]); } fclose(p); } And yes I used matlab, you can use matlab or Octave, load in rho and delta then draw the figure like this: rho = load('rho.txt'); delta = load('delta.txt'); figure; scatter(rho, delta, 'r'); It's simple. Regards,

Thank you for your kind reply and detailed explanation.

Hi Yuan,

Thanks for sharing your codes, can you tell how to do the cluster after you getting the rho and delta?

Hey Huiming,

Check this: http://www.sciencemag.org/content/344/6191/1492/suppl/DC1

There’s a matlab version of the whole algorithm on ScienceMag website.

Click “Data S1” under “Additional Data”.

Thank you for your kind reply. You are so nice!^_^

Hi Eric,

Thanks for your sharing. From my point of view, the method of this paper is finding the initial cluster centers, when i get the cluster centers, iterative computing is still needed until convergence.

In addition, i have one question. i find the time complexity is O(n^2). Comparing with k-means++ or k-means||, it cost much, what’ more, The result of clustering is not better than k-means++ or k-means|| all the time.

Hi Lanmbert,

There’s a matlab version of the whole algorithm on ScienceMag website.

http://www.sciencemag.org/content/344/6191/1492/suppl/DC1

Click “Data S1″ under “Additional Data”.

And I think the author didn’t mean to use this method for initial centers, but the substitute of the whole kmeans algorithm. From this point of view, the O(n^2) is unquestionably better than kmeans (which is NP-Hard).

Hi Eric, i thik the time complexity of kmeans is O(n*k*t),t is the number of iterations, it is far less than this paper’s method(O(n^2))

Can you explain or elaborate what this means?

“for the point with highest density, we simply let delta(i) = maxj (dij)”

I’ve read this in the article itself as well but I don’t understand how this is applied. In other words, is this condition only applied to the single point with the highest overall density while all other points have delta as “minimum distance between the point i and any other point with higher density”? It would be great if you could clarify this for me.

Yes, this only applies to the single point with the highest density because there’s no other point which has higher density. In my opinion, the purpose of doing this is, for the highest density sample, just let this distance as long as possible (but not too long so that the distance is still an acceptable value for this certain dataset).

Hi Eric. The author said “A hint for choosing the number of centers is provided by the plot of ri = rhoi * deltai sorted in decreasing order.” But how to implement it without plotting?

Hi Amber,

They I don’t think this algorithm is auto enough, it seems that they failed to provide other ways to deal with that problem.

Hi, Eric,

thanks for the sharing of both your code and your understanding on the newly proposed clustering algorithm.

Many guys have mentioned and questioned the O(n^2) complexity, but I do think that is not a big deal because the clustering algorithm can be easily complemented in parallel.

BTW, I’m wondering whether it can also work well in the high-dimension cases, since most of the numerical results are of the dimension 2 or 3. Any idea about this?

Thanks.

The function getdc ,you mean the nLow = neighborRateLow * nSamples * nSamples,why?

Hey Tink,

I didn’t calculate the average nLow, but the total amount of it, so for each of the points, it has Rate * nSamples neighbors, so for all of the nSamples points, they will have Rate * nSamples * nSamples neighbors.

I also have some questions.How to determine the cluster center and how to assign the remaining points?

hello, i have a question want to ask you about the getdc()function

the dc == 0.0 when it was initialized . And then, in the criculation, when getDistance(data[i], data[j]) nHigh) goto DCPLUS.

However, i think, because the dc == 0.0 when it was initialized , the distance between two point would not be <= dc , the neighbors will always be 0 ,the DCPLUS won't run .

Please give me a help. Think you

Hey Albert,

Thanks for asking. The reason I used “goto” was just break the two for-loops, so in each iteration inside the while-loop, the dc+=0.01 executes.

Thank you~

Hello.I have this question too.But I did not understand it even if I saw the reply.Could you explain it?Thanks

Hi Eric, I`m a new one here, could you please tell me after running the code you post above, I save Rho,delta into a txt file. How can I print the figure with many different shapes?

Hi Kimi,

It’s OK to just use Matlab plot functions to figure those shapes, or, of course, use GNU/Octave for free. Thanks

OK I see. After running your program, we have Rho and delta, if I want to choose the centers, I just pick up some point, which have high delta and relative high Rho?

Yes exactly.

A method for automatically selecting dc has been proposed:

http://arxiv.org/abs/1501.04267

After you using Ryan’s method for getting the density, what do you do to define the border region? That method defines rho as the average distance over the M-closest points and doesn’t use a distance cutoff (dc). However, the paper defines the border region using dc so it’s unclear how to resolve this. Any comments on this would be helpful!

Hello ,the function getdc() has triple loops which make the method time consuming when the dataset is large scale,is there any optimization?

I think the function getdc() can not stop.

Hello. I have a question about this algorithm.”if(getDistance(data[i], data[j]) <= dc) ++neighbors;",beacuse dc's Initial value is zero,however,the result of "(getDistance(data[i], data[j]) " just have one.So this loop will never stop.

Hi,

You mean the while loop? But dc adds 0.03 in every iteration, it will be bigger and bigger…

Great implementation! We built an R package which has automatic bandwidth selection.

Here is the link:

https://cran.rstudio.com/web/packages/ADPclust/index.html

As already said in the comment, I really find the idea of using the M nearest neighbors very powerful. However, there is still the need of a dc for the assignation to the clusters. So how can the two approaches be combined, and use Ryan’s idea also for the assignation? Use a criteria like

if dist < (mean(rho) +/- k. std(rho)) ? I do not know if I'm clear, but this is crucial at the end to not only get the centroids, but also the affectations

Hi Eric! Thanks a ton for your making your code available for free! You are great!

I have one tiny problem. You did not mention what should be the format of the dataset. In my dataset, I have 3 columns: (1) object 1 (2) object 2 (3) distance between objects 1 and 2. Is this the correct format? When I compile it, there is no error; but when I run it, it doesn’t show anything. How can I get the colourful graphs like you have in the post?

## One Trackback

[…] 刚开始还被唬住了，觉得这个idea真棒，都发science了！然后就兴致勃勃地想要实现，还发现一位大神实现了 （呵呵，其实做做到了decision graph）。 […]