• Simon

    Hi Yuan,
    Thanks for sharing your codes. I have some problems compile them under Ubuntu 14.04. Sorry I am not familiar with cpp.

    $ g++ fsfdp.cpp
    fsfdp.cpp: In function ‘int main(int, char**)’:
    fsfdp.cpp:117:37: error: ‘fopen’ was not declared in this scope
    input = fopen(“dataset.txt”, “r”);
    ^
    fsfdp.cpp:122:42: error: ‘fscanf’ was not declared in this scope
    if(fscanf(input, “%lf”, &tpdouble)==EOF) break;
    ^
    fsfdp.cpp:122:45: error: ‘EOF’ was not declared in this scope
    if(fscanf(input, “%lf”, &tpdouble)==EOF) break;
    ^
    fsfdp.cpp:130:17: error: ‘fclose’ was not declared in this scope
    fclose(input);
    ^
    fsfdp.cpp:131:44: error: ‘random_shuffle’ was not declared in this scope
    random_shuffle(data.begin(), data.end());
    ^

    • Eric

      Hi Simon,
      pls try to include “stdio.h”

      • Simon

        Thanks Eric.
        I have also included ‘algorithm’ for random_shuffle. It is compiled successfully but has ‘Segmentation fault’ after running.

        $ g++ fsfdp.cpp
        $ ./a.out
        Segmentation fault (core dumped)

        • Simon

          $ g++ -v
          Using built-in specs.
          COLLECT_GCC=g++
          COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.8/lto-wrapper
          Target: x86_64-linux-gnu
          Configured with: ../src/configure -v –with-pkgversion=’Ubuntu 4.8.2-19ubuntu1′ –with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs –enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ –prefix=/usr –program-suffix=-4.8 –enable-shared –enable-linker-build-id –libexecdir=/usr/lib –without-included-gettext –enable-threads=posix –with-gxx-include-dir=/usr/include/c++/4.8 –libdir=/usr/lib –enable-nls –with-sysroot=/ –enable-clocale=gnu –enable-libstdcxx-debug –enable-libstdcxx-time=yes –enable-gnu-unique-object –disable-libmudflap –enable-plugin –with-system-zlib –disable-browser-plugin –enable-java-awt=gtk –enable-gtk-cairo –with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64/jre –enable-java-home –with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64 –with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-amd64 –with-arch-directory=amd64 –with-ecj-jar=/usr/share/java/eclipse-ecj.jar –enable-objc-gc –enable-multiarch –disable-werror –with-arch-32=i686 –with-abi=m64 –with-multilib-list=m32,m64,mx32 –with-tune=generic –enable-checking=release –build=x86_64-linux-gnu –host=x86_64-linux-gnu –target=x86_64-linux-gnu
          Thread model: posix
          gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)

        • Eric

          Oh sorry for that, it’s ok to just delete that random_shuffle line, I add that line because I made my own test dataset which is a set of ordered points, I just wanted to make them randomly distributed.

          • lxp

            hi,Eric,there is still Segmentation fault (core dumped)
            after deleting the random_shuffle line

          • lxp

            i have solved it and thanks for your sharing code

          • Simon

            Thank you very much.
            It works like a charm.

  • ahui

    Well, I run your code. However, the result didn’t seem satisfied. Anything wrong with me?

    • Eric

      Yeah it’s true that for different dataset, the result differs a lot. I tried different “dc” value and I think that causes the difference. Alex&Alessandro didn’t give the details about calculating dc, only said “can choose dc so that the average number of neighbors is around 1 to 2% of the total number of points in the data set”.

      I’m thinking about it and if you have any idea, please let me know.

      btw, you can get some cluster datasets here : http://cs.joensuu.fi/sipu/datasets/ including the shape sets used in the paper.

  • Aspirin Vagrant

    Hi, Eric Yuan
    I’m sort of new to this paper, could you tell “dc is a cutoff distance”, and, does dc depend on data set distribution? Can you give me an example that I can learn from?

    • Eric

      Hi Aspirin,

      I don’t think the value of dc depends on how the dataset distributes, because there’s an AVERAGE process. We can see this dc as the radius of a circle (in 2d), or a sphere (in 3d). For every point in the dataset, we let it to be the centre of circle/sphere, and count how many points are there inside this circle/sphere. As the author said, the target is to find a proper dc, that makes the AVERAGE amount of points in circles/spheres is between 1% and 2% of the total point amount.

      For example, say we have 200 samples in our dataset, we want a dc that the average number of neighbors is1.5% of the total number of points in the data set, means we want the average numbers of points in each circle/sphere is 3, and for all points, they will have 3 * 200 = 600 neighbors in total. So in my code, I just try to find a value, and when dc equals this value, the total neighbors is 600, so that the average neighbors is 3.

      I don’t know whether I explained clearly enough, if you have any question, feel free to let me know.

      • Aspirin Vagrant

        Thanks for your explaination! Gives a really nice overview of dc, and, dc may seem a bit anticlimactic, but what it does is quite powerful.

  • Ryan

    Thinking about dc threshold choosing problem… what if local density was computed as mean distance to nearest M neighbors, where M is set to some percentage of the total number of points. Still have a threshold to choose but seems like that should reduce the sensitivity to the threshold.

    • Eric

      Hey Ryan,
      It might works but I think it is still hard to get a proper M. For example, there are two points A and B, A has only 3 neighbors and the distance between A and any of its neighbor is 3; B has a lot of neighbors but the distance between point B and each of its neighbors is 3.5. Then it is hard to say which points has bigger local density, and the problem is exactly the same as “dc” in the paper, that is, HOW LOCAL is the “local density”.
      I like your idea anyway, and I’ll try it tomorrow, and reply you then with test result.
      Thanks.

    • Eric

      I tried your idea, check the newly updated part of this post pls.

      • Ryan

        Thanks for the update. Just to confirm, did you play with the value of “double ratio = 0.015” and decide that the results are still too sensitive to that value? Is there a threshold (ratio) where the mean of M threshold works well for all your data sets?

        I guess the formal way to evaluate might be something like:
        1) Define (or given) the truth for each point in a data set (human determined clustering).
        2) Define an accuracy metric (how well the alg does vs truth). I guess just the obvious nCorrect / nSamples.
        3) Sweep the threshold parameter (dc for the original one, ratio for the new one): meaning repeatedly run the alg and vary the parameter.
        4) Evaluate how sensitive the accuracy is vs the threshold. I guess just look at a graph?
        5) Evaluate how the best threshold changes across the data sets. Maybe plot all the curves from 4 on the same graph?

        If a method was more sensitive to the threshold I would think the accuracy metric would vary quickly relative to that threshold, and the best threshold for one data set might not be the best for other data sets. If a method were less sensitive I would think the accuracy metric would vary more slowly relative to that threshold, and maybe there would be a threshold that worked across more data sets.

        Another thought, might want a minimum number for the M nearest points, like:
        int M = max(5, nSamples * ratio);
        (though I don’t know what a good minimum would be, maybe that could be determined from more analysis…)

        I might get some time in the next few days to help work on this.

        • Ryan

          Oh, probably makes more sense to have an intermediate metric, not accuracy metric. Because you’re not doing final classification yet.

          So maybe a “separation metric” where:
          1) Define truth as the set of peak density points. (Points 10 and 1 in the first example above.)
          2) Define a separation metric that measures how well those truth peak density points are separated from the rest of the points (the non-peak-density points) on the rho/delta graph. Maybe distance from centroid of rest of points / stddev of distance of rest of points?
          3) Same parameter sweep, varying the threshold, to see how the separation metric varies with threshold.

          Because I guess the only thing a good thresholding scheme can do is provide good separation of the peak density points from the rest of the points, without having to choose a threshold for each data set.

  • Pardeep

    Hi Eric,
    Could you share the Matlab version of your code.

    Thanks

    • Eric

      Hi Pardeep,

      Sorry I don’t have a Matlab version of this code, however the algorithm is simple, you can try to write one.

    • Eric

      Check this: http://www.sciencemag.org/content/344/6191/1492/suppl/DC1
      There’s a matlab version of the whole algorithm on ScienceMag website.
      Click “Data S1” under “Additional Data”.

      • Pardeep

        Hi Eric,
        Thanks for posting the matlab code link, the current matlab code is not supporting new data sets e.g. iris data set. Could you please post a new link for matlab code of your algorithm that can be applied on irris data set or any other multidimensional data set.

        One more doubt about R version of your algorithm: http://cran.r-project.org/web/packages/densityClust/

        Suppose we divide irris data set into two parts: Calibration and Verification and apply R code to get cluster information for Calibration data.
        How to get the cluster information for Verification data set without running R code?
        How to store rho and delta values for centroid points of Calibration clusters so that we can use it for Verification data set?

    • aldercy

      Hi,Eric,
      Thanks for share your code,and now I also need the matlab code for this algorithm.Could you give me your flow chart for this cpp vision or help me to find this friend nemed Pardeep,mabe he has already got the matlab one. Thank you.

  • Bo

    Hi Eric,

    Thanks for the sharing. I am trying to implement the paper in LabVIEW (if you know it) and I found it confusing to define the border region. I think you didn’t do that in your code. Do you have any thought? Thanks.

    Regards,
    Bo

    • Eric

      Hey Bo,
      Sorry I did not do that part because I don’t think I fully understand the dc-calculation part, so the result is not good enough, and I’m still trying to find better way to get dc.
      Regards,

  • li

    Hi Yuan,
    Thanks for sharing your codes, can you tell how to get the figures in your testing for Ryan’s method, using matlab tools or other tools? can you share the code for plotting figures.Thank you!

    Regards,
    Li

    • Eric

      Hi Li,
      After getting rho and delta, you can simply use function like the following example to save them into .txt file:
      void saveToTxt(vector &rho){
      FILE *p = fopen(“rho.txt”, “w”);
      for(int i = 0; i < rho.size(); i++){ fprintf(p, "%lf\n", rho[i]); } fclose(p); } And yes I used matlab, you can use matlab or Octave, load in rho and delta then draw the figure like this: rho = load('rho.txt'); delta = load('delta.txt'); figure; scatter(rho, delta, 'r'); It's simple. Regards,

      • Li

        Thank you for your kind reply and detailed explanation.

  • Huiming Xie

    Hi Yuan,
    Thanks for sharing your codes, can you tell how to do the cluster after you getting the rho and delta?

  • lanmbert

    Hi Eric,

    Thanks for your sharing. From my point of view, the method of this paper is finding the initial cluster centers, when i get the cluster centers, iterative computing is still needed until convergence.
    In addition, i have one question. i find the time complexity is O(n^2). Comparing with k-means++ or k-means||, it cost much, what’ more, The result of clustering is not better than k-means++ or k-means|| all the time.

    • Eric

      Hi Lanmbert,

      There’s a matlab version of the whole algorithm on ScienceMag website.
      http://www.sciencemag.org/content/344/6191/1492/suppl/DC1
      Click “Data S1″ under “Additional Data”.
      And I think the author didn’t mean to use this method for initial centers, but the substitute of the whole kmeans algorithm. From this point of view, the O(n^2) is unquestionably better than kmeans (which is NP-Hard).

      • Lei Zhao

        Hi Eric, i thik the time complexity of kmeans is O(n*k*t),t is the number of iterations, it is far less than this paper’s method(O(n^2))

  • Sean

    Can you explain or elaborate what this means?

    “for the point with highest density, we simply let delta(i) = maxj (dij)”

    I’ve read this in the article itself as well but I don’t understand how this is applied. In other words, is this condition only applied to the single point with the highest overall density while all other points have delta as “minimum distance between the point i and any other point with higher density”? It would be great if you could clarify this for me.

    • Eric

      Yes, this only applies to the single point with the highest density because there’s no other point which has higher density. In my opinion, the purpose of doing this is, for the highest density sample, just let this distance as long as possible (but not too long so that the distance is still an acceptable value for this certain dataset).

  • Amber Yang

    Hi Eric. The author said “A hint for choosing the number of centers is provided by the plot of ri = rhoi * deltai sorted in decreasing order.” But how to implement it without plotting?

    • Eric

      Hi Amber,
      They I don’t think this algorithm is auto enough, it seems that they failed to provide other ways to deal with that problem.

  • peghoty

    Hi, Eric,

    thanks for the sharing of both your code and your understanding on the newly proposed clustering algorithm.

    Many guys have mentioned and questioned the O(n^2) complexity, but I do think that is not a big deal because the clustering algorithm can be easily complemented in parallel.

    BTW, I’m wondering whether it can also work well in the high-dimension cases, since most of the numerical results are of the dimension 2 or 3. Any idea about this?

    Thanks.

  • tink

    The function getdc ,you mean the nLow = neighborRateLow * nSamples * nSamples,why?

    • Eric

      Hey Tink,

      I didn’t calculate the average nLow, but the total amount of it, so for each of the points, it has Rate * nSamples neighbors, so for all of the nSamples points, they will have Rate * nSamples * nSamples neighbors.

  • tink

    I also have some questions.How to determine the cluster center and how to assign the remaining points?

  • Albert

    hello, i have a question want to ask you about the getdc()function
    the dc == 0.0 when it was initialized . And then, in the criculation, when getDistance(data[i], data[j]) nHigh) goto DCPLUS.
    However, i think, because the dc == 0.0 when it was initialized , the distance between two point would not be <= dc , the neighbors will always be 0 ,the DCPLUS won't run .
    Please give me a help. Think you

    • Eric

      Hey Albert,

      Thanks for asking. The reason I used “goto” was just break the two for-loops, so in each iteration inside the while-loop, the dc+=0.01 executes.

      • Albert

        Thank you~

    • XueFei·Xia

      Hello.I have this question too.But I did not understand it even if I saw the reply.Could you explain it?Thanks

  • Kimi

    Hi Eric, I`m a new one here, could you please tell me after running the code you post above, I save Rho,delta into a txt file. How can I print the figure with many different shapes?

    • Eric

      Hi Kimi,

      It’s OK to just use Matlab plot functions to figure those shapes, or, of course, use GNU/Octave for free. Thanks

      • Kimi

        OK I see. After running your program, we have Rho and delta, if I want to choose the centers, I just pick up some point, which have high delta and relative high Rho?

        • Eric

          Yes exactly.

  • Pingback: ()

  • Sean

    A method for automatically selecting dc has been proposed:

    http://arxiv.org/abs/1501.04267

  • Sean

    After you using Ryan’s method for getting the density, what do you do to define the border region? That method defines rho as the average distance over the M-closest points and doesn’t use a distance cutoff (dc). However, the paper defines the border region using dc so it’s unclear how to resolve this. Any comments on this would be helpful!

  • nixiejun

    Hello ,the function getdc() has triple loops which make the method time consuming when the dataset is large scale,is there any optimization?

    • XueFei·Xia

      I think the function getdc() can not stop.

  • XueFei·Xia

    Hello. I have a question about this algorithm.”if(getDistance(data[i], data[j]) <= dc) ++neighbors;",beacuse dc's Initial value is zero,however,the result of "(getDistance(data[i], data[j]) " just have one.So this loop will never stop.

    • Eric

      Hi,

      You mean the while loop? But dc adds 0.03 in every iteration, it will be bigger and bigger…

  • Ethan

    Great implementation! We built an R package which has automatic bandwidth selection.
    Here is the link:
    https://cran.rstudio.com/web/packages/ADPclust/index.html

  • Piere

    As already said in the comment, I really find the idea of using the M nearest neighbors very powerful. However, there is still the need of a dc for the assignation to the clusters. So how can the two approaches be combined, and use Ryan’s idea also for the assignation? Use a criteria like
    if dist < (mean(rho) +/- k. std(rho)) ? I do not know if I'm clear, but this is crucial at the end to not only get the centroids, but also the affectations

  • Prashant

    Hi Eric! Thanks a ton for your making your code available for free! You are great!

    I have one tiny problem. You did not mention what should be the format of the dataset. In my dataset, I have 3 columns: (1) object 1 (2) object 2 (3) distance between objects 1 and 2. Is this the correct format? When I compile it, there is no error; but when I run it, it doesn’t show anything. How can I get the colourful graphs like you have in the post?