However, DBSCAN can consume a lot of memory when the input is very large. And since I do BigData, my data inputs will overwhelm my MacBook Pro very quickly. Since I know Hadoop MapReduce fairly well, and MR has been around for quite some time, I decided to see how other folks implemented such a solution in a distributed share nothing environment. I came across this paper, which was very inspiring and found out that IrvingC used it too as a reference implementation. So I decided to implement my own DBSCAN on Spark as a way to further my education in Scala. And boy did I learn a lot when it comes to immutable data structures, type aliasing and collection folding. BTW, I highly recommend the Twitter Scala School.
Like usual, all the source code can be found here, and make sure to check out the “How It Works?” section.
[Update] After posting - I saw this post - very nice video too!