Monday, March 4, 2013

BigData meet Raspberry Pi

The Raspberry Pi is a very neat little machine - I bootstrapped it with Soft-Float Debian "Wheezy" so I can run Java Embedded on it - I download and install ejre-7u10-fcs-b18-linux-arm-vfp-client_headless-28_nov_2012.tar, and now can run java programs on the Pi. Wrote the ubiquitous "Hello World" to make sure it works (I should not have doubted the write once run anywhere motto) and jumped directly to running a Spring container with beans. So, the next thing to do is to run the Hadoop client from my previous post on it.  Now remember, there is not a lot of RAM on the Model B Rev 2 Pi (512MB) and the post-action runs on the client machine, so I reduced the cell size and the search radius, worked like a charm. Now, why did I do that ? because... I can :-)
Next, is to come up with something to do with the generated data.

BigData: Kernel Density Analysis on Hadoop MapReduce

The storage of BigData has been democratized by Hadoop. And with that, rather than bringing the data to the program for analysis, we send the program to the data. That twist comes with a challenge to take existing serial algorithms and parallelizing them. This post is about such a process. I want to perform a Kernel Density analysis on billions of records loaded into HDFS. The data is fairly simple; a set of records separated by carriage return, and the fields in each record are tab separated. The fields' values contain a latitude, a longitude and a weight.
If you do not know how Kernel Density works, check out this ArcGIS reference.
To take advantage of the distributed HDFS datanodes, I have to transform the traditional state-full sequential implementation into a model that supports stateless parallelism. MapReduce is such a model, and Hadoop can distribute a MapReduce implementation to read data from the "local" HDFS datanodes and reduces back the result into HDFS.
A Kernel Density analysis expects 3 parameters when processing spatial data; an output extent, a cell size and a search radius. From the above reference description, a input value influences the cells in its search area, and the level of influence is proportional to the input weight and inversely proportional to the distance of the cell from the input. That inverse proportionality is computed using a Kernel function.
So, in the MapReduce programming paradigm, a map function expects a key(k1),value(v1) tuple and emits zero or more new key(k2),value(v2) tuples.

map(k1,v1) -> List(k2,v2)

The map portion of the parallel Kernel Density analysis implementation will emit all the influenced cells and their associated weights for a given input. A cell is defined as a string concatenating its row and column values with a slash character.
After a shuffle and sort intermediate step, the reduce function expects a key(k2),List(v2) tuple and emits zero or more new key(k3),value(v3) tuples.

reduce(k2,list(v2)) -> List(k3,v3)

The implementation will emit the sum of all the weights of a cell
To execute the kernel density map and reduce functions, I decided to use the newly released Spring Data - Apache Hadoop project. As you know, I love the Spring Framework and have been tracking the Spring Data Hadoop project for a while now and glad that it is finally released.
The following Spring application context bootstraps an application and runs the defined job. The job declaratively reference a mapper class, a reducer class and a combiner class. In addition, the job runner defines a pre-action and a post-action. The pre-action tests the existence of the HDFS output path and removes it if it exits. The post-action references a class that converts the content of the output path into a local float raster file. You can use the Float to Raster conversion tool in ArcMap to visualize the kernel density output. And like usual, all the source code can be found here.