Thursday, April 9, 2015

Calculating weighted variance using Spark

Spark provides a stat() function on a DoubleRDD to calculate in a robust way the count, mean and standard deviation of the double values. The inputs are unweighted (or all have a weight of 1), and I ran into a situation where I needed to perform the statistics on weighted values. The use case is to find the density area of locations with associated number of incidents. Here is the result on a map:
So, after re-reading the "Weighted Incremental Algorithm" section on wikipedia, I ripped the original StatCounter code and and implemented a WeightedStatCounter class that calculates the statistics for an RDD of WeightedValues. The above use case is based on the Standard Distance GeoProcessing function and generates a WKT Polygon that is converted to a Feature through an ArcPy Toolbox.

Works like a charm, and like usual, all the source code can be found here.