Thunderhead Explorer: January 2015

Sunday, January 18, 2015

Spark, Cassandra, Tessellation and ArcGIS

If you do BigData and have not heard or used Spark then…..you are living under a rock!
When executing a Spark job, you can read data from all kind of sources with schemas like file, hdfs, s3 and can write data to all kind of sinks with schemas like file and hdfs.
One BigData repository that I’ve been exploring is Cassandra. The DataStax folks released a Cassandra connector to Spark enabling the reading and writing of data from and to Cassandra.
I’ve posted on Github a sample project that reads the NYC trip data from a local file and tessellates a hexagonal mosaic with aggregates of pickup locations. That aggregation is persisted onto Cassandra.
To visualize the aggregated mosaic, I extended ArcMap with an ArcPy toolbox that fetches the content of a Cassandra table and converts it to a set of features in a FeatureClass. The resulting FeatureClass is associated with a gradual symbology to become a layer on the map as follows:

Like usual all the source code is here.

Saturday, January 17, 2015

Scala Hexagon Tessellation

I've committed myself for 2015 to learn Scala, and I wish I did that earlier after 20 years of Java (wow, that makes me sound old :-). I've placed on Github a simple Scala based library to compute the row/column pair of a planar x/y value on a hexagonal grid.

Will be using that library in following posts...
In the meantime, like usual, all the source code is available here.

Friday, January 2, 2015

Spark SQL DBF Library

Happy new year all…It’s been a while. I was crazy busy from May till mid December of last year implementing BigData geospatial solutions at client sites all over the world. Was in Japan a couple of times, Singapore, Malaysia, UK, and do not recall the times I was in Redlands, Texas and DC. In addition, I’ve been investing heavily in Spark and Scala. Do not recall the last time I implemented a Hadoop MapReduce job !

One of the resolutions for the new year (in addition to the usual eating right, exercising more and the never-off-the-bucket-list biking Mt Ventoux) is to blog more. One post per month as a minimum.

So…to kick to year right, I’ve implemented a library to query DBF files using Spark SQL. With the advent of Spark 1.2, a custom relation (table) can be defined as a SchemaRDD. A sample implementation is demonstrated by Databrick’s spark-avro, as Avro files have embedded schema and data so it is relatively easy to convert that to a SchemaRDD. We, in the geo community have such a “old” format that encapsulates schema and data; the DBF format. Using the Shapefile project, I was able to create an RDD using the spark context Hadoop file API and the implementation of a DBFInputFormat. Then using the DBFHeader fields information, each record was mapped onto a Row to be processed by SparkSQL. This is mostly work in progress and is far from been optimized, but it works !

Like usual, all the source code can be downloaded from here. Happy new year all.