Sunday, May 26, 2013

Creating Spatial Crunch Pipelines

Josh Wills (@Josh_wills) introduced me to Apache Crunch which is now a top-level project within the Apache Foundation. Crunch simplifies the coding of data aggregation in MapReduce.

Here is a proof-of-concept project that spatially enables a crunch pipeline with a Point-In-Polygon function from a very large set of static point data with a small set of dynamic polygons.

Crunch has simplified so much so the process, that is came down to a one line syntax:

final PTable<Long, Long> counts = pipeline. readTextFile(args[0]). parallelDo(new PointInPolygon(), Writables.longs()). count();

Crunch's strength is in processing BigData that cannot be stored in the "traditional means", such a time series and graphs. Will be interesting to perform some kind to spatial and temporal analysis with it in a followup post.

Like usual, all the source code can be found here.

Monday, May 20, 2013

Export FeatureClass to Hadoop, Run MapReduce, Visualize in ArcMap

In the previous post we launched a CDH cluster on EC2 in under 5 minutes. In this post, we will use that cluster to perform geo spatial analytics in the form of MapReduce and visualize the result in ArcMap. See, ArcMap is one of the desktop tools that a GeoData Scientist will use when working and visualizing spatial data.  The use case in my mind is something like the following:  Point data is streamed through, for example GeoEventProcessor into Amazon S3. The user has a set polygons in ArcGIS that needs to be spatially joined with that big data point content. The result of the big data join is linked back to the polygon set for symbol classification and visualization.

After editing the polygons in ArcMap, the user exports the feature class into HDFS using the ExportToHDFSTool.

Using the new Esri Geometry API for Java, a MapReduceJob is written as a GeoProcessing extension, in such that it can be directly executed from within ArcMap. The result of the job is covered directly back into a feature class.

The tool expects the following parameters:

  • A Hadoop configuration in the form of a properties file.
  • A user name that Hadoop will use as credentials and its privileges when executing the job.
  • The big data input to use as the source - in the above use case that will be the S3 data.
  • The small data polygon set to spatially join and aggregate.
  • The remote output folder - where the MapReduce job will put its results.
  • A list of jar files that will be used by the DistributedCache to augment the MapReduce classpath. That is because the Esri Geometry API for Java jar is not part of the Hadoop distribution, and we use the distributed cache mechanism to "push" it to each node.
  • The output feature class to create when returning back the MapReduce job result from HDFS.
This code borrows resources from Spatial Framework for Hadoop for the JSON serialization.

For the ArcGIS Java GeoProcessing Extension developers out there - I would like to point you to a couple tricks that will make your development a bit easier and hopefully will make you bang your head a little bit less when dealing with ArcObjects - LOL !

I strongly recommend that you use Apache Maven as your building process. In addition to a good code structure and unified repository, it comes with a great set of plugins to assist in the deployment process.  The first plugin is the maven-dependency-plugin. It copies all the runtime dependencies into a specified folder.  I learned the hard that when ArcMap starts up, it introspects all the classes in all the jars in the extension folder for the @ArcGISExtension annotation declaration.  Now this is fine if you are writing a nice "hello world" GP extension, but in our case, where we are depending on 30+ jars, well….ArcMap will never starts. So, the solution is to put all the jars in a separate subfolder and declare a Class-Path entry in the main jar manifest file that references all the dependencies. This is where the second plugin comes in to the rescue. The maven-jar-plugin can be configured to automatically generate a manifest file containing a class path entry that references all the dependencies declares in the maven pom.

Talk about classpath. So you think that all is well and good by using the above setup. There is one more thing that you have to do to force the correct classpath when executing the GP task and that is to set the current thread context class loader to the system class loader as follows:


It took me 3 weeks to figure this out - so hopefully somebody will find this useful.

Like usual, all the source code can be found here.

Saturday, May 18, 2013

BigData: Launch CDH on EC2 from ArcMap in under 5 minutes

well....after you get all the necessary software, certificates and... setup everything correctly :-)

Update: Regarding the above comment - you can download a zip file containing all the necessary jars and the toolbox so like that you do not have to package the project from scratch.

The idea here is that I would like an ArcGIS user to just push a button from within ArcMap and have a Cloudera based Hadoop cluster started on Amazon EC2. From there on, a user can edit features in ArcMap that can be exported into that cluster to be used as an input to a MapReduce job. The output of the MapReduce job is imported back into ArcMap for further analysis. This combination of SmallData (GeoDatabase) and BigData (Hadoop) is a great fusion in a geo-data scientist arsenal. When done with the analysis, the user again will push a button and destroys the cluster, thus paying for what he/she used while having access to elastic resources.

The following is a sequence of prerequisite steps that you need to execute to get going:

Create an Amazon Web Service account and setup AWS security credentials. You need to remember your 'Access Key ID' and 'Secret Access Key'. They are ugly long strings that we will need later to create our Hadoop cluster in EC2.

Download the PuTTY Windows installer from and run it. This will add two additional programs to your system; putty.exe and puttygen.exe. Will use puttygen.exe to generate a OpenSSH formatted public and private key to enable a secure tunneled communication with the soon to be created AWS instances.

Here are the steps to create the keys:

  • Launch PuTTYGen.
  • Set the number of bits in generated keys to 2048.
  • Click the 'Generate' button and move your mouse randomly in the 'blank' area.
  • To save your private key, click the 'Conversions' menu option and select the 'Export OpenSSH key' menu item. An alert will be displayed, warning you that you are exporting this key without a passphrase. Press 'Yes' to save it
  • To save your public key, select and copy the content of the text area. Open notepad, paste the content and save the file.
  • Finally, save this key in a PuTTY format by clicking on the 'Save private key' button.

Run as administrator the 'JavaConfigTool' located in C:\Program Files (x86)\ArcGIS\Desktop10.1\bin and adjust the Minimum heap size to 512, Maximum heap size to 999, and the Thread stack size to 512.

The deployment of the Cloudera Distribution of Hadoop (CDH) on Amazon EC2 is executed by Apache Whirr. You can find great documentation here on how to run it from the command line. However in my case, I wanted to execute the process from within ArcMap. So, I wrote a GeoProcessing extension that you can add to ArcCatalog enabling the invocation of the deployment tool from within ArcMap.

Whirr depends on a set of configurations to launch the cluster. The values of some of the parameters depend on the previous two steps. Here is a snippet of my '' file.

whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,3 hadoop-datanode+hadoop-tasktracker

Please follow the instructions on Github to install the WhirrToolbox in ArcMap.

To launch the cluster, start ArcMap, show the catalog window, expand the 'WhirrToolbox' and double click the 'LaunchClusterTool'. Select a cluster configuration file and click the 'Ok' button.

Now, this will take a little bit less that 5 minutes.

 To see the Whirr progress, disable the background processing in the GeoProcessing Options by unchecking the 'Enable' checkbox:

Once the cluster is launched successfully, Whirr will create a folder named '.whirr' in your home folder. In addition, underneath that folder, a subfolder will be created and will be named after the value of 'whirr.cluster-name' property in the configuration file.

The file 'hadoop-site.xml' contains the Hadoop cluster properties.

I've written yet another GP tool (ClusterPropertiesTool) that will convert this xml file in a properties file. This properties file will be used in the subsequent tools when exporting, importing feature classes from and to the cluster and running map reduce jobs.

Now, all the communication with the cluster is secure and should be tunneled through an ssh connection. To test this, we will use PuTTY. But before starting PuTTY, we need the name node IP. This can be found in the 'instance' file in the '.whirr' subfolder. Make sure to get the first IP in the 'namenode' row, not the one that starts with ''.

Start PuTTY. Put in the 'Host Name (or IP Address)' field the value of the name node IP.

In the Connection / SSH / Auth form, browse for the PuTTY formatted private key that we save previously.

 In The Connection / SSH / Tunnels, select the 'Dynamic' radio button and enter '6666' in the 'Source port' field and click the 'Add' Button.

Click the 'Open' button, and that should connect you to your cluster.

Enter your username (this is your Windows login username) and hit enter, and you should see a welcome banner and a bash shell prompt.

Once you are done with the cluster, the DestroyClusterTool GP tool will destroy the cluster.

In the next post, we will use that cluster to export, import feature classes and run map reduce jobs.

Like usual, all the source code can be found here.