Code Graphs of 5 Top Open Source Data Projects

12 Nov 2014

Hadoop
Spark
Cassandra
Neo4j
Elasticsearch

Intro

As a software engineer, I have spent (I figure) more than 10,000 hours writing code, deconstructing simple to complex problems into pieces then weaving them carefully together to create a program. It occurred to me that any piece of code that is broken into different parts which call each other is itself a graph or network. As my software career has also been firmly rooted in open-source, I wondered what these graphs would look like for some of the most popular open-source projects. Being also a data engineer, and someone who has focused on development in the Java language, I decided to look at projects centered around data written in Java.

There are really two parts to doing this. First, find a way to parse Java code to extract its network of logic. And second, to visualize then analyze it.

Credit

Fortunately, the difficult part of #1 has already been done. The github project java-callgraph by Dr. Georgios Gousios will read through a jar file, and output a file describing its class and method references. Great! I also want to thank the tremendous amount of effort that has been put into these open-source projects. Without them, most of what I do in my job would not be possible.

Extracting The Network

Once I compiled the java-callgraph project and ran it against a jar file, I then had to write code which could parse that output and save it in a graph file format that Gephi could read for the visualization. A couple hours of coding/testing later and I had a nice flexible tool that could do this (as well as store it in any graph database I want that's supported by the Tinkerpop Framework).

Simply loading the entire graph leads to two main network components (two separated graph sections). That's because java-callgraph maps two types of connections: class dependencies, and method calls. For simplicity reasons, I chose to just focus on the class dependency aspect of the code graph where a more pronounced topology appears. Another post sometime looking at the method call graphs would be interesting. I also am filtering the nodes to only those authored by the project itself rather than dependencies it has to the native Java classes or other libraries.

Visualization

I'm doing a pretty simple visualization: coloring based on simple community detection, force separation layout, and sizing the nodes based on degree (in this case, in-degree to highlight the classes most frequently referenced).

Results

Here are the results.

Hadoop is a wildly popular framework that provides reliable, scalable, distributed computing. It has had a very disruptive effect on how data is used across diverse fields and industries. Often when people use the buzzword "Big Data" the technology they are describing includes Hadoop in some way.

For this one, I used the hadoop-core v4.3.7 library.

hadoop code graph network

The top ten most referenced classes are:

org.apache.hadoop.conf.Configuration
org.apache.hadoop.fs.Path
org.apache.hadoop.fs.FileSystem
org.apache.hadoop.util.StringUtils
org.apache.hadoop.mapred.JobConf
org.apache.hadoop.io.Writable
org.apache.hadoop.io.Text
org.apache.hadoop.security.UserGroupInformation
org.apache.hadoop.classification.InterfaceAudience
org.apache.hadoop.util.ReflectionUtils

Prominent communities:

The pink-purple community on the left roughly correlates to the HDFS, datanode/namenode code.
The green community on the right has much of the map-reduce server code.
The yellow community at the top is mostly the map-reduce client and tracking code.
The red community at the top has a lot of map-reduce lib code.
The maroon community in the lower-left is the record.compiler package which is actually it's own isolated component.
And the violet community between the pink and maroon communities contains the HDFS web classes.

Spark brings many of the same concepts of reliable, scalable, distributed computing and adapts it to the reality of falling RAM prices. As a result you can experience 10-100x performance improvements in the latency of distributed jobs which before worked primarily against disk in a Hadoop environment. It isn't a competitor to Hadoop as much as it is an augmentation of its capabilities; and it is showing tremendous potential to be just as transformative as Hadoop was itself.

Here's the class import graph for spark-core v2.10-1.1.0:

spark code graph network

This has some unique characteristics in its graph with nodes which fan out to many other nodes. I believe this can be attributed to its being written in Scala and the way that Scala code is packaged in jars.

Here are the top ten most referenced classes:

org.apache.spark.Logging
org.apache.spark.rdd.RDD (Note: Spark is an implementation of the RDD specification)
org.apache.spark.util.Utils
org.apache.spark.util.JsonProtocol
org.apache.spark.storage.BlockManager
org.apache.spark.SparkConf
org.apache.spark.scheduler.DAGScheduler
org.apache.spark.ui.jobs.UIData
org.apache.spark.SparkContext
org.apache.spark.network.ConnectionManager

Some prominent communities:

The green community on the right (not the top) generally has most of the RDD classes.
The pink/purple community below it is mostly made up of the spark.deploy package with master/worker code.
The purple community below it is made up of spark.ui package.
The yellow community in the upper left is mostly the spark.storage package.
The pink community on the left has the spark.scheduler package.
The darker purple community on the bottom has a lot of classes involved in tracking/reporting job progress and task execution.

Cassandra is a distributed, reliable, and highly scalable columnar storage engine which grew out of the Amazon Dynamo specification. It too has been transformative and is used by many companies and institutions in many different industries and fields.

I looked at Cassandra v2.1.1 to create this class import graph:

cassandra code graph network

The top ten most referenced classes are:

org.apache.cassandra.exceptions.InvalidRequestException
org.apache.cassandra.utils.ByteBufferUtil
org.apache.cassandra.config.CFMetaData
org.apache.cassandra.db.ColumnFamilyStore
org.apache.cassandra.db.marshal.AbstractType
org.apache.cassandra.db.ColumnFamily
org.apache.cassandra.utils.FBUtilities
org.apache.cassandra.config.DatabaseDescriptor
org.apache.cassandra.db.composites.CellNameType
org.apache.cassandra.db.Cell

And a few of its communities:

The blue community in the upper-left, is made up of cassandra.transport and cassandra.cql.
The green community in the center contains cassandra.thrift, cassandra.hadoop, and cassandra.cli.
The yellow-ish orange community in the lower right is mostly cassandra.db, cassandra.io, and cassandra.utils.memory.
The orange community on the right is a mixture of packages involving SSTables and compaction.
The red community on the bottom is mostly NodeTool classes.
The reddish pink community on the left has cassandra.cql3 classes.

Neo4j is a leading graph database. Graph databases have gained popularity recently as highly interconnected datasets and use cases to understand those connections have grown. Big drivers for this are the rise of social media, increasingly interconnected datasets, and advances in systems sciences. This is the "problem of organized complexity" described by Warren Weaver. I go more into this in my about page. I've also done an intro to graph databases which talks about what graph databases, gives examples of how they've been used, and frames the history of graph/network science (see first part of that video).

For Neo4j, I used v2.1.4 of it's kernel library:

neo4j code graph network

The top most referenced classes are:

org.neo4j.kernel.impl.util.StringLogger
org.neo4j.helpers.Function
org.neo4j.helpers.Predicate
org.neo4j.helpers.collection.IteratorUtil
org.neo4j.helpers.collection.Iterables
org.neo4j.kernel.impl.nioneo.store.StoreChannel
org.neo4j.kernel.impl.nioneo.store.Record
org.neo4j.kernel.impl.nioneo.store.FileSystemAbstraction
org.neo4j.kernel.impl.transaction.xaframework.LogEntry
org.neo4j.graphdb.Path

Some of its communities are:

The green community on the bottom has much of the kernel.api and kernel.impl packages.
The red community on the right has neo4j.unsafe.impl.batchimport, kernel.impl.nioneo, and neo4j.helpers.
The orange community above that has a lot of kernel.impl.storemigration, kernel.impl.transaction, and kernel.impl.nioneo.store
The blue community in the upper-right has kernel.impl.transaction and kernel.impl.nioneo.xa.
Moving to the left, the yellow community has much of the graphdb.traversal classes.
Below that, the sea-green community has classes in the top-level neo4j.graphdb package, also has some in graphdb.traversal, and contains kernel.impl.core.

Elasticsearch is a distributed, real-time search and analytics engine. It has become quite popular for real-time log analysis and visualization and is commonly combined with logstash and kibana (making the ELK stack).

This is what the class import graph looks like for Elasticsearch 1.4.0:

elasticsearch code graph network

Top classes are:

org.elasticsearch.common.io.stream.StreamOutput
org.elasticsearch.common.io.stream.StreamInput
org.elasticsearch.common.xcontent.XContentBuilder
org.elasticsearch.common.logging.ESLogger
org.elasticsearch.common.xcontent.ToXContent
org.elasticsearch.common.xcontent.ToXContent$Params
org.elasticsearch.common.settings.Settings
org.elasticsearch.ElasticsearchException
org.elasticsearch.common.base.Preconditions
org.elasticsearch.ElasticsearchIllegalArgumentException

Some of its communities:

The yellow community on the bottom is mostly elasticsearch.common.netty.
The orange community on the left is elasticsearch.search.
The blue community is a combination of elasticsearch.action, elasticsearch.cluster, elasticsearch.discovery, elasticsearch.gateway, elasticsearch.index/indecies, elasticsearch.river, elasticsearch.snapshots, and elasticsearch.transport (a higher resolution of community detection would probably break these apart).
The red community on the right has the client and REST interface classes.
The purple community on the right is elasticsearch.common.
The green community in the middle is mixture of elsticsearch.index/indecies, elasticsearch.action, elasticsearch.common, elasticsearch.monitor, and elasticsearch.search.

The Elasticsearch communities are less segmented than many of these other code bases.

Conclusion

Doing this visualization and analysis was fun. What's interesting to think about is the people working on these projects having a portion of these graphs in their mind as they design and work with the code. This is, in my opinion, why programming requires so much concentration. Maintaining a picture in your mind of a part of these networks while you change it and understand how that will affect these relationships is very demanding.

I've done this with 5 other popular open-source Java libraries (though those aren't databases). I'll post those results in the future sometime.

AllThingsGraphed.com Exploring the graphs that surround us