Skip to content

Getting Started

okram edited this page Sep 16, 2012 · 49 revisions

Faunus requires that the user have access to a Hadoop cluster. If a Hadoop cluster is readily available to the user, then Faunus is easy to get up and running. If not, then the provided Whirr recipe can be leveraged to spawn a Hadoop cluster on Amazon EC2 (see the following tutorial) or a local instance of Hadoop can be run in pseudo-cluster mode (see the following tutorial). This section will discuss the pseudo-cluster approach for those users just getting started with Hadoop (and Faunus). For more experienced Hadoop users, the examples below can be easily adapted to work with a non-local Hadoop cluster (e.g. simply change the Hadoop configuration to point to the accessible cluster ($HADOOP_CONF_DIR).

Installing Hadoop

Faunus has been written and tested with Hadoop 1.0.3. For using Faunus locally on a single machine, a Hadoop pseudo-cluster can be used. Instructions to set up a pseudo-cluster are provided in the Hadoop documentation. Once the pseudo-cluster has been set up, it can be started at anytime using $HADOOP_HOME/bin/start-all.sh. Once running, Hadoop can be issued jobs (e.g. Faunus jobs).

~$ start-all.sh
starting namenode, logging to /Applications/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-marko-namenode-markolaptop.local.out
localhost: starting datanode, logging to /Applications/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-marko-datanode-markolaptop.local.out
localhost: starting secondarynamenode, logging to /Applications/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-marko-secondarynamenode-markolaptop.local.out
starting jobtracker, logging to /Applications/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-marko-jobtracker-markolaptop.local.out
localhost: starting tasktracker, logging to /Applications/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-marko-tasktracker-markolaptop.local.out

To ensure that the installation is running properly, do an ls on Hadoop’s distributed file system, HDFS.

faunus$ hadoop fs -ls /
Found 2 items
drwxr-xr-x   - marko supergroup          0 2012-07-26 11:55 /tmp
drwxr-xr-x   - marko supergroup          0 2012-07-26 11:55 /user

Installing Faunus

Faunus can be downloaded from the downloads section of this project. Faunus provides a Gremlin implementation that is used to construct a Faunus job (a chain of 1 or more MapReduce jobs). The Gremlin REPL can be started with bin/gremlin.sh.

faunus$ bin/gremlin.sh 

         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin>

Graph of the Gods Examples

Faunus distributes with a toy graph called The Graph of the Gods (represented in Faunus’ GraphSON format). This graph has 12 vertices/17 edges and denotes people, places, monsters and their various types of relationships to one another. A diagrammatic representation is provided above and the raw GraphJSON representation is provided below.

{"name":"saturn","type":"titan","_id":0,"_inE":[{"_label":"father","_id":12,"_outV":1}]}
{"name":"jupiter","type":"god","_id":1,"_outE":[{"_label":"lives","_id":13,"_inV":4},{"_label":"brother","_id":16,"_inV":3},{"_label":"brother","_id":14,"_inV":2},{"_label":"father","_id":12,"_inV":0}],"_inE":[{"_label":"brother","_id":17,"_outV":3},{"_label":"brother","_id":15,"_outV":2},{"_label":"father","_id":24,"_outV":7}]}
{"name":"neptune","type":"god","_id":2,"_outE":[{"_label":"lives","_id":20,"_inV":5},{"_label":"brother","_id":19,"_inV":3},{"_label":"brother","_id":15,"_inV":1}],"_inE":[{"_label":"brother","_id":18,"_outV":3},{"_label":"brother","_id":14,"_outV":1}]}
{"name":"pluto","type":"god","_id":3,"_outE":[{"_label":"pet","_id":23,"_inV":11},{"_label":"lives","_id":21,"_inV":6},{"_label":"brother","_id":17,"_inV":1},{"_label":"brother","_id":18,"_inV":2}],"_inE":[{"_label":"brother","_id":19,"_outV":2},{"_label":"brother","_id":16,"_outV":1}]}
{"name":"sky","type":"location","_id":4,"_inE":[{"_label":"lives","_id":13,"_outV":1}]}
{"name":"sea","type":"location","_id":5,"_inE":[{"_label":"lives","_id":20,"_outV":2}]}
{"name":"tartarus","type":"location","_id":6,"_inE":[{"_label":"lives","_id":21,"_outV":3},{"_label":"lives","_id":22,"_outV":11}]}
{"name":"hercules","type":"demigod","_id":7,"_outE":[{"_label":"mother","_id":25,"_inV":8},{"time":1,"_label":"battled","_id":26,"_inV":9},{"time":2,"_label":"battled","_id":27,"_inV":10},{"time":12,"_label":"battled","_id":28,"_inV":11},{"_label":"father","_id":24,"_inV":1}]}
{"name":"alcmene","type":"human","_id":8,"_inE":[{"_label":"mother","_id":25,"_outV":7}]}
{"name":"nemean","type":"monster","_id":9,"_inE":[{"time":1,"_label":"battled","_id":26,"_outV":7}]}
{"name":"hydra","type":"monster","_id":10,"_inE":[{"time":2,"_label":"battled","_id":27,"_outV":7}]}
{"name":"cerberus","type":"monster","_id":11,"_outE":[{"_label":"lives","_id":22,"_inV":6}],"_inE":[{"_label":"pet","_id":23,"_outV":3},{"time":12,"_label":"battled","_id":28,"_outV":7}]}

To use this sample graph for the examples to follow, it must be first placed into HDFS. This can be done using Gremlin or the standard Hadoop CLI.

Gremlin REPL

gremlin> hdfs.copyFromLocal('data/graph-of-the-gods.json','graph-of-the-gods.json')
==>null
gremlin> hdfs.ls()          
==>rw-r--r-- marko supergroup 2028 graph-of-the-gods.json
gremlin> 

Hadoop CLI

faunus$ hadoop fs -put data/graph-of-the-gods.json graph-of-the-gods.json
faunus$ hadoop fs -ls
Found 1 item
-rw-r--r--   1 marko supergroup       2028 2012-07-26 11:55 /user/marko/graph-of-the-gods.json

A Graph Statistic and a Graph Derivation

Vertex Property Value Distribution (Statistic)

With the graph in HDFS, a reference is made to it in bin/gremlin.properties. Faunus already has all the properties configured to point to graph-of-the-gods.json so it is possible to simply run the desired Gremlin traversal. To determine the distribution of vertex types (a simple graph statistic), run the following Gremlin traversal.

gremlin> g = FaunusFactory.open('bin/faunus.properties')
==>faunusgraph[graphsoninputformat]
gremlin> g.V.type.groupCount
12/09/16 14:04:16 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
12/09/16 14:04:16 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map, com.thinkaurelius.faunus.mapreduce.sideeffect.ValueGroupCountMapReduce.Map, com.thinkaurelius.faunus.mapreduce.sideeffect.ValueGroupCountMapReduce.Reduce]
12/09/16 14:04:16 INFO mapreduce.FaunusCompiler: Job data location: output/job-0
12/09/16 14:04:17 INFO input.FileInputFormat: Total input paths to process : 1
12/09/16 14:04:17 INFO mapred.JobClient: Running job: job_201209160849_0033
12/09/16 14:04:18 INFO mapred.JobClient:  map 0% reduce 0%
...
==>demigod	1
==>god	3
==>human	1
==>location	3
==>monster	3
==>titan	1

Lets examine the command more closely.

g.V.type.groupCount
  • g: the graph pointed to by bin/faunus.properties
  • V: for all the vertices in the graph
  • type: get the type property value of those vertices
  • groupCount: count the number of times each unique type is seen

When the Faunus job completes (which could be many MapReduce jobs), the results are outputted to the terminal and is available at output (the local is set in bin/faunus.properties).

gremlin> hdfs.ls()
==>rw-r--r-- marko supergroup 2028 graph-of-the-gods.json
==>rwxr-xr-x marko supergroup 0 (D) output
gremlin> hdfs.ls('output')
==>rwxr-xr-x marko supergroup 0 (D) job-0
gremlin> hdfs.ls('output/job-0')
==>rw-r--r-- marko supergroup 0 _SUCCESS
==>rwxr-xr-x marko supergroup 0 (D) _logs
==>rw-r--r-- marko supergroup 435 graph-m-00000.bz2
==>rw-r--r-- marko supergroup 80 sideeffect-m-00000.bz2
gremlin> hdfs.head('output/job-0/sideeffect*')
==>demigod	1
==>god	3
==>human	1
==>location	3
==>monster	3
==>titan	1
gremlin>

Inference using Paths (Derivation)

A derivation is some mutation of the graph whether that is as simple as removing vertices/edges or as complex as inferring new edges from explicit edges in the graph. With The Graph of the Gods, grandfather edges can be derived from father edges. This type of derivation is known as an inference.

From the Gremlin REPL, enter the following traversal.

g.V.as('x').out('father').out('father').linkIn('x','grandfather')
  • g: the graph pointed to by bin/faunus.properties
  • V: for all the vertices in the graph
  • as('x'): name the current elements “x”
  • out('father'): traverse out over father edges
  • out('father'): traverse out over father edges
  • linkIn('x','grandfather'): create incoming grandfather edges from vertices at step “x”

The derived graph can be pulled to local disk or analyzed in HDFS.

gremlin> hdfs.ls('output')                                                
==>rwxr-xr-x marko supergroup 0 (D) job-0
==>rwxr-xr-x marko supergroup 0 (D) job-1
==>rwxr-xr-x marko supergroup 0 (D) job-2
gremlin> hdfs.ls('output/job-2')
==>rw-r--r-- marko supergroup 0 _SUCCESS
==>rwxr-xr-x marko supergroup 0 (D) _logs
==>rw-r--r-- marko supergroup 449 part-r-00000.bz2
gremlin> hdfs.head('output/job-2')
==>{"name":"saturn","type":"titan","_id":0,"_inE":[{"_label":"grandfather","_id":-1,"_outV":7},{"_label":"father","_id":12,"_outV":1}]}
==>{"name":"jupiter","type":"god","_id":1,"_outE":[{"_label":"lives","_id":13,"_inV":4},{"_label":"brother","_id":16,"_inV":3},{"_label":"brother","_id":14,"_inV":2},{"_label":"father","_id":12,"_inV":0}],"_inE":[{"_label":"brother","_id":17,"_outV":3},{"_label":"brother","_id":15,"_outV":2},{"_label":"father","_id":24,"_outV":7}]}
==>{"name":"neptune","type":"god","_id":2,"_outE":[{"_label":"lives","_id":20,"_inV":5},{"_label":"brother","_id":19,"_inV":3},{"_label":"brother","_id":15,"_inV":1}],"_inE":[{"_label":"brother","_id":18,"_outV":3},{"_label":"brother","_id":14,"_outV":1}]}
==>{"name":"pluto","type":"god","_id":3,"_outE":[{"_label":"pet","_id":23,"_inV":11},{"_label":"lives","_id":21,"_inV":6},{"_label":"brother","_id":17,"_inV":1},{"_label":"brother","_id":18,"_inV":2}],"_inE":[{"_label":"brother","_id":19,"_outV":2},{"_label":"brother","_id":16,"_outV":1}]}
==>{"name":"sky","type":"location","_id":4,"_inE":[{"_label":"lives","_id":13,"_outV":1}]}
==>{"name":"sea","type":"location","_id":5,"_inE":[{"_label":"lives","_id":20,"_outV":2}]}
==>{"name":"tartarus","type":"location","_id":6,"_inE":[{"_label":"lives","_id":21,"_outV":3},{"_label":"lives","_id":22,"_outV":11}]}
==>{"name":"hercules","type":"demigod","_id":7,"_outE":[{"_label":"mother","_id":25,"_inV":8},{"_label":"grandfather","_id":-1,"_inV":0},{"time":1,"_label":"battled","_id":26,"_inV":9},{"time":2,"_label":"battled","_id":27,"_inV":10},{"time":12,"_label":"battled","_id":28,"_inV":11},{"_label":"father","_id":24,"_inV":1}]}
==>{"name":"alcmene","type":"human","_id":8,"_inE":[{"_label":"mother","_id":25,"_outV":7}]}
==>{"name":"nemean","type":"monster","_id":9,"_inE":[{"time":1,"_label":"battled","_id":26,"_outV":7}]}
==>{"name":"hydra","type":"monster","_id":10,"_inE":[{"time":2,"_label":"battled","_id":27,"_outV":7}]}
==>{"name":"cerberus","type":"monster","_id":11,"_outE":[{"_label":"lives","_id":22,"_inV":6}],"_inE":[{"_label":"pet","_id":23,"_outV":3},{"time":12,"_label":"battled","_id":28,"_outV":7}]}

To conclude, the last derivation can be further computed on using g.getNextGraph(). This method returns a new graph that points to the output of the previous graph (with respective inputs/outputs and HDFS directories respected).

gremlin> g                                                 
==>faunusgraph[graphsoninputformat]
gremlin> h = g.getNextGraph()
==>faunusgraph[graphsoninputformat]
gremlin> h.E.has('label','grandfather').keep.count()
12/09/16 14:24:42 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
12/09/16 14:24:42 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.EdgesMap.Map, com.thinkaurelius.faunus.mapreduce.filter.PropertyFilterMap.Map, com.thinkaurelius.faunus.mapreduce.sideeffect.CommitEdgesMap.Map, com.thinkaurelius.faunus.mapreduce.util.CountMapReduce.Map, com.thinkaurelius.faunus.mapreduce.util.CountMapReduce.Reduce]
...
==>1

The traversal above removes all edges except grandfather edges and then counts the remaining edges. As demonstrated, there is only 1 grandfather edge.

h.E.has('label','grandfather').keep.count()
  • h: the graph pointing to the output of g
  • E: for all the vertices in the graph
  • has('label','grandfather'): traverse to grandfather edges
  • keep: keep the elements at the current step and delete all others
  • count: count the number of elements current at
gremlin> hdfs.head('output_/*/graph*')    
==>{"name":"saturn","type":"titan","_id":0,"_inE":[{"_label":"grandfather","_id":-1,"_outV":7}]}
==>{"name":"jupiter","type":"god","_id":1}
==>{"name":"neptune","type":"god","_id":2}
==>{"name":"pluto","type":"god","_id":3}
==>{"name":"sky","type":"location","_id":4}
==>{"name":"sea","type":"location","_id":5}
==>{"name":"tartarus","type":"location","_id":6}
==>{"name":"hercules","type":"demigod","_id":7,"_outE":[{"_label":"grandfather","_id":-1,"_inV":0}]}
==>{"name":"alcmene","type":"human","_id":8}
==>{"name":"nemean","type":"monster","_id":9}
==>{"name":"hydra","type":"monster","_id":10}
==>{"name":"cerberus","type":"monster","_id":11}
Clone this wiki locally