Skip to content

michelcaradec/Graph-OLAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Graph OLAP with Neo4j

Alternative title: "The Graph Kimball Project" (sorry, I could not resist).

An attempt to model an OLAP cube with Neo4j.

As a former OLAP developer, and recent adopter of graph databases, I was curious to check how Neo4j could handle Online Analytical Processing structures, dedicated to reporting.

These concepts will be experimented by creating an OLAP cube, with the only use of Cypher language.

Made with:

Table of Content:

Dataset

We will use a sample dataset stored in tabulation-delimited files named sales.2016.tsv and sales.2017.tsv.

Top and bottom lines:

Region         Country    Year  Month  Product   Category  Sales    Units
North America  USA        2016  1      Piano     Strings   4251.00  693
North America  USA        2016  1      Violin    Strings   2789.00  571
North America  USA        2016  1      Cello     Strings   3908.00  311
North America  USA        2016  1      Guitar    Strings   5034.00  284
...
South America  Argentina  2017  12     Clarinet  Winds     4576.00  329
South America  Argentina  2017  12     Trumpet   Brass     3540.00  327
South America  Argentina  2017  12     Trombone  Brass     3646.00  599
South America  Argentina  2017  12     Tuba      Brass     6971.00  465
South America  Argentina  2017  12     Sax       Winds     5718.00  236

In order to make it easy to understand, some explanations will be based on a tiny dataset (sales.small.tsv):

Region  Country  Year  Month  Product  Category  Sales  Units
France  Europe   2016  1      Piano    Strings   1000   3
France  Europe   2016  1      Piano    Strings   2000   5
France  Europe   2016  1      Violin   Strings   100    1
France  Europe   2016  1      Violin   Strings   200    2

Cube Structure

Our cube will be made of three dimensions and two measures.

Object Type Components Description
Product Dimension Levels = Category, Product Products sold.
Place Dimension Levels = Region, Country Locations where products were sold.
Time Dimension Levels = Year, Month Dates when events (sales) occurred.
Measure Measure Measures = Sales, Units Sales in Euros, units sold.

The Star Schema model of our cube:

Storage Implementation

Each component (levels and measures) will be stored as nodes.

Nodes will be linked by relationships to represent how they are hierarchically connected.

Dimension Level Label
Product Category :Category
Product Product :Product
Place Region :Region
Place Country :Country
Time Year :Year
Time Month :Month
Measures :Measure

This gives use the following meta-model:

Example:

Graph generated importing the first 3 lines (including header) of the tiny dataset.

  • For dimensions, there will be one node per member of each level.
  • For facts, there will be one node per event for a given product, place and time (i.e. leaf level members of corresponding dimensions).

In the following screenshot, Product category Strings appears once, and is connected to two products (Piano and Violin). The same principle is applied for Time and Place dimensions.

Graph generated importing the tiny dataset.

There are two facts nodes, corresponding to two events which occurred in France in January 2016 for Piano and Violin.

Facts Consolidation

Multiple events can occur at the same time, for the same product and the same country (see rows 2-3 and 4-5 of our tiny dataset).

At import time, facts must be aggregated to a single :Measure node containing cumulated values (using Sum aggregation operator).

Data Import

We will use the LOAD CSV clause to import our data.

Before starting, let's initialize our database with some index creation:

// Dimensions
CREATE INDEX ON :Country(country);
CREATE INDEX ON :Product(product);
CREATE INDEX ON :Month(month);
// Measures
CREATE INDEX ON :Measure(mid);

Our import script must support:

  • Incremental data import, as data can come from multiple data sources, at different time, with overlapping events.
  • Slowly changing dimension: we will keep it simple by overwriting property values.

First part of the import script:

UNWIND ["sales.2016.tsv", "sales.2017.tsv"] AS sourceFile
//USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS
    FROM "file:///sourceFile" + sourceFile
    AS row
    FIELDTERMINATOR '\t'

Second line was commented as it doesn't seem possible to mix UNWIND or FOREACH clauses with USING PERIODIC COMMIT query hint.

Dimension Import

For each dimensions levels, we will create one node per member. Nodes hierarchically related will be connected by a relationship.

Example for Europe and France (respectively members of Region and Country levels of Place dimension): (:Country {country: 'France'})-[:IN_REGION]->(:Region {region: 'Europe'}).

Place dimension:

MERGE (r:Region {region: row.Region})
MERGE (c:Country {country: row.Country})
MERGE (c)-[:IN_REGION]->(r)

Product dimension:

MERGE (cat:Category {category: row.Category})
MERGE (prod:Product {product: row.Product})
MERGE (prod)-[:IN_CATEGORY]->(cat)

Time dimension:

MERGE (y:Year {year: toInteger(row.Year)})
MERGE (m:Month {year: toInteger(row.Year), month: toInteger(row.Month)})
MERGE (m)-[:OF_YEAR]->(y)

For Month level (:Month node), a year property is added in order to disambiguate months from different years (March 2016 is different from March 2017).

Facts Import

Import script continues with the import of measures:

WITH
    c, m, prod, row,
    c.country + '_' + toString(m.year) + '_' + toString(m.month) + '_' + prod.product AS MeasureID
// Facts
MERGE (meas:Measure {mid: MeasureID})
ON CREATE
    // Create new fact node
    SET meas.sales = toFloat(row.Sales),
    meas.units = toInteger(row.Units)
ON MATCH
    // Update fact node
    SET meas.sales = meas.sales + toFloat(row.Sales),
    meas.units = meas.units + toInteger(row.Units)
// Foreign keys
MERGE (meas)-[:IN_PLACE]->(c)
MERGE (meas)-[:AT_TIME]->(m)
MERGE (meas)-[:FOR_PRODUCT]->(prod)

The full import cypher script can be found in create.v2.cypher file.

Output after execution:

Added 4285 labels, created 4285 nodes, set 12757 properties, created 12723 relationships, completed after 4277 ms.

Implementation Details

Slowly changing dimension is handled by using MERGE expressions. It succeeds in updating or inserting level members.

An extra challenge must be faced with measures, as properties can't just be overwritten. If a node already exists for a given event, imported values must be added to existing ones.

A first version of the script was relying on OPTIONAL MATCH clause to identify existing and non-existing measure nodes, based on presence of as many relationships as number of dimensions for the event to import:

WITH c, m, prod, row
// Facts
OPTIONAL MATCH
    (meas:Measure)-[:IN_PLACE]->(c),
    (meas:Measure)-[:AT_TIME]->(m),
    (meas:Measure)-[:FOR_PRODUCT]->(prod)
// Create new fact node
FOREACH (x IN CASE WHEN meas IS NULL THEN [1] ELSE [] END |
    CREATE (meas:Measure {sales: toFloat(row.Sales), units: toInteger(row.Units)})
    // Foreign keys
    CREATE (meas)-[:IN_PLACE]->(c)
    CREATE (meas)-[:AT_TIME]->(m)
    CREATE (meas)-[:FOR_PRODUCT]->(prod)
)
// Update fact node
FOREACH (x IN CASE WHEN meas IS NULL THEN [] ELSE [1] END |
    SET meas.sales = meas.sales + toFloat(row.Sales)
    SET meas.units = meas.units + toInteger(row.Units)
)

For some reason I couldn't explain, existing measures were never detected, leading to systematic measure node creation (i.e. last part of the script was never executed).

This option was initially preferred for its relationship-oriented nature. As I wasn't successful in making it work, an alternative solution was found, as previously seen, based on an extra-property named mid (for Measure IDentifier) containing duplicate information on its related dimension members.

The full non-working import cypher script can be found in create.v1.cypher file.

The execution of the data import script gives the following meta-model:

Querying

A this stage, we have an OLAP-like graph structure containing consolidated facts.

We are now ready to run few queries.

Query 1 - Sum of Sales for Strings in 2016

MATCH (cat:Category {category: 'Strings'})<-[*]-(meas:Measure)
MATCH (year:Year {year: 2016})<-[*]-(meas:Measure)
RETURN
    SUM(meas.sales) AS Sales,
    SUM(meas.units) AS Units;

Output (from tiny dataset):

Sales Units
3300 11

Query 2 - Sum of Sales for Strings in January 2016

MATCH (cat:Category {category: 'Strings'})<-[*]-(meas:Measure)
MATCH (year:Year {year: 2016})<-[]-(month:Month {month: 1})<-[]-(meas:Measure)
RETURN
    SUM(meas.sales) AS Sales,
    SUM(meas.units) AS Units;

Alternatively:

MATCH (cat:Category {category: 'Strings'})<-[*]-(meas:Measure)
MATCH (month:Month {year: 2016, month: 1})<-[]-(meas:Measure)
RETURN
    SUM(meas.sales) AS Sales,
    SUM(meas.units) AS Units;

Output (from tiny dataset):

Sales Units
3300 11

Query 3 - Detail of Sales for Strings in January 2016

Returning the detail of an aggregated measure is called Drill-through.

MATCH (cat:Category {category: 'Strings'})<-[]-(prod:Product)<-[]-(meas:Measure)
MATCH (year:Year {year: 2016})<-[]-(month:Month {month: 1})<-[]-(meas:Measure)
RETURN
    year.year AS Year,
    month.month AS Month,
    cat.category AS Category,
    prod.product AS Product,
    meas.sales AS Sales,
    meas.units AS Units
LIMIT 10;

Output (from tiny dataset):

Year Month Category Product Sales Units
2016 1 Strings Piano 3000 8
2016 1 Strings Violin 300 3

Aggregates Store

The queries we previously wrote only relied on facts to return a result. Once retreived, facts must be aggregated with the use of SUM aggregation function.

This does the job, but is far from OLAP philosophy, where aggregates can be pre-calculated in order to improve query response time.

Aggregate Strategy

Aggregates will be stored as nodes, with a label name composed of two parts:

Example: (:Aggregate_000000).

This concept is called MOLAP when applied to a multi-dimensional database store, ROLAP when applied to a relational one. Let's call it GOLAP (for Graph OLAP), as we are using a graph database store (we could also call it NOLAP for Neo4j OLAP).

Aggregate Bitmask

An aggregate bitmask is a sequence of characters that identifies which part of a sub-cube an aggregate is calculated on.

The sequence is composed of N positions (one for each level), each position taking one of the following values:

Value Description
x Aggregate is calculated for all members of the level.
0 No aggregate is calculated for the level.
1 Aggregate is calculated for each member of the level.

The order of the position is very important, and must remain the same. If the cube structure is modified (by the addition of extra dimensions or levels), aggregate strategy must be redefined.

Aggregate sequence will be composed of six positions for our sample cube, as it contains six levels (Category, Product, Region, Country, Year, Month).

Aggregate bitmask construction:

Product.Category Product.Product Place.Region Place.Country Time.Year Time.Month Description Scope
1 0 x 0 x 0 By Product.Category One-level
0 1 x 0 x 0 By Product.Product One-level
0 1 0 1 x 0 By Product.Product x Place.Country Two-levels

Aggregate Creation

There are six one-level aggregates for our sample cube:

Aggregate Description
Aggregate_10x0x0 By Product.Category
Aggregate_01x0x0 By Product.Product
Aggregate_x010x0 By Place.Region
Aggregate_x001x0 By Place.Country
Aggregate_x0x010 By Time.Year
Aggregate_x0x001 By Time.Month

By Place.Region

MATCH (r:Region)<-[*2]-(meas:Measure)
WITH DISTINCT r, SUM(meas.sales) AS SumSales, SUM(meas.units) As SumUnits
CREATE (a:Aggregate_x010x0 {sales: SumSales, units: SumUnits})-[:AGGREGATE_OF]->(r);

Output:

Added 5 labels, created 5 nodes, set 10 properties, created 5 relationships, completed after 36 ms.

Here is how Aggregate_x010x0 aggregate fits in the graph for Europe region:

Aggregate node is coloured in grey.

By Place.Country

MATCH (ct:Country)<-[]-(meas:Measure)
WITH DISTINCT ct, SUM(meas.sales) AS SumSales, SUM(meas.units) As SumUnits
CREATE (a:Aggregate_x001x0 {sales: SumSales, units: SumUnits})-[:AGGREGATE_OF]->(ct);

Output:

Added 16 labels, created 16 nodes, set 32 properties, created 16 relationships, completed after 28 ms.

Here is how Aggregate_x010x0 (by Place.Region) and Aggregate_x001x0 (by Place.Country) aggregates fit in the graph for Europe and France:

The higher the level, the greater the performance benefit when querying aggregate nodes (coloured in grey), as they prevent querying indivudual (i.e. detail) measure nodes (coloured in red).

The creation of the one-level aggregates gives the following meta-model:

One-level aggregate nodes are coloured in grey.

By Product.Product x Place.Country

This is a two-levels aggregate, as it is calculated by the combination of two levels.

MATCH (prod:Product)<-[]-(meas:Measure)
MATCH (ct:Country)<-[]-(meas:Measure)
WITH DISTINCT prod, ct, SUM(meas.sales) AS SumSales, SUM(meas.units) As SumUnits
CREATE (prod)<-[:AGGREGATE_OF]-(a:Aggregate_0101x0 {sales: SumSales, units: SumUnits})-[:AGGREGATE_OF]->(ct);

Other aggregate creation commands can be found in file aggregates.cypher.

The creation of the two-levels aggregates gives the following meta-model:

One-level aggregate nodes are coloured in grey, two-levels ones in green.

All commands are consolidated in one single file full_script.cypher.

Querying Aggregates

Aggregates can now be used in queries.

Query 1 - Sum of Sales for Violin in France

MATCH (prod:Product {product: 'Violin'})<-[]-(a:Aggregate_0101x0)-[]->(ct:Country {country: 'France'})
RETURN
    a.sales AS Sales,
    a.units AS Units;

Query 2 - Sum of Sales for Violin in 2016

MATCH (prod:Product {product: 'Violin'})<-[]-(a:Aggregate_01x010)-[]->(ct:Year {year: 2016})
RETURN
    a.sales AS Sales,
    a.units AS Units;

It doesn't necessarily make sens to create all possible aggregates.

Aggregate creation is based on a good understanding of business needs (i.e. how data is queried). Some OLAP engines such as Microsoft SQL Server Analysis Services propose some tooling to help selecting good aggregate candidates based on previously executed queries. We could imagine implementing such a feature based on Neo4j query log, but this is out of the scope of this workshop.

Conclusion

There are many other features that could be explored, such as:

This workshop was an opportunity to challenge (and learn) Neo4j and Cypher in an area it wasn't primarily designed for (even if we are dealing with lots of relationships).

Each data engine is designed to address specific purposes, so we can't expect one being as feature-complete as another one for such different fields as OLAP and Graphs.

About

An attempt to model an OLAP cube with Neo4j.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published