GetMatch query

A. Overview

SPADE provides a client to be able to query stored provenance graph for forensic analysis. The complete list of queries that SPADE supports are documented here. This page documents the usage of the query getMatch, and illustrates it's usage by extracting cross namespaces data flows in a CamFlow provenance graph.

B. GetMatch

In provenance graph querying, a use case often arises where the user wants to find common vertices (or edges) in two vertex sets i.e. an intersection. SPADE supports the operator & to do an intersection, for example, $common = $set1 & $set2. The intersection operator works by comparing the hashes of all annotations of the graph elements in the given sets, and returns the graph elements with the matching hashes as the intersection set. While, this intersection operation is beneficial on its own, it does not provide a way to intersect based on only a subset of the annotations of graph elements. This is supported by the query getMatch.

The getMatch query operates on a subject graph, an object graph, and variable number of names of annotation keys of graph elements. It works by comparing only the specified annotations on graph elements in the subject graph, and the object graph, and returns the graph elements with the matching specified annotations. An example query is given below:

-> $matching_pids = $set1.getMatch($set2, 'pid')

In the query above, the result $matching_pids, contains the vertices that were in $set1, and $set2, and had the same pid (process id) annotation. If a simple intersection query was used like $matching_pids = $set1 & $set2, then $matching_pids would have contained the vertices that were in $set1, and $set2, and had the same set of all annotations (not just pid).

C. Example

The following describes a subset of CamFlow provenance captured for a given scenario, the cross namespace data flow in that scenario, and the SPADE queries (including getMatch) to extract the data flow in that scenario.

C.1. Scenario

CamFlow is used to collect provenance for system activity where:

A process (called writer) in IPC namespace X writes to a file
A process (called reader) in IPC namespace Y reads the same file that the writer process wrote to

In this scenario, there is a data flow between the reader process, and the writer process because the same file was written, and then read by the two processes. Since the two processes exist in different IPC namespaces therefore the data flow is considered to be cross namespace data flow. This scenario is shown in the CamFlow provenance graph below:

In the image above, the activity of the scenario that can be seen is:

a. The writer process (pid:1917), and it's IPC namespace (ipcns:4026531839). The writer process is represented by a vertex with the annotation object_type:task, and it's IPC namespace can be found in a vertex (connected to the process vertex) with annotation object_type:process_memory

b. The reader process (pid:1919), and it's IPC namespace (ipcns:4026532201). The reader process is represented by a vertex with the annotation object_type:task, and it's IPC namespace can be found in a vertex (connected to the process vertex) with annotation object_type:process_memory

c. The path /tmp/testfile that is written, and read in the scenario. This is represented by the vertex with annotation 'object_type:path'

d. The write of the path /tmp/testfile. The write on a path is represented by an edge incident on the inode vertex with annotation object_type:file of the path. The edge itself has the annotation relation_type:write

e. The read of the path /tmp/testfile is represented the same way as the write except that the edge has the annotation relation_type:read

C.2. Query Approach

Given the provenance graph, the goal is to construct a series of queries that can be used extract to cross namespace data flow because of a write, and a read of a path. The approach taken can be broken down into the following steps:

a. Find the inodes that were written to

b. Find the tasks (processes) that wrote to the inodes

c. Find the process memory vertices (contains namespace identifiers) of the tasks in (b)

d. Find the inodes that were read from

e. Find the common inodes between (a), and (d)

f. Find the tasks that read from the common inodes in (e)

g. Find the process memory vertices (contains namespace identifiers) of the tasks in (f)

h. Find the process memory vertices which match on namespace identifiers between (c), and (g)

i. Find the process memory vertices that were either in (c), or (g), but not both

j. Construct a graph using only the vertices found above to get a subgraph which represents a cross namespace data flow because of inodes

C.3. Queries

Following are the queries for the approach mentioned in C.2. Queries are preceded by comments (lines starting with #) to describe the following query. Also, the queries below can be copied to a file, and executed in the SPADE query client by using the command load <path to file with queries>.

# Group all types of relevant vertices into respective variables for convenience.
# This comes in handy when only a particular type of vertex is required from another variable.

# Group all process memory vertices which contain the namespace identifiers.
$memorys = $base.getVertex(object_type = 'process_memory')
# Group all task vertices which contain the process identifiers. Tasks are connected to process memory vertices.
$tasks = $base.getVertex(object_type = 'task')
# Group all files vertices which represent an inode. Tasks are connected to files by 'relation_type'='read' or 'relation_type'='write'.
$files = $base.getVertex(object_type = 'file')
# Group all path vertices which contain the path of an inode in the filesystem. Files are connected to paths.
$paths = $base.getVertex(object_type = 'path')

# Finding the subgraph representing the writing of a file
# Get all edges that represent a write
$write_edges = $base.getEdge(relation_type = 'write')
# Get all the written files.
# Note: the '&' operation with $files is a convenient way of getting only files from $write_edges.getEdgeSource().
$written_files = $write_edges.getEdgeSource() & $files
# Get all the paths for the found files.
# Note: If empty result is returned then it means the file was written to more than 10 times. Progressively increase that value '10' until the result is not empty.
$written_files_to_paths = $base.getPath($written_files, $files, 10, $paths, 1)

# Get the tasks that wrote to the file
$writing_tasks = $write_edges.getEdgeDestination() & $tasks
# Get the process memory vertices (these contain the namespace identifiers) for all the writing tasks
$writing_tasks_to_memorys = $base.getPath($writing_tasks, $memorys, 1)
# Get only the process memory vertices
$writing_memorys = $writing_tasks_to_memorys.getEdgeDestination() & $memorys

# Get the read edges to find files which were read
$read_edges = $base.getEdge(relation_type = 'read')
$read_files = $read_edges.getEdgeDestination() & $files

# Find common files i.e. which were written to and read from. If this is empty then there was no data flow through files
$common_files = $written_files & $read_files
# Find the paths for the files that were common (above)
$common_files_to_paths = $base.getPath($common_files, $files, 10, $paths, 1)
$common_paths = $common_files_to_paths & $paths

# Find the tasks and their process memory vertices of the tasks that read the common files
$reading_tasks_to_common_files = $base.getPath($tasks, $common_files, 1)
$reading_tasks = $reading_tasks_to_common_files & $tasks
$reading_tasks_to_memorys = $base.getPath($reading_tasks, $memorys, 1)
$reading_memorys = $reading_tasks_to_memorys.getEdgeDestination() & $memorys

# Now do a match between the memory vertices of the reading tasks, and that of the writing tasks to find the ones with the same values for namespace identifiers: 'cgroupns', 'ipcns', 'mntns', 'netns', 'pidns', 'utsns'
$common_memorys = $reading_memorys.getMatch($writing_memorys, 'cgroupns', 'ipcns', 'mntns', 'netns', 'pidns', 'utsns')

# Divide the result from getMatch into two groups. The first group is the result i.e. the common ones. The second group is the group with namespaces that didn't match.
$group1_memorys = $common_memorys
# If the following query returns empty result then that means there was no process memory vertices with different namespaces
$group2_memorys = $reading_memorys + $writing_memorys - $common_memorys

# Cross namespace data flow subgraph construction
# Get the tasks of the process memorys in both groups
$group1_tasks = $base.getPath($tasks, $group1_memorys, 1) & $tasks
$group2_tasks = $base.getPath($tasks, $group2_memorys, 1) & $tasks
# Use the files that were read from and written to as the starting point of the contruction
$subgraph = $common_files
# Find the paths of the files involved
$subgraph = $subgraph + $base.getPath($common_files, $files, 10, $common_paths, 1)
# Find the edges from tasks to files in both groups
$subgraph = $subgraph + $base.getPath($group1_tasks, $common_files, 1)
$subgraph = $subgraph + $base.getPath($group2_tasks, $common_files, 1)
# Find the edges from files to tasks in both groups
$subgraph = $subgraph + $base.getPath($common_files, $group1_tasks, 1)
$subgraph = $subgraph + $base.getPath($common_files, $group2_tasks, 1)
# Find the edges from tasks to memory vertices in both groups
$subgraph = $subgraph + $base.getPath($group1_tasks, $group1_memorys, 1)
$subgraph = $subgraph + $base.getPath($group2_tasks, $group2_memorys, 1)

# Discard intermediate graph variable
erase $memorys $tasks $files $paths $write_edges $written_files $written_files_to_paths
erase $writing_tasks_to_memorys $read_edges $read_files $common_files_to_paths
erase $reading_tasks_to_common_files $reading_tasks_to_memorys $common_memorys

The graph variable $subgraph would contain the result graph for the scenario mentioned in C.1 (shown below).

This material is based upon work supported by the National Science Foundation under Grants OCI-0722068, IIS-1116414, and ACI-1547467. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Setting up SPADE
Storing provenance
Collecting provenance
- Across the operating system
- Limiting collection to a part of the filesystem
  - On Linux
  - On macOS
- From an external application
- With compile-time instrumentation
- Using the reporting API
- Of transactions in the Bitcoin blockchain
- Filtering provenance
  - Using filters
  - Available filters
Viewing provenance
- In a graph database
- In a relational database
Querying SPADE
- Illustrative example
- Transforming query responses
  - Using transformers
  - Available transformers
- Protecting query responses
Miscellaneous

Provide feedback

Saved searches

Use saved searches to filter your results more quickly