-
Notifications
You must be signed in to change notification settings - Fork 75
GetMatch query
SPADE's query client can be used to retrieve stored provenance for subsequent analysis. The complete list of queries that the QuickGrail query surface supports are documented here.
This page documents usage of the query getMatch
. It illustrates this by identifying cross-namespace data flows in a CamFlow provenance graph.
When querying provenance graphs, an analyst may be interested in finding common vertices (or edges) in two different graphs -- that is, their intersection. SPADE query support provides the operator &
for this. For example, $common = $set1 & $set2
will extract elements present in both $set1
and $set2
and store them in $common
. The intersection operator works by comparing all annotations of graph (vertex or edge) elements in the given sets. It only returns graph elements for which all the annotations match.
In some contexts, even a partial match of annotations suffices. The getMatch
query provides support for this. More specifically, getMatch
operates on a subject graph ($set1
in the example below), an object graph ($set2
below), and a list of annotation keys ('pid' below). It works by comparing only the specified annotations of graph elements in the subject and object graphs. The result of getMatch
contains the elements for which the specified annotation keys had matching values in both graphs. Annotations with keys that were not specified are ignored.
An example query:
-> $matching_pids = $set1.getMatch($set2, 'pid')
Above, each vertex in $set1
will only be added to $matching_pids
if there is a vertex in $set2
that has the same value for the pid
annotation key. Similarly, the only vertices in $set2
that will appear in $matching_pids
are those with a value for the pid
annotation key that is also present in some vertex in $set1
.
A scenario is described below where data flows from one process to another while the two are in different IPC namespaces. After that, the use of getMatch
is described to illustrate how the resulting cross-namespace provenance can be identified.
CamFlow is used to collect provenance for system activity where:
- A process (called
writer
) in IPC namespaceX
writes to a file. - A process (called
reader
) in IPC namespaceY
reads the same file that thewriter
process wrote to.
In the above scenario, there is a data flow to the reader
process from the writer
process. Since the two processes exist in different IPC namespaces, cross-namespace provenance arises. It can be seen in this CamFlow graph:
The figure above depicts the following vertices and edges:
-
writer
process (pid:1917
) and its IPC namespace (ipcns:4026531839
): Thewriter
process is represented by a vertex with the annotationobject_type:task
. Its IPC namespace is in the vertex with annotationobject_type:process_memory
(and is connected to the process vertex). -
reader
process (pid:1919
) and its IPC namespace (ipcns:4026532201
): Thereader
process is represented by a vertex with the annotationobject_type:task
. Its IPC namespace is in the vertex with annotationobject_type:process_memory
(and is connected to the process vertex). -
file at path
/tmp/testfile
: This is written to and read from. It is represented by the vertex with the annotationobject_type:path
. -
write to path
/tmp/testfile
: The write is represented by an edge incident on an inode vertex with the annotationobject_type:file
. The edge itself has the annotationrelation_type:write
. -
read of path
/tmp/testfile
: It is represented similarly to the write, except that the edge has the annotationrelation_type:read
.
The CamFlow provenance log for the scenario above can be found here.
Given the provenance graph, the goal is to construct a series of queries that can be used extract to cross namespace data flow because of a write, and the read of the same path. The approach taken can be broken down into the following steps:
- Find the inodes that were written to
- Find the tasks (processes) that wrote to the inodes
- Find the process memory vertices (contain namespace identifiers) of the tasks in (2)
- Find the inodes that were read from
- Find the common inodes between (1), and (4)
- Find the tasks that read from the common inodes in (5)
- Find the process memory vertices (contain namespace identifiers) of the tasks in (6)
- Find the process memory vertices which match on namespace identifiers between (3), and (7)
- Find the process memory vertices that were either in (3), or (7), but not in both. This is the set of process memory vertices where the namespaces identifiers did not match
- Construct a graph using only the vertices, and edges found above to get a subgraph which represents a cross namespace data flow because of inodes
Following are the queries for the approach mentioned in C.II
. Queries are preceded by comments (lines starting with #
) to describe the query that follows. Also, the queries below can be copied to a file, and executed in the SPADE query client by using the command load <path to file with queries>
.
# Group all types of relevant vertices into respective variables for convenience.
# This comes in handy when only a particular type of vertex is required from another variable.
# Group all process memory vertices which contain the namespace identifiers.
$memorys = $base.getVertex(object_type = 'process_memory')
# Group all task vertices which contain the process identifiers. Tasks are connected to process memory vertices.
$tasks = $base.getVertex(object_type = 'task')
# Group all files vertices which represent an inode. Tasks are connected to files by 'relation_type'='read' or 'relation_type'='write'.
$files = $base.getVertex(object_type = 'file')
# Group all path vertices which contain the path of an inode in the filesystem. Files are connected to paths.
$paths = $base.getVertex(object_type = 'path')
# Finding the subgraph representing the writing of a file
# Get all edges that represent a write
$write_edges = $base.getEdge(relation_type = 'write')
# Get all the written files.
# Note: the '&' operation with $files is a convenient way of getting only files from $write_edges.getEdgeSource().
$written_files = $write_edges.getEdgeSource() & $files
# Get all the paths for the found files.
# Note: If empty result is returned then it means the file was written to more than 10 times. Progressively increase that value '10' until the result is not empty.
$written_files_to_paths = $base.getPath($written_files, $files, 10, $paths, 1)
# Get the tasks that wrote to the file
$writing_tasks = $write_edges.getEdgeDestination() & $tasks
# Get the process memory vertices (these contain the namespace identifiers) for all the writing tasks
$writing_tasks_to_memorys = $base.getPath($writing_tasks, $memorys, 1)
# Get only the process memory vertices
$writing_memorys = $writing_tasks_to_memorys.getEdgeDestination() & $memorys
# Get the read edges to find files which were read
$read_edges = $base.getEdge(relation_type = 'read')
$read_files = $read_edges.getEdgeDestination() & $files
# Find common files i.e. which were written to and read from. If this is empty then there was no data flow through files
$common_files = $written_files & $read_files
# Find the paths for the files that were common (above)
$common_files_to_paths = $base.getPath($common_files, $files, 10, $paths, 1)
$common_paths = $common_files_to_paths & $paths
# Find the tasks and their process memory vertices of the tasks that read the common files
$reading_tasks_to_common_files = $base.getPath($tasks, $common_files, 1)
$reading_tasks = $reading_tasks_to_common_files & $tasks
$reading_tasks_to_memorys = $base.getPath($reading_tasks, $memorys, 1)
$reading_memorys = $reading_tasks_to_memorys.getEdgeDestination() & $memorys
# Now do a match between the memory vertices of the reading tasks, and that of the writing tasks to find the ones with the same values for namespace identifiers: 'cgroupns', 'ipcns', 'mntns', 'netns', 'pidns', 'utsns'
$common_memorys = $reading_memorys.getMatch($writing_memorys, 'cgroupns', 'ipcns', 'mntns', 'netns', 'pidns', 'utsns')
# Divide the result from getMatch into two groups. The first group is the result i.e. the common ones. The second group is the group with namespaces that didn't match.
$group1_memorys = $common_memorys
# If the following query returns empty result then that means there was no process memory vertices with different namespaces
$group2_memorys = $reading_memorys + $writing_memorys - $common_memorys
# Cross namespace data flow subgraph construction
# Get the tasks of the process memorys in both groups
$group1_tasks = $base.getPath($tasks, $group1_memorys, 1) & $tasks
$group2_tasks = $base.getPath($tasks, $group2_memorys, 1) & $tasks
# Use the files that were read from and written to as the starting point of the contruction
$subgraph = $common_files
# Find the paths of the files involved
$subgraph = $subgraph + $base.getPath($common_files, $files, 10, $common_paths, 1)
# Find the edges from tasks to files in both groups
$subgraph = $subgraph + $base.getPath($group1_tasks, $common_files, 1)
$subgraph = $subgraph + $base.getPath($group2_tasks, $common_files, 1)
# Find the edges from files to tasks in both groups
$subgraph = $subgraph + $base.getPath($common_files, $group1_tasks, 1)
$subgraph = $subgraph + $base.getPath($common_files, $group2_tasks, 1)
# Find the edges from tasks to memory vertices in both groups
$subgraph = $subgraph + $base.getPath($group1_tasks, $group1_memorys, 1)
$subgraph = $subgraph + $base.getPath($group2_tasks, $group2_memorys, 1)
# Discard intermediate graph variable
erase $memorys $tasks $files $paths $write_edges $written_files $written_files_to_paths
erase $writing_tasks_to_memorys $read_edges $read_files $common_files_to_paths
erase $reading_tasks_to_common_files $reading_tasks_to_memorys $common_memorys
The graph variable $subgraph
would contain the result graph for the scenario mentioned in C.I
(shown below).
This material is based upon work supported by the National Science Foundation under Grants OCI-0722068, IIS-1116414, and ACI-1547467. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
- Setting up SPADE
- Storing provenance
-
Collecting provenance
- Across the operating system
- Limiting collection to a part of the filesystem
- From an external application
- With compile-time instrumentation
- Using the reporting API
- Of transactions in the Bitcoin blockchain
- Filtering provenance
- Viewing provenance
-
Querying SPADE
- Illustrative example
- Transforming query responses
- Protecting query responses
- Miscellaneous