docs(datasources): make datasource docs more complete (#3568)

- add containers data source documentation - add process tree data source documentation commit: 2a51125 (main), cherry-pick
aquasecurity · Oct 18, 2023 · 48beb05 · 48beb05
1 parent 73e4041
commit 48beb05
Show file tree

Hide file tree

Showing 4 changed files with 444 additions and 27 deletions.
diff --git a/docs/docs/data-sources/containers.md b/docs/docs/data-sources/containers.md
@@ -0,0 +1,99 @@
+# Containers Data Source
+
+The [container enrichment](../integrating/container-engines.md) feature gives Tracee the ability to extract details about active containers and link this information to the events it captures.
+
+The [data source](./overview.md) feature makes the information gathered from active containers accessible to signatures. When an event is captured and triggers a signature, that signature can retrieve information about the container using its container ID, which is bundled with the event being analyzed by the signature.
+
+## Enabling the Feature
+
+The data source does not need to be enabled, but requires that the `container enrichment` feature is. To enable the enrichment feature, execute trace with `--containers`. For more information you can read [container enrichment](../integrating/container-engines.md) page.
+
+## Internal Data Organization
+
+From the [data-sources documentation](../data-sources/overview.md), you'll see that searches use keys. It's a bit like looking up information with a specific tag (or a key=value storage).
+
+The `containers data source` operates straightforwardly. Using `string` keys, which represent the container IDs, you can fetch `map[string]string` values as shown below:
+
+```go
+    schemaMap := map[string]string{
+        "container_id":      "string",
+        "container_name":    "string",
+        "container_image":   "string",
+        "k8s_pod_id":        "string",
+        "k8s_pod_name":      "string",
+        "k8s_pod_namespace": "string",
+        "k8s_pod_sandbox":   "bool",
+    }
+```
+
+From the structure above, using the container ID lets you access details like the originating Kubernetes pod name or the image utilized by the container.
+
+## Using the Containers Data Source
+
+> Make sure to read [Golang Signatures](../events/custom/golang.md) first.
+
+### Signature Initialization
+
+During the signature initialization, get the containers data source instance:
+
+```go
+type e2eContainersDataSource struct {
+    cb             detect.SignatureHandler
+    containersData detect.DataSource
+}
+
+func (sig *e2eContainersDataSource) Init(ctx detect.SignatureContext) error {
+    sig.cb = ctx.Callback
+    containersData, ok := ctx.GetDataSource("tracee", "containers")
+    if !ok {
+        return fmt.Errorf("containers data source not registered")
+    }
+    sig.containersData = containersData
+    return nil
+}
+```
+
+Then, to each event being handled, you will `Get()`, from the data source, the information needed.
+
+### On Events
+
+Given the following example:
+
+```go
+func (sig *e2eContainersDataSource) OnEvent(event protocol.Event) error {
+    eventObj, ok := event.Payload.(trace.Event)
+    if !ok {
+        return fmt.Errorf("failed to cast event's payload")
+    }
+
+    switch eventObj.EventName {
+    case "sched_process_exec":
+        containerId := eventObj.Container.ID
+        if containerId == "" {
+            return fmt.Errorf("received non container event")
+        }
+
+        container, err := sig.containersData.Get(containerId)
+        if !ok {
+            return fmt.Errorf("failed to find container in data source: %v", err)
+        }
+
+        containerImage, ok := container["container_image"].(string)
+        if !ok {
+            return fmt.Errorf("failed to obtain the container image name")
+        }
+
+        m, _ := sig.GetMetadata()
+
+        sig.cb(detect.Finding{
+            SigMetadata: m,
+            Event:       event,
+            Data:        map[string]interface{}{},
+        })
+    }
+
+    return nil
+}
+```
+
+You may see that, through the `event object container ID` information, you may query the data source and obtain the `container name` or any other information listed before.
diff --git a/docs/docs/data-sources/overview.md b/docs/docs/data-sources/overview.md
@@ -1,44 +1,76 @@
 # Data Sources (Experimental)
 
-Data sources are a new feature, which will be the base of allowing access to dynamic data stores in signature writing (currently only available in golang).  
-Data sources are currently an experimental feature and in active development, and usage is opt-in.
+Data sources are a new feature, which will be the base of allowing access to
+dynamic data stores in signature writing (currently only available in golang).
+
+> Data sources are currently an experimental feature and in active development,
+> and usage is opt-in.
 
 ## Why use data sources?
 
-Data sources should be used when a signature requires access to data not available to it from the events it receives.  
-For example, a signature may need access to additional data about a container where an event was generated. Using tracee's builtin container data source it can do so without additionally tracking container lifecycle events.
+Signatures should opt for data sources when they need access to data beyond what
+is provided by the events they process.
+
+For instance, a signature may need access to data about the container where the
+event being processed was generated. With Tracee's integrated container data
+source, this can be achieved without the signature having to separately monitor
+container lifecycle events.
 
 ## What data sources can I use
 
-Currently, only builtin data sources from tracee are available.  
-Initially only a data source for containers will be available, but the list will be expanded as this and other features are further developed.  
+For now, only the built-in data sources from Tracee are at your disposal.
+Looking ahead, there are plans to enable integration of data sources into Tracee
+either as plugins or extensions.
+
+Currently, two primary data source exist:
+
+1. Containers: Provides metadata about containers given a container id.
+1. Process Tree: Provides access to a tree of ever existing processes and threads.
+
+This list will be expanded as other features are developed.
 
 ## How to use data sources
-In order to use a data source in a signature you must request access to it in the `Init` stage. This can be done through the `SignatureContext` passed at that stage as such:
+
+In order to use a data source in a signature you must request access to it in
+the `Init` stage. This can be done through the `SignatureContext` passed at that
+stage as such:
+
 ```golang
 func (sig *mySig) Init(ctx detect.SignatureContext) error {
     ...
     containersData, ok := ctx.GetDataSource("tracee", "containers")
-	if !ok {
-		return fmt.Errorf("containers data source not registered")
-	}
+ if !ok {
+  return fmt.Errorf("containers data source not registered")
+ }
     if containersData.Version() > 1 {
-		return fmt.Errorf("containers data source version not supported, please update this signature")
-	}
-	sig.containersData = containersData
+  return fmt.Errorf("containers data source version not supported, please update this signature")
+ }
+ sig.containersData = containersData
 }
 ```
 
-As you can see we have requested access to the data source through two keys, a namespace, and a data source ID. Namespaces are used to avoid name conflicts in the future when custom data sources can be integrated. All of tracee's builtin data sources will be available under the "tracee" namespace.  
-After checking the data source is available, we suggest to add a version check against the data source. Doing so will let you avoid running a signature which was not updated to run with a new data source schema.  
+As you can see, access to the data source has been requested using two keys: a
+namespace and a data source ID. Namespaces are employed to prevent name
+conflicts in the future when integrating custom data sources. All built-in data
+sources from Tracee will be available under the "tracee" namespace.
+
+After verifying the data source's availability, it's suggested to include a
+version check against the data source. This approach ensures that outdated
+signatures aren't run with a newer data source schema.
+
+Now, in the `OnEvent` function, you may use the data source like so:
 
-Now, in the `OnEvent` function, you may use the data source like so:  
 ```golang
 container, err := sig.containersData.Get(containerId)
 if !ok {
     return fmt.Errorf("failed to find container in data source: %v", err)
 }
 
 containerName := container["container_name"].(string)
-``` 
-Each Data source comes with one querying method `Get(key any) map[string]any`. In the above example, omitting the type validation when checking the key, which was safe to do by following the schema (given through the `Schema()` method), a json representation of the returned map, and initially checking the data source version.
+```
+
+Each Data source provides a querying method `Get(key any) map[string]any`. In
+the provided example, type validation is omitted during key verification. This
+omission is safe when adhering to the schema (provided by the `Schema()`
+method), considering the JSON representation of the returned map, and after an
+initial check of the data source version.