Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENG-13633: Initial discovery and classification implementation #51

Merged
merged 15 commits into from
Apr 8, 2024

Conversation

ccampo133
Copy link
Contributor

@ccampo133 ccampo133 commented Mar 27, 2024

Description of the change

Initial implementation of Dmap's discovery and classification feature.

Includes a dmap CLI which can be used as follows:

$ dmap --help             
Usage: dmap <command> [flags]

Assess your data security posture in AWS.

Flags:
  -h, --help                 Show context-sensitive help.
      --log-level="info"     Set the logging level (trace|debug|info|warn|error|fatal)
      --log-format="text"    Set the logging format (text|json)
      --version              Print version information and quit

Commands:
  repo-scan    Perform data discovery and classification on a data repository.

Run "dmap <command> --help" for more information on a command.

The repo-scan sub-command performs the data discovery and classification. Currently it just prints the output to stdout, in JSON form, example:

$ dmap repo-scan --type postgres --database postgres --host ... --port ...  --user ... --password ...
{
    "labels": [
        {
            "name": "ADDRESS",
            "description": "Address",
            "tags": [
                "PII"
            ]
        },
        ...
    ],
    "classifications": [
        {
            "attributePath": [
                "postgres",
                "public",
                "doctors",
                "address2"
            ],
            "labels": [
                "ADDRESS"
            ]
        },
        ...
    ]
}

Note that some of the details like command name, parameters, etc. are subject to change until the first stable version is released.

Additionally, most of the code that powers the CLI has been added as public packages to the main module, enhancing the API of the existing Dmap library. Users can use these packages to implement their own discovery and classification tooling if desired. There are two new top-level packages added to the public API:

  • classification - provides an API to perform data classification on arbitrary string data.
  • sql - provides an API to introspect, sample, and scan (which is introspect + sample + classify) SQL data repositories.

A new RepoScanner interface was also added to the scan package.

Type of change

  • Bug fix (non-breaking change that fixes an issue).
  • New feature (non-breaking change that adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).

Checklists

Development

  • Lint rules pass locally.
  • The code changed/added as part of this pull request has been covered with tests.
  • All tests related to the changed code pass.

Code review

  • This pull request has a descriptive title and information useful to a reviewer. There may be a screenshot or screencast attached.
  • Jira issue referenced in commit message and/or PR title.

Testing

Unit and manual testing.

@ccampo133 ccampo133 changed the title Initial discovery and classification CLI implementation ENG-13633: Initial discovery and classification CLI implementation Mar 27, 2024
@ccampo133 ccampo133 force-pushed the ENG-13633 branch 5 times, most recently from 3f0d8fe to 72dd28a Compare March 27, 2024 16:30
@ccampo133 ccampo133 force-pushed the ENG-13633 branch 5 times, most recently from 5118b3e to 27aea6b Compare March 29, 2024 16:17
@ccampo133 ccampo133 changed the title ENG-13633: Initial discovery and classification CLI implementation ENG-13633: Initial discovery and classification implementation Mar 29, 2024
Refactor
@ccampo133 ccampo133 force-pushed the ENG-13633 branch 4 times, most recently from 54baeac to 9b629dd Compare March 29, 2024 17:33
Copy link

@yoursnerdly yoursnerdly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A big PR, thanks for the effort @ccampo133 - I've taken a quick initial pass, skipping over repo specific code that I know is from an existing implementation.

Dockerfile Outdated
COPY . .

# Build.
RUN CGO_ENABLED=0 go build -ldflags="-X main.version=$(git rev-parse HEAD)" -o dmap cmd/*.go

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of building within the Dockerfile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that it simplified the build process, since we just need to run a single command and don't need to do any sort of relative path copying, and also makes it somewhat self contained. However if we plan to release a binary in addition to the Docker image (we probably should), then we should also change this up to pull the binary from that build step and include it in the image.

classification/classification.go Outdated Show resolved Hide resolved
classification/classification.go Show resolved Hide resolved
classification/rego/ip_address.rego Outdated Show resolved Hide resolved
classification/rego/labels.yaml Outdated Show resolved Hide resolved
discovery/config/util.go Outdated Show resolved Hide resolved
Comment on lines 41 to 43
if err != nil {
return nil, fmt.Errorf("error evaluating query for label %s: %w", lbl.Name, err)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should log the error and move on - if there's a mistake in one of the rego classifiers, we can still use the others.

Copy link
Contributor Author

@ccampo133 ccampo133 Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I tried to find a way of not doing this stuff at runtime and instead at compile time, but came up blank. If you have any ideas, LMK. We will need runtime parsing of classifier code to support custom labels in any case, but I'd prefer not having the possibility of us releasing a binary with potentially broken classifiers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well at compile time you could do a "test" run with some input and check for the output format. That still doesn't rule out the possibility (in theory) though that the rego code will give output in some other format for some other inputs.

classification/label.go Outdated Show resolved Hide resolved
discovery/repository/metadata.go Outdated Show resolved Hide resolved
discovery/repository/repository.go Outdated Show resolved Hide resolved
@ccampo133 ccampo133 force-pushed the ENG-13633 branch 7 times, most recently from c9dbf4b to 46cf074 Compare April 3, 2024 23:52
testutil/mock/scanner.go Outdated Show resolved Hide resolved
@ccampo133 ccampo133 force-pushed the ENG-13633 branch 2 times, most recently from 03e79f6 to 36885f3 Compare April 4, 2024 16:13
@ccampo133
Copy link
Contributor Author

ccampo133 commented Apr 4, 2024

@VictorGFM @yoursnerdly after a bunch of churn, I believe this is good to go, at least for the initial implementation. I put a lot of effort into minimizing the public API surface through a bunch of refactoring, but the code is largely the same. The main difference being representing attributes as a path array (e.g. [db, schema, table, column]) and also reporting the labels along with the classifications.

Copy link
Contributor

@VictorGFM VictorGFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccampo133 The package organization and type definitions look really good! I left a few comments below for your consideration. I'll let the approval to @yoursnerdly since he got the chance to take a look at the implementation in more detail.

scan/scanner.go Outdated Show resolved Hide resolved
classification/label_classifier.go Show resolved Hide resolved
@@ -0,0 +1,93 @@
package sql
Copy link
Contributor

@VictorGFM VictorGFM Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving the Repository implementations to a separate package? Maybe a package within sql named repository, seems easier to understand the repo implementations from the type definitions and utilities if they're in separated packages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the original design actually. If you think it makes it clearer, I am happy to do that. I thought just having a single package was easier from an API consumer perspective, but if you don't think so, let's change it.

Copy link
Contributor Author

@ccampo133 ccampo133 Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually what ends up happening is you get a circular dependency, which is difficult to avoid. For example, if we have the following package layout:

sql/
  scanner.go
  sample.go
  repository/
    mysql.go

The scanner depends on the repository package, but the repository package will depend on the sql package to use the Sample type, and thus you get the circular dependency. What ends up happening is that the only thing that can live in the sql package by itself is the Scanner type. Everything else needs to go in the repository package, which is annoying and sort of defeats the purpose.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see the problem with circular dependency now. In that case I think it's fine to keep the way it is on the sql package.

Copy link

@yoursnerdly yoursnerdly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good @ccampo133, thanks for the huge PR. Please look at my comments (mostly nitpicks).

classification/label.go Outdated Show resolved Hide resolved
classification/label.go Outdated Show resolved Hide resolved
classification/label.go Show resolved Hide resolved
classification/label_classifier.go Outdated Show resolved Hide resolved
classification/label_classifier.go Outdated Show resolved Hide resolved
sql/scanner.go Outdated Show resolved Hide resolved
sql/scanner.go Outdated
// "databases", therefore a single repository instance will always scan the
// entire database.
if s.Config.RepoConfig.Database != "" || s.Config.RepoType == RepoTypeOracle {
samples, err = s.sampleDb(ctx, s.Config.RepoConfig.Database)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oracle does support multiple databases (in a very confusing way) since version 12c, there is a root CDB and then multiple PDBs within that. I think we need to support those scenarios as well - perhaps in a future PR.

For now, we can tell the UI to expect no database name for Oracle (just schema, table, column) but if we do later support PDB etc, the attribute path may have 3 or 4 entries depending upon on the version etc. In the latter case, the first entry should be interpreted as the database name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - we should add support for this in a future PR then. I will need to research how Oracle works to add support for it, or delegate it to somebody more experienced with Oracle.

discovery/config/config.go Outdated Show resolved Hide resolved
discovery/config/config.go Outdated Show resolved Hide resolved
classification/label_classifier.go Outdated Show resolved Hide resolved
@ccampo133 ccampo133 force-pushed the ENG-13633 branch 2 times, most recently from 854ac64 to 87676f2 Compare April 5, 2024 21:32
Copy link

@yoursnerdly yoursnerdly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Chris, this looks good to me. Just a couple of nitpicks.

}
rule, err := parseRego(string(b))
rule, err := readLabelRule(ruleFname, ruleFs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work if the path begins with ... Based on the documentation, it looks like it should but just checking.

We should document somewhere that the relative paths in the yaml file are relative to the directory containing the yaml file itself (and not the current directory).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - it works now but there were actually some bugs around this. I added some test cases to cover them. The part about the path being relative to the file is documented in the embedded labels.yaml file as a header comment. I will also ensure this is documented in the public README, when it is updated.

// successfully loaded.
var errs classification.InvalidLabelsError
if errors.As(err, &errs) {
log.WithError(errs).Warnf("%s: some labels were not loaded", errMsg)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should return error if len(lbls) == 0 since there is no point scanning the db if there are no labels.

@yoursnerdly
Copy link

golangci-lint seems to be running into some internal errors - will need to look into that.

@ccampo133 ccampo133 merged commit f1d5fc0 into main Apr 8, 2024
3 checks passed
@ccampo133 ccampo133 deleted the ENG-13633 branch April 8, 2024 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants