add-vpc and remove-vpc use Events to create/destroy per-ENI mirroring #42

chelma · 2023-05-01T19:31:05Z

Description

The first part of the work to listen for ECS EventBridge events and automatically create/destroy per-ENI Mirroring for them.
Created a per-Cluster EventBridge Bus that we can send events through
Created per-VPC EventBridge Rules and Lambdas to listen for when an ENI should be created/destroyed and perform that action
Updated the add-vpc and remove-vpc CLI commands to emit events to the Cluster Bus to create/destroying the per-ENI mirroring components rather than perform that work themselves. This was basically copy/pasting the existing code into Lambdas with a few minor tweaks.
Added code to the Create/Destroy Lambdas to emit CloudWatch Metrics on their outcome. This will help users track what's happening in the system now we're moving to an asynchronous model. Additionally, CloudWatch Logs contains the logs from our Lambda invocations, and these can be searched manually or using CloudWatch Insights.
REMAINING WORK TO COMPLETE TASK: Create EventBridge Rules to listen on the default Event Bus for AWS ECS Events and kick off a Lambda to take the raw, default Event and convert it into an event understandable to our Create/Destroy Lambdas

Issues

Event-Based Mirroring: Listen for ECS Service Events #37

Testing

Added unit tests
Ran add-vpc and remove-vpc and confirmed the mirroring resources were successfully created/destroyed by the event-driven Lambdas, that the Lambdas emitted the expected metrics to CloudWatch, and that our test traffic still showed up in the Arkime Dashboard for the Cluster

(.venv) chelma@3c22fba4e266 cloud-demo % ./manage_arkime.py add-vpc --cluster-name MyCluster --vpc-id vpc-085d9c6085d49263e
2023-05-01 14:02:20 - Debug-level logs save to file: /Users/chelma/workspace/Arkime/cloud-demo/manage_arkime.log
2023-05-01 14:02:20 - Using AWS Credential Profile: default
2023-05-01 14:02:20 - Using AWS Region: default from AWS Config settings
2023-05-01 14:02:22 - Deploying shared mirroring components via CDK...
2023-05-01 14:02:22 - Executing command: deploy MyCluster-vpc-085d9c6085d49263e-Mirror
2023-05-01 14:02:22 - NOTE: This operation can take a while.  You can 'tail -f' the logfile to track the status.
2023-05-01 14:03:17 - Deployment succeeded
2023-05-01 14:03:18 - Initiating creation of mirroring session for ENI eni-030672d37883575bc
2023-05-01 14:03:19 - Initiating creation of mirroring session for ENI eni-063bf223f6eba509b
2023-05-01 14:03:19 - Initiating creation of mirroring session for ENI eni-03a4e617291b1a7c5
2023-05-01 14:03:20 - Initiating creation of mirroring session for ENI eni-0d5cd6940daa297e1

(.venv) chelma@3c22fba4e266 cloud-demo % ./manage_arkime.py remove-vpc --cluster-name MyCluster --vpc-id vpc-085d9c6085d49263e
2023-05-01 14:05:31 - Debug-level logs save to file: /Users/chelma/workspace/Arkime/cloud-demo/manage_arkime.log
2023-05-01 14:05:31 - Using AWS Credential Profile: default
2023-05-01 14:05:31 - Using AWS Region: default from AWS Config settings
2023-05-01 14:05:33 - Initiating teardown of mirroring session for ENI eni-063bf223f6eba509b
2023-05-01 14:05:34 - Tearing down shared mirroring components via CDK...
2023-05-01 14:05:35 - ================================================================================
2023-05-01 14:05:35 - USER ACTION REQUIRED:
2023-05-01 14:05:35 - --------------------------------------------------------------------------------
Your command will result in the the following CloudFormation stacks being destroyed in AWS Account 968674222892 and Region us-east-2: ['MyCluster-vpc-085d9c6085d49263e-Mirror']

Do you wish to proceed (y/yes or n/no)? y
2023-05-01 14:05:36 - Executing command: destroy --force MyCluster-vpc-085d9c6085d49263e-Mirror
2023-05-01 14:05:36 - NOTE: This operation can take a while.  You can 'tail -f' the logfile to track the status.
2023-05-01 14:07:58 - Destruction succeeded

Signed-off-by: Chris Helma <chelma+github@amazon.com>

* Given the move towards an event-based architecture, we needed a way to more easily keep track of what is happening in our system. I added CloudWatch Metrics to the lambda to indicate when each possible outcome occurred. Signed-off-by: Chris Helma <chelma+github@amazon.com>

* Like the CreateEniMirror Lambda, the Destroy Lambda emits metrics to CloudWatch on each possible outcome to help evaluate what's happening in the system. Signed-off-by: Chris Helma <chelma+github@amazon.com>

awick · 2023-05-01T20:23:30Z

In the future if possible when doing a reorg/rename (manage_arkime. => cdk_interactions. for example) do that in its own PR so there are less changes per PR.
It looks like this PR doesn't do the subnet event, but just an eni event? I just want to make sure eventually we will still have the subnet or vpc events? (My concern is if I add vpc with N subnets and 1000s of instances is that better as 1 vpc event, N subnet events, or 1000s of eni events?
Eventually I assume we will have multiple versions of some of these events as we learn new args and stuff. Do you want to prepare for that now, or does eventbridge make it easy to handle?
Going to test locally now

chelma · 2023-05-01T20:41:41Z

In the future if possible when doing a reorg/rename (manage_arkime. => cdk_interactions. for example) do that in its own PR so there are less changes per PR.

I agree; it created a bit of a mess. I unfortunately discovered it was necessary right in the middle of a bunch of changes in order for the namespacing inside the Lambda containers to work correctly.

It looks like this PR doesn't do the subnet event, but just an eni event? I just want to make sure eventually we will still have the subnet or vpc events? (My concern is if I add vpc with N subnets and 1000s of instances is that better as 1 vpc event, N subnet events, or 1000s of eni events?

So, we need to have a per-ENI Event/Lambda in order to handle the EC2/ECS autoscaling events which themselves operate on a per-ENI level.

I'm planning on making another Event/Lambda that operates at the subnet level to detect changes in the ENIs we have mirroring set up for. This is what I was planning on (initially) having perform our automated scheduled scans (see #36). I think there's pros/cons to having that Lambda directly manipulate the ENIs, but my current inclination is to have it bulk-put Create/Destroy events to the Cluster Bus for our per-ENI Lambdas to action instead of doing the operations itself. We have a pre-built distributed event system that is designed to operate at-scale; we might as well use it. The alternative would be having that Subnet lambda tackle them single-threaded.

I can see a per-VPC Event/Lambda existing and it's one piece I was planning on using to make the refresh-cluster CLI command work (see #32). The premise would be that the per-VPC Lambda would find new/removed Subnets and create the per-Subnet configuration we're currently making using the CDK, then kick off the per-Subnet Lambdas to look for new/removed ENIs, and so on.

Eventually I assume we will have multiple versions of some of these events as we learn new args and stuff. Do you want to prepare for that now, or does eventbridge make it easy to handle?

Hmm, could use a bit more context to understand what you mean here. I would say that, in general, wiring up EventBridge Rules to look for Events and trigger a Lambda is really easy.

The one thing I have a question about is how much to atomize our Lambdas. This PR proposes a model where we have separate Create and Destroy Lambdas for each VPC. This makes it easier to track what our system is doing (imo) because they generate separate metrics and logs this way. However, it is more things to keep track of. We may end up combining some Lambda-based responsibilities into unified Lambdas.

Going to test locally now

Cool!

awick · 2023-05-01T20:52:30Z

I can see a per-VPC Event/Lambda existing and it's one piece I was planning on using to make the refresh-cluster CLI command work (see #32). The premise would be that the per-VPC Lambda would find new/removed Subnets and create the per-Subnet configuration we're currently making using the CDK, then kick off the per-Subnet Lambdas to look for new/removed ENIs, and so on.

Ok. So are you saying in the future add-vpc would just use that? If thats not what you are saying then we will want to do some testing of onboarding large vpcs/subnets creating 1000s of ENI events.

Hmm, could use a bit more context to understand what you mean here. I would say that, in general, wiring up EventBridge Rules to look for Events and trigger a Lambda is really easy.

Whenever I'm using an event bus I like to upfront at least discuss how I'm going to do event versioning when I discover I need more/less/different parameters in the messages. The two most common solutions are either a version field in every message or the name of the event changes (such as appending _v2 _v3 etc). Then the discussion is, should the initial implementation have this version marker or not. Such as version: 1 or _v1. The general issue is eventually you'll either have a newer version of the lambda or cli depending on upgrade order.

chelma · 2023-05-01T21:01:40Z

Ok. So are you saying in the future add-vpc would just use that? If thats not what you are saying then we will want to do some testing of onboarding large vpcs/subnets creating 1000s of ENI events.

Yeah - when we've built up to having a VPC-level Lambda, then we have our add-vpc call use it to do all the dynamic configuration. And I definitely agree we need to do testing at-scale, for like this entire repo. I'm hoping to do some of that for this task to ensure our Capture Node auto-scaling is working (#31)

Whenever I'm using an event bus I like to upfront at least discuss how I'm going to do event versioning when I discover I need more/less/different parameters in the messages.

Great point. Let me create a follow-up task for this and we can discuss the best strategy and document in there.

chelma · 2023-05-01T21:10:55Z

Created follow-up task to add event versioning: #43

awick

LGTM (except for small conflict :) )

chelma added 4 commits April 28, 2023 16:54

add-vpc uses events to create Traffic Mirroring Sessions

6d5a4d5

Signed-off-by: Chris Helma <chelma+github@amazon.com>

remove-vpc uses events to destroy Traffic Mirroring Sessions

b2fc43a

DestroyEniMirror Lambda now emits outcome metrics

c902953

* Like the CreateEniMirror Lambda, the Destroy Lambda emits metrics to CloudWatch on each possible outcome to help evaluate what's happening in the system. Signed-off-by: Chris Helma <chelma+github@amazon.com>

chelma added the Capture Resilience Work to make traffic capture more resilient to changes in load, configuration, and sources label May 1, 2023

chelma requested a review from awick May 1, 2023 19:33

chelma mentioned this pull request May 1, 2023

Event-Based Mirroring: Listen for ECS Service Events #37

Closed

chelma requested a review from 31453 May 1, 2023 19:44

chelma mentioned this pull request May 1, 2023

Add State & Event Versioning #43

Open

awick approved these changes May 1, 2023

View reviewed changes

Merge branch 'main' into ecs-events

4cfed07

chelma merged commit 2d17c1e into main May 2, 2023

chelma deleted the ecs-events branch May 2, 2023 12:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add-vpc and remove-vpc use Events to create/destroy per-ENI mirroring #42

add-vpc and remove-vpc use Events to create/destroy per-ENI mirroring #42

chelma commented May 1, 2023

awick commented May 1, 2023

chelma commented May 1, 2023 •

edited

Loading

awick commented May 1, 2023

chelma commented May 1, 2023

chelma commented May 1, 2023

awick left a comment

add-vpc and remove-vpc use Events to create/destroy per-ENI mirroring #42

add-vpc and remove-vpc use Events to create/destroy per-ENI mirroring #42

Conversation

chelma commented May 1, 2023

Description

Issues

Testing

awick commented May 1, 2023

chelma commented May 1, 2023 • edited Loading

awick commented May 1, 2023

chelma commented May 1, 2023

chelma commented May 1, 2023

awick left a comment

Choose a reason for hiding this comment

chelma commented May 1, 2023 •

edited

Loading