Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add-vpc and remove-vpc use Events to create/destroy per-ENI mirroring #42

Merged
merged 5 commits into from
May 2, 2023

Conversation

chelma
Copy link
Collaborator

@chelma chelma commented May 1, 2023

Description

  • The first part of the work to listen for ECS EventBridge events and automatically create/destroy per-ENI Mirroring for them.
  • Created a per-Cluster EventBridge Bus that we can send events through
  • Created per-VPC EventBridge Rules and Lambdas to listen for when an ENI should be created/destroyed and perform that action
  • Updated the add-vpc and remove-vpc CLI commands to emit events to the Cluster Bus to create/destroying the per-ENI mirroring components rather than perform that work themselves. This was basically copy/pasting the existing code into Lambdas with a few minor tweaks.
  • Added code to the Create/Destroy Lambdas to emit CloudWatch Metrics on their outcome. This will help users track what's happening in the system now we're moving to an asynchronous model. Additionally, CloudWatch Logs contains the logs from our Lambda invocations, and these can be searched manually or using CloudWatch Insights.
  • REMAINING WORK TO COMPLETE TASK: Create EventBridge Rules to listen on the default Event Bus for AWS ECS Events and kick off a Lambda to take the raw, default Event and convert it into an event understandable to our Create/Destroy Lambdas

Issues

Testing

  • Added unit tests
  • Ran add-vpc and remove-vpc and confirmed the mirroring resources were successfully created/destroyed by the event-driven Lambdas, that the Lambdas emitted the expected metrics to CloudWatch, and that our test traffic still showed up in the Arkime Dashboard for the Cluster
(.venv) chelma@3c22fba4e266 cloud-demo % ./manage_arkime.py add-vpc --cluster-name MyCluster --vpc-id vpc-085d9c6085d49263e
2023-05-01 14:02:20 - Debug-level logs save to file: /Users/chelma/workspace/Arkime/cloud-demo/manage_arkime.log
2023-05-01 14:02:20 - Using AWS Credential Profile: default
2023-05-01 14:02:20 - Using AWS Region: default from AWS Config settings
2023-05-01 14:02:22 - Deploying shared mirroring components via CDK...
2023-05-01 14:02:22 - Executing command: deploy MyCluster-vpc-085d9c6085d49263e-Mirror
2023-05-01 14:02:22 - NOTE: This operation can take a while.  You can 'tail -f' the logfile to track the status.
2023-05-01 14:03:17 - Deployment succeeded
2023-05-01 14:03:18 - Initiating creation of mirroring session for ENI eni-030672d37883575bc
2023-05-01 14:03:19 - Initiating creation of mirroring session for ENI eni-063bf223f6eba509b
2023-05-01 14:03:19 - Initiating creation of mirroring session for ENI eni-03a4e617291b1a7c5
2023-05-01 14:03:20 - Initiating creation of mirroring session for ENI eni-0d5cd6940daa297e1
(.venv) chelma@3c22fba4e266 cloud-demo % ./manage_arkime.py remove-vpc --cluster-name MyCluster --vpc-id vpc-085d9c6085d49263e
2023-05-01 14:05:31 - Debug-level logs save to file: /Users/chelma/workspace/Arkime/cloud-demo/manage_arkime.log
2023-05-01 14:05:31 - Using AWS Credential Profile: default
2023-05-01 14:05:31 - Using AWS Region: default from AWS Config settings
2023-05-01 14:05:33 - Initiating teardown of mirroring session for ENI eni-063bf223f6eba509b
2023-05-01 14:05:34 - Tearing down shared mirroring components via CDK...
2023-05-01 14:05:35 - ================================================================================
2023-05-01 14:05:35 - USER ACTION REQUIRED:
2023-05-01 14:05:35 - --------------------------------------------------------------------------------
Your command will result in the the following CloudFormation stacks being destroyed in AWS Account 968674222892 and Region us-east-2: ['MyCluster-vpc-085d9c6085d49263e-Mirror']

Do you wish to proceed (y/yes or n/no)? y
2023-05-01 14:05:36 - Executing command: destroy --force MyCluster-vpc-085d9c6085d49263e-Mirror
2023-05-01 14:05:36 - NOTE: This operation can take a while.  You can 'tail -f' the logfile to track the status.
2023-05-01 14:07:58 - Destruction succeeded

Signed-off-by: Chris Helma <chelma+github@amazon.com>
* Given the move towards an event-based architecture, we needed a
  way to more easily keep track of what is happening in our system.
  I added CloudWatch Metrics to the lambda to indicate when each
  possible outcome occurred.

Signed-off-by: Chris Helma <chelma+github@amazon.com>
* Like the CreateEniMirror Lambda, the Destroy Lambda emits
  metrics to CloudWatch on each possible outcome to help evaluate
  what's happening in the system.

Signed-off-by: Chris Helma <chelma+github@amazon.com>
@chelma chelma added the Capture Resilience Work to make traffic capture more resilient to changes in load, configuration, and sources label May 1, 2023
@chelma chelma requested a review from awick May 1, 2023 19:33
@chelma chelma requested a review from 31453 May 1, 2023 19:44
@awick
Copy link
Contributor

awick commented May 1, 2023

  • In the future if possible when doing a reorg/rename (manage_arkime. => cdk_interactions. for example) do that in its own PR so there are less changes per PR.
  • It looks like this PR doesn't do the subnet event, but just an eni event? I just want to make sure eventually we will still have the subnet or vpc events? (My concern is if I add vpc with N subnets and 1000s of instances is that better as 1 vpc event, N subnet events, or 1000s of eni events?
  • Eventually I assume we will have multiple versions of some of these events as we learn new args and stuff. Do you want to prepare for that now, or does eventbridge make it easy to handle?
  • Going to test locally now

@chelma
Copy link
Collaborator Author

chelma commented May 1, 2023

In the future if possible when doing a reorg/rename (manage_arkime. => cdk_interactions. for example) do that in its own PR so there are less changes per PR.

I agree; it created a bit of a mess. I unfortunately discovered it was necessary right in the middle of a bunch of changes in order for the namespacing inside the Lambda containers to work correctly.

It looks like this PR doesn't do the subnet event, but just an eni event? I just want to make sure eventually we will still have the subnet or vpc events? (My concern is if I add vpc with N subnets and 1000s of instances is that better as 1 vpc event, N subnet events, or 1000s of eni events?

So, we need to have a per-ENI Event/Lambda in order to handle the EC2/ECS autoscaling events which themselves operate on a per-ENI level.

I'm planning on making another Event/Lambda that operates at the subnet level to detect changes in the ENIs we have mirroring set up for. This is what I was planning on (initially) having perform our automated scheduled scans (see #36). I think there's pros/cons to having that Lambda directly manipulate the ENIs, but my current inclination is to have it bulk-put Create/Destroy events to the Cluster Bus for our per-ENI Lambdas to action instead of doing the operations itself. We have a pre-built distributed event system that is designed to operate at-scale; we might as well use it. The alternative would be having that Subnet lambda tackle them single-threaded.

I can see a per-VPC Event/Lambda existing and it's one piece I was planning on using to make the refresh-cluster CLI command work (see #32). The premise would be that the per-VPC Lambda would find new/removed Subnets and create the per-Subnet configuration we're currently making using the CDK, then kick off the per-Subnet Lambdas to look for new/removed ENIs, and so on.

Eventually I assume we will have multiple versions of some of these events as we learn new args and stuff. Do you want to prepare for that now, or does eventbridge make it easy to handle?

Hmm, could use a bit more context to understand what you mean here. I would say that, in general, wiring up EventBridge Rules to look for Events and trigger a Lambda is really easy.

The one thing I have a question about is how much to atomize our Lambdas. This PR proposes a model where we have separate Create and Destroy Lambdas for each VPC. This makes it easier to track what our system is doing (imo) because they generate separate metrics and logs this way. However, it is more things to keep track of. We may end up combining some Lambda-based responsibilities into unified Lambdas.

Going to test locally now

Cool!

@awick
Copy link
Contributor

awick commented May 1, 2023

I can see a per-VPC Event/Lambda existing and it's one piece I was planning on using to make the refresh-cluster CLI command work (see #32). The premise would be that the per-VPC Lambda would find new/removed Subnets and create the per-Subnet configuration we're currently making using the CDK, then kick off the per-Subnet Lambdas to look for new/removed ENIs, and so on.

Ok. So are you saying in the future add-vpc would just use that? If thats not what you are saying then we will want to do some testing of onboarding large vpcs/subnets creating 1000s of ENI events.

Hmm, could use a bit more context to understand what you mean here. I would say that, in general, wiring up EventBridge Rules to look for Events and trigger a Lambda is really easy.

Whenever I'm using an event bus I like to upfront at least discuss how I'm going to do event versioning when I discover I need more/less/different parameters in the messages. The two most common solutions are either a version field in every message or the name of the event changes (such as appending _v2 _v3 etc). Then the discussion is, should the initial implementation have this version marker or not. Such as version: 1 or _v1. The general issue is eventually you'll either have a newer version of the lambda or cli depending on upgrade order.

@chelma
Copy link
Collaborator Author

chelma commented May 1, 2023

Ok. So are you saying in the future add-vpc would just use that? If thats not what you are saying then we will want to do some testing of onboarding large vpcs/subnets creating 1000s of ENI events.

Yeah - when we've built up to having a VPC-level Lambda, then we have our add-vpc call use it to do all the dynamic configuration. And I definitely agree we need to do testing at-scale, for like this entire repo. I'm hoping to do some of that for this task to ensure our Capture Node auto-scaling is working (#31)

Whenever I'm using an event bus I like to upfront at least discuss how I'm going to do event versioning when I discover I need more/less/different parameters in the messages.

Great point. Let me create a follow-up task for this and we can discuss the best strategy and document in there.

@chelma
Copy link
Collaborator Author

chelma commented May 1, 2023

Created follow-up task to add event versioning: #43

Copy link
Contributor

@awick awick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (except for small conflict :) )

@chelma chelma merged commit 2d17c1e into main May 2, 2023
@chelma chelma deleted the ecs-events branch May 2, 2023 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Capture Resilience Work to make traffic capture more resilient to changes in load, configuration, and sources
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants