Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Design Proposal] Event-Based Traffic Mirroring Setup/Teardown #35

Closed
chelma opened this issue Apr 26, 2023 · 9 comments
Closed

[Design Proposal] Event-Based Traffic Mirroring Setup/Teardown #35

chelma opened this issue Apr 26, 2023 · 9 comments
Assignees
Labels
Capture Resilience Work to make traffic capture more resilient to changes in load, configuration, and sources Design Proposal

Comments

@chelma
Copy link
Collaborator

chelma commented Apr 26, 2023

Proposal

It is proposed to add event-based setup of traffic mirroring to the Arkime Cloud tooling. The capability would be rolled out in phases. The first phase is a scheduled event that triggers updates to the per-ENI Mirroring configuration on a regular cadence. The second phase is adding rules to listen for the AWS EventBridge events natively fired by AWS Services (e.g. EC2 Autoscaling) on state changes in order to more practively update the per-ENI Mirroring configuration. The third phase would be to automatically manage per-Subnet Mirroring configuration instead of just the per-ENI configuration.

Background - Existing Solution

The existing solution uses VPC Traffic Mirroring [1] to send a copy of the user's traffic from the User VPC through a Gateway Load Balancer [2] to Capture Nodes in the Capture VPC. The Capture Nodes are running a copy of the Arkime Capture process. The source of mirrored traffic must currently be a Network Interface [3] in the user's VPC.

There are three levels of configuration required to make this work.

  • Per-VPC:

    • Traffic Mirroring Filter: every User VPC has a set of filtering rules used to govern which traffic is sent to the Capture VPC
  • Per-Subnet:

    • VPC Endpoint: every subnet in the User's VPC has a VPC Endpoint created in it as a funnel to the GWLB in the Capture VPC.
    • Traffic Mirroring Target: A conceptual resource that provides a mapping to a specific destination that traffic can be mirrored to; in this case, the subnet's VPC Endpoint
  • Per-ENI:

    • Traffic Mirroring Session: Maps an ENI (source) to a Mirroring Target, subject to a Mirroring Filter Rule

Currently, the add-vpc CLI operation creates these resources based on a point-in-time understanding of the User VPC's subnets and ENIs, and the remove-vpc CLI operation tears them down. The Per-VPC and Per-Subnet resources are managed using CDK/CloudFormation and the Per-ENI resources are managed using the Python SDK (boto).

[1] https://docs.aws.amazon.com/vpc/latest/mirroring/what-is-traffic-mirroring.html
[2] https://docs.aws.amazon.com/elasticloadbalancing/latest/gateway/introduction.html
[3] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html

Background - AWS EventBridge

AWS EventBridge [1] is a service that provides an event bus that both AWS Services and user-defined applications can use to communicate state changes. A sample event might be that an EC2 Autoscaling Group successfully launched a new EC2 Instance. Users can set up rules to listen for specific event types on a given bus, perform transformations of the event messages, and direct the messages to a target which can take action (e.g. a Lambda function). All rules that apply to an event fire, and each rule can send the event to multiple targets. Each target can be configured with a different retry policy, and events that fail to be actioned can be sent to a dead letter queue. The delivery guarantee for a given target is at-least-once. AWS EventBridge supports both cross-region and cross-account operation.

[1] https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html

Phase 1 Proposal

It is proposed to create an EventBridge Bus for each Arkime Cluster. This will be accomplished using CDK, and bundled with the existing Capture VPC resource(s). We create the Bus per-Cluster because the default quota for Buses per account/region is fairly low; the default TPS throttling rate per bus is fairly high; it should be easy to distinguish between events meant for each User VPC; and it should be possible to move to per-VPC Buses later if necessary.

It is proposed that we create per-VPC Rules and Lambda Functions to watch for events on the per-Cluster Bus and action them, and bundle them with the other per-VPC resources deployed via CDK. We make these resources per-VPC because we want our Lambda functions to have as few permissions as possible. While we currently have the entire system operate within a single AWS Account/Region, in the future we would like to enable a single Arkime Cluster to monitor traffic in many User VPCs spread across multiple AWS Accounts and Regions. In that scenario, we don't want our Lambda Functions have access across all User accounts/regions; just the ones required to action a specific VPC. There's also no apparent downside to beginning this segregation now.

The Lambda Functions we create will effectively just run the Python code currently being performed by our add-vpc and remove-vpc CLI operations to set up/tear down ENI-specific mirroring configuration (excluding all the CDK-related behavior). The add-vpc and remove-vpc CLI commands will be updated to emit events to the Cluster bus to trigger the Lambda functions.

Additionally, we will have scheduled events fire every minute to continuously scan for changes in the ENIs and trigger the add/remove lambdas.

At the end of Phase 1, we will have a system in place that ensures that the per-ENI configuration is checked and updated at least once per minute.

Phase 2 Proposal

It is proposed that we build on Phase 1 by beginning to listen to the existing events continuously emitted by AWS Services on their state changes. A couple examples include the events that AWS EC2 Autoscaling emits to EventBridge when new instances start/stop running, and that AWS ECS emits when containers start/stop running. These events are natively emitted to a default EventBridge that exists in every AWS Account without requiring any user action.

Creating EventBridge Rules to listen to these events would enable us to create/destroy mirroring configuration at the moment that the state-change occurs rather than waiting for the next scheduled scan. The downside is that every AWS Service emits different events with different formats, so rules will need to be created for each scenario. We would add these Rules to the per-VPC CDK configuration, as the rules are inherently tied to a specific AWS Account/Region via the default EventBridge they listen to, and we want to enable an easy transition to multi-Region/multi-Account setups in the future.

Starting with high-value event types (such as EC2/ECS changes) seems reasonable, with additional event types added incrementally as they are identified.

Phase 3 Proposal

It is proposed that we automatically update our per-Subnet mirroring configuration using either scheduled scans for changes or emitted VPC events, similar to how we update per-ENI configuration. This likely means changing our per-Subnet configuration from being managed by CDK/CloudFormation to being managed by direct SDK invocations. Currently, a human needs to handle when subnets within a User VPC change (see [1]).

[1] #32

FAQS

Why use EventBridge instead of using SNS/SQS directly?

SNS/SQS are general queuing and notification solutions; EventBridge is specifically designed for handling AWS State changes. AWS services have out-of-the-box integration [1] into EventBridge to make it easy to take action when changes occur, and emit events following standardized schemas. For the example of a change in EC2 Autoscaling Capacity, the EC2 service already emits an event to EventBridge without any effort on our part [2]. If we used SNS/SQS, we'd have to detect the state change and emit the event ourselves.

[1] https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-service-event-list.html
[2] https://docs.aws.amazon.com/autoscaling/ec2/userguide/automating-ec2-auto-scaling-with-eventbridge.html

How can users initiate creation of mirroring resources themselves?

An example use-case might be that a user wants to ensure that their EC2 instances are having their traffic captured before instance spinnup is allowed to complete.

In this case, the user can emit our standardized event for creating per-ENI resources as part of their User Data Script (or ECS startup command) and wait for the expected entry to appear in our data store for it (currently, SSM Parameter Store).

How will the system behave for longer-lived (EC2 instances) and shorter lived (Lambda containers) resources?

The system is designed for longer-lived resources, such as EC2 instances and ECS containers, and will behave well for them starting in Phase 1, with capture beginning no later than 1 minute after the resource is created. Phase 2 probably brings the delay down to ~1 second.

Shorter lived and ephemeral resources (such as AWS Lambda Functions) are tricky to deal with given the underlying constrains imposed by VPC Traffic Mirroring. The traffic of Lambda Functions is hypothetically mirrorable (though I have not tested this), and longer-lived Functions should be caught by the scheduled scans - if the Function in question is executing inside the User VPC. Given current constraints, very short-lived resources would likely be best addressed by having the user manually trigger an event to set up mirroring for the resource before continuing with the computer operation, but this necessarily imposes additional latency to the operation they are trying to perform (unclear on what that latency value is).

Ideally, VPC Traffic Mirroring would be improved in a manner that obviates the need for per-ENI configuration.

How the system will handle multiple, concurrent and/or conflicting instructions

The actual operations being performed at the per-ENI level appear fairly simple and reasonably easy to make idempotent. The actual resources/state for each ENI is as follows:

  • The Traffic Mirroring Session for the ENI
  • An AWS Systems Manager Parameter Store value containing some metadata about the ENI's setup

Multiple creation attempts for the same ENI would only ever create a Mirroring Session with the same configuration, and write the same metadata to the same Parameter Store key. Multiple deletions would only ever delete the same Mirroring Session and Parameter Store key. If later duplicates of the same operation fail, there shouldn't be user impact. As a result, the at-least-once delivery guaranteed by EventBridge to the Target Lambda functions should not cause problems.

EventBridge does not guarantee ordering for simultaneous events, so it's possible that a Create and Delete operation could have indeterminate ordering, but that should be resolved either way during the next scheduled scan of the User-VPC (e.g. Phase 1 deliverable).

@chelma chelma added the Capture Resilience Work to make traffic capture more resilient to changes in load, configuration, and sources label Apr 26, 2023
@awick
Copy link
Contributor

awick commented Apr 26, 2023

Overall seems like a good plan, few comments

  • Would like a sentence about why EventBus over SNS/SQS.I know eventbus supports more targets, but if we are using lambda I don't think that matters.
  • In phase 1 I like that manage_arkime.py add/remove vpc will now just publish the event. It is unclear to me if the lambda function will actually be running CDK or if you are just going to be doing aws cli or boto script stuff?
  • phase 1 "There's also no apparent downside to beginning this segregation now." - debugging will be harder right? Since no longer a script on your computer.
  • phase 2 "We would add these Rules to the per-VPC CDK configuration" not sure what that means since I thought we weren't using CDK in the lambda function.

@chelma
Copy link
Collaborator Author

chelma commented Apr 27, 2023

Thanks for the review! In order:

  • SNS/SQS is are general queuing and notification solutions; EventBridge is specifically designed for handling AWS State changes. AWS services have out-of-the-box integration [1] into EventBridge to make it easy to take action when changes occur, and emit events following standardized schemas. For the example of a change in EC2 Autoscaling Capacity, the EC2 service already emits an event to EventBridge without any effort on our part [2]. If we used SNS/SQS, we'd have to detect the state change and emit the event ourselves. Will update the proposal above.
  • For the forseable future, I don't envision the Lambda Functions invoking CDK/CloudFormation. In short, CDK/CloudFormation is great for static collections of resources, when there aren't too many of them. If you need to have dynamic resource creation/deletion, especially if you need a lot of resources and they're not too complex, you should probably keep your own state and just use the SDK to create/delete them. We're using Lambda Functions here specifically to handle dynamic creation/deletion of large quantities of simple resources (Traffic Mirroring Sessions).
  • Ah, good point - debugging will be harder, though that's really a side effect of having the Lambdas do the work than segregating the Lambdas in a per-VPC manner. We might be able to do fancy things like pull logs from CloudWatch or something to make this a bit easier to follow.
  • So, add-vpc currently has two parts to it. We use a CDK invocation to spin up per-VPC and per-Subnet AWS Resources then, once those are in place, we use the Python SDK to create the per-ENI AWS Resources. After Phase 1, add-vpc will still make a CDK invocation to create the per-VPC and per-Subnet AWS Resources, but will kick off an EventBridge Event to create the per-ENI AWS Resources rather than directly creating them. The phrase We would add these Rules to the per-VPC CDK configuration refers to bundling in the EventBridge Rules Resources that will listen for the built-in AWS Service events, such as EC2 instance start/stop, on the default EventBridge Bus into the CloudFormation template we instantiate when we do an add-vpc.

Hopefully that all makes sense?

[1] https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-service-event-list.html
[2] https://docs.aws.amazon.com/autoscaling/ec2/userguide/automating-ec2-auto-scaling-with-eventbridge.html

@chelma
Copy link
Collaborator Author

chelma commented Apr 27, 2023

Updated the FAQ to address the question about using SNS/SQS instead

@awick
Copy link
Contributor

awick commented Apr 27, 2023

Ok I think I understand.

The only other feedback is priority. Unless its relatively "free" the phase 1
Additionally, we will have scheduled events fire every minute to continuously scan for changes in the ENIs and trigger the add/remove lambdas. I think should come AFTER the phase 2 watching ASG events. While both are important, I would rather see the ASG event first since that is the more common usecase, while the scan seems to be useful if we miss events or the customer changes things in the console. thoughts?

@chelma
Copy link
Collaborator Author

chelma commented Apr 27, 2023

The auto-firing is (hypothetically) extremely easy to set up; should just be a few lines of CDK once the rest of phase 1 is in place.

Additionally and philosophically, I'd argue it's actually higher priority than listening for specific event types because it serves as a backstop that hypothetically functions for all valid traffic sources. If we never implemented Phase 2, the automated scan would provide basic (if not ideal) capability for all resource types, while if the reverse were true we'd only have support for a few specific resource types, and wouldn't have an automated way to "catch" failures if something wacky happened to the configuration process for a specific ENI (i.e. listening for the AWS Service Events is an edge-triggered system). Obviously, we could begin listening to the dead letter queue and actioning failed AWS Service Events... but the automated scan gives us an equivalent capability that also works for all resource types.

@awick
Copy link
Contributor

awick commented Apr 27, 2023

Ya doing the scheduling part is easy, I didn't know if you were already going to be doing the scanning part. Since it won't be CDK anymore I'm assuming there is a lot of code to write for the scanning? Maybe I still don't understand the design.

If we never implement phase 2 we have failed.

My philosophy is opposite, if you implement the scan first you might miss cases that the scan fixes for you and you might never do phase 2.

For a MVP phase 2 is much more important. If I go to a customer and say you will have to wait on avg 30 seconds for each event before we detect it, we've failed.

@chelma
Copy link
Collaborator Author

chelma commented Apr 27, 2023

Fair enough, though it's probably all academic because we'll do both Phase 1 and Phase 2. We're already doing the scan part of this in our CLI's client-side code; I'm just going to move it to a Lambda.

@awick
Copy link
Contributor

awick commented Apr 27, 2023

ah ok! awesome :)

@chelma
Copy link
Collaborator Author

chelma commented Apr 27, 2023

You may find this follow-up task for the work helpful to understand what's being proposed: #36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Capture Resilience Work to make traffic capture more resilient to changes in load, configuration, and sources Design Proposal
Projects
None yet
Development

No branches or pull requests

2 participants