Skip to content

Conversation

@emilyhuaa
Copy link

Issue #, if available:
aws/amazon-network-policy-controller-k8s#201

Description of changes:
When a NetworkPolicy with many endpoints is chunked into multiple PolicyEndpoint resources, the agent was incorrectly applying default deny during pod startup. This occurred because isolation was checked per PolicyEndpoint chunk rather than after aggregating rules from all chunks.

For example, a NetworkPolicy with both ingress and egress rules might be split into:

  • Chunk 1: podIsolation=[Ingress,Egress], only ingress rules
  • Chunk 2: podIsolation=[Ingress,Egress], only egress rules

The agent would process Chunk 1, see Egress in podIsolation but len(egressRules)==0, and incorrectly enable default deny on egress until Chunk 2 was processed.

Fix: Move isolation check outside the PolicyEndpoint aggregation loop to evaluate based on total aggregated rules, not individual chunks.

Testing:
Tested on EKS cluster with NP_CONTROLLER_ENDPOINT_CHUNK_SIZE=50. Created chunked PolicyEndpoints where one chunk had podIsolation=[Ingress,Egress] but only ingress rules. Confirmed the bug by observing "Default Deny enabled on Egress" log message. After applying the fix, the same scenario no longer produces the incorrect default deny message, and egress traffic is properly allowed.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

When a NetworkPolicy with many endpoints is chunked into multiple
PolicyEndpoint resources, the agent was incorrectly applying default
deny during pod startup. This occurred because isolation was checked
per PolicyEndpoint chunk rather than after aggregating rules from all
chunks.

For example, a NetworkPolicy with both ingress and egress rules might
be split into:
- Chunk 1: podIsolation=[Ingress,Egress], only ingress rules
- Chunk 2: podIsolation=[Ingress,Egress], only egress rules

The agent would process Chunk 1, see Egress in podIsolation but
len(egressRules)==0, and incorrectly enable default deny on egress
until Chunk 2 was processed.

Fix: Move isolation check outside the PolicyEndpoint aggregation loop
to evaluate based on total aggregated rules, not individual chunks.

Fixes: aws/amazon-network-policy-controller-k8s#201
@emilyhuaa emilyhuaa requested a review from a team as a code owner November 24, 2025 23:42
lastPE = currentPE
}
log().Infof("Total no.of - ingressRules %d egressRules %d", len(ingressRules), len(egressRules))
// Check isolation after aggregating all rules from all PolicyEndpoints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what issue was this creating?

final decision will be made once we iterate on all the PE for a PodIdentifier. It seems current behavior of doing isIngressIsolated = isIngressIsolated || ingressIsolated and this updated change, both will result in same final behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue was the log message. deriveDefaultPodIsolation() logs "Default Deny enabled on Egress" when called with egressRulesCount=0 inside the loop, even though other chunks have egress rules. The final boolean values are correct due to ||, but the premature log message was confusing customers running in standard mode who don't expect default deny for policies with explicit rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, yeah log message fix makes sense. CR title Fix incorrect default deny when NetworkPolicy is chunked threw off, probably edit it to Fix incorrect Default Deny log when NP is chunked

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the actual issue could be when other Chunks haven't made it to the agent yet (as those are still getting processed by NPC), in that case Default Deny will still show up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, this fixes the case where all chunks exist but are being aggregated by the agent. would it be better to check if the PolicyEndpoint name indicates it's part of a chunked set (e.g. contains a chunk suffix) and suppress the log message for chunked policies? or is the sequential arrival case acceptable since it's transient during controller processing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

today there is no way to check if PolicyEndpoint has any siblings directly from same NetworkPolicy just by looking at its name or indirectly from other NetworkPolicies.

Doing chunk set (based on some deterministic chunk naming) makes sense from a single Network Policy perspective but pods can be targeted by multiple Network Policies and in this loop, we are iterating all PolicyEndpoints belonging to all Network Policies targeting that pod and then making the isolation decision.

@emilyhuaa emilyhuaa changed the title Fix incorrect default deny when NetworkPolicy is chunked Fix incorrect default deny log when NetworkPolicy is chunked Nov 25, 2025
@oliviassss
Copy link
Contributor

LGTM, just to confirm this only happens in standard mode, and it's just some annoying log that may mislead the users during startup? since the agent should act after processing all relevant PEs right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants