Skip to content

Conversation

jotak
Copy link
Member

@jotak jotak commented Sep 5, 2025

Description

This is an intermediate alternative between the Kafka mode and the Direct mode, that is more suitable for quick install on large clusters (Kafka mode is a more complex setup, whereas Direct mode isn't suitable on large clusters due to the memory consumption of FLP)

To use it, set deploymentModel to Service.

There's a caveat with conversation tracking because of the absence of sticky sessions; agents not talking always to the same FLP instance doesn't play well with this feature that requires statefulness. For that reason, enabling conntrack results in an error from the validation hook.

Also I'd like to deprecate the embedded HPA configuration, and replace it with just a flag telling if the number of replicas has to be enforced or not. So users can set up their own HPA as they want, and we don't need to deal with the HPA API anymore.

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labeled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Copy link

openshift-ci bot commented Sep 5, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link

openshift-ci bot commented Sep 5, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign oliviercazade for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jotak jotak changed the title Deploy FLP as a service NETOBSERV-2247: Deploy FLP as a service Sep 9, 2025
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Sep 9, 2025

@jotak: This pull request references NETOBSERV-2247 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target either version "4.21." or "openshift-4.21.", but it targets "netobserv-1.11" instead.

In response to this:

Description

This is an intermediate alternative between the Kafka mode and the Direct mode, that is more suitable for quick install on large clusters (Kafka mode is a more complex setup, whereas Direct mode isn't suitable on large clusters due to the memory consumption of FLP)

To use it, set deploymentModel to Service.

There are potential caveat to check:

  • Without sticky session, no guarantee that the agents talk to the same FLP instance. I don't think it's an issue in the nominal case, but might be a problem for conversation tracking?

Also I'd like to deprecate the embedded HPA configuration, and replace it with just a flag telling if the number of replicas has to be enforced or not. So users can set up their own HPA as they want, and we don't need to deal with the HPA API anymore.

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labeled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

@OlivierCazade OlivierCazade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comments, the PR is mostly LGTM.

I really like your approach to HPA and enforcing or not the number of replicas.

About connection tracking, if I remember correctly, we need both flows captured from the source and the destination to go to the same FLP pod. If it is the case, this will break it.

@jotak jotak changed the title NETOBSERV-2247: Deploy FLP as a service NETOBSERV-2429: Deploy FLP as a service Sep 30, 2025
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Sep 30, 2025

@jotak: This pull request references NETOBSERV-2429 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Description

This is an intermediate alternative between the Kafka mode and the Direct mode, that is more suitable for quick install on large clusters (Kafka mode is a more complex setup, whereas Direct mode isn't suitable on large clusters due to the memory consumption of FLP)

To use it, set deploymentModel to Service.

There are potential caveat to check:

  • Without sticky session, no guarantee that the agents talk to the same FLP instance. I don't think it's an issue in the nominal case, but might be a problem for conversation tracking?

Also I'd like to deprecate the embedded HPA configuration, and replace it with just a flag telling if the number of replicas has to be enforced or not. So users can set up their own HPA as they want, and we don't need to deal with the HPA API anymore.

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labeled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

codecov bot commented Sep 30, 2025

Codecov Report

❌ Patch coverage is 51.80723% with 120 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.12%. Comparing base (69b2475) to head (4bb705a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
internal/controller/flp/flp_monolith_objects.go 8.47% 54 Missing ⚠️
internal/controller/flp/flp_monolith_reconciler.go 27.02% 26 Missing and 1 partial ⚠️
internal/controller/networkpolicy/np_objects.go 78.78% 10 Missing and 4 partials ⚠️
internal/controller/ebpf/agent_controller.go 60.00% 9 Missing and 1 partial ⚠️
api/flowcollector/v1beta2/helper.go 57.14% 8 Missing and 1 partial ⚠️
internal/controller/flp/flp_common_objects.go 45.45% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1953      +/-   ##
==========================================
- Coverage   71.72%   71.12%   -0.61%     
==========================================
  Files          80       80              
  Lines       10723    10787      +64     
==========================================
- Hits         7691     7672      -19     
- Misses       2626     2704      +78     
- Partials      406      411       +5     
Flag Coverage Δ
unittests 71.12% <51.80%> (-0.61%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
api/flowcollector/v1beta2/flowcollector_types.go 100.00% <ø> (ø)
...lector/v1beta2/flowcollector_validation_webhook.go 75.70% <100.00%> (+0.31%) ⬆️
api/flowcollector/v1beta2/zz_generated.deepcopy.go 38.03% <100.00%> (+0.30%) ⬆️
...ntroller/consoleplugin/consoleplugin_reconciler.go 68.78% <100.00%> (-0.18%) ⬇️
internal/controller/flp/flp_transfo_objects.go 86.46% <100.00%> (ø)
internal/controller/flp/flp_transfo_reconciler.go 58.38% <100.00%> (-0.26%) ⬇️
internal/controller/reconcilers/reconcilers.go 76.47% <100.00%> (+2.47%) ⬆️
internal/pkg/helper/comparators.go 76.52% <100.00%> (ø)
internal/pkg/helper/flowcollector.go 73.23% <100.00%> (+2.75%) ⬆️
internal/controller/flp/flp_common_objects.go 89.13% <45.45%> (-1.67%) ⬇️
... and 5 more

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Sep 30, 2025

@jotak: This pull request references NETOBSERV-2429 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Description

This is an intermediate alternative between the Kafka mode and the Direct mode, that is more suitable for quick install on large clusters (Kafka mode is a more complex setup, whereas Direct mode isn't suitable on large clusters due to the memory consumption of FLP)

To use it, set deploymentModel to Service.

There's a caveat with conversation tracking because of the absence of sticky sessions; agents not talking always to the same FLP instance doesn't play well with this feature that requires statefulness. For that reason, enabling conntrack results in an error from the validation hook.

Also I'd like to deprecate the embedded HPA configuration, and replace it with just a flag telling if the number of replicas has to be enforced or not. So users can set up their own HPA as they want, and we don't need to deal with the HPA API anymore.

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labeled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jotak jotak marked this pull request as ready for review September 30, 2025 12:40
@jotak
Copy link
Member Author

jotak commented Sep 30, 2025

One small comments, the PR is mostly LGTM.

I really like your approach to HPA and enforcing or not the number of replicas.

🎉

About connection tracking, if I remember correctly, we need both flows captured from the source and the destination to go to the same FLP pod. If it is the case, this will break it.

I added a validation check to make this combination forbidden

@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 30, 2025
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:1ef2f81
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-1ef2f81
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-1ef2f81

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:1ef2f81 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-1ef2f81

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-1ef2f81
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 30, 2025
@jotak jotak force-pushed the service-deployment branch from 8beb2d8 to 3536cf8 Compare October 8, 2025 14:32
jotak added 6 commits October 9, 2025 14:25
This is an intermediate alternative between the Kafka mode and the
Direct mode, that is more suitable for quick install on large clusters
(Kafka mode is a more complex setup, whereas Direct mode isn't suitable
on large clusters due to the memory consumption of FLP)

To use it, set `deploymentModel` to `Service`.

There are potential caveat to check:
- Without sticky session, no guarantee that the agents talk to the same
  FLP instance. I don't think it's an issue in the nominal case, but
might be a problem for conversation tracking?
Replicas were being reconciled despite being unmanaged, when something
else was triggering the reconcile event
@jotak jotak force-pushed the service-deployment branch from 3536cf8 to 1d83851 Compare October 10, 2025 07:09
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Oct 10, 2025
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:2fb9bd2
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-2fb9bd2
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-2fb9bd2

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:2fb9bd2 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-2fb9bd2

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-2fb9bd2
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

- New rule, for Service deploymentModel, allow port 2055/hostnetwork
- Remove most rules regarding the communication with agents: as they use
  hostnetwork, it's not needed to allow traffic between main and
privileged ns.
  As a result, the netpol in privileged ns denies pretty much everything
(except prometheus scraper)
- Traffic with openshift-console only needed for ingress, not egress
- Traffic to apiserver restricted to ns openshift-apiserver and
  openshift-kube-apiserver (note that an ovn bug requires to set an empty pod
selector to make it work, see https://issues.redhat.com/browse/OSDOCS-14395)
- Rule to apiserver was duplicated, removed one
- Remove the rule to loki when it's installed in netobserv namespace
- Optimize rules by merging similar rules affecting several namespaces
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Oct 10, 2025
@jotak
Copy link
Member Author

jotak commented Oct 10, 2025

@OlivierCazade in last commit 3b71182 I took the opportunity to further restrict the netpol rules, I found a couple of areas to harden; see the commit description for the details

Here's the policies that we get as a result (here with loki in same namespace):

Main ns

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: netobserv
  namespace: netobserv
spec:
  egress:
  - to:
    - podSelector: {} #inside traffic
  - ports:
    - port: 6443 # API server
      protocol: TCP
    to:
    - namespaceSelector:
        matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: In
          values:
          - openshift-apiserver
          - openshift-kube-apiserver
      podSelector: {}
  - to:
    - namespaceSelector: # other namespaces
        matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: In
          values:
          - openshift-dns
          - openshift-monitoring
      podSelector: {}
  ingress:
  - from:
    - podSelector: {} # inside traffic
  - from:
    - namespaceSelector: # openshift-console/9001
        matchLabels:
          kubernetes.io/metadata.name: openshift-console
    ports:
    - port: 9001
      protocol: TCP
  - from:
    - namespaceSelector:
        matchLabels:
          policy-group.network.openshift.io/host-network: ""
    ports:
    - port: 9443 # operator webhook
      protocol: TCP
    - port: 2055 # FLP collector (only needed with the new Service deployment model)
      protocol: TCP
  - from:
    - namespaceSelector: # other namespaces
        matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: In
          values:
          - openshift-monitoring
      podSelector: {}
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Agent ns

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: netobserv
  namespace: netobserv-privileged
spec:
  ingress:
  - from:
    - namespaceSelector:
        matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: In
          values:
          - openshift-monitoring # prometheus scraper
      podSelector: {}
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

@jotak
Copy link
Member Author

jotak commented Oct 10, 2025

(I may split this in a different PR if needed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants