NETOBSERV-2429: Deploy FLP as a service #1953

jotak · 2025-09-05T13:33:25Z

Description

This is an intermediate alternative between the Kafka mode and the Direct mode, that is more suitable for quick install on large clusters (Kafka mode is a more complex setup, whereas Direct mode isn't suitable on large clusters due to the memory consumption of FLP)

To use it, set deploymentModel to Service.

There's a caveat with conversation tracking because of the absence of sticky sessions; agents not talking always to the same FLP instance doesn't play well with this feature that requires statefulness. For that reason, enabling conntrack results in an error from the validation hook.

Also I'd like to deprecate the embedded HPA configuration, and replace it with just a flag telling if the number of replicas has to be enforced or not. So users can set up their own HPA as they want, and we don't need to deal with the HPA API anymore.

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

openshift-ci · 2025-09-05T13:33:30Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2025-09-05T13:33:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign oliviercazade for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-09-09T06:37:58Z

OlivierCazade

One small comments, the PR is mostly LGTM.

I really like your approach to HPA and enforcing or not the number of replicas.

About connection tracking, if I remember correctly, we need both flows captured from the source and the destination to go to the same FLP pod. If it is the case, this will break it.

config/manager/manager.yaml

openshift-ci-robot · 2025-09-30T09:47:06Z

codecov · 2025-09-30T12:28:22Z

Codecov Report

❌ Patch coverage is 51.80723% with 120 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.12%. Comparing base (69b2475) to head (4bb705a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/controller/flp/flp_monolith_objects.go	8.47%	54 Missing ⚠️
internal/controller/flp/flp_monolith_reconciler.go	27.02%	26 Missing and 1 partial ⚠️
internal/controller/networkpolicy/np_objects.go	78.78%	10 Missing and 4 partials ⚠️
internal/controller/ebpf/agent_controller.go	60.00%	9 Missing and 1 partial ⚠️
api/flowcollector/v1beta2/helper.go	57.14%	8 Missing and 1 partial ⚠️
internal/controller/flp/flp_common_objects.go	45.45%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1953      +/-   ##
==========================================
- Coverage   71.72%   71.12%   -0.61%     
==========================================
  Files          80       80              
  Lines       10723    10787      +64     
==========================================
- Hits         7691     7672      -19     
- Misses       2626     2704      +78     
- Partials      406      411       +5

Flag	Coverage Δ
unittests	`71.12% <51.80%> (-0.61%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
api/flowcollector/v1beta2/flowcollector_types.go	`100.00% <ø> (ø)`
...lector/v1beta2/flowcollector_validation_webhook.go	`75.70% <100.00%> (+0.31%)`	⬆️
api/flowcollector/v1beta2/zz_generated.deepcopy.go	`38.03% <100.00%> (+0.30%)`	⬆️
...ntroller/consoleplugin/consoleplugin_reconciler.go	`68.78% <100.00%> (-0.18%)`	⬇️
internal/controller/flp/flp_transfo_objects.go	`86.46% <100.00%> (ø)`
internal/controller/flp/flp_transfo_reconciler.go	`58.38% <100.00%> (-0.26%)`	⬇️
internal/controller/reconcilers/reconcilers.go	`76.47% <100.00%> (+2.47%)`	⬆️
internal/pkg/helper/comparators.go	`76.52% <100.00%> (ø)`
internal/pkg/helper/flowcollector.go	`73.23% <100.00%> (+2.75%)`	⬆️
internal/controller/flp/flp_common_objects.go	`89.13% <45.45%> (-1.67%)`	⬇️
... and 5 more

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

openshift-ci-robot · 2025-09-30T12:39:40Z

jotak · 2025-09-30T12:42:38Z

One small comments, the PR is mostly LGTM.

I really like your approach to HPA and enforcing or not the number of replicas.

🎉

About connection tracking, if I remember correctly, we need both flows captured from the source and the destination to go to the same FLP pod. If it is the case, this will break it.

I added a validation check to make this combination forbidden

github-actions · 2025-09-30T13:18:31Z

New images:

quay.io/netobserv/network-observability-operator:1ef2f81
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-1ef2f81
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-1ef2f81

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:1ef2f81 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-1ef2f81

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-1ef2f81
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

This is an intermediate alternative between the Kafka mode and the Direct mode, that is more suitable for quick install on large clusters (Kafka mode is a more complex setup, whereas Direct mode isn't suitable on large clusters due to the memory consumption of FLP) To use it, set `deploymentModel` to `Service`. There are potential caveat to check: - Without sticky session, no guarantee that the agents talk to the same FLP instance. I don't think it's an issue in the nominal case, but might be a problem for conversation tracking?

Replicas were being reconciled despite being unmanaged, when something else was triggering the reconcile event

github-actions · 2025-10-10T07:24:15Z

New images:

quay.io/netobserv/network-observability-operator:2fb9bd2
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-2fb9bd2
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-2fb9bd2

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:2fb9bd2 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-2fb9bd2

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-2fb9bd2
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

- New rule, for Service deploymentModel, allow port 2055/hostnetwork - Remove most rules regarding the communication with agents: as they use hostnetwork, it's not needed to allow traffic between main and privileged ns. As a result, the netpol in privileged ns denies pretty much everything (except prometheus scraper) - Traffic with openshift-console only needed for ingress, not egress - Traffic to apiserver restricted to ns openshift-apiserver and openshift-kube-apiserver (note that an ovn bug requires to set an empty pod selector to make it work, see https://issues.redhat.com/browse/OSDOCS-14395) - Rule to apiserver was duplicated, removed one - Remove the rule to loki when it's installed in netobserv namespace - Optimize rules by merging similar rules affecting several namespaces

jotak · 2025-10-10T13:45:04Z

@OlivierCazade in last commit 3b71182 I took the opportunity to further restrict the netpol rules, I found a couple of areas to harden; see the commit description for the details

Here's the policies that we get as a result (here with loki in same namespace):

Main ns

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: netobserv
  namespace: netobserv
spec:
  egress:
  - to:
    - podSelector: {} #inside traffic
  - ports:
    - port: 6443 # API server
      protocol: TCP
    to:
    - namespaceSelector:
        matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: In
          values:
          - openshift-apiserver
          - openshift-kube-apiserver
      podSelector: {}
  - to:
    - namespaceSelector: # other namespaces
        matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: In
          values:
          - openshift-dns
          - openshift-monitoring
      podSelector: {}
  ingress:
  - from:
    - podSelector: {} # inside traffic
  - from:
    - namespaceSelector: # openshift-console/9001
        matchLabels:
          kubernetes.io/metadata.name: openshift-console
    ports:
    - port: 9001
      protocol: TCP
  - from:
    - namespaceSelector:
        matchLabels:
          policy-group.network.openshift.io/host-network: ""
    ports:
    - port: 9443 # operator webhook
      protocol: TCP
    - port: 2055 # FLP collector (only needed with the new Service deployment model)
      protocol: TCP
  - from:
    - namespaceSelector: # other namespaces
        matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: In
          values:
          - openshift-monitoring
      podSelector: {}
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Agent ns

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: netobserv
  namespace: netobserv-privileged
spec:
  ingress:
  - from:
    - namespaceSelector:
        matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: In
          values:
          - openshift-monitoring # prometheus scraper
      podSelector: {}
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

jotak · 2025-10-10T13:45:44Z

(I may split this in a different PR if needed)

openshift-ci bot added the do-not-merge/work-in-progress label Sep 5, 2025

openshift-merge-robot added the needs-rebase label Sep 5, 2025

jotak changed the title ~~Deploy FLP as a service~~ NETOBSERV-2247: Deploy FLP as a service Sep 9, 2025

openshift-ci-robot added the jira/valid-reference label Sep 9, 2025

OlivierCazade reviewed Sep 9, 2025

View reviewed changes

config/manager/manager.yaml Outdated Show resolved Hide resolved

jotak changed the title ~~NETOBSERV-2247: Deploy FLP as a service~~ NETOBSERV-2429: Deploy FLP as a service Sep 30, 2025

jotak force-pushed the service-deployment branch from 8647d10 to d21cf8b Compare September 30, 2025 12:23

openshift-merge-robot removed the needs-rebase label Sep 30, 2025

jotak marked this pull request as ready for review September 30, 2025 12:40

openshift-ci bot removed the do-not-merge/work-in-progress label Sep 30, 2025

jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 30, 2025

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 30, 2025

jotak force-pushed the service-deployment branch from 8beb2d8 to 3536cf8 Compare October 8, 2025 14:32

jotak added 6 commits October 9, 2025 14:25

Add validation error when using with conntrack

9d35f75

regen bundle

8247e9d

Fix issue reconciling replicas

045358c

Replicas were being reconciled despite being unmanaged, when something else was triggering the reconcile event

fix hardcoded url

62a64a3

Add part-of label to new depl/svc

1d83851

jotak force-pushed the service-deployment branch from 3536cf8 to 1d83851 Compare October 10, 2025 07:09

jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Oct 10, 2025

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Oct 10, 2025

jotak added 2 commits October 10, 2025 17:08

Fix netpol for upstream/kind

c74acf4

fix-tests

4bb705a

NETOBSERV-2429: Deploy FLP as a service #1953

Are you sure you want to change the base?

NETOBSERV-2429: Deploy FLP as a service #1953

Uh oh!

Conversation

jotak commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Dependencies

Checklist

Uh oh!

openshift-ci bot commented Sep 5, 2025

Uh oh!

openshift-ci bot commented Sep 5, 2025

Uh oh!

openshift-ci-robot commented Sep 9, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Dependencies

Checklist

Uh oh!

OlivierCazade left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci-robot commented Sep 30, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Dependencies

Checklist

Uh oh!

codecov bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

openshift-ci-robot commented Sep 30, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Dependencies

Checklist

Uh oh!

jotak commented Sep 30, 2025

Uh oh!

github-actions bot commented Sep 30, 2025

Uh oh!

github-actions bot commented Oct 10, 2025

Uh oh!

jotak commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jotak commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jotak commented Sep 5, 2025 •

edited

Loading

openshift-ci-robot commented Sep 9, 2025 •

edited by openshift-ci bot

Loading

OlivierCazade left a comment •

edited

Loading

openshift-ci-robot commented Sep 30, 2025 •

edited by openshift-ci bot

Loading

codecov bot commented Sep 30, 2025 •

edited

Loading

openshift-ci-robot commented Sep 30, 2025 •

edited by openshift-ci bot

Loading

jotak commented Oct 10, 2025 •

edited

Loading