Skip to content

Conversation

@mjiao
Copy link
Contributor

@mjiao mjiao commented Jul 7, 2025

No description provided.

@mjiao mjiao force-pushed the rosa-validation-eic831 branch 4 times, most recently from 231931e to e10daaf Compare July 8, 2025 14:11
@mjiao mjiao changed the title [WIP] ARO validation automation [WIP] ARO/ROSA validation automation Jul 10, 2025
@mjiao mjiao changed the title [WIP] ARO/ROSA validation automation [WIP] ARO validation automation Jul 11, 2025
@mjiao
Copy link
Contributor Author

mjiao commented Jul 11, 2025

Screenshot 2025-07-12 at 00 52 12

@mjiao mjiao force-pushed the rosa-validation-eic831 branch from 9b42e7e to afd5d7e Compare July 11, 2025 22:56
@mjiao mjiao changed the title [WIP] ARO validation automation ARO validation automation Jul 11, 2025
@mjiao mjiao force-pushed the rosa-validation-eic831 branch from f5554be to d0c05bf Compare July 14, 2025 08:41
@mjiao mjiao requested review from RishabhKodes and kksat July 14, 2025 08:42
@mjiao mjiao force-pushed the rosa-validation-eic831 branch from d0c05bf to 17abde5 Compare July 16, 2025 11:52
- name: publicDNS
value: "false"
- name: jiraSecretName
value: "jira-secret-mj"
- name: jiraIssueKey
value: "SAPOCP-1587"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I do not quite understand, do we plan to update this for each new jira ticket? maybe create a copy, so for each jira ticket we will have separate workflow, so we can run it later and independently?
We for sure can and should delete workflows for some outdated or not supported versions / configurations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the great question! Let me clarify our workflow strategy for JIRA ticket management.

Our Current Approach:
We create a new JIRA ticket for each validation task, with approximately 3-4 validation tasks per month on average. Since our clusters (especially ARO and ROSA) are quite dynamic, we typically create new clusters for each validation cycle - particularly for cloud-based ones where we focus on testing major releases.

Workflow Management Strategy:

  • New tickets = New workflows: Each JIRA ticket gets its own workflow file (like the current aro-endpoint-test-run.yaml for SAPOCP-1590)
  • No parallel execution: We don't run old workflows alongside new ones - each validation task is independent
  • Proactive cleanup: We clean up workflow files once the cluster is deleted or the validation task is completed

Why This Works for Us:
This approach aligns well with our validation cycle where we're constantly testing new configurations and major releases. The workflow files serve as a snapshot of what was tested for each specific ticket, and we maintain a clean repository by removing outdated workflows.

Your suggestion about creating separate workflow files for each ticket is exactly what we're doing! The current file structure will evolve as we create new validation tasks, and we'll maintain a clean slate by removing completed workflows.

* Kubeconfig and service access configured ✅
* All connectivity tests passed ✅
Ready for manual teardown approval. The pipeline will proceed with infrastructure cleanup once approved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why manual tear down? Are we expected to do some manual steps before tear down?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manual approval before teardown serves two purposes in our validation workflow.

  1. EIC Uninstallation Testing: Similar to the installation process, we need to test the uninstallation of EIC through the web interface. Since EIC doesn't support API-based uninstallation yet, this requires manual steps that need to be performed before the cluster is torn down.

  2. Demo/Reference Scenarios: Sometimes we may want to keep the cluster running for demo purposes or as a reference environment. The manual approval step gives us the flexibility to decide whether to proceed with teardown immediately or keep the infrastructure running for a longer period.

Once the manual steps are completed and approved, the pipeline proceeds with the automated cluster teardown.

@mjiao mjiao force-pushed the rosa-validation-eic831 branch 10 times, most recently from 84c05b7 to 96572aa Compare August 11, 2025 20:25
@mjiao
Copy link
Contributor Author

mjiao commented Aug 11, 2025

image

@mjiao mjiao requested a review from kksat August 14, 2025 14:34
@mjiao mjiao force-pushed the rosa-validation-eic831 branch 4 times, most recently from 53b3cd5 to e67583d Compare August 28, 2025 14:11
mjiao added 23 commits October 24, 2025 11:25
Configure explicit one-week timeout to override default 1-hour limit
and prevent pipeline failures during long-running operations.

This gives sufficient time for ARO deployment, manual approvals,
endpoint testing, and teardown operations.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

Fix PostgreSQL and Redis deletion commands in ARO teardown task

Remove unsupported --no-wait flag from az postgres flexible-server delete
and az redis delete commands to prevent teardown failures.

The --no-wait flag is not supported by these specific Azure CLI commands
and was causing 'unrecognized arguments: --no-wait' errors during cleanup.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

enhance: add explicit PostgreSQL and Redis cleanup to teardown

- Add explicit PostgreSQL flexible server deletion
- Add explicit Redis cache deletion with proper name matching
- Keep existing generic resource cleanup as fallback
- Ensure all Azure services are properly cleaned up

This makes the teardown more robust and explicit about
cleaning up PostgreSQL and Redis services, while maintaining
the existing cleanup logic for other ARO-related resources.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix: handle null tags in Azure resource cleanup query

- Add null checks for tags and tags.cluster before using contains()
- Fixes 'Invalid jmespath query' error in teardown task
- Query now safely handles resources without tags or cluster tags

The issue was that some Azure resources don't have tags or
have null tags.cluster values, causing the contains() function
to fail. Now we check for existence before using contains().

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix: use direct az command with --file parameter for kubeconfig generation

- Replace 'make aro-kubeconfig > kubeconfig' with direct az command
- Use 'az aro get-admin-kubeconfig --file kubeconfig' to avoid file conflicts
- Fixes 'File kubeconfig already exists' error

The issue was that az aro get-admin-kubeconfig creates a kubeconfig
file by default, and redirecting output to the same filename caused
a conflict. Using --file parameter directly avoids this issue.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix: correct kubeconfig generation command syntax

- Replace 'make aro-kubeconfig --file kubeconfig' with 'make aro-kubeconfig > kubeconfig'
- Fixes 'No rule to make target kubeconfig' error
- Use proper output redirection instead of invalid --file parameter

The aro-kubeconfig makefile target doesn't accept --file parameter,
it just runs the az aro get-admin-kubeconfig command and outputs
to stdout, which we redirect to the kubeconfig file.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix: filter out az aro list-credentials command line from JSON output

- Add 'grep -v "az aro list-credentials"' to filter out the command line
- Fixes 'Invalid numeric literal at line 2, column 3' error
- Now only the actual JSON object will be passed to jq

The issue was that make aro-credentials was outputting both the
command line and the JSON result, causing jq to try to parse
the command line as JSON.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

debug: add detailed logging to see raw credentials output

- Add debug output to see what make aro-credentials actually returns
- Show both raw output and filtered JSON before jq parsing
- Help identify what's causing 'Invalid numeric literal' error
- Will help determine the exact content being passed to jq

This will show us the actual output structure and help
identify why the JSON parsing is still failing.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix: use grep to filter out info messages from aro-credentials output

- Replace 'tail -n +2' with 'grep -v' to filter out specific info messages
- Filter out 'Variable is not defined' and 'Not all required variables are defined'
- More robust approach to handle variable output from required-environment-variables
- Fixes 'Invalid numeric literal' jq parsing error

This approach is more reliable than line-based filtering since
the number of info messages can vary.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix: handle JSON output from aro-credentials properly

- Use 'tail -n +2' to skip the first line (info message) but keep all JSON lines
- Store full JSON in CREDENTIALS_JSON variable before parsing with jq
- Fixes 'parse error: Unmatched }' when trying to parse incomplete JSON

The issue was that 'tail -1' only kept the last line of the JSON,
breaking the JSON structure. Now we skip the first line but keep
the complete JSON object for proper jq parsing.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix: apply tail -1 fix to all make commands with required-environment-variables

- Fix aro-deploy-task.yaml: make aro-cluster-exists, make aro-cluster-status,
  make postgres-exists, make redis-exists
- Fix aro-teardown-task.yaml: make aro-cluster-exists (2 instances)
- Fix aro-validate-task.yaml: make aro-cluster-url, make aro-credentials,
  make postgres-exists, make redis-exists

This resolves the issue where required-environment-variables function
was printing info messages that got captured in command substitution,
causing string comparisons to fail and infinite loops to occur.

All make commands that use required-environment-variables now use
'tail -1' to extract only the actual output, not the info messages.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix: handle extra output from required-environment-variables in cluster status

- Use 'tail -1' to get only the last line from make aro-cluster-status
- Fixes issue where required-environment-variables function was printing
  info messages that got captured in CLUSTER_STATUS variable
- Removes hex dump debugging since xxd command is not available
- Resolves infinite loop where 'Succeeded' status was not being detected

The issue was that CLUSTER_STATUS contained:
'az aro show --name "sapeic" --resource-group "manjun" --query "provisioningState" -o tsv\nSucceeded'
instead of just 'Succeeded'

Signed-off-by: mjiao <manjun.jiao@gmail.com>

debug: add detailed status debugging to validate task

- Add quotes around status output to see exact string
- Add status length and hex dump for debugging
- Help identify why 'Succeeded' status comparison is failing

This will help debug the infinite loop issue where
cluster shows 'Succeeded' but script continues waiting.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix: use runtime evaluation for azure-set-subscription target

- Change {
  "environmentName": "AzureCloud",
  "homeTenantId": "64dc69e4-d083-49fc-9569-ebece1dd1408",
  "id": "6a73742d-8c0a-4b2d-9c60-67c592a0df50",
  "isDefault": true,
  "managedByTenants": [
    {
      "tenantId": "935e2c6d-0fea-4cff-9e01-487490ca06c7"
    }
  ],
  "name": "EcoEng SAP",
  "state": "Enabled",
  "tenantId": "64dc69e4-d083-49fc-9569-ebece1dd1408",
  "user": {
    "name": "4b9ca7ac-1460-41ec-842e-776962b2c9ac",
    "type": "servicePrincipal"
  }
} to 87416(az account show) in azure-set-subscription
- Fixes issue where subscription ID was empty due to makefile parse-time evaluation
- Ensures subscription is set after Azure login, not before

This resolves the 'subscription of '' doesn't exist' error
in the validate-and-get-access step.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

refactor: remove duplicate checks and use makefile targets consistently

- Remove duplicate Azure CLI checks and use makefile targets as primary solution
- Remove redundant service checks in waiting loop
- Simplify cluster status checking logic
- Use make aro-cluster-exists, make aro-cluster-status, make postgres-exists, make redis-exists
- Clean up debugging output while maintaining essential logging

This eliminates redundant API calls and maintains the clean
makefile-based approach we established earlier.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix: correct cluster status checking logic to prevent infinite loops

- Fix logic bug where Succeeded status was entering waiting loop
- Add explicit break statements when cluster is ready
- Only enter waiting loop for Creating/Updating states
- Prevent infinite waiting when cluster is already Succeeded

This fixes the issue where the pipeline was stuck waiting
for a cluster that was already in Succeeded state.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix: add comprehensive debugging and safety checks for cluster existence

- Add direct Azure CLI cluster existence check for debugging
- Add makefile target result comparison
- Add final safety check before deployment to prevent conflicts
- Log cluster name and resource group for debugging
- Compare direct Azure CLI vs makefile target results

This will help identify why cluster existence detection is failing
and prevent PropertyChangeNotAllowed errors by double-checking
before attempting deployment.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix: improve cluster existence detection and prevent duplicate deployments

- Add debug logging for cluster existence check results
- Restructure deployment logic to be more explicit about cluster existence
- Add deployment decision logging to help troubleshoot issues
- Prevent full ARO deployment when cluster already exists
- Fixes PropertyChangeNotAllowed error when trying to modify existing cluster

This ensures that if a cluster is already running, we only deploy
missing services instead of trying to recreate the entire cluster.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix: disable private link service network policies on subnets

- Add privateLinkServiceNetworkPolicies: 'Disabled' to both master and worker subnets
- Fixes Azure deployment error: PrivateLinkServiceNetworkPoliciesCannotBeEnabledOnPrivateLinkServiceSubnet
- Prevents conflicts when Azure automatically configures subnets as private link service subnets

This resolves the network deployment failure in ARO cluster creation.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix: add missing export keywords and remove redundant makefile references

- Add export keywords to environment variables in aro-teardown-task.yaml
- Remove redundant -f bicep.makefile references in aro-deploy-task.yaml
- Ensure consistent environment variable export pattern across all tasks
- Fixes potential issues where environment variables might not be available to makefile targets

This ensures all Tekton tasks properly export environment variables
and use the correct makefile syntax for consistency.

Signed-off-by: mjiao <manjun.jiao@gmail.com>
  standardize environment variables

  - Standardize secret key names to use UPPER_CASE environment
  variables
    - CLIENT_ID, CLIENT_SECRET, TENANT_ID, ARO_RESOURCE_GROUP,
  ARO_DOMAIN, PULL_SECRET
    - Remove manual export conversions from Tekton tasks
  (aro-deploy, aro-validate, aro-teardown)

  - Add 11 new Makefile targets for Azure operations
    - aro-get-kubeconfig: Get ARO kubeconfig with insecure TLS
  settings
    - redis-get-info: Get Redis cache connection information
    - postgres-delete, redis-delete: Individual service cleanup
    - aro-resources-cleanup: Clean up ARO-related resources
    - aro-cleanup-all-services: Comprehensive service cleanup
    - aro-resource-group-create/exists: Resource group management
    - aro-services-deploy-only/with-retry: Granular service
  deployment
    - aro-final-safety-check: Pre-deployment validation

  - Update Tekton tasks to use centralized Makefile targets
    - Replace ~70 lines of inline Azure CLI commands with make
  calls
    - Improve maintainability and enable local development
  workflows

  - Fix Tekton timeout configurations
    - Update PipelineRun to use new timeouts syntax (pipeline:
  168h, tasks: 120m)
    - Add explicit step timeouts (aro-deploy: 120m, aro-validate:
   30m, aro-teardown: 60m)

  - Update documentation (README.md, CLAUDE.md) with new secret
  format and Makefile targets

  This enables developers to run the same Azure operations
  locally that are used in CI/CD pipelines,
  following infrastructure-as-code best practices with
  centralized command definitions.

Signed-off-by: mjiao <manjun.jiao@gmail.com>

test further

Signed-off-by: mjiao <manjun.jiao@gmail.com>

test further the auto-approval

Signed-off-by: mjiao <manjun.jiao@gmail.com>
Clean up duplicate and redundant Makefile targets while enhancing
Bicep templates with cost-optimized testing configurations.

Changes:
- Remove duplicate targets from bicep.makefile:
  - aro-deploy → consolidated into aro-deploy-test
  - azure-services-deploy & aro-services-deploy-only → aro-services-deploy-test
  - aro-url → use existing aro-cluster-url
- Move bicep-related targets from main Makefile to bicep.makefile
- Enhance Bicep templates with testing-focused improvements:
  - Add cost-optimized parameter validation and defaults
  - Create test-specific parameter files
  - Add comprehensive testing outputs and metadata
  - Include testing tags for resource management
- Update documentation and references to use simplified targets
- Add consistent deployment names for better tracking

This aligns the tooling with the testing-only use case while
eliminating redundancy and improving cost optimization.

Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
@mjiao mjiao force-pushed the rosa-validation-eic831 branch from 217effe to 9eb9c98 Compare October 24, 2025 09:25
mjiao added 2 commits October 24, 2025 11:50
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
@mjiao mjiao force-pushed the rosa-validation-eic831 branch from f66e8d6 to a6aaffd Compare October 24, 2025 10:11
mjiao added 3 commits October 24, 2025 12:14
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Signed-off-by: mjiao <manjun.jiao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants