ARO validation automation #50

mjiao · 2025-07-07T14:14:23Z

No description provided.

mjiao · 2025-07-11T22:52:22Z

kksat · 2025-07-14T08:52:32Z

.tekton-examples/aro-endpoint-test-run.yaml

    - name: publicDNS
      value: "false"
    - name: jiraSecretName
      value: "jira-secret-mj"
    - name: jiraIssueKey
-      value: "SAPOCP-1587"


This I do not quite understand, do we plan to update this for each new jira ticket? maybe create a copy, so for each jira ticket we will have separate workflow, so we can run it later and independently?
We for sure can and should delete workflows for some outdated or not supported versions / configurations.

Thank you for the great question! Let me clarify our workflow strategy for JIRA ticket management.

Our Current Approach:
We create a new JIRA ticket for each validation task, with approximately 3-4 validation tasks per month on average. Since our clusters (especially ARO and ROSA) are quite dynamic, we typically create new clusters for each validation cycle - particularly for cloud-based ones where we focus on testing major releases.

Workflow Management Strategy:

New tickets = New workflows: Each JIRA ticket gets its own workflow file (like the current aro-endpoint-test-run.yaml for SAPOCP-1590)

No parallel execution: We don't run old workflows alongside new ones - each validation task is independent

Proactive cleanup: We clean up workflow files once the cluster is deleted or the validation task is completed

Why This Works for Us:
This approach aligns well with our validation cycle where we're constantly testing new configurations and major releases. The workflow files serve as a snapshot of what was tested for each specific ticket, and we maintain a clean repository by removing outdated workflows.

Your suggestion about creating separate workflow files for each ticket is exactly what we're doing! The current file structure will evolve as we create new validation tasks, and we'll maintain a clean slate by removing completed workflows.

.tekton/pipelines/aro-endpoint-test-pipeline.yaml

kksat · 2025-07-16T13:03:24Z

.tekton/pipelines/aro-endpoint-test-pipeline.yaml

+            * Kubeconfig and service access configured ✅
+            * All connectivity tests passed ✅
+
+            Ready for manual teardown approval. The pipeline will proceed with infrastructure cleanup once approved.


why manual tear down? Are we expected to do some manual steps before tear down?

The manual approval before teardown serves two purposes in our validation workflow.

EIC Uninstallation Testing: Similar to the installation process, we need to test the uninstallation of EIC through the web interface. Since EIC doesn't support API-based uninstallation yet, this requires manual steps that need to be performed before the cluster is torn down.

Demo/Reference Scenarios: Sometimes we may want to keep the cluster running for demo purposes or as a reference environment. The manual approval step gives us the flexibility to decide whether to proceed with teardown immediately or keep the infrastructure running for a longer period.

Once the manual steps are completed and approved, the pipeline proceeds with the automated cluster teardown.

.tekton/pipelines/aro-endpoint-test-pipeline.yaml

.tekton/tasks/aro-cleanup-failed-task.yaml

.tekton/tasks/aro-teardown-task.yaml

.tekton/tasks/aro-validate-task.yaml

bicep.makefile

README.md

mjiao · 2025-08-11T22:22:12Z

Configure explicit one-week timeout to override default 1-hour limit and prevent pipeline failures during long-running operations. This gives sufficient time for ARO deployment, manual approvals, endpoint testing, and teardown operations. Signed-off-by: mjiao <manjun.jiao@gmail.com> Fix PostgreSQL and Redis deletion commands in ARO teardown task Remove unsupported --no-wait flag from az postgres flexible-server delete and az redis delete commands to prevent teardown failures. The --no-wait flag is not supported by these specific Azure CLI commands and was causing 'unrecognized arguments: --no-wait' errors during cleanup. Signed-off-by: mjiao <manjun.jiao@gmail.com> enhance: add explicit PostgreSQL and Redis cleanup to teardown - Add explicit PostgreSQL flexible server deletion - Add explicit Redis cache deletion with proper name matching - Keep existing generic resource cleanup as fallback - Ensure all Azure services are properly cleaned up This makes the teardown more robust and explicit about cleaning up PostgreSQL and Redis services, while maintaining the existing cleanup logic for other ARO-related resources. Signed-off-by: mjiao <manjun.jiao@gmail.com> fix: handle null tags in Azure resource cleanup query - Add null checks for tags and tags.cluster before using contains() - Fixes 'Invalid jmespath query' error in teardown task - Query now safely handles resources without tags or cluster tags The issue was that some Azure resources don't have tags or have null tags.cluster values, causing the contains() function to fail. Now we check for existence before using contains(). Signed-off-by: mjiao <manjun.jiao@gmail.com> fix: use direct az command with --file parameter for kubeconfig generation - Replace 'make aro-kubeconfig > kubeconfig' with direct az command - Use 'az aro get-admin-kubeconfig --file kubeconfig' to avoid file conflicts - Fixes 'File kubeconfig already exists' error The issue was that az aro get-admin-kubeconfig creates a kubeconfig file by default, and redirecting output to the same filename caused a conflict. Using --file parameter directly avoids this issue. Signed-off-by: mjiao <manjun.jiao@gmail.com> fix: correct kubeconfig generation command syntax - Replace 'make aro-kubeconfig --file kubeconfig' with 'make aro-kubeconfig > kubeconfig' - Fixes 'No rule to make target kubeconfig' error - Use proper output redirection instead of invalid --file parameter The aro-kubeconfig makefile target doesn't accept --file parameter, it just runs the az aro get-admin-kubeconfig command and outputs to stdout, which we redirect to the kubeconfig file. Signed-off-by: mjiao <manjun.jiao@gmail.com> fix: filter out az aro list-credentials command line from JSON output - Add 'grep -v "az aro list-credentials"' to filter out the command line - Fixes 'Invalid numeric literal at line 2, column 3' error - Now only the actual JSON object will be passed to jq The issue was that make aro-credentials was outputting both the command line and the JSON result, causing jq to try to parse the command line as JSON. Signed-off-by: mjiao <manjun.jiao@gmail.com> debug: add detailed logging to see raw credentials output - Add debug output to see what make aro-credentials actually returns - Show both raw output and filtered JSON before jq parsing - Help identify what's causing 'Invalid numeric literal' error - Will help determine the exact content being passed to jq This will show us the actual output structure and help identify why the JSON parsing is still failing. Signed-off-by: mjiao <manjun.jiao@gmail.com> fix: use grep to filter out info messages from aro-credentials output - Replace 'tail -n +2' with 'grep -v' to filter out specific info messages - Filter out 'Variable is not defined' and 'Not all required variables are defined' - More robust approach to handle variable output from required-environment-variables - Fixes 'Invalid numeric literal' jq parsing error This approach is more reliable than line-based filtering since the number of info messages can vary. Signed-off-by: mjiao <manjun.jiao@gmail.com> fix: handle JSON output from aro-credentials properly - Use 'tail -n +2' to skip the first line (info message) but keep all JSON lines - Store full JSON in CREDENTIALS_JSON variable before parsing with jq - Fixes 'parse error: Unmatched }' when trying to parse incomplete JSON The issue was that 'tail -1' only kept the last line of the JSON, breaking the JSON structure. Now we skip the first line but keep the complete JSON object for proper jq parsing. Signed-off-by: mjiao <manjun.jiao@gmail.com> fix: apply tail -1 fix to all make commands with required-environment-variables - Fix aro-deploy-task.yaml: make aro-cluster-exists, make aro-cluster-status, make postgres-exists, make redis-exists - Fix aro-teardown-task.yaml: make aro-cluster-exists (2 instances) - Fix aro-validate-task.yaml: make aro-cluster-url, make aro-credentials, make postgres-exists, make redis-exists This resolves the issue where required-environment-variables function was printing info messages that got captured in command substitution, causing string comparisons to fail and infinite loops to occur. All make commands that use required-environment-variables now use 'tail -1' to extract only the actual output, not the info messages. Signed-off-by: mjiao <manjun.jiao@gmail.com> fix: handle extra output from required-environment-variables in cluster status - Use 'tail -1' to get only the last line from make aro-cluster-status - Fixes issue where required-environment-variables function was printing info messages that got captured in CLUSTER_STATUS variable - Removes hex dump debugging since xxd command is not available - Resolves infinite loop where 'Succeeded' status was not being detected The issue was that CLUSTER_STATUS contained: 'az aro show --name "sapeic" --resource-group "manjun" --query "provisioningState" -o tsv\nSucceeded' instead of just 'Succeeded' Signed-off-by: mjiao <manjun.jiao@gmail.com> debug: add detailed status debugging to validate task - Add quotes around status output to see exact string - Add status length and hex dump for debugging - Help identify why 'Succeeded' status comparison is failing This will help debug the infinite loop issue where cluster shows 'Succeeded' but script continues waiting. Signed-off-by: mjiao <manjun.jiao@gmail.com> fix: use runtime evaluation for azure-set-subscription target - Change { "environmentName": "AzureCloud", "homeTenantId": "64dc69e4-d083-49fc-9569-ebece1dd1408", "id": "6a73742d-8c0a-4b2d-9c60-67c592a0df50", "isDefault": true, "managedByTenants": [ { "tenantId": "935e2c6d-0fea-4cff-9e01-487490ca06c7" } ], "name": "EcoEng SAP", "state": "Enabled", "tenantId": "64dc69e4-d083-49fc-9569-ebece1dd1408", "user": { "name": "4b9ca7ac-1460-41ec-842e-776962b2c9ac", "type": "servicePrincipal" } } to 87416(az account show) in azure-set-subscription - Fixes issue where subscription ID was empty due to makefile parse-time evaluation - Ensures subscription is set after Azure login, not before This resolves the 'subscription of '' doesn't exist' error in the validate-and-get-access step. Signed-off-by: mjiao <manjun.jiao@gmail.com> refactor: remove duplicate checks and use makefile targets consistently - Remove duplicate Azure CLI checks and use makefile targets as primary solution - Remove redundant service checks in waiting loop - Simplify cluster status checking logic - Use make aro-cluster-exists, make aro-cluster-status, make postgres-exists, make redis-exists - Clean up debugging output while maintaining essential logging This eliminates redundant API calls and maintains the clean makefile-based approach we established earlier. Signed-off-by: mjiao <manjun.jiao@gmail.com> fix: correct cluster status checking logic to prevent infinite loops - Fix logic bug where Succeeded status was entering waiting loop - Add explicit break statements when cluster is ready - Only enter waiting loop for Creating/Updating states - Prevent infinite waiting when cluster is already Succeeded This fixes the issue where the pipeline was stuck waiting for a cluster that was already in Succeeded state. Signed-off-by: mjiao <manjun.jiao@gmail.com> fix: add comprehensive debugging and safety checks for cluster existence - Add direct Azure CLI cluster existence check for debugging - Add makefile target result comparison - Add final safety check before deployment to prevent conflicts - Log cluster name and resource group for debugging - Compare direct Azure CLI vs makefile target results This will help identify why cluster existence detection is failing and prevent PropertyChangeNotAllowed errors by double-checking before attempting deployment. Signed-off-by: mjiao <manjun.jiao@gmail.com> fix: improve cluster existence detection and prevent duplicate deployments - Add debug logging for cluster existence check results - Restructure deployment logic to be more explicit about cluster existence - Add deployment decision logging to help troubleshoot issues - Prevent full ARO deployment when cluster already exists - Fixes PropertyChangeNotAllowed error when trying to modify existing cluster This ensures that if a cluster is already running, we only deploy missing services instead of trying to recreate the entire cluster. Signed-off-by: mjiao <manjun.jiao@gmail.com> fix: disable private link service network policies on subnets - Add privateLinkServiceNetworkPolicies: 'Disabled' to both master and worker subnets - Fixes Azure deployment error: PrivateLinkServiceNetworkPoliciesCannotBeEnabledOnPrivateLinkServiceSubnet - Prevents conflicts when Azure automatically configures subnets as private link service subnets This resolves the network deployment failure in ARO cluster creation. Signed-off-by: mjiao <manjun.jiao@gmail.com> fix: add missing export keywords and remove redundant makefile references - Add export keywords to environment variables in aro-teardown-task.yaml - Remove redundant -f bicep.makefile references in aro-deploy-task.yaml - Ensure consistent environment variable export pattern across all tasks - Fixes potential issues where environment variables might not be available to makefile targets This ensures all Tekton tasks properly export environment variables and use the correct makefile syntax for consistency. Signed-off-by: mjiao <manjun.jiao@gmail.com>

standardize environment variables - Standardize secret key names to use UPPER_CASE environment variables - CLIENT_ID, CLIENT_SECRET, TENANT_ID, ARO_RESOURCE_GROUP, ARO_DOMAIN, PULL_SECRET - Remove manual export conversions from Tekton tasks (aro-deploy, aro-validate, aro-teardown) - Add 11 new Makefile targets for Azure operations - aro-get-kubeconfig: Get ARO kubeconfig with insecure TLS settings - redis-get-info: Get Redis cache connection information - postgres-delete, redis-delete: Individual service cleanup - aro-resources-cleanup: Clean up ARO-related resources - aro-cleanup-all-services: Comprehensive service cleanup - aro-resource-group-create/exists: Resource group management - aro-services-deploy-only/with-retry: Granular service deployment - aro-final-safety-check: Pre-deployment validation - Update Tekton tasks to use centralized Makefile targets - Replace ~70 lines of inline Azure CLI commands with make calls - Improve maintainability and enable local development workflows - Fix Tekton timeout configurations - Update PipelineRun to use new timeouts syntax (pipeline: 168h, tasks: 120m) - Add explicit step timeouts (aro-deploy: 120m, aro-validate: 30m, aro-teardown: 60m) - Update documentation (README.md, CLAUDE.md) with new secret format and Makefile targets This enables developers to run the same Azure operations locally that are used in CI/CD pipelines, following infrastructure-as-code best practices with centralized command definitions. Signed-off-by: mjiao <manjun.jiao@gmail.com> test further Signed-off-by: mjiao <manjun.jiao@gmail.com> test further the auto-approval Signed-off-by: mjiao <manjun.jiao@gmail.com>

Clean up duplicate and redundant Makefile targets while enhancing Bicep templates with cost-optimized testing configurations. Changes: - Remove duplicate targets from bicep.makefile: - aro-deploy → consolidated into aro-deploy-test - azure-services-deploy & aro-services-deploy-only → aro-services-deploy-test - aro-url → use existing aro-cluster-url - Move bicep-related targets from main Makefile to bicep.makefile - Enhance Bicep templates with testing-focused improvements: - Add cost-optimized parameter validation and defaults - Create test-specific parameter files - Add comprehensive testing outputs and metadata - Include testing tags for resource management - Update documentation and references to use simplified targets - Add consistent deployment names for better tracking This aligns the tooling with the testing-only use case while eliminating redundancy and improving cost optimization. Signed-off-by: mjiao <manjun.jiao@gmail.com>

Signed-off-by: mjiao <manjun.jiao@gmail.com>

mjiao force-pushed the rosa-validation-eic831 branch 4 times, most recently from 231931e to e10daaf Compare July 8, 2025 14:11

mjiao changed the title ~~[WIP] ARO validation automation~~ [WIP] ARO/ROSA validation automation Jul 10, 2025

mjiao changed the title ~~[WIP] ARO/ROSA validation automation~~ [WIP] ARO validation automation Jul 11, 2025

mjiao force-pushed the rosa-validation-eic831 branch from 9b42e7e to afd5d7e Compare July 11, 2025 22:56

mjiao changed the title ~~[WIP] ARO validation automation~~ ARO validation automation Jul 11, 2025

mjiao force-pushed the rosa-validation-eic831 branch from f5554be to d0c05bf Compare July 14, 2025 08:41

mjiao requested review from RishabhKodes and kksat July 14, 2025 08:42

mjiao force-pushed the rosa-validation-eic831 branch from d0c05bf to 17abde5 Compare July 16, 2025 11:52

kksat reviewed Jul 16, 2025

View reviewed changes

mjiao force-pushed the rosa-validation-eic831 branch 10 times, most recently from 84c05b7 to 96572aa Compare August 11, 2025 20:25

mjiao requested a review from kksat August 14, 2025 14:34

mjiao force-pushed the rosa-validation-eic831 branch 4 times, most recently from 53b3cd5 to e67583d Compare August 28, 2025 14:11

mjiao added 23 commits October 24, 2025 11:25

minor fix

e58beb2

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix base64 encoding / decoding

e245905

Signed-off-by: mjiao <manjun.jiao@gmail.com>

add quay support for aro deployment

3ddbdc4

Signed-off-by: mjiao <manjun.jiao@gmail.com>

disable the aro eic test

ba5af9c

Signed-off-by: mjiao <manjun.jiao@gmail.com>

split different tasks

cb74089

Signed-off-by: mjiao <manjun.jiao@gmail.com>

package management

03d90f2

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix aro cluster name configuration

3721ce2

Signed-off-by: mjiao <manjun.jiao@gmail.com>

trigger

ffd9c09

Signed-off-by: mjiao <manjun.jiao@gmail.com>

move complex makefile target to scripts

c359c63

Signed-off-by: mjiao <manjun.jiao@gmail.com>

add quay deploy for rosa

9339642

Signed-off-by: mjiao <manjun.jiao@gmail.com>

hanlde the return code better for the quay endpoint call

b3c3919

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix: ARO/ROSA Quay certificate trust configuration and file naming

2692b89

Signed-off-by: mjiao <manjun.jiao@gmail.com>

feat: migrate Quay deployment to Ansible and remove verification steps

f90ce47

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix lint issue

a9412a4

Signed-off-by: mjiao <manjun.jiao@gmail.com>

issue a pipeline run for the validation

3172aef

Signed-off-by: mjiao <manjun.jiao@gmail.com>

fix aro trust certificate issue

008dcf7

Signed-off-by: mjiao <manjun.jiao@gmail.com>

output the aro result

b821773

Signed-off-by: mjiao <manjun.jiao@gmail.com>

trigger the run

83220a6

Signed-off-by: mjiao <manjun.jiao@gmail.com>

trigger the run

3391705

Signed-off-by: mjiao <manjun.jiao@gmail.com>

trigger aro deployment

9eb9c98

Signed-off-by: mjiao <manjun.jiao@gmail.com>

mjiao force-pushed the rosa-validation-eic831 branch from 217effe to 9eb9c98 Compare October 24, 2025 09:25

mjiao added 2 commits October 24, 2025 11:50

trigger aro deployment

d9157dc

Signed-off-by: mjiao <manjun.jiao@gmail.com>

trigger aro deployment

a6aaffd

Signed-off-by: mjiao <manjun.jiao@gmail.com>

mjiao force-pushed the rosa-validation-eic831 branch from f66e8d6 to a6aaffd Compare October 24, 2025 10:11

mjiao added 3 commits October 24, 2025 12:14

trigger aro deployment

a3c1464

Signed-off-by: mjiao <manjun.jiao@gmail.com>

trigger aro deployment

3bc3977

Signed-off-by: mjiao <manjun.jiao@gmail.com>

put git-clone in the folder

8332373

Signed-off-by: mjiao <manjun.jiao@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ARO validation automation #50

ARO validation automation #50

Uh oh!

mjiao commented Jul 7, 2025

Uh oh!

mjiao commented Jul 11, 2025

Uh oh!

kksat Jul 14, 2025

Uh oh!

mjiao Aug 5, 2025

Uh oh!

Uh oh!

kksat Jul 16, 2025

Uh oh!

mjiao Aug 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mjiao commented Aug 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

ARO validation automation #50

Are you sure you want to change the base?

ARO validation automation #50

Uh oh!

Conversation

mjiao commented Jul 7, 2025

Uh oh!

mjiao commented Jul 11, 2025

Uh oh!

kksat Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

mjiao Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kksat Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

mjiao Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mjiao commented Aug 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants