From c073e565a9e832266ad76fb284f632b431de6ffe Mon Sep 17 00:00:00 2001 From: Copilot <198982749+Copilot@users.noreply.github.com> Date: Mon, 24 Nov 2025 15:51:41 -0800 Subject: [PATCH 01/19] Add Copilot instruction files for repository technologies (#30) * Initial plan * Add comprehensive Copilot instruction files Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> * Add README and validation script for Copilot instructions Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --- .github/instructions/README.md | 127 ++++ .github/instructions/bash.instructions.md | 464 ++++++++++++++ .github/instructions/docker.instructions.md | 533 ++++++++++++++++ .../helm-kubernetes.instructions.md | 209 +++++++ .github/instructions/python.instructions.md | 577 ++++++++++++++++++ .github/scripts/validate-instructions.sh | 92 +++ 6 files changed, 2002 insertions(+) create mode 100644 .github/instructions/README.md create mode 100644 .github/instructions/bash.instructions.md create mode 100644 .github/instructions/docker.instructions.md create mode 100644 .github/instructions/helm-kubernetes.instructions.md create mode 100644 .github/instructions/python.instructions.md create mode 100755 .github/scripts/validate-instructions.sh diff --git a/.github/instructions/README.md b/.github/instructions/README.md new file mode 100644 index 00000000..02b5de26 --- /dev/null +++ b/.github/instructions/README.md @@ -0,0 +1,127 @@ +# GitHub Copilot Instructions + +This directory contains instruction files that help GitHub Copilot provide better, more contextual assistance when working with this repository. These files follow the [GitHub Copilot coding agent best practices](https://gh.io/copilot-coding-agent-tips). + +## Overview + +Each instruction file provides guidelines, conventions, and best practices for specific technologies or file types used in this repository. GitHub Copilot uses these instructions to understand the project's coding standards and provide more accurate suggestions. + +## Instruction Files + +### Core Technologies + +- **[helm-kubernetes.instructions.md](helm-kubernetes.instructions.md)** - Comprehensive guide for Helm chart development and Kubernetes manifest creation + - Applies to: `**/*.yaml`, `**/*.yml`, `**/Chart.yaml`, `**/values.yaml`, `**/templates/**` + - Covers: Helm best practices, chart structure, template development, RBAC, Argo-specific patterns + +- **[python.instructions.md](python.instructions.md)** - General Python development guidelines + - Applies to: `**/*.py`, `**/requirements*.txt`, `**/setup.py`, `**/pyproject.toml` + - Covers: PEP 8 compliance, type hints, testing, Flask patterns, error handling + +- **[bash.instructions.md](bash.instructions.md)** - Bash scripting and Makefile best practices + - Applies to: `**/*.sh`, `**/Makefile` + - Covers: Script structure, error handling, Kubernetes patterns, security, testing + +- **[docker.instructions.md](docker.instructions.md)** - Docker and containerization guidelines + - Applies to: `**/Dockerfile`, `**/Dockerfile.*`, `**/.dockerignore` + - Covers: Multi-stage builds, security, optimization, health checks, Alpine patterns + +### Specialized Technologies + +- **[go.instructions.md](go.instructions.md)** - Go development following idiomatic practices + - Applies to: `**/*.go`, `**/go.mod`, `**/go.sum` + - Covers: Idiomatic Go, naming conventions, error handling, concurrency + +- **[python-mcp-server.instructions.md](python-mcp-server.instructions.md)** - Model Context Protocol (MCP) server development + - Applies to: `**/*.py`, `**/pyproject.toml`, `**/requirements.txt` + - Covers: FastMCP patterns, tool development, resource management, HTTP/stdio transports + +## How It Works + +GitHub Copilot automatically reads and applies these instructions based on the file patterns specified in each instruction file's frontmatter. When you're working on a file that matches one or more patterns, Copilot considers the relevant guidelines when providing suggestions. + +### Frontmatter Format + +Each instruction file starts with YAML frontmatter: + +```yaml +--- +description: 'Brief description of what this file covers' +applyTo: 'file pattern(s) that trigger these instructions' +--- +``` + +### File Pattern Examples + +- `**/*.py` - All Python files +- `**/Dockerfile` - All Dockerfiles +- `helm/*/templates/**` - All Helm templates +- `**/*.{yaml,yml}` - All YAML files + +## Contributing + +When adding new technologies or updating existing guidelines: + +1. Create or update the appropriate instruction file +2. Include proper frontmatter with description and file patterns +3. Follow the established structure and format +4. Include practical examples and common patterns +5. Document common pitfalls and security considerations +6. Update this README with any new instruction files + +## Repository-Specific Patterns + +This repository focuses on: +- **Argo Workflows** - Kubernetes-native workflow engine +- **Argo CD** - GitOps continuous delivery +- **Authorization Adapter** - Flask-based RBAC service +- **Helm Charts** - Kubernetes package management +- **Multi-tenancy** - Namespace isolation and RBAC + +The instruction files are tailored to these specific use cases while following industry best practices. + +## Best Practices + +### When Writing Instructions + +- **Be specific** - Provide concrete examples and patterns +- **Be practical** - Focus on what developers actually need +- **Be current** - Keep up with best practices and tool updates +- **Be consistent** - Follow the established format and style +- **Be comprehensive** - Cover common scenarios and edge cases + +### Testing Instructions + +After adding or updating instruction files, verify they work correctly by: + +1. Opening files that match the patterns +2. Checking that Copilot provides contextually appropriate suggestions +3. Ensuring suggestions follow the documented guidelines +4. Testing with different file types and scenarios + +## Resources + +- [GitHub Copilot Documentation](https://docs.github.com/en/copilot) +- [Best practices for Copilot coding agent](https://gh.io/copilot-coding-agent-tips) +- [Repository CONTRIBUTING.md](../../CONTRIBUTING.md) +- [Repository README.md](../../README.md) + +## Maintenance + +These instruction files should be reviewed and updated: +- When introducing new technologies or patterns +- When updating dependencies or frameworks +- When best practices evolve +- When team conventions change +- At least quarterly for general maintenance + +## Questions? + +If you have questions about these instructions or suggestions for improvements, please: +- Open an issue in the repository +- Submit a pull request with proposed changes +- Reach out to the maintainers + +--- + +**Note**: These instructions are designed to assist GitHub Copilot in providing better suggestions. They represent our team's coding standards and should be followed by all contributors, whether using Copilot or not. diff --git a/.github/instructions/bash.instructions.md b/.github/instructions/bash.instructions.md new file mode 100644 index 00000000..beacf4e0 --- /dev/null +++ b/.github/instructions/bash.instructions.md @@ -0,0 +1,464 @@ +--- +description: 'Instructions for writing Bash scripts following best practices and conventions' +applyTo: '**/*.sh, **/Makefile' +--- + +# Bash Scripting Instructions + +## General Principles + +- Write portable, readable, and maintainable shell scripts +- Follow POSIX standards where possible, use Bash-specific features when beneficial +- Include error handling and validation +- Make scripts idempotent when possible +- Document script usage and requirements + +## Script Structure + +### Shebang and Options + +- Always start scripts with `#!/bin/bash` (or `#!/usr/bin/env bash` for portability) +- Use `set -e` to exit on errors (or `set -euo pipefail` for stricter error handling) +- Consider `set -u` to treat unset variables as errors +- Use `set -x` for debugging when needed (or enable via DEBUG environment variable) + +Example: +```bash +#!/bin/bash +set -euo pipefail + +# Optional debugging +[[ "${DEBUG:-}" == "true" ]] && set -x +``` + +### Script Organization + +- Start with a header comment describing the script's purpose +- Define all functions before the main script logic +- Include a usage/help function +- Place main execution logic at the bottom +- Use clear section separators + +Example: +```bash +#!/bin/bash +# Description: Deploy Argo stack to Kubernetes +# Usage: ./deploy.sh [options] + +set -euo pipefail + +################# +# Configuration # +################# + +DEFAULT_NAMESPACE="argocd" +TIMEOUT="10m" + +############# +# Functions # +############# + +usage() { + cat </dev/null 2>&1 || { echo "kubectl is required"; exit 1; } +} + +main() { + check_prerequisites + # Main logic here +} + +######## +# Main # +######## + +main "$@" +``` + +## Error Handling + +### Exit Codes + +- Use meaningful exit codes (0 for success, non-zero for errors) +- Document exit codes in help text for complex scripts +- Use consistent exit codes across scripts + +### Validation + +- Validate required environment variables early: +```bash +: "${REQUIRED_VAR:?Error: REQUIRED_VAR must be set}" +``` + +- Check for required commands: +```bash +command -v kubectl >/dev/null 2>&1 || { + echo "Error: kubectl is required but not installed" + exit 1 +} +``` + +- Validate file existence: +```bash +[[ -f "${CONFIG_FILE}" ]] || { + echo "Error: Config file not found: ${CONFIG_FILE}" + exit 1 +} +``` + +### Cleanup and Traps + +- Use `trap` for cleanup operations: +```bash +cleanup() { + rm -f "${TEMP_FILE}" +} +trap cleanup EXIT INT TERM +``` + +## Variables and Quoting + +### Variable Naming + +- Use UPPER_CASE for environment variables and constants +- Use lower_case for local variables +- Use descriptive names (avoid single letters except for loop counters) + +### Quoting + +- Always quote variables unless you explicitly want word splitting: `"${var}"` +- Quote command substitutions: `"$(command)"` +- Use arrays for lists instead of space-separated strings +- Don't quote variables in `[[ ]]` conditions (they're safe there) + +### Arrays + +- Use arrays for lists of items: +```bash +namespaces=("argo" "argocd" "security") +for ns in "${namespaces[@]}"; do + echo "${ns}" +done +``` + +## Conditionals and Loops + +### If Statements + +- Use `[[ ]]` instead of `[ ]` for better error handling and features +- Prefer explicit comparisons: +```bash +if [[ "${STATUS}" == "ready" ]]; then + echo "Ready" +fi + +if [[ -n "${VAR}" ]]; then # Check if variable is not empty + echo "VAR is set" +fi + +if [[ -z "${VAR}" ]]; then # Check if variable is empty + echo "VAR is not set" +fi +``` + +### Loops + +- Use `for` loops for iterating over arrays +- Use `while read` for processing lines: +```bash +while IFS= read -r line; do + echo "${line}" +done < file.txt +``` + +- Break long loops into functions for readability + +## Functions + +### Function Definition + +- Define functions before use +- Use clear, descriptive function names +- Add comments describing parameters and return values +- Use `local` for function-scoped variables + +```bash +# Deploy a Helm chart +# Arguments: +# $1 - chart name +# $2 - namespace +# $3 - values file (optional) +# Returns: +# 0 on success, 1 on failure +deploy_chart() { + local chart_name="${1}" + local namespace="${2}" + local values_file="${3:-}" + + local helm_args=( + upgrade --install + "${chart_name}" + "./charts/${chart_name}" + --namespace "${namespace}" + --create-namespace + ) + + if [[ -n "${values_file}" ]]; then + helm_args+=(--values "${values_file}") + fi + + helm "${helm_args[@]}" +} +``` + +## Command Execution + +### Command Substitution + +- Use `$(command)` instead of backticks +- Check command success: +```bash +if output=$(kubectl get pods 2>&1); then + echo "Success: ${output}" +else + echo "Failed to get pods" + exit 1 +fi +``` + +### Pipelines + +- Use `set -o pipefail` to catch errors in pipelines +- Consider breaking complex pipelines into steps + +### Background Jobs + +- Track background processes: +```bash +kubectl port-forward svc/myservice 8080:80 & +PF_PID=$! + +# Later, clean up +kill "${PF_PID}" 2>/dev/null || true +``` + +## Output and Logging + +### User Feedback + +- Use descriptive output messages with emoji when appropriate: +```bash +echo "โœ… Deployment successful" +echo "โŒ Error: Deployment failed" +echo "๐Ÿ” Checking prerequisites..." +echo "โš ๏ธ Warning: Resource limits not set" +``` + +### Debugging + +- Use meaningful debug output: +```bash +if [[ "${DEBUG:-false}" == "true" ]]; then + echo "DEBUG: Variable value: ${VAR}" +fi +``` + +### Error Messages + +- Write errors to stderr: +```bash +echo "Error: Something went wrong" >&2 +exit 1 +``` + +## Kubernetes-Specific Patterns + +### Waiting for Resources + +- Use `kubectl wait` instead of sleep loops: +```bash +kubectl wait --for=condition=Ready pod \ + -l app=myapp \ + --timeout=120s \ + -n "${namespace}" +``` + +### Namespace Operations + +- Always specify namespace explicitly: +```bash +kubectl get pods -n "${namespace}" +``` + +- Check if namespace exists: +```bash +if kubectl get namespace "${namespace}" >/dev/null 2>&1; then + echo "Namespace exists" +fi +``` + +### Safe Deletions + +- Use `|| true` for delete operations that might not find resources: +```bash +kubectl delete namespace "${namespace}" --ignore-not-found=true +# or +kubectl delete pod mypod 2>/dev/null || true +``` + +## Makefile Conventions + +### Targets + +- Use `.PHONY` for non-file targets +- Provide a `help` target as default +- Use descriptive target names +- Add comments explaining what each target does + +### Variables + +- Define configurable variables with defaults +- Use `?=` for variables that can be overridden +- Document required environment variables + +Example: +```makefile +.PHONY: help deploy clean + +NAMESPACE ?= default +TIMEOUT ?= 10m + +help: + @echo "Available targets:" + @echo " deploy - Deploy the application" + @echo " clean - Clean up resources" + +deploy: + @echo "๐Ÿš€ Deploying to namespace: $(NAMESPACE)" + helm upgrade --install myapp ./charts/myapp \ + --namespace $(NAMESPACE) \ + --timeout $(TIMEOUT) + +clean: + @echo "๐Ÿงน Cleaning up..." + helm uninstall myapp -n $(NAMESPACE) || true +``` + +## Security Considerations + +### Secrets and Sensitive Data + +- Never hardcode secrets in scripts +- Use environment variables or secret management tools +- Don't echo sensitive variables (they'll appear in logs) +- Be careful with `set -x` when handling secrets + +### Input Validation + +- Validate all external inputs +- Sanitize user-provided values +- Be cautious with `eval` (avoid if possible) + +## Testing + +### Dry Runs + +- Support dry-run mode where applicable: +```bash +DRY_RUN="${DRY_RUN:-false}" + +run_command() { + if [[ "${DRY_RUN}" == "true" ]]; then + echo "Would run: $*" + else + "$@" + fi +} +``` + +### ShellCheck + +- Run `shellcheck` on all shell scripts before committing +- Address or suppress warnings with justification +- Add shellcheck directives when needed: +```bash +# shellcheck disable=SC2034 # VAR appears unused +VAR="value" +``` + +## Common Patterns + +### Checking Command Availability + +```bash +has_command() { + command -v "$1" >/dev/null 2>&1 +} + +if ! has_command kubectl; then + echo "kubectl not found" + exit 1 +fi +``` + +### Retry Logic + +```bash +retry() { + local max_attempts=$1 + shift + local cmd=("$@") + local attempt=1 + + while (( attempt <= max_attempts )); do + if "${cmd[@]}"; then + return 0 + fi + echo "Attempt ${attempt}/${max_attempts} failed, retrying..." + ((attempt++)) + sleep 2 + done + + return 1 +} + +retry 3 kubectl get pods +``` + +### Temporary Files + +```bash +# Create temp file safely +TEMP_FILE=$(mktemp) +trap 'rm -f "${TEMP_FILE}"' EXIT + +# Use it +echo "data" > "${TEMP_FILE}" +``` + +## Common Pitfalls to Avoid + +- Don't use `cd` without error checking or in subshells +- Don't parse `ls` output (use globs or `find` instead) +- Don't use `cat file | grep` (use `grep pattern file`) +- Don't ignore command failures with `;` (use `&&` for chaining) +- Don't use `echo` for complex output (use `printf` or heredocs) +- Don't assume scripts run from a specific directory (use absolute paths or `cd "$(dirname "$0")"`) +- Avoid `which` (use `command -v` instead) + +## Documentation + +- Include usage information in scripts (help function) +- Document required environment variables +- Add examples in comments +- Keep comments up to date with code diff --git a/.github/instructions/docker.instructions.md b/.github/instructions/docker.instructions.md new file mode 100644 index 00000000..b5d630f1 --- /dev/null +++ b/.github/instructions/docker.instructions.md @@ -0,0 +1,533 @@ +--- +description: 'Instructions for writing Dockerfiles and working with containers' +applyTo: '**/Dockerfile, **/Dockerfile.*, **/.dockerignore' +--- + +# Docker and Containerization Instructions + +## General Principles + +- Write secure, efficient, and maintainable Dockerfiles +- Optimize for small image sizes and fast build times +- Follow Docker best practices and security guidelines +- Use multi-stage builds when appropriate +- Keep images minimal and focused + +## Dockerfile Best Practices + +### Base Image Selection + +- Use official base images from Docker Hub +- Prefer specific version tags over `latest` +- Choose appropriate base images for your use case: + - `alpine` for minimal size (use musl libc compatible packages) + - `slim` variants for balance between size and compatibility + - Full images when you need all system utilities +- Use multi-stage builds to keep final images small + +```dockerfile +# Good: Specific version +FROM python:3.11-slim + +# Bad: Using latest +FROM python:latest + +# Good: Alpine for minimal size +FROM python:3.11-alpine + +# Good: Multi-stage build +FROM python:3.11 AS builder +# Build steps + +FROM python:3.11-slim +# Copy artifacts from builder +``` + +### Image Structure + +- Order instructions from least to most frequently changing +- Combine related RUN commands to reduce layers +- Use `.dockerignore` to exclude unnecessary files +- Clean up in the same layer where you create files + +```dockerfile +FROM python:3.11-slim + +# Set working directory early +WORKDIR /app + +# Install system dependencies (changes rarely) +RUN apt-get update && \ + apt-get install -y --no-install-recommends \ + curl \ + ca-certificates && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* + +# Copy requirements first (changes less often than code) +COPY requirements.txt . + +# Install Python dependencies +RUN pip install --no-cache-dir -r requirements.txt + +# Copy application code (changes most frequently) +COPY . . + +# Set runtime configuration +ENV FLASK_APP=app.py +ENV PYTHONUNBUFFERED=1 + +# Expose port +EXPOSE 8080 + +# Run as non-root user +USER nobody + +# Define entrypoint +CMD ["python", "app.py"] +``` + +### Layer Optimization + +- Minimize the number of layers (combine RUN commands) +- Put frequently changing instructions at the end +- Use build cache effectively by ordering instructions properly +- Clean up temporary files in the same RUN instruction + +```dockerfile +# Good: Combined into one layer with cleanup +RUN apt-get update && \ + apt-get install -y --no-install-recommends \ + gcc \ + build-essential && \ + pip install --no-cache-dir -r requirements.txt && \ + apt-get remove -y gcc build-essential && \ + apt-get autoremove -y && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* + +# Bad: Multiple layers, no cleanup +RUN apt-get update +RUN apt-get install -y gcc +RUN pip install -r requirements.txt +``` + +### Multi-stage Builds + +- Use multi-stage builds to separate build and runtime environments +- Copy only necessary artifacts to final image +- Keep final image minimal + +```dockerfile +# Build stage +FROM python:3.11 AS builder + +WORKDIR /app + +# Install build dependencies +RUN apt-get update && \ + apt-get install -y --no-install-recommends \ + gcc \ + build-essential + +# Install Python packages +COPY requirements.txt . +RUN pip install --user --no-cache-dir -r requirements.txt + +# Runtime stage +FROM python:3.11-slim + +WORKDIR /app + +# Copy Python packages from builder +COPY --from=builder /root/.local /root/.local + +# Copy application +COPY . . + +# Make sure scripts are in PATH +ENV PATH=/root/.local/bin:$PATH +ENV PYTHONUNBUFFERED=1 + +USER nobody + +CMD ["python", "app.py"] +``` + +## Security Best Practices + +### User Management + +- Don't run containers as root +- Create dedicated non-root user if needed +- Use numeric user IDs for better Kubernetes compatibility + +```dockerfile +# Option 1: Use nobody user (already exists in base images) +USER nobody + +# Option 2: Create a dedicated user +RUN groupadd -r appuser && \ + useradd -r -g appuser -u 1000 appuser && \ + chown -R appuser:appuser /app + +USER appuser + +# Option 3: Use numeric UID (better for Kubernetes) +USER 1000:1000 +``` + +### Minimize Attack Surface + +- Install only necessary packages +- Remove package manager caches +- Use specific package versions +- Scan images for vulnerabilities regularly + +```dockerfile +# Install only what's needed, clean up after +RUN apt-get update && \ + apt-get install -y --no-install-recommends \ + ca-certificates \ + curl && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* + +# Pin Python package versions +COPY requirements.txt . +RUN pip install --no-cache-dir \ + flask==3.0.0 \ + requests==2.31.0 +``` + +### Secrets and Sensitive Data + +- Never include secrets in Docker images +- Use build arguments for build-time configuration (not secrets) +- Use Docker secrets or environment variables for runtime secrets +- Don't commit .env files with secrets + +```dockerfile +# Good: Use ARG for build-time values (not secrets) +ARG APP_VERSION=1.0.0 +ENV APP_VERSION=${APP_VERSION} + +# Good: Expect secrets via environment at runtime +ENV API_KEY="" + +# Bad: Hardcoded secret +ENV API_KEY="secret-key-12345" +``` + +## Python-Specific Patterns + +### Python Dockerfiles + +- Use `PYTHONUNBUFFERED=1` for real-time logging +- Install packages with `--no-cache-dir` to save space +- Use `pip install --user` in multi-stage builds +- Consider using `uv` or `pip-tools` for faster installs + +```dockerfile +FROM python:3.11-slim + +WORKDIR /app + +# Prevent Python from writing pyc files and buffering stdout/stderr +ENV PYTHONDONTWRITEBYTECODE=1 \ + PYTHONUNBUFFERED=1 \ + PIP_NO_CACHE_DIR=1 \ + PIP_DISABLE_PIP_VERSION_CHECK=1 + +# Install dependencies +COPY requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt + +# Copy application +COPY . . + +# Create non-root user +RUN useradd -m -u 1000 appuser && \ + chown -R appuser:appuser /app + +USER appuser + +EXPOSE 8080 + +CMD ["python", "-m", "flask", "run", "--host=0.0.0.0", "--port=8080"] +``` + +### Flask/Web Application Pattern + +```dockerfile +FROM python:3.11-slim + +WORKDIR /app + +ENV PYTHONUNBUFFERED=1 \ + FLASK_APP=app.py \ + FLASK_ENV=production + +# Install dependencies +COPY requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt + +# Copy application code +COPY app.py . +COPY templates/ templates/ +COPY static/ static/ + +# Health check +HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \ + CMD curl -f http://localhost:8080/healthz || exit 1 + +# Run as non-root +USER nobody + +EXPOSE 8080 + +CMD ["python", "-m", "flask", "run", "--host=0.0.0.0", "--port=8080"] +``` + +## .dockerignore + +- Always include a `.dockerignore` file +- Exclude unnecessary files to speed up builds and reduce context size +- Follow patterns similar to `.gitignore` + +``` +# .dockerignore +.git +.gitignore +.github +README.md +LICENSE +.venv +venv/ +__pycache__/ +*.pyc +*.pyo +*.pyd +.pytest_cache/ +.coverage +htmlcov/ +.mypy_cache/ +.tox/ +dist/ +build/ +*.egg-info/ +.DS_Store +.env +.env.local +*.log +tests/ +docs/ +examples/ +``` + +## Health Checks + +- Include HEALTHCHECK instructions for containerized services +- Implement a health endpoint in your application +- Set appropriate intervals and timeouts + +```dockerfile +# Simple health check using curl +HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \ + CMD curl -f http://localhost:8080/healthz || exit 1 + +# Health check without additional tools +HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \ + CMD python -c "import requests; requests.get('http://localhost:8080/healthz').raise_for_status()" || exit 1 +``` + +## Labels and Metadata + +- Add labels for better organization and documentation +- Follow OCI annotation conventions +- Include version, build info, and maintainer + +```dockerfile +LABEL org.opencontainers.image.title="Argo AuthZ Adapter" \ + org.opencontainers.image.description="Authorization adapter for Argo Workflows" \ + org.opencontainers.image.version="1.0.0" \ + org.opencontainers.image.authors="Your Team " \ + org.opencontainers.image.source="https://github.com/calypr/argo-helm" \ + org.opencontainers.image.licenses="Apache-2.0" +``` + +## Build Arguments + +- Use ARG for configurable build-time values +- Provide sensible defaults +- Document arguments in comments + +```dockerfile +# Build arguments with defaults +ARG PYTHON_VERSION=3.11 +ARG APP_VERSION=latest + +FROM python:${PYTHON_VERSION}-slim + +# Re-declare after FROM to use in this stage +ARG APP_VERSION +ENV APP_VERSION=${APP_VERSION} + +LABEL version="${APP_VERSION}" +``` + +## Entrypoint vs CMD + +- Use ENTRYPOINT for executable containers +- Use CMD for default arguments to ENTRYPOINT or standalone commands +- Use JSON array format for proper signal handling + +```dockerfile +# Good: ENTRYPOINT + CMD for flexibility +ENTRYPOINT ["python"] +CMD ["app.py"] +# Can override CMD: docker run myimage script.py + +# Good: ENTRYPOINT as executable +ENTRYPOINT ["python", "-m", "flask"] +CMD ["run", "--host=0.0.0.0"] + +# Good: Simple CMD +CMD ["python", "app.py"] + +# Bad: Shell form (doesn't handle signals properly) +CMD python app.py +``` + +## Working with Alpine + +- Install Python packages that need compilation with build dependencies +- Use Alpine's package manager (apk) +- Clean up build dependencies after use + +```dockerfile +FROM python:3.11-alpine + +WORKDIR /app + +# Install build dependencies and runtime dependencies +RUN apk add --no-cache --virtual .build-deps \ + gcc \ + musl-dev \ + python3-dev && \ + apk add --no-cache \ + ca-certificates \ + curl + +# Install Python packages +COPY requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt + +# Remove build dependencies +RUN apk del .build-deps + +COPY . . + +USER nobody + +CMD ["python", "app.py"] +``` + +## Volume Management + +- Use VOLUME for data that should persist or be shared +- Document expected volumes in comments +- Don't include VOLUME for application code + +```dockerfile +# Create directory for data +RUN mkdir -p /data && chown appuser:appuser /data + +# Declare volume for persistent data +VOLUME ["/data"] + +# Document in comment +# Expected volumes: +# /data - Application data and logs +``` + +## Common Patterns + +### Development vs Production + +Create separate Dockerfiles or use build targets: + +```dockerfile +# Dockerfile +FROM python:3.11-slim AS base + +WORKDIR /app + +COPY requirements.txt . + +FROM base AS development +RUN pip install --no-cache-dir -r requirements.txt -r requirements-dev.txt +COPY . . +CMD ["python", "-m", "flask", "run", "--host=0.0.0.0", "--debug"] + +FROM base AS production +RUN pip install --no-cache-dir -r requirements.txt +COPY . . +USER nobody +CMD ["gunicorn", "--bind", "0.0.0.0:8080", "app:app"] +``` + +### Build with make target + +```bash +# Build development image +docker build --target development -t myapp:dev . + +# Build production image +docker build --target production -t myapp:prod . +``` + +## Testing Docker Images + +- Test images locally before pushing +- Verify non-root user execution +- Check image size +- Scan for vulnerabilities + +```bash +# Build image +docker build -t myapp:test . + +# Check image size +docker images myapp:test + +# Run security scan (example with trivy) +trivy image myapp:test + +# Test the container +docker run --rm -p 8080:8080 myapp:test + +# Verify non-root +docker run --rm myapp:test id +``` + +## Common Pitfalls to Avoid + +- Don't use `apt-get upgrade` in Dockerfiles (use newer base image instead) +- Don't store secrets in images +- Don't run containers as root +- Don't use `latest` tag for base images in production +- Don't ignore .dockerignore (slows builds and increases context size) +- Don't install unnecessary packages +- Don't create unnecessary layers +- Don't leave package manager caches +- Don't use shell form for ENTRYPOINT/CMD (breaks signal handling) +- Don't copy everything with `COPY . .` too early (breaks layer caching) + +## Documentation + +- Document build arguments and their defaults +- Document exposed ports and their purpose +- Document required environment variables +- Document expected volumes +- Include example run commands in README diff --git a/.github/instructions/helm-kubernetes.instructions.md b/.github/instructions/helm-kubernetes.instructions.md new file mode 100644 index 00000000..7fbfaeed --- /dev/null +++ b/.github/instructions/helm-kubernetes.instructions.md @@ -0,0 +1,209 @@ +--- +description: 'Instructions for developing and maintaining Helm charts and Kubernetes manifests' +applyTo: '**/*.yaml, **/*.yml, **/Chart.yaml, **/values.yaml, **/templates/**' +--- + +# Helm and Kubernetes Development Instructions + +## General Principles + +- Follow Helm best practices and Kubernetes manifest conventions +- Write clean, maintainable, and reusable chart templates +- Ensure backward compatibility when making changes to existing charts +- Test all changes thoroughly before committing +- Document configuration options clearly in values.yaml and README files + +## Helm Chart Development + +### Chart Structure + +- Organize charts following the standard Helm chart structure: + - `Chart.yaml`: Chart metadata and dependencies + - `values.yaml`: Default configuration values with comprehensive comments + - `templates/`: Kubernetes manifest templates + - `templates/_helpers.tpl`: Helper templates for reusable snippets + - `README.md`: Chart documentation with usage examples + +### Chart.yaml + +- Follow semantic versioning for chart versions +- Increment `version` for chart changes, `appVersion` for application version changes +- List all dependencies with specific version constraints +- Include maintainer information and useful metadata +- Add keywords and home/source URLs for discoverability + +### values.yaml + +- Provide sensible defaults that work out of the box +- Document each configuration option with inline comments +- Group related settings logically with clear section headers +- Use consistent naming conventions (camelCase recommended) +- Mark required values clearly and provide example values +- Consider backward compatibility when adding or modifying values +- Use nested structures to organize complex configurations + +### Templates + +- Use consistent indentation (2 spaces) +- Include helpful comments explaining complex logic +- Use `{{- ` and ` -}}` to control whitespace appropriately +- Leverage `_helpers.tpl` for common patterns and labels +- Always quote string values in templates to prevent type issues +- Use `.Values`, `.Chart`, `.Release` objects appropriately +- Validate required values with `required` function +- Use `toYaml` and `nindent` for clean YAML output +- Include resource limits and requests for all containers +- Add health checks (liveness and readiness probes) where appropriate + +### Template Best Practices + +- Use `include` instead of `template` for better error messages +- Define common labels in `_helpers.tpl` and reuse them +- Use consistent naming for Kubernetes resources: `{{ include "chart.fullname" . }}` +- Implement conditional resource creation with `if` statements +- Validate inputs using the `required` and `fail` functions +- Use `lookup` function carefully (not available in `helm template`) +- Handle list values properly with `toYaml` and proper indentation + +### Testing and Validation + +- Run `helm lint` to check for issues before committing +- Use `helm template` to render manifests and verify output +- Test with `ct lint` (chart-testing tool) for comprehensive validation +- Use `kubeconform` or similar tools to validate Kubernetes manifests +- Test installation with `helm install` in a test cluster +- Verify upgrades work correctly with `helm upgrade` +- Test with different values files to ensure flexibility + +## Kubernetes Manifest Best Practices + +### Resource Specifications + +- Always specify resource requests and limits +- Set appropriate security contexts (runAsNonRoot, readOnlyRootFilesystem, etc.) +- Use namespaces for resource isolation +- Apply proper RBAC (Roles, RoleBindings, ServiceAccounts) +- Add meaningful labels and annotations +- Use selectors consistently + +### ConfigMaps and Secrets + +- Use ConfigMaps for non-sensitive configuration +- Use Secrets for sensitive data +- Reference Secrets securely in pod specs +- Consider using external secret management solutions +- Document which secrets need to be created before installation + +### Networking + +- Define Services with appropriate types (ClusterIP, NodePort, LoadBalancer) +- Configure Ingress resources with proper annotations for your ingress controller +- Use NetworkPolicies for network segmentation when needed +- Document external dependencies and endpoints + +### High Availability and Scaling + +- Support replica configuration for stateless applications +- Use PodDisruptionBudgets for critical services +- Configure HorizontalPodAutoscaler when appropriate +- Consider anti-affinity rules for better pod distribution +- Use StatefulSets for stateful applications + +### Observability + +- Include health check endpoints for all services +- Add Prometheus annotations for metric scraping when applicable +- Configure proper logging (stdout/stderr) +- Add readiness and liveness probes with appropriate thresholds + +## Argo-Specific Patterns + +### Argo Workflows + +- Follow Argo Workflows best practices for WorkflowTemplate definitions +- Use proper artifact repository configuration +- Configure service accounts with appropriate RBAC +- Use templates for reusable workflow components +- Document workflow parameters and usage + +### Argo CD + +- Structure Application manifests with proper sync policies +- Use automated sync with caution (prune and selfHeal options) +- Configure proper health checks for custom resources +- Use Projects for multi-tenancy when appropriate +- Document repository requirements and access patterns + +### Argo Events + +- Define EventSources with proper authentication +- Create Sensors with clear trigger conditions +- Use proper RBAC for event processing +- Document webhook configurations and expected payloads + +## Multi-Tenancy and RBAC + +- Create proper namespace isolation +- Define clear RBAC roles (viewer, runner, admin) +- Use RoleBindings and ClusterRoleBindings appropriately +- Document permission requirements +- Test RBAC policies with `kubectl auth can-i` + +## Documentation + +- Keep README.md up to date with: + - Prerequisites and dependencies + - Installation instructions + - Configuration examples + - Upgrade procedures + - Troubleshooting tips +- Document breaking changes in CHANGELOG +- Provide example values files for common scenarios +- Include mermaid diagrams for architecture when helpful + +## Common Pitfalls to Avoid + +- Don't hardcode values that should be configurable +- Don't ignore backward compatibility in existing charts +- Don't skip testing with different values combinations +- Don't forget to update Chart.yaml version +- Don't use deprecated Kubernetes API versions +- Don't omit resource limits (can cause cluster issues) +- Don't expose secrets in logs or status outputs +- Avoid creating breaking changes without major version bump + +## Validation Commands + +Always run these commands before committing: + +```bash +# Add Helm dependencies +helm repo add argo https://argoproj.github.io/argo-helm +helm repo update + +# Build dependencies +helm dependency build helm/argo-stack + +# Lint the chart +helm lint helm/argo-stack --values helm/argo-stack/values.yaml + +# Render templates +helm template argo-stack helm/argo-stack \ + --values helm/argo-stack/values.yaml \ + --namespace argocd > rendered.yaml + +# Validate manifests +kubeconform -strict -ignore-missing-schemas \ + -skip 'CustomResourceDefinition|Application|Workflow|WorkflowTemplate' \ + -summary rendered.yaml + +# Test with ct (if available) +ct lint --config .ct.yaml +``` + +## Version Compatibility + +- Target Kubernetes 1.20+ unless specific compatibility is needed +- Use stable API versions (avoid alpha/beta in production) +- Test with multiple Kubernetes versions when possible +- Document minimum required versions in Chart.yaml and README diff --git a/.github/instructions/python.instructions.md b/.github/instructions/python.instructions.md new file mode 100644 index 00000000..42682c73 --- /dev/null +++ b/.github/instructions/python.instructions.md @@ -0,0 +1,577 @@ +--- +description: 'Instructions for Python development following best practices and conventions' +applyTo: '**/*.py, **/requirements*.txt, **/setup.py, **/pyproject.toml' +--- + +# Python Development Instructions + +## General Principles + +- Write clear, readable, and maintainable Python code +- Follow PEP 8 style guidelines +- Use Python 3.9+ features and best practices +- Write comprehensive tests for all functionality +- Document code with docstrings and type hints +- Prefer explicit over implicit + +## Code Style and Formatting + +### PEP 8 Compliance + +- Use 4 spaces for indentation (never tabs) +- Limit lines to 88-100 characters (prefer 88 for Black compatibility) +- Use blank lines to separate logical sections +- Follow naming conventions: + - `snake_case` for functions and variables + - `PascalCase` for classes + - `UPPER_CASE` for constants + - `_leading_underscore` for private/internal + +### Imports + +- Group imports in order: standard library, third-party, local +- Use absolute imports over relative imports +- Sort imports alphabetically within groups +- One import per line for clarity: + +```python +# Standard library +import os +import sys +from typing import Dict, List, Optional + +# Third-party +import flask +from flask import Flask, request + +# Local +from .config import Config +from .utils import helper_function +``` + +### Type Hints + +- Always use type hints for function signatures +- Use `Optional[T]` for values that can be None +- Import types from `typing` module +- Use `-> None` for functions that don't return values + +```python +from typing import Dict, List, Optional + +def process_data( + data: List[str], + config: Optional[Dict[str, str]] = None +) -> Dict[str, int]: + """Process data and return results.""" + result: Dict[str, int] = {} + # Implementation + return result +``` + +## Documentation + +### Docstrings + +- Use docstrings for all public modules, classes, and functions +- Follow Google or NumPy style for multi-line docstrings +- Include parameter descriptions and return values +- Document exceptions that can be raised + +```python +def validate_token(token: str, fence_base: str) -> Dict[str, any]: + """ + Validate an authentication token against Fence. + + Args: + token: The authentication token to validate + fence_base: Base URL for the Fence authentication service + + Returns: + Dictionary containing user information and authorization data + + Raises: + ValueError: If token is empty or invalid format + requests.HTTPError: If Fence API request fails + """ + if not token: + raise ValueError("Token cannot be empty") + # Implementation +``` + +### Comments + +- Write comments for complex logic, not obvious code +- Keep comments up to date with code changes +- Use `#` for inline comments, prefer docstrings for functions +- Explain "why" not "what" when the code is self-documenting + +## Functions and Classes + +### Function Design + +- Keep functions small and focused (single responsibility) +- Use descriptive function names that indicate purpose +- Limit function parameters (consider using dataclasses for many params) +- Return early to reduce nesting + +```python +def decide_groups(user_doc: Dict[str, any]) -> List[str]: + """Determine user's authorization groups.""" + if not user_doc.get("active"): + return [] + + groups = [] + authz = user_doc.get("authz", {}) + + # Check for admin privileges + if _is_admin(user_doc): + groups.extend(["argo-admin", "argo-runner", "argo-viewer"]) + return groups + + # Check for runner privileges + if _has_workflow_access(authz): + groups.append("argo-runner") + + return groups +``` + +### Classes + +- Use classes for stateful objects and related functionality +- Implement `__init__`, `__repr__`, and other dunder methods as needed +- Use properties for computed attributes +- Consider dataclasses for simple data containers + +```python +from dataclasses import dataclass +from typing import Optional + +@dataclass +class WorkflowConfig: + """Configuration for workflow execution.""" + name: str + namespace: str + service_account: Optional[str] = None + timeout: int = 300 + + def __post_init__(self): + """Validate configuration after initialization.""" + if self.timeout < 0: + raise ValueError("Timeout must be positive") +``` + +## Error Handling + +### Exceptions + +- Use specific exception types, not bare `except:` +- Catch exceptions at the appropriate level +- Log errors before re-raising or returning error responses +- Create custom exceptions for domain-specific errors + +```python +class AuthorizationError(Exception): + """Raised when user is not authorized for an action.""" + pass + +def check_authorization(user: str, resource: str) -> bool: + """Check if user can access resource.""" + try: + result = validate_access(user, resource) + return result + except requests.RequestException as e: + logger.error(f"Authorization check failed: {e}") + raise AuthorizationError(f"Cannot verify access for {user}") from e + except Exception as e: + logger.exception("Unexpected error in authorization check") + raise +``` + +### Validation + +- Validate inputs early +- Use descriptive error messages +- Consider using libraries like `pydantic` for complex validation + +```python +def process_request(data: Dict[str, any]) -> Dict[str, any]: + """Process incoming request.""" + # Validate required fields + required_fields = ["user_id", "action", "resource"] + missing = [f for f in required_fields if f not in data] + if missing: + raise ValueError(f"Missing required fields: {', '.join(missing)}") + + # Validate field values + if not data["user_id"].strip(): + raise ValueError("user_id cannot be empty") + + # Process data + return perform_action(data) +``` + +## Flask/Web Application Patterns + +### Application Structure + +- Use application factory pattern for Flask apps +- Separate configuration, routes, and business logic +- Use blueprints for modular organization +- Configure proper logging + +```python +from flask import Flask +import logging + +def create_app(config: Optional[Dict] = None) -> Flask: + """Create and configure Flask application.""" + app = Flask(__name__) + + # Configure logging + logging.basicConfig(level=logging.INFO) + + # Load configuration + if config: + app.config.update(config) + + # Register routes + register_routes(app) + + return app +``` + +### Route Handlers + +- Keep route handlers thin (delegate to service layer) +- Validate inputs +- Return appropriate HTTP status codes +- Use consistent response format + +```python +from flask import Flask, request, jsonify + +app = Flask(__name__) + +@app.route("/check", methods=["GET"]) +def check_authorization(): + """Check if user is authorized.""" + try: + # Extract headers + auth_header = request.headers.get("Authorization", "") + if not auth_header: + return jsonify({"error": "Missing Authorization header"}), 401 + + # Validate token + token = auth_header.replace("Bearer ", "") + user_info = validate_token(token) + + # Check authorization + groups = decide_groups(user_info) + if not groups: + return jsonify({"error": "Unauthorized"}), 403 + + # Success response + return jsonify({ + "authorized": True, + "groups": groups + }), 200 + + except ValueError as e: + return jsonify({"error": str(e)}), 400 + except Exception as e: + app.logger.exception("Authorization check failed") + return jsonify({"error": "Internal server error"}), 500 +``` + +### Health Checks + +- Implement health check endpoints +- Check dependencies (database, external services) +- Return appropriate status codes + +```python +@app.route("/healthz", methods=["GET"]) +def health_check(): + """Health check endpoint.""" + try: + # Check if critical services are accessible + check_external_dependencies() + return jsonify({"status": "healthy"}), 200 + except Exception as e: + app.logger.error(f"Health check failed: {e}") + return jsonify({"status": "unhealthy", "error": str(e)}), 503 +``` + +## Testing + +### Test Structure + +- Use `pytest` for testing +- Organize tests to mirror source structure +- Use descriptive test names that explain what is being tested +- Group related tests in classes + +```python +import pytest +from app import decide_groups + +class TestDecideGroups: + """Tests for decide_groups function.""" + + def test_inactive_user_returns_empty_list(self): + """Inactive users should have no groups.""" + user_doc = {"active": False} + assert decide_groups(user_doc) == [] + + def test_admin_user_gets_all_groups(self): + """Admin users should get all permission groups.""" + user_doc = { + "active": True, + "email": "admin@example.com", + "authz": {} + } + groups = decide_groups(user_doc) + assert "argo-admin" in groups + assert "argo-runner" in groups + assert "argo-viewer" in groups +``` + +### Fixtures + +- Use pytest fixtures for common test setup +- Keep fixtures focused and reusable +- Use `conftest.py` for shared fixtures + +```python +# conftest.py +import pytest +from app import create_app + +@pytest.fixture +def app(): + """Create Flask app for testing.""" + app = create_app({"TESTING": True}) + return app + +@pytest.fixture +def client(app): + """Create test client.""" + return app.test_client() + +@pytest.fixture +def sample_user(): + """Sample user document for testing.""" + return { + "active": True, + "email": "test@example.com", + "authz": { + "/workflows/submit": [{"method": "create"}] + } + } +``` + +### Test Coverage + +- Aim for high test coverage (80%+ for critical code) +- Test edge cases and error conditions +- Use `pytest-cov` to measure coverage +- Don't just aim for coverage, ensure meaningful tests + +### Mocking + +- Use `unittest.mock` or `pytest-mock` for external dependencies +- Mock network calls, file I/O, and external services +- Keep mocks simple and focused + +```python +from unittest.mock import Mock, patch +import pytest + +def test_token_validation_with_mock(): + """Test token validation with mocked HTTP call.""" + mock_response = Mock() + mock_response.json.return_value = { + "active": True, + "email": "user@example.com" + } + + with patch("requests.get", return_value=mock_response): + result = validate_token("test-token", "https://fence.example.com") + assert result["active"] is True +``` + +## Dependencies and Environment + +### Requirements Files + +- Use `requirements.txt` for production dependencies +- Use `requirements-dev.txt` for development dependencies +- Pin versions for reproducibility +- Keep dependencies minimal and up to date + +``` +# requirements.txt +flask==3.0.0 +requests==2.31.0 + +# requirements-dev.txt +pytest==7.4.3 +pytest-cov==4.1.0 +black==23.12.0 +flake8==6.1.0 +``` + +### Virtual Environments + +- Always use virtual environments +- Document setup in README +- Consider using `venv`, `virtualenv`, or `uv` + +## Logging + +### Logger Configuration + +- Use Python's `logging` module +- Configure appropriate log levels +- Include context in log messages +- Don't log sensitive information + +```python +import logging + +logger = logging.getLogger(__name__) + +def process_authorization(user_id: str, resource: str) -> bool: + """Process authorization request.""" + logger.info(f"Checking authorization for user {user_id} on {resource}") + + try: + result = check_access(user_id, resource) + logger.info(f"Authorization check completed: {result}") + return result + except Exception as e: + logger.error(f"Authorization check failed for {user_id}: {e}") + raise +``` + +## Security Considerations + +### Input Validation + +- Validate all external inputs +- Sanitize data before use +- Use parameterized queries for databases +- Validate file paths to prevent path traversal + +### Secrets Management + +- Never hardcode secrets +- Use environment variables or secret management systems +- Don't log sensitive data +- Use secure random number generation for tokens + +```python +import os +import secrets + +def get_api_key() -> str: + """Get API key from environment.""" + api_key = os.environ.get("API_KEY") + if not api_key: + raise ValueError("API_KEY environment variable not set") + return api_key + +def generate_token() -> str: + """Generate secure random token.""" + return secrets.token_urlsafe(32) +``` + +## Performance Considerations + +### Efficient Code + +- Use list comprehensions for simple transformations +- Use generators for large datasets +- Cache expensive computations when appropriate +- Profile before optimizing + +```python +# Good: List comprehension +active_users = [u for u in users if u.get("active")] + +# Good: Generator for large datasets +def process_large_file(filename: str): + """Process large file line by line.""" + with open(filename) as f: + for line in f: + yield process_line(line) + +# Good: Caching with functools +from functools import lru_cache + +@lru_cache(maxsize=128) +def expensive_computation(n: int) -> int: + """Cached expensive operation.""" + return sum(i * i for i in range(n)) +``` + +## Common Patterns + +### Context Managers + +- Use context managers for resource management +- Implement `__enter__` and `__exit__` for custom context managers + +```python +from contextlib import contextmanager + +@contextmanager +def managed_resource(resource_name: str): + """Manage resource lifecycle.""" + resource = acquire_resource(resource_name) + try: + yield resource + finally: + release_resource(resource) + +# Usage +with managed_resource("my-resource") as res: + res.do_something() +``` + +## Common Pitfalls to Avoid + +- Don't use mutable default arguments (`def func(items=[]):`) +- Don't modify lists while iterating over them +- Don't catch `Exception` without logging or re-raising +- Don't use `eval()` or `exec()` with untrusted input +- Don't ignore return values from functions +- Don't use global variables when class attributes would work +- Avoid circular imports (reorganize code structure) + +## Code Quality Tools + +### Linting and Formatting + +- Use `black` for code formatting +- Use `flake8` or `pylint` for linting +- Use `mypy` for type checking +- Configure tools in `pyproject.toml` or setup.cfg + +```bash +# Format code +black . + +# Check style +flake8 . + +# Type checking +mypy app.py +``` + +### Pre-commit Hooks + +- Set up pre-commit hooks for automatic checks +- Run formatters and linters before committing +- Ensure tests pass before pushing diff --git a/.github/scripts/validate-instructions.sh b/.github/scripts/validate-instructions.sh new file mode 100755 index 00000000..c3dc7e02 --- /dev/null +++ b/.github/scripts/validate-instructions.sh @@ -0,0 +1,92 @@ +#!/bin/bash +# Validate GitHub Copilot instruction files +# This script checks that all instruction files have proper frontmatter + +set -euo pipefail + +INSTRUCTIONS_DIR=".github/instructions" +ERRORS=0 + +echo "๐Ÿ” Validating Copilot instruction files..." +echo "" + +# Check if instructions directory exists +if [[ ! -d "${INSTRUCTIONS_DIR}" ]]; then + echo "โŒ Error: Instructions directory not found: ${INSTRUCTIONS_DIR}" + exit 1 +fi + +# Find all .instructions.md files +shopt -s nullglob +instruction_files=("${INSTRUCTIONS_DIR}"/*.instructions.md) + +if [[ ${#instruction_files[@]} -eq 0 ]]; then + echo "โš ๏ธ Warning: No instruction files found in ${INSTRUCTIONS_DIR}" + exit 0 +fi + +echo "Found ${#instruction_files[@]} instruction file(s)" +echo "" + +# Validate each instruction file +for file in "${instruction_files[@]}"; do + filename=$(basename "$file") + echo "Checking ${filename}..." + + # Check if file starts with frontmatter + if ! head -n 1 "$file" | grep -q "^---$"; then + echo " โŒ Missing frontmatter opening delimiter" + ((ERRORS++)) + continue + fi + + # Check for description field + if ! head -n 10 "$file" | grep -q "^description:"; then + echo " โŒ Missing 'description' field in frontmatter" + ((ERRORS++)) + else + echo " โœ… Has description field" + fi + + # Check for applyTo field + if ! head -n 10 "$file" | grep -q "^applyTo:"; then + echo " โŒ Missing 'applyTo' field in frontmatter" + ((ERRORS++)) + else + echo " โœ… Has applyTo field" + fi + + # Check for closing frontmatter delimiter + if ! head -n 10 "$file" | grep -A 1 "^description:" | tail -n +2 | grep -q "^---$" && \ + ! head -n 10 "$file" | grep -A 1 "^applyTo:" | tail -n +2 | grep -q "^---$"; then + # Check more lines for closing delimiter + if ! head -n 15 "$file" | tail -n +2 | grep -q "^---$"; then + echo " โŒ Missing frontmatter closing delimiter" + ((ERRORS++)) + else + echo " โœ… Has frontmatter closing delimiter" + fi + else + echo " โœ… Has frontmatter closing delimiter" + fi + + # Check file has content beyond frontmatter + line_count=$(wc -l < "$file") + if [[ $line_count -lt 20 ]]; then + echo " โš ๏ธ Warning: File seems very short (${line_count} lines)" + else + echo " โœ… Has substantial content (${line_count} lines)" + fi + + echo "" +done + +# Summary +echo "โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”" +if [[ $ERRORS -eq 0 ]]; then + echo "โœ… All instruction files are valid!" + exit 0 +else + echo "โŒ Found ${ERRORS} error(s) in instruction files" + exit 1 +fi From 055fe704c2a9b75582caebdeb37a8eb6c0faa734 Mon Sep 17 00:00:00 2001 From: Copilot <198982749+Copilot@users.noreply.github.com> Date: Mon, 24 Nov 2025 16:35:19 -0800 Subject: [PATCH 02/19] Add ingress-authz-overlay for unified path-based routing with centralized auth (#93) * Initial plan * Implement ingress-authz-overlay chart per issue #91 Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> * Address code review: remove unused helpers, add container security context Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> * Add Let's Encrypt / cert-manager documentation to user guide Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --- .../overlays/ingress-authz-overlay/Chart.yaml | 16 + .../overlays/ingress-authz-overlay/README.md | 65 ++++ .../docs/authz-ingress-user-guide.md | 367 ++++++++++++++++++ .../templates/_helpers.tpl | 72 ++++ .../templates/authz-adapter.yaml | 107 +++++ .../templates/ingress-authz.yaml | 63 +++ .../tests/authz-ingress.feature | 72 ++++ .../ingress-authz-overlay/values.yaml | 147 +++++++ helm/argo-stack/values.yaml | 53 +++ 9 files changed, 962 insertions(+) create mode 100644 helm/argo-stack/overlays/ingress-authz-overlay/Chart.yaml create mode 100644 helm/argo-stack/overlays/ingress-authz-overlay/README.md create mode 100644 helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md create mode 100644 helm/argo-stack/overlays/ingress-authz-overlay/templates/_helpers.tpl create mode 100644 helm/argo-stack/overlays/ingress-authz-overlay/templates/authz-adapter.yaml create mode 100644 helm/argo-stack/overlays/ingress-authz-overlay/templates/ingress-authz.yaml create mode 100644 helm/argo-stack/overlays/ingress-authz-overlay/tests/authz-ingress.feature create mode 100644 helm/argo-stack/overlays/ingress-authz-overlay/values.yaml diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/Chart.yaml b/helm/argo-stack/overlays/ingress-authz-overlay/Chart.yaml new file mode 100644 index 00000000..75de8fe6 --- /dev/null +++ b/helm/argo-stack/overlays/ingress-authz-overlay/Chart.yaml @@ -0,0 +1,16 @@ +apiVersion: v2 +name: ingress-authz-overlay +description: Authz-aware ingress overlay providing unified path-based routing with centralized authorization for multi-tenant UIs and APIs +type: application +version: 0.1.0 +appVersion: "1.0.0" +keywords: + - ingress + - authorization + - multi-tenant + - nginx + - argo +home: https://github.com/calypr/argo-helm +maintainers: + - name: calypr + url: https://github.com/calypr diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/README.md b/helm/argo-stack/overlays/ingress-authz-overlay/README.md new file mode 100644 index 00000000..02a89464 --- /dev/null +++ b/helm/argo-stack/overlays/ingress-authz-overlay/README.md @@ -0,0 +1,65 @@ +# Ingress AuthZ Overlay + +A Helm overlay chart providing unified, path-based ingress with centralized authorization for multi-tenant Argo Stack deployments. + +## Overview + +This overlay provides a **single host, path-based ingress** for all major UIs and APIs: + +| Path | Service | Description | +|------|---------|-------------| +| `/workflows` | Argo Workflows Server | Workflow UI (port 2746) | +| `/applications` | Argo CD Server | GitOps applications UI (port 8080) | +| `/registrations` | GitHub EventSource | Repository registration events (port 12000) | +| `/api` | Calypr API | Platform API service (port 3000) | +| `/tenants` | Calypr Tenants | Tenant portal (port 3001) | + +All endpoints are protected by the `authz-adapter` via NGINX external authentication. + +## Quick Start + +```bash +# Install the overlay +helm upgrade --install ingress-authz-overlay \ + helm/argo-stack/overlays/ingress-authz-overlay \ + --namespace argo-stack \ + --create-namespace + +# With custom host +helm upgrade --install ingress-authz-overlay \ + helm/argo-stack/overlays/ingress-authz-overlay \ + --namespace argo-stack \ + --set ingressAuthzOverlay.host=my-domain.example.com +``` + +## Configuration + +See [`values.yaml`](values.yaml) for all configurable options. + +Key settings: + +```yaml +ingressAuthzOverlay: + enabled: true + host: calypr-demo.ddns.net + tls: + enabled: true + secretName: calypr-demo-tls + clusterIssuer: letsencrypt-prod +``` + +## Documentation + +- [User Guide](docs/authz-ingress-user-guide.md) - Complete installation and configuration guide +- [Acceptance Tests](tests/authz-ingress.feature) - Gherkin-style test scenarios + +## Architecture + +See the [User Guide](docs/authz-ingress-user-guide.md) for architecture diagrams and detailed flow descriptions. + +## Requirements + +- Kubernetes 1.19+ +- Helm 3.x +- NGINX Ingress Controller +- cert-manager (for TLS) diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md b/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md new file mode 100644 index 00000000..a54d5aed --- /dev/null +++ b/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md @@ -0,0 +1,367 @@ +# Authz-Aware Ingress Overlay User Guide + +## Overview + +The `ingress-authz-overlay` is a Helm overlay chart that provides a unified, path-based ingress layer for all major UIs and APIs in the Argo Stack. It centralizes authorization through the `authz-adapter` service, ensuring consistent access control across all endpoints. + +## Features + +- **Single Host**: All services exposed on one HTTPS hostname +- **Path-Based Routing**: Clean URL structure (`/workflows`, `/applications`, `/api`, etc.) +- **Centralized Authorization**: All routes protected by `authz-adapter` via NGINX external auth +- **TLS via cert-manager**: Automatic Let's Encrypt certificate management +- **Multi-Tenant Support**: User, email, and group headers passed to backend services +- **Drop-In Deployment**: Simple Helm overlay that can be enabled or disabled per environment + +## Architecture + +```mermaid +sequenceDiagram + participant User + participant Ingress as NGINX Ingress + participant AuthzAdapter as authz-adapter + participant Workflows as Argo Workflows + participant Applications as Argo CD + participant Registrations as Event Source + participant Api as Calypr API + participant Tenants as Calypr Tenants + + User->>Ingress: HTTPS GET /path + Ingress->>AuthzAdapter: auth-url check + AuthzAdapter-->>Ingress: Allow or Deny + alt Allowed + Note over Ingress: Route based on path + Ingress->>Workflows: /workflows... + Ingress->>Applications: /applications... + Ingress->>Registrations: /registrations... + Ingress->>Api: /api... + Ingress->>Tenants: /tenants... + else Denied + Ingress-->>User: Redirect to /tenants/login + end +``` + +## Routes + +| Path | Service | Port | Namespace | Description | +|------|---------|------|-----------|-------------| +| `/workflows` | `argo-stack-argo-workflows-server` | 2746 | `argo-stack` | Argo Workflows UI | +| `/applications` | `argo-stack-argocd-server` | 8080 | `argo-stack` | Argo CD Applications UI | +| `/registrations` | `github-repo-registrations-eventsource-svc` | 12000 | `argo-stack` | GitHub Repo Registration Events | +| `/api` | `calypr-api` | 3000 | `calypr-api` | Calypr API Service | +| `/tenants` | `calypr-tenants` | 3001 | `calypr-tenants` | Calypr Tenant Portal | + +## TLS with Let's Encrypt and cert-manager + +This overlay uses [cert-manager](https://cert-manager.io/) to automatically provision and renew TLS certificates from [Let's Encrypt](https://letsencrypt.org/). + +### How It Works + +```mermaid +sequenceDiagram + participant Ingress as Ingress Resource + participant CM as cert-manager + participant LE as Let's Encrypt + participant DNS as DNS Provider + + Note over Ingress: Created with annotation:
cert-manager.io/cluster-issuer: letsencrypt-prod + Ingress->>CM: Ingress triggers Certificate request + CM->>LE: Request certificate for domain + LE->>CM: ACME challenge (HTTP-01 or DNS-01) + CM->>DNS: Prove domain ownership + DNS-->>LE: Challenge verified + LE-->>CM: Issue certificate + CM->>Ingress: Store cert in TLS Secret + Note over Ingress: HTTPS now available +``` + +### ClusterIssuer: letsencrypt-prod + +The `letsencrypt-prod` ClusterIssuer is a cluster-wide cert-manager resource that defines how to obtain certificates from Let's Encrypt's production API. + +**Prerequisites**: You must create the ClusterIssuer before deploying this overlay: + +```yaml +apiVersion: cert-manager.io/v1 +kind: ClusterIssuer +metadata: + name: letsencrypt-prod +spec: + acme: + # Let's Encrypt production API endpoint + server: https://acme-v02.api.letsencrypt.org/directory + + # Email for certificate expiration notifications + email: your-email@example.com + + # Secret to store the ACME account private key + privateKeySecretRef: + name: letsencrypt-prod-account-key + + # HTTP-01 challenge solver using ingress + solvers: + - http01: + ingress: + class: nginx +``` + +**Apply the ClusterIssuer**: + +```bash +kubectl apply -f cluster-issuer.yaml +``` + +### Configuration Options + +| Setting | Description | Default | +|---------|-------------|---------| +| `tls.enabled` | Enable TLS for ingress | `true` | +| `tls.secretName` | Name of the TLS Secret (auto-created by cert-manager) | `calypr-demo-tls` | +| `tls.clusterIssuer` | Name of the ClusterIssuer to use | `letsencrypt-prod` | + +### Using letsencrypt-staging (for Testing) + +For testing, use the staging issuer to avoid Let's Encrypt rate limits: + +```yaml +apiVersion: cert-manager.io/v1 +kind: ClusterIssuer +metadata: + name: letsencrypt-staging +spec: + acme: + server: https://acme-staging-v02.api.letsencrypt.org/directory + email: your-email@example.com + privateKeySecretRef: + name: letsencrypt-staging-account-key + solvers: + - http01: + ingress: + class: nginx +``` + +Then configure the overlay to use it: + +```yaml +ingressAuthzOverlay: + tls: + clusterIssuer: letsencrypt-staging +``` + +### Verifying Certificate Status + +Check if the certificate was issued successfully: + +```bash +# Check Certificate resource +kubectl get certificate -n argo-stack + +# Check certificate details +kubectl describe certificate -n argo-stack + +# Check the TLS secret +kubectl get secret calypr-demo-tls -n argo-stack +``` + +### Troubleshooting Certificates + +If the certificate is not being issued: + +```bash +# Check cert-manager logs +kubectl logs -n cert-manager -l app=cert-manager + +# Check Certificate status +kubectl describe certificate -n argo-stack + +# Check CertificateRequest +kubectl get certificaterequest -n argo-stack + +# Check ACME challenges +kubectl get challenges -A +``` + +Common issues: +- **Domain not reachable**: Ensure your domain's DNS points to the ingress controller's external IP +- **Rate limited**: Use `letsencrypt-staging` for testing to avoid production rate limits +- **Challenge failed**: Check that port 80 is accessible for HTTP-01 challenges + +## Installation + +### Prerequisites + +- Kubernetes cluster with NGINX Ingress Controller +- cert-manager installed and configured with a ClusterIssuer (e.g., `letsencrypt-prod`) +- Helm 3.x + +### Install the Overlay + +```bash +# Install with default values +helm upgrade --install ingress-authz-overlay \ + helm/argo-stack/overlays/ingress-authz-overlay \ + --namespace argo-stack \ + --create-namespace + +# Install with custom host +helm upgrade --install ingress-authz-overlay \ + helm/argo-stack/overlays/ingress-authz-overlay \ + --namespace argo-stack \ + --set ingressAuthzOverlay.host=my-domain.example.com \ + --set ingressAuthzOverlay.tls.secretName=my-domain-tls +``` + +### Integrate with Parent Chart + +Alternatively, add the values to your main `argo-stack` deployment: + +```bash +helm upgrade --install argo-stack \ + helm/argo-stack \ + --values helm/argo-stack/values.yaml \ + --set ingressAuthzOverlay.enabled=true +``` + +## Configuration + +### Basic Configuration + +```yaml +ingressAuthzOverlay: + enabled: true + host: calypr-demo.ddns.net + tls: + enabled: true + secretName: calypr-demo-tls + clusterIssuer: letsencrypt-prod +``` + +### AuthZ Adapter Configuration + +```yaml +ingressAuthzOverlay: + authzAdapter: + # Disable if authz-adapter is deployed separately + deploy: true + + # Service location + serviceName: authz-adapter + namespace: argo-stack + port: 8080 + path: /check + + # Sign-in redirect URL + signinUrl: https://calypr-demo.ddns.net/tenants/login + + # Headers passed from auth response to backends + responseHeaders: "X-User,X-Email,X-Groups" + + # Environment configuration + env: + fenceBase: "https://calypr-dev.ohsu.edu/user" +``` + +### Custom Routes + +Add or modify routes as needed: + +```yaml +ingressAuthzOverlay: + routes: + # Custom route example + myservice: + enabled: true + namespace: my-namespace + service: my-service + port: 8000 + pathPrefix: /myservice + useRegex: true + rewriteTarget: /$2 +``` + +### Disabling a Route + +```yaml +ingressAuthzOverlay: + routes: + registrations: + enabled: false +``` + +## Authorization Flow + +1. **User Request**: Client sends HTTPS request to the ingress host +2. **External Auth**: NGINX Ingress calls the `authz-adapter` `/check` endpoint +3. **Token Validation**: `authz-adapter` validates the Authorization header against Fence/OIDC +4. **Group Assignment**: User is assigned groups based on their permissions (e.g., `argo-runner`, `argo-viewer`) +5. **Response Headers**: On success, user info headers are added to the request +6. **Routing**: Request is forwarded to the appropriate backend service +7. **Denial**: On failure, user is redirected to the sign-in URL + +### Auth Response Headers + +The following headers are passed to backend services on successful authentication: + +| Header | Description | +|--------|-------------| +| `X-Auth-Request-User` | Username or email of the authenticated user | +| `X-Auth-Request-Email` | Email address of the user | +| `X-Auth-Request-Groups` | Comma-separated list of groups | +| `X-User` | Alias for X-Auth-Request-User | +| `X-Email` | Alias for X-Auth-Request-Email | +| `X-Groups` | Alias for X-Auth-Request-Groups | + +## Troubleshooting + +### Check Ingress Status + +```bash +kubectl get ingress -A -l app.kubernetes.io/name=ingress-authz-overlay +``` + +### Check AuthZ Adapter + +```bash +# Logs +kubectl logs -n argo-stack -l app=authz-adapter + +# Test health endpoint +kubectl port-forward -n argo-stack svc/authz-adapter 8080:8080 & +curl http://localhost:8080/healthz +``` + +### Test Authentication + +```bash +# Should redirect to login +curl -I https://calypr-demo.ddns.net/workflows + +# With valid token (should return 200) +curl -I -H "Authorization: Bearer $TOKEN" https://calypr-demo.ddns.net/workflows +``` + +### Common Issues + +1. **502 Bad Gateway**: AuthZ adapter not reachable + - Check authz-adapter deployment is running + - Verify service selector matches pod labels + +2. **503 Service Unavailable**: Backend service not available + - Check target service exists in the specified namespace + - Verify service port matches configuration + +3. **Redirect Loop**: Auth signin URL misconfigured + - Ensure `/tenants/login` path is accessible + - Check signinUrl matches actual login endpoint + +## Uninstall + +```bash +helm uninstall ingress-authz-overlay -n argo-stack +``` + +## Related Documentation + +- [Argo Stack User Guide](../../docs/user-guide.md) +- [Tenant Onboarding Guide](../../docs/tenant-onboarding.md) +- [Repo Registration Guide](../../docs/repo-registration-guide.md) diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/templates/_helpers.tpl b/helm/argo-stack/overlays/ingress-authz-overlay/templates/_helpers.tpl new file mode 100644 index 00000000..e8f2468d --- /dev/null +++ b/helm/argo-stack/overlays/ingress-authz-overlay/templates/_helpers.tpl @@ -0,0 +1,72 @@ +{{/* +Expand the name of the chart. +*/}} +{{- define "ingress-authz-overlay.name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Create a default fully qualified app name. +*/}} +{{- define "ingress-authz-overlay.fullname" -}} +{{- if .Values.fullnameOverride }} +{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- $name := default .Chart.Name .Values.nameOverride }} +{{- if contains $name .Release.Name }} +{{- .Release.Name | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }} +{{- end }} +{{- end }} +{{- end }} + +{{/* +Create chart name and version as used by the chart label. +*/}} +{{- define "ingress-authz-overlay.chart" -}} +{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Common labels +*/}} +{{- define "ingress-authz-overlay.labels" -}} +helm.sh/chart: {{ include "ingress-authz-overlay.chart" . }} +{{ include "ingress-authz-overlay.selectorLabels" . }} +{{- if .Chart.AppVersion }} +app.kubernetes.io/version: {{ .Chart.AppVersion | quote }} +{{- end }} +app.kubernetes.io/managed-by: {{ .Release.Service }} +{{- end }} + +{{/* +Selector labels +*/}} +{{- define "ingress-authz-overlay.selectorLabels" -}} +app.kubernetes.io/name: {{ include "ingress-authz-overlay.name" . }} +app.kubernetes.io/instance: {{ .Release.Name }} +{{- end }} + +{{/* +Create the auth-url for NGINX ingress external auth. +*/}} +{{- define "ingress-authz-overlay.authUrl" -}} +{{- $adapter := .Values.ingressAuthzOverlay.authzAdapter -}} +http://{{ $adapter.serviceName }}.{{ $adapter.namespace }}.svc.cluster.local:{{ $adapter.port }}{{ $adapter.path }} +{{- end }} + +{{/* +Create common ingress annotations for NGINX external auth. +*/}} +{{- define "ingress-authz-overlay.authAnnotations" -}} +nginx.ingress.kubernetes.io/auth-url: {{ include "ingress-authz-overlay.authUrl" . | quote }} +nginx.ingress.kubernetes.io/auth-method: "GET" +nginx.ingress.kubernetes.io/auth-signin: {{ .Values.ingressAuthzOverlay.authzAdapter.signinUrl | quote }} +nginx.ingress.kubernetes.io/auth-response-headers: {{ .Values.ingressAuthzOverlay.authzAdapter.responseHeaders | quote }} +nginx.ingress.kubernetes.io/auth-snippet: | + proxy_set_header Authorization $http_authorization; + proxy_set_header X-Original-URI $request_uri; + proxy_set_header X-Original-Method $request_method; + proxy_set_header X-Forwarded-Host $host; +{{- end }} diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/templates/authz-adapter.yaml b/helm/argo-stack/overlays/ingress-authz-overlay/templates/authz-adapter.yaml new file mode 100644 index 00000000..194c1ea8 --- /dev/null +++ b/helm/argo-stack/overlays/ingress-authz-overlay/templates/authz-adapter.yaml @@ -0,0 +1,107 @@ +{{/* +AuthZ Adapter Deployment and Service for the ingress-authz-overlay. +The authz-adapter provides external authentication for NGINX Ingress, +validating tokens and returning user/group information. +*/}} +{{- if and .Values.ingressAuthzOverlay.enabled .Values.ingressAuthzOverlay.authzAdapter.deploy }} +{{- $config := .Values.ingressAuthzOverlay }} +{{- $adapter := $config.authzAdapter }} +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ $adapter.serviceName }} + namespace: {{ $adapter.namespace }} + labels: + {{- include "ingress-authz-overlay.labels" . | nindent 4 }} + app.kubernetes.io/component: authz-adapter + app: {{ $adapter.serviceName }} + annotations: + meta.helm.sh/release-name: {{ .Release.Name }} + meta.helm.sh/release-namespace: {{ .Release.Namespace }} +spec: + replicas: {{ $adapter.replicas | default 2 }} + selector: + matchLabels: + app: {{ $adapter.serviceName }} + app.kubernetes.io/instance: {{ .Release.Name }} + template: + metadata: + labels: + app: {{ $adapter.serviceName }} + {{- include "ingress-authz-overlay.selectorLabels" . | nindent 8 }} + app.kubernetes.io/component: authz-adapter + spec: + {{- with $adapter.securityContext }} + securityContext: + {{- toYaml . | nindent 8 }} + {{- end }} + containers: + - name: authz-adapter + image: {{ $adapter.image }} + imagePullPolicy: IfNotPresent + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + capabilities: + drop: + - ALL + ports: + - name: http + containerPort: {{ $adapter.port }} + protocol: TCP + env: + - name: FENCE_BASE + value: {{ $adapter.env.fenceBase | quote }} + - name: TENANT_LOGIN_PATH + value: {{ $adapter.env.tenantLoginPath | default "/tenants/login" | quote }} + - name: HTTP_TIMEOUT + value: {{ $adapter.env.httpTimeout | default "3.0" | quote }} + {{- if $adapter.env.gitappBaseUrl }} + - name: GITAPP_BASE_URL + value: {{ $adapter.env.gitappBaseUrl | quote }} + {{- end }} + livenessProbe: + httpGet: + path: /healthz + port: http + initialDelaySeconds: 5 + periodSeconds: 10 + timeoutSeconds: 3 + failureThreshold: 3 + readinessProbe: + httpGet: + path: /healthz + port: http + initialDelaySeconds: 3 + periodSeconds: 5 + timeoutSeconds: 2 + failureThreshold: 2 + {{- with $adapter.resources }} + resources: + {{- toYaml . | nindent 12 }} + {{- end }} +--- +apiVersion: v1 +kind: Service +metadata: + name: {{ $adapter.serviceName }} + namespace: {{ $adapter.namespace }} + labels: + {{- include "ingress-authz-overlay.labels" . | nindent 4 }} + app.kubernetes.io/component: authz-adapter + app: {{ $adapter.serviceName }} + annotations: + meta.helm.sh/release-name: {{ .Release.Name }} + meta.helm.sh/release-namespace: {{ .Release.Namespace }} +spec: + type: ClusterIP + selector: + app: {{ $adapter.serviceName }} + app.kubernetes.io/instance: {{ .Release.Name }} + ports: + - name: http + port: {{ $adapter.port }} + targetPort: http + protocol: TCP +{{- end }} diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/templates/ingress-authz.yaml b/helm/argo-stack/overlays/ingress-authz-overlay/templates/ingress-authz.yaml new file mode 100644 index 00000000..6f81fae3 --- /dev/null +++ b/helm/argo-stack/overlays/ingress-authz-overlay/templates/ingress-authz.yaml @@ -0,0 +1,63 @@ +{{/* +Ingress resources for each route in the ingress-authz-overlay. +Each route creates a separate Ingress resource in its respective namespace, +all sharing the same host and TLS configuration. +All routes are protected by the authz-adapter via NGINX external auth. +*/}} +{{- if .Values.ingressAuthzOverlay.enabled }} +{{- $root := . }} +{{- $config := .Values.ingressAuthzOverlay }} +{{- range $routeName, $route := $config.routes }} +{{- if $route.enabled }} +--- +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: ingress-authz-{{ $routeName }} + namespace: {{ $route.namespace }} + labels: + {{- include "ingress-authz-overlay.labels" $root | nindent 4 }} + app.kubernetes.io/component: ingress + ingress-authz-overlay.calypr.io/route: {{ $routeName | quote }} + annotations: + # Helm release tracking + meta.helm.sh/release-name: {{ $root.Release.Name }} + meta.helm.sh/release-namespace: {{ $root.Release.Namespace }} + # NGINX external auth annotations + {{- include "ingress-authz-overlay.authAnnotations" $root | nindent 4 }} + {{- if $config.tls.enabled }} + # Let's Encrypt / cert-manager integration + cert-manager.io/cluster-issuer: {{ $config.tls.clusterIssuer | quote }} + {{- end }} + {{- if $route.useRegex }} + # Path rewriting for subpath support + nginx.ingress.kubernetes.io/use-regex: "true" + nginx.ingress.kubernetes.io/rewrite-target: {{ $route.rewriteTarget | default "/$2" }} + {{- end }} +spec: + ingressClassName: {{ $config.ingressClassName | default "nginx" }} + {{- if $config.tls.enabled }} + tls: + - hosts: + - {{ $config.host | quote }} + secretName: {{ $config.tls.secretName | quote }} + {{- end }} + rules: + - host: {{ $config.host | quote }} + http: + paths: + {{- if $route.useRegex }} + - path: {{ $route.pathPrefix }}(/|$)(.*) + pathType: ImplementationSpecific + {{- else }} + - path: {{ $route.pathPrefix }} + pathType: Prefix + {{- end }} + backend: + service: + name: {{ $route.service }} + port: + number: {{ $route.port }} +{{- end }} +{{- end }} +{{- end }} diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/tests/authz-ingress.feature b/helm/argo-stack/overlays/ingress-authz-overlay/tests/authz-ingress.feature new file mode 100644 index 00000000..e84c7a8c --- /dev/null +++ b/helm/argo-stack/overlays/ingress-authz-overlay/tests/authz-ingress.feature @@ -0,0 +1,72 @@ +Feature: Authz ingress overlay + + Background: + Given the ingress-authz-overlay is installed + And the hostname "calypr-demo.ddns.net" resolves to the ingress endpoint + + Scenario: Unauthenticated user is redirected to login + When I send a GET request to "https://calypr-demo.ddns.net/workflows" + Then the response status should be 302 or 303 + And the "Location" header should contain "/tenants/login" + + Scenario: Authenticated user can access workflows + Given I have a valid session recognized by authz-adapter + When I send a GET request to "https://calypr-demo.ddns.net/workflows" + Then the response status should be 200 + + Scenario: All paths are protected by authz-adapter + When I send a GET request to "https://calypr-demo.ddns.net/applications" without credentials + Then I should be redirected to "/tenants/login" + + When I send a GET request to "https://calypr-demo.ddns.net/registrations" without credentials + Then I should be redirected to "/tenants/login" + + When I send a GET request to "https://calypr-demo.ddns.net/api" without credentials + Then I should be redirected to "/tenants/login" + + When I send a GET request to "https://calypr-demo.ddns.net/tenants" without credentials + Then I should be redirected to "/tenants/login" or served only public content as configured + + Scenario: TLS certificate is valid + When I connect to "https://calypr-demo.ddns.net" + Then the TLS certificate should be issued by "Let's Encrypt" + And the certificate subject alt name should include "calypr-demo.ddns.net" + + Scenario: Routing sends requests to the correct services + Given I am authenticated + When I send a GET request to "https://calypr-demo.ddns.net/workflows" + Then the response should contain an HTML title for the workflows UI + + When I send a GET request to "https://calypr-demo.ddns.net/applications" + Then the response should contain an HTML title for the applications UI + + When I send a GET request to "https://calypr-demo.ddns.net/api/health" + Then I should receive a 200 response with a JSON health object from the API + + When I send a GET request to "https://calypr-demo.ddns.net/tenants" + Then I should see the tenant portal landing page or login as configured + + Scenario: Auth response headers are passed to backend + Given I am authenticated with user "test@example.com" in groups "argo-runner,argo-viewer" + When I send a GET request to "https://calypr-demo.ddns.net/api/whoami" + Then the backend should receive header "X-Auth-Request-User" with value "test@example.com" + And the backend should receive header "X-Auth-Request-Groups" with value "argo-runner,argo-viewer" + + Scenario: Path rewriting works correctly + Given I am authenticated + When I send a GET request to "https://calypr-demo.ddns.net/workflows/workflow-details/my-workflow" + Then the Argo Workflows server should receive path "/workflow-details/my-workflow" + + When I send a GET request to "https://calypr-demo.ddns.net/api/v1/users" + Then the Calypr API should receive path "/v1/users" + + Scenario: Health check endpoint is accessible + When I send a GET request to "http://authz-adapter.argo-stack.svc.cluster.local:8080/healthz" + Then the response status should be 200 + And the response body should be "ok" + + Scenario: Multiple simultaneous requests are handled + Given I am authenticated + When I send 10 concurrent GET requests to "https://calypr-demo.ddns.net/workflows" + Then all responses should have status 200 + And the average response time should be less than 500ms diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/values.yaml b/helm/argo-stack/overlays/ingress-authz-overlay/values.yaml new file mode 100644 index 00000000..ed3b71a4 --- /dev/null +++ b/helm/argo-stack/overlays/ingress-authz-overlay/values.yaml @@ -0,0 +1,147 @@ +# ============================================================================ +# Ingress AuthZ Overlay Configuration +# ============================================================================ +# This overlay provides a single host, path-based ingress layer for all +# major UIs and APIs, protected by a centralized authz-adapter. +# +# Usage: +# helm upgrade --install ingress-authz-overlay \ +# helm/argo-stack/overlays/ingress-authz-overlay \ +# --set ingressAuthzOverlay.enabled=true +# ============================================================================ + +ingressAuthzOverlay: + # Enable or disable the overlay + enabled: true + + # ============================================================================ + # Host and TLS Configuration + # ============================================================================ + # Single host for all path-based routes + host: calypr-demo.ddns.net + + # TLS configuration using cert-manager + tls: + enabled: true + secretName: calypr-demo-tls + clusterIssuer: letsencrypt-prod + + # ============================================================================ + # Ingress Controller Configuration + # ============================================================================ + ingressClassName: nginx + + # ============================================================================ + # AuthZ Adapter Configuration + # ============================================================================ + authzAdapter: + # Enable deployment of authz-adapter (set to false if deployed separately) + deploy: true + + # Service discovery settings + serviceName: authz-adapter + namespace: argo-stack + port: 8080 + + # Auth endpoint path + path: /check + + # Sign-in URL for unauthenticated requests + signinUrl: https://calypr-demo.ddns.net/tenants/login + + # Headers to pass back from auth response + responseHeaders: "X-User,X-Email,X-Groups,X-Auth-Request-User,X-Auth-Request-Email,X-Auth-Request-Groups" + + # Container image for authz-adapter + image: ghcr.io/calypr/argo-helm:latest + + # Number of replicas + replicas: 2 + + # Environment configuration for the adapter + env: + # GitApp/Fence base URL for user info + fenceBase: "https://calypr-dev.ohsu.edu/user" + # Tenant login path + tenantLoginPath: "/tenants/login" + # HTTP timeout for auth calls + httpTimeout: "3.0" + + # Resource limits and requests + resources: + requests: + cpu: 50m + memory: 64Mi + limits: + cpu: 200m + memory: 128Mi + + # Pod security context + securityContext: + runAsNonRoot: true + runAsUser: 1000 + + # ============================================================================ + # Route Definitions + # ============================================================================ + # Each route creates a separate Ingress resource in the specified namespace. + # All routes share the same host and TLS configuration. + # All routes are protected by the authz-adapter via NGINX external auth. + routes: + # Argo Workflows UI + workflows: + enabled: true + namespace: argo-stack + service: argo-stack-argo-workflows-server + port: 2746 + pathPrefix: /workflows + # Use regex path matching for subpaths + useRegex: true + # Rewrite path to remove prefix + rewriteTarget: /$2 + + # Argo CD Applications UI + applications: + enabled: true + namespace: argo-stack + service: argo-stack-argocd-server + port: 8080 + pathPrefix: /applications + useRegex: true + rewriteTarget: /$2 + + # GitHub Repository Registrations EventSource + registrations: + enabled: true + namespace: argo-stack + service: github-repo-registrations-eventsource-svc + port: 12000 + pathPrefix: /registrations + useRegex: true + rewriteTarget: /$2 + + # Calypr API Service + api: + enabled: true + namespace: calypr-api + service: calypr-api + port: 3000 + pathPrefix: /api + useRegex: true + rewriteTarget: /$2 + + # Calypr Tenants Service + tenants: + enabled: true + namespace: calypr-tenants + service: calypr-tenants + port: 3001 + pathPrefix: /tenants + useRegex: true + rewriteTarget: /$2 + # Optional: Allow public access to login endpoint + # Set to true to skip auth for /tenants/login + publicPaths: + - /tenants/login + - /tenants/logout + - /tenants/callback diff --git a/helm/argo-stack/values.yaml b/helm/argo-stack/values.yaml index 4215afd3..76275e6b 100644 --- a/helm/argo-stack/values.yaml +++ b/helm/argo-stack/values.yaml @@ -215,6 +215,59 @@ ingressAuth: authURL: "http://authz-adapter.security.svc.cluster.local:8080/check" passAuthorization: true +# ============================================================================ +# Ingress AuthZ Overlay - Unified Path-Based Routing with Centralized Auth +# ============================================================================ +# Enable this overlay to provide a single host, path-based ingress for all +# major UIs and APIs, protected by the authz-adapter. +# See: helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md +# +# To use the overlay, install it separately: +# helm upgrade --install ingress-authz-overlay \ +# helm/argo-stack/overlays/ingress-authz-overlay \ +# --values helm/argo-stack/values.yaml \ +# --set ingressAuthzOverlay.enabled=true + +ingressAuthzOverlay: + enabled: false + host: calypr-demo.ddns.net + tls: + secretName: calypr-demo-tls + clusterIssuer: letsencrypt-prod + authzAdapter: + serviceName: authz-adapter + namespace: argo-stack + port: 8080 + path: /check + signinUrl: https://calypr-demo.ddns.net/tenants/login + responseHeaders: X-User, X-Email, X-Groups + routes: + workflows: + namespace: argo-stack + service: argo-stack-argo-workflows-server + port: 2746 + pathPrefix: /workflows + applications: + namespace: argo-stack + service: argo-stack-argocd-server + port: 8080 + pathPrefix: /applications + registrations: + namespace: argo-stack + service: github-repo-registrations-eventsource-svc + port: 12000 + pathPrefix: /registrations + api: + namespace: calypr-api + service: calypr-api + port: 3000 + pathPrefix: /api + tenants: + namespace: calypr-tenants + service: calypr-tenants + port: 3001 + pathPrefix: /tenants + # ============================================================================ # Argo CD Applications - Multi-Application Support (REMOVED) # ============================================================================ From c3c31c3576cd2f3d5122f2ad9c8bf774d34d2418 Mon Sep 17 00:00:00 2001 From: Copilot <198982749+Copilot@users.noreply.github.com> Date: Mon, 24 Nov 2025 17:39:23 -0800 Subject: [PATCH 03/19] Document Let's Encrypt ACME account key secrets for cert-manager (#95) * Initial plan * Add documentation for Let's Encrypt ACME account key secrets Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> * Add cert-manager installation instructions and troubleshooting Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> * Fix installation order list formatting Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> * Add troubleshooting for Helm ownership conflict with ClusterIssuer Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> * Improve Helm ownership conflict documentation clarity Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --- .../overlays/ingress-authz-overlay/README.md | 21 ++- .../docs/authz-ingress-user-guide.md | 138 +++++++++++++++++- 2 files changed, 155 insertions(+), 4 deletions(-) diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/README.md b/helm/argo-stack/overlays/ingress-authz-overlay/README.md index 02a89464..d07fe2c1 100644 --- a/helm/argo-stack/overlays/ingress-authz-overlay/README.md +++ b/helm/argo-stack/overlays/ingress-authz-overlay/README.md @@ -62,4 +62,23 @@ See the [User Guide](docs/authz-ingress-user-guide.md) for architecture diagrams - Kubernetes 1.19+ - Helm 3.x - NGINX Ingress Controller -- cert-manager (for TLS) +- cert-manager (for TLS) - **must be installed before deploying this overlay** + +### Installing cert-manager + +If you see `no matches for kind "ClusterIssuer"`, cert-manager is not installed: + +```bash +# Install cert-manager +helm repo add jetstack https://charts.jetstack.io +helm repo update +helm install cert-manager jetstack/cert-manager \ + --namespace cert-manager \ + --create-namespace \ + --set crds.enabled=true + +# Wait for cert-manager to be ready +kubectl wait --for=condition=Ready pods --all -n cert-manager --timeout=120s +``` + +See the [User Guide](docs/authz-ingress-user-guide.md) for complete setup instructions including ClusterIssuer configuration. diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md b/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md index a54d5aed..4e57dbe7 100644 --- a/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md +++ b/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md @@ -55,6 +55,36 @@ sequenceDiagram This overlay uses [cert-manager](https://cert-manager.io/) to automatically provision and renew TLS certificates from [Let's Encrypt](https://letsencrypt.org/). +### Installing cert-manager + +**cert-manager must be installed before creating ClusterIssuers or deploying this overlay.** + +If you see an error like: +``` +no matches for kind "ClusterIssuer" in version "cert-manager.io/v1" +``` +This means cert-manager is not installed. Install it first: + +```bash +# Add the Jetstack Helm repository +helm repo add jetstack https://charts.jetstack.io +helm repo update + +# Install cert-manager with CRDs +helm install cert-manager jetstack/cert-manager \ + --namespace cert-manager \ + --create-namespace \ + --set crds.enabled=true + +# Verify cert-manager is running +kubectl get pods -n cert-manager +``` + +Wait for all cert-manager pods to be `Running` before proceeding: +```bash +kubectl wait --for=condition=Ready pods --all -n cert-manager --timeout=120s +``` + ### How It Works ```mermaid @@ -111,6 +141,58 @@ spec: kubectl apply -f cluster-issuer.yaml ``` +### Understanding the ACME Account Key Secret + +The `privateKeySecretRef` (e.g., `letsencrypt-prod-account-key` or `letsencrypt-staging-account-key`) specifies where cert-manager stores the ACME account private key. **You do NOT need to create this secret manually** โ€” cert-manager handles it automatically. + +#### How It Works + +1. **First-Time Setup**: When you create the ClusterIssuer, cert-manager: + - Generates a new RSA private key + - Registers a new account with Let's Encrypt using your email + - Stores the private key in the specified secret (in the `cert-manager` namespace) + +2. **Secret Location**: The secret is created in the same namespace as cert-manager (typically `cert-manager`): + ```bash + # View the account key secret + kubectl get secret letsencrypt-prod-account-key -n cert-manager + + # Describe to see metadata + kubectl describe secret letsencrypt-prod-account-key -n cert-manager + ``` + +3. **Account Persistence**: The account key persists across cert-manager restarts. As long as the secret exists, cert-manager will reuse the same Let's Encrypt account. + +#### Backing Up the ACME Account Key + +For disaster recovery, you may want to back up the account key: + +```bash +# Export the account key secret +kubectl get secret letsencrypt-prod-account-key -n cert-manager -o yaml > letsencrypt-account-backup.yaml + +# To restore in a new cluster (before creating ClusterIssuer) +kubectl apply -f letsencrypt-account-backup.yaml +``` + +> **Note**: Keep the backup secure โ€” this key provides access to your Let's Encrypt account and all its certificates. + +#### Troubleshooting Account Key Issues + +If the account key secret is not being created: + +```bash +# Check cert-manager controller logs +kubectl logs -n cert-manager -l app.kubernetes.io/component=controller + +# Check ClusterIssuer status +kubectl describe clusterissuer letsencrypt-prod +``` + +Common issues: +- **ACME Registration Failed**: Check your email address is valid and you can reach Let's Encrypt's API +- **Secret Not Found in Expected Namespace**: The secret is created in the cert-manager namespace, not your application namespace + ### Configuration Options | Setting | Description | Default | @@ -140,6 +222,8 @@ spec: class: nginx ``` +> **Note**: The `letsencrypt-staging-account-key` secret is also auto-generated by cert-manager, just like the production key. Staging and production use separate accounts and secrets. + Then configure the overlay to use it: ```yaml @@ -182,17 +266,65 @@ kubectl get challenges -A ``` Common issues: +- **cert-manager not installed**: If you see `no matches for kind "ClusterIssuer"`, install cert-manager first (see [Installing cert-manager](#installing-cert-manager)) +- **Helm ownership conflict**: If you see `invalid ownership metadata; label validation error`, the ClusterIssuer was created outside of Helm. See [Helm Ownership Conflict](#helm-ownership-conflict) below. - **Domain not reachable**: Ensure your domain's DNS points to the ingress controller's external IP - **Rate limited**: Use `letsencrypt-staging` for testing to avoid production rate limits - **Challenge failed**: Check that port 80 is accessible for HTTP-01 challenges +### Helm Ownership Conflict + +If you get an error like: +``` +Error: UPGRADE FAILED: Unable to continue with update: ClusterIssuer "letsencrypt-prod" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by" +``` + +This happens when: +1. The ClusterIssuer was created manually with `kubectl apply` +2. A Helm chart template tries to create/manage the same ClusterIssuer + +**Solution**: ClusterIssuers should be managed **outside** of this Helm chart: + +```bash +# Option 1: Keep the manually created ClusterIssuer (recommended) +# Simply don't include cluster-issuer templates in the chart +# This overlay already follows this pattern - it references the ClusterIssuer +# via annotation but doesn't create it + +# Option 2: If you have a local cluster-issuer template file, remove it +rm helm/argo-stack/overlays/ingress-authz-overlay/templates/cluster-issuer*.yaml + +# Option 3: To adopt an existing resource into Helm (advanced) +# Add Helm labels and annotations to the existing ClusterIssuer: +kubectl annotate clusterissuer letsencrypt-prod \ + meta.helm.sh/release-name=ingress-authz-overlay \ + meta.helm.sh/release-namespace=argo-stack +kubectl label clusterissuer letsencrypt-prod \ + app.kubernetes.io/managed-by=Helm +``` + +**Why ClusterIssuers are managed separately**: ClusterIssuers are cluster-scoped resources that affect the entire cluster, not just one namespace. Including them in application-specific Helm charts causes conflicts when: +- Multiple applications need the same ClusterIssuer +- The ClusterIssuer already exists (created by a previous deployment or another chart) +- Different teams deploy applications that reference the same issuer + +This chart references the ClusterIssuer via annotation (`cert-manager.io/cluster-issuer`) but leaves its lifecycle management to cluster administrators. + ## Installation ### Prerequisites -- Kubernetes cluster with NGINX Ingress Controller -- cert-manager installed and configured with a ClusterIssuer (e.g., `letsencrypt-prod`) -- Helm 3.x +Before installing this overlay, ensure you have: + +1. **Kubernetes cluster** (1.19+) with NGINX Ingress Controller installed +2. **cert-manager installed** (see [Installing cert-manager](#installing-cert-manager) above) +3. **ClusterIssuer created** (see [ClusterIssuer: letsencrypt-prod](#clusterissuer-letsencrypt-prod)) +4. **Helm 3.x** installed locally + +**Installation Order**: +1. Install cert-manager +2. Create ClusterIssuer +3. Install this overlay ### Install the Overlay From 367efe08240e1dd67f84d2da4a7f19ba20b3f6b0 Mon Sep 17 00:00:00 2001 From: Copilot <198982749+Copilot@users.noreply.github.com> Date: Mon, 24 Nov 2025 20:40:05 -0800 Subject: [PATCH 04/19] Fix cert-manager certificate ownership conflict for multi-ingress overlay (#96) * Initial plan * Fix cert-manager certificate ownership conflict by using primary route flag Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --- .../overlays/ingress-authz-overlay/README.md | 21 +++++++++++ .../docs/authz-ingress-user-guide.md | 35 +++++++++++++++++++ .../templates/ingress-authz.yaml | 9 +++-- .../ingress-authz-overlay/values.yaml | 9 ++++- 4 files changed, 71 insertions(+), 3 deletions(-) diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/README.md b/helm/argo-stack/overlays/ingress-authz-overlay/README.md index d07fe2c1..85f90f4d 100644 --- a/helm/argo-stack/overlays/ingress-authz-overlay/README.md +++ b/helm/argo-stack/overlays/ingress-authz-overlay/README.md @@ -64,6 +64,27 @@ See the [User Guide](docs/authz-ingress-user-guide.md) for architecture diagrams - NGINX Ingress Controller - cert-manager (for TLS) - **must be installed before deploying this overlay** +### TLS Certificate Ownership + +When using cert-manager's ingress-shim, only **one** ingress resource can "own" a Certificate. +This overlay uses a `primary: true` flag on routes to designate which ingress should have the +`cert-manager.io/cluster-issuer` annotation. + +By default, the `workflows` route is set as primary. Other ingresses reference the same TLS +secret but without the cluster-issuer annotation, avoiding the "certificate resource is not +owned by this object" error. + +To change the primary route: + +```yaml +ingressAuthzOverlay: + routes: + workflows: + primary: false # Remove primary from workflows + applications: + primary: true # Make applications the primary +``` + ### Installing cert-manager If you see `no matches for kind "ClusterIssuer"`, cert-manager is not installed: diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md b/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md index 4e57dbe7..e5e919ba 100644 --- a/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md +++ b/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md @@ -193,6 +193,41 @@ Common issues: - **ACME Registration Failed**: Check your email address is valid and you can reach Let's Encrypt's API - **Secret Not Found in Expected Namespace**: The secret is created in the cert-manager namespace, not your application namespace +### TLS Certificate Ownership + +When using multiple ingress resources with the same TLS secret and cert-manager's ingress-shim, you may encounter an error: + +``` +certificate resource is not owned by this object. refusing to update non-owned certificate resource +``` + +This happens because **cert-manager only allows one Ingress to own a Certificate**. When multiple ingresses have the `cert-manager.io/cluster-issuer` annotation pointing to the same certificate, a conflict occurs. + +**Solution**: This overlay uses a `primary: true` flag on routes. Only the primary route's Ingress gets the `cert-manager.io/cluster-issuer` annotation. Other ingresses reference the TLS secret but don't trigger certificate creation. + +```yaml +ingressAuthzOverlay: + routes: + workflows: + enabled: true + primary: true # Only this route has cert-manager.io/cluster-issuer annotation + # ... + applications: + enabled: true + # primary: false (default) - uses the TLS secret but doesn't trigger cert creation +``` + +By default, the `workflows` route is primary. To change: + +```yaml +ingressAuthzOverlay: + routes: + workflows: + primary: false + api: + primary: true # Move certificate ownership to /api route +``` + ### Configuration Options | Setting | Description | Default | diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/templates/ingress-authz.yaml b/helm/argo-stack/overlays/ingress-authz-overlay/templates/ingress-authz.yaml index 6f81fae3..a056aef2 100644 --- a/helm/argo-stack/overlays/ingress-authz-overlay/templates/ingress-authz.yaml +++ b/helm/argo-stack/overlays/ingress-authz-overlay/templates/ingress-authz.yaml @@ -3,6 +3,11 @@ Ingress resources for each route in the ingress-authz-overlay. Each route creates a separate Ingress resource in its respective namespace, all sharing the same host and TLS configuration. All routes are protected by the authz-adapter via NGINX external auth. + +NOTE: Only the route with primary: true should have the cert-manager.io/cluster-issuer +annotation. Other routes just reference the TLS secret without the annotation to avoid +cert-manager ownership conflicts. If no route has primary: true, no ingress will have +the cluster-issuer annotation (the Certificate must be created manually or by another means). */}} {{- if .Values.ingressAuthzOverlay.enabled }} {{- $root := . }} @@ -25,8 +30,8 @@ metadata: meta.helm.sh/release-namespace: {{ $root.Release.Namespace }} # NGINX external auth annotations {{- include "ingress-authz-overlay.authAnnotations" $root | nindent 4 }} - {{- if $config.tls.enabled }} - # Let's Encrypt / cert-manager integration + {{- if and $config.tls.enabled $route.primary }} + # Let's Encrypt / cert-manager integration (only on primary route to avoid ownership conflicts) cert-manager.io/cluster-issuer: {{ $config.tls.clusterIssuer | quote }} {{- end }} {{- if $route.useRegex }} diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/values.yaml b/helm/argo-stack/overlays/ingress-authz-overlay/values.yaml index ed3b71a4..301e2056 100644 --- a/helm/argo-stack/overlays/ingress-authz-overlay/values.yaml +++ b/helm/argo-stack/overlays/ingress-authz-overlay/values.yaml @@ -87,10 +87,17 @@ ingressAuthzOverlay: # Each route creates a separate Ingress resource in the specified namespace. # All routes share the same host and TLS configuration. # All routes are protected by the authz-adapter via NGINX external auth. + # + # IMPORTANT: Only ONE route should have the cert-manager.io/cluster-issuer annotation + # to avoid "certificate resource is not owned by this object" errors. + # Set `primary: true` on exactly one route to designate it as the certificate owner. routes: - # Argo Workflows UI + # Argo Workflows UI (primary route - manages TLS certificate) workflows: enabled: true + # Set primary: true to designate this route as the certificate owner + # Only the primary route gets the cert-manager.io/cluster-issuer annotation + primary: true namespace: argo-stack service: argo-stack-argo-workflows-server port: 2746 From 0a59706545714b83cae6451b1b83b762fb8c047a Mon Sep 17 00:00:00 2001 From: Copilot <198982749+Copilot@users.noreply.github.com> Date: Mon, 24 Nov 2025 22:05:09 -0800 Subject: [PATCH 05/19] Configure ingress-authz-overlay to use centralized authz-adapter (#97) * Initial plan * Configure overlay to use centralized authz-adapter in security namespace Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --- .../overlays/ingress-authz-overlay/README.md | 34 +++++++++++++++++++ .../docs/authz-ingress-user-guide.md | 23 ++++++++++--- .../ingress-authz-overlay/values.yaml | 10 ++++-- helm/argo-stack/values.yaml | 4 ++- 4 files changed, 62 insertions(+), 9 deletions(-) diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/README.md b/helm/argo-stack/overlays/ingress-authz-overlay/README.md index 85f90f4d..0b0ead6e 100644 --- a/helm/argo-stack/overlays/ingress-authz-overlay/README.md +++ b/helm/argo-stack/overlays/ingress-authz-overlay/README.md @@ -16,6 +16,40 @@ This overlay provides a **single host, path-based ingress** for all major UIs an All endpoints are protected by the `authz-adapter` via NGINX external authentication. +## AuthZ Adapter Configuration + +**Important**: By default, this overlay does **not** deploy its own authz-adapter. Instead, it reuses the centralized authz-adapter deployed by the main `argo-stack` chart in the `security` namespace. + +### Default Configuration (Recommended) + +The overlay is configured to use the existing authz-adapter in the `security` namespace: + +```yaml +ingressAuthzOverlay: + authzAdapter: + deploy: false # Do NOT deploy a separate adapter + namespace: security # Point to security namespace + serviceName: authz-adapter + port: 8080 +``` + +This ensures a single, centralized authz-adapter handles authentication for all ingress routes. + +### Deploying a Separate Adapter (Advanced) + +If you need the overlay to deploy its own authz-adapter instance: + +```yaml +ingressAuthzOverlay: + authzAdapter: + deploy: true # Deploy a separate adapter + namespace: argo-stack # In the overlay's namespace + serviceName: authz-adapter + port: 8080 +``` + +**Note**: Having multiple authz-adapter instances may cause configuration drift and is not recommended. + ## Quick Start ```bash diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md b/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md index e5e919ba..ab324591 100644 --- a/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md +++ b/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md @@ -405,15 +405,17 @@ ingressAuthzOverlay: ### AuthZ Adapter Configuration +By default, this overlay does **not** deploy its own authz-adapter. It reuses the centralized authz-adapter deployed by the main `argo-stack` chart in the `security` namespace: + ```yaml ingressAuthzOverlay: authzAdapter: - # Disable if authz-adapter is deployed separately - deploy: true + # Use centralized adapter from security namespace (recommended) + deploy: false - # Service location + # Service location (points to main argo-stack adapter) serviceName: authz-adapter - namespace: argo-stack + namespace: security port: 8080 path: /check @@ -422,8 +424,19 @@ ingressAuthzOverlay: # Headers passed from auth response to backends responseHeaders: "X-User,X-Email,X-Groups" +``` + +If you need to deploy a separate authz-adapter instance (not recommended): + +```yaml +ingressAuthzOverlay: + authzAdapter: + deploy: true # Deploy a separate adapter + namespace: argo-stack # In overlay's namespace + serviceName: authz-adapter + port: 8080 - # Environment configuration + # Environment configuration (only used when deploy: true) env: fenceBase: "https://calypr-dev.ohsu.edu/user" ``` diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/values.yaml b/helm/argo-stack/overlays/ingress-authz-overlay/values.yaml index 301e2056..40df467d 100644 --- a/helm/argo-stack/overlays/ingress-authz-overlay/values.yaml +++ b/helm/argo-stack/overlays/ingress-authz-overlay/values.yaml @@ -35,12 +35,16 @@ ingressAuthzOverlay: # AuthZ Adapter Configuration # ============================================================================ authzAdapter: - # Enable deployment of authz-adapter (set to false if deployed separately) - deploy: true + # Enable deployment of authz-adapter (set to false if using centralized adapter) + # NOTE: By default, the main argo-stack chart deploys authz-adapter to the + # 'security' namespace. Set deploy: false to reuse that instance. + deploy: false # Service discovery settings + # NOTE: When deploy: false, ensure these point to the existing authz-adapter + # deployed by the main argo-stack chart in the 'security' namespace. serviceName: authz-adapter - namespace: argo-stack + namespace: security port: 8080 # Auth endpoint path diff --git a/helm/argo-stack/values.yaml b/helm/argo-stack/values.yaml index 76275e6b..07b6d70b 100644 --- a/helm/argo-stack/values.yaml +++ b/helm/argo-stack/values.yaml @@ -235,8 +235,10 @@ ingressAuthzOverlay: secretName: calypr-demo-tls clusterIssuer: letsencrypt-prod authzAdapter: + # Use centralized adapter from security namespace + deploy: false serviceName: authz-adapter - namespace: argo-stack + namespace: security port: 8080 path: /check signinUrl: https://calypr-demo.ddns.net/tenants/login From f599ffa68c543d0df8405bdb07626029591ebca5 Mon Sep 17 00:00:00 2001 From: Brian Walsh Date: Tue, 25 Nov 2025 06:13:31 -0800 Subject: [PATCH 06/19] expected logs --- .../docs/authz-ingress-user-guide.md | 39 +++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md b/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md index ab324591..d97c5989 100644 --- a/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md +++ b/helm/argo-stack/overlays/ingress-authz-overlay/docs/authz-ingress-user-guide.md @@ -540,6 +540,45 @@ curl -I -H "Authorization: Bearer $TOKEN" https://calypr-demo.ddns.net/workflows helm uninstall ingress-authz-overlay -n argo-stack ``` +## Transient start up logs + +### Issue +When the authz-adapter starts up, you may see transient log messages like: + +``` +external-secrets-system external-secrets-cert-controller-5f8b8994d5-vrzmj cert-controller {"level":"error","ts":1764050879.7977011,"logger":"controllers.webhook-certs-updater","msg":"could not update webhook config","Webhookconfig":{"name":"secretstore-validate"},"error":"ca cert not yet ready","stacktrace":"github.com/external-secrets/external-secrets/pkg/controllers/webhookconfig.(*Reconciler).Reconcile\n\t/home/runner/work/external-secrets/external- +``` + +# โœ… **Short Answer** + +**Yes, this is a transient and harmless startup condition.** +It occurs when the **cert-controller** tries to update the validating/mutating webhook configuration *before* the internal CA bundle has been generated. + +ESO keeps retrying until the CA is ready, then the message disappears. + +--- + +# ๐Ÿง  **Why This Happens** + +External Secrets Operator uses an **internal self-signed CA** to secure: + +* The validating webhook +* The mutating webhook +* The admission controller + +On startup, the control plane usually initializes in this order: + +1. Pod starts +2. Cert controller initializes +3. Webhook server generates or fetches CA bundle +4. Cert controller tries to patch webhook config +5. **If CA is not yet ready โ†’ logs โ€œca cert not yet readyโ€** +6. Retry loop resolves it once CA is created + +ESOโ€™s cert controller reconciles every few seconds until successful. + +--- + ## Related Documentation - [Argo Stack User Guide](../../docs/user-guide.md) From 3bee6a317edae8b27dcda0ee3915ed79e76b6ece Mon Sep 17 00:00:00 2001 From: Brian Walsh Date: Tue, 25 Nov 2025 04:40:56 +0000 Subject: [PATCH 07/19] WIP: testing --- Makefile | 26 ++++++++++++------- .../cluster-issuer-letsencrypt.yaml | 21 +++++++++++++++ helm/argo-stack/values.yaml | 6 +++-- 3 files changed, 42 insertions(+), 11 deletions(-) create mode 100644 helm/argo-stack/overlays/ingress-authz-overlay/cluster-issuer-letsencrypt.yaml diff --git a/Makefile b/Makefile index add12e94..54a766d7 100644 --- a/Makefile +++ b/Makefile @@ -143,7 +143,7 @@ argo-stack: argo-stack ./helm/argo-stack -n argocd --create-namespace \ --wait --atomic \ --set-string events.github.webhook.ingress.hosts[0]=${ARGO_HOSTNAME} \ - --set-string events.github.webhook.url=http://${ARGO_HOSTNAME}:12000 \ + --set-string events.github.webhook.url=https://${ARGO_HOSTNAME}/registrations\ --set-string s3.enabled=${S3_ENABLED} \ --set-string s3.bucket=${S3_BUCKET} \ --set-string s3.pathStyle=true \ @@ -154,14 +154,22 @@ argo-stack: deploy: init argo-stack docker-install ports ports: - echo waiting for pods - sleep 10 - kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=argocd-server --timeout=120s -n argocd - echo starting port forwards - kubectl port-forward svc/argo-stack-argo-workflows-server 2746:2746 --address=0.0.0.0 -n argo-workflows & - kubectl port-forward svc/argo-stack-argocd-server 8080:443 --address=0.0.0.0 -n argocd & - kubectl port-forward svc/github-repo-registrations-eventsource-svc 12000:12000 --address=0.0.0.0 -n argo-events & - echo UIs available on port 2746 and port 8080, event exposed on 12000 + # Add the Jetstack Helm repository + helm repo add jetstack https://charts.jetstack.io + helm repo update + # Install cert-manager with CRDs + helm install cert-manager jetstack/cert-manager \ + --namespace cert-manager \ + --create-namespace \ + --set crds.enabled=true + # Wait for them to come up + kubectl wait --for=condition=Ready pods --all -n cert-manager --timeout=120s + # + kubectl apply -f helm/argo-stack/overlays/ingress-authz-overlay/cluster-issuer-letsencrypt.yaml + helm upgrade --install ingress-authz-overlay \ + helm/argo-stack/overlays/ingress-authz-overlay \ + --namespace argo-stack \ + --set ingressAuthzOverlay.host=${ARGO_HOSTNAME} adapter: cd authz-adapter && python3 -m pip install -r requirements.txt pytest && pytest -q diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/cluster-issuer-letsencrypt.yaml b/helm/argo-stack/overlays/ingress-authz-overlay/cluster-issuer-letsencrypt.yaml new file mode 100644 index 00000000..67be9c36 --- /dev/null +++ b/helm/argo-stack/overlays/ingress-authz-overlay/cluster-issuer-letsencrypt.yaml @@ -0,0 +1,21 @@ +apiVersion: cert-manager.io/v1 +kind: ClusterIssuer +metadata: + name: letsencrypt-staging +spec: + acme: + # Let's Encrypt production API endpoint + server: https://acme-v02.api.letsencrypt.org/directory + + # Email for certificate expiration notifications + email: brian@bwalsh.com + + # Secret to store the ACME account private key + privateKeySecretRef: + name: letsencrypt-staging-account-key + + # HTTP-01 challenge solver using ingress + solvers: + - http01: + ingress: + class: nginx diff --git a/helm/argo-stack/values.yaml b/helm/argo-stack/values.yaml index 07b6d70b..cc880317 100644 --- a/helm/argo-stack/values.yaml +++ b/helm/argo-stack/values.yaml @@ -8,9 +8,11 @@ namespaces: argo: argo-workflows argocd: argocd - tenant: wf-poc + calypr-tenants: calypr-tenants security: security argo-events: argo-events + argo-stack: argo-stack + calypr-api: calypr-api # ============================================================================ # External Secrets Operator (ESO) and Vault Integration @@ -143,7 +145,7 @@ argo-workflows: - --auth-mode=server controller: workflowNamespaces: - - wf-poc + - argo-workflows # Ensure controller uses correct namespace namespaceInstallMode: true # Enable log archiving for all workflows From 48f28cd03f60dce46af1139d22b3537b326bcefd Mon Sep 17 00:00:00 2001 From: Brian Walsh Date: Tue, 25 Nov 2025 15:04:06 +0000 Subject: [PATCH 08/19] local address pool --- Makefile | 17 ++++++++++++++++- helm/argo-stack/overlays/ip-address-pool.yaml | 17 +++++++++++++++++ 2 files changed, 33 insertions(+), 1 deletion(-) create mode 100644 helm/argo-stack/overlays/ip-address-pool.yaml diff --git a/Makefile b/Makefile index 54a766d7..510b0b15 100644 --- a/Makefile +++ b/Makefile @@ -154,6 +154,15 @@ argo-stack: deploy: init argo-stack docker-install ports ports: + # MetalLB provides LoadBalancer functionality for bare metal clusters. For now, we are not using AWS load balancer + kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.12/config/manifests/metallb-native.yaml + # Wait for MetalLB pods to be ready + kubectl wait --namespace metallb-system \ + --for=condition=ready pod \ + --selector=app=metallb \ + --timeout=90s + # Configure IP Address Pool + kubectl apply -f helm/argo-stack/overlays/ip-address-pool.yaml # Add the Jetstack Helm repository helm repo add jetstack https://charts.jetstack.io helm repo update @@ -164,12 +173,18 @@ ports: --set crds.enabled=true # Wait for them to come up kubectl wait --for=condition=Ready pods --all -n cert-manager --timeout=120s - # + # Start letsencrypt kubectl apply -f helm/argo-stack/overlays/ingress-authz-overlay/cluster-issuer-letsencrypt.yaml helm upgrade --install ingress-authz-overlay \ helm/argo-stack/overlays/ingress-authz-overlay \ --namespace argo-stack \ --set ingressAuthzOverlay.host=${ARGO_HOSTNAME} + # start nginx + helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx + helm repo update + helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \ + -n ingress-nginx --create-namespace \ + --set controller.service.type=LoadBalancer adapter: cd authz-adapter && python3 -m pip install -r requirements.txt pytest && pytest -q diff --git a/helm/argo-stack/overlays/ip-address-pool.yaml b/helm/argo-stack/overlays/ip-address-pool.yaml new file mode 100644 index 00000000..beae18e5 --- /dev/null +++ b/helm/argo-stack/overlays/ip-address-pool.yaml @@ -0,0 +1,17 @@ +apiVersion: metallb.io/v1beta1 +kind: IPAddressPool +metadata: + name: default-pool + namespace: metallb-system +spec: + addresses: + - 100.22.124.96-100.22.124.96 # Adjust to your available IP range +--- +apiVersion: metallb.io/v1beta1 +kind: L2Advertisement +metadata: + name: default + namespace: metallb-system +spec: + ipAddressPools: + - default-pool From fecdd62b48077c29a7a8fce8bb0232f98a0706ba Mon Sep 17 00:00:00 2001 From: Copilot <198982749+Copilot@users.noreply.github.com> Date: Tue, 25 Nov 2025 07:29:34 -0800 Subject: [PATCH 09/19] Add Ingress connectivity troubleshooting and environment-specific configuration documentation (#98) * Initial plan * Add Ingress and Connectivity troubleshooting section to docs Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> * Add AWS EKS and on-premises ingress configuration docs Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> * Fix code review issues in ingress docs Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --- docs/troubleshooting.md | 504 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 504 insertions(+) diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index 61bc810c..9e7a3fcb 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -13,6 +13,10 @@ Data managers, developers, and platform administrators using the Argo Stack for ## Table of Contents - [General Troubleshooting](#general-troubleshooting) +- [Ingress and Connectivity Troubleshooting](#ingress-and-connectivity-troubleshooting) +- [Environment-Specific Ingress Configuration](#environment-specific-ingress-configuration) + - [AWS EKS Configuration](#aws-eks-configuration) + - [On-Premises / Bare Metal Configuration](#on-premises--bare-metal-configuration) - [Workflow Troubleshooting](#workflow-troubleshooting) - [Argo Events Issues](#argo-events-issues) - [Secret and Vault Issues](#secret-and-vault-issues) @@ -117,6 +121,506 @@ kubectl get eventsources -A --- +## Ingress and Connectivity Troubleshooting + +### Issue: Connection Refused on Port 443 + +**Error:** +``` +curl: (7) Failed to connect to calypr-demo.ddns.net port 443 after 2 ms: Could not connect to server +``` + +**Cause:** The NGINX Ingress Controller is not accessible. This can happen for several reasons: +- Ingress Controller is not running +- LoadBalancer service has no external IP +- Firewall/Security Group blocking port 443 +- Wrong ingress class configured + +**Solution - Step-by-Step Debugging:** + +#### 1. Check NGINX Ingress Controller Status + +```bash +# Check if ingress-nginx pods are running +kubectl get pods -n ingress-nginx + +# Check ingress-nginx service and external IP +kubectl get svc -n ingress-nginx + +# Expected output should show EXTERNAL-IP (not ) +# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) +# ingress-nginx-controller LoadBalancer 10.100.x.x 80:30080/TCP,443:30443/TCP +``` + +If `EXTERNAL-IP` shows ``, the LoadBalancer hasn't been provisioned: + +```bash +# Check events for the service +kubectl describe svc ingress-nginx-controller -n ingress-nginx + +# Check cloud provider logs for LoadBalancer issues +``` + +#### 2. Verify Ingress Controller is Installed + +```bash +# Check if ingress-nginx namespace exists +kubectl get ns ingress-nginx + +# If not installed, install with: +helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx +helm repo update +helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \ + -n ingress-nginx --create-namespace +``` + +#### 3. Check Ingress Resources + +```bash +# List all ingress resources in relevant namespaces +kubectl get ingress -A + +# Describe a specific ingress to check configuration +kubectl describe ingress ingress-authz-workflows -n argo-stack +``` + +Look for: +- Correct host matching your domain +- IngressClass set correctly (usually `nginx`) +- TLS secret exists +- Backend service exists + +#### 4. Verify TLS Certificate + +```bash +# Check if certificate is ready +kubectl get certificate -n argo-stack + +# Check certificate status +kubectl describe certificate calypr-demo-tls -n argo-stack + +# Check if TLS secret exists +kubectl get secret calypr-demo-tls -n argo-stack +``` + +#### 5. Check Ingress Controller Logs + +```bash +# View ingress controller logs for errors +kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100 + +# Look for errors related to: +# - Certificate loading +# - Backend connection +# - Configuration reloads +``` + +#### 6. Verify Network Connectivity + +```bash +# Test from inside the cluster +kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ + curl -v http://argo-stack-argo-workflows-server.argo-stack:2746/ + +# Test the ingress controller service directly +kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ + curl -v http://ingress-nginx-controller.ingress-nginx:80/ +``` + +#### 7. Check Security Groups / Firewalls (Cloud-specific) + +**AWS:** +```bash +# Check the LoadBalancer security group allows inbound 443 +aws ec2 describe-security-groups --group-ids +``` + +**GCP:** +```bash +# Check firewall rules +gcloud compute firewall-rules list --filter="name~ingress" +``` + +**Azure:** +```bash +# Check network security group +az network nsg rule list --resource-group --nsg-name +``` + +### Issue: 404 Not Found on Ingress Paths + +**Error:** +``` +{"level":"error","ts":...,"msg":"route not found"...} +``` + +**Cause:** The ingress path doesn't match any backend or the service doesn't exist. + +**Solution:** + +1. Verify backend service exists: +```bash +kubectl get svc -n argo-stack argo-stack-argo-workflows-server +``` + +2. Check ingress path configuration matches service expectations +3. Verify the service ports match ingress configuration + +### Issue: 503 Service Unavailable + +**Error:** +``` +HTTP/1.1 503 Service Temporarily Unavailable +``` + +**Cause:** Backend service has no healthy endpoints. + +**Solution:** + +```bash +# Check endpoints for the service +kubectl get endpoints argo-stack-argo-workflows-server -n argo-stack + +# Check backend pods are running +kubectl get pods -n argo-stack -l app.kubernetes.io/name=argo-workflows-server + +# Check pod health +kubectl describe pod -n argo-stack +``` + +### Issue: authz-adapter External Auth Failure + +**Error:** +``` +auth-url: http://authz-adapter.security.svc.cluster.local:8080/check failed +``` + +**Cause:** The authz-adapter service is not responding. + +**Solution:** + +```bash +# Check authz-adapter is running +kubectl get pods -n security -l app=authz-adapter + +# Check authz-adapter service exists +kubectl get svc authz-adapter -n security + +# Test authz-adapter from within cluster +kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ + curl -v http://authz-adapter.security:8080/healthz + +# Check authz-adapter logs +kubectl logs -n security -l app=authz-adapter --tail=100 +``` + +### Ingress Debugging Cheat Sheet + +| Check | Command | +|-------|---------| +| Ingress controller pods | `kubectl get pods -n ingress-nginx` | +| Ingress controller service | `kubectl get svc -n ingress-nginx` | +| All ingress resources | `kubectl get ingress -A` | +| Ingress details | `kubectl describe ingress -n ` | +| TLS certificates | `kubectl get certificate -A` | +| Certificate status | `kubectl describe certificate -n ` | +| Controller logs | `kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx` | +| authz-adapter status | `kubectl get pods -n security -l app=authz-adapter` | +| Test internal connectivity | `kubectl run debug --image=curlimages/curl --rm -it -- curl -v ` | + +--- + +## Environment-Specific Ingress Configuration + +This section covers ingress setup and troubleshooting for different deployment environments. + +### AWS EKS Configuration + +#### Prerequisites for AWS EKS + +1. **AWS Load Balancer Controller** (recommended) or use the default in-tree cloud provider +2. **IAM permissions** for creating/managing Elastic Load Balancers +3. **Subnet tags** for automatic subnet discovery + +#### Installing NGINX Ingress on AWS EKS + +```bash +# Add the ingress-nginx repository +helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx +helm repo update + +# Install with AWS-specific settings +helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \ + -n ingress-nginx --create-namespace \ + --set controller.service.type=LoadBalancer \ + --set controller.service.annotations."service\.beta\.kubernetes\.io/aws-load-balancer-type"=nlb \ + --set controller.service.annotations."service\.beta\.kubernetes\.io/aws-load-balancer-scheme"=internet-facing +``` + +#### AWS-Specific Annotations + +For Network Load Balancer (NLB) - recommended for production: +```yaml +service: + annotations: + service.beta.kubernetes.io/aws-load-balancer-type: nlb + service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing + # For internal-only access: + # service.beta.kubernetes.io/aws-load-balancer-scheme: internal +``` + +For Application Load Balancer (ALB) - requires AWS Load Balancer Controller: + +โš ๏ธ **Note:** When using ALB with the AWS Load Balancer Controller, you configure the Ingress resource (not the Service). The Service should use `ClusterIP` or `NodePort` type. + +```yaml +# Ingress annotations for ALB (on the Ingress resource, not Service): +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + annotations: + kubernetes.io/ingress.class: alb + alb.ingress.kubernetes.io/scheme: internet-facing + alb.ingress.kubernetes.io/target-type: ip +``` + +#### Troubleshooting AWS LoadBalancer Pending + +If `EXTERNAL-IP` stays ``: + +1. **Check service events:** +```bash +kubectl describe svc ingress-nginx-controller -n ingress-nginx +``` + +Look for events like: +- `Error syncing load balancer` - IAM permission issues +- `could not find any suitable subnets` - subnet tagging issues + +2. **Verify IAM permissions:** + +The node IAM role or service account needs these permissions: +```json +{ + "Effect": "Allow", + "Action": [ + "elasticloadbalancing:CreateLoadBalancer", + "elasticloadbalancing:DeleteLoadBalancer", + "elasticloadbalancing:DescribeLoadBalancers", + "elasticloadbalancing:ModifyLoadBalancerAttributes", + "elasticloadbalancing:CreateTargetGroup", + "elasticloadbalancing:DeleteTargetGroup", + "elasticloadbalancing:DescribeTargetGroups", + "elasticloadbalancing:RegisterTargets", + "elasticloadbalancing:DeregisterTargets", + "ec2:DescribeSecurityGroups", + "ec2:DescribeSubnets", + "ec2:DescribeVpcs", + "ec2:CreateSecurityGroup", + "ec2:AuthorizeSecurityGroupIngress" + ], + "Resource": "*" +} +``` + +3. **Check subnet tags:** + +Public subnets need this tag for internet-facing LBs: +``` +kubernetes.io/role/elb = 1 +``` + +Private subnets need this tag for internal LBs: +``` +kubernetes.io/role/internal-elb = 1 +``` + +4. **Verify cluster tag on subnets:** +``` +kubernetes.io/cluster/ = shared (or owned) +``` + +5. **Check AWS Load Balancer Controller (if using ALB):** +```bash +kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller +kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller +``` + +#### AWS Security Group Configuration + +After LoadBalancer is created, verify security group allows traffic: + +```bash +# Get the LoadBalancer DNS name +LB_DNS=$(kubectl get svc ingress-nginx-controller -n ingress-nginx -o jsonpath='{.status.loadBalancer.ingress[0].hostname}') +echo $LB_DNS + +# Find associated security group (from AWS Console or CLI) +# Replace the DNS name in the query with your actual LoadBalancer DNS +aws elbv2 describe-load-balancers --query "LoadBalancers[?DNSName=='${LB_DNS}'].SecurityGroups" + +# Verify inbound rules allow 80 and 443 +aws ec2 describe-security-groups --group-ids --query "SecurityGroups[].IpPermissions" +``` + +Required inbound rules: +- Port 80 (HTTP) from 0.0.0.0/0 (or your IP range) +- Port 443 (HTTPS) from 0.0.0.0/0 (or your IP range) + +--- + +### On-Premises / Bare Metal Configuration + +#### Option 1: MetalLB (Recommended for On-Premises) + +MetalLB provides LoadBalancer functionality for bare metal clusters. + +**Install MetalLB:** +```bash +# Install MetalLB +kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.12/config/manifests/metallb-native.yaml + +# Wait for MetalLB pods to be ready +kubectl wait --namespace metallb-system \ + --for=condition=ready pod \ + --selector=app=metallb \ + --timeout=90s +``` + +**Configure IP Address Pool:** +```bash +cat <<'YAML' | kubectl apply -f - +apiVersion: metallb.io/v1beta1 +kind: IPAddressPool +metadata: + name: default-pool + namespace: metallb-system +spec: + addresses: + - 192.168.1.240-192.168.1.250 # Adjust to your available IP range +--- +apiVersion: metallb.io/v1beta1 +kind: L2Advertisement +metadata: + name: default + namespace: metallb-system +spec: + ipAddressPools: + - default-pool +YAML +``` + +**Then install NGINX Ingress with LoadBalancer:** +```bash +helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \ + -n ingress-nginx --create-namespace \ + --set controller.service.type=LoadBalancer +``` + +#### Option 2: NodePort (Simple, No External Dependencies) + +Use NodePort when you don't have a LoadBalancer solution: + +```bash +helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \ + -n ingress-nginx --create-namespace \ + --set controller.service.type=NodePort \ + --set controller.service.nodePorts.http=30080 \ + --set controller.service.nodePorts.https=30443 +``` + +Access via any node IP on the configured ports: +```bash +# Get node IPs +kubectl get nodes -o wide + +# Access ingress +curl http://:30080/ +curl -k https://:30443/ +``` + +**To use standard ports (80/443)**, set up an external load balancer or reverse proxy (HAProxy, NGINX) pointing to the NodePorts. + +#### Option 3: HostNetwork (Direct Node Access) + +For single-node clusters or when you need direct port 80/443 access: + +```bash +helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \ + -n ingress-nginx --create-namespace \ + --set controller.hostNetwork=true \ + --set controller.service.type=ClusterIP \ + --set controller.kind=DaemonSet +``` + +Access directly via node IP on ports 80 and 443. + +โš ๏ธ **Note:** Only one ingress controller pod can run per node with hostNetwork. + +#### Troubleshooting On-Premises Ingress + +1. **MetalLB not assigning IPs:** +```bash +# Check MetalLB speaker pods +kubectl get pods -n metallb-system + +# Check MetalLB logs +kubectl logs -n metallb-system -l component=speaker + +# Verify IPAddressPool is configured +kubectl get ipaddresspool -n metallb-system +``` + +2. **NodePort not accessible:** +```bash +# Verify service has NodePort assigned +kubectl get svc ingress-nginx-controller -n ingress-nginx + +# Check if port is open on the node +nc -zv 30443 + +# Check firewall (iptables/firewalld) +sudo iptables -L -n | grep 30443 +sudo firewall-cmd --list-ports +``` + +3. **Network connectivity from external:** +```bash +# Test from external machine +telnet 30443 + +# Check if traffic reaches the node +sudo tcpdump -i any port 30443 +``` + +4. **Firewall configuration (if using firewalld):** +```bash +# Option 1: Allow only the specific ports you're using (recommended for security) +sudo firewall-cmd --permanent --add-port=30080/tcp # HTTP NodePort +sudo firewall-cmd --permanent --add-port=30443/tcp # HTTPS NodePort +sudo firewall-cmd --reload + +# Option 2: Allow entire NodePort range (less secure, but convenient for development) +# sudo firewall-cmd --permanent --add-port=30000-32767/tcp +# sudo firewall-cmd --reload +``` + +--- + +### Environment Comparison Quick Reference + +| Feature | AWS EKS | On-Premises (MetalLB) | On-Premises (NodePort) | +|---------|---------|----------------------|------------------------| +| LoadBalancer type | NLB/ALB | L2/BGP | N/A | +| External IP | Automatic | From IP pool | Node IP + port | +| Standard ports (80/443) | โœ… Yes | โœ… Yes | โŒ No (30000-32767) | +| TLS termination | Ingress or ALB | Ingress | Ingress | +| Health checks | AWS-managed | MetalLB | Manual | +| HA setup | Multi-AZ | Multiple speakers | External LB needed | +| Setup complexity | Medium | Medium | Low | + +--- + ## Workflow Troubleshooting ### ๐Ÿงญ Overview From 337139c4c45915e97f0cc2cd5ff1937b8f357e10 Mon Sep 17 00:00:00 2001 From: Brian Walsh Date: Tue, 25 Nov 2025 15:30:21 +0000 Subject: [PATCH 10/19] ingress..host --- Makefile | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Makefile b/Makefile index 510b0b15..5c4a252e 100644 --- a/Makefile +++ b/Makefile @@ -150,6 +150,8 @@ argo-stack: --set-string s3.insecure=true \ --set-string s3.region=${S3_REGION} \ --set-string s3.hostname=${S3_HOSTNAME} \ + --set-string ingress.argoWorkflows.host=${ARGO_HOSTNAME} \ + --set-string ingress.argocd.host=${ARGO_HOSTNAME} \ -f - deploy: init argo-stack docker-install ports From 540c3d73be82251dcf5cf3ac169c449748c7e160 Mon Sep 17 00:00:00 2001 From: Copilot <198982749+Copilot@users.noreply.github.com> Date: Tue, 25 Nov 2025 08:01:57 -0800 Subject: [PATCH 11/19] Fix Ingress class detection for github-repo-registrations-eventsource (#99) * Initial plan * Fix github-repo-registrations-eventsource ingress to use spec.ingressClassName instead of deprecated annotation Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> * Add quote to ingressClassName for proper YAML parsing Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --- .../ingress-authz-overlay/templates/ingress-authz.yaml | 2 +- .../events/eventsource-github-from-repo-registrations.yaml | 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/templates/ingress-authz.yaml b/helm/argo-stack/overlays/ingress-authz-overlay/templates/ingress-authz.yaml index a056aef2..66e3cb57 100644 --- a/helm/argo-stack/overlays/ingress-authz-overlay/templates/ingress-authz.yaml +++ b/helm/argo-stack/overlays/ingress-authz-overlay/templates/ingress-authz.yaml @@ -40,7 +40,7 @@ metadata: nginx.ingress.kubernetes.io/rewrite-target: {{ $route.rewriteTarget | default "/$2" }} {{- end }} spec: - ingressClassName: {{ $config.ingressClassName | default "nginx" }} + ingressClassName: {{ $config.ingressClassName | default "nginx" | quote }} {{- if $config.tls.enabled }} tls: - hosts: diff --git a/helm/argo-stack/templates/events/eventsource-github-from-repo-registrations.yaml b/helm/argo-stack/templates/events/eventsource-github-from-repo-registrations.yaml index 973b9276..fd47b9f1 100644 --- a/helm/argo-stack/templates/events/eventsource-github-from-repo-registrations.yaml +++ b/helm/argo-stack/templates/events/eventsource-github-from-repo-registrations.yaml @@ -78,9 +78,8 @@ metadata: namespace: {{ .Values.events.namespace | default "argo-events" }} labels: source: repo-registration - annotations: - kubernetes.io/ingress.class: {{ .Values.events.github.webhook.ingress.className | default "nginx" | quote }} spec: + ingressClassName: {{ .Values.events.github.webhook.ingress.className | default "nginx" | quote }} rules: {{- range $h := (.Values.events.github.webhook.ingress.hosts | default list) }} - host: {{ $h }} From 9a64f936c6fe9ee2fde82997223d6db7931acb26 Mon Sep 17 00:00:00 2001 From: Copilot <198982749+Copilot@users.noreply.github.com> Date: Tue, 25 Nov 2025 08:58:57 -0800 Subject: [PATCH 12/19] Add cross-namespace routing support via ExternalName services (#100) * Initial plan * Add cross-namespace routing support via ExternalName services Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> * Add cross-namespace routing troubleshooting documentation Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --- docs/troubleshooting.md | 85 +++++++++++++++++++ .../overlays/ingress-authz-overlay/README.md | 49 +++++++++-- .../templates/externalname-services.yaml | 43 ++++++++++ .../templates/ingress-authz.yaml | 17 +++- .../ingress-authz-overlay/values.yaml | 13 +++ 5 files changed, 199 insertions(+), 8 deletions(-) create mode 100644 helm/argo-stack/overlays/ingress-authz-overlay/templates/externalname-services.yaml diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index 9e7a3fcb..2e6a4bb3 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -266,6 +266,91 @@ kubectl get svc -n argo-stack argo-stack-argo-workflows-server 2. Check ingress path configuration matches service expectations 3. Verify the service ports match ingress configuration +### Issue: 404 Due to Cross-Namespace Service Routing + +**Error:** +NGINX ingress returns 404 for all paths (`/workflows`, `/applications`, `/registrations`) even though the backend pods are running and responding correctly when accessed directly within the cluster. + +**Symptoms:** +```bash +# Direct service access works: +kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ + curl -v http://argo-stack-argo-workflows-server.argo-workflows:2746/ +# Returns expected HTML + +# But ingress returns 404: +curl https://calypr-demo.ddns.net/workflows +# Returns 404 Not Found +``` + +**Cause:** Kubernetes Ingress resources can **only route to services in the same namespace** as the Ingress. If your ingress is in `argo-stack` namespace but the actual service is in `argo-workflows` namespace, NGINX cannot route to it directly. + +Common cross-namespace scenarios: +- Ingress in `argo-stack` โ†’ Service in `argo-workflows` (Argo Workflows Server) +- Ingress in `argo-stack` โ†’ Service in `argocd` (Argo CD Server) +- Ingress in `argo-stack` โ†’ Service in `argo-events` (EventSource Service) + +**Solution - Use ExternalName Services:** + +The `ingress-authz-overlay` chart supports cross-namespace routing via ExternalName services. Configure each route with both `namespace` (where ingress lives) and `serviceNamespace` (where service actually exists): + +```yaml +# helm/argo-stack/overlays/ingress-authz-overlay/values.yaml +ingressAuthzOverlay: + routes: + workflows: + namespace: argo-stack # Where the ingress is created + serviceNamespace: argo-workflows # Where the actual service exists + service: argo-stack-argo-workflows-server + port: 2746 + applications: + namespace: argo-stack + serviceNamespace: argocd # ArgoCD server is in argocd namespace + service: argo-stack-argocd-server + port: 8080 + registrations: + namespace: argo-stack + serviceNamespace: argo-events # EventSource is in argo-events namespace + service: github-repo-registrations-eventsource-svc + port: 12000 +``` + +When `serviceNamespace` differs from `namespace`, the chart automatically creates: +1. **ExternalName Service** (e.g., `argo-stack-argo-workflows-server-proxy`) in the ingress namespace +2. This service acts as a DNS proxy pointing to the actual service FQDN +3. The ingress routes to the proxy service, which forwards to the actual service + +**Verify ExternalName Services:** +```bash +# Check ExternalName services were created +kubectl get svc -n argo-stack -l app.kubernetes.io/component=externalname-proxy + +# Verify ExternalName targets +kubectl get svc argo-stack-argo-workflows-server-proxy -n argo-stack -o yaml | grep externalName +# Should show: externalName: argo-stack-argo-workflows-server.argo-workflows.svc.cluster.local +``` + +**Redeploy the overlay:** +```bash +helm upgrade --install ingress-authz-overlay \ + helm/argo-stack/overlays/ingress-authz-overlay \ + --namespace argo-stack +``` + +**Debug cross-namespace routing:** +```bash +# 1. Verify direct service access works +kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ + curl -v http://argo-stack-argo-workflows-server.argo-workflows:2746/ + +# 2. Verify ExternalName proxy service resolves correctly +kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ + curl -v http://argo-stack-argo-workflows-server-proxy.argo-stack:2746/ + +# 3. Check ingress configuration +kubectl describe ingress ingress-authz-workflows -n argo-stack | grep -A5 "backend" +``` + ### Issue: 503 Service Unavailable **Error:** diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/README.md b/helm/argo-stack/overlays/ingress-authz-overlay/README.md index 0b0ead6e..20a12cb3 100644 --- a/helm/argo-stack/overlays/ingress-authz-overlay/README.md +++ b/helm/argo-stack/overlays/ingress-authz-overlay/README.md @@ -6,16 +6,51 @@ A Helm overlay chart providing unified, path-based ingress with centralized auth This overlay provides a **single host, path-based ingress** for all major UIs and APIs: -| Path | Service | Description | -|------|---------|-------------| -| `/workflows` | Argo Workflows Server | Workflow UI (port 2746) | -| `/applications` | Argo CD Server | GitOps applications UI (port 8080) | -| `/registrations` | GitHub EventSource | Repository registration events (port 12000) | -| `/api` | Calypr API | Platform API service (port 3000) | -| `/tenants` | Calypr Tenants | Tenant portal (port 3001) | +| Path | Service | Namespace | Description | +|------|---------|-----------|-------------| +| `/workflows` | Argo Workflows Server | argo-workflows | Workflow UI (port 2746) | +| `/applications` | Argo CD Server | argocd | GitOps applications UI (port 8080) | +| `/registrations` | GitHub EventSource | argo-events | Repository registration events (port 12000) | +| `/api` | Calypr API | calypr-api | Platform API service (port 3000) | +| `/tenants` | Calypr Tenants | calypr-tenants | Tenant portal (port 3001) | All endpoints are protected by the `authz-adapter` via NGINX external authentication. +## Cross-Namespace Routing + +This overlay supports **cross-namespace routing** for services that exist in different namespaces than the ingress resource. This is achieved using **ExternalName services** as proxies. + +### How It Works + +When a route's `serviceNamespace` differs from its `namespace`: + +1. An **ExternalName Service** is created in the ingress namespace +2. This service acts as a DNS proxy pointing to the actual service in the target namespace +3. The ingress routes to the proxy service, which forwards to the actual service + +### Configuration + +Each route can specify both the ingress namespace and the actual service namespace: + +```yaml +ingressAuthzOverlay: + routes: + workflows: + # Where the ingress is created + namespace: argo-stack + # Where the actual service lives + serviceNamespace: argo-workflows + service: argo-stack-argo-workflows-server + port: 2746 +``` + +When `serviceNamespace` differs from `namespace`, an ExternalName service is automatically created: + +- **Service Name**: `-proxy` +- **ExternalName**: `..svc.cluster.local` + +The ingress also adds the `nginx.ingress.kubernetes.io/upstream-vhost` annotation to ensure the correct Host header is sent to the backend service. + ## AuthZ Adapter Configuration **Important**: By default, this overlay does **not** deploy its own authz-adapter. Instead, it reuses the centralized authz-adapter deployed by the main `argo-stack` chart in the `security` namespace. diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/templates/externalname-services.yaml b/helm/argo-stack/overlays/ingress-authz-overlay/templates/externalname-services.yaml new file mode 100644 index 00000000..7796a3f5 --- /dev/null +++ b/helm/argo-stack/overlays/ingress-authz-overlay/templates/externalname-services.yaml @@ -0,0 +1,43 @@ +{{/* +ExternalName Services for cross-namespace routing. +When the ingress namespace differs from the service namespace, we create an +ExternalName service in the ingress namespace that points to the actual service +in its original namespace. This enables NGINX Ingress to route traffic correctly. +*/}} +{{- if .Values.ingressAuthzOverlay.enabled }} +{{- $root := . }} +{{- $config := .Values.ingressAuthzOverlay }} +{{- range $routeName, $route := $config.routes }} +{{- if $route.enabled }} +{{- $serviceNamespace := $route.serviceNamespace | default $route.namespace }} +{{- if ne $route.namespace $serviceNamespace }} +--- +# ExternalName service to enable cross-namespace routing for {{ $routeName }} +# Routes from {{ $route.namespace }} to {{ $route.service }}.{{ $serviceNamespace }} +apiVersion: v1 +kind: Service +metadata: + name: {{ $route.service }}-proxy + namespace: {{ $route.namespace }} + labels: + {{- include "ingress-authz-overlay.labels" $root | nindent 4 }} + app.kubernetes.io/component: externalname-proxy + ingress-authz-overlay.calypr.io/route: {{ $routeName | quote }} + ingress-authz-overlay.calypr.io/target-namespace: {{ $serviceNamespace | quote }} + ingress-authz-overlay.calypr.io/target-service: {{ $route.service | quote }} + annotations: + # Helm release tracking + meta.helm.sh/release-name: {{ $root.Release.Name }} + meta.helm.sh/release-namespace: {{ $root.Release.Namespace }} +spec: + type: ExternalName + externalName: {{ $route.service }}.{{ $serviceNamespace }}.svc.cluster.local + ports: + - port: {{ $route.port }} + targetPort: {{ $route.port }} + protocol: TCP + name: http +{{- end }} +{{- end }} +{{- end }} +{{- end }} diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/templates/ingress-authz.yaml b/helm/argo-stack/overlays/ingress-authz-overlay/templates/ingress-authz.yaml index 66e3cb57..85343ce8 100644 --- a/helm/argo-stack/overlays/ingress-authz-overlay/templates/ingress-authz.yaml +++ b/helm/argo-stack/overlays/ingress-authz-overlay/templates/ingress-authz.yaml @@ -8,12 +8,19 @@ NOTE: Only the route with primary: true should have the cert-manager.io/cluster- annotation. Other routes just reference the TLS secret without the annotation to avoid cert-manager ownership conflicts. If no route has primary: true, no ingress will have the cluster-issuer annotation (the Certificate must be created manually or by another means). + +For cross-namespace routing (when serviceNamespace differs from namespace), we use +an ExternalName service as a proxy. The ExternalName service is created in the +externalname-services.yaml template. */}} {{- if .Values.ingressAuthzOverlay.enabled }} {{- $root := . }} {{- $config := .Values.ingressAuthzOverlay }} {{- range $routeName, $route := $config.routes }} {{- if $route.enabled }} +{{- $serviceNamespace := $route.serviceNamespace | default $route.namespace }} +{{- $isCrossNamespace := ne $route.namespace $serviceNamespace }} +{{- $serviceName := ternary (printf "%s-proxy" $route.service) $route.service $isCrossNamespace }} --- apiVersion: networking.k8s.io/v1 kind: Ingress @@ -24,6 +31,10 @@ metadata: {{- include "ingress-authz-overlay.labels" $root | nindent 4 }} app.kubernetes.io/component: ingress ingress-authz-overlay.calypr.io/route: {{ $routeName | quote }} + {{- if $isCrossNamespace }} + ingress-authz-overlay.calypr.io/cross-namespace: "true" + ingress-authz-overlay.calypr.io/target-namespace: {{ $serviceNamespace | quote }} + {{- end }} annotations: # Helm release tracking meta.helm.sh/release-name: {{ $root.Release.Name }} @@ -39,6 +50,10 @@ metadata: nginx.ingress.kubernetes.io/use-regex: "true" nginx.ingress.kubernetes.io/rewrite-target: {{ $route.rewriteTarget | default "/$2" }} {{- end }} + {{- if $isCrossNamespace }} + # Cross-namespace routing via ExternalName service + nginx.ingress.kubernetes.io/upstream-vhost: {{ $route.service }}.{{ $serviceNamespace }}.svc.cluster.local + {{- end }} spec: ingressClassName: {{ $config.ingressClassName | default "nginx" | quote }} {{- if $config.tls.enabled }} @@ -60,7 +75,7 @@ spec: {{- end }} backend: service: - name: {{ $route.service }} + name: {{ $serviceName }} port: number: {{ $route.port }} {{- end }} diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/values.yaml b/helm/argo-stack/overlays/ingress-authz-overlay/values.yaml index 40df467d..bcf15376 100644 --- a/helm/argo-stack/overlays/ingress-authz-overlay/values.yaml +++ b/helm/argo-stack/overlays/ingress-authz-overlay/values.yaml @@ -102,8 +102,13 @@ ingressAuthzOverlay: # Set primary: true to designate this route as the certificate owner # Only the primary route gets the cert-manager.io/cluster-issuer annotation primary: true + # Namespace where the ingress will be created namespace: argo-stack + # Service name to route to service: argo-stack-argo-workflows-server + # Namespace where the actual service exists (for cross-namespace routing) + # If different from namespace, an ExternalName service will be created + serviceNamespace: argo-workflows port: 2746 pathPrefix: /workflows # Use regex path matching for subpaths @@ -116,6 +121,8 @@ ingressAuthzOverlay: enabled: true namespace: argo-stack service: argo-stack-argocd-server + # ArgoCD server is in argocd namespace + serviceNamespace: argocd port: 8080 pathPrefix: /applications useRegex: true @@ -126,6 +133,8 @@ ingressAuthzOverlay: enabled: true namespace: argo-stack service: github-repo-registrations-eventsource-svc + # EventSource is in argo-events namespace + serviceNamespace: argo-events port: 12000 pathPrefix: /registrations useRegex: true @@ -136,6 +145,8 @@ ingressAuthzOverlay: enabled: true namespace: calypr-api service: calypr-api + # Service is in same namespace as ingress + serviceNamespace: calypr-api port: 3000 pathPrefix: /api useRegex: true @@ -146,6 +157,8 @@ ingressAuthzOverlay: enabled: true namespace: calypr-tenants service: calypr-tenants + # Service is in same namespace as ingress + serviceNamespace: calypr-tenants port: 3001 pathPrefix: /tenants useRegex: true From 7f820b52c5234ea74e0303eecdf7449b8755c9ca Mon Sep 17 00:00:00 2001 From: Brian Walsh Date: Tue, 25 Nov 2025 09:07:25 -0800 Subject: [PATCH 13/19] tweak pod names --- docs/troubleshooting.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index 2e6a4bb3..989f4f60 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -219,8 +219,7 @@ kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100 ```bash # Test from inside the cluster -kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ - curl -v http://argo-stack-argo-workflows-server.argo-stack:2746/ +kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- curl -v http://argo-stack-argo-workflows-server.argo-workflows:2746/ # Test the ingress controller service directly kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ @@ -260,7 +259,7 @@ az network nsg rule list --resource-group --nsg-name 1. Verify backend service exists: ```bash -kubectl get svc -n argo-stack argo-stack-argo-workflows-server +kubectl get svc -n argo-workflows argo-stack-argo-workflows-server ``` 2. Check ingress path configuration matches service expectations From 91d277f6d704a8dfb6f4a2f9d17be3c96460f8dd Mon Sep 17 00:00:00 2001 From: Brian Walsh Date: Tue, 25 Nov 2025 20:52:25 +0000 Subject: [PATCH 14/19] host-networking --- Makefile | 6 ++++-- kind-config.yaml | 11 +++++++++++ 2 files changed, 15 insertions(+), 2 deletions(-) create mode 100644 kind-config.yaml diff --git a/Makefile b/Makefile index 5c4a252e..c136def7 100644 --- a/Makefile +++ b/Makefile @@ -85,7 +85,7 @@ show-limits: kind: kind delete cluster || true - kind create cluster + kind create cluster --config kind-config.yaml minio: @echo "๐Ÿ—„๏ธ Installing MinIO in cluster..." @@ -186,7 +186,9 @@ ports: helm repo update helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \ -n ingress-nginx --create-namespace \ - --set controller.service.type=LoadBalancer + --set controller.service.type=NodePort + # Solution - Use NodePort instead of LoadBalancer in kind + kubectl patch svc ingress-nginx-controller -n ingress-nginx -p '{"spec":{"type":"NodePort","ports":[{"port":80,"nodePort":30080},{"port":443,"nodePort":30443}]}}' adapter: cd authz-adapter && python3 -m pip install -r requirements.txt pytest && pytest -q diff --git a/kind-config.yaml b/kind-config.yaml new file mode 100644 index 00000000..b7bf67de --- /dev/null +++ b/kind-config.yaml @@ -0,0 +1,11 @@ +kind: Cluster +apiVersion: kind.x-k8s.io/v1alpha4 +networking: + kubeProxyMode: "iptables" # Explicit mode +nodes: +- role: control-plane + extraPortMappings: + - containerPort: 30080 + hostPort: 80 + - containerPort: 30443 + hostPort: 443 From 79f895959262583f72945da00998ffb5a4e234ee Mon Sep 17 00:00:00 2001 From: Brian Walsh Date: Tue, 25 Nov 2025 22:13:43 +0000 Subject: [PATCH 15/19] letsencrypt dns01 --- Makefile | 4 ++++ acmedns.json.example | 9 +++++++++ .../cluster-issuer-letsencrypt.yaml | 20 ++++++++----------- 3 files changed, 21 insertions(+), 12 deletions(-) create mode 100644 acmedns.json.example diff --git a/Makefile b/Makefile index c136def7..c0b7607e 100644 --- a/Makefile +++ b/Makefile @@ -173,6 +173,10 @@ ports: --namespace cert-manager \ --create-namespace \ --set crds.enabled=true + # install letsencrypt secret see **Create ClusterIssuer with acme-dns** + kubectl create secret generic acme-dns-credentials \ + -n cert-manager \ + --from-file=acmedns.json # Wait for them to come up kubectl wait --for=condition=Ready pods --all -n cert-manager --timeout=120s # Start letsencrypt diff --git a/acmedns.json.example b/acmedns.json.example new file mode 100644 index 00000000..7bc96121 --- /dev/null +++ b/acmedns.json.example @@ -0,0 +1,9 @@ +{ + "calypr-demo.ddns.net": { + "username": "XXXX", + "password": "XXXX", + "fulldomain": "XXXX", + "subdomain": "XXXX", + "allowfrom": [] + } +} diff --git a/helm/argo-stack/overlays/ingress-authz-overlay/cluster-issuer-letsencrypt.yaml b/helm/argo-stack/overlays/ingress-authz-overlay/cluster-issuer-letsencrypt.yaml index 67be9c36..d64390ac 100644 --- a/helm/argo-stack/overlays/ingress-authz-overlay/cluster-issuer-letsencrypt.yaml +++ b/helm/argo-stack/overlays/ingress-authz-overlay/cluster-issuer-letsencrypt.yaml @@ -1,21 +1,17 @@ apiVersion: cert-manager.io/v1 kind: ClusterIssuer metadata: - name: letsencrypt-staging + name: letsencrypt-prod spec: acme: - # Let's Encrypt production API endpoint server: https://acme-v02.api.letsencrypt.org/directory - - # Email for certificate expiration notifications email: brian@bwalsh.com - - # Secret to store the ACME account private key privateKeySecretRef: - name: letsencrypt-staging-account-key - - # HTTP-01 challenge solver using ingress + name: letsencrypt-prod-account-key solvers: - - http01: - ingress: - class: nginx + - dns01: + acmeDNS: + host: https://auth.acme-dns.io + accountSecretRef: + name: acme-dns-credentials + key: acmedns.json From 6b98e98c761ce423117293eebf6204d30cf3d8ab Mon Sep 17 00:00:00 2001 From: Copilot <198982749+Copilot@users.noreply.github.com> Date: Tue, 25 Nov 2025 14:15:50 -0800 Subject: [PATCH 16/19] Add troubleshooting docs for ingress connectivity, kind clusters, and cert-manager issues (#101) * Initial plan * Add troubleshooting docs for external connection issues when internal services work Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> * Add kind cluster and cert-manager troubleshooting to docs Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> * Add DNS-01 challenge docs for No-IP.com and kind clusters Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --- docs/troubleshooting.md | 520 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 520 insertions(+) diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index 989f4f60..0dbcb544 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -123,6 +123,526 @@ kubectl get eventsources -A ## Ingress and Connectivity Troubleshooting +### Issue: Connection Refused but Internal Services Work + +**Symptoms:** +Internal cluster connectivity works perfectly, but external access fails: + +```bash +# โœ… Internal service access works: +kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ + curl -v http://argo-stack-argo-workflows-server.argo-workflows:2746/ +# Returns 200 OK with HTML content + +# โœ… ExternalName proxy also works: +kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ + curl -v http://argo-stack-argo-workflows-server-proxy.argo-stack:2746/ +# Returns 200 OK + +# โŒ But external access fails: +curl https://calypr-demo.ddns.net/workflows +# curl: (7) Failed to connect to calypr-demo.ddns.net port 443 after 2 ms: Could not connect to server +``` + +**Cause:** This "Connection refused" error at the network level means the **ingress-nginx controller's LoadBalancer service** is not exposing ports to the external network. This is distinct from a 404 error (which would mean the ingress is reachable but routing is misconfigured). + +Common causes: +- LoadBalancer service is pending (no external IP provisioned) +- NodePort is not exposed in firewall/security groups +- DNS is not pointing to the correct IP +- Cloud provider LoadBalancer controller is not configured + +**Solution - Step-by-Step Diagnosis:** + +#### 1. Check the ingress-nginx LoadBalancer Service + +```bash +# Check the service type and external IP +kubectl get svc -n ingress-nginx + +# Expected output for LoadBalancer type: +# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) +# ingress-nginx-controller LoadBalancer 10.100.x.x 80:30080/TCP,443:30443/TCP + +# Expected output for NodePort type: +# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) +# ingress-nginx-controller NodePort 10.100.x.x 80:30080/TCP,443:30443/TCP +``` + +#### 2. If EXTERNAL-IP is `` + +This means the cloud LoadBalancer hasn't been provisioned: + +```bash +# Check service events for errors +kubectl describe svc ingress-nginx-controller -n ingress-nginx + +# Common causes: +# - AWS Load Balancer Controller not installed (EKS) +# - Insufficient IAM permissions for LB creation +# - Subnet/VPC configuration issues +# - Quota exceeded for load balancers +``` + +**For AWS EKS:** See [Troubleshooting AWS LoadBalancer Pending](#troubleshooting-aws-loadbalancer-pending) for detailed AWS-specific steps including IAM permissions, subnet tagging, and AWS Load Balancer Controller setup. + +Quick check: +```bash +# Check if AWS Load Balancer Controller is installed +kubectl get deployment -n kube-system aws-load-balancer-controller + +# If not installed, the Kubernetes service will stay in +``` + +**For bare metal / on-premises clusters:** + +LoadBalancer type won't work without a load balancer controller. Options: +- Use MetalLB: https://metallb.universe.tf/ +- Switch to NodePort and configure external LB manually +- Use HostPort on specific nodes + +#### 3. If using NodePort, check external access + +```bash +# Get the NodePort for port 443 +kubectl get svc ingress-nginx-controller -n ingress-nginx -o jsonpath='{.spec.ports[?(@.port==443)].nodePort}' +# Example output: 30443 + +# Get node external IP +kubectl get nodes -o wide +# Note the EXTERNAL-IP of your nodes + +# Verify firewall allows traffic on the NodePort +# Then test: curl https://:/ +``` + +#### 4. Verify DNS Resolution + +```bash +# Check that your domain resolves to the correct IP +nslookup calypr-demo.ddns.net + +# This should return the LoadBalancer external IP or Node external IP +# If it returns an incorrect IP, update your DNS +``` + +#### 5. Test Direct Access to the LoadBalancer IP + +```bash +# Get the LoadBalancer IP +LB_IP=$(kubectl get svc ingress-nginx-controller -n ingress-nginx -o jsonpath='{.status.loadBalancer.ingress[0].ip}') +echo "LoadBalancer IP: $LB_IP" + +# If AWS NLB (uses hostname instead of IP): +LB_HOSTNAME=$(kubectl get svc ingress-nginx-controller -n ingress-nginx -o jsonpath='{.status.loadBalancer.ingress[0].hostname}') +echo "LoadBalancer Hostname: $LB_HOSTNAME" + +# Test direct access +curl -v -k https://$LB_IP/workflows + +# If this works but your domain doesn't, the issue is DNS +``` + +#### 6. AWS-Specific: Check Security Groups + +See [AWS Security Group Configuration](#aws-security-group-configuration) for detailed security group verification. + +The LoadBalancer security group must allow: +- Inbound 443 from 0.0.0.0/0 (or your IP range) +- Inbound 80 from 0.0.0.0/0 (for HTTP-01 ACME challenges) + +#### 7. Verify ingress-nginx Controller is Healthy + +```bash +# Check pods are running +kubectl get pods -n ingress-nginx + +# Check controller logs for errors +kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50 + +# Look for: +# - "successfully synced" messages (good) +# - Error loading certificate (TLS issue) +# - Backend connection errors +``` + +#### 8. kind Cluster Specific Issues + +If you're using **kind** (Kubernetes IN Docker), the networking works differently: + +**Problem:** MetalLB's external IP only exists inside the Docker network, not accessible from your host machine. + +**Solution for kind:** + +1. **Access via localhost** using the port mappings defined in your kind config: +```bash +# If you configured extraPortMappings for ports 80/443 +curl -k https://localhost/workflows + +# Update /etc/hosts to use localhost for your domain +echo "127.0.0.1 calypr-demo.ddns.net" | sudo tee -a /etc/hosts +curl -k https://calypr-demo.ddns.net/workflows +``` + +2. **Use NodePort instead of LoadBalancer** with kind: +```yaml +# kind-config.yaml +kind: Cluster +apiVersion: kind.x-k8s.io/v1alpha4 +networking: + kubeProxyMode: "iptables" +nodes: +- role: control-plane + extraPortMappings: + - containerPort: 30080 # NodePort for HTTP + hostPort: 80 + - containerPort: 30443 # NodePort for HTTPS + hostPort: 443 +``` + +Then patch the ingress-nginx service: +```bash +kubectl patch svc ingress-nginx-controller -n ingress-nginx -p '{"spec":{"type":"NodePort","ports":[{"name":"http","port":80,"nodePort":30080},{"name":"https","port":443,"nodePort":30443}]}}' +``` + +3. **Check iptables rules inside the kind container** (not on host): +```bash +# Rules exist inside the kind node container, not on the host +docker exec -it kind-control-plane bash + +# Inside the container +iptables-save | grep KUBE-SERVICES +iptables-save | grep ingress-nginx +``` + +4. **Let's Encrypt certificates won't work in kind** - use self-signed certs instead: + +kind clusters aren't accessible from the internet, so Let's Encrypt HTTP-01 challenges will fail. You'll see "Kubernetes Ingress Controller Fake Certificate" in your browser. + +**Solution - Use self-signed certificates for kind:** + +```bash +# Create a self-signed certificate +openssl req -x509 -nodes -days 365 -newkey rsa:2048 \ + -keyout tls.key -out tls.crt \ + -subj "/CN=calypr-demo.ddns.net/O=calypr-demo" + +# Create the TLS secret +kubectl create secret tls calypr-demo-tls \ + -n argo-stack \ + --cert=tls.crt \ + --key=tls.key + +# Delete the Certificate resource (stop cert-manager from managing it) +kubectl delete certificate calypr-demo-tls -n argo-stack + +# Remove cert-manager annotation from ingress +kubectl annotate ingress ingress-authz-workflows -n argo-stack cert-manager.io/cluster-issuer- +``` + +Your browser will show a security warning (expected for self-signed certs), but you can proceed. + +--- + +### Issue: kube-proxy Not Creating iptables/nftables Rules + +**Symptoms:** +- NodePort connections fail (Connection refused) +- Testing `curl localhost:` fails +- No KUBE-* chains in iptables/nftables output + +**Cause:** kube-proxy is configured for iptables mode but the system uses nftables, and no rules are being created. + +**Diagnosis:** + +1. **Check if kube-proxy rules exist:** +```bash +# On systems using iptables-nft backend +sudo nft list ruleset | grep KUBE-SERVICES + +# On systems using iptables-legacy +sudo iptables-save | grep KUBE-SERVICES + +# If you get "incompatible, use 'nft' tool" error: +# Your system uses nftables but you're trying to use iptables commands +``` + +2. **Verify which iptables backend is active:** +```bash +sudo update-alternatives --display iptables +# Look for: link currently points to /usr/sbin/iptables-nft +``` + +3. **Check kube-proxy configuration:** +```bash +kubectl get cm kube-proxy -n kube-system -o yaml | grep "mode:" +# Should show: mode: iptables or mode: nft +``` + +**Solution:** + +**For kind clusters:** +- kube-proxy runs inside the kind container +- Check rules from inside: `docker exec -it kind-control-plane iptables-save` +- The host's iptables/nftables are separate from the kind node's + +**For bare metal/VM clusters with nftables:** + +If your system uses iptables-nft and kube-proxy shows "Using iptables Proxier" but creates no rules: + +1. **Verify kube-proxy mode in ConfigMap:** +```bash +kubectl edit cm kube-proxy -n kube-system +``` + +Ensure `mode: iptables` is set (it should work with iptables-nft). + +2. **Restart kube-proxy:** +```bash +kubectl delete pod -n kube-system -l k8s-app=kube-proxy +``` + +3. **Verify rules are created:** +```bash +# Wait 30 seconds, then check +sudo nft list ruleset | grep KUBE-SERVICES +``` + +4. **If still no rules, check kube-proxy logs:** +```bash +kubectl logs -n kube-system -l k8s-app=kube-proxy --tail=100 +# Look for errors about iptables/nftables initialization +``` + +--- + +### Issue: Let's Encrypt Certificate Not Issuing (Fake Certificate Shown) + +**Symptoms:** +- Browser shows "Kubernetes Ingress Controller Fake Certificate" +- Certificate status shows `Ready: False` with reason `DoesNotExist` +- CertificateRequest or Challenge resources stuck in pending state + +**Diagnosis:** + +1. **Check Certificate status:** +```bash +kubectl describe certificate calypr-demo-tls -n argo-stack + +# Look for conditions showing why it's not ready +# Common reasons: DoesNotExist, Pending, Failed +``` + +2. **Check CertificateRequest:** +```bash +kubectl get certificaterequest -n argo-stack +kubectl describe certificaterequest -n argo-stack + +# Check for failure reasons +``` + +3. **Check ACME Challenge (for Let's Encrypt):** +```bash +kubectl get challenges -A +kubectl describe challenge -n argo-stack + +# Look for HTTP-01 or DNS-01 challenge status +``` + +4. **Check cert-manager logs:** +```bash +kubectl logs -n cert-manager -l app=cert-manager --tail=100 +kubectl logs -n cert-manager -l app=webhook --tail=100 +``` + +**Common Causes and Solutions:** + +#### Cause 1: Domain Not Accessible from Internet (kind/local clusters) + +**For kind or local development clusters**, Let's Encrypt cannot reach your domain to verify ownership via HTTP-01 challenge. + +**Solution:** Use self-signed certificates (see [kind Cluster Specific Issues](#8-kind-cluster-specific-issues) section). + +#### Cause 2: HTTP-01 Challenge Fails - Port 80 Not Reachable + +Let's Encrypt needs to reach `http://your-domain/.well-known/acme-challenge/` on port 80. + +**Check:** +```bash +# Verify ingress responds on port 80 +curl -v http://calypr-demo.ddns.net/.well-known/acme-challenge/test + +# Check if port 80 is open in firewall/security groups +# AWS: Check security group allows inbound port 80 from 0.0.0.0/0 +# On-prem: Check firewall allows port 80 from Let's Encrypt IPs +``` + +**Solution:** +```bash +# Ensure LoadBalancer/NodePort exposes port 80 +kubectl get svc ingress-nginx-controller -n ingress-nginx + +# Should show: 80:xxxxx/TCP in PORT(S) column +``` + +#### Cause 3: Use DNS-01 Challenge for kind/Local Clusters + +**For kind clusters or when using dynamic DNS providers like No-IP.com**, HTTP-01 challenges won't work because: +- kind clusters aren't publicly accessible from the internet +- Dynamic DNS IPs may not route directly to your cluster + +**Solution: Use DNS-01 challenge with webhook solver** + +cert-manager doesn't have native No-IP.com support, but you can use the generic **webhook solver with custom scripts** or **acme-dns**: + +**Option A: Use acme-dns (Recommended for No-IP.com)** + +1. **Set up acme-dns server** (one-time setup): +```bash +# Deploy acme-dns in your cluster +kubectl apply -f https://raw.githubusercontent.com/joohoi/acme-dns/master/k8s/acme-dns-deployment.yaml + +# Or use the public acme-dns service at auth.acme-dns.io +``` + +2. **Install cert-manager acme-dns webhook**: +```bash +helm repo add cert-manager-webhook-acme-dns https://k8s-at-home.github.io/charts +helm install acme-dns-webhook cert-manager-webhook-acme-dns/cert-manager-webhook-acme-dns \ + -n cert-manager +``` + +3. **Register your domain with acme-dns** (follow prompts): +```bash +curl -X POST https://auth.acme-dns.io/register +# Returns: {"username":"xxx","password":"xxx","fulldomain":"xxx.auth.acme-dns.io","subdomain":"xxx"} +``` + +4. **Add CNAME record in No-IP.com**: +``` +_acme-challenge.calypr-demo.ddns.net CNAME +``` + +5. **Create ClusterIssuer with acme-dns**: +```yaml +apiVersion: cert-manager.io/v1 +kind: ClusterIssuer +metadata: + name: letsencrypt-prod +spec: + acme: + server: https://acme-v02.api.letsencrypt.org/directory + email: your-email@example.com + privateKeySecretRef: + name: letsencrypt-prod-account-key + solvers: + - dns01: + acmeDNS: + host: https://auth.acme-dns.io + accountSecretRef: + name: acme-dns-credentials + key: acmedns.json +``` + +6. **Create the credentials secret**: +```bash +cat > acmedns.json <", + "password": "", + "fulldomain": "", + "subdomain": "", + "allowfrom": [] + } +} +EOF + +kubectl create secret generic acme-dns-credentials \ + -n cert-manager \ + --from-file=acmedns.json +``` + +**Option B: Manual DNS-01 (Not recommended - use acme-dns instead)** + +Manual verification requires adding TXT records to No-IP.com each time a certificate renews. This is not practical for automated renewals. + +If you still want manual control, you'll need to: +1. Create a Certificate with manual approval +2. Check the Challenge resource for the required TXT record +3. Add the TXT record `_acme-challenge.calypr-demo.ddns.net` to No-IP.com +4. Wait for validation + +For automated renewals, use Option A (acme-dns) instead. + +#### Cause 4: ClusterIssuer Not Ready (HTTP-01) + +```bash +kubectl get clusterissuer +kubectl describe clusterissuer letsencrypt-prod + +# Check status shows Ready: True +``` + +If ClusterIssuer doesn't exist or isn't ready: +```bash +# Create ClusterIssuer for HTTP-01 (production clusters only) +kubectl apply -f - < Date: Tue, 25 Nov 2025 22:44:44 +0000 Subject: [PATCH 17/19] assign external ip --- Makefile | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/Makefile b/Makefile index c0b7607e..8ede3d0a 100644 --- a/Makefile +++ b/Makefile @@ -156,15 +156,15 @@ argo-stack: deploy: init argo-stack docker-install ports ports: - # MetalLB provides LoadBalancer functionality for bare metal clusters. For now, we are not using AWS load balancer - kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.12/config/manifests/metallb-native.yaml - # Wait for MetalLB pods to be ready - kubectl wait --namespace metallb-system \ - --for=condition=ready pod \ - --selector=app=metallb \ - --timeout=90s - # Configure IP Address Pool - kubectl apply -f helm/argo-stack/overlays/ip-address-pool.yaml + # # MetalLB provides LoadBalancer functionality for bare metal clusters. For now, we are not using AWS load balancer + # kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.12/config/manifests/metallb-native.yaml + # # Wait for MetalLB pods to be ready + # kubectl wait --namespace metallb-system \ + # --for=condition=ready pod \ + # --selector=app=metallb \ + # --timeout=90s + # # Configure IP Address Pool + # kubectl apply -f helm/argo-stack/overlays/ip-address-pool.yaml # Add the Jetstack Helm repository helm repo add jetstack https://charts.jetstack.io helm repo update @@ -191,6 +191,8 @@ ports: helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \ -n ingress-nginx --create-namespace \ --set controller.service.type=NodePort + # Assign external address + kubectl patch svc ingress-nginx-controller -n ingress-nginx -p '{ "spec": { "type": "NodePort", "externalIPs": ["100.22.124.96"] } }' # Solution - Use NodePort instead of LoadBalancer in kind kubectl patch svc ingress-nginx-controller -n ingress-nginx -p '{"spec":{"type":"NodePort","ports":[{"port":80,"nodePort":30080},{"port":443,"nodePort":30443}]}}' From 11234c241db2492bf6ad37db9b885ee15a9d05c8 Mon Sep 17 00:00:00 2001 From: Copilot <198982749+Copilot@users.noreply.github.com> Date: Tue, 25 Nov 2025 16:43:51 -0800 Subject: [PATCH 18/19] Add DNS-01 challenge debugging guide, manual certificate installation, and NodePort external IP configuration to troubleshooting docs (#102) * Initial plan * Add comprehensive DNS-01 challenge debugging guide with propagation error fixes Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> * Add manual certificate installation guide to troubleshooting docs Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com> --- docs/troubleshooting.md | 583 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 583 insertions(+) diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index 0dbcb544..46a0c144 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -575,6 +575,414 @@ If you still want manual control, you'll need to: For automated renewals, use Option A (acme-dns) instead. +#### Debugging DNS-01 Challenge Flow + +If you've configured DNS-01 challenges but still see self-signed certificates, follow these debugging steps: + +**Step 1: Verify ClusterIssuer Configuration** + +```bash +# Check ClusterIssuer exists and is ready +kubectl get clusterissuer +kubectl describe clusterissuer letsencrypt-prod + +# Look for status: Ready: True +# If not ready, check the status conditions for error messages +``` + +**Step 2: Check Certificate Resource Status** + +```bash +# Check certificate status +kubectl get certificate -n argo-stack +kubectl describe certificate calypr-demo-tls -n argo-stack + +# Look for: +# - Ready: False (certificate not issued) +# - Status conditions showing the reason (e.g., "Issuing", "NotReady") +# - Last transition time (stuck?) +``` + +**Step 3: Inspect CertificateRequest** + +```bash +# List certificate requests +kubectl get certificaterequest -n argo-stack + +# Describe the most recent one +kubectl describe certificaterequest -n argo-stack | head -50 + +# Look for: +# - Approved: True +# - Ready: False +# - Status message indicating DNS-01 challenge state +``` + +**Step 4: Check Challenge Resources (DNS-01 specific)** + +```bash +# List all challenges +kubectl get challenges -A + +# Describe the challenge +kubectl describe challenge -n argo-stack + +# Look for: +# - Type: DNS-01 +# - State: pending, valid, invalid, or errored +# - Reason field with specific error messages +# - Presented: True (DNS record was created) +``` + +**Step 5: Verify acme-dns Credentials Secret** + +If using acme-dns: + +```bash +# Check secret exists +kubectl get secret acme-dns-credentials -n cert-manager + +# Verify the secret has the correct key +kubectl get secret acme-dns-credentials -n cert-manager -o jsonpath='{.data.acmedns\.json}' | base64 -d | jq . + +# Should return JSON with your domain configuration: +# { +# "calypr-demo.ddns.net": { +# "username": "...", +# "password": "...", +# "fulldomain": "xxx.auth.acme-dns.io", +# "subdomain": "xxx", +# "allowfrom": [] +# } +# } +``` + +**Step 6: Verify CNAME Record** + +```bash +# Check if CNAME record exists for _acme-challenge subdomain +nslookup -type=CNAME _acme-challenge.calypr-demo.ddns.net + +# Or use dig +dig _acme-challenge.calypr-demo.ddns.net CNAME +short + +# Should return something like: xxx.auth.acme-dns.io +``` + +If the CNAME is missing, add it to your DNS provider (e.g., No-IP.com): +``` +_acme-challenge.calypr-demo.ddns.net CNAME +``` + +**Step 7: Check acme-dns TXT Record** + +```bash +# Get the fulldomain from your secret +FULLDOMAIN=$(kubectl get secret acme-dns-credentials -n cert-manager -o jsonpath='{.data.acmedns\.json}' | base64 -d | jq -r '."calypr-demo.ddns.net".fulldomain') + +echo "Full domain: $FULLDOMAIN" + +# Check if TXT record is created on acme-dns +dig @auth.acme-dns.io $FULLDOMAIN TXT +short + +# During a challenge, you should see a TXT record with the validation token +``` + +**Step 8: Check cert-manager Logs** + +```bash +# cert-manager controller logs (handles Certificate resources) +kubectl logs -n cert-manager -l app=cert-manager --tail=100 --follow + +# cert-manager webhook logs (handles DNS-01 challenge creation) +kubectl logs -n cert-manager -l app=webhook --tail=100 + +# Look for: +# - "DNS record created" or "DNS propagation check" +# - acme-dns API call logs +# - Authentication errors +# - "challenge not ready" or timeout messages +``` + +**Step 9: Verify acme-dns Webhook (if installed)** + +If you installed the acme-dns webhook: + +```bash +# Check webhook pod is running +kubectl get pods -n cert-manager | grep acme-dns + +# Check webhook logs +kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager-webhook-acme-dns --tail=50 + +# Test webhook connectivity to acme-dns server +kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ + curl -v https://auth.acme-dns.io/health +``` + +**Step 10: Manual Challenge Validation** + +Test if Let's Encrypt can validate your DNS-01 challenge: + +```bash +# Get the challenge token from the Challenge resource +kubectl get challenge -n argo-stack -o yaml + +# Look for spec.key (the validation token) +# The TXT record should be: _acme-challenge.calypr-demo.ddns.net -> CNAME -> fulldomain.auth.acme-dns.io + +# Verify Let's Encrypt can resolve it +dig _acme-challenge.calypr-demo.ddns.net TXT +short + +# This should resolve through the CNAME to the acme-dns TXT record +``` + +**Step 11: Force Certificate Reissue** + +After fixing configuration issues: + +```bash +# Delete failed challenges and certificate requests +kubectl delete challenges -n argo-stack --all +kubectl delete certificaterequest -n argo-stack --all + +# Optionally delete the certificate to trigger fresh issuance +kubectl delete certificate calypr-demo-tls -n argo-stack + +# cert-manager will automatically recreate them +# Watch the new challenge +kubectl get challenges -n argo-stack -w +``` + +**Common DNS-01 Issues and Solutions:** + +| Symptom | Likely Cause | Solution | +|---------|--------------|----------| +| "DNS record not yet propagated" | CNAME not configured or DNS cache | See detailed fix below | +| Challenge stuck in "pending" | CNAME not configured | Add `_acme-challenge.your-domain CNAME fulldomain.auth.acme-dns.io` | +| "invalid credentials" | Wrong acme-dns credentials | Re-register with acme-dns and update secret | +| "DNS record not found" | CNAME propagation delay | Wait 5-10 minutes for DNS propagation | +| "acme-dns: unauthorized" | Incorrect username/password | Verify credentials in secret match registration | +| Challenge "invalid" after 60s | DNS propagation too slow | Use longer `--dns01-self-check-period` flag on cert-manager | +| Certificate stays "Issuing" | Previous challenge failed | Delete old challenges: `kubectl delete challenges -A` | + +**Detailed Fix for "DNS record not yet propagated" Error:** + +If you see this error in cert-manager logs: +``` +"propagation check failed" err="DNS record for \"calypr-demo.ddns.net\" not yet propagated" +``` + +This means cert-manager is checking for the TXT record but can't find it. Follow these steps: + +**1. Verify the CNAME Record Exists:** + +```bash +# Check if CNAME exists +dig _acme-challenge.calypr-demo.ddns.net CNAME +short + +# Should return: xxx.auth.acme-dns.io +# If empty, the CNAME is missing - add it to your DNS provider +``` + +**2. Check DNS Resolution Path:** + +```bash +# Follow the full resolution chain +dig _acme-challenge.calypr-demo.ddns.net TXT +trace + +# This should show: +# 1. Query to root servers +# 2. Query to .net servers +# 3. Query to ddns.net servers (No-IP.com) +# 4. CNAME pointing to auth.acme-dns.io +# 5. TXT record on auth.acme-dns.io +``` + +**3. Verify acme-dns Has Created the TXT Record:** + +```bash +# Get the fulldomain from your acme-dns credentials +FULLDOMAIN=$(kubectl get secret acme-dns-credentials -n cert-manager \ + -o jsonpath='{.data.acmedns\.json}' | base64 -d | \ + jq -r '."calypr-demo.ddns.net".fulldomain') + +echo "Checking TXT record on: $FULLDOMAIN" + +# Query acme-dns directly +dig @auth.acme-dns.io $FULLDOMAIN TXT +short + +# Should return a TXT record like: "abc123def456..." +# If empty, acme-dns hasn't created the record yet +``` + +**4. Check cert-manager's View:** + +cert-manager uses specific DNS resolvers. Check what it sees: + +```bash +# Get the cert-manager pod name +CERT_MGR_POD=$(kubectl get pods -n cert-manager -l app=cert-manager -o jsonpath='{.items[0].metadata.name}') + +# Check DNS resolution from cert-manager's perspective +kubectl exec -n cert-manager $CERT_MGR_POD -- nslookup -type=TXT _acme-challenge.calypr-demo.ddns.net + +# If this fails but your local dig works, cert-manager is using different DNS servers +``` + +**5. Wait for DNS Propagation:** + +DNS changes can take time to propagate: + +```bash +# Watch the challenge status +kubectl get challenges -n argo-stack -w + +# cert-manager retries every 60 seconds by default +# Wait up to 10 minutes for DNS propagation +``` + +**6. Check for DNS Caching Issues:** + +```bash +# Flush local DNS cache (on your machine, not cluster) +# Linux: +sudo systemd-resolve --flush-caches + +# macOS: +sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder + +# Then retest +dig _acme-challenge.calypr-demo.ddns.net TXT +short +``` + +**7. Verify CNAME Configuration in No-IP.com:** + +Log into your No-IP.com account and verify: + +1. Go to **Dynamic DNS** โ†’ **Hostnames** +2. Click **Modify** on `calypr-demo.ddns.net` +3. Check if there's a **DNS Records** or **Advanced** section +4. Add CNAME record: + - **Subdomain**: `_acme-challenge` + - **Record Type**: CNAME + - **Target**: `.auth.acme-dns.io` (from your acme-dns registration) + +**8. If CNAME is Correct but Still Failing:** + +The issue might be cert-manager's DNS resolver configuration: + +```bash +# Check cert-manager deployment for custom DNS settings +kubectl get deployment -n cert-manager cert-manager -o yaml | grep -A 5 dnsPolicy + +# If using ClusterFirst (default), it uses cluster DNS (CoreDNS/kube-dns) +# Try using public DNS resolvers by adding flags to cert-manager: +kubectl set env deployment/cert-manager -n cert-manager \ + --containers=cert-manager \ + DNS01_RECURSIVE_NAMESERVERS=8.8.8.8:53,1.1.1.1:53 +``` + +**9. Increase DNS Propagation Check Period:** + +If your DNS propagates slowly: + +```bash +# Edit cert-manager deployment to increase check period +kubectl edit deployment cert-manager -n cert-manager + +# Add to container args: +# - --dns01-recursive-nameservers-only=true +# - --dns01-self-check-period=10m + +# Or use kubectl set: +kubectl patch deployment cert-manager -n cert-manager --type='json' \ + -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--dns01-self-check-period=10m"}]' +``` + +**10. Verify acme-dns API Accessibility:** + +```bash +# Test from within the cluster +kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \ + curl -v https://auth.acme-dns.io/health + +# Should return: {"ok":true} + +# If this fails, check network policies or firewall rules blocking acme-dns +``` + +**11. Check acme-dns Registration:** + +Verify your acme-dns registration is correct: + +```bash +# View the credentials +kubectl get secret acme-dns-credentials -n cert-manager \ + -o jsonpath='{.data.acmedns\.json}' | base64 -d | jq . + +# Test the credentials directly +curl -X POST https://auth.acme-dns.io/update \ + -H "X-Api-User: " \ + -H "X-Api-Key: " \ + -d '{"subdomain":"","txt":"test123"}' + +# Should return: {"txt":"test123"} +``` + +**12. Monitor cert-manager Logs in Real-Time:** + +```bash +# Watch cert-manager process the DNS-01 challenge +kubectl logs -n cert-manager -l app=cert-manager --tail=100 -f | grep -i "dns\|propagation\|challenge" + +# Look for: +# - "Calling DNS01 Update" (acme-dns API call) +# - "Waiting for DNS-01 propagation" (checking DNS) +# - "DNS record propagated" (success!) +# - Specific error messages +``` + +After fixing the issue, the challenge should transition from "pending" to "valid", and cert-manager will issue the certificate. + +**Using Staging for DNS-01 Testing:** + +To avoid Let's Encrypt rate limits while debugging: + +```bash +# Create staging ClusterIssuer with DNS-01 +kubectl apply -f - <&1 | grep "subject:\|issuer:\|expire" +``` + +#### Step 4: Update Multiple Namespaces (if needed) + +If you have ingress resources in multiple namespaces using the same certificate: + +```bash +# Create the same secret in other namespaces +kubectl create secret tls calypr-demo-tls \ + -n calypr-api \ + --cert=/etc/letsencrypt/live/calypr-demo.ddns.net/fullchain.pem \ + --key=/etc/letsencrypt/live/calypr-demo.ddns.net/privkey.pem + +kubectl create secret tls calypr-demo-tls \ + -n calypr-tenants \ + --cert=/etc/letsencrypt/live/calypr-demo.ddns.net/fullchain.pem \ + --key=/etc/letsencrypt/live/calypr-demo.ddns.net/privkey.pem +``` + +#### Important Notes + +**Certificate Expiration:** +- Let's Encrypt certificates are valid for **90 days** +- You must manually renew before expiration +- Set a calendar reminder for 30 days before expiration + +**Check expiration date:** +```bash +kubectl get secret calypr-demo-tls -n argo-stack -o jsonpath='{.data.tls\.crt}' | \ + base64 -d | openssl x509 -noout -enddate +``` + +**Manual Renewal Process:** + +When the certificate is about to expire: + +```bash +# 1. Renew on the server where you originally obtained it +certbot renew + +# 2. Update the Kubernetes secret +kubectl create secret tls calypr-demo-tls \ + -n argo-stack \ + --cert=/etc/letsencrypt/live/calypr-demo.ddns.net/fullchain.pem \ + --key=/etc/letsencrypt/live/calypr-demo.ddns.net/privkey.pem \ + --dry-run=client -o yaml | kubectl apply -f - + +# 3. Restart ingress controller to pick up new certificate (optional) +kubectl rollout restart deployment ingress-nginx-controller -n ingress-nginx +``` + +#### Re-enabling Automated cert-manager Management + +If you later fix your DNS-01 or HTTP-01 setup and want to return to automated certificate management: + +```bash +# 1. Delete the manual secret +kubectl delete secret calypr-demo-tls -n argo-stack + +# 2. Re-add the cert-manager annotation to your ingress +kubectl annotate ingress ingress-authz-workflows -n argo-stack \ + cert-manager.io/cluster-issuer=letsencrypt-prod + +# 3. cert-manager will automatically create a new Certificate resource +# and obtain a certificate from Let's Encrypt + +# 4. Verify certificate is being issued +kubectl get certificate -n argo-stack +kubectl describe certificate calypr-demo-tls -n argo-stack +``` + +#### Alternative: Using cert-manager with Manual Certificates + +If you want cert-manager to manage the Certificate resource but provide your own cert: + +```bash +# Create a Certificate resource pointing to an existing secret +kubectl apply -f - < Date: Wed, 26 Nov 2025 01:21:34 +0000 Subject: [PATCH 19/19] Initial plan