Skip to content

[CI/CD Assessment] CI/CD Pipelines and Integration Tests Gap Assessment #990

@github-actions

Description

@github-actions

📊 Current CI/CD Pipeline Status

The repository has a mature and layered CI/CD setup combining standard GitHub Actions workflows with agentic (AI-driven) workflows. Overall, the pipeline is healthy: all standard PR checks are passing at 100%, and most agentic build-test workflows succeed.

Workflow inventory (43 total):

Category Count Trigger
Standard PR quality checks 10 pull_request
Agentic build-test integrations 8 pull_request
Agentic smoke tests (AI engines) 5 pull_request + schedule
Security scans 3 pull_request + schedule
Documentation / release / maintenance 17 push / schedule / dispatch

Notable status from recent runs:

  • All standard PR workflows: ✅ 100% pass rate
  • Smoke Gemini: ❌ 0% (1/1 failed) on recent PRs
  • Smoke Chroot: ❌ 0% (1/1 failed) on recent PRs
  • Smoke Codex: ⚠️ 50% pass on recent PRs
  • CI Doctor (scheduled): ❌ 0% (0/6 runs passing)

✅ Existing Quality Gates

The following checks run on every PR:

Check Workflow What It Validates
PR Title Format pr-title.yml Conventional commits (type, lowercase, allowed scopes)
Build Verification build.yml TypeScript compiles on Node 20 & 22 (matrix)
Linting lint.yml ESLint rules
TypeScript Type Check test-integration.yml¹ tsc --noEmit strict check
Test Coverage test-coverage.yml Unit tests + coverage regression gate vs base branch
CodeQL codeql.yml Static analysis for JS/TS and Actions
Dependency Audit dependency-audit.yml npm audit --audit-level=high on main + docs packages
Examples Test test-examples.yml Runs 4 real awf shell examples end-to-end
Action Self-Test test-action.yml Tests the setup action (4 scenarios incl. invalid version)
Chroot Integration test-chroot.yml 4 parallel chroot test suites (languages, pkg managers, procfs, edge cases)
Agentic Security Guard security-guard.lock.yml AI-powered PR review for security-weakening changes (Claude)
Agentic Build Tests 8 build-test-*.lock.yml Real build+test of downstream projects using each runtime through the firewall
Smoke Tests 5 smoke-*.lock.yml Full AI agent runs (Claude, Codex, Copilot, Gemini, Chroot)
Container Scan container-scan.yml² Trivy CRITICAL/HIGH CVE scan of agent + squid images

¹ File is named test-integration.yml but only contains a TypeScript type check — misleading name.
² Only runs when containers/** files change.


🔍 Identified Gaps

🔴 High Priority

1. Integration test suite is not triggered on PRs (only chroot tests are)

The repository has 26 integration test files covering critical functionality — blocked domains, network security, credential hiding, API proxy, DNS, IPv6, exit codes, volume mounts, and more. None of these run on PRs except the chroot subset. A PR breaking blocked-domains.test.ts or network-security.test.ts would be merged without detection.

Affected files:

  • tests/integration/blocked-domains.test.ts
  • tests/integration/network-security.test.ts
  • tests/integration/credential-hiding.test.ts
  • tests/integration/api-proxy.test.ts
  • tests/integration/dns-servers.test.ts
  • tests/integration/ipv6.test.ts
  • …and 20 more

2. Unit test coverage is critically low for the most important files

Coverage thresholds are set far below acceptable levels for a security tool:

File Lines Branches Functions Risk
docker-manager.ts (250 lines — core logic) 18% 22% 4% 🔴 Critical
cli.ts (69 lines — entry point) 0% 0% 0% 🔴 Critical
domain-patterns.ts Unknown Unknown Unknown ⚠️

Thresholds are: 38% lines, 30% branches, 35% functions — far too low for a firewall. A security bypass introduced in docker-manager.ts has only an 18% chance of being caught by a unit test.

3. security-guard.md and build-test-java.md are not compiled

Both show compiled: No in gh aw status. An uncompiled agentic workflow means the .lock.yml is stale and may not reflect the .md source. Any change to security-guard.md will not take effect until recompiled, creating a silent gap in the security review gate.

4. CI Doctor workflow is consistently failing (0% over 6 runs)

The ci-doctor workflow — which monitors for failures across all other workflows — has a 0% success rate in recent runs. This means the automated failure-detection watchdog is itself broken, leaving CI issues undetected and unresolved.


🟡 Medium Priority

5. Container security scan is path-gated — misses logic changes

container-scan.yml only triggers when containers/** files change. A change to src/squid-config.ts or src/docker-manager.ts that modifies the generated container configuration will not trigger a fresh Trivy scan on that PR. The scan runs weekly on a schedule, but not in PR context for source changes.

6. No shell script linting (shellcheck)

The repository contains critical shell scripts (containers/agent/setup-iptables.sh, containers/agent/entrypoint.sh, containers/squid/entrypoint.sh, all scripts/ci/*.sh) that implement the core iptables rules and security hardening. None are validated by shellcheck in CI. Shell script bugs in iptables setup could silently break the firewall.

7. Coverage thresholds need significant upward revision

Current thresholds (38% lines, 30% branches) are appropriate for a bootstrap stage, but given the security-critical nature of this codebase they should be progressively raised. There is no automated PR ratchet preventing coverage from declining further.

8. Smoke tests have significant reliability issues

Three of five smoke test workflows show failures or instability on recent PRs:

  • Smoke Gemini: 0% pass rate (consistently failing)
  • Smoke Chroot: 0% pass rate on recent PRs
  • Smoke Codex: 50% pass rate

Flaky smoke tests train developers to ignore CI failures ("it's probably just flaky"), which reduces the signal value of the entire CI system.

9. No dist/ artifact size monitoring

There is no check on the size of the compiled dist/ output or packaged binaries. A dependency accidentally bundled or a large file added to dist/ would pass all current checks. For a security tool distributed as pre-built binaries (4 platform targets), artifact size is a meaningful integrity signal.


🟢 Low Priority

10. No mutation testing

The test suite validates behavior, but there is no mutation testing (e.g., Stryker) to verify that tests would actually catch regressions. This is especially relevant given the low branch coverage in docker-manager.ts — tests may pass even when logic is wrong.

11. No code complexity / maintainability gate

There is no cyclomatic complexity check or maintainability index. docker-manager.ts at 250+ lines with 4% function coverage and complex branching logic is a candidate for this.

12. Coverage badge not published

The test-coverage.yml uploads reports to artifacts and posts PR comments, but no coverage badge is published to the README or an external service (Codecov, Coveralls). This reduces visibility of coverage health for contributors.

13. test-integration.yml is misnamed

The file named test-integration.yml actually runs only a TypeScript type check — the same check that runs in build.yml and elsewhere. This creates confusion about what "integration" means in this repository's CI and makes it harder for contributors to understand the test matrix.


📋 Actionable Recommendations

Gap Recommended Solution Complexity Impact
#1 Integration tests not on PRs Add test-integration-core.yml running blocked-domains, network-security, credential-hiding, exit-code-propagation, dns-servers on PRs Medium 🔴 High
#2 Low coverage on critical files Set a 90-day coverage improvement plan: raise thresholds to 50/40/50 within 90 days; track docker-manager.ts & cli.ts specifically Low 🔴 High
#3 Uncompiled workflows Run gh aw compile security-guard build-test-java && npx tsx scripts/ci/postprocess-smoke-workflows.ts and add a CI check that verifies all .lock.yml files are up-to-date Low 🔴 High
#4 CI Doctor always failing Investigate and fix the ci-doctor workflow; it is the monitoring system for all other workflows Medium 🔴 High
#5 Path-gated container scan Add src/** to the container-scan.yml path triggers so source changes also trigger a fresh Trivy scan Low 🟡 Medium
#6 No shellcheck Add a shellcheck step (using ludeeus/action-shellcheck) in build.yml or a dedicated shell-lint.yml covering containers/**/*.sh and scripts/**/*.sh Low 🟡 Medium
#7 Low coverage thresholds Add a coverage ratchet script that reads the current coverage and updates thresholds upward automatically Medium 🟡 Medium
#8 Flaky smoke tests Add retry logic or mark smoke tests as continue-on-error: true with explicit flakiness tracking; investigate root causes for Gemini/Chroot failures Medium 🟡 Medium
#9 No artifact size check Add a step in build.yml that fails if dist/ exceeds a configured size threshold Low 🟢 Low
#13 Misnamed workflow Rename test-integration.yml to type-check.yml and update all references Low 🟢 Low

📈 Metrics Summary

Metric Value
Total workflows 43 (29 agentic .md + 14 standard .yml)
Workflows running on PRs ~25
Standard PR workflow pass rate (recent) 100%
Agentic smoke test pass rate (recent PRs) ~60% (3/5 reliable)
CI Doctor health 0% (broken watchdog)
Unit test coverage — statements 38.39% (threshold: 38%)
Unit test coverage — branches 31.78% (threshold: 30%)
Unit test coverage — docker-manager.ts 18%
Unit test coverage — cli.ts 0%
Integration test files 26 total; ~4 suites on PRs (chroot only)
Shell scripts without linting ~12 files
Uncompiled agentic workflows 2 (security-guard, build-test-java)

The repository has an excellent foundation with diverse quality gates, real end-to-end smoke tests, AI-powered security review, and per-PR coverage comparison. The primary gaps are: running the broader integration test suite on PRs, improving unit test coverage for the critical docker-manager.ts and cli.ts files, fixing the broken CI Doctor, and compiling the two stale agentic workflows.


Note: This was intended to be a discussion, but discussions could not be created due to permissions issues. This issue was created as a fallback.

Tip: Discussion creation may fail if the specified category is not announcement-capable. Consider using the "Announcements" category or another announcement-capable category in your workflow configuration.

Generated by CI/CD Pipelines and Integration Tests Gap Assessment

  • expires on Feb 27, 2026, 10:19 PM UTC

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions