Skip to content

Conversation

@krusche
Copy link
Member

@krusche krusche commented Feb 8, 2026

Summary

Add a cached Docker availability flag to the build agent that prevents thousands of SocketException stack traces from polluting CI server logs when Docker is not available (e.g. on CI servers without /var/run/docker.sock). The existing updateDockerVersion() scheduled task is extended to also probe Docker availability every 60 seconds. All Docker-touching code paths check this flag before attempting operations, failing fast with clean WARN/DEBUG messages instead of ERROR with full stack traces.

Checklist

General

Server

Changes affecting Programming Exercises

  • High priority: I tested all changes and their related features with all corresponding user types on a test server configured with the integrated lifecycle setup (LocalVC and LocalCI).

Motivation and Context

When running server tests on CI, Docker is not available (/var/run/docker.sock does not exist). This causes ~2,794 SocketException: No such file or directory stack traces in CI logs per test run. Multiple services independently discover Docker is unreachable and each logs full stack traces, massively polluting the output and making it harder to identify real test failures.

The existing dockerClientNotAvailable() check only verifies if the Java DockerClient object is null — which it never is, because client creation always succeeds regardless of whether Docker is actually running.

Description

Core change: A volatile boolean dockerAvailable field in BuildAgentConfiguration with getter/setter. This flag is maintained by the existing updateDockerVersion() scheduled task in BuildAgentInformationService, which now runs every 60 seconds using fixedDelay (prevents overlap if Docker is slow to respond). The versionCmd().exec() call already performed by this task implicitly serves as the availability ping.

Guarded code paths:

  • BuildAgentDockerService: Enhanced dockerClientNotAvailable() to also check isDockerAvailable(). Added guards in pullDockerImage(), deleteOldDockerImages(). Changed cleanUpContainers() catch blocks from ERROR to DEBUG for Docker unavailability.
  • BuildJobContainerService: Added isDockerAvailable() guard in getContainerForName(). Added Docker unavailability checks in archive retrieval/copy error handlers.
  • BuildJobManagementService: Simplified Docker unavailability error to WARN without stack trace.
  • SharedQueueProcessingService: Added DockerUtil.isDockerNotAvailable(ex) check in the exceptionally handler to log WARN with message only instead of ERROR with full stack trace.
  • DockerUtil: Fixed isDockerNotAvailable() to traverse the full cause chain instead of only checking one level deep. Fixed isDockerConnectionRefused() to check the throwable directly (consistent with isDockerSocketNotAvailable()). Added cycle protection.

State transition logging:

  • WARN when Docker becomes unavailable (first detection)
  • DEBUG for repeated unavailability checks
  • INFO when Docker becomes available again

Test updates: Both AbstractArtemisBuildAgentTest and AbstractProgrammingIntegrationLocalCILocalVCTestBase are updated to set dockerAvailable = true since tests use mocked Docker clients.

Steps for Testing

Prerequisites:

  • Access to a build agent test environment
  1. Run build agent tests: ./gradlew test --tests "de.tum.cit.aet.artemis.buildagent.*" --tests "de.tum.cit.aet.artemis.programming.icl.BuildAgentDockerServiceTest" -x webapp — all 97 tests should pass
  2. Run full server tests without Docker: ./gradlew test -x webapp and check logs for SocketException — should see ~0 ERROR-level occurrences (down from ~2,794)
  3. Start a build agent with Docker running, verify builds execute normally
  4. Stop Docker, verify the build agent logs one WARN about Docker becoming unavailable, and subsequent operations skip with DEBUG logs
  5. Restart Docker, verify the build agent logs INFO about Docker becoming available again, and operations resume

Testserver States

You can manage test servers using Helios. Check environment statuses in the environment list. To deploy to a test server, go to the CI/CD page, find your PR or branch, and trigger the deployment.

Review Progress

Performance Review

  • I (as a reviewer) confirm that the server changes (in particular related to database calls) are implemented with a very good performance even for very large courses with more than 2000 students.

Code Review

  • Code Review 1
  • Code Review 2

Manual Tests

  • Test 1
  • Test 2

Test Coverage

Server

Class/File Line Coverage Lines
BuildAgentConfiguration.java 85.71% 186
BuildAgentDockerService.java 61.68% 360
BuildAgentInformationService.java 81.31% 205
BuildJobContainerService.java 72.96% 531
BuildJobManagementService.java 72.55% 284
DockerUtil.java 80.00% 25
SharedQueueProcessingService.java 58.44% 642

Last updated: 2026-02-08 12:48:09 UTC

…g noise

Add a cached dockerAvailable flag to BuildAgentConfiguration that is
maintained by the existing scheduled updateDockerVersion task (now every
60s with fixedDelay). All Docker-touching code paths check this flag
before attempting operations, preventing thousands of SocketException
stack traces when Docker is not available (e.g. on CI servers without
Docker). Error logging for Docker unavailability is rationalized from
ERROR with full stack traces to WARN/DEBUG with concise messages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@krusche krusche requested a review from a team as a code owner February 8, 2026 12:22
@krusche krusche added this to the 8.8.1 milestone Feb 8, 2026
@krusche krusche self-assigned this Feb 8, 2026
@github-project-automation github-project-automation bot moved this to Work In Progress in Artemis Development Feb 8, 2026
@github-actions github-actions bot added tests server Pull requests that update Java code. (Added Automatically!) buildagent Pull requests that affect the corresponding module programming Pull requests that affect the corresponding module labels Feb 8, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 8, 2026

Walkthrough

Adds a dockerAvailable flag to BuildAgentConfiguration, optimistically sets it true on client init/open, and propagates Docker availability checks across services; services and utilities guard operations, adjust logging when Docker is unreachable, and BuildAgentInformationService updates version/availability on a fixedDelay schedule.

Changes

Cohort / File(s) Summary
Core Docker Availability State
src/main/java/de/tum/cit/aet/artemis/buildagent/BuildAgentConfiguration.java
Adds dockerAvailable field with public isDockerAvailable() and setDockerAvailable(...); lifecycle hooks (app ready, open, close) set availability (optimistically true on init/open, false on close).
Docker Utilities
src/main/java/de/tum/cit/aet/artemis/buildagent/service/DockerUtil.java
Adds isDockerNotAvailable(Throwable) that traverses causes (with cycle protection) to detect socket absence or connection-refused patterns; simplifies/refactors connection-refused detection.
Docker Service Operations
src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentDockerService.java
Adds pre-checks/early returns when Docker unavailable; reduces noisy error logs (debug/warn for Docker-not-available), guards cleanup/pull/delete flows, and throws LocalCIException when pull attempted while unavailable.
Information & Scheduling
src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentInformationService.java
Changes scheduled check from fixedRate to fixedDelay; tracks previous availability, marks dockerAvailable on successful version retrieval, updates dockerVersion and calls updateLocalBuildAgentInformation(false) when version changes; marks unavailable and logs accordingly on failures.
Container & Job Services
src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildJobContainerService.java, src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildJobManagementService.java, src/main/java/de/tum/cit/aet/artemis/buildagent/service/SharedQueueProcessingService.java
Guard operations against Docker unavailability: container lookups may return early, retries and repeated failures log warnings for Docker-not-available cases (errors retained for other failures), and some exception/log checks simplified.
Tests & Docs
src/test/java/de/tum/cit/aet/artemis/programming/AbstractProgrammingIntegrationLocalCILocalVCTestBase.java, src/test/java/de/tum/cit/aet/artemis/shared/base/AbstractArtemisBuildAgentTest.java
Test stubs leniently return isDockerAvailable() = true; comment clarifies openBuildAgentServices() sets dockerAvailable = true.

Sequence Diagram

sequenceDiagram
    participant Info as BuildAgentInformationService
    participant Config as BuildAgentConfiguration
    participant Docker as BuildAgentDockerService
    participant Container as BuildJobContainerService
    participant Job as BuildJobManagementService

    Note over Config: onAppReady / open -> setDockerAvailable(true)
    Info->>Config: request docker version
    alt version retrieved (Docker available)
        Config-->>Info: docker client + version
        Info->>Config: setDockerAvailable(true) if changed
        Info->>Docker: request image pull / cleanup
        Docker->>Container: list/create/delete containers/images
        Container->>Job: execute container operations (copy, archive, run)
        Job-->>Info: report job state
    else cannot retrieve (Docker unavailable)
        Config-->>Info: indicate not available
        Info->>Config: setDockerAvailable(false)
        Note over Docker,Container: guard checks -> early return / skip ops (debug/warn)
        Job-->>Info: log warning, handle gracefully
    end
    Note over Info: periodic re-check (fixedDelay) -> update version/availability
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding Docker availability checks to reduce CI log noise.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/build-agent/docker-availability-check

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentDockerService.java (1)

486-491: ⚠️ Potential issue | 🟡 Minor

Pre-existing: getFirst() can throw NoSuchElementException.

Not introduced by this PR, but at Line 489, after mutableSortedImagesByLastBuildDate.remove(oldestImage), getFirst() is called without checking if the list is empty. If the last element was just removed, this throws NoSuchElementException. The outer while loop condition never re-evaluates before this call.

Suggested fix
                 mutableSortedImagesByLastBuildDate.remove(oldestImage);
-                oldestImage = mutableSortedImagesByLastBuildDate.getFirst();
+                oldestImage = mutableSortedImagesByLastBuildDate.isEmpty() ? null : mutableSortedImagesByLastBuildDate.getFirst();
                 totalAttempts--;
🤖 Fix all issues with AI agents
In
`@src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentDockerService.java`:
- Around line 340-345: The catch block in BuildAgentDockerService (inside the
method that inspects/pulls images) always throws a LocalCIException with the
message "Docker is not available..." even when the root cause is different;
update the catch to distinguish cases using DockerUtil.isDockerNotAvailable(ex):
if true, log a warn and then throw a LocalCIException with a message stating
Docker is not available (including ex.getMessage()); otherwise throw a
LocalCIException with a generic/predictive message about failing to pull/inspect
the image (e.g., "Failed to pull/inspect image <imageName>") and include the
original exception so the real cause (auth, timeout, manifest) is
preserved—adjust the code paths around the catch in BuildAgentDockerService to
conditionally set the thrown message accordingly.
🧹 Nitpick comments (3)
src/main/java/de/tum/cit/aet/artemis/buildagent/service/DockerUtil.java (1)

9-9: Consider adding a private constructor to this utility class.

DockerUtil is a final class with only static methods but no explicit constructor, so Java generates a default public one. A private constructor would prevent accidental instantiation.

♻️ Proposed fix
 public final class DockerUtil {
 
+    private DockerUtil() {
+        // Utility class
+    }
+
     public static boolean isDockerSocketNotAvailable(Throwable throwable) {
src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentInformationService.java (1)

90-99: Docker recovery without version change won't update distributed map.

When Docker transitions from unavailable → available but the version hasn't changed (e.g., brief network blip), updateLocalBuildAgentInformation(false) on line 98 is not called. The dockerAvailable flag is set locally (line 92), so local services behave correctly, but the distributed map won't reflect the recovery until the next periodic updateBuildAgentInformation() call (every 10s in SharedQueueProcessingService). This seems acceptable given the 10s update interval, but if immediate distributed visibility matters, consider also calling updateLocalBuildAgentInformation(false) after setDockerAvailable(true).

♻️ Optional: propagate recovery to distributed map immediately
             if (!wasAvailable) {
                 log.info("Docker is now available (version: {})", newVersion);
                 buildAgentConfiguration.setDockerAvailable(true);
+                updateLocalBuildAgentInformation(false);
             }
-            if (!Objects.equals(newVersion, dockerVersion)) {
+            if (!Objects.equals(newVersion, dockerVersion)) {
                 log.info("Docker version: {}", newVersion);
                 dockerVersion = newVersion;
-                // Update the build agent information in the distributed map
                 updateLocalBuildAgentInformation(false);
             }

Note: This may cause a double updateLocalBuildAgentInformation call when both recovery and version change happen simultaneously. If that's a concern, you could use an else if or a flag to coalesce the calls.

src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildJobContainerService.java (1)

667-672: Double logging when Docker is unavailable after retries are exhausted.

executeWithRetry logs a WARN at Line 668 when Docker is not available, then throws lastException. The callers (getArchiveFromContainer at Line 323 and copyToContainer at Line 718) catch the same exception, perform the same isDockerNotAvailable check, and log another WARN with a similar message. This produces duplicate WARN entries for a single failure.

Consider either removing the Docker-unavailability log in executeWithRetry (lines 667-672) and letting callers handle it, or removing the caller-side Docker-availability check and always using log.error in the callers since executeWithRetry already differentiates.

Also applies to: 322-327, 717-722

@github-project-automation github-project-automation bot moved this from Work In Progress to Ready For Review in Artemis Development Feb 8, 2026
@github-actions
Copy link

github-actions bot commented Feb 8, 2026

@krusche Test coverage has been automatically updated in the PR description.

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 8, 2026
Split the catch block in pullDockerImage to throw Docker-specific
exception message only when Docker is actually unavailable, and a
generic message for other errors (auth, timeout, manifest issues).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentDockerService.java (1)

486-491: ⚠️ Potential issue | 🟠 Major

getFirst() on a potentially empty list will throw NoSuchElementException.

After removing the last element on Line 489, if mutableSortedImagesByLastBuildDate is empty, Line 490 throws. The while guard on Line 477 checks oldestImage != null, but getFirst() never returns null—it throws on an empty list.

This is pre-existing code, but the surrounding method was touched in this PR. A simple guard would prevent the crash:

Proposed fix
                 mutableSortedImagesByLastBuildDate.remove(oldestImage);
-                oldestImage = mutableSortedImagesByLastBuildDate.getFirst();
+                oldestImage = mutableSortedImagesByLastBuildDate.isEmpty() ? null : mutableSortedImagesByLastBuildDate.getFirst();
                 totalAttempts--;
🤖 Fix all issues with AI agents
In
`@src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentDockerService.java`:
- Around line 264-267: The pullDockerImage method in BuildAgentDockerService
only checks buildAgentConfiguration.isDockerAvailable() and then calls
buildAgentConfiguration.getDockerClient(), which can be null for a paused agent;
replace that guard with the same check used elsewhere by invoking
dockerClientNotAvailable() (or equivalently check both dockerClient != null and
isDockerAvailable()) before using DockerClient to avoid an NPE on dockerClient;
update references in pullDockerImage to obtain the DockerClient only after that
guard and keep behavior consistent with other methods in this class.
🧹 Nitpick comments (1)
src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentDockerService.java (1)

504-508: Redundant availability check inside getUnusedDockerImages.

Both callers (deleteOldDockerImages at Line 408 and checkUsableDiskSpaceThenCleanUp at Line 443) already call dockerClientNotAvailable() before reaching this method. The re-check on Line 506 is harmless but adds unnecessary overhead and a second getDockerClient() call. Consider removing it or documenting why it's needed (e.g., for standalone use).

@github-actions
Copy link

github-actions bot commented Feb 8, 2026

@krusche Test coverage has been automatically updated in the PR description.

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 8, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Feb 8, 2026

End-to-End (E2E) Test Results Summary

Test Strategy: Two-phase execution

  • Phase 1 (Relevant): e2e/Login.spec.ts e2e/Logout.spec.ts e2e/SystemHealth.spec.ts e2e/exercise/programming/
  • Phase 2 (Remaining): e2e/atlas/ e2e/course/ e2e/exam/ e2e/lecture/

Status: E2E tests encountered an error

Detailed Results: Check the individual phase reports in the workflow run for test counts, timing, and results.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentDockerService.java (1)

489-491: ⚠️ Potential issue | 🟡 Minor

Pre-existing bug: getFirst() on potentially empty list throws NoSuchElementException.

When the last element is removed at Line 489, Line 490 calls getFirst() on an empty list, which throws NoSuchElementException. The outer catch at Line 494 swallows it, so it doesn't crash the app, but it does log a misleading error and skips the totalAttempts-- decrement.

Proposed fix
                 mutableSortedImagesByLastBuildDate.remove(oldestImage);
-                oldestImage = mutableSortedImagesByLastBuildDate.getFirst();
+                oldestImage = mutableSortedImagesByLastBuildDate.isEmpty() ? null : mutableSortedImagesByLastBuildDate.getFirst();
                 totalAttempts--;

@github-actions
Copy link

github-actions bot commented Feb 8, 2026

@krusche Test coverage has been automatically updated in the PR description.

@github-actions
Copy link

github-actions bot commented Feb 8, 2026

Phase 1: E2E Tests ✅

Status: Phase 1 passed

Tests Run: e2e/Login.spec.ts e2e/Logout.spec.ts e2e/SystemHealth.spec.ts e2e/exercise/programming/

Details: Check the Phase 1 Test Report for detailed results.


This is an automated comment for Phase 1 of the E2E test execution.

@github-actions
Copy link

github-actions bot commented Feb 8, 2026

Phase 2: E2E Tests ✅

Status: Phase 2 passed

Tests Run: e2e/atlas/ e2e/course/ e2e/exam/ e2e/lecture/

Details: Check the Phase 2 Test Report for detailed results.


This is an automated comment for Phase 2 of the E2E test execution.

@github-actions
Copy link

github-actions bot commented Feb 8, 2026

End-to-End (E2E) Test Results Summary

Test Strategy: Two-phase execution

  • Phase 1 (Relevant): e2e/Login.spec.ts e2e/Logout.spec.ts e2e/SystemHealth.spec.ts e2e/exercise/programming/
  • Phase 2 (Remaining): e2e/atlas/ e2e/course/ e2e/exam/ e2e/lecture/

Status: All E2E tests passed (both phases)

Detailed Results: Check the individual phase reports in the workflow run for test counts, timing, and results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

buildagent Pull requests that affect the corresponding module programming Pull requests that affect the corresponding module server Pull requests that update Java code. (Added Automatically!) tests

Projects

Status: Ready For Review

Development

Successfully merging this pull request may close these issues.

1 participant