-
Notifications
You must be signed in to change notification settings - Fork 357
Integrated code lifecycle: Add Docker availability check to reduce CI log noise
#12119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…g noise Add a cached dockerAvailable flag to BuildAgentConfiguration that is maintained by the existing scheduled updateDockerVersion task (now every 60s with fixedDelay). All Docker-touching code paths check this flag before attempting operations, preventing thousands of SocketException stack traces when Docker is not available (e.g. on CI servers without Docker). Error logging for Docker unavailability is rationalized from ERROR with full stack traces to WARN/DEBUG with concise messages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WalkthroughAdds a dockerAvailable flag to BuildAgentConfiguration, optimistically sets it true on client init/open, and propagates Docker availability checks across services; services and utilities guard operations, adjust logging when Docker is unreachable, and BuildAgentInformationService updates version/availability on a fixedDelay schedule. Changes
Sequence DiagramsequenceDiagram
participant Info as BuildAgentInformationService
participant Config as BuildAgentConfiguration
participant Docker as BuildAgentDockerService
participant Container as BuildJobContainerService
participant Job as BuildJobManagementService
Note over Config: onAppReady / open -> setDockerAvailable(true)
Info->>Config: request docker version
alt version retrieved (Docker available)
Config-->>Info: docker client + version
Info->>Config: setDockerAvailable(true) if changed
Info->>Docker: request image pull / cleanup
Docker->>Container: list/create/delete containers/images
Container->>Job: execute container operations (copy, archive, run)
Job-->>Info: report job state
else cannot retrieve (Docker unavailable)
Config-->>Info: indicate not available
Info->>Config: setDockerAvailable(false)
Note over Docker,Container: guard checks -> early return / skip ops (debug/warn)
Job-->>Info: log warning, handle gracefully
end
Note over Info: periodic re-check (fixedDelay) -> update version/availability
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentDockerService.java (1)
486-491:⚠️ Potential issue | 🟡 MinorPre-existing:
getFirst()can throwNoSuchElementException.Not introduced by this PR, but at Line 489, after
mutableSortedImagesByLastBuildDate.remove(oldestImage),getFirst()is called without checking if the list is empty. If the last element was just removed, this throwsNoSuchElementException. The outerwhileloop condition never re-evaluates before this call.Suggested fix
mutableSortedImagesByLastBuildDate.remove(oldestImage); - oldestImage = mutableSortedImagesByLastBuildDate.getFirst(); + oldestImage = mutableSortedImagesByLastBuildDate.isEmpty() ? null : mutableSortedImagesByLastBuildDate.getFirst(); totalAttempts--;
🤖 Fix all issues with AI agents
In
`@src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentDockerService.java`:
- Around line 340-345: The catch block in BuildAgentDockerService (inside the
method that inspects/pulls images) always throws a LocalCIException with the
message "Docker is not available..." even when the root cause is different;
update the catch to distinguish cases using DockerUtil.isDockerNotAvailable(ex):
if true, log a warn and then throw a LocalCIException with a message stating
Docker is not available (including ex.getMessage()); otherwise throw a
LocalCIException with a generic/predictive message about failing to pull/inspect
the image (e.g., "Failed to pull/inspect image <imageName>") and include the
original exception so the real cause (auth, timeout, manifest) is
preserved—adjust the code paths around the catch in BuildAgentDockerService to
conditionally set the thrown message accordingly.
🧹 Nitpick comments (3)
src/main/java/de/tum/cit/aet/artemis/buildagent/service/DockerUtil.java (1)
9-9: Consider adding a private constructor to this utility class.
DockerUtilis afinalclass with only static methods but no explicit constructor, so Java generates a default public one. A private constructor would prevent accidental instantiation.♻️ Proposed fix
public final class DockerUtil { + private DockerUtil() { + // Utility class + } + public static boolean isDockerSocketNotAvailable(Throwable throwable) {src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentInformationService.java (1)
90-99: Docker recovery without version change won't update distributed map.When Docker transitions from unavailable → available but the version hasn't changed (e.g., brief network blip),
updateLocalBuildAgentInformation(false)on line 98 is not called. ThedockerAvailableflag is set locally (line 92), so local services behave correctly, but the distributed map won't reflect the recovery until the next periodicupdateBuildAgentInformation()call (every 10s inSharedQueueProcessingService). This seems acceptable given the 10s update interval, but if immediate distributed visibility matters, consider also callingupdateLocalBuildAgentInformation(false)aftersetDockerAvailable(true).♻️ Optional: propagate recovery to distributed map immediately
if (!wasAvailable) { log.info("Docker is now available (version: {})", newVersion); buildAgentConfiguration.setDockerAvailable(true); + updateLocalBuildAgentInformation(false); } - if (!Objects.equals(newVersion, dockerVersion)) { + if (!Objects.equals(newVersion, dockerVersion)) { log.info("Docker version: {}", newVersion); dockerVersion = newVersion; - // Update the build agent information in the distributed map updateLocalBuildAgentInformation(false); }Note: This may cause a double
updateLocalBuildAgentInformationcall when both recovery and version change happen simultaneously. If that's a concern, you could use anelse ifor a flag to coalesce the calls.src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildJobContainerService.java (1)
667-672: Double logging when Docker is unavailable after retries are exhausted.
executeWithRetrylogs a WARN at Line 668 when Docker is not available, then throwslastException. The callers (getArchiveFromContainerat Line 323 andcopyToContainerat Line 718) catch the same exception, perform the sameisDockerNotAvailablecheck, and log another WARN with a similar message. This produces duplicate WARN entries for a single failure.Consider either removing the Docker-unavailability log in
executeWithRetry(lines 667-672) and letting callers handle it, or removing the caller-side Docker-availability check and always usinglog.errorin the callers sinceexecuteWithRetryalready differentiates.Also applies to: 322-327, 717-722
src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentDockerService.java
Show resolved
Hide resolved
|
@krusche Test coverage has been automatically updated in the PR description. |
Split the catch block in pullDockerImage to throw Docker-specific exception message only when Docker is actually unavailable, and a generic message for other errors (auth, timeout, manifest issues). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentDockerService.java (1)
486-491:⚠️ Potential issue | 🟠 Major
getFirst()on a potentially empty list will throwNoSuchElementException.After removing the last element on Line 489, if
mutableSortedImagesByLastBuildDateis empty, Line 490 throws. Thewhileguard on Line 477 checksoldestImage != null, butgetFirst()never returnsnull—it throws on an empty list.This is pre-existing code, but the surrounding method was touched in this PR. A simple guard would prevent the crash:
Proposed fix
mutableSortedImagesByLastBuildDate.remove(oldestImage); - oldestImage = mutableSortedImagesByLastBuildDate.getFirst(); + oldestImage = mutableSortedImagesByLastBuildDate.isEmpty() ? null : mutableSortedImagesByLastBuildDate.getFirst(); totalAttempts--;
🤖 Fix all issues with AI agents
In
`@src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentDockerService.java`:
- Around line 264-267: The pullDockerImage method in BuildAgentDockerService
only checks buildAgentConfiguration.isDockerAvailable() and then calls
buildAgentConfiguration.getDockerClient(), which can be null for a paused agent;
replace that guard with the same check used elsewhere by invoking
dockerClientNotAvailable() (or equivalently check both dockerClient != null and
isDockerAvailable()) before using DockerClient to avoid an NPE on dockerClient;
update references in pullDockerImage to obtain the DockerClient only after that
guard and keep behavior consistent with other methods in this class.
🧹 Nitpick comments (1)
src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentDockerService.java (1)
504-508: Redundant availability check insidegetUnusedDockerImages.Both callers (
deleteOldDockerImagesat Line 408 andcheckUsableDiskSpaceThenCleanUpat Line 443) already calldockerClientNotAvailable()before reaching this method. The re-check on Line 506 is harmless but adds unnecessary overhead and a secondgetDockerClient()call. Consider removing it or documenting why it's needed (e.g., for standalone use).
src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentDockerService.java
Outdated
Show resolved
Hide resolved
|
@krusche Test coverage has been automatically updated in the PR description. |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
End-to-End (E2E) Test Results SummaryTest Strategy: Two-phase execution
Status: E2E tests encountered an error Detailed Results: Check the individual phase reports in the workflow run for test counts, timing, and results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/main/java/de/tum/cit/aet/artemis/buildagent/service/BuildAgentDockerService.java (1)
489-491:⚠️ Potential issue | 🟡 MinorPre-existing bug:
getFirst()on potentially empty list throwsNoSuchElementException.When the last element is removed at Line 489, Line 490 calls
getFirst()on an empty list, which throwsNoSuchElementException. The outercatchat Line 494 swallows it, so it doesn't crash the app, but it does log a misleading error and skips thetotalAttempts--decrement.Proposed fix
mutableSortedImagesByLastBuildDate.remove(oldestImage); - oldestImage = mutableSortedImagesByLastBuildDate.getFirst(); + oldestImage = mutableSortedImagesByLastBuildDate.isEmpty() ? null : mutableSortedImagesByLastBuildDate.getFirst(); totalAttempts--;
|
@krusche Test coverage has been automatically updated in the PR description. |
Phase 1: E2E Tests ✅Status: Phase 1 passed Tests Run: e2e/Login.spec.ts e2e/Logout.spec.ts e2e/SystemHealth.spec.ts e2e/exercise/programming/ Details: Check the Phase 1 Test Report for detailed results. This is an automated comment for Phase 1 of the E2E test execution. |
Phase 2: E2E Tests ✅Status: Phase 2 passed Tests Run: e2e/atlas/ e2e/course/ e2e/exam/ e2e/lecture/ Details: Check the Phase 2 Test Report for detailed results. This is an automated comment for Phase 2 of the E2E test execution. |
End-to-End (E2E) Test Results SummaryTest Strategy: Two-phase execution
Status: All E2E tests passed (both phases) Detailed Results: Check the individual phase reports in the workflow run for test counts, timing, and results. |
Summary
Add a cached Docker availability flag to the build agent that prevents thousands of
SocketExceptionstack traces from polluting CI server logs when Docker is not available (e.g. on CI servers without/var/run/docker.sock). The existingupdateDockerVersion()scheduled task is extended to also probe Docker availability every 60 seconds. All Docker-touching code paths check this flag before attempting operations, failing fast with clean WARN/DEBUG messages instead of ERROR with full stack traces.Checklist
General
Server
Changes affecting Programming Exercises
Motivation and Context
When running server tests on CI, Docker is not available (
/var/run/docker.sockdoes not exist). This causes ~2,794SocketException: No such file or directorystack traces in CI logs per test run. Multiple services independently discover Docker is unreachable and each logs full stack traces, massively polluting the output and making it harder to identify real test failures.The existing
dockerClientNotAvailable()check only verifies if the JavaDockerClientobject is null — which it never is, because client creation always succeeds regardless of whether Docker is actually running.Description
Core change: A
volatile boolean dockerAvailablefield inBuildAgentConfigurationwith getter/setter. This flag is maintained by the existingupdateDockerVersion()scheduled task inBuildAgentInformationService, which now runs every 60 seconds usingfixedDelay(prevents overlap if Docker is slow to respond). TheversionCmd().exec()call already performed by this task implicitly serves as the availability ping.Guarded code paths:
BuildAgentDockerService: EnhanceddockerClientNotAvailable()to also checkisDockerAvailable(). Added guards inpullDockerImage(),deleteOldDockerImages(). ChangedcleanUpContainers()catch blocks from ERROR to DEBUG for Docker unavailability.BuildJobContainerService: AddedisDockerAvailable()guard ingetContainerForName(). Added Docker unavailability checks in archive retrieval/copy error handlers.BuildJobManagementService: Simplified Docker unavailability error to WARN without stack trace.SharedQueueProcessingService: AddedDockerUtil.isDockerNotAvailable(ex)check in theexceptionallyhandler to log WARN with message only instead of ERROR with full stack trace.DockerUtil: FixedisDockerNotAvailable()to traverse the full cause chain instead of only checking one level deep. FixedisDockerConnectionRefused()to check the throwable directly (consistent withisDockerSocketNotAvailable()). Added cycle protection.State transition logging:
Test updates: Both
AbstractArtemisBuildAgentTestandAbstractProgrammingIntegrationLocalCILocalVCTestBaseare updated to setdockerAvailable = truesince tests use mocked Docker clients.Steps for Testing
Prerequisites:
./gradlew test --tests "de.tum.cit.aet.artemis.buildagent.*" --tests "de.tum.cit.aet.artemis.programming.icl.BuildAgentDockerServiceTest" -x webapp— all 97 tests should pass./gradlew test -x webappand check logs forSocketException— should see ~0 ERROR-level occurrences (down from ~2,794)Testserver States
You can manage test servers using Helios. Check environment statuses in the environment list. To deploy to a test server, go to the CI/CD page, find your PR or branch, and trigger the deployment.
Review Progress
Performance Review
Code Review
Manual Tests
Test Coverage
Server
Last updated: 2026-02-08 12:48:09 UTC