Skip to content

fix: replace premature media file cleanup with background MediaCleaner#620

Closed
ex-takashima wants to merge 5 commits intosipeed:mainfrom
ex-takashima:fix/media-cleanup-race-condition
Closed

fix: replace premature media file cleanup with background MediaCleaner#620
ex-takashima wants to merge 5 commits intosipeed:mainfrom
ex-takashima:fix/media-cleanup-race-condition

Conversation

@ex-takashima
Copy link
Contributor

Summary

Fixes #619

This addresses the media file race condition originally identified by @DevEverything01 in #543 — channel handlers delete downloaded temp files via defer os.Remove() before the async agent loop consumer can read them, causing silent failures in voice/image/document processing.

Rather than simply removing the cleanup (which would cause unbounded temp file accumulation), this PR:

  • Removes the premature defer os.Remove() from the 4 affected channel handlers (telegram, line, slack, onebot)
  • Adds a background MediaCleaner goroutine that scans os.TempDir()/picoclaw_media/ every 5 minutes and removes files older than 30 minutes
  • Extracts the MediaDir constant to avoid hardcoded strings

discord.go is intentionally unchanged — it processes voice files synchronously before returning, so there is no race condition.

Changes

File Change
pkg/utils/media.go Add MediaDir const, MediaCleaner struct with Start()/Stop()
pkg/utils/media_test.go Tests for old file removal and Start/Stop idempotency
pkg/channels/telegram.go Remove localFiles tracking and defer os.Remove block
pkg/channels/line.go Remove localFiles tracking and defer os.Remove block
pkg/channels/slack.go Remove localFiles tracking and defer os.Remove block
pkg/channels/onebot.go Remove LocalFiles field, localFiles tracking, and defer os.Remove block
cmd/picoclaw/cmd_gateway.go Start/stop MediaCleaner with gateway lifecycle

Test plan

  • go vet ./pkg/utils/ ./pkg/channels/ — clean
  • go build ./pkg/utils/ ./pkg/channels/ — clean
  • go test ./pkg/utils/... -run TestMediaCleaner -v — 2/2 PASS

🤖 Generated with Claude Code

@Zhaoyikaiii
Copy link
Collaborator

I have some concerns about introducing a global background MediaCleaner goroutine as the primary mechanism for file cleanup.

From a resource lifecycle perspective, it’s generally safer to follow an ownership principle: the component that creates a resource should also be responsible for cleaning it up. In the previous implementation, cleanup was explicit (e.g. via defer os.Remove(...)) and tightly coupled to the request/event lifecycle. That made ownership and timing clear.

With the current approach:

Cleanup is now time-based (e.g. “older than 30 minutes”) rather than usage-based.

The deletion logic is detached from the code that creates and consumes the file.

Lifecycle becomes implicit and harder to reason about.

Time-based cleanup is not equivalent to “no longer in use.” If any processing chain legitimately exceeds the configured threshold (slow downstream services, retries, backpressure, long-running tasks), files could be deleted while still logically in use. Even if this avoids premature deletion in some race conditions, it replaces a deterministic lifecycle with a heuristic one.

I understand the motivation behind avoiding premature cleanup, but I think we can achieve that without shifting ownership to a global janitor goroutine.

Some possible alternatives:

Have DownloadFile (or equivalent) return a cleanup func() along with the path. The caller invokes it once processing is truly complete. This keeps ownership explicit while still allowing flexible timing.

Organize temp files under a per-request/per-session directory, and remove the entire directory once the request lifecycle ends. This keeps cleanup scoped and easier to reason about.

Keep MediaCleaner only as a safety net (e.g. for orphaned files after crashes), rather than as the primary cleanup mechanism. Explicit cleanup should remain the normal path; the background cleaner should be a fallback.

In short, I think we should prefer deterministic, ownership-based cleanup over time-based sweeping. The latter can be a useful safeguard, but it shouldn’t replace explicit lifecycle management.

@ex-takashima
Copy link
Contributor Author

Thanks for the thoughtful review, @Zhaoyikaiii! These are valid concerns.

I agree that ownership-based cleanup is generally preferable. However, the challenge here is the async boundary — HandleMessage publishes the message (with file paths) to a message bus via PublishInbound, and the agent loop consumes it in a separate goroutine. There is currently no completion callback or signal from the consumer side, so the channel handler has no way to know when the agent is actually done with the file.

This means the alternatives run into the same fundamental issue:

  • Cleanup func: The handler could receive one, but it doesn't know when to call it — the agent loop's completion is invisible to the handler.
  • Per-request directory: Same problem — there's no signal to trigger the directory removal.
  • Explicit cleanup + safety net: Without a "processing complete" signal across the async bus, there's nothing to drive explicit cleanup.

Adding such a signal (e.g., a completion callback in the message bus, or reference counting on media files) would be a more significant architectural change and is probably beyond the scope of this bug fix.

That said, the 30-minute threshold is intentionally conservative — typical message processing completes in seconds. But I understand the concern about it being heuristic rather than deterministic. Happy to discuss further or hold this PR if a broader architectural approach is preferred.

@Zhaoyikaiii
Copy link
Collaborator

Zhaoyikaiii commented Feb 22, 2026

Thanks for the clarification — I agree the missing “processing complete” signal across PublishInbound/agent loop makes deterministic cleanup hard right now.

Given that, could we make the MediaCleaner behavior configurable so deployments can choose their trade-off explicitly?

Suggestions:

  • Add config:
    • media_cleanup_enabled (default: true)
    • media_cleanup_max_age (default: 30m)
    • media_cleanup_interval (default: 5m)
    • (optional) media_cleanup_dir
  • Log the effective config at startup (helps with debugging “why was my file deleted?”).
  • Consider making the default max_age more conservative for slow/embedded setups.

This keeps the current approach as a pragmatic fix, while letting operators tune/disable it if their workloads have longer async processing times or they prefer external temp management.

@ex-takashima
Copy link
Contributor Author

Good suggestion, @Zhaoyikaiii — just pushed an update that makes MediaCleaner fully configurable via tools.media_cleanup in the config:

{
  "tools": {
    "media_cleanup": {
      "enabled": true,
      "max_age_minutes": 30,
      "interval_minutes": 5
    }
  }
}

Also supported via environment variables: PICOCLAW_MEDIA_CLEANUP_ENABLED, PICOCLAW_MEDIA_CLEANUP_MAX_AGE, PICOCLAW_MEDIA_CLEANUP_INTERVAL.

The effective config is now logged at startup for easier debugging. Operators can disable cleanup entirely or tune the thresholds for their specific deployment.

(Note: the remaining CI lint failure is a pre-existing gofmt issue in registry_test.go, unrelated to this PR.)

@alexhoshina
Copy link
Collaborator

Perhaps you could try rebasing to resolve the CI issue; the latest code in the repository should all pass CI.

ex-takashima and others added 5 commits February 23, 2026 07:18
Channel handlers delete downloaded media files via defer os.Remove()
before the async agent loop consumer can read them, causing silent
failures in voice/image/document processing.

Remove the immediate defer cleanup from telegram, line, slack, and
onebot handlers. Introduce a background MediaCleaner goroutine that
periodically removes files older than 30 minutes from the temp media
directory, preventing unbounded accumulation.

discord.go is unchanged because it processes files synchronously.

Closes sipeed#619

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix gci errors caused by lost indentation when removing localFiles
lines, and use 0o700/0o600 octal syntax for gofumpt compliance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add MediaCleanupConfig with enabled, max_age_minutes, and
interval_minutes fields. MediaCleaner is only started when enabled
(default: true) and logs its effective settings at startup.

Defaults: interval=5m, max_age=30m. Operators can tune or disable
cleanup for deployments with longer async processing times.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ex-takashima ex-takashima force-pushed the fix/media-cleanup-race-condition branch from 6e499d6 to 0e5d30a Compare February 22, 2026 22:18
@ex-takashima
Copy link
Contributor Author

Thanks @alexhoshina — just rebased onto latest main. The pre-existing registry_test.go lint issue should be resolved now. CI should go green.

Copy link

@nikolasdehor nikolasdehor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix for a real race condition. The approach is sound: remove the premature defer os.Remove() from channel handlers and replace with a background cleaner that uses file age as the deletion criterion. This ensures the agent loop has time to consume the file.

Code quality observations (non-blocking):

  1. Cleanup on first tick only: The cleaner runs cleanup() only on ticker ticks, not immediately at start. This means files from a previous crash could sit for up to interval minutes before the first sweep. Consider running mc.cleanup() once at the beginning of loop() before entering the ticker loop. Minor, since the default interval is only 5 minutes.

  2. Symlink safety: cleanup() iterates directory entries and calls os.Remove() on regular files. If a symlink were placed in the media directory (e.g., by a malicious tool call or skill), os.Remove would remove the symlink itself (not the target), which is safe. Good.

  3. No recursive directory cleanup: The cleaner skips entry.IsDir(), which is correct since DownloadFile only creates flat files. But if a future change nests media in subdirs, those would accumulate. Acceptable for now.

  4. Config placement: MediaCleanupConfig is nested under ToolsConfig, which is slightly odd since media cleanup is a cross-cutting concern not specific to tools. But it matches the existing ToolsConfig as a grab-bag for non-channel config, so it is fine.

  5. Test coverage is appropriate for the cleaner logic. The channel handler diffs are purely mechanical removals, which is clean.

LGTM. This correctly fixes the race condition from #619 without introducing unbounded temp file growth.

@ex-takashima
Copy link
Contributor Author

Thanks for the thorough review, @nikolasdehor!

Heads-up: since this PR, the approach has evolved. @alexhoshina asked me to integrate the cleanup directly into FileMediaStore on the refactor/channel-system branch instead of keeping it as a standalone goroutine. That's now up as #728 — TTL cleanup happens atomically under the same mutex as the ref/scope maps, so no dangling references are possible.

This PR (#620) will likely be superseded by #728 once it's merged. Your observations about first-tick cleanup and config placement are noted and still relevant there.

@alexhoshina alexhoshina added this to the Refactor Channel milestone Feb 26, 2026
@ex-takashima
Copy link
Contributor Author

Closing in favor of #728, which was merged into refactor/channel-system. The TTL cleanup is now integrated directly into FileMediaStore rather than as a separate MediaCleaner, avoiding dangling ref issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: downloaded media files deleted before async consumer reads them

4 participants