Skip to content

Conversation

@jzunigax2
Copy link
Contributor

Cleanup script to remove duplicated backup root folders

@jzunigax2 jzunigax2 requested a review from sg-gs January 8, 2026 01:12
@jzunigax2 jzunigax2 self-assigned this Jan 8, 2026
@jzunigax2
Copy link
Contributor Author

explain result @sg-gs

QUERY PLAN
Hash Right Anti Join (cost=43.08..57.72 rows=1 width=4) (actual time=13.238..13.332 rows=62 loops=1)
Hash Cond: (files.folder_id = f.id)
-> Seq Scan on files (cost=0.00..13.93 rows=186 width=4) (actual time=0.941..1.956 rows=186 loops=1)
Filter: (NOT deleted)
Rows Removed by Filter: 7
-> Hash (cost=43.07..43.07 rows=1 width=4) (actual time=9.164..9.201 rows=84 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 11kB
-> Hash Join (cost=21.47..43.07 rows=1 width=4) (actual time=8.736..8.959 rows=84 loops=1)
Hash Cond: (((f.plain_name)::text = (folders.plain_name)::text) AND ((f.bucket)::text = (folders.bucket)::text) AND (f.user_id = folders.user_id))
Join Filter: (f.id <> (min(folders.id)))
Rows Removed by Join Filter: 21
-> Seq Scan on folders f (cost=0.00..19.57 rows=257 width=48) (actual time=0.303..0.468 rows=257 loops=1)
-> Hash (cost=21.37..21.37 rows=6 width=48) (actual time=8.141..8.146 rows=21 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Limit (cost=21.13..21.37 rows=6 width=48) (actual time=7.834..7.865 rows=21 loops=1)
-> HashAggregate (cost=21.13..21.37 rows=6 width=48) (actual time=7.789..7.815 rows=21 loops=1)
Group Key: folders.plain_name, folders.bucket, folders.user_id
Filter: (count(*) > 1)
Batches: 1 Memory Usage: 24kB
Rows Removed by Filter: 30
-> Seq Scan on folders (cost=0.00..20.86 rows=22 width=48) (actual time=0.049..7.255 rows=73 loops=1)
Filter: ((parent_id IS NULL) AND (parent_uuid IS NULL) AND (NOT deleted) AND (NOT removed) AND (plain_name IS NOT NULL) AND (created_at >= '2025-12-17 14:16:00-06'::timestamp with time zone) AND (created_at <= '2026-01-05 21:50:00-06'::timestamp with time zone))
Rows Removed by Filter: 184
Planning Time: 84.248 ms
Execution Time: 27.053 ms

@jzunigax2 jzunigax2 force-pushed the fix/cleanup-duplicate-folders branch from 0c3ca0a to 47ce568 Compare January 12, 2026 15:10
@jzunigax2 jzunigax2 force-pushed the fix/cleanup-duplicate-folders branch from 47ce568 to bbd6962 Compare January 12, 2026 15:10
@jzunigax2 jzunigax2 force-pushed the fix/cleanup-duplicate-folders branch from bbd6962 to 4ae1045 Compare January 12, 2026 16:19
@jzunigax2 jzunigax2 force-pushed the fix/cleanup-duplicate-folders branch from 4ae1045 to 37c9be2 Compare January 14, 2026 21:58
@jzunigax2 jzunigax2 force-pushed the fix/cleanup-duplicate-folders branch from 37c9be2 to ccf78bc Compare January 14, 2026 21:59
Copy link
Member

@sg-gs sg-gs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to add the tests to cover the case that produced this issue in the first place.

SELECT 1
FROM files
WHERE folder_id = f.id
AND deleted = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the status != 'DELETED' field, deleted and removed are deprecated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment on lines +20 to +38
WITH duplicate_groups AS (
SELECT
plain_name,
bucket,
user_id,
MIN(id) as id_to_keep
FROM folders
WHERE
created_at >= '2025-12-17 14:16:00'
AND created_at <= '2026-01-05 21:50:00'
AND parent_id IS NULL
AND parent_uuid IS NULL
AND deleted = false
AND removed = false
AND plain_name IS NOT NULL
GROUP BY plain_name, bucket, user_id
HAVING COUNT(*) > 1
LIMIT ${BATCH_SIZE}
),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets the oldest folder from each duplicated group. So we should be getting 905 folders from this cte

Comment on lines 39 to 54
folders_to_delete AS (
SELECT f.id
FROM folders f
INNER JOIN duplicate_groups dg
ON f.plain_name = dg.plain_name
AND f.bucket = dg.bucket
AND f.user_id = dg.user_id
WHERE
f.id != dg.id_to_keep
AND NOT EXISTS (
SELECT 1
FROM files
WHERE folder_id = f.id
AND deleted = false
)
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this joins all folders back to their duplicate group by matching (plain_name, bucket, user_id). So if a group has 3 duplicates, the JOIN produces 3 rows — each with that group's id_to_keep. Then we filter to only keep folders where:

  • id != id_to_keep (not the oldest/keeper)
  • No direct files (status != 'DELETED')
  • No child folders (deleted = false)

this should get 5863 folders

… and enhance test coverage for folder creation
@sonarqubecloud
Copy link

@github-actions
Copy link

github-actions bot commented Feb 7, 2026

This PR is stale because it has been open for more than 15 days with no activity.

@github-actions github-actions bot added stalled and removed stalled labels Feb 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants