Skip to content

Comments

per-team + monthly log partitioning#2

Merged
9cb14c1ec0 merged 2 commits intomasterfrom
user-id
Feb 19, 2026
Merged

per-team + monthly log partitioning#2
9cb14c1ec0 merged 2 commits intomasterfrom
user-id

Conversation

@9cb14c1ec0
Copy link
Owner

@9cb14c1ec0 9cb14c1ec0 commented Feb 19, 2026

Two-level PostgreSQL declarative partitioning for the logs table:

  • Level 1: LIST by team_id (isolates each team's data)
  • Level 2: RANGE by created_at monthly (instant retention via partition drops)

Summary by CodeRabbit

  • New Features

    • Implemented automatic log partitioning by team for improved query performance and data organization.
    • Added scheduled daily maintenance to ensure partition structure remains current and optimized.
  • Chores

    • Refactored data retention mechanism to use partition-based cleanup for enhanced efficiency.
    • Migrated database to support team-based log partitioning structure.

Two-level PostgreSQL declarative partitioning for the logs table:
- Level 1: LIST by team_id (isolates each team's data)
- Level 2: RANGE by created_at monthly (instant retention via partition drops)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

📝 Walkthrough

Walkthrough

The PR introduces PostgreSQL partition management for logs with team-based LIST partitioning at Level 1 and monthly RANGE partitioning at Level 2. Partition lifecycle operations are integrated into team creation and deletion, and maintenance jobs are scheduled during startup to ensure upcoming partitions and clean up expired data.

Changes

Cohort / File(s) Summary
Partition Service Implementation
backend/app/services/partitions.py
New module providing 6 async functions for partition management: create_team_partition(), create_monthly_partition(), drop_team_partition(), drop_monthly_partition(), ensure_upcoming_partitions(), and cleanup_expired_partitions(). Handles team-based LIST partitioning and monthly RANGE partitioning with early-exit checks and UTC timestamp handling.
Database Migration
backend/migrations/002_partition_logs_by_team.sql
Converts logs table to two-level partitioned structure with idempotency check. Creates per-team partitions with default sub-partition and monthly range partitions. Migrates data from temporary logs\_old table, resets sequence, and creates performance indexes.
Team Lifecycle Integration
backend/app/api/admin.py
Adds partition creation after team creation and partition deletion before team deletion via create_team_partition() and drop_team_partition() imports and calls.
Application Startup & Scheduling
backend/app/main.py
Imports ensure_upcoming_partitions and calls it once during startup lifespan. Schedules new daily "partition\_maintenance" job to run ensure_upcoming_partitions() every 24 hours.
Retention Service Refactoring
backend/app/services/retention.py
Simplifies cleanup_expired_logs() to delegate to cleanup_expired_partitions() from partitions module. Removes per-team retention loop and Log model direct deletions.

Sequence Diagram(s)

sequenceDiagram
    participant API as Admin API
    participant App as Application<br/>(main.py)
    participant PartSvc as Partition<br/>Service
    participant DB as PostgreSQL
    
    Note over API,DB: Team Creation Workflow
    API->>API: create_team()
    API->>PartSvc: create_team_partition(team_id)
    PartSvc->>DB: Create LIST partition<br/>for team_id
    PartSvc->>DB: Create default<br/>sub-partition
    PartSvc->>DB: Create current & next<br/>month partitions
    DB-->>PartSvc: Partitions created
    PartSvc-->>API: Partition ready
    API-->>API: Team created

    Note over App,DB: Startup & Maintenance Workflow
    App->>PartSvc: ensure_upcoming_partitions()<br/>(on startup)
    PartSvc->>DB: Check all teams
    PartSvc->>DB: Ensure current &<br/>next month<br/>partitions exist
    DB-->>PartSvc: Partitions verified
    PartSvc-->>App: Startup partition<br/>check complete
    
    App->>App: Schedule daily<br/>partition_maintenance job
    
    Note over App,DB: Team Deletion Workflow
    API->>API: delete_team()
    API->>PartSvc: drop_team_partition(team_id)
    PartSvc->>DB: Drop team partition<br/>(cascades to<br/>sub-partitions)
    DB-->>PartSvc: Partition dropped
    PartSvc-->>API: Cleanup complete
    API-->>API: Team deleted
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A rabbit's ode to organized logs:
Partitions spring forth, by team and by time,
Monthly buckets arranged in a line,
Logs sorted swift as a hop through the clover,
Old data swept away as seasons roll over,
Maintenance hops through the garden each day! 🌙

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and concisely summarizes the main change: implementing two-level log partitioning (by team and by month), which aligns with all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 81.25% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch user-id

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/api/admin.py (1)

113-121: ⚠️ Potential issue | 🟡 Minor

Partition creation failure leaves the team record committed without its partition.

If create_team_partition throws after Team.create succeeds, the client gets a 500 but the team already exists. Subsequent log ingestion falls into logs_default until ensure_upcoming_partitions self-heals. Consider either wrapping both operations in a transaction or catching the partition error and attempting cleanup (delete the team) to keep the API response consistent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/admin.py` around lines 113 - 121, The Team.create call can
succeed while create_team_partition fails, leaving a team without its partition;
wrap the team creation and partition setup in a single transactional flow or, if
transactions aren’t possible, catch errors from create_team_partition after
Team.create and roll back by deleting the created team record (use Team.delete
or equivalent) before rethrowing/returning the error. Specifically modify the
code surrounding Team.create and create_team_partition to either use a DB
transaction scope (ensuring both operations commit atomically) or add a
try/catch around create_team_partition that calls the Team deletion/cleanup for
the newly created team.id and then surfaces the original error to the caller.
Ensure you reference Team.create, create_team_partition, and the cleanup call
(e.g., Team.delete or destroy method) so the change is easy to locate.
🧹 Nitpick comments (8)
backend/app/services/partitions.py (6)

196-201: Pass cutoff as a datetime object, not a formatted string.

cutoff is already a timezone-aware datetime. Tortoise ORM (and the underlying asyncpg driver) can bind datetime objects directly to $1. Formatting to a string (cutoff_str) loses precision and relies on PostgreSQL to re-parse it.

Proposed fix
             elif partition_start < cutoff < partition_end:
                 # Partial month — delete rows before cutoff
-                cutoff_str = cutoff.strftime("%Y-%m-%d %H:%M:%S+00")
                 result = await conn.execute_query(
-                    f"DELETE FROM {relname} WHERE created_at < $1",
-                    [cutoff_str],
+                    f"DELETE FROM {relname} WHERE created_at < $1",
+                    [cutoff],
                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/partitions.py` around lines 196 - 201, The DELETE
currently formats cutoff to a string before binding; instead, pass the
timezone-aware datetime directly to conn.execute_query so asyncpg/Tortoise can
bind it natively: replace the cutoff_str usage and pass cutoff as the parameter
to the f"DELETE FROM {relname} WHERE created_at < $1" execute_query call (update
the code around relname, cutoff and conn.execute_query to remove the strftime
conversion).

122-141: ensure_upcoming_partitions issues one query per team — consider batching the existence checks.

For deployments with many teams, this function executes at least one _table_exists query per team (plus two create_monthly_partition calls, each with their own existence check). A single catalog query to fetch all existing partition names and then computing the diff in Python would reduce round-trips significantly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/partitions.py` around lines 122 - 141,
ensure_upcoming_partitions currently calls _table_exists per team causing N
queries; instead run a single catalog query (via the same conn from _get_conn)
to fetch all existing partition names for the teams and target months, compute
missing team partitions and missing monthly partitions in Python using
_partition_name(team.id) and the month names (now and next_month), then call
create_team_partition(team.id) only for teams whose team partition is missing
and call create_monthly_partition(team.id, year, month) only for missing month
partitions; reuse Team.all(), _partition_name, create_team_partition,
create_monthly_partition and _get_conn to locate code and ensure you preserve
existing semantics (create full structure when team partition missing, otherwise
only ensure current+next month).

194-194: Replace print() with structured logging.

Production services should use the logging module for proper log-level control, formatting, and integration with observability tools.

Proposed fix
+import logging
+
+logger = logging.getLogger(__name__)
+
 ...
-                await conn.execute_script(f"DROP TABLE {relname};")
-                print(f"Dropped expired partition {relname}")
+                await conn.execute_script(f"DROP TABLE {relname};")
+                logger.info("Dropped expired partition %s", relname)
             elif partition_start < cutoff < partition_end:
 ...
-                if deleted:
-                    print(f"Deleted {deleted} rows from boundary partition {relname}")
+                if deleted:
+                    logger.info("Deleted %s rows from boundary partition %s", deleted, relname)

Also applies to: 204-204

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/partitions.py` at line 194, Replace the bare print
statements that log dropped partition names with structured logging: add or use
a module-level logger (logger = logging.getLogger(__name__)) in
backend.app.services.partitions and replace print(f"Dropped expired partition
{relname}") (and the similar occurrence around lines 204) with an appropriate
log call such as logger.info("Dropped expired partition %s", relname) so
messages respect log levels, formatting, and observability integration.

34-39: _table_exists doesn't filter by schema — could match tables in other schemas.

pg_class.relname is not schema-qualified. If the search path includes multiple schemas, a stale or unrelated table with the same name could cause a false positive.

Schema-qualified check
 async def _table_exists(conn: BaseDBAsyncClient, table_name: str) -> bool:
     rows = await conn.execute_query_dict(
-        "SELECT 1 FROM pg_class WHERE relname = $1 AND relkind IN ('r', 'p')",
-        [table_name],
+        "SELECT 1 FROM pg_class c "
+        "JOIN pg_namespace n ON n.oid = c.relnamespace "
+        "WHERE c.relname = $1 AND c.relkind IN ('r', 'p') "
+        "AND n.nspname = current_schema()",
+        [table_name],
     )
     return len(rows) > 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/partitions.py` around lines 34 - 39, _table_exists
currently queries pg_class by relname only, which can return tables from other
schemas; update the SQL used in _table_exists (the call to
conn.execute_query_dict) to join pg_namespace (pg_class.relnamespace =
pg_namespace.oid) and filter by the intended schema (e.g., WHERE
pg_namespace.nspname = current_schema() AND pg_class.relname = $1) or add a
schema parameter to _table_exists and compare against that schema name, so the
existence check is schema-qualified rather than name-only.

52-57: SQL built via f-strings — safe here but fragile pattern.

part_name and team_id_str are derived from UUID.hex / str(UUID), so injection isn't exploitable today. However, the same f-string pattern is repeated across all DDL functions (lines 53–57, 61–63, 94–97, 108, 119). Since PostgreSQL parameterized queries can't bind identifiers or DDL clauses, consider at least adding an assertion on the input format (e.g., assert re.fullmatch(r'[a-f0-9]{32}', _team_hex(team_id))) to harden against future misuse.

Example: add a validation helper
import re

_HEX_RE = re.compile(r"[a-f0-9]{32}\Z")

def _team_hex(team_id: UUID) -> str:
    h = team_id.hex
    assert _HEX_RE.match(h), f"unexpected hex: {h}"
    return h
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/partitions.py` around lines 52 - 57, The f-string DDL
builds (creating partitions) use values derived from UUIDs without validation
(symbols: part_name, team_id_str) — add a small validator helper (e.g., _HEX_RE
and _team_hex(team_id)) that asserts the UUID hex matches r"[a-f0-9]{32}\Z" and
use _team_hex(...) when composing part_name and team_id_str; also validate any
identifier derived from that hex (the partition name) before interpolation so
future non-UUID inputs cannot slip in.

170-204: relname from pg_class used unsanitized in DROP/DELETE — add a name-pattern guard.

Ruff S608 flags line 199 for string-based query construction. While relname comes from a trusted system catalog, applying a regex check before interpolation adds defense-in-depth and silences the linter. This also covers the DROP TABLE on line 193.

Proposed guard
+        import re
+        _PART_NAME_RE = re.compile(r"^logs_[a-f0-9]{32}_\d{4}_\d{2}$")
+
         for row in rows:
             relname = row["relname"]
 
+            if not _PART_NAME_RE.match(relname):
+                continue
+
             # Parse partition name: logs_<hex>_YYYY_MM
             parts = relname.rsplit("_", 2)

(Move the compiled regex to module level.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/partitions.py` around lines 170 - 204, Add a defensive
name-pattern guard for relname before using it in string-interpolated SQL:
compile an allowed identifier regex at module level (e.g., ALLOWED_PARTITION_RE
= re.compile(r"^logs_[0-9a-f]+_\d{4}_\d{2}$")) and in the loop that defines
relname (the block that parses parts and computes partition_start/partition_end)
skip any relname that does not match ALLOWED_PARTITION_RE; only after the match
use conn.execute_script(f"DROP TABLE {relname};") and
conn.execute_query(f"DELETE FROM {relname} WHERE created_at < $1", ...) so the
linter/S608 is satisfied and the operations are restricted to expected partition
names.
backend/migrations/002_partition_logs_by_team.sql (2)

104-107: Bulk INSERT into a fully indexed partitioned table can be slow on large datasets.

The data copy runs after all indexes (5 on the parent, plus PK) are already created on lines 57–61. For a large logs_old table, building those indexes during INSERT is expensive. Consider creating indexes after the data copy, or at minimum, document that this migration may require a maintenance window.

Proposed reorder: move index creation after data copy
     ALTER SEQUENCE logs_id_seq OWNED BY logs.id;
 
-    -- Create indexes on the parent (inherited by all partitions)
-    CREATE INDEX idx_logs_team_id_id ON logs (team_id, id DESC);
-    CREATE INDEX idx_logs_timestamp ON logs ("timestamp");
-    CREATE INDEX idx_logs_level ON logs (level);
-    CREATE INDEX idx_logs_source ON logs (source);
-    CREATE INDEX idx_logs_created_at ON logs (created_at);
-
     -- 3. Create top-level default partition ...
     ...
     -- 5. Copy all data from old table
     INSERT INTO logs (id, team_id, "timestamp", level, message, metadata, source, created_at)
     SELECT id, team_id, "timestamp", level, message, metadata, source, created_at
     FROM logs_old;
 
+    -- Create indexes after bulk load for better performance
+    CREATE INDEX idx_logs_team_id_id ON logs (team_id, id DESC);
+    CREATE INDEX idx_logs_timestamp ON logs ("timestamp");
+    CREATE INDEX idx_logs_level ON logs (level);
+    CREATE INDEX idx_logs_source ON logs (source);
+    CREATE INDEX idx_logs_created_at ON logs (created_at);
+
     -- 6. Reset the sequence ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/migrations/002_partition_logs_by_team.sql` around lines 104 - 107,
The migration currently builds all indexes on the partitioned parent table
(including the primary key and the five indexes created earlier) before copying
data from logs_old into logs, which makes the bulk INSERT very slow for large
datasets; to fix, reorder the migration so the INSERT INTO logs ... SELECT ...
FROM logs_old runs before creating the parent/index definitions (i.e., defer
creation of the PK and the five indexes on logs until after the data copy), or
if reordering is not acceptable, add an explicit comment and documentation in
this migration noting it requires a maintenance window and may be slow for large
logs_old tables; reference the INSERT INTO logs (id, team_id, "timestamp", ...)
SELECT ... FROM logs_old and the subsequent parent/index creation steps when
making the change.

29-30: No explicit table lock before rename — concurrent writes will fail during migration.

ALTER TABLE logs RENAME TO logs_old will acquire an ACCESS EXCLUSIVE lock, but there's a window between the rename and the recreation where inserts targeting logs will error. If this runs on a live system, consider wrapping with an advisory lock or documenting the required maintenance window.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/migrations/002_partition_logs_by_team.sql` around lines 29 - 30, The
migration performs ALTER TABLE logs RENAME TO logs_old with no session lock,
leaving a window where concurrent inserts to "logs" will fail; to fix, acquire a
session-level advisory lock (e.g. via pg_advisory_lock with a chosen stable key)
before running ALTER TABLE logs RENAME TO logs_old and hold it until after the
new "logs" table is created and any necessary triggers/schemas are swapped, then
release the advisory lock (pg_advisory_unlock) — alternatively, if you cannot
modify the SQL, document that this migration requires a maintenance window where
writes are paused; reference the ALTER TABLE logs RENAME TO logs_old and the new
logs/logs_old table creation steps when applying the lock.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@backend/app/api/admin.py`:
- Around line 113-121: The Team.create call can succeed while
create_team_partition fails, leaving a team without its partition; wrap the team
creation and partition setup in a single transactional flow or, if transactions
aren’t possible, catch errors from create_team_partition after Team.create and
roll back by deleting the created team record (use Team.delete or equivalent)
before rethrowing/returning the error. Specifically modify the code surrounding
Team.create and create_team_partition to either use a DB transaction scope
(ensuring both operations commit atomically) or add a try/catch around
create_team_partition that calls the Team deletion/cleanup for the newly created
team.id and then surfaces the original error to the caller. Ensure you reference
Team.create, create_team_partition, and the cleanup call (e.g., Team.delete or
destroy method) so the change is easy to locate.

---

Nitpick comments:
In `@backend/app/services/partitions.py`:
- Around line 196-201: The DELETE currently formats cutoff to a string before
binding; instead, pass the timezone-aware datetime directly to
conn.execute_query so asyncpg/Tortoise can bind it natively: replace the
cutoff_str usage and pass cutoff as the parameter to the f"DELETE FROM {relname}
WHERE created_at < $1" execute_query call (update the code around relname,
cutoff and conn.execute_query to remove the strftime conversion).
- Around line 122-141: ensure_upcoming_partitions currently calls _table_exists
per team causing N queries; instead run a single catalog query (via the same
conn from _get_conn) to fetch all existing partition names for the teams and
target months, compute missing team partitions and missing monthly partitions in
Python using _partition_name(team.id) and the month names (now and next_month),
then call create_team_partition(team.id) only for teams whose team partition is
missing and call create_monthly_partition(team.id, year, month) only for missing
month partitions; reuse Team.all(), _partition_name, create_team_partition,
create_monthly_partition and _get_conn to locate code and ensure you preserve
existing semantics (create full structure when team partition missing, otherwise
only ensure current+next month).
- Line 194: Replace the bare print statements that log dropped partition names
with structured logging: add or use a module-level logger (logger =
logging.getLogger(__name__)) in backend.app.services.partitions and replace
print(f"Dropped expired partition {relname}") (and the similar occurrence around
lines 204) with an appropriate log call such as logger.info("Dropped expired
partition %s", relname) so messages respect log levels, formatting, and
observability integration.
- Around line 34-39: _table_exists currently queries pg_class by relname only,
which can return tables from other schemas; update the SQL used in _table_exists
(the call to conn.execute_query_dict) to join pg_namespace
(pg_class.relnamespace = pg_namespace.oid) and filter by the intended schema
(e.g., WHERE pg_namespace.nspname = current_schema() AND pg_class.relname = $1)
or add a schema parameter to _table_exists and compare against that schema name,
so the existence check is schema-qualified rather than name-only.
- Around line 52-57: The f-string DDL builds (creating partitions) use values
derived from UUIDs without validation (symbols: part_name, team_id_str) — add a
small validator helper (e.g., _HEX_RE and _team_hex(team_id)) that asserts the
UUID hex matches r"[a-f0-9]{32}\Z" and use _team_hex(...) when composing
part_name and team_id_str; also validate any identifier derived from that hex
(the partition name) before interpolation so future non-UUID inputs cannot slip
in.
- Around line 170-204: Add a defensive name-pattern guard for relname before
using it in string-interpolated SQL: compile an allowed identifier regex at
module level (e.g., ALLOWED_PARTITION_RE =
re.compile(r"^logs_[0-9a-f]+_\d{4}_\d{2}$")) and in the loop that defines
relname (the block that parses parts and computes partition_start/partition_end)
skip any relname that does not match ALLOWED_PARTITION_RE; only after the match
use conn.execute_script(f"DROP TABLE {relname};") and
conn.execute_query(f"DELETE FROM {relname} WHERE created_at < $1", ...) so the
linter/S608 is satisfied and the operations are restricted to expected partition
names.

In `@backend/migrations/002_partition_logs_by_team.sql`:
- Around line 104-107: The migration currently builds all indexes on the
partitioned parent table (including the primary key and the five indexes created
earlier) before copying data from logs_old into logs, which makes the bulk
INSERT very slow for large datasets; to fix, reorder the migration so the INSERT
INTO logs ... SELECT ... FROM logs_old runs before creating the parent/index
definitions (i.e., defer creation of the PK and the five indexes on logs until
after the data copy), or if reordering is not acceptable, add an explicit
comment and documentation in this migration noting it requires a maintenance
window and may be slow for large logs_old tables; reference the INSERT INTO logs
(id, team_id, "timestamp", ...) SELECT ... FROM logs_old and the subsequent
parent/index creation steps when making the change.
- Around line 29-30: The migration performs ALTER TABLE logs RENAME TO logs_old
with no session lock, leaving a window where concurrent inserts to "logs" will
fail; to fix, acquire a session-level advisory lock (e.g. via pg_advisory_lock
with a chosen stable key) before running ALTER TABLE logs RENAME TO logs_old and
hold it until after the new "logs" table is created and any necessary
triggers/schemas are swapped, then release the advisory lock
(pg_advisory_unlock) — alternatively, if you cannot modify the SQL, document
that this migration requires a maintenance window where writes are paused;
reference the ALTER TABLE logs RENAME TO logs_old and the new logs/logs_old
table creation steps when applying the lock.

Validate UUID hex and partition names against strict regexes before
string-interpolating into SQL, preventing injection from unexpected inputs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@9cb14c1ec0 9cb14c1ec0 merged commit c0df8de into master Feb 19, 2026
2 checks passed
@9cb14c1ec0 9cb14c1ec0 deleted the user-id branch February 19, 2026 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant