Skip to content

Conversation

@jackschultz
Copy link
Owner

Summary

Complete diagnostic suite for production PostgreSQL performance analysis:

Phase 3a: Query Performance Diagnostics

  • pgcrate queries: Top queries from pg_stat_statements

    • Sort by total time, mean time, or call count
    • Cache hit ratio per query
    • Graceful degradation when extension not installed
  • pgcrate connections: Connection usage analysis

    • Usage vs max_connections with percentage
    • Group by user, database, or application
    • Idle-in-transaction detection

Phase 3b: Infrastructure Diagnostics

  • pgcrate bloat: Table and index bloat estimation

    • Statistical estimation (no extensions required)
    • VACUUM FULL / REINDEX recommendations
  • pgcrate replication: Streaming replication health

    • Primary/standby role detection
    • Replica lag monitoring (write, flush, replay)
    • Replication slot status

Bug Fixes

  • Fix UTF-8 string slicing in bloat/sequences (prevents panic on non-ASCII)
  • Fix XID age type overflow in triage (i32 → i64)
  • Fix triage sequences percentage calculation alignment

UX Improvements

  • migration alias for migrate command
  • create alias for migrate new command

Test plan

  • All 391 unit tests pass
  • Clippy clean
  • pgcrate queries gracefully handles missing pg_stat_statements
  • pgcrate connections shows accurate usage vs max_connections
  • pgcrate bloat estimates without extensions
  • pgcrate replication detects server role correctly
  • JSON output follows schema patterns
  • Capabilities correctly report availability

- Statistical index bloat estimation using pg_class/pg_stats (ioguix-style)
- Table bloat from dead tuple ratios via pg_stat_user_tables
- Status thresholds: 20% warning, 50% critical
- Recommendations for VACUUM FULL and REINDEX when critical
- Full JSON support with pgcrate.diagnostics.bloat schema
- Added diagnostics.bloat capability check
- Integration tests for empty DB, tables, JSON structure, limit option
Use chars().count() and chars().take() instead of byte slicing
to avoid panic on multi-byte UTF-8 characters in table/index names.
- Server role detection (primary vs standby via pg_is_in_recovery())
- Replica lag monitoring from pg_stat_replication (write, flush, replay lag)
- Replication slot status from pg_replication_slots (active, wal_status, retained)
- WAL receiver info on standby servers from pg_stat_wal_receiver
- Status thresholds:
  - Warning: replay_lag >30s or inactive slot retaining >1GB
  - Critical: replay_lag >5min or wal_status='lost' or slot retaining >10GB
- Full JSON support with pgcrate.diagnostics.replication schema
- Added diagnostics.replication capability (degraded if no pg_stat_replication access)
- Integration tests for standalone server (no replica) scenarios
Bug fixes:
- Fix UTF-8 slicing in sequences.rs display (same issue as bloat.rs)
- Fix xid_age type overflow in triage.rs (i32 -> i64 for large XID ages)
- Add better error context in xid.rs for empty database handling
- Align triage sequences check with sequences.rs (consistent float calculation)

UX improvements:
- Add 'migration' as visible alias for 'migrate' command
- Add 'create' as visible alias for 'migrate new' command

These fixes address issues reported in 25+ agent feedback sessions.
Bug Fixes:
- UTF-8 string slicing in bloat/sequences (prevents panic on non-ASCII)
- XID age type overflow in triage (i32 → i64)
- Triage sequences percentage calculation alignment
- Better error context for XID on empty databases

UX Improvements:
- Add 'migration' as alias for 'migrate' command
- Add 'create' as alias for 'migrate new' command
Completes the "why is prod slow?" workflow with:

pgcrate queries:
- Top queries from pg_stat_statements by total time, mean time, or calls
- Cache hit ratio per query
- Status thresholds: warning >1s mean, critical >5s mean
- Graceful degradation when extension not installed
- JSON output with pgcrate.diagnostics.queries schema

pgcrate connections:
- Connection usage vs max_connections with percentage
- Breakdown by state (active, idle, idle in transaction)
- Group by user, database, or application
- Status thresholds: warning >75%, critical >90%
- JSON output with pgcrate.diagnostics.connections schema

Updated capabilities:
- diagnostics.queries: available when pg_stat_statements installed
- diagnostics.connections: always available (uses pg_stat_activity)
@jackschultz jackschultz merged commit 72fbd32 into main Jan 19, 2026
2 of 4 checks passed
@jackschultz jackschultz deleted the feature/phase3b-bloat-replication branch January 19, 2026 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants