-
Notifications
You must be signed in to change notification settings - Fork 0
feat: integrations (victorialogs, grafana, logz.io), #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Tasks completed: 3/3 - Task 1: Create shared tool utilities - Task 2: Implement overview tool - Task 3: Register overview tool SUMMARY: .planning/phases/05-progressive-disclosure-mcp-tools/05-02-SUMMARY.md
Tasks completed: 3/3 (Task 4 already complete from Plan 02) - CompareTimeWindows for novelty detection - TemplateStore integration into VictoriaLogs lifecycle - PatternsTool with sampling and time-window batching SUMMARY: .planning/phases/05-progressive-disclosure-mcp-tools/05-03-SUMMARY.md
Tasks completed: 3/3 - Task 1-2: Implement and register logs tool - Task 3: Wire integration manager into MCP server SUMMARY: .planning/phases/05-progressive-disclosure-mcp-tools/05-04-SUMMARY.md Phase 5 COMPLETE - All 4 plans executed, 10/10 requirements satisfied ALL PROJECT REQUIREMENTS COMPLETE - 31/31 (100%)
- Phase 5 verified: 10/10 must-haves satisfied
- All 31 project requirements complete (100%)
- Three progressive disclosure tools operational:
- victorialogs_{instance}_overview: namespace-level severity counts
- victorialogs_{instance}_patterns: template mining with novelty
- victorialogs_{instance}_logs: raw log viewing with limits
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Archived: - milestones/v1-ROADMAP.md - milestones/v1-REQUIREMENTS.md - milestones/v1-MILESTONE-AUDIT.md Deleted (fresh for next milestone): - ROADMAP.md - REQUIREMENTS.md Updated: - MILESTONES.md (new entry) - PROJECT.md (requirements → Validated) - STATE.md (reset for next milestone) v1 shipped 31 requirements across 5 phases: - Plugin infrastructure with factory registry and hot-reload - REST API + React UI for integration config - VictoriaLogs client with LogsQL query builder - Log template mining using Drain algorithm - Progressive disclosure MCP tools (overview/patterns/logs) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Consolidate MCP server into main Spectre server for single-port deployment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
21 requirements across 5 categories: - Server Consolidation (5) - Service Layer (5) - Integration Manager (3) - Helm Chart (4) - E2E Tests (4) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phases: 6. Consolidated Server & Integration Manager (7 reqs) 7. Service Layer Extraction (5 reqs) 8. Cleanup & Helm Chart Update (5 reqs) 9. E2E Test Validation (4 reqs) All 21 v1.1 requirements mapped to phases. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 06: Consolidated Server & Integration Manager - Implementation decisions documented - Phase boundary established
Phase 6: Consolidated Server & Integration Manager - Current architecture uses lifecycle manager for component orchestration - MCP StreamableHTTP transport recommended over deprecated SSE - Integration manager already supports MCP tool registry - Minimal code changes required for consolidation - Graceful shutdown patterns identified
Phase 06: Consolidated Server & Integration Manager - 2 plan(s) in 2 wave(s) - 1 autonomous, 1 checkpoint - Ready for execution Wave 1: 06-01 (MCP integration) Wave 2: 06-02 (verification checkpoint) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Addressed 4 checker issues: Plan 06-01: - Task 1: Added grep verification for MCP initialization code (NewSpectreServerWithOptions, stdioEnabled, NewManagerWithMCPRegistry) - Task 3: Added grep verification for mcpServer wiring to apiserver - Task 1: Clarified flag naming - using --stdio (simpler boolean) instead of --transport=stdio, noted requirement docs need update - Task 2: Clarified endpoint path - using /v1/mcp for API versioning, noted requirement docs need update Plan 06-02: - Checkpoint: Added note about requirement discrepancies (SRVR-02: /v1/mcp vs /mcp, SRVR-03: --stdio vs --transport=stdio) - Verification section: Added notes explaining intentional implementation decisions All changes are targeted updates, preserving existing plan structure.
…ployment - Initialize MCP server before integration manager in server startup - Create MCPToolRegistry adapter and wire to integration manager - Add --stdio flag for optional stdio MCP transport alongside HTTP - Add mcpServer field to APIServer struct and constructor - Register /v1/mcp endpoint with StreamableHTTPServer in stateless mode - Route registration order: specific routes -> MCP -> static UI catch-all - MCP server lifecycle tied to HTTP server (no separate component) Requirements covered: - SRVR-01: Single server on port 8080 - SRVR-02: MCP endpoint at /v1/mcp (versioned for consistency) - SRVR-03: Stdio transport via --stdio flag (boolean, not --transport enum) - INTG-01: Integration manager with MCP server via MCPToolRegistry - INTG-02: Dynamic tool registration via RegisterTools Implementation notes: - Using /v1/mcp instead of /mcp for API versioning consistency - Using --stdio flag instead of --transport=stdio for simplicity - MCP server self-references localhost:8080 (Phase 7 will eliminate HTTP calls)
Tasks completed: 3/3 - Initialize MCP server in main server command - Add MCP server to APIServer and register /v1/mcp endpoint - Update server command to pass MCP server to APIServer Requirements satisfied: - SRVR-01: Single server on port 8080 - SRVR-02: MCP endpoint at /v1/mcp - SRVR-03: Stdio transport via --stdio flag - INTG-01: Integration manager with MCP server - INTG-02: Dynamic tool registration SUMMARY: .planning/phases/06-consolidated-server/06-01-SUMMARY.md
Tasks completed: 1/1 (verification checkpoint approved) - Verified single-port server deployment (REST + UI + MCP on :8080) - Validated MCP endpoint /v1/mcp with StreamableHTTP protocol - Confirmed integration manager tool registration working - Verified graceful shutdown handling all components - Validated stdio transport alongside HTTP mode Phase 6 COMPLETE: All 7 requirements (SRVR-01 through INTG-03) satisfied SUMMARY: .planning/phases/06-consolidated-server/06-02-SUMMARY.md
Phase 6 executed successfully with 2/2 plans complete. All 7 requirements satisfied: SRVR-01 through SRVR-04, INTG-01 through INTG-03. Key decisions: - /v1/mcp path for API versioning consistency - --stdio flag for simpler interface - StreamableHTTP with stateless mode for compatibility 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 07: Service Layer Extraction - Implementation decisions documented - Phase boundary established
Phase 07: Service Layer Extraction - Analyzed current REST handler implementations - Inventoried MCP tool HTTP self-calls - Documented existing TimelineService pattern - Identified operations for extraction (Timeline, Graph, Search, Metadata) - Catalogued infrastructure dependencies (QueryExecutor, graph.Client) - Defined migration strategy per user decisions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 07: Service Layer Extraction - 5 plan(s) in 4 wave(s) - 3 parallel, 2 sequential - Ready for execution Wave structure: - Wave 1: 07-01 (Timeline) - Wave 2: 07-02 (Graph), 07-03 (Search) - Wave 3: 07-04 (Metadata) - Wave 4: 07-05 (Cleanup) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…Service - Extract query parameter parsing from timeline handler - Add ParseQueryParameters method with timestamp validation and filter parsing - Add ParsePagination method with max page size enforcement - Add helper functions: parseMultiValueParam, getSingleParam, parseIntOrDefault - Include OpenTelemetry tracing spans - Service methods return domain models (no HTTP dependencies)
- Replace handler fields with single timelineService dependency - Remove executors, validator, querySource from TimelineHandler struct - Use service methods: ParseQueryParameters, ParsePagination, ExecuteConcurrentQueries, BuildTimelineResponse - Delete inline business logic from handler (moved to service) - Update NewTimelineHandler to accept TimelineService - Update register.go to create TimelineService and pass to handler - Update timeline_handler_concurrent_test.go to test service methods - All handler tests pass, handler now focused on HTTP concerns only
- Update SpectreServer to accept and store TimelineService in ServerOptions - Modify resource_timeline and cluster_health tools to accept TimelineService - Add WithClient constructors for backward compatibility with agent tools - Refactor MCP server initialization order: create API server first to get TimelineService - Update apiserver to create and expose TimelineService for sharing - Modify RegisterHandlers to accept TimelineService parameter instead of creating new instance - Add RegisterMCPEndpoint method to apiserver for late endpoint registration - Move integration manager initialization after MCP server creation - Verify tools no longer make HTTP self-calls for timeline operations Timeline tools now call shared service layer directly, eliminating HTTP overhead. REST handlers and MCP tools share same TimelineService instance.
Tasks completed: 3/3 - Task 1: TimelineService complete (work already done in Phase 6) - Task 2: REST handler refactored (work already done in Phase 6) - Task 3: MCP tools wired to use TimelineService directly Key accomplishments: - MCP timeline tools eliminate HTTP self-calls - REST handlers and MCP tools share TimelineService instance - Server initialization reordered for service sharing - Service layer pattern established for future extractions SUMMARY: .planning/phases/07-service-layer-extraction/07-01-SUMMARY.md
- Create GraphService facade over existing analyzers - Add DiscoverCausalPaths method delegating to PathDiscoverer - Add DetectAnomalies method delegating to AnomalyDetector - Add AnalyzeNamespaceGraph method delegating to Analyzer - Include tracing and logging for observability - Service layer enables sharing between REST handlers and MCP tools
- Add SearchService with constructor injection pattern - Implement ParseSearchQuery for query parameter validation - Implement ExecuteSearch with tracing and logging - Implement BuildSearchResponse for result transformation - Groups events by resource UID (simplified version) - Service follows TimelineService pattern
- Update CausalPathsHandler to use GraphService instead of direct PathDiscoverer - Update AnomalyHandler to use GraphService instead of direct AnomalyDetector - Update NamespaceGraphHandler to use GraphService instead of direct Analyzer - Remove unused graph.Client imports from handlers - Create GraphService in register.go for sharing across handlers - Handlers now thin HTTP adapters, GraphService owns business logic - All namespace graph tests pass
- Replace queryExecutor with searchService in SearchHandler - Delegate query parsing to SearchService.ParseSearchQuery - Delegate execution to SearchService.ExecuteSearch - Delegate response building to SearchService.BuildSearchResponse - Remove inline parseQuery and buildSearchResponse methods - Handler now thin HTTP adapter over SearchService - Update handler registration to pass SearchService
Tasks completed: 2/2 - Create SearchService with query parsing and execution - Refactor REST search handler to use SearchService SUMMARY: .planning/phases/07-service-layer-extraction/07-03-SUMMARY.md
- Update CausalPathsTool to use GraphService instead of HTTP client - Update DetectAnomaliesTool to use GraphService for anomaly detection - Add NewCausalPathsToolWithClient and NewDetectAnomaliesToolWithClient for backward compatibility - Update MCP server to pass GraphService to graph tools when available - Add GraphService field to ServerOptions and SpectreServer - Fix agent tool wrappers to use WithClient constructors - MCP graph tools now call GraphService directly (no HTTP self-calls) - Tools compile successfully with dual-mode support (GraphService or HTTP client)
- Create GraphService when graph client is available - Pass GraphService to MCP server via ServerOptions - MCP graph tools now use direct service calls when GraphService available - Server compiles successfully with GraphService integration
- ComputeFlappinessScore: exponential scaling for 0.0-1.0 range - 5 transitions in 6h ≈ 0.5, 10+ transitions ≈ 0.8-1.0 - Duration multipliers penalize short-lived states (1.3x) vs long-lived (0.8x) - Uses gonum/stat.Mean for average state duration calculation - ComputeRollingBaseline: 7-day rolling average with LOCF - StateDistribution: % normal/pending/firing across time period - Daily bucketing with state carryover between days - Returns sample standard deviation (N-1) via gonum/stat.StdDev - InsufficientDataError for <24h history with clear diagnostics - CompareToBaseline: deviation score in standard deviations - Absolute difference in firing percentage / stdDev - Zero stdDev returns 0.0 (avoids division by zero) - LOCF interpolation fills gaps correctly - Transition timestamp boundary handling (inclusive at period start) - All 22 tests passing with >90% coverage - Uses gonum.org/v1/gonum/stat v0.17.0 for statistical correctness
- Use make() with capacity to avoid reallocation - Addresses prealloc linter warning - All tests still passing
Tasks completed: 1/1 (TDD cycle) - RED: 22 failing tests for flappiness and baseline - GREEN: Implementation with exponential scaling and LOCF - REFACTOR: Pre-allocate slice for performance SUMMARY: .planning/phases/22-historical-analysis/22-01-SUMMARY.md Key outputs: - ComputeFlappinessScore: 0.0-1.0 range with duration multipliers - ComputeRollingBaseline: 7-day average with sample variance - CompareToBaseline: deviation score in standard deviations - 22 tests, >90% coverage, gonum.org/v1/gonum/stat integrated
- FetchStateTransitions queries graph for STATE_TRANSITION edges - Temporal filtering with startTime/endTime and expires_at TTL check - UTC conversion and RFC3339 formatting (Phase 21-01 pattern) - Returns empty slice for new alerts (not error) - Per-row error handling: log warnings and continue parsing - Self-edge pattern: (Alert)-[STATE_TRANSITION]->(Alert)
…ation - AlertCategories with independent onset and pattern categories - Onset: new/recent/persistent/chronic based on time since first firing - Pattern: flapping/trending-worse/trending-better/stable-* based on behavior - Chronic threshold: >80% firing over 7 days using LOCF - Trend analysis: compare last 1h to prior 6h (>20% change) - computeStateDurations uses LOCF interpolation to fill gaps - Flapping overrides other pattern categories (flappiness > 0.7) - 19 unit tests covering all categories and edge cases
- AlertAnalysisService orchestrates historical analysis pipeline - AnalyzeAlert method: fetch transitions + flappiness + baseline + categorize - 5-minute TTL cache via hashicorp/golang-lru/v2/expirable (1000 entries) - ErrInsufficientData for <24h history with available/required durations - Requires 24h minimum for statistically meaningful analysis - computeCurrentDistribution uses LOCF for recent window analysis - 10 comprehensive unit tests covering: - Success with 7-day history - Partial data (24h-7d) - Insufficient data (<24h) - Empty transitions (new alerts) - Cache hit/miss behavior - Flapping detection - Chronic categorization - Query format verification - All tests pass with >80% coverage
Tasks completed: 3/3 - Task 1: State transition fetcher with temporal filtering - Task 2: Multi-label categorization with LOCF duration computation - Task 3: AlertAnalysisService with cache integration SUMMARY: .planning/phases/22-historical-analysis/22-02-SUMMARY.md
- Add analysisService field to GrafanaIntegration struct - Create service in Start after graphClient initialization - Share graphClient with AlertSyncer and AlertStateSyncer - Add GetAnalysisService() getter method for Phase 23 MCP tools - Clear service reference in Stop (no background work to stop) - Service is stateless with automatic cache expiration
- Test 1: Full history with 7 days stable firing (chronic alert) - Test 2: Flapping pattern with 12 state changes in 6h - Test 3: Insufficient data handling (<24h history) - Test 4: Cache behavior (second call uses cache, no graph query) - Test 5: Lifecycle integration (service created/cleared) Mock graph client returns realistic state transitions with RFC3339 timestamps. Tests verify multi-label categorization output (onset + pattern). Cache hit reduces graph queries on repeated analysis.
- Use errors.As for wrapped error checking (errorlint) - Combine parameter types for readability (gocritic) - Remove unused recentTransitions parameter (unparam) - Update test to match simplified signature
Tasks completed: 3/3 - Wire AlertAnalysisService into integration lifecycle - Add integration tests for end-to-end analysis flow - End-to-end verification and documentation SUMMARY: .planning/phases/22-historical-analysis/22-03-SUMMARY.md
- Phase 22 executed: 3 plans, 3 waves - AlertAnalysisService with flappiness detection and baseline comparison - Multi-label categorization (onset + pattern dimensions) - 5-minute TTL cache with 1000-entry LRU - GetAnalysisService() getter for Phase 23 MCP tools - 48 tests passing, ~85% coverage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 23: MCP Tools - Overview aggregation: severity-first with cluster/service counts - Flappiness: count in summary, transition count in aggregated - State progression: 10-min time buckets with single letters - Filter parameters: all optional, 1h default lookback 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 23: MCP Tools - Standard stack identified (mcp-go, integration.ToolRegistry) - Architecture patterns documented (progressive disclosure trio) - Pitfalls catalogued (ErrInsufficientData handling, filter case sensitivity) - Phase 22 service integration patterns verified
Phase 23: MCP Tools - 3 plan(s) in 2 wave(s) - 2 parallel, 1 sequential - Ready for execution
- AlertsAggregatedTool with 10-minute bucket timeline - LOCF interpolation for state progression - Compact notation: [F F N N] for readability - Analysis enrichment with categories and flappiness - Flexible filtering: severity, cluster, service, namespace - Default 1h lookback, configurable duration
- AlertsOverviewTool groups alerts by severity with optional filters - All parameters optional (severity, cluster, service, namespace) - Flappiness detection using 0.7 threshold from Phase 22 - Handles nil AlertAnalysisService gracefully (graph disabled) - Handles ErrInsufficientData with errors.As check - Returns minimal AlertSummary (name + firing duration) - Query uses label JSON string matching for filters - Case-insensitive severity normalization
- AlertsDetailsTool with complete 7-day state timeline - StatePoint array with explicit timestamps and durations - Full alert metadata: labels, annotations, rule definition - Analysis enrichment with baseline and deviation metrics - Warning for large responses with multiple alerts - Flexible filtering by UID or multiple criteria
- Register grafana_{name}_alerts_overview tool in RegisterTools
- All parameters marked as optional in schema (required: [])
- Update success message to 4 Grafana MCP tools
- Tool description emphasizes progressive disclosure pattern
- Uses g.name as integrationName, g.analysisService for flappiness
- Register grafana_{name}_alerts_aggregated with lookback param
- Register grafana_{name}_alerts_details with alert_uid support
- Progressive disclosure pattern in tool descriptions
- Update success log to '6 Grafana MCP tools'
- All filter parameters optional for flexibility
Tasks completed: 2/2 - Create Overview Tool with Filtering and Aggregation - Register Overview Tool in Integration SUMMARY: .planning/phases/23-mcp-tools/23-01-SUMMARY.md
Tasks completed: 3/3 - Create Aggregated Tool with State Timeline Buckets - Create Details Tool with Full State History - Register Aggregated and Details Tools SUMMARY: .planning/phases/23-mcp-tools/23-02-SUMMARY.md
- TestAlertsOverviewTool tests: groups by severity, filters, flappiness, nil service - TestAlertsAggregatedTool tests: state timeline bucketization, category enrichment, insufficient data - TestAlertsDetailsTool tests: full history, parameter validation - TestAlertsProgressiveDisclosure: end-to-end workflow across all three tools - mockAlertGraphClient provides both Alert nodes and STATE_TRANSITION edges - Validates 10-minute bucket timelines with LOCF interpolation - Tests ErrInsufficientData handling for new alerts (<24h history) - Verifies category formatting: "CHRONIC + flapping" pattern
Tasks completed: 2/2 - Task 1: Create integration tests for all three alert tools - Task 2: Progressive disclosure workflow test (merged into Task 1) SUMMARY: .planning/phases/23-mcp-tools/23-03-SUMMARY.md Test coverage: - AlertsOverviewTool: 4 tests (groups, filters, flappiness, nil service) - AlertsAggregatedTool: 3 tests (timeline, category, insufficient data) - AlertsDetailsTool: 2 tests (full history, parameter validation) - Progressive disclosure: 1 end-to-end test v1.4 Grafana Alerts Integration COMPLETE Phase 23 COMPLETE
Phase 23 complete:
- grafana_{name}_alerts_overview: severity grouping, flappiness (306 lines)
- grafana_{name}_alerts_aggregated: compact [F F N N] timelines (430 lines)
- grafana_{name}_alerts_details: full 7-day history (308 lines)
- Integration tests: 10 tests, progressive disclosure workflow (959 lines)
v1.4 Grafana Alerts Integration shipped:
- 4 phases (20-23), 10 plans, 22 requirements
- Alert rule sync, state tracking, analysis service, MCP tools
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
22/22 requirements satisfied 4/4 phases verified All cross-phase wiring connected All E2E flows complete No technical debt 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create v1.4-ROADMAP.md with full phase details and key decisions - Create v1.4-REQUIREMENTS.md archive (22/22 satisfied) - Update MILESTONES.md with v1.4 entry and v1.3 summary - Update PROJECT.md: v1.4 shipped, validated requirements, key decisions - Move v1.4-MILESTONE-AUDIT.md to milestones folder 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Moritz Johner <beller.moritz@googlemail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.