-
Notifications
You must be signed in to change notification settings - Fork 0
Description
The Insight
Most code quality issues aren't about specific strings—they're about structural anomalies relative to the codebase's own norms. What if the scanner learned what "normal" looks like and flagged deviations?
Agnostic Pattern Detection Approaches
1. Statistical Baseline Anomaly Detection
Instead of "does this match a bad pattern," ask "does this deviate from how this codebase usually does things?"
// Pseudocode concept
const baseline = analyzeCodebase(files);
// baseline = {
// avgFunctionLength: 24,
// stdDevFunctionLength: 12,
// loopNestingDistribution: [0.7, 0.2, 0.08, 0.02], // depth 1,2,3,4+
// queryPatterns: { withLimit: 0.85, withoutLimit: 0.15 },
// ...
// }
function detectAnomalies(file, baseline) {
const metrics = extractMetrics(file);
return metrics.filter(m => zscore(m.value, baseline[m.type]) > 2.5);
}A function 3 standard deviations longer than your codebase average? Flag it. A file with 15 database calls when your norm is 2? Worth a look.
Language agnostic because: You're measuring universal structural properties—length, nesting, repetition—not language-specific syntax.
2. Structural Fingerprinting
Reduce code to abstract "shapes" and look for unusual ones.
// Transform code to structure tokens
"if (x) { foo(); bar(); }" → "COND { CALL CALL }"
"if (y) { baz(); }" → "COND { CALL }"
"if (z) { a();b();c();d();e();f();g();h(); }" → "COND { CALL CALL CALL CALL CALL CALL CALL CALL }"
// The third one is structurally anomalous—8 sequential calls in a conditionalYou'd define structure extractors per language (simple regex or lightweight parsing), but the anomaly detection engine is universal.
3. Entropy and Repetition Analysis
Borrowed from compression theory:
- High local entropy = possibly complex/clever code worth review
- Low entropy repeated blocks = copy-paste that should be abstracted
- Sudden entropy shifts = style inconsistency, possibly merged from different sources
function entropySignature(code, windowSize = 500) {
const windows = slidingWindow(code, windowSize);
return windows.map(w => ({
position: w.start,
entropy: shannonEntropy(w.text),
compression: gzipRatio(w.text)
}));
}
// Flag: "Lines 450-500 have entropy 2.3 std devs below codebase mean"
// Translation: "This looks like boilerplate/copy-paste"4. Graph-Based Pattern Detection
Model code as a graph of relationships and look for unusual topologies:
File A imports: [B, C]
File B imports: [C]
File C imports: [A] ← circular dependency, structural anomaly
Function X calls: [Y, Z, DB, DB, DB, DB, DB] ← fanout anomaly to DB layer
The detection is universal: "nodes with unusually high fanout," "cycles in directed graph," "clusters with no external connections."
5. Temporal/Historical Anomaly
If you have git history:
// Files that change together but aren't colocated = hidden coupling
// Files that changed 50 times in 6 months = hotspot/instability
// Functions that grow 10% every sprint = unbounded complexity creepThe Abstraction Layer
Here's how I'd architect it language-agnostically:
┌─────────────────────────────────────────────────────┐
│ DETECTION ENGINE │
│ (statistical models, graph analysis, entropy calc) │
│ Knows nothing about any language │
└─────────────────────┬───────────────────────────────┘
│
Consumes "Universal Code Model"
│
┌─────────────────────┴───────────────────────────────┐
│ ADAPTER LAYER │
│ Transforms language → Universal Code Model │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ PHP │ │ JS │ │ Python │ │ YAML │ │
│ │ Adapter │ │ Adapter │ │ Adapter │ │ Adapter │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────┘
The Universal Code Model might look like:
{
units: [ // functions, methods, blocks
{
id: 'UserService.getOrders',
type: 'function',
metrics: {
length: 45,
cyclomaticComplexity: 8,
nestingDepth: 4,
paramCount: 3
},
calls: ['DB.query', 'Logger.info', 'Cache.get'],
contains: ['loop', 'conditional', 'try-catch']
}
],
dependencies: [ /* import/require graph */ ],
structure: [ /* AST-lite: just shapes */ ]
}Adapters can be as simple as regex-based heuristics or as sophisticated as tree-sitter parsers—the detection engine doesn't care.
Why This Is Interesting For Your Toolkit
You're already categorizing rules into layers (regex → PHPStan → runtime). This adds another dimension:
| Approach | What It Catches |
|---|---|
| Explicit rules | Known antipatterns |
| Statistical anomaly | Unknown antipatterns unique to this codebase |
| Structural analysis | Architectural drift, hidden coupling |
| Historical analysis | Risk hotspots, unstable code |
The explicit rules tell you "this is bad." The anomaly detection tells you "this is weird—maybe investigate."
Concrete Next Step?
I could sketch out either:
- A minimal "anomaly baseline" scanner that profiles a codebase and flags statistical outliers—maybe 200 lines of JS
- A Universal Code Model schema that could become the interchange format between your shell layer, JS tools, and potentially PHPStan output
Which direction interests you more?