Skip to content

Ideas - Agnostic Code Scanner #57

@noelsaw1

Description

@noelsaw1

The Insight

Most code quality issues aren't about specific strings—they're about structural anomalies relative to the codebase's own norms. What if the scanner learned what "normal" looks like and flagged deviations?

Agnostic Pattern Detection Approaches

1. Statistical Baseline Anomaly Detection

Instead of "does this match a bad pattern," ask "does this deviate from how this codebase usually does things?"

// Pseudocode concept
const baseline = analyzeCodebase(files);
// baseline = {
//   avgFunctionLength: 24,
//   stdDevFunctionLength: 12,
//   loopNestingDistribution: [0.7, 0.2, 0.08, 0.02], // depth 1,2,3,4+
//   queryPatterns: { withLimit: 0.85, withoutLimit: 0.15 },
//   ...
// }

function detectAnomalies(file, baseline) {
  const metrics = extractMetrics(file);
  return metrics.filter(m => zscore(m.value, baseline[m.type]) > 2.5);
}

A function 3 standard deviations longer than your codebase average? Flag it. A file with 15 database calls when your norm is 2? Worth a look.

Language agnostic because: You're measuring universal structural properties—length, nesting, repetition—not language-specific syntax.

2. Structural Fingerprinting

Reduce code to abstract "shapes" and look for unusual ones.

// Transform code to structure tokens
"if (x) { foo(); bar(); }"  "COND { CALL CALL }"
"if (y) { baz(); }"         "COND { CALL }"
"if (z) { a();b();c();d();e();f();g();h(); }"  "COND { CALL CALL CALL CALL CALL CALL CALL CALL }"

// The third one is structurally anomalous—8 sequential calls in a conditional

You'd define structure extractors per language (simple regex or lightweight parsing), but the anomaly detection engine is universal.

3. Entropy and Repetition Analysis

Borrowed from compression theory:

  • High local entropy = possibly complex/clever code worth review
  • Low entropy repeated blocks = copy-paste that should be abstracted
  • Sudden entropy shifts = style inconsistency, possibly merged from different sources
function entropySignature(code, windowSize = 500) {
  const windows = slidingWindow(code, windowSize);
  return windows.map(w => ({
    position: w.start,
    entropy: shannonEntropy(w.text),
    compression: gzipRatio(w.text)
  }));
}

// Flag: "Lines 450-500 have entropy 2.3 std devs below codebase mean"
// Translation: "This looks like boilerplate/copy-paste"

4. Graph-Based Pattern Detection

Model code as a graph of relationships and look for unusual topologies:

File A imports: [B, C]
File B imports: [C]
File C imports: [A]  ← circular dependency, structural anomaly

Function X calls: [Y, Z, DB, DB, DB, DB, DB]  ← fanout anomaly to DB layer

The detection is universal: "nodes with unusually high fanout," "cycles in directed graph," "clusters with no external connections."

5. Temporal/Historical Anomaly

If you have git history:

// Files that change together but aren't colocated = hidden coupling
// Files that changed 50 times in 6 months = hotspot/instability
// Functions that grow 10% every sprint = unbounded complexity creep

The Abstraction Layer

Here's how I'd architect it language-agnostically:

┌─────────────────────────────────────────────────────┐
│                   DETECTION ENGINE                   │
│  (statistical models, graph analysis, entropy calc)  │
│         Knows nothing about any language             │
└─────────────────────┬───────────────────────────────┘
                      │
          Consumes "Universal Code Model"
                      │
┌─────────────────────┴───────────────────────────────┐
│                  ADAPTER LAYER                       │
│   Transforms language → Universal Code Model         │
│                                                      │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐   │
│  │   PHP   │ │   JS    │ │ Python  │ │  YAML   │   │
│  │ Adapter │ │ Adapter │ │ Adapter │ │ Adapter │   │
│  └─────────┘ └─────────┘ └─────────┘ └─────────┘   │
└─────────────────────────────────────────────────────┘

The Universal Code Model might look like:

{
  units: [  // functions, methods, blocks
    { 
      id: 'UserService.getOrders',
      type: 'function',
      metrics: {
        length: 45,
        cyclomaticComplexity: 8,
        nestingDepth: 4,
        paramCount: 3
      },
      calls: ['DB.query', 'Logger.info', 'Cache.get'],
      contains: ['loop', 'conditional', 'try-catch']
    }
  ],
  dependencies: [ /* import/require graph */ ],
  structure: [ /* AST-lite: just shapes */ ]
}

Adapters can be as simple as regex-based heuristics or as sophisticated as tree-sitter parsers—the detection engine doesn't care.

Why This Is Interesting For Your Toolkit

You're already categorizing rules into layers (regex → PHPStan → runtime). This adds another dimension:

Approach What It Catches
Explicit rules Known antipatterns
Statistical anomaly Unknown antipatterns unique to this codebase
Structural analysis Architectural drift, hidden coupling
Historical analysis Risk hotspots, unstable code

The explicit rules tell you "this is bad." The anomaly detection tells you "this is weird—maybe investigate."

Concrete Next Step?

I could sketch out either:

  1. A minimal "anomaly baseline" scanner that profiles a codebase and flags statistical outliers—maybe 200 lines of JS
  2. A Universal Code Model schema that could become the interchange format between your shell layer, JS tools, and potentially PHPStan output

Which direction interests you more?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions