Ideas - Agnostic Code Scanner

## The Insight

Most code quality issues aren't about specific strings—they're about **structural anomalies** relative to the codebase's own norms. What if the scanner learned what "normal" looks like and flagged deviations?

## Agnostic Pattern Detection Approaches

### 1. Statistical Baseline Anomaly Detection

Instead of "does this match a bad pattern," ask "does this deviate from how this codebase usually does things?"

```javascript
// Pseudocode concept
const baseline = analyzeCodebase(files);
// baseline = {
//   avgFunctionLength: 24,
//   stdDevFunctionLength: 12,
//   loopNestingDistribution: [0.7, 0.2, 0.08, 0.02], // depth 1,2,3,4+
//   queryPatterns: { withLimit: 0.85, withoutLimit: 0.15 },
//   ...
// }

function detectAnomalies(file, baseline) {
  const metrics = extractMetrics(file);
  return metrics.filter(m => zscore(m.value, baseline[m.type]) > 2.5);
}
```

A function 3 standard deviations longer than your codebase average? Flag it. A file with 15 database calls when your norm is 2? Worth a look.

**Language agnostic because:** You're measuring universal structural properties—length, nesting, repetition—not language-specific syntax.

### 2. Structural Fingerprinting

Reduce code to abstract "shapes" and look for unusual ones.

```javascript
// Transform code to structure tokens
"if (x) { foo(); bar(); }" → "COND { CALL CALL }"
"if (y) { baz(); }"        → "COND { CALL }"
"if (z) { a();b();c();d();e();f();g();h(); }" → "COND { CALL CALL CALL CALL CALL CALL CALL CALL }"

// The third one is structurally anomalous—8 sequential calls in a conditional
```

You'd define structure extractors per language (simple regex or lightweight parsing), but the anomaly detection engine is universal.

### 3. Entropy and Repetition Analysis

Borrowed from compression theory:

- **High local entropy** = possibly complex/clever code worth review
- **Low entropy repeated blocks** = copy-paste that should be abstracted
- **Sudden entropy shifts** = style inconsistency, possibly merged from different sources

```javascript
function entropySignature(code, windowSize = 500) {
  const windows = slidingWindow(code, windowSize);
  return windows.map(w => ({
    position: w.start,
    entropy: shannonEntropy(w.text),
    compression: gzipRatio(w.text)
  }));
}

// Flag: "Lines 450-500 have entropy 2.3 std devs below codebase mean"
// Translation: "This looks like boilerplate/copy-paste"
```

### 4. Graph-Based Pattern Detection

Model code as a graph of relationships and look for unusual topologies:

```
File A imports: [B, C]
File B imports: [C]
File C imports: [A]  ← circular dependency, structural anomaly

Function X calls: [Y, Z, DB, DB, DB, DB, DB]  ← fanout anomaly to DB layer
```

The detection is universal: "nodes with unusually high fanout," "cycles in directed graph," "clusters with no external connections."

### 5. Temporal/Historical Anomaly

If you have git history:

```javascript
// Files that change together but aren't colocated = hidden coupling
// Files that changed 50 times in 6 months = hotspot/instability
// Functions that grow 10% every sprint = unbounded complexity creep
```

## The Abstraction Layer

Here's how I'd architect it language-agnostically:

```
┌─────────────────────────────────────────────────────┐
│                   DETECTION ENGINE                   │
│  (statistical models, graph analysis, entropy calc)  │
│         Knows nothing about any language             │
└─────────────────────┬───────────────────────────────┘
                      │
          Consumes "Universal Code Model"
                      │
┌─────────────────────┴───────────────────────────────┐
│                  ADAPTER LAYER                       │
│   Transforms language → Universal Code Model         │
│                                                      │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐   │
│  │   PHP   │ │   JS    │ │ Python  │ │  YAML   │   │
│  │ Adapter │ │ Adapter │ │ Adapter │ │ Adapter │   │
│  └─────────┘ └─────────┘ └─────────┘ └─────────┘   │
└─────────────────────────────────────────────────────┘
```

The Universal Code Model might look like:

```javascript
{
  units: [  // functions, methods, blocks
    { 
      id: 'UserService.getOrders',
      type: 'function',
      metrics: {
        length: 45,
        cyclomaticComplexity: 8,
        nestingDepth: 4,
        paramCount: 3
      },
      calls: ['DB.query', 'Logger.info', 'Cache.get'],
      contains: ['loop', 'conditional', 'try-catch']
    }
  ],
  dependencies: [ /* import/require graph */ ],
  structure: [ /* AST-lite: just shapes */ ]
}
```

Adapters can be as simple as regex-based heuristics or as sophisticated as tree-sitter parsers—the detection engine doesn't care.

## Why This Is Interesting For Your Toolkit

You're already categorizing rules into layers (regex → PHPStan → runtime). This adds another dimension:

| Approach | What It Catches |
|----------|-----------------|
| Explicit rules | Known antipatterns |
| Statistical anomaly | Unknown antipatterns unique to *this* codebase |
| Structural analysis | Architectural drift, hidden coupling |
| Historical analysis | Risk hotspots, unstable code |

The explicit rules tell you "this is bad." The anomaly detection tells you "this is *weird*—maybe investigate."

## Concrete Next Step?

I could sketch out either:

1. **A minimal "anomaly baseline" scanner** that profiles a codebase and flags statistical outliers—maybe 200 lines of JS
2. **A Universal Code Model schema** that could become the interchange format between your shell layer, JS tools, and potentially PHPStan output

Which direction interests you more?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ideas - Agnostic Code Scanner #57

The Insight

Agnostic Pattern Detection Approaches

1. Statistical Baseline Anomaly Detection

2. Structural Fingerprinting

3. Entropy and Repetition Analysis

4. Graph-Based Pattern Detection

5. Temporal/Historical Anomaly

The Abstraction Layer

Why This Is Interesting For Your Toolkit

Concrete Next Step?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Approach	What It Catches
Explicit rules	Known antipatterns
Statistical anomaly	Unknown antipatterns unique to this codebase
Structural analysis	Architectural drift, hidden coupling
Historical analysis	Risk hotspots, unstable code

Ideas - Agnostic Code Scanner #57

Description

The Insight

Agnostic Pattern Detection Approaches

1. Statistical Baseline Anomaly Detection

2. Structural Fingerprinting

3. Entropy and Repetition Analysis

4. Graph-Based Pattern Detection

5. Temporal/Historical Anomaly

The Abstraction Layer

Why This Is Interesting For Your Toolkit

Concrete Next Step?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions