Isolation Forest for Semiconductor Wafer Defect Detection

Implementation of the Isolation Forest algorithm by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou, applied to semiconductor manufacturing quality control.

Overview

This project applies unsupervised anomaly detection to identify defective semiconductor wafers from electrical test data. The Isolation Forest algorithm learns patterns from normal wafers and flags deviations as potential defects.

Key Features:

🎯 92.3% defect detection rate
✅ 94% specificity (low false alarms)
📊 19 engineered spatial features from wafer maps
🔬 Per-defect-type performance analysis
🧪 Comprehensive test suite (95% pass rate)
📚 Production-ready with full documentation

Algorithm Approach

The Isolation Forest uses a unique strategy:

Isolation principle: Anomalies are few and different, thus easier to isolate
Random partitioning: Build trees that isolate samples via random splits
Path length: Anomalies require fewer splits to isolate (shorter paths)
Unsupervised learning: Train only on normal samples, detect deviations

Why it works for semiconductors:

Defective wafers have distinct spatial failure patterns
Normal wafers cluster tightly in feature space
Efficient for high-dimensional manufacturing data

Quick Start

Option 1: CLI Interface (Recommended)

# 1. Clone and setup
git clone https://github.com/Anudeepreddynarala/semiconductor-yield-analyzer.git
cd semiconductor-yield-analyzer
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# 2. Run complete pipeline
python wafer_cli.py --pipeline

# Or run individual steps:
python wafer_cli.py --extract-features  # Extract features
python wafer_cli.py --train             # Train model
python wafer_cli.py --analyze           # Analyze defects
python wafer_cli.py --metrics           # Calculate Cpk & Yield
python wafer_cli.py --test              # Run tests

# View help
python wafer_cli.py --help

Option 2: Manual Steps

# 1. Setup
git clone https://github.com/Anudeepreddynarala/semiconductor-yield-analyzer.git
cd semiconductor-yield-analyzer
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# 2. Download dataset
# Place Mixed-Type Wafer Defect Dataset in archive/ directory
# From: https://www.kaggle.com/datasets/co1d7era/mixedtype-wafer-defect-datasets

# 3. Extract features
python extract_wafer_features.py

# 4. Train and evaluate
python wafer_anomaly_detection_v2.py

# 5. Analyze per-defect-type
python analyze_defect_types.py

# 6. Calculate quality metrics
python calculate_metrics.py

# 7. Run tests
pytest tests/ -v

Dataset

Mixed-Type Wafer Defect Dataset (Kaggle)

38,015 wafer maps (52×52 pixel grids)
1,000 normal wafers (2.6%)
37,015 defective wafers (97.4%)
8 basic defect types: Center, Donut, Edge_Loc, Edge_Ring, Loc, Near_Full, Scratch, Random
29 mixed-type defects (combinations of 2-4 basic types)

Approach

Since wafer maps are images (not direct sensor readings), we extract 19 spatial features from each wafer:

Feature Categories:

Basic counts: total dies, failed dies, blank spots
Failure metrics: failure rate, failure concentration
Spatial distribution: center of mass, distance from wafer center
Spread metrics: mean/std/max distance between failures
Radial features: radius statistics, edge vs center concentration
Clustering: number of failure clusters, average cluster size

Training Strategy

Key insight: With defects being 97.4% of data, we train on normal wafers only to learn what "good" looks like, then detect defects as anomalies.

# Train on 700 normal wafers → Test on 300 normal + 37,015 defective
python wafer_anomaly_detection_v2.py

Results

Overall Performance (100 trees, 90% TPR target):

✅ 92.3% of defects detected (34,155 / 37,015)
✅ 94% specificity (282 / 300 normal wafers correctly identified)
⚠️ 6% false positive rate (18 good wafers flagged)
❌ 2,860 defects missed (underkill - high cost risk!)
F1 Score: 0.9596
Average Precision: 0.9998

Manufacturing Impact:

Overkill: 18 good wafers scrapped unnecessarily (~$90K cost)
Underkill: 2,860 bad wafers shipped (~$143M cost - ⚠️ HIGH RISK)
Cost consideration: Underkill >> Overkill in semiconductor industry

Quality Metrics (Cpk & Yield)

Yield Analysis:

Actual Yield: 2.63% (1,000/38,015 normal wafers)
Predicted Yield: 10.17%
Overkill Rate: 8.80% (88 false alarms)
Underkill Rate: 7.98% (2,860 escapes)

Cpk (Process Capability Index):

python calculate_metrics.py

Generates comprehensive report including:

Cpk calculations for failure rate and defect counts
Sigma level assessment
Process capability interpretation
Cost impact analysis

Note: Dataset intentionally skewed toward defects (97.4%) for detection research. Production processes would target Cpk ≥ 1.33 (Four Sigma) with <5% defect rates.

Per-Defect-Type Performance

Defect Type	Count	Avg Anomaly Score	Detection @ 0.50 threshold
Random	866	0.5894	100.0%
Center	13,000	0.5865	98.0%
Scratch	19,000	0.5845	94.7%
Loc	18,000	0.5838	97.0%
Edge_Loc	13,000	0.5836	97.7%
Edge_Ring	12,000	0.5772	96.5%
Donut	12,000	0.5742	94.8%
Near_Full	149	0.5663	100.0%
NORMAL	300	0.4314	N/A

Key Findings:

All defect types score significantly higher than normal (0.43)
Random and Center defects are most distinguishable
Near_Full has fewer samples but 100% detection rate
At threshold 0.50, nearly all defect types achieve 94-100% detection

Visualizations

1. Anomaly Score Distributions

Training on normal wafers only, the model clearly separates normal from defective:

Key Observations:

Normal wafers cluster around score 0.43 (green)
Defective wafers cluster around 0.58+ (red)
Clear separation with 90% TPR threshold

2. Per-Defect-Type Performance

Key Observations:

All defect types score higher than normal baseline (0.43)
Random defects most detectable (0.589)
Consistent performance across defect types (94-100% detection)

Running Analysis

Generate visualizations yourself:

python analyze_defect_types.py

Outputs:

Bar chart of average anomaly scores by defect type
Box plots showing score distributions
Detection rate tables at different thresholds

Files

Data Processing:

extract_wafer_features.py - Extract spatial features from wafer maps
wafer_features.csv - 19 engineered features for 38,015 wafers
wafer_labels.csv - Defect type labels (one-hot encoded)

Analysis Scripts:

wafer_anomaly_detection_v2.py - Main detection script (train on normal only)
analyze_defect_types.py - Per-defect-type performance analysis

Visualizations:

wafer_trained_on_normal-100trees-90TPR.svg - Score distributions
defect_type_analysis.svg - Per-defect-type comparison

Citation

Dataset: Wang et al., "Deformable Convolutional Networks for Efficient Mixed-type Wafer Defect Pattern Recognition," IEEE Transactions on Semiconductor Manufacturing, 2020.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
images		images
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
IsolationForestPaper.pdf		IsolationForestPaper.pdf
README.md		README.md
TESTING.md		TESTING.md
analyze_defect_types.py		analyze_defect_types.py
calculate_metrics.py		calculate_metrics.py
extract_wafer_features.py		extract_wafer_features.py
iforest.py		iforest.py
plot_anomalies.py		plot_anomalies.py
pytest.ini		pytest.ini
quality_metrics_report.txt		quality_metrics_report.txt
requirements.txt		requirements.txt
run_tests.sh		run_tests.sh
wafer_anomaly_detection.py		wafer_anomaly_detection.py
wafer_anomaly_detection_v2.py		wafer_anomaly_detection_v2.py
wafer_cli.py		wafer_cli.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Isolation Forest for Semiconductor Wafer Defect Detection

Overview

Algorithm Approach

Quick Start

Option 1: CLI Interface (Recommended)

Option 2: Manual Steps

Dataset

Approach

Training Strategy

Results

Quality Metrics (Cpk & Yield)

Per-Defect-Type Performance

Visualizations

1. Anomaly Score Distributions

2. Per-Defect-Type Performance

Running Analysis

Files

Citation

About

Uh oh!

Releases

Packages

Languages

Anudeepreddynarala/semiconductor-yield-analyzer

Folders and files

Latest commit

History

Repository files navigation

Isolation Forest for Semiconductor Wafer Defect Detection

Overview

Algorithm Approach

Quick Start

Option 1: CLI Interface (Recommended)

Option 2: Manual Steps

Dataset

Approach

Training Strategy

Results

Quality Metrics (Cpk & Yield)

Per-Defect-Type Performance

Visualizations

1. Anomaly Score Distributions

2. Per-Defect-Type Performance

Running Analysis

Files

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages