Knowledge Base Deduplication Service

A machine learning service for detecting duplicate articles in knowledge bases using semantic similarity search and GPT-based classification.

Overview

This service implements a two-stage ML pipeline for automated duplicate detection:

Semantic Similarity Search: Uses OpenAI embeddings and Approximate Nearest Neighbors (ANN) to find candidate duplicate pairs
GPT Classification: Uses GPT models (gpt-4o-mini, gpt-4o, gpt-5) for zero-shot classification of candidate pairs as duplicate or non-duplicate

Architecture

Core Components

Data Loading: JSON file loading with schema validation and hierarchical data transformation
Text Preprocessing: HTML/Markdown cleaning, Unicode normalization, and intelligent chunking
Vectorization: OpenAI embedding generation with cost tracking
ANN Indexing: Fast similarity search using Annoy indexes
Search Engine: Multi-level similarity search across titles, solutions, and text chunks
Classification: GPT-based duplicate classification with structured output
Metrics: Performance tracking, cost monitoring, and reporting

Pipeline Workflows

Load Base (load-base): Load, preprocess, vectorize, and build indexes for knowledge base
Scan Base (scan-base): Find duplicates within existing knowledge base
Check New (check-new): Check new solutions against existing knowledge base

Installation

# Install dependencies
pip install -r requirements.txt

# Set up environment
export OPENAI_API_KEY="your-api-key-here"

# Optional: Set custom config path
export KB_DEDUP_CONFIG_PATH="path/to/your/config.json"

Usage

CLI Commands

All configuration is managed through options/config.json by default. You can specify a custom config file using the KB_DEDUP_CONFIG_PATH environment variable. No command-line arguments are needed.

# Load knowledge base (uses config.json settings)
python main.py load-base

# Scan existing knowledge base for duplicates
python main.py scan-base

# Check new solutions against existing base
python main.py check-new

# View detailed configuration options
python main.py config-help

# View current configuration values
python main.py help-config

Configuration

The service uses options/config.json for all configuration parameters. Here's a comprehensive breakdown of each section:

Pipeline Configuration

{
  "pipeline": {
    "log_level": "INFO", // Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL
    "output_dir": "./data" // Directory where all artifacts are saved
  }
}

Preprocessing Configuration

{
  "preprocessing": {
    "chunk_size": 400, // Maximum characters per text chunk
    "chunk_overlap": 50, // Characters to overlap between chunks
    "max_title_length": 1000, // Maximum length for article titles
    "max_solution_length": 200000, // Maximum length for article content
    "remove_html": true, // Strip HTML tags from content
    "normalize_unicode": true // Normalize Unicode characters
  }
}

Vectorization Configuration

{
  "vectorization": {
    "embedding_model": "text-embedding-3-large", // OpenAI embedding model
    "batch_size": 100, // Texts per API call
    "api_timeout": 30, // API timeout in seconds
    "max_retries": 3, // Retry attempts for failed calls
    "levels": ["title", "solution", "chunks"] // Text levels to vectorize
  }
}

Indexing Configuration

{
  "indexing": {
    "metric": "angular", // Distance metric: angular, euclidean, manhattan, hamming, dot
    "n_trees": 100 // Number of trees (higher = more accurate)
  }
}

Search Configuration

{
  "search": {
    "default_similarity_threshold": 0.8, // Minimum cosine similarity (0.0-1.0)
    "default_top_k": 5, // Number of nearest neighbors
    "default_levels": ["solution"] // Default text levels to search
  }
}

Classification Configuration

{
  "classification": {
    "enabled": true, // Enable GPT classification
    "model": "gpt-5", // OpenAI model: gpt-4o-mini, gpt-4o, gpt-5
    "batch_size": 10, // Candidate pairs per API call
    "max_retries": 3, // Retry attempts for failed calls
    "temperature": 0.0, // GPT temperature (0.0 = deterministic)
    "max_tokens": 1000 // Maximum tokens for responses
  }
}

Command-Specific Configuration

{
  "commands": {
    "load_base": {
      "data_path": "sample_kb.json", // Path to knowledge base JSON file
      "raw_input": false, // Whether data needs hierarchical transformation
      "embedding_model": "text-embedding-3-large", // Override embedding model
      "chunk_size": 500, // Override chunk size
      "levels": ["title", "solution", "chunks"] // Override levels to vectorize
    },
    "scan_base": {
      "threshold": 0.8, // Similarity threshold for candidates
      "top_k": 5, // Number of nearest neighbors
      "levels": ["chunks"], // Text levels to search
      "overwrite": false // Whether to overwrite existing result files (true) or add timestamp (false)
    },
    "check_new": {
      "input": "new_items.json", // Path to new solutions JSON file
      "threshold": 0.8, // Similarity threshold for candidates
      "top_k": 5, // Number of nearest neighbors
      "levels": ["solution", "title"], // Text levels to search
      "overwrite": true // Whether to overwrite existing result files (true) or add timestamp (false)
    }
  }
}

Data Formats

Input Data

Flat Format (Standard)

[
  {
    "solution_id": "FD-001",
    "title": "How to reset your password",
    "description_text": "Steps to reset password in settings...",
    "category": "Accounts",
    "folder": "Passwords"
  }
]

Hierarchical Format (Raw Input)

[
  {
    "name": "Accounts",
    "solutions": [
      {
        "solutionId": 43000643547,
        "title": "How to reset your password",
        "description_text": "Steps to reset password...",
        "description": "<p>Steps to reset password...</p>"
      }
    ],
    "folders": [
      {
        "name": "Passwords",
        "solutions": [
          {
            "solutionId": 43000643548,
            "title": "Password security tips",
            "description_text": "Best practices for password security..."
          }
        ]
      }
    ]
  }
]

Output Formats and Results

Generated Artifacts

load_base_results.json: Preprocessed knowledge base data
embeddings/: Vector embeddings for all levels
indexes/: ANN indexes for fast similarity search
*_results.json: Search and classification results
*_final_report.json: Comprehensive processing reports
logs/: Structured logging directory with service-specific logs
- pipeline.log: Main orchestration and coordination
- data_loading.log: Data loading, validation, and transformation
- preprocessing.log: Text cleaning, chunking, and normalization
- vectorization.log: OpenAI embedding generation and cost tracking
- indexing.log: ANN index building and validation
- search.log: Similarity search and candidate processing
- classification.log: GPT classification and duplicate detection
- metrics.log: Performance metrics and cost tracking
- load-base.log: Load-base command execution and results
- check-new.log: Check-new command execution and results
- scan-base.log: Scan-base command execution and results

Search Results Format

All search operations (scan-base, check-new) produce results in the following format:

[
  {
    "id1": "FD-001", // First solution ID
    "id2": "FD-002", // Second solution ID
    "title1": "How to reset your password", // First solution title
    "title2": "Password Reset Guide", // Second solution title
    "category1": "Accounts", // First solution category
    "category2": "Accounts", // Second solution category
    "folder1": "Passwords", // First solution folder
    "folder2": "Passwords", // Second solution folder
    "solution1_text": "Steps to reset password...", // First solution content
    "solution2_text": "Guide for password reset...", // Second solution content
    "found_in": ["solution", "title"], // Text levels where similarity was found
    "similarity": 0.8538628292794584, // Maximum cosine similarity score (0.0-1.0)
    "duplicate": 1, // GPT classification: 1=duplicate, 0=not duplicate, null=not classified
    "reason": "Both articles describe the same password reset process with similar steps and content", // GPT explanation
    "sim_level": "gpt_classification" // Detection method: "sim_search" or "gpt_classification"
  }
]

Result Field Explanations

Core Identification Fields

id1, id2: Unique identifiers for the solution pair being compared
title1, title2: Article titles for quick identification
category1, category2: Solution categories (e.g., "Accounts", "Charts")
folder1, folder2: Solution folders for organization

Content Fields

solution1_text, solution2_text: Full text content of both solutions
found_in: Array of text levels where similarity was detected
- "title": Similarity found in article titles
- "solution": Similarity found in full article content
- "chunks": Similarity found in text chunks

Similarity Analysis Fields

similarity: Maximum cosine similarity score across all levels (0.0-1.0)
- 0.0: No similarity
- 1.0: Identical content
- Values above threshold (default 0.8) indicate potential duplicates

Classification Fields

duplicate: GPT classification result
- 1: Confirmed duplicate by GPT
- 0: Not a duplicate according to GPT
- null: Classification failed or not performed
reason: GPT's explanation for the classification
- Empty string if duplicate is 0
- Detailed explanation if duplicate is 1
- Error message if classification failed

Detection Method Field

sim_level: Indicates how the duplicate was detected
- "sim_search": Found by similarity search but not confirmed by GPT
  - Used when GPT classifies as non-duplicate
  - Used when GPT classification fails
  - Used for error cases
- "gpt_classification": Found by similarity search AND confirmed by GPT
  - Used when GPT classifies as duplicate
  - Indicates highest confidence detection

Processing Reports Format

Each command generates a comprehensive report with the following structure:

Scan Base Report (`scan_base_final_report.json`)

{
  "summary": {
    "total_candidates": 133, // Total candidate pairs found
    "duplicates_confirmed": 10, // Duplicates confirmed by GPT
    "duplicate_rate": 0.07518796992481203 // Percentage of confirmed duplicates
  },
  "metrics": {
    "processing_time_seconds": 674.528, // Total processing time
    "stage_times_seconds": {
      // Time breakdown by stage
      "load_base_artifacts": 0.204,
      "similarity_search": 64.105,
      "candidate_processing": 0.003,
      "classification": 610.216
    },
    "duplicates_found": {
      // Duplicate counts by stage
      "similarity_search": 133, // Found by similarity search
      "gpt_classification": 10 // Confirmed by GPT
    },
    "costs_dollars": {
      // API cost breakdown
      "vectorization": 0.0, // Embedding costs (0 for scan-base)
      "classification": 0.033684, // GPT classification costs
      "total": 0.033684 // Total costs
    },
    "performance": {
      // Performance metrics
      "memory_usage_mb": 209.62, // Peak memory usage
      "efficiency": {
        "duplicates_per_second": 0.197175, // Processing speed
        "cost_per_duplicate_dollars": 0.000253 // Cost efficiency
      }
    },
    "timestamp_unix": 1761563577.1429791, // Unix timestamp
    "timestamp_human": "2025-10-27 14:12:57 UTC" // Human-readable timestamp
  }
}

Check New Report (`check_new_final_report.json`)

{
  "summary": {
    "total_candidates": 1, // Total candidate pairs found
    "duplicates_confirmed": 1, // Duplicates confirmed by GPT
    "duplicate_rate": 1.0 // Percentage of confirmed duplicates
  },
  "metrics": {
    "processing_time_seconds": 4.99, // Total processing time
    "stage_times_seconds": {
      // Time breakdown by stage
      "load_base_artifacts": 0.189,
      "prepare_new_items": 1.685, // New items preparation time
      "similarity_search": 0.129,
      "candidate_processing": 0.001,
      "classification": 2.987
    },
    "duplicates_found": {
      // Duplicate counts by stage
      "similarity_search": 1, // Found by similarity search
      "gpt_classification": 1 // Confirmed by GPT
    },
    "costs_dollars": {
      // API cost breakdown
      "vectorization": 2.3e-5, // Embedding costs for new items
      "classification": 7.4e-5, // GPT classification costs
      "total": 9.7e-5 // Total costs
    },
    "performance": {
      // Performance metrics
      "memory_usage_mb": 563.05, // Peak memory usage
      "efficiency": {
        "duplicates_per_second": 0.200387, // Processing speed
        "cost_per_duplicate_dollars": 9.7e-5 // Cost efficiency
      }
    },
    "timestamp_unix": 1761562885.236172, // Unix timestamp
    "timestamp_human": "2025-10-27 14:01:25 UTC" // Human-readable timestamp
  }
}

Load Base Report (`load_base_final_report.json`)

{
  "summary": {
    "solutions_count": 2188, // Total solutions processed
    "processing_time": 45.2, // Total processing time
    "vectorization_cost": 0.0234, // Embedding generation cost
    "embedding_statistics": {
      // Detailed embedding metrics
      "embedding_model": "text-embedding-3-large",
      "levels": ["title", "solution", "chunks"],
      "costs": {
        "total_cost": 0.0234,
        "api_calls": 15,
        "tokens_used": 45678,
        "vectorization_time": 42.1
      },
      "title_count": 2188,
      "title_dimension": 3072,
      "solution_count": 2188,
      "solution_dimension": 3072,
      "chunks_count": 3245,
      "chunks_dimension": 3072
    },
    "index_statistics": {
      // ANN index metrics
      "total_indexes": 3,
      "levels": ["title", "solution", "chunks"],
      "metric": "angular",
      "n_trees": 100,
      "title_vectors": 2188,
      "solution_vectors": 2188,
      "chunks_vectors": 3245
    },
    "chunk_statistics": {
      // Text chunking metrics
      "total_items": 2188,
      "total_chunks": 3245,
      "avg_chunks_per_item": 1.48,
      "avg_chunk_length": 286.4,
      "min_chunk_length": 142,
      "max_chunk_length": 418
    },
    "results_path": "data/load_base_results.json",
    "report_path": "data/load_base_final_report.json",
    "embeddings_path": "data/embeddings",
    "indexes_path": "data/indexes"
  }
}

Load Base Results Format

The load_base_results.json contains preprocessed knowledge base data:

[
  {
    "solution_id": "43000643547", // Unique solution identifier
    "title": "What is economic data?", // Original article title
    "description_text": "Economic data are data points...", // Original content
    "category": "Data, Exchanges & Data-related Issues", // Solution category
    "folder": "General information about economic data", // Solution folder
    "title_cleaned": "What is economic data?", // Cleaned title (HTML removed, normalized)
    "description_cleaned": "Economic data are data points...", // Cleaned content
    "chunks": [
      // Text chunks for detailed analysis
      {
        "chunk_id": "43000643547_chunk_0", // Unique chunk identifier
        "text": "Economic data are data points...", // Chunk text content
        "start_pos": 0, // Start position in original text
        "end_pos": 304, // End position in original text
        "chunk_index": 0 // Chunk index within the solution
      }
    ]
  }
]

Load Base Results Field Explanations

solution_id: Unique identifier for the solution
title: Original article title from input data
description_text: Original article content from input data
category: Solution category for organization
folder: Solution folder for organization
title_cleaned: Title after HTML removal and Unicode normalization
description_cleaned: Content after HTML removal and Unicode normalization
chunks: Array of text chunks created by intelligent splitting
- chunk_id: Unique identifier combining solution_id and chunk index
- text: The actual chunk text content
- start_pos: Character position where chunk starts in original text
- end_pos: Character position where chunk ends in original text
- chunk_index: Sequential index of chunk within the solution

Codebase Structure

kb-deduplication/
├── src/                          # Source code
│   ├── dataload.py              # Data loading and validation
│   ├── data_transform.py        # Hierarchical data transformation
│   ├── preprocessing.py         # Text cleaning and chunking
│   ├── vectorization.py         # OpenAI embedding generation
│   ├── indexing.py              # ANN index building
│   ├── search.py                # Similarity search engine
│   ├── classification.py        # GPT-based classification
│   ├── metrics.py               # Performance metrics
│   ├── pipeline.py              # Main orchestrator
│   ├── models.py                # Pydantic data models
│   └── utils.py                 # Utility functions
├── options/
│   └── config.json              # Configuration file
├── main.py                      # CLI entry point
├── requirements.txt             # Dependencies
└── README.md                    # Documentation

Key Features

Multi-Level Search

The service supports similarity search across multiple text levels:

Title Level: Article titles for quick matching
Solution Level: Full article content for comprehensive matching
Chunk Level: Text chunks for detailed content analysis

Cross-Level Aggregation

When candidates are found on multiple levels, the service:

Combines all matching levels in the found_in field
Takes the maximum similarity score
Provides comprehensive duplicate detection

Intelligent Chunking

Uses RecursiveCharacterTextSplitter for optimal text segmentation:

Respects sentence and paragraph boundaries
Reduces chunk count by 90%+ compared to naive splitting
Improves semantic coherence of text chunks

Cost Tracking

Real-time monitoring of API costs:

Vectorization costs (OpenAI embeddings)
Classification costs (GPT API calls)
Detailed cost breakdowns in reports

Progress Tracking

Comprehensive progress indicators:

Vectorization progress for each level
Search progress for candidate processing
Classification progress for each pair
Real-time speed and ETA information

File Management

Flexible file overwrite behavior:

load-base: Always overwrites existing files (no timestamping)
scan-base: Uses overwrite setting - true overwrites, false adds timestamp
check-new: Uses overwrite setting - true overwrites, false adds timestamp
Timestamped files use format: filename_YYYY-MM-DD_HH-MM-SS.json

Understanding Detection Confidence

The sim_level field indicates the confidence level of duplicate detection:

Two-Stage Detection Pipeline

Similarity Search Stage:
- Uses OpenAI embeddings and ANN indexes to find candidate pairs
- Filters by similarity threshold (default 0.8)
- All candidates found here get sim_level: "sim_search"
GPT Classification Stage:
- GPT analyzes candidate pairs for semantic similarity
- Provides binary classification (duplicate/not duplicate) with reasoning
- Updates sim_level based on GPT's decision

Sim Level Values

"sim_search": Lower confidence detection
- Found by similarity search but GPT classified as non-duplicate
- GPT classification failed due to API errors
- Classification parsing errors occurred
- These are potential false positives from similarity search
"gpt_classification": Higher confidence detection
- Found by similarity search AND confirmed by GPT
- GPT provided detailed reasoning for the duplicate classification
- These have the highest confidence and should be prioritized for review

Interpreting Results

When analyzing results:

High Priority: sim_level: "gpt_classification" with duplicate: 1
- These are confirmed duplicates with GPT reasoning
- Review the reason field for GPT's explanation
Medium Priority: sim_level: "sim_search" with duplicate: 0
- These were flagged by similarity search but GPT disagreed
- May indicate similar but distinct content
- Good candidates for manual review
Investigation Needed: sim_level: "sim_search" with duplicate: null
- Classification failed due to technical issues
- Manual review recommended

Limitations

Current approaches exclude batch processing for gpt api

Requirements

Python 3.8+
OpenAI API key
Required packages: openai, pandas, numpy, annoy, tqdm, click, tiktoken, psutil, langchain, pydantic

Troubleshooting

Common Issues

Missing API Key: Set OPENAI_API_KEY environment variable
Memory Issues: Reduce batch_size or use smaller embedding models for large datasets
Slow Processing: Reduce batch_size or increase n_trees
High Costs: Use smaller embedding models or reduce search levels

Logging

The service uses structured logging similar to monitoring systems like Zabbix. All logs are saved in the logs/ subdirectory within the output directory.

Log Structure

Each log entry follows a structured key-value format:

Service Start Example:

timestamp=2025-10-27 17:35:10 | level=INFO | service=data_loading | message=Starting data_loading service | action=service_start | description=Data loading, validation, and transformation | log_file=data_loading.log | total_items=2188

Performance Metrics Example:

timestamp=2025-10-27 18:06:14 | level=INFO | service=metrics | message=Performance metrics for load-base | action=metrics | metrics={"total_time": 1.344118356704712, "stage_times": {"data_loading": 0.0002949237823486328, "preprocessing": 0.0035767555236816406, "vectorization": 1.3333508968353271, "indexing": 0.006895780563354492}, "duplicates_sim_found": 0, "duplicates_gpt_found": 0, "total_processing_cost": 8.814e-05, "memory_usage": 178.78125}

Endpoint Completion Example:

timestamp=2025-10-27 18:06:14 | level=INFO | service=load-base | message=Completed load-base endpoint | action=endpoint_end | endpoint=load-base | solutions_count=5 | processing_time=1.344118356704712 | vectorization_cost=8.814e-05 | results_path=data/load_base_results.json | embeddings_path=data/embeddings | indexes_path=data/indexes | report_path=data/load_base_final_report.json

Service-Specific Logs

pipeline.log: Main orchestration, pipeline start/end, high-level operations
data_loading.log: Data loading, validation, transformation operations
preprocessing.log: Text cleaning, chunking, normalization operations
vectorization.log: Embedding generation, API calls, cost tracking
indexing.log: ANN index building, validation, performance metrics
search.log: Similarity search, candidate processing, search operations
classification.log: GPT classification, duplicate detection, reasoning
metrics.log: Performance metrics with operation-specific messages (e.g., "Performance metrics for load-base"), cost tracking, efficiency measurements

Endpoint-Specific Logs

load-base.log: Complete load-base command execution with configuration, results, and artifacts
check-new.log: Complete check-new command execution with input, processing, and results
scan-base.log: Complete scan-base command execution with parameters, processing, and results

Log Message Types

The logging system uses different message types for different purposes:

Service Messages: Starting/Completed [service] service - Internal pipeline stages
Operation Messages: [operation] - Specific operations within services
Metrics Messages: Performance metrics for [operation] - Performance data with operation context
Endpoint Messages: Starting/Completed [endpoint] endpoint - Command-level execution tracking
Error Messages: Error in [service]: [error] - Error handling with context

Log Levels

Enable detailed logging by setting log_level: "DEBUG" in configuration. Available levels:

DEBUG: Detailed debugging information
INFO: General information about operations
WARNING: Warning messages for potential issues
ERROR: Error messages with context
CRITICAL: Critical errors that may stop execution

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
options		options
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
main.py		main.py
new_items.json		new_items.json
requirements.txt		requirements.txt
sample_kb.json		sample_kb.json

Folders and files

Latest commit

History

Repository files navigation

Knowledge Base Deduplication Service

Overview

Architecture

Core Components

Pipeline Workflows

Installation

Usage

CLI Commands

Configuration

Pipeline Configuration

Preprocessing Configuration

Vectorization Configuration

Indexing Configuration

Search Configuration

Classification Configuration

Command-Specific Configuration

Data Formats

Input Data

Flat Format (Standard)

Hierarchical Format (Raw Input)

Output Formats and Results

Generated Artifacts

Search Results Format

Result Field Explanations

Core Identification Fields

Content Fields

Similarity Analysis Fields

Classification Fields

Detection Method Field

Processing Reports Format

Scan Base Report (scan_base_final_report.json)

Check New Report (check_new_final_report.json)

Load Base Report (load_base_final_report.json)

Load Base Results Format

Load Base Results Field Explanations

Codebase Structure

Key Features

Multi-Level Search

Cross-Level Aggregation

Intelligent Chunking

Cost Tracking

Progress Tracking

File Management

Understanding Detection Confidence

Two-Stage Detection Pipeline

Sim Level Values

Interpreting Results

Limitations

Requirements

Troubleshooting

Common Issues

Logging

Log Structure

Service-Specific Logs

Endpoint-Specific Logs

Log Message Types

Log Levels

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Scan Base Report (`scan_base_final_report.json`)

Check New Report (`check_new_final_report.json`)

Load Base Report (`load_base_final_report.json`)

Packages