Skip to content

RazinAleksandr/knowledge-base-deduplication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowledge Base Deduplication Service

A machine learning service for detecting duplicate articles in knowledge bases using semantic similarity search and GPT-based classification.

Overview

This service implements a two-stage ML pipeline for automated duplicate detection:

  1. Semantic Similarity Search: Uses OpenAI embeddings and Approximate Nearest Neighbors (ANN) to find candidate duplicate pairs
  2. GPT Classification: Uses GPT models (gpt-4o-mini, gpt-4o, gpt-5) for zero-shot classification of candidate pairs as duplicate or non-duplicate

Architecture

Core Components

  • Data Loading: JSON file loading with schema validation and hierarchical data transformation
  • Text Preprocessing: HTML/Markdown cleaning, Unicode normalization, and intelligent chunking
  • Vectorization: OpenAI embedding generation with cost tracking
  • ANN Indexing: Fast similarity search using Annoy indexes
  • Search Engine: Multi-level similarity search across titles, solutions, and text chunks
  • Classification: GPT-based duplicate classification with structured output
  • Metrics: Performance tracking, cost monitoring, and reporting

Pipeline Workflows

  1. Load Base (load-base): Load, preprocess, vectorize, and build indexes for knowledge base
  2. Scan Base (scan-base): Find duplicates within existing knowledge base
  3. Check New (check-new): Check new solutions against existing knowledge base

Installation

# Install dependencies
pip install -r requirements.txt

# Set up environment
export OPENAI_API_KEY="your-api-key-here"

# Optional: Set custom config path
export KB_DEDUP_CONFIG_PATH="path/to/your/config.json"

Usage

CLI Commands

All configuration is managed through options/config.json by default. You can specify a custom config file using the KB_DEDUP_CONFIG_PATH environment variable. No command-line arguments are needed.

# Load knowledge base (uses config.json settings)
python main.py load-base

# Scan existing knowledge base for duplicates
python main.py scan-base

# Check new solutions against existing base
python main.py check-new

# View detailed configuration options
python main.py config-help

# View current configuration values
python main.py help-config

Configuration

The service uses options/config.json for all configuration parameters. Here's a comprehensive breakdown of each section:

Pipeline Configuration

{
  "pipeline": {
    "log_level": "INFO", // Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL
    "output_dir": "./data" // Directory where all artifacts are saved
  }
}

Preprocessing Configuration

{
  "preprocessing": {
    "chunk_size": 400, // Maximum characters per text chunk
    "chunk_overlap": 50, // Characters to overlap between chunks
    "max_title_length": 1000, // Maximum length for article titles
    "max_solution_length": 200000, // Maximum length for article content
    "remove_html": true, // Strip HTML tags from content
    "normalize_unicode": true // Normalize Unicode characters
  }
}

Vectorization Configuration

{
  "vectorization": {
    "embedding_model": "text-embedding-3-large", // OpenAI embedding model
    "batch_size": 100, // Texts per API call
    "api_timeout": 30, // API timeout in seconds
    "max_retries": 3, // Retry attempts for failed calls
    "levels": ["title", "solution", "chunks"] // Text levels to vectorize
  }
}

Indexing Configuration

{
  "indexing": {
    "metric": "angular", // Distance metric: angular, euclidean, manhattan, hamming, dot
    "n_trees": 100 // Number of trees (higher = more accurate)
  }
}

Search Configuration

{
  "search": {
    "default_similarity_threshold": 0.8, // Minimum cosine similarity (0.0-1.0)
    "default_top_k": 5, // Number of nearest neighbors
    "default_levels": ["solution"] // Default text levels to search
  }
}

Classification Configuration

{
  "classification": {
    "enabled": true, // Enable GPT classification
    "model": "gpt-5", // OpenAI model: gpt-4o-mini, gpt-4o, gpt-5
    "batch_size": 10, // Candidate pairs per API call
    "max_retries": 3, // Retry attempts for failed calls
    "temperature": 0.0, // GPT temperature (0.0 = deterministic)
    "max_tokens": 1000 // Maximum tokens for responses
  }
}

Command-Specific Configuration

{
  "commands": {
    "load_base": {
      "data_path": "sample_kb.json", // Path to knowledge base JSON file
      "raw_input": false, // Whether data needs hierarchical transformation
      "embedding_model": "text-embedding-3-large", // Override embedding model
      "chunk_size": 500, // Override chunk size
      "levels": ["title", "solution", "chunks"] // Override levels to vectorize
    },
    "scan_base": {
      "threshold": 0.8, // Similarity threshold for candidates
      "top_k": 5, // Number of nearest neighbors
      "levels": ["chunks"], // Text levels to search
      "overwrite": false // Whether to overwrite existing result files (true) or add timestamp (false)
    },
    "check_new": {
      "input": "new_items.json", // Path to new solutions JSON file
      "threshold": 0.8, // Similarity threshold for candidates
      "top_k": 5, // Number of nearest neighbors
      "levels": ["solution", "title"], // Text levels to search
      "overwrite": true // Whether to overwrite existing result files (true) or add timestamp (false)
    }
  }
}

Data Formats

Input Data

Flat Format (Standard)

[
  {
    "solution_id": "FD-001",
    "title": "How to reset your password",
    "description_text": "Steps to reset password in settings...",
    "category": "Accounts",
    "folder": "Passwords"
  }
]

Hierarchical Format (Raw Input)

[
  {
    "name": "Accounts",
    "solutions": [
      {
        "solutionId": 43000643547,
        "title": "How to reset your password",
        "description_text": "Steps to reset password...",
        "description": "<p>Steps to reset password...</p>"
      }
    ],
    "folders": [
      {
        "name": "Passwords",
        "solutions": [
          {
            "solutionId": 43000643548,
            "title": "Password security tips",
            "description_text": "Best practices for password security..."
          }
        ]
      }
    ]
  }
]

Output Formats and Results

Generated Artifacts

  • load_base_results.json: Preprocessed knowledge base data
  • embeddings/: Vector embeddings for all levels
  • indexes/: ANN indexes for fast similarity search
  • *_results.json: Search and classification results
  • *_final_report.json: Comprehensive processing reports
  • logs/: Structured logging directory with service-specific logs
    • pipeline.log: Main orchestration and coordination
    • data_loading.log: Data loading, validation, and transformation
    • preprocessing.log: Text cleaning, chunking, and normalization
    • vectorization.log: OpenAI embedding generation and cost tracking
    • indexing.log: ANN index building and validation
    • search.log: Similarity search and candidate processing
    • classification.log: GPT classification and duplicate detection
    • metrics.log: Performance metrics and cost tracking
    • load-base.log: Load-base command execution and results
    • check-new.log: Check-new command execution and results
    • scan-base.log: Scan-base command execution and results

Search Results Format

All search operations (scan-base, check-new) produce results in the following format:

[
  {
    "id1": "FD-001", // First solution ID
    "id2": "FD-002", // Second solution ID
    "title1": "How to reset your password", // First solution title
    "title2": "Password Reset Guide", // Second solution title
    "category1": "Accounts", // First solution category
    "category2": "Accounts", // Second solution category
    "folder1": "Passwords", // First solution folder
    "folder2": "Passwords", // Second solution folder
    "solution1_text": "Steps to reset password...", // First solution content
    "solution2_text": "Guide for password reset...", // Second solution content
    "found_in": ["solution", "title"], // Text levels where similarity was found
    "similarity": 0.8538628292794584, // Maximum cosine similarity score (0.0-1.0)
    "duplicate": 1, // GPT classification: 1=duplicate, 0=not duplicate, null=not classified
    "reason": "Both articles describe the same password reset process with similar steps and content", // GPT explanation
    "sim_level": "gpt_classification" // Detection method: "sim_search" or "gpt_classification"
  }
]

Result Field Explanations

Core Identification Fields

  • id1, id2: Unique identifiers for the solution pair being compared
  • title1, title2: Article titles for quick identification
  • category1, category2: Solution categories (e.g., "Accounts", "Charts")
  • folder1, folder2: Solution folders for organization

Content Fields

  • solution1_text, solution2_text: Full text content of both solutions
  • found_in: Array of text levels where similarity was detected
    • "title": Similarity found in article titles
    • "solution": Similarity found in full article content
    • "chunks": Similarity found in text chunks

Similarity Analysis Fields

  • similarity: Maximum cosine similarity score across all levels (0.0-1.0)
    • 0.0: No similarity
    • 1.0: Identical content
    • Values above threshold (default 0.8) indicate potential duplicates

Classification Fields

  • duplicate: GPT classification result

    • 1: Confirmed duplicate by GPT
    • 0: Not a duplicate according to GPT
    • null: Classification failed or not performed
  • reason: GPT's explanation for the classification

    • Empty string if duplicate is 0
    • Detailed explanation if duplicate is 1
    • Error message if classification failed

Detection Method Field

  • sim_level: Indicates how the duplicate was detected
    • "sim_search": Found by similarity search but not confirmed by GPT
      • Used when GPT classifies as non-duplicate
      • Used when GPT classification fails
      • Used for error cases
    • "gpt_classification": Found by similarity search AND confirmed by GPT
      • Used when GPT classifies as duplicate
      • Indicates highest confidence detection

Processing Reports Format

Each command generates a comprehensive report with the following structure:

Scan Base Report (scan_base_final_report.json)

{
  "summary": {
    "total_candidates": 133, // Total candidate pairs found
    "duplicates_confirmed": 10, // Duplicates confirmed by GPT
    "duplicate_rate": 0.07518796992481203 // Percentage of confirmed duplicates
  },
  "metrics": {
    "processing_time_seconds": 674.528, // Total processing time
    "stage_times_seconds": {
      // Time breakdown by stage
      "load_base_artifacts": 0.204,
      "similarity_search": 64.105,
      "candidate_processing": 0.003,
      "classification": 610.216
    },
    "duplicates_found": {
      // Duplicate counts by stage
      "similarity_search": 133, // Found by similarity search
      "gpt_classification": 10 // Confirmed by GPT
    },
    "costs_dollars": {
      // API cost breakdown
      "vectorization": 0.0, // Embedding costs (0 for scan-base)
      "classification": 0.033684, // GPT classification costs
      "total": 0.033684 // Total costs
    },
    "performance": {
      // Performance metrics
      "memory_usage_mb": 209.62, // Peak memory usage
      "efficiency": {
        "duplicates_per_second": 0.197175, // Processing speed
        "cost_per_duplicate_dollars": 0.000253 // Cost efficiency
      }
    },
    "timestamp_unix": 1761563577.1429791, // Unix timestamp
    "timestamp_human": "2025-10-27 14:12:57 UTC" // Human-readable timestamp
  }
}

Check New Report (check_new_final_report.json)

{
  "summary": {
    "total_candidates": 1, // Total candidate pairs found
    "duplicates_confirmed": 1, // Duplicates confirmed by GPT
    "duplicate_rate": 1.0 // Percentage of confirmed duplicates
  },
  "metrics": {
    "processing_time_seconds": 4.99, // Total processing time
    "stage_times_seconds": {
      // Time breakdown by stage
      "load_base_artifacts": 0.189,
      "prepare_new_items": 1.685, // New items preparation time
      "similarity_search": 0.129,
      "candidate_processing": 0.001,
      "classification": 2.987
    },
    "duplicates_found": {
      // Duplicate counts by stage
      "similarity_search": 1, // Found by similarity search
      "gpt_classification": 1 // Confirmed by GPT
    },
    "costs_dollars": {
      // API cost breakdown
      "vectorization": 2.3e-5, // Embedding costs for new items
      "classification": 7.4e-5, // GPT classification costs
      "total": 9.7e-5 // Total costs
    },
    "performance": {
      // Performance metrics
      "memory_usage_mb": 563.05, // Peak memory usage
      "efficiency": {
        "duplicates_per_second": 0.200387, // Processing speed
        "cost_per_duplicate_dollars": 9.7e-5 // Cost efficiency
      }
    },
    "timestamp_unix": 1761562885.236172, // Unix timestamp
    "timestamp_human": "2025-10-27 14:01:25 UTC" // Human-readable timestamp
  }
}

Load Base Report (load_base_final_report.json)

{
  "summary": {
    "solutions_count": 2188, // Total solutions processed
    "processing_time": 45.2, // Total processing time
    "vectorization_cost": 0.0234, // Embedding generation cost
    "embedding_statistics": {
      // Detailed embedding metrics
      "embedding_model": "text-embedding-3-large",
      "levels": ["title", "solution", "chunks"],
      "costs": {
        "total_cost": 0.0234,
        "api_calls": 15,
        "tokens_used": 45678,
        "vectorization_time": 42.1
      },
      "title_count": 2188,
      "title_dimension": 3072,
      "solution_count": 2188,
      "solution_dimension": 3072,
      "chunks_count": 3245,
      "chunks_dimension": 3072
    },
    "index_statistics": {
      // ANN index metrics
      "total_indexes": 3,
      "levels": ["title", "solution", "chunks"],
      "metric": "angular",
      "n_trees": 100,
      "title_vectors": 2188,
      "solution_vectors": 2188,
      "chunks_vectors": 3245
    },
    "chunk_statistics": {
      // Text chunking metrics
      "total_items": 2188,
      "total_chunks": 3245,
      "avg_chunks_per_item": 1.48,
      "avg_chunk_length": 286.4,
      "min_chunk_length": 142,
      "max_chunk_length": 418
    },
    "results_path": "data/load_base_results.json",
    "report_path": "data/load_base_final_report.json",
    "embeddings_path": "data/embeddings",
    "indexes_path": "data/indexes"
  }
}

Load Base Results Format

The load_base_results.json contains preprocessed knowledge base data:

[
  {
    "solution_id": "43000643547", // Unique solution identifier
    "title": "What is economic data?", // Original article title
    "description_text": "Economic data are data points...", // Original content
    "category": "Data, Exchanges & Data-related Issues", // Solution category
    "folder": "General information about economic data", // Solution folder
    "title_cleaned": "What is economic data?", // Cleaned title (HTML removed, normalized)
    "description_cleaned": "Economic data are data points...", // Cleaned content
    "chunks": [
      // Text chunks for detailed analysis
      {
        "chunk_id": "43000643547_chunk_0", // Unique chunk identifier
        "text": "Economic data are data points...", // Chunk text content
        "start_pos": 0, // Start position in original text
        "end_pos": 304, // End position in original text
        "chunk_index": 0 // Chunk index within the solution
      }
    ]
  }
]

Load Base Results Field Explanations

  • solution_id: Unique identifier for the solution
  • title: Original article title from input data
  • description_text: Original article content from input data
  • category: Solution category for organization
  • folder: Solution folder for organization
  • title_cleaned: Title after HTML removal and Unicode normalization
  • description_cleaned: Content after HTML removal and Unicode normalization
  • chunks: Array of text chunks created by intelligent splitting
    • chunk_id: Unique identifier combining solution_id and chunk index
    • text: The actual chunk text content
    • start_pos: Character position where chunk starts in original text
    • end_pos: Character position where chunk ends in original text
    • chunk_index: Sequential index of chunk within the solution

Codebase Structure

kb-deduplication/
├── src/                          # Source code
│   ├── dataload.py              # Data loading and validation
│   ├── data_transform.py        # Hierarchical data transformation
│   ├── preprocessing.py         # Text cleaning and chunking
│   ├── vectorization.py         # OpenAI embedding generation
│   ├── indexing.py              # ANN index building
│   ├── search.py                # Similarity search engine
│   ├── classification.py        # GPT-based classification
│   ├── metrics.py               # Performance metrics
│   ├── pipeline.py              # Main orchestrator
│   ├── models.py                # Pydantic data models
│   └── utils.py                 # Utility functions
├── options/
│   └── config.json              # Configuration file
├── main.py                      # CLI entry point
├── requirements.txt             # Dependencies
└── README.md                    # Documentation

Key Features

Multi-Level Search

The service supports similarity search across multiple text levels:

  • Title Level: Article titles for quick matching
  • Solution Level: Full article content for comprehensive matching
  • Chunk Level: Text chunks for detailed content analysis

Cross-Level Aggregation

When candidates are found on multiple levels, the service:

  • Combines all matching levels in the found_in field
  • Takes the maximum similarity score
  • Provides comprehensive duplicate detection

Intelligent Chunking

Uses RecursiveCharacterTextSplitter for optimal text segmentation:

  • Respects sentence and paragraph boundaries
  • Reduces chunk count by 90%+ compared to naive splitting
  • Improves semantic coherence of text chunks

Cost Tracking

Real-time monitoring of API costs:

  • Vectorization costs (OpenAI embeddings)
  • Classification costs (GPT API calls)
  • Detailed cost breakdowns in reports

Progress Tracking

Comprehensive progress indicators:

  • Vectorization progress for each level
  • Search progress for candidate processing
  • Classification progress for each pair
  • Real-time speed and ETA information

File Management

Flexible file overwrite behavior:

  • load-base: Always overwrites existing files (no timestamping)
  • scan-base: Uses overwrite setting - true overwrites, false adds timestamp
  • check-new: Uses overwrite setting - true overwrites, false adds timestamp
  • Timestamped files use format: filename_YYYY-MM-DD_HH-MM-SS.json

Understanding Detection Confidence

The sim_level field indicates the confidence level of duplicate detection:

Two-Stage Detection Pipeline

  1. Similarity Search Stage:

    • Uses OpenAI embeddings and ANN indexes to find candidate pairs
    • Filters by similarity threshold (default 0.8)
    • All candidates found here get sim_level: "sim_search"
  2. GPT Classification Stage:

    • GPT analyzes candidate pairs for semantic similarity
    • Provides binary classification (duplicate/not duplicate) with reasoning
    • Updates sim_level based on GPT's decision

Sim Level Values

  • "sim_search": Lower confidence detection

    • Found by similarity search but GPT classified as non-duplicate
    • GPT classification failed due to API errors
    • Classification parsing errors occurred
    • These are potential false positives from similarity search
  • "gpt_classification": Higher confidence detection

    • Found by similarity search AND confirmed by GPT
    • GPT provided detailed reasoning for the duplicate classification
    • These have the highest confidence and should be prioritized for review

Interpreting Results

When analyzing results:

  1. High Priority: sim_level: "gpt_classification" with duplicate: 1

    • These are confirmed duplicates with GPT reasoning
    • Review the reason field for GPT's explanation
  2. Medium Priority: sim_level: "sim_search" with duplicate: 0

    • These were flagged by similarity search but GPT disagreed
    • May indicate similar but distinct content
    • Good candidates for manual review
  3. Investigation Needed: sim_level: "sim_search" with duplicate: null

    • Classification failed due to technical issues
    • Manual review recommended

Limitations

  • Current approaches exclude batch processing for gpt api

Requirements

  • Python 3.8+
  • OpenAI API key
  • Required packages: openai, pandas, numpy, annoy, tqdm, click, tiktoken, psutil, langchain, pydantic

Troubleshooting

Common Issues

  1. Missing API Key: Set OPENAI_API_KEY environment variable
  2. Memory Issues: Reduce batch_size or use smaller embedding models for large datasets
  3. Slow Processing: Reduce batch_size or increase n_trees
  4. High Costs: Use smaller embedding models or reduce search levels

Logging

The service uses structured logging similar to monitoring systems like Zabbix. All logs are saved in the logs/ subdirectory within the output directory.

Log Structure

Each log entry follows a structured key-value format:

Service Start Example:

timestamp=2025-10-27 17:35:10 | level=INFO | service=data_loading | message=Starting data_loading service | action=service_start | description=Data loading, validation, and transformation | log_file=data_loading.log | total_items=2188

Performance Metrics Example:

timestamp=2025-10-27 18:06:14 | level=INFO | service=metrics | message=Performance metrics for load-base | action=metrics | metrics={"total_time": 1.344118356704712, "stage_times": {"data_loading": 0.0002949237823486328, "preprocessing": 0.0035767555236816406, "vectorization": 1.3333508968353271, "indexing": 0.006895780563354492}, "duplicates_sim_found": 0, "duplicates_gpt_found": 0, "total_processing_cost": 8.814e-05, "memory_usage": 178.78125}

Endpoint Completion Example:

timestamp=2025-10-27 18:06:14 | level=INFO | service=load-base | message=Completed load-base endpoint | action=endpoint_end | endpoint=load-base | solutions_count=5 | processing_time=1.344118356704712 | vectorization_cost=8.814e-05 | results_path=data/load_base_results.json | embeddings_path=data/embeddings | indexes_path=data/indexes | report_path=data/load_base_final_report.json

Service-Specific Logs

  • pipeline.log: Main orchestration, pipeline start/end, high-level operations
  • data_loading.log: Data loading, validation, transformation operations
  • preprocessing.log: Text cleaning, chunking, normalization operations
  • vectorization.log: Embedding generation, API calls, cost tracking
  • indexing.log: ANN index building, validation, performance metrics
  • search.log: Similarity search, candidate processing, search operations
  • classification.log: GPT classification, duplicate detection, reasoning
  • metrics.log: Performance metrics with operation-specific messages (e.g., "Performance metrics for load-base"), cost tracking, efficiency measurements

Endpoint-Specific Logs

  • load-base.log: Complete load-base command execution with configuration, results, and artifacts
  • check-new.log: Complete check-new command execution with input, processing, and results
  • scan-base.log: Complete scan-base command execution with parameters, processing, and results

Log Message Types

The logging system uses different message types for different purposes:

  • Service Messages: Starting/Completed [service] service - Internal pipeline stages
  • Operation Messages: [operation] - Specific operations within services
  • Metrics Messages: Performance metrics for [operation] - Performance data with operation context
  • Endpoint Messages: Starting/Completed [endpoint] endpoint - Command-level execution tracking
  • Error Messages: Error in [service]: [error] - Error handling with context

Log Levels

Enable detailed logging by setting log_level: "DEBUG" in configuration. Available levels:

  • DEBUG: Detailed debugging information
  • INFO: General information about operations
  • WARNING: Warning messages for potential issues
  • ERROR: Error messages with context
  • CRITICAL: Critical errors that may stop execution

About

A machine learning service for detecting duplicate articles in knowledge bases using semantic similarity search and GPT-based classification.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages