A machine learning service for detecting duplicate articles in knowledge bases using semantic similarity search and GPT-based classification.
This service implements a two-stage ML pipeline for automated duplicate detection:
- Semantic Similarity Search: Uses OpenAI embeddings and Approximate Nearest Neighbors (ANN) to find candidate duplicate pairs
- GPT Classification: Uses GPT models (gpt-4o-mini, gpt-4o, gpt-5) for zero-shot classification of candidate pairs as duplicate or non-duplicate
- Data Loading: JSON file loading with schema validation and hierarchical data transformation
- Text Preprocessing: HTML/Markdown cleaning, Unicode normalization, and intelligent chunking
- Vectorization: OpenAI embedding generation with cost tracking
- ANN Indexing: Fast similarity search using Annoy indexes
- Search Engine: Multi-level similarity search across titles, solutions, and text chunks
- Classification: GPT-based duplicate classification with structured output
- Metrics: Performance tracking, cost monitoring, and reporting
- Load Base (
load-base): Load, preprocess, vectorize, and build indexes for knowledge base - Scan Base (
scan-base): Find duplicates within existing knowledge base - Check New (
check-new): Check new solutions against existing knowledge base
# Install dependencies
pip install -r requirements.txt
# Set up environment
export OPENAI_API_KEY="your-api-key-here"
# Optional: Set custom config path
export KB_DEDUP_CONFIG_PATH="path/to/your/config.json"All configuration is managed through options/config.json by default.
You can specify a custom config file using the KB_DEDUP_CONFIG_PATH environment variable.
No command-line arguments are needed.
# Load knowledge base (uses config.json settings)
python main.py load-base
# Scan existing knowledge base for duplicates
python main.py scan-base
# Check new solutions against existing base
python main.py check-new
# View detailed configuration options
python main.py config-help
# View current configuration values
python main.py help-configThe service uses options/config.json for all configuration parameters. Here's a comprehensive breakdown of each section:
{
"pipeline": {
"log_level": "INFO", // Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL
"output_dir": "./data" // Directory where all artifacts are saved
}
}{
"preprocessing": {
"chunk_size": 400, // Maximum characters per text chunk
"chunk_overlap": 50, // Characters to overlap between chunks
"max_title_length": 1000, // Maximum length for article titles
"max_solution_length": 200000, // Maximum length for article content
"remove_html": true, // Strip HTML tags from content
"normalize_unicode": true // Normalize Unicode characters
}
}{
"vectorization": {
"embedding_model": "text-embedding-3-large", // OpenAI embedding model
"batch_size": 100, // Texts per API call
"api_timeout": 30, // API timeout in seconds
"max_retries": 3, // Retry attempts for failed calls
"levels": ["title", "solution", "chunks"] // Text levels to vectorize
}
}{
"indexing": {
"metric": "angular", // Distance metric: angular, euclidean, manhattan, hamming, dot
"n_trees": 100 // Number of trees (higher = more accurate)
}
}{
"search": {
"default_similarity_threshold": 0.8, // Minimum cosine similarity (0.0-1.0)
"default_top_k": 5, // Number of nearest neighbors
"default_levels": ["solution"] // Default text levels to search
}
}{
"classification": {
"enabled": true, // Enable GPT classification
"model": "gpt-5", // OpenAI model: gpt-4o-mini, gpt-4o, gpt-5
"batch_size": 10, // Candidate pairs per API call
"max_retries": 3, // Retry attempts for failed calls
"temperature": 0.0, // GPT temperature (0.0 = deterministic)
"max_tokens": 1000 // Maximum tokens for responses
}
}{
"commands": {
"load_base": {
"data_path": "sample_kb.json", // Path to knowledge base JSON file
"raw_input": false, // Whether data needs hierarchical transformation
"embedding_model": "text-embedding-3-large", // Override embedding model
"chunk_size": 500, // Override chunk size
"levels": ["title", "solution", "chunks"] // Override levels to vectorize
},
"scan_base": {
"threshold": 0.8, // Similarity threshold for candidates
"top_k": 5, // Number of nearest neighbors
"levels": ["chunks"], // Text levels to search
"overwrite": false // Whether to overwrite existing result files (true) or add timestamp (false)
},
"check_new": {
"input": "new_items.json", // Path to new solutions JSON file
"threshold": 0.8, // Similarity threshold for candidates
"top_k": 5, // Number of nearest neighbors
"levels": ["solution", "title"], // Text levels to search
"overwrite": true // Whether to overwrite existing result files (true) or add timestamp (false)
}
}
}[
{
"solution_id": "FD-001",
"title": "How to reset your password",
"description_text": "Steps to reset password in settings...",
"category": "Accounts",
"folder": "Passwords"
}
][
{
"name": "Accounts",
"solutions": [
{
"solutionId": 43000643547,
"title": "How to reset your password",
"description_text": "Steps to reset password...",
"description": "<p>Steps to reset password...</p>"
}
],
"folders": [
{
"name": "Passwords",
"solutions": [
{
"solutionId": 43000643548,
"title": "Password security tips",
"description_text": "Best practices for password security..."
}
]
}
]
}
]load_base_results.json: Preprocessed knowledge base dataembeddings/: Vector embeddings for all levelsindexes/: ANN indexes for fast similarity search*_results.json: Search and classification results*_final_report.json: Comprehensive processing reportslogs/: Structured logging directory with service-specific logspipeline.log: Main orchestration and coordinationdata_loading.log: Data loading, validation, and transformationpreprocessing.log: Text cleaning, chunking, and normalizationvectorization.log: OpenAI embedding generation and cost trackingindexing.log: ANN index building and validationsearch.log: Similarity search and candidate processingclassification.log: GPT classification and duplicate detectionmetrics.log: Performance metrics and cost trackingload-base.log: Load-base command execution and resultscheck-new.log: Check-new command execution and resultsscan-base.log: Scan-base command execution and results
All search operations (scan-base, check-new) produce results in the following format:
[
{
"id1": "FD-001", // First solution ID
"id2": "FD-002", // Second solution ID
"title1": "How to reset your password", // First solution title
"title2": "Password Reset Guide", // Second solution title
"category1": "Accounts", // First solution category
"category2": "Accounts", // Second solution category
"folder1": "Passwords", // First solution folder
"folder2": "Passwords", // Second solution folder
"solution1_text": "Steps to reset password...", // First solution content
"solution2_text": "Guide for password reset...", // Second solution content
"found_in": ["solution", "title"], // Text levels where similarity was found
"similarity": 0.8538628292794584, // Maximum cosine similarity score (0.0-1.0)
"duplicate": 1, // GPT classification: 1=duplicate, 0=not duplicate, null=not classified
"reason": "Both articles describe the same password reset process with similar steps and content", // GPT explanation
"sim_level": "gpt_classification" // Detection method: "sim_search" or "gpt_classification"
}
]id1,id2: Unique identifiers for the solution pair being comparedtitle1,title2: Article titles for quick identificationcategory1,category2: Solution categories (e.g., "Accounts", "Charts")folder1,folder2: Solution folders for organization
solution1_text,solution2_text: Full text content of both solutionsfound_in: Array of text levels where similarity was detected"title": Similarity found in article titles"solution": Similarity found in full article content"chunks": Similarity found in text chunks
similarity: Maximum cosine similarity score across all levels (0.0-1.0)0.0: No similarity1.0: Identical content- Values above threshold (default 0.8) indicate potential duplicates
-
duplicate: GPT classification result1: Confirmed duplicate by GPT0: Not a duplicate according to GPTnull: Classification failed or not performed
-
reason: GPT's explanation for the classification- Empty string if
duplicateis 0 - Detailed explanation if
duplicateis 1 - Error message if classification failed
- Empty string if
sim_level: Indicates how the duplicate was detected"sim_search": Found by similarity search but not confirmed by GPT- Used when GPT classifies as non-duplicate
- Used when GPT classification fails
- Used for error cases
"gpt_classification": Found by similarity search AND confirmed by GPT- Used when GPT classifies as duplicate
- Indicates highest confidence detection
Each command generates a comprehensive report with the following structure:
{
"summary": {
"total_candidates": 133, // Total candidate pairs found
"duplicates_confirmed": 10, // Duplicates confirmed by GPT
"duplicate_rate": 0.07518796992481203 // Percentage of confirmed duplicates
},
"metrics": {
"processing_time_seconds": 674.528, // Total processing time
"stage_times_seconds": {
// Time breakdown by stage
"load_base_artifacts": 0.204,
"similarity_search": 64.105,
"candidate_processing": 0.003,
"classification": 610.216
},
"duplicates_found": {
// Duplicate counts by stage
"similarity_search": 133, // Found by similarity search
"gpt_classification": 10 // Confirmed by GPT
},
"costs_dollars": {
// API cost breakdown
"vectorization": 0.0, // Embedding costs (0 for scan-base)
"classification": 0.033684, // GPT classification costs
"total": 0.033684 // Total costs
},
"performance": {
// Performance metrics
"memory_usage_mb": 209.62, // Peak memory usage
"efficiency": {
"duplicates_per_second": 0.197175, // Processing speed
"cost_per_duplicate_dollars": 0.000253 // Cost efficiency
}
},
"timestamp_unix": 1761563577.1429791, // Unix timestamp
"timestamp_human": "2025-10-27 14:12:57 UTC" // Human-readable timestamp
}
}{
"summary": {
"total_candidates": 1, // Total candidate pairs found
"duplicates_confirmed": 1, // Duplicates confirmed by GPT
"duplicate_rate": 1.0 // Percentage of confirmed duplicates
},
"metrics": {
"processing_time_seconds": 4.99, // Total processing time
"stage_times_seconds": {
// Time breakdown by stage
"load_base_artifacts": 0.189,
"prepare_new_items": 1.685, // New items preparation time
"similarity_search": 0.129,
"candidate_processing": 0.001,
"classification": 2.987
},
"duplicates_found": {
// Duplicate counts by stage
"similarity_search": 1, // Found by similarity search
"gpt_classification": 1 // Confirmed by GPT
},
"costs_dollars": {
// API cost breakdown
"vectorization": 2.3e-5, // Embedding costs for new items
"classification": 7.4e-5, // GPT classification costs
"total": 9.7e-5 // Total costs
},
"performance": {
// Performance metrics
"memory_usage_mb": 563.05, // Peak memory usage
"efficiency": {
"duplicates_per_second": 0.200387, // Processing speed
"cost_per_duplicate_dollars": 9.7e-5 // Cost efficiency
}
},
"timestamp_unix": 1761562885.236172, // Unix timestamp
"timestamp_human": "2025-10-27 14:01:25 UTC" // Human-readable timestamp
}
}{
"summary": {
"solutions_count": 2188, // Total solutions processed
"processing_time": 45.2, // Total processing time
"vectorization_cost": 0.0234, // Embedding generation cost
"embedding_statistics": {
// Detailed embedding metrics
"embedding_model": "text-embedding-3-large",
"levels": ["title", "solution", "chunks"],
"costs": {
"total_cost": 0.0234,
"api_calls": 15,
"tokens_used": 45678,
"vectorization_time": 42.1
},
"title_count": 2188,
"title_dimension": 3072,
"solution_count": 2188,
"solution_dimension": 3072,
"chunks_count": 3245,
"chunks_dimension": 3072
},
"index_statistics": {
// ANN index metrics
"total_indexes": 3,
"levels": ["title", "solution", "chunks"],
"metric": "angular",
"n_trees": 100,
"title_vectors": 2188,
"solution_vectors": 2188,
"chunks_vectors": 3245
},
"chunk_statistics": {
// Text chunking metrics
"total_items": 2188,
"total_chunks": 3245,
"avg_chunks_per_item": 1.48,
"avg_chunk_length": 286.4,
"min_chunk_length": 142,
"max_chunk_length": 418
},
"results_path": "data/load_base_results.json",
"report_path": "data/load_base_final_report.json",
"embeddings_path": "data/embeddings",
"indexes_path": "data/indexes"
}
}The load_base_results.json contains preprocessed knowledge base data:
[
{
"solution_id": "43000643547", // Unique solution identifier
"title": "What is economic data?", // Original article title
"description_text": "Economic data are data points...", // Original content
"category": "Data, Exchanges & Data-related Issues", // Solution category
"folder": "General information about economic data", // Solution folder
"title_cleaned": "What is economic data?", // Cleaned title (HTML removed, normalized)
"description_cleaned": "Economic data are data points...", // Cleaned content
"chunks": [
// Text chunks for detailed analysis
{
"chunk_id": "43000643547_chunk_0", // Unique chunk identifier
"text": "Economic data are data points...", // Chunk text content
"start_pos": 0, // Start position in original text
"end_pos": 304, // End position in original text
"chunk_index": 0 // Chunk index within the solution
}
]
}
]solution_id: Unique identifier for the solutiontitle: Original article title from input datadescription_text: Original article content from input datacategory: Solution category for organizationfolder: Solution folder for organizationtitle_cleaned: Title after HTML removal and Unicode normalizationdescription_cleaned: Content after HTML removal and Unicode normalizationchunks: Array of text chunks created by intelligent splittingchunk_id: Unique identifier combining solution_id and chunk indextext: The actual chunk text contentstart_pos: Character position where chunk starts in original textend_pos: Character position where chunk ends in original textchunk_index: Sequential index of chunk within the solution
kb-deduplication/
├── src/ # Source code
│ ├── dataload.py # Data loading and validation
│ ├── data_transform.py # Hierarchical data transformation
│ ├── preprocessing.py # Text cleaning and chunking
│ ├── vectorization.py # OpenAI embedding generation
│ ├── indexing.py # ANN index building
│ ├── search.py # Similarity search engine
│ ├── classification.py # GPT-based classification
│ ├── metrics.py # Performance metrics
│ ├── pipeline.py # Main orchestrator
│ ├── models.py # Pydantic data models
│ └── utils.py # Utility functions
├── options/
│ └── config.json # Configuration file
├── main.py # CLI entry point
├── requirements.txt # Dependencies
└── README.md # Documentation
The service supports similarity search across multiple text levels:
- Title Level: Article titles for quick matching
- Solution Level: Full article content for comprehensive matching
- Chunk Level: Text chunks for detailed content analysis
When candidates are found on multiple levels, the service:
- Combines all matching levels in the
found_infield - Takes the maximum similarity score
- Provides comprehensive duplicate detection
Uses RecursiveCharacterTextSplitter for optimal text segmentation:
- Respects sentence and paragraph boundaries
- Reduces chunk count by 90%+ compared to naive splitting
- Improves semantic coherence of text chunks
Real-time monitoring of API costs:
- Vectorization costs (OpenAI embeddings)
- Classification costs (GPT API calls)
- Detailed cost breakdowns in reports
Comprehensive progress indicators:
- Vectorization progress for each level
- Search progress for candidate processing
- Classification progress for each pair
- Real-time speed and ETA information
Flexible file overwrite behavior:
- load-base: Always overwrites existing files (no timestamping)
- scan-base: Uses
overwritesetting -trueoverwrites,falseadds timestamp - check-new: Uses
overwritesetting -trueoverwrites,falseadds timestamp - Timestamped files use format:
filename_YYYY-MM-DD_HH-MM-SS.json
The sim_level field indicates the confidence level of duplicate detection:
-
Similarity Search Stage:
- Uses OpenAI embeddings and ANN indexes to find candidate pairs
- Filters by similarity threshold (default 0.8)
- All candidates found here get
sim_level: "sim_search"
-
GPT Classification Stage:
- GPT analyzes candidate pairs for semantic similarity
- Provides binary classification (duplicate/not duplicate) with reasoning
- Updates
sim_levelbased on GPT's decision
-
"sim_search": Lower confidence detection- Found by similarity search but GPT classified as non-duplicate
- GPT classification failed due to API errors
- Classification parsing errors occurred
- These are potential false positives from similarity search
-
"gpt_classification": Higher confidence detection- Found by similarity search AND confirmed by GPT
- GPT provided detailed reasoning for the duplicate classification
- These have the highest confidence and should be prioritized for review
When analyzing results:
-
High Priority:
sim_level: "gpt_classification"withduplicate: 1- These are confirmed duplicates with GPT reasoning
- Review the
reasonfield for GPT's explanation
-
Medium Priority:
sim_level: "sim_search"withduplicate: 0- These were flagged by similarity search but GPT disagreed
- May indicate similar but distinct content
- Good candidates for manual review
-
Investigation Needed:
sim_level: "sim_search"withduplicate: null- Classification failed due to technical issues
- Manual review recommended
- Current approaches exclude batch processing for gpt api
- Python 3.8+
- OpenAI API key
- Required packages:
openai,pandas,numpy,annoy,tqdm,click,tiktoken,psutil,langchain,pydantic
- Missing API Key: Set
OPENAI_API_KEYenvironment variable - Memory Issues: Reduce
batch_sizeor use smaller embedding models for large datasets - Slow Processing: Reduce
batch_sizeor increasen_trees - High Costs: Use smaller embedding models or reduce search levels
The service uses structured logging similar to monitoring systems like Zabbix. All logs are saved in the logs/ subdirectory within the output directory.
Each log entry follows a structured key-value format:
Service Start Example:
timestamp=2025-10-27 17:35:10 | level=INFO | service=data_loading | message=Starting data_loading service | action=service_start | description=Data loading, validation, and transformation | log_file=data_loading.log | total_items=2188
Performance Metrics Example:
timestamp=2025-10-27 18:06:14 | level=INFO | service=metrics | message=Performance metrics for load-base | action=metrics | metrics={"total_time": 1.344118356704712, "stage_times": {"data_loading": 0.0002949237823486328, "preprocessing": 0.0035767555236816406, "vectorization": 1.3333508968353271, "indexing": 0.006895780563354492}, "duplicates_sim_found": 0, "duplicates_gpt_found": 0, "total_processing_cost": 8.814e-05, "memory_usage": 178.78125}
Endpoint Completion Example:
timestamp=2025-10-27 18:06:14 | level=INFO | service=load-base | message=Completed load-base endpoint | action=endpoint_end | endpoint=load-base | solutions_count=5 | processing_time=1.344118356704712 | vectorization_cost=8.814e-05 | results_path=data/load_base_results.json | embeddings_path=data/embeddings | indexes_path=data/indexes | report_path=data/load_base_final_report.json
pipeline.log: Main orchestration, pipeline start/end, high-level operationsdata_loading.log: Data loading, validation, transformation operationspreprocessing.log: Text cleaning, chunking, normalization operationsvectorization.log: Embedding generation, API calls, cost trackingindexing.log: ANN index building, validation, performance metricssearch.log: Similarity search, candidate processing, search operationsclassification.log: GPT classification, duplicate detection, reasoningmetrics.log: Performance metrics with operation-specific messages (e.g., "Performance metrics for load-base"), cost tracking, efficiency measurements
load-base.log: Complete load-base command execution with configuration, results, and artifactscheck-new.log: Complete check-new command execution with input, processing, and resultsscan-base.log: Complete scan-base command execution with parameters, processing, and results
The logging system uses different message types for different purposes:
- Service Messages:
Starting/Completed [service] service- Internal pipeline stages - Operation Messages:
[operation]- Specific operations within services - Metrics Messages:
Performance metrics for [operation]- Performance data with operation context - Endpoint Messages:
Starting/Completed [endpoint] endpoint- Command-level execution tracking - Error Messages:
Error in [service]: [error]- Error handling with context
Enable detailed logging by setting log_level: "DEBUG" in configuration. Available levels:
DEBUG: Detailed debugging informationINFO: General information about operationsWARNING: Warning messages for potential issuesERROR: Error messages with contextCRITICAL: Critical errors that may stop execution