An AI-powered log classification engine that intelligently categorizes system logs using a hybrid log classification system, combining three complementary approaches to handle varying levels of complexity in log patterns. The classification methods ensure flexibility and effectiveness in processing predictable, complex, and poorly-labeled data patterns.
Features • Quick Start • Architecture • API Documentation • Contributing
- Overview
- Features
- System Architecture
- Installation
- Quick Start
- API Documentation
- Configuration
- Development
- Contributing
- License
LogEngine is an intelligent log classification system designed to automatically categorize system logs into predefined categories using a hybrid approach combining regex-based pattern matching, BERT-based semantic analysis, and Large Language Models (LLM).
The system processes CSV files containing log messages from multiple sources (e.g., CRM systems, billing systems, analytics engines) and produces enriched output with intelligent classification labels. This enables organizations to:
- Automate log categorization at scale
- Reduce manual review time by up to 80%
- Identify critical issues faster through semantic understanding
- Support legacy systems with specialized LLM processing
- Handle edge cases with fallback mechanisms
- Enterprise Log Management: Automatically categorize logs from distributed systems
- Security Monitoring: Identify anomalies and security-related events
- Incident Response: Fast-track critical issues for immediate action
- Compliance Reporting: Categorize logs for audit and compliance purposes
- System Health Monitoring: Track system notifications and operational events
- Regex-Based Matching: Ultra-fast pattern recognition for known log formats
- BERT Embeddings: Deep semantic understanding for complex log messages
- LLM Integration: Advanced reasoning for ambiguous or legacy system logs
- Intelligent Fallback: Graceful degradation when primary method fails
- CSV Processing: Batch process multiple logs in one operation
- RESTful API: Real-time classification via FastAPI
- Extensible Categories: Easily add new classification categories
- Confidence Scoring: Built-in confidence metrics (via BERT probabilities)
- Pre-trained Models: Uses lightweight sentence transformers for fast inference
- Efficient Caching: Model loading optimization
- Scalable Design: Built on FastAPI for async processing
- Low Latency: < 100ms per log message on average
- Error Handling: Comprehensive exception management
- Input Validation: CSV schema validation
- Logging: Detailed processing logs for debugging
- Resource Efficient: Minimal memory footprint
- Python 3.9+
- pip or conda
- Groq API Key (for LLM features)
- 8GB RAM minimum (for BERT model loading)
git clone https://github.com/yourusername/logengine.git
cd logengine# Using venv
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Or using conda
conda create -n logengine python=3.9
conda activate logenginepip install -r requirements.txt# BERT models will auto-download on first run
# Alternatively, pre-download:
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"# Create .env file
cat > .env << EOF
GROQ_API_KEY=your_groq_api_key_here
LOG_LEVEL=INFO
EOF# Test imports
python -c "from classify import classify; print('✓ LogEngine installed successfully')"
# Test API server
python server.py
# Should output: INFO: Uvicorn running on http://0.0.0.0:8000# Start the server
python server.py
# In another terminal, upload a CSV file
curl -X POST "http://localhost:8000/classify/" \
-F "file=@resources/test.csv"
# Response: CSV file with target_label column# Classify logs from a CSV file
python classify.py
# Input: resources/test.csv
# Output: resources/output.csvfrom classify import classify_log, classify
# Single log classification
label = classify_log("ModernCRM", "User 12345 logged in.")
print(label) # Output: "User Action"
# Batch classification
logs = [
("ModernCRM", "User 12345 logged in."),
("BillingSystem", "Payment received successfully"),
("LegacyCRM", "The ReportGenerator module will be retired in v4.0")
]
labels = classify(logs)
for log, label in zip(logs, labels):
print(f"{log[0]} → {label}")Classify log messages from a CSV file
POST /classify/ HTTP/1.1
Host: localhost:8000
Content-Type: multipart/form-data
file: (binary CSV file)source,log_message
ModernCRM,User 12345 logged in.
BillingSystem,Payment received for invoice INV-001
AnalyticsEngine,File data_6957.csv uploaded successfully by user User265
LegacyCRM,The ReportGenerator module will be retired in version 4.0HTTP/1.1 200 OK
Content-Type: text/csv
source,log_message,target_label
ModernCRM,User 12345 logged in.,User Action
BillingSystem,Payment received for invoice INV-001,System Notification
AnalyticsEngine,File data_6957.csv uploaded successfully by user User265,System Notification
LegacyCRM,The ReportGenerator module will be retired in version 4.0,Deprecation Warning
{
"detail": "CSV must contain 'source' and 'log_message' columns."
}| Code | Description |
|---|---|
| 200 | Classification successful |
| 400 | Invalid file format or missing columns |
| 500 | Server error during processing |
# Required
GROQ_API_KEY=sk-grok_xxxxxxxxxxxxx
# Optional
LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR
CONFIDENCE_THRESHOLD=0.5 # BERT confidence threshold
API_HOST=0.0.0.0 # FastAPI host
API_PORT=8000 # FastAPI portEdit the processor files to customize models:
# processor_bert.py
model_embedding = SentenceTransformer('all-MiniLM-L6-v2')
# processor_llm.py
model="deepseek-r1-distill-llama-70b"
temperature=0.5Edit processor_regex.py:
regex_patterns = {
r"Your pattern here": "Your Category",
r"Another pattern": "Another Category",
}logengine/
├── server.py # FastAPI server
├── classify.py # Classification orchestrator
├── processor_regex.py # Regex-based processor
├── processor_bert.py # BERT-based processor
├── processor_llm.py # LLM-based processor
├── models/
│ ├── log_classifier.joblib # Pre-trained BERT classifier
│ └── log_classifier_model.pkl # Backup model
├── resources/
│ ├── test.csv # Test dataset
│ └── output.csv # Output results
├── .env # Environment variables
├── requirements.txt # Python dependencies
└── README.md # This file
# Unit tests
python -m pytest tests/ -v
# Integration tests
python -m pytest tests/integration/ -v
# Test classification on sample data
python classify.py# Profile classification speed
python -m cProfile -s cumtime classify.py
# Memory usage
python -m memory_profiler classify.py# Add debug logging
import logging
logging.basicConfig(level=logging.DEBUG)
# Log classification steps
from classify import classify_log
label = classify_log("ModernCRM", "User 12345 logged in.", debug=True)We welcome contributions! Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Write tests for new functionality
- Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 guidelines
- Add docstrings to all functions
- Include type hints
- Write unit tests (minimum 80% coverage)
- Update README for new features
# Must pass all tests
pytest tests/ --cov=. --cov-report=term-missing
# Code quality
flake8 . --max-line-length=100
black . --checkIssue: "ModuleNotFoundError: No module named 'groq'"
# Solution: Install missing dependency
pip install groq python-dotenvIssue: "GROQ_API_KEY environment variable not found"
# Solution: Create .env file with valid key
echo "GROQ_API_KEY=your_key_here" > .envIssue: BERT model slow to load
# Solution: Model downloads on first use, subsequent runs are fast
# To pre-download: python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"Issue: CSV Processing fails
# Solution: Ensure CSV has required columns
# Required: source, log_message
# Check file encoding (should be UTF-8)- Batch Processing: Use script mode for 100+ logs
- Model Caching: First run may be slow (model download)
- API Optimization: Use connection pooling for multiple requests
- GPU Support: BERT supports CUDA for faster inference
This project is licensed under the MIT License - see LICENSE file for details.
- Documentation: Full API Docs
- Examples: Example Scripts
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Groq for providing fast LLM inference
- Hugging Face for sentence transformers
- Scikit-learn for ML classification
- FastAPI for the web framework
If you have any questions or feedback, please feel free to contact me at pranjal360agarwal@gmail.com. You can also connect with me on LinkedIn or Twitter. Thank you for visiting my project!
