A comprehensive real-time knowledge graph system for academic papers with MongoDB Atlas sharding, Apache Kafka streaming, and interactive Streamlit dashboard. This project implements a complete pipeline for building and visualizing a knowledge graph of academic papers using NoSQL technologies.
- Real-time Data Ingestion: Fetches papers from ArXiv, PubMed, and CrossRef APIs
- Knowledge Graph Construction: Automatically builds relationships between papers, authors, institutions, and concepts
- Interactive Dashboard: Real-time visualization of knowledge graph with insights and metrics
- Sharding Analysis: Implements and benchmarks different MongoDB sharding strategies
- Streaming Pipeline: Kafka-based data processing for scalable ingestion
- Docker Integration: Complete containerized setup for easy deployment
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β API β β Kafka β β Knowledge β
β Sources β -> β Stream β -> β Graph β
β (ArXiv,etc) β β Pipeline β β Builder β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β
βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Streamlit β β MongoDB β β Sharding β
β Dashboard β <- β Atlas β <- β Strategies β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
- Docker and Docker Compose
- MongoDB Atlas account (for production) or local MongoDB
- Python 3.8+ (for local development)
git clone <repository-url>
cd NoSQL_Project
cp .env.example .envEdit .env file with your MongoDB Atlas credentials:
MONGODB_USER=your_atlas_username
MONGODB_PASS=your_atlas_password
MONGODB_CLUSTER=your_atlas_cluster.mongodb.net
MONGODB_DB=NOSQLFor local development, you can use the provided MongoDB container by leaving the default settings.
./run.shThis will start all services:
- MongoDB (local) or connect to Atlas
- Apache Kafka with Zookeeper
- Kafka Producer (API fetcher)
- Kafka Consumer (KG builder)
- Streamlit Dashboard
- Kafka UI
- MongoDB Express
- Streamlit Dashboard: http://localhost:8501
- Kafka UI: http://localhost:8080
- MongoDB Express: http://localhost:8081
For production deployment with MongoDB Atlas:
python enable_atlas_sharding.py --connection-string "mongodb+srv://username:password@cluster.mongodb.net/NOSQL"This script enables sharding for the database and collections with appropriate shard keys.
To benchmark different sharding strategies:
python benchmark_sharding.py --connection-string "mongodb+srv://username:password@cluster.mongodb.net/NOSQL"This generates performance metrics for various operations across different collections and sharding configurations.
To test the complete pipeline integration:
python test_pipeline.py --connection-string "mongodb+srv://username:password@cluster.mongodb.net/NOSQL" --bootstrap-servers "localhost:9092"This runs the producer and consumer, monitors the process, and verifies that data is flowing correctly through the system.
- Interactive network graph showing papers, authors, institutions, and concepts
- Real-time updates as new data is ingested
- Configurable node limits, layout types, and display options
- Node filtering by type and relation filtering
- Dynamic node sizing based on connections
- Detailed hover information including abstracts and keywords
- Comparison of different sharding strategies:
- Modulo Hashing
- Consistent Hashing
- Range-based Partitioning
- Performance metrics and benchmarks
- Load balancing analysis
- Manual API fetching controls
- Real-time pipeline monitoring
- System status indicators
- Manual refresh button for immediate data updates
- Live statistics (papers, nodes, edges)
- Recent papers table
- Growth metrics
pip install -r requirements.txt-
Start Local Services:
docker-compose up -d mongo kafka zookeeper
-
Start Kafka Producer:
KAFKA_BOOTSTRAP_SERVERS=localhost:9092 python ingestion/kafka_producer.py
-
Start Kafka Consumer:
KAFKA_BOOTSTRAP_SERVERS=localhost:9092 python ingestion/kafka_consumer_kg.py
-
Start Dashboard:
streamlit run ui/app.py
The project implements and benchmarks three sharding strategies:
- Simple hash-based distribution
- Good for uniform data distribution
- Fast routing decisions
- Virtual nodes for better load balancing
- Minimal data movement when adding/removing shards
- Better handling of hot spots
- Partitions based on document properties (e.g., publication year)
- Good for range queries
- Natural data organization
Run sharding benchmarks:
python benchmarks/sharding_bench.pyOr use the dashboard's benchmark feature for interactive analysis.
| Service | Port | Description |
|---|---|---|
| nosql-app | 8501 | Main Streamlit application |
| kafka | 9092 | Apache Kafka broker |
| kafka-ui | 8080 | Kafka management UI |
| mongo | 27017 | MongoDB database |
| mongo-express | 8081 | MongoDB web interface |
| zookeeper | 2181 | Kafka coordination |
NoSQL_Project/
βββ api/ # Database connection and API routes
βββ benchmarks/ # Sharding performance benchmarks
βββ ingestion/ # Kafka producers and consumers
βββ kg_builder/ # Knowledge graph construction
βββ mongo-init-scripts/ # MongoDB initialization
βββ ui/ # Streamlit dashboard
βββ docker-compose.yml # Complete Docker setup
βββ Dockerfile # Application container
βββ requirements.txt # Python dependencies
βββ run.sh # Startup script
# All services
docker-compose logs -f
# Specific service
docker-compose logs -f nosql-app
docker-compose logs -f kafka-producer
docker-compose logs -f kafka-consumerdocker-compose psdocker-compose down- Bootstrap servers (local host apps):
localhost:9092 - Bootstrap servers (inside Docker containers):
kafka:29092 - Topic:
raw_papers - Auto-commit: enabled
Set KAFKA_BOOTSTRAP_SERVERS accordingly for producers/consumers. The compose file sets kafka:29092 for in-container services; running locally uses localhost:9092.
- Database:
NOSQL - Collections:
papers,nodes,edges - Connection: Atlas or local container
- Default shards: 3
- Strategies: modulo, consistent, range
- Benchmark iterations: 2000
- Docker not starting: Ensure Docker is running and has sufficient resources
- MongoDB connection failed: Check Atlas credentials in
.envfile - Kafka connection timeout: Wait for Kafka to fully initialize (30-60 seconds)
- Port conflicts: Check if ports 8501, 9092, 27017 are available
docker-compose down -v
docker system prune -f
./run.sh- ArXiv: Academic preprints in physics, mathematics, computer science
- PubMed: Biomedical literature database
- CrossRef: Scholarly publication metadata
| Variable | Description | Default |
|---|---|---|
MONGODB_URI |
MongoDB connection string | Atlas or local |
KAFKA_BOOTSTRAP_SERVERS |
Kafka brokers | localhost:9092 |
NUM_SHARDS |
Number of shards for benchmarking | 3 |
BATCH_SIZE |
API fetch batch size | 10 |
The system tracks:
- Query response times
- Throughput (operations/second)
- Load balancing efficiency
- Data distribution patterns
- Real-time ingestion rates
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
For issues and questions:
- Check the troubleshooting section
- Review Docker and service logs
- Ensure all prerequisites are met
- Verify environment configuration
Built with: Python, Streamlit, MongoDB Atlas, Apache Kafka, Docker, NetworkX, Plotly