NoSQL Knowledge Graph Project

A comprehensive real-time knowledge graph system for academic papers with MongoDB Atlas sharding, Apache Kafka streaming, and interactive Streamlit dashboard. This project implements a complete pipeline for building and visualizing a knowledge graph of academic papers using NoSQL technologies.

🎯 Features

Real-time Data Ingestion: Fetches papers from ArXiv, PubMed, and CrossRef APIs
Knowledge Graph Construction: Automatically builds relationships between papers, authors, institutions, and concepts
Interactive Dashboard: Real-time visualization of knowledge graph with insights and metrics
Sharding Analysis: Implements and benchmarks different MongoDB sharding strategies
Streaming Pipeline: Kafka-based data processing for scalable ingestion
Docker Integration: Complete containerized setup for easy deployment

🏗️ Architecture

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   API       │    │    Kafka    │    │  Knowledge  │
│  Sources    │ -> │   Stream    │ -> │   Graph     │
│ (ArXiv,etc) │    │  Pipeline   │    │  Builder    │
└─────────────┘    └─────────────┘    └─────────────┘
                           │
                           ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ Streamlit   │    │  MongoDB    │    │  Sharding   │
│ Dashboard   │ <- │   Atlas     │ <- │ Strategies  │
└─────────────┘    └─────────────┘    └─────────────┘

🚀 Quick Start

Prerequisites

Docker and Docker Compose
MongoDB Atlas account (for production) or local MongoDB
Python 3.8+ (for local development)

1. Clone and Setup

git clone <repository-url>
cd NoSQL_Project
cp .env.example .env

2. Configure Environment

Edit .env file with your MongoDB Atlas credentials:

MONGODB_USER=your_atlas_username
MONGODB_PASS=your_atlas_password
MONGODB_CLUSTER=your_atlas_cluster.mongodb.net
MONGODB_DB=NOSQL

For local development, you can use the provided MongoDB container by leaving the default settings.

3. Start the Complete System

./run.sh

This will start all services:

MongoDB (local) or connect to Atlas
Apache Kafka with Zookeeper
Kafka Producer (API fetcher)
Kafka Consumer (KG builder)
Streamlit Dashboard
Kafka UI
MongoDB Express

4. Access the Dashboard

Streamlit Dashboard: http://localhost:8501
Kafka UI: http://localhost:8080
MongoDB Express: http://localhost:8081

🧪 Testing and Benchmarking

MongoDB Atlas Sharding

For production deployment with MongoDB Atlas:

python enable_atlas_sharding.py --connection-string "mongodb+srv://username:password@cluster.mongodb.net/NOSQL"

This script enables sharding for the database and collections with appropriate shard keys.

Sharding Benchmarks

To benchmark different sharding strategies:

python benchmark_sharding.py --connection-string "mongodb+srv://username:password@cluster.mongodb.net/NOSQL"

This generates performance metrics for various operations across different collections and sharding configurations.

Pipeline Integration Test

To test the complete pipeline integration:

python test_pipeline.py --connection-string "mongodb+srv://username:password@cluster.mongodb.net/NOSQL" --bootstrap-servers "localhost:9092"

This runs the producer and consumer, monitors the process, and verifies that data is flowing correctly through the system.

📊 Dashboard Features

Real-time Knowledge Graph Visualization

Interactive network graph showing papers, authors, institutions, and concepts
Real-time updates as new data is ingested
Configurable node limits, layout types, and display options
Node filtering by type and relation filtering
Dynamic node sizing based on connections
Detailed hover information including abstracts and keywords

Sharding Performance Analysis

Comparison of different sharding strategies:
- Modulo Hashing
- Consistent Hashing
- Range-based Partitioning
Performance metrics and benchmarks
Load balancing analysis

Data Ingestion Control

Manual API fetching controls
Real-time pipeline monitoring
System status indicators
Manual refresh button for immediate data updates

Database Insights

Live statistics (papers, nodes, edges)
Recent papers table
Growth metrics

🔧 Manual Setup (Development)

Install Dependencies

pip install -r requirements.txt

Start Individual Components

Start Local Services:

docker-compose up -d mongo kafka zookeeper

Start Kafka Producer:

KAFKA_BOOTSTRAP_SERVERS=localhost:9092 python ingestion/kafka_producer.py

Start Kafka Consumer:

KAFKA_BOOTSTRAP_SERVERS=localhost:9092 python ingestion/kafka_consumer_kg.py

Start Dashboard:
```
streamlit run ui/app.py
```

🗂️ Sharding Strategies

The project implements and benchmarks three sharding strategies:

1. Modulo Hashing

Simple hash-based distribution
Good for uniform data distribution
Fast routing decisions

2. Consistent Hashing

Virtual nodes for better load balancing
Minimal data movement when adding/removing shards
Better handling of hot spots

3. Range-based Partitioning

Partitions based on document properties (e.g., publication year)
Good for range queries
Natural data organization

📈 Performance Benchmarks

Run sharding benchmarks:

python benchmarks/sharding_bench.py

Or use the dashboard's benchmark feature for interactive analysis.

🐳 Docker Services

Service	Port	Description
nosql-app	8501	Main Streamlit application
kafka	9092	Apache Kafka broker
kafka-ui	8080	Kafka management UI
mongo	27017	MongoDB database
mongo-express	8081	MongoDB web interface
zookeeper	2181	Kafka coordination

📁 Project Structure

NoSQL_Project/
├── api/                    # Database connection and API routes
├── benchmarks/            # Sharding performance benchmarks
├── ingestion/             # Kafka producers and consumers
├── kg_builder/            # Knowledge graph construction
├── mongo-init-scripts/    # MongoDB initialization
├── ui/                    # Streamlit dashboard
├── docker-compose.yml     # Complete Docker setup
├── Dockerfile            # Application container
├── requirements.txt      # Python dependencies
└── run.sh               # Startup script

🔍 Monitoring

View Logs

# All services
docker-compose logs -f

# Specific service
docker-compose logs -f nosql-app
docker-compose logs -f kafka-producer
docker-compose logs -f kafka-consumer

Check Service Status

docker-compose ps

Stop Services

docker-compose down

🛠️ Configuration

Kafka Configuration

Bootstrap servers (local host apps): localhost:9092
Bootstrap servers (inside Docker containers): kafka:29092
Topic: raw_papers
Auto-commit: enabled

Set KAFKA_BOOTSTRAP_SERVERS accordingly for producers/consumers. The compose file sets kafka:29092 for in-container services; running locally uses localhost:9092.

MongoDB Configuration

Database: NOSQL
Collections: papers, nodes, edges
Connection: Atlas or local container

Sharding Configuration

Default shards: 3
Strategies: modulo, consistent, range
Benchmark iterations: 2000

🚨 Troubleshooting

Common Issues

Docker not starting: Ensure Docker is running and has sufficient resources
MongoDB connection failed: Check Atlas credentials in .env file
Kafka connection timeout: Wait for Kafka to fully initialize (30-60 seconds)
Port conflicts: Check if ports 8501, 9092, 27017 are available

Reset Everything

docker-compose down -v
docker system prune -f
./run.sh

📚 API Data Sources

ArXiv: Academic preprints in physics, mathematics, computer science
PubMed: Biomedical literature database
CrossRef: Scholarly publication metadata

🎛️ Environment Variables

Variable	Description	Default
`MONGODB_URI`	MongoDB connection string	Atlas or local
`KAFKA_BOOTSTRAP_SERVERS`	Kafka brokers	localhost:9092
`NUM_SHARDS`	Number of shards for benchmarking	3
`BATCH_SIZE`	API fetch batch size	10

📊 Performance Metrics

The system tracks:

Query response times
Throughput (operations/second)
Load balancing efficiency
Data distribution patterns
Real-time ingestion rates

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

For issues and questions:

Check the troubleshooting section
Review Docker and service logs
Ensure all prerequisites are met
Verify environment configuration

Built with: Python, Streamlit, MongoDB Atlas, Apache Kafka, Docker, NetworkX, Plotly

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
api		api
benchmarks		benchmarks
ingestion		ingestion
kg_builder		kg_builder
mongo-init-scripts		mongo-init-scripts
samples		samples
ui		ui
venv		venv
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
benchmark_sharding.py		benchmark_sharding.py
docker-compose.yml		docker-compose.yml
enable_atlas_sharding.py		enable_atlas_sharding.py
pipeline_runner.py		pipeline_runner.py
requirements.txt		requirements.txt
run.sh		run.sh
test_pipeline.py		test_pipeline.py

License

Namansh0660/NoSQL_Projct

Folders and files

Latest commit

History

Repository files navigation