Skip to content

Namansh0660/NoSQL_Projct

Repository files navigation

NoSQL Knowledge Graph Project

A comprehensive real-time knowledge graph system for academic papers with MongoDB Atlas sharding, Apache Kafka streaming, and interactive Streamlit dashboard. This project implements a complete pipeline for building and visualizing a knowledge graph of academic papers using NoSQL technologies.

🎯 Features

  • Real-time Data Ingestion: Fetches papers from ArXiv, PubMed, and CrossRef APIs
  • Knowledge Graph Construction: Automatically builds relationships between papers, authors, institutions, and concepts
  • Interactive Dashboard: Real-time visualization of knowledge graph with insights and metrics
  • Sharding Analysis: Implements and benchmarks different MongoDB sharding strategies
  • Streaming Pipeline: Kafka-based data processing for scalable ingestion
  • Docker Integration: Complete containerized setup for easy deployment

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   API       β”‚    β”‚    Kafka    β”‚    β”‚  Knowledge  β”‚
β”‚  Sources    β”‚ -> β”‚   Stream    β”‚ -> β”‚   Graph     β”‚
β”‚ (ArXiv,etc) β”‚    β”‚  Pipeline   β”‚    β”‚  Builder    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Streamlit   β”‚    β”‚  MongoDB    β”‚    β”‚  Sharding   β”‚
β”‚ Dashboard   β”‚ <- β”‚   Atlas     β”‚ <- β”‚ Strategies  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Quick Start

Prerequisites

  • Docker and Docker Compose
  • MongoDB Atlas account (for production) or local MongoDB
  • Python 3.8+ (for local development)

1. Clone and Setup

git clone <repository-url>
cd NoSQL_Project
cp .env.example .env

2. Configure Environment

Edit .env file with your MongoDB Atlas credentials:

MONGODB_USER=your_atlas_username
MONGODB_PASS=your_atlas_password
MONGODB_CLUSTER=your_atlas_cluster.mongodb.net
MONGODB_DB=NOSQL

For local development, you can use the provided MongoDB container by leaving the default settings.

3. Start the Complete System

./run.sh

This will start all services:

  • MongoDB (local) or connect to Atlas
  • Apache Kafka with Zookeeper
  • Kafka Producer (API fetcher)
  • Kafka Consumer (KG builder)
  • Streamlit Dashboard
  • Kafka UI
  • MongoDB Express

4. Access the Dashboard

πŸ§ͺ Testing and Benchmarking

MongoDB Atlas Sharding

For production deployment with MongoDB Atlas:

python enable_atlas_sharding.py --connection-string "mongodb+srv://username:password@cluster.mongodb.net/NOSQL"

This script enables sharding for the database and collections with appropriate shard keys.

Sharding Benchmarks

To benchmark different sharding strategies:

python benchmark_sharding.py --connection-string "mongodb+srv://username:password@cluster.mongodb.net/NOSQL"

This generates performance metrics for various operations across different collections and sharding configurations.

Pipeline Integration Test

To test the complete pipeline integration:

python test_pipeline.py --connection-string "mongodb+srv://username:password@cluster.mongodb.net/NOSQL" --bootstrap-servers "localhost:9092"

This runs the producer and consumer, monitors the process, and verifies that data is flowing correctly through the system.

πŸ“Š Dashboard Features

Real-time Knowledge Graph Visualization

  • Interactive network graph showing papers, authors, institutions, and concepts
  • Real-time updates as new data is ingested
  • Configurable node limits, layout types, and display options
  • Node filtering by type and relation filtering
  • Dynamic node sizing based on connections
  • Detailed hover information including abstracts and keywords

Sharding Performance Analysis

  • Comparison of different sharding strategies:
    • Modulo Hashing
    • Consistent Hashing
    • Range-based Partitioning
  • Performance metrics and benchmarks
  • Load balancing analysis

Data Ingestion Control

  • Manual API fetching controls
  • Real-time pipeline monitoring
  • System status indicators
  • Manual refresh button for immediate data updates

Database Insights

  • Live statistics (papers, nodes, edges)
  • Recent papers table
  • Growth metrics

πŸ”§ Manual Setup (Development)

Install Dependencies

pip install -r requirements.txt

Start Individual Components

  1. Start Local Services:

    docker-compose up -d mongo kafka zookeeper
  2. Start Kafka Producer:

    KAFKA_BOOTSTRAP_SERVERS=localhost:9092 python ingestion/kafka_producer.py
  3. Start Kafka Consumer:

    KAFKA_BOOTSTRAP_SERVERS=localhost:9092 python ingestion/kafka_consumer_kg.py
  4. Start Dashboard:

    streamlit run ui/app.py

πŸ—‚οΈ Sharding Strategies

The project implements and benchmarks three sharding strategies:

1. Modulo Hashing

  • Simple hash-based distribution
  • Good for uniform data distribution
  • Fast routing decisions

2. Consistent Hashing

  • Virtual nodes for better load balancing
  • Minimal data movement when adding/removing shards
  • Better handling of hot spots

3. Range-based Partitioning

  • Partitions based on document properties (e.g., publication year)
  • Good for range queries
  • Natural data organization

πŸ“ˆ Performance Benchmarks

Run sharding benchmarks:

python benchmarks/sharding_bench.py

Or use the dashboard's benchmark feature for interactive analysis.

🐳 Docker Services

Service Port Description
nosql-app 8501 Main Streamlit application
kafka 9092 Apache Kafka broker
kafka-ui 8080 Kafka management UI
mongo 27017 MongoDB database
mongo-express 8081 MongoDB web interface
zookeeper 2181 Kafka coordination

πŸ“ Project Structure

NoSQL_Project/
β”œβ”€β”€ api/                    # Database connection and API routes
β”œβ”€β”€ benchmarks/            # Sharding performance benchmarks
β”œβ”€β”€ ingestion/             # Kafka producers and consumers
β”œβ”€β”€ kg_builder/            # Knowledge graph construction
β”œβ”€β”€ mongo-init-scripts/    # MongoDB initialization
β”œβ”€β”€ ui/                    # Streamlit dashboard
β”œβ”€β”€ docker-compose.yml     # Complete Docker setup
β”œβ”€β”€ Dockerfile            # Application container
β”œβ”€β”€ requirements.txt      # Python dependencies
└── run.sh               # Startup script

πŸ” Monitoring

View Logs

# All services
docker-compose logs -f

# Specific service
docker-compose logs -f nosql-app
docker-compose logs -f kafka-producer
docker-compose logs -f kafka-consumer

Check Service Status

docker-compose ps

Stop Services

docker-compose down

πŸ› οΈ Configuration

Kafka Configuration

  • Bootstrap servers (local host apps): localhost:9092
  • Bootstrap servers (inside Docker containers): kafka:29092
  • Topic: raw_papers
  • Auto-commit: enabled

Set KAFKA_BOOTSTRAP_SERVERS accordingly for producers/consumers. The compose file sets kafka:29092 for in-container services; running locally uses localhost:9092.

MongoDB Configuration

  • Database: NOSQL
  • Collections: papers, nodes, edges
  • Connection: Atlas or local container

Sharding Configuration

  • Default shards: 3
  • Strategies: modulo, consistent, range
  • Benchmark iterations: 2000

🚨 Troubleshooting

Common Issues

  1. Docker not starting: Ensure Docker is running and has sufficient resources
  2. MongoDB connection failed: Check Atlas credentials in .env file
  3. Kafka connection timeout: Wait for Kafka to fully initialize (30-60 seconds)
  4. Port conflicts: Check if ports 8501, 9092, 27017 are available

Reset Everything

docker-compose down -v
docker system prune -f
./run.sh

πŸ“š API Data Sources

  • ArXiv: Academic preprints in physics, mathematics, computer science
  • PubMed: Biomedical literature database
  • CrossRef: Scholarly publication metadata

πŸŽ›οΈ Environment Variables

Variable Description Default
MONGODB_URI MongoDB connection string Atlas or local
KAFKA_BOOTSTRAP_SERVERS Kafka brokers localhost:9092
NUM_SHARDS Number of shards for benchmarking 3
BATCH_SIZE API fetch batch size 10

πŸ“Š Performance Metrics

The system tracks:

  • Query response times
  • Throughput (operations/second)
  • Load balancing efficiency
  • Data distribution patterns
  • Real-time ingestion rates

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ†˜ Support

For issues and questions:

  1. Check the troubleshooting section
  2. Review Docker and service logs
  3. Ensure all prerequisites are met
  4. Verify environment configuration

Built with: Python, Streamlit, MongoDB Atlas, Apache Kafka, Docker, NetworkX, Plotly

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published