A distributed, high-throughput article generation system that produces Markdown-based news stories about U.S. congressional bills using only structured data from the Congress.gov API.
This system implements a Retrieval-Augmented Generation (RAG) pipeline that:
- Fetches data for 10 specific congressional bills from Congress.gov API
- Answers 7 fixed questions for each bill using open-source LLMs
- Generates short, news-style articles in Markdown format with hyperlinks
- Outputs structured JSON files containing all generated articles
- Uses Kafka-like distributed task system for fast, scalable article creation
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Congress.gov βββββΆβ Redpanda/Kafka βββββΆβ Redis State β
β API β β Message Bus β β Manager β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Question Workersββββββ Message Bus βββββΆβ Link Checkers β
β (Answer Q&A) β β β β (Validate URLs) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
βArticle Generator β
β (Final Article) β
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Output JSON β
β (10 Articles) β
βββββββββββββββββββ
- Controller: Main orchestrator that manages the pipeline
- Question Workers: Answer the 7 required questions for each bill
- Link Checkers: Validate all URLs return HTTP 200
- Article Generator: Assembles final Markdown articles
- State Manager: Tracks task completion using Redis
- Congress API Client: Fetches and caches data from Congress.gov
- Python 3.12+
- Docker & Docker Compose
- 8GB+ RAM recommended
- Congress.gov API key
- Local LLM runtime (Ollama) with model
qwen2.5:7bpulled and running
Use the provided script to set up everything (venv, dependencies, Docker services) and run the pipeline end-to-end:
./Run_Me.shNote: If needed, make it executable first with chmod +x Run_Me.sh.
-
Clone and Setup Environment
git clone https://github.com/ronak4/Ronak-Raisingani-RAG cd newsGen python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
Configure Environment
cp config.example.env .env # Edit .env with your API keys -
Start Infrastructure
docker-compose up -d
-
Run the Pipeline (manual alternative to the script above)
python run_integrated_pipeline.py
The system will process all 10 bills and generate articles in approximately 9-10 minutes:
RAG News Generation - Integrated Pipeline
================================================================================
================================================================================
Initializing integrated pipeline...
Created 10 workers
- 8 question workers
- 1 link checker
- 1 article generator
Starting all workers...
All workers started successfully
================================================================================
Starting news generation pipeline...
Processing 10 bills with 70 total questions...
================================================================================
================================================================================
PROGRESS: Articles 0/10 (0.0%) | Tasks: 0/70 | Speed: 0.00/s | Time Elapsed: 0m 18s
H.R.1 0/7
H.R.5371 0/7
H.R.5401 0/7
S.2296 0/7
S.24 0/7
S.2882 0/7
S.499 0/7
S.RES.412 0/7
H.RES.353 0/7
H.R.1968 0/7
================================================================================
- Target Bills: 10 congressional bills
- Questions per Bill: 7 required questions
- Total Tasks: 70 question-answer pairs
- Expected Completion: ~9-10 minutes
- Throughput: ~0.11 tasks/second
- Success Rate: 100% (with retry logic)
newsGen/
βββ src/
β βββ controller.py # Main pipeline controller
β βββ services/
β β βββ congress_api.py # Congress.gov API client
β β βββ ai_service.py # AI LLM service integration
β β βββ state_manager.py # Redis state management
β βββ workers/
β β βββ question_worker.py # Question answering worker
β β βββ link_checker.py # URL validation worker
β β βββ article_generator.py # Article assembly worker
β βββ utils/
β βββ schemas.py # Data models and schemas
β βββ kafka_client_simple.py # Kafka client utilities
β βββ performance_monitor.py # Performance tracking
βββ output/
β βββ articles.json # Generated articles output
βββ cache/ # API response cache
βββ docker-compose.yml # Infrastructure setup
βββ Dockerfile # Container configuration
βββ requirements.txt # Python dependencies
βββ run_integrated_pipeline.py # Main execution script
# Required
CONGRESS_API_KEY=your_congress_api_key
# Optional
REDIS_HOST=localhost
REDIS_PORT=6379
KAFKA_BOOTSTRAP_SERVERS=localhost:19092The system is optimized for stability and performance:
- Question Workers: 8 workers (parallel processing)
- Link Checkers: 1 worker
- Article Generator: 1 worker
- Concurrent Tasks per Worker: up to 12
- Timeout: 180 seconds per LLM call
- Retry Logic: 3 attempts with exponential backoff
Each bill is analyzed for these 7 questions:
- What does this bill do? Where is it in the process?
- What committees is this bill in?
- Who is the sponsor?
- Who cosponsored this bill? Are any of the cosponsors on the committee that the bill is in?
- Have any hearings happened on the bill? If so, what were the findings?
- Have any amendments been proposed on the bill? If so, who proposed them and what do they do?
- Have any votes happened on the bill? If so, was it a party-line vote or a bipartisan one?
Articles are saved to output/articles.json with this schema:
[
{
"bill_id": "H.R.1",
"bill_title": "Lower Energy Costs Act",
"sponsor_bioguide_id": "S001176",
"bill_committee_ids": ["hsii00", "hsif00", "hspw00", "hsbu00", "hsag00"],
"article_content": "In the ongoing debate over energy costs, Rep. Steve Scalise [R-LA-1] has introduced H.R. 1, the Lower Energy Costs Act, aiming to alleviate financial burdens on American families and businesses..."
}
]Run the smoke test to verify system functionality:
python -m pytest tests/ -vThe system includes full Docker containerization:
# Build and run
docker-compose up --build
# Run in container
docker run -it --rm newsgen python run_integrated_pipeline.pyThe system provides real-time progress monitoring:
- Progress Updates: Every 15 seconds
- Task Tracking: Redis-based state management
- Performance Metrics: Throughput and completion times
- Error Handling: Automatic retry with exponential backoff
- API Rate Limits: System includes built-in rate limiting and retry logic
- Memory Usage: Optimized for 8GB+ RAM systems
- Network Timeouts: 180-second timeout with retry logic
- Docker Issues: Ensure Docker Desktop is running
Enable verbose logging:
python run_integrated_pipeline.py --verboseLatest Run: 9m 47s for 10 articles
- Throughput: 0.11 tasks/second
- Success Rate: 100%
- Memory Usage: ~8GB peak
- API Calls: 70+ Congress.gov requests (cached)
The system is optimized for both speed and accuracy through:
- Parallel Processing: Multiple workers handle different tasks simultaneously
- Intelligent Caching: API responses cached to avoid rate limits
- Balanced Concurrency: Optimized worker count prevents API overload
- Robust Error Handling: Automatic retry with exponential backoff
- State Management: Redis tracks progress for fault tolerance
This approach ensures reliable completion of all 10 articles in under 10 minutes while maintaining high accuracy and proper hyperlink validation.
Author: Ronak Raisingani
Project: RAG News Generation Challenge
Completion: 10/10 articles generated successfully