Skip to content

siddhamapple/YoutubeSummarizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

15 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽฌ YouTube Transcript Q&A & Summarizer

An AI-powered application that transforms YouTube videos into searchable, interactive knowledge sources through advanced Large Language Models (LLMs).

Python Streamlit License Tests

๐Ÿš€ Overview

YouTube Transcript Q&A & Summarizer is a modular, end-to-end pipeline that:

  • Retrieves transcripts from YouTube videos automatically
  • Dynamically chunks long content for optimal LLM processing
  • Offers a Streamlit web interface for seamless user experience
  • Transforms videos into searchable, interactive knowledge sources

Project Architecture

Here is a visual overview of the application's pipeline, from user input to final output.

graph TD
    subgraph "User Interface (Streamlit)"
        A[User provides YouTube URL] --> B{Streamlit App};
    end

    subgraph "Backend Processing Pipeline"
        B --> C[Transcript Retriever];
        C -->|Raw Transcript| D[Preprocessor];
        D -->|Cleaned Text| E[Dynamic Chunker];
        E -->|Text Chunks| F[QA Engine];
        F -->|Map: Process each chunk| G((LLM API));
        G -->|Chunk Answers| F;
        F -->|Reduce: Synthesize final answer| G;
    end

    subgraph "Final Output"
        G -->|Unified Answer/Summary| B;
        B --> H[Display to User];
        H --> I[Download Output];
    end

    style F fill:##bbb,stroke:#333,stroke-width:2px
    style G fill:#fff,stroke:#333,stroke-width:2px
Loading

โœจ Features

๐ŸŽฏ Instant Transcript Retrieval

Extracts subtitles automatically from YouTube videos with available captions.

๐Ÿค– LLM-Powered Summarization

Converts long video transcripts into concise, informative summaries using Google Gemini API.

โ“ Ask Any Question

Enter natural language queries and get accurate answers based on video content.

๐Ÿ”„ Map-Reduce Aggregation

Efficiently processes long videos by chunking content and combining answers or summaries.

๐ŸŒ Streamlit Web App

Simple, interactive user interface to upload links, ask questions, and view results.

๐Ÿ’พ Downloadable Results

Export generated summaries or answers as .txt files for offline use.

๐Ÿงช Robust Testing

Unit-tested modular pipeline using pytest with comprehensive coverage.

๐Ÿ“Š Logging & Error Handling

Centralized logging system with graceful failure mechanisms and detailed error tracking.

๐Ÿ“ Project Structure

project-root/
โ”œโ”€โ”€ app.py # Streamlit frontend
โ”œโ”€โ”€ yt_logo.png # Logo for branding
โ”œโ”€โ”€ requirements.txt # Dependencies
โ”œโ”€โ”€ requirements-dev.txt # Dev dependencies (pytest, etc.)
โ”œโ”€โ”€ pytest.ini # Pytest configuration
โ”œโ”€โ”€ .env # API keys and configs (not committed)
โ”œโ”€โ”€ .gitignore # Git ignore rules
โ”œโ”€โ”€ README.md # Project documentation
โ”œโ”€โ”€ setup.py # For packaging , still to launch but made it
โ”œโ”€โ”€ logs/ # Runtime logs
โ”œโ”€โ”€ src/
โ”‚ โ”œโ”€โ”€ init.py
โ”‚ โ”œโ”€โ”€ components/
โ”‚ โ”‚ โ”œโ”€โ”€ init.py
โ”‚ โ”‚ โ”œโ”€โ”€ main.py # Main orchestrator
โ”‚ โ”‚ โ”œโ”€โ”€ preprocessor.py # Text cleaning & chunking
โ”‚ โ”‚ โ”œโ”€โ”€ qa_engine.py # LLM interface
โ”‚ โ”‚ โ”œโ”€โ”€ transcript_retriever.py # YouTube transcript fetching
โ”‚ โ”‚ โ””โ”€โ”€ internalTesting/ # Optional submodules
โ”‚ โ”œโ”€โ”€ logger.py # Centralized logging
โ”‚ โ”œโ”€โ”€ exception.py # Custom error handling
โ”‚ โ””โ”€โ”€ utils.py # Helper utilities
โ””โ”€โ”€ tests/
โ”œโ”€โ”€ test_preprocessor.py
โ”œโ”€โ”€ test_qa_engine.py
โ””โ”€โ”€ test_transcript_retriever.py

๐Ÿ› ๏ธ Prerequisites

  • Python 3.8+
  • YouTube video with captions
  • Google Gemini API key (or any supported LLM)
  • Internet connection for API calls

๐Ÿ“ฅ Installation

1. Clone the Repository

git clone https://github.com/siddhamapple/YoutubeSummarizer
cd YoutubeSummarizer

2. Create Virtual Environment

python -m venv venv

Activate (Linux/Mac)

source venv/bin/activate 

Activate WIndows

venv\Scripts\activate 

3. Install Dependencies

For regular usage-

pip install -r requirements.txt

For development (includes testing tools)-

pip install -r requirements-dev.txt

4. Environment Configuration

Create a .env file in the project root:

GEMINI_API_KEY=your_gemini_api_key_here

Usage

Quick Start

streamlit run app.py

Visit: http://localhost:8501

App Workflow

  1. Paste YouTube Link โ†’ Enter any YouTube video URL
  2. Choose Mode โ†’ Select "Summarize" or "Ask a Question"
  3. Process Transcript โ†’ App fetches and processes video transcript
  4. LLM Processing โ†’ AI runs on each chunk (Map-Reduce approach)
  5. View Results โ†’ Final answer/summary displayed
  6. Download โ†’ Optional: Save output as .txt file

๐Ÿงช Running Tests

Execute the complete test suite:

pytest

Test Coverage

The testing suite covers:

  • โœ… TranscriptRetriever: Verifies transcript fetch accuracy from YouTube
  • โœ… Preprocessor: Validates chunking, cleaning, and formatting logic
  • โœ… QnAEngine: Checks consistency and quality of LLM-generated answers/summaries
tests/
โ”œโ”€โ”€ test_preprocessor.py
โ”œโ”€โ”€ test_qa_engine.py
โ””โ”€โ”€ test_transcript_retriever.py

๐Ÿ”ง Core Technologies

Tool/Library Description
Streamlit Frontend UI for interacting with the app
Google Gemini API Large Language Model for Q&A and summarization
python-dotenv Manages .env configurations securely
Pytest Testing framework for backend logic verification
Custom Modules Modular Python architecture under src/components/

Modular Design Components

  • transcript_retriever.py: Fetches captions from YouTube videos
  • preprocessor.py: Cleans and chunks transcript content
  • qa_engine.py: Interfaces with LLM for Q&A/summaries
  • main.py: Orchestrates the entire pipeline

๐Ÿ—บ๏ธ Roadmap

๐Ÿ”ฎ Upcoming Features

  • ๐ŸŒ Multilingual Support: Auto-detect video language and translate
  • โšก Asynchronous Processing: Faster summarization with async LLM calls
  • โ˜๏ธ Deployment Options: Push to Streamlit Cloud / Hugging Face Spaces
  • โŒจ๏ธ CLI Utilities: Command-line support for batch summarization
  • ๐ŸŽจ UI Polish: Improve layout, theme, and mobile responsiveness
  • ๐Ÿ“ˆ Analytics: Usage tracking and performance metrics

๐Ÿ‘จโ€๐Ÿ’ป Author

Siddham Jain
๐ŸŽ“ B.Tech in Electrical and Computer Engineering | Shiv Nadar IOE
๐Ÿ“ง siddhamjainn@gmail.com
๐Ÿ“ฑ +919625208689

๐Ÿค Contributing

We welcome contributions! Follow these steps:

  1. Fork this repository
  2. Create a new feature branch (git checkout -b feature-name)
  3. Commit your changes (git commit -am 'Add feature')
  4. Push to the branch (git push origin feature-name)
  5. Open a pull request ๐Ÿš€

Development Guidelines

  • Follow PEP 8 style guidelines
  • Write comprehensive tests for new features
  • Update documentation for any API changes
  • Use only meaningful commit messages

โญ If you find this project helpful, please give it a star!

For questions, issues, or feature requests, please open an issue on GitHub.

Releases

No releases published

Packages

No packages published

Languages