seanbrar · seanbrar · Mar 23, 2025 · Sep 12, 2024 · Sep 12, 2024 · Sep 12, 2024
diff --git a/README.md b/README.md
@@ -1,22 +1,37 @@
 # paperweight
 
-This project automatically retrieves, filters, and summarizes recent academic papers from arXiv based on user-specified categories, then sends notifications to the user.
+A scalable system for retrieving, filtering, and summarizing academic papers from arXiv based on user preferences, with customizable notifications.
 
 ## Features
 
 - **ArXiv Integration**: Fetches recent papers from arXiv using their API, ensuring up-to-date access to the latest research.
 - **Customizable Filtering**: Filters papers based on user-defined preferences, including keywords, categories, and exclusion criteria.
-- **Intelligent Summarization** (BETA): Generates concise summaries or extracts abstracts, providing quick insights into paper content. Note: This feature is currently in beta and may have some limitations.
+- **Intelligent Summarization** (BETA): Generates concise summaries or extracts abstracts, providing quick insights into paper content.
 - **Flexible Notification System**: Notifies users via email, with potential for expansion to other notification methods.
 - **Configurable Settings**: Allows users to fine-tune the application's behavior through a YAML configuration file.
 
+## System Architecture
+
+```
+┌───────────────┐     ┌───────────────┐     ┌───────────────┐     ┌───────────────┐
+│    SCRAPER    │────▶│   PROCESSOR   │────▶│   ANALYZER    │────▶│   NOTIFIER    │
+└───────────────┘     └───────────────┘     └───────────────┘     └───────────────┘
+        │                     │                     │                     │
+        ▼                     ▼                     ▼                     ▼
+┌───────────────┐     ┌───────────────┐     ┌───────────────┐     ┌───────────────┐
+│ arXiv API &   │     │ Scoring &     │     │ Abstract      │     │ Email &       │
+│ PDF Processing│     │ Filtering     │     │ Extraction    │     │ Templating    │
+└───────────────┘     └───────────────┘     └───────────────┘     └───────────────┘
+```
+
 ## Table of Contents
 - [Getting Started](#getting-started)
 - [Installation](#installation)
 - [Quick Start](#quick-start)
 - [Usage](#usage)
 - [Configuration](#configuration)
 - [FAQ and Troubleshooting](#faq-and-troubleshooting)
+- [Technical Details](#technical-details)
 - [Roadmap](#roadmap)
 - [Glossary](#glossary)
 - [License](#license)
@@ -29,11 +44,13 @@ This project automatically retrieves, filters, and summarizes recent academic pa
 
 - Python 3.10 or higher
 - Required Python packages:
-  - pypdf
-  - python-dotenv
-  - PyYAML
-  - requests
-  - simplerllm
+  - pypdf - For PDF document processing
+  - python-dotenv - For environment variable management
+  - PyYAML - For configuration parsing
+  - requests - For API communication
+  - simplerllm - For LLM integration
+  - tenacity - For resilient API interactions
+  - tiktoken - For token counting
 
 ## Installation
 
@@ -98,14 +115,41 @@ For a comprehensive list of frequently asked questions, including setup instruct
 
 If you can't find an answer to your question or solution to your problem in the FAQ, please [open an issue](https://github.com/seanbrar/paperweight/issues) on GitHub.
 
+## Technical Details
+
+### Processing Pipeline
+
+paperweight processes papers through four main stages:
+
+1. **Scraping** (`scraper.py`): Fetches recent papers from arXiv's API based on user-defined categories and processes the PDF/LaTeX content.
+
+2. **Processing** (`processor.py`): Calculates relevance scores based on keyword matching, with weights for title, abstract, and content matches, plus handling of exclusion keywords.
+
+3. **Analysis** (`analyzer.py`): Either extracts the abstract or generates a summary using an LLM (OpenAI or Gemini), with configurable options.
+
+4. **Notification** (`notifier.py`): Formats the filtered papers and sends them via email, with options for sorting by relevance, date, or title.
+
+### Resilience Features
+
+- **Retry Logic**: Uses the `tenacity` library to implement exponential backoff for API calls
+- **Error Handling**: Comprehensive error catching and logging throughout the codebase
+- **State Persistence**: Maintains processing state between runs using the `last_processed_date.txt` file
+
+### Performance Considerations
+
+- **Token Counting**: Uses `tiktoken` to accurately count tokens for LLM context management
+- **Configurable Limits**: Allows setting maximum papers per category to control processing time
+- **Incremental Processing**: Only fetches papers published since the last run
+
 ## Roadmap
 
 Key upcoming features:
 - Implement machine learning-based paper recommendations
 - Add support for additional academic paper sources
 - Expand notification methods
+- Enhance batch processing capabilities
 
-For a full list of proposed features and known issues, see the [open issues](https://github.com/seanbrar/paperweight/issues) page or the detailed [roadmap](docs/ROADMAP.md).
+For a full list of proposed features and planned enhancements, see the detailed [roadmap](docs/ROADMAP.md).
 
 ## Glossary
 
@@ -114,6 +158,8 @@ For a full list of proposed features and known issues, see the [open issues](htt
 - **YAML**: A human-readable data serialization format used for configuration files.
 - **SMTP**: Simple Mail Transfer Protocol; used for sending emails.
 - **LLM**: Large Language Model; an AI model used for text generation and analysis.
+- **Embedding**: A numerical representation of text that captures semantic meaning.
+- **Token**: A unit of text processed by language models, roughly corresponding to 4 characters.
 
 ## License
 

diff --git a/docs/ROADMAP.md b/docs/ROADMAP.md
@@ -1,56 +1,87 @@
 # paperweight roadmap
 
-This document outlines the planned features and improvements for the paperweight project. Please note that this roadmap is subject to change based on user feedback and project priorities.
+This document outlines planned features and improvements for the paperweight project. The roadmap is organized into focused development areas to create a scalable, efficient academic paper processing system.
 
-## Short-term Goals
+## Core System Enhancements
 
-### General Improvements
-- [ ] Implement general code cleanup and optimization
-- [ ] Increase overall speed through asynchronous operations
-- [ ] Create a web-hosted demo of the program
+### Performance & Efficiency
+- [ ] Implement asynchronous processing for paper fetching and analysis
+- [ ] Add configurable batch processing with adjustable batch sizes
+- [ ] Create memory usage tracking and optimization for large document sets
+- [ ] Implement benchmarking tools to measure and optimize performance
+
+### Context Management
+- [ ] Develop intelligent document chunking for papers exceeding token limits
+- [ ] Implement hierarchical summarization for extremely long papers
+- [ ] Create a context window awareness system that optimizes token usage
+- [ ] Add semantic sectioning to prioritize important paper components
+
+### Caching Infrastructure
+- [ ] Implement persistent caching for paper embeddings and metadata
+- [ ] Create smart cache invalidation strategies based on paper updates
+- [ ] Develop a disk-based storage system for embeddings to reduce API costs
+- [ ] Add cache statistics reporting for optimization insights
+
+## Module-Specific Improvements
 
 ### Scraper Module
-- [ ] Build and implement PDF extraction evaluations
-- [ ] Add retry logic in API/scraper (possibly using tenacity)
-- [ ] Revisit and improve date checking logic
-  - [ ] Develop comprehensive testing suite with dummy papers
-- [ ] Parse out unnecessary content (e.g., references, LaTeX preambles)
-- [ ] Add support for extracting and handling images from papers
+- [ ] Enhance PDF extraction precision with specialized academic paper handling
+- [ ] Add support for extracting and processing figures and tables
+- [ ] Expand retry logic in API interactions using advanced backoff strategies
+- [ ] Improve date-based paper filtering with precise version tracking
 
 ### Processor Module
-- [ ] Refine and expand the normalization score system for papers
+- [ ] Develop enhanced scoring algorithms for more accurate paper relevance
+- [ ] Implement sliding window analysis for sequential context processing
+- [ ] Create adaptive keyword weighting based on document section importance
+- [ ] Add citation network analysis for evaluating paper significance
 
 ### Analyzer Module
-- [ ] Conduct additional testing of LLM integration
-- [ ] Implement rate limits for API calls
-- [ ] Explore and potentially add support for a wider selection of models
-- [ ] Refine and optimize summarization prompts
+- [ ] Expand LLM provider support with a unified interface
+- [ ] Implement streaming responses for long paper summarization
+- [ ] Create domain-specific summarization templates for different fields
+- [ ] Add comparative analysis between related papers
 
 ### Notifier Module
-- [ ] Improve handling of scenarios where all papers are discarded
-- [ ] Revisit and potentially expand the fields included in notifications (e.g., authors)
-- [ ] Add more options for paper ordering and field selection in email notifications
+- [ ] Develop a modular notification system supporting multiple channels
+- [ ] Create customizable templates for notification formatting
+- [ ] Implement digest mode for batched notifications
+- [ ] Add interactive elements to notifications for user feedback
+
+## Strategic Directions
 
-## Medium-term Goals
+### Machine Learning Integration
+- [ ] Replace keyword-based filtering with embedding similarity scoring
+- [ ] Implement personalized paper recommendations based on user interests
+- [ ] Develop citation impact prediction for emerging papers
+- [ ] Create a feedback loop to improve future recommendations
 
-- [ ] Replace current static keyword-based filtering with a machine learning recommendation engine
-  - [ ] Ensure interface compatibility is maintained
-- [ ] Expand notification methods beyond email
-  - [ ] Investigate possibilities like desktop notifications or a desktop agent
-- [ ] Rethink the notification system to make SMTP configuration less cumbersome for users
+### Expanded Data Sources
+- [ ] Add support for multiple academic repositories (PubMed, IEEE, etc.)
+- [ ] Implement unified metadata schema across different sources
+- [ ] Create source-specific optimizations for each repository
+- [ ] Develop cross-repository deduplication
 
-## Long-term Goals
+### User Experience
+- [ ] Create a simple web interface for configuration and monitoring
+- [ ] Develop a local dashboard for visualizing paper recommendations
+- [ ] Add personalized preference learning from user interactions
+- [ ] Implement saved searches and automated monitoring
 
-- [ ] Add support for additional academic paper sources beyond arXiv
-- [ ] Implement machine learning-based paper recommendations
-- [ ] Continuously improve and refine the LLM-based summarization feature
+## Development Infrastructure
 
-## Ongoing Tasks
+### Testing & Quality
+- [ ] Expand test coverage with more integration tests
+- [ ] Develop performance regression testing
+- [ ] Create automated benchmark suites for optimization
+- [ ] Implement continuous profiling for memory and CPU usage
 
-- [ ] Maintain and update documentation
-- [ ] Address bugs and issues reported by users
-- [ ] Optimize performance and resource usage
+### Documentation
+- [ ] Expand API documentation for extensibility
+- [ ] Create visual architecture diagrams
+- [ ] Develop advanced configuration guides for specific use cases
+- [ ] Add code examples for common extension patterns
 
-We welcome contributions and suggestions from the community. If you have ideas for new features or improvements, please open an issue on the [GitHub repository](https://github.com/seanbrar/paperweight/issues).
+We welcome contributions and suggestions from the community. If you have ideas for features or improvements, please open an issue on the [GitHub repository](https://github.com/seanbrar/paperweight/issues).
 
 For information on how to contribute to paperweight, please see the [contributing guide](docs/CONTRIBUTING.md).
diff --git a/setup.py b/setup.py
@@ -3,7 +3,7 @@
 
 setup(
     name="paperweight",
-    version="0.1.1",
+    version="0.1.2",
     package_dir={"": "src"},
     packages=find_packages(where="src"),
     install_requires=[

diff --git a/src/paperweight/analyzer.py b/src/paperweight/analyzer.py
@@ -1,3 +1,10 @@
+"""Module for analyzing and summarizing academic papers.
+
+This module provides functionality for analyzing paper content using LLMs (Language Model Models)
+and extracting relevant information. It supports different analysis types including abstract
+extraction and paper summarization using various LLM providers.
+"""
+
 import logging
 from typing import Any, Dict
 
@@ -11,29 +18,61 @@
 
 logger = logging.getLogger(__name__)
 
+
 def get_abstracts(processed_papers, config):
-    analysis_type = config.get('type', 'abstract')
+    """Extract abstracts or summaries from processed papers based on configuration.
+
+    Args:
+        processed_papers: List of dictionaries containing paper data.
+        config: Configuration dictionary specifying analysis type and parameters.
 
-    if analysis_type == 'abstract':
-        return [paper['abstract'] for paper in processed_papers]
-    elif analysis_type == 'summary':
+    Returns:
+        List of strings containing either abstracts or summaries based on config type.
+
+    Raises:
+        ValueError: If an unknown analysis type is specified in config.
+    """
+    analysis_type = config.get("type", "abstract")
+
+    if analysis_type == "abstract":
+        return [paper["abstract"] for paper in processed_papers]
+    elif analysis_type == "summary":
         return [summarize_paper(paper, config) for paper in processed_papers]
     else:
         raise ValueError(f"Unknown analysis type: {analysis_type}")
 
+
 @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
 def summarize_paper(paper: Dict[str, Any], config: Dict[str, Any]) -> str:
-    llm_provider = config.get('analyzer', {}).get('llm_provider', 'openai').lower()
-    api_key = config.get('analyzer', {}).get('api_key')
+    """Generate a summary of a paper using an LLM.
 
-    if llm_provider not in ['openai', 'gemini'] or not api_key:
-        logger.warning(f"No valid LLM provider or API key available for {llm_provider}. Falling back to abstract.")
-        return paper['abstract']
+    Args:
+        paper: Dictionary containing paper data including content and metadata.
+        config: Configuration dictionary containing LLM settings.
+
+    Returns:
+        A string containing the generated summary.
+
+    Raises:
+        ValueError: If no valid LLM provider or API key is available.
+    """
+    llm_provider = config.get("analyzer", {}).get("llm_provider", "openai").lower()
+    api_key = config.get("analyzer", {}).get("api_key")
+
+    if llm_provider not in ["openai", "gemini"] or not api_key:
+        logger.warning(
+            f"No valid LLM provider or API key available for {llm_provider}. Falling back to abstract."
+        )
+        return paper["abstract"]
 
     try:
         provider = LLMProvider[llm_provider.upper()]
-        model_name = 'gpt-4o-mini' if provider == LLMProvider.OPENAI else 'gemini-1.5-flash'
-        llm_instance = LLM.create(provider=provider, model_name=model_name, api_key=api_key)
+        model_name = (
+            "gpt-4o-mini" if provider == LLMProvider.OPENAI else "gemini-1.5-flash"
+        )
+        llm_instance = LLM.create(
+            provider=provider, model_name=model_name, api_key=api_key
+        )
         prompt = f"Write a concise, accurate summary of the following paper's content in about 3-5 sentences:\n\n```{paper['content']}```"
 
         input_tokens = count_tokens(prompt)
@@ -47,12 +86,29 @@ def summarize_paper(paper: Dict[str, Any], config: Dict[str, Any]) -> str:
         return response
     except Exception as e:
         logger.error(f"Error summarizing paper: {e}", exc_info=True)
-        return paper['abstract']
+        return paper["abstract"]
+
 
 def create_llm_instance(provider: str, api_key: str) -> LLM:
-    if provider == 'openai':
-        return LLM.create(provider=LLMProvider.OPENAI, model_name="gpt-4o-mini", api_key=api_key)
-    elif provider == 'gemini':
-        return LLM.create(provider=LLMProvider.GEMINI, model_name="gemini-1.5-flash", api_key=api_key)
+    """Create an instance of the specified LLM provider.
+
+    Args:
+        provider: The name of the LLM provider ('openai' or 'gemini').
+        api_key: API key for the specified provider.
+
+    Returns:
+        An initialized LLM instance.
+
+    Raises:
+        ValueError: If an unsupported provider is specified.
+    """
+    if provider == "openai":
+        return LLM.create(
+            provider=LLMProvider.OPENAI, model_name="gpt-4o-mini", api_key=api_key
+        )
+    elif provider == "gemini":
+        return LLM.create(
+            provider=LLMProvider.GEMINI, model_name="gemini-1.5-flash", api_key=api_key
+        )
     else:
         raise ValueError(f"Unsupported LLM provider: {provider}")