Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 54 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,37 @@
# paperweight

This project automatically retrieves, filters, and summarizes recent academic papers from arXiv based on user-specified categories, then sends notifications to the user.
A scalable system for retrieving, filtering, and summarizing academic papers from arXiv based on user preferences, with customizable notifications.

## Features

- **ArXiv Integration**: Fetches recent papers from arXiv using their API, ensuring up-to-date access to the latest research.
- **Customizable Filtering**: Filters papers based on user-defined preferences, including keywords, categories, and exclusion criteria.
- **Intelligent Summarization** (BETA): Generates concise summaries or extracts abstracts, providing quick insights into paper content. Note: This feature is currently in beta and may have some limitations.
- **Intelligent Summarization** (BETA): Generates concise summaries or extracts abstracts, providing quick insights into paper content.
- **Flexible Notification System**: Notifies users via email, with potential for expansion to other notification methods.
- **Configurable Settings**: Allows users to fine-tune the application's behavior through a YAML configuration file.

## System Architecture

```
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ SCRAPER │────▶│ PROCESSOR │────▶│ ANALYZER │────▶│ NOTIFIER │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ arXiv API & │ │ Scoring & │ │ Abstract │ │ Email & │
│ PDF Processing│ │ Filtering │ │ Extraction │ │ Templating │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
```

## Table of Contents
- [Getting Started](#getting-started)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Usage](#usage)
- [Configuration](#configuration)
- [FAQ and Troubleshooting](#faq-and-troubleshooting)
- [Technical Details](#technical-details)
- [Roadmap](#roadmap)
- [Glossary](#glossary)
- [License](#license)
Expand All @@ -29,11 +44,13 @@ This project automatically retrieves, filters, and summarizes recent academic pa

- Python 3.10 or higher
- Required Python packages:
- pypdf
- python-dotenv
- PyYAML
- requests
- simplerllm
- pypdf - For PDF document processing
- python-dotenv - For environment variable management
- PyYAML - For configuration parsing
- requests - For API communication
- simplerllm - For LLM integration
- tenacity - For resilient API interactions
- tiktoken - For token counting

## Installation

Expand Down Expand Up @@ -98,14 +115,41 @@ For a comprehensive list of frequently asked questions, including setup instruct

If you can't find an answer to your question or solution to your problem in the FAQ, please [open an issue](https://github.com/seanbrar/paperweight/issues) on GitHub.

## Technical Details

### Processing Pipeline

paperweight processes papers through four main stages:

1. **Scraping** (`scraper.py`): Fetches recent papers from arXiv's API based on user-defined categories and processes the PDF/LaTeX content.

2. **Processing** (`processor.py`): Calculates relevance scores based on keyword matching, with weights for title, abstract, and content matches, plus handling of exclusion keywords.

3. **Analysis** (`analyzer.py`): Either extracts the abstract or generates a summary using an LLM (OpenAI or Gemini), with configurable options.

4. **Notification** (`notifier.py`): Formats the filtered papers and sends them via email, with options for sorting by relevance, date, or title.

### Resilience Features

- **Retry Logic**: Uses the `tenacity` library to implement exponential backoff for API calls
- **Error Handling**: Comprehensive error catching and logging throughout the codebase
- **State Persistence**: Maintains processing state between runs using the `last_processed_date.txt` file

### Performance Considerations

- **Token Counting**: Uses `tiktoken` to accurately count tokens for LLM context management
- **Configurable Limits**: Allows setting maximum papers per category to control processing time
- **Incremental Processing**: Only fetches papers published since the last run

## Roadmap

Key upcoming features:
- Implement machine learning-based paper recommendations
- Add support for additional academic paper sources
- Expand notification methods
- Enhance batch processing capabilities

For a full list of proposed features and known issues, see the [open issues](https://github.com/seanbrar/paperweight/issues) page or the detailed [roadmap](docs/ROADMAP.md).
For a full list of proposed features and planned enhancements, see the detailed [roadmap](docs/ROADMAP.md).

## Glossary

Expand All @@ -114,6 +158,8 @@ For a full list of proposed features and known issues, see the [open issues](htt
- **YAML**: A human-readable data serialization format used for configuration files.
- **SMTP**: Simple Mail Transfer Protocol; used for sending emails.
- **LLM**: Large Language Model; an AI model used for text generation and analysis.
- **Embedding**: A numerical representation of text that captures semantic meaning.
- **Token**: A unit of text processed by language models, roughly corresponding to 4 characters.

## License

Expand Down
101 changes: 66 additions & 35 deletions docs/ROADMAP.md
Original file line number Diff line number Diff line change
@@ -1,56 +1,87 @@
# paperweight roadmap

This document outlines the planned features and improvements for the paperweight project. Please note that this roadmap is subject to change based on user feedback and project priorities.
This document outlines planned features and improvements for the paperweight project. The roadmap is organized into focused development areas to create a scalable, efficient academic paper processing system.

## Short-term Goals
## Core System Enhancements

### General Improvements
- [ ] Implement general code cleanup and optimization
- [ ] Increase overall speed through asynchronous operations
- [ ] Create a web-hosted demo of the program
### Performance & Efficiency
- [ ] Implement asynchronous processing for paper fetching and analysis
- [ ] Add configurable batch processing with adjustable batch sizes
- [ ] Create memory usage tracking and optimization for large document sets
- [ ] Implement benchmarking tools to measure and optimize performance

### Context Management
- [ ] Develop intelligent document chunking for papers exceeding token limits
- [ ] Implement hierarchical summarization for extremely long papers
- [ ] Create a context window awareness system that optimizes token usage
- [ ] Add semantic sectioning to prioritize important paper components

### Caching Infrastructure
- [ ] Implement persistent caching for paper embeddings and metadata
- [ ] Create smart cache invalidation strategies based on paper updates
- [ ] Develop a disk-based storage system for embeddings to reduce API costs
- [ ] Add cache statistics reporting for optimization insights

## Module-Specific Improvements

### Scraper Module
- [ ] Build and implement PDF extraction evaluations
- [ ] Add retry logic in API/scraper (possibly using tenacity)
- [ ] Revisit and improve date checking logic
- [ ] Develop comprehensive testing suite with dummy papers
- [ ] Parse out unnecessary content (e.g., references, LaTeX preambles)
- [ ] Add support for extracting and handling images from papers
- [ ] Enhance PDF extraction precision with specialized academic paper handling
- [ ] Add support for extracting and processing figures and tables
- [ ] Expand retry logic in API interactions using advanced backoff strategies
- [ ] Improve date-based paper filtering with precise version tracking

### Processor Module
- [ ] Refine and expand the normalization score system for papers
- [ ] Develop enhanced scoring algorithms for more accurate paper relevance
- [ ] Implement sliding window analysis for sequential context processing
- [ ] Create adaptive keyword weighting based on document section importance
- [ ] Add citation network analysis for evaluating paper significance

### Analyzer Module
- [ ] Conduct additional testing of LLM integration
- [ ] Implement rate limits for API calls
- [ ] Explore and potentially add support for a wider selection of models
- [ ] Refine and optimize summarization prompts
- [ ] Expand LLM provider support with a unified interface
- [ ] Implement streaming responses for long paper summarization
- [ ] Create domain-specific summarization templates for different fields
- [ ] Add comparative analysis between related papers

### Notifier Module
- [ ] Improve handling of scenarios where all papers are discarded
- [ ] Revisit and potentially expand the fields included in notifications (e.g., authors)
- [ ] Add more options for paper ordering and field selection in email notifications
- [ ] Develop a modular notification system supporting multiple channels
- [ ] Create customizable templates for notification formatting
- [ ] Implement digest mode for batched notifications
- [ ] Add interactive elements to notifications for user feedback

## Strategic Directions

## Medium-term Goals
### Machine Learning Integration
- [ ] Replace keyword-based filtering with embedding similarity scoring
- [ ] Implement personalized paper recommendations based on user interests
- [ ] Develop citation impact prediction for emerging papers
- [ ] Create a feedback loop to improve future recommendations

- [ ] Replace current static keyword-based filtering with a machine learning recommendation engine
- [ ] Ensure interface compatibility is maintained
- [ ] Expand notification methods beyond email
- [ ] Investigate possibilities like desktop notifications or a desktop agent
- [ ] Rethink the notification system to make SMTP configuration less cumbersome for users
### Expanded Data Sources
- [ ] Add support for multiple academic repositories (PubMed, IEEE, etc.)
- [ ] Implement unified metadata schema across different sources
- [ ] Create source-specific optimizations for each repository
- [ ] Develop cross-repository deduplication

## Long-term Goals
### User Experience
- [ ] Create a simple web interface for configuration and monitoring
- [ ] Develop a local dashboard for visualizing paper recommendations
- [ ] Add personalized preference learning from user interactions
- [ ] Implement saved searches and automated monitoring

- [ ] Add support for additional academic paper sources beyond arXiv
- [ ] Implement machine learning-based paper recommendations
- [ ] Continuously improve and refine the LLM-based summarization feature
## Development Infrastructure

## Ongoing Tasks
### Testing & Quality
- [ ] Expand test coverage with more integration tests
- [ ] Develop performance regression testing
- [ ] Create automated benchmark suites for optimization
- [ ] Implement continuous profiling for memory and CPU usage

- [ ] Maintain and update documentation
- [ ] Address bugs and issues reported by users
- [ ] Optimize performance and resource usage
### Documentation
- [ ] Expand API documentation for extensibility
- [ ] Create visual architecture diagrams
- [ ] Develop advanced configuration guides for specific use cases
- [ ] Add code examples for common extension patterns

We welcome contributions and suggestions from the community. If you have ideas for new features or improvements, please open an issue on the [GitHub repository](https://github.com/seanbrar/paperweight/issues).
We welcome contributions and suggestions from the community. If you have ideas for features or improvements, please open an issue on the [GitHub repository](https://github.com/seanbrar/paperweight/issues).

For information on how to contribute to paperweight, please see the [contributing guide](docs/CONTRIBUTING.md).
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

setup(
name="paperweight",
version="0.1.1",
version="0.1.2",
package_dir={"": "src"},
packages=find_packages(where="src"),
install_requires=[
Expand Down
88 changes: 72 additions & 16 deletions src/paperweight/analyzer.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
"""Module for analyzing and summarizing academic papers.

This module provides functionality for analyzing paper content using LLMs (Language Model Models)
and extracting relevant information. It supports different analysis types including abstract
extraction and paper summarization using various LLM providers.
"""

import logging
from typing import Any, Dict

Expand All @@ -11,29 +18,61 @@

logger = logging.getLogger(__name__)


def get_abstracts(processed_papers, config):
analysis_type = config.get('type', 'abstract')
"""Extract abstracts or summaries from processed papers based on configuration.

Args:
processed_papers: List of dictionaries containing paper data.
config: Configuration dictionary specifying analysis type and parameters.

if analysis_type == 'abstract':
return [paper['abstract'] for paper in processed_papers]
elif analysis_type == 'summary':
Returns:
List of strings containing either abstracts or summaries based on config type.

Raises:
ValueError: If an unknown analysis type is specified in config.
"""
analysis_type = config.get("type", "abstract")

if analysis_type == "abstract":
return [paper["abstract"] for paper in processed_papers]
elif analysis_type == "summary":
return [summarize_paper(paper, config) for paper in processed_papers]
else:
raise ValueError(f"Unknown analysis type: {analysis_type}")


@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def summarize_paper(paper: Dict[str, Any], config: Dict[str, Any]) -> str:
llm_provider = config.get('analyzer', {}).get('llm_provider', 'openai').lower()
api_key = config.get('analyzer', {}).get('api_key')
"""Generate a summary of a paper using an LLM.

if llm_provider not in ['openai', 'gemini'] or not api_key:
logger.warning(f"No valid LLM provider or API key available for {llm_provider}. Falling back to abstract.")
return paper['abstract']
Args:
paper: Dictionary containing paper data including content and metadata.
config: Configuration dictionary containing LLM settings.

Returns:
A string containing the generated summary.

Raises:
ValueError: If no valid LLM provider or API key is available.
"""
llm_provider = config.get("analyzer", {}).get("llm_provider", "openai").lower()
api_key = config.get("analyzer", {}).get("api_key")

if llm_provider not in ["openai", "gemini"] or not api_key:
logger.warning(
f"No valid LLM provider or API key available for {llm_provider}. Falling back to abstract."
)
return paper["abstract"]

try:
provider = LLMProvider[llm_provider.upper()]
model_name = 'gpt-4o-mini' if provider == LLMProvider.OPENAI else 'gemini-1.5-flash'
llm_instance = LLM.create(provider=provider, model_name=model_name, api_key=api_key)
model_name = (
"gpt-4o-mini" if provider == LLMProvider.OPENAI else "gemini-1.5-flash"
)
llm_instance = LLM.create(
provider=provider, model_name=model_name, api_key=api_key
)
prompt = f"Write a concise, accurate summary of the following paper's content in about 3-5 sentences:\n\n```{paper['content']}```"

input_tokens = count_tokens(prompt)
Expand All @@ -47,12 +86,29 @@ def summarize_paper(paper: Dict[str, Any], config: Dict[str, Any]) -> str:
return response
except Exception as e:
logger.error(f"Error summarizing paper: {e}", exc_info=True)
return paper['abstract']
return paper["abstract"]


def create_llm_instance(provider: str, api_key: str) -> LLM:
if provider == 'openai':
return LLM.create(provider=LLMProvider.OPENAI, model_name="gpt-4o-mini", api_key=api_key)
elif provider == 'gemini':
return LLM.create(provider=LLMProvider.GEMINI, model_name="gemini-1.5-flash", api_key=api_key)
"""Create an instance of the specified LLM provider.

Args:
provider: The name of the LLM provider ('openai' or 'gemini').
api_key: API key for the specified provider.

Returns:
An initialized LLM instance.

Raises:
ValueError: If an unsupported provider is specified.
"""
if provider == "openai":
return LLM.create(
provider=LLMProvider.OPENAI, model_name="gpt-4o-mini", api_key=api_key
)
elif provider == "gemini":
return LLM.create(
provider=LLMProvider.GEMINI, model_name="gemini-1.5-flash", api_key=api_key
)
else:
raise ValueError(f"Unsupported LLM provider: {provider}")
Loading