diff --git a/README.md b/README.md index cd1bfec..59a8fe2 100644 --- a/README.md +++ b/README.md @@ -1,15 +1,29 @@ # paperweight -This project automatically retrieves, filters, and summarizes recent academic papers from arXiv based on user-specified categories, then sends notifications to the user. +A scalable system for retrieving, filtering, and summarizing academic papers from arXiv based on user preferences, with customizable notifications. ## Features - **ArXiv Integration**: Fetches recent papers from arXiv using their API, ensuring up-to-date access to the latest research. - **Customizable Filtering**: Filters papers based on user-defined preferences, including keywords, categories, and exclusion criteria. -- **Intelligent Summarization** (BETA): Generates concise summaries or extracts abstracts, providing quick insights into paper content. Note: This feature is currently in beta and may have some limitations. +- **Intelligent Summarization** (BETA): Generates concise summaries or extracts abstracts, providing quick insights into paper content. - **Flexible Notification System**: Notifies users via email, with potential for expansion to other notification methods. - **Configurable Settings**: Allows users to fine-tune the application's behavior through a YAML configuration file. +## System Architecture + +``` +┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ +│ SCRAPER │────▶│ PROCESSOR │────▶│ ANALYZER │────▶│ NOTIFIER │ +└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘ + │ │ │ │ + ▼ ▼ ▼ ▼ +┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ +│ arXiv API & │ │ Scoring & │ │ Abstract │ │ Email & │ +│ PDF Processing│ │ Filtering │ │ Extraction │ │ Templating │ +└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘ +``` + ## Table of Contents - [Getting Started](#getting-started) - [Installation](#installation) @@ -17,6 +31,7 @@ This project automatically retrieves, filters, and summarizes recent academic pa - [Usage](#usage) - [Configuration](#configuration) - [FAQ and Troubleshooting](#faq-and-troubleshooting) +- [Technical Details](#technical-details) - [Roadmap](#roadmap) - [Glossary](#glossary) - [License](#license) @@ -29,11 +44,13 @@ This project automatically retrieves, filters, and summarizes recent academic pa - Python 3.10 or higher - Required Python packages: - - pypdf - - python-dotenv - - PyYAML - - requests - - simplerllm + - pypdf - For PDF document processing + - python-dotenv - For environment variable management + - PyYAML - For configuration parsing + - requests - For API communication + - simplerllm - For LLM integration + - tenacity - For resilient API interactions + - tiktoken - For token counting ## Installation @@ -98,14 +115,41 @@ For a comprehensive list of frequently asked questions, including setup instruct If you can't find an answer to your question or solution to your problem in the FAQ, please [open an issue](https://github.com/seanbrar/paperweight/issues) on GitHub. +## Technical Details + +### Processing Pipeline + +paperweight processes papers through four main stages: + +1. **Scraping** (`scraper.py`): Fetches recent papers from arXiv's API based on user-defined categories and processes the PDF/LaTeX content. + +2. **Processing** (`processor.py`): Calculates relevance scores based on keyword matching, with weights for title, abstract, and content matches, plus handling of exclusion keywords. + +3. **Analysis** (`analyzer.py`): Either extracts the abstract or generates a summary using an LLM (OpenAI or Gemini), with configurable options. + +4. **Notification** (`notifier.py`): Formats the filtered papers and sends them via email, with options for sorting by relevance, date, or title. + +### Resilience Features + +- **Retry Logic**: Uses the `tenacity` library to implement exponential backoff for API calls +- **Error Handling**: Comprehensive error catching and logging throughout the codebase +- **State Persistence**: Maintains processing state between runs using the `last_processed_date.txt` file + +### Performance Considerations + +- **Token Counting**: Uses `tiktoken` to accurately count tokens for LLM context management +- **Configurable Limits**: Allows setting maximum papers per category to control processing time +- **Incremental Processing**: Only fetches papers published since the last run + ## Roadmap Key upcoming features: - Implement machine learning-based paper recommendations - Add support for additional academic paper sources - Expand notification methods +- Enhance batch processing capabilities -For a full list of proposed features and known issues, see the [open issues](https://github.com/seanbrar/paperweight/issues) page or the detailed [roadmap](docs/ROADMAP.md). +For a full list of proposed features and planned enhancements, see the detailed [roadmap](docs/ROADMAP.md). ## Glossary @@ -114,6 +158,8 @@ For a full list of proposed features and known issues, see the [open issues](htt - **YAML**: A human-readable data serialization format used for configuration files. - **SMTP**: Simple Mail Transfer Protocol; used for sending emails. - **LLM**: Large Language Model; an AI model used for text generation and analysis. +- **Embedding**: A numerical representation of text that captures semantic meaning. +- **Token**: A unit of text processed by language models, roughly corresponding to 4 characters. ## License diff --git a/docs/ROADMAP.md b/docs/ROADMAP.md index cbc0025..8fc9760 100644 --- a/docs/ROADMAP.md +++ b/docs/ROADMAP.md @@ -1,56 +1,87 @@ # paperweight roadmap -This document outlines the planned features and improvements for the paperweight project. Please note that this roadmap is subject to change based on user feedback and project priorities. +This document outlines planned features and improvements for the paperweight project. The roadmap is organized into focused development areas to create a scalable, efficient academic paper processing system. -## Short-term Goals +## Core System Enhancements -### General Improvements -- [ ] Implement general code cleanup and optimization -- [ ] Increase overall speed through asynchronous operations -- [ ] Create a web-hosted demo of the program +### Performance & Efficiency +- [ ] Implement asynchronous processing for paper fetching and analysis +- [ ] Add configurable batch processing with adjustable batch sizes +- [ ] Create memory usage tracking and optimization for large document sets +- [ ] Implement benchmarking tools to measure and optimize performance + +### Context Management +- [ ] Develop intelligent document chunking for papers exceeding token limits +- [ ] Implement hierarchical summarization for extremely long papers +- [ ] Create a context window awareness system that optimizes token usage +- [ ] Add semantic sectioning to prioritize important paper components + +### Caching Infrastructure +- [ ] Implement persistent caching for paper embeddings and metadata +- [ ] Create smart cache invalidation strategies based on paper updates +- [ ] Develop a disk-based storage system for embeddings to reduce API costs +- [ ] Add cache statistics reporting for optimization insights + +## Module-Specific Improvements ### Scraper Module -- [ ] Build and implement PDF extraction evaluations -- [ ] Add retry logic in API/scraper (possibly using tenacity) -- [ ] Revisit and improve date checking logic - - [ ] Develop comprehensive testing suite with dummy papers -- [ ] Parse out unnecessary content (e.g., references, LaTeX preambles) -- [ ] Add support for extracting and handling images from papers +- [ ] Enhance PDF extraction precision with specialized academic paper handling +- [ ] Add support for extracting and processing figures and tables +- [ ] Expand retry logic in API interactions using advanced backoff strategies +- [ ] Improve date-based paper filtering with precise version tracking ### Processor Module -- [ ] Refine and expand the normalization score system for papers +- [ ] Develop enhanced scoring algorithms for more accurate paper relevance +- [ ] Implement sliding window analysis for sequential context processing +- [ ] Create adaptive keyword weighting based on document section importance +- [ ] Add citation network analysis for evaluating paper significance ### Analyzer Module -- [ ] Conduct additional testing of LLM integration -- [ ] Implement rate limits for API calls -- [ ] Explore and potentially add support for a wider selection of models -- [ ] Refine and optimize summarization prompts +- [ ] Expand LLM provider support with a unified interface +- [ ] Implement streaming responses for long paper summarization +- [ ] Create domain-specific summarization templates for different fields +- [ ] Add comparative analysis between related papers ### Notifier Module -- [ ] Improve handling of scenarios where all papers are discarded -- [ ] Revisit and potentially expand the fields included in notifications (e.g., authors) -- [ ] Add more options for paper ordering and field selection in email notifications +- [ ] Develop a modular notification system supporting multiple channels +- [ ] Create customizable templates for notification formatting +- [ ] Implement digest mode for batched notifications +- [ ] Add interactive elements to notifications for user feedback + +## Strategic Directions -## Medium-term Goals +### Machine Learning Integration +- [ ] Replace keyword-based filtering with embedding similarity scoring +- [ ] Implement personalized paper recommendations based on user interests +- [ ] Develop citation impact prediction for emerging papers +- [ ] Create a feedback loop to improve future recommendations -- [ ] Replace current static keyword-based filtering with a machine learning recommendation engine - - [ ] Ensure interface compatibility is maintained -- [ ] Expand notification methods beyond email - - [ ] Investigate possibilities like desktop notifications or a desktop agent -- [ ] Rethink the notification system to make SMTP configuration less cumbersome for users +### Expanded Data Sources +- [ ] Add support for multiple academic repositories (PubMed, IEEE, etc.) +- [ ] Implement unified metadata schema across different sources +- [ ] Create source-specific optimizations for each repository +- [ ] Develop cross-repository deduplication -## Long-term Goals +### User Experience +- [ ] Create a simple web interface for configuration and monitoring +- [ ] Develop a local dashboard for visualizing paper recommendations +- [ ] Add personalized preference learning from user interactions +- [ ] Implement saved searches and automated monitoring -- [ ] Add support for additional academic paper sources beyond arXiv -- [ ] Implement machine learning-based paper recommendations -- [ ] Continuously improve and refine the LLM-based summarization feature +## Development Infrastructure -## Ongoing Tasks +### Testing & Quality +- [ ] Expand test coverage with more integration tests +- [ ] Develop performance regression testing +- [ ] Create automated benchmark suites for optimization +- [ ] Implement continuous profiling for memory and CPU usage -- [ ] Maintain and update documentation -- [ ] Address bugs and issues reported by users -- [ ] Optimize performance and resource usage +### Documentation +- [ ] Expand API documentation for extensibility +- [ ] Create visual architecture diagrams +- [ ] Develop advanced configuration guides for specific use cases +- [ ] Add code examples for common extension patterns -We welcome contributions and suggestions from the community. If you have ideas for new features or improvements, please open an issue on the [GitHub repository](https://github.com/seanbrar/paperweight/issues). +We welcome contributions and suggestions from the community. If you have ideas for features or improvements, please open an issue on the [GitHub repository](https://github.com/seanbrar/paperweight/issues). For information on how to contribute to paperweight, please see the [contributing guide](docs/CONTRIBUTING.md). \ No newline at end of file diff --git a/setup.py b/setup.py index e322c42..ff22c80 100644 --- a/setup.py +++ b/setup.py @@ -3,7 +3,7 @@ setup( name="paperweight", - version="0.1.1", + version="0.1.2", package_dir={"": "src"}, packages=find_packages(where="src"), install_requires=[ diff --git a/src/paperweight/analyzer.py b/src/paperweight/analyzer.py index c71b2ba..789f121 100644 --- a/src/paperweight/analyzer.py +++ b/src/paperweight/analyzer.py @@ -1,3 +1,10 @@ +"""Module for analyzing and summarizing academic papers. + +This module provides functionality for analyzing paper content using LLMs (Language Model Models) +and extracting relevant information. It supports different analysis types including abstract +extraction and paper summarization using various LLM providers. +""" + import logging from typing import Any, Dict @@ -11,29 +18,61 @@ logger = logging.getLogger(__name__) + def get_abstracts(processed_papers, config): - analysis_type = config.get('type', 'abstract') + """Extract abstracts or summaries from processed papers based on configuration. + + Args: + processed_papers: List of dictionaries containing paper data. + config: Configuration dictionary specifying analysis type and parameters. - if analysis_type == 'abstract': - return [paper['abstract'] for paper in processed_papers] - elif analysis_type == 'summary': + Returns: + List of strings containing either abstracts or summaries based on config type. + + Raises: + ValueError: If an unknown analysis type is specified in config. + """ + analysis_type = config.get("type", "abstract") + + if analysis_type == "abstract": + return [paper["abstract"] for paper in processed_papers] + elif analysis_type == "summary": return [summarize_paper(paper, config) for paper in processed_papers] else: raise ValueError(f"Unknown analysis type: {analysis_type}") + @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10)) def summarize_paper(paper: Dict[str, Any], config: Dict[str, Any]) -> str: - llm_provider = config.get('analyzer', {}).get('llm_provider', 'openai').lower() - api_key = config.get('analyzer', {}).get('api_key') + """Generate a summary of a paper using an LLM. - if llm_provider not in ['openai', 'gemini'] or not api_key: - logger.warning(f"No valid LLM provider or API key available for {llm_provider}. Falling back to abstract.") - return paper['abstract'] + Args: + paper: Dictionary containing paper data including content and metadata. + config: Configuration dictionary containing LLM settings. + + Returns: + A string containing the generated summary. + + Raises: + ValueError: If no valid LLM provider or API key is available. + """ + llm_provider = config.get("analyzer", {}).get("llm_provider", "openai").lower() + api_key = config.get("analyzer", {}).get("api_key") + + if llm_provider not in ["openai", "gemini"] or not api_key: + logger.warning( + f"No valid LLM provider or API key available for {llm_provider}. Falling back to abstract." + ) + return paper["abstract"] try: provider = LLMProvider[llm_provider.upper()] - model_name = 'gpt-4o-mini' if provider == LLMProvider.OPENAI else 'gemini-1.5-flash' - llm_instance = LLM.create(provider=provider, model_name=model_name, api_key=api_key) + model_name = ( + "gpt-4o-mini" if provider == LLMProvider.OPENAI else "gemini-1.5-flash" + ) + llm_instance = LLM.create( + provider=provider, model_name=model_name, api_key=api_key + ) prompt = f"Write a concise, accurate summary of the following paper's content in about 3-5 sentences:\n\n```{paper['content']}```" input_tokens = count_tokens(prompt) @@ -47,12 +86,29 @@ def summarize_paper(paper: Dict[str, Any], config: Dict[str, Any]) -> str: return response except Exception as e: logger.error(f"Error summarizing paper: {e}", exc_info=True) - return paper['abstract'] + return paper["abstract"] + def create_llm_instance(provider: str, api_key: str) -> LLM: - if provider == 'openai': - return LLM.create(provider=LLMProvider.OPENAI, model_name="gpt-4o-mini", api_key=api_key) - elif provider == 'gemini': - return LLM.create(provider=LLMProvider.GEMINI, model_name="gemini-1.5-flash", api_key=api_key) + """Create an instance of the specified LLM provider. + + Args: + provider: The name of the LLM provider ('openai' or 'gemini'). + api_key: API key for the specified provider. + + Returns: + An initialized LLM instance. + + Raises: + ValueError: If an unsupported provider is specified. + """ + if provider == "openai": + return LLM.create( + provider=LLMProvider.OPENAI, model_name="gpt-4o-mini", api_key=api_key + ) + elif provider == "gemini": + return LLM.create( + provider=LLMProvider.GEMINI, model_name="gemini-1.5-flash", api_key=api_key + ) else: raise ValueError(f"Unsupported LLM provider: {provider}") diff --git a/src/paperweight/logging_config.py b/src/paperweight/logging_config.py index 4ed82c2..79d9ee1 100644 --- a/src/paperweight/logging_config.py +++ b/src/paperweight/logging_config.py @@ -1,44 +1,63 @@ +"""Module for configuring logging in the paperweight application. + +This module provides functionality for setting up logging with both file and console +handlers, configurable log levels, and standardized formatting. It ensures log directories +exist and handles invalid logging level configurations gracefully. +""" + import logging import logging.config import os def setup_logging(logging_config): - valid_levels = {'DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'} - logging_level = logging_config.get('level', 'INFO').upper() + """Set up logging configuration for the application. + + Args: + logging_config: Dictionary containing logging configuration parameters including + 'level' and 'file' settings. + + The function configures both file and console handlers with the following features: + - Console handler with WARNING and above levels + - File handler with the configured level (defaults to INFO) + - Standard format: timestamp - logger_name - level - message + - Automatic creation of log directory if it doesn't exist + """ + valid_levels = {"DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"} + logging_level = logging_config.get("level", "INFO").upper() if logging_level not in valid_levels: - logging_level = 'INFO' + logging_level = "INFO" - log_file = logging_config['file'] + log_file = logging_config["file"] log_dir = os.path.dirname(log_file) if log_dir and not os.path.exists(log_dir): os.makedirs(log_dir, exist_ok=True) logging_config = { - 'version': 1, - 'disable_existing_loggers': False, - 'formatters': { - 'standard': { - 'format': '%(asctime)s - %(name)s - %(levelname)s - %(message)s', - 'datefmt': '%Y-%m-%d %H:%M:%S' + "version": 1, + "disable_existing_loggers": False, + "formatters": { + "standard": { + "format": "%(asctime)s - %(name)s - %(levelname)s - %(message)s", + "datefmt": "%Y-%m-%d %H:%M:%S", }, }, - 'handlers': { - 'console': { - 'class': 'logging.StreamHandler', - 'formatter': 'standard', - 'level': 'WARNING', + "handlers": { + "console": { + "class": "logging.StreamHandler", + "formatter": "standard", + "level": "WARNING", }, - 'file': { - 'class': 'logging.FileHandler', - 'filename': log_file, - 'formatter': 'standard', - 'level': logging_level, + "file": { + "class": "logging.FileHandler", + "filename": log_file, + "formatter": "standard", + "level": logging_level, }, }, - 'root': { - 'handlers': ['console', 'file'], - 'level': logging_level, + "root": { + "handlers": ["console", "file"], + "level": logging_level, }, } logging.config.dictConfig(logging_config) diff --git a/src/paperweight/main.py b/src/paperweight/main.py index 9cc8970..1cc152b 100644 --- a/src/paperweight/main.py +++ b/src/paperweight/main.py @@ -1,3 +1,10 @@ +"""Main module for the paperweight application. + +This module serves as the entry point for the paperweight application, coordinating +the paper fetching, processing, analysis, and notification processes. It handles +configuration loading, logging setup, and the main execution flow of the application. +""" + import argparse import logging import traceback @@ -14,9 +21,20 @@ logger = logging.getLogger(__name__) + def setup_and_get_papers(force_refresh): + """Set up the application and fetch papers. + + Args: + force_refresh: Boolean indicating whether to ignore the last processed date + and fetch all papers within the configured time window. + + Returns: + Tuple of (papers, config) where papers is a list of paper dictionaries and + config is the loaded configuration dictionary. + """ config = load_config() - setup_logging(config['logging']) + setup_logging(config["logging"]) logger.info("Configuration loaded successfully") if force_refresh: @@ -25,27 +43,54 @@ def setup_and_get_papers(force_refresh): else: return get_recent_papers(), config + def process_and_summarize_papers(recent_papers, config): + """Process and analyze papers based on configured criteria. + + Args: + recent_papers: List of paper dictionaries to process. + config: Configuration dictionary containing processing parameters. + + Returns: + List of processed papers with relevance scores and summaries. + """ if not recent_papers: logger.info("No new papers to process. Exiting.") return None - processed_papers = process_papers(recent_papers, config['processor']) + processed_papers = process_papers(recent_papers, config["processor"]) logger.info(f"Processed {len(processed_papers)} papers") if not processed_papers: logger.info("No papers met the relevance criteria. Exiting.") return None - summaries = get_abstracts(processed_papers, config['analyzer']) + summaries = get_abstracts(processed_papers, config["analyzer"]) for paper, summary in zip(processed_papers, summaries): - paper['summary'] = summary if summary else paper.get('abstract', 'No summary available') + paper["summary"] = ( + summary if summary else paper.get("abstract", "No summary available") + ) return processed_papers + def main(): - parser = argparse.ArgumentParser(description="paperweight: Fetch and process arXiv papers") - parser.add_argument('--force-refresh', action='store_true', help='Force refresh papers regardless of last processed date') + """Main entry point for the paperweight application. + + This function parses command line arguments, coordinates the paper processing + pipeline, and handles any errors that occur during execution. + + Returns: + 0 on successful execution, 1 on error. + """ + parser = argparse.ArgumentParser( + description="paperweight: Fetch and process arXiv papers" + ) + parser.add_argument( + "--force-refresh", + action="store_true", + help="Force refresh papers regardless of last processed date", + ) args = parser.parse_args() try: @@ -53,7 +98,9 @@ def main(): processed_papers = process_and_summarize_papers(recent_papers, config) if processed_papers: - notification_sent = compile_and_send_notifications(processed_papers, config['notifier']) + notification_sent = compile_and_send_notifications( + processed_papers, config["notifier"] + ) if notification_sent: logger.info("Notifications compiled and sent successfully") else: @@ -69,6 +116,7 @@ def main(): except Exception as e: logger.error(f"An unexpected error occurred: {e}") + if __name__ == "__main__": try: main() diff --git a/src/paperweight/notifier.py b/src/paperweight/notifier.py index 30277fb..6db93b2 100644 --- a/src/paperweight/notifier.py +++ b/src/paperweight/notifier.py @@ -1,3 +1,10 @@ +"""Module for sending email notifications about processed papers. + +This module handles the creation and sending of email notifications about relevant papers +that have been processed. It includes functionality for composing email content and +sending emails through SMTP servers. +""" + import logging import smtplib from email.mime.multipart import MIMEMultipart @@ -5,20 +12,31 @@ logger = logging.getLogger(__name__) + def send_email_notification(subject, body, config): - from_email = config['email']['from'] - from_password = config['email']['password'] - to_email = config['email']['to'] - smtp_server = config['email']['smtp_server'] - smtp_port = config['email']['smtp_port'] + """Send an email notification using the configured SMTP server. + + Args: + subject: The subject line of the email. + body: The body text of the email. + config: Configuration dictionary containing email settings. + + Raises: + smtplib.SMTPException: If there is an error sending the email. + """ + from_email = config["email"]["from"] + from_password = config["email"]["password"] + to_email = config["email"]["to"] + smtp_server = config["email"]["smtp_server"] + smtp_port = config["email"]["smtp_port"] # Create the email msg = MIMEMultipart() - msg['From'] = from_email - msg['To'] = to_email - msg['Subject'] = subject + msg["From"] = from_email + msg["To"] = to_email + msg["Subject"] = subject - msg.attach(MIMEText(body, 'plain')) + msg.attach(MIMEText(body, "plain")) # Send the email try: @@ -28,23 +46,32 @@ def send_email_notification(subject, body, config): text = msg.as_string() server.sendmail(from_email, to_email, text) server.quit() - logger.info("Email sent successfully") - return True + logger.info("Email notification sent successfully") except Exception as e: - logger.error(f"Failed to send email: {e}") - return False + logger.error(f"Failed to send email notification: {e}", exc_info=True) + raise + def compile_and_send_notifications(papers, config): + """Compile paper information and send email notifications. + + Args: + papers: List of dictionaries containing paper data. + config: Configuration dictionary containing email and notification settings. + + Returns: + bool: True if notifications were sent successfully, False otherwise. + """ if not papers: logger.info("No papers to send notifications for.") return - sort_order = config.get('email', {}).get('sort_order', 'relevance') + sort_order = config.get("email", {}).get("sort_order", "relevance") - if sort_order == 'alphabetical': - papers = sorted(papers, key=lambda x: x['title'].lower()) - elif sort_order == 'publication_time': - papers = sorted(papers, key=lambda x: x['date'], reverse=True) + if sort_order == "alphabetical": + papers = sorted(papers, key=lambda x: x["title"].lower()) + elif sort_order == "publication_time": + papers = sorted(papers, key=lambda x: x["date"], reverse=True) # For 'relevance' or any other value, we keep the existing order (already sorted by relevance) subject = "New Papers from ArXiv" diff --git a/src/paperweight/processor.py b/src/paperweight/processor.py index 3a01ea0..8c698e7 100644 --- a/src/paperweight/processor.py +++ b/src/paperweight/processor.py @@ -1,3 +1,10 @@ +"""Module for processing and scoring academic papers. + +This module handles the processing of papers including scoring based on relevance criteria, +keyword matching, and importance weighting. It provides functionality for filtering papers +based on minimum score thresholds and normalizing scores across multiple papers. +""" + import logging import math import re @@ -6,80 +13,152 @@ logger = logging.getLogger(__name__) -def process_papers(papers: List[Dict[str, Any]], processor_config: Dict[str, Any]) -> List[Dict[str, Any]]: + +def process_papers( + papers: List[Dict[str, Any]], processor_config: Dict[str, Any] +) -> List[Dict[str, Any]]: + """Process and score a list of papers based on configured criteria. + + Args: + papers: List of dictionaries containing paper data. + processor_config: Configuration dictionary containing scoring parameters and thresholds. + + Returns: + List of processed papers with relevance scores, sorted by normalized score. + """ processed_papers = [] for paper in papers: score, score_breakdown = calculate_paper_score(paper, processor_config) logger.debug(f"Paper '{paper['title']}' scored {score}") - if score >= processor_config['min_score']: - paper['relevance_score'] = score - paper['score_breakdown'] = score_breakdown + if score >= processor_config["min_score"]: + paper["relevance_score"] = score + paper["score_breakdown"] = score_breakdown processed_papers.append(paper) else: - logger.debug(f"Paper '{paper['title']}' filtered out. Score {score} < min_score {processor_config['min_score']}") + logger.debug( + f"Paper '{paper['title']}' filtered out. Score {score} < min_score {processor_config['min_score']}" + ) logger.debug(f"Processed {len(processed_papers)} papers out of {len(papers)}") processed_papers = normalize_scores(processed_papers) - return sorted(processed_papers, key=lambda x: x['normalized_score'], reverse=True) + return sorted(processed_papers, key=lambda x: x["normalized_score"], reverse=True) + def normalize_scores(papers: List[Dict[str, Any]]) -> List[Dict[str, Any]]: + """Normalize relevance scores across all papers to a 0-1 scale. + + Args: + papers: List of dictionaries containing paper data with relevance scores. + + Returns: + List of papers with added normalized_score field. + """ if not papers: return papers - max_score = max(paper['relevance_score'] for paper in papers) - min_score = min(paper['relevance_score'] for paper in papers) + max_score = max(paper["relevance_score"] for paper in papers) + min_score = min(paper["relevance_score"] for paper in papers) for paper in papers: if max_score != min_score: - paper['normalized_score'] = (paper['relevance_score'] - min_score) / (max_score - min_score) + paper["normalized_score"] = (paper["relevance_score"] - min_score) / ( + max_score - min_score + ) else: - paper['normalized_score'] = 1.0 + paper["normalized_score"] = 1.0 logger.debug("Normalized scores calculated") return papers + def calculate_paper_score(paper, config): + """Calculate a relevance score for a paper based on configured criteria. + + Args: + paper: Dictionary containing paper data including content and metadata. + config: Configuration dictionary containing scoring parameters. + + Returns: + Tuple of (total_score, score_breakdown) where score_breakdown is a dictionary + containing individual component scores. + """ score = 0 score_breakdown = {} # Keyword matching - title_keywords = count_keywords(paper['title'], config['keywords']) - abstract_keywords = count_keywords(paper['abstract'], config['keywords']) - content_keywords = count_keywords(paper['content'], config['keywords']) + title_keywords = count_keywords(paper["title"], config["keywords"]) + abstract_keywords = count_keywords(paper["abstract"], config["keywords"]) + content_keywords = count_keywords(paper["content"], config["keywords"]) max_title_score = 50 max_abstract_score = 50 max_content_score = 25 - title_score = min(title_keywords * config['title_keyword_weight'], max_title_score) - abstract_score = min(abstract_keywords * config['abstract_keyword_weight'], max_abstract_score) - content_score = min(content_keywords * config['content_keyword_weight'], max_content_score) + title_score = min(title_keywords * config["title_keyword_weight"], max_title_score) + abstract_score = min( + abstract_keywords * config["abstract_keyword_weight"], max_abstract_score + ) + content_score = min( + content_keywords * config["content_keyword_weight"], max_content_score + ) score += title_score + abstract_score + content_score - score_breakdown['keyword_matching'] = { - 'title': round(title_score, 2), - 'abstract': round(abstract_score, 2), - 'content': round(content_score, 2) + score_breakdown["keyword_matching"] = { + "title": round(title_score, 2), + "abstract": round(abstract_score, 2), + "content": round(content_score, 2), } # Exclusion list - exclusion_count = count_keywords(paper['content'], config['exclusion_keywords']) - exclusion_score = min(exclusion_count * config['exclusion_keyword_penalty'], max_content_score) + exclusion_count = count_keywords(paper["content"], config["exclusion_keywords"]) + exclusion_score = min( + exclusion_count * config["exclusion_keyword_penalty"], max_content_score + ) score -= exclusion_score - score_breakdown['exclusion_penalty'] = -round(exclusion_score, 2) + score_breakdown["exclusion_penalty"] = -round(exclusion_score, 2) # Simple text analysis - important_word_count = count_important_words(paper['content'], config['important_words']) - important_word_score = min(important_word_count * config['important_words_weight'], max_content_score) + important_word_count = count_important_words( + paper["content"], config["important_words"] + ) + important_word_score = min( + important_word_count * config["important_words_weight"], max_content_score + ) score += important_word_score - score_breakdown['important_words'] = round(important_word_score, 2) + score_breakdown["important_words"] = round(important_word_score, 2) + + return max(score, 0), score_breakdown # Ensure score is not negative - return max(score, 0), score_breakdown # Ensure score is not negative def count_keywords(text, keywords): - return sum(math.log(text.lower().count(keyword.lower()) + 1) for keyword in keywords) + """Count occurrences of keywords in text. + + Args: + text: The text to search in. + keywords: List of keywords to count. + + Returns: + Dictionary mapping keywords to their occurrence counts. + """ + return sum( + math.log(text.lower().count(keyword.lower()) + 1) for keyword in keywords + ) + def count_important_words(text, important_words): - words = re.findall(r'\w+', text.lower()) + """Count occurrences of important words in text. + + Args: + text: The text to search in. + important_words: List of important words to count. + + Returns: + Dictionary mapping important words to their occurrence counts. + """ + words = re.findall(r"\w+", text.lower()) word_counts = Counter(words) - return sum(math.log(word_counts[word.lower()] + 1) for word in important_words if word.lower() in word_counts) + return sum( + math.log(word_counts[word.lower()] + 1) + for word in important_words + if word.lower() in word_counts + ) diff --git a/src/paperweight/scraper.py b/src/paperweight/scraper.py index 6d69f0d..1e57dc1 100644 --- a/src/paperweight/scraper.py +++ b/src/paperweight/scraper.py @@ -1,3 +1,10 @@ +"""Module for fetching and processing arXiv papers. + +This module handles all interactions with the arXiv API, including fetching paper metadata, +downloading PDFs, and extracting text content. It includes retry mechanisms for robust +API interactions and various methods for processing paper content. +""" + import gzip import io import logging @@ -26,12 +33,29 @@ logger = logging.getLogger(__name__) + @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10), - retry=retry_if_exception_type((requests.ConnectionError, requests.Timeout)) + retry=retry_if_exception_type((requests.ConnectionError, requests.Timeout)), ) -def fetch_arxiv_papers(category: str, start_date: date, max_results: Optional[int] = None) -> List[Dict[str, Any]]: +def fetch_arxiv_papers( + category: str, start_date: date, max_results: Optional[int] = None +) -> List[Dict[str, Any]]: + """Fetch papers from arXiv API for a specific category and date range. + + Args: + category: The arXiv category to fetch papers from (e.g., 'cs.AI'). + start_date: The date from which to start fetching papers. + max_results: Optional maximum number of results to return. + + Returns: + List of dictionaries containing paper metadata. + + Raises: + requests.ConnectionError: If connection to arXiv API fails. + requests.Timeout: If the request times out. + """ logger.debug(f"Fetching arXiv papers for category '{category}' since {start_date}") base_url = "http://export.arxiv.org/api/query?" query = f"cat:{category}" @@ -39,7 +63,7 @@ def fetch_arxiv_papers(category: str, start_date: date, max_results: Optional[in "search_query": query, "start": 0, "sortBy": "submittedDate", - "sortOrder": "descending" + "sortOrder": "descending", } if max_results is not None and max_results > 0: params["max_results"] = max_results @@ -49,8 +73,12 @@ def fetch_arxiv_papers(category: str, start_date: date, max_results: Optional[in response.raise_for_status() except HTTPError as http_err: if response.status_code == 400 and "Invalid field: cat" in response.text: - logger.error(f"Invalid arXiv category: {category}. Please check your configuration.") - raise ValueError(f"Invalid arXiv category: {category}. Please check your configuration.") from http_err + logger.error( + f"Invalid arXiv category: {category}. Please check your configuration." + ) + raise ValueError( + f"Invalid arXiv category: {category}. Please check your configuration." + ) from http_err else: logger.error(f"HTTP error occurred: {http_err}") raise @@ -58,13 +86,18 @@ def fetch_arxiv_papers(category: str, start_date: date, max_results: Optional[in root = ET.fromstring(response.content) papers = [] - for entry in root.findall('{http://www.w3.org/2005/Atom}entry'): - title_elem = entry.find('{http://www.w3.org/2005/Atom}title') - link_elem = entry.find('{http://www.w3.org/2005/Atom}id') - published_elem = entry.find('{http://www.w3.org/2005/Atom}published') - summary_elem = entry.find('{http://www.w3.org/2005/Atom}summary') - - if title_elem is None or link_elem is None or published_elem is None or summary_elem is None: + for entry in root.findall("{http://www.w3.org/2005/Atom}entry"): + title_elem = entry.find("{http://www.w3.org/2005/Atom}title") + link_elem = entry.find("{http://www.w3.org/2005/Atom}id") + published_elem = entry.find("{http://www.w3.org/2005/Atom}published") + summary_elem = entry.find("{http://www.w3.org/2005/Atom}summary") + + if ( + title_elem is None + or link_elem is None + or published_elem is None + or summary_elem is None + ): logger.warning("Skipping entry due to missing required elements") continue @@ -82,27 +115,37 @@ def fetch_arxiv_papers(category: str, start_date: date, max_results: Optional[in logger.debug(f"Paper '{title}' submitted on {submitted_date}") if submitted_date < start_date: - logger.debug(f"Stopping fetch: paper date {submitted_date} is before start date {start_date}") + logger.debug( + f"Stopping fetch: paper date {submitted_date} is before start date {start_date}" + ) break - papers.append({ - "title": title, - "link": link, - "date": submitted_date, - "abstract": abstract - }) + papers.append( + {"title": title, "link": link, "date": submitted_date, "abstract": abstract} + ) if max_results is not None and max_results > 0 and len(papers) >= max_results: logger.debug(f"Reached max_results limit of {max_results}") break - logger.info(f"Successfully fetched {len(papers)} papers for category '{category}' since {start_date}") + logger.info( + f"Successfully fetched {len(papers)} papers for category '{category}' since {start_date}" + ) return papers + def fetch_recent_papers(start_days=1): + """Fetch papers published within the last specified number of days. + + Args: + start_days: Number of days to look back for papers. + + Returns: + List of dictionaries containing paper metadata. + """ config = load_config() - categories = config['arxiv']['categories'] - max_results = config['arxiv'].get('max_results', 0) # Default to 0 if not set + categories = config["arxiv"]["categories"] + max_results = config["arxiv"].get("max_results", 0) # Default to 0 if not set end_date = datetime.now().date() start_date = end_date - timedelta(days=start_days) @@ -114,9 +157,19 @@ def fetch_recent_papers(start_days=1): for category in categories: logger.info(f"Processing category: {category}") try: - papers = fetch_arxiv_papers(category, start_date, max_results=max_results if max_results > 0 else None) - new_papers = [paper for paper in papers if paper['link'].split('/abs/')[-1] not in processed_ids] - processed_ids.update(paper['link'].split('/abs/')[-1] for paper in new_papers) + papers = fetch_arxiv_papers( + category, + start_date, + max_results=max_results if max_results > 0 else None, + ) + new_papers = [ + paper + for paper in papers + if paper["link"].split("/abs/")[-1] not in processed_ids + ] + processed_ids.update( + paper["link"].split("/abs/")[-1] for paper in new_papers + ) if max_results > 0: new_papers = new_papers[:max_results] @@ -130,22 +183,38 @@ def fetch_recent_papers(start_days=1): logger.info(f"Fetched a total of {len(all_papers)} papers") return all_papers + @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10), - retry=retry_if_exception_type((requests.ConnectionError, requests.Timeout, requests.RequestException)) + retry=retry_if_exception_type( + (requests.ConnectionError, requests.Timeout, requests.RequestException) + ), ) def fetch_paper_content(paper_id): + """Fetch the content of a specific paper from arXiv. + + Args: + paper_id: The arXiv ID of the paper to fetch. + + Returns: + Tuple of (content, method) where method indicates the source type. + + Raises: + requests.ConnectionError: If connection to arXiv fails. + requests.Timeout: If the request times out. + requests.RequestException: For other request-related errors. + """ logger.debug(f"Fetching content for paper ID: {paper_id}") - source_url = f'http://export.arxiv.org/e-print/{paper_id}' - pdf_url = f'https://export.arxiv.org/pdf/{paper_id}' + source_url = f"http://export.arxiv.org/e-print/{paper_id}" + pdf_url = f"https://export.arxiv.org/pdf/{paper_id}" try: # Try to fetch source first response = requests.get(source_url, timeout=30) response.raise_for_status() logger.debug(f"Successfully fetched source for paper ID: {paper_id}") - return response.content, 'source' + return response.content, "source" except requests.RequestException as e: logger.warning(f"Failed to fetch source for paper ID: {paper_id}. Error: {e}") @@ -154,14 +223,23 @@ def fetch_paper_content(paper_id): response = requests.get(pdf_url, timeout=30) response.raise_for_status() logger.debug(f"Successfully fetched PDF for paper ID: {paper_id}") - return response.content, 'pdf' + return response.content, "pdf" except requests.RequestException as e: logger.warning(f"Failed to fetch PDF for paper ID: {paper_id}. Error: {e}") logger.error(f"Failed to fetch content for paper ID: {paper_id}") return None, None + def extract_text_from_pdf(pdf_content): + """Extract text content from a PDF file. + + Args: + pdf_content: Binary content of the PDF file. + + Returns: + Extracted text as a string. + """ pdf_file = io.BytesIO(pdf_content) pdf_reader = PdfReader(pdf_file) text = "" @@ -169,11 +247,21 @@ def extract_text_from_pdf(pdf_content): text += page.extract_text() return text + def extract_text_from_source(content, method): - if method not in ['pdf', 'source']: + """Extract text from various source formats. + + Args: + content: The content to extract text from. + method: The method to use for extraction ('pdf' or 'source'). + + Returns: + Extracted text as a string. + """ + if method not in ["pdf", "source"]: raise ValueError(f"Invalid source type: {method}") - if method == 'pdf': + if method == "pdf": return extract_text_from_pdf(content) # Try to decompress gzip content @@ -190,11 +278,11 @@ def extract_text_from_source(content, method): for member in tar.getmembers(): if member.isfile(): _, ext = os.path.splitext(member.name) - if ext.lower() in ['.tex', '.txt', '.log']: + if ext.lower() in [".tex", ".txt", ".log"]: f = tar.extractfile(member) if f: - text += f.read().decode('utf-8', errors='ignore') - elif ext.lower() in ['.png', '.jpg', '.jpeg']: + text += f.read().decode("utf-8", errors="ignore") + elif ext.lower() in [".png", ".jpg", ".jpeg"]: # Optionally log the presence of image files logger.debug(f"Skipping image file: {member.name}") else: @@ -202,9 +290,18 @@ def extract_text_from_source(content, method): return text else: # If it's not a tar file, assume it's a single file - return decompressed.decode('utf-8', errors='ignore') + return decompressed.decode("utf-8", errors="ignore") + def fetch_paper_contents(paper_ids): + """Fetch contents for multiple papers in parallel. + + Args: + paper_ids: List of arXiv paper IDs to fetch. + + Returns: + Dictionary mapping paper IDs to their content. + """ contents = [] total_papers = len(paper_ids) logger.info(f"Fetching content for {total_papers} papers") @@ -218,7 +315,9 @@ def fetch_paper_contents(paper_ids): if (i + 1) % 4 == 0: time.sleep(1) - logger.debug(f"Processed {i + 1}/{total_papers} papers. Waiting 1 second...") + logger.debug( + f"Processed {i + 1}/{total_papers} papers. Waiting 1 second..." + ) if (i + 1) % 20 == 0: logger.info(f"Processed {i + 1}/{total_papers} papers") @@ -226,7 +325,16 @@ def fetch_paper_contents(paper_ids): logger.info(f"Finished fetching content for all {total_papers} papers") return contents + def get_recent_papers(force_refresh=False): + """Get recent papers, either from cache or by fetching new ones. + + Args: + force_refresh: If True, ignore cache and fetch new papers. + + Returns: + List of dictionaries containing paper metadata. + """ last_processed_date = get_last_processed_date() logger.info(f"Last processed date: {last_processed_date}") current_date = datetime.now().date() @@ -244,12 +352,14 @@ def get_recent_papers(force_refresh=False): elif days > 7: # If more than a week has passed, limit to 7 days to avoid overload days = 7 - logger.warning(f"More than a week since last run. Limiting fetch to last {days} days.") + logger.warning( + f"More than a week since last run. Limiting fetch to last {days} days." + ) logger.info(f"Fetching papers for the last {days} days") recent_papers = fetch_recent_papers(days) logger.info(f"Fetched {len(recent_papers)} recent papers") - paper_ids = [paper['link'].split('/abs/')[-1] for paper in recent_papers] + paper_ids = [paper["link"].split("/abs/")[-1] for paper in recent_papers] contents = fetch_paper_contents(paper_ids) @@ -259,22 +369,25 @@ def get_recent_papers(force_refresh=False): logger.debug(f"Extracting text for paper ID: {paper_id}") text = extract_text_from_source(content, method) - papers_with_content.append({ - "id": paper_id, - "title": paper['title'], - "link": paper['link'], - "date": paper['date'], - "abstract": paper['abstract'], - "content": text, - "content_type": method - }) + papers_with_content.append( + { + "id": paper_id, + "title": paper["title"], + "link": paper["link"], + "date": paper["date"], + "abstract": paper["abstract"], + "content": text, + "content_type": method, + } + ) if papers_with_content: save_last_processed_date(current_date) - logger.info(f"Processed {len(papers_with_content)} papers. Last processed date updated to {current_date}") + logger.info( + f"Processed {len(papers_with_content)} papers. Last processed date updated to {current_date}" + ) else: logger.info("No new papers found.") logger.info(f"Returning {len(papers_with_content)} papers with content") return papers_with_content - diff --git a/src/paperweight/utils.py b/src/paperweight/utils.py index f7728c9..9df91ea 100644 --- a/src/paperweight/utils.py +++ b/src/paperweight/utils.py @@ -1,3 +1,11 @@ +"""Utility functions for the paperweight application. + +This module provides various utility functions for configuration management, +environment variable handling, date tracking, and token counting. It includes +functions for loading and validating configuration, expanding environment variables, +and managing the last processed date for paper fetching. +""" + import logging import os import re @@ -11,7 +19,16 @@ logger = logging.getLogger(__name__) + def expand_env_vars(config): + """Recursively expand environment variables in configuration values. + + Args: + config: Configuration object (dict, list, or scalar value). + + Returns: + Configuration object with environment variables expanded. + """ if isinstance(config, dict): return {k: expand_env_vars(v) for k, v in config.items()} elif isinstance(config, list): @@ -21,8 +38,20 @@ def expand_env_vars(config): else: return config + def override_with_env(config): - env_prefix = 'PAPERWEIGHT_' + """Override configuration values with environment variables. + + Args: + config: Configuration dictionary to override. + + Returns: + Configuration dictionary with values overridden by environment variables. + + Environment variables should be prefixed with 'PAPERWEIGHT_' and use uppercase. + Nested configuration keys are joined with underscores. + """ + env_prefix = "PAPERWEIGHT_" for key, value in config.items(): env_var = f"{env_prefix}{key.upper()}" if isinstance(value, dict): @@ -30,7 +59,7 @@ def override_with_env(config): elif env_var in os.environ: env_value = os.environ[env_var] if isinstance(value, bool): - config[key] = env_value.lower() in ('true', '1', 'yes') + config[key] = env_value.lower() in ("true", "1", "yes") elif isinstance(value, int): config[key] = int(env_value) elif isinstance(value, float): @@ -39,11 +68,25 @@ def override_with_env(config): config[key] = env_value return config -def load_config(config_path='config.yaml'): + +def load_config(config_path="config.yaml"): + """Load and validate the application configuration. + + Args: + config_path: Path to the YAML configuration file. + + Returns: + Dictionary containing the validated configuration. + + Raises: + FileNotFoundError: If the configuration file doesn't exist. + yaml.YAMLError: If the configuration file is invalid YAML. + ValueError: If the configuration is invalid. + """ try: load_dotenv() - with open(config_path, 'r') as config_file: + with open(config_path, "r") as config_file: config = yaml.safe_load(config_file) if config is None: raise ValueError("Empty configuration file") @@ -52,23 +95,23 @@ def load_config(config_path='config.yaml'): config = override_with_env(config) # Handle API keys - if config['analyzer']['type'] == 'summary': - llm_provider = config['analyzer'].get('llm_provider') + if config["analyzer"]["type"] == "summary": + llm_provider = config["analyzer"].get("llm_provider") if not llm_provider: raise ValueError("LLM provider not specified for summary analyzer type") - api_key_from_config = config['analyzer'].get('api_key') - api_key_from_env = os.getenv(f'{llm_provider.upper()}_API_KEY') + api_key_from_config = config["analyzer"].get("api_key") + api_key_from_env = os.getenv(f"{llm_provider.upper()}_API_KEY") api_key = api_key_from_config or api_key_from_env if api_key: - config['analyzer']['api_key'] = api_key + config["analyzer"]["api_key"] = api_key else: raise ValueError(f"Missing API key for {llm_provider}") else: pass - if 'arxiv' in config and 'max_results' in config['arxiv']: - config['arxiv']['max_results'] = int(config['arxiv']['max_results']) + if "arxiv" in config and "max_results" in config["arxiv"]: + config["arxiv"]["max_results"] = int(config["arxiv"]["max_results"]) check_config(config) logger.info("Configuration loaded and validated successfully") @@ -89,84 +132,175 @@ def load_config(config_path='config.yaml'): logger.error(f"Exception in load_config: {str(e)}") raise + def check_config(config): + """Check if the configuration is valid. + + Args: + config: Configuration dictionary to validate. + + Returns: + bool: True if configuration is valid. + + Raises: + ValueError: If any required configuration is missing or invalid. + """ if not isinstance(config, dict): raise ValueError("Configuration must be a dictionary") try: _check_required_sections(config) - _check_arxiv_section(config['arxiv']) - _check_analyzer_section(config['analyzer']) - _check_notifier_section(config['notifier']) - _check_logging_section(config['logging']) + _check_arxiv_section(config["arxiv"]) + _check_analyzer_section(config["analyzer"]) + _check_notifier_section(config["notifier"]) + _check_logging_section(config["logging"]) except KeyError as e: raise ValueError(f"Missing required section or key: {e}") + def _check_required_sections(config): - required_sections = ['arxiv', 'processor', 'analyzer', 'notifier', 'logging'] + """Check if all required configuration sections are present. + + Args: + config: Configuration dictionary to check. + + Raises: + ValueError: If any required section is missing. + """ + required_sections = ["arxiv", "processor", "analyzer", "notifier", "logging"] for section in required_sections: if section not in config: raise ValueError(f"Missing required section: '{section}'") + def _check_arxiv_section(arxiv): - if 'categories' not in arxiv: + """Validate the arXiv section of the configuration. + + Args: + arxiv: arXiv configuration dictionary. + + Raises: + ValueError: If arXiv configuration is invalid. + """ + if "categories" not in arxiv: raise ValueError("Missing required subsection: 'categories' in 'arxiv'") - invalid_categories = [cat for cat in arxiv['categories'] if not is_valid_arxiv_category(cat)] + invalid_categories = [ + cat for cat in arxiv["categories"] if not is_valid_arxiv_category(cat) + ] if invalid_categories: raise ValueError(f"Invalid arXiv category: {', '.join(invalid_categories)}") - if 'max_results' in arxiv: + if "max_results" in arxiv: try: - max_results = int(arxiv['max_results']) + max_results = int(arxiv["max_results"]) except ValueError: raise ValueError("'max_results' in 'arxiv' section must be a valid integer") if max_results < 0: - raise ValueError("'max_results' in 'arxiv' section must be a non-negative integer") + raise ValueError( + "'max_results' in 'arxiv' section must be a non-negative integer" + ) + def _check_analyzer_section(analyzer): - valid_analyzer_types = ['abstract', 'summary'] - if analyzer.get('type') not in valid_analyzer_types: + """Validate the analyzer section of the configuration. + + Args: + analyzer: Analyzer configuration dictionary. + + Raises: + ValueError: If analyzer configuration is invalid. + """ + valid_analyzer_types = ["abstract", "summary"] + if analyzer.get("type") not in valid_analyzer_types: raise ValueError(f"Invalid analyzer type: '{analyzer.get('type')}'") - if analyzer.get('type') == 'summary': - valid_llm_providers = ['openai', 'gemini'] - if analyzer.get('llm_provider') not in valid_llm_providers: + if analyzer.get("type") == "summary": + valid_llm_providers = ["openai", "gemini"] + if analyzer.get("llm_provider") not in valid_llm_providers: raise ValueError(f"Invalid LLM provider: '{analyzer.get('llm_provider')}'") + def _check_notifier_section(notifier): - if 'email' not in notifier: + """Validate the notifier section of the configuration. + + Args: + notifier: Notifier configuration dictionary. + + Raises: + ValueError: If notifier configuration is invalid. + """ + if "email" not in notifier: raise ValueError("Missing required subsection: 'email' in 'notifier'") - required_email_fields = ['to', 'from', 'password', 'smtp_server', 'smtp_port'] + required_email_fields = ["to", "from", "password", "smtp_server", "smtp_port"] for field in required_email_fields: - if field not in notifier['email']: + if field not in notifier["email"]: raise ValueError(f"Missing required email field: '{field}'") + def _check_logging_section(logging): - valid_logging_levels = ['DEBUG', 'INFO', 'WARNING', 'ERROR'] - if logging.get('level') not in valid_logging_levels: + """Validate the logging section of the configuration. + + Args: + logging: Logging configuration dictionary. + + Raises: + ValueError: If logging configuration is invalid. + """ + valid_logging_levels = ["DEBUG", "INFO", "WARNING", "ERROR"] + if logging.get("level") not in valid_logging_levels: raise ValueError(f"Invalid logging level: '{logging.get('level')}'") + def is_valid_arxiv_category(category): + """Check if an arXiv category string is valid. + + Args: + category: arXiv category string to validate. + + Returns: + bool: True if the category format is valid. + """ # A simple method to catch obviously invalid categories - pattern = r'^[a-z]+\.[A-Z]{2,}$' + pattern = r"^[a-z]+\.[A-Z]{2,}$" return bool(re.match(pattern, category)) + def get_last_processed_date(): + """Get the date when papers were last processed. + + Returns: + datetime.date: The last processed date if available, None otherwise. + """ try: if os.path.exists(LAST_PROCESSED_DATE_FILE): - with open(LAST_PROCESSED_DATE_FILE, 'r') as f: + with open(LAST_PROCESSED_DATE_FILE, "r") as f: date_str = f.read().strip() return datetime.strptime(date_str, "%Y-%m-%d").date() except (IOError, ValueError) as e: logger.error(f"Error reading last processed date: {e}") return None + def save_last_processed_date(date): + """Save the date when papers were last processed. + + Args: + date: datetime.date object to save. + """ try: - with open(LAST_PROCESSED_DATE_FILE, 'w') as f: + with open(LAST_PROCESSED_DATE_FILE, "w") as f: f.write(date.strftime("%Y-%m-%d")) logger.info(f"Saved last processed date: {date}") except IOError as e: logger.error(f"Error saving last processed date: {e}") + def count_tokens(text): + """Count the number of tokens in a text string using tiktoken. + + Args: + text: String to count tokens in. + + Returns: + int: Number of tokens in the text. + """ encoding = tiktoken.encoding_for_model("gpt-3.5-turbo") - return len(encoding.encode(text, allowed_special={'<|endoftext|>'})) + return len(encoding.encode(text, allowed_special={"<|endoftext|>"}))