This repository is an extended version of Taylor Denouden's bb-ai-oceanography project. While the original repository focused on analyzing paper abstracts using Large Language Models (LLMs), this fork expands the scope to include full-text analysis and fully-cited report generation of scientific papers related to machine learning applications in oceanography. Report generation could absolutely benefit from the work previously done by Taylor to identify topic areas, e.g. the report queries could focus on specific topic areas identified in the abstract analysis.
- Project Overview
- Key Features
- Getting Started
- Report Configuration
- Usage
- Project Structure
- Acknowledgments
The project aims to provide a comprehensive analysis of the current state of machine learning in oceanography by:
- Downloading full-text scientific papers from various sources
- Processing and extracting relevant information from the papers using paperetl
- Analyzing the content using natural language processing techniques with paperai
- Generating reports to identify trends and patterns in the field
- Multi-source paper retrieval (Elsevier, Springer, Wiley, arXiv, Unpaywall)
- Full-text extraction from PDFs and XMLs using GROBID and paperetl
- Robust error handling and logging
- Rate-limited API calls to respect usage policies
- Machine learning-based analysis and report generation using paperai
Changes made to paperetl:
- Extracting paragraphs instead of sentences for improved summary quality
- Improved error handling
- Added support for more XML formats
Changes made to paperai:
- Enhanced citation system with paragraph-level precision (e.g., [1.2] refers to article 1, paragraph 2)
- Hover text on citations now shows the exact paragraph being cited along with article metadata
- Support for Google's Gemini API with optimized configurations for both summary generation and QA tasks
- One of the major improvements to
paperai's report generation made in this repository is the addition of a properly cited summary section. This is done by using an LLM (either through the API or locally) to generate the summary given a query. More specifically, the query is "embedded" by an embedding model (which turns text into vectors of fixed length), and thetopnmost similar paragraphs (which have been embedded by the same embedding model) in the entire database are retrieved and used as context for the summary generation. See below for an example of what the summary looks like given the query "emerging trends or future directions in machine learning for ocean sciences":
Example Summary
Machine learning is increasingly becoming an essential tool in ocean sciences, offering unprecedented solutions for managing complex and high-dimensional data challenges. Its ability to efficiently tackle "high-dimensional, complicated, non-linear, and big-data problems" makes it particularly promising for oceanic applications (Sunkara et al., 2023). This capability allows machine learning to address issues that traditional methods find difficult or unfeasible. For example, machine learning has been employed to enhance prediction models of Southern Ocean circulation patterns beyond the capabilities of conventional approaches (Sonnewald et al., 2023). In marine ecology, these advanced tools are used for classifying dynamic oceanic features through various datasets like images and optical spectra (Phillips et al., 2020)(Rubbens et al., 2023), leading to significant advancements in understanding ecological systems by integrating diverse marine data sources.
The adaptation of machine learning also shows great promise in addressing critical issues such as climate change impacts on oceans. Despite its extensive global use in areas like climate analysis and ecological environments (Sunkara et al., 2023), its application within specific regions like the Gulf of Mexico (GOM) remains limited even though there are abundant data resources available(Sunkara et al., 2023)(Sunkara et al., 2023). Emerging technologies such as image-based machine learning methods offer transformative capabilities by accelerating crucial image processing tasks for marine research(Belcher et al., 2022), but technical complexities still present substantial barriers to their adoption.
Furthermore, integrating physical models into machine learning algorithms can significantly improve predictions. This approach has already demonstrated potential by refining surface ocean pCO2 estimates when combined with outputs from global biogeochemical models(Gloege et al., 2021). As computational power continues to increase alongside advances in sensor technology that enhance data collection across the world's oceans(Sunkara et al., 2023), future directions suggest more sophisticated simulations involving multiscale phenomena within coastal environments(Tang et al., 2021)(Tang et al., 2021).
Overall, these trends underscore an exciting trajectory where interdisciplinary collaborations could lead toward intelligent autonomous systems capable of comprehensive ocean monitoring. Such advancements would not only benefit scientific exploration but also practical applications relevant to societal needs such as energy security or environmental sustainability initiatives surrounding our planet's vast aquatic ecosystems(Lermusiaux et al., 2017)(Yang et al., 2019).
The enhanced citation system now provides more precise references to source material:
- Citations use the format
[article_num.paragraph_num](e.g.,[1.2]refers to article 1, paragraph 2) - Hovering over a citation reveals:
- Article title
- Section name (if available)
- The exact paragraph being cited
- DOI link for easy reference
- Multiple citations can be combined (e.g.,
[1.2, 1.3, 2.1]) - Citations are automatically formatted as clickable links with hover text in the generated markdown
Example citation in the output:
"Deep learning models have enabled unprecedented accuracy in predicting ocean temperature patterns" [1.2]When rendered, this citation becomes a clickable link with hover text showing the exact quoted text and its context.
- Python 3.7+
- Docker (recommended)
- NVIDIA GPU (optional, for local models)
- One of the following:
- OpenAI API key (recommended)
- Google API key (for Gemini API)
- Hugging Face account (for API or local models)
- Local GPU for running models (optional)
There are three main ways to use this project, listed in order of recommendation:
Option 1: Docker with OpenAI API (Recommended)
-
Clone the repository:
git clone https://github.com/Spiffical/bb-ai-oceanography.git cd bb-ai-oceanography -
Create a
.envfile in the project root with your OpenAI API key:OPENAI_API_KEY=your_openai_api_key -
Build the Docker image:
Ensure Docker is installed and running, then run:
docker build -f docker/Dockerfile.api -t paperai-api .
Option 2: Docker with Local Models
Choose this option if you want to run models locally without API costs.
-
Clone and enter the repository as shown above
-
Choose your preferred local model provider:
A. Using Ollama (Easier)
- Build the Docker image:
docker build -f docker/Dockerfile.gpu -t paperai-gpu .
B. Using Hugging Face (More flexible)
- Create a Hugging Face account and get your access token
- Add to your
.envfile:HUGGING_FACE_HUB_TOKEN=your_token - Build the Docker image:
docker build -f docker/Dockerfile.gpu -t paperai-gpu .
- Build the Docker image:
Option 3: Local Installation
-
Clone the repository:
git clone https://github.com/Spiffical/bb-ai-oceanography.git cd bb-ai-oceanography -
Install the required packages:
pip install -r requirements.txt
-
Install paperetl and paperai:
cd paperetl pip install . cd ../paperai pip install . cd ..
-
Set up environment variables if you are downloading papers from the Elsevier, Springer, and Wiley APIs: Create a
.envfile in the project root and add your API keys and email address for Unpaywall:ELSEVIER_API_KEY=your_elsevier_api_key SPRINGER_API_KEY=your_springer_api_key WILEY_API_KEY=your_wiley_api_key UNPAYWALL_EMAIL=your_email_address GOOGLE_API_KEY=your_google_api_key # If using Gemini API
Reports are configured through YAML files in the reports/ directory. Each configuration file defines how the report should be generated and what content to analyze.
Here's an example of a report configuration file:
name: Your_Report_Name
options:
# General settings
topn: 100
render: md
qa: "deepset/roberta-base-squad2"
generate_summary: true
# Choose your mode
llm_mode: "api" # "api" or "local"
# API Settings (if llm_mode is "api")
api:
provider: "gemini" # "openai", "huggingface", or "gemini"
model: "gemini-2.5-flash-preview-05-20" # Default model for summaries
gemini_summary_model: "gemini-2.5-flash-preview-05-20" # Model for summary generation
gemini_qa_model: "gemini-2.0-flash" # Model for QA tasks
gemini_summary_temperature: 0.3 # Controls randomness in summary generation
gemini_summary_max_tokens: 20000 # Maximum tokens for summaries
gemini_qa_temperature: 0.1 # Lower temperature for more focused QA
gemini_qa_max_tokens: 1000 # Maximum tokens for QA responses
gemini_summary_thinking_budget: 1000 # For Gemini 2.5 models (0-24576)
gemini_qa_thinking_budget: 0 # For Gemini 2.5 models (0-24576)
# Local Settings (if llm_mode is "local")
local:
provider: "ollama" # "ollama" or "huggingface"
model: "mistral:instruct" # See supported models below
gpu_strategy: "auto" # For HuggingFace models
sections:
Your_Section_Name:
query: your search query here
columns:
- name: Date
- name: Study
- {name: Custom_Column, query: specific search terms, question: what specific information to extract}Key components:
name: The name of the reportoptions: General settings for the reporttopn: Number of results (paragraphs from the database) to userender: Output format (md for markdown)qa: Model to use for question answeringgenerate_summary: Whether to include an LLM-generated summaryllm_mode: Choose between API or local models- Mode-specific settings for API or local model usage
- Gemini-specific settings for controlling model behavior and performance
sections: Define what content to analyzequery: Main search query for this sectioncolumns: What information to extract and how to organize it- Simple columns just need a name
- Complex columns can include specific queries and questions
Supported Models:
- OpenAI API: gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo, etc. (see OpenAI API Models)
- Ollama: mistral:instruct, gemma:7b, llama2:7b, etc. (see Ollama Models)
- Hugging Face: google/gemma-2-9b-it, mistralai/Mistral-7B-Instruct-v0.2, etc. (see Hugging Face Models)
Example sections from the default report include:
- ML Applications in ocean sciences
- Research gaps and challenges
- Emerging trends
See the reports/ directory for complete examples.
To generate a report using a pre-processed paperetl embeddings model:
Using OpenAI API
# On Linux/Mac
docker run --rm --env-file ".env" -v "$(pwd):/work" paperai-api -m paperai.report /work/reports/report_file.yml /work/path/to/your/model
# On Windows PowerShell
docker run --rm --env-file ".env" -v "${PWD}:/work" paperai-api -m paperai.report /work/reports/report_file.yml /work/path/to/your/modelUsing Local Models with GPU
# On Linux/Mac
docker run --rm --env-file ".env" --gpus all -v "$(pwd):/work" paperai-gpu -m paperai.report /work/reports/report_file.yml /work/path/to/your/model
# On Windows PowerShell
docker run --rm --env-file ".env" --gpus all -v "${PWD}:/work" paperai-gpu -m paperai.report /work/reports/report_file.yml /work/path/to/your/modelUsing Local Installation
python -m paperai.report reports/report_file.yml path/to/your/modelReplace:
path/to/your/modelwith the path to your embeddings model directoryreports/report_file.ymlwith the path to your report configuration file
For example, if you want to use the OpenAI API, your embeddings model is in paperetl/models/pdf-oceanai, your report configuration file is in reports/report_oceans_gaps.yml, and you are currently in the bb-ai-oceanography directory:
docker run --rm \
-v "$(pwd):/work" \
--env-file .env \
paperai-api -m paperai.report /work/reports/report_oceans_gaps.yml /work/paperetl/models/pdf-oceanaiIf you need to process new papers or build the database from scratch, follow these additional steps:
- Download papers:
python scripts/download_papers.py --csv path/to/your/doi_list.csv output_pathOr for a single DOI:
python scripts/download_papers.py --doi 10.1016/j.example.2023.123456 output_path- Start GROBID with the custom configuration:
sudo docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 -v /path/to/bb-ai-oceanography/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.8.0Replace /path/to/bb-ai-oceanography with the actual path to your project directory, e.g. if you're currently in the project directory, you can use $PWD:
sudo docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 -v $PWD/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.8.1- In a separate terminal, run paperetl to extract content and create an SQLite database:
python -m paperetl.file /path/to/pdfs /path/to/outputReplace /path/to/pdfs with the directory containing your downloaded PDFs, and /path/to/output with the desired output directory for the SQLite database.
- Index the extracted content:
python -m paperai.index /path/to/outputUse the same /path/to/output as in the paperetl step.
- Generate a report using a YAML configuration file (see Report Configuration for details):
python -m paperai.report /path/to/report_config.yml 100 md /path/to/outputReplace:
/path/to/report_config.ymlwith the path to your report configuration file (examples can be found in thereportsdirectory)./path/to/outputwith the same/path/to/outputas in the paperetl step.
scripts/: Contains the main scripts for downloading papersutils/: Utility functions for API calls, file handling, and PDF processingnotebooks/: Jupyter notebooks for data analysis and visualizationconfig/: Configuration files, including the custom GROBID configurationpaperetl/: The modified paperetl package for extracting content from PDFspaperai/: The modified paperai package for analyzing and generating reportsdocker/: Docker configuration files for building and running the projectreports/: Report configuration files for generating reports
- This work builds upon the initial analysis by Leland McInnes at the Tutte Institute, and Taylor Denouden's bb-ai-oceanography project
- Thanks to all the publishers and platforms providing access to scientific literature
- GROBID for PDF content extraction
- paperetl and paperai for content processing and analysis