Merge pull request #19 from aozalevsky/Alex-3

Documentation added to scraper + READMEs
aozalevsky · Nov 12, 2023 · 12e78e8 · 12e78e8
2 parents 4ca6d96 + baa1c8a
commit 12e78e8
Show file tree

Hide file tree

Showing 3 changed files with 110 additions and 54 deletions.
diff --git a/README.md b/README.md
@@ -1,37 +1,56 @@
-# Latern Vector Database 
+# StructHunt
 
-## Installation
+## Overview
 
-Run `initialize_database.sh` to:
-1. Setup Postgres
-2. Create databases
-3. Install dependancies
+StructHunt is a program designed to scrape scientific articles from BioRXiv, parse them, convert them into embeddings, and perform analysis on whether they employ certain methodologies. The resulting information is then organized and stored in a CSV file. The program consists of several components that work together seamlessly to achieve this functionality.
 
-## Classes
-Fragment and Publication classes which contain a Python representation of datarow from table. 
+## Components
 
-## Database Structure
+### 1. `scraper.py`
 
-Latern creates the following two tables in the database:
+`scraper.py` is responsible for scraping BioRXiv to obtain scientific articles in PDF format. It utilizes external libraries and APIs to download these articles and then applies the necessary parsing logic to extract relevant information.
 
-1. `fragments` table:
-   - Columns: pdbid (text), header (text), content (text), vector (real[])
-   - Used to store information about molecular fragments, including their PDB ID, header, content, and associated vector data.
+### 2. `VectorDatabase.py`
 
-2. `publications` table:
-   - Columns: pdbid (text, primary key), title (text), pmc (text), pubmed (text), doi (text)
-   - Used to store information about publications related to the fragments, including their PDB ID, title, PMC, PubMed, and DOI.
+`VectorDatabase.py` contains the `Lantern` class, which is used to interact with a PostgreSQL database. The embeddings generated from the articles are input into the database, associating them with the corresponding publications.
 
-## Usage
+### 3. `hackathon_runner.py`
 
-VectorDatabase file, which has class Latern, provides the main functionality for the vector database. For example, you can insert an embedding with the insertEmbedding().
+`hackathon_runner.py` is the script responsible for managing the overall flow of the program. It identifies publications that haven't been processed, retrieves their IDs, and triggers subsequent processing steps.
 
-## Dumping/restoring the database
+### 4. `chatgpt`
 
-To dump the database for the backup/transfer one can use built-in Postgres command [`pg_dump`](https://www.postgresql.org/docs/current/backup-dump.html):
+The `chatgpt` component involves interacting with OpenAI's GPT-based language model. This is done using prompts generated from the `updated_prompt.py` script along with the embeddings retrieved from the previous step. The goal is to analyze whether the publications implement certain methodologies.
 
-`sudo -u postgres pg_dump structdb > structdb.sql`
+### 5. `prompts.py`
 
-to restore the database from dump:
+`prompts.py` generates prompts that are used to query the GPT model. These prompts are crafted based on the specific characteristics of the publications being analyzed.
 
-`sudo -u postgres psql structdb < structdb.sql`
+### 6. `CSV Output`
+
+The program populates a CSV file with the analysis results. This file contains information on whether the publications employ certain methodologies, providing a structured output for easy interpretation and further analysis.
+
+## Getting Started
+
+1. **Environment Setup:**
+    - Ensure that you have Python installed.
+    - Install the required Postgres Database and Python packages using `initialize_database.sh`.
+
+    ```bash
+    sudo ./initialize_database.sh
+    ```
+
+2. **Run the Program:**
+    - Execute `runner.py` to initiate the structured hunting process.
+
+```bash
+python runner.py
+```
+
+## Contributing
+
+Feel free to contribute to the development of StructHunt by submitting issues, feature requests, or pull requests. Your feedback and contributions are highly appreciated.
+
+## License
+
+This project is licensed under the [MIT License](LICENSE).
diff --git a/VectorDatabase.md b/VectorDatabase.md
@@ -0,0 +1,27 @@
+# Latern Vector Database 
+
+## Installation
+
+Run `initialize_database.sh` to:
+1. Setup Postgres
+2. Create databases
+3. Install dependancies
+
+## Classes
+Fragment and Publication classes which contain a Python representation of datarow from table. 
+
+## Database Structure
+
+Latern creates the following two tables in the database:
+
+1. `fragments` table:
+   - Columns: id (text), header (text), content (text), vector (real[])
+   - Used to store information about molecular fragments, including their ID(DOI), header, content, and associated vector data.
+
+2. `publications` table:
+   - Columns: id (text, primary key), title (text), pmc (text), pubmed (text), doi (text)
+   - Used to store information about publications related to the fragments, including their ID(DOI), title, and links to PMC, PubMed, and DOI.
+
+## Usage
+
+VectorDatabase file, which has class Lantern, provides the main functionality for the vector database. For example, you can insert an embedding with the insertEmbedding().
diff --git a/scrapper.py → scraper.py b/scrapper.py → scraper.py
@@ -1,45 +1,53 @@
+# File: scraper.py
+# Description: This script defines functions for scraping and processing scientific papers from bioRxiv,
+# extracting text and embeddings, and storing the information in a custom database.
+# It also performs a keyword search on the obtained data.
+
+# Importing necessary libraries
 import os
 import pandas as pd
 import PyPDF2
+import argparse, datetime
 from paperscraper.pdf import save_pdf
 from paperscraper.get_dumps import biorxiv
-from paperscraper.load_dumps import QUERY_FN_DICT
 from paperscraper.xrxiv.xrxiv_query import XRXivQuery
 
-
-import openai
-from langchain.document_loaders.csv_loader import CSVLoader
 from langchain.embeddings.openai import OpenAIEmbeddings
 from langchain.text_splitter import CharacterTextSplitter
 from langchain.vectorstores import FAISS
 from langchain.document_loaders import TextLoader
 
 from langchain.embeddings.openai import OpenAIEmbeddings
-from langchain.vectorstores import FAISS
-from langchain.chat_models import ChatOpenAI
-from langchain.chains import RetrievalQA
-from langchain import PromptTemplate
 import PyPDF2
 
-from VectorDatabase import Lantern
-from fragment import Fragment
-from publication import Publication
+from VectorDatabase import Lantern, Fragment, Publication
 
 
 # OpenAI Setup
 # openai.api_key = os.getenv(openai_api_key)
 os.environ['OPENAI_API_KEY'] = os.getenv("OPENAI_API_KEY")
 
 
+"""
+Scrapes papers from bioRxiv between the specified dates and saves the metadata in a JSON file.
+
+:param start: Start date for the scraping (format: "YYYY-MM-DD").
+:param end: End date for the scraping (format: "YYYY-MM-DD").
+:param out_file: Output file to save the metadata in JSON Lines format.
+:return: None
+"""
 def scrapeBiorxiv(start, end, out_file):
     filepath = out_file
     biorxiv(begin_date=start, end_date=end, save_path=out_file)
     retreiveTextFromPdf(filepath)
 
+"""
+Retrieves text embeddings from a given text file using OpenAI's language model.
 
+:param fname: Path to the input text file.
+:return: A tuple containing text embeddings and the OpenAIEmbeddings instance.
+"""
 def get_embeddings(fname):
-    """
-    """
     loader = TextLoader(fname)
     documents = loader.load()
     text_splitter = CharacterTextSplitter(
@@ -54,17 +62,18 @@ def get_embeddings(fname):
     return text_embeddings, emb
 
 
+"""
+Retrieves text from PDF files, extracts embeddings, and stores information in a custom database.
+
+:param inp_file: Path to the input JSON file containing paper metadata.
+:return: None
+"""
 def retreiveTextFromPdf(inp_file):
 
     json = pd.read_json(path_or_buf=inp_file, lines=True)
     lantern = Lantern()
 
     for n, doi in enumerate(json['doi']):
-        print(n, doi)
-
-        # NOTE: This is for example purpose only
-        if n > 10:
-            break
 
         paper_data = {'doi': doi}
         doi = doi.replace("/", "-")
@@ -111,18 +120,19 @@ def retreiveTextFromPdf(inp_file):
         os.remove(pdfsavefile)
 
 
-start_date = "2023-10-30"
-end_date = "2023-10-31"
-out_file = "bio.jsonl"
-
-scrapeBiorxiv(start_date, end_date, out_file)
+if __name__ == "__main__":
+    # Adding command line arguments for start_date and end_date with default values as the current date
+    parser = argparse.ArgumentParser(description="Scrape and process scientific papers from bioRxiv.")
+    parser.add_argument("--start-date", default=str(datetime.date.today()), help="Start date for the scraping (format: 'YYYY-MM-DD').")
+    parser.add_argument("--end-date", default=str(datetime.date.today()), help="End date for the scraping (format: 'YYYY-MM-DD').")
+    parser.add_argument("--outfile", default="bio.jsonl", help="Output file to save the metadata in JSON Lines format.")
+    args = parser.parse_args()
 
+    # Calling the scrapeBiorxiv function with command line arguments
+    scrapeBiorxiv(args.start_date, args.end_date, args.out_file)
 
-querier = XRXivQuery('bio.jsonl')
-biology = [
-    'Bioinformatics',
-    'Molecular Biology',
-    'Bioengineering',
-    'Biochemistry']
-query = [biology]
-querier.search_keywords(query, output_filepath='bio_key.jsonl')
+    # Additional code for keyword search if needed
+    querier = XRXivQuery(args.out_file)
+    biology = ['Bioinformatics', 'Molecular Biology', 'Bioengineering', 'Biochemistry']
+    query = [biology]
+    querier.search_keywords(query, output_filepath='bio_key.jsonl')