Skip to content

Commit

Permalink
Merge pull request #19 from aozalevsky/Alex-3
Browse files Browse the repository at this point in the history
Documentation added to scraper + READMEs
  • Loading branch information
Alexander-Aghili authored Nov 12, 2023
2 parents 4ca6d96 + baa1c8a commit 12e78e8
Show file tree
Hide file tree
Showing 3 changed files with 110 additions and 54 deletions.
65 changes: 42 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,56 @@
# Latern Vector Database
# StructHunt

## Installation
## Overview

Run `initialize_database.sh` to:
1. Setup Postgres
2. Create databases
3. Install dependancies
StructHunt is a program designed to scrape scientific articles from BioRXiv, parse them, convert them into embeddings, and perform analysis on whether they employ certain methodologies. The resulting information is then organized and stored in a CSV file. The program consists of several components that work together seamlessly to achieve this functionality.

## Classes
Fragment and Publication classes which contain a Python representation of datarow from table.
## Components

## Database Structure
### 1. `scraper.py`

Latern creates the following two tables in the database:
`scraper.py` is responsible for scraping BioRXiv to obtain scientific articles in PDF format. It utilizes external libraries and APIs to download these articles and then applies the necessary parsing logic to extract relevant information.

1. `fragments` table:
- Columns: pdbid (text), header (text), content (text), vector (real[])
- Used to store information about molecular fragments, including their PDB ID, header, content, and associated vector data.
### 2. `VectorDatabase.py`

2. `publications` table:
- Columns: pdbid (text, primary key), title (text), pmc (text), pubmed (text), doi (text)
- Used to store information about publications related to the fragments, including their PDB ID, title, PMC, PubMed, and DOI.
`VectorDatabase.py` contains the `Lantern` class, which is used to interact with a PostgreSQL database. The embeddings generated from the articles are input into the database, associating them with the corresponding publications.

## Usage
### 3. `hackathon_runner.py`

VectorDatabase file, which has class Latern, provides the main functionality for the vector database. For example, you can insert an embedding with the insertEmbedding().
`hackathon_runner.py` is the script responsible for managing the overall flow of the program. It identifies publications that haven't been processed, retrieves their IDs, and triggers subsequent processing steps.

## Dumping/restoring the database
### 4. `chatgpt`

To dump the database for the backup/transfer one can use built-in Postgres command [`pg_dump`](https://www.postgresql.org/docs/current/backup-dump.html):
The `chatgpt` component involves interacting with OpenAI's GPT-based language model. This is done using prompts generated from the `updated_prompt.py` script along with the embeddings retrieved from the previous step. The goal is to analyze whether the publications implement certain methodologies.

`sudo -u postgres pg_dump structdb > structdb.sql`
### 5. `prompts.py`

to restore the database from dump:
`prompts.py` generates prompts that are used to query the GPT model. These prompts are crafted based on the specific characteristics of the publications being analyzed.

`sudo -u postgres psql structdb < structdb.sql`
### 6. `CSV Output`

The program populates a CSV file with the analysis results. This file contains information on whether the publications employ certain methodologies, providing a structured output for easy interpretation and further analysis.

## Getting Started

1. **Environment Setup:**
- Ensure that you have Python installed.
- Install the required Postgres Database and Python packages using `initialize_database.sh`.

```bash
sudo ./initialize_database.sh
```

2. **Run the Program:**
- Execute `runner.py` to initiate the structured hunting process.

```bash
python runner.py
```

## Contributing

Feel free to contribute to the development of StructHunt by submitting issues, feature requests, or pull requests. Your feedback and contributions are highly appreciated.

## License

This project is licensed under the [MIT License](LICENSE).
27 changes: 27 additions & 0 deletions VectorDatabase.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Latern Vector Database

## Installation

Run `initialize_database.sh` to:
1. Setup Postgres
2. Create databases
3. Install dependancies

## Classes
Fragment and Publication classes which contain a Python representation of datarow from table.

## Database Structure

Latern creates the following two tables in the database:

1. `fragments` table:
- Columns: id (text), header (text), content (text), vector (real[])
- Used to store information about molecular fragments, including their ID(DOI), header, content, and associated vector data.

2. `publications` table:
- Columns: id (text, primary key), title (text), pmc (text), pubmed (text), doi (text)
- Used to store information about publications related to the fragments, including their ID(DOI), title, and links to PMC, PubMed, and DOI.

## Usage

VectorDatabase file, which has class Lantern, provides the main functionality for the vector database. For example, you can insert an embedding with the insertEmbedding().
72 changes: 41 additions & 31 deletions scrapper.py → scraper.py
Original file line number Diff line number Diff line change
@@ -1,45 +1,53 @@
# File: scraper.py
# Description: This script defines functions for scraping and processing scientific papers from bioRxiv,
# extracting text and embeddings, and storing the information in a custom database.
# It also performs a keyword search on the obtained data.

# Importing necessary libraries
import os
import pandas as pd
import PyPDF2
import argparse, datetime
from paperscraper.pdf import save_pdf
from paperscraper.get_dumps import biorxiv
from paperscraper.load_dumps import QUERY_FN_DICT
from paperscraper.xrxiv.xrxiv_query import XRXivQuery


import openai
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain import PromptTemplate
import PyPDF2

from VectorDatabase import Lantern
from fragment import Fragment
from publication import Publication
from VectorDatabase import Lantern, Fragment, Publication


# OpenAI Setup
# openai.api_key = os.getenv(openai_api_key)
os.environ['OPENAI_API_KEY'] = os.getenv("OPENAI_API_KEY")


"""
Scrapes papers from bioRxiv between the specified dates and saves the metadata in a JSON file.
:param start: Start date for the scraping (format: "YYYY-MM-DD").
:param end: End date for the scraping (format: "YYYY-MM-DD").
:param out_file: Output file to save the metadata in JSON Lines format.
:return: None
"""
def scrapeBiorxiv(start, end, out_file):
filepath = out_file
biorxiv(begin_date=start, end_date=end, save_path=out_file)
retreiveTextFromPdf(filepath)

"""
Retrieves text embeddings from a given text file using OpenAI's language model.
:param fname: Path to the input text file.
:return: A tuple containing text embeddings and the OpenAIEmbeddings instance.
"""
def get_embeddings(fname):
"""
"""
loader = TextLoader(fname)
documents = loader.load()
text_splitter = CharacterTextSplitter(
Expand All @@ -54,17 +62,18 @@ def get_embeddings(fname):
return text_embeddings, emb


"""
Retrieves text from PDF files, extracts embeddings, and stores information in a custom database.
:param inp_file: Path to the input JSON file containing paper metadata.
:return: None
"""
def retreiveTextFromPdf(inp_file):

json = pd.read_json(path_or_buf=inp_file, lines=True)
lantern = Lantern()

for n, doi in enumerate(json['doi']):
print(n, doi)

# NOTE: This is for example purpose only
if n > 10:
break

paper_data = {'doi': doi}
doi = doi.replace("/", "-")
Expand Down Expand Up @@ -111,18 +120,19 @@ def retreiveTextFromPdf(inp_file):
os.remove(pdfsavefile)


start_date = "2023-10-30"
end_date = "2023-10-31"
out_file = "bio.jsonl"

scrapeBiorxiv(start_date, end_date, out_file)
if __name__ == "__main__":
# Adding command line arguments for start_date and end_date with default values as the current date
parser = argparse.ArgumentParser(description="Scrape and process scientific papers from bioRxiv.")
parser.add_argument("--start-date", default=str(datetime.date.today()), help="Start date for the scraping (format: 'YYYY-MM-DD').")
parser.add_argument("--end-date", default=str(datetime.date.today()), help="End date for the scraping (format: 'YYYY-MM-DD').")
parser.add_argument("--outfile", default="bio.jsonl", help="Output file to save the metadata in JSON Lines format.")
args = parser.parse_args()

# Calling the scrapeBiorxiv function with command line arguments
scrapeBiorxiv(args.start_date, args.end_date, args.out_file)

querier = XRXivQuery('bio.jsonl')
biology = [
'Bioinformatics',
'Molecular Biology',
'Bioengineering',
'Biochemistry']
query = [biology]
querier.search_keywords(query, output_filepath='bio_key.jsonl')
# Additional code for keyword search if needed
querier = XRXivQuery(args.out_file)
biology = ['Bioinformatics', 'Molecular Biology', 'Bioengineering', 'Biochemistry']
query = [biology]
querier.search_keywords(query, output_filepath='bio_key.jsonl')

0 comments on commit 12e78e8

Please sign in to comment.