Skip to content

Commit

Permalink
Merge pull request #164 from trigaten/cleaning_up
Browse files Browse the repository at this point in the history
Cleaning up
  • Loading branch information
trigaten authored Jun 9, 2024
2 parents 1909def + 984e0d8 commit ac6384c
Show file tree
Hide file tree
Showing 57 changed files with 460 additions and 38,447 deletions.
1 change: 0 additions & 1 deletion Prompt_Systematic_Review_Dataset
Submodule Prompt_Systematic_Review_Dataset deleted from 7d8eb4
87 changes: 67 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,92 @@
# Prompt Engineering Survey
Generative Artificial Intelligence (GenAI) systems are being increasingly deployed across all parts of
industry and research settings. Developers and end users interact with these systems through the use of
prompting or prompt engineering. While prompting is a widespread and highly researched concept, there
exists conflicting terminology and a poor ontological understanding of what constitutes a prompt due to the
area’s nascency. This repository is the code for The Prompt Report, our research that establishes a structured
understanding of prompts, by assembling a taxonomy of prompting techniques and analyzing their use. This code
allows for the automated review of papers, the collection of data, and the running of experiments. Our dataset
is available on [Hugging Face](https://huggingface.co/datasets/PromptSystematicReview/ThePromptReport)

## Table of Contents
- [Prompt Engineering Survey](#prompt-engineering-survey)
- [Table of Contents](#table-of-contents)
- [Install requirements](#install-requirements)
- [Setting up API keys](#setting-up-api-keys)
- [Setting up keys for running tests](#setting-up-keys-for-running-tests)
- [Structure of the Repository](#structure-of-the-repository)
- [Running the code](#running-the-code)
- [TLDR;](#tldr)
- [Notes](#notes)

## Install requirements

after cloning, run `pip install -r requirements.txt` from root

## Set up API keys
## Setting up API keys

Make a file at root called `.env`.

For HF: https://huggingface.co/docs/hub/security-tokens, also run `huggingface-cli login`
For OpenAI: https://platform.openai.com/docs/quickstart <br>
For Hugging Face: https://huggingface.co/docs/hub/security-tokens, also run `huggingface-cli login` <br>
For Sematic Scholar: https://www.semanticscholar.org/product/api#api-key <br>

Put your key in like:
Use the reference `example.env` file to fill in your API keys/tokens.
```
OPENAI_API_KEY=sk.-...
SEMANTIC_SCHOLAR_API_KEY=...
HF_TOKEN=...
```

`OPENAI_API_KEY=sk-...`
`SEMANTIC_SCHOLAR_API_KEY=...`
`HF_TOKEN=...`
## Setting up keys for running tests
Then to load the .env file, type: <br>
`pip install pytest-dotenv`

Then to load the .env file, type:
pip install pytest-dotenv

You can also choose to update the env file by doing:
py.test --envfile path/to/.env
You can also choose to update the env file by doing: <br>
`py.test --envfile path/to/.env`

In the case that you have several .env files, create a new env_files in the pytest config folder and type:

```
env_files =
.env
.test.env
.deploy.env
```
## Structure of the Repository
The script `main.py` calls the necessary functions to download all the papers, deduplicate and filter them, and then run all the experiments.

The core of the repository is in `src/prompt_systematic_review`. The `config_data.py` script contains configurations that are important for running experiments and saving time. You can see in `main.py` how some of these options are used.

The source folder is divided into 4 main sections: 3 scripts (`automated_review.py`, `collect_papers.py`,`config_data.py`) that deal with collecting the data and running the automated review, the `utils` folder that contains utility functions that are used throughout the repository, the `get_papers` folder that contains the scripts to download the papers, and the `experiments` folder that contains the scripts to run the experiments.

At the root, there is a `data` folder. It comes pre-loaded with some data that is used for the experiments, however the bulk of the dataset can either be generated by running `main.py` or by downloading the data from Hugging Face. It is in `data/experiments_output` that the results of the experiments are saved.

Notably, the keywords used in the automated review/scraping process are in `src/prompt_systematic_review/utils/keywords.py`. Anyone who wishes to run the automated review can adjust these keywords to their liking in that file.

## Running the code

### TLDR;
```bash
git clone https://github.com/trigaten/Prompt_Systematic_Review.git && cd Prompt_Systematic_Review
pip install -r requirements.txt
# create a .env file with your API keys
nano .env
git lfs install
git clone https://huggingface.co/datasets/PromptSystematicReview/ThePromptReport
mv ThePromptReport/* data/
python main.py
```

Running `main.py` will download the papers, run the automated review, and run the experiments.
However, if you wish to save time and only run the experiments, you can download the data from [Hugging Face](https://huggingface.co/datasets/PromptSystematicReview/ThePromptReport) and move the papers folder and all the csv files in the dataset into the data folder (should look like `data/papers/*.pdf` and `data/master_papers.csv` etc). Adjust main.py accordingly.

Every experiment script has a `run_experiment` function that is called in `main.py`. The `run_experiment` function is responsible for running the experiment and saving the results. However each script can be run individually by just running `python src/prompt_systematic_review/experiments/<experiment_name>.py` from root.

There is one experiment, `graph_internal_references` that, because of weird issues with parallelism, is better run from root as an individual script. To avoid it causing issues with other experiments, it is run last as it is ordered at the bottom of the list in `experiments/__init__.py`.

## blacklist.csv

Papers do not include due to them being poorly written or AI generated (or simply irrelevant).

## Notes

- Sometimes a paper title may appear differently on the arXiv API. For example, "Visual Attention-Prompted Prediction and Learning" (arXiv:2310.08420), according to arXiv API is titled "A visual encoding model based on deep neural networks and transfer learning"

- When testing APIs, there may be latency and aborted connections

- Publication dates of papers from IEEE are missing the day about half the time. They also may come in any of the following formats
- "April 1988"
- "2-4 April 2002"
- "29 Nov.-2 Dec. 2022"
29 changes: 0 additions & 29 deletions data/model_citation_counts.csv

This file was deleted.

Loading

0 comments on commit ac6384c

Please sign in to comment.