Skip to content

Commit d115b09

Browse files
authored
Merge pull request #114 from trigaten/topic-model
Add topic modeling code
2 parents 5059651 + 113e919 commit d115b09

16 files changed

+2776
-3
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,5 +14,7 @@ scripts/arxiv_papers_with_ai_labels.csv
1414
papers_output/*
1515
data/arxiv_papers_for_human_review.csv
1616
papers
17+
scripts/master_papers.csv
18+
scripts/t.py
1719
/RP_eval_results_*.json
1820
scripts/master_papers.csv

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,5 +15,6 @@ tika
1515
tqdm
1616
openai
1717
load_dotenv
18+
tomotopy
1819
wordcloud
1920
-e .

scripts/download_and_process_pdfs.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ def filter_and_save_pdfs(folder_path, csv_path, output_csv_path):
8181
papList.append(
8282
Paper(
8383
row["title"],
84-
row["firstAuthor"],
84+
row["authors"],
8585
row["url"],
8686
row["dateSubmitted"],
8787
row["keywords"],

scripts/download_data_pipeline.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@
106106
"source": [
107107
"blacklist = pd.read_csv(\"../data/blacklist.csv\")\n",
108108
"blacklist[\"title\"] = blacklist[\"title\"].apply(lambda x: process_paper_title(x))\n",
109-
"blacklist"
109+
"len(blacklist)"
110110
]
111111
},
112112
{
@@ -391,7 +391,7 @@
391391
"\n",
392392
"df_combined.to_csv(\"master_papers.csv\")\n",
393393
"\n",
394-
"auto_pipeline(\"master_papers.csv\", \"papers\")"
394+
"auto_pipeline(\"master_papers.csv\", \"papers/\")"
395395
]
396396
}
397397
],

src/prompt_systematic_review/pipeline.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ def upload_folder(self, folderName):
8585
self.api.upload_folder(
8686
repo_id=self.repo_name,
8787
folder_path=folderName,
88+
path_in_repo=folderName,
8889
commit_message=f"Add {folderName}",
8990
repo_type="dataset",
9091
)

topic-model/README.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Running a topic model on the data
2+
3+
## Installation
4+
First, [install poetry](https://python-poetry.org/docs), then install this package with `poetry install`. NB, it may be possible to just install directly with `pip install -e .`, I haven't tested this.
5+
6+
Type `soup-nuts --help` to make sure the preprocessing package was installed correctly. If it wasn't, clone [this repo](https://github.com/ahoho/topics) and try running `poetry install` there, then `poetry add tomotopy`.
7+
8+
## Process data
9+
10+
Download the CSV of papers and abstracts.
11+
12+
```console
13+
curl https://huggingface.co/datasets/PromptSystematicReview/Prompt_Systematic_Review_Dataset/blob/main/master_papers.csv -o master_papers.csv
14+
```
15+
16+
Optionally learn common phrases (e.g., `prompt_engineering`):
17+
18+
```bash
19+
mkdir ./detected-phrases
20+
21+
soup-nuts detect-phrases \
22+
master_papers.csv \
23+
./detected-phrases \
24+
--input-format csv \
25+
--text-key abstract \
26+
--id-key paperId \
27+
--lowercase \
28+
--min-count 15 \
29+
--token-regex wordlike \
30+
--no-detect-entities
31+
```
32+
33+
Preprocess the data---feel free to play around with these parameters (see `soup-nuts preprocess --help` for information)
34+
35+
```bash
36+
soup-nuts preprocess \
37+
master_papers.csv\
38+
./processed\
39+
--text-key abstract \
40+
--id-key paperId \
41+
--lowercase \
42+
--input-format csv \
43+
--detect-entities \
44+
--phrases ./detected-phrases/phrases.json \
45+
--max-doc-freq 0.9 \
46+
--min-doc-freq 2 \
47+
--output-text \
48+
--metadata-keys abstract,title,url \
49+
--stopwords stopwords.txt
50+
```
51+
52+
## Run the topic model
53+
54+
```
55+
python run_tomotopy.py --num_topics 25 --iterations 1000
56+
```
57+
58+
You can view the outputs in `topic_outputs-<num_topics>.html`
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
{
2+
"output_dir": "detected-phrases",
3+
"input_format": "csv",
4+
"passes": 1,
5+
"lowercase": true,
6+
"detect_entities": false,
7+
"detect_noun_chunks": false,
8+
"token_regex": "re.compile('^[\\\\w-]*[a-zA-Z][\\\\w-]*$')",
9+
"min_count": 15,
10+
"threshold": 10.0,
11+
"max_vocab_size": 40000000.0,
12+
"connector_words": "frozenset({'hereby', 'any', '\u2018re', 'when', 'only', 'nobody', 'this', 'whatever', 'whereby', 'ourselves', '\u2019ve', 'his', 'himself', 'afterwards', 'along', \"'ll\", 'mine', 'whenever', 'again', \"'s\", 'latter', \"'d\", 'across', 'five', 'n\u2018t', 'somewhere', 'may', 'themselves', 'did', 'nothing', 'whither', 'her', 'is', 'get', 'can', 'ours', 'could', 'keep', 'for', 'just', 'in', 'quite', 'no', 'such', 'hereafter', 'due', 'really', 'therefore', 'you', 'per', 'give', 'anything', 'using', 'whose', 'anyone', 'as', 'i', 'besides', 'therein', 'anywhere', '\u2018m', 'be', 'sixty', 'wherein', 'amount', 'name', 'whereafter', 'then', 'whereupon', 'still', 'your', \"'re\", 'less', 'make', 'one', 'hereupon', 'please', '\u2018d', 'of', 'yet', 'someone', 'while', 'without', 'how', 'here', 'does', 'whereas', '\u2019m', 'n\u2019t', 'fifty', 'once', 'but', 'empty', '\u2019s', 'thereupon', 'sometimes', 'regarding', 'itself', 'seems', 'front', 'with', '\u2019d', 'there', 'all', 'might', 'our', 'ever', 'were', 'why', 'done', 'many', 'nowhere', 'around', 'otherwise', 'upon', 'made', 'latterly', 'perhaps', 'forty', 'hers', 'these', 'him', 'something', 'namely', 'are', 'other', 'unless', 'until', 'doing', 'nevertheless', 'full', 'become', 'else', 'more', 'meanwhile', 'see', 'beyond', 'further', 'whence', 'among', 'behind', 'former', 'move', 'rather', 'that', 'seem', 'both', 'sometime', 'where', 'on', 'since', 'out', 'however', 'throughout', 'or', 'whole', \"'m\", 'also', 'than', 'few', 'well', 'me', 'often', 'own', '\u2019ll', 'my', 'except', 'wherever', 'least', 'twelve', \"'ve\", 'three', 'another', 'mostly', 'became', 'indeed', 'he', 're', 'always', 'beside', 'by', 'first', 'enough', 'whoever', 'serious', 'everything', 'thence', 'from', 'neither', 'if', 'under', 'anyhow', 'back', 'anyway', 'already', 'whom', 'above', 'us', 'put', 'it', 'onto', 'being', 'everywhere', 'twenty', 'thereby', 'even', 'thus', 'hundred', 'go', 'because', 'over', 'very', 'so', 'four', 'have', 'bottom', 'up', 'used', 'hence', 'seeming', 'everyone', 'each', 'show', 'yourselves', 'nine', 'elsewhere', 'ten', 'its', 'she', 'noone', 'about', 'off', 'never', 'not', 'too', 'next', 'into', 'becoming', 'thereafter', 'we', 'none', 'down', 'every', 'which', '\u2018ll', 'do', 'almost', 'top', 'via', 'several', 'nor', 'much', 'am', 'what', '\u2018ve', 'a', 'them', 'whether', 'their', 'and', 'they', 'side', 'thru', 'before', 'amongst', 'will', 'most', 'either', \"n't\", 'an', 'to', 'between', 'part', 'has', 'alone', 'below', 'together', 'some', 'becomes', 'formerly', 'beforehand', 'though', 'various', 'say', 'after', 'should', 'towards', 'now', 'through', 'must', 'somehow', 'although', 'herein', 'fifteen', 'eight', 'take', 'same', 'eleven', 'last', 'cannot', 'been', 'yours', 'at', 'third', 'had', 'was', 'call', 'others', 'would', 'moreover', 'who', 'those', 'ca', 'against', '\u2018s', '\u2019re', 'yourself', 'the', 'toward', 'two', 'six', 'herself', 'within', 'during', 'seemed', 'myself'})",
13+
"phrases": null,
14+
"max_phrase_len": null,
15+
"n_process": -1,
16+
"encoding": "utf-8",
17+
"id_key": "paperId",
18+
"input_path": "master_papers.csv",
19+
"lines_are_documents": true,
20+
"max_doc_size": null,
21+
"text_key": "abstract"
22+
}
Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
wide_range
2+
benchmark_datasets
3+
multiple_choice
4+
parameter_efficient
5+
knowledge_intensive
6+
experiments_demonstrate
7+
future_research
8+
computer_vision
9+
promising_results
10+
find_relation
11+
contrastive_learning
12+
results_demonstrate
13+
black_box
14+
downstream_tasks
15+
open_source
16+
task_specific
17+
external_knowledge
18+
existing_methods
19+
f1_score
20+
world_scenarios
21+
paper_we_propose
22+
test_cases
23+
recent_work
24+
test_time
25+
fine_tune
26+
resource_languages
27+
code_generation
28+
publicly_available
29+
world_applications
30+
sentiment_analysis
31+
high_quality
32+
great_potential
33+
remarkable_capabilities
34+
end_to_end
35+
image_generation
36+
machine_learning
37+
prompting_technique
38+
superior_performance
39+
retrieval_augmented
40+
paper_we_introduce
41+
training_data
42+
language_models
43+
competitive_performance
44+
prompt_optimization
45+
hand_crafted
46+
point_cloud
47+
meta_learning
48+
shot_learning
49+
style_transfer
50+
fine_tuned
51+
lingual_transfer
52+
prompt_template
53+
find_event
54+
context_examples
55+
reinforcement_learning
56+
pre_trained
57+
ood_nlp
58+
human_like
59+
cot_prompting
60+
shot_setting
61+
code_is_available
62+
multi_task
63+
rosgpt_vision
64+
model_size
65+
work_we_propose
66+
named_entity
67+
chain_of_thought
68+
source_code
69+
high_level
70+
pre_training
71+
prior_work
72+
generalization_ability
73+
language_model
74+
multi_modal
75+
fully_supervised
76+
multi_hop
77+
zero_shot
78+
semantic_parsing
79+
results_indicate
80+
low_resource
81+
instruction_tuning
82+
relation_extraction
83+
f_1
84+
prompt_injection
85+
natural_language
86+
pretrained_language
87+
consistently_outperforms
88+
task_oriented
89+
text_to_sql
90+
step_by_step
91+
models_plms
92+
neural_networks
93+
data_augmentation
94+
propose_a_novel
95+
models_lms
96+
input_output
97+
proposed_method
98+
machine_translation
99+
cross_lingual
100+
intelligence_ai
101+
training_examples
102+
shot_settings
103+
instruction_following
104+
extensive_experiments
105+
processing_nlp
106+
significantly_outperforms
107+
findings_suggest
108+
propose_a_new
109+
language_processing
110+
general_purpose
111+
ground_truth
112+
prompting_techniques
113+
fact_checking
114+
achieves_state
115+
demonstrated_remarkable
116+
prompt_engineering
117+
shot_prompting
118+
new_paradigm
119+
text_classification
120+
novel_approach
121+
jailbreak_prompts
122+
multi_step
123+
real_world
124+
nlp_tasks
125+
generative_ai
126+
r_score
127+
domain_specific
128+
small_number
129+
labeled_data
130+
big_bench
131+
text_to_image
132+
large_scale
133+
human_written
134+
recent_advances
135+
paper_presents
136+
experimental_results
137+
fewshot_lama
138+
vision_language
139+
e_commerce
140+
foundation_models
141+
large_languagemodels
142+
large_language
143+
prompt_based
144+
information_extraction
145+
stable_diffusion
146+
shown_impressive
147+
mental_health
148+
thought_prompting
149+
knowledge_graph
150+
method_achieves
151+
complex_reasoning
152+
paper_proposes
153+
self_supervised
154+
time_consuming
155+
inthis_paper
156+
recent_years
157+
like_chatgpt
158+
reasoning_steps
159+
conduct_extensive
160+
trained_language
161+
fine_grained
162+
prompt_templates
163+
llm_articulated_object_manipulation
164+
fine_tuning
165+
success_rate
166+
entity_recognition
167+
et_al
168+
models_llms
169+
artificial_intelligence
170+
annotated_data
171+
state_of_the_art
172+
context_learning
173+
social_media
174+
learning_icl
175+
knowledge_distillation
176+
thought_cot
177+
demonstrate_the_effectiveness
178+
question_answering
179+
largelanguage_models
180+
decision_making
181+
open_domain
182+
paper_we_present
183+
object_detection

0 commit comments

Comments
 (0)