Skip to content

Commit

Permalink
Pacify linter, add coverage to requirements
Browse files Browse the repository at this point in the history
  • Loading branch information
jmelot committed Jan 22, 2024
1 parent 2c1fdc3 commit a4436c8
Show file tree
Hide file tree
Showing 5 changed files with 24 additions and 10 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This repository contains a description and supporting code for CSET's current me
cross-dataset article linking. Note that we use "article" very loosely, although in a way that to our knowledge
is fairly consistent across corpora. Books, for example, are included.

For each article in arXiv, WOS, Papers With Code, Semantic Scholar, The Lens, and OpenAlex
For each article in arXiv, WOS, Papers With Code, Semantic Scholar, The Lens, and OpenAlex
we normalized titles, abstracts, and author last names. For the purpose of matching, we filtered out
titles, abstracts, and DOIs that occurred more than 10 times in the corpus. We then considered each group of articles
within or across datasets that shared at least one of the following (non-null) metadata fields:
Expand Down
27 changes: 20 additions & 7 deletions linkage_dag.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
from airflow.operators.bash import BashOperator
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.providers.google.cloud.sensors.gcs import GCSObjectExistenceSensor
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from airflow.providers.google.cloud.operators.bigquery import (
BigQueryCheckOperator,
Expand All @@ -20,6 +19,7 @@
DataflowCreatePythonJobOperator,
)
from airflow.providers.google.cloud.operators.gcs import GCSDeleteObjectsOperator
from airflow.providers.google.cloud.sensors.gcs import GCSObjectExistenceSensor
from airflow.providers.google.cloud.transfers.bigquery_to_bigquery import (
BigQueryToBigQueryOperator,
)
Expand Down Expand Up @@ -372,7 +372,7 @@
task_id="wait_for_simhash_index",
bucket=DATA_BUCKET,
object=f"{tmp_dir}/done_files/simhash_is_done",
deferrable=True
deferrable=True,
)

create_cset_ids = BashOperator(
Expand All @@ -384,7 +384,7 @@
task_id="wait_for_cset_ids",
bucket=DATA_BUCKET,
object=f"{tmp_dir}/done_files/ids_are_done",
deferrable=True
deferrable=True,
)

push_to_gcs = BashOperator(
Expand Down Expand Up @@ -632,12 +632,25 @@
>> wait_for_combine
)

(last_combination_query >> heavy_compute_inputs >> gce_instance_start >> prep_environment >>
update_simhash_index >> wait_for_simhash_index >> create_cset_ids >> wait_for_cset_ids >> push_to_gcs >>
gce_instance_stop)
(
last_combination_query
>> heavy_compute_inputs
>> gce_instance_start
>> prep_environment
>> update_simhash_index
>> wait_for_simhash_index
>> create_cset_ids
>> wait_for_cset_ids
>> push_to_gcs
>> gce_instance_stop
)

gce_instance_start >> run_lid >> gce_instance_stop

gce_instance_stop >> [import_id_mapping, import_lid] >> start_final_transform_queries
(
gce_instance_stop
>> [import_id_mapping, import_lid]
>> start_final_transform_queries
)

last_transform_query >> check_queries >> start_production_cp
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,4 @@ typing-extensions==3.7.4.1
wcwidth==0.1.8
zipp==3.0.0
pre-commit
coverage
2 changes: 1 addition & 1 deletion utils/run_ids_scripts.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@ python3 create_merge_ids.py --match_dir usable_ids --prev_id_mapping_dir prev_id
/snap/bin/gsutil -m cp new_simhash_indexes/* gs://airflow-data-exchange/article_linkage/simhash_indexes/
/snap/bin/gsutil -m cp new_simhash_indexes/* gs://airflow-data-exchange/article_linkage/simhash_indexes_archive/$(date +%F)/
touch ids_are_done
gsutil cp ids_are_done gs://airflow-data-exchange/article_linkage/tmp/done_files/
gsutil cp ids_are_done gs://airflow-data-exchange/article_linkage/tmp/done_files/
2 changes: 1 addition & 1 deletion utils/run_simhash_scripts.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ python3 run_simhash.py simhash_input simhash_results --simhash_indexes simhash_i
cp -r article_pairs usable_ids
cp simhash_results/* article_pairs/
touch simhash_is_done
gsutil cp simhash_is_done gs://airflow-data-exchange/article_linkage/tmp/done_files/
gsutil cp simhash_is_done gs://airflow-data-exchange/article_linkage/tmp/done_files/

0 comments on commit a4436c8

Please sign in to comment.