This repository contains the accompanying code, dataset and online appendix of:
Timo Breuer, Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya Sakai, Philipp Schaer, and Ian Soboroff. 2020. How to Measure the Reproducibility of System-oriented IR Experiments. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20).
Replicability and reproducibility of experimental results are primary concerns in all the areas of science and IR is not an exception. Besides the problem of moving the field towards more reproducible experimental practices and protocols, we also face a severe methodological issue: we do not have any means to assess when reproduced is reproduced. Moreover, we lack any reproducibility-oriented dataset, which would allow us to develop such methods.
To address these issues, we compare several measures to objectively quantify to what extent we have replicated or reproduced a system-oriented IR experiment. These measures operate at different levels of granularity, from the fine-grained comparison of ranked lists, to the more general comparison of the obtained effects and significant differences. Moreover, we also develop a reproducibility-oriented dataset, which allows us to validate our measures and which can also be used to develop future measures.
appendix/
: online appendix with additional Tables and Figuresconfig/
: configurations of each run constellationcore/
: core modules of the reimplementationdataset/
: replicated and reproduced results ofwcrobust04
andwcrobust0405
with 200 runs in totalevaluation/
: scripts for the evaluation of the experimental setupreplicability/
: scripts for producing replicated resultsreproducibility/
: scripts for producing reproduced results
-
Install requirements with pip:
pip install -r requirements.txt
-
Download English stopwords for nltk:
python -m nltk.downloader stopwords
-
Clone trec_eval and compile it in this directory:
git clone https://github.com/usnistgov/trec_eval.git && make -C trec_eval
-
Edit
config/config.py
by adding the path of the four test collections to the parametersrobust04
,robust05
,core17
andcore18
. -
Specify one of the 50 run constellations with the help of
config/settings.py
. Set the parameternum_con
to the appropriate number of the constellation. If the preprocessing has already been done for a previous run, it can be omitted by setting the parameterdata_prep
toFalse
. -
Run the commands below for producing the respective run.
Replicability | Reproducibility | |
---|---|---|
WCRobust04 | python -m replicability.wcrobust04 |
python -m reproducibility.wcrobust04 |
WCRobust0405 | python -m replicability.wcrobust0405 |
python -m reproducibility.wcrobust0405 |
run name | constellation number |
---|---|
rpl/rpd_tf_1 | 45 |
rpl/rpd_tf_2 | 46 |
rpl/rpd_tf_3 | 47 |
rpl/rpd_tf_4 | 48 |
rpl/rpd_tf_5 | 49 |
rpl/rpd_df_1 | 14 |
rpl/rpd_df_2 | 15 |
rpl/rpd_df_3 | 16 |
rpl/rpd_df_4 | 17 |
rpl/rpd_df_5 | 18 |
rpl/rpd_tol_1 | 39 |
rpl/rpd_tol_2 | 38 |
rpl/rpd_tol_3 | 37 |
rpl/rpd_tol_4 | 36 |
rpl/rpd_tol_5 | 35 |
rpl/rpd_C_1 | 44 |
rpl/rpd_C_2 | 43 |
rpl/rpd_C_3 | 42 |
rpl/rpd_C_4 | 41 |
rpl/rpd_C_5 | 40 |
For the evaluation script you need Matlab and Matters.
Alternatively, some evaluation measures are already pre-computed and stored in csv files: evaluation/matlab/results
.