The following change log details commits to regression tests that alter effectiveness and the addition of new regression tests. This documentation is useful for figuring why results may have changed over time.
- commit
445bb45
(10/11/2019)
Add regressions for NTCIR-8 ACLIA (IR4QA subtask, Chinese monolingual).
- commit
e88b931
(9/5/2019)
As it turns out, we were incorrect in entry below (commit 2f1b665
). Regressions numbers after BM25prf fix did change slightly.
- commit
2f1b665
(8/14/2019)
Resolves inconsistent tie-breaking for BM25prf that leads to non-deterministic results, per #774. Note that regression numbers did not change.
Added new Doc2query regression car17v2.0-doc2query
to replicate Nogueira et al. (arXiv 2019) on the TREC 2017 Complex Answer Retrieval (CAR) section-level passage retrieval task (v2.0).
Added +Ax and +PRF regressions with both tuned and default BM25 parameters for MS MARCO passage ranking task.
- commit
80c5447
(8/5/2019)
Added +Ax and +PRF regressions with both tuned and default BM25 parameters for MS MARCO document ranking task.
Added new Doc2query regression msmarco-passage-doc2query
to replicate Nogueira et al. (arXiv 2019) on the MS MARCO passage ranking task.
Added tuned BM25 parameters to msmarco-doc
regression.
Associated documentation updated.
- commit
75e36f9
(6/9/2019)
Upgrade to Lucene 8: minor changes to all regression experiments. JDIQ 2018 experiments are no longer maintained.
Added regressions for MS MARCO passage and document ranking tasks.
Fixed bug in topic reader for CAR. Better parsing of New York Times documents. Regression numbers in both cases improved slightly.
- commit
27493ed
(5/31/2019)
Per #658: fixed broken regression in Core18 introduced by commit c4ab6b
(4/18/2019).
CAR regression refactoring: added v2.0 regression and renamed existing regression to v1.5. Both use benchmarkY1-test
to support consistent comparisons.
- commit
407f308
(1/2/2019)
Added fine tuning results (i.e., SIGIR Forum article experiments) for axiomatic semantic term matching.
- commit
1aa3970
(12/24/2018)
Changed RM3 defaults to match settings in Indri.
- commit
e71df7a
(12/20/2018)
Added Axiomatic F2Exp and F2Log ranking models back into Anserini (previously, we were using the default Lucene implementation as part of version 7.6 upgrade).
- commit
e71df7a
(12/18/2018)
Upgrade to Lucene 7.6.
- commit
e5b87f0
(11/30/2018)
Added default regressions for TREC 2018 Common Core Track.
- commit
2c8cd7a
(11/16/2018)
This is the commit id references in the SIGIR Forum 2018 article.
Note that commit 18c3211
(12/9/2018) contains minor fixes to the code.
- commit
10255e0
(10/22/2018)
Fixed incorrect implementation of -rm3.fbTerms
.
- commit
7c882d3
(9/26/2018)
Fixed bug as part of #429: cw12
and mb13
regression tests changed slightly in effectiveness.
- commit
d4b3272
(8/8/2018)
Added regressions tests for CAR17.
- commit
c0da510
(8/5/2018)
This commit adds the effectiveness verification testing for the JDIQ2018 Paper.
These three commits establish the new regression testing infrastructure with the following tests:
- Experiments on Disks 1 & 2: {BM25, QL} ⨯ {RM3, Ax} ⨯ {AP, P30}
- Experiments on Disks 4 & 5 (Robust04): {BM25, QL} ⨯ {RM3, Ax} ⨯ {AP, P30}
- Experiments on AQUAINT (Robust05): {BM25, QL} ⨯ {RM3, Ax} ⨯ {AP, P30}
- Experiments on New York Times (Core17): {BM25, QL} ⨯ {RM3, Ax} ⨯ {AP, P30}
- Experiments on Wt10g: {BM25, QL} ⨯ {RM3, Ax} ⨯ {AP, P30}
- Experiments on Gov2: {BM25, QL} ⨯ {RM3, Ax} ⨯ {AP, P30}
- Experiments on ClueWeb09 (Category B): {BM25, QL} ⨯ {RM3, Ax} ⨯ {AP, P30, NDCG@20, ERR@20}
- Experiments on ClueWeb12-B13: {BM25, QL} ⨯ {RM3, Ax} ⨯ {AP, P30, NDCG@20, ERR@20}
- Experiments on ClueWeb12: {BM25, QL} ⨯ {RM3} ⨯ {AP, P30, NDCG@20, ERR@20}
- Experiments on Tweets2011 (MB11 & MB12): {BM25, QL} ⨯ {RM3, Ax} ⨯ {AP, P30}
- Experiments on Tweets2013 (MB13 & MB14): {BM25, QL} ⨯ {RM3, Ax} ⨯ {AP, P30}