Skip to content

Commit

Permalink
Fixed "neural hype" tuning experiments after Lucene 8.0 upgrade (cast…
Browse files Browse the repository at this point in the history
…orini#736)

+ retuned retrieval models for Lucene 8.0

Refactored tuning script:
+ made command-line parameters more consistent
+ broke fold settings into external config files for greater generality
+ removed unintuitive distinction between "model" and "basemodel": there's just "model" now.
  • Loading branch information
lintool authored Jul 3, 2019
1 parent 42047ca commit 64bae9c
Show file tree
Hide file tree
Showing 34 changed files with 298 additions and 94,923 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,12 +77,12 @@ Note that these regressions capture the "out of the box" experience, based on [_

Other experiments:

+ [Replicating "Neural Hype" Experiments](docs/experiments-forum2018.md)
+ [Guide to running BM25 baselines on the MS MARCO Passage Task](docs/experiments-msmarco-passage.md)
+ [Guide to running BM25 baselines on the MS MARCO Document Task](docs/experiments-msmarco-doc.md)
+ [Guide to replicating document expansion by query prediction (Doc2query) results](docs/experiments-doc2query.md)
+ [Guide to running experiments on the AI2 Open Research Corpus](docs/experiments-openresearch.md)
+ [Experiments from Yang et al. (JDIQ 2018)](docs/experiments-jdiq2018.md)
+ [Experiments from Lin (SIGIR Forum 2018)](docs/experiments-forum2018.md)
+ Runbooks for TREC 2018: [[Anserini group](docs/runbook-trec2018-anserini.md)] [[h2oloo group](docs/runbook-trec2018-h2oloo.md)]
+ Runbook for [ECIR 2019 paper on axiomatic semantic term matching](docs/runbook-ecir2019-axiomatic.md)
+ Runbook for [ECIR 2019 paper on cross-collection relevance feedback](docs/runbook-ecir2019-ccrf.md)
Expand Down
101 changes: 73 additions & 28 deletions docs/experiments-forum2018.md
Original file line number Diff line number Diff line change
@@ -1,74 +1,119 @@
# Anserini: SIGIR Forum 2018 Experiments
# Anserini: "Neural Hype" Baseline Experiments

This page documents code for replicating results from the following article:
This page provides documentation for replicating results from two "neural hype" papers, which questioned whether neural ranking models actually represent improvements in _ad hoc_ retrieval effectiveness over well-tuned "competitive baselines" in limited data scenarios:

+ Jimmy Lin. [The Neural Hype and Comparisons Against Weak Baselines.](http://sigir.org/wp-content/uploads/2019/01/p040.pdf) SIGIR Forum, 52(2):40-51, 2018.
+ Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. [Critically Examining the "Neural Hype": Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models.](https://cs.uwaterloo.ca/~jimmylin/publications/Yang_etal_SIGIR2019.pdf) _SIGIR 2019_.

Note that the commit [2c8cd7a](https://github.com/castorini/Anserini/commit/2c8cd7a550faca0fc450e4159a4a874d4795ac25) referenced in the article is out of date with respect to the latest experimental results.
See "History" section below.
The "competitive baseline" referenced in the two above papers is BM25+RM3, with proper parameter tuning, on the test collection from the TREC 2004 Robust Track (Robust04).
Scripts referenced on this page encode automated regressions that allow users to recreate and verify the results reported below.

**Requirements**: Python>=2.6 or Python>=3.5 `pip install -r src/main/python/requirements.txt`
The SIGIR Forum article references commit [`2c8cd7a`](https://github.com/castorini/Anserini/commit/2c8cd7a550faca0fc450e4159a4a874d4795ac25) (11/16/2018), the results of which changed slightly with an upgrade to Lucene 7.6 at commit [`e71df7a`](https://github.com/castorini/Anserini/commit/e71df7aee42c7776a63b9845600a4075632fa11c) (12/18/2018).
The SIGIR 2019 paper contains experiments performed post upgrade.

Folds:
The Anserini upgrade to Lucene 8.0 at commit [`75e36f9`](https://github.com/castorini/anserini/commit/75e36f97f7037d1ceb20fa9c91582eac5e974131) (6/12/2019) broke the regression tests, which was later fixed at commit [`xxxxxxx`](https://github.com/castorini/anserini/commit/xxxxxxx) (x/x/xxx).
This commit represents the latest state of the code and the results that can be currently replicated.
See summary in "History" section below.


## Expected Results

Retrieval models are tuned with respect to following fold definitions:

+ [Folds for 2-fold cross-validation used in "paper 1"](../src/main/resources/fine_tuning/robust04-paper1-folds.json)
+ [Folds for 5-fold cross-validation used in "paper 2"](../src/main/resources/fine_tuning/robust04-paper2-folds.json)

Here are expected results for various retrieval models:

AP | Paper 1 | Paper 2 |
:------------------|---------|---------|
BM25 (default) | 0.2531 | 0.2531 |
BM25 (tuned) | 0.2539 | 0.2531 |
QL (default) | 0.2467 | 0.2467 |
QL (tuned) | 0.2520 | 0.2499 |
BM25+RM3 (default) | 0.2903 | 0.2903 |
BM25+RM3 (tuned) | 0.3043 | 0.3021 |
BM25+Ax (default) | 0.2896 | 0.2896 |
BM25+Ax (tuned) | 0.2940 | 0.2950 |


## Parameter Tuning

First, change the index path at `src/main/resources/fine_tuning/collections.yaml`.
The script will go through the `index_roots` and concatenate with the collection's `index_path` and take the first match as the index path.
Before starting, modify the index path at `src/main/resources/fine_tuning/collections.yaml`.
The tuning script will go through the `index_roots`, concatenate with the collection's `index_path`, and take the first match as the location of the index.

BM25 Robust04 (runs + eval + print results):
Tuning BM25:

```
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25 --n 44 --run --use_drr_fold
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25 --threads 18 --run
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25 --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper1-folds.json --verbose
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25 --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper2-folds.json --verbose
```

QL Robust04 (runs + eval + print results):
The first command runs the parameter sweeps and prints general statistics.
The second and third commands use a specific fold setting to perform cross-validation and print out model parameters.

Tuning QL (commands similarly organized):

```
python src/main/python/fine_tuning/run_batch.py --collection robust04 --basemodel ql --model ql --n 44 --run --use_drr_fold
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model ql --threads 18 --run
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model ql --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper1-folds.json --verbose
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model ql --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper2-folds.json --verbose
```

BM25+RM3 Robust04 (runs + eval + print results):
Tuning BM25+RM3 (commands similarly organized):

```
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+rm3 --n 44 --run --use_drr_fold
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+rm3 --threads 18 --run
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+rm3 --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper1-folds.json --verbose
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+rm3 --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper2-folds.json --verbose
```

BM25+AxiomaticReranking Robust04 (runs + eval + print results):
Tuning BM25+Ax (commands similarly organized):

```
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+axiom --n 44 --run --use_drr_fold
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+axiom --threads 18 --run
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+axiom --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper1-folds.json --verbose
python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+axiom --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper2-folds.json --verbose
```

## Tuned Run

Tuned parameter values:
## Tuned Runs

Tuned parameter values for BM25+RM3:

+ [For the 2-fold cross-validation used in "paper 1", in terms of MAP](../src/main/resources/fine_tuning/robust04-paper1-folds-map-params.json)
+ [For tor 5-fold cross-validation used in "paper 2", in terms of MAP](../src/main/resources/fine_tuning/robust04-paper2-folds-map-params.json)
+ [For the 2-fold cross-validation used in "paper 1", in terms of MAP](../src/main/resources/fine_tuning/params/params.map.robust04-paper1-folds.bm25+rm3.json)
+ [For the 5-fold cross-validation used in "paper 2", in terms of MAP](../src/main/resources/fine_tuning/params/params.map.robust04-paper2-folds.bm25+rm3.json)

To be clear, these are the tuned parameters on _that_ fold, trained on the remaining folds.

The follow script will reconstruct the tuned runs for BM25 + RM3:
The following script will reconstruct the tuned runs for BM25+RM3:

```
python src/main/python/fine_tuning/reconstruct_robus04_tuned_run.py \
--index lucene-index.robust04.pos+docvectors+rawdocs \
--folds src/main/resources/fine_tuning/robust04-paper2-folds.json \
--params src/main/resources/fine_tuning/robust04-paper2-folds-map-params.json
--folds src/main/resources/fine_tuning/robust04-paper1-folds.json \
--params src/main/resources/fine_tuning/params/params.map.robust04-paper1-folds.bm25+rm3.json \
--output run.robust04.bm25+rm3.paper1.txt
```

Change `paper2` to `paper1` to reconstruct using the folds in paper 1.
Change `paper1` to `paper2` to reconstruct using the folds in paper 2.

To reconstruct runs from other retrieval models, use the parameter definitions in [`src/main/resources/fine_tuning/params/`](../src/main/resources/fine_tuning/params/), plugging them into the above command as appropriate.

Note that applying `trec_eval` to these reconstructed runs might yield AP that is a tiny bit different from the values reported above (difference of 0.0001 at the most).
This difference arises from rounding when averaging across the folds.


## History

+ commit [407f308](https://github.com/castorini/Anserini/commit/407f308cc543286e39701caf0acd1afab39dde2c) (2019/1/2) - Added results for axiomatic semantic term matching.
+ commit [e71df7a](https://github.com/castorini/Anserini/commit/e71df7aee42c7776a63b9845600a4075632fa11c) (2018/12/18) - Upgrade to Lucene 7.6.
+ commit [18c3211](https://github.com/castorini/Anserini/commit/18c3211117f35f72cbc1019c125ff885f51056ea) (2018/12/9) - minor fixes.
+ commit [2c8cd7a](https://github.com/castorini/Anserini/commit/2c8cd7a550faca0fc450e4159a4a874d4795ac25) (2018/11/16) - commit id referenced in SIGIR Forum article.
The following documents commits that have altered effectiveness figures:


+ commit [`xxxxxxx`](https://github.com/castorini/anserini/commit/xxxxxxx) (x/xx/xxxx) - Regression experiments here fixed.
+ commit [`75e36f9`](https://github.com/castorini/anserini/commit/75e36f97f7037d1ceb20fa9c91582eac5e974131) (6/12/2019) - Upgrade to Lucene 8.0 breaks regression experiments here.
+ commit [`407f308`](https://github.com/castorini/Anserini/commit/407f308cc543286e39701caf0acd1afab39dde2c) (1/2/2019) - Added results for axiomatic semantic term matching.
+ commit [`e71df7a`](https://github.com/castorini/Anserini/commit/e71df7aee42c7776a63b9845600a4075632fa11c) (12/18/2018) - Upgrade to Lucene 7.6.
+ commit [`2c8cd7a`](https://github.com/castorini/Anserini/commit/2c8cd7a550faca0fc450e4159a4a874d4795ac25) (11/16/2018) - commit id referenced in SIGIR Forum article.


12 changes: 4 additions & 8 deletions src/main/python/fine_tuning/effectiveness.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,8 @@ def gen_output_effectiveness_params(self, output_root):
all_results = {}
for metric_dir in os.listdir(os.path.join(output_root, self.eval_files_root)):
for fn in os.listdir(os.path.join(output_root, self.eval_files_root, metric_dir)):
if len(fn.split('_')) == 3:
basemodel, model, model_params = fn.split('_')
elif len(fn.split('_')) == 2:
basemodel, model = fn.split('_')
output_fn = basemodel+'_'+model+'_'+metric_dir
model, model_params = fn.split('_')
output_fn = model+'_'+metric_dir
if not os.path.exists(output_fn):
if output_fn not in all_results:
all_results[output_fn] = []
Expand Down Expand Up @@ -85,7 +82,7 @@ def read_eval_file(self, fn):
return {qid: {metric: [(value, para), ...]}}
"""
split_fn = os.path.basename(fn).split('_')
params = split_fn[-1] if len(split_fn) == 3 else ''
params = split_fn[1]
res = {}
with open(fn) as _in:
for line in _in:
Expand Down Expand Up @@ -133,15 +130,14 @@ def load_optimal_effectiveness(self, output_root):
per_topic_oracle = {} # per topic optimal across all kinds of methods
effectiveness_root = os.path.join(output_root, self.effectiveness_root)
for fn in os.listdir(effectiveness_root):
basemodel, model, metric = fn.split('_')
model, metric = fn.split('_')
with open(os.path.join(effectiveness_root, fn)) as f:
for real_metric, all_performance in json.load(f).items():
if real_metric not in per_topic_oracle:
per_topic_oracle[real_metric] = {}
all_optimal = self.add_up_all_optimal(all_performance, per_topic_oracle[real_metric])
res = {
'model': model,
'basemodel': basemodel,
'metric': real_metric,
'best_avg': all_performance['all'],
'oracles_per_topic': all_optimal[0]
Expand Down
8 changes: 4 additions & 4 deletions src/main/python/fine_tuning/evaluation.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,13 +62,13 @@ def output_all_evaluations(self, qrel_programs, qrel_file_path, result_file_path
if process.returncode == 0:
try:
if i == 0:
o = open( output_path, 'w')
o = open(output_path, 'w')
else:
o = open( output_path, 'a')
o = open(output_path, 'a')
if 'trec_eval' in qrel_program:
o.write(stdout)
o.write(stdout.decode("utf-8"))
elif 'gdeval' in qrel_program:
for line in stdout.split('\n')[1:-1]:
for line in stdout.decode("utf-8").split('\n')[1:-1]:
line = line.strip()
if line:
row = line.split(',')
Expand Down
7 changes: 4 additions & 3 deletions src/main/python/fine_tuning/reconstruct_robus04_tuned_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
parser.add_argument("--index", type=str, help='index', required=True)
parser.add_argument("--folds", type=str, help='folds file', required=True)
parser.add_argument("--params", type=str, help='params file', required=True)
parser.add_argument("--output", type=str, help='output run file', required=True)

args = parser.parse_args()
index = args.index
Expand Down Expand Up @@ -70,11 +71,11 @@
folds_run_files = []
for i in range(len(folds)):
os.system(f'target/appassembler/bin/SearchCollection -topicreader Trec -index {index} '
f'-topics topics.robust04.fold{i} -output run.robust04.bm25+rm3.fold{i}.txt -hits 1000 {params[i]}')
folds_run_files.append(f'run.robust04.bm25+rm3.fold{i}.txt')
f'-topics topics.robust04.fold{i} -output {args.output}.fold{i} -hits 1000 {params[i]}')
folds_run_files.append(f'{args.output}.fold{i}')

# Concatenate all partial run files together.
with open('run.robust04.bm25+rm3.txt', 'w') as outfile:
with open(args.output, 'w') as outfile:
for fname in folds_run_files:
with open(fname) as infile:
outfile.write(infile.read())
Expand Down
Loading

0 comments on commit 64bae9c

Please sign in to comment.