Fixed "neural hype" tuning experiments after Lucene 8.0 upgrade (cast…

…orini#736) + retuned retrieval models for Lucene 8.0 Refactored tuning script: + made command-line parameters more consistent + broke fold settings into external config files for greater generality + removed unintuitive distinction between "model" and "basemodel": there's just "model" now.
w329li · Jul 3, 2019 · 64bae9c · 64bae9c
1 parent 42047ca
commit 64bae9c
Show file tree

Hide file tree

Showing 34 changed files with 298 additions and 94,923 deletions.
diff --git a/README.md b/README.md
@@ -77,12 +77,12 @@ Note that these regressions capture the "out of the box" experience, based on [_
 
 Other experiments:
 
++ [Replicating "Neural Hype" Experiments](docs/experiments-forum2018.md)
 + [Guide to running BM25 baselines on the MS MARCO Passage Task](docs/experiments-msmarco-passage.md)
 + [Guide to running BM25 baselines on the MS MARCO Document Task](docs/experiments-msmarco-doc.md)
 + [Guide to replicating document expansion by query prediction (Doc2query) results](docs/experiments-doc2query.md)
 + [Guide to running experiments on the AI2 Open Research Corpus](docs/experiments-openresearch.md)
 + [Experiments from Yang et al. (JDIQ 2018)](docs/experiments-jdiq2018.md)
-+ [Experiments from Lin (SIGIR Forum 2018)](docs/experiments-forum2018.md)
 + Runbooks for TREC 2018: [[Anserini group](docs/runbook-trec2018-anserini.md)] [[h2oloo group](docs/runbook-trec2018-h2oloo.md)]
 + Runbook for [ECIR 2019 paper on axiomatic semantic term matching](docs/runbook-ecir2019-axiomatic.md)
 + Runbook for [ECIR 2019 paper on cross-collection relevance feedback](docs/runbook-ecir2019-ccrf.md)

diff --git a/docs/experiments-forum2018.md b/docs/experiments-forum2018.md
@@ -1,74 +1,119 @@
-# Anserini: SIGIR Forum 2018 Experiments
+# Anserini: "Neural Hype" Baseline Experiments
 
-This page documents code for replicating results from the following article:
+This page provides documentation for replicating results from two "neural hype" papers, which questioned whether neural ranking models actually represent improvements in _ad hoc_ retrieval effectiveness over well-tuned "competitive baselines" in limited data scenarios:
 
 + Jimmy Lin. [The Neural Hype and Comparisons Against Weak Baselines.](http://sigir.org/wp-content/uploads/2019/01/p040.pdf) SIGIR Forum, 52(2):40-51, 2018.
++ Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. [Critically Examining the "Neural Hype": Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models.](https://cs.uwaterloo.ca/~jimmylin/publications/Yang_etal_SIGIR2019.pdf) _SIGIR 2019_.
 
-Note that the commit [2c8cd7a](https://github.com/castorini/Anserini/commit/2c8cd7a550faca0fc450e4159a4a874d4795ac25) referenced in the article is out of date with respect to the latest experimental results.
-See "History" section below.
+The "competitive baseline" referenced in the two above papers is BM25+RM3, with proper parameter tuning, on the test collection from the TREC 2004 Robust Track (Robust04).
+Scripts referenced on this page encode automated regressions that allow users to recreate and verify the results reported below.
 
-**Requirements**: Python>=2.6 or Python>=3.5 `pip install -r src/main/python/requirements.txt`
+The SIGIR Forum article references commit [`2c8cd7a`](https://github.com/castorini/Anserini/commit/2c8cd7a550faca0fc450e4159a4a874d4795ac25) (11/16/2018), the results of which changed slightly with an upgrade to Lucene 7.6 at commit [`e71df7a`](https://github.com/castorini/Anserini/commit/e71df7aee42c7776a63b9845600a4075632fa11c) (12/18/2018).
+The SIGIR 2019 paper contains experiments performed post upgrade.
 
-Folds:
+The Anserini upgrade to Lucene 8.0 at commit [`75e36f9`](https://github.com/castorini/anserini/commit/75e36f97f7037d1ceb20fa9c91582eac5e974131) (6/12/2019) broke the regression tests, which was later fixed at commit [`xxxxxxx`](https://github.com/castorini/anserini/commit/xxxxxxx) (x/x/xxx).
+This commit represents the latest state of the code and the results that can be currently replicated.
+See summary in "History" section below.
+
+
+## Expected Results
+
+Retrieval models are tuned with respect to following fold definitions:
 
 + [Folds for 2-fold cross-validation used in "paper 1"](../src/main/resources/fine_tuning/robust04-paper1-folds.json)
 + [Folds for 5-fold cross-validation used in "paper 2"](../src/main/resources/fine_tuning/robust04-paper2-folds.json)
 
+Here are expected results for various retrieval models:
+
+AP                 | Paper 1 | Paper 2 |
+:------------------|---------|---------|
+BM25 (default)     |  0.2531 |  0.2531 |
+BM25 (tuned)       |  0.2539 |  0.2531 |
+QL (default)       |  0.2467 |  0.2467 |
+QL (tuned)         |  0.2520 |  0.2499 |
+BM25+RM3 (default) |  0.2903 |  0.2903 |
+BM25+RM3 (tuned)   |  0.3043 |  0.3021 |
+BM25+Ax (default)  |  0.2896 |  0.2896 |
+BM25+Ax (tuned)    |  0.2940 |  0.2950 |
+
+
 ## Parameter Tuning
 
-First, change the index path at `src/main/resources/fine_tuning/collections.yaml`.
-The script will go through the `index_roots` and concatenate with the collection's `index_path` and take the first match as the index path.
+Before starting, modify the index path at `src/main/resources/fine_tuning/collections.yaml`.
+The tuning script will go through the `index_roots`, concatenate with the collection's `index_path`, and take the first match as the location of the index.
 
-BM25 Robust04 (runs + eval + print results):
+Tuning BM25:
 
 ```
-python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25 --n 44 --run --use_drr_fold
+python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25 --threads 18 --run
+python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25 --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper1-folds.json --verbose
+python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25 --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper2-folds.json --verbose
 ```
 
-QL Robust04 (runs + eval + print results):
+The first command runs the parameter sweeps and prints general statistics.
+The second and third commands use a specific fold setting to perform cross-validation and print out model parameters.
+
+Tuning QL (commands similarly organized):
 
 ```
-python src/main/python/fine_tuning/run_batch.py --collection robust04 --basemodel ql --model ql --n 44 --run --use_drr_fold
+python src/main/python/fine_tuning/run_batch.py --collection robust04 --model ql --threads 18 --run
+python src/main/python/fine_tuning/run_batch.py --collection robust04 --model ql --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper1-folds.json --verbose
+python src/main/python/fine_tuning/run_batch.py --collection robust04 --model ql --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper2-folds.json --verbose
 ```
 
-BM25+RM3 Robust04 (runs + eval + print results):
+Tuning BM25+RM3 (commands similarly organized):
 
 ```
-python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+rm3 --n 44 --run --use_drr_fold
+python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+rm3 --threads 18 --run
+python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+rm3 --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper1-folds.json --verbose
+python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+rm3 --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper2-folds.json --verbose
 ```
 
-BM25+AxiomaticReranking Robust04 (runs + eval + print results):
+Tuning BM25+Ax (commands similarly organized):
 
 ```
-python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+axiom --n 44 --run --use_drr_fold
+python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+axiom --threads 18 --run
+python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+axiom --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper1-folds.json --verbose
+python src/main/python/fine_tuning/run_batch.py --collection robust04 --model bm25+axiom --threads 18 --run --fold_settings src/main/resources/fine_tuning/robust04-paper2-folds.json --verbose
 ```
 
-## Tuned Run
 
-Tuned parameter values:
+## Tuned Runs
+
+Tuned parameter values for BM25+RM3:
 
-+ [For the 2-fold cross-validation used in "paper 1", in terms of MAP](../src/main/resources/fine_tuning/robust04-paper1-folds-map-params.json)
-+ [For tor 5-fold cross-validation used in "paper 2", in terms of MAP](../src/main/resources/fine_tuning/robust04-paper2-folds-map-params.json)
++ [For the 2-fold cross-validation used in "paper 1", in terms of MAP](../src/main/resources/fine_tuning/params/params.map.robust04-paper1-folds.bm25+rm3.json)
++ [For the 5-fold cross-validation used in "paper 2", in terms of MAP](../src/main/resources/fine_tuning/params/params.map.robust04-paper2-folds.bm25+rm3.json)
 
 To be clear, these are the tuned parameters on _that_ fold, trained on the remaining folds.
 
-The follow script will reconstruct the tuned runs for BM25 + RM3:
+The following script will reconstruct the tuned runs for BM25+RM3:
 
 ```
 python src/main/python/fine_tuning/reconstruct_robus04_tuned_run.py \
  --index lucene-index.robust04.pos+docvectors+rawdocs \
- --folds src/main/resources/fine_tuning/robust04-paper2-folds.json \
- --params src/main/resources/fine_tuning/robust04-paper2-folds-map-params.json
+ --folds src/main/resources/fine_tuning/robust04-paper1-folds.json \
+ --params src/main/resources/fine_tuning/params/params.map.robust04-paper1-folds.bm25+rm3.json \
+ --output run.robust04.bm25+rm3.paper1.txt
 ```
 
-Change `paper2` to `paper1` to reconstruct using the folds in paper 1.
+Change `paper1` to `paper2` to reconstruct using the folds in paper 2.
+
+To reconstruct runs from other retrieval models, use the parameter definitions in [`src/main/resources/fine_tuning/params/`](../src/main/resources/fine_tuning/params/), plugging them into the above command as appropriate.
+
+Note that applying `trec_eval` to these reconstructed runs might yield AP that is a tiny bit different from the values reported above (difference of 0.0001 at the most).
+This difference arises from rounding when averaging across the folds.
 
 
 ## History
 
-+ commit [407f308](https://github.com/castorini/Anserini/commit/407f308cc543286e39701caf0acd1afab39dde2c) (2019/1/2) - Added results for axiomatic semantic term matching.
-+ commit [e71df7a](https://github.com/castorini/Anserini/commit/e71df7aee42c7776a63b9845600a4075632fa11c) (2018/12/18) - Upgrade to Lucene 7.6.
-+ commit [18c3211](https://github.com/castorini/Anserini/commit/18c3211117f35f72cbc1019c125ff885f51056ea) (2018/12/9) - minor fixes.
-+ commit [2c8cd7a](https://github.com/castorini/Anserini/commit/2c8cd7a550faca0fc450e4159a4a874d4795ac25) (2018/11/16) - commit id referenced in SIGIR Forum article.
+The following documents commits that have altered effectiveness figures:
+
+
++ commit [`xxxxxxx`](https://github.com/castorini/anserini/commit/xxxxxxx) (x/xx/xxxx) - Regression experiments here fixed.
++ commit [`75e36f9`](https://github.com/castorini/anserini/commit/75e36f97f7037d1ceb20fa9c91582eac5e974131) (6/12/2019) - Upgrade to Lucene 8.0 breaks regression experiments here.
++ commit [`407f308`](https://github.com/castorini/Anserini/commit/407f308cc543286e39701caf0acd1afab39dde2c) (1/2/2019) - Added results for axiomatic semantic term matching.
++ commit [`e71df7a`](https://github.com/castorini/Anserini/commit/e71df7aee42c7776a63b9845600a4075632fa11c) (12/18/2018) - Upgrade to Lucene 7.6.
++ commit [`2c8cd7a`](https://github.com/castorini/Anserini/commit/2c8cd7a550faca0fc450e4159a4a874d4795ac25) (11/16/2018) - commit id referenced in SIGIR Forum article.
 
 
diff --git a/src/main/python/fine_tuning/effectiveness.py b/src/main/python/fine_tuning/effectiveness.py
@@ -48,11 +48,8 @@ def gen_output_effectiveness_params(self, output_root):
         all_results = {}
         for metric_dir in os.listdir(os.path.join(output_root, self.eval_files_root)):
             for fn in os.listdir(os.path.join(output_root, self.eval_files_root, metric_dir)):
-                if len(fn.split('_')) == 3:
-                    basemodel, model, model_params = fn.split('_')
-                elif len(fn.split('_')) == 2:
-                    basemodel, model = fn.split('_')
-                output_fn = basemodel+'_'+model+'_'+metric_dir
+                model, model_params = fn.split('_')
+                output_fn = model+'_'+metric_dir
                 if not os.path.exists(output_fn):
                     if output_fn not in all_results:
                         all_results[output_fn] = []
@@ -85,7 +82,7 @@ def read_eval_file(self, fn):
         return {qid: {metric: [(value, para), ...]}}
         """
         split_fn = os.path.basename(fn).split('_')
-        params = split_fn[-1] if len(split_fn) == 3 else ''
+        params = split_fn[1]
         res = {}
         with open(fn) as _in:
             for line in _in:
@@ -133,15 +130,14 @@ def load_optimal_effectiveness(self, output_root):
         per_topic_oracle = {} # per topic optimal across all kinds of methods
         effectiveness_root = os.path.join(output_root, self.effectiveness_root)
         for fn in os.listdir(effectiveness_root):
-            basemodel, model, metric = fn.split('_')
+            model, metric = fn.split('_')
             with open(os.path.join(effectiveness_root, fn)) as f:
                 for real_metric, all_performance in json.load(f).items():
                     if real_metric not in per_topic_oracle:
                         per_topic_oracle[real_metric] = {}
                     all_optimal = self.add_up_all_optimal(all_performance, per_topic_oracle[real_metric])
                     res = {
                         'model': model,
-                        'basemodel': basemodel,
                         'metric': real_metric,
                         'best_avg': all_performance['all'],
                         'oracles_per_topic': all_optimal[0]

diff --git a/src/main/python/fine_tuning/evaluation.py b/src/main/python/fine_tuning/evaluation.py
@@ -62,13 +62,13 @@ def output_all_evaluations(self, qrel_programs, qrel_file_path, result_file_path
             if process.returncode == 0:
                 try:
                     if i == 0:
-                        o = open( output_path, 'w')
+                        o = open(output_path, 'w')
                     else:
-                        o = open( output_path, 'a')
+                        o = open(output_path, 'a')
                     if 'trec_eval' in qrel_program:
-                        o.write(stdout)
+                        o.write(stdout.decode("utf-8"))
                     elif 'gdeval' in qrel_program:
-                        for line in stdout.split('\n')[1:-1]:
+                        for line in stdout.decode("utf-8").split('\n')[1:-1]:
                             line = line.strip()
                             if line:
                                 row = line.split(',')

diff --git a/src/main/python/fine_tuning/reconstruct_robus04_tuned_run.py b/src/main/python/fine_tuning/reconstruct_robus04_tuned_run.py
@@ -30,6 +30,7 @@
     parser.add_argument("--index", type=str, help='index', required=True)
     parser.add_argument("--folds", type=str, help='folds file', required=True)
     parser.add_argument("--params", type=str, help='params file', required=True)
+    parser.add_argument("--output", type=str, help='output run file', required=True)
 
     args = parser.parse_args()
     index = args.index
@@ -70,11 +71,11 @@
     folds_run_files = []
     for i in range(len(folds)):
         os.system(f'target/appassembler/bin/SearchCollection -topicreader Trec -index {index} '
-                  f'-topics topics.robust04.fold{i} -output run.robust04.bm25+rm3.fold{i}.txt -hits 1000 {params[i]}')
-        folds_run_files.append(f'run.robust04.bm25+rm3.fold{i}.txt')
+                  f'-topics topics.robust04.fold{i} -output {args.output}.fold{i} -hits 1000 {params[i]}')
+        folds_run_files.append(f'{args.output}.fold{i}')
 
     # Concatenate all partial run files together.
-    with open('run.robust04.bm25+rm3.txt', 'w') as outfile:
+    with open(args.output, 'w') as outfile:
         for fname in folds_run_files:
             with open(fname) as infile:
                 outfile.write(infile.read())