LuceneInTheSkyWithDiamonds

A Lucene search engine for the document collection from the 8th Text-based REtrieval Conference (TREC), consisting of documents from the Los Angeles Times (latimes), the Foreign Broadcast Information Service (fbis, the Federal Register 1994 (fr94) and the Financial Times (ft) collections.

Running the Search Engine

The run_search_engine.sh script will run the search engine and output results in ./output/multi-custom_A-results.txt.

It will also generate the trec-eval results file, which can be found in ./output/trec-eval-multi-custom_A-results.txt.

Arguments required:

$1 = qrels file

Note: it is also possible to experiment with the other analyzers and scoring models that were experimented with during the course of the project, using the run_custom_search_engine.sh script.

These can be seen by running 'run_custom_search_engine.sh' without specifying any parameters, which will display the usage information.

Implementation Details

The search engine was implemented using the Lucene 7.2.1 API in Java.

Scoring Model

There were a combination of different scoring models used, bundled into the MultiSimilarity type. This includes use of the LMJelinekMercerSimilarity, and LMDirichletSimilarity models.

Text Pre-processing / Analyzers

A custom analyzer, com.lucene_in_the_sky_with_diamonds.analysis.CustomAnalyzer, was used as the analyzer for text processing operations. It comprises of a number of filters:

StandardFilter;
LowerCaseFilter;
StopFilter, using the StandardAnalyzer.ENGLISH_STOP_WORDS_SET;
SnowballFilter, with an EnglishStemmer;
PorterStemFilter;
TrimFilter.

There were a number of other analyzers experimented with but not used in the final search engine, including two other custom analyzers.

The optimal value of lambda for the LMJelinekMercerSimilarity was found to be 0.75. The default parameters were assumed for the LMDirichletSimilarity model.

Document Fields

There were a number of fields which were common across all of the documents: in particular, the 'Headline' and 'Text' fields. These were the two fields used to deliver the most performant search engine for the datasets.

Date fields were also explored as part of the project: However, overall they disimproved the performance of the search engine and so were omitted from the final implementation.

Field Boosts

Field boosts were used on the Text and Headline fields of the documents, with a weighting of 0.2 on the Headline and of 0.8 on the Text.

Query Construction

The queries were constructed from a topics file.

The fields 'Title', 'Description' and 'Narrative' were all used to construct the query.

The 'Narrative' field in particular was parsed based on the occurrence of phrases such as 'not relevant', 'will discuss', 'must cite' (for relevant documents) and 'is not relevant, 'are irrelevant', etc. for irrelevant documents.

Query Expansion

Query expansion was used to improve the search engine results. This expansion was performed only on the 'Text' field of the documents.

An optimal number of documents to use in query expansion was found to be 4 based on experimentation.

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
input_data		input_data
output		output
src/main		src/main
target		target
trec_eval		trec_eval
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.md		README.md
dependency-reduced-pom.xml		dependency-reduced-pom.xml
pom.xml		pom.xml
qrels.assignment2		qrels.assignment2
qrels.assignment2.part1		qrels.assignment2.part1
run_custom_search_engine.sh		run_custom_search_engine.sh
run_search_engine.sh		run_search_engine.sh
test.gnuplot		test.gnuplot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LuceneInTheSkyWithDiamonds

Running the Search Engine

Implementation Details

Scoring Model

Text Pre-processing / Analyzers

Document Fields

Field Boosts

Query Construction

Query Expansion

About

Releases

Packages

Contributors 2

Languages

amhiggin/LuceneInTheSkyWithDiamonds

Folders and files

Latest commit

History

Repository files navigation

LuceneInTheSkyWithDiamonds

Running the Search Engine

Implementation Details

Scoring Model

Text Pre-processing / Analyzers

Document Fields

Field Boosts

Query Construction

Query Expansion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages