作者: Michael McCandless / Erik Hatcher / Otis Gospodnetic
出版社: Manning Publications
副标题: Covers Apache Lucene 3.0
出版年: 2010-7-28
页数: 475
定价: USD 49.99
装帧: Paperback
ISBN: 9781933988177
- Meet Lucene
- Building a search index
- Index File Formats
- Adding search to your application
- Implementing a simple search feature
- Using indexsearcher
- Understanding lucene scoring
- Lucene's diverse queries
- Searching by term: TermQuery
- Searching within a term range: TermRangeQuery
- Searching within a numeric range: NumericRangeQuery
- Searching on a string: PrefixQuery
- Combining queries: BooleanQuery
- Searching by phrase: PhraseQuery
- Searching by wildcard: WildcardQuery
- Searching for similar terms: FuzzyQuery
- Matching all documents: MatchAllDocsQuery
- Parsing query expressions queryparser
- Lucene’s analysis process
Lucene is a high-performance, scalable information retrieval
(IR
) library. IR
refers to the process of searching for documents, information within documents, or metadata about documents.
Figure 1.4. Typical components of search application; the shaded components show which parts Lucene handles.
Acquire Content
:which involves using a crawler or spider, gathers and scopes the content that needs to be indexed.Build Document
:Once you have the raw content that needs to be indexed, you must translate the content into theunits
(usually calleddocuments
) used by the search engine. The document typically consists of several separately named fields with values, such astitle
,body
,abstract
,author
, andurl
.Analyze Document
:No search engine indexes text directly: rather, the text must be broken into a series of individual atomic elements calledtokens
.Index Document
:During the indexing step, the document is added to the index.
Search User Interface
:The user interface is what users actually see, in the web browser, desktop application, or mobile device, when they interact with your search application.Build Query
- You must then translate the request into the search engine’s
Query
object. We call this theBuild Query
step. - Lucene provides a powerful package, called
QueryParser
, to process the user’s text into a query object according to a common search syntax.
- You must then translate the request into the search engine’s
Search Query
- Search Query is the process of consulting the search index and retrieving the documents matching the Query, sorted in the requested sort order.
- There are three common theoretical models of search:
Pure Boolean model
— Documents either match or don’t match the provided query, and no scoring is done. In this model there are no relevance scores associated with matching documents, and the matching documents are unordered; a query simply identifies a subset of the overall corpus as matching the query.Vector space model
— Both queries and documents are modeled as vectors in a high dimensional space, where each unique term is a dimension. Relevance, or similarity, between a query and a document is computed by a vector distance measure between these vectors.Probabilistic model
— In this model, you compute the probability that a document is a good match to a query using a full probabilistic approach.
Render Results
:Once you have the raw set of documents that match the query, sorted in the right order, you then render them to the user in an intuitive, consumable manner.
Administration Interface
:tune the size of the RAM buffer, how many segments to merge at once, how often to commit changes, or when to optimize and purge deletes from the index.Analytics Interface
:Lucene-specific metrics that could feed the analytics interface include:- How often which kinds of queries (single term, phrase, Boolean queries, etc.) are run
- Queries that hit low relevance
- Queries where the user didn’t click on any results (if your application tracks click-throughs)
- How often users are sorting by specified fields instead of relevance
- The breakdown of Lucene’s search time
Scaling
:There are two dimensions to scaling: net amount of content, and net query throughput.
Listing 1.1 shows the Indexer command-line program, originally written for Erik’s introductory Lucene article on java.net. It takes two arguments:
- A path to a directory where we store the Lucene index
- A path to a directory that contains the files we want to index
Listing 1.1. Indexer, which indexes .txt files
Go ahead and type ant Indexer, and you should see output like this:
% ant Indexer
Index *.txt files in a directory into a Lucene index.
Use the Searcher target to search this index.
Indexer is covered in the "Meet Lucene" chapter.
Press return to continue...
Directory for new Lucene index: [indexes/MeetLucene]
Directory with .txt files to index: [src/lia/meetlucene/data]
Overwrite indexes/MeetLucene? (y, n) y
Running lia.meetlucene.Indexer...
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/apache1.0.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/apache1.1.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/apache2.0.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/cpl1.0.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/epl1.0.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/freebsd.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/gpl1.0.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/gpl2.0.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/gpl3.0.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/lgpl2.1.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/lgpl3.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/lpgl2.0.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/mit.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/mozilla1.1.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/
mozilla_eula_firefox3.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/
mozilla_eula_thunderbird2.txt
Indexing 16 files took 757 milliseconds
BUILD SUCCESSFUL
The Searcher program, originally written for Erik’s introductory Lucene article on java.net, complements Indexer and provides command-line searching capability. Listing 1.2 shows Searcher in its entirety. It takes two command-line arguments:
- The path to the index created with Indexer
- A query to use to search the index
Listing 1.2. Searcher, which searches a Lucene index
Let’s run Searcher and find documents in our index using the query ‘patent’
% ant Searcher
Search an index built using Indexer.
Searcher is described in the "Meet Lucene" chapter.
Press return to continue...
Directory of existing Lucene index built by
Indexer: [indexes/MeetLucene]
Query: [patent]
Running lia.meetlucene.Searcher...
Found 8 document(s) (in 11 milliseconds) that
matched query 'patent':
/Users/mike/lia2e/src/lia/meetlucene/data/cpl1.0.txt
/Users/mike/lia2e/src/lia/meetlucene/data/mozilla1.1.txt
/Users/mike/lia2e/src/lia/meetlucene/data/epl1.0.txt
/Users/mike/lia2e/src/lia/meetlucene/data/gpl3.0.txt
/Users/mike/lia2e/src/lia/meetlucene/data/apache2.0.txt
/Users/mike/lia2e/src/lia/meetlucene/data/lpgl2.0.txt
/Users/mike/lia2e/src/lia/meetlucene/data/gpl2.0.txt
/Users/mike/lia2e/src/lia/meetlucene/data/lgpl2.1.txt
BUILD SUCCESSFUL
Total time: 4 seconds
As you saw in our Indexer class, you need the following classes to perform the simplest indexing procedure:
- IndexWriter
- Directory
- Analyzer
- Document
- Field
Figure 1.5. Classes used when indexing documents with Lucene
IndexWriter
is the central component of the indexing process. This class creates a new index or opens an existing one, and adds, removes, or updates documents in the index. Think of IndexWriter as an object that gives you write access to the index but doesn’t let you read or search it. IndexWriter needs somewhere to store its index, and that’s what Directory is for.
The Directory
class represents the location of a Lucene index. It’s an abstract class that allows its subclasses to store the index as they see fit. In our Indexer example, we used FSDirectory.open to get a suitable concrete FSDirectory implementation that stores real files in a directory on the file system, and passed that in turn to IndexWriter’s constructor.
Before text is indexed, it’s passed through an analyzer
. The analyzer, specified in the IndexWriter constructor, is in charge of extracting those tokens out of text that should be indexed and eliminating the rest. Analyzer is an abstract class, but Lucene comes with several implementations of it.
- Some of them deal with skipping
stop words
(frequently used words that don’t help distinguish one document from the other, such as a, an, the, in, and on); - some deal with conversion of tokens to lowercase letters, so that searches aren’t case sensitive;
- and so on.
The Document
class represents a collection of fields. Think of it as a virtual document—a chunk of data, such as a web page, an email message, or a text file—that you want to make retrievable at a later time.
Each document in an index contains one or more named fields
, embodied in a class called Field. Each field has a name and corresponding value, and a bunch of options, described in section 2.4, that control precisely how Lucene will index the field’s value.
The basic search interface that Lucene provides is as straightforward as the one for indexing. Only a few classes are needed to perform the basic search operation:
- IndexSearcher
- Term
- Query
- TermQuery
- TopDocs
IndexSearcher
is to searching what IndexWriter is to indexing: the central link to the index that exposes several search methods. You can think of IndexSearcher as a class that opens an index in a read-only mode. It requires a Directory instance, holding the previously created index, and then offers a number of search methods, some of which are implemented in its abstract parent class Searcher; the simplest takes a Query object and an int topN count as parameters and returns a TopDocs object. A typical use of this method looks like this:
Directory dir = FSDirectory.open(new File("/tmp/index"));
IndexSearcher searcher = new IndexSearcher(dir);
Query q = new TermQuery(new Term("contents", "lucene"));
TopDocs hits = searcher.search(q, 10);
searcher.close();
A Term
is the basic unit for searching. Similar to the Field object, it consists of a pair of string elements: the name of the field and the word (text value) of that field.
During searching, you may construct Term objects and use them together with TermQuery:
Query q = new TermQuery(new Term("contents", "lucene"));
TopDocs hits = searcher.search(q, 10);
Lucene comes with a number of concrete Query
subclasses. So far in this chapter we’ve mentioned only the most basic Lucene Query: TermQuery. Other Query types are BooleanQuery, PhraseQuery, PrefixQuery, PhrasePrefixQuery, TermRangeQuery, NumericRangeQuery, FilteredQuery, and SpanQuery.
TermQuery
is the most basic type of query supported by Lucene, and it’s one of the primitive query types. It’s used for matching documents that contain fields with specific values, as you’ve seen in the last few paragraphs.
The TopDocs
class is a simple container of pointers to the top N ranked search results—documents that match a given query. For each of the top N results, TopDocs records the int docID (which you can use to retrieve the document) as well as the float score.
At a high level, there are three things Lucene can do with each field:
- The value may be indexed (or not).
- If it’s indexed, the field may also optionally store term vectors, which are collectively a miniature inverted index for that one field, allowing you to retrieve all of its tokens.
- Separately, the field’s value may be stored, meaning a verbatim copy of the unanalyzed value is written away in the index so that it can later be retrieved. This is useful for fields you’d like to present unchanged to the user, such as the document’s title or abstract.
Unlike a database, Lucene has no notion of a fixed global schema.
The second major difference between Lucene and databases is that Lucene requires you to flatten, or denormalize, your content when you index it.
Lucene documents are flat. Such recursion and joins must be denormalized when creating your documents.
Figure 2.1. Indexing with Lucene breaks down into three main operations: extracting text from source documents, analyzing it, and saving it to the index.
- Extracting text and creating the document
- Analysis
- Adding to the index
Figure 2.2. Segmented structure of a Lucene inverted index
There are two methods for adding documents:
- addDocument(Document)—Adds the document using the default analyzer, which you specified when creating the IndexWriter, for tokenization.
- addDocument(Document, Analyzer)—Adds the document using the provided analyzer for tokenization. But be careful! In order for searches to work correctly, you need the analyzer used at search time to “match” the tokens produced by the analyzers at indexing time.
Listing 2.1. Adding documents to an index
In the getWriter method, we create the IndexWriter with three arguments:
Directory
, where the index is stored.- The analyzer to use when indexing tokenized fields.
- MaxFieldLength.UNLIMITED, a required argument that tells IndexWriter to index all tokens in the document.
public static int hitCount(IndexSearcher searcher, Query query)
throws IOException {
return searcher.search(query, 1).totalHits;
}
This method runs the search and returns the total number of hits that matched.
IndexWriter provides various methods to remove documents from an index:
- deleteDocuments(Term) deletes all documents containing the provided term.
- deleteDocuments(Term[]) deletes all documents containing any of the terms in the provided array.
- deleteDocuments(Query) deletes all documents matching the provided query.
- deleteDocuments(Query[]) deletes all documents matching any of the queries in the provided array.
- deleteAll() deletes all documents in the index. This is exactly the same as closing the writer and opening a new writer with create=true, without having to close your writer.
Listing 2.2. Deleting documents from an index
- maxDoc() returns the total number of deleted or undeleted documents in the index,
- numDocs() returns only the number of undeleted documents.
IndexWriter provides two convenience methods to replace a document in the index:
- updateDocument(Term, Document) first deletes all documents containing the provided term and then adds the new document using the writer’s default analyzer.
- updateDocument(Term, Document, Analyzer) does the same but uses the provided analyzer instead of the writer’s default analyzer.
Listing 2.3. Updating indexed Documents
The options for indexing (Field.Index.*) control how the text in the field will be made searchable via the inverted index. Here are the choices:
Index.ANALYZED
—Use the analyzer to break the field’s value into a stream of separate tokens and make each token searchable. This option is useful for normal text fields (body, title, abstract, etc.).Index.NOT_ANALYZED
—Do index the field, but don’t analyze the String value. Instead, treat the Field’s entire value as a single token and make that token searchable. This option is useful for fields that you’d like to search on but that shouldn’t be broken up, such as URLs, file system paths, dates, personal names, Social Security numbers, and telephone numbers. This option is especially useful for enabling “exact match” searching.Index.ANALYZED_NO_NORMS
—A variant ofIndex.ANALYZED
that doesn’t store norms information in the index. Norms record index-time boost information in the index but can be memory consuming when you’re searching.Index.NOT_ANALYZED_NO_NORMS
—Just like Index.NOT_ANALYZED, but also doesn’t store norms. This option is frequently used to save index space and memory usage during searching, because single-token fields don’t need the norms information unless they’re boosted.Index.NO
—Don’t make this field’s value available for searching.
The options for stored fields (Field.Store.*) determine whether the field’s exact value should be stored away so that you can later retrieve it during searching:
- Store.YES —Stores the value. When the value is stored, the original String in its entirety is recorded in the index and may be retrieved by an IndexReader. This option is useful for fields that you’d like to use when displaying the search results (such as a URL, title, or database primary key). Try not to store very large fields, if index size is a concern, as stored fields consume space in the index.
- Store.NO —Doesn’t store the value. This option is often used along with Index.ANALYZED to index a large text field that doesn’t need to be retrieved in its original form, such as bodies of web pages, or any other type of text document.
Term vectors are a mix between an indexed field and a stored field. They’re similar to a stored field because you can quickly retrieve all term vector fields for a given document: term vectors are keyed first by document ID. But then, they’re keyed secondarily by term, meaning they store a miniature inverted index for that one document. Unlike a stored field, where the original String content is stored verbatim, term vectors store the actual separate terms that were produced by the analyzer, allowing you to retrieve all terms for each field, and the frequency of their occurrence within the document, sorted in lexicographic order.
You can choose separately whether these details are also stored in your term vectors by passing these constants as the fourth argument to the Field constructor:
- TermVector.YES—Records the unique terms that occurred, and their counts, in each document, but doesn’t store any positions or offsets information
- TermVector.WITH_POSITIONS—Records the unique terms and their counts, and also the positions of each occurrence of every term, but no offsets
- TermVector.WITH_OFFSETS—Records the unique terms and their counts, with the offsets (start and end character position) of each occurrence of every term, but no positions
- TermVector.WITH_POSITIONS_OFFSETS—Stores unique terms and their counts, along with positions and offsets
- TermVector.NO—Doesn’t store any term vector information
There are a few other constructors for the Field object that allow you to use values other than String:
- Field(String name, Reader value, TermVector termVector) uses a Reader instead of a String to represent the value. In this case, the value can’t be stored (the option is hardwired to Store.NO) and is always analyzed and indexed (Index.ANALYZED). This can be useful when holding the full String in memory might be too costly or inconvenient—for example, for very large values.
- Field(String name, Reader value), like the previous value, uses a Reader instead of a String to represent the value but defaults termVector to TermVector.NO.
- Field(String name, TokenStream tokenStream, TermVector termVector) allows you to preanalyze the field value into a TokenStream. Likewise, such fields aren’t stored and are always analyzed and indexed.
- Field(String name, TokenStream tokenStream), like the previous value, allows you to preanalyze the field value into a TokenStream but defaults termVector to TermVector.NO.
- Field(String name, byte[] value, Store store) is used to store a binary field. Such fields are never indexed (Index.NO) and have no term vectors (TermVector.NO). The store argument must be Store.YES.
- Field(String name, byte[] value, int offset, int length, Store store), like the previous value, indexes a binary field but allows you to reference a sub-slice of the bytes starting at offset and running for length bytes.
Table 2.1. A summary of various field characteristics, showing you how fields are created, along with common usage examples
Index | Store | TermVector | Example usage |
---|---|---|---|
NOT_ANALYZED_NO_NORMS | YES | NO | Identifiers (filenames, primary keys), telephone and Social Security numbers, URLs, personal names, dates, and textual fields for sorting |
ANALYZED | YES | WITH_POSITIONS_OFFSETS | Document title, document abstract |
ANALYZED | NO | WITH_POSITIONS_OFFSETS | Document body |
NO | YES | NO | Document type, database primary key (if not used for searching) |
NOT_ANALYZED | NO | NO | Hidden keywords |
- If the field is numeric, use
NumericField
- If the field is textual, such as the sender’s name in an email message, you must add it as a
Field
that’s indexed but not analyzed usingField.Index.NOT_ANALYZED
- If you aren’t doing any boosting for the field, you should index it without norms, to save disk space and memory, using
Field.Index.NOT_ANALYZED_NO_NORMS
:
new Field("author", "Arthur C. Clark", Field.Store.YES,
Field.Index.NOT_ANALYZED_NO_NORMS);
Fields
used for sorting must be indexed and must contain one token per document. Typically this means using Field.Index.NOT_ANALYZED
or Field.Index.NOT_ANALYZED_NO_NORMS
(if you’re not boosting documents or fields), but if your analyzer will always produce only one token, such as KeywordAnalyzer
, Field.Index.ANALYZED
or Field.Index.ANALYZED_NO_NORMS
will work as well.
This is perfectly acceptable and encouraged, as it’s a natural way to represent a field that legitimately has multiple values.
By changing a document’s boost factor, you can instruct Lucene to consider it more or less important with respect to other documents in the index when computing relevance.
Listing 2.4. Selectively boosting documents and fields
Just as you can boost documents, you can also boost individual fields.
To achieve this behavior, we use the setBoost(float) method of the Field class:
Field subjectField = new Field("subject", subject,
Field.Store.YES,
Field.Index.ANALYZED);
subjectField.setBoost(1.2F);
During searching, norms for any field being searched are loaded into memory, decoded back into a floating-point number, and used when computing the relevance score.
doc.add(new NumericField("price").setDoubleValue(19.99));
doc.add(new NumericField("timestamp").setLongValue(new Date().getTime()));
doc.add(new NumericField("day").setIntValue((int) (new Date().getTime()/24/3600)));
Calendar cal = Calendar.getInstance();
cal.setTime(date);
doc.add(new NumericField("dayOfMonth").setIntValue(cal.get(Calendar.DAY_OF_MONTH)));
IndexWriter
allows you to truncate per-Field indexing so that only the first N terms are indexed for an analyzed field. When you instantiate IndexWriter, you must pass in a MaxFieldLength
instance expressing this limit. MaxFieldLength provides two convenient default instances: MaxFieldLength.UNLIMITED
, which means no truncation will take place, and MaxFieldLength.LIMITED
, which means fields are truncated at 10,000 terms. You can also instantiate MaxFieldLength with your own limit.
IndexReader getReader()
This method immediately flushes any buffered added or deleted documents, and then creates a new read-only IndexReader that includes those documents.
Optimizing only improves searching speed, not indexing speed.
IndexWriter exposes four methods to optimize:
- optimize() reduces the index to a single segment, not returning until the operation is finished.
- optimize(int maxNumSegments), also known as partial optimize, reduces the index to at most maxNumSegments segments. Because the final merge down to one segment is the most costly, optimizing to, say, five segments should be quite a bit faster than optimizing down to one segment, allowing you to trade less optimization time for slower search speed.
- optimize(boolean doWait) is just like optimize, except if doWait is false then the call returns immediately while the necessary merges take place in the background. Note that doWait=false only works for a merge scheduler that runs merges in background threads, such as the default ConcurrentMergeScheduler.
- optimize(int maxNumSegments, boolean doWait) is a partial optimize that runs in the background if doWait is false.
Table 2.2. Lucene’s several core Directory implementations
Directory | Description |
---|---|
SimpleFSDirectory | A simplistic Directory that stores files in the file system, using java.io.* APIs. It doesn’t scale well with many threads. |
NIOFSDirectory | A Directory that stores files in the file system, using java.nio.* APIs. This does scale well with threads on all platforms except Microsoft Windows, due to a longstanding issue with Sun’s Java Runtime Environment (JRE). |
MMapDirectory | A Directory that uses memory-mapped I/O to access files. This is a good choice on 64-bit JREs, or on 32-bit JREs where the size of the index is relatively small. |
RAMDirectory | A Directory that stores all files in RAM. |
FileSwitchDirectory | A Directory that takes two directories in, and switches between these directories based on file extension. |
Figure 2.3. A single IndexWriter can be shared by multiple threads.
Table 2.3. Issues related to accessing a Lucene index across remote file systems
Remote file system | Notes |
---|---|
Samba/CIFS 1.0 | The standard remote file system for Windows computers. Sharing a Lucene index works fine. |
Samba/CIFS 2.0 | The new version of Samba/CIFS that’s the default for Windows Server 2007 and Windows Vista. Lucene has trouble due to incoherent client-side caches. |
Networked File System (NFS) | The standard remote file systems for most Unix OSs. Lucene has trouble due to both incoherent client-side caches as well as how NFS handles deletion of files that are held open by another computer. |
Apple File Protocol (AFP) | Apple’s standard remote file system protocol. Lucene has trouble due to incoherent client-side caches. |
Table 2.4. Locking implementations provided by Lucene
Locking class name | Description |
---|---|
NativeFSLockFactory | This is the default locking for FSDirectory, using java.nio native OS locking, which will never leave leftover lock files when the JVM exits. But this locking implementation may not work correctly over certain shared file systems, notably NFS. |
SimpleFSLockFactory | Uses Java’s File.createNewFile API, which may be more portable across different file systems than NativeFSLockFactory. Be aware that if the JVM crashes or IndexWriter isn’t closed before the JVM exits, this may leave a leftover write.lock file, which you must manually remove. |
SingleInstanceLockFactory | Creates a lock entirely in memory. This is the default locking implementation for RAMDirectory. Use this when you know all IndexWriters will be instantiated in a single JVM. |
NoLockFactory | Disables locking entirely. Be careful! Only use this when you are absolutely certain that Lucene’s normal locking safeguard isn’t necessary—for example, when using a private RAMDirectory with a single IndexWriter instance. |
Listing 2.5. Using file-based locks to enforce a single writer at a time
http://lucene.apache.org/core/3_5_0/fileformats.html
- An
index
contains a sequence of documents. - A
document
is a sequence of fields. - A
field
is a named sequence of terms. - A
term
is a string.
In Lucene, fields may be stored
, in which case their text is stored in the index literally, in a non-inverted manner. Fields that are inverted are called indexed
. A field may be both stored and indexed.
The text of a field may be tokenized into terms to be indexed, or the text of a field may be used literally as a term to be indexed. Most fields are tokenized, but sometimes it is useful for certain identifier fields to be indexed literally.
Lucene indexes may be composed of multiple sub-indexes, or segments
. Each segment is a fully independent index, which could be searched separately. Indexes evolve by:
- Creating new segments for newly added documents.
- Merging existing segments.
Internally, Lucene refers to documents by an integer document number
. The first document added to an index is numbered zero, and each subsequent document added gets a number one greater than the previous.
In particular, numbers may change in the following situations:
- The numbers stored in each segment are unique only within the segment, and must be converted before they can be used in a larger context. The standard technique is to allocate each segment a range of values, based on the range of numbers used in that segment. To convert a document number from a segment to an external value, the segment's base document number is added. To convert an external value back to a segment-specific value, the segment is identified by the range that the external value is in, and the segment's base value is subtracted. For example two five document segments might be combined, so that the first segment has a base value of zero, and the second of five. Document three from the second segment would have an external value of eight.
- When documents are deleted, gaps are created in the numbering. These are eventually removed as the index evolves through merging. Deleted documents are dropped when segments are merged. A freshly-merged segment thus has no gaps in its numbering.
Each segment index maintains the following:
Field names
. This contains the set of field names used in the index.Stored Field values
. This contains, for each document, a list of attribute-value pairs, where the attributes are field names. These are used to store auxiliary information about the document, such as its title, url, or an identifier to access a database. The set of stored fields are what is returned for each hit when searching. This is keyed by document number.Term dictionary
. A dictionary containing all of the terms used in all of the indexed fields of all of the documents. The dictionary also contains the number of documents which contain the term, and pointers to the term's frequency and proximity data.Term Frequency data
. For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document, unless frequencies are omitted (IndexOptions.DOCS_ONLY)Term Proximity data
. For each term in the dictionary, the positions that the term occurs in each document. Note that this will not exist if all fields in all documents omit position data.Normalization factors
. For each field in each document, a value is stored that is multiplied into the score for hits on that field.Term Vectors
. For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector consists of term text and term frequency.Deleted documents
. An optional file indicating which documents are deleted.
All files belonging to a segment have the same name with varying extensions. The extensions correspond to the different file formats described below. When using the Compound File format (default in 1.4 and greater) these files are collapsed into a single .cfs
file (see below for details)
Typically, all segments in an index are stored in a single directory, although this is not required.
As of version 2.1 (lock-less commits), file names are never re-used (there is one exception, "segments.gen", see below). That is, when any file is saved to the Directory it is given a never before used filename
. This is achieved using a simple generations approach. For example, the first segments file is segments_1, then segments_2, etc. The generation is a sequential long integer represented in alpha-numeric (base 36) form.
Table 3.1. Lucene’s primary searching API
Class | Purpose |
---|---|
IndexSearcher | Gateway to searching an index. All searches come through an IndexSearcher instance using any of the several overloaded search methods. |
Query (and subclasses) | Concrete subclasses encapsulate logic for a particular query type. Instances of Query are passed to an IndexSearcher’s search method. |
QueryParser | Processes a human-entered (and readable) expression into a concrete Query object. |
TopDocs | Holds the top scoring documents, returned by IndexSearcher.search. |
ScoreDoc | Provides access to each search result in TopDocs. |
Listing 3.1. Simple searching with TermQuery
Figure 3.1. QueryParser translates a textual expression from the end user into an arbitrarily complex query for searching.
Listing 3.2. QueryParser, which makes it trivial to translate search text into a Query
Table 3.2. Expression examples that QueryParser handles
Query expression | Matches documents that... |
---|---|
java | Contain the term java in the default field |
java junit | Contain the term java or junit, or both, in the default field |
java OR junit | - |
+java +junit | Contain both java and junit in the default field |
java AND junit | - |
title:ant | Contain the term ant in the title field |
title:extreme–subject:sports | Contain extreme in the title field and don’t have sports in the subject field |
title:extreme AND NOT subject:sports | - |
(agile OR extreme) AND methodology | Contain methodology and must also contain agile and/or extreme, all in the default field |
title:"junit in action" | Contain the exact phrase “junit in action” in the title field |
title:"junit action"~5 | Contain the terms junit and action within five positions of one another, in the title field |
java* | Contain terms that begin with java, like javaspaces, javaserver, java.net, and the exact tem java itself. |
java~ | Contain terms that are close to the word java, such as lava |
lastmodified: [1/1/09 TO 12/31/09] | Have lastmodified field values between the dates January 1, 2009 and December 31, 2009 |
Figure 3.2. The relationship between the common classes used for searching
Table 3.3. Primary IndexSearcher search methods
IndexSearcher.search method signature | When to use |
---|---|
TopDocs search(Query query, int n) | Straightforward searches. The int n parameter specifies how many top-scoring documents to return. |
TopDocs search(Query query, Filter filter, int n) | Searches constrained to a subset of available documents, based on filter criteria. |
TopFieldDocs search(Query query, Filter filter, int n, Sort sort) | Searches constrained to a subset of available documents based on filter criteria, and sorted by a custom Sort object |
void search(Query query, Collector results) | Used when you have custom logic to implement for each document visited, or you’d like to collect a different subset of documents than the top N by the sort criteria. |
void search(Query query, Filter filter, Collector results) | Same as previous, except documents are only accepted if they pass the filter criteria. |
Table 3.4. TopDocs methods for efficiently accessing search results03_Ch03.fm
TopDocs method or attribute | Return value |
---|---|
totalHits | Number of documents that matched the search |
scoreDocs | Array of ScoreDoc instances that contains the results |
getMaxScore() | Returns best score of all matches, if scoring was done while searching (when sorting by field, you separately control whether scores are computed) |
You can choose from a couple of implementation approaches:
- Gather multiple pages’ worth of results on the initial search and keep the resulting ScoreDocs and IndexSearcher instances available while the user is navigating the search results.
- Requery each time the user navigates to a new page.
Listing 3.3. Near-real-time search
Figure 3.3. Lucene uses this formula to determine a document score based on a query.
Table 3.5. Factors in the scoring formula
Factor | Description |
---|---|
tf(t in d) | Term frequency factor for the term (t) in the document (d)—how many times the term t occurs in the document. |
idf(t) | Inverse document frequency of the term: a measure of how “unique” the term is. Very common terms have a low idf; very rare terms have a high idf. |
boost(t.field in d) | Field and document boost, as set during indexing (see section 2.5). You may use this to statically boost certain fields and certain documents over others. |
lengthNorm(t.field in d) | Normalization value of a field, given the number of terms within the field. This value is computed during indexing and stored in the index norms. Shorter fields (fewer tokens) get a bigger boost from this factor. |
coord(q, d) | Coordination factor, based on the number of query terms the document contains. The coordination factor gives an AND-like boost to documents that contain more of the search terms than other documents. |
queryNorm(q) | Normalization value for a query, given the sum of the squared weights of each of the query terms. |
Listing 3.4. The explain() method
public void testKeyword() throws Exception {
Directory dir = TestUtil.getBookIndexDirectory();
IndexSearcher searcher = new IndexSearcher(dir);
Term t = new Term("isbn", "9781935182023");
Query query = new TermQuery(t);
TopDocs docs = searcher.search(query, 10);
assertEquals("JUnit in Action, Second Edition",
1, docs.totalHits);
searcher.close();
dir.close();
}
public void testTermRangeQuery() throws Exception {
Directory dir = TestUtil.getBookIndexDirectory();
IndexSearcher searcher = new IndexSearcher(dir);
TermRangeQuery query = new TermRangeQuery("title2", "d", "j",
true, true);
TopDocs matches = searcher.search(query, 100);
assertEquals(3, matches.totalHits);
searcher.close();
dir.close();
}
public void testInclusive() throws Exception {
Directory dir = TestUtil.getBookIndexDirectory();
IndexSearcher searcher = new IndexSearcher(dir);
// pub date of TTC was September 2006
NumericRangeQuery query = NumericRangeQuery.newIntRange("pubmonth",
200605,
200609,
true,
true);
TopDocs matches = searcher.search(query, 10);
assertEquals(1, matches.totalHits);
searcher.close();
dir.close();
}
public void testExclusive() throws Exception {
Directory dir = TestUtil.getBookIndexDirectory();
IndexSearcher searcher = new IndexSearcher(dir);
// pub date of TTC was September 2006
NumericRangeQuery query = NumericRangeQuery.newIntRange("pubmonth",
200605,
200609,
false,
false);
TopDocs matches = searcher.search(query, 10);
assertEquals(0, matches.totalHits);
searcher.close();
dir.close();
}
Listing 3.5. PrefixQuery
Listing 3.6. Using BooleanQuery to combine required subqueries
public static boolean hitsIncludeTitle(IndexSearcher searcher,
TopDocs hits, String title)
throws IOException {
for (ScoreDoc match : hits.scoreDocs) {
Document doc = searcher.doc(match.doc);
if (title.equals(doc.get("title"))) {
return true;
}
}
System.out.println("title '" + title + "' not found");
return false;
}
Listing 3.7. Using BooleanQuery to combine optional subqueries.
Listing 3.8. PhraseQuery
Listing 3.9. WildcardQuery
public void testFuzzy() throws Exception {
indexSingleFieldDocs(new Field[] { new Field("contents",
"fuzzy",
Field.Store.YES,
Field.Index.ANALYZED),
new Field("contents",
"wuzzy",
Field.Store.YES,
Field.Index.ANALYZED)
});
IndexSearcher searcher = new IndexSearcher(directory);
Query query = new FuzzyQuery(new Term("contents", "wuzza"));
TopDocs matches = searcher.search(query, 10);
assertEquals("both close enough", 2, matches.totalHits);
assertTrue("wuzzy closer than fuzzy",
matches.scoreDocs[0].score != matches.scoreDocs[1].score);
Document doc = searcher.doc(matches.scoreDocs[0].doc);
assertEquals("wuzza bear", "wuzzy", doc.get("contents"));
searcher.close();
}
Query query = new MatchAllDocsQuery(field);
public void testToString() throws Exception {
BooleanQuery query = new BooleanQuery();
query.add(new FuzzyQuery(new Term("field", "kountry")),
BooleanClause.Occur.MUST);
query.add(new TermQuery(new Term("title", "western")),
BooleanClause.Occur.SHOULD);
assertEquals("both kinds", "+kountry~0.5 title:western",
query.toString("field"));
}
public void testTermQuery() throws Exception {
QueryParser parser = new QueryParser(Version.LUCENE_30,
"subject", analyzer);
Query query = parser.parse("computers");
System.out.println("term: " + query);
}
produces this output:
term: subject:computers
Listing 3.10. Creating a TermRangeQuery using QueryParser
QueryParser does include certain built-in logic for parsing dates when they appear as part of a range query, but the logic doesn’t work when you’ve indexed your dates using NumericField.
public void testLowercasing() throws Exception {
Query q = new QueryParser(Version.LUCENE_30,
"field", analyzer).parse("PrefixQuery*");
assertEquals("lowercased",
"prefixquery*", q.toString("field"));
QueryParser qp = new QueryParser(Version.LUCENE_30,
"field", analyzer);
qp.setLowercaseExpandedTerms(false);
q = qp.parse("PrefixQuery*");
assertEquals("not lowercased",
"PrefixQuery*", q.toString("field"));
}
QueryParser parser = new QueryParser(Version.LUCENE_30,
"contents", analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
Table 3.6. Boolean query operator shortcuts
Verbose syntax | Shortcut syntax |
---|---|
a AND b | +a +b |
a OR b | a b |
a AND NOT b | +a –b |
public void testPhraseQuery() throws Exception {
Query q = new QueryParser(Version.LUCENE_30,
"field",
new StandardAnalyzer(
Version.LUCENE_30))
.parse("\"This is Some Phrase*\"");
assertEquals("analyzed",
"\"? ? some phrase\"", q.toString("field"));
q = new QueryParser(Version.LUCENE_30,
"field", analyzer)
.parse("\"term\"");
assertTrue("reduced to TermQuery", q instanceof TermQuery);
}
public void testSlop() throws Exception {
Query q = new QueryParser(Version.LUCENE_30,
"field", analyzer)
.parse("\"exact phrase\"");
assertEquals("zero slop",
"\"exact phrase\"", q.toString("field"));
QueryParser qp = new QueryParser(Version.LUCENE_30,
"field", analyzer);
qp.setPhraseSlop(5);
q = qp.parse("\"sloppy phrase\"");
assertEquals("sloppy, implicitly",
"\"sloppy phrase\"~5", q.toString("field"));
}
public void testFuzzyQuery() throws Exception {
QueryParser parser = new QueryParser(Version.LUCENE_30,
"subject", analyzer);
Query query = parser.parse("kountry~");
System.out.println("fuzzy: " + query);
query = parser.parse("kountry~0.7");
System.out.println("fuzzy 2: " + query);
}
This produces the following output:
fuzzy: subject:kountry~0.5
fuzzy 2: subject:kountry~0.7
QueryParser produces the MatchAllDocsQuery when you enter *:*
.
public void testGrouping() throws Exception {
Query query = new QueryParser(
Version.LUCENE_30,
"subject",
analyzer).parse("(agile OR extreme) AND methodology");
TopDocs matches = searcher.search(query, 10);
assertTrue(TestUtil.hitsIncludeTitle(searcher, matches,
"Extreme Programming Explained"));
assertTrue(TestUtil.hitsIncludeTitle(searcher,
matches,
"The Pragmatic Programmer"));
}
Figure 3.7. A Query can have an arbitrary nested structure, easily expressed with QueryParser’s grouping. This query is achieved by parsing the expression (+"brown fox" +quick) "red dog".
A caret (^) followed by a floating-point number sets the boost factor for the preceding query. For example, the query expression junit^2.0 testing sets the junit TermQuery to a boost of 2.0 and leaves the testing TermQuery at the default boost of 1.0.
Analyzing "The quick brown fox jumped over the lazy dog"
WhitespaceAnalyzer:
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]
SimpleAnalyzer:
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]
StopAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dog]
StandardAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dog]
Analyzing "XY&Z Corporation - xyz@example.com"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [xyz@example.com]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [xyz@example.com]
In the meantime, here’s a summary of each of these analyzers:
WhitespaceAnalyzer
, as the name implies, splits text into tokens on whitespace characters and makes no other effort to normalize the tokens. It doesn’t lowercase each token.SimpleAnalyzer
first splits tokens at nonletter characters, then lowercases each token. Be careful! This analyzer quietly discards numeric characters but keeps all other characters.StopAnalyzer
is the same as SimpleAnalyzer, except it removes common words. By default, it removes common words specific to the English language (the, a, etc.), though you can pass in your own set.StandardAnalyzer
is Lucene’s most sophisticated core analyzer. It has quite a bit of logic to identify certain kinds of tokens, such as company names, email addresses, and hostnames. It also lowercases each token and removes stop words and punctuation.
Figure 4.1. Analysis process during indexing. Fields 1 and 2 are analyzed, producing a sequence of tokens; Field 3 is unanalyzed, causing its entire value to be indexed as a single token.
QueryParser parser = new QueryParser(Version.LUCENE_30,
"contents", analyzer);
Query query = parser.parse(expression);
- Analyzers are used to analyze a specific field at a time and break things into tokens only within that field.
- Analyzers don’t help in field separation because their scope is to deal with a single field at a time. Instead, parsing these documents prior to analysis is required.
public final class SimpleAnalyzer extends Analyzer {
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
return new LowerCaseTokenizer(reader);
}
@Override
public TokenStream reusableTokenStream(String fieldName, Reader reader
throws IOException {
Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream();
if (tokenizer == null) {
tokenizer = new LowerCaseTokenizer(reader);
setPreviousTokenStream(tokenizer);
} else
tokenizer.reset(reader);
return tokenizer;
}
}
Figure 4.2. A token stream with positional and offset information
Figure 4.3. The hierarchy of classes used to produce tokens: TokenStream is the abstract base class; Tokenizer creates tokens from a Reader; and TokenFilter filters any other TokenStream.
Figure 4.4. An analyzer chain starts with a Tokenizer, to produce initial tokens from the characters read from a Reader, then modifies the tokens with any number of chained TokenFilters.
Figure 4.5. TokenFilter and Tokenizer class hierarchy
To illustrate the analyzer chain in code, here’s a simple example analyzer:
public TokenStream tokenStream(String fieldName, Reader reader) {
return new StopFilter(true,
new LowerCaseTokenizer(reader),
stopWords);
}
Listing 4.1. AnalyzerDemo: seeing analysis in action
Listing 4.2. AnalyzerUtils: delving into an analyzer
The AnalyzerDemo application lets you specify one or more strings from the command line to be analyzed instead of the embedded example ones:
%java lia.analysis.AnalyzerDemo "No Fluff, Just Stuff"
Analyzing "No Fluff, Just Stuff"
org.apache.lucene.analysis.WhitespaceAnalyzer:
[No] [Fluff,] [Just] [Stuff]
org.apache.lucene.analysis.SimpleAnalyzer:
[no] [fluff] [just] [stuff]
org.apache.lucene.analysis.StopAnalyzer:
[fluff] [just] [stuff]
org.apache.lucene.analysis.standard.StandardAnalyzer:
[fluff] [just] [stuff]
Listing 4.3. Seeing the term, offsets, type, and position increment of each token
We display all token information on the example phrase using SimpleAnalyzer:
public static void main(String[] args) throws IOException {
AnalyzerUtils.displayTokensWithFullDetails(new SimpleAnalyzer(),
"The quick brown fox....");
}
Here’s the output:
1: [the:0->3:word]
2: [quick:4->9:word]
3: [brown:10->15:word]
4: [fox:16->19:word]
Analyzing the phrase “I’ll email you at xyz@example.com” with StandardAnalyzer
produces this interesting output:
1: [i'll:0->4:<APOSTROPHE>]
2: [email:5->10:<ALPHANUM>]
3: [you:11->14:<ALPHANUM>]
5: [xyz@example.com:18->33:<EMAIL>]
Table 4.3. Primary analyzers available in Lucene
Analyzer | Steps taken |
---|---|
WhitespaceAnalyzer | Splits tokens at whitespace. |
SimpleAnalyzer | Divides text at nonletter characters and lowercases. |
StopAnalyzer | Divides text at nonletter characters, lowercases, and removes stop words. |
KeywordAnalyzer | Treats entire text as a single token. |
StandardAnalyzer | Tokenizes based on a sophisticated grammar that recognizes email addresses, acronyms, Chinese-Japanese-Korean characters, alphanumerics, and more. It also lowercases and removes stop words. |
We chose the Metaphone
algorithm as an example, but other algorithms are available, such as Soundex
.
Listing 4.4. Searching for words that sound like one another
The trick lies in the MetaphoneReplacementAnalyzer:
public class MetaphoneReplacementAnalyzer extends Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
return new MetaphoneReplacementFilter(
new LetterTokenizer(reader));
}
}
Listing 4.5. TokenFilter that replaces tokens with their metaphone equivalents
Using our AnalyzerUtils, two phrases that sound similar yet are spelled differently are tokenized and displayed:
public static void main(String[] args) throws IOException {
MetaphoneReplacementAnalyzer analyzer =
new MetaphoneReplacementAnalyzer();
AnalyzerUtils.displayTokens(analyzer,
"The quick brown fox jumped over the lazy dog");
System.out.println("");
AnalyzerUtils.displayTokens(analyzer,
"Tha quik brown phox jumpd ovvar tha lazi dag");
}
We get a sample of the metaphone encoder, shown here:
[0] [KK] [BRN] [FKS] [JMPT] [OFR] [0] [LS] [TKS]
[0] [KK] [BRN] [FKS] [JMPT] [OFR] [0] [LS] [TKS]
Listing 4.6. Testing the synonym analyzer
Listing 4.7. SynonymAnalyzer implementation
public class SynonymAnalyzer extends Analyzer {
private SynonymEngine engine;
public SynonymAnalyzer(SynonymEngine engine) {
this.engine = engine;
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new SynonymFilter(
new StopFilter(true,
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(
Version.LUCENE_30, reader))),
StopAnalyzer.ENGLISH_STOP_WORDS_SET),
engine
);
return result;
}
}
Listing 4.8. SynonymFilter: buffering tokens and emitting one at a time
The design of SynonymAnalyzer allows for pluggable SynonymEngine implementations. SynonymEngine is a one-method interface:
public interface SynonymEngine {
String[] getSynonyms(String s) throws IOException;
}
public class TestSynonymEngine implements SynonymEngine {
private static HashMap<String, String[]> map =
new HashMap<String, String[]>();
static {
map.put("quick", new String[] {"fast", "speedy"});
map.put("jumps", new String[] {"leaps", "hops"});
map.put("over", new String[] {"above"});
map.put("lazy", new String[] {"apathetic", "sluggish"});
map.put("dog", new String[] {"canine", "pooch"});
}
public String[] getSynonyms(String s) {
return map.get(s);
}
}
Listing 4.9. SynonymAnalyzerTest: showing that synonym queries work
Listing 4.10. Testing SynonymAnalyzer with QueryParser
The test produces the following output:
With SynonymAnalyzer, "fox jumps" parses to "fox (jumps hops leaps)"
With StandardAnalyzer, "fox jumps" parses to "fox jumps"
Listing 4.11. Visualizing the position increment of each token
public static void displayTokensWithPositions
(Analyzer analyzer, String text) throws IOException {
TokenStream stream = analyzer.tokenStream("contents",
new StringReader(text));
TermAttribute term = stream.addAttribute(TermAttribute.class);
PositionIncrementAttribute posIncr =
stream.addAttribute(PositionIncrementAttribute.class);
int position = 0;
while(stream.incrementToken()) {
int increment = posIncr.getPositionIncrement();
if (increment > 0) {
position = position + increment;
System.out.println();
System.out.print(position + ": ");
}
System.out.print("[" + term.term() + "] ");
}
System.out.println();
}
We wrote a quick piece of code to see what our SynonymAnalyzer is doing:
public class SynonymAnalyzerViewer {
public static void main(String[] args) throws IOException {
SynonymEngine engine = new TestSynonymEngine();
AnalyzerUtils.displayTokensWithPositions(
new SynonymAnalyzer(engine),
"The quick brown fox jumps over the lazy dog");
}
}
And we can now visualize the synonyms placed in the same positions as the original words:
2: [quick] [speedy] [fast]
3: [brown]
4: [fox]
5: [jumps] [hops] [leaps]
6: [over] [above]
8: [lazy] [sluggish] [apathetic]
9: [dog] [pooch] [canine]
This is illustrated from the output of AnalyzerUtils.displayTokensWithPositions:
2: [quick]
3: [brown]
4: [fox]
5: [jump]
6: [over]
8: [lazi]
9: [dog]
Listing 4.12. PositionalPorterStopAnalyzer: stemming and stop word removal
public class PositionalPorterStopAnalyzer extends Analyzer {
private Set stopWords;
public PositionalPorterStopAnalyzer() {
this(StopAnalyzer.ENGLISH_STOP_WORDS_SET);
}
public PositionalPorterStopAnalyzer(Set stopWords) {
this.stopWords = stopWords;
}
public TokenStream tokenStream(String fieldName, Reader reader) {
StopFilter stopFilter = new StopFilter(true,
new LowerCaseTokenizer(reader),
stopWords);
stopFilter.setEnablePositionIncrements(true);
return new PorterStemFilter(stopFilter);
}
}