Skip to content

ElasticSearch (Basics)

Gaurav Chandak edited this page Jul 23, 2016 · 2 revisions

#ElasticSearch (Basics)

  • ElasticSearch is a highly scalable open source search engine with a REST API.
  • Elasticsearch is a document oriented database. The entire object graph you want to search needs to be indexed, so before indexing your documents, they must be denormalized. Denormalization increases retrieval performance (since no query joining is necessary), uses more space (because things must be stored several times), but makes keeping things consistent and up-to-date more difficult (as any change must be applied to all instances). They're excellent for write-once-read-many-workloads, however.
  • Elasticsearch is commonly used in addition to another database. A database system with stronger focus on constraints, correctness and robustness, and on being readily and transactionally updatable, has the master record - which is then asynchronously pushed to Elasticsearch.
  • Its features and upgrades allow it to basically function just like a schema-less JSON datastore that can be accessed using both search queries and regular database CRUD commands.

Here are the main disadvantages:

  • Security - ElasticSearch does not provide any authentication or access control functionality.
  • Transactions - There is no support for transactions or processing on data manipulation.
  • Maturity of tools - ES is still relatively new and has not had time to develop mature client libraries and 3rd party tools which can make development much harder.
  • Large Computations - Commands for searching data are not suited to "large" scans of data and advanced computation on the db side.
  • Data Availability - ES makes data available in "near real-time" which may require additional considerations in your application (ie: comments page where a user adds new comment, refreshing the page might not actually show the new post because the index is still updating).
  • Durability - ES is distributed and fairly stable but backups and durability are not as high priority as in other data stores. This is probably the most important if you're going to make ES the primary store since losing your data is never good.

A document does not consist only of its data. It also has metadata—information about the document. The three required metadata elements are as follows:

  • _index An index is like a database in a relational database; it’s the place we store and index related data.

  • _type The class of object that the document represents. In a relational database, we usually store objects of the same class in the same table, because they share the same data structure. For the same reason, in Elasticsearch we use the same type for documents that represent the same class of thing, because they share the same data structure.

  • _id The ID is a string that, when combined with the _index and _type, uniquely identifies a document in Elasticsearch.

##Commands:

  • if our index is called website, our type is called blog, and we choose the ID 123, then the index request looks like this:
    • curl -PUT "http://localhost:9200/website/blog/123" -d'{"name": "John Smith","age":42,"confirmed":true,"join_date":"2014-06-01","home": {"lat":51.5,"lon":0.1},"accounts": [{"type": "facebook","id": "johnsmith"},{"type": "twitter","id": "johnsmith"}]}'
  • If our data doesn’t have a natural ID, we can let Elasticsearch autogenerate one for us. The structure of the request changes: instead of using the PUT verb (“store this document at this URL”), we use the POST verb (“store this document under this URL”).
    • curl -POST "http://localhost:9200/website/blog" -d'{"name": "John Smith","age":42,"confirmed":true,"join_date":"2014-06-01","home": {"lat":51.5,"lon":0.1},"accounts": [{"type": "facebook","id": "johnsmith"},{"type": "twitter","id": "johnsmith"}]}'
  • curl -GET "http://localhost:9200/website/blog/123?pretty"
  • curl -GET "http://localhost:9200/website/blog/123/_source?pretty"
  • curl -GET "http://localhost:9200/website/blog/123?_source=name,accounts"
  • ES updates docs by marking the previous doc as deleted and creating a new doc with the updated version no. It deletes the doc later in background.
  • To create new document and to be sure not to update a prev doc, use either POST for auto generated ID or use PUT with '?op_type=create' or _create:
    • curl -PUT "http://localhost:9200/website/blog/123/_create" -d'{"name": "John Smith"}' or
    • curl -PUT "http://localhost:9200/website/blog/123?op_type=create" -d'{"name": "John Smith"}' or
    • curl -POST "http://localhost:9200/website/blog" -d'{"name": "John Smith"}'
  • ES just marks the doc as deleted but deletes it later in background. curl -DELETE "http://localhost:9200/website/blog/123"
  • Partial update:
    • curl -POST "http://localhost:9200/website/blog/123/_update" -d'{"doc" : { "tags" : [ "testing" ], "views": 0 }}' or
    • curl -POST "http://localhost:9200/website/blog/123/_update" -d'{ "script" : "ctx._source.views+=1" }'
  • Search
    • curl -GET "http://localhost:9200/_search?pretty"
    • curl -GET "http://localhost:9200/website/_search?pretty"
    • curl -GET "http://localhost:9200/website/blog/_search?pretty"
    • curl -GET "http://localhost:9200/_all/_search?pretty"
    • curl -GET "http://localhost:9200/w*/_search?pretty"
    • curl -GET "http://localhost:9200/index1,website/_search?pretty"
    • curl -GET "http://localhost:9200/index1,website/blog,type1/_search?pretty"
    • curl -GET "http://localhost:9200/_all/blog,type1/_search?pretty"
    • curl -GET "http://localhost:9200/_search?size=5&from=0&pretty"
    • Search Lite: curl -GET "http://localhost:9200/_search?q=age:44&pretty"
    • curl -GET "http://localhost:9200/_search?q=+age:44%20-name:mary&pretty"

##Inverted Index

Elasticsearch uses a structure called an inverted index, which is designed to allow very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.

You can find only terms that exist in your index, so both the indexed text and the query string must be normalized into the same form (using lowercasing, removing stopwords, stemming and same words for synonyms).

##Full-Text Search curl -GET "http://localhost:9200/_all/_search?pretty" -d'{ "query": QUERY', "from" : 0, "size" : 10}'

###QUERY:

  • { "match_all": {} }
  • { "query": { "match": { "tweet": "elasticsearch" } } }
  • Term is used for exact term query whereas matxh is used for full-text query and works based on the analyzer used
    • { "term": { "date": "2014-09-01" }}
  • The terms filter is the same as the term filter, but allows you to specify multiple values to match. If the field contains any of the specified values, the document matches:
    • { "terms": { "tag": [ "search", "full_text", "nosql" ] }}
    • { "range": { "age": { "gte": 20, "lt": 30 } } }
    • { "exists": { "field": "title" } }
  • The multi_match query allows to run the same match query on multiple fields:
    • { "multi_match": { "query": "full text search", "fields": [ "title", "body" ] } }
    • { "query": { "bool": { "must_not": { something }, "should": { something }, "must": { something }}}}
    • { "query": { "filtered": { "query": { "match_all": {}}, "filter": { "term": { "folder": "inbox" }} } } }
Clone this wiki locally