Skip to content

Commit

Permalink
Chapter 6 is done
Browse files Browse the repository at this point in the history
  • Loading branch information
ayende committed Aug 31, 2014
1 parent b9f2a8b commit 43af156
Show file tree
Hide file tree
Showing 5 changed files with 176 additions and 6 deletions.
171 changes: 167 additions & 4 deletions Ch06/Ch06.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,15 +123,20 @@ while (databaseIsRunning) {
continue;
}
var docsToIndex = LoadDocumentsAfter(lastIndexEtag);
var docsToIndex = LoadDocumentsAfter(lastIndexEtag)
.Take(autoTuner.BatchSize);
for(var indexEntry in indexingFunc(docsToIndex)) {
StoreInIndex(indexEntry);
}
SetLastIndexEtag("Products/ByName", docsToIndex.Last().Etag);
}
```

I'll repeat again that Listing 6.4 shows the _conceptual_ model, the actual working of this is very different.
I'll repeat again that Listing 6.4 shows the _conceptual_ model, the actual working of this is very different. But the overall process is similar in intention if not in practice.

Indexing works by pulling a batch of documents from the document storage, applying the indexing function to them, and then writing the index entries to the index. We update the last indexed etag, and repeat the whole process again. When we run out of documents to index, we wait until a document is updated or changed, and the whole process starts anew.

This means that when we index a batch of documents, we just need to update the index with their changes, no need to do all the work from scratch. You can also see in the code that we are processing the documents in batches (which are auto tuned for best performance). Even though the code in Listing 6.4 is far away from how it actually works, it should give you a good idea about the overall process.

## The indexing process

Expand Down Expand Up @@ -227,9 +232,45 @@ When there isn't a lot of work to do, the batch sizes are going to be small, fav

When we have a large number of items to index, we will slowly move to a larger batch size and process more items per batch. This is the high throuput (but high latency) indexing strategy that we talked about. When the load goes down, we'll decrease the batch size to its initial level. This seemingly simple change has dramatic implications on the way RavenDB can handle spikes in traffic. We'll automatically (and transparently) adjust to the new system conditions. That is much nicer than waking your administrators at 3 AM.

### What gets indexed?

This is an interesting question, and it often trips people up. Let us assume that we have the two indexes showing in Listing 6.5.

```{caption="{A couple of index definitions}" .cs}
//users/ByNameAndEmail
from user in docs.Users
select new { user.Name, user.Email }
//orders/ByOrderIdAndCustomerEmail
from order in docs.Orders
select new { order.OrderId, order.Customer.Email }
```

The first index users by name and email, and the second allow us to query on orders by the order id or the customer's email. What would happen when we create a new `Order` document?

Well, we need to execute all the relevant indexes on this, and at first glance, it will appear that we only need to index this document using the second index. The problem is that we don't know this yet. More to the point, we don't have a good way of _knowing_ that. We determine if an index needs to be indexed or not by checking its last indexed etag and comparing that to the latest document etag. That information doesn't take into account the details of which collection belongs to.

Because of that, all indexes need to process all documents, if only to verify that they don't care for those. At least, that is how it works in theory.

> Remember, this is a deep implementation details discussion. Please understand that the following details are just that, implementation details, and are subject to change in the future.
In practice, we take quite a bit of advantage on the nature of RavenDB to optimize how this works. We still need to read the documents, but we can figure out ahead of time that a certain document isn't a match for a specific index, so we can avoid even giving the document to the index. That means that we can just update the last indexed etag of that index past any document that doesn't match the collections that this index operates on.

That works quite efficently in reducing the amount of work required when we are indexing new and updated documents only, but it gets more complex when we need to deal with a new index creation.

### Introducing a new index

I mentioned already that in RavenDB, the process of adding a new index is [[TODO]]
I mentioned already that in RavenDB, the process of adding a new index is both expected and streamlined. While this is indeed the case, that does not mean that it is a _simple_ process. A new index requires that we'll index all the previous documents in the database. And we don't know, at this time, what collections they actually belong to. So if we have ten million documents in the database and we introduce a new index, we'll need to read them all, send the matching documents to the index, and discard the rest. That is the case even if your index only cover a collection that has a mere hundred documents.

As you can imagine, this is quite expensive, and not something that we are very willing to do. Because of that, there are several optimizations that are applied during this process. The first of which is to use the system index `Raven/DocumentsByEntityName` to find out exactly how many documents we have to cover. If the number is small (under 131,072 documents, by default), we'll just load all the relevant documents and index them on the spot.
This gives us a great advantage when creating new indexes on small collections, because we can catch up very quickly.

However, that doesn't really help us in the case of bigger collections. What happens then? At this point, one of two strategies comes into play, depending on the exact situation involved. If there aren't enough resources, the database will split the work between the new index and the existing index. So the new index will get a chance to run and index a batch of documents, then the existing indexes will have a chance to run over any new documents that came in, etc.
RavenDB is biased in this situation toward the existing indexes, because we don't want to stall them too much. That might mean that the new index will take time to build completely, but that isn't generally an issue, it is a new index, and expected to take some time.

If there is a wealth of resources to exploit, however, RavenDB will chose a different strategy. It will create a dedicated indexing task that will run in the background, separate from the normal indexing process and in parallel to it. This indexing process will try to get as many documents index for the new index as it possibly can as fast as it can. This generally require more resources (memory, CPU & I/O), so even though this is the preferred strategy, we can't always apply it.

At any rate, introducing a new index is a well oiled process by now, even on large database, that it is a safe enough process that we let an automated system decide when we need a new index.

### I/O Considerations

Expand Down Expand Up @@ -282,4 +323,126 @@ I mentioned that successfully running Lucene in production is somewhat of a hass

All of that require quite a bit of expertise. We've talked about how RavenDB achieve safety with indexes in the previous section. The others issues are also handled for you by RavenDB. I know that the previous list can make Lucene look scary, but I think that Lucene is a great library, and it is a great solution for handling search.

## Summary
## Transformations on the indexing function

So far, we have gone through the details on how RavenDB indexes work (transforming a set of input document into Lucene index entries) and we spent a considerable amount of time diving into the details behind the actual indexing process itself. Now I want to focus primarily on what RavenDB with the index definitions you create. Let us look at Listing 6.6 and Listing 6.7, which show a simple index definition and what RavenDB does with it (respectively).

```{caption="{A simple index definition}" .cs}
// users/ByNameAndEmail
from user in docs.Users
select new { user.Name, user.Email }
```
What would happen when we create such an index in RavenDB? The RavenDB engine will transform the simple index definiton in Listing 6.6 into the class shown in Listing 6.7. You can look at that and see the actual indexing work that is being done with RavenDB^[You can always get the source for each index definition by going to the following url: http://localhost:8080/databases/Northwind/users/ByNameAndEmail?source=yes (naturally, you'll need to change the host, database and index names according to your own system).].

```{caption="{The generated index class in RavenDB}" .cs}
public class Index_users_ByNameAndEmail :
Raven.Database.Linq.AbstractViewGenerator
{
public Index_users_ByNameAndEmail()
{
this.ViewText = @"
from user in docs.Users
select new { user.Name, user.Email } ";
this.ForEntityNames.Add("Users");
this.AddMapDefinition(docs =>
from user in ((IEnumerable<dynamic>)docs)
where string.Equals(
user["@metadata"]["Raven-Entity-Name"], "Users",
System.StringComparison.InvariantCultureIgnoreCase)
select new {
user.Name,
user.Email,
__document_id = user.__document_id
});
this.AddField("__document_id");
this.AddField("Name");
this.AddField("Email");
this.AddQueryParameterForMap("__document_id");
this.AddQueryParameterForMap("Name");
this.AddQueryParameterForMap("Email");
this.AddQueryParameterForReduce("__document_id");
this.AddQueryParameterForReduce("Name");
this.AddQueryParameterForReduce("Email");
}
}
```

All indexes inherit from the `AbstractViewGenerator` class. And the actual indexing work is done in the lambda passed to the `AddMapDefinition` call. You can see how we changed the index definition. The `docs.Users` call, which looks like a collection reference was changed to the more accurate where statement, which filter unwanted items from different collections. You can also see that in addition to the properties that we have for indexing, we also include the `__document_id` property.

Note that we keep the original index definition in the `ViewText` property (mostly for debug purposes), and that we keep track of the entity names each index covers. The later is very important for optimizations, since we can make decisions based on this information (which documents we can _not_ send to this index).

> **The RavenDB Indexing Language**
>
> On the surface, it looks like the indexing language RavenDB is using is C# linq expressions. And that is true, up to a point. In practice, we have taken the C# language prisoner, and made it jump through many hoops to make our indexing story easy and seamless.
>
> The result isn't actually plain C#, for example, there are no nulls. Instead, we use the Null Object pattern to avoid dealing with NullReferenceExceptions. Another change is that the language we use isn't strongly typed, and won't error on missing members.
>
> All of that said, you can actually debug a RavenDB index in Visual Studio, because however much we twist the language's arm, we end up compiling to C#.
The rest of the calls (`AddField`, `AddQueryParameterForReduce`, `AddQueryParameterForMap`) are most there for book keeping purposes, and are used by the query optimizer to decide if an index should get to handle a specific query.

## Error handling

We try very hard to ensure that an index can't actually generate errors, but in the real world, it isn't actually an attainable goal. So the question now becomes, what is going to happen when an index runs into an error? Those can be divided into several parts.

The indexing function can run into trouble. The easiest way to reproduce that is to have a division in the indexing function, and have the denominator set to zero. That obviously cause a DivideByZeroException. What happens then? The indexing process will terminate the indexing of the document that caused this issue, and an error will be logged. You can see the indexing errors in the studio and in the index statistics.

Along with the actual error message, you'll have the faulting index and the document that caused all that problem. In general, errors in indexes aren't a problem, but because an error stops a document from being index (only by the index that actually caused the error, mind) it can be hard to understand why. A query on the index won't fail if some documents failed to be indexed, you need to explicitly check the stats page (or the database statistics) to see the actual error.

If a large precentage of the documents are erroring (over 15%, once we are past some initial number of documents), however, we'll decree that index as faulty. At that point, it will not be active any longer and won't take part in indexing. Any queries made ot the index will result in an exception being thrown. You'll need to fix the index definition so it won't throw so many errors for it to resume standard operations.

Another type of indexing errors relates to actual problems in indexing. For example, the indexing disk might be full. This will cause the indexing process to fail, although that wouldn't count against the 15% fail quota for the index. You'll be able to see the warning about those failures in the log.

## What about deletes?

So far, we talked a lot about how indexing works for new or updated documents. But how does the indexes work when we _delete_ documents? The short answer is that this is a much more complicated process, and it was quite annoying to have to deal with it.
The basic process goes like this: Whenever a document is deleted, we check all the indexes for those who would cover this particular document. We then generate a `RemoveFromIndexTask` background task for each of those indexes. We save that background task in the same transaction as the document deletion. The indexing process will check for any pending tasks as part of its process, and it will load and execute those tasks.

In this case, the work this task will do is to remove the relevant documents from the indexes in question. The process is quite optimized, and we'll merge similar tasks into a single execution, to reduce the overall cost. That said, mass deletion, in particular, is a costly operation in RavenDB.

Note that as soon as the document is actually deleted from the document store, we won't be returning it from any queries, and the purpose of the `RemoveFromIndexTask` is to clean up the indexes more than anything else.

## Indexing priorities

Users frequently request some better ways to control the indexing process priorties. To decide that a particular index is very important, and should be given the first priority above all other indexes. While this seems to be a reasonable request, it does open up a lot of very complex issues.
In particular, how do you prevent starvation of the other indexes, if the very important index is running all the time?

So instead of implementing a `ThisIsVeryImportantIndex` flag, we switched things around and allow you to indicate that a particular index isn't that important. The following indexing priorities are supported:

* Normal - The default, execute this index as fast as possible.
* Idle - Only execute this index when there is no other work to be done.
* Abandoned - Only execute this index when there is no other work to be done, and there hasn't been any work for awhile, and it has been a long time since we last rn this.
* Disabled - Don't index at all.
* Error - This index has too many errors and has failed.

`Idle` indexing will happen if RavenDB don't have anything else. `Abandoned` is very similar to `Idle`, but it won't trigger even if we have nothing to do. It will only trigger if we didn't have anything to do for a _long_ while. The expectation is that abanonded indexes will run whenever you have a long idle period. For example, at night or over the weekends.

> **Why not have a Priority level?**
>
> Any priority scheme has to deal with starvation issues. And while is seems like a technical detail, there is a big difference in the expectation if you have an index set to idle and another index set to normal than you have one index set to normal and the other to priority.
>
> In the first case, it is easy to understand that the index won't run as long as the normal index has work to do. In the second case, you probably want both to run, but the priority to run more often or with higher frequency.
> The problem with starvation prevention is that you have to punish the important index at some point, by blocking it and running the other indexes. At which point, they have a lot of work to do, so they can take a long time to run, and you defeated the whole point of priorities.
>
> It might be a semantic difference, but I feel that this way clearly states what is going to happen, and reduce surprises down the road.
Note that the query optimizer will play with those options for the auto indexes that it creates, but it won't interfere with indexes that were created explicitly by the user.

## Summary

We've gone over the details of how RavenDB indexing actually _work_. Hopefully not in a mind numbing detail. Those details are not important during your work with RavenDB, all of that is very well hidden under the covers, but it is good to know how RavenDB will respond to changing conditions.

We started talking about the logical view of indexing in RavenDB. How an indexing function output index entries that will be stored in a Lucene index, and how queries will go against that index to find the matching document ids, which will be pulled from document storage. We then talked about incremental indexing and the conceptual process of how RavenDB actually index documents.

From the conceptual level, we moved to the actual implementation details, including the set of tradeoffs that we have to make in indexing between I/O, CPU and memory usage. We looked at how we deal with each of those issues. Optimizing I/O by prefetching documents and batching writes. Optimizing memory by auto tuning the batch size and optimizing CPU usage by parallelising work (but not too much).

We also talked about what actually gets indexed, and how we optimize things so an index doesn't have to go through all documents, only those relevant for it. Then we talked about the new index creation strategies, how we try to make sure that this is as efficent as possible while still letting the system operate normally.

We got to talking a bit about Lucene, how we actually manage the index and safe guard from corruption and handle recovery. In particular by managing the state of the index outside of Lucene, but by checking to see the recovered state in case of a crash.

We concluded the chapter by talking about the actual code that gets run as part of the index, error handling and recovery during indexing and the details of index priorities and why they are setup the way they are.

I hope that this peek behind the curtain doesn't make you lose any faith in the magical properties of RavenDB, pay no attention to the man behind the screen, as the Wizard said. Even after knowing how everything work, it still seems magical to me. And one of the most magical features in RavenDB is the topic of the next chapter, how RavenDB allows ad-hoc queries by using automatic indexing and the query optimizer.
4 changes: 4 additions & 0 deletions Ch07/Ch07.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@

# Automatic Indexes & the query optimizer


3 changes: 3 additions & 0 deletions Part2.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@

In this part, we'll learn about indexing and querying:

* Deep dive into RavenDB Indexing implementation
* Ad-hoc queries, automatic indexes and the query optimizer

* Why do we need indexes?
* Dynamic & static indexes
* Simple (map only) indexes
Expand Down
Loading

0 comments on commit 43af156

Please sign in to comment.