Skip to content

[BUG] NoSuchFileException in CompositeDirectory.sync #19658

@himkak

Description

@himkak

Describe the bug

Hi Team,

While indexing warm index for a long duration, when eviction is happening I get the below error :

...[WARN ][org.opensearch.index.engine.Engine][flush][T#1]] failed engine [lucene commit failed]
java.nio.file.NoSuchFileException: ...\nodes\0\indices\m2BcHMJvRQWemyt7A7fNZQ\0\index_clt.si
at java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:85)
at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103)
at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:108)
at java.base/sun.nio.fs.WindowsFileSystemProvider.newFileChannel(WindowsFileSystemProvider.java:119)
at java.base/java.nio.channels.FileChannel.open(FileChannel.java:309)
at java.base/java.nio.channels.FileChannel.open(FileChannel.java:369)
at org.apache.lucene.util.IOUtils.fsync(IOUtils.java:427)
at org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:298)
at org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:250)
at org.opensearch.index.store.CompositeDirectory.sync(CompositeDirectory.java:296)
at org.apache.lucene.store.FilterDirectory.sync(FilterDirectory.java:86)
at org.apache.lucene.store.FilterDirectory.sync(FilterDirectory.java:86)
at org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68)
at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:5657)
at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3784)
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:4125)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4087)
at org.opensearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:2585)
at org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1916)
at org.opensearch.index.shard.IndexShard.flush(IndexShard.java:1629)
at org.opensearch.index.shard.IndexShard$8.doRun(IndexShard.java:4824)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:975)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)

Sharing the exact line below where error occured, as i added some logs to debug, and the line number would not match with 3.2 code

CompositeDirectory.sync(CompositeDirectory.java:296) : localDirectory.sync(fullFilesToSync);

This is a race condition issue.
It will happen for such a file (F1) which is getting synced as part of IndexWriter commit, but that file was created after the most recent remote metadata file (M1) got created. This file is also uploaded to remote and also got deleted from local. But before the next metadata file is created, which would contain this file's name, CompositeDirectory.sync was invoked, and it read the metadata file to get files in remote, which didnt had the file's entry.

Order of events (identified from logs, and some assumptions):

  1. As part of RemoteStoreRefreshListener, M1 metadata file got created. As the file F1 doesnt existis, so the metadata file doesnt have its entry.
  2. Segment file F1, got created
  3. File F1 got uploaded to remote
  4. IndexWriter commit got triggered, which further in prepareCommitInternal, identifies the F1 file also to be synced.
  5. CompositeDirectory.sync method got called, to sync this F1 file along with other files.
  6. File F1 was unpinned from cache and got evicted
  7. In CompositeDirectory.sync the M1 metadata was read to identify the files in remote. So file F1 never got filtered.
    Metadata M2 got created, which has entry for file F1.
    In CompositeDirectory.sync : localdirectory.sync -> fsync got executed for file F1. As the file is not present in local, so NoSuchFIleException is thrown.

Proposed solutions:

  1. As the localDirectory. sync can take time and there be lot of file and fsync for the file F1 comes into place after file F1 got deleted.
    we can do the sync operations in batches, so that we keep on pulling the most recent metadata file.
  2. If NoSuchFileException comes, do a retry.
  3. File should be uploaded only after sync has happened, and after that only it should be unpinned.

Tagging : @rayshrey : as you have written most of the code involved in this flow. It would be great, if you could help with this issue.

Related component

Storage:Remote

To Reproduce

  1. Create a warm index
  2. Set the node.search.cache.size to some smaller value like 100 Gb. Lets assume the perSegmentCache capacity be 5Gb
  3. keep on indexing, so that evicting keeps happening from cache, but we dont hit the cache capacity soon enough.
  4. At some point this issue will be observed and index will go to red.

Expected behavior

This error should not happen, the files being synced should either be present in local or their entry should be there in the remote metadata file.
NoSuchFileException should not come and Indexing should keep on happening seamlessly.

Additional Details

Plugins
repository-azure plugin

Host/Environment (please complete the following information):

  • OS: Windows
  • Version : 3.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions