[Feature Request] Allow the OpenSearch source plugin to shut down the Data Prepper pipeline #2944

kartg · 2023-06-27T17:47:34Z

Is your feature request related to a problem? Please describe.
Hello! We're working on changing the opensearch-migrations tooling so our data migration implementation uses Data Prepper instead of Logstash. We will be leveraging the newly minted OpenSearch/ElasticSearch source plugin for this.

Since this is a pull-based plugin, there is a finite set of data that needs to be ingested. Once all of the data has been processed, our expectation is that the Data Prepper pipeline would shut itself down based on a signal from the source plugin. This is similar to how pull-based plugins function in Logstash.

However, Data Prepper does not currently operate this way. The pipeline/process continues to stay alive (though the source plugin is not pulling any more data) until the caller terminates it or shuts it down via APIs.

Describe the solution you'd like
Once a pull-based source plugin has completed ingesting all data, it should signal to the Data Prepper pipeline, and the pipeline should shut itself down.

Describe alternatives you've considered (Optional)
An alternative approach would be to have the pipeline / source plugin signal externally (to the caller) that all data has been processed. The caller can then invoke the Data Prepper shutdown API to stop the process.

Additional context
N/A

graytaylor0 · 2023-06-27T23:15:06Z

Thanks for creating this issue @kartg!

While the implementation to shut down pipelines from notifications from sources would be abstracted from the opensearch source, as far as the OpenSearch source is concerned, what do you think the appropriate behavior would be?

My proposal would be to have a configurable value for the number of consecutive times the OpenSearch source has attempted to acquire an index with nothing being returned. The current behavior is to wait 30 seconds if no indices are found before attempting to acquire an index again, which would also run the partition supplier to pick up new indices. The location where this happens is here (

data-prepper/data-prepper-plugins/opensearch-source/src/main/java/org/opensearch/dataprepper/plugins/source/opensearch/worker/PitWorker.java

Line 77 in 05d229a

Thread.sleep(STANDARD_BACKOFF_MILLIS);

).

The config could be added to the scheduling config (naming needs some work on the parameter)

scheduling:
   shutdown_after_no_indices_found_count: 3

This would mean that if the OpenSearch source does not get any indices to process after 3 attempts (roughly 1 min 30 seconds), then it would issue a shutdown of the pipeline with whatever mechanism is supported for that.

We could also consider nesting the shutdown for future additions, such as

scheduling:
   shutdown:
       after_no_indices_found_count: 3
       # conditionally shut down with conditional expressions
       when: "/some_key == DONE"
       # shut down at a certain time
       shutdown_time: "2023-06-28T22:01:30.00Z"

kartg · 2023-06-28T22:29:35Z

My proposal would be to have a configurable value for the number of consecutive times the OpenSearch source has attempted to acquire an index with nothing being returned.

@graytaylor0 if i understand this correctly, this would require the source indices to have a field that can be leveraged to sort temporally, so the approach cannot be applied more generally.

For example, consider an index that holds alphabets - (a, c, d). It's reasonable to assume that such documents would be sorted alphabetically. At a later point, if b is then inserted into the index, there is no query that would only return b, so it wouldn't be possible with fetch the index with nothing being returned.

A version of this to consider is a simple run count where the pipeline is shutdown after the OpenSearch source has queried its cluster a certain number of times:

scheduling:
   shutdown:
       after_query_count: 3

I do think that after_no_indices_found_count is a good idea, as are the other ideas you've listed (when and shutdown_time). It may just be easier to implement the simplest solution first. Wdyt?

graytaylor0 · 2023-06-30T23:33:10Z

For example, consider an index that holds alphabets - (a, c, d). It's reasonable to assume that such documents would be sorted alphabetically. At a later point, if b is then inserted into the index, there is no query that would only return b, so it wouldn't be possible with fetch the index with nothing being returned.

@kartg I’m not sure I understand what you mean here. The after_no_indices_found_count has nothing to do with the data in the indices. I think this is the best general indicator that nothing else needs to be processed

A version of this to consider is a simple run count where the pipeline is shutdown after the OpenSearch source has queried its cluster a certain number of times:

As in put a limit on the number of indices a source node can process before shutting down?

kartg · 2023-07-05T21:25:25Z

The after_no_indices_found_count has nothing to do with the data in the indices. I think this is the best general indicator that nothing else needs to be processed

Ah, my bad - i misunderstood the phrase "acquire an index with nothing being returned" to as referring to data within the index rather than the index itself. @graytaylor0 for my understanding (and after looking at the code you've linked above) - when would the SourceCoordinator return nothing? If some indices are already processed, are these filtered out by state maintained in the SourceCoordinationStore ?

graytaylor0 · 2023-07-13T17:39:22Z

for my understanding (and after looking at the code you've linked above) - when would the SourceCoordinator return nothing? If some indices are already processed, are these filtered out by state maintained in the SourceCoordinationStore ?

In lines 144-159 here (

data-prepper/data-prepper-core/src/main/java/org/opensearch/dataprepper/sourcecoordination/LeaseBasedSourceCoordinator.java

Line 159 in 92af936

    
           ownedPartitions = sourceCoordinationStore.tryAcquireAvailablePartition(sourceIdentifierWithPartitionType, ownerId, DEFAULT_LEASE_TIMEOUT);

) is where nothing would be returned. First a call to attempt to acquire a partition is made. If this returns empty, which means that either all the indices are COMPLETED, ASSIGNED without a partition ownership timeout being reached, or CLOSED without the reOpenAt timestamp being reached, then nothing is returned. If nothing is returned on the first call to tryAcquireAvailablePartition, the supplier is run (which would create new partitions if new indices were created in the source cluster since the last time the supplier was run). After the supplier is run, another call to tryAcquireAvailablePartition is made, and if there is still no available partition, then the method returns empty to the source that calls getNextPartition.

dlvenable · 2023-07-13T21:52:33Z

I like this idea. But, let's be sure that the source only shutdown it's own pipeline. Data Prepper core has support for shutting down the entire Data Prepper application when either a single pipeline terminates or after all pipelines terminate. So depending on the configuration, this may or may not shutdown Data Prepper entirely.

kartg added the untriaged label Jun 27, 2023

github-project-automation bot added this to Data Prepper Tracking Board Jun 27, 2023

github-project-automation bot moved this to Unplanned in Data Prepper Tracking Board Jun 27, 2023

asifsmohammed added enhancement New feature or request and removed untriaged labels Jun 28, 2023

kartg mentioned this issue Jul 7, 2023

[Index Configuration Tool] Unified Dockerfile for ICT and Data Prepper opensearch-project/opensearch-migrations#234

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Allow the OpenSearch source plugin to shut down the Data Prepper pipeline #2944

[Feature Request] Allow the OpenSearch source plugin to shut down the Data Prepper pipeline #2944

kartg commented Jun 27, 2023

graytaylor0 commented Jun 27, 2023 •

edited

Loading

kartg commented Jun 28, 2023

graytaylor0 commented Jun 30, 2023 •

edited

Loading

kartg commented Jul 5, 2023

graytaylor0 commented Jul 13, 2023

dlvenable commented Jul 13, 2023

[Feature Request] Allow the OpenSearch source plugin to shut down the Data Prepper pipeline #2944

[Feature Request] Allow the OpenSearch source plugin to shut down the Data Prepper pipeline #2944

Comments

kartg commented Jun 27, 2023

graytaylor0 commented Jun 27, 2023 • edited Loading

kartg commented Jun 28, 2023

graytaylor0 commented Jun 30, 2023 • edited Loading

kartg commented Jul 5, 2023

graytaylor0 commented Jul 13, 2023

dlvenable commented Jul 13, 2023

graytaylor0 commented Jun 27, 2023 •

edited

Loading

graytaylor0 commented Jun 30, 2023 •

edited

Loading