Skip to content

Conversation

@DevinTDHa
Copy link
Member

Description

This PR introduces further optimizations for NerDLApproach:

  1. Data can now be fed with a threaded dataloader, when using setEnableMemoryOptimizer(true) and setPrefetchBatches(int)
  2. Dataframe partitioning is now optimized for NerDLApproach training by default, can be disabled with setOptimizePartitioning(false).

Motivation and Context

  1. Training is slow on clusters and the threaded dataloader improves training times
  2. If using large partitions, the driver node is at risk of running out of memory. The optmized partitioning prevents this.

How Has This Been Tested?

Old and new tests passing

Threaded NerDLDataLoader fetches batches in the background while
training is happening in NerDLApproach, reducing idle time in the driver
thread.
Allow NerDLApproach to repartition the input dataset, so the driver does
not go out of memory when training on large partitions.
@DevinTDHa DevinTDHa force-pushed the optimization/nerdl-dataloader branch from f1c8046 to 631b350 Compare November 24, 2025 11:47
@DevinTDHa DevinTDHa changed the base branch from master to release/623-release-candidate November 25, 2025 12:07
@DevinTDHa DevinTDHa changed the title Further NerDL Optimizations [SPARKNLP-1317] Further NerDL Optimizations Nov 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant