-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #1086 remove group_by to optimize computation and suggest parallel #1130
Fixes #1086 remove group_by to optimize computation and suggest parallel #1130
Conversation
pchelle
commented
Oct 23, 2023
- If parallel is not installed, the regular code is used.
- Regular code tries not to run group_by that slow down the computation by a lot
- The number of cores is identified from the SimulationSettings already set up for the PK parameter calculation task
…utation and suggest parallel - If parallel is not installed, the regular code is used. - Regular code tries not to run group_by that slow down the computation by a lot - The number of cores is identified from the SimulationSettings already set up for the PK parameter calculation task
@Yuri05 could you test and let me know to what extent this update reduces the computation time of your test workflow |
@pchelle The execution time did not change - still 13.5 minutes |
) | ||
useParallel <- all( | ||
requireNamespace("parallel", quietly = TRUE), | ||
settings$numberOfCores > 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From which settings does the number of cores come from?
Are those the settings of plotPKRatio task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, seems to be the settings of calculatePKAnalysisTask where the number of cores is taken from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I set workflow$calculatePKParameters$settings$numberOfCores
to the number of available cores on my machine (8), the execution time becomes 4.9 minutes - which is still slightly slower than 4.7 minutes before the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test case scenario is as following:
- 8 PK parameters:
pkParameters <-
c('C_max', 'C_max_norm', 't_max', 'C_tEnd', 'AUC_tEnd', 'AUC_tEnd_norm', 'AUC_inf', 'AUC_inf_norm')
- 2 simulation sets
- 1000 individuals in the reference population
- 1000 individuals in the non-reference population
- 1 Output per simulation set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I set
workflow$calculatePKParameters$settings$numberOfCores
to the number of available cores on my machine (8), the execution time becomes 4.9 minutes - which is still slightly slower than 4.7 minutes before the fix.
Yes, it is the correct settings for the number of cores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test case scenario is as following:
* 8 PK parameters:
* 2 simulation sets
* 1000 individuals in the reference population
* 1000 individuals in the non-reference population
* 1 Output per simulation set
That explains why mine was shorter, I did only 1 PK parameter and output.
Thus the data.frame to compare was only 1000 rows for me and 8000 rows for you.
But I am surprised that time did not decrease by much for the regular ~15 to 13.5 minutes. Since in my tests with system.time
it seemed to decrease the computation time significantly.
If I remember using vectors of directly is faster than handling data.frames, so I I'll try using something these instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also wondering that the parallelization factor is less than 3 with 8 cores... And I can see that 8 processes are indeed created in parallel.
Only the Monte Carlo sampling portion is parallelized, so all the set up and PK parameter plots time are not shortened.
Additionally, setting up the parallel environment with a big data.frame of 8000 rows may also add to the computation time. "However, it is rare that we would achieve this optimal speed-up since there is always communication between threads" (from https://csgillespie.github.io/efficientR/performance.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuri05
I just updated to another method which should make things much faster even without parallelization.
Briefly, before the calculation I pivoted the data.frame and translated it into a matrix object.
The benefit from numeric matrix objects is that computations can be vectorized.
If the build pass, let me know if this updated method works as expected and is efficient.
Here is a sample test using populations of 1000 individuals and 8 pk parameters With 1 core Monte Carlo takes 0.7 minpopSize <- 1e3
referenceParams <- list(meanlog = 0, sdlog = 1)
comparisonParams <- list(meanlog = 1, sdlog = 2)
set.seed(1111)
# Creates a random data.frame of 8000 rows
referenceData <- data.frame(
IndividualId = rep(1:popSize, 8),
QuantityPath = "Test|Path",
Parameter = rep(
c('C_max', 'C_max_norm', 't_max', 'C_tEnd', 'AUC_tEnd', 'AUC_tEnd_norm', 'AUC_inf', 'AUC_inf_norm'),
each = popSize
),
Value = rlnorm(
n = 8*popSize,
meanlog = referenceParams$meanlog,
sdlog = referenceParams$sdlog
)
)
# Creates another random data.frame of 8000 rows
pkData <- data.frame(
IndividualId = rep(1:popSize, 8),
QuantityPath = "Test|Path",
Parameter = rep(
c('C_max', 'C_max_norm', 't_max', 'C_tEnd', 'AUC_tEnd', 'AUC_tEnd_norm', 'AUC_inf', 'AUC_inf_norm'),
each = popSize
),
Value = rlnorm(
n = 8*popSize,
meanlog = referenceParams$meanlog,
sdlog = referenceParams$sdlog
)
)
mcSolution <- ospsuite.reportingengine:::getPKRatioSummaryFromMCSampling(
pkData = pkData,
referencePKData = referenceData,
simulationSetName = "test",
settings = list(showProgress = FALSE, numberOfCores = 1, mcRepetitions = 1e4, mcRandomSeed = 3333)
)
#> 25/10/2023 - 04:32:20
#> i Info Monte Carlo Sampling for test will use 10000 repetitions with Random Seed 3333
#> 25/10/2023 - 04:33:02
#> i Info Monte Carlo Sampling completed in 0.7 min
knitr::kable(mcSolution, row.names = FALSE)
Created on 2023-10-25 with reprex v2.0.2 With 6 cores Monte Carlo takes 0.3 minpopSize <- 1e3
referenceParams <- list(meanlog = 0, sdlog = 1)
comparisonParams <- list(meanlog = 1, sdlog = 2)
set.seed(1111)
# Creates a random data.frame of 8000 rows
referenceData <- data.frame(
IndividualId = rep(1:popSize, 8),
QuantityPath = "Test|Path",
Parameter = rep(
c('C_max', 'C_max_norm', 't_max', 'C_tEnd', 'AUC_tEnd', 'AUC_tEnd_norm', 'AUC_inf', 'AUC_inf_norm'),
each = popSize
),
Value = rlnorm(
n = 8*popSize,
meanlog = referenceParams$meanlog,
sdlog = referenceParams$sdlog
)
)
# Creates another random data.frame of 8000 rows
pkData <- data.frame(
IndividualId = rep(1:popSize, 8),
QuantityPath = "Test|Path",
Parameter = rep(
c('C_max', 'C_max_norm', 't_max', 'C_tEnd', 'AUC_tEnd', 'AUC_tEnd_norm', 'AUC_inf', 'AUC_inf_norm'),
each = popSize
),
Value = rlnorm(
n = 8*popSize,
meanlog = referenceParams$meanlog,
sdlog = referenceParams$sdlog
)
)
mcSolution <- ospsuite.reportingengine:::getPKRatioSummaryFromMCSampling(
pkData = pkData,
referencePKData = referenceData,
simulationSetName = "test",
settings = list(showProgress = FALSE, numberOfCores = 6, mcRepetitions = 1e4, mcRandomSeed = 3333)
)
#> 25/10/2023 - 04:35:17
#> i Info Monte Carlo Sampling for test will use 10000 repetitions with Random Seed 3333
#> 25/10/2023 - 04:35:37
#> i Info Monte Carlo Sampling completed in 0.3 min
knitr::kable(mcSolution, row.names = FALSE)
Created on 2023-10-25 with reprex v2.0.2 |
Finally 😃 |