Fixes #1086 remove group_by to optimize computation and suggest parallel #1130

pchelle · 2023-10-23T15:40:36Z

If parallel is not installed, the regular code is used.
Regular code tries not to run group_by that slow down the computation by a lot
The number of cores is identified from the SimulationSettings already set up for the PK parameter calculation task

…utation and suggest parallel - If parallel is not installed, the regular code is used. - Regular code tries not to run group_by that slow down the computation by a lot - The number of cores is identified from the SimulationSettings already set up for the PK parameter calculation task

pchelle · 2023-10-24T06:29:53Z

@Yuri05 could you test and let me know to what extent this update reduces the computation time of your test workflow

Yuri05 · 2023-10-24T08:58:39Z

@pchelle The execution time did not change - still 13.5 minutes

Yuri05 · 2023-10-24T09:00:22Z

R/utilities-ratio-comparison.R

+  )
+  useParallel <- all(
+    requireNamespace("parallel", quietly = TRUE),
+    settings$numberOfCores > 1


From which settings does the number of cores come from?
Are those the settings of plotPKRatio task?

ok, seems to be the settings of calculatePKAnalysisTask where the number of cores is taken from

if I set workflow$calculatePKParameters$settings$numberOfCores to the number of available cores on my machine (8), the execution time becomes 4.9 minutes - which is still slightly slower than 4.7 minutes before the fix.

The test case scenario is as following:

8 PK parameters:

pkParameters <- c('C_max', 'C_max_norm', 't_max', 'C_tEnd', 'AUC_tEnd', 'AUC_tEnd_norm', 'AUC_inf', 'AUC_inf_norm')

2 simulation sets

1000 individuals in the reference population

1000 individuals in the non-reference population

1 Output per simulation set

Also wondering that the parallelization factor is less than 3 with 8 cores...
And I can see that 8 processes are indeed created in parallel.

if I set workflow$calculatePKParameters$settings$numberOfCores to the number of available cores on my machine (8), the execution time becomes 4.9 minutes - which is still slightly slower than 4.7 minutes before the fix.

Yes, it is the correct settings for the number of cores

The test case scenario is as following:
* 8 PK parameters:
* 2 simulation sets
* 1000 individuals in the reference population
* 1000 individuals in the non-reference population
* 1 Output per simulation set

That explains why mine was shorter, I did only 1 PK parameter and output.
Thus the data.frame to compare was only 1000 rows for me and 8000 rows for you.

But I am surprised that time did not decrease by much for the regular ~15 to 13.5 minutes. Since in my tests with system.time it seemed to decrease the computation time significantly.

If I remember using vectors of directly is faster than handling data.frames, so I I'll try using something these instead

Also wondering that the parallelization factor is less than 3 with 8 cores... And I can see that 8 processes are indeed created in parallel.

Only the Monte Carlo sampling portion is parallelized, so all the set up and PK parameter plots time are not shortened.
Additionally, setting up the parallel environment with a big data.frame of 8000 rows may also add to the computation time. "However, it is rare that we would achieve this optimal speed-up since there is always communication between threads" (from https://csgillespie.github.io/efficientR/performance.html)

@Yuri05
I just updated to another method which should make things much faster even without parallelization.
Briefly, before the calculation I pivoted the data.frame and translated it into a matrix object.
The benefit from numeric matrix objects is that computations can be vectorized.

If the build pass, let me know if this updated method works as expected and is efficient.

…o matrix

pchelle · 2023-10-25T08:38:41Z

Here is a sample test using populations of 1000 individuals and 8 pk parameters

With 1 core Monte Carlo takes 0.7 min

popSize <- 1e3
referenceParams <- list(meanlog = 0, sdlog = 1)
comparisonParams <- list(meanlog = 1, sdlog = 2)
set.seed(1111)


# Creates a random data.frame of 8000 rows
referenceData <- data.frame(
  IndividualId = rep(1:popSize, 8),
  QuantityPath = "Test|Path",
  Parameter = rep(
    c('C_max', 'C_max_norm', 't_max', 'C_tEnd', 'AUC_tEnd', 'AUC_tEnd_norm', 'AUC_inf', 'AUC_inf_norm'),
    each = popSize
    ),
  Value = rlnorm(
    n = 8*popSize,
    meanlog = referenceParams$meanlog,
    sdlog = referenceParams$sdlog
  )
)

# Creates another random data.frame of 8000 rows
pkData <- data.frame(
  IndividualId = rep(1:popSize, 8),
  QuantityPath = "Test|Path",
  Parameter = rep(
    c('C_max', 'C_max_norm', 't_max', 'C_tEnd', 'AUC_tEnd', 'AUC_tEnd_norm', 'AUC_inf', 'AUC_inf_norm'),
    each = popSize
  ),
  Value = rlnorm(
    n = 8*popSize,
    meanlog = referenceParams$meanlog,
    sdlog = referenceParams$sdlog
  )
)

mcSolution <- ospsuite.reportingengine:::getPKRatioSummaryFromMCSampling(
  pkData = pkData, 
  referencePKData = referenceData, 
  simulationSetName = "test",
  settings = list(showProgress = FALSE, numberOfCores = 1, mcRepetitions = 1e4, mcRandomSeed = 3333)
)
#> 25/10/2023 - 04:32:20
#> i Info   Monte Carlo Sampling for test will use 10000 repetitions with Random Seed 3333
#> 25/10/2023 - 04:33:02
#> i Info   Monte Carlo Sampling completed in 0.7 min

knitr::kable(mcSolution, row.names = FALSE)

SimulationSetName	QuantityPath	Parameter	N	Perc5	Perc25	Perc50	Perc75	Perc95	Mean	SD	GeoMean	GeoSD
test	Test\|Path	C_max	1000	0.0987764	0.3890666	1.0276187	2.703265	10.60052	2.763742	5.807231	1.0257840	4.159598
test	Test\|Path	C_max_norm	1000	0.0985152	0.3884350	1.0000157	2.595312	10.27465	2.686598	5.751125	1.0032445	4.109656
test	Test\|Path	t_max	1000	0.1003968	0.3795418	0.9727001	2.537205	10.38319	2.807066	7.147038	0.9898864	4.128164
test	Test\|Path	C_tEnd	1000	0.1031553	0.3952829	1.0258425	2.684064	10.55342	2.768582	5.980251	1.0325755	4.092747
test	Test\|Path	AUC_tEnd	1000	0.0974862	0.3784937	0.9847710	2.570482	10.13589	2.726136	6.443166	0.9883422	4.140941
test	Test\|Path	AUC_tEnd_norm	1000	0.0879475	0.3666436	0.9764895	2.583932	10.39213	2.725995	6.171252	0.9691336	4.277946
test	Test\|Path	AUC_inf	1000	0.0967520	0.3905456	1.0399128	2.767019	11.19236	2.911609	6.481359	1.0399910	4.246863
test	Test\|Path	AUC_inf_norm	1000	0.0919560	0.3844425	1.0292274	2.716473	10.91008	2.838040	6.212406	1.0181034	4.276537

^{Created on 2023-10-25 with reprex v2.0.2}

With 6 cores Monte Carlo takes 0.3 min

popSize <- 1e3
referenceParams <- list(meanlog = 0, sdlog = 1)
comparisonParams <- list(meanlog = 1, sdlog = 2)
set.seed(1111)


# Creates a random data.frame of 8000 rows
referenceData <- data.frame(
  IndividualId = rep(1:popSize, 8),
  QuantityPath = "Test|Path",
  Parameter = rep(
    c('C_max', 'C_max_norm', 't_max', 'C_tEnd', 'AUC_tEnd', 'AUC_tEnd_norm', 'AUC_inf', 'AUC_inf_norm'),
    each = popSize
    ),
  Value = rlnorm(
    n = 8*popSize,
    meanlog = referenceParams$meanlog,
    sdlog = referenceParams$sdlog
  )
)

# Creates another random data.frame of 8000 rows
pkData <- data.frame(
  IndividualId = rep(1:popSize, 8),
  QuantityPath = "Test|Path",
  Parameter = rep(
    c('C_max', 'C_max_norm', 't_max', 'C_tEnd', 'AUC_tEnd', 'AUC_tEnd_norm', 'AUC_inf', 'AUC_inf_norm'),
    each = popSize
  ),
  Value = rlnorm(
    n = 8*popSize,
    meanlog = referenceParams$meanlog,
    sdlog = referenceParams$sdlog
  )
)

mcSolution <- ospsuite.reportingengine:::getPKRatioSummaryFromMCSampling(
  pkData = pkData, 
  referencePKData = referenceData, 
  simulationSetName = "test",
  settings = list(showProgress = FALSE, numberOfCores = 6, mcRepetitions = 1e4, mcRandomSeed = 3333)
)
#> 25/10/2023 - 04:35:17
#> i Info   Monte Carlo Sampling for test will use 10000 repetitions with Random Seed 3333
#> 25/10/2023 - 04:35:37
#> i Info   Monte Carlo Sampling completed in 0.3 min

knitr::kable(mcSolution, row.names = FALSE)

SimulationSetName	QuantityPath	Parameter	N	Perc5	Perc25	Perc50	Perc75	Perc95	Mean	SD	GeoMean	GeoSD
test	Test\|Path	C_max	1000	0.0987764	0.3890666	1.0276187	2.703265	10.60052	2.763742	5.807231	1.0257840	4.159598
test	Test\|Path	C_max_norm	1000	0.0985152	0.3884350	1.0000157	2.595312	10.27465	2.686598	5.751125	1.0032445	4.109656
test	Test\|Path	t_max	1000	0.1003968	0.3795418	0.9727001	2.537205	10.38319	2.807066	7.147038	0.9898864	4.128164
test	Test\|Path	C_tEnd	1000	0.1031553	0.3952829	1.0258425	2.684064	10.55342	2.768582	5.980251	1.0325755	4.092747
test	Test\|Path	AUC_tEnd	1000	0.0974862	0.3784937	0.9847710	2.570482	10.13589	2.726136	6.443166	0.9883422	4.140941
test	Test\|Path	AUC_tEnd_norm	1000	0.0879475	0.3666436	0.9764895	2.583932	10.39213	2.725995	6.171252	0.9691336	4.277946
test	Test\|Path	AUC_inf	1000	0.0967520	0.3905456	1.0399128	2.767019	11.19236	2.911609	6.481359	1.0399910	4.246863
test	Test\|Path	AUC_inf_norm	1000	0.0919560	0.3844425	1.0292274	2.716473	10.91008	2.838040	6.212406	1.0181034	4.276537

^{Created on 2023-10-25 with reprex v2.0.2}

Yuri05 · 2023-10-25T10:46:31Z

Finally 😃
My test workflow takes now 2.9 min (MC sampling: 1.5 min) in sequential and 1.8 min (MC: 0.9 min) in parallel.
Nice!

pchelle added 2 commits October 23, 2023 11:37

Remove ave which did not work as expected

d540956

pchelle marked this pull request as ready for review October 24, 2023 06:28

Yuri05 reviewed Oct 24, 2023

View reviewed changes

pchelle added 2 commits October 25, 2023 04:10

Fixes Open-Systems-Pharmacology#1086 ratios data.frame is pivoted int…

9e9ff2b

…o matrix

Remove simulation set name from test data

0820502

Yuri05 merged commit c862b2c into Open-Systems-Pharmacology:develop Oct 25, 2023
2 checks passed

pchelle deleted the 1086_parallel_monte_carlo branch February 15, 2024 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #1086 remove group_by to optimize computation and suggest parallel #1130

Fixes #1086 remove group_by to optimize computation and suggest parallel #1130

pchelle commented Oct 23, 2023

pchelle commented Oct 24, 2023

Yuri05 commented Oct 24, 2023

Yuri05 Oct 24, 2023

Yuri05 Oct 24, 2023

Yuri05 Oct 24, 2023 •

edited

Loading

Yuri05 Oct 24, 2023

Yuri05 Oct 24, 2023 •

edited

Loading

pchelle Oct 24, 2023

pchelle Oct 24, 2023

pchelle Oct 24, 2023

pchelle Oct 25, 2023

pchelle commented Oct 25, 2023

Yuri05 commented Oct 25, 2023

Fixes #1086 remove group_by to optimize computation and suggest parallel #1130

Fixes #1086 remove group_by to optimize computation and suggest parallel #1130

Conversation

pchelle commented Oct 23, 2023

pchelle commented Oct 24, 2023

Yuri05 commented Oct 24, 2023

Yuri05 Oct 24, 2023

Choose a reason for hiding this comment

Yuri05 Oct 24, 2023

Choose a reason for hiding this comment

Yuri05 Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

Yuri05 Oct 24, 2023

Choose a reason for hiding this comment

Yuri05 Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

pchelle Oct 24, 2023

Choose a reason for hiding this comment

pchelle Oct 24, 2023

Choose a reason for hiding this comment

pchelle Oct 24, 2023

Choose a reason for hiding this comment

pchelle Oct 25, 2023

Choose a reason for hiding this comment

pchelle commented Oct 25, 2023

Yuri05 commented Oct 25, 2023

Yuri05 Oct 24, 2023 •

edited

Loading

Yuri05 Oct 24, 2023 •

edited

Loading