Suitability for large datasets #22

roaldarbol · 2024-04-22T12:43:03Z

roaldarbol
Apr 22, 2024

Hi there! I've been following along on the sidelines for a while now, and been on the fence about whether or not I needed to take the plunge into HMMs - now I think time is ripe. :-)

I'm working on some really massive data sets, videos recorded at 30fps for 24h, so ~2500000 data points for each individual animal, that have been tracked to extract the coordinates. I'm then hoping to find a good (semi)automated method for finding passive/active periods based on the velocity. Would this package be suitable for this? This is also a question about the compute time, have you had experience with data sets in this magnitude?

Looking forward to hearing back!

Answered by TheoMichelot

Apr 22, 2024

Hi, here are a few thoughts about using hmmTMB on large data sets.

We wrote the package with speed in mind, e.g., the likelihood function is implemented in C++ using TMB. That being said, we also wanted hmmTMB to be flexible (e.g., many observation distributions, general covariate dependence), so we had to sacrifice speed in some places.
I just ran a simulation to fit a 2-state Gaussian HMM to 2.5 million data points, and it took around 3 minutes on my laptop. Here I was fitting the true model, so it is likely that it would take somewhat longer in a real data analysis, but it should be feasible.
Computing time will very strongly depend on the model formulation, and model fitting can…

View full answer

TheoMichelot · 2024-04-22T14:32:07Z

TheoMichelot
Apr 22, 2024
Maintainer

Hi, here are a few thoughts about using hmmTMB on large data sets.

We wrote the package with speed in mind, e.g., the likelihood function is implemented in C++ using TMB. That being said, we also wanted hmmTMB to be flexible (e.g., many observation distributions, general covariate dependence), so we had to sacrifice speed in some places.
I just ran a simulation to fit a 2-state Gaussian HMM to 2.5 million data points, and it took around 3 minutes on my laptop. Here I was fitting the true model, so it is likely that it would take somewhat longer in a real data analysis, but it should be feasible.
Computing time will very strongly depend on the model formulation, and model fitting can be much much slower for more complex models. In particular, it will depend on the following.
- The number of states: With more states, model fitting is slower and more parameters need to be estimated. For comparison, I just fitted a 3-state Gaussian HMM to 1 million observations (40% of the previous simulation), and it took around 3.5 min.
- Covariates: When covariates are included, more parameters need to be estimated, and things are slower.
- Random effects: If random effects and/or non-linear effects are included in any part of the model, TMB has to integrate those out of the likelihood, and everything is much slower.
- Observation distribution: Common distributions (e.g., normal, gamma, exponential, Poisson) use fast TMB implementations, so I would expect them to be faster than less standard distributions, which we had to implement from scratch (e.g., von Mises).
I'm not exactly sure what the reason is, but it seems that memory is sometimes a bottleneck for large data sets and complex models. For example, I tried running a 3-state HMM to 2.5 million observations, and my R session crashed after I ran out of memory (16Gb). I think this happens when TMB builds the computational graph to set up automatic differentiation, and the memory is only used briefly before model fitting starts, but in some cases it might be limiting.
You can fit the model to a subset of the data, and then use the fitted model to get the most likely state sequence for the rest of the data. This is described in Section 2 of the "Advanced features of hmmTMB" vignette. This might be the best option in many large-data situations, if the main objective is to get the decoded state sequence.

Anecdotally, my impression is that very high-frequency data like yours are often pre-processed before applying a hidden Markov model. This is both to make things faster, and to extract the variables that we think will be the most informative to identify the states that we are interested in (e.g., mean or variance over 30-sec windows might be more useful than raw 30-Hz observations). But of course, this depends a lot on the specific application.

I hope this helps, I would be interested to hear how your analysis goes!

9 replies

roaldarbol Apr 24, 2024
Author

Yes, it's 1D. The animals (beetles) were in small circular dishes in the lab, so there's nothing really meaningful about the directionality (and at the time of tracking, I couldn't get the orientation of the animal - for future work, I'll be able to have that too, so that might change the levels I'll use). Great suggestion with the gamma2 distribution, I'll give that a go!

Sorry, that was my bad completely - with ::: it works.

I'll have a look around, but even just for animal movement data, that's already good to know - thanks! And also for the links, that'll be my dinner-TV tonight.😉 Thanks for taking the time to respond so quickly and thoroughly, I really appreciate it!

roaldarbol Apr 30, 2024
Author

Just an update on my end - I've had good success with the gamma distribution and the suggest_initial()-patch on my big data sets! I'm running it separately for ~200 experiments of 24 hours. Takes about 2h per experiment to both fit and estimate states with viterbi(), so it'll be running for a few days, but I'm stoked it's working well!

TheoMichelot Apr 30, 2024
Maintainer

Okay that sounds good, thanks for the update. Out of curiosity, do you know roughly how long the model fitting and the Viterbi algorithm take, respectively? I would be interested to know how slow viterbi() is for large data sets (it is implemented in R, rather than C++ like model fitting).

roaldarbol Apr 30, 2024
Author

Absolutely! I was a bit curious, so I got some nicely formatted CLI output for it. 😉 So about 5 mins for the fitting, and 15 mins for the viterbi(), so it does take the majority or time spent.

── Processing experiment from 2022-02-07 ───────────────────────────────────────
here() starts at /Volumes/ResearchSSD/002-Tracking-1Day
ℹ 2024-04-30 11:38:10 - Starting HMM for number: 1/6
ℹ 2024-04-30 11:38:40 - Fitting HMM
ℹ 2024-04-30 11:44:02 - Generating most likely sequence of states
ℹ 2024-04-30 11:59:16 - Starting HMM for number: 2/6
ℹ 2024-04-30 11:59:49 - Fitting HMM
ℹ 2024-04-30 12:06:12 - Generating most likely sequence of states
ℹ 2024-04-30 12:21:31 - Starting HMM for number: 3/6
ℹ 2024-04-30 12:22:04 - Fitting HMM
ℹ 2024-04-30 12:28:13 - Generating most likely sequence of states
ℹ 2024-04-30 12:42:55 - Starting HMM for number: 4/6
ℹ 2024-04-30 12:43:29 - Fitting HMM
ℹ 2024-04-30 12:49:47 - Generating most likely sequence of states
ℹ 2024-04-30 13:04:28 - Starting HMM for number: 5/6
ℹ 2024-04-30 13:05:05 - Fitting HMM
ℹ 2024-04-30 13:10:39 - Generating most likely sequence of states
ℹ 2024-04-30 13:25:21 - Starting HMM for number: 6/6
ℹ 2024-04-30 13:25:54 - Fitting HMM
ℹ 2024-04-30 13:31:32 - Generating most likely sequence of states

TheoMichelot May 1, 2024
Maintainer

Very good to know, thanks Mikkel!

roaldarbol · 2024-07-18T16:32:14Z

roaldarbol
Jul 18, 2024
Author

Hi @TheoMichelot! Just came across this poster/paper from 2007 implementing a faster Viterbi algorithm, and it contains C++ code too. Maybe it's not too much hassle to implement at some point? Might at least make it easier. 😉 https://cs.idc.ac.il/~smozes/hmmspeedup/hmmspeedup.html

3 replies

roaldarbol Jul 19, 2024
Author

Or here https://github.com/xiaoming-qxm/viterbi

TheoMichelot Aug 2, 2024
Maintainer

Hi Mikkel, thanks for the references. Unfortunately, the challenge isn't to implement the Viterbi algorithm in C++ per se, but to combine TMB code with other C++ code in the package (see for example kaskr/adcomp#247).

roaldarbol Aug 3, 2024
Author

Ah, that's why - thanks for the link too! Why does packaging always have to be hard... 😅

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suitability for large datasets #22

{{title}}

Replies: 2 comments 12 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Suitability for large datasets #22

roaldarbol Apr 22, 2024

Replies: 2 comments · 12 replies

TheoMichelot Apr 22, 2024 Maintainer

roaldarbol Apr 24, 2024 Author

roaldarbol Apr 30, 2024 Author

TheoMichelot Apr 30, 2024 Maintainer

roaldarbol Apr 30, 2024 Author

TheoMichelot May 1, 2024 Maintainer

roaldarbol Jul 18, 2024 Author

roaldarbol Jul 19, 2024 Author

TheoMichelot Aug 2, 2024 Maintainer

roaldarbol Aug 3, 2024 Author

roaldarbol
Apr 22, 2024

Replies: 2 comments 12 replies

TheoMichelot
Apr 22, 2024
Maintainer

roaldarbol Apr 24, 2024
Author

roaldarbol Apr 30, 2024
Author

TheoMichelot Apr 30, 2024
Maintainer

roaldarbol Apr 30, 2024
Author

TheoMichelot May 1, 2024
Maintainer

roaldarbol
Jul 18, 2024
Author

roaldarbol Jul 19, 2024
Author

TheoMichelot Aug 2, 2024
Maintainer

roaldarbol Aug 3, 2024
Author