Suitability for large datasets #22
-
Hi there! I've been following along on the sidelines for a while now, and been on the fence about whether or not I needed to take the plunge into HMMs - now I think time is ripe. :-) I'm working on some really massive data sets, videos recorded at 30fps for 24h, so ~2500000 data points for each individual animal, that have been tracked to extract the coordinates. I'm then hoping to find a good (semi)automated method for finding passive/active periods based on the velocity. Would this package be suitable for this? This is also a question about the compute time, have you had experience with data sets in this magnitude? Looking forward to hearing back! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 12 replies
-
Hi, here are a few thoughts about using hmmTMB on large data sets.
Anecdotally, my impression is that very high-frequency data like yours are often pre-processed before applying a hidden Markov model. This is both to make things faster, and to extract the variables that we think will be the most informative to identify the states that we are interested in (e.g., mean or variance over 30-sec windows might be more useful than raw 30-Hz observations). But of course, this depends a lot on the specific application. I hope this helps, I would be interested to hear how your analysis goes! |
Beta Was this translation helpful? Give feedback.
-
Hi @TheoMichelot! Just came across this poster/paper from 2007 implementing a faster Viterbi algorithm, and it contains C++ code too. Maybe it's not too much hassle to implement at some point? Might at least make it easier. 😉 https://cs.idc.ac.il/~smozes/hmmspeedup/hmmspeedup.html |
Beta Was this translation helpful? Give feedback.
Hi, here are a few thoughts about using hmmTMB on large data sets.
We wrote the package with speed in mind, e.g., the likelihood function is implemented in C++ using TMB. That being said, we also wanted hmmTMB to be flexible (e.g., many observation distributions, general covariate dependence), so we had to sacrifice speed in some places.
I just ran a simulation to fit a 2-state Gaussian HMM to 2.5 million data points, and it took around 3 minutes on my laptop. Here I was fitting the true model, so it is likely that it would take somewhat longer in a real data analysis, but it should be feasible.
Computing time will very strongly depend on the model formulation, and model fitting can…