DASCore for edge processing #215
Replies: 5 comments 2 replies
-
I see your point, but the indexing is actually quite efficient. DASCore's indexing was mostly stolen from ObsPlus, and at NIOSH we have ObsPlus banks with hundreds of thousands of files and it still does alright (last I checked it updates the indexes in less than 30 seconds). The key is that only files with time stamps after the last update time are indexed when you call spool.update(). The only cost then to having more files is parsing out the ones that need updating and the ones that don't based on their time stamps, which most operating systems are pretty good at (using os.walk). Do you have such a case you can profile? |
Beta Was this translation helpful? Give feedback.
-
spool.update() is exactly what I need! Can you elaborate more on "indexing files with time stamps after last update"? ---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[25], line 1
----> 1 print(sp[10].data)
File [~/anaconda3/envs/py10/lib/python3.10/site-packages/dascore/core/spool.py:125](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/ahmadtourei/coding/iDAS/edge%20processing/~/anaconda3/envs/py10/lib/python3.10/site-packages/dascore/core/spool.py:125), in DataFrameSpool.__getitem__(self, item)
124 def __getitem__(self, item):
--> 125 out = self._get_patches_from_index(item)
126 # a single index was used, should return a single patch
127 if not isinstance(item, slice):
File [~/anaconda3/envs/py10/lib/python3.10/site-packages/dascore/core/spool.py:160](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/ahmadtourei/coding/fervo/iDAS_stimulation_stage8/edge%20processing/~/anaconda3/envs/py10/lib/python3.10/site-packages/dascore/core/spool.py:160), in DataFrameSpool._get_patches_from_index(self, df_ind)
158 if df1.empty:
159 msg = f"index of [{df_ind}] is out of bounds for spool."
--> 160 raise IndexError(msg)
161 joined = df1.join(source.drop(columns=df1.columns, errors="ignore"))
162 return self._patch_from_instruction_df(joined)
IndexError: index of [10] is out of bounds for spool. |
Beta Was this translation helpful? Give feedback.
-
I can use the following after index_path = data_path + '/.dascore_index.h5'
sp.indexer.index_path.unlink()
sp.update() |
Beta Was this translation helpful? Give feedback.
-
Yes, so the time stamp on the file must be after the time update was run. Try using touch on the 11th file to reset it's mtime, then run update again. |
Beta Was this translation helpful? Give feedback.
-
Does it? It should use the same index file and not recreate it. Just make sure the timestamp stuff is sorted out. |
Beta Was this translation helpful? Give feedback.
-
To run a DASCore function constantly (let's say every 5 minutes) on a data directory in which we are continuously saving data in real-time, we need to index the spool each time we call the function. However, this would not be efficient when the directory contains hundreds or thousands of files. I wonder, is there a quick way to just index part of the data in a directory? Also, how can we re-index a directory while a ".dascore_index" file is already present there?
I appreciate any thoughts.
Beta Was this translation helpful? Give feedback.
All reactions