DASCore for edge processing #215

ahmadtourei · 2023-08-03T00:46:20Z

ahmadtourei
Aug 3, 2023
Collaborator

To run a DASCore function constantly (let's say every 5 minutes) on a data directory in which we are continuously saving data in real-time, we need to index the spool each time we call the function. However, this would not be efficient when the directory contains hundreds or thousands of files. I wonder, is there a quick way to just index part of the data in a directory? Also, how can we re-index a directory while a ".dascore_index" file is already present there?
I appreciate any thoughts.

d-chambers · 2023-08-03T15:54:12Z

d-chambers
Aug 3, 2023
Maintainer

I see your point, but the indexing is actually quite efficient. DASCore's indexing was mostly stolen from ObsPlus, and at NIOSH we have ObsPlus banks with hundreds of thousands of files and it still does alright (last I checked it updates the indexes in less than 30 seconds).

The key is that only files with time stamps after the last update time are indexed when you call spool.update(). The only cost then to having more files is parsing out the ones that need updating and the ones that don't based on their time stamps, which most operating systems are pretty good at (using os.walk).

Do you have such a case you can profile?

0 replies

ahmadtourei · 2023-08-03T19:23:31Z

ahmadtourei
Aug 3, 2023
Collaborator Author

spool.update() is exactly what I need! Can you elaborate more on "indexing files with time stamps after last update"?
It seems spool.update() is not updating the index file when I call it. For example, I added 10 files in a directory and called sp = dc.spool(data_directory). Then, added 1 new file (whose time stamp was right after the 10th file) and called sp.update() but the index file was not updated and an "out of bounds" error came up when I was accessing the data from the 11th file:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[25], line 1
----> 1 print(sp[10].data)

File [~/anaconda3/envs/py10/lib/python3.10/site-packages/dascore/core/spool.py:125](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/ahmadtourei/coding/iDAS/edge%20processing/~/anaconda3/envs/py10/lib/python3.10/site-packages/dascore/core/spool.py:125), in DataFrameSpool.__getitem__(self, item)
    124 def __getitem__(self, item):
--> 125     out = self._get_patches_from_index(item)
    126     # a single index was used, should return a single patch
    127     if not isinstance(item, slice):

File [~/anaconda3/envs/py10/lib/python3.10/site-packages/dascore/core/spool.py:160](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/ahmadtourei/coding/fervo/iDAS_stimulation_stage8/edge%20processing/~/anaconda3/envs/py10/lib/python3.10/site-packages/dascore/core/spool.py:160), in DataFrameSpool._get_patches_from_index(self, df_ind)
    158 if df1.empty:
    159     msg = f"index of [{df_ind}] is out of bounds for spool."
--> 160     raise IndexError(msg)
    161 joined = df1.join(source.drop(columns=df1.columns, errors="ignore"))
    162 return self._patch_from_instruction_df(joined)

IndexError: index of [10] is out of bounds for spool.

0 replies

ahmadtourei · 2023-08-03T20:36:38Z

ahmadtourei
Aug 3, 2023
Collaborator Author

I can use the following after sp = dc.spool(data_path).update() but that way, it will remove the index file and reindex all files, not the added ones (which takes indexing 11 files, not 1 file.

index_path = data_path + '/.dascore_index.h5'
sp.indexer.index_path.unlink()
sp.update()

0 replies

d-chambers · 2023-08-04T00:08:32Z

d-chambers
Aug 4, 2023
Maintainer

Yes, so the time stamp on the file must be after the time update was run. Try using touch on the 11th file to reset it's mtime, then run update again.

2 replies

ahmadtourei Aug 4, 2023
Collaborator Author

Ah, I see! So, the modified time should be after running the update() function.
Thanks very much for your help! Please close this discussion if needed.

d-chambers Aug 4, 2023
Maintainer

Sounds good. Let me know if you find it isn't efficient enough and we can see if we can improve it.

d-chambers · 2023-08-04T00:11:49Z

d-chambers
Aug 4, 2023
Maintainer

but that way, it will remove the index file and reindex all files

Does it? It should use the same index file and not recreate it. Just make sure the timestamp stuff is sorted out.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DASCore for edge processing #215

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

DASCore for edge processing #215

ahmadtourei Aug 3, 2023 Collaborator

Replies: 5 comments · 2 replies

d-chambers Aug 3, 2023 Maintainer

ahmadtourei Aug 3, 2023 Collaborator Author

ahmadtourei Aug 3, 2023 Collaborator Author

d-chambers Aug 4, 2023 Maintainer

ahmadtourei Aug 4, 2023 Collaborator Author

d-chambers Aug 4, 2023 Maintainer

d-chambers Aug 4, 2023 Maintainer

ahmadtourei
Aug 3, 2023
Collaborator

Replies: 5 comments 2 replies

d-chambers
Aug 3, 2023
Maintainer

ahmadtourei
Aug 3, 2023
Collaborator Author

ahmadtourei
Aug 3, 2023
Collaborator Author

d-chambers
Aug 4, 2023
Maintainer

ahmadtourei Aug 4, 2023
Collaborator Author

d-chambers Aug 4, 2023
Maintainer

d-chambers
Aug 4, 2023
Maintainer