-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processing segmented/sampled data #378
Comments
This is a great idea. Would you keep the indices in csv files, one day per file or how? I assume that the FCS would still be in 24-hour images?
|
That's an interesting question. But, yes, I think that by default any files processed by #191 or this issue should be stored in a day long format. That format could be tricky to make though. The simplest approach would be inserting null values for missing data (not default values). To optimize on space, we should read/write/use sparse matrices. Also, it would require all data be aligned to absolute minutes. |
Was this ever something that was put into place? I am currently trying to manage some segmented data and process the indices outputs in R, but finding it tricky to organize the matrices without the null values that would keep each matrix the same length. We'd like to average different indices across days that have recordings at different times, and while it's no problem adding null values to fill gaps at the end of the day, it's more difficult to do this when recording gaps exist at the beginning of a 24 hour period. |
@meperra, this was a feature request. It's something we intend to do but it is not yet done. While it is inefficient, we've certainly had more than one person complete analyses like this without this feature inbuilt.
I'd like to learn more about why this is an issue. Naively, you'd allocate a vector 1440 elements in size, per day, per index. For any minutes where there are data, you fill in the values. Then you're left with a properly structured vector. (where that vector could be a slice of a larger matrix). Note: I do not recommend filling in or zeroing with a default value cells where data is missing... you'll bias your calculation. You need to properly skip missing cells and accept that different minutes can have different population sizes in the final aggregate day. Also note: averaging indices are not straight forward. @towsey did you have a white paper on averaging indices? |
The issue I run into is just that the start time of the first file for a specific day is not specified, so the time before the first recording is not populated with NAs or 0s. Instead, the first minute of the first recording appears as the start of the day in the CSV files (index 0 in the CSV is always the first minute of the first recording, rather than the first minute of that day). The issue is not having different population sizes for different minutes in the aggregate day, but rather organizing it so that each minute is in the correct place within its vector so that I am averaging the same minutes across not different days. E.g. if my first recording starts at 1AM, the hour at the start of that day is not recognized and the length of the vector is 1380 elements, with the first minute (1:00-1:01AM) labelled as Index 0. Ideally, that minute would be labelled as 61 instead, and the minutes that precede it would be populated with 0s (which are then changed to NAs), just like the time between two nonconsecutive recordings currently is, and then the vector would be the same length as a full 24 hour recording. It's not a huge issue to look at the timestamp and identify if discrepancies in length are happening at the end or the beginning of a concatenated day, and I think there is a workaround we can figure out in R, but I just wanted to see if there was a simpler solution. If the time between the true beginning of that day and the first recording could be recognized, and those cells could be populated, then any discrepancies in vector length would be at the end of the day, and those are easily fixed by adding elements to the vector in R (these elements would be NAs that are omitted when averages are taken). I guess the short version of this is just that my vector lengths vary because of discontinuous recordings, and I would like to use missing values (NAs) to arrange my data appropriately within a 1440 element vector so that each element in each vector is associated with the same minute in a 24 hour day. These missing values will be omitted when averages are taken, but I'm under the impression that I need to make sure the actual values are in the right place. Hopefully that makes some sense? |
Your files should have a date stamp in them, right? All AP does is produce results for input files. This is the simplest way for it to operate and thus the most powerful (since it can be used for a variety of cases). All results it outputs are relative to the recording it is processing (not the start of the day). Even if we move to processing multiple results, its highly likely we will continue to produce results that are relative to the input recording. There are a large number of formats and non-trivial problems involved with using datestamps from filenames, which is why we leave that problem as an exercise for the reader. Without knowing all the complexities of your format, it's impossible for us to know what the right thing to do is (don't get me wrong, I have plans to make all this easier... but the point holds). For example: we process a single 1-minute file. What is the context here? Are there other files in the day (in a different folder)? Do we produce a massive day-size matrix of results with only one one-minute slice filled? What if you only wanted results from that minute? What if the datestamp in the filename is wrong (happens frequently)? What if you're doing a sampling experiment where you take every 5th recording and concatenate the results? So, the basic process you should take is:
|
That does make sense re: the fact that everyone is not doing the same analysis as me. I selfishly wanted everything easy and catered to my needs (haha, I apologize for being so lazy), but the basic process you outlined is not too tricky to figure out using the filenames! Thank you for your help, AP is great program that has been incredibly helpful thus far. |
Is your feature request related to a problem? Please describe.
We've had a couple of datasets so far that consist of segmented or sampled data.
While we can produce indices and FCS for this data, the process is very inefficient and convulted.
Describe the solution you'd like
We should be able to process multiple files from a sensor as if they were one large file, while also supporting both regular and irregular gaps in data.
Describe alternatives you've considered
The old process involves:
concatenateindexfiles
to stitch the results togetherHowever, creating a full result set for many (many!) small, short files is extremely inefficient.
Additional context
Related to #191
The text was updated successfully, but these errors were encountered: