-
Notifications
You must be signed in to change notification settings - Fork 66
Part 1 improving code readability #1027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
this is minor, but with && conditions are only evaluated till the first failure, so this gets rid of unnecessary computation for blocks after the 1st one.
The comment got re-worded incorrectly sometime in the past. Correcting to restore the original meaning. It came from this commit: e61538b
this was from 0caaf70
dototcomma() was only called for Actigraph csv files. That was my mistake from this commit 0caaf70
stringsAsFactors=TRUE was set in response to R changing the default for this parameter from TRUE to FALSE. But we don't expect any non-numeric values in these data files, and if there's something wrong with the file and it contains a character string, it's better to end up with a character column than a factor column. Factors can be mistaken for numbers, and this conversion could end up undetected, even though we'd end up with very unexpected numeric values.
extract_params() defaults params_general[["desiredtz"]] to "", which means to use the system timezone
The csv files were generated by Open Movement's OmGui, from GGIRread test files testfiles/ax3_testfile.cwa and testfiles/ax6_testfile.cwa using "Export Raw CSV", with "Gravity(g)" selected as accelerometer unit and "Seconds (Unix epoch)" as timestamp format
not just in the very beginning. The name of any intermediate folder in the path can contain a dot. This happens for example in the path unisensR-0.3.4/tests/unisensExample/acc.bin that we get when we download a movisens data example from https://github.com/Unisens
…sDataFrame = TRUE
This doesn't actually matter in the code, but cleaning up anyway.
1. Always skip (rmc.firstrow.acc - 1) rows, not rmc.firstrow.acc like we did when length(rmc.firstrow.header) == 0, because otherwise the first row of data is lost. 2. Set header parameter of data.table::fread call to "auto" (this way, IFF every non-empty field on the first data line is of type character, this line will be read in as column names). Don't just assume that there is a header with column names, people might have set rmc.firstrow.acc to point to the row after this header. And especially if rmc.firstrow.header is set, then rmc.firstrow.acc is better be set to point to the actual first row of acc data, since data.table::fread won't know how to deal with an arbitrary length/structure header block. For the case of rmc.firstrow.acc == 2, we used to mistakenly skip 2 rows of data because we were setting skip = rmc.firstrow.acc (== 2), and were also calling data.table::fread with header = TRUE (so using a row of data as if it were column names), even though the column names were most likely on row 1 which we had skipped anyway. And for the case when rmc.firstrow.header was set, we were most likely losing one row of data, because then rmc.firstrow.acc was most likely pointing to the first row of the actual acc data, not to any header info. And if rmc.firstrow.acc does point to a row of column names, then header = "auto" can handle that just fine.
|
@vincentvanhees thank you for the review and the testing! There were a few (very small) fixes that I expect to possibly have a minor effect on the computed metrics. I don't want to have to guess though, so what I will do is I'll prepare a branch where I'll undo only those fixes that I expect to have an effect on metrics. There should only be a small handful of them. I'll do this on Monday. This way we can test and make sure that any changes in computed metrics between this PR and the master branch are fully explained by this small subset of changes, and we can double-check that these changes are acceptable. |
54462fa to
b11c16f
Compare
|
@vincentvanhees I created a branch that is based on this PR's branch, and in there I undid all those changes that aren't cosmetic but cause differences in computed metrics. This way you can take a look at them in isolation and make sure these are reasonable changes. I did some testing with this new branch, and so far I'm getting the same I do still get a difference between
I can't quite put my finger on what exactly is causing the rounding issue though. I'll see if I can figure it out tomorrow. And I'll do some more testing tomorrow with different file types. |
|
Hi! Thanks @l-k- for all the work on this. In case it helps, I have tested this branch on some movisens files that I got in my computer. I found it 30% faster than the master branch on these files with identical metashort and metalong output in both branches. I have also checked that the new branch identifies when the acc.bin file is missing in a participant movisens folder and returns a warning in this regard, everything works as expected. |
|
Thanks @l-k- for preparing the branch. I think I figured out the cause of the issue, but maybe good to share the whole process of how I got there: Observation 1I am getting different results when running Comparing branch Comparing branch Comparing branch The above is from a single file. When I repeat this for four files one of them shows very tiny differences similar to you. This gives the impression that the differences I observe are triggered by a characteristic in the recording: It is not clippingscore or nonwearscore, because all four recordings are identical on this. However, when looking at the timing of the outlier I see a patterns that correlates with the chunks being read (approx. 24 hours): Preliminary conclusions:
Observation 2Then I looked at whether the In 1 -> 24874 while for master branch I see: This is explained by the -1 you added in line 52 of g.readaccfile. However, when I undo it I still get the differences for anglez as described above, so the -1 is not the cause of the issue. Observation 3One level higher: Objects For object
Observation 4I then dived into ConclusionIt is the rounding in the decimal places in the seconds of the start time that timeshifted the data by 0.5 second and subsequently the window over which metrics and nonwear are derived as well, explaining why values can be different.
Great, so now we know that the change in values is expected! |
|
I can confirm that the minor gt3x discrepancy is clarified by the
Just to confirm that these tiny differences between the branches is specific to GENEActiv being started at a full second. In that case I see differences in ENMO ranging from -3.000e-04 to 3.000e-04. I also cannot think of an explanation for now... As far as I am concerned this PR can be merged but let's leave it open till the end of the week in case we come up with additional tests we want to run prior to merging. |
Oh wow, thank you for the analysis, and for figuring this out! I just tested with this file that was shared with us on the google group, and |
|
@vincentvanhees @jhmigueles thank you both for testing! I agree with keeping this PR open till the end of the week; I'll be testing more. |
a558678 to
f1f6f06
Compare
5881578 to
47fdd94
Compare
|
One more thing that is missing is:
|
|
@vincentvanhees I found the cause for the tiny diffefrences I was seeing in metalong$EN, metashort$anglez and metashort$ENMO for GENEActiv bin and for Movisens files. In the master branch for these file types, mean temperature was calculated before the data was truncated by get_starttime_weekday_meantemp_truncdata(), so it was calculated on a slightly longer chunk of temperature. And this caused XYZ data to be scaled using a slightly different meantemp value. I added one more commit to the test-only-dont-merge branch, and I now get identical metalong and metashort results using that branch and master, for GENEActiv bin and Movisens files I was using. I'll keep testing on other data I can dig up. |
|
I will add a NEWS.md entry on Thursday. |
|
Another small issue that got accidentally fixed by this PR -- for Axivity .cwa files, when XYZ values were scaled in getmeta() using calibration coefficients, temperature parameters weren't used. Temperature was used to calculate calibration coefficients in g.calibrate(), but later in g.getmeta() these temperature-related coefficients weren't used for re-scaling. |
95238ea to
ae35547
Compare
|
@vincentvanhees I just bumped into an issue with an Axivity csv file. I'll get it resolved on Friday. |
351c34b to
4b4f330
Compare
|
Additional fixes look good. Thanks again for all the work Lena, I will now merge this PR. |


Addresses #883
The main focus of this PR was on improving readability of
g.calibrateandg.getmetaby moving most monitor-specific code out of these two modules and intog.readaccfile. Nowg.calibrateandg.getmetatreat input data uniformly, regardless of what monitor was used to collect it.Readability of some other part 1 modules was also improved (e.g.
g.getstarttime,get_starttime_weekday_truncdata).The output of
g.readaccfileis now a lot more standardized, and the downstream methods can expect that this method returns a dataframeP$datawith columns of at leastc("x", "y", "z"), and possibly of a wider subset ofc("time", "x", "y", "z", "light", "temperature", "wear").g.readaccfilenow also ensures that timestamps have been read in the correct timezone,configtz.This PR also includes two speed improvements:
g.readtemp_movisens()is now a lot faster because it now only resamples the temperature data needed for that particular block (as opposed to resampling all temperature every time) and resampling is expensive. Overall, part 1 is ~50% faster for this file type.However, part 1 is now ~10% slower for gt3x files, because of the added timestamp conversion code.
Below is the list of changes that affected more than just code readability or processing speed. These are minor.
g.getmeta now uses
meantempcal, as calculated by g.calibrate, as the mean temperature for rescaling data for all monitor types. It used to usemeantempcalfor the (deprecated) GENEActiv csv files and for ad-hoc csv files, but for GENEActiv bin and Movisens files it was using the mean temperature calculated by g.getmeta for that particular data chunk. So in that second case each chunk of data was recalibrated slightly differently, as a slightly different meantemp value was used.for ad-hoc files not containing a column header, we used to always skip the very first line of the file anyway. Now we only skip the first line if it contains column names.
For Axivity cwa files (but not other monitor or format types), if the very last block was shorter than (sf * ws * 2 + 1), g.readaccfile() trimmed this data off. Now we keep it, like we do for other types.
for movisens files, we didn't account for the fact that readUnisensSignalEntry(... , startIndex, endIndex) reads up to and including endIndex itself. So we were reading the next block starting from that last point, and this point was being read twice.
for monitor types for which the file is read till endpage including the endpage, we used to read (blocksize + 1) samples at a time, instead of blocksize samples as was requested.
for movisens files, now can have a dot anywhere in the file path, not just in the very beginning of the path, for example something like
unisensR-0.3.4/tests/unisensExample/acc.binthat we get when we download a movisens data example from https://github.com/Unisensin g.imputeTimegaps(), raw timestamp imputation results were lost and the the last timestamp before the gap was carried over as the timestamp for every sample inside the gap.
for Axivity .cwa files, while temperature was used to calculate calibration coefficients in g.calibrate(), it actually wasn't used to re-scale data in g.getmeta()
g.getstarttime() used to calculate the start time incorrectly for Axivity csv files (it prepended the start date to the full timestamp, which lead to the start date being there twice), which lead to the starting timestamps of the metalong and metashort metrics to be truncated to midnight.
I will be doing more testing in the upcoming days, but I would appreciate any suggestions on how to test more throughly.
Checklist before merging:
inst/NEWS.Rdwith a user-readable summary. Please, include references to relevant issues or PR discussions.DESCRIPTIONfile, if you think you made a significant contribution.