Skip to content

Conversation

@l-k-
Copy link
Collaborator

@l-k- l-k- commented Jan 31, 2024

Addresses #883

The main focus of this PR was on improving readability of g.calibrate and g.getmeta by moving most monitor-specific code out of these two modules and into g.readaccfile. Now g.calibrate and g.getmeta treat input data uniformly, regardless of what monitor was used to collect it.

Readability of some other part 1 modules was also improved (e.g. g.getstarttime, get_starttime_weekday_truncdata).

The output of g.readaccfile is now a lot more standardized, and the downstream methods can expect that this method returns a dataframe P$data with columns of at least c("x", "y", "z"), and possibly of a wider subset of c("time", "x", "y", "z", "light", "temperature", "wear").

g.readaccfile now also ensures that timestamps have been read in the correct timezone, configtz.

This PR also includes two speed improvements:

  • for movisens files, g.readtemp_movisens() is now a lot faster because it now only resamples the temperature data needed for that particular block (as opposed to resampling all temperature every time) and resampling is expensive. Overall, part 1 is ~50% faster for this file type.
  • For Axivity cwa and GENEActiv .bin files, g.calibrate is now ~25% faster because of switching from data.frame to numeric matrix for storing data. Overall, part 1 is ~10% faster for these file types.

However, part 1 is now ~10% slower for gt3x files, because of the added timestamp conversion code.

Below is the list of changes that affected more than just code readability or processing speed. These are minor.

  • g.getmeta now uses meantempcal, as calculated by g.calibrate, as the mean temperature for rescaling data for all monitor types. It used to use meantempcal for the (deprecated) GENEActiv csv files and for ad-hoc csv files, but for GENEActiv bin and Movisens files it was using the mean temperature calculated by g.getmeta for that particular data chunk. So in that second case each chunk of data was recalibrated slightly differently, as a slightly different meantemp value was used.

  • for ad-hoc files not containing a column header, we used to always skip the very first line of the file anyway. Now we only skip the first line if it contains column names.

  • For Axivity cwa files (but not other monitor or format types), if the very last block was shorter than (sf * ws * 2 + 1), g.readaccfile() trimmed this data off. Now we keep it, like we do for other types.

  • for movisens files, we didn't account for the fact that readUnisensSignalEntry(... , startIndex, endIndex) reads up to and including endIndex itself. So we were reading the next block starting from that last point, and this point was being read twice.

  • for monitor types for which the file is read till endpage including the endpage, we used to read (blocksize + 1) samples at a time, instead of blocksize samples as was requested.

  • for movisens files, now can have a dot anywhere in the file path, not just in the very beginning of the path, for example something like unisensR-0.3.4/tests/unisensExample/acc.bin that we get when we download a movisens data example from https://github.com/Unisens

  • in g.imputeTimegaps(), raw timestamp imputation results were lost and the the last timestamp before the gap was carried over as the timestamp for every sample inside the gap.

  • for Axivity .cwa files, while temperature was used to calculate calibration coefficients in g.calibrate(), it actually wasn't used to re-scale data in g.getmeta()

  • g.getstarttime() used to calculate the start time incorrectly for Axivity csv files (it prepended the start date to the full timestamp, which lead to the start date being there twice), which lead to the starting timestamps of the metalong and metashort metrics to be truncated to midnight.

I will be doing more testing in the upcoming days, but I would appreciate any suggestions on how to test more throughly.

Checklist before merging:

  • Existing tests still work (check by running the test suite, e.g. from RStudio).
  • Added tests (if you added functionality) or fixed existing test (if you fixed a bug).
  • Updated or expanded the documentation.
  • Updated release notes in inst/NEWS.Rd with a user-readable summary. Please, include references to relevant issues or PR discussions.
  • Added your name to the contributors lists in the DESCRIPTION file, if you think you made a significant contribution.

l-k- added 30 commits January 30, 2024 07:38
this is minor, but with && conditions are only evaluated till the first failure, so this gets rid of unnecessary computation for blocks after the 1st one.
The comment got re-worded incorrectly sometime in the past. Correcting to restore the original meaning. It came from this commit: e61538b
dototcomma() was only called for Actigraph csv files.
That was my mistake from this commit 0caaf70
stringsAsFactors=TRUE was set in response to R changing the default for this parameter from TRUE to FALSE.

But we don't expect any non-numeric values in these data files, and if there's something wrong with the file and it contains a character string, it's better to end up with a character column than a factor column. Factors can be mistaken for numbers, and this conversion could end up undetected, even though we'd end up with very unexpected numeric values.
extract_params() defaults params_general[["desiredtz"]] to "", which means to use the system timezone
The csv files were generated by Open Movement's OmGui, from  GGIRread test files testfiles/ax3_testfile.cwa and testfiles/ax6_testfile.cwa using "Export Raw CSV", with "Gravity(g)" selected as accelerometer unit and "Seconds (Unix epoch)" as timestamp format
not just in the very beginning. The name of any intermediate folder in the path can contain a dot. This happens for example in the path unisensR-0.3.4/tests/unisensExample/acc.bin that we get when we download a movisens data example from https://github.com/Unisens
This doesn't actually matter in the code, but cleaning up anyway.
1. Always skip (rmc.firstrow.acc - 1) rows, not rmc.firstrow.acc like we did when length(rmc.firstrow.header) == 0, because otherwise the first row of data is lost.

2. Set header parameter of data.table::fread call to "auto" (this way, IFF every non-empty field on the first data line is of type character, this line will be read in as column names).  Don't just assume that there is a header with column names, people might have set rmc.firstrow.acc to point to the row after this header. And especially if rmc.firstrow.header is set, then rmc.firstrow.acc is better be set to point to the actual first row of acc data, since data.table::fread won't know how to deal with an arbitrary length/structure header block.

For the case of rmc.firstrow.acc == 2, we used to mistakenly skip 2 rows of data because we were setting skip = rmc.firstrow.acc (== 2), and were also calling data.table::fread with header = TRUE (so using a row of data as if it were column names), even though the column names were most likely on row 1 which we had skipped anyway.

And for the case when rmc.firstrow.header was set, we were most likely losing one row of data, because then rmc.firstrow.acc was most likely pointing to the first row of the actual acc data, not to any header info.

And if rmc.firstrow.acc does point to a row of column names, then header = "auto" can handle that just fine.
@l-k-
Copy link
Collaborator Author

l-k- commented Feb 12, 2024

@vincentvanhees thank you for the review and the testing!

There were a few (very small) fixes that I expect to possibly have a minor effect on the computed metrics. I don't want to have to guess though, so what I will do is I'll prepare a branch where I'll undo only those fixes that I expect to have an effect on metrics. There should only be a small handful of them. I'll do this on Monday. This way we can test and make sure that any changes in computed metrics between this PR and the master branch are fully explained by this small subset of changes, and we can double-check that these changes are acceptable.

@l-k- l-k- force-pushed the part1-simplification-v2 branch from 54462fa to b11c16f Compare February 13, 2024 04:15
@l-k-
Copy link
Collaborator Author

l-k- commented Feb 13, 2024

@vincentvanhees I created a branch that is based on this PR's branch, and in there I undid all those changes that aren't cosmetic but cause differences in computed metrics. This way you can take a look at them in isolation and make sure these are reasonable changes.

I did some testing with this new branch, and so far I'm getting the same metalong values as I do in the master branch.

I do still get a difference between metashort values. But these seem to be a rounding error:

image

I can't quite put my finger on what exactly is causing the rounding issue though. I'll see if I can figure it out tomorrow.

And I'll do some more testing tomorrow with different file types.

@jhmigueles
Copy link
Collaborator

Hi! Thanks @l-k- for all the work on this.

In case it helps, I have tested this branch on some movisens files that I got in my computer. I found it 30% faster than the master branch on these files with identical metashort and metalong output in both branches.

I have also checked that the new branch identifies when the acc.bin file is missing in a participant movisens folder and returns a warning in this regard, everything works as expected.

@vincentvanhees
Copy link
Member

vincentvanhees commented Feb 13, 2024

Thanks @l-k- for preparing the branch. I think I figured out the cause of the issue, but maybe good to share the whole process of how I got there:

Observation 1

I am getting different results when running summary() on the difference between M$metashort$anglez between the branches:

Comparing branch part1-simplification-v2 with test-only-dont-merge:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.0521  0.0615  0.1048  0.1073  0.1519  0.2853 

Comparing branch master with test-only-dont-merge:

  Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-13.755700  -0.120200  -0.001700  -0.001436   0.106200  14.932200 

Comparing branch master with part1-simplification-v2:

   Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-13.78220  -0.24330  -0.10020  -0.10871   0.02173  14.81740

The above is from a single file. When I repeat this for four files one of them shows very tiny differences similar to you. This gives the impression that the differences I observe are triggered by a characteristic in the recording: It is not clippingscore or nonwearscore, because all four recordings are identical on this. However, when looking at the timing of the outlier I see a patterns that correlates with the chunks being read (approx. 24 hours):

image

Preliminary conclusions:

  • The changes you suspect to drive the difference are not driving the difference I observe here.
  • The differences I observe may relate to how time is accounted for or time segments are selected.

Observation 2

Then I looked at whether the pagestart and pageend for each chunk are different between branches:

In part1-simplification-v2 I get following page indices for GENEActiv:

1 -> 24874
24875 -> 49748
49749 -> 74622
...

while for master branch I see:
1 -> 24875
24876 -> 49750
49751 -> 74625

This is explained by the -1 you added in line 52 of g.readaccfile. However, when I undo it I still get the differences for anglez as described above, so the -1 is not the cause of the issue.

Observation 3

One level higher: Objects LD and use in g.getmeta are consistent between part1-simplification-v2 without the -1 and the master branch.

For object data I compared its values:

  • Right before it is appended with the object S => at this point they are identical.
  • Right after get_starttime_weekday_truncdata => There is a time lag of less than a second, ccf gives a max lagged correlation of 42 for one axis, which for this study is half a second.

Observation 4

I then dived into get_starttime_weekday_meantemp_truncdata.R and this led to the observation that previously the decimal places in the seconds of the starttime got dropped while your revised code accounts for them.

Conclusion

It is the rounding in the decimal places in the seconds of the start time that timeshifted the data by 0.5 second and subsequently the window over which metrics and nonwear are derived as well, explaining why values can be different.

  • The reason I did not find the difference in one of the four recordings earlier in this comment is that recording starts exactly on the second without decimal places. This may also explain why you did not found the issue in your test recording.
  • The fact that the difference did not disappear in your test branch is because you did not reverse this part of the code.
  • I have not tested ActiGraph gt3x again, but suspect that the exact same problem occurs there too, because both data types followed the same steps in get_starttime_weekday_meantemp_truncdata.R .
  • The temporal relationship of the timing of the large differences in anglez values I found (plot above), may with hindsight have been misleading me. We expect more differences in values when there is more rotation, and more rotation happens in a 24 hour cycle, so this was not an indication of chunk selection related errors.

Great, so now we know that the change in values is expected!

@vincentvanhees
Copy link
Member

vincentvanhees commented Feb 13, 2024

I can confirm that the minor gt3x discrepancy is clarified by the test-only-dont-merge branch:

  • The gt3x file I am testing comes with time gaps which are handled differently. When comparing M$QClog I see that we now (as expected) identify/impute a slightly longer time gaps, less than a second difference.
  • Slightly different window selection for nonwear and clippingscore: +2 samples longer if i understand correctly.

I can't quite put my finger on what exactly is causing the rounding issue though. I'll see if I can figure it out tomorrow.
And I'll do some more testing tomorrow with different file types.

Just to confirm that these tiny differences between the branches is specific to GENEActiv being started at a full second. In that case I see differences in ENMO ranging from -3.000e-04 to 3.000e-04. I also cannot think of an explanation for now...

As far as I am concerned this PR can be merged but let's leave it open till the end of the week in case we come up with additional tests we want to run prior to merging.

@l-k-
Copy link
Collaborator Author

l-k- commented Feb 14, 2024

I then dived into get_starttime_weekday_meantemp_truncdata.R and this led to the observation that previously the decimal places in the seconds of the starttime got dropped while your revised code accounts for them.

Oh wow, thank you for the analysis, and for figuring this out!

I just tested with this file that was shared with us on the google group, and starttime$sec there was 36.5, and you are right, in the master branch the 0.5 got trimmed off, and this caused quite large discrepancies in metalong$nonwearscore, metashort$anglez and metashort$ENMO, and these differences are explained by rounding the seconds to an integer.

@l-k-
Copy link
Collaborator Author

l-k- commented Feb 14, 2024

@vincentvanhees @jhmigueles thank you both for testing!

I agree with keeping this PR open till the end of the week; I'll be testing more.

@l-k- l-k- force-pushed the part1-simplification-v2 branch from a558678 to f1f6f06 Compare February 14, 2024 05:01
@l-k- l-k- force-pushed the part1-simplification-v2 branch from 5881578 to 47fdd94 Compare February 14, 2024 05:20
@l-k- l-k- mentioned this pull request Feb 14, 2024
2 tasks
@vincentvanhees
Copy link
Member

One more thing that is missing is:

  • Update to the changelog in NEWS.md now that we know that some changes impact consistency with previous versions.

@l-k-
Copy link
Collaborator Author

l-k- commented Feb 14, 2024

@vincentvanhees I found the cause for the tiny diffefrences I was seeing in metalong$EN, metashort$anglez and metashort$ENMO for GENEActiv bin and for Movisens files.

In the master branch for these file types, mean temperature was calculated before the data was truncated by get_starttime_weekday_meantemp_truncdata(), so it was calculated on a slightly longer chunk of temperature. And this caused XYZ data to be scaled using a slightly different meantemp value.

I added one more commit to the test-only-dont-merge branch, and I now get identical metalong and metashort results using that branch and master, for GENEActiv bin and Movisens files I was using.

I'll keep testing on other data I can dig up.

@l-k-
Copy link
Collaborator Author

l-k- commented Feb 15, 2024

I will add a NEWS.md entry on Thursday.

@l-k-
Copy link
Collaborator Author

l-k- commented Feb 16, 2024

Another small issue that got accidentally fixed by this PR -- for Axivity .cwa files, when XYZ values were scaled in getmeta() using calibration coefficients, temperature parameters weren't used. Temperature was used to calculate calibration coefficients in g.calibrate(), but later in g.getmeta() these temperature-related coefficients weren't used for re-scaling.

@l-k- l-k- force-pushed the part1-simplification-v2 branch from 95238ea to ae35547 Compare February 16, 2024 05:05
@l-k-
Copy link
Collaborator Author

l-k- commented Feb 16, 2024

@vincentvanhees I just bumped into an issue with an Axivity csv file. I'll get it resolved on Friday.

@l-k- l-k- force-pushed the part1-simplification-v2 branch from 351c34b to 4b4f330 Compare February 16, 2024 12:45
@vincentvanhees
Copy link
Member

Additional fixes look good. Thanks again for all the work Lena, I will now merge this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants