Part 1 improving code readability #1027

l-k- · 2024-01-31T06:05:16Z

Addresses #883

The main focus of this PR was on improving readability of g.calibrate and g.getmeta by moving most monitor-specific code out of these two modules and into g.readaccfile. Now g.calibrate and g.getmeta treat input data uniformly, regardless of what monitor was used to collect it.

Readability of some other part 1 modules was also improved (e.g. g.getstarttime, get_starttime_weekday_truncdata).

The output of g.readaccfile is now a lot more standardized, and the downstream methods can expect that this method returns a dataframe P$data with columns of at least c("x", "y", "z"), and possibly of a wider subset of c("time", "x", "y", "z", "light", "temperature", "wear").

g.readaccfile now also ensures that timestamps have been read in the correct timezone, configtz.

This PR also includes two speed improvements:

for movisens files, g.readtemp_movisens() is now a lot faster because it now only resamples the temperature data needed for that particular block (as opposed to resampling all temperature every time) and resampling is expensive. Overall, part 1 is ~50% faster for this file type.
For Axivity cwa and GENEActiv .bin files, g.calibrate is now ~25% faster because of switching from data.frame to numeric matrix for storing data. Overall, part 1 is ~10% faster for these file types.

However, part 1 is now ~10% slower for gt3x files, because of the added timestamp conversion code.

Below is the list of changes that affected more than just code readability or processing speed. These are minor.

g.getmeta now uses meantempcal, as calculated by g.calibrate, as the mean temperature for rescaling data for all monitor types. It used to use meantempcal for the (deprecated) GENEActiv csv files and for ad-hoc csv files, but for GENEActiv bin and Movisens files it was using the mean temperature calculated by g.getmeta for that particular data chunk. So in that second case each chunk of data was recalibrated slightly differently, as a slightly different meantemp value was used.
for ad-hoc files not containing a column header, we used to always skip the very first line of the file anyway. Now we only skip the first line if it contains column names.
For Axivity cwa files (but not other monitor or format types), if the very last block was shorter than (sf * ws * 2 + 1), g.readaccfile() trimmed this data off. Now we keep it, like we do for other types.
for movisens files, we didn't account for the fact that readUnisensSignalEntry(... , startIndex, endIndex) reads up to and including endIndex itself. So we were reading the next block starting from that last point, and this point was being read twice.
for monitor types for which the file is read till endpage including the endpage, we used to read (blocksize + 1) samples at a time, instead of blocksize samples as was requested.
for movisens files, now can have a dot anywhere in the file path, not just in the very beginning of the path, for example something like unisensR-0.3.4/tests/unisensExample/acc.bin that we get when we download a movisens data example from https://github.com/Unisens
in g.imputeTimegaps(), raw timestamp imputation results were lost and the the last timestamp before the gap was carried over as the timestamp for every sample inside the gap.
for Axivity .cwa files, while temperature was used to calculate calibration coefficients in g.calibrate(), it actually wasn't used to re-scale data in g.getmeta()
g.getstarttime() used to calculate the start time incorrectly for Axivity csv files (it prepended the start date to the full timestamp, which lead to the start date being there twice), which lead to the starting timestamps of the metalong and metashort metrics to be truncated to midnight.

I will be doing more testing in the upcoming days, but I would appreciate any suggestions on how to test more throughly.

Checklist before merging:

Existing tests still work (check by running the test suite, e.g. from RStudio).
Added tests (if you added functionality) or fixed existing test (if you fixed a bug).
Updated or expanded the documentation.
Updated release notes in inst/NEWS.Rd with a user-readable summary. Please, include references to relevant issues or PR discussions.
Added your name to the contributors lists in the DESCRIPTION file, if you think you made a significant contribution.

…eadaccfile

this is minor, but with && conditions are only evaluated till the first failure, so this gets rid of unnecessary computation for blocks after the 1st one.

The comment got re-worded incorrectly sometime in the past. Correcting to restore the original meaning. It came from this commit: e61538b

this was from 0caaf70

dototcomma() was only called for Actigraph csv files. That was my mistake from this commit 0caaf70

stringsAsFactors=TRUE was set in response to R changing the default for this parameter from TRUE to FALSE. But we don't expect any non-numeric values in these data files, and if there's something wrong with the file and it contains a character string, it's better to end up with a character column than a factor column. Factors can be mistaken for numbers, and this conversion could end up undetected, even though we'd end up with very unexpected numeric values.

extract_params() defaults params_general[["desiredtz"]] to "", which means to use the system timezone

…anged

The csv files were generated by Open Movement's OmGui, from GGIRread test files testfiles/ax3_testfile.cwa and testfiles/ax6_testfile.cwa using "Export Raw CSV", with "Gravity(g)" selected as accelerometer unit and "Seconds (Unix epoch)" as timestamp format

not just in the very beginning. The name of any intermediate folder in the path can contain a dot. This happens for example in the path unisensR-0.3.4/tests/unisensExample/acc.bin that we get when we download a movisens data example from https://github.com/Unisens

…sDataFrame = TRUE

This doesn't actually matter in the code, but cleaning up anyway.

1. Always skip (rmc.firstrow.acc - 1) rows, not rmc.firstrow.acc like we did when length(rmc.firstrow.header) == 0, because otherwise the first row of data is lost. 2. Set header parameter of data.table::fread call to "auto" (this way, IFF every non-empty field on the first data line is of type character, this line will be read in as column names). Don't just assume that there is a header with column names, people might have set rmc.firstrow.acc to point to the row after this header. And especially if rmc.firstrow.header is set, then rmc.firstrow.acc is better be set to point to the actual first row of acc data, since data.table::fread won't know how to deal with an arbitrary length/structure header block. For the case of rmc.firstrow.acc == 2, we used to mistakenly skip 2 rows of data because we were setting skip = rmc.firstrow.acc (== 2), and were also calling data.table::fread with header = TRUE (so using a row of data as if it were column names), even though the column names were most likely on row 1 which we had skipped anyway. And for the case when rmc.firstrow.header was set, we were most likely losing one row of data, because then rmc.firstrow.acc was most likely pointing to the first row of the actual acc data, not to any header info. And if rmc.firstrow.acc does point to a row of column names, then header = "auto" can handle that just fine.

l-k- · 2024-02-12T03:38:56Z

@vincentvanhees thank you for the review and the testing!

There were a few (very small) fixes that I expect to possibly have a minor effect on the computed metrics. I don't want to have to guess though, so what I will do is I'll prepare a branch where I'll undo only those fixes that I expect to have an effect on metrics. There should only be a small handful of them. I'll do this on Monday. This way we can test and make sure that any changes in computed metrics between this PR and the master branch are fully explained by this small subset of changes, and we can double-check that these changes are acceptable.

l-k- · 2024-02-13T07:31:00Z

@vincentvanhees I created a branch that is based on this PR's branch, and in there I undid all those changes that aren't cosmetic but cause differences in computed metrics. This way you can take a look at them in isolation and make sure these are reasonable changes.

I did some testing with this new branch, and so far I'm getting the same metalong values as I do in the master branch.

I do still get a difference between metashort values. But these seem to be a rounding error:

I can't quite put my finger on what exactly is causing the rounding issue though. I'll see if I can figure it out tomorrow.

And I'll do some more testing tomorrow with different file types.

jhmigueles · 2024-02-13T11:12:33Z

Hi! Thanks @l-k- for all the work on this.

In case it helps, I have tested this branch on some movisens files that I got in my computer. I found it 30% faster than the master branch on these files with identical metashort and metalong output in both branches.

I have also checked that the new branch identifies when the acc.bin file is missing in a participant movisens folder and returns a warning in this regard, everything works as expected.

vincentvanhees · 2024-02-13T13:22:48Z

Thanks @l-k- for preparing the branch. I think I figured out the cause of the issue, but maybe good to share the whole process of how I got there:

Observation 1

I am getting different results when running summary() on the difference between M$metashort$anglez between the branches:

Comparing branch part1-simplification-v2 with test-only-dont-merge:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.0521  0.0615  0.1048  0.1073  0.1519  0.2853

Comparing branch master with test-only-dont-merge:

  Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-13.755700  -0.120200  -0.001700  -0.001436   0.106200  14.932200

Comparing branch master with part1-simplification-v2:

   Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-13.78220  -0.24330  -0.10020  -0.10871   0.02173  14.81740

The above is from a single file. When I repeat this for four files one of them shows very tiny differences similar to you. This gives the impression that the differences I observe are triggered by a characteristic in the recording: It is not clippingscore or nonwearscore, because all four recordings are identical on this. However, when looking at the timing of the outlier I see a patterns that correlates with the chunks being read (approx. 24 hours):

Preliminary conclusions:

The changes you suspect to drive the difference are not driving the difference I observe here.
The differences I observe may relate to how time is accounted for or time segments are selected.

Observation 2

Then I looked at whether the pagestart and pageend for each chunk are different between branches:

In part1-simplification-v2 I get following page indices for GENEActiv:

1 -> 24874
24875 -> 49748
49749 -> 74622
...

while for master branch I see:
1 -> 24875
24876 -> 49750
49751 -> 74625

This is explained by the -1 you added in line 52 of g.readaccfile. However, when I undo it I still get the differences for anglez as described above, so the -1 is not the cause of the issue.

Observation 3

One level higher: Objects LD and use in g.getmeta are consistent between part1-simplification-v2 without the -1 and the master branch.

For object data I compared its values:

Right before it is appended with the object S => at this point they are identical.
Right after get_starttime_weekday_truncdata => There is a time lag of less than a second, ccf gives a max lagged correlation of 42 for one axis, which for this study is half a second.

Observation 4

I then dived into get_starttime_weekday_meantemp_truncdata.R and this led to the observation that previously the decimal places in the seconds of the starttime got dropped while your revised code accounts for them.

Conclusion

It is the rounding in the decimal places in the seconds of the start time that timeshifted the data by 0.5 second and subsequently the window over which metrics and nonwear are derived as well, explaining why values can be different.

The reason I did not find the difference in one of the four recordings earlier in this comment is that recording starts exactly on the second without decimal places. This may also explain why you did not found the issue in your test recording.
The fact that the difference did not disappear in your test branch is because you did not reverse this part of the code.
I have not tested ActiGraph gt3x again, but suspect that the exact same problem occurs there too, because both data types followed the same steps in get_starttime_weekday_meantemp_truncdata.R .
The temporal relationship of the timing of the large differences in anglez values I found (plot above), may with hindsight have been misleading me. We expect more differences in values when there is more rotation, and more rotation happens in a 24 hour cycle, so this was not an indication of chunk selection related errors.

Great, so now we know that the change in values is expected!

vincentvanhees · 2024-02-13T16:48:19Z

I can confirm that the minor gt3x discrepancy is clarified by the test-only-dont-merge branch:

The gt3x file I am testing comes with time gaps which are handled differently. When comparing M$QClog I see that we now (as expected) identify/impute a slightly longer time gaps, less than a second difference.
Slightly different window selection for nonwear and clippingscore: +2 samples longer if i understand correctly.

I can't quite put my finger on what exactly is causing the rounding issue though. I'll see if I can figure it out tomorrow.
And I'll do some more testing tomorrow with different file types.

Just to confirm that these tiny differences between the branches is specific to GENEActiv being started at a full second. In that case I see differences in ENMO ranging from -3.000e-04 to 3.000e-04. I also cannot think of an explanation for now...

As far as I am concerned this PR can be merged but let's leave it open till the end of the week in case we come up with additional tests we want to run prior to merging.

l-k- · 2024-02-14T01:46:09Z

I then dived into get_starttime_weekday_meantemp_truncdata.R and this led to the observation that previously the decimal places in the seconds of the starttime got dropped while your revised code accounts for them.

Oh wow, thank you for the analysis, and for figuring this out!

I just tested with this file that was shared with us on the google group, and starttime$sec there was 36.5, and you are right, in the master branch the 0.5 got trimmed off, and this caused quite large discrepancies in metalong$nonwearscore, metashort$anglez and metashort$ENMO, and these differences are explained by rounding the seconds to an integer.

l-k- · 2024-02-14T01:47:20Z

@vincentvanhees @jhmigueles thank you both for testing!

I agree with keeping this PR open till the end of the week; I'll be testing more.

vincentvanhees · 2024-02-14T15:21:25Z

One more thing that is missing is:

Update to the changelog in NEWS.md now that we know that some changes impact consistency with previous versions.

l-k- · 2024-02-14T18:36:56Z

@vincentvanhees I found the cause for the tiny diffefrences I was seeing in metalong$EN, metashort$anglez and metashort$ENMO for GENEActiv bin and for Movisens files.

In the master branch for these file types, mean temperature was calculated before the data was truncated by get_starttime_weekday_meantemp_truncdata(), so it was calculated on a slightly longer chunk of temperature. And this caused XYZ data to be scaled using a slightly different meantemp value.

I added one more commit to the test-only-dont-merge branch, and I now get identical metalong and metashort results using that branch and master, for GENEActiv bin and Movisens files I was using.

I'll keep testing on other data I can dig up.

l-k- · 2024-02-15T04:14:38Z

I will add a NEWS.md entry on Thursday.

l-k- · 2024-02-16T05:03:11Z

Another small issue that got accidentally fixed by this PR -- for Axivity .cwa files, when XYZ values were scaled in getmeta() using calibration coefficients, temperature parameters weren't used. Temperature was used to calculate calibration coefficients in g.calibrate(), but later in g.getmeta() these temperature-related coefficients weren't used for re-scaling.

l-k- · 2024-02-16T06:16:53Z

@vincentvanhees I just bumped into an issue with an Axivity csv file. I'll get it resolved on Friday.

…led once, for i==1

vincentvanhees · 2024-02-17T13:53:08Z

Additional fixes look good. Thanks again for all the work Lena, I will now merge this PR.

l-k- added 30 commits January 30, 2024 07:38

remove unused variables

35fc330

don't crash if get to end of csv file before encountering a non-0 number

6a67747

use correct actigraph csv test file for testing g.inspectfile and g.r…

40ae3ef

…eadaccfile

simplify blocksize handling

8d9f29a

up-to-date documentation is in g.readaccfile.Rd

e2c54cd

simplify readaccfile

740310d

clean up if() conditions

525e4b7

this is minor, but with && conditions are only evaluated till the first failure, so this gets rid of unnecessary computation for blocks after the 1st one.

data.table::fread returns a data frame if data.table=FALSE

ecb3e9b

standardize output names

4109b83

correcting comment

ebf61a2

The comment got re-worded incorrectly sometime in the past. Correcting to restore the original meaning. It came from this commit: e61538b

remove unused parameter

1f59462

fixing a weird copy-paste error I made

bb18b5d

this was from 0caaf70

call to dotorcoma was in a wrong place

502370b

dototcomma() was only called for Actigraph csv files. That was my mistake from this commit 0caaf70

standardize output for Actigraph csv

a7ffe05

simplify function declaration

023ca14

standardize output names

7666748

params_general[["desiredtz"]] defaults to ""

49ee517

extract_params() defaults params_general[["desiredtz"]] to "", which means to use the system timezone

converting unix timestamp to POSIXlt & back to numeric leaves it unch…

847373f

…anged

standardize output

954a009

put all tests for same format together

99ec671

add readaccfile and inspectfile tests for movisens

613d2bd

standardize output for movisens

dcce9c8

read.gt3x::read.gt3x already returns a data frame because parameter a…

8e0f445

…sDataFrame = TRUE

standardize output for gt3x

9c41ee9

standardize output for ad hoc csv

62c33b7

I think mon = MOVISENS was a typo, need AD_HOC

aa3fc63

This doesn't actually matter in the code, but cleaning up anyway.

vincentvanhees mentioned this pull request Feb 12, 2024

Add check that windowsizes[3] should be multitude of windowsizes[2] #1042

Closed

l-k- added 2 commits February 12, 2024 13:56

if S is one row, process it correctly

5f245d2

Fix C$use.temp retrieval and get C$meantempcal

b11c16f

l-k- force-pushed the part1-simplification-v2 branch from 54462fa to b11c16f Compare February 13, 2024 04:15

vincentvanhees approved these changes Feb 13, 2024

View reviewed changes

l-k- force-pushed the part1-simplification-v2 branch from a558678 to f1f6f06 Compare February 14, 2024 05:01

issue 1042, align ws and ws2

47fdd94

l-k- force-pushed the part1-simplification-v2 branch from 5881578 to 47fdd94 Compare February 14, 2024 05:20

l-k- mentioned this pull request Feb 14, 2024

EN en consistency #816

Closed

2 tasks

l-k- force-pushed the part1-simplification-v2 branch from 95238ea to ae35547 Compare February 16, 2024 05:05

l-k- added 2 commits February 16, 2024 07:45

Update NEWS.md

23a768d

no need to reuse values from past calls b/c this method now only cal…

4b4f330

…led once, for i==1

l-k- force-pushed the part1-simplification-v2 branch from 351c34b to 4b4f330 Compare February 16, 2024 12:45

deal with badly formatted timestamps in Axivity csv

e235b85

vincentvanhees merged commit 2c9e997 into master Feb 17, 2024

vincentvanhees mentioned this pull request Jun 7, 2024

GENEActiv_testfile - should this be able to be read? wadpac/GGIRread#66

Closed

vincentvanhees deleted the part1-simplification-v2 branch October 23, 2024 12:39

Part 1 improving code readability #1027

Part 1 improving code readability #1027

Uh oh!

Conversation

l-k- commented Jan 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

l-k- commented Feb 12, 2024

Uh oh!

l-k- commented Feb 13, 2024

Uh oh!

jhmigueles commented Feb 13, 2024

Uh oh!

vincentvanhees commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Observation 1

Observation 2

Observation 3

Observation 4

Conclusion

Uh oh!

vincentvanhees commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

l-k- commented Feb 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

l-k- commented Feb 14, 2024

Uh oh!

vincentvanhees commented Feb 14, 2024

Uh oh!

l-k- commented Feb 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

l-k- commented Feb 15, 2024

Uh oh!

l-k- commented Feb 16, 2024

Uh oh!

l-k- commented Feb 16, 2024

Uh oh!

vincentvanhees commented Feb 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

l-k- commented Jan 31, 2024 •

edited

Loading

vincentvanhees commented Feb 13, 2024 •

edited

Loading

vincentvanhees commented Feb 13, 2024 •

edited

Loading

l-k- commented Feb 14, 2024 •

edited

Loading

l-k- commented Feb 14, 2024 •

edited

Loading