Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow scientific notation in written numbers #70

Closed
wants to merge 6 commits into from

Conversation

ericphanson
Copy link
Member

Update to #30: attempts to allow writing out scientific notation where legal (i.e. not onset or duration fields).

motivated by encountering

ErrorException(\"failed to fit number into EDF's 8 ASCII character limit: 4.180821e-7\")

when writing out an EDF. We don't know why or which field this was in though.

This PR changes the semantics:

  • if allow_scientific=false, which I have set by default, then underflow to 0 is allowed. This was formerly disallowed in [Bugfix] add option to truncate values with size larger than spec #30 (although I did not see a discussion why). The former semantics were a bit weird in that rounding/truncation was OK except when the result was truncated to 0, in which case it errored.
  • if allow_scientific=true, then we use %G in sprintf to allow either of scientific notation or decimal representation. This shouldn't really underflow nor with any kind of Float64 representable number, but could if somehow one got a BigFloat or such in there.

I also set the code to allow scientific notation with all floating-point fields of the headers (but with onset/durations of annotations).

Would appreciate thoughts on this & careful review

@ericphanson
Copy link
Member Author

ericphanson commented Oct 13, 2023

Is this too slow?

  • the allow_scientific=true path is only used for headers, so should not be a perf concern
  • the allow_scientific=false path is used for annotations, so could be a perf concern, if it is slow to call for several million numbers

Let us benchmark.

Benchmark

Try every Float16 50 times, which is 1.5 million numbers

function all_floats(; n=50)
    out = Char[]
    sizehint!(out, 30720 * n)
    for _ = 1:n
        x = floatmin(Float16)
        while x <= floatmax(Float16)
            str = EDF._edf_repr(Float64(x))
            push!(out, str[1])
            x = nextfloat(x)
        end
    end
    return out
end

PR

julia> @time length(all_floats())
  0.731282 seconds (9.98 M allocations: 1.442 GiB, 7.50% gc time)
1536000

julia> @time length(all_floats())
  0.727198 seconds (9.98 M allocations: 1.442 GiB, 7.72% gc time)
1536000

julia> @time length(all_floats())
  0.736517 seconds (9.98 M allocations: 1.442 GiB, 7.93% gc time)
1536000

Main

julia> @time length(all_floats())
  0.344242 seconds (12.52 M allocations: 889.876 MiB, 6.08% gc time)
1536000

julia> @time length(all_floats())
  0.345693 seconds (12.52 M allocations: 889.876 MiB, 5.95% gc time)
1536000

julia> @time length(all_floats())
  0.347998 seconds (12.52 M allocations: 889.876 MiB, 6.80% gc time)
1536000

Conclusion

So it is ~2x slower, but still fast enough that it doesn't seem like a big issue.

@ericphanson
Copy link
Member Author

Is allowing underflow to 0 dangerous?

  • We allow this now when allow_scientific=false. This is used for onsets & annotations, and for non-floating-point types.
    • for onsets and annotations, a sufficeintly small time like 1e-7 does seem OK to round down to 0. If the onset is right at the start of the recording, marking that as time 0 seems OK. If an annotation has very small duration, since EDF does allow 0 or even missing duration, rounding that down to 0 does seem OK too.
  • when allow_scientific=true, we can represent even floatmin(Float64) with a few digits of precision. This is used for floating point fields in the headers.

Therefore I think it is probably OK, given the we can see all the usages and check them individually.

@ericphanson
Copy link
Member Author

Is scientific notation even allowed in the EDF+ spec?

from https://www.edfplus.info/specs/guidelines.html#:~:text=Numbers%20in%20EDF(%2B)%20headers,3%20(Okt%202004) they say

2a, E-notation (May 2009). Numbers in EDF(+) headers may have the scientific E notation as in 1E2345, +012E+34, -1.34E09 and +1.234E-5. Note, though, that the 8 characters of some EDF(+) numberfields are used more efficiently by -123.456 uV than by -1.23E-4 V.

So it seems it is. We do not use it for non-header fields such as annotation durations and onsets.

src/write.jl Outdated
Comment on lines 5 to 9
# `allow_scientific` is only meaningful for `value::Number`. We allow passing it though,
# so `edf_write` can be more generic.
_edf_repr(value::Union{String,Char}; allow_scientific=nothing) = value
_edf_repr(date::Date; allow_scientific=nothing) = uppercase(Dates.format(date, dateformat"dd-u-yyyy"))
_edf_repr(date::DateTime; allow_scientific=nothing) = Dates.format(date, dateformat"dd\.mm\.yyHH\.MM\.SS")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually keep the existing methods as they are and simply add a dispatch layer like so:

_edf_repr(value::Number, allow_scientific::Bool) = # implementation
_edf_repr(value, ::Any) = _edf_repr(value)

That way the methods that don't care about scientific notation don't need to include it in their signatures.

src/write.jl Outdated
_edf_repr(date::DateTime; allow_scientific=nothing) = Dates.format(date, dateformat"dd\.mm\.yyHH\.MM\.SS")

"""
sprintf_G_under_8(x) -> String
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name should be all lowercase per the style guide, though it's generally rather odd; perhaps there's something more descriptive of what it does rather than how it's implemented?

src/write.jl Outdated Show resolved Hide resolved
@kleinschmidt
Copy link
Member

Is scientific notation even allowed in the EDF+ spec?

from https://www.edfplus.info/specs/guidelines.html#:~:text=Numbers%20in%20EDF(%2B)%20headers,3%20(Okt%202004) they say

2a, E-notation (May 2009). Numbers in EDF(+) headers may have the scientific E notation as in 1E2345, +012E+34, -1.34E09 and +1.234E-5. Note, though, that the 8 characters of some EDF(+) numberfields are used more efficiently by -123.456 uV than by -1.23E-4 V.

So it seems it is. We do not use it for non-header fields such as annotation durations and onsets.

Whether it's mentioend in the spec and whether it's supported by other readers is a different thing. Maybe we don't care about compatibility with say MNE and friends but we should at least try scientific notation exports and see if our preferred viewers can view the generated files

Copy link
Member

@kleinschmidt kleinschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally not opposed to this if we can check whether or not we're generating un-readable output, both for standard viewers (persyst and EDFBrowser) and for the python tools (MNE).

src/types.jl Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
Comment on lines -249 to +305
# It is similiar to a `EDF Annotations` file except that
# It is similiar to a `EDF Annotations` file except that
# The `ANNOTATIONS_SIGNAL_LABEL` is `BDF Annotations`.
# The test data has 1081 trigger events, and
# has 180 trials in total, and
# The test data has 1081 trigger events, and
# has 180 trials in total, and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert these whitesapce changes to keep blame clean

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've configured vscode to remove trailing whitespace, so I will need to do this at the end (otherwise if I touch the file again it will re-introduce it). So your comment is noted but I won't address it until the PR is ready to merge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there no autoformatting set up on this repo? it's possible given that it's kind of ancient...

@ericphanson
Copy link
Member Author

One thing I realized is that in TAL, there isn't the same 8 character limits (as far as I can tell in the spec). And that is the only place we currently force the decimal printing with floating-point numbers. So we could try raising the limit there (or just do a regular @sprintf("%f", x)).

@ericphanson
Copy link
Member Author

This actually does fix my internal bug, which now has an MWE at #74, and results in EDFs that are loadable in EDFBrowser at least (and which pass it's validation check) but I don't think this approach is actually necessary because we should just allow rounding to 0 for fields like physical_min. So I think a different fix would be more direct here as well as less concerning about portability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants