-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add size-limited strings and varying bit-width integer Value_Types to in-memory backend and check for ArithmeticOverflow in LongStorage #7557
Conversation
So I was not completely sure what to do with the case of arithmetic overflow. We essentially have two cases: 64-bit integers and smaller types. For 64-bit overflow, we cannot do much as we don't have a bigger type yet. Current behaviour is to do the standard 'modulus' overflow, but I think that this is wrong as it can lead to data corruption. So I replace the value with Now what if I add two 16-bit values together and they overflow the 16-bit type. Then I have a bit more options - I could widen the storage type to 32-bit or 64-bit to make it fit. However, there is no clear 'heuristic' when to do so. Do I do this 'on demand'? That will mean the result type will be unpredictable - I start with 2 16-bit columns and can end up with whatever. Instead, do I do this always? Maybe all operations should return 64-bit integers? But that will mean I will very quickly 'lose' the smaller bit-width. Even if I had small columns, they will be 'upcasted' after any operation is performed on them. If I want them back to be small, I would need to cast them again. I think that always up-converting to 64-bit could have its merit. It is definitely a bit simpler to implement. But then the smaller types become 'second class citizens' - because we immediately 'escape' them. It may be practical though, as we will encounter overflows less often this way and can always can re-cast afterwards. And it's easier to implement. What do you think @jdunkerley @GregoryTravis? I'm sure for 64-bit overflows we want to have warnings and all. But for smaller types - do we also check for overflow or do we up-cast them on every operation to 64-bits? I imagine I will implement the 64-bit logic first as it's simpler, but I'm wondering what to do with these smaller types. |
I agree that widening to 64 bit by default is the better move. The most common use case is that the original data uses a narrow integer type, but after it's read, the user doesn't need it to stay that way. The second most common use case is that the user needs to compute something with narrow types, and write them back to a column with a narrow type. In this case, they'll get a clear warning /error that they tried to write 64 bit integers to a narrows column, and they'll have to cast. |
I think you are right. I will amend the tests tomorrow. Amended the tests in commit 66508b1 and then in abfbf0b I added that even the |
2a66410
to
66508b1
Compare
if non_trivial_types_supported then | ||
src = source_table_builder [["X", [1, 2, 3]], ["Y", ["a", "xyz", "abcdefghijkl"]], ["Z", ["a", "pqrst", "abcdefghijkl"]]] | ||
## TODO [RW] figure out what semantics we want here; I think the current one may be OK but it is going to | ||
be slightly painful, so IMO an auto-conversion could be useful. We could make it so that we do | ||
auto-conversion (cast), but in a more strict mode such that if anything does not fit | ||
(even just string padding required) we fail hard and tell the user to fix this. | ||
Test.specify "fails if the target type is more restrictive than source" <| | ||
result = src.update_database_table dest update_action=Update_Action.Insert key_columns=[] | ||
result.should_fail_with Column_Type_Mismatch | ||
but_maybe="I'm not sure if we want this automatic restriction. If anything, we should probably report situations like abcdefghijkl being truncated." | ||
Test.specify "should warn if the target type is more restrictive than source and truncation may occur" pending=but_maybe <| | ||
result = src.update_database_table dest update_action=Update_Action.Insert key_columns=[] | ||
IO.println (Problems.get_attached_warnings result) | ||
|
||
result.column_names . should_equal ["X", "Y", "Z"] | ||
result.at "X" . to_vector . should_contain_the_same_elements_as [1, 2, 3] | ||
result.at "Y" . to_vector . should_contain_the_same_elements_as ["a", "xyz", "abc"] | ||
result.at "Z" . to_vector . should_contain_the_same_elements_as ["a ", "pqrst", "abcde"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is not really for this PR, hence I'm not going to stop that PR by it.
But I just realised that if we have a table with e.g. 16-bit integers, and we create a table in Enso it will be 64-bit by default. So as shown in the test above, it will raise an error and require the user to cast
.
I'm wondering if it would not be better to try casting automatically. If the cast has any warnings - error hard and get the user to resolve the situation, but if the types fit I imagine we could do this automatically for convenience.
Although that will only work in-memory where we can easily check if cast worked without warnings. In database we'd risk losing data so we surely cannot do this automatic conversion.
@jdunkerley do you think such convenience auto-cast in upload is worth it? If so, I will appreciate creating a ticket for it. Or just let me know and I'll create one once I'm back from vacation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is worth doing; if the user has a target table with a narrow type, they are likely trying to write data that they believe will fit, so this is a common case.
test/Table_Tests/src/Common_Table_Operations/Join/Union_Spec.enso
Outdated
Show resolved
Hide resolved
...table/src/main/java/org/enso/table/data/column/operation/map/MapOperationProblemBuilder.java
Show resolved
Hide resolved
if non_trivial_types_supported then | ||
src = source_table_builder [["X", [1, 2, 3]], ["Y", ["a", "xyz", "abcdefghijkl"]], ["Z", ["a", "pqrst", "abcdefghijkl"]]] | ||
## TODO [RW] figure out what semantics we want here; I think the current one may be OK but it is going to | ||
be slightly painful, so IMO an auto-conversion could be useful. We could make it so that we do | ||
auto-conversion (cast), but in a more strict mode such that if anything does not fit | ||
(even just string padding required) we fail hard and tell the user to fix this. | ||
Test.specify "fails if the target type is more restrictive than source" <| | ||
result = src.update_database_table dest update_action=Update_Action.Insert key_columns=[] | ||
result.should_fail_with Column_Type_Mismatch | ||
but_maybe="I'm not sure if we want this automatic restriction. If anything, we should probably report situations like abcdefghijkl being truncated." | ||
Test.specify "should warn if the target type is more restrictive than source and truncation may occur" pending=but_maybe <| | ||
result = src.update_database_table dest update_action=Update_Action.Insert key_columns=[] | ||
IO.println (Problems.get_attached_warnings result) | ||
|
||
result.column_names . should_equal ["X", "Y", "Z"] | ||
result.at "X" . to_vector . should_contain_the_same_elements_as [1, 2, 3] | ||
result.at "Y" . to_vector . should_contain_the_same_elements_as ["a", "xyz", "abc"] | ||
result.at "Z" . to_vector . should_contain_the_same_elements_as ["a ", "pqrst", "abcde"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is worth doing; if the user has a target table with a narrow type, they are likely trying to write data that they believe will fit, so this is a common case.
...bits/table/src/main/java/org/enso/table/data/column/storage/numeric/AbstractLongStorage.java
Show resolved
Hide resolved
b710f25
to
90f44a9
Compare
I amended the default iteration counts as I was seeing insufficient warmup leading to inconsistent benchmark results. Results afterwards:
We can see that as long as there is no overflow, both approaches have comparable performance (surprisingly for Of course in case of overflow, the new approach is slower. In my setting I set it so that 20% of rows do overflow and with that the slowdown is about 2.6x. It will likely depend on the % of rows that overflow (<1% will likely cause little overhead, whereas 100% will increase it significantly) and other factors like base stack size that influences how costly it is to throw the exception. Still here we are comparing essentially 'incorrect' behaviour (the Raw resultsdevelop:
this PR:
|
a8cf52f
to
2659f97
Compare
Pull Request Description
long
s, we just check their bounds. Changing underlying storage for memory efficiency may come in the future: In-Memory Table: StoreInteger 32-bit
andInteger 16-bit
more efficiently. Add support forFloat 32-bit
. #6109union
of two tables reports an "unexpected storage implementations" error whenadd_row_number
precedes theunion
#7565Important Notes
Checklist
Please ensure that the following checklist has been satisfied before submitting the PR:
Scala,
Java,
and
Rust
style guides. In case you are using a language not listed above, follow the Rust style guide.
./run ide build
.