Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix overflow when calling parsec_data_create #646

Merged
merged 1 commit into from
May 10, 2024

Conversation

QingleiCao
Copy link
Contributor

@QingleiCao QingleiCao commented Mar 15, 2024

@QingleiCao QingleiCao requested a review from a team as a code owner March 15, 2024 16:47
Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that the right approach is to make bsiz size_t.

Also, the build failure is legit (stemming from the added assert):

/tmp/parsec/parsec/parsec/data_dist/matrix/matrix.c: In function 'parsec_tiled_matrix_init':
/tmp/parsec/parsec/parsec/data_dist/matrix/matrix.c:98:31: error: 'INT_MAX' undeclared (first use in this function)
   98 |     assert((size_t)mb * nb <= INT_MAX);
      |                               ^~~~~~~
/tmp/parsec/parsec/parsec/data_dist/matrix/matrix.c:19:1: note: 'INT_MAX' is defined in header '<limits.h>'; did you forget to '#include <limits.h>'?
   18 | #include "parsec/data_dist/matrix/two_dim_tabular.h"
  +++ |+#include <limits.h>

@QingleiCao
Copy link
Contributor Author

I still think that the right approach is to make bsiz size_t.

Also, the build failure is legit (stemming from the added assert):

/tmp/parsec/parsec/parsec/data_dist/matrix/matrix.c: In function 'parsec_tiled_matrix_init':
/tmp/parsec/parsec/parsec/data_dist/matrix/matrix.c:98:31: error: 'INT_MAX' undeclared (first use in this function)
   98 |     assert((size_t)mb * nb <= INT_MAX);
      |                               ^~~~~~~
/tmp/parsec/parsec/parsec/data_dist/matrix/matrix.c:19:1: note: 'INT_MAX' is defined in header '<limits.h>'; did you forget to '#include <limits.h>'?
   18 | #include "parsec/data_dist/matrix/two_dim_tabular.h"
  +++ |+#include <limits.h>

I'm not sure whether there is potential side-effect as @abouteiller pointed out.

@bosilca
Copy link
Contributor

bosilca commented Mar 15, 2024

INT_MAX is not a size_t so even with the correct header the compiler will generate a warning and the comparison will be done in the lesser precisions. You really intend to create data that is over 2GB ?

@devreal
Copy link
Contributor

devreal commented Mar 16, 2024

I'm not sure whether there is potential side-effect as @abouteiller pointed out.

Unless we pass a pointer to bsiz somewhere that expects an int* there should be no side-effect. Compiling with -Wall should help finding any place where that happens, or simply a grep.

INT_MAX is not a size_t so even with the correct header the compiler will generate a warning and the comparison will be done in the lesser precisions.

The int of INT_MAX will be promoted to size_t as per the C integer promotion rules (64bit integer has higher rank than 32bit integer). That assert is legit, albeit not very helpful unless applications run with a debug build of PaRSEC.

@devreal
Copy link
Contributor

devreal commented Mar 16, 2024

A quick grep showed no place where we pass a pointer to bsiz. There are places where we cast bsiz to size_t already:

parsec/data_dist/matrix/two_dim_rectangle_cyclic.c:            pos *= (size_t)dc->super.bsiz;
parsec/data_dist/matrix/two_dim_rectangle_cyclic.c:            pos *= (size_t)dc->super.bsiz;

and in matrix.c we store the result of fread/fwrite in an int and compare against bsiz. Those functions return size_t though, so that could be adjusted as well.

@bosilca
Copy link
Contributor

bosilca commented Mar 16, 2024

Rule 6.3.1.1 applies here:

Otherwise, if the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand, then the operand with signed integer type is converted to the type of the operand with unsigned integer type.

The conversion will always work here because we are dealing with a constant with a positive value.

@QingleiCao
Copy link
Contributor Author

You really intend to create data that is over 2GB ?

Yeah, each timeslot includes L^2/2 ZGEMV. the size of each vector is L. Those L^2/2 number of vectors are stored in a file, so I'm loading them together. L can be 720 or 1440. If L = 720, then 720^2/2 * 720 * sizeof(complex double) = 2,985,984,000

@QingleiCao
Copy link
Contributor Author

A quick grep showed no place where we pass a pointer to bsiz. There are places where we cast bsiz to size_t already:

parsec/data_dist/matrix/two_dim_rectangle_cyclic.c:            pos *= (size_t)dc->super.bsiz;
parsec/data_dist/matrix/two_dim_rectangle_cyclic.c:            pos *= (size_t)dc->super.bsiz;

and in matrix.c we store the result of fread/fwrite in an int and compare against bsiz. Those functions return size_t though, so that could be adjusted as well.

A quick grep showed no place where we pass a pointer to bsiz. There are places where we cast bsiz to size_t already:

parsec/data_dist/matrix/two_dim_rectangle_cyclic.c:            pos *= (size_t)dc->super.bsiz;
parsec/data_dist/matrix/two_dim_rectangle_cyclic.c:            pos *= (size_t)dc->super.bsiz;

and in matrix.c we store the result of fread/fwrite in an int and compare against bsiz. Those functions return size_t though, so that could be adjusted as well.

Yes, others directly related to bsiz are already protected by size_t.

@QingleiCao
Copy link
Contributor Author

Any suggestions? Changing bsiz to size_t or this is fine as mb * nb > INT_MAX is rare?

@devreal
Copy link
Contributor

devreal commented Mar 17, 2024

Rule 6.3.1.1 applies here:

Otherwise, if the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand, then the operand with signed integer type is converted to the type of the operand with unsigned integer type.

The conversion will always work here because we are dealing with a constant with a positive value.

Let's run through this text: INT_MAX is a signed integer. size_t is an unsigned integer with rank higher than int (32 vs 64b on 64bit archs).

if the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand

So yes, this rule applies: size_t is the unsigned interger type that has rank greater than int, which is the signed integer type with lower rank.

then the operand with signed integer type is converted to the type of the operand with unsigned integer type.

This requires that INT_MAX (int) is promoted to the type of the operand with the unsigned type (size_t).

Now, if we were to compare size_t against INT_MIN we would be in trouble because promoting a negative number to an unsigned type of higher rank would create a larger number, because 2's complement. But that's not happening here and it would be a dubious thing to do in the first place.

Any suggestions? Changing bsiz to size_t or this is fine as mb * nb > INT_MAX is rare?

Things that encode counts should be size_t to avoid future overflows. The check we're talking about then becomes irrelevant and we can all go home happy :) there is of course the issue of sending such a tile via MPI, which can be solved via the big count API that hopefully we an use soon. My understand though is that in @QingleiCao's case the data remains local.

@bosilca
Copy link
Contributor

bosilca commented Mar 18, 2024

@devreal I'm confused by your comment, because it started as a stark contradiction to my statement in the end it basically states exactly the same thing. 😕

Anyway, this PR looks good as it is.

@devreal
Copy link
Contributor

devreal commented Mar 18, 2024

@bosilca You said:

INT_MAX is not a size_t so even with the correct header the compiler will generate a warning and the comparison will be done in the lesser precisions.

I don't think that is what the standard says and I hope I did not convey that in my comment :)

@bosilca
Copy link
Contributor

bosilca commented Mar 18, 2024

If the right-hand side was of a different signedness and not a constant, where the compiler can verify that the conversion is safe, then the compiler should have warned that the conversion is unsafe.

@QingleiCao
Copy link
Contributor Author

Rule 6.3.1.1 applies here:

Otherwise, if the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand, then the operand with signed integer type is converted to the type of the operand with unsigned integer type.

The conversion will always work here because we are dealing with a constant with a positive value.

Let's run through this text: INT_MAX is a signed integer. size_t is an unsigned integer with rank higher than int (32 vs 64b on 64bit archs).

if the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand

So yes, this rule applies: size_t is the unsigned interger type that has rank greater than int, which is the signed integer type with lower rank.

then the operand with signed integer type is converted to the type of the operand with unsigned integer type.

This requires that INT_MAX (int) is promoted to the type of the operand with the unsigned type (size_t).

Now, if we were to compare size_t against INT_MIN we would be in trouble because promoting a negative number to an unsigned type of higher rank would create a larger number, because 2's complement. But that's not happening here and it would be a dubious thing to do in the first place.

Any suggestions? Changing bsiz to size_t or this is fine as mb * nb > INT_MAX is rare?

Things that encode counts should be size_t to avoid future overflows. The check we're talking about then becomes irrelevant and we can all go home happy :) there is of course the issue of sending such a tile via MPI, which can be solved via the big count API that hopefully we an use soon. My understand though is that in @QingleiCao's case the data remains local.

Yeah, in my case, it's local without communication involved.

Well, INT_MAX instead of INT_MIN, right? @devreal

@abouteiller
Copy link
Contributor

there is of course the issue of sending such a tile via MPI, which can be solved via the big count API that hopefully we an use soon.

Indeed, that was my first reaction reading all this. The drawback is that we have to use the large count variants for everything, not just the types.

@abouteiller
Copy link
Contributor

You really intend to create data that is over 2GB ?

Yeah, each timeslot includes L^2/2 ZGEMV. the size of each vector is L. Those L^2/2 number of vectors are stored in a file, so I'm loading them together. L can be 720 or 1440. If L = 720, then 720^2/2 * 720 * sizeof(complex double) = 2,985,984,000

I'm not sure how the patch fixes your issue (at best it delays it to when you want to load a matrix that is 8x larger?), bsiz is still in int and will still be truncated for some (apparently, since less than 1 order of magnitude to the one you use now) reasonable sizes. What is our reasoning for not promoting bsiz to a size_t overall?

@QingleiCao
Copy link
Contributor Author

QingleiCao commented Apr 10, 2024

I will change bsiz to size_t. I think we also need to change the matrix size as well. I spent quite a while debugging an issue because the matrix (1*N) is larger that INT_MAX (N > INT_MAX).

@QingleiCao QingleiCao force-pushed the qinglei/issue645 branch 5 times, most recently from d609ada to b7acf0f Compare May 3, 2024 23:26
@QingleiCao
Copy link
Contributor Author

The matrix size is already cast in the high-level distribution API, e.g., parsec_matrix_sym_block_cyclic_init, so this PR is only about bsiz.

@QingleiCao QingleiCao requested a review from devreal May 3, 2024 23:32
@abouteiller abouteiller merged commit 0612582 into ICLDisco:master May 10, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants