This memo intends to illustrate problems associated with the type system in C, provide advices on avoiding various pitfalls, and provide advices on having a consistent coding style.
In the following text, all C codes are compiled using clang
with the flags
-Wall -Wextra
under both ILP32 and LP64 compilation environments.
According to the 2nd edition of "The C Programming Language":
There are only a few basic data types in C:
char
: a single byte, capable of holding one character in the local character set.
int
: an integer, typically reflecting the natural size of integers on the host machine,
float
: single-precision floating point,
double
: double-precision floating point.In addition, there are a number of qualifiers* that can be applied to these basic types.
short
andlong
apply to integers ... The intent is thatshort
andlong
should provide different lengths of integers where practical;int
will normall be the natural size for a particular machine.short
is often 16 bits,long
32 bits, and int either 16 or 32 bits. Each compiler is free to choose appropriate sizes for its own hardware subject only to the restriction thatshort
s andint
s are at least 16 bits,long
are at least 32 bits andshort
is no longer thanint
, which is no longer thanlong
.The qualifier
signed
orunsigned
may be applied tochar
or any integer....
The type
long double
specifies extended-precision floating point. ...
*: not in the specification sense of qualifiers
which refers to const
, volatile
, etc.
Over the course of standard development, new types had been added,
such as long long
and _Complex
in C99, and the _Decimal*
types in C2X.
Floating point numbers will not be the subject of this memo, interested readers are recommended to purchase "Handbook of Floating Point Arithmetic" by J-M Muller et.al. to learn more.
Obviously, as these integer types have implementation-defined sizes, they
cannot be used portably in definitions of data structures intended for
exchange with foreign computer systems. Thus C99 introduced the <stdint.h>
header for new type definitions, and <inttypes.h>
which "extends it with
additional facilities provided by hosted implementations".
The "usual arithmetic conversions" governs how operands' types are determined in arithmetic operations, and in the case of integers, also bitwise operations.
The ruleset for floating point and pointer operands are not the concern of this memo.
The ruleset (as of C17) for integers can be described as follow:
- after integer promotionn
(converting lesser types to
int
orunsigned int
), - identical types are preferred; over
- types of same signedness with the higher rank; over
- unsigned type of same or higher rank; over
- signed type wider than the unsigned type; over
- unsigned type with rank corresponding to the signed type.
Now a bit of my own interpretation: bullet points 3-5 are a bit of convoluted but in simple terms it just means "unsigned types are preferred unless that other signed type can represent all my values", condition being "if the signed type have higher rank than the unsigned type".
Let's illustrate this with an example. Suppose we have a long long
and
an unsigned long
, on a ILP32 system, they'd be converted to long long
because it's 64-bit and can represent all the values of both types;
but on a LP64 system, both types are of same width, so they'd result in
unsigned long long
.
**Table of usual arithmetic conversoins applied to integer types under
various data models**
Type Abbreviations:
{s,u}{c,s,i,l,ll}: {signed,unsigned}{char,short,int,long,long long}
Where signedness and rank differs from ILP32 are highlighted in upper case.
sc ss si sl sll uc us ui ul ull
SI16:
sc si si si sl sll si Ui ui ul ull
ss si si si sl sll si Ui ui ul ull
si si si si sl sll si Ui ui ul ull
sl sl sl sl sl sll sl sl sl ul ull
sll sll sll sll sll sll sll sll sll sll ull
uc si si si sl sll si Ui ui ul ull
us Ui Ui Ui sl sll Ui Ui ui ul ull
ui ui ui ui sl sll ui ui ui ul ull
ul ul ul ul ul sll ul ul ul ul ull
ull ull ull ull ull ull ull ull ull ull ull
ILP32:
sc si si si sl sll si si ui ul ull
ss si si si sl sll si si ui ul ull
si si si si sl sll si si ui ul ull
sl sl sl sl sl sll sl sl ul ul ull
sll sll sll sll sll sll sll sll sll sll ull
uc si si si sl sll si si ui ul ull
us si si si sl sll si si ui ul ull
ui ui ui ui ul sll ui ui ui ul ull
ul ul ul ul ul sll ul ul ul ul ull
ull ull ull ull ull ull ull ull ull ull ull
LP64:
sc si si si sl sll si si ui ul ull
ss si si si sl sll si si ui ul ull
si si si si sl sll si si ui ul ull
sl sl sl sl sl sll sl sl Sl ul ull
sll sll sll sll sll sll sll sll sll ull ull
uc si si si sl sll si si ui ul ull
us si si si sl sll si si ui ul ull
ui ui ui ui Sl ull ui ui ui ul ull
ul ul ul ul ul ull ul ul ul ul ull
ull ull ull ull ull ull ull ull ull ull ull
The rules used to be simpler in ANSI C (C89), as there was no long long
integer rank. In the 2nd edition of "The C Programming Language"", it is
summarized in Appendix C:
The "usual arithmetic conversions" are changed [from the 1st edition], essentially from "for integers,
unsigned
always wins; for floating point, always use double" to "promote to the smallest capacious-enough type." ...
And in Appendix A section 6.5:
There are two changes [from the 1st edition] here. First, arithmetic on
float
operands ... Second, shorter unsigned types, when combined with a larger signed type, do not propagate the unsigned property to the result type; ... . The new rules are slightly more complicated, but reduce somewhat the surprises that may occur when an unsigned entity meets signed. Unexpected results may still occur when an unsigned expression is compared to a signed expression of the same size.
Consider the following example:
Example 3-1
_Bool lcmp(unsigned long a, long long b)
{
return a > b;
}
On ILP32 systems, this code compiles just fine, on LP64 environment, the compiler gives the following warning:
warning: comparison of integers of different signs: 'unsigned long' and 'long long' [-Wsign-compare]
...
1 warning generated.
More often, the warning is issued for a
with a typedef
type such as
size_t
. In order to avoid the warning and possible erroneous behaviors,
we must get back to what was originally intended.
- Possible Intention 1:
uint32_t
was intended fora
.
If this was an old codebase, assumptions might have been made for the
size of types, as such, when the new exact-width types are introduced in C,
those new types should be used. The function may therefore use uint32_t
and
int64_t
respectively for a
and b
.
- Possible Intention 2:
size_t
was intended fora
, andintmax_t
forb
.
If b
is meant to hold all possible values for some predicate, and a
is the
size of some object, then this can reasonably occur.
For finite values, it's reasonable to expect b
be converted to unsigned.
But there is also some plausible edge cases, like (a=0) > (b=-1)
.
As mentioned before, if this is a ILP32 system, such edge case would be safe, but on LP64 systems, it's best if both are converted to signed 64-bit types. This is recommended for 2 reason:
-
Even 2^63 is a very big number, it's unlikely any number above it will ever be needed;
-
It's more likely to catch unintentional overflow this way.
It should be noted firstmost that this is not the correct way to make sure application logics are correct, if the application ever depends on such comparison, it's a clear sign that there's some inconsistency. It's strongly advised the components relying on it be redesigned and rewritten.
But, when overly large a
occur, converting it into signed will most likely
cause the lcmp
to return false, thus preventing some incorrect behavior
in some limited situations.
However, this is just my personal opinion, if you find exception cases to it, you're more than welcome to tell me about it, and I'll pay you my appreciation.
- Possible Intention 3: still trying to learn C programming.
If so, they're recommended to read and learn more.
Consider the following example:
Example 3-2
_Bool icmp(unsigned int a, long b)
{
return a > b;
}
Unlike the previous case, this code fragment doesn't give off warning in LP64, but in ILP32; also, the types are 32-bit now, so the suggestions from the previous case cannot apply. So what could be possibly intended?
- Possible Intention 1: The code is intended for both ILP32 and LP64.
In this case, it can be expected that b
is actually something like ssize_t
.
If so, the suggestion from the previous section can apply, convert both to a
signed type.
- Possible Intention 2: The code is an adaption from SI16 to ILP32.
If this is the case, the range of possible values to be compared are indeed
[-2^31, 2^32)
, then we have to add a special condition to make it work:
_Bool icmp(unsigned int a, long b)
{
return b < 0 || a > b;
}
Also, to silence warnings, explicit type casting should be added.
Because the standard leaves the order of bit allocation in bitfield members unspecified, it's often to see applications sensitive to the representation of type values use integer words, bitwise operators, and endian conversion functions to build bitfield data portably.
Consider the following example:
Example 3-3
uint64_t set(uint64_t a, uint16_t b)
{
a &= ~(UINT64_C(0xffff) << 24;
a |= b << 24;
return a;
}
The intention in this example is clear - to set the 2 middle bytes of a
to
b
. While the compiler doesn't give off warnings on this code, there's
something not quite right with it.
On ILP32 and LP64 systems, b
is first promoted to int
as the 32-bit int
can represent all values of uint16_t
, then, after left-shifting 24 bits,
bits 32-39 are truncated, and bits 32-63 are sign-extended from bit 31 as b
now have the signed type int
.
So, it's a surprise to see that the value of the expression set(0, -1)
is
0xffffffffff000000
The fix is simple though. Simply add a type cast to b
, or if it's going to be
used multiple times, assign it to a local variable of type uint64_t
,
like this:
uint64_t set_v1(uint64_t a, uint16_t b)
{
a &= ~(UINT64_C(0xffff) << 24);
a |= (uint64_t)b << 24;
return a;
}
uint64_t set_v2(uint64_t a, uint16_t b)
{
uint64_t w = b;
a &= ~(UINT64_C(0xffff) << 24);
a |= w << 24;
...
return a;
}
-
Built-in types are the foundation from which type conversion rules are defined on top of, and is not particularly useful for portability.
-
Types with concrete names such as
size_t
andtime_t
from C, anduid_t
,gid_t
,pid_t
,id_t
, andmode_t
,off_t
, etc. from POSIX are more useful. These should be the preferential choice, and ought to be used correctly - IDs should be compared only for equality and not for relation; bitmasks should only be applied bitwise operators; multiplication should not be applied to types representing absolute positions. -
Only types with exact widths (
uint8_t
,uint16_t
, etc.) should be used in definitions for data structures intended for exchange with foreign systems.