-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tensor] Int4QTensor with quantized 4-bit integer data type #2895
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to have many negatigve unittest TCs as well. All good!
nntrainer/tensor/int4_tensor.cpp
Outdated
/// @todo this func should be template function | ||
void Int4QTensor::addValue(unsigned int b, unsigned int c, unsigned int h, | ||
unsigned int w, float value, float beta) { | ||
auto const &idx = getIndex(b, c, h, w); | ||
float output = getValue(idx); | ||
output *= beta; | ||
output += value; | ||
|
||
// if result value is out of range, clamp to max/min value | ||
int8_t val = std::trunc(std::clamp((int)output, -8, 7)); | ||
|
||
// encode result value to int8 data | ||
((int8_t *)getData())[idx / 2] = | ||
(idx % 2 == 0) ? (val << 4) | (((int8_t *)getData())[idx / 2] & 0x0f) | ||
: (((int8_t *)getData())[idx / 2] << 4) | (val & 0x0f); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question:
Do we just expect the user to consider scale factor in input float value
and float beta
?
I am curious about how basic math in int4Q tensor goes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for asking! Currently, no. This is to modify the quantized value directly.
nntrainer/tensor/int4_tensor.cpp
Outdated
// encode result value to int8 data | ||
((int8_t *)getData())[idx / 2] = | ||
(idx % 2 == 0) ? (val << 4) | (((int8_t *)getData())[idx / 2] & 0x0f) | ||
: (((int8_t *)getData())[idx / 2] << 4) | (val & 0x0f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understood, the computation should be :
: (((int8_t *)getData())[idx / 2] << 4) | (val & 0x0f); | |
: (((int8_t *)getData())[idx / 2] & 0xf0) | (val & 0x0f); |
I'm quite confused with it. Please let me know If I'm wrong :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right! thanks for pointing it out :)
nntrainer/tensor/int4_tensor.cpp
Outdated
(idx % 2 == 0) ? (val << 4) | ((int8_t *)getData())[idx / 2] | ||
: ((int8_t *)getData())[idx / 2] | (val & 0x0f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find we need to clear out the space we want to append the value.
(idx % 2 == 0) ? (val << 4) | ((int8_t *)getData())[idx / 2] | |
: ((int8_t *)getData())[idx / 2] | (val & 0x0f); | |
(idx % 2 == 0) ? (val << 4) | (((int8_t *)getData())[idx / 2] & 0x0f) | |
: (((int8_t *)getData())[idx / 2] & 0xf0) | (val & 0x0f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense 👍
This pull request presents the class, a powerful solution for efficiently storing quantized 4-bit integer data. By packing each 4-bit integer into an 8-bit memory space, we utilize memory resources effectively—where the first four bits hold the first 4-bit value and the last four bits hold the second. 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: Donghyeon Jeong <dhyeon.jeong@samsung.com>
d60db7a
to
f526f3a
Compare
This pull request presents the class, a powerful solution for efficiently storing quantized 4-bit integer data. By packing each 4-bit integer into an 8-bit memory space, we utilize memory resources effectively—the first four bits hold the first 4-bit value, and the last four bits hold the second.