[Wait for #2568][Mixed] Mixed Precision Layer update #2579

DonghakPark · 2024-05-10T00:26:45Z

This PR is to update the mixed precision layer.

integrate [Wait for #2567] [ Test ] Mixed Precision Test Case #2568 & Support mixed precision training @open sesame 03/08 07:57 #2455
will update more test

Self evaluation:

Build test: [X]Passed [ ]Failed [ ]Skipped
Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Donghak PARK donghak.park@samsung.com

We will add Var32 Tensor if the Variable Weight is not Full precision (FP32). This eables the Weight Update with full precision and only Apply Gradient Process ueses this Tensor. Therefore, the lifespan of this tensor should be "ApplyGradient". . Modify TensorPool to generate Weigth considering Mixed Precsion. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <jijoong.moon@samsung.com>

This pr create the variable fp32 tensor when we create the Weight and Optimizer Weight. . update the manager to create Weight with var32 tensor which requested to weight pool. . update the weight requests with Weight Spec and var, grad and var32 tensors which created already. . add clone Tensor with specific type in tensor.h Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <jijoong.moon@samsung.com>

This PR enables the FP16 support for the layers below: . input layer . mse loss layer Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <jijoong.moon@samsung.com>

This PR includes the mixed precision test case. . Input - FC - MSE : "batch_size=2", "model_tensor_type=FP16-FP16", "loss_scale=128" **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <jijoong.moon@samsung.com>

This commit modify apply gradient in optimizer. We do not need to save optimizer variables in weight type. Only Optimizer needs the optimizer variables and we should update the weight with full precision to maintain the accuracy. Therefore, remove the var32 tensors for optimizer variables. Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <jijoong.moon@samsung.com>

taos-ci · 2024-05-10T00:26:48Z

📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2579. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

taos-ci · 2024-05-10T00:26:53Z

cibot: @DonghakPark, The last line of a text file must have a newline character. Please append a new line at the end of the line in nntrainer/layers/loss/mse_loss_layer.cpp.

taos-ci · 2024-05-10T00:27:31Z

cibot: @DonghakPark, The last line of a text file must have a newline character. Please append a new line at the end of the line in nntrainer/layers/loss/mse_loss_layer.cpp.

Edited build instructions for Resnet18 test **Fixing the meson build option** Resolves: Error on building the test example where it says `-c is an un-recognized option` and in the meson documentation -C is used, so it seems to be a typo. **Self evaluation:** 1. Build test: []Passed [ ]Failed [ X]Skipped 2. Run test: []Passed [ ]Failed [ X]Skipped Signed-off-by: Udit Jain <udit.jain@samsung.com>

- Implement 4x4 GEMM kernel that works f16-f32 partial accumulation **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

- Now Hgemm supports 4x4 f16-f32 partial accumulation strategy **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

- With macro-defined code, the function latency is expected to be optimized by compiler more easily **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

…16 kernel - With more digits computed with fp16 (in this case 1024 -> 2048) I could observe latency improvement with the cost of accuracy loss. However, according to current accuracy measurement criteria, it is still acceptable. Note that it is highly desired to be proven with model output once more. - With variety of partial sum kernels, we can adaptively apply internal macro kernels without being constrained to K-divisibilty w.r.t. 4, 8, 16.Commit title (Until 50 colums per line) **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

…8x8 kernel - Apply similar change made in commit#52a3c734 but in 8x8 kernel **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

- To avoid the constraint of 4-8 divisibilty w.r.t. K, loop for adaptive K direction. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

- I found there was a repeated usage of matrix initialization before mul-add fused operations. - With separate initialization code, we can enjoy: 1. Cleaner code that is reusable for both f16 & f16-f32 kernel 2. Redundant init process is minimized for f16 kernel. Better latency with the SAME accuracy. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

- Due to adaptive macro kernel usage, previous comment is no longer needed. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

Added naive version of OpenCl implementation for FC Layer. Incorporated separate kernels for ops used. Added unit test for fc_layer_cl. Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

Added incremental forwarding as an option for unit testing layers Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

Added blas_kernels to enhance resuability of the common blas kernels. Used FullyConnected interface for both CPU and GPU calls. Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

Renamed global variables in unittest_layers_fully_connected_cl.cpp to fix duplicate declaration error Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

Fixed kernel argument bug for dot_cl kernel Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

Used proper size while creating OpenCL buffers. Optimized SGEMM kernel with 2D global work size. Modified function docs. Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

update yolo v2 modeling part of yolo v2. (update some hyper param values) - update yolo v2 pytorch(python) script - update yolo v2 nntrainer(c++) script * issue - activation function(in this case, leaky relu) of nntrainer needs to support setting negative slope via parameter... **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: Seungbaek Hong <sb92.hong@samsung.com>

Added code stub to generate Swiglu layer's golden test data. Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

It adds tests for conv2d fp16 test. Signed-off-by: Jiho Chu <jiho.chu@samsung.com>

It is assumed that activations and weight are fully compotaible, so it's unnecessary to be converted to. input layer and loss layres are different, cause input data and label data is assumed to be always float 32 type now. Signed-off-by: Jiho Chu <jiho.chu@samsung.com>

This PR is to update the mixed precision layer. - integrate nnstreamer#2568 & nnstreamer#2455 - will update more test **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: Donghak PARK <donghak.park@samsung.com>

taos-ci

@DonghakPark, 💯 All CI checkers are successfully verified. Thanks.

DonghakPark · 2024-05-30T07:45:31Z

will update layers with new PR.
so, close this PR

This PR update the conv2D Layer to support Mixed Precision (FP16). It is based on the PR #2579 Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <jijoong.moon@samsung.com>

This PR update the conv2D Layer to support Mixed Precision (FP16). It is based on the PR nnstreamer#2579 Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <jijoong.moon@samsung.com>

This PR update the conv2D Layer to support Mixed Precision (FP16). It is based on the PR #2579 Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <jijoong.moon@samsung.com>

This PR update the conv2D Layer to support Mixed Precision (FP16). It is based on the PR nnstreamer#2579 Resolves: **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: jijoong.moon <jijoong.moon@samsung.com>

jijoongmoon added 5 commits May 7, 2024 13:38

DonghakPark requested review from myungjoo, jijoongmoon, again4you, jaeyun-jung, leemgs, wooksong, helloahn, kparichay, gichan-jang, anyj0527 and zhoonit as code owners May 10, 2024 00:26

DonghakPark requested review from lhs8928, songgot, jihochu, SeoHyungjun, baek2sm, skykongkong8, djeong20, EunjuYang and a team as code owners May 10, 2024 00:26

github-actions bot added the Need Review label May 10, 2024

DonghakPark force-pushed the mixed_pre_1 branch from fe21d0a to 9468bbc Compare May 10, 2024 00:27

udit01 and others added 20 commits May 27, 2024 16:45

[ hgemm ] Implement 4x4 f16-f32 kernel

1e1112d

- Implement 4x4 GEMM kernel that works f16-f32 partial accumulation **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[ hgemm ] Add 4x4 kernel-using f16-f32 hgemm_noTrans

8a2887f

- Now Hgemm supports 4x4 f16-f32 partial accumulation strategy **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[ Trivial ] Remove redundant comments and format

a012f73

- Due to adaptive macro kernel usage, previous comment is no longer needed. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <ss.kong@samsung.com>

[GPU/OpenCL] Initial version of FC Layer with OpenCL ops

933f47a

Added naive version of OpenCl implementation for FC Layer. Incorporated separate kernels for ops used. Added unit test for fc_layer_cl. Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[unittest] Added test for incremental forwarding for layers

a08d9f0

Added incremental forwarding as an option for unit testing layers Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[GPU/OpenCL] Resuable blas OpenCL kernels

eff5f0a

Added blas_kernels to enhance resuability of the common blas kernels. Used FullyConnected interface for both CPU and GPU calls. Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[bugfix] Renamed variables in unittest of FC Layer

ff3ceb7

Renamed global variables in unittest_layers_fully_connected_cl.cpp to fix duplicate declaration error Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[bugfix] Used global memmory for result in dot_cl kernel

48c61dc

Fixed kernel argument bug for dot_cl kernel Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[bugfix/refactor] OpenCL buffer creation fix and optimization

75b04b0

Used proper size while creating OpenCL buffers. Optimized SGEMM kernel with 2D global work size. Modified function docs. Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[goldendata] Added script to generate Swiglu data

ff1f76b

Added code stub to generate Swiglu layer's golden test data. Signed-off-by: Debadri Samaddar <s.debadri@samsung.com>

[Test] Add conv2d test for fp16

27c9f0f

It adds tests for conv2d fp16 test. Signed-off-by: Jiho Chu <jiho.chu@samsung.com>

DonghakPark force-pushed the mixed_pre_1 branch from 40a5233 to 1d4b135 Compare May 27, 2024 07:50

taos-ci approved these changes May 27, 2024

View reviewed changes

DonghakPark closed this May 30, 2024

jijoongmoon mentioned this pull request Jun 3, 2024

[Wait for #2607] [ Layer ] Mixed Precision support for BN Layer #2615

Closed

DonghakPark deleted the mixed_pre_1 branch November 26, 2024 04:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Wait for #2568][Mixed] Mixed Precision Layer update #2579

[Wait for #2568][Mixed] Mixed Precision Layer update #2579

DonghakPark commented May 10, 2024

taos-ci commented May 10, 2024

taos-ci commented May 10, 2024

taos-ci commented May 10, 2024

taos-ci left a comment

DonghakPark commented May 30, 2024

[Wait for #2568][Mixed] Mixed Precision Layer update #2579

[Wait for #2568][Mixed] Mixed Precision Layer update #2579

Conversation

DonghakPark commented May 10, 2024

taos-ci commented May 10, 2024

taos-ci commented May 10, 2024

taos-ci commented May 10, 2024

taos-ci left a comment

Choose a reason for hiding this comment

DonghakPark commented May 30, 2024