Releases: tenstorrent/tt-metal
v0.58.0-rc24
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/14821991576
📦 Uncategorized
- #21060: Update unary documentation
- PR: #21057
- [skip ci] Add TG demo to choose your own TG pipeline
- PR: #21066
- #0: (MINOR) The REAL bump to v0.58.0
- PR: #21068
- add 6u specific full mesh bandwidth tests
- PR: #21037
- Add multicore support for argmax for any rank and shape
- PR: #20730
- Filter out wheel without the + because that seems to be an older thing, I guess
- PR: #21072
- tt_transformets/tt/attention: Fix typo
- PR: #21081
- [skip ci] Move FD tests to CIv2 (WH only)
- PR: #21079
- #0: [skip ci] Raise blackhole post commit fd test timeout to 40m
- PR: #21090
- Bump UMD
- PR: #21080
- Split go message from device command sequence
- PR: #21006
- A Docker image for package validation
- PR: #21095
- Enable some compiler warnings
- PR: #21032
- EXTERNAL PR: TTNN Support for stack [adding new ttnn op]
- PR: #20500
- Add ProgramDescriptor for future use in TTNN Generic OP
- PR: #21031
- Update device perf margins
- PR: #21107
- Stability script for Resnet50
- PR: #20973
- #17138: Add vae midblock and upblocks
- PR: #20988
- #20979: Add uint16 support for ttnn.add
- PR: #20910
- fix SDXL bias for split conv
- PR: #21110
- Apply clang format to generic_pools.cpp
- PR: #21111
- Jrock/device perf apr24 2
- PR: #21117
- Add fix for argmax in demo causing bad outputs
- PR: #21121
- Fixing TopK L1 limitations, currently for single core implmentation, otherwise default to multi-core where changes are not needed
- PR: #21029
- #0: Don't fail generate-system-logs action
- PR: #21092
- Remove superfluous 'static'
- PR: #20967
- [DM] #19268: Testing for "One to one" primitive
- PR: #20722
- Watcher to catch noc_inline_dw_write's to DRAM
- PR: #21093
- add yml version of support cloud request template
- PR: #20954
- Remove dispatch profiler test for BH until CI HW setup stabilizes
- PR: #21123
- #21082: Disable TensixInlineWriteDynamicNoc on BH
- PR: #21129
- TM Stress test + 2 tiny fixes
- PR: #21126
- Fix typo
- PR: #21136
- Restore torch to pyproject.toml
- PR: #21138
- Fusing pre and post rms norm
- PR: #21018
- Update the support request template to be yml
- PR: #21139
- Nest import of torch inside test function
- PR: #21130
- [skip ci] Move CODEOWNERS to .github/
- PR: #21151
- Add support for uneven shards in
ttnn.upsample
for mode=nearest
- PR: #21063
- Update improved performance for convnet_mnist
- PR: #21164
- #0: Propagate runner label for blackhole demo tests on BH post commit
- PR: #21166
- [skip ci] Cleanup codeowner file a bit
- PR: #21158
- Start of -dev package
- PR: #21035
- [skip ci] Cleanup codeowners
- PR: #21170
- Disable shlibdeps for now -- one of the builds is failing
- PR: #21174
- [skip ci] fix lack of extra-tag in tg demo
- PR: #21176
- Add support for RM input for all_gather_concat + implicit tilize for its output
- PR: #20991
- Fix and deduplicate reduce scatter code that computes receiver/sender IDs around cluster axis
- PR: #21171
- #16015: Support new op ttnn.experimental.broadcast_to
- PR: #19759
- #21140: updated SDXL conv2d tests
- PR: #21152
- #17138: Add VAE decoder
- PR: #21133
- #0: Updated DRAM Slice Size calc logic
- PR: #20857
- Revert "Fusing pre and post rms norm"
- PR: #21191
- #20905: Add int support for relational ops
- PR: #21048
- device perf dispatch margin updates
- PR: #21193
- #0: [skip ci] Change produce data workflow owner to William
- PR: #21204
- Use device fixture instead of MeshDevice for test_matmul_1d_ring_llama_perf
- PR: #21180
- Add int support for eq LLK
- PR: #21039
- #21140: SDXL ttnn.group_norm tests updated
- PR: #21184
- Update test_llama_ops_perf_TG_llama.py
- PR: #21213
- Include enough Git info for Git Describe to do its job
- PR: #21217
- Fixed failing eth profiler tests in metal microbenchmarks workflow
- PR: #21211
- Add support for 2D torus at device init for 6u
- PR: #21118
- Add 0D, 1D and 0V support for matmul ops
- PR: #21003
- [skip ci] Add small delay before calling APIs in produce_data
- PR: #21216
- #19918 bcast test CB assert
- PR: #21168
- Increase lower tolerance for Falcon7b t3k/tg targets since CI varies too much
- PR: #21232
- Add test_system_health binary to run on 6U/T3K
- PR: #21020
- Remove legacy Async Mode APIs
- PR: #21094
- Resolve AllGatherAsyncMinimal segfault
- PR: #21229
- Fix import logic to not import funcs from test files
- PR: #21141
- Get rid of unused
MULTI_DEVICE
storage type enum- PR: #21165
- [skip ci] Slightly update perf target for Falcon7b-t3k-2048
- PR: #21241
- Add missing noc selecting opt to 1d fabric device init
- PR: #21230
- Revert "#17138: Add VAE decoder (#21133)"
- PR: #21203
- #21169: Resolve 1D Fabric Microbenchmark Failures
- PR: #21178
- #0: Optimize Llama SDPA decode by using 16x32 tiles and removing copy_blocks
- PR: #20912
- [skip ci] Produce data: Downgrade arch-label exception to warning
- PR: #21249
- Split launch message from device command sequence
- PR: #21033
- [skip ci] Docker image updates needed for packaging
- PR: #21253
- Add FORCE_PUSH_TO_TRACY option to DumpDeviceProfileResults
- PR: #20742
- Add support of Mistral-7B into TT-Transformers
- PR: #19995
- Revert "Add support of Mistral-7B into TT-Transformers (#19995)"
- PR: #21263
- update codeowners
- PR: #21254
v0.58.0-rc23
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/14816455401
📦 Uncategorized
- #21060: Update unary documentation
- PR: #21057
- [skip ci] Add TG demo to choose your own TG pipeline
- PR: #21066
- #0: (MINOR) The REAL bump to v0.58.0
- PR: #21068
- add 6u specific full mesh bandwidth tests
- PR: #21037
- Add multicore support for argmax for any rank and shape
- PR: #20730
- Filter out wheel without the + because that seems to be an older thing, I guess
- PR: #21072
- tt_transformets/tt/attention: Fix typo
- PR: #21081
- [skip ci] Move FD tests to CIv2 (WH only)
- PR: #21079
- #0: [skip ci] Raise blackhole post commit fd test timeout to 40m
- PR: #21090
- Bump UMD
- PR: #21080
- Split go message from device command sequence
- PR: #21006
- A Docker image for package validation
- PR: #21095
- Enable some compiler warnings
- PR: #21032
- EXTERNAL PR: TTNN Support for stack [adding new ttnn op]
- PR: #20500
- Add ProgramDescriptor for future use in TTNN Generic OP
- PR: #21031
- Update device perf margins
- PR: #21107
- Stability script for Resnet50
- PR: #20973
- #17138: Add vae midblock and upblocks
- PR: #20988
- #20979: Add uint16 support for ttnn.add
- PR: #20910
- fix SDXL bias for split conv
- PR: #21110
- Apply clang format to generic_pools.cpp
- PR: #21111
- Jrock/device perf apr24 2
- PR: #21117
- Add fix for argmax in demo causing bad outputs
- PR: #21121
- Fixing TopK L1 limitations, currently for single core implmentation, otherwise default to multi-core where changes are not needed
- PR: #21029
- #0: Don't fail generate-system-logs action
- PR: #21092
- Remove superfluous 'static'
- PR: #20967
- [DM] #19268: Testing for "One to one" primitive
- PR: #20722
- Watcher to catch noc_inline_dw_write's to DRAM
- PR: #21093
- add yml version of support cloud request template
- PR: #20954
- Remove dispatch profiler test for BH until CI HW setup stabilizes
- PR: #21123
- #21082: Disable TensixInlineWriteDynamicNoc on BH
- PR: #21129
- TM Stress test + 2 tiny fixes
- PR: #21126
- Fix typo
- PR: #21136
- Restore torch to pyproject.toml
- PR: #21138
- Fusing pre and post rms norm
- PR: #21018
- Update the support request template to be yml
- PR: #21139
- Nest import of torch inside test function
- PR: #21130
- [skip ci] Move CODEOWNERS to .github/
- PR: #21151
- Add support for uneven shards in
ttnn.upsample
for mode=nearest
- PR: #21063
- Update improved performance for convnet_mnist
- PR: #21164
- #0: Propagate runner label for blackhole demo tests on BH post commit
- PR: #21166
- [skip ci] Cleanup codeowner file a bit
- PR: #21158
- Start of -dev package
- PR: #21035
- [skip ci] Cleanup codeowners
- PR: #21170
- Disable shlibdeps for now -- one of the builds is failing
- PR: #21174
- [skip ci] fix lack of extra-tag in tg demo
- PR: #21176
- Add support for RM input for all_gather_concat + implicit tilize for its output
- PR: #20991
- Fix and deduplicate reduce scatter code that computes receiver/sender IDs around cluster axis
- PR: #21171
- #16015: Support new op ttnn.experimental.broadcast_to
- PR: #19759
- #21140: updated SDXL conv2d tests
- PR: #21152
- #17138: Add VAE decoder
- PR: #21133
- #0: Updated DRAM Slice Size calc logic
- PR: #20857
- Revert "Fusing pre and post rms norm"
- PR: #21191
- #20905: Add int support for relational ops
- PR: #21048
- device perf dispatch margin updates
- PR: #21193
- #0: [skip ci] Change produce data workflow owner to William
- PR: #21204
- Use device fixture instead of MeshDevice for test_matmul_1d_ring_llama_perf
- PR: #21180
- Add int support for eq LLK
- PR: #21039
- #21140: SDXL ttnn.group_norm tests updated
- PR: #21184
- Update test_llama_ops_perf_TG_llama.py
- PR: #21213
- Include enough Git info for Git Describe to do its job
- PR: #21217
- Fixed failing eth profiler tests in metal microbenchmarks workflow
- PR: #21211
- Add support for 2D torus at device init for 6u
- PR: #21118
- Add 0D, 1D and 0V support for matmul ops
- PR: #21003
- [skip ci] Add small delay before calling APIs in produce_data
- PR: #21216
- #19918 bcast test CB assert
- PR: #21168
- Increase lower tolerance for Falcon7b t3k/tg targets since CI varies too much
- PR: #21232
- Add test_system_health binary to run on 6U/T3K
- PR: #21020
- Remove legacy Async Mode APIs
- PR: #21094
- Resolve AllGatherAsyncMinimal segfault
- PR: #21229
- Fix import logic to not import funcs from test files
- PR: #21141
- Get rid of unused
MULTI_DEVICE
storage type enum- PR: #21165
- [skip ci] Slightly update perf target for Falcon7b-t3k-2048
- PR: #21241
- Add missing noc selecting opt to 1d fabric device init
- PR: #21230
- Revert "#17138: Add VAE decoder (#21133)"
- PR: #21203
- #21169: Resolve 1D Fabric Microbenchmark Failures
- PR: #21178
- #0: Optimize Llama SDPA decode by using 16x32 tiles and removing copy_blocks
- PR: #20912
- [skip ci] Produce data: Downgrade arch-label exception to warning
- PR: #21249
- Split launch message from device command sequence
- PR: #21033
- [skip ci] Docker image updates needed for packaging
- PR: #21253
- Add FORCE_PUSH_TO_TRACY option to DumpDeviceProfileResults
- PR: #20742
- Add support of Mistral-7B into TT-Transformers
- PR: #19995
- Revert "Add support of Mistral-7B into TT-Transformers (#19995)"
- PR: #21263
- update codeowners
- PR: #21254
v0.58.0-rc22
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/14811590331
📦 Uncategorized
- Fix dangling reference in ElfFile::Impl constructor
- PR: #21002
- #21060: Update unary documentation
- PR: #21057
- [skip ci] Add TG demo to choose your own TG pipeline
- PR: #21066
- #0: (MINOR) The REAL bump to v0.58.0
- PR: #21068
- add 6u specific full mesh bandwidth tests
- PR: #21037
- Add multicore support for argmax for any rank and shape
- PR: #20730
- Filter out wheel without the + because that seems to be an older thing, I guess
- PR: #21072
- tt_transformets/tt/attention: Fix typo
- PR: #21081
- [skip ci] Move FD tests to CIv2 (WH only)
- PR: #21079
- #0: [skip ci] Raise blackhole post commit fd test timeout to 40m
- PR: #21090
- Bump UMD
- PR: #21080
- Split go message from device command sequence
- PR: #21006
- A Docker image for package validation
- PR: #21095
- Enable some compiler warnings
- PR: #21032
- EXTERNAL PR: TTNN Support for stack [adding new ttnn op]
- PR: #20500
- Add ProgramDescriptor for future use in TTNN Generic OP
- PR: #21031
- Update device perf margins
- PR: #21107
- Stability script for Resnet50
- PR: #20973
- #17138: Add vae midblock and upblocks
- PR: #20988
- #20979: Add uint16 support for ttnn.add
- PR: #20910
- fix SDXL bias for split conv
- PR: #21110
- Apply clang format to generic_pools.cpp
- PR: #21111
- Jrock/device perf apr24 2
- PR: #21117
- Add fix for argmax in demo causing bad outputs
- PR: #21121
- Fixing TopK L1 limitations, currently for single core implmentation, otherwise default to multi-core where changes are not needed
- PR: #21029
- #0: Don't fail generate-system-logs action
- PR: #21092
- Remove superfluous 'static'
- PR: #20967
- [DM] #19268: Testing for "One to one" primitive
- PR: #20722
- Watcher to catch noc_inline_dw_write's to DRAM
- PR: #21093
- add yml version of support cloud request template
- PR: #20954
- Remove dispatch profiler test for BH until CI HW setup stabilizes
- PR: #21123
- #21082: Disable TensixInlineWriteDynamicNoc on BH
- PR: #21129
- TM Stress test + 2 tiny fixes
- PR: #21126
- Fix typo
- PR: #21136
- Restore torch to pyproject.toml
- PR: #21138
- Fusing pre and post rms norm
- PR: #21018
- Update the support request template to be yml
- PR: #21139
- Nest import of torch inside test function
- PR: #21130
- [skip ci] Move CODEOWNERS to .github/
- PR: #21151
- Add support for uneven shards in
ttnn.upsample
for mode=nearest
- PR: #21063
- Update improved performance for convnet_mnist
- PR: #21164
- #0: Propagate runner label for blackhole demo tests on BH post commit
- PR: #21166
- [skip ci] Cleanup codeowner file a bit
- PR: #21158
- Start of -dev package
- PR: #21035
- [skip ci] Cleanup codeowners
- PR: #21170
- Disable shlibdeps for now -- one of the builds is failing
- PR: #21174
- [skip ci] fix lack of extra-tag in tg demo
- PR: #21176
- Add support for RM input for all_gather_concat + implicit tilize for its output
- PR: #20991
- Fix and deduplicate reduce scatter code that computes receiver/sender IDs around cluster axis
- PR: #21171
- #16015: Support new op ttnn.experimental.broadcast_to
- PR: #19759
- #21140: updated SDXL conv2d tests
- PR: #21152
- #17138: Add VAE decoder
- PR: #21133
- #0: Updated DRAM Slice Size calc logic
- PR: #20857
- Revert "Fusing pre and post rms norm"
- PR: #21191
- #20905: Add int support for relational ops
- PR: #21048
- device perf dispatch margin updates
- PR: #21193
- #0: [skip ci] Change produce data workflow owner to William
- PR: #21204
- Use device fixture instead of MeshDevice for test_matmul_1d_ring_llama_perf
- PR: #21180
- Add int support for eq LLK
- PR: #21039
- #21140: SDXL ttnn.group_norm tests updated
- PR: #21184
- Update test_llama_ops_perf_TG_llama.py
- PR: #21213
- Include enough Git info for Git Describe to do its job
- PR: #21217
- Fixed failing eth profiler tests in metal microbenchmarks workflow
- PR: #21211
- Add support for 2D torus at device init for 6u
- PR: #21118
- Add 0D, 1D and 0V support for matmul ops
- PR: #21003
- [skip ci] Add small delay before calling APIs in produce_data
- PR: #21216
- #19918 bcast test CB assert
- PR: #21168
- Increase lower tolerance for Falcon7b t3k/tg targets since CI varies too much
- PR: #21232
- Add test_system_health binary to run on 6U/T3K
- PR: #21020
- Remove legacy Async Mode APIs
- PR: #21094
- Resolve AllGatherAsyncMinimal segfault
- PR: #21229
- Fix import logic to not import funcs from test files
- PR: #21141
- Get rid of unused
MULTI_DEVICE
storage type enum- PR: #21165
- [skip ci] Slightly update perf target for Falcon7b-t3k-2048
- PR: #21241
- Add missing noc selecting opt to 1d fabric device init
- PR: #21230
- Revert "#17138: Add VAE decoder (#21133)"
- PR: #21203
- #21169: Resolve 1D Fabric Microbenchmark Failures
- PR: #21178
- #0: Optimize Llama SDPA decode by using 16x32 tiles and removing copy_blocks
- PR: #20912
- [skip ci] Produce data: Downgrade arch-label exception to warning
- PR: #21249
- Split launch message from device command sequence
- PR: #21033
- [skip ci] Docker image updates needed for packaging
- PR: #21253
- Add FORCE_PUSH_TO_TRACY option to DumpDeviceProfileResults
- PR: #20742
- Add support of Mistral-7B into TT-Transformers
- PR: #19995
- Revert "Add support of Mistral-7B into TT-Transformers (#19995)"
- PR: #21263
- update codeowners
- PR: #21254
v0.58.0-rc21
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/14806160762
📦 Uncategorized
- #19891: ttnn.sort-single-core-implementation
- PR: #20514
- #18973: Deduplicate ttnn_test_fixtures.hpp code
- PR: #20863
- Fix dangling reference in ElfFile::Impl constructor
- PR: #21002
- #21060: Update unary documentation
- PR: #21057
- [skip ci] Add TG demo to choose your own TG pipeline
- PR: #21066
- #0: (MINOR) The REAL bump to v0.58.0
- PR: #21068
- add 6u specific full mesh bandwidth tests
- PR: #21037
- Add multicore support for argmax for any rank and shape
- PR: #20730
- Filter out wheel without the + because that seems to be an older thing, I guess
- PR: #21072
- tt_transformets/tt/attention: Fix typo
- PR: #21081
- [skip ci] Move FD tests to CIv2 (WH only)
- PR: #21079
- #0: [skip ci] Raise blackhole post commit fd test timeout to 40m
- PR: #21090
- Bump UMD
- PR: #21080
- Split go message from device command sequence
- PR: #21006
- A Docker image for package validation
- PR: #21095
- Enable some compiler warnings
- PR: #21032
- EXTERNAL PR: TTNN Support for stack [adding new ttnn op]
- PR: #20500
- Add ProgramDescriptor for future use in TTNN Generic OP
- PR: #21031
- Update device perf margins
- PR: #21107
- Stability script for Resnet50
- PR: #20973
- #17138: Add vae midblock and upblocks
- PR: #20988
- #20979: Add uint16 support for ttnn.add
- PR: #20910
- fix SDXL bias for split conv
- PR: #21110
- Apply clang format to generic_pools.cpp
- PR: #21111
- Jrock/device perf apr24 2
- PR: #21117
- Add fix for argmax in demo causing bad outputs
- PR: #21121
- Fixing TopK L1 limitations, currently for single core implmentation, otherwise default to multi-core where changes are not needed
- PR: #21029
- #0: Don't fail generate-system-logs action
- PR: #21092
- Remove superfluous 'static'
- PR: #20967
- [DM] #19268: Testing for "One to one" primitive
- PR: #20722
- Watcher to catch noc_inline_dw_write's to DRAM
- PR: #21093
- add yml version of support cloud request template
- PR: #20954
- Remove dispatch profiler test for BH until CI HW setup stabilizes
- PR: #21123
- #21082: Disable TensixInlineWriteDynamicNoc on BH
- PR: #21129
- TM Stress test + 2 tiny fixes
- PR: #21126
- Fix typo
- PR: #21136
- Restore torch to pyproject.toml
- PR: #21138
- Fusing pre and post rms norm
- PR: #21018
- Update the support request template to be yml
- PR: #21139
- Nest import of torch inside test function
- PR: #21130
- [skip ci] Move CODEOWNERS to .github/
- PR: #21151
- Add support for uneven shards in
ttnn.upsample
for mode=nearest
- PR: #21063
- Update improved performance for convnet_mnist
- PR: #21164
- #0: Propagate runner label for blackhole demo tests on BH post commit
- PR: #21166
- [skip ci] Cleanup codeowner file a bit
- PR: #21158
- Start of -dev package
- PR: #21035
- [skip ci] Cleanup codeowners
- PR: #21170
- Disable shlibdeps for now -- one of the builds is failing
- PR: #21174
- [skip ci] fix lack of extra-tag in tg demo
- PR: #21176
- Add support for RM input for all_gather_concat + implicit tilize for its output
- PR: #20991
- Fix and deduplicate reduce scatter code that computes receiver/sender IDs around cluster axis
- PR: #21171
- #16015: Support new op ttnn.experimental.broadcast_to
- PR: #19759
- #21140: updated SDXL conv2d tests
- PR: #21152
- #17138: Add VAE decoder
- PR: #21133
- #0: Updated DRAM Slice Size calc logic
- PR: #20857
- Revert "Fusing pre and post rms norm"
- PR: #21191
- #20905: Add int support for relational ops
- PR: #21048
- device perf dispatch margin updates
- PR: #21193
- #0: [skip ci] Change produce data workflow owner to William
- PR: #21204
- Use device fixture instead of MeshDevice for test_matmul_1d_ring_llama_perf
- PR: #21180
- Add int support for eq LLK
- PR: #21039
- #21140: SDXL ttnn.group_norm tests updated
- PR: #21184
- Update test_llama_ops_perf_TG_llama.py
- PR: #21213
- Include enough Git info for Git Describe to do its job
- PR: #21217
- Fixed failing eth profiler tests in metal microbenchmarks workflow
- PR: #21211
- Add support for 2D torus at device init for 6u
- PR: #21118
- Add 0D, 1D and 0V support for matmul ops
- PR: #21003
- [skip ci] Add small delay before calling APIs in produce_data
- PR: #21216
- #19918 bcast test CB assert
- PR: #21168
- Increase lower tolerance for Falcon7b t3k/tg targets since CI varies too much
- PR: #21232
- Add test_system_health binary to run on 6U/T3K
- PR: #21020
- Remove legacy Async Mode APIs
- PR: #21094
- Resolve AllGatherAsyncMinimal segfault
- PR: #21229
- Fix import logic to not import funcs from test files
- PR: #21141
- Get rid of unused
MULTI_DEVICE
storage type enum- PR: #21165
- [skip ci] Slightly update perf target for Falcon7b-t3k-2048
- PR: #21241
- Add missing noc selecting opt to 1d fabric device init
- PR: #21230
- Revert "#17138: Add VAE decoder (#21133)"
- PR: #21203
- #21169: Resolve 1D Fabric Microbenchmark Failures
- PR: #21178
- #0: Optimize Llama SDPA decode by using 16x32 tiles and removing copy_blocks
- PR: #20912
- [skip ci] Produce data: Downgrade arch-label exception to warning
- PR: #21249
- Split launch message from device command sequence
- PR: #21033
- [skip ci] Docker image updates needed for packaging
- PR: #21253
- Add FORCE_PUSH_TO_TRACY option to DumpDeviceProfileResults
- PR: #20742
- Add support of Mistral-7B into TT-Transformers
- PR: #19995
- Revert "Add support of Mistral-7B into TT-Transformers (#19995)"
- PR: #21263
- update codeowners
- PR: #21254
v0.58.0-rc20
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/14796816833
📦 Uncategorized
- Clear set tracking device ids in database between test runs
- PR: #20999
- In Place Halo Multicasting on WH/BH
- PR: #20878
- Trace support for yolov8x
- PR: #20884
- #0: Fix for ttnn.CreateDevice with multiple N150s
- PR: #20897
- #18922: Add int support for zero comparison ops
- PR: #19337
- Update perf margin
- PR: #21045
- Add fixes to LM Head unit test
- PR: #21028
- #20717: Generate per core op to op time csv
- PR: #20855
- Bump UMD
- PR: #21043
- #19891: ttnn.sort-single-core-implementation
- PR: #20514
- #18973: Deduplicate ttnn_test_fixtures.hpp code
- PR: #20863
- Fix dangling reference in ElfFile::Impl constructor
- PR: #21002
- #21060: Update unary documentation
- PR: #21057
- [skip ci] Add TG demo to choose your own TG pipeline
- PR: #21066
- #0: (MINOR) The REAL bump to v0.58.0
- PR: #21068
- add 6u specific full mesh bandwidth tests
- PR: #21037
- Add multicore support for argmax for any rank and shape
- PR: #20730
- Filter out wheel without the + because that seems to be an older thing, I guess
- PR: #21072
- tt_transformets/tt/attention: Fix typo
- PR: #21081
- [skip ci] Move FD tests to CIv2 (WH only)
- PR: #21079
- #0: [skip ci] Raise blackhole post commit fd test timeout to 40m
- PR: #21090
- Bump UMD
- PR: #21080
- Split go message from device command sequence
- PR: #21006
- A Docker image for package validation
- PR: #21095
- Enable some compiler warnings
- PR: #21032
- EXTERNAL PR: TTNN Support for stack [adding new ttnn op]
- PR: #20500
- Add ProgramDescriptor for future use in TTNN Generic OP
- PR: #21031
- Update device perf margins
- PR: #21107
- Stability script for Resnet50
- PR: #20973
- #17138: Add vae midblock and upblocks
- PR: #20988
- #20979: Add uint16 support for ttnn.add
- PR: #20910
- fix SDXL bias for split conv
- PR: #21110
- Apply clang format to generic_pools.cpp
- PR: #21111
- Jrock/device perf apr24 2
- PR: #21117
- Add fix for argmax in demo causing bad outputs
- PR: #21121
- Fixing TopK L1 limitations, currently for single core implmentation, otherwise default to multi-core where changes are not needed
- PR: #21029
- #0: Don't fail generate-system-logs action
- PR: #21092
- Remove superfluous 'static'
- PR: #20967
- [DM] #19268: Testing for "One to one" primitive
- PR: #20722
- Watcher to catch noc_inline_dw_write's to DRAM
- PR: #21093
- add yml version of support cloud request template
- PR: #20954
- Remove dispatch profiler test for BH until CI HW setup stabilizes
- PR: #21123
- #21082: Disable TensixInlineWriteDynamicNoc on BH
- PR: #21129
- TM Stress test + 2 tiny fixes
- PR: #21126
- Fix typo
- PR: #21136
- Restore torch to pyproject.toml
- PR: #21138
- Fusing pre and post rms norm
- PR: #21018
- Update the support request template to be yml
- PR: #21139
- Nest import of torch inside test function
- PR: #21130
- [skip ci] Move CODEOWNERS to .github/
- PR: #21151
- Add support for uneven shards in
ttnn.upsample
for mode=nearest
- PR: #21063
- Update improved performance for convnet_mnist
- PR: #21164
- #0: Propagate runner label for blackhole demo tests on BH post commit
- PR: #21166
- [skip ci] Cleanup codeowner file a bit
- PR: #21158
- Start of -dev package
- PR: #21035
- [skip ci] Cleanup codeowners
- PR: #21170
- Disable shlibdeps for now -- one of the builds is failing
- PR: #21174
- [skip ci] fix lack of extra-tag in tg demo
- PR: #21176
- Add support for RM input for all_gather_concat + implicit tilize for its output
- PR: #20991
- Fix and deduplicate reduce scatter code that computes receiver/sender IDs around cluster axis
- PR: #21171
- #16015: Support new op ttnn.experimental.broadcast_to
- PR: #19759
- #21140: updated SDXL conv2d tests
- PR: #21152
- #17138: Add VAE decoder
- PR: #21133
- #0: Updated DRAM Slice Size calc logic
- PR: #20857
- Revert "Fusing pre and post rms norm"
- PR: #21191
- #20905: Add int support for relational ops
- PR: #21048
- device perf dispatch margin updates
- PR: #21193
- #0: [skip ci] Change produce data workflow owner to William
- PR: #21204
- Use device fixture instead of MeshDevice for test_matmul_1d_ring_llama_perf
- PR: #21180
- Add int support for eq LLK
- PR: #21039
- #21140: SDXL ttnn.group_norm tests updated
- PR: #21184
- Update test_llama_ops_perf_TG_llama.py
- PR: #21213
- Include enough Git info for Git Describe to do its job
- PR: #21217
- Fixed failing eth profiler tests in metal microbenchmarks workflow
- PR: #21211
- Add support for 2D torus at device init for 6u
- PR: #21118
- Add 0D, 1D and 0V support for matmul ops
- PR: #21003
- [skip ci] Add small delay before calling APIs in produce_data
- PR: #21216
- #19918 bcast test CB assert
- PR: #21168
- Increase lower tolerance for Falcon7b t3k/tg targets since CI varies too much
- PR: #21232
- Add test_system_health binary to run on 6U/T3K
- PR: #21020
- Remove legacy Async Mode APIs
- PR: #21094
- Resolve AllGatherAsyncMinimal segfault
- PR: #21229
- Fix import logic to not import funcs from test files
- PR: #21141
- Get rid of unused
MULTI_DEVICE
storage type enum- PR: #21165
- [skip ci] Slightly update perf target for Falcon7b-t3k-2048
- PR: #21241
- Add missing noc selecting opt to 1d fabric device init
- PR: #21230
- Revert "#17138: Add VAE decoder (#21133)"
- PR: #21203
- #21169: Resolve 1D Fabric Microbenchmark Failures
- PR: #21178
- #0: Optimize Llama SDPA decode by using 16x32 tiles and removing copy_blocks
- PR: #20912
- [skip ci] Produce data: Downgrade arch-label exception to warning
- PR: #21249
- Split launch message from device command sequence
- PR: #21033
- [skip ci] Docker image updates needed for packaging
- PR: #21253
- Add FORCE_PUSH_TO_TRACY option to DumpDeviceProfileResults
- PR: #20742
- Add support of Mistral-7B into TT-Transformers
- PR: #19995
- Revert "Add support of Mistral-7B into TT-Transformers (#19995)"
- PR: #21263
- update codeowners
- PR: #21254
v0.58.0-rc19
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/14786913862
📦 Uncategorized
- Limit scope of xtensor-blas dependency
- PR: #21021
- Clear set tracking device ids in database between test runs
- PR: #20999
- In Place Halo Multicasting on WH/BH
- PR: #20878
- Trace support for yolov8x
- PR: #20884
- #0: Fix for ttnn.CreateDevice with multiple N150s
- PR: #20897
- #18922: Add int support for zero comparison ops
- PR: #19337
- Update perf margin
- PR: #21045
- Add fixes to LM Head unit test
- PR: #21028
- #20717: Generate per core op to op time csv
- PR: #20855
- Bump UMD
- PR: #21043
- #19891: ttnn.sort-single-core-implementation
- PR: #20514
- #18973: Deduplicate ttnn_test_fixtures.hpp code
- PR: #20863
- Fix dangling reference in ElfFile::Impl constructor
- PR: #21002
- #21060: Update unary documentation
- PR: #21057
- [skip ci] Add TG demo to choose your own TG pipeline
- PR: #21066
- #0: (MINOR) The REAL bump to v0.58.0
- PR: #21068
- add 6u specific full mesh bandwidth tests
- PR: #21037
- Add multicore support for argmax for any rank and shape
- PR: #20730
- Filter out wheel without the + because that seems to be an older thing, I guess
- PR: #21072
- tt_transformets/tt/attention: Fix typo
- PR: #21081
- [skip ci] Move FD tests to CIv2 (WH only)
- PR: #21079
- #0: [skip ci] Raise blackhole post commit fd test timeout to 40m
- PR: #21090
- Bump UMD
- PR: #21080
- Split go message from device command sequence
- PR: #21006
- A Docker image for package validation
- PR: #21095
- Enable some compiler warnings
- PR: #21032
- EXTERNAL PR: TTNN Support for stack [adding new ttnn op]
- PR: #20500
- Add ProgramDescriptor for future use in TTNN Generic OP
- PR: #21031
- Update device perf margins
- PR: #21107
- Stability script for Resnet50
- PR: #20973
- #17138: Add vae midblock and upblocks
- PR: #20988
- #20979: Add uint16 support for ttnn.add
- PR: #20910
- fix SDXL bias for split conv
- PR: #21110
- Apply clang format to generic_pools.cpp
- PR: #21111
- Jrock/device perf apr24 2
- PR: #21117
- Add fix for argmax in demo causing bad outputs
- PR: #21121
- Fixing TopK L1 limitations, currently for single core implmentation, otherwise default to multi-core where changes are not needed
- PR: #21029
- #0: Don't fail generate-system-logs action
- PR: #21092
- Remove superfluous 'static'
- PR: #20967
- [DM] #19268: Testing for "One to one" primitive
- PR: #20722
- Watcher to catch noc_inline_dw_write's to DRAM
- PR: #21093
- add yml version of support cloud request template
- PR: #20954
- Remove dispatch profiler test for BH until CI HW setup stabilizes
- PR: #21123
- #21082: Disable TensixInlineWriteDynamicNoc on BH
- PR: #21129
- TM Stress test + 2 tiny fixes
- PR: #21126
- Fix typo
- PR: #21136
- Restore torch to pyproject.toml
- PR: #21138
- Fusing pre and post rms norm
- PR: #21018
- Update the support request template to be yml
- PR: #21139
- Nest import of torch inside test function
- PR: #21130
- [skip ci] Move CODEOWNERS to .github/
- PR: #21151
- Add support for uneven shards in
ttnn.upsample
for mode=nearest
- PR: #21063
- Update improved performance for convnet_mnist
- PR: #21164
- #0: Propagate runner label for blackhole demo tests on BH post commit
- PR: #21166
- [skip ci] Cleanup codeowner file a bit
- PR: #21158
- Start of -dev package
- PR: #21035
- [skip ci] Cleanup codeowners
- PR: #21170
- Disable shlibdeps for now -- one of the builds is failing
- PR: #21174
- [skip ci] fix lack of extra-tag in tg demo
- PR: #21176
- Add support for RM input for all_gather_concat + implicit tilize for its output
- PR: #20991
- Fix and deduplicate reduce scatter code that computes receiver/sender IDs around cluster axis
- PR: #21171
- #16015: Support new op ttnn.experimental.broadcast_to
- PR: #19759
- #21140: updated SDXL conv2d tests
- PR: #21152
- #17138: Add VAE decoder
- PR: #21133
- #0: Updated DRAM Slice Size calc logic
- PR: #20857
- Revert "Fusing pre and post rms norm"
- PR: #21191
- #20905: Add int support for relational ops
- PR: #21048
- device perf dispatch margin updates
- PR: #21193
- #0: [skip ci] Change produce data workflow owner to William
- PR: #21204
- Use device fixture instead of MeshDevice for test_matmul_1d_ring_llama_perf
- PR: #21180
- Add int support for eq LLK
- PR: #21039
- #21140: SDXL ttnn.group_norm tests updated
- PR: #21184
- Update test_llama_ops_perf_TG_llama.py
- PR: #21213
- Include enough Git info for Git Describe to do its job
- PR: #21217
- Fixed failing eth profiler tests in metal microbenchmarks workflow
- PR: #21211
- Add support for 2D torus at device init for 6u
- PR: #21118
- Add 0D, 1D and 0V support for matmul ops
- PR: #21003
- [skip ci] Add small delay before calling APIs in produce_data
- PR: #21216
- #19918 bcast test CB assert
- PR: #21168
- Increase lower tolerance for Falcon7b t3k/tg targets since CI varies too much
- PR: #21232
- Add test_system_health binary to run on 6U/T3K
- PR: #21020
- Remove legacy Async Mode APIs
- PR: #21094
- Resolve AllGatherAsyncMinimal segfault
- PR: #21229
- Fix import logic to not import funcs from test files
- PR: #21141
- Get rid of unused
MULTI_DEVICE
storage type enum- PR: #21165
- [skip ci] Slightly update perf target for Falcon7b-t3k-2048
- PR: #21241
- Add missing noc selecting opt to 1d fabric device init
- PR: #21230
- Revert "#17138: Add VAE decoder (#21133)"
- PR: #21203
- #21169: Resolve 1D Fabric Microbenchmark Failures
- PR: #21178
- #0: Optimize Llama SDPA decode by using 16x32 tiles and removing copy_blocks
- PR: #20912
- [skip ci] Produce data: Downgrade arch-label exception to warning
- PR: #21249
- Split launch message from device command sequence
- PR: #21033
- [skip ci] Docker image updates needed for packaging
- PR: #21253
- Add FORCE_PUSH_TO_TRACY option to DumpDeviceProfileResults
- PR: #20742
- Add support of Mistral-7B into TT-Transformers
- PR: #19995
- Revert "Add support of Mistral-7B into TT-Transformers (#19995)"
- PR: #21263
- update codeowners
- PR: #21254
v0.58.0-rc18
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/14776759745
📦 Uncategorized
- Removing the 6u limit on main
- PR: #21025
- Add support for a performance mode in DRAM Prefetcher
- PR: #20942
- Demo for yolov8s_world model
- PR: #19956
- #17700: Yolov9c model trace perf bringup
- PR: #20747
- Remove DispatchMemMap singleton, move ownership to MetalContext
- PR: #20974
- #20895: Restrict using zero copy chunking to contiguous outermost dim
- PR: #21015
- Fix shifted RISCV_SOFT_RESET_0_BRISC value
- PR: #21024
- Remove dealloc of persistent buffer tt_stats in RMS
- PR: #21022
- TTNN x TT-Mesh Integration: Exposing a Native MultiDevice Backend to TTNN
- PR: #18067
- Limit scope of xtensor-blas dependency
- PR: #21021
- Clear set tracking device ids in database between test runs
- PR: #20999
- In Place Halo Multicasting on WH/BH
- PR: #20878
- Trace support for yolov8x
- PR: #20884
- #0: Fix for ttnn.CreateDevice with multiple N150s
- PR: #20897
- #18922: Add int support for zero comparison ops
- PR: #19337
- Update perf margin
- PR: #21045
- Add fixes to LM Head unit test
- PR: #21028
- #20717: Generate per core op to op time csv
- PR: #20855
- Bump UMD
- PR: #21043
- #19891: ttnn.sort-single-core-implementation
- PR: #20514
- #18973: Deduplicate ttnn_test_fixtures.hpp code
- PR: #20863
- Fix dangling reference in ElfFile::Impl constructor
- PR: #21002
- #21060: Update unary documentation
- PR: #21057
- [skip ci] Add TG demo to choose your own TG pipeline
- PR: #21066
- #0: (MINOR) The REAL bump to v0.58.0
- PR: #21068
- add 6u specific full mesh bandwidth tests
- PR: #21037
- Add multicore support for argmax for any rank and shape
- PR: #20730
- Filter out wheel without the + because that seems to be an older thing, I guess
- PR: #21072
- tt_transformets/tt/attention: Fix typo
- PR: #21081
- [skip ci] Move FD tests to CIv2 (WH only)
- PR: #21079
- #0: [skip ci] Raise blackhole post commit fd test timeout to 40m
- PR: #21090
- Bump UMD
- PR: #21080
- Split go message from device command sequence
- PR: #21006
- A Docker image for package validation
- PR: #21095
- Enable some compiler warnings
- PR: #21032
- EXTERNAL PR: TTNN Support for stack [adding new ttnn op]
- PR: #20500
- Add ProgramDescriptor for future use in TTNN Generic OP
- PR: #21031
- Update device perf margins
- PR: #21107
- Stability script for Resnet50
- PR: #20973
- #17138: Add vae midblock and upblocks
- PR: #20988
- #20979: Add uint16 support for ttnn.add
- PR: #20910
- fix SDXL bias for split conv
- PR: #21110
- Apply clang format to generic_pools.cpp
- PR: #21111
- Jrock/device perf apr24 2
- PR: #21117
- Add fix for argmax in demo causing bad outputs
- PR: #21121
- Fixing TopK L1 limitations, currently for single core implmentation, otherwise default to multi-core where changes are not needed
- PR: #21029
- #0: Don't fail generate-system-logs action
- PR: #21092
- Remove superfluous 'static'
- PR: #20967
- [DM] #19268: Testing for "One to one" primitive
- PR: #20722
- Watcher to catch noc_inline_dw_write's to DRAM
- PR: #21093
- add yml version of support cloud request template
- PR: #20954
- Remove dispatch profiler test for BH until CI HW setup stabilizes
- PR: #21123
- #21082: Disable TensixInlineWriteDynamicNoc on BH
- PR: #21129
- TM Stress test + 2 tiny fixes
- PR: #21126
- Fix typo
- PR: #21136
- Restore torch to pyproject.toml
- PR: #21138
- Fusing pre and post rms norm
- PR: #21018
- Update the support request template to be yml
- PR: #21139
- Nest import of torch inside test function
- PR: #21130
- [skip ci] Move CODEOWNERS to .github/
- PR: #21151
- Add support for uneven shards in
ttnn.upsample
for mode=nearest
- PR: #21063
- Update improved performance for convnet_mnist
- PR: #21164
- #0: Propagate runner label for blackhole demo tests on BH post commit
- PR: #21166
- [skip ci] Cleanup codeowner file a bit
- PR: #21158
- Start of -dev package
- PR: #21035
- [skip ci] Cleanup codeowners
- PR: #21170
- Disable shlibdeps for now -- one of the builds is failing
- PR: #21174
- [skip ci] fix lack of extra-tag in tg demo
- PR: #21176
- Add support for RM input for all_gather_concat + implicit tilize for its output
- PR: #20991
- Fix and deduplicate reduce scatter code that computes receiver/sender IDs around cluster axis
- PR: #21171
- #16015: Support new op ttnn.experimental.broadcast_to
- PR: #19759
- #21140: updated SDXL conv2d tests
- PR: #21152
- #17138: Add VAE decoder
- PR: #21133
- #0: Updated DRAM Slice Size calc logic
- PR: #20857
- Revert "Fusing pre and post rms norm"
- PR: #21191
- #20905: Add int support for relational ops
- PR: #21048
- device perf dispatch margin updates
- PR: #21193
- #0: [skip ci] Change produce data workflow owner to William
- PR: #21204
- Use device fixture instead of MeshDevice for test_matmul_1d_ring_llama_perf
- PR: #21180
- Add int support for eq LLK
- PR: #21039
- #21140: SDXL ttnn.group_norm tests updated
- PR: #21184
- Update test_llama_ops_perf_TG_llama.py
- PR: #21213
- Include enough Git info for Git Describe to do its job
- PR: #21217
- Fixed failing eth profiler tests in metal microbenchmarks workflow
- PR: #21211
- Add support for 2D torus at device init for 6u
- PR: #21118
- Add 0D, 1D and 0V support for matmul ops
- PR: #21003
- [skip ci] Add small delay before calling APIs in produce_data
- PR: #21216
- #19918 bcast test CB assert
- PR: #21168
- Increase lower tolerance for Falcon7b t3k/tg targets since CI varies too much
- PR: #21232
- Add test_system_health binary to run on 6U/T3K
- PR: #21020
- Remove legacy Async Mode APIs
- PR: #21094
- Resolve AllGatherAsyncMinimal segfault
- PR: #21229
- Fix import logic to not import funcs from test files
- PR: #21141
- Get rid of unused
MULTI_DEVICE
storage type enum- PR: #21165
- [skip ci] Slightly update perf target for Falcon7b-t3k-2048
- PR: #21241
- Add missing noc selecting opt to 1d fabric device init
- PR: #21230
- Revert "#17138: Add VAE decoder (#21133)"
- PR: #21203
- #21169: Resolve 1D Fabric Microbenchmark Failures
- PR: #21178
- #0: Optimize Llama SDPA decode by using 16x32 tiles and removing copy_blocks
- PR: #20912
- [skip ci] Produce data: Downgrade arch-label exception to warning
- PR: #21249
- Split launch message from device command sequence
- PR: #21033
- [skip ci] Docker image updates needed for packaging
- PR: #21253
- Add FORCE_PUSH_TO_TRACY option to DumpDeviceProfileResults
- PR: #20742
- Add support of Mistral-7B into TT-Transformers
- PR: #19995
- Revert "Add support of Mistral-7B into TT-Transformers (#19995)"
- PR: #21263
- update codeowners
- PR: #21254
v0.58.0-rc17
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/14767852291
📦 Uncategorized
- Removing the 6u limit on main
- PR: #21025
- Add support for a performance mode in DRAM Prefetcher
- PR: #20942
- Demo for yolov8s_world model
- PR: #19956
- #17700: Yolov9c model trace perf bringup
- PR: #20747
- Remove DispatchMemMap singleton, move ownership to MetalContext
- PR: #20974
- #20895: Restrict using zero copy chunking to contiguous outermost dim
- PR: #21015
- Fix shifted RISCV_SOFT_RESET_0_BRISC value
- PR: #21024
- Remove dealloc of persistent buffer tt_stats in RMS
- PR: #21022
- TTNN x TT-Mesh Integration: Exposing a Native MultiDevice Backend to TTNN
- PR: #18067
- Limit scope of xtensor-blas dependency
- PR: #21021
- Clear set tracking device ids in database between test runs
- PR: #20999
- In Place Halo Multicasting on WH/BH
- PR: #20878
- Trace support for yolov8x
- PR: #20884
- #0: Fix for ttnn.CreateDevice with multiple N150s
- PR: #20897
- #18922: Add int support for zero comparison ops
- PR: #19337
- Update perf margin
- PR: #21045
- Add fixes to LM Head unit test
- PR: #21028
- #20717: Generate per core op to op time csv
- PR: #20855
- Bump UMD
- PR: #21043
- #19891: ttnn.sort-single-core-implementation
- PR: #20514
- #18973: Deduplicate ttnn_test_fixtures.hpp code
- PR: #20863
- Fix dangling reference in ElfFile::Impl constructor
- PR: #21002
- #21060: Update unary documentation
- PR: #21057
- [skip ci] Add TG demo to choose your own TG pipeline
- PR: #21066
- #0: (MINOR) The REAL bump to v0.58.0
- PR: #21068
- add 6u specific full mesh bandwidth tests
- PR: #21037
- Add multicore support for argmax for any rank and shape
- PR: #20730
- Filter out wheel without the + because that seems to be an older thing, I guess
- PR: #21072
- tt_transformets/tt/attention: Fix typo
- PR: #21081
- [skip ci] Move FD tests to CIv2 (WH only)
- PR: #21079
- #0: [skip ci] Raise blackhole post commit fd test timeout to 40m
- PR: #21090
- Bump UMD
- PR: #21080
- Split go message from device command sequence
- PR: #21006
- A Docker image for package validation
- PR: #21095
- Enable some compiler warnings
- PR: #21032
- EXTERNAL PR: TTNN Support for stack [adding new ttnn op]
- PR: #20500
- Add ProgramDescriptor for future use in TTNN Generic OP
- PR: #21031
- Update device perf margins
- PR: #21107
- Stability script for Resnet50
- PR: #20973
- #17138: Add vae midblock and upblocks
- PR: #20988
- #20979: Add uint16 support for ttnn.add
- PR: #20910
- fix SDXL bias for split conv
- PR: #21110
- Apply clang format to generic_pools.cpp
- PR: #21111
- Jrock/device perf apr24 2
- PR: #21117
- Add fix for argmax in demo causing bad outputs
- PR: #21121
- Fixing TopK L1 limitations, currently for single core implmentation, otherwise default to multi-core where changes are not needed
- PR: #21029
- #0: Don't fail generate-system-logs action
- PR: #21092
- Remove superfluous 'static'
- PR: #20967
- [DM] #19268: Testing for "One to one" primitive
- PR: #20722
- Watcher to catch noc_inline_dw_write's to DRAM
- PR: #21093
- add yml version of support cloud request template
- PR: #20954
- Remove dispatch profiler test for BH until CI HW setup stabilizes
- PR: #21123
- #21082: Disable TensixInlineWriteDynamicNoc on BH
- PR: #21129
- TM Stress test + 2 tiny fixes
- PR: #21126
- Fix typo
- PR: #21136
- Restore torch to pyproject.toml
- PR: #21138
- Fusing pre and post rms norm
- PR: #21018
- Update the support request template to be yml
- PR: #21139
- Nest import of torch inside test function
- PR: #21130
- [skip ci] Move CODEOWNERS to .github/
- PR: #21151
- Add support for uneven shards in
ttnn.upsample
for mode=nearest
- PR: #21063
- Update improved performance for convnet_mnist
- PR: #21164
- #0: Propagate runner label for blackhole demo tests on BH post commit
- PR: #21166
- [skip ci] Cleanup codeowner file a bit
- PR: #21158
- Start of -dev package
- PR: #21035
- [skip ci] Cleanup codeowners
- PR: #21170
- Disable shlibdeps for now -- one of the builds is failing
- PR: #21174
- [skip ci] fix lack of extra-tag in tg demo
- PR: #21176
- Add support for RM input for all_gather_concat + implicit tilize for its output
- PR: #20991
- Fix and deduplicate reduce scatter code that computes receiver/sender IDs around cluster axis
- PR: #21171
- #16015: Support new op ttnn.experimental.broadcast_to
- PR: #19759
- #21140: updated SDXL conv2d tests
- PR: #21152
- #17138: Add VAE decoder
- PR: #21133
- #0: Updated DRAM Slice Size calc logic
- PR: #20857
- Revert "Fusing pre and post rms norm"
- PR: #21191
- #20905: Add int support for relational ops
- PR: #21048
- device perf dispatch margin updates
- PR: #21193
- #0: [skip ci] Change produce data workflow owner to William
- PR: #21204
- Use device fixture instead of MeshDevice for test_matmul_1d_ring_llama_perf
- PR: #21180
- Add int support for eq LLK
- PR: #21039
- #21140: SDXL ttnn.group_norm tests updated
- PR: #21184
- Update test_llama_ops_perf_TG_llama.py
- PR: #21213
- Include enough Git info for Git Describe to do its job
- PR: #21217
- Fixed failing eth profiler tests in metal microbenchmarks workflow
- PR: #21211
- Add support for 2D torus at device init for 6u
- PR: #21118
- Add 0D, 1D and 0V support for matmul ops
- PR: #21003
- [skip ci] Add small delay before calling APIs in produce_data
- PR: #21216
- #19918 bcast test CB assert
- PR: #21168
- Increase lower tolerance for Falcon7b t3k/tg targets since CI varies too much
- PR: #21232
- Add test_system_health binary to run on 6U/T3K
- PR: #21020
- Remove legacy Async Mode APIs
- PR: #21094
- Resolve AllGatherAsyncMinimal segfault
- PR: #21229
- Fix import logic to not import funcs from test files
- PR: #21141
- Get rid of unused
MULTI_DEVICE
storage type enum- PR: #21165
- [skip ci] Slightly update perf target for Falcon7b-t3k-2048
- PR: #21241
- Add missing noc selecting opt to 1d fabric device init
- PR: #21230
- Revert "#17138: Add VAE decoder (#21133)"
- PR: #21203
- #21169: Resolve 1D Fabric Microbenchmark Failures
- PR: #21178
- #0: Optimize Llama SDPA decode by using 16x32 tiles and removing copy_blocks
- PR: #20912
- [skip ci] Produce data: Downgrade arch-label exception to warning
- PR: #21249
- Split launch message from device command sequence
- PR: #21033
- [skip ci] Docker image updates needed for packaging
- PR: #21253
- Add FORCE_PUSH_TO_TRACY option to DumpDeviceProfileResults
- PR: #20742
- Add support of Mistral-7B into TT-Transformers
- PR: #19995
- Revert "Add support of Mistral-7B into TT-Transformers (#19995)"
- PR: #21263
- update codeowners
- PR: #21254
v0.58.0-rc16
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/14756585688
📦 Uncategorized
- Removing the 6u limit on main
- PR: #21025
- Add support for a performance mode in DRAM Prefetcher
- PR: #20942
- Demo for yolov8s_world model
- PR: #19956
- #17700: Yolov9c model trace perf bringup
- PR: #20747
- Remove DispatchMemMap singleton, move ownership to MetalContext
- PR: #20974
- #20895: Restrict using zero copy chunking to contiguous outermost dim
- PR: #21015
- Fix shifted RISCV_SOFT_RESET_0_BRISC value
- PR: #21024
- Remove dealloc of persistent buffer tt_stats in RMS
- PR: #21022
- TTNN x TT-Mesh Integration: Exposing a Native MultiDevice Backend to TTNN
- PR: #18067
- Limit scope of xtensor-blas dependency
- PR: #21021
- Clear set tracking device ids in database between test runs
- PR: #20999
- In Place Halo Multicasting on WH/BH
- PR: #20878
- Trace support for yolov8x
- PR: #20884
- #0: Fix for ttnn.CreateDevice with multiple N150s
- PR: #20897
- #18922: Add int support for zero comparison ops
- PR: #19337
- Update perf margin
- PR: #21045
- Add fixes to LM Head unit test
- PR: #21028
- #20717: Generate per core op to op time csv
- PR: #20855
- Bump UMD
- PR: #21043
- #19891: ttnn.sort-single-core-implementation
- PR: #20514
- #18973: Deduplicate ttnn_test_fixtures.hpp code
- PR: #20863
- Fix dangling reference in ElfFile::Impl constructor
- PR: #21002
- #21060: Update unary documentation
- PR: #21057
- [skip ci] Add TG demo to choose your own TG pipeline
- PR: #21066
- #0: (MINOR) The REAL bump to v0.58.0
- PR: #21068
- add 6u specific full mesh bandwidth tests
- PR: #21037
- Add multicore support for argmax for any rank and shape
- PR: #20730
- Filter out wheel without the + because that seems to be an older thing, I guess
- PR: #21072
- tt_transformets/tt/attention: Fix typo
- PR: #21081
- [skip ci] Move FD tests to CIv2 (WH only)
- PR: #21079
- #0: [skip ci] Raise blackhole post commit fd test timeout to 40m
- PR: #21090
- Bump UMD
- PR: #21080
- Split go message from device command sequence
- PR: #21006
- A Docker image for package validation
- PR: #21095
- Enable some compiler warnings
- PR: #21032
- EXTERNAL PR: TTNN Support for stack [adding new ttnn op]
- PR: #20500
- Add ProgramDescriptor for future use in TTNN Generic OP
- PR: #21031
- Update device perf margins
- PR: #21107
- Stability script for Resnet50
- PR: #20973
- #17138: Add vae midblock and upblocks
- PR: #20988
- #20979: Add uint16 support for ttnn.add
- PR: #20910
- fix SDXL bias for split conv
- PR: #21110
- Apply clang format to generic_pools.cpp
- PR: #21111
- Jrock/device perf apr24 2
- PR: #21117
- Add fix for argmax in demo causing bad outputs
- PR: #21121
- Fixing TopK L1 limitations, currently for single core implmentation, otherwise default to multi-core where changes are not needed
- PR: #21029
- #0: Don't fail generate-system-logs action
- PR: #21092
- Remove superfluous 'static'
- PR: #20967
- [DM] #19268: Testing for "One to one" primitive
- PR: #20722
- Watcher to catch noc_inline_dw_write's to DRAM
- PR: #21093
- add yml version of support cloud request template
- PR: #20954
- Remove dispatch profiler test for BH until CI HW setup stabilizes
- PR: #21123
- #21082: Disable TensixInlineWriteDynamicNoc on BH
- PR: #21129
- TM Stress test + 2 tiny fixes
- PR: #21126
- Fix typo
- PR: #21136
- Restore torch to pyproject.toml
- PR: #21138
- Fusing pre and post rms norm
- PR: #21018
- Update the support request template to be yml
- PR: #21139
- Nest import of torch inside test function
- PR: #21130
- [skip ci] Move CODEOWNERS to .github/
- PR: #21151
- Add support for uneven shards in
ttnn.upsample
for mode=nearest
- PR: #21063
- Update improved performance for convnet_mnist
- PR: #21164
- #0: Propagate runner label for blackhole demo tests on BH post commit
- PR: #21166
- [skip ci] Cleanup codeowner file a bit
- PR: #21158
- Start of -dev package
- PR: #21035
- [skip ci] Cleanup codeowners
- PR: #21170
- Disable shlibdeps for now -- one of the builds is failing
- PR: #21174
- [skip ci] fix lack of extra-tag in tg demo
- PR: #21176
- Add support for RM input for all_gather_concat + implicit tilize for its output
- PR: #20991
- Fix and deduplicate reduce scatter code that computes receiver/sender IDs around cluster axis
- PR: #21171
- #16015: Support new op ttnn.experimental.broadcast_to
- PR: #19759
- #21140: updated SDXL conv2d tests
- PR: #21152
- #17138: Add VAE decoder
- PR: #21133
- #0: Updated DRAM Slice Size calc logic
- PR: #20857
- Revert "Fusing pre and post rms norm"
- PR: #21191
- #20905: Add int support for relational ops
- PR: #21048
- device perf dispatch margin updates
- PR: #21193
- #0: [skip ci] Change produce data workflow owner to William
- PR: #21204
- Use device fixture instead of MeshDevice for test_matmul_1d_ring_llama_perf
- PR: #21180
- Add int support for eq LLK
- PR: #21039
- #21140: SDXL ttnn.group_norm tests updated
- PR: #21184
- Update test_llama_ops_perf_TG_llama.py
- PR: #21213
- Include enough Git info for Git Describe to do its job
- PR: #21217
- Fixed failing eth profiler tests in metal microbenchmarks workflow
- PR: #21211
- Add support for 2D torus at device init for 6u
- PR: #21118
- Add 0D, 1D and 0V support for matmul ops
- PR: #21003
- [skip ci] Add small delay before calling APIs in produce_data
- PR: #21216
- #19918 bcast test CB assert
- PR: #21168
- Increase lower tolerance for Falcon7b t3k/tg targets since CI varies too much
- PR: #21232
- Add test_system_health binary to run on 6U/T3K
- PR: #21020
- Remove legacy Async Mode APIs
- PR: #21094
- Resolve AllGatherAsyncMinimal segfault
- PR: #21229
- Fix import logic to not import funcs from test files
- PR: #21141
- Get rid of unused
MULTI_DEVICE
storage type enum- PR: #21165
- [skip ci] Slightly update perf target for Falcon7b-t3k-2048
- PR: #21241
- Add missing noc selecting opt to 1d fabric device init
- PR: #21230
- Revert "#17138: Add VAE decoder (#21133)"
- PR: #21203
- #21169: Resolve 1D Fabric Microbenchmark Failures
- PR: #21178
- #0: Optimize Llama SDPA decode by using 16x32 tiles and removing copy_blocks
- PR: #20912
- [skip ci] Produce data: Downgrade arch-label exception to warning
- PR: #21249
- Split launch message from device command sequence
- PR: #21033
- [skip ci] Docker image updates needed for packaging
- PR: #21253
- Add FORCE_PUSH_TO_TRACY option to DumpDeviceProfileResults
- PR: #20742
- Add support of Mistral-7B into TT-Transformers
- PR: #19995
- Revert "Add support of Mistral-7B into TT-Transformers (#19995)"
- PR: #21263
- update codeowners
- PR: #21254
v0.58.0-rc15
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/14744816028
📦 Uncategorized
- Removing the 6u limit on main
- PR: #21025
- Add support for a performance mode in DRAM Prefetcher
- PR: #20942
- Demo for yolov8s_world model
- PR: #19956
- #17700: Yolov9c model trace perf bringup
- PR: #20747
- Remove DispatchMemMap singleton, move ownership to MetalContext
- PR: #20974
- #20895: Restrict using zero copy chunking to contiguous outermost dim
- PR: #21015
- Fix shifted RISCV_SOFT_RESET_0_BRISC value
- PR: #21024
- Remove dealloc of persistent buffer tt_stats in RMS
- PR: #21022
- TTNN x TT-Mesh Integration: Exposing a Native MultiDevice Backend to TTNN
- PR: #18067
- Limit scope of xtensor-blas dependency
- PR: #21021
- Clear set tracking device ids in database between test runs
- PR: #20999
- In Place Halo Multicasting on WH/BH
- PR: #20878
- Trace support for yolov8x
- PR: #20884
- #0: Fix for ttnn.CreateDevice with multiple N150s
- PR: #20897
- #18922: Add int support for zero comparison ops
- PR: #19337
- Update perf margin
- PR: #21045
- Add fixes to LM Head unit test
- PR: #21028
- #20717: Generate per core op to op time csv
- PR: #20855
- Bump UMD
- PR: #21043
- #19891: ttnn.sort-single-core-implementation
- PR: #20514
- #18973: Deduplicate ttnn_test_fixtures.hpp code
- PR: #20863
- Fix dangling reference in ElfFile::Impl constructor
- PR: #21002
- #21060: Update unary documentation
- PR: #21057
- [skip ci] Add TG demo to choose your own TG pipeline
- PR: #21066
- #0: (MINOR) The REAL bump to v0.58.0
- PR: #21068
- add 6u specific full mesh bandwidth tests
- PR: #21037
- Add multicore support for argmax for any rank and shape
- PR: #20730
- Filter out wheel without the + because that seems to be an older thing, I guess
- PR: #21072
- tt_transformets/tt/attention: Fix typo
- PR: #21081
- [skip ci] Move FD tests to CIv2 (WH only)
- PR: #21079
- #0: [skip ci] Raise blackhole post commit fd test timeout to 40m
- PR: #21090
- Bump UMD
- PR: #21080
- Split go message from device command sequence
- PR: #21006
- A Docker image for package validation
- PR: #21095
- Enable some compiler warnings
- PR: #21032
- EXTERNAL PR: TTNN Support for stack [adding new ttnn op]
- PR: #20500
- Add ProgramDescriptor for future use in TTNN Generic OP
- PR: #21031
- Update device perf margins
- PR: #21107
- Stability script for Resnet50
- PR: #20973
- #17138: Add vae midblock and upblocks
- PR: #20988
- #20979: Add uint16 support for ttnn.add
- PR: #20910
- fix SDXL bias for split conv
- PR: #21110
- Apply clang format to generic_pools.cpp
- PR: #21111
- Jrock/device perf apr24 2
- PR: #21117
- Add fix for argmax in demo causing bad outputs
- PR: #21121
- Fixing TopK L1 limitations, currently for single core implmentation, otherwise default to multi-core where changes are not needed
- PR: #21029
- #0: Don't fail generate-system-logs action
- PR: #21092
- Remove superfluous 'static'
- PR: #20967
- [DM] #19268: Testing for "One to one" primitive
- PR: #20722
- Watcher to catch noc_inline_dw_write's to DRAM
- PR: #21093
- add yml version of support cloud request template
- PR: #20954
- Remove dispatch profiler test for BH until CI HW setup stabilizes
- PR: #21123
- #21082: Disable TensixInlineWriteDynamicNoc on BH
- PR: #21129
- TM Stress test + 2 tiny fixes
- PR: #21126
- Fix typo
- PR: #21136
- Restore torch to pyproject.toml
- PR: #21138
- Fusing pre and post rms norm
- PR: #21018
- Update the support request template to be yml
- PR: #21139
- Nest import of torch inside test function
- PR: #21130
- [skip ci] Move CODEOWNERS to .github/
- PR: #21151
- Add support for uneven shards in
ttnn.upsample
for mode=nearest
- PR: #21063
- Update improved performance for convnet_mnist
- PR: #21164
- #0: Propagate runner label for blackhole demo tests on BH post commit
- PR: #21166
- [skip ci] Cleanup codeowner file a bit
- PR: #21158
- Start of -dev package
- PR: #21035
- [skip ci] Cleanup codeowners
- PR: #21170
- Disable shlibdeps for now -- one of the builds is failing
- PR: #21174
- [skip ci] fix lack of extra-tag in tg demo
- PR: #21176
- Add support for RM input for all_gather_concat + implicit tilize for its output
- PR: #20991
- Fix and deduplicate reduce scatter code that computes receiver/sender IDs around cluster axis
- PR: #21171
- #16015: Support new op ttnn.experimental.broadcast_to
- PR: #19759
- #21140: updated SDXL conv2d tests
- PR: #21152
- #17138: Add VAE decoder
- PR: #21133
- #0: Updated DRAM Slice Size calc logic
- PR: #20857
- Revert "Fusing pre and post rms norm"
- PR: #21191
- #20905: Add int support for relational ops
- PR: #21048
- device perf dispatch margin updates
- PR: #21193
- #0: [skip ci] Change produce data workflow owner to William
- PR: #21204
- Use device fixture instead of MeshDevice for test_matmul_1d_ring_llama_perf
- PR: #21180
- Add int support for eq LLK
- PR: #21039
- #21140: SDXL ttnn.group_norm tests updated
- PR: #21184
- Update test_llama_ops_perf_TG_llama.py
- PR: #21213
- Include enough Git info for Git Describe to do its job
- PR: #21217
- Fixed failing eth profiler tests in metal microbenchmarks workflow
- PR: #21211
- Add support for 2D torus at device init for 6u
- PR: #21118
- Add 0D, 1D and 0V support for matmul ops
- PR: #21003
- [skip ci] Add small delay before calling APIs in produce_data
- PR: #21216
- #19918 bcast test CB assert
- PR: #21168
- Increase lower tolerance for Falcon7b t3k/tg targets since CI varies too much
- PR: #21232
- Add test_system_health binary to run on 6U/T3K
- PR: #21020
- Remove legacy Async Mode APIs
- PR: #21094
- Resolve AllGatherAsyncMinimal segfault
- PR: #21229
- Fix import logic to not import funcs from test files
- PR: #21141
- Get rid of unused
MULTI_DEVICE
storage type enum- PR: #21165
- [skip ci] Slightly update perf target for Falcon7b-t3k-2048
- PR: #21241
- Add missing noc selecting opt to 1d fabric device init
- PR: #21230
- Revert "#17138: Add VAE decoder (#21133)"
- PR: #21203
- #21169: Resolve 1D Fabric Microbenchmark Failures
- PR: #21178
- #0: Optimize Llama SDPA decode by using 16x32 tiles and removing copy_blocks
- PR: #20912
- [skip ci] Produce data: Downgrade arch-label exception to warning
- PR: #21249
- Split launch message from device command sequence
- PR: #21033
- [skip ci] Docker image updates needed for packaging
- PR: #21253
- Add FORCE_PUSH_TO_TRACY option to DumpDeviceProfileResults
- PR: #20742
- Add support of Mistral-7B into TT-Transformers
- PR: #19995
- Revert "Add support of Mistral-7B into TT-Transformers (#19995)"
- PR: #21263
- update codeowners
- PR: #21254