Skip to content

Releases: sophgo/tpu-mlir

v1.11-beta.0

18 Sep 09:02
Compare
Choose a tag to compare
[soc_dump] add doc

Change-Id: Icaf313113415a9bf0ad9c75abdcb609d661c815b

TPU-MLIR v1.10 Release

15 Aug 05:02
Compare
Choose a tag to compare

Release Note

Enhancements:

  • Added CUDA support for various operations like conv2d, MatMul, dwconv, pool2d, and more.
  • Improved performance for operations like MeanStdScale and softmax.
  • Enhanced multi-core batch mm and added support for bm168x with CUDA.
  • Refined CUDA code style and adjusted interfaces for various operations.

Bug Fixes:

  • Fixed issues with matmul, calibration failures, conv pad problems, and various performance problems.
  • Addressed bugs in model transformations, calibration, and various pattern issues.
  • Resolved bugs in different model backends like ssd, vit, detr, and yolov5.

New Features:

  • Added support for new models like resnet50, mobilenet_v2, shufflenet_v2, and yolox_s/alphapose_res50.
  • Introduced new operations like RequantIntAxisOp and Depth2Space with CUDA support.
  • Implemented new functionalities for better model inference and compilation.

Documentation Updates:

  • Updated weight.md, calibration sections, and user interface details.
  • Improved documentation for quick start, developer manual, and various tpulang interfaces.
  • Enhanced documentation for model transformation parameters and tensor data arrangements.

Miscellaneous:

  • Added new npz tools, modelzoo regression, and support for bmodel encryption.
  • Fixed issues with various model performance, shape inference, and CUDA backend optimizations.
  • Revived performance for models like yolov5s-6, bm1690 swin multicore, and more.

TPU-MLIR v1.9 Release

15 Jul 14:40
Compare
Choose a tag to compare

Release Note

Enhancements:

  • Implemented output order preservation in converters like ONNX, Caffe, Torch, and TFLite.
  • Added support for resnet50-v2 bm1690 f8 regression.
  • Improved ILP group mlir file sequences for resnet50 training.
  • Updated chip libraries and performance AI for A2 profiling.
  • Added a new dump mode "COMB" and refined abs/relu conversions.

Bug Fixes:

  • Fixed issues with preprocess when source layout differs from target layout.
  • Addressed bugs in various operations like softmax, concat, and weight reorder in conv2d.
  • Resolved bugs in model training, model transformation, and various pattern issues.
  • Fixed bugs related to CUDA inference, matmul with bias, and multi-output calibration.

New Features:

  • Added support for multi-graph in TPULang.
  • Introduced new options in TPULang for inference and model deployment.
  • Implemented various optimizations and enhancements for dynamic operations and model transformations.

Documentation Updates:

  • Refined documentation for quick start quantization and user interface sections.
  • Updated backend information, docker image download methods, and model deployment details in the documentation.

Miscellaneous:

  • Improved performance for various models like vit, yolov5s, and bm1690.
  • Introduced new functionalities like embedding multi-device slice and groupnorm train operations.
  • Added support for adaptive_avgpool inference and multiple Einsum modes.

TPU-MLIR v1.8.1

12 Jul 09:27
Compare
Choose a tag to compare

Full Changelog: v1.8...v1.8.1

TPU-MLIR v1.8 Release

29 May 11:15
Compare
Choose a tag to compare

Highlights:

  • Enhancements:

    • Added support for dynamic shape inference in various operations.
    • Optimized core operations for better performance on specific models.
    • Improved backend support for multiple models like BM1684X, BM1688, BM1690, SG2380, etc.
    • Introduced new operations and patterns for more efficient model processing.
    • Updated documentation for better clarity and user guidance.
  • Bug Fixes:

    • Resolved issues related to input/output handling, kernel configurations, and model-specific bugs.
    • Fixed bugs in dynamic compilation, core parallel processing, and various backend operations.
    • Addressed errors in specific model post-processing steps like YOLOv5, EfficientNet, etc.
  • Performance Improvements:

    • Optimized cycle calculations for multi-core models.
    • Enhanced bandwidth usage statistics for better resource management.
    • Accelerated compilation processes for training models using a new layer-group scheme.
  • New Features:

    • Introduced new operations like attention quant block, prelu op, and various dynamic compile features.
    • Added support for additional operations, weight location, and dynamic compile enhancements.

Documentation Updates:

  • Updated developer manuals, quick start guides, and model-specific documentation for better understanding.

Miscellaneous:

  • Streamlined workflows for faster commit checks and improved debugging processes.
  • Added new test cases for regression testing and script-based model evaluations.
  • Fine-tuned backend operations for improved model performance and accuracy.

TPU-MLIR v1.7 Release

19 Apr 09:58
Compare
Choose a tag to compare

Change Log

New Features

  • Added support for new operations including flash attention, custom op dynamic compile, and tpulang ops.
  • Enabled AttnReorder and added support for dynamic indices in ops like onehot, scatterelements, and cumsum.
  • Added --dump_dataframe option for bmodel_checker and support for transpose with order [1, 2, 3, 0].
  • Introduced Watchpoint feature to TDB and added support for mixed-precision networks.
  • Implemented optimizations for dma efficiency of flash attention and optimized backend for various models.
  • Added support for local memory dump in pcie mode and added various quantization features like eva quant, swin quant, and detr quant.
  • Enhanced multi-core support including support for LayerNorm and GroupNorm in coreParallel, and multi-core data slice in tensorLocation.
  • Added new patterns for Cswin and Einsum operations.
  • Improved support for LLM (Large Language Models) in bm1688.

Bug Fixes

  • Fixed various bugs including kernel_module msg_id, SAM-VIT-encoder regression, and attention accuracy problems.
  • Addressed logical issues in AddToScale pattern and issues in fp_forward.
  • Resolved bugs in model info core dump, op's liveRange in coreParallel, and DevParallel bugs.
  • Fixed issues in model combine with io alone and bugs in various ops like interp, RotaryPosEmbPattern, and efficient-lite4 permute.

Performance Improvements

  • Improved the performance of TDB and the bmodel_checker for 1684x pcie.
  • Optimized facenet and fixed performance issues of 1688 multicore.
  • Enabled single-core mode optimizations where necessary.

Documentation and Testing

  • Updated documentation, refined custom chapters, and ensured consistency in quick start docs.
  • Added test cases for custom tpulang, multi-core with subnets, and custom cpuop.
  • Fixed various documentation errors and updated the release note.

Other Changes

  • Added restrictions to tpulang ops and net test cases.
  • Adjusted descriptions and refined interfaces for better user experience.
  • Updated backend .so files and addressed sensitive words in the codebase.
  • Added support for int4 dtype in tpu_profile and ensured tool/scripts work in Python virtual environments.

Technical Preview

01 Apr 09:16
Compare
Choose a tag to compare
Technical Preview Pre-release
Pre-release

Features

  • Added support for LLM Decoding by utilizing multi-cores to enhance processing efficiency.
  • Introduced fx2mlir, a new functionality for enhanced MLIR conversion.
  • Implemented nnvlc2.0 and nnvlc1.0 local activation and weight operations, respectively, for improved neural network performance.
  • Enabled TPULANG support for operations like sort, argsort, and additional ops, enhancing the language's functionality and flexibility.
  • Added cv186x support in run_sensitive_layer.py and for the TDB, expanding compatibility and debugging capabilities.
  • Introduced new ops and features like Watchpoint in TDB and activation ops support for scale & zero_point, broadening the range of functionalities available in the tpu-mlir project.
  • Supports BM1690.
  • L2mem performs intermediate data exchange for active tensor.

Bug Fixes

  • Resolved a variety of bugs affecting backend processes, including issues with the 1684x backend, permutefuse2, permutemulconstswap, and more, improving overall stability and performance.
  • Fixed several critical issues across tpulang, including errors in sort_by_key operation, reshape operations, where operation, and more, enhancing the language's reliability for developers.
  • Addressed bugs in model processing, including fixes for concat logic, scale2conv, scale2conv3d, instance norm, and several more, ensuring smoother model optimization and execution.
  • Corrected errors in the documentation, providing clearer and more accurate information for users and developers.

Documentation Updates

  • Updated tpulang documentation to include new functionalities and optimizations, making it easier for users to understand and utilize the language effectively.

Performance Improvements

  • Optimized TDB and bmodel_checker for 1684x pcie mode, significantly reducing processing times and enhancing efficiency for model analysis.
  • Improved the efficiency of DMA in flash attention operations, ensuring faster data handling and processing.
  • Enabled IO tag mode and refined address mode for better memory management and operational flexibility.

TPU-MLIR v1.6.1

27 Mar 12:25
Compare
Choose a tag to compare

TPU-MLIR v1.6 release

23 Feb 16:56
Compare
Choose a tag to compare

Change Log

Bug Fixes

  • Fixed documentation errors and added checks for documentation errors during build.
  • Set workaround for ar.copy cycle issue to 0, avoiding potential data overwriting in inplacing operations.
  • Addressed a bug in Caffe DetectionOutput and fixed a hang in cv186x.
  • Corrected Mul buffer size alignment issues and various other buffer size corrections.
  • Fixed issues with attention accuracy, RotaryPosEmbPattern, and op status validation before the matching process.
  • Addressed a series of backend bugs, including daily build errors, performance declines, and incorrect return values.
  • Fixed data_checker issues, api_conv bug, and a local slice calculation bug.
  • Resolved incorrect affineMap for Pooling buffer and fixed reshape bug for inner products.
  • Corrected Mul&Div dynamic support for local operations and fixed issues with Conv2d buffer size calculations.
  • Addressed various matmul bugs, including fp8 support issues and quantization inconsistencies.

Features

  • Enabled multicore optimizations and added support for multi-core model tests.
  • Updated libbackend_1688.so and various backend updates for better performance and compatibility.
  • Introduced groupParallel operation, support for dynamic input data generation.
  • Added support for new patterns such as Permute fuse pattern and splitQuantizedMLP pattern.
  • Implemented npz compare visualizer tool and added support for bm1688 backend.
  • Added MatMul weight split case and improved permute performance.
  • Added support for img2col pattern, attention interface, and several dialects for SG2260 operations.

Documentation Updates

  • Updated release notes and resolved issues with document formatting.
  • Standardized expression terminology and replaced sensitive words in documentation.

Performance Improvements

  • Improved local softmax performance and optimized dataFlow checking in coreMatch.
  • Enhanced performance for Vit L i8 4 batch operations and refined conv multi-core handling.
  • Optimized VIT-B concurrency and addressed performance issues with MaxPool buffer sizes.

v1.6-beta.0

29 Jan 13:39
Compare
Choose a tag to compare
v1.6-beta.0 Pre-release
Pre-release

New Features

  • Implemented SG2260 structureOp interface and structured transform, including a solver for finding transforms【ea234bc2†source】.
  • Added OneHot converter and support for fp8 in the debugger【c03ba46c†source】【f87127bd†source】【fed7e68a†source】.
  • Supported MatMulOp for special cases broadcast in batch dims and added interface for attention【90d4b327†source】【044c4fc3†source】.
  • Provided "decompose linalg op" and "tile+fuse" pass for MatMul parallel supports more batch patterns【25f24e3d†source】.
  • Unet single block test added【ea76f9c9†source】.
  • Implemented fp8 support for Matmul and other ops including addconst, subconst, mul, add, sub, and abs【e09adbda†source】【7eaec57f†source】.

Performance Improvements

  • Improved Matmul fp8 performance with new backend support【2b8dd03b†source】.
  • Enabled distribute MLP and attention with improved performance for cascade_net input/output names and order【d5a42d7a†source】.
  • Refactored tdb to improve disassembler serialize and resolve BM1688 decoding issue【e73450f8†source】【1457df29†source】.
  • Improved weight reorder for ConvOp and optimized permute of attention matmul【a9045c3c†source】【91a353e3†source】.

Bug Fixes

  • Resolved various bugs in MatMul, Conv, and other ops across multiple chipsets including SG2260, BM1688, and CV18xx【b809a8c1†source】【bfada4de†source】【9804e30c†source】.
  • Fixed bugs related to ReduceOp, ArgOp, SliceOp, and others for better operation and tensor handling【2cdeb60d†source】【bbacf00f†source】.
  • Addressed issues in SAM, daily test, and tdb related to core operations and functionality【83e1979c†source】【7c37e39d†source】.
  • Fixed memory and data handling bugs for more accurate and stable operation of the models【2310cd8d†source】【0ed60f1f†source】.

Documentation Updates

  • Updated documentation to remove sensitive words and improve clarity and comprehensiveness【43e0b428†source】【5d6c49fc†source】.

Miscellaneous

  • Enhanced various backend libraries and supported new ops and patterns for more efficient and versatile model handling【1ca95d71†source】【8f1a2de8†source】.
  • Improved scatterE and reduce dynamic shape_value handling for better model optimization【fa2ccf29†source】.
  • Refinements in graph optimization, permute parallel indexMapping, and related areas for improved model processing【094f05da†source】【1ec6c16b†source】.