Apache TVM v0.9.0
Introduction
The TVM community has worked since the v0.8 release to deliver many exciting features and improvements. v0.9.0 is the first release on the new quarterly release schedule and includes many highlights, such as:
- MetaSchedule's full implementation
- ARM cascading scheduler for Arm Ethos(TM)-U NPUs
- Collage which brings tuning to BYOC
- Several microTVM improvements
- New
tvm.relay.build
parameters -runtime=
,executor=
, - AOT - Support for the C++ runtime (with
llvm
andc
targets only) and support for host-driven AOT in the C runtime - Hexagon RPC support
- Testing via Hexagon SDK simulator and on device via Snapdragon-based HDK boards and phones
- AOT and USMP support
- Threading
- Initial op support
- MLF - Support for multiple modules in a single MLF artifact
- Several TIR schedule primitives and transforms including (abridged):
schedule.transform_layout
- Applies a layout transformation to a buffer as specified by an IndexMap.schedule.transform_block_layout
- Applies a schedule transformation to a block as specified by an IndexMap.schedule.set_axis_separators
- Sets axis separators in a buffer to lower to multi-dimensional memory (e.g. texture memory).transform.InjectSoftwarePipeline
- Transforms annotated loop nest into a pipeline prologue, body and epilogue where producers and consumers are overlapped.transform.CommonSubexprElimTIR
- Implements common-subexpression elimination for TIR.transform.InjectPTXAsyncCopy
- Rewrites global to shared memory copies in CUDA with async copy when annotated tir::attr::async_scope.transform.LowerCrossThreadReduction
- Enables support for reductions across threads on GPUs.
- And many more! See the list of RFCs and PRs included in v0.9.0 for a complete list, as well as the full change list.
RFCs
These RFCs have been merged in apache/tvm-rfcs since the last release.
- [RFC] TUNIP: TVMScript Unified Printer (#74) (
48d47c5
) - [RFC][Backend] RFC-CSI-NN2-Integration (#75) (
cfcf114
) - [RFC] Introducing DeclBuffer (#70) (
87ff1fa
) - [RFC][MLF] Model Library Format with Multiple Modules (#76) (
f47c6ad
) - [RFC] UMA Universal Modular Accelerator Interface (#60) (
6990e13
) - [RFC] DietCode: An Auto-Scheduler for Dynamic Tensor Programs (#72) (
a518000
) - [RFC] Quarterly Releases (#67) (
70293c7
) - RFC-BYOC-DNNL-Integration (#73) (
7aed0ca
) - [RFC] Relay Next Roadmap (#69) (
ac15f2a
) - RFC: clarifying buffer declaration and access (#63) (
de4fe97
) - Inclusive Language RFC (#68) (#68) (
4203bd2
) - [USMP] Adding U4 usecase (#65) (
b9e246f
) - Collage RFC (#62) (
23250f5
) - Replace codeowners with more relevant automation (#58) (
540c1f8
) - [RFC][TIR] Layout transformations on buffer access (#39) (
b675ef8
) - Module Based Model Runtime for AOT (#46) (
d9dd6eb
) - @slow test RFC (#55) (
9b6203a
) - [RFC][Roadmap] TVM Continuous Integration & Testing Roadmap (#54) (
41e5ba0
) - Bring
PackedFunc
into TVM Object System (#51) (2e0de6c
) - [RFC][OpenCLML] OpenCLML integration as BYOC (#52) (
f5ef65f
) - Introduce the Arm(R) Ethos(TM)-U Cascading Scheduler (#37) (
f9fa824
) - [RFC][Roadmap] microTVM roadmap (#53) (
1b14456
) - Add Managed Jenkins Infrastructure for TVM RFC (#49) (
a3a7d2c
) - TVM Roadmap RFC (#50) (
263335f
) - [RFC] Integrate LIBXSMM with TVM. (#47) (
1a3d4f1
) - [RELAY][AST] Add virtual device as a first class field to Relay expressions (#45) (
67c39d2
)
What's Changed
Note that this list is not comprehensive of all PRs and discussions since v0.8. Please visit the full listing of commits for a complete view: v0.8.0...v0.9.0.rc0.
AOT
- #11208 - Calculate used memory at the callsite of primitive functions
- #11365 - Fix function number datatype from char to uint16_t
- #11091 - Enable A-Normal Form in the AOT executor
- #10753 - Support LLVM backend with C++ runtime
- #10518 - Use python temporary directory for AOT tests
- #10337 - BugFix of workspace calculation
- #10282 - [runtime] Add Metadata classes for AOTExecutor
- #9501 - [3/3][DeviceAPI] Wire up cpacked Device API context
- #9500 - [2/3][DeviceAPI] Add Hooks for Activate/Deactivate/Open/Close
- #9395 - [1/3][DeviceAPI] Connecting devices structure to relevant operators
BYOC
- #11474 - Two helper passes for external codegen using RelayToTIR custom pass machinery
- #11144 - Remove support for run-time linked-params from codegen
- #10590 - Add order to functions in C Codegen
- #11638 - [DNNL][CBLAS]Unifles all MKLDNN/DNNL to DNNL
- #11619 - RelayToTIR custom codegen passes can still depend on dynamic shape functions
- DNNL - #11902, #11642, #11513, #11571, #11560, #11345, #11111, #10837, #10421, #9995, #9797
- TensorRT - #11923, #11203, #10759, #10772, #10388
- CMSIS-NN - #11732, #11625, #10939, #11013, #10817, #10563, #10224, #10148, #10100, #9338, #9531, #9409, #9331
- OpenCLML - #10243
- CUTLASS - #11631, #10185, #10177, #10110, #10036, #9899, #9820, #9800, #9795, #9746, #9737, #9698, #9595, #9571
- CUDNN - #10997, #9986, #9948
- ACL - #10801
- PTX - #10855, #10339, #9909
- CUBLAS - #10826, #10820
CI
- #11313 - Refactor of tvm.testing.requires_* annotations
- #11666 - Enable pylint for tests/python/ci
- #11657 - Apply linting rules to AOT tests
- #11380 - Restructure Jenkinsfile
- Automation - #11813, #11775, #11480, #11437, #10833, #10056, #9973, #9934
- User experience improvements - #11470, #11329, #11553, #11497, #11051, #10933, #10960, #10525, #10425, #10322, #10121, #9971, #9554, #9752, #9556
- Reduce CI runtime - #11402, #11349, #11258, #11132, #10946, #10743, #10359
- Code cleanups - #10968, #10740
Frontends
- PaddlePaddle - #11537, #9724, #9564
- TFLite - #10915, #10566
- Oneflow - #11321, #11036, #8790
- PyTorch - #11190, #10504, #10184, #10091
- ONNX - #10949, #9438, #9186, #9493, #9475
- Keras - #7006
Hexagon
- #11549 - Initial clip operator for Hexagon
- #11834 - Add op resize2d for hexagon
- #11559 - Softmax slice op initial version
- #11529 - Slice ops added - add, subtract, multiply
- #11720 - [testing] add max_pool2d benchmark
- #11417 - Implement avg_pool2d slice op
- #11653 - Add HexagonThreadManager
- #11547 - Run single RPC server on Android in each testing session
- #11490 - [testing] add TVMScript elemwise-add
- #11400 - [testing] refactor benchmark-table code
- #11277 - moves conftest.py to tvm.contrib.hexagon so outside repos can access the testing fixtures
- #11319 - Add unit tests for Hexagon Device API
- #11279 - Add USMP tests
- #11283 - Update Readme
- #11239 - capture gtest output and return over FFI
- #11175 - Add schedule and test for conv2d_transpose_nchw
- #11018 - [Runtime] Add QuRT thread pool backend
- #11145 - Add support for on-device unit testing using gtest
- #11138 - Add test for depthwise conv2d schedule
- #11016 - Add test for registered schedules
- #11104 - Add mobilenet test
- #11090 - Delete offload runtime, move files to right places
- #11065 - AoT with LLVM Codegen on Hexagon
- #11025 - Deprecate USE_HEXAGON_DEVICE, introduce USE_HEXAGON
- #10604 - HVX scheduling and bench-marking of TE element-wise add
- #10905 - [LLVM] Enable/test tensorized Hexagon DMA on 2d transformed layout
- #10907 - Move aot/graph_executor interactions into launcher
- #10919 - Register basic strategies and schedules for common operators
- #10904 - Add unit tests executing 2-d VTCM usage
- #10910 - Refactor to keep HexagonBuffer private to the device api
- #10908 - [LLVM][CodeGen] Make CodeGenHexagon a subclass of CodeGenCPU
- #10878 - Generalized HexagonBuffer::CopyTo/CopyFrom
- #10846 - Support both 1-d and 2-d VTCM allocations
- #10581 - Improved ergonomics of HexagonLauncher in unit tests.
- #10616 - Refactor tvm.contrib.hexagon, NFC
- #10612 - Deprecate SDK 3.x, rewrite HexagonSDK.cmake
- #10586 - Codegen for 2d Load/Store
- #10558 - Generalize builtin for Nd memory alloc with storage scope and add lowering for VTCM / Hexagon
- #10543 - [Runtime][PipelineExecutor] Add the pipeline internal forwarding logic.
- #10507 - Add doc on TVM - Hexagon RPC flow
- #10520 - Resolve breakage in test_hexagon/test_cache_read_write
- #10311 - [runtime]AOTExecutor implementation for C Codegen
- #10454 - Allow execution on target or simulator from HexagonLauncher
- #10365 - Lower cache_read and cache_write to Hexagon DMA via tensorize
- #10361 - RPC server/client for simulator
- #10302 - [CI]Add Hexagon Tests to pipeline
- #10263 - [Docker]Add docker file and scripts
- #10227 - Refactor Hexagon.cmake
- #10217 - Adding support for Hexagon User DMA Engine
- #10068 - Update hexagon API build instruction and cleanup hexagon_proxy_rpc
- #9970 - Do not auto-build apps when building TVM
- #9736 - Add unit tests for HexagonBuffer
- #9525 - Add Hexagon VTCM and discontiguous allocation support
- #9631 - Add RPC Mechanism for Hexagon
- #9473 - cleanup Hexagon conv2d tests
MetaSchedule
- #11884 - Postproc: Rewrite-Layout
- #11848 - [OpStrategy] Support MetaSchedule Layout
- #11845 - [Relay][Pass] Meta-Schedule-Layout-Rewrite
- #11758 - [Runtime] Enhance Runner RandomFill
- #11683 - Distributed Measurement
- #11751 - [Minor] Organize Testing Scripts
- #11735 - Modify Profiler Timers
- #11727 - Developer Ergonomics Enhancement II
- #11692 - Apply-History-Best Task Filtering
- #11486 - Add Profiler Support For Tuning Efficiency Optimization
- #11680 - JSONDatabase Utilities
- #11641 - Generate MetaSchedule Dataset
- #11622 - Developer Ergonomics Enhancement
- #11604 - Resolve dependencies between header files
- #11587 - Add Testing Script with ONNX Support
- #11590 - Evo Independence from TaskScheduler
- #11534 - No explicit unrolling for spatial PrimFunc
- #11512 - Enable Task Filtering
- #11177 - AutoBind rule and MutateThreadBinding
- #11157 - Logging Interface Unification
- #11088 - Auto tensorization for CPU / GPU dot product
- #10986 - [Refactor] Introduce TuneConfig
- #11020 - [Metaschedule, Refactor] Move MultiLevelTilingNode decl to a header
- #10927 - [Refactor] Clarify Integration Logic
- #10876 - Add utility API to ease using manual schedules
- #10885 - [BugFix] Fix skipped tests
- #10366 - Add Gradient Based Task Scheduler
- #10823 - Fine-Grained Rewrite Unbound Block
- #10793 - Add demonstration of selectively tuning relay ops with TIR schedules
- #10811 - Support grouping in the cost model
- #10810 - Extract task weights during task extraction
- #10782 - [TIR]Estimate TIR FLOPs
- #10776 - Misc updates for tuning end-to-end workloads
- #10689 - Upstream the leftover changes
- #10648 - [Meta Schedule] Refactor meta schedule testing utils
- #10578 - New relay backend for meta schedule task extraction
- #10534 - Bug Fix for Relay Integration
- #10501 - Update scripts for subgraph tuning
- #10497 - Refactor testing workloads
- #10461 - Enable AutoTVM-style template-based search space
- #10368 - Fix Cyclic Dependency in PyClass Family
- #10403 - Arithmetic analysis
- #10367 - Update Tuning Interfaces.
- #10079 - [M4a] User-API: Tune-TE/TIR/Relay
- #10081 - [M4a] Rewrite-Cooperative-Fetch
- #10055 - [M4b] Testcases for TensorRT builder/runner
- #10092 - [M4a] Mutator: Mutate-Tile-Size
- #10096 - [M4a] Mutator: Mutate Parallel
- #10071 - [M4a] PostProcessor: Rewrite-Parallel-Vectorize-Unroll
- #10043 - [M4a] Schedule Rule: Multi-Level-Tiling
- #10045 - Mutator: Mutate-Unroll
- #10033 - [M4a] Schedule Rule: Parallelize-Vectorize-Unroll
- #10027 - [M4a] PostProcessor: Rewrite-Unbound-Block
- #10028 - Mutator: Mutate-Compute-Location
- #9997 - [M4a] PostProcessor: Disallow-Dynamic-Loop
- #9994 - [M4a] Schedule Rule: Cross-Thread-Reduction
- #10013 - [M4a] PostProcessor: Rewrite Reduction Block
- #9975 - [M4a] Schedule Rule: Add-RFactor
- #9945 - [M4a] PostProcessor: Verify-GPU-Code
- #9940 - [M4a] Schedule Rule: Random-Compute-Location
- #9943 - [M4a] Schedule Rule: Auto-Inline
- #9860 - [M3c] Add Per-Store-Feature
- #9859 - [M3c] XGB-based Cost Model
- #9836 - [M4a] Add EvolutionarySearch Search Strategy
- #9799 - [M4a] Add ReplayFunc Search Strategy
- #9789 - [M3c] Update TuneContext, TaskScheduler & Search Strategy Design
- #9780 - [M3c] Add More Measure Callbacks
- #9761 - [M4a] Add ScheduleRule class & PostOrderApply space generator
- #9760 - [M3c] Random Feature Extractor
MicroTVM
- #11741 - Refactor RVM scripts and fix DNS network issue
- #11472 - [ARM]Add tests for arm schedules
- #11634 - Update pyproject to python3.7
- Zephyr support - #11650
- RPC - #11227, #10967
Relay
- #11825 - [realy][pass]add split infer shape with convert op layout pass
- #11674 - Finish implementations of WithFields
- #11481 - IndexedGraph improvements in preparation for Collage
- #11432 - Plumb external codegen target via Target.current()
- #11494 - [Pass] Add MaxPool, AvgPool to FoldExplicitPadding
- #11183 - Add unidirectional sequence lstm
- #11442 - Add 'static_library' runtime::Module
- #11413 - [Topi]Support for FP16 ERF on CPU.
- #11382 - Finish support for list-of-targets
- #11386 - [Tests] Replace the Relay interpreter with the VM in the op tests
- #11224 - Support i16, f16 scalars in Relay text
- #11337 - Fix eltwise alter op layout for broadcast axis
- #11199 - Flexible shape dispatch transformation
- #11173 - Support 'external codegen targets'.
- #10996 - Add FlattenAtrousConv transformation
- #10871 - [CUDNN] Add cuDNN as a Relay partitioning target (BYOC)
- #10787 - [Pass][Bugfix] Disable re-use of non-flat buffers in StorageRewrite.
- #10378 - [FQ2I] Add leaky relu to FQ21
- #10400 - RelayViz graphviz renderer
- #10352 - [VIRTUALDEVICE] Change syntax for device planning and store parameter virtual devices in virtual_device_ field
- #10310 - [ARM_CPU] Conv2d int8 intrinsic for cortex-A72
- #10085 - RelayViz interface and terminal ast-dump
- #10239 - Add a conversion of individual operations in FQ2I pass.
- #10236 - [Refactor] Clean up type relations that are declared as template for no reason
- #10156 - Fix broadcast InferCorrectLayout
- #10026 - [VM] Relay VM memory liveness/lifetime analysis
- #10089 - [Pass] Add a relay pass to extract fake quantized ops
- #9690 - Change function constructors to WithFields
- #10069 - [DefuseOps pass] bug fix: To support function body types other…
- #9954 - Add
conv2d_backward_weight
op (without topi) - #9838 - [FoldScaleAxis] Support dense and bias_add op in fold scale axis
- #9816 - Add sliding_window operator
- #9874 - Add a JSON converter for 0.7 -> 0.8 and 0.8 -> 0.9
- #9735 - [AMP][Pass][Typing] Add faster type inference
- #9723 - [Frontend] Add Span filling for frontends to Relay
- #9749 - Fix invalid shape function for "copy" operator
- #9759 - s/SEScope/VirtualDevice/g
- #9734 - Support large constants saved/loaded outside of VM executable
- #9613 - Re-run PlanDevices after LowerTE to flow new memory scope constraints.
- #9693 - PlanDevices supports 'free' on_device annotations
- #9641 - [AST] Add virtual_device as a first class field in Relay
- #9483 - Switch the VM to use the LowerTE pass instead of TECompiler::{Lower,LowerShapeFunc}.
- #9569 - WithFields method for Call, Function, Var, TupleGetItem, If, Let, RefCreate, RefRead, RefWrite, Match, and Clause
- #9533 - WithFields for Tuples
- #9550 - Prepare for switching VM to LowerTEPass.
- #9542 - Prepare DeadCodeElimination for running post LowerTEPass/ManifestAlloc.
- #9352 - [TVMC]Introduce executor and runtime parameters
- #9457 - Add the Arm(R) Ethos(TM)-U NPU identity operator
- #9326 - Switch PlanDevices pass to be w.r.t. SEScopes instead of DLDeviceTypes.
- QNN - #11228, #10718, #10086, #10053, #9637, #9982
Runtime
- #11334 - [PipelineExecutor] Add graph manually splitting logic into the unit test.
- #11133 - [PipelineExecutor] Refactor PipelineExecutor.py and Add cross compile support for pipeline executor.
- #11172 - Move WrapTimeEvaluator from RPC to profiling, NFC
- #10990 - [PipelineExecutor]Add forwarding queue logic for set input.
- #10953 - [Vulkan] Add RGP support to TVM for vulkan device
- #10723 - [PipelineExecutor] Getting the asynchronous output
- #10283 - AOTExecutor implementation and c target code-generator
- #9802 - [ThreadPool]Refactor affinity function and support CPU affinity list setting.
- #10234 - [Pipeline Executor] multiple threads management and the data forwarding notification mechanism.
- #10326 - Improved log information with function signature
- #10032 - [PackedFunc] Bring
PackedFunc
into TVM Object System - #10082 - [PipelineExecutor] Pipeline Executor Sequential execution
- #10010 - [PipelineExecutor] Add Pipeline Executor Interface
- #9846 - [Pipeline executor] Global parameters group name and runtime modules parameters map.
- #9889 - [GraphExecutor] Add API
get_input_info
to graph_executor - #9751 - [Pipeline Executor] Add the map logic of global input and subgraph input.
TE
- #11589 - Support schedulable TIR compute definitions in TOPI
- #11341 - Optimized version of concatenation layer
- #10561 - [TECompiler] Decouple TE compute and schedule lowering in ScheduleBuilder
TIR
- #11592 - HoistExpression, generalization of HoistIfThenElse
- #11870 - [Pass] Remove-Weight-Layout-Rewrite-Block
- #11740 - [TIR, analysis] Add GetAutoTensorizeMappingInfo to generate transforms for auto tensorization
- #11585 - Add preserve-unit-iters
- #11677 - Register CUDA WMMA tensor intrinsics
- #11658 - [TIR, CUDA] Add pass to replace global to shared memory copy with cp.async
- #11624 - [Schedule] Allow named block and buffer arguments in Schedule
- #11628 - [PASS] Refactor a couple of TIR passes - BindTarget, AnnotateEntryFunc, Filter, LowerInitBlock
- #11574 - CSE pass : Restrict the equivalence to be decided by a normal form - avoids comparison of terms
- #11575 - Schedule Primitive: Add-Unit-Loop
- #11515 - Add schedule primitive ReIndex
- #11524 - [Arith] Additional Simplifications Inside Conditionals
- #11485 - Add schedule primitive TransformBlockLayout
- #11495 - [Software pipeline] Fix hardcoded index in
access_ptr
rewriting, add a GPU test with depth 4 - #11269 - [Schedule] Transform layout quality of life
- #11355 - Support tensorization using ldmatrix + MMA
- #11289 - [Schedule] Allowed typing.Tuple in tir.schedule._type_checker
- #11317 - Support affine expressions as indices in reverse compute inline
- #11235 - [Arith] Implemented padded inverses in IndexMap
- #11238 - [ROOFLINE] Calculate roofline from existing TIR PrimFunc
- #11225 - Add schedule primitive SetAxisSeparator
- #11110 - Get read/write access precisely for opaque access.
- #11106 - Enhance software pipeline validation and fix predicate of epilogue
- #10843 - StmtFunctor RenewDefs
- #11075 - Add function to tile a block according to a given tensor intrinsic
- #11050 - Utility function to decide loop mapping for auto tensorization
- #11009 - [ROCM] DP4A intrinsic support for TE/TIR
- #10925 - VNNI and ARM dot product intrinsic for tensorization
- #10887 - [Schedule] Relax reorder primitive's affine binding check
- #10732 - [Analysis] Add SuggestIndexMap for layout rewriting
- #10538 - [Schedule] Transform layout
- #10638 - Change the behavior of read/write region analysis for reduction blocks.
- #10705 - Use local complete block and local reduction block to identify compact dataflow
- #10671 - Tuple Reduction Support in CreatePrimFunc
- #9727 - [TE]Implement layout transformations, non-flat memory buffers
- #10405 - [TensorIR] Update VerifyGPU
- #10401 - [TensorIR] Renormalize split pattern
- #10112 - [TIR, Relay] improve bfloat16 support
- #8509 - Tir constants integration into compilation pipeline
- #9996 - add support for multi-blocking layout and their transformation
- #10066 - Add software pipelining
- #10207 - Support sub warp reduction for CUDA target.
- #9482 - Implementation of Common Subexpression Elimination for TIR
- #9527 - Allow compute_at create block predicate for non-trivial bounds and support floordiv pattern
- #10158 - [Schedule] Update compact_dataflow constraint
- #9871 - [Schedule] Blockize and Tensorize
- #10016 - [BugFix]Fix cross-thread reduction when single reduction loop with predicate
- #9880 - Encode conditional accesses info into block read/write regions
- #9699 - Affine utility support iter lowerbound and diagnostics
- #9742 - [Schedule] Add Annotate/Unannotate primitive
- #9738 - [TensorIR] Primitive "SetScope"
- #9743 - [Schedule] Analysis functions to check if compute_inline and com…
- #9689 - Allow memory (aka storage) scopes to be retrieved/applied to PrimFuncs
- #9559 - [TensorIR][UX] Type annotation-based runtime type checking
- #9444 - Add a 'rolling_buffer' scheduling primitive
- #9360 - [TensorIR] Cross-Thread Reduction
TOPI
- #11531 - TE implementation of LSTM using scan
- #11161 - Add Adreno GPU target and topi supporting textures with dynamically allocated textures
- #10332 - VNNI support for batch matmul
- #9873 - Add support for groupped conv3d
- #10230 - VNNI support for int8 dense
- #10098 - [Op]5 ops can accept unsigned integers as indices
- #9832 - Support grouped conv1d
- #9694 - Add generic batch norm
- #9233 - Cortex-M DSP support
TVMScript
- #11308 - Represent ramp as index slice
- #10099 - Support T.buffer_decl using data pointer from Let/Allocate
- #9680 - Improve printer for TIR syntax sugar
- #9492 - Add syntax sugar for T.handle and T.match_buffer
- #9620 - Add for loop syntax sugar
- #9543 - Misc error message improvements
- #9505 - [Fix] Add type hints for more uncovered cases
USMP
- #11015 - U3 use case
- #10189 - Adding support for U1 usecase for constant pools
- #10785 - Adding support for U4 usecase
- #10193 - adding support for U2 and U3 usecases
- #10005 - Add performance characteristics to PoolInfo
- #9565 - [TIR]Integrating USMP to AoT Executor
- #9704 - Hill Climb allocator
- #9418 - [TIR]adding the pass to convert to pool offsets
- #9649 - [TIR]Augmenting the algo interface with memory pressure
- #9214 - [TIR]Greedy memory planning algorithm
- #8468 - [TIR]Added buffer info extraction pass
microNPU
- #11468 - Optimize separate padding operation for conv2d
- #11453 - Add transform matrices and part matcher to identity op
- #11410 - add E2E tests with cascader wo striping
- #11288 - Expose compute cycle annotations to TIR lowering
- #10959 - Add a pass to reorder copy and compute nodes
- #10509 - Add various options to the cascader
- #11263 - Adding a option to enable striping
- #10251 - Add support for conv2d running on two cores on U65
- #10862 - Integrate the cascader
- #10344 - Integrate rolling buffers in Arm(R) Ethos(TM)-U
- #10824 - Some housekeeping in the test_ethosu folder
- #10763 - Tweak a layout transform matrix
- #10725 - Add a pass to move allocate nodes to the outer scope
- #10695 - Determine block configs using the cascader
- #10599 - Refactor Relay to TIR hook
- #10508 - Improve cascader memory transfer estimates
- #10345 - Add support for TFLite FULLY_CONNECTED
- #10254 - Introduce a pass to remove redundant identity operations
- #10062 - [5] Convert Proposals to te.Schedules
- #9959 - [4] Add the cascader Proposal generator
- #10022 - enable USMP
- #10127 - Add support for LeakyReLU
- #10004 - Add FreeRTOS variant of NPU demo
- #10060 - Refactor type inference data type checks
- #9960 - Add support for pack and unpack
- #10143 - Fix layout assignment in layout optimizer pass
- #9890 - [3] Plan generation for the cascader
- #9855 - Add support for transpose convolution
- #9841 - Add support for nearest neighbor and bilinear upsampling
- #9951 - Removing constant args from PrimFunc
- #9929 - Refactor base address determination to codegen
- #9910 - Add support for requantize
- #9831 - Move optimization passes to be a module pass and ensure they are running
- #9785 - [2d] Add more Part matchers to cascader
- #9778 - [2c] Add performance modelling to cascader
- #9471 - [2b] Create CascaderGraphs from TE graphs
- #9469 - [2a] Add CascaderGraph for cascading analysis
- #9621 - Add support for SPLIT and SPLIT_V
- #9508 - Update Conv2D Tests to Use TF API to Gen Test Cases
- #9627 - Add support for SIGMOID
- #9589 - Add support for TFLite concatenate
- #9623 - Refactor codegen tests
- #9561 - Add NHWC -> NHCWB16 layout transformation pass
- #9576 - Mean legalization support
- #9597 - Move the compilation to use Target Hooks.
- #9458 - [1] Add affine analysis structures for the cascader
- #9547 - Add the infrastructure for lookup table and TANH
- #9521 - Support binary elementwise with non-4D inputs
- #9560 - Fix incorrectly calculated stride when converting NHWC to NHCWB16
- #9530 - Add unary elementwise operator infrastructure with ABS
- #9514 - Adding rounding mode attribute to operators
- #9515 - Allow constants to be given as input to an operator
microTVM
- #11250 - [ARM] Add Relay tests for conv2d registered schedules
- #11232 - [rpc] Implemented rpc logging
- #11044 - Add support for host-driven AoT Executor
- #11043 - Better version handling for Arduino
- #10555 - Enable micro tvmc tutorial testing in CI
- #10194 - [RVM] Add scripts for automated build and testing
- #10144 - TVMCon 2021 Zephyr Demo with CMSIS-NN
- #10024 - [tvmc] Add TVMC Micro tutorial for Zephyr
- #9684 - Fix zephye/test_zephyr_armv7m test
- #9584 - [TVMC] Add TVMC test for Arduino and Zephyr
- #9526 - Add minimal forwarding RPC server for host driven python execution on Hexagon
- Zephyr support - #11362, #10138
Misc
- #11465 - Add cooldown interval logic for the profiling functional
- #11888 - [LLVM] Include LLVM headers in files that use them, not in llvm_common.h
- #11646 - [Arith] Simplification of ceil, log2, and left_shift
- #11464 - [MLF] Add support for multiple modules in Model Library Format
- #11632 - [AutoTVM][Autoscheduler] Default build funcs inherit PassContext
- #11543 - [OpenCL] Implement conv2d_winograd algorithm for Adreno
- #11287 - [Arith] Merge surjective/non-surjective iter mapping detections
- #11393 - Add utility to replace direct call to pytest.main
- #11252 - [ROOFLINE] Roofline analysis over RPC
- #11000 - [Graph Debugger] Expose way to benchmark individual nodes.
- #10794 - bump PyTorch version to 1.11
- #10821 - [REFACTOR] Remove legacy nnvm folder
- #10798 - [Arith] Remove diagnostic ctx argument from DetectIterMap
- #10567 - [Refactor] Reduced repetition in CodeGenLLVM's buffer access
- #10455 - [AUTO_SCHEDULER] Add feature extraction directly from PrimFunc
- #7401 - RFC: initial stab at TorchScript fallback
- #10391 - [vulkan] Add integer dot product (4xint8, 4xuint8) tensorization for the vulkan SPIR-V target.
- #10293 - [VirtualMachine] new method allowing to set one input tensor by its index or name
- #10191 - Generate correct output tensor names in C Interface API
- #9276 - Parameterize test_link_params
- #9808 - [Rust] Update Rust bindings
- #9553 - [PROFILING] Add ability to profile a single function_profiling
- #9611 - [CMAKE] Automatically detect newly added source files
- #9544 - [Target] enable -arch=sm_xx for assigning cuda target arch and deprecate autotvm.measure.set_cuda_target_arch api
- Profiler - #11530, #11066
- Docs - #10921, #11403, #10774, #10912, #9633, #9906, #9534, #9307, #9654, #9580
- Android - #11241
- ETHOSN - #11261, #10486, #10018, #9596
- TVMC - #11012, #10962, #10722, #9817, #9529, #9229