Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework iree-compile pipeline for ONNX #18803

Closed
zjgarvey opened this issue Oct 17, 2024 · 11 comments
Closed

Rework iree-compile pipeline for ONNX #18803

zjgarvey opened this issue Oct 17, 2024 · 11 comments
Assignees
Labels
enhancement ➕ New feature or request

Comments

@zjgarvey
Copy link
Contributor

Request description

The current onnx-specific intake logic in PluginRegistration.cpp is insufficient for iree-compile. Currently, this single pass is not enough to get the torch IR into an acceptable state to hand off to the torch-to-iree pipeline that follows.

I'd like to request that we do the following.

Use the following passes when lowering torch-onnx-to-torch:

  • "--convert-torch-onnx-to-torch"
  • take the relevant bits out of "--torch-lower-to-backend-contract". In particular, "--torch-shape-refinement-pipeline" and "--torch-decompose-complex-ops".
  • "--torch-scalarize-shapes": This pass de-tensorizes shape computations, which are horrifically profuse in models coming from ONNX.
  • Re-apply "--torch-shape-refinement-pipeline", since the shape scalarization might have unblocked shape inference from doing what it is supposed to. Yes, it is unfortunately necessary to do shape inference before and after shape scalarization. For some models that fail to get any meaningful shape information from onnx's shape inference tools, cycling between shape inference and scalarization seems to be the only reliable way to untangle a mess of frontend IR.
  • On the off-chance that the return type of a func op is still a static info cast op after shape inference, "--torch-refine-public-return" and "--canonicalize" can clean up the boundary.

Remove the following pass from torch-to-iree pipeline:

  • "--torch-decompose-complex-ops". Why remove this here? Well, since this pass pipeline is already missing shape and dtype refinement passes, I'm going to run under the assumption that any "torch" input IR for the torch-to-IREE pipeline satisfies the loosely defined torch-backend contract. Any pipeline like "-to-(torch-)backend-" already includes decompose complex ops. If we actually do need this for the ONNX side because some cycling between shape scalarization and inference unblocks decomposition patterns we need, then it should be included in the onnx-to-torch pipeline.

Recommended:

I'd recommend rolling the "torch-onnx-to-torch-backend passes" into a pass pipeline in torch-mlir, then it would be much easier to just include this pass pipeline in the iree-compile pipeline for onnx models. Maybe call it --torch-onnx-to-torch-backend-pipeline.

If reasonable, it would be a good idea to re-apply cycles of shape refinement, shape-scalarization, and possibly decompose complex ops, until the IR reaches a fixed state (or a max number of iterations). If doing this, please disable the unflatten.int and flatten.using_ints decompositions, since these might create a cycle with view op.

What component(s) does this issue relate to?

Frontends, Compiler, MLIR

Additional context

We are currently pre-applying torch-mlir passes in our test suite before handing off linalg IR to iree-compile. Making the suggested changes would mean we could just run iree-compile on the onnx ir and get the same results (or better, since torch-to-iree might handle the lowering to linalg more carefully).

The removal of additional pre-compile processing could resolve some compilation issues. For example, the generation of affine.apply ops that seem to be relevant to the issue #18386 (comment), evidently don't get created when using torch-to-iree, but do get created when using torch-backend-to-linalg-on-tensors-backend-pipeline.

@zjgarvey zjgarvey added the enhancement ➕ New feature or request label Oct 17, 2024
@ScottTodd
Copy link
Member

We are currently pre-applying torch-mlir passes in our test suite before handing off linalg IR to iree-compile.

Oof. That's not good. Test suites should be using the same tools that we recommend to users. If tests fail when using those tools, the tools should be fixed instead of changing the tests.

The current onnx-specific intake logic in PluginRegistration.cpp is insufficient for iree-compile. Currently, this single pass is not enough to get the torch IR into an acceptable state to hand off to the torch-to-iree pipeline that follows.

Generally:

  1. iree-import-onnx should handle .onnx --> .mlir that iree-compile can use.
  2. iree-compile with --iree-input-type=onnx (or the default auto) should be able to compile any .mlir produced by iree-import-onnx for all backends

We can put options in the torch (onnx) plugin if there are decisions for users to make, like the existing --iree-torch-use-strict-symbolic-shapes option.

If the passes and pipelines in torch-mlir change, IREE should adapt as well. See also #17021.


I don't have specific feedback about the proposed concrete changes. They all look pretty standard for converting from one representation to another.

@vivekkhandelwal1
Copy link
Member

I have added the WIP patch here: llvm/torch-mlir#3801 for adding the torch-onnx-to-torch-backend pipeline in Torch-MLIR.

@zjgarvey
Copy link
Contributor Author

Oof. That's not good. Test suites should be using the same tools that we recommend to users. If tests fail when using those tools, the tools should be fixed instead of changing the tests.

I only disagree based on my incompetence. It's been some trial and error to figure out what passes we actually need, and where to put them in the pipeline. It would have been several iterations of sloppy iree commits to converge on something meaningful. Instead, we tried some different combinations of pre-proccessing passes in the test suite, and finally have a good understanding of what passes are actually necessary to apply to onnx-importer generated IR. It's time to upstream the correct pipeline.

@ScottTodd
Copy link
Member

Fair points. It's definitely the end state that matters, not necessarily the journey. I think there are some process / development workflow changes we could make that would make that convergence less sloppy. Manually specifying pass pipelines and adding extra flags is a crutch that we've been leaning on in many places for far too long. The tools need to work (correctly and with high performance) out of the box.

@vivekkhandelwal1
Copy link
Member

For the first proposed change, I have added a PR here: llvm/torch-mlir#3801

CC: @zjgarvey @ScottTodd

@ScottTodd
Copy link
Member

ScottTodd commented Dec 6, 2024

What's the status of this issue? I'm comparing test coverage between https://github.com/nod-ai/SHARK-TestSuite/tree/main/alt_e2eshark/onnx_tests and https://github.com/iree-org/iree-test-suites/tree/main/onnx_models and I'm wondering if the higher test pass rate in SHARK-TestSuite is due to some of the extra torch-mlir-opt passes being run here: https://github.com/nod-ai/SHARK-TestSuite/blob/22a000a685ede0764b51d234ccad6b973e2724d9/alt_e2eshark/e2e_testing/test_configs/onnxconfig.py#L148-L223. Can that test suite be switched to using the user-facing tools (iree-import-onnx) now?

@zjgarvey zjgarvey closed this as completed Dec 7, 2024
@zjgarvey
Copy link
Contributor Author

zjgarvey commented Dec 7, 2024

What's the status of this issue? I'm comparing test coverage between https://github.com/nod-ai/SHARK-TestSuite/tree/main/alt_e2eshark/onnx_tests and https://github.com/iree-org/iree-test-suites/tree/main/onnx_models and I'm wondering if the higher test pass rate in SHARK-TestSuite is due to some of the extra torch-mlir-opt passes being run here: https://github.com/nod-ai/SHARK-TestSuite/blob/22a000a685ede0764b51d234ccad6b973e2724d9/alt_e2eshark/e2e_testing/test_configs/onnxconfig.py#L148-L223. Can that test suite be switched to using the user-facing tools (iree-import-onnx) now?

We aren't running torch-mlir preprocessing anymore in the ci, so we can close this issue.

@ScottTodd
Copy link
Member

Are you sure about that? the code I linked is using torch-mlir directly in alt_e2eshark/, while I only see references to iree-import-onnx in e2eshark/: https://github.com/nod-ai/SHARK-TestSuite/blob/22a000a685ede0764b51d234ccad6b973e2724d9/e2eshark/run.py#L197

@zjgarvey
Copy link
Contributor Author

zjgarvey commented Dec 7, 2024

Yeah, the preprocessing stage is only used if the flag --torchtolinalg is passed. We turned that off in the CI after updating the torch-to-iree pipeline. We will soon use iree-import-onnx when adding some larger models to the CI. So right now the only place torch-mlir is being used in the CI is the import phase.

@ScottTodd
Copy link
Member

Ah. I think the import phase should also be moved to iree-import-onnx too. Users (and test suites for IREE) shouldn't be using https://github.com/llvm/torch-mlir directly at all.

I specifically was looking at this test failure earlier and was wondering if the torch-mlir import was affected by the same bug:

https://github.com/iree-org/iree-test-suites/actions/runs/12202034018/job/34041791439#step:6:180

ERROR    onnx_models.conftest:conftest.py:140 artifacts/model_zoo/validated/vision/classification/rcnn-ilsvrc13-9.mlir:13:82: error: expected '>'
    %10 = torch.operator "onnx.Constant"() {torch.onnx.value = dense_resource<_fc-rcnn_b_0> : tensor<200xf32>} : () -> !torch.vtensor<[200],f32>                                                                               ^

(maybe some invalid symbol in a resource name?)

@zjgarvey
Copy link
Contributor Author

zjgarvey commented Dec 9, 2024

Yeah, this "-" issue should get fixed by llvm/torch-mlir#3901.

When this gets added and bumped into iree, the importer there should get fixed too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement ➕ New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants