Skip to content

Conversation

@sdvillal
Copy link

@sdvillal sdvillal commented Nov 12, 2025

Summary

Adds a modern conda environment following best practices to improve the quality of life of conda users.

The environment is self-contained, including a sane toolchain to build extensions fully compatible with the rest of dependencies, and with batteries included (inference, bioinformatics, fast kernels, dev dependencies).

We maintain a pixi workspace and an automatically generated conda environment for non-pixi users.

We still need to iron-out four known problems (see overcomments in pixi.toml and upcoming issues) and add documentation.

From here, creating conda-forge openfold3 package and bioconda openfold3-extra should be simple enough.

Changes

Related Issues

TBC

Testing

Current environment passes all tests and produce sensible predictions.

Other Notes

This is exploratory at the moment. Will cleanup commit history or open a clean PR when we are done.

@Emrys-Merlin
Copy link

Thank you for the draft!

@Emrys-Merlin
Copy link

DeepSpeed accepted our first upstream fix regarding the ninja detection (deepspeedai/DeepSpeed#7687). Once a new version is released, this should allow us to get rid of the PyPI ninja dependency. Of course, this fix will only come into play if we decide against the vendoring approach.

@sdvillal
Copy link
Author

sdvillal commented Dec 1, 2025

As of 2024/12/01, packages still installed from pypi after installing openfold3 in devel/editable mode:

To investigate

  • aria2, both from CF (v 1.37.0) and pypi (v 0.0.1b0)

Proposed solution: remove aria2 from pypi dependencies, as it is currently unused in OF3 codebase. The pypi package is an old convenience and should in general not be used to install aria2, as it is not even such big of a convenience.

Because of cuequivariance_ops_torch_cu12

  • cuequivariance_ops_torch_cu12
  • cuequivariance_ops_cu12
  • nvidia_cublas_cu12 both from CF (libcublas 12.9.1.4) and pypi (v12.9.1.4)

These should at the very least aligned with the CF version, but likely it is best to just install all from pypi until we understand how to deal with the license. The key question is what to do with libcublas, maybe we should add synonyms to parselmouth in pixi - although I am not 100% sure these two packages are 100% binary compatible.

Currently the biggest block to have a conda package with these is their LICENSE.

See also: NVIDIA/cuEquivariance#218

It could be interesting to see if openequivariance could be a viable alternative:
https://github.com/PASSIONLab/OpenEquivariance

Because of mkl

  • mkl both from CF (2025.3.0) and pypi (2025.3.0)
  • intel_openmp
  • onemkl_license
  • tbb both from CF (2022.3.0) and pypi (2022.3.0)
  • tcmlib
  • umf
  • intel_cmplr_lib_url

Proposed solution: remove mkl from pypi dependencies, as it is actually unused (pytorch links it statically, numpy and scipy are not build against it and do not dynamically dispatch).

@jandom
Copy link
Collaborator

jandom commented Dec 17, 2025

hi there @sdvillal thanks for this – we're currently working on a bunch of related issues
#70
#75

Hopefully we can combine it all together with your PR after the holidays?

@sdvillal
Copy link
Author

sdvillal commented Dec 23, 2025

hi there @sdvillal thanks for this – we're currently working on a bunch of related issues #70 #75

Hopefully we can combine it all together with your PR after the holidays?

Also:
#79
sdvillal#1 (now merged)

@sdvillal
Copy link
Author

Coming back to this after the end of year "hiatus". Current state and TODOs:

  • Need to catch up with changes and PRs upstream.
  • Fully isolating/vendoring the evoformer extension, to get rid completely of the deepspeed dependency, is proving very hairy - so likely we will need to open a PR upstream to also fix CUTLASS detection.
  • Take care of all the open issues above.

all tests pass, predictions seem to be correct
corresponds to a modernized conda environment following best practices
Comments

Overcommenting issues
incomplete, we might not need the native sources
from upstream commit df59f203f40c8a292dd019ae68c9e6c88f107026
incomplete, we might not need the native sources
from upstream commit df59f203f40c8a292dd019ae68c9e6c88f107026
Use vendored deepspeed in the attention primitives
@sdvillal
Copy link
Author

sdvillal commented Jan 18, 2026

Slowly cooking, hopefully this will taste good.

While we still need to take care of multiple loose ties, everything is usable. So feel free to try @jnwei @jandom @Emrys-Merlin @dwhswenson. Feedback, suggestions, bug reports are very welcome! Happy to meet again sometime soon if useful.

Try

Install pixi (you need >=0.63.1), then:

# Clone or switch to the branch
git clone -b modernize-conda-environment https://github.com/sdvillal/openfold-3.git
cd openfold-3

# We could instruct pixi not to update the environment
# export PIXI_FROZEN=1

# See what is in store
pixi info

# List dependencies in all environments
pixi list

# Run tests in the CPU environment (which is also the default and works everywhere)...
pixi run -e openfold3-cpu test
# ...this will save the test log in test-results

# Run openfold3 using CUDA12
pixi run -e openfold3-cuda12 run_openfold --help

# Enter a shell with an activated environment
pixi shell -e openfold3-cuda13-pypi
run_openfold --help
# then exit
exit

# Export to good old conda environment yamls (I have to test these)
pixi run export-conda openfold3-cuda12 linux-aarch64
# ...or export all in bulk
pixi run export-conda-all
# ...files will be saved to the environments directory

# Same but generating conda-lock environements
pixi run export-conda-lock openfold3-cuda12
# ...or export all in bulk
pixi run export-conda-lock-all
# ...files will be saved to the environments directory

Status

  • Broad support. Provided environments can run openfold-3 in linux-64, linux-aarch64, osx-64 and osx-arm64, CUDA12 and CUDA13, from Ampere to Blackwell. All runnable tests pass, and predictions seem correct in my tests on a subset of contexts:

    • linux-64 x {CPU, CUDA12-A100, CUDA13-A100, CUDA12-B300, CUDA13-B300}
    • osx-arm64 x CPU&Metal (sounds frivolous, but debugging inference in my mac felt productive)
  • Best practices and isolation for one-command setup. We provide two types of environments:

    • conda-centric (recommended, minimize pypi dependencies): best practice getting everything from conda-forge, ensuring binary and toolchain compatibility)
    • pypi-centric (discouraged, minimize conda dependencies): python, cuda libraries and compiler toolchain brought by conda, everything else coming from pypi, to enable using OF3 against packages not yet in conda-forge). No more pain with compiling.
  • lockfiles: current pixi.lock files represent a working environment

  • pytorch-gpu in conda: Pure-conda CUDA13 environment is waiting for pytorch+cuda13 to be supported in CF. At the moment we simply get pytorch from pypi, everything else from conda, in this scenario. Best strategy here is patience: it feels that soon CF will have pytorch 2.9 and 2.10 with support for CUDA 13 and then a cascade of updates in dependent libraries like deepspeed will mature the ecosystem.

  • DeepSpeed: Vendoring will not be necessary.

    • We managed to push forward publishing of the ninja detection patch for deepspeed in pypi and conda-forge
    • The fixed cutlass detection logic is now merged into main and awaiting for a new version to be cut-out and published in CF and PyPI. Until then, we can keep the vendored version here. We might want to anyway carve out a package with an (hopefully updated) version of the Evoformer attention. The broader and really useful question is, as proposed by Jennifer, if we remove deepspeed's EvoformerAttention, which I would be all up to after we carefully evaluate how to be on top of the efficiency question in combination with providing pre-compiled extensions. That would cleanup multiple code branches and workarounds.
  • cuEquivariance: We still can only install from pypi. We got no answer so far to our request for a conda package or at least a change/or an exception for the non-redistributable part of the close-source kernels license. To my mind, we should think if there are viable fully open-source alternatives.

  • Biotite is at the moment the only missing conda dependency for linux-aarch64 (it can be source-installed, but that is not ideal), and a source of complexity in the environment. As ARM is becoming more and more common, I started work on also publishing precompiled packages for aarch64 ([1], [2]) and fixing some problems with newer numpy versions. We should consider eventually upgrading it.

  • Other improvements: We have changed pyproject.toml, added some example Dockerfiles (which I feel we should remove before merging) and touched code here and there to make things work in all contexts. Happy to review together.

  • An OF3 conda package?: It should be quite easy to do now. And yes, it should be possible to build precompiled packages.

  • Merging soon? The PR is up-to-speed with main. It is quite large by now (tons of moving pieces and yak-shaving has been and is of epic proportions), but we could consider merging it already and take smaller improvements one at a time. We could also split, but I do not think it is worth it - more we should just squash and merge.

  • Pixi: we need to spread the word; pixi is so frequently in the love-region (of the love/hate relationship one will always have with a package manager) that it would be selfish not to proselytize.

  • A maturing ecosystem: as is always the case, it takes time until things become easy when new hardware is release, so for some things relating newer blackwell GPUs or CUDA 13 we just need to wait a little bit.

Some more loose ties

  • Isolation of building (extensions get build in the environment directory and not default locations in user's home) is handled by setting environment variables which might not play well with advanced user's expectations. Document.

  • Updating biotite and cuEquivariance break tests (as documented in pixi.toml). But we will eventually need to update them, as they bring interesting features, fixes and support for newer libraries.

  • OpenFold-3 get's installed in editable mode in all environments at the moment. We should make that optional.

  • We should provide pixi-pack tasks to allow send full environments around

  • We need to document better - one should not expect users to read pixi.toml.

  • Support for running inference in osx with default parameters is not possible. OF3 tries to use deepspeed evoformer attention and still asks for cuda devices, these things should have better autodetection of actual capabilites at runtime. So not the most user friendly at the moment, but works by explicitly disabling deepspeed in the runner config.

  • We should benchmark a bit more systematically the differences between all these setups and different compilation flags. Is running OF3 in a conda environment more efficient, less efficient or the same as running in another environment? Should we specialise some dependencies with aggressive compilation optimizations?

  • We should test the exported conda environments. We also need to push for needed pixi and pixi-to-conda-lock features (support exporting environment variables), but we can workaround at the moment.

  • Remember to remove uv.lock, .python-version, and figure if Dockerfile.pixi and Dockerfile.pypi are generally useful

@jandom
Copy link
Collaborator

jandom commented Jan 18, 2026

@sdvillal legend – this is huge, this is insane amount of work! Thank you – instant tier-1 contributor status 🏆

Will take a look at this next week, really looking forward to it!

@jandom
Copy link
Collaborator

jandom commented Jan 18, 2026

Ah, any chance we can remove the vendored code? i think you mentioned it's not needed any more

@sdvillal
Copy link
Author

Ah, any chance we can remove the vendored code? i think you mentioned it's not needed any more

Not yet. It depends on how much we want to wait: we need the last changes in deepspeed to make it into release and then run a quick round of testing. I have just checked and it seems deepspeed have a reasonably fast release cadence (every 2-4 weeks). They replied quickly to our two pull requests too, and the CF people were also fast at merging our package update request. So I feel it should not take long.

I too see potential problems with vendoring: large diff for an only temporary solution and risk for bugs - although we have not experienced any so far. I do not think there is any license concern - deepspeed is Apache2 and every file contains the LICENSE stanza. But if we decide for waiting, then I will remove all traces of the deepspeed code from the branch history before merging.

I suggest to let you guys try a bit, see how many problems and suggestions arise, and decide then if and when we merge. An idea would also be to point to this PR to people struggling with environment setup or having access to different untested hardware, to get some more eyes on it.

Happy for anything, and I am anyway now out of time to work much on this until around the first week of February.

@jandom
Copy link
Collaborator

jandom commented Jan 18, 2026

Ah, makes sense – I mostly meant splitting up the auto-generated code into a separate PR. I'm particularly interested in the blackwell build, I struggled massively with this – ultimately was able to build but to support blackwell fully, i needed to go with latest-and-greatest CUDA and that wasn't fully supported downstream.

tohtana added a commit to deepspeedai/DeepSpeed that referenced this pull request Jan 18, 2026
`EvoformerAttnBuilder` has some problems which preclude compiling the
extension on several scenarios (e.g., [isolated conda environment with
cuda toolchain](aqlaboratory/openfold-3#34),
lack of hardware in the system) and breaks some standard DeepSpeed
configuration of target capabilities.

*Changes*

  - Fix evoformer CUTLASS detection:
- Allow to skip it, useful when CUTLASS is already correctly setup
(e.g., in a conda environment with CUTLASS and the CUDA toolchain)
- Fix misleading use of deprecated nvidia-cutlass pypi package by
actually using the provided bindings but discouraging this route as
[these bindings are not maintained
anymore](NVIDIA/cutlass#2119)

  - Fix evoformer compilation with no GPU is present:
- this is taken care correctly and more generally by
builder.compute_capability_args
    - allow for cross-compilation in systems without GPU
- allows for compilation against all available virtual architectures and
binary outputs
    - see e.g., #5308

- Make all these changes configurable and explicit through documented
environment variables

Tested in all scenarios.

---------

Signed-off-by: Santi Villalba <sdvillal@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
@sdvillal sdvillal force-pushed the modernize-conda-environment branch from eb192fd to 23a7f0c Compare January 19, 2026 05:39
@sdvillal
Copy link
Author

Note: I forced-pushed & merged main

@sdvillal
Copy link
Author

Ah, makes sense – I mostly meant splitting up the auto-generated code into a separate PR. I'm particularly interested in the blackwell build, I struggled massively with this – ultimately was able to build but to support blackwell fully, i needed to go with latest-and-greatest CUDA and that wasn't fully supported downstream.

There is no autogenerated code, but rather some simple copy from deepspeed + patching to change the name-space and simplify. I believe it is all well isolated on a few commits, so it would not be hard to take apart.

From my tests, things work in blackwell but they might not be taking full advantage of the new architecture yet. Just give it a try following the instructions above and let me know what fails :-)

@Emrys-Merlin
Copy link

Thanks Santi :-) regarding:

Remember to remove uv.lock, .python-version, and figure if Dockerfile.pixi and Dockerfile.pypi are generally useful

I think it should be perfectly fine to remove all of them. Some context why I created the Dockerfiles originally: I had trouble isolating the separate environments on the same machine. In particular, even the "pure python" environment needed at least one conda installed package (kalign2). The cleanest idea, I could come up with, was to isolate everything into docker containers. The added benefit was that for the pure python environment, I could avoid installing the cudatoolkit etc. via conda, because I outsourced that to the base image.

I think it's important to stress that I only created the images to test the environments. These are definitely no production ready images that should be used for inference runs or the like. If there is any chance of misuse we should definitely remove the files.

@sdvillal
Copy link
Author

sdvillal commented Jan 19, 2026

Thanks Santi :-) regarding:

Remember to remove uv.lock, .python-version, and figure if Dockerfile.pixi and Dockerfile.pypi are generally useful

I think it should be perfectly fine to remove all of them. Some context why I created the Dockerfiles originally: I had trouble isolating the separate environments on the same machine. In particular, even the "pure python" environment needed at least one conda installed package (kalign2). The cleanest idea, I could come up with, was to isolate everything into docker containers. The added benefit was that for the pure python environment, I could avoid installing the cudatoolkit etc. via conda, because I outsourced that to the base image.

I think it's important to stress that I only created the images to test the environments. These are definitely no production ready images that should be used for inference runs or the like. If there is any chance of misuse we should definitely remove the files.

As an intermediate step, I moved them to their own directory in 9d0e7f1 (which will break building them, but can be easily fixed by manipulating paths here and there). A key question is if these conda environment could help building the docker images, to ease our maintenance burden, or if it is better to depend on NVIDIA's official images that bring the latest and greatest dependencies supposed to work well together.

@sdvillal
Copy link
Author

After a new round of yak shaving, things on secondary features are cleaner now:

  • Ensured that we pull the right pytorch in pypi-style environments (6970215). This highlights the need for changing pyproject.toml to include variants for all our supported CUDA and likely move then to uv (or maybe soon wheelvariants will become standard and adopted). In any case, maybe we should go all in and make uv openfold-3 tool of choice to manage the pypi ecosystem.
  • Fixed a bug upstream pixi-to-conda-lock (Fix handling of PyPI packages with missing hash basnijholt/pixi-to-conda-lock#12) and now all conda exports work (2797a12). Note that I am not testing the exported environments at the moment.
  • Hit a weird pixi bug we should report (430e600).

@sdvillal sdvillal marked this pull request as ready for review January 20, 2026 07:46
@jandom
Copy link
Collaborator

jandom commented Jan 20, 2026

Hey there @sdvillal – this is fantastic, should really be in all caps, some ideas and comments.

  1. The pixi cheatsheet you gave in Modernize conda environment #34 (comment), could that become a pixi/README.md or something similar? I often find that most insightful comments exists in PR and would really benefit people when included in the repo code – that's one of them create DOI for repo #19

  2. I'm really uneasy about vendoring this, especially because we know that both cuEqivariance and deepspeed cause instabilities in training. We don't have an ability to test these changes outside of inference for now.

Here is some further good and bad news for when I checked out your code.

First background on my box

nvidia-smi 
Tue Jan 20 04:16:26 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   37C    P8              3W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
  • CPU tests ran and passed without a problem ✅
  • Inference tests with CUDA12 and CUDA13 both failed, albeit at different points
CUDA12
pixi run -e openfold3-cuda12 \
  run_openfold predict \
    --query_json=./examples/example_inference_inputs/query_ubiquitin.json \
    --output_dir=./output/openfold3-cuda12
/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda12/lib/python3.13/site-packages/torch/cuda/__init__.py:283: UserWarning: 
    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (5.0) - (12.0)
    
  warnings.warn(
Traceback (most recent call last):
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda12/bin/run_openfold", line 10, in <module>
    sys.exit(cli())
             ~~~^^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda12/lib/python3.13/site-packages/click/core.py", line 1485, in __call__
    return self.main(*args, **kwargs)
           ~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda12/lib/python3.13/site-packages/click/core.py", line 1406, in main
    rv = self.invoke(ctx)
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda12/lib/python3.13/site-packages/click/core.py", line 1873, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda12/lib/python3.13/site-packages/click/core.py", line 1269, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda12/lib/python3.13/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/jandom/workspace/openfold-3/openfold3/run_openfold.py", line 145, in predict
    from openfold3.entry_points.experiment_runner import (
        InferenceExperimentRunner,
    )
  File "/home/jandom/workspace/openfold-3/openfold3/entry_points/experiment_runner.py", line 38, in <module>
    from openfold3.core.data.framework.data_module import (
    ...<3 lines>...
    )
  File "/home/jandom/workspace/openfold-3/openfold3/core/data/framework/__init__.py", line 26, in <module>
    _import_all_py_files_from_dir(
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        Path(__file__).parent.parent / Path("framework/single_datasets")
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/jandom/workspace/openfold-3/openfold3/core/data/framework/__init__.py", line 22, in _import_all_py_files_from_dir
    __import__(".".join(list(path.parts[-6:-1]) + [path.parts[-1].split(".")[0]]))
    ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jandom/workspace/openfold-3/openfold3/core/data/framework/single_datasets/inference.py", line 36, in <module>
    from openfold3.core.data.pipelines.featurization.conformer import (
        featurize_reference_conformers_of3,
    )
  File "/home/jandom/workspace/openfold-3/openfold3/core/data/pipelines/featurization/conformer.py", line 29, in <module>
    from openfold3.core.model.structure.diffusion_module import centre_random_augmentation
  File "/home/jandom/workspace/openfold-3/openfold3/core/model/structure/diffusion_module.py", line 25, in <module>
    from openfold3.core.model.layers.diffusion_conditioning import DiffusionConditioning
  File "/home/jandom/workspace/openfold-3/openfold3/core/model/layers/diffusion_conditioning.py", line 20, in <module>
    from openfold3.core.model.feature_embedders.input_embedders import FourierEmbedding
  File "/home/jandom/workspace/openfold-3/openfold3/core/model/feature_embedders/input_embedders.py", line 28, in <module>
    from openfold3.core.model.layers.sequence_local_atom_attention import (
        AtomAttentionEncoder,
    )
  File "/home/jandom/workspace/openfold-3/openfold3/core/model/layers/sequence_local_atom_attention.py", line 25, in <module>
    from openfold3.core.model.layers.diffusion_transformer import DiffusionTransformer
  File "/home/jandom/workspace/openfold-3/openfold3/core/model/layers/diffusion_transformer.py", line 26, in <module>
    from .attention_pair_bias import AttentionPairBias, CrossAttentionPairBias
  File "/home/jandom/workspace/openfold-3/openfold3/core/model/layers/attention_pair_bias.py", line 23, in <module>
    from openfold3.core.model.primitives import (
    ...<4 lines>...
    )
  File "/home/jandom/workspace/openfold-3/openfold3/core/model/primitives/__init__.py", line 16, in <module>
    from .attention import (
    ...<4 lines>...
    )
  File "/home/jandom/workspace/openfold-3/openfold3/core/model/primitives/attention.py", line 51, in <module>
    from cuequivariance_ops_torch.triangle_attention import (
        CUEQ_TRIATTN_FALLBACK_THRESHOLD,
    )
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda12/lib/python3.13/site-packages/cuequivariance_ops_torch/__init__.py", line 39, in <module>
    from cuequivariance_ops_torch.fused_layer_norm_torch import layer_norm_transpose
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda12/lib/python3.13/site-packages/cuequivariance_ops_torch/fused_layer_norm_torch.py", line 17, in <module>
    from cuequivariance_ops.triton import (
    ...<5 lines>...
    )
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda12/lib/python3.13/site-packages/cuequivariance_ops/triton/__init__.py", line 24, in <module>
    from .tuning_decorator import autotune_aot
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda12/lib/python3.13/site-packages/cuequivariance_ops/triton/tuning_decorator.py", line 17, in <module>
    from .cache_manager import get_cache_manager
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda12/lib/python3.13/site-packages/cuequivariance_ops/triton/cache_manager.py", line 255, in <module>
    cache_manager = CacheManager()
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda12/lib/python3.13/site-packages/cuequivariance_ops/triton/cache_manager.py", line 110, in __init__
    self.gpu_information = get_gpu_information()
                           ~~~~~~~~~~~~~~~~~~~^^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda12/lib/python3.13/site-packages/cuequivariance_ops/triton/cache_manager.py", line 66, in get_gpu_information
    power_limit = pynvml.nvmlDeviceGetPowerManagementLimit(handle)
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda12/lib/python3.13/site-packages/pynvml.py", line 3648, in nvmlDeviceGetPowerManagementLimit
    _nvmlCheckReturn(ret)
    ~~~~~~~~~~~~~~~~^^^^^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda12/lib/python3.13/site-packages/pynvml.py", line 1076, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported
CUDA13
pixi run -e openfold3-cuda13 \
  run_openfold predict \
    --query_json=./examples/example_inference_inputs/query_ubiquitin.json \
    --output_dir=./output/openfold3-cuda13
/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/torch/cuda/__init__.py:283: UserWarning: 
    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    
  warnings.warn(
💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
WARNING:openfold3.core.data.tools.colabfold_msa_server:Using output directory: /tmp/of3_colabfold_msas for ColabFold MSAs.
Submitting 1 sequences to the Colabfold MSA server for main MSAs...
COMPLETE: 100%|███████████████████████████████████████| 150/150 [elapsed: 00:02 remaining: 00:00]
/home/jandom/workspace/openfold-3/openfold3/core/data/tools/colabfold_msa_server.py:331: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
  tar_gz.extractall(path)
No complexes found for paired MSA generation. Skipping...
Preprocessing templates...
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/multiprocessing/popen_fork.py:67: DeprecationWarning: This process (pid=236107) is multi-threaded, use of fork() may lead to deadlocks in the child.
  self.pid = os.fork()
ERROR: Unexpected segmentation fault encountered in worker.

Traceback (most recent call last):
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1275, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/multiprocessing/queues.py", line 111, in get
    if not self._poll(timeout):
           ~~~~~~~~~~^^^^^^^^^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
           ~~~~~~~~~~^^^^^^^^^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/multiprocessing/connection.py", line 440, in _poll
    r = wait([self], timeout)
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/multiprocessing/connection.py", line 1148, in wait
    ready = selector.select(timeout)
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/selectors.py", line 398, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
    _error_if_any_worker_fails()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^
RuntimeError: DataLoader worker (pid 236266) is killed by signal: Segmentation fault. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/bin/run_openfold", line 10, in <module>
    sys.exit(cli())
             ~~~^^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/click/core.py", line 1485, in __call__
    return self.main(*args, **kwargs)
           ~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/click/core.py", line 1406, in main
    rv = self.invoke(ctx)
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/click/core.py", line 1873, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/click/core.py", line 1269, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/home/jandom/workspace/openfold-3/openfold3/run_openfold.py", line 175, in predict
    expt_runner.run(query_set)
    ~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/home/jandom/workspace/openfold-3/openfold3/entry_points/experiment_runner.py", line 659, in run
    self.trainer.predict(
    ~~~~~~~~~~~~~~~~~~~~^
        model=self.lightning_module,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        datamodule=self.lightning_data_module,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        return_predictions=False,
        ^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/pytorch_lightning/trainer/trainer.py", line 941, in predict
    return call._call_and_handle_interrupt(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        self,
        ^^^^^
    ...<6 lines>...
        weights_only,
        ^^^^^^^^^^^^^
    )
    ^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/pytorch_lightning/trainer/call.py", line 49, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _predict_impl
    results = self._run(model, ckpt_path=ckpt_path, weights_only=weights_only)
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in _run
    results = self._run_stage()
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/pytorch_lightning/trainer/trainer.py", line 1118, in _run_stage
    return self.predict_loop.run()
           ~~~~~~~~~~~~~~~~~~~~~^^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/pytorch_lightning/loops/utilities.py", line 179, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/pytorch_lightning/loops/prediction_loop.py", line 122, in run
    batch, batch_idx, dataloader_idx = next(data_fetcher)
                                       ~~~~^^^^^^^^^^^^^^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/pytorch_lightning/loops/fetchers.py", line 134, in __next__
    batch = super().__next__()
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/pytorch_lightning/loops/fetchers.py", line 61, in __next__
    batch = next(self.iterator)
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/pytorch_lightning/utilities/combined_loader.py", line 142, in __next__
    out = next(self.iterators[0])
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 732, in __next__
    data = self._next_data()
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1482, in _next_data
    idx, data = self._get_data()
                ~~~~~~~~~~~~~~^^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1444, in _get_data
    success, data = self._try_get_data()
                    ~~~~~~~~~~~~~~~~~~^^
  File "/home/jandom/workspace/openfold-3/.pixi/envs/openfold3-cuda13/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1288, in _try_get_data
    raise RuntimeError(
        f"DataLoader worker (pid(s) {pids_str}) exited unexpectedly"
    ) from e
RuntimeError: DataLoader worker (pid(s) 236266) exited unexpectedly

@sdvillal
Copy link
Author

Hey @jandom

  1. Absolutely, let's create good doc and save to pixi/README.md and reference in the main README? Will do at some point when I have a bit of quiet time.

  2. Vendored code will have the same problems as upstream, no more, no less. But let's wait until we have the new deepspeed cut so we simplify enormously the payload of this PR.

At the moment indeed we are targetting only inference in this PR. When you say there is no way to test these changes in training, do you mean that our test coverage is not useful enough? There will, of course, always be the effects that can only be seen in long training runs.

@sdvillal
Copy link
Author

sdvillal commented Jan 21, 2026

@jandom DGX spark, right? That is AARCH64 + GB10 (= compute capability 12.1), possibly the two less mature combo? Let's make it work and add to our support matrix :-). Since I do not have access to the hardware I would need a bit of help, hope it does not become too hairy!

Two notes first:

  • deepspeed does not officially support aarch64, but I have seen reports that it works
  • pytorch 2.10 should release today and I hope it will help with these newer / less common setups

Let's start from the CUDA12 environment, that sounds like cuequivariance/triton not supporting the GPU. It seems to try to query for something that is not queryable in that machine, like the GPU temperature, which you can also see in the nvidia-smi output.

From the top of my mind, we can try:

  • Comment out cuequivariance here and see if that already fixes that error.

  • Bump cuequivariance here and see if that also fixes the error - I suspect not. Note that bumping cuequivariance to >=0.7 will make tests fail - we need to figure that out.

  • Play with cuequivariance autotuning which seems to be the culprit of triggering the crash. I am not an expert, but I have just seen these variables - not clear to me how exactly to disable autotune. If this could work it might mean still being able to use cuequivariance at reduced performance.

Once we know, we can create yet another environment variant (e.g., -no-acceleration or -reduced-acceleration).

I suspect this is a problem we should report to cuequivariance upstream, like here

Let's do the CUDA13 bit, which to me is where the real prize is, later. Maybe already against pytorch 2.10. I have some suspicion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants