Skip to content

Conversation

@jandom
Copy link
Collaborator

@jandom jandom commented Jan 7, 2026

Summary

Goal: ideally we can have a single process (single base image?) for building the docker image that also works on Blackwell.

Context: the current state is Dockerfile that uses conda (one base image), and the Blackwell image that installs everything through the system python, and comes with PyTorch included (another base image).

we've had a number of contributions around this

Mike Henry has donated his DGX for testing and I was able to completed the Blackwell build, and get some simple inferences running.

Bad news: because our environment.yml uses PyTorch-cuda, that's not actually available for aarch64/arm, we basically install all the deps manually, both the apt-get packages and the pip packages. I also took a completely different base image (in line with both contributions but different to our standard base).

Good news: I was able to simplify the Dockerfile significantly because all the tools have been upgraded to handle sm121. The performance looks comparable to what was reported

For ubiquitin

  • cold-start 0:02:26
  • warm-start 0:00:05

Changes

Upgraded the base image, removed some duplicate package installs that were not needed, better layering.

Related Issues

  • Training on Blackwell (out of scope)
  • Pre-complie the triton extension via docker commit (out of scope)
  • Run multiple benchmark runs to get a full picture
  • Visually confirm that the predictions look sane
  • Unify the Blackwell and 'main' docker image

Testing

Other Notes

@jandom jandom requested a review from jnwei January 7, 2026 17:43
@jandom jandom self-assigned this Jan 7, 2026
PyTorch: 2.7.0a0+ecf3bae40a.nv25.02

```
CUDA: 13.1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important: with CUDA 12.9+ we get sm121 support out of the box

Comment on lines -18 to -22
RUN git clone https://github.com/aqlaboratory/openfold-3.git && \
cd openfold-3 && \
cp -p environments/production-linux-64.yml environments/production.yml.backup && \
grep -v "pytorch::pytorch" environments/production.yml > environments/production.yml.tmp && \
mv environments/production.yml.tmp environments/production.yml
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was completely unused: everything is installed via the system python+pip

Comment on lines +25 to +30
# Set environment variables including CUDA architecture for Blackwell
ENV PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1 \
KMP_AFFINITY=none \
CUTLASS_PATH=/opt/cutlass \
TORCH_CUDA_ARCH_LIST="12.1"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still remove some of these – all of those could be provided at runtime, and are quite specific to the use case here

Comment on lines -50 to -51
"nvidia-cutlass<4" \
"cuda-python<12.9.1"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We get coda-python with the image, no need to duplicate that
  • We also only need the cutlass headers, no need to install the package


# Pre-compile DeepSpeed operations for Blackwell GPUs to avoid runtime compilation
# Create necessary cache directories
RUN python3 -c "import os; os.makedirs('/root/.triton/autotune', exist_ok=True)"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is empirically needed in my tests, which is a bit odd

Comment on lines -74 to -78
RUN mkdir -p /usr/local/lib/python3.12/site-packages && \
echo 'import os' > /usr/local/lib/python3.12/site-packages/sitecustomize.py && \
echo 'os.environ.setdefault("TORCH_CUDA_ARCH_LIST", "12.0")' >> /usr/local/lib/python3.12/site-packages/sitecustomize.py && \
echo 'os.environ.setdefault("CUTLASS_PATH", "/opt/cutlass")' >> /usr/local/lib/python3.12/site-packages/sitecustomize.py && \
echo 'os.environ.setdefault("KMP_AFFINITY", "none")' >> /usr/local/lib/python3.12/site-packages/sitecustomize.py
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this can be removed

Lots and lots of ENV magic and overrides... not great
Comment on lines +50 to +51
- --extra-index-url https://download.pytorch.org/whl/cu130
- torch>=2.9.0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important to get a sufficiently high version of torch. A couple of things got removed or moved

  • biotite conda package only exists for linux64 but the pip package does better
  • mkl removed
  • pytorch-cuda, again only for linux64

Comment on lines +4 to +13
CUDA_HOME: /usr/local/cuda
PATH: /usr/local/cuda/bin:${PATH}
LD_LIBRARY_PATH: /usr/local/cuda/lib64:${LD_LIBRARY_PATH}
# Triton bundles its own ptaxs which does not support sm_121
# This forces Triton to use the system ptaxas compiler, aware of sm_121
TRITON_PTXAS_PATH: /usr/local/cuda/bin/ptxas
# Requires: git clone https://github.com/NVIDIA/cutlass --branch v3.6.0 --depth 1 ~/workspace/cutlass
CUTLASS_PATH: /home/jandom/workspace/cutlass
# Note: OMP_NUM_THREADS=1 is required to avoid threading conflicts
OMP_NUM_THREADS: "1"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the really ugly part, especially the hard-coded paths specific to my box or $HOME – all of this get taken care of when using the docker image from nvidia with torch pre-installed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant