Add XCCL Backend Support for Intel GPU in TorchComms #52

zhangxiaoli73 · 2025-11-10T03:06:47Z

Motivation:

As design illustrated in RFC #51, we would like to add XCCL (Intel GPU Collective Communications Library) backend support to TorchComms.

In this PR, we will enable allreduce in the XCCL backend as an entry point. Full collectives support will come later.

PR explanation:

Core Files

File	Purpose
`TorchCommXCCL.cpp/.hpp`	Main collectives implementation
`TorchWorkXCCL.cpp/.hpp`	Asynchronous work tracking
`TorchWorkXCCLQueue.cpp`	Work queue management
`TorchCommXCCLBootstrap.cpp/.hpp`	Communicator initialization
`TorchCommXCCLUtils.cpp/.hpp`	Utility functions (type conversion)
`TorchCommXCCLPy.cpp`	Python bindings

API Abstraction Layers

File	Purpose
`XcclApi.cpp/.hpp`	oneCCL API wrapper
`device/XpuApi.cpp/.hpp`	SYCL/XPU runtime abstraction

Build System

File	Purpose
`xccl/CMakeLists.txt`	CMake configuration for XCCL backend
`setup.py`	Python package build (USE_XCCL flag)

Examples:

import torch
import torchcomms
from torchcomms import new_comm

# Initialize communicator
device = torch.device('xpu')
comm = new_comm("xccl", device, name="main_comm")
rank = torchcomms.get_rank()
current_device = torch.device(f"xpu:{rank}")

# Perform collective operation
tensor = torch.randn(1024, device=current_device)
work = comm.all_reduce(tensor, torchcomms.ReduceOp.SUM, async_op=True)
work.wait()

# Cleanup
comm.finalize()

meta-cla · 2025-11-10T03:06:53Z

Hi @zhangxiaoli73!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

This reverts commit 2ce3df2.

Expand XPU memory info with free memory query: * Check for sycl::aspect::ext_intel_free_memory capability * Use Intel SYCL extension to query actual free memory when available * Fall back to total memory with warning when extension unsupported * Add TorchCommLogging.hpp include for warning messages This provides more accurate memory reporting on Intel XPU devices that support the free memory extension, while maintaining backward compatibility with devices that don't.

Move the free memory extension warning from memGetInfo() to getDeviceProperties() to provide early notification during device initialization: * Add Intel free memory extension check in getDeviceProperties() * Remove duplicate warning from memGetInfo() to avoid log spam * Add device properties verification call in TorchCommXCCL::init() This ensures users are warned once during device setup rather than repeatedly during memory queries, while maintaining the same functionality and error handling.

Add [[maybe_unused]] attribute to device_prop variable in TorchCommXCCL::init() to prevent compiler warnings since it is only used for device properties validation during initialization but is never accessed.

Add [[likely]] and [[unlikely]] attributes to optimize branch prediction for Intel SYCL free memory extension checks: * Mark extension availability as [[likely]] in memGetInfo() * Mark extension unavailability as [[unlikely]] in getDeviceProperties() and memGetInfo() This optimizes the common case where Intel XPU devices support the free memory extension, improving performance in the hot path while maintaining compatibility with devices that lack the extension.

Wrap XPU memory free operation in try-catch block within the TorchCommXCCLBootstrap destructor: * Catch exceptions during barrier buffer deallocation * Log errors instead of allowing exceptions to escape * Ensure safe object destruction even if memory freeing fails This prevents potential program termination due to throwing exceptions from a destructor, which is undefined behavior in C++ during stack unwinding.

The AllReduce.py unittest case fails when empty tensor is provided as input. This patch adds the missing check and return without a crash.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Add Intel XPU to prerequisites - Document XCCL backend setup with Intel oneAPI CCL environment - Add USE_XCCL flag to build configuration options

Implement getMemAllocator() in TorchCommXCCL as a placeholder that throws a runtime error. This satisfies the interface requirement from TorchCommBackend.

Replace hardcoded CUDA device detection with PyTorch's accelerator API for better hardware abstraction: * Use torch.accelerator.current_accelerator() for device detection * Improve variable naming (device -> device_str) for clarity * Add return type annotation for better type safety * Fallback to CPU when no accelerator is available or device_count is 0

zhangxiaoli73 · 2025-12-17T07:12:09Z

Hi, @d4l3k This is our initial PR to integrate XCCL backend into TorchComms. Could you please help review? cc. @pkourdis @siju-samuel @newtdms

meta-cla · 2025-12-17T18:07:08Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

pkourdis and others added 10 commits December 5, 2025 19:32

Happy Init

06173ad

comments and format

7b40721

typo

8c16c18

revert split support and leave for seperate PR

d5aa7e7

move some APIs if we cannot support now

bde7460

fix ptr

43ffef8

support xccl test

35424a6

support allreduce

02f98f3

Revert "support xccl test"

a27f699

This reverts commit 2ce3df2.

Add env check

09063d6

zhangxiaoli73 force-pushed the cherry/xccl-allreduce branch from e1fcfee to d1ee478 Compare December 10, 2025 04:20

Chao1Han force-pushed the cherry/xccl-allreduce branch from d1ee478 to a19186e Compare December 10, 2025 06:16

zhangxiaoli73 force-pushed the cherry/xccl-allreduce branch 2 times, most recently from 805be37 to 3fd9dfc Compare December 12, 2025 03:40

pkourdis and others added 15 commits December 15, 2025 14:56

improve(xpu): refine free memory warning message

3c68257

fix(xccl): suppress unused variable warning for device_prop

0c087f3

Add [[maybe_unused]] attribute to device_prop variable in TorchCommXCCL::init() to prevent compiler warnings since it is only used for device properties validation during initialization but is never accessed.

Apply suggestion from @frost-intel

6574bde

style(xccl): Apply clang-format to XpuApi.cpp file

82affe6

Address empty input tensor for all_reduce operation from unittest

c0f6a65

The AllReduce.py unittest case fails when empty tensor is provided as input. This patch adds the missing check and return without a crash.

Fix typo

4c5ee99

Fix comment

e798ec7

Add a fix for PREMUL_SUM when world size is 1 due to oneCCL bug

d65c4f5

Workaround for reduce PREMUL_SUM

8611648

refactor(xccl): simplify preReduce function using std::visit

15b5f81

make preReduce static

63b8d70

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

siju-samuel and others added 3 commits December 15, 2025 15:23

Add XCCL backend documentation to README

a68aab7

- Add Intel XPU to prerequisites - Document XCCL backend setup with Intel oneAPI CCL environment - Add USE_XCCL flag to build configuration options

feat(xccl): add placeholder for getMemAllocator

2402ed7

Implement getMemAllocator() in TorchCommXCCL as a placeholder that throws a runtime error. This satisfies the interface requirement from TorchCommBackend.

zhangxiaoli73 force-pushed the cherry/xccl-allreduce branch from 3fd9dfc to f7f00c1 Compare December 17, 2025 06:09

zhangxiaoli73 changed the title ~~[Draft] Add XCCL Backend Support for Intel GPU in TorchComms~~ Add XCCL Backend Support for Intel GPU in TorchComms Dec 17, 2025

Update readme about XCCL backend source code building

b81c952

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add XCCL Backend Support for Intel GPU in TorchComms #52

Add XCCL Backend Support for Intel GPU in TorchComms #52

Uh oh!

zhangxiaoli73 commented Nov 10, 2025 •

edited

Loading

Uh oh!

meta-cla bot commented Nov 10, 2025

Uh oh!

zhangxiaoli73 commented Dec 17, 2025

Uh oh!

meta-cla bot commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Add XCCL Backend Support for Intel GPU in TorchComms #52

Are you sure you want to change the base?

Add XCCL Backend Support for Intel GPU in TorchComms #52

Uh oh!

Conversation

zhangxiaoli73 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation:

PR explanation:

Core Files

API Abstraction Layers

Build System

Examples:

Uh oh!

meta-cla bot commented Nov 10, 2025

Action Required

Process

Uh oh!

zhangxiaoli73 commented Dec 17, 2025

Uh oh!

meta-cla bot commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zhangxiaoli73 commented Nov 10, 2025 •

edited

Loading