Skip to content

Conversation

@zhangxiaoli73
Copy link

@zhangxiaoli73 zhangxiaoli73 commented Nov 10, 2025

Motivation:

As design illustrated in RFC #51, we would like to add XCCL (Intel GPU Collective Communications Library) backend support to TorchComms.

In this PR, we will enable allreduce in the XCCL backend as an entry point. Full collectives support will come later.

PR explanation:

Core Files

File Purpose
TorchCommXCCL.cpp/.hpp Main collectives implementation
TorchWorkXCCL.cpp/.hpp Asynchronous work tracking
TorchWorkXCCLQueue.cpp Work queue management
TorchCommXCCLBootstrap.cpp/.hpp Communicator initialization
TorchCommXCCLUtils.cpp/.hpp Utility functions (type conversion)
TorchCommXCCLPy.cpp Python bindings

API Abstraction Layers

File Purpose
XcclApi.cpp/.hpp oneCCL API wrapper
device/XpuApi.cpp/.hpp SYCL/XPU runtime abstraction

Build System

File Purpose
xccl/CMakeLists.txt CMake configuration for XCCL backend
setup.py Python package build (USE_XCCL flag)

Examples:

import torch
import torchcomms
from torchcomms import new_comm

# Initialize communicator
device = torch.device('xpu')
comm = new_comm("xccl", device, name="main_comm")
rank = torchcomms.get_rank()
current_device = torch.device(f"xpu:{rank}")

# Perform collective operation
tensor = torch.randn(1024, device=current_device)
work = comm.all_reduce(tensor, torchcomms.ReduceOp.SUM, async_op=True)
work.wait()

# Cleanup
comm.finalize()

@meta-cla
Copy link

meta-cla bot commented Nov 10, 2025

Hi @zhangxiaoli73!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@Chao1Han Chao1Han force-pushed the cherry/xccl-allreduce branch from d1ee478 to a19186e Compare December 10, 2025 06:16
@zhangxiaoli73 zhangxiaoli73 force-pushed the cherry/xccl-allreduce branch 2 times, most recently from 805be37 to 3fd9dfc Compare December 12, 2025 03:40
pkourdis and others added 15 commits December 15, 2025 14:56
Expand XPU memory info with free memory query:

* Check for sycl::aspect::ext_intel_free_memory capability
* Use Intel SYCL extension to query actual free memory when available
* Fall back to total memory with warning when extension unsupported
* Add TorchCommLogging.hpp include for warning messages

This provides more accurate memory reporting on Intel XPU devices
that support the free memory extension, while maintaining backward
compatibility with devices that don't.
Move the free memory extension warning from memGetInfo() to getDeviceProperties()
to provide early notification during device initialization:

* Add Intel free memory extension check in getDeviceProperties()
* Remove duplicate warning from memGetInfo() to avoid log spam
* Add device properties verification call in TorchCommXCCL::init()

This ensures users are warned once during device setup rather than
repeatedly during memory queries, while maintaining the same
functionality and error handling.
Add [[maybe_unused]] attribute to device_prop variable in TorchCommXCCL::init()
to prevent compiler warnings since it is only used for device properties
validation during initialization but is never accessed.
Add [[likely]] and [[unlikely]] attributes to optimize branch prediction
for Intel SYCL free memory extension checks:

* Mark extension availability as [[likely]] in memGetInfo()
* Mark extension unavailability as [[unlikely]] in getDeviceProperties()
  and memGetInfo()

This optimizes the common case where Intel XPU devices support the
free memory extension, improving performance in the hot path while
maintaining compatibility with devices that lack the extension.
Wrap XPU memory free operation in try-catch block within the
TorchCommXCCLBootstrap destructor:

* Catch exceptions during barrier buffer deallocation
* Log errors instead of allowing exceptions to escape
* Ensure safe object destruction even if memory freeing fails

This prevents potential program termination due to throwing exceptions
from a destructor, which is undefined behavior in C++ during stack unwinding.
The AllReduce.py unittest case fails when empty tensor is provided as
input. This patch adds the missing check and return without a crash.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
siju-samuel and others added 3 commits December 15, 2025 15:23
- Add Intel XPU to prerequisites
- Document XCCL backend setup with Intel oneAPI CCL environment
- Add USE_XCCL flag to build configuration options
Implement getMemAllocator() in TorchCommXCCL as a placeholder that
throws a runtime error. This satisfies the interface requirement from
TorchCommBackend.
Replace hardcoded CUDA device detection with PyTorch's accelerator API
for better hardware abstraction:

* Use torch.accelerator.current_accelerator() for device detection
* Improve variable naming (device -> device_str) for clarity
* Add return type annotation for better type safety
* Fallback to CPU when no accelerator is available or device_count is 0
@zhangxiaoli73 zhangxiaoli73 changed the title [Draft] Add XCCL Backend Support for Intel GPU in TorchComms Add XCCL Backend Support for Intel GPU in TorchComms Dec 17, 2025
@zhangxiaoli73
Copy link
Author

Hi, @d4l3k This is our initial PR to integrate XCCL backend into TorchComms. Could you please help review? cc. @pkourdis @siju-samuel @newtdms

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 17, 2025
@meta-cla
Copy link

meta-cla bot commented Dec 17, 2025

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants