Skip to content

Conversation

@Aidyn-A
Copy link
Contributor

@Aidyn-A Aidyn-A commented Feb 2, 2026

In the PR pytorch/pytorch#171482, the PlacementClassVariable and PlacementVariable were removed from TorchDynamo, so the explicit registration for _ScaledPartial is now required, otherwise it will fail with:

torch._dynamo.exc.InternalTorchDynamoError: AsPythonConstantNotImplementedError: 
UserDefinedObjectVariable(_ScaledPartial) is not a constant

This PR adds the registration for _ScaledPartial placement to the simple FSDP.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 2, 2026
"is_replicate": MemberType.USE_REAL,
"__eq__": MemberType.USE_REAL,
}
register_opaque_type(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please comment

  • what this function is doing
  • why we need it
  • what is allowed_members

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I have added a comment. However, the PR pytorch/pytorch#171482 got reverted. If it lands, this PR should land as well.

"__eq__": MemberType.USE_REAL,
}
register_opaque_type(
_ScaledPartial,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wwwjn We introduced _ScaledPartial to mimic FSDP2's set_gradient_divide_factor. Now that we handles gradient scale by ourselves, I feel we can deprecate this field. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as simpleFSDP don't rescale gradients, we should be good. Does _ScaledPartial means the gradients on each rank are scaled (divided by fsdp degree) when the information in the tensor are still partial / unreduced?

Copy link
Contributor

@tianyu-l tianyu-l Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good point. It does scaling here https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/simple_fsdp/simple_fsdp.py#L204 by using Partial(avg) instead of P(sum).

@Aidyn-A could you help change it to P(sum) and deprecate _ScaledPartial? We also need to compare loss curve with FSDP2 to make sure the change is correct.

If it sounds too involved, we can do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, do you want me to add a warning that _ScaledPartial is deprecated or immediately remove _ScaledPartial. In any case, what should I do with reduction_divide_factor? Where will that go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to worry about warning. For reduction_divide_factor, definition and usage should all go away. Thanks!

@tianyu-l tianyu-l requested a review from wwwjn February 5, 2026 09:04
tianyu-l pushed a commit that referenced this pull request Feb 9, 2026
A follow up on
#2313 (comment).
This PR removes `_ScaledPartial` placement in favor of
`Partial(reduce_op="sum")` placement.

cc @tianyu-l, @wwwjn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants