Skip to content

Conversation

Yiozolm
Copy link

@Yiozolm Yiozolm commented Sep 1, 2025

  • New Features
    1. CUDA backend
    2. Add FB/F2B cache
    3. Broadcast replace repeat
    4. splits→permute→view→mean to block mean CUDA kernel
    5. STy s-fold upsampler CUDA kernel
    6. R2C/C2R (Real FFT) replaces C2C
  • Tests
    1. Speed test for fwd/bwd
    2. Precision comprison between pytorch and CUDA

@Yiozolm
Copy link
Author

Yiozolm commented Sep 1, 2025

Further test (dev branch):

  1. v7 may cause numerical issues during training; we recommend using it only during inference.
  2. On devices with limited processing power, such as the 2080ti, v1 appears to be the optimal choice.

@Yiozolm Yiozolm closed this Sep 4, 2025
@csgeekhuang
Copy link
Collaborator

Thank you for your hard work on this! We’ve conducted efficiency tests, and based on the results, we’re planning to merge your code once all the remaining ToDo items are completed—this should help significantly boost the overall speed.

@Yiozolm
Copy link
Author

Yiozolm commented Sep 12, 2025

Thank you for your hard work on this! We’ve conducted efficiency tests, and based on the results, we’re planning to merge your code once all the remaining ToDo items are completed—this should help significantly boost the overall speed.

Appreciate your positive feedback. I believe this project is a foundational and extremely meaningful piece of work, and I feel very honored to have the opportunity to contribute.

I'm currently quite busy with my 26Fall PhD applications, so my free time is limited. However, I will get to work on fixing the numerical issues as soon as possible. As I've found, the numerical errors appear when scale≥2, but all versions seem to be correct when scale=1. I will prioritize addressing this to ensure the code's accuracy.

@csgeekhuang
Copy link
Collaborator

Thank you for your hard work on this! We’ve conducted efficiency tests, and based on the results, we’re planning to merge your code once all the remaining ToDo items are completed—this should help significantly boost the overall speed.

Appreciate your positive feedback. I believe this project is a foundational and extremely meaningful piece of work, and I feel very honored to have the opportunity to contribute.

I'm currently quite busy with my 26Fall PhD applications, so my free time is limited. However, I will get to work on fixing the numerical issues as soon as possible. As I've found, the numerical errors appear when scale≥2, but all versions seem to be correct when scale=1. I will prioritize addressing this to ensure the code's accuracy.

Wish you find a good PhD position and wait for your wonderful CUDA Optimization!

@Yiozolm Yiozolm reopened this Sep 14, 2025
@Yiozolm
Copy link
Author

Yiozolm commented Sep 14, 2025

I recommend temporarily merging the current version.
Any further optimizations appear to introduce floating-point reordering errors, which is also frustrating me.

@Yiozolm Yiozolm closed this Sep 14, 2025
@Yiozolm
Copy link
Author

Yiozolm commented Sep 14, 2025

The current branch may contain too many unnecessary commits.
Maybe you can use Squash merge.

@Yiozolm Yiozolm reopened this Sep 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants