Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tile-based API #145

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Add tile-based API #145

wants to merge 12 commits into from

Conversation

Binyang2014
Copy link
Contributor

@Binyang2014 Binyang2014 commented Jul 26, 2023

Provide tile-based api

void put2D(uint64_t dstOffset, uint64_t srcOffset, uint32_t width, uint32_t height)
void put2DWithSignal(uint64_t dstOffset, uint64_t srcOffset, uint32_t width, uint32_t height)

To support this, add a new structure fields2D in ChannelTrigger. In this structure we replace the 64bit size to two 32 bit fields (a 32bit width and a 32bit height). Also add another flag multiDimensionFlag in fields2D to distinguish with fields structure

Example to use tile-based API:
When setup the connections, need to call channelService->addPitch first

  for (int r = 0; r < worldSize; r++) {
    if (r == rank) {
      continue;
    }
    std::shared_ptr<mscclpp::Connection> conn;
    if ((rankToNode(r) == rankToNode(gEnv->rank)) && !useIbOnly) {
      conn = communicator->connectOnSetup(r, 0, mscclpp::Transport::CudaIpc);
    } else {
      conn = communicator->connectOnSetup(r, 0, ibTransport);
    }
    connections[r] = conn;
    communicator->sendMemoryOnSetup(recvBufRegMem, r, 0);
    auto remoteMemory = communicator->recvMemoryOnSetup(r, 0);
    communicator->setup();

    mscclpp::SemaphoreId cid = channelService->add2DChannel(conn, std::pair<size_t, size_t>(dstPitch, srcPitch));
    communicator->setup();
  }

Then use the put2D API

if (threadIdx.x == 0) proxyChan.put2DWithSignal(offset, width * sizeof(int), height);

@chhwang chhwang marked this pull request as ready for review July 26, 2023 12:47
@Binyang2014
Copy link
Contributor Author

The API cudaMemcpy2DAsync seems slower than cudaMemcpyAsync for 1D data. Need to investigate

@@ -29,6 +29,16 @@ MSCCLPP_API_CPP SemaphoreId ProxyService::buildAndAddSemaphore(Communicator& com
return semaphores_.size() - 1;
}

MSCCLPP_API_CPP SemaphoreId ProxyService::buildAndAddSemaphore(Communicator& communicator,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make much sense to me, why do we need an extra an way of building a semaphore? We only need to provide a 2D write over 1D arrays. So, just a 2D write is enough. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we need to set pitch/stride for the channel (the name semaphore is not accurate). The reason we don't set stride in the put2D API is our trigger is only 128bit. We don't have extra bits for the it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants