-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BlockingMode._fd_reader_callback asyncio task not end #1072
Comments
Hi @luweizheng , thanks for the report, I wasn't familiar with xorbits/xoscar and it's nice to see you've been using UCX-Py for your projects! Those errors are generally nothing to worry about, I reckon it is not nice to have them and the reason they occur is UCX-Py attempts to do as much as possible for the user, in this particular case it means we launch an asynchronous task to keep progressing the worker without having to pass that responsibility to the user. I'd also like to point out that they've always been there, some proof is in this 1.5 year old PR where I've attempted to resolve this but failed so far. The problem with that is there's no good way to stop that task when UCX-Py doesn't control the event loop, but the application (xoscar in this case) does, so when the event loop closes UCX-Py doesn't know about that and can't do anything to stop the task that cannot be progressed anymore. On the application end you should be able to fix it though by running As for performance, I don't think that is in any way related to this. It is possible though that either your application or Another change that may have had impact in performance is in UCX itself, where UCX v1.16 which just recently became supported by UCX-Py switches to protov2 as default ( Finally, I'd like to point out that UCX-Py is going to be archived some time in the future in favor of UCXX, it contains an almost identical API, so hopefully for you simply changing the install requirements and moving |
Thank you for your long response. I have tried all the methods mentioned above, except for using The good news is that after changing The bad news is that using Actually, I install ucxx with |
As you mention that "UCX may now select a different set of transports that may have different performance characteristics". For InfiniBand on CPU node, now my |
And does
UCXX also has a different default progress mode than UCX-Py, which instead of being run as an asynchronous task it's actually a separate C++ thread that notifies Python futures. It may also have different performance characteristics which will depend on the workload running, hopefully it will perform slightly better for the majority of cases. The blocking progress mode which is the current default in UCX-Py is being worked on in rapidsai/ucxx#116 .
Can you provide details on how much slow is today vs how much that was previously? Also what error are you seeing?
I think we currently don't provide a CPU-only UCXX wheel package, I can't recall if this is just because there was nobody using that or if there were other technical limitations. Could you please file an issue in https://github.com/rapidsai/ucxx/issues about that, and please add the same details about being used by xorbits/xoscar, I'll then make sure the relevant people can respond on what can be done.
In the past we used to suggest setting |
I have not attempted I tried recreating a conda environment, installing ucx version 1.15, and setting environment variable The error encountered with |
That would be just before your event loop closes, which is often either before the end of
You still haven't provided some reference of how much slower we're talking about. Do you have numbers you can provide?
Yes, please do so. |
Hi @pentschev Since I have moved to To show the numbers, I run some TPC-H queries, a data analysis benchmark. In terms of Query 3, the UCX backend takes 53 seconds, while UNIX sockets take only 35 seconds. I've noticed that when using UCX in our software, the CPU usage is high as I check So maybe we need to optimize our code on how to use UCX? |
Is there an easy way to reproduce that? Note that establishing UCX endpoints can be more costly than regular sockets depending on what transports are used. Therefore, depending on what exactly you're timing, the transports that are used by UCX, how many times endpoints get created during the workflow, whether the transfers are "large enough" or just plenty of small transfers, I wouldn't be surprised if sockets can outperform UCX. With that said, it would be useful if we could have more information about those details, such as what your system looks like w.r.t. network interfaces, how many endpoints get created, sizes and amounts of messages, etc.
It seems like you're running in polling mode. Do you happen to specify |
Hi @pentschev As I move from ucx-py to ucxx. I conduct some benchmark tests based on the README page of ucxx. Can you help me analyze this? I install ucxx by
python -m ucxx.benchmarks.send_recv \
--backend ucxx-async \
--object_type rmm \
--server-dev 0 \
--client-dev 3 \
--n-iter 10 \
--n-bytes 1Gb \
--n-buffers 2 Result:
results of object_type with one compute node that acts as server: python -m ucxx.benchmarks.send_recv \
--backend ucxx-async \
--object_type rmm \
--n-iter 3 \
--n-bytes 1Gb \
--server-only \
--server-dev 0 another compute node that acts as client: python -m ucxx.benchmarks.send_recv \
--backend ucxx-async \
--object_type rmm \
--n-iter 3 \
--n-bytes 1Gb \
--client-only \
--server-address 192.168.1.64 \
--port 40295 \
--client-dev 0 I check that the Result:
And I use Server:
Client:
I get average bandwidth 19905.54 MB/s.
Server: python -m ucxx.benchmarks.send_recv
--n-bytes 1Gb \
--server-only\
--object_type numpy \
--n-iter 3 Client: python -m ucxx.benchmarks.send_recv \
--backend ucxx-async \
--object_type numpy \
--n-iter 3 \
--n-bytes 1Gb \
--client-only \
--server-address 192.168.1.64 \
--port 58129 Result:
The bandwidth should be 10+GiB/s? And I use On the server side: ucx_perftest -t tag_bw -s 1000000000 -n 20 -p 9999 On the client side: ucx_perftest 192.168.1.65 -t tag_bw -s 1000000000 -n 20 -p 9999 I get: bandwidth (MB/s) average 46830.10, overall 46830.10. Seems better than the python benchmark code? |
Unfortunately I can't see anything obviously wrong nor reproduce what you're observing. For me this is what I get on a DGX-1 with ConnectX-4: ucx_perftest
send_recv --backend ucxx-async
send_recv --backend ucxx-core
As you can see, the async backend indeed is a bit slower, but Could you also try with |
Hi @pentschev The issue I've found is that the intra-node communication bandwidth is much lower than expected. My hardware should be a bit more powerful than the one you listed; I'm using ConnectX-6 200Gbps network cards. The result I got with ucxx is only 482.24MB/s, while your result is 9.9GB/s. My question is, is there a problem with the way I've installed it or am I missing some necessary packages? I installed with |
UPDATE: I created a new conda environment and installed the precompiled ucxx using conda/mamba, and tested it. It seems that the bandwidth is much higher than what I got when I installed it using pip. Maybe the conda/mamba version shows the expected performance based on my hardware? conda/mamba version
pip version
|
Can you post when you have your pip environment active:
I think there may be some issue with the UCX pip package, I normally use conda so I don't often see that, plus I think the user base of the UCX pip package is slim to none, so there may be some issues with it we may need to fix and what I've asked above will help us understanding more. |
|
Thanks for reporting back. After discussing internally I now realize that the performance aspect is indeed expected, the reason is pip packages are NOT built with verbs/rdmacm support, and the reason for that is there's no rdma-core package available in for pip, whereas rdma-core is available for conda. Unfortunately, you won't be able to leverage better performance with the default pip install. However, the UCX pip package is intentionally built in a way that it will pick up the UCX system install if one is available and only if not available it will use its own binaries, with that in mind you may still be able to resolve your problem if you have UCX installed on your system built with either rdma-core or MOFED support, but again you'll need to provide a build with that support yourself. |
Thanks for all your replies. But the end-to-end entire frameworks (xorbits for distributed dataframe, and xoscar for actor and communication) still face performance issues even with the conda package that has rdma-core. I will investigate it, and I may open new issues in ucx-py or ucxx repo. |
I'm glad you were now unblocked. If you're still seeing performance issues I think the easiest way to determine whether this is due to some regression in UCXX would be to run xorbits/xoscar with both UCX-Py and UCXX. From this thread I had the impression that there's no history of what was the previous performance you obtained, and as we've seen it could be some of the new packages may be lacking features that you were previously using. With that in mind, I would say we want to know whether you see any performance regression due to the move from UCX-Py to UCXX and/or with latest UCX packages, keep in mind UCX pip packages are only available since June, so if you were using UCX-Py previously with pip then you must have provided your own UCX build that had all the capabilities for the system where you built it. |
Hi there,
I am the maintainer of xoscar and xorbits. xoscar is a lightweight actor programming framework that enables inter-process and inter-node communication. We use ucx-py to accelerate communication. There have been no issues before, but recently, using ucx-py has been consistently reporting the following error.
It seems that there are some asyncio tasks not end?
As this is only a
assert
statement, I delete this line. After commenting out this assert line, the entire program can run but will report another error.In terms of performance for communication and computation across computing nodes, now using ucx-py is slightly slower than using unixsocket. Perviously, when no error like this, ucx-py is faster than unixsocket.
This part feels difficult to debug. Are there any clues to help with debugging?
The text was updated successfully, but these errors were encountered: