-
Notifications
You must be signed in to change notification settings - Fork 0
Description
We observed today while running ingestd against a fairly small fts3 server that it was possible to overload the server in such a way that
all fts3 commands returned error code 500.
You would see a few transfers complete but the status check would give a 500 and thus ingestd would try againl
Thus we had 3700 transfers pending in FTS from an ingestd dropbox that contained only 173 data files to process.
Each file attempting to be getting transferred multiple times, the status timing out and thus another copy of the file
getting submitted for transfer while previous copies were still pending.
This was an extreme situation with FTS trying to run 130 simultaneous copies in a single-core container pod in OKD,
each of those copies being of a 10GB file.
We can get the logs from duneingestgpvm01 if necessary.
It may be worthwhile to include a backoff if ingestd gets an error code 500 from FTS3.. that you don't submit new transfers until
such time as you've successfully talked to the server again, and you don't give up on a transfer on the first 500 error.
Related to that, it would be handy to have a configuration option to tell FTS3 to have a longer timeout before it gives up on the transfer, default appears to be 3 hours.