What determines when a replication job is considered 'crashing'? #3550

QualityControll · 2021-05-07T14:38:53Z

QualityControll
May 7, 2021

Hello,

I have a question regarding transitioning an active continuous running replication job from state 'running' -> 'crashing'.

I have the following scenario:
2 hosts, A, B
Host A has a replication 'pull' where Host B is the source, and Host A is the target.

Replication works fine. However; let's assume that the network is disconnected between A and B (for instance, I pull the network cable).

'Host A' continues to show a 'running' state in /_scheduler/jobs/_replicator for anywhere from 2.5 minutes to 10 minutes, until the job finally transitions to 'crashing'.

I have configured the local.ini with the following:

[replicator]
max_history = 2
retries_per_request = 1
connection_timeout = 10000
checkpoint_interval = 5000
interval = 10000

Despite 'errors' in the couchdb log file due to 'req_timedout', the job takes several minutes to report that it's 'crashing'.

How can I make the _scheduler/jobs/_replicator report a 'crashing' state sooner? Say within 1 minute?

Thanks!

Answered by nickva

May 7, 2021

With remote connections unless there is a periodic ping or timeout involved, the socket might not know that the cable was pulled. If the documents have all replicated, for example, we'd only find out if the connection is broken when the _changes feed times out. The timeout on the changes feed will be derived from the connection_timeout config parameter and since you set it to 10000 (10 seconds) so it seems you should find earlier than a minute. Good idea to lower retries_per_request too.

I think you meant _scheduler/docs/_replicator? Maybe monitor the logs and see when you start seeing errors in the log and if you poll _scheduler/jobs or _scheduler/docs when you start seeing the first sta…

View full answer

nickva · 2021-05-07T15:29:02Z

nickva
May 7, 2021
Collaborator

With remote connections unless there is a periodic ping or timeout involved, the socket might not know that the cable was pulled. If the documents have all replicated, for example, we'd only find out if the connection is broken when the _changes feed times out. The timeout on the changes feed will be derived from the connection_timeout config parameter and since you set it to 10000 (10 seconds) so it seems you should find earlier than a minute. Good idea to lower retries_per_request too.

I think you meant _scheduler/docs/_replicator? Maybe monitor the logs and see when you start seeing errors in the log and if you poll _scheduler/jobs or _scheduler/docs when you start seeing the first state change there.

Also max_history = 2 seems a bit low, you might not get much backoff when the system is disconnected. Maybe make it a bit higher so you can see when the error happens first, as it may be masking the transition from running to crashing (since it will just restart as running right away).

4 replies

QualityControll May 7, 2021
Author

Thanks for the response.

Yes - sorry I did mean _scheduler/docs/_replicator. I have a python script that periodically polls _scheduler/docs/_replicator 1/second for the replicator status.

I have another script that 'disconnects' the network using iptables rules; I run this script on hostA (this is just to simulate a network failure):
sudo iptables -A INPUT -s -j DROP
sudo iptables -A OUTPUT -d -j DROP

I tried changing my config to max_history = 5.

The behavior that I'm seeing follows:

Run my 'disconnect' script
Immediately run my python script which loops forever and polls the _scheduler/docs/_replicator status, returned state is 'running'.
I notice fairly quickly after running the 'disconnect' script in the couchdb.log file:

[error] 2021-05-07T16:05:20.451856Z couchdb@127.0.0.1 <0.597.0> -------- Replicator, request GET to "https://hostB:6984/test/_changes?feed=continuous&style=all_docs&since=%2214260-g1AAAACheJzLYWBgYMpgTmEQTM4vTc5ISXIwNDLXMwBCwxyQVCJDUv3___-zMpiTGBhkuHKBYuxJBokploap2PTgMSmPBUgyNACp_3ADpVdBDDQ1NTNNMcOmNQsA5swpag%22&timeout=3333" failed due to error {error,req_timedout}
[notice] 2021-05-07T16:05:20.452009Z couchdb@127.0.0.1 <0.597.0> -------- Retrying changes request to source database https://hostB:6984/test/ with since=<<"14260-g1AAAACheJzLYWBgYMpgTmEQTM4vTc5ISXIwNDLXMwBCwxyQVCJDUv3__-zMpiTGBhkuHKBYuxJBokploap2PTgMSmPBUgyNACp_3ADpVdBDDQ1NTNNMcOmNQsA5swpag">> in 0.25 seconds
[notice] 2021-05-07T16:05:29.020046Z couchdb@127.0.0.1 <0.520.0> -------- Retrying GET request to https://hostB:6984/test/ in 0.25 seconds due to error req_timedout

About 2 minutes later (note that the still running python script continues to return 'running'):

[error] 2021-05-07T16:07:27.904296Z couchdb@127.0.0.1 <0.597.0> -------- Replicator, request GET to "https://hostB:6984/test/_changes?feed=continuous&style=all_docs&since=%2214260-g1AAAACheJzLYWBgYMpgTmEQTM4vTc5ISXIwNDLXMwBCwxyQVCJDUv3___-zMpiTGBhkuHKBYuxJBokploap2PTgMSmPBUgyNACp_3ADpVdBDDQ1NTNNMcOmNQsA5swpag%22&timeout=3333" failed due to error {error,{conn_failed,{error,etimedout}}}
[error] 2021-05-07T16:07:36.352910Z couchdb@127.0.0.1 <0.520.0> -------- Replicator, request GET to "https://hostB:6984/test/" failed due to error {error,{conn_failed,{error,etimedout}}}
[error] 2021-05-07T16:07:36.353039Z couchdb@127.0.0.1 <0.520.0> -------- ChangesReader process died with reason: {changes_req_failed,{error,{conn_failed,{error,etimedout}}}}
[error] 2021-05-07T16:07:36.353110Z couchdb@127.0.0.1 <0.520.0> -------- Replication b93eed31854a5cddead8f0e8a1819f5a+continuous (https://hostB:6984/test/ -> https://hostA:6984/test/) failed: {changes_req_failed,{error,{conn_failed,{error,etimedout}}}}
[notice] 2021-05-07T16:07:36.353527Z couchdb@127.0.0.1 <0.380.0> -------- couch_replicator_scheduler: Job {"b93eed31854a5cddead8f0e8a1819f5a","+continuous"} started as <0.3373.0>

Finally I see this, and the python script begins returning 'crashing' (this is almost 2.5 minutes since the first error reported in the couchdb log file):

[error] 2021-05-07T16:07:50.447907Z couchdb@127.0.0.1 <0.3373.0> -------- couch_replicator_httpc: auth plugin initialization failed "https://hostB:6984/test/" {session_request_failed,"https://hostB:6984/_session","couchdb",req_timedout}
[error] 2021-05-07T16:07:50.448007Z couchdb@127.0.0.1 <0.3373.0> -------- throw:{replication_auth_error,{session_request_failed,"https://hostB:6984/_session","couchdb",req_timedout}}: Replication b93eed31854a5cddead8f0e8a1819f5a+continuous failed to start "https://hostB:6984/test/" -> "https://hostA:6984/test/" doc <<"shards/80000000-ffffffff/_replicator.1607122044">>:<<"hostB_to_hostA">> stack:[{couch_replicator_httpc,setup,1,[{file,"src/couch_replicator_httpc.erl"},{line,59}]},{couch_replicator_api_wrap,db_open,3,[{file,"src/couch_replicator_api_wrap.erl"},{line,74}]}]

The state seems to only consistently transition from 'running' -> 'crashing' when I see this in the log file, which seems to happen several minutes after I run the disconnect script:
couch_replicator_httpc: auth plugin initialization failed "https://hostB:6984/test/" {session_request_failed,"https://hostB:6984/_session","couchdb",req_timedout}

The rest of my local.ini are just the defaults (I only changed the [replicator] settings and I enabled SSL), but everything else was left alone.
This is with CouchDB 3.1.1 on RHEL 7.6

nickva May 8, 2021
Collaborator

I see a retry happening - Retrying changes request to source database https://hostB:6984/test/ with since=<<"14260-.. so wonder if you try retries_per_request = 0.

Perhaps depending on how iptables and TCP sockets interact, the connection doesn't get an immediate failure but is waiting for a response. Try to usecurl, for example, if it would behave in the same way with those iptable drop rules. Would it also get stuck sometimes or would fail immediately?

QualityControll May 10, 2021
Author

Using curl from hostB to get the '_changes' feed from hostA does timeout correctly, even with iptables rules set (provided --connect-timeout 10 is passed).

I did manage to figure out how to shorten this timeout. There appears to be an undocumented [replicator] configuration option:

case get_value(path, Params) == "_changes" of
true ->
    Timeout = infinity;
false ->
    Timeout = case config:get("replicator", "request_timeout", "infinity") of
        "infinity" -> infinity;
        Milliseconds -> list_to_integer(Milliseconds)
    end
end,

It seems setting "request_timeout" is necessary to shorten the timeout.

There are still some strange things going on based on the couchdb.log file that I can't really explain. I tried setting 'retries_per_request' to 0, and it still seems to cycle from 0.25, 0.5, 1.0 seconds, and even with 'retries_per_request' set to 1, it does the same thing, so I'm not sure why it seems to be ignoring that setting. Either way, I can now see replicator jobs transition from 'running' -> 'crashing' in about 30 seconds, which is greatly improved, and the desired behavior that I wanted.

I ended up with the following settings:
[replicator]
max_history = 2 (I understand that I won't have much back-off here, for my use-case that's fine)
retries_per_request = 1 (this seems to be ignored...)
connection_timeout = 10000
request_timeout = 5000

I think I have everything configured the way it needs to be for my application.

Thanks for your help.

nickva May 10, 2021
Collaborator

Np. Thank you for debugging and sharing the info! I had completely forgot about "request_timeout" setting, it's too bad it's not documented.

I can see how sometimes we don't respect the retries settings. In both cases where we check _changes feeds and periodically get pending changes we either override it to be 3 or we sometimes have failures which we always retry on

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What determines when a replication job is considered 'crashing'? #3550

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

What determines when a replication job is considered 'crashing'? #3550

QualityControll May 7, 2021

Replies: 1 comment · 4 replies

nickva May 7, 2021 Collaborator

QualityControll May 7, 2021 Author

nickva May 8, 2021 Collaborator

QualityControll May 10, 2021 Author

nickva May 10, 2021 Collaborator

QualityControll
May 7, 2021

Replies: 1 comment 4 replies

nickva
May 7, 2021
Collaborator

QualityControll May 7, 2021
Author

nickva May 8, 2021
Collaborator

QualityControll May 10, 2021
Author

nickva May 10, 2021
Collaborator