Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channel is not unlocked after processing payment #553

Open
vsbogd opened this issue Oct 6, 2021 · 7 comments
Open

Channel is not unlocked after processing payment #553

vsbogd opened this issue Oct 6, 2021 · 7 comments
Labels
bug Something isn't working

Comments

@vsbogd
Copy link
Member

vsbogd commented Oct 6, 2021

This issue started happening frequently on sound-spleeter service instance. It happens about two times per week.

In client logs one can see that client first receives "cannot get mutex for channel: {ID: 14}" error. And after this client receives "another transaction on channel: {ID: 14} is in progress".

In server logs one can see "rpc error: code = Internal desc = cannot get mutex for channel: {ID: 14}" before payment completed.

Daemon version:

# snetd version
INFO[0000] Cobra initialized                            
INFO[0000] Configuration file is not set, using default configuration 
version tag: v.5.0.1
built on: 2021-05-16_19:16:27
sha1 revision: b0217231faf2620ab0e6510dcfb122e8e7f453b1

One can see client and server logs below:
Client log:

2021-10-01 20:03:16,983 [grpc-default-executor-27500] ERROR i.s.soundspleeter.Proxy - request[117402] completed with error
io.grpc.StatusRuntimeException: INTERNAL: cannot get mutex for channel: {ID: 14}
        at io.grpc.Status.asRuntimeException(Status.java:533)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:449)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
        at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
...
2021-10-01 20:04:46,616 [grpc-default-executor-27500] ERROR i.s.soundspleeter.Proxy - request[117406] completed with error
io.grpc.StatusRuntimeException: FAILED_PRECONDITION: another transaction on channel: {ID: 14} is in progress
        at io.grpc.Status.asRuntimeException(Status.java:533)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:449)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
        at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Server log:

time="2021-10-01T20:03:13.847393841Z" level=debug msg="New gRPC call received" context="{MD: map [:authority:[sound-spleeter.singularitynet.io:7068] content-type:[application/grpc] grpc-accept-encoding:[gzip] snet-payment-channel-amount:[5805900] snet-payment-channel-id:[14] snet-payment-channel-nonce:[0] snet-payment-channel-signature-bin:[...] snet-payment-type:[escrow] user-agent:[grpc-java-netty/1.28.0]], Info: {/SoundSpleeter/spleeter true true}"
time="2021-10-01T20:03:13.84754259Z" level=debug msg="Return payment handler by type" paymentType=escrow
time="2021-10-01T20:03:16.98287053Z" level=debug msg="etcd Transaction took 3.000481805s"
time="2021-10-01T20:03:16.983087638Z" level=debug msg=payload payload="{\"type\":\"response\",\"registry_address_key\":\"0x247DEbEBB766E4fA99667265A158060018D5f4F8\",\"ethereum_json_rpc_endpoint\":\"https://mainnet.infura.io/v3/09027f4a13e841d48dbfefc67e7685d5\",\"request_id\":\"c5bmj0bbg458pmbl6s50\",\"organization_id\":\"snet\",\"service_id\":\"sound-spleeter\",\"group_id\":\"nZdFbyUlpWfOuTn0WpJCpKtQATrU6gxz6Wn9zAC2mno=\",\"service_method\":\"/SoundSpleeter/spleeter\",\"response_sent_time\":\"2021-10-01 20:03:16.982996817\",\"request_received_time\":\"2021-10-01 20:03:13.847349713\",\"response_time\":\"3.1356\",\"response_code\":\"Internal\",\"error_message\":\"rpc error: code = Internal desc = cannot get mutex for channel: {ID: 14}\",\"version\":\"v.5.0.1\",\"client_type\":\"\",\"user_details\":\"\",\"user_agent\":\"grpc-java-netty/1.28.0\",\"channel_id\":\"14\",\"username\":\"\",\"operation\":\"read\",\"usage_type\":\"apicall\",\"status\":\"failed\",\"start_time\":\"2021-10-01 20:03:13.847349713\",\"end_time\":\"2021-10-01 20:03:16.982996817\",\"usage_value\":1,\"time_zone\":\"\",\"payment_mode\":\"escrow\",\"user_address\":\"\"}"
time="2021-10-01T20:03:17.533955042Z" level=warning msg="Service call failed with status code : 500 "
time="2021-10-01T20:03:17.534002203Z" level=warning msg="Unable to publish metrics" payload="{\"type\":\"response\",\"registry_address_key\":\"0x247DEbEBB766E4fA99667265A158060018D5f4F8\",\"ethereum_json_rpc_endpoint\":\"https://mainnet.infura.io/v3/09027f4a13e841d48dbfefc67e7685d5\",\"request_id\":\"c5bmj0bbg458pmbl6s50\",\"organization_id\":\"snet\",\"service_id\":\"sound-spleeter\",\"group_id\":\"nZdFbyUlpWfOuTn0WpJCpKtQATrU6gxz6Wn9zAC2mno=\",\"service_method\":\"/SoundSpleeter/spleeter\",\"response_sent_time\":\"2021-10-01 20:03:16.982996817\",\"request_received_time\":\"2021-10-01 20:03:13.847349713\",\"response_time\":\"3.1356\",\"response_code\":\"Internal\",\"error_message\":\"rpc error: code = Internal desc = cannot get mutex for channel: {ID: 14}\",\"version\":\"v.5.0.1\",\"client_type\":\"\",\"user_details\":\"\",\"user_agent\":\"grpc-java-netty/1.28.0\",\"channel_id\":\"14\",\"username\":\"\",\"operation\":\"read\",\"usage_type\":\"apicall\",\"status\":\"failed\",\"start_time\":\"2021-10-01 20:03:13.847349713\",\"end_time\":\"2021-10-01 20:03:16.982996817\",\"usage_value\":1,\"time_zone\":\"\",\"payment_mode\":\"escrow\",\"user_address\":\"\"}" url="https://marketplace-mt-v2.singularitynet.io/metering/usage"
time="2021-10-01T20:03:24.188823457Z" level=debug msg="Payment completed"
time="2021-10-01T20:03:24.428001054Z" level=debug msg="etcd Transaction took 175.435209ms"
time="2021-10-01T20:03:24.428045194Z" level=debug msg="Channel unlocked"
@anandrgitnirman
Copy link
Contributor

anandrgitnirman commented Oct 6, 2021

@vsbogd

  • Are you trying to get a new Token ?
  • Are you manually unlocking the channel ?
  • any rough idea on when you first saw this ?

@vsbogd
Copy link
Member Author

vsbogd commented Oct 6, 2021

Are you trying to get a new Token ?

No

Are you manually unlocking the channel ?

Yes, it is I need to do each time to make application working again.

any rough idea on when you first saw this ?

First time I saw this issue in log on Sept 27 at 04:48:49,077 GMT. Since that time it happened three times AFAIK.

@anandrgitnirman
Copy link
Contributor

#507 , we need to get to this soon , been pending a while now
We havent done any release in Daemon since a very long time. I see java sdk still uses the pay per use and not concurrency /token feature , either ways daemon should handle channel locks gracefully.
I will try to re produce this scenario

@anandrgitnirman
Copy link
Contributor

@vsbogd , can you also point me to client code ,might have more leads on possibly simulating the error

@vsbogd
Copy link
Member Author

vsbogd commented Oct 7, 2021

@vsbogd , can you also point me to client code ,might have more leads on possibly simulating the error

Client code is here: https://github.com/vsbogd/sound-spleeter-proxy (README.md explains how to run it), but it happens once per 8000 client requests so I don't think it is easily reproducible.

I think it is more simple to find the root cause by code analysis and then one could write a unit test to reproduce the behavior.
I see the error is raised in the following line:

return nil, fmt.Errorf("cannot get mutex for channel: %v", key)

First issue is that error is not logged nor imprinted into returned error, so we don't have enough information to understand what happened. I would fix it first.

There are two places inside Lock which can lead to the error:

func (locker *etcdLocker) Lock(name string) (lock Lock, ok bool, err error) {
value, ok, err := locker.storage.Get(name)
if err != nil {
return
}
if ok {
if value == locked {
return nil, false, nil
}
ok, err = locker.storage.CompareAndSwap(name, unlocked, locked)
} else {
ok, err = locker.storage.PutIfAbsent(name, locked)
}
if err != nil || !ok {
return
}
return &lockType{
name: name,
locker: locker,
}, true, nil
}

Taking into account the lock actually happens the most suspicious is CompareAndSwap call which is essentially ExecuteTransaction call:
func (client *EtcdClient) ExecuteTransaction(request storage.CASRequest) (ok bool, err error) {
transaction, err := client.StartTransaction(request.ConditionKeys)
if err != nil {
return false, err
}
//We should also have a configuration on how many times you try this ( say 100 times )
for {
oldValues, err := transaction.GetConditionValues()
if err != nil {
return false, err
}
newValues, ok, err := request.Update(oldValues)
if err != nil {
return false, err
}
if !ok {
return false, nil
}
if ok, err = client.CompleteTransaction(transaction, newValues); err != nil {
return false, err
}
if ok {
return true, nil
}
if !request.RetryTillSuccessOrError {
return false, nil
}
}
}

Judging by "Transaction took" in log we can say CompleteTransaction was actually called. Btw the transaction took 3 seconds which looks like timeout value:

func (client *EtcdClient) CompleteTransaction(_transaction storage.Transaction, update []storage.KeyValueData) (
ok bool, err error) {
var transaction *etcdTransaction = _transaction.(*etcdTransaction)
ctx, cancel := context.WithTimeout(context.Background(), client.timeout)
defer cancel()
defer ctx.Done()
startime := time.Now()
txn := client.etcdv3.KV.Txn(ctx)
var txnResp *clientv3.TxnResponse
conditionKeys := transaction.ConditionKeys
ifCompares := make([]clientv3.Cmp, len(transaction.ConditionValues))
for i, cmp := range transaction.ConditionValues {
if cmp.Present {
ifCompares[i] = clientv3.Compare(clientv3.ModRevision(cmp.Key), "=", cmp.Version)
} else {
ifCompares[i] = clientv3.Compare(clientv3.CreateRevision(cmp.Key), "=", 0)
}
}
thenOps := make([]clientv3.Op, len(update))
for index, op := range update {
// TODO: add OpDelete if op.Present is false
thenOps[index] = clientv3.OpPut(op.Key, op.Value)
}
elseOps := make([]clientv3.Op, len(conditionKeys))
for index, key := range conditionKeys {
elseOps[index] = clientv3.OpGet(key)
}
txnResp, err = txn.If(ifCompares...).Then(thenOps...).Else(elseOps...).Commit()
endtime := time.Now()
log.Debugf("etcd Transaction took %v", endtime.Sub(startime))
if err != nil {
return false, err
}
if txnResp == nil {
return false, fmt.Errorf("transaction response is nil")
}
if txnResp.Succeeded {
return true, nil
}
var latestValues []keyValueVersion
if latestValues, err = client.checkTxnResponse(conditionKeys, txnResp); err != nil {
return false, err
}
transaction.ConditionValues = latestValues
//succeeded is set to true if the compare evaluated to true or false otherwise.
return txnResp.Succeeded, nil
}

As transaction looks safe I would suppose that transaction was sent to ETCD server and received by server but response was not received because of timeout. Looks like ETCD finished transaction but client received timeout error and it led to the inconsistency.

What do you think? Does ETCD logs for this moment of time contains something interesting?

@vsbogd
Copy link
Member Author

vsbogd commented Oct 7, 2021

Btw, I am not sure about priority of this. Are we going to completely decommission escrow payment transactions and they are already deprecated?

If so may be we can simply migrate code to the tokens approach and finish the decommission. I cannot say when it will be possible though because we don't have resources to for doing this work.

@anandrgitnirman
Copy link
Contributor

to me this is a production issue , and i dont want any body to have to do a manual workaround
lets do this for now

  • fix the logging
  • increase the timeout to 5 sec on block chain ,

We can try simulating with a test case as suggested and deep dive, will post anything seen on etcd

anandrgitnirman pushed a commit to anandrgitnirman/snet-daemon that referenced this issue Oct 7, 2021
added better logging on to debug the channel lock issue
anandrgitnirman pushed a commit to anandrgitnirman/snet-daemon that referenced this issue Oct 7, 2021
added better logging on to debug the channel lock issue
anandrgitnirman pushed a commit to anandrgitnirman/snet-daemon that referenced this issue Oct 8, 2021
more debug statements to resolve channel lock
anandrgitnirman pushed a commit to anandrgitnirman/snet-daemon that referenced this issue Oct 13, 2021
Use License ID to track license as this will be used to upgrade / downgrade and keep track of any Audit trail
@semyon-dev semyon-dev added the bug Something isn't working label Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants