Skip to content

Commit

Permalink
[KYUUBI #333] Fix hanging due to disordered service stopping
Browse files Browse the repository at this point in the history
![yaooqinn](https://badgen.net/badge/Hello/yaooqinn/green) [![Closes #333](https://badgen.net/badge/Preview/Closes%20%23333/blue)](https://github.com/yaooqinn/kyuubi/pull/333) ![10](https://badgen.net/badge/%2B/10/red) ![6](https://badgen.net/badge/-/6/green) ![1](https://badgen.net/badge/commits/1/yellow) ![Target Issue](https://badgen.net/badge/Missing/Target%20Issue/ff0000) [&#10088;?&#10089;](https://pullrequestbadge.com/?utm_medium=github&utm_source=yaooqinn&utm_campaign=badge_info)<!-- PR-BADGE: PLEASE DO NOT REMOVE THIS COMMENT -->

<!--
Thanks for sending a pull request!

Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html
  2. If the PR is related to an issue in https://github.com/yaooqinn/kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'.
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'.
-->

### _Why are the changes needed?_
<!--
Please clarify why the changes are needed. For instance,
  1. If you add a feature, you can talk about the use case of it.
  2. If you fix a bug, you can clarify why it is a bug.
-->

the service discovery should stop before the embedded zk server.

### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [x] Add screenshots for manual tests if appropriate

```before
2021-01-31 11:26:15.800 ERROR server.KyuubiServer: RECEIVED SIGNAL 15: TERM
2021-01-31 11:26:15.801 INFO server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:58707 which had sessionid 0x1000fbbfad70002
2021-01-31 11:26:15.801 INFO server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:58490 which had sessionid 0x1000fbbfad70000
2021-01-31 11:26:15.801 INFO server.NIOServerCnxnFactory: NIOServerCnxn factory exited run method
2021-01-31 11:26:15.801 INFO zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x1000fbbfad70000, likely server has closed socket, closing socket connection and attempting reconnect
2021-01-31 11:26:15.801 INFO server.ZooKeeperServer: shutting down
2021-01-31 11:26:15.801 INFO server.SessionTrackerImpl: Shutting down
2021-01-31 11:26:15.801 INFO server.PrepRequestProcessor: Shutting down
2021-01-31 11:26:15.802 INFO server.KyuubiServer: Service: [ServiceDiscovery] is stopping.
2021-01-31 11:26:15.802 INFO server.SyncRequestProcessor: Shutting down
2021-01-31 11:26:15.802 INFO server.PrepRequestProcessor: PrepRequestProcessor exited loop!
2021-01-31 11:26:15.802 INFO server.SyncRequestProcessor: SyncRequestProcessor exited!
2021-01-31 11:26:15.802 INFO server.FinalRequestProcessor: shutdown of request processor complete
2021-01-31 11:26:15.804 INFO server.EmbeddedZkServer: Service[EmbeddedZkServer] is stopped.
2021-01-31 11:26:15.906 INFO state.ConnectionStateManager: State change: SUSPENDED
2021-01-31 11:26:15.907 INFO client.ServiceDiscovery: Zookeeper client connection state changed to: SUSPENDED
2021-01-31 11:26:17.615 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2021-01-31 11:26:17.615 INFO zookeeper.ClientCnxn: Socket error occurred: localhost/127.0.0.1:2181: Connection refused
2021-01-31 11:26:18.793 INFO server.SessionTrackerImpl: SessionTrackerImpl exited loop!
2021-01-31 11:26:19.126 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2021-01-31 11:26:19.126 INFO zookeeper.ClientCnxn: Socket error occurred: localhost/127.0.0.1:2181: Connection refused
2021-01-31 11:26:20.815 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2021-01-31 11:26:20.815 INFO zookeeper.ClientCnxn: Socket error occurred: localhost/127.0.0.1:2181: Connection refused
2021-01-31 11:26:21.947 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2021-01-31 11:26:21.948 INFO zookeeper.ClientCnxn: Socket error occurred: localhost/127.0.0.1:2181: Connection refused
2021-01-31 11:26:23.288 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2021-01-31 11:26:23.289 INFO zookeeper.ClientCnxn: Socket error occurred: localhost/127.0.0.1:2181: Connection refused
2021-01-31 11:26:25.075 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2021-01-31 11:26:25.075 INFO zookeeper.ClientCnxn: Socket error occurred: localhost/127.0.0.1:2181: Connection refused
2021-01-31 11:26:26.965 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2021-01-31 11:26:26.965 INFO zookeeper.ClientCnxn: Socket error occurred: localhost/127.0.0.1:2181: Connection refused
2021-01-31 11:26:28.412 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2021-01-31 11:26:28.412 INFO zookeeper.ClientCnxn: Socket error occurred: localhost/127.0.0.1:2181: Connection refused
2021-01-31 11:26:30.356 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2021-01-31 11:26:30.356 INFO zookeeper.ClientCnxn: Socket error occurred: localhost/127.0.0.1:2181: Connection refused
2021-01-31 11:26:31.751 ERROR curator.ConnectionState: Connection timed out for connection string (127.0.0.1:2181) and timeout (15000) / elapsed (15841)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
```
```after
2021-01-31 11:49:15.070 ERROR server.KyuubiServer: RECEIVED SIGNAL 15: TERM
2021-01-31 11:49:15.071 INFO server.KyuubiServer: Service: [ServiceDiscovery] is stopping.
2021-01-31 11:49:15.074 INFO server.PrepRequestProcessor: Processed session termination for sessionid: 0x1000fd1db370000
2021-01-31 11:49:15.074 WARN client.ServiceDiscovery: This Kyuubi instance localhost:10009 is now de-registered from ZooKeeper. The server will be shut down after the last client session completes.
2021-01-31 11:49:15.074 INFO client.ServiceDiscovery: Service[ServiceDiscovery] is stopped.
2021-01-31 11:49:15.075 INFO server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:63940 which had sessionid 0x1000fd1db370000
2021-01-31 11:49:15.075 INFO zookeeper.ZooKeeper: Session: 0x1000fd1db370000 closed
2021-01-31 11:49:15.076 WARN client.ServiceDiscovery: Service[ServiceDiscovery] is not started(STOPPED) yet.
2021-01-31 11:49:15.076 INFO server.KyuubiServer: Service: [FrontendService] is stopping.
2021-01-31 11:49:15.078 INFO service.FrontendService: FrontendService has stopped
2021-01-31 11:49:15.078 INFO service.FrontendService: Service[FrontendService] is stopped.
2021-01-31 11:49:15.078 INFO server.KyuubiServer: Service: [KyuubiBackendService] is stopping.
2021-01-31 11:49:15.078 INFO server.KyuubiBackendService: Service: [KyuubiSessionManager] is stopping.
2021-01-31 11:49:15.078 INFO session.KyuubiSessionManager: Service: [KyuubiOperationManager] is stopping.
2021-01-31 11:49:15.078 INFO operation.KyuubiOperationManager: Service[KyuubiOperationManager] is stopped.
2021-01-31 11:49:15.078 INFO session.KyuubiSessionManager: Service[KyuubiSessionManager] is stopped.
2021-01-31 11:49:15.079 INFO server.KyuubiBackendService: Service[KyuubiBackendService] is stopped.
2021-01-31 11:49:15.079 INFO server.KyuubiServer: Service: [KinitAuxiliaryService] is stopping.
2021-01-31 11:49:15.079 INFO service.KinitAuxiliaryService: Service[KinitAuxiliaryService] is stopped.
2021-01-31 11:49:15.079 INFO server.KyuubiServer: Service[KyuubiServer] is stopped.
2021-01-31 11:49:15.079 INFO server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:64039 which had sessionid 0x1000fd1db370002
2021-01-31 11:49:15.079 INFO server.NIOServerCnxnFactory: NIOServerCnxn factory exited run method
2021-01-31 11:49:15.079 INFO server.ZooKeeperServer: shutting down
2021-01-31 11:49:15.079 INFO server.SessionTrackerImpl: Shutting down
2021-01-31 11:49:15.079 INFO server.PrepRequestProcessor: Shutting down
2021-01-31 11:49:15.080 INFO server.SyncRequestProcessor: Shutting down
2021-01-31 11:49:15.080 INFO server.PrepRequestProcessor: PrepRequestProcessor exited loop!
2021-01-31 11:49:15.080 INFO server.SyncRequestProcessor: SyncRequestProcessor exited!
2021-01-31 11:49:15.080 INFO server.FinalRequestProcessor: shutdown of request processor complete
2021-01-31 11:49:15.081 INFO server.EmbeddedZkServer: Service[EmbeddedZkServer] is stopped.
```

- [x] [Run test](https://kyuubi.readthedocs.io/en/latest/tools/testing.html#running-tests) locally before make a pull request

Closes #333 from yaooqinn/hang.

945e1ba [jhx1008] Fix hanging due to disordered service stopping

Authored-by: jhx1008 <jhx1008@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 9db0088)
Signed-off-by: Kent Yao <yao@apache.org>
  • Loading branch information
jhx1008 authored and yaooqinn committed Jan 31, 2021
1 parent 45f6bb3 commit 1932f6a
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -47,13 +47,18 @@ abstract class Serverable(name: String) extends CompositeService(name) {

override def stop(): Unit = synchronized {
try {
stopServer()
if (started.getAndSet(false)) {
super.stop()
}
} catch {
case t: Throwable =>
warn(s"Error stopping spark ${t.getMessage}", t)
warn(s"Error stopping $name ${t.getMessage}", t)
} finally {
if (started.getAndSet(false)) {
super.stop()
try {
stopServer()
} catch {
case t: Throwable =>
warn(s"Error stopping spark ${t.getMessage}", t)
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,6 @@ object KyuubiServer extends Logging {
if (!ServiceDiscovery.supportServiceDiscovery(conf)) {
zkServer.initialize(conf)
zkServer.start()
sys.addShutdownHook(zkServer.stop())
conf.set(HA_ZK_QUORUM, zkServer.getConnectString)
conf.set(HA_ZK_ACL_ENABLED, false)
}
Expand Down Expand Up @@ -82,7 +81,7 @@ class KyuubiServer(name: String) extends Serverable(name) {
override private[kyuubi] val backendService: AbstractBackendService = new KyuubiBackendService()
private val discoveryService = new ServiceDiscovery(this)

override def initialize(conf: KyuubiConf): Unit = {
override def initialize(conf: KyuubiConf): Unit = synchronized {
val kinit = new KinitAuxiliaryService()
addService(kinit)
super.initialize(conf)
Expand Down

0 comments on commit 1932f6a

Please sign in to comment.