-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tony-core runtime error #675
Comments
If u submit tony app to secured cluster, the machine must be certified, which means keytab or principle must be provided. I think you could use this machine to submit spark app for test. If it's ok, the tony app also can be submitted to cluster. |
Thanks for your reply. The cluster is hadoop 3.2.2 with kerberos, and I tried spark example successfully. I tried minist-tensorflow example according to the guide, https://github.com/tony-framework/TonY/tree/master/tony-examples/mnist-tensorflow, but it failed. Do I need any other setting or configuration for this task? |
Please attach the detailed error log and submit cli command args/ tony.xml and so on. |
cli command: error logs as the below: |
Is the same problem? #672 It looks the nodemanager machine don't have the complete hadoop environment. |
You should package your pyenv zip at linux system machine same as the NM system. @tonywang-sh |
My package pyenv is set at ubuntu 18.04 system with anaconda according to the guide https://github.com/tony-framework/TonY/tree/master/tony-examples/mnist-tensorflow. Do you have another guide about setting up nomachine system package env to package this pyenv zip? Thanks. |
Conda is also OK. If you want to check whether the env is OK, you could launch it in local machine. |
I used anaconda to package virtualenv python and obtained virtualenv pyenv zip, but this pyenv zip can not work at worker nodes. Is it right method? |
Does this pyenv can be used in your local machine? You'd better to pre-check |
It worked in local machine by using "ven/bin/python " cmd line, but failed in remote worker node by submitting task with TonY script. |
I guess this is caused by your local machine' env is not consistent with the nodemanager. |
If pyenv is packaged by virtualenv or anaconda, does it need to activate this pyenv python environment at the worker node, such as the comand, 'venv/bin/activate' before the task start at the worker. But I didn't find this "activate" operation in TonY project. |
There are error messages about tony.TonyClient when runing tony task on yarn and hadoop 3.2.2. The error messages are as the below. How to deal with these errors?
2022-07-26 06:35:41,245 WARN ipc.Client: Exception encountered while connecting to the server
org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:173)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:390)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
at org.apache.hadoop.ipc.Client.call(Client.java:1452)
at org.apache.hadoop.ipc.Client.call(Client.java:1405)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy17.getTaskInfos(Unknown Source)
at com.linkedin.tony.rpc.impl.pb.client.TensorFlowClusterPBClientImpl.getTaskInfos(TensorFlowClusterPBClientImpl.java:77)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy18.getTaskInfos(Unknown Source)
at com.linkedin.tony.rpc.impl.ApplicationRpcClient.getTaskInfos(ApplicationRpcClient.java:82)
at com.linkedin.tony.TonyClient.updateTaskInfoAndReturn(TonyClient.java:1192)
at com.linkedin.tony.TonyClient.monitorApplication(TonyClient.java:1046)
at com.linkedin.tony.TonyClient.run(TonyClient.java:225)
at com.linkedin.tony.TonyClient.start(TonyClient.java:1293)
at java.lang.Thread.run(Thread.java:748)
The text was updated successfully, but these errors were encountered: