Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NumberFormatException while getting sender sequence number #601

Closed
ajinkyatankhiwale1510 opened this issue Jan 12, 2023 · 18 comments
Closed
Labels

Comments

@ajinkyatankhiwale1510
Copy link

Describe the bug
After restarting fix acceptor, we saw an unusual issue, where config was not loaded as expected and saw below error in server logs

11-01-2023 21:14:04,920 INFO [quickfix.ThreadedSocketAcceptor] (ServerService Thread Pool -- 79) Logging out all sessions
11-01-2023 21:14:04,921 ERROR [org.apache.camel.impl.engine.AbstractCamelContext] (ServerService Thread Pool -- 79) Error starting CamelContext (camel-1) due to exception thrown: quickfix.ConfigError: error during session initialization: org.apache.camel.RuntimeCamelException: quickfix.ConfigError: error during session initialization
at org.apache.camel.RuntimeCamelException.wrapRuntimeException(RuntimeCamelException.java:66) [camel-api-3.9.0.jar:3.9.0]
at org.apache.camel.support.service.BaseService.doFail(BaseService.java:409) [camel-api-3.9.0.jar:3.9.0]
at org.apache.camel.support.service.BaseService.fail(BaseService.java:338) [camel-api-3.9.0.jar:3.9.0]
at org.apache.camel.support.service.BaseService.start(BaseService.java:128) [camel-api-3.9.0.jar:3.9.0]
at org.apache.camel.component.quickfixj.QuickfixjComponent.startQuickfixjEngine(QuickfixjComponent.java:147) [camel-quickfix-3.9.0.jar:3.9.0]
at org.apache.camel.component.quickfixj.QuickfixjComponent.onCamelContextStarted(QuickfixjComponent.java:226) [camel-quickfix-3.9.0.jar:3.9.0]
at org.apache.camel.impl.engine.InternalRouteStartupManager.safelyStartRouteServices(InternalRouteStartupManager.java:196) [camel-base-engine-3.9.0.jar:3.9.0]
at org.apache.camel.impl.engine.InternalRouteStartupManager.doStartOrResumeRoutes(InternalRouteStartupManager.java:147) [camel-base-engine-3.9.0.jar:3.9.0]
at org.apache.camel.impl.engine.AbstractCamelContext.doStartCamel(AbstractCamelContext.java:3177) [camel-base-engine-3.9.0.jar:3.9.0]
at org.apache.camel.impl.engine.AbstractCamelContext.doStartContext(AbstractCamelContext.java:2861) [camel-base-engine-3.9.0.jar:3.9.0]
at org.apache.camel.impl.engine.AbstractCamelContext.doStart(AbstractCamelContext.java:2812) [camel-base-engine-3.9.0.jar:3.9.0]
at org.apache.camel.support.service.BaseService.start(BaseService.java:115) [camel-api-3.9.0.jar:3.9.0]
at org.apache.camel.impl.engine.AbstractCamelContext.start(AbstractCamelContext.java:2507) [camel-base-engine-3.9.0.jar:3.9.0]
at org.apache.camel.impl.DefaultCamelContext.start(DefaultCamelContext.java:174) [camel-core-engine-3.9.0.jar:3.9.0]
at com.eurexrepo.ntp.fix.acceptor.StartAcceptor.startQuickFIxCamelContext(StartAcceptor.java:141) [classes:]
at com.eurexrepo.ntp.fix.acceptor.StartAcceptor.initInstance(StartAcceptor.java:78) [classes:]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.8.0_322]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [rt.jar:1.8.0_322]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_322]
at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_322]
at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleElement.invoke(InitDestroyAnnotationBeanPostProcessor.java:389) [spring-beans-5.3.5.jar:5.3.5]
at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleMetadata.invokeInitMethods(InitDestroyAnnotationBeanPostProcessor.java:333) [spring-beans-5.3.5.jar:5.3.5]
at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeInitialization(InitDestroyAnnotationBeanPostProcessor.java:157) [spring-beans-5.3.5.jar:5.3.5]
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyBeanPostProcessorsBeforeInitialization(AbstractAutowireCapableBeanFactory.java:422) [spring-beans-5.3.5.jar:5.3.5]
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1778) [spring-beans-5.3.5.jar:5.3.5]
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:602) [spring-beans-5.3.5.jar:5.3.5]
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:524) [spring-beans-5.3.5.jar:5.3.5]
at org.springframework.beans.factory.support.AbstractBeanFactory.lambda$doGetBean$0(AbstractBeanFactory.java:335) [spring-beans-5.3.5.jar:5.3.5]
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:234) [spring-beans-5.3.5.jar:5.3.5]
at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:333) [spring-beans-5.3.5.jar:5.3.5]
at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:208) [spring-beans-5.3.5.jar:5.3.5]
at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:944) [spring-beans-5.3.5.jar:5.3.5]
at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:918) [spring-context-5.3.5.jar:5.3.5]
at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:583) [spring-context-5.3.5.jar:5.3.5]
at org.springframework.web.context.ContextLoader.configureAndRefreshWebApplicationContext(ContextLoader.java:401) [spring-web-5.2.22.RELEASE.jar:5.2.22.RELEASE]
at org.springframework.web.context.ContextLoader.initWebApplicationContext(ContextLoader.java:292) [spring-web-5.2.22.RELEASE.jar:5.2.22.RELEASE]
at org.springframework.web.context.ContextLoaderListener.contextInitialized(ContextLoaderListener.java:103) [spring-web-5.2.22.RELEASE.jar:5.2.22.RELEASE]
at io.undertow.servlet.core.ApplicationListeners.contextInitialized(ApplicationListeners.java:187) [undertow-servlet-2.2.5.Final-redhat-00001.jar:2.2.5.Final-redhat-00001]
at io.undertow.servlet.core.DeploymentManagerImpl$1.call(DeploymentManagerImpl.java:219) [undertow-servlet-2.2.5.Final-redhat-00001.jar:2.2.5.Final-redhat-00001]
at io.undertow.servlet.core.DeploymentManagerImpl$1.call(DeploymentManagerImpl.java:187) [undertow-servlet-2.2.5.Final-redhat-00001.jar:2.2.5.Final-redhat-00001]
at io.undertow.servlet.core.ServletRequestContextThreadSetupAction$1.call(ServletRequestContextThreadSetupAction.java:42) [undertow-servlet-2.2.5.Final-redhat-00001.jar:2.2.5.Final-redhat-00001]
at io.undertow.servlet.core.ContextClassLoaderSetupAction$1.call(ContextClassLoaderSetupAction.java:43) [undertow-servlet-2.2.5.Final-redhat-00001.jar:2.2.5.Final-redhat-00001]
at org.wildfly.extension.undertow.security.SecurityContextThreadSetupAction.lambda$create$0(SecurityContextThreadSetupAction.java:105)
at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1530)
at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1530)
at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1530)
at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1530)
at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1530)
at io.undertow.servlet.core.DeploymentManagerImpl.deploy(DeploymentManagerImpl.java:255) [undertow-servlet-2.2.5.Final-redhat-00001.jar:2.2.5.Final-redhat-00001]
at org.wildfly.extension.undertow.deployment.UndertowDeploymentService.startContext(UndertowDeploymentService.java:96)
at org.wildfly.extension.undertow.deployment.UndertowDeploymentService$1.run(UndertowDeploymentService.java:78)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_322]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_322]
at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)
at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:1990)
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1486)
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1377)
at java.lang.Thread.run(Thread.java:750) [rt.jar:1.8.0_322]
at org.jboss.threads.JBossThread.run(JBossThread.java:513)
Caused by: quickfix.ConfigError: error during session initialization
at quickfix.mina.acceptor.AbstractSocketAcceptor.createSessions(AbstractSocketAcceptor.java:251) [quickfixj-core-2.3.0.jar:2.3.0]
at quickfix.mina.acceptor.AbstractSocketAcceptor.startAcceptingConnections(AbstractSocketAcceptor.java:98) [quickfixj-core-2.3.0.jar:2.3.0]
at quickfix.ThreadedSocketAcceptor.start(ThreadedSocketAcceptor.java:103) [quickfixj-core-2.3.0.jar:2.3.0]
at org.apache.camel.component.quickfixj.QuickfixjEngine.doStart(QuickfixjEngine.java:237) [camel-quickfix-3.9.0.jar:3.9.0]
at org.apache.camel.support.service.BaseService.start(BaseService.java:115) [camel-api-3.9.0.jar:3.9.0]
... 55 more
Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: For input string: ""
at quickfix.FileStoreFactory.create(FileStoreFactory.java:82) [quickfixj-core-2.3.0.jar:2.3.0]
at quickfix.Session.(Session.java:526) [quickfixj-core-2.3.0.jar:2.3.0]
at quickfix.DefaultSessionFactory.create(DefaultSessionFactory.java:224) [quickfixj-core-2.3.0.jar:2.3.0]
at quickfix.mina.acceptor.AbstractSocketAcceptor.setupSession(AbstractSocketAcceptor.java:214) [quickfixj-core-2.3.0.jar:2.3.0]
at quickfix.mina.acceptor.AbstractSocketAcceptor.createSessions(AbstractSocketAcceptor.java:245) [quickfixj-core-2.3.0.jar:2.3.0]
... 59 more
Caused by: java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) [rt.jar:1.8.0_322]
at java.lang.Integer.parseInt(Integer.java:592) [rt.jar:1.8.0_322]
at java.lang.Integer.parseInt(Integer.java:615) [rt.jar:1.8.0_322]
at quickfix.FileStore.initializeSequenceNumbers(FileStore.java:160) [quickfixj-core-2.3.0.jar:2.3.0]
at quickfix.FileStore.initializeCache(FileStore.java:118) [quickfixj-core-2.3.0.jar:2.3.0]
at quickfix.FileStore.initialize(FileStore.java:112) [quickfixj-core-2.3.0.jar:2.3.0]
at quickfix.FileStore.(FileStore.java:96) [quickfixj-core-2.3.0.jar:2.3.0]
at quickfix.FileStoreFactory.create(FileStoreFactory.java:80) [quickfixj-core-2.3.0.jar:2.3.0]
... 63 more

Log suggests that issue might be during initializing sequence store, below is the snippet from where exception was thrown.
In statement cache.setNextSenderMsgSeqNum(Integer.parseInt(s));

private void initializeSequenceNumbers() throws IOException {
senderSequenceNumberFile.seek(0);
if (senderSequenceNumberFile.length() > 0) {
final String s = senderSequenceNumberFile.readUTF();
cache.setNextSenderMsgSeqNum(Integer.parseInt(s));
}
targetSequenceNumberFile.seek(0);
if (targetSequenceNumberFile.length() > 0) {
final String s = targetSequenceNumberFile.readUTF();
cache.setNextTargetMsgSeqNum(Integer.parseInt(s));
}
}

To Reproduce
Happened once, not able to reproduce.

Expected behavior
When server is started, sequence should be reset as expected and should not throw numberformat exception

system information:

  • OS: Red Hat Enterprise Linux 8.6 (Ootpa)
  • Java version openjdk 8
  • QFJ Version 2.3.0

fix-acceptor-settings.txt

@chrjohn
Copy link
Member

chrjohn commented Jan 12, 2023

Looks like the file was existing but empty? Could this be related to #540 ? Are you sure that the access to the store is only done by one thread, I.e. could it be that you started up two services for the same session?

@ajinkyatankhiwale1510
Copy link
Author

Yes, file is empty for the members who have not yet done session login. I can't relate this issue with #540, in that issue during session reset StreamClosed would occur and server itself might be running as expected. Switching to other available session would solve our problem.

In this case, fix engine itself did not start correctly (Happened only once until now)

We have a failover mechanism which is running in active-active mode. In case, one node goes down. users request will be redirected to node2 via WAF where they can do session logon again and continue. So, yes, they share common store files. But application itself is deployed on different machines.

@chrjohn
Copy link
Member

chrjohn commented Jan 12, 2023

Do you happen to have logs from both instances during the time of session test and startup?

@ajinkyatankhiwale1510
Copy link
Author

Attached are logs from both instances, for security reasons I have masked SenderCompId
server-node1.log
server-node2.log

@gbirchmeier
Copy link
Contributor

gbirchmeier commented Jan 17, 2023

Hi @chrjohn - are the logs from @ajinkyatankhiwale1510 what you need to see? His firm is working with Connamara -- if you need something different, we can help him get what is needed to see to progress the issue.

@chrjohn
Copy link
Member

chrjohn commented Jan 18, 2023

Hi @gbirchmeier , I will take another look tomorrow but I think the logs are the correct ones. I'm just unsure at the moment how the problem could happen. To me it looks like a concurrency issue.
The services are deployed on different machines but share the same file store. How? NFS mount?

@ajinkyatankhiwale1510
Copy link
Author

Hi @chrjohn , you are correct, both this machines have a central storage maintained on NFS file share.
Server and NFS

@chrjohn
Copy link
Member

chrjohn commented Jan 19, 2023

Actually I think I cannot give you a recommendation on this. I don't have any experience using the FileStore on a distributed file system. I'd go for a database in such environments.
As I commented in #601 (comment) it looks like the file was there with a size greater than 0 but contained an empty String. To me it looks like the file was tried to be initialized concurrently by two threads.
How do you make sure that only one instance initializes the store?

Is #540 also happening on this infrastructure? That might explain the Stream Closed Exceptions. Never saw those on a regular file system AFAIR.

@ajinkyatankhiwale1510
Copy link
Author

For both nodes, start time and end time is 00:00:00, at this time only initialization should happen. Apart from that, in case initiator sends sequence reset flag then store gets re initialized or system is restarted to load new cfg file other wise during the normal operation store continues as it should be.

In fact, node2(erps0jb02) is just a failover node, request comes at this node only in case node 1(erps0jb01) goes down. In that case too, user session is terminated and they have to make session login again on node 2. For performance reasons we decided to go to FileStore.

You are correct, #540 is happening on the same node. For Stream Closed, we have custom implementation of Message Store in place currently which is under test. I would update in #540 with the results to close the discussion.

@chrjohn
Copy link
Member

chrjohn commented Jan 19, 2023

For both nodes, start time and end time is 00:00:00, at this time only initialization should happen.

Do you have a log file from both nodes at that time at the problematic day?

For performance reasons we decided to go to FileStore.

Just out of own interest: NFS FileStore was faster than using data base?

@ajinkyatankhiwale1510
Copy link
Author

Yes, I have uploaded logs for both nodes with my earlier comment
server-node1.log
server-node2.log

Just out of own interest: NFS FileStore was faster than using data base?
We haven't tried database, but we think it would be slower considering we have 1000's of TradeCaptureReport and QuoteStatusReport being sent in few seconds timeframe. As far, we don't see anyone from our customers complaining about performance of report being sent. They are handled pretty fast enough.

@chrjohn
Copy link
Member

chrjohn commented Jan 19, 2023

Yes, I have uploaded logs for both nodes with my earlier comment

I have asked for a log around the time of 00:00. There should be something like "session state not current" in the log.

@ajinkyatankhiwale1510
Copy link
Author

server-node 1.log
server-node 2.log
Sorry, this logs would help you

@chrjohn
Copy link
Member

chrjohn commented Jan 19, 2023

I did not find any occurrence of "session state not current". Could you check with grep or similar if it appears at some time?

@ajinkyatankhiwale1510
Copy link
Author

I checked event and messages logs but could not find this statement. Any suggestion how we can mitigate this problem? We are anyways implementing MessageStore to handle StreamClosed error with our upcoming release.

@chrjohn
Copy link
Member

chrjohn commented Jan 24, 2023

Without knowing exactly what caused the problem, it is impossible to know how to prevent it.
My guess is that one or all of the store files seemed to be corrupt. IMHO this could happen on concurrent access or if the process that wrote the files beforehand was stopped or killed. Or specific to your environment using NFS: maybe there was a network problem when syncing the files. Don't know if this could happen, am no NFS expert.

I asked already here #601 (comment) : how do you prevent concurrent access to the file store when you have two instances that could possibly initialize the store concurrently? Even when only one node has an active connection, on startup the store is initializing its files.

@ajinkyatankhiwale1510
Copy link
Author

Until now, we don't have any mechanism which would prevent concurrent file store access. I got your point, that after restarting both nodes will individually will initialize the file store. There would be couple of scenario (restart, session start/end etc) where concurrency would happen and we will have to live with it. In any case, our expectation is that, file store should not get corrupt.

@chrjohn
Copy link
Member

chrjohn commented Jan 24, 2023

The FileStore is not thread-safe (rule-of-thumb usually is that classes are not thread-safe except it is explicitly stated) and contains no mechanism to prevent concurrent access. Your scenario is even more complicated since the same files could be accessed from different machines at the same time, so thread-safety inside one JVM would not help.
Under this premises, why do you expect that the store should not get corrupt? :)
I would advise you to create your own implementation of the FileStore where you cater for your custom setup. You could e.g. have the store write a file (which could contain the hostname) which both instances use to synchronise access to the store so that only one instance at a time accesses the store. This would be a very simple solution, which of course needs to be fine-tuned. E.g. there also could be a timestamp of the last access so that one node knows when it should take over (when the timestamp is too old).

@chrjohn chrjohn closed this as not planned Won't fix, can't repro, duplicate, stale Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants