-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BlockReplicas API returns not what was requested #1117
Comments
Findings:
|
Possible solutions:
|
Second case of misbehavior caught by Unified on Sat Nov 10 04:06 . |
Hi @nataliaratnikova - What's the error message that was received? Brian |
I forwarded it to you in email. |
For reference, the sanitized version is:
One surprising thing I've recently learned is that Apache worker processes never reload CRLs -- meaning busy servers will never reload the CRLs and, in extreme cases, eventually start spewing expiration errors (only fix is to have the parent forcibly reload the child every N requests). I'm not sure this is the case here - but if we're fairly certain the certificate itself isn't expiring, we should look at the average age of the process and compare it to the CRL ages. |
can one explain how that expired cert lead to the query made be changed and replied to with something different than what was requested initially ; sorry but I am missing this point. The call was for https://cmsweb.cern.ch/phedex/datasvc/xml/prod/blockreplicas?dataset=/MonoHbb_ZpBaryonic_MZp-500_MChi2-1_MChi1-0p5_ctau-100000_13TeV-madgraph/RunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/MINIAODSIM and returned block locations for [u'/DoubleMuon/Run2018C-PromptReco-v1/AOD#7057f19b-f644-4fd4-a779-047e63f70ab4', u'/DoubleMuon/Run2018C-PromptReco-v1/AOD#c49a7c8b-525e-460e-b5e4-674ba073369c', u'/DoubleMuon/Run2018C-PromptReco-v1/AOD#fe9bb7ab-0072-4c2c-9c36-995683f9736c', u'/EGamma/Run2018A-HcalCalIsoTrkFilter-PromptReco-v3/ALCARECO#018492c6-d733-47e5-ac80-1d63d8b0af5b', u'/EGamma/Run2018A-HcalCalIsoTrkFilter-PromptReco-v3/ALCARECO#134e9b0a-a1d5-4587-82e2-22aedf4c997f', u'/EGamma/Run2018A-HcalCalIsoTrkFilter-PromptReco-v3/ALCARECO#30d60b8f-0344-4f65-b27b-f6a5368272c0', u'/EGamma/Run2018A-HcalCalIsoTrkFilter-PromptReco-v3/ALCARECO#3538588b-5380-4f6f-90e1-523a5d83aec8', u'/EGamma/Run2018A-HcalCalIsoTrkFilter-PromptReco-v3/ALCARECO#56bd71ef-4378-4b8e-a826-a867b6f1b51d', u'/EGamma/Run2018A-HcalCalIsoTrkFilter-PromptReco-v3/ALCARECO#5f5dc434-6a7a-4a28-9957-2d0febe19430', u'/EGamma/Run2018A-HcalCalIsoTrkFilter-PromptReco-v3/ALCARECO#7e78af93-4724-44ab-a976-6a187b3142c4', u'/EGamma/Run2018A-HcalCalIsoTrkFilter-PromptReco-v3/ALCARECO#844e11f0-9ef6-4ab3-b1d7-ca5fcd7e52b1', u'/EGamma/Run2018A-HcalCalIsoTrkFilter-PromptReco-v3/ALCARECO#93598e28-2ece-4db4-afc1-32e8ed002592', u'/EGamma/Run2018A-HcalCalIsoTrkFilter-PromptReco-v3/ALCARECO#b97e9a54-06b7-4b20-b24f-b5f319d79056', u'/EGamma/Run2018A-HcalCalIsoTrkFilter-PromptReco-v3/ALCARECO#c1ee19f5-1c11-4bdc-bea6-691de32764a1', u'/EGamma/Run2018A-HcalCalIsoTrkFilter-PromptReco-v3/ALCARECO#de35cfea-d12c-4ca9-98fa-ebccbe78d0b9'] for example |
@vlimant - my working theory would be Apache has a bug and is spewing crap. @nataliaratnikova - to make sure that I understand your message correctly, is the |
HI Brian, I do not know much about frontend/backend authentication, I hoped you would know what this DN is for. I will ask via e-group for cmsweb developers and operators, hopefully we get an answer next week. |
That certificate is "internally" used by CMSWEB services for inter-communication, though the frontend though. |
@amaltaro - do we have monitoring for the age of the certificates? Is it safe to assume that these did not really expire? Back to my earlier comment:
The only way I know to test this theory is to have someone log in to the frontend and look at the age of the Apache worker processes. If they are regularly a few days old, we could be hitting this bug. |
I think Lina is going to reply to this thread, but there is no monitoring. The service certificate gets copied during the deployment to its own service auth area. Meaning either it works all the time, without any room for glitches or very short windows of expiration, OR... I wouldn't discard the change to have a unwanted backend serving requests with an expired certificated (e.g., reqmgr2 on vocms0742, which does not serve any requests, but if it had to, it would have expired certs). I let Lina comment any further, we should have the backend ip address on the frontend logs and be able to rule out that theory. |
Gotcha. Since we're not seeing massive failures across the board, I'm going to guess that it's not really expired. I feel like my CRL theory may be going down the right track. @h4d4 - can you do
So the majority of processes are from yesterday ( Anyhow, we could potentially test this by having a custom frontend with one child and doing a steady stream of requests that require authentication for a week or so. Alternately, we could set:
in the frontend configuration and see if it lowers the average lifetime of the child process -- and see if this bug "magically" disappears. |
@nataliaratnikova @bbockelm @amaltaro CN=dmwm/cmsweb.cern.ch, is a grid host Certificate dmwm/cmsweb.cern.ch for authenticating some CMSWeb services. Therefore, a copy of the certificate needs to be in the authenticated area of the service, in nodes where the service should run. And for renewing the certificate, CMSWeb operator receive an email like one month before it expires. Regarding, ssl:error Certificate Verification, messages in frontends logs, those are in fact showing up there, and comes from vocms0741 and vocms0742, and on both nodes, dmwm certificate for all services listed before are from Dec 14 2016 (this mind in fact expired). Now the question, why those old certificates are there? The answer is the last time that I renewed it, I’ve to propagate it by hand(since that day did not match with the production upgrade, the procedure that includes a step to propagate it). Propagated, means copied from afs private area to the specific authentication area of the node where the service should run. [1] https://github.com/dmwm/deployment/blob/master/t0_reqmon/deploy#L49 |
@bbockelm
|
Apparently this was an intentional feature: PHEDEX/perl_lib/PHEDEX/Web/Util.pm Line 230 in d2602ab
|
The API intended logic is
PHEDEX/perl_lib/PHEDEX/Web/API/BlockReplicas.pm
Lines 288 to 291 in 0702117
that is if none of block, dataset, node, or create_since is set, the default create_since is set to 1 day ago. If any is set, create_since is set to whatever the parameter is or open. Respective code:
Sometimes however the query returns all data sets subscribed in the last 24 hours even when not intended and confuses the calling application.
Affected version: 2_4_0-pre1 and earlier releases.
The text was updated successfully, but these errors were encountered: