Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cholesky decomposition fails for ATLAS_2JET_7TEV_R06 in a closure test #2043

Closed
andreab1997 opened this issue Apr 10, 2024 · 8 comments · Fixed by #2045
Closed

Cholesky decomposition fails for ATLAS_2JET_7TEV_R06 in a closure test #2043

andreab1997 opened this issue Apr 10, 2024 · 8 comments · Fixed by #2045

Comments

@andreab1997
Copy link
Contributor

andreab1997 commented Apr 10, 2024

When running a level-2 closure test which contains the dataset ATLAS_2JET_7TEV_R06, it fails (during n3fit execution) with the error
numpy.linalg.LinAlgError: 65-th leading minor of the array is not positive definite

Reproduce

A minimal runcard to reproduce this bug is the following

#
# Configuration file for NNPDF++
#

############################################################
description: NNPDF4.0 NNLO Global, closure test

############################################################
# frac: training fraction
# ewk: apply ewk k-factors
# sys: systematics treatment (see systypes)
dataset_inputs:
- {dataset: ATLAS_2JET_7TEV_R06, frac: 0.75, cfac: []} 

############################################################
datacuts:  
  t0pdfset: 231223-ab-baseline-nnlo-global-NNLOcuts_iterated     # PDF set to generate t0 covmat
  q2min: 3.49                       # Q2 minimum
  w2min: 12.5
  use_cuts: internal
############################################################
theory:
  theoryid: 708         # database id

############################################################
trvlseed: 1727532335
nnseed: 1737785873
mcseed: 2123509817
genrep: true          # true = generate MC replicas, false = use real data

parameters: # This defines the parameter dictionary that is passed to the Model Trainer
  nodes_per_layer: [25, 20, 8]
  activation_per_layer: [tanh, tanh, linear]
  initializer: glorot_normal
  optimizer:
    clipnorm: 6.073e-6
    learning_rate: 2.621e-3
    optimizer_name: Nadam
  epochs: 17000
  positivity:
    initial: 184.8
    multiplier:
  integrability:
    initial: 10
    multiplier:
  stopping_patience: 0.1
  layer_type: dense
  dropout: 0.0
  threshold_chi2: 3.5

fitting:
  fitbasis: EVOL  # EVOL (7), EVOLQED (8), etc.
  basis:
  - {fl: sng, trainable: false, smallx: [1.091, 1.119], largex: [1.471, 3.021]}
  - {fl: g, trainable: false, smallx: [0.7795, 1.095], largex: [2.742, 5.547]}
  - {fl: v, trainable: false, smallx: [0.472, 0.7576], largex: [1.571, 3.559]}
  - {fl: v3, trainable: false, smallx: [0.07483, 0.4501], largex: [1.714, 3.467]}
  - {fl: v8, trainable: false, smallx: [0.5731, 0.779], largex: [1.555, 3.465]}
  - {fl: t3, trainable: false, smallx: [-0.5498, 1.0], largex: [1.778, 3.5]}
  - {fl: t8, trainable: false, smallx: [0.5469, 0.857], largex: [1.555, 3.391]}
  - {fl: t15, trainable: false, smallx: [1.081, 1.142], largex: [1.491, 3.092]}

############################################################
positivity:
  posdatasets:
  - {dataset: POSF2U, maxlambda: 1e6}        # Positivity Lagrange Multiplier
  - {dataset: POSF2DW, maxlambda: 1e6}
  - {dataset: POSF2S, maxlambda: 1e6}
  - {dataset: POSFLL_19PTS, maxlambda: 1e6}
  - {dataset: POSDYU, maxlambda: 1e10}
  - {dataset: POSDYD, maxlambda: 1e10}
  - {dataset: POSDYS, maxlambda: 1e10}
  - {dataset: POSF2C_17PTS, maxlambda: 1e6}
  - {dataset: POSXUQ, maxlambda: 1e6}        # Positivity of MSbar PDFs
  - {dataset: POSXUB, maxlambda: 1e6}
  - {dataset: POSXDQ, maxlambda: 1e6}
  - {dataset: POSXDB, maxlambda: 1e6}
  - {dataset: POSXSQ, maxlambda: 1e6}
  - {dataset: POSXSB, maxlambda: 1e6}
  - {dataset: POSXGL, maxlambda: 1e6}

############################################################
integrability:
  integdatasets:
  - {dataset: INTEGXT8, maxlambda: 1e2}
  - {dataset: INTEGXT3, maxlambda: 1e2}

############################################################

closuretest:
  filterseed: 3345348918 # Random seed to be used in filtering data partitions
  fakedata: true     # true = to use FAKEPDF to generate pseudo-data
  fakepdf: 231223-ab-baseline-nnlo-global-NNLOcuts_iterated      # Theory input for pseudo-data
  errorsize: 1.0    # uncertainties rescaling
  fakenoise: true    # true = to add random fluctuations to pseudo-data
  rancutprob: 1.0   # Fraction of data to be included in the fit
  rancutmethod: 0   # Method to select rancutprob data fraction
  rancuttrnval: false # 0(1) to output training(valiation) chi2 in report
  printpdf4gen: false # To print info on PDFs during minimization

############################################################
debug: false
maxcores: 8

Things to note

  • The experimental covariance matrix generated during vp-setupfit to generate the level-1 data seems to be ok. In fact, vp-setupfit does not crash.
  • In order to reproduce this bug in master, another bug needs to be solved (which is solved already in Closure with same level1 #2007, look at coredata.py).

I tried to debug this myself but for the moment I was not able to figure out the problem.

CC: @scarlehoff @RoyStegeman @comane

@scarlehoff
Copy link
Member

The experimental covariance matrix generated during vp-setupfit to generate the level-1 data seems to be ok

This didn't fail before but it does now.
Do you have the old covariance matrix as well (so that you can check whether there are any changes)

@andreab1997
Copy link
Contributor Author

The experimental covariance matrix generated during vp-setupfit to generate the level-1 data seems to be ok

What do you mean with before? Before the new commondata?

This didn't fail before but it does now. Do you have the old covariance matrix as well (so that you can check whether there are any changes)

I do not have the old one and I am not sure how I can reproduce it now

@scarlehoff
Copy link
Member

What do you mean with before? Before the new commondata?

Yes.

I do not have the old one and I am not sure how I can reproduce it now

git checkout 4.0.9

In order to reproduce this bug in master, another bug needs to be solved

Just to make sure, did you try doing this in master with only that bug solved?

@andreab1997
Copy link
Contributor Author

Just to make sure, did you try doing this in master with only that bug solved?

Yes, I am debugging now in master

@andreab1997
Copy link
Contributor Author

UPDATE: It seems that in the function dataset_inputs_covmat_from_systematics, during vp-setupfit I have
CommonData(setname='ATLAS_2JET_7TEV_R06_M12Y', ndata=90, commondataproc='DIJET', nkin=3, nsys=474, legacy=False, legacy_name='ATLAS_2JET_7TEV_R06', kin_variables=['ystar', 'm_jj', 'sqrts']), while during n3fit I have
[CommonData(setname='ATLAS_2JET_7TEV_R06_M12Y', ndata=90, commondataproc='DIJET', nkin=3, nsys=224, legacy=False, legacy_name='ATLAS_2JET_7TEV_R06', kin_variables=['ystar', 'm_jj', 'sqrts'])].

So nsys changes from vp-setupfit and n3fit, which I believe might be the problem

@scarlehoff
Copy link
Member

scarlehoff commented Apr 10, 2024

The difference should be the number of SKIP. vp-setupfit is reading all of them. Then those are not written down (your changes to coredata.py ensure this is done consistently). After that n3fit can only read the ones that were written down.

So that difference is correct, but If that is the problem it means SKIP is having an effect somewhere (and it shouldn't).

@andreab1997
Copy link
Contributor Author

The difference should be the number of SKIP. vp-setupfit is reading all of them. Then those are not written down (your changes to coredata.py ensure this is done consistently). After that n3fit can only read the ones that were written down.

So that difference is correct, but If that is the problem it means SKIP is having an effect somewhere (and it shouldn't).

Ok but if this is the case, maybe is linked to my solution of the SKIP problem (the one I have written at the beginning of this issue)?

@scarlehoff
Copy link
Member

No, that should be correct. The SKIP should not have an effect anywhere (so also their definition should not be written down).
You should be seeing the same mismatch in (old) master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants