Skip to content

Commit

Permalink
Add helper functions for notebooks, add paper folder to put the softw…
Browse files Browse the repository at this point in the history
…are paper.
  • Loading branch information
jasonfan1997 committed Oct 2, 2024
1 parent 2ddbe68 commit 8baf62e
Show file tree
Hide file tree
Showing 109 changed files with 7,317 additions and 7,042 deletions.
Binary file modified docs/build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/build/doctrees/index.doctree
Binary file not shown.
22 changes: 11 additions & 11 deletions docs/build/doctrees/nbsphinx/notebooks/cox.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@
"Dep. Variable: y No. Observations: 5000\n",
"Model: Logit Df Residuals: 4998\n",
"Method: MLE Df Model: 1\n",
"Date: Thu, 26 Sep 2024 Pseudo R-squ.: 0.4438\n",
"Time: 09:40:42 Log-Likelihood: -1927.5\n",
"Date: Wed, 02 Oct 2024 Pseudo R-squ.: 0.4438\n",
"Time: 15:53:27 Log-Likelihood: -1927.5\n",
"converged: True LL-Null: -3465.6\n",
"Covariance Type: nonrobust LLR p-value: 0.000\n",
"==============================================================================\n",
Expand Down Expand Up @@ -82,7 +82,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 2,
"metadata": {},
"outputs": [
{
Expand All @@ -97,8 +97,8 @@
"Dep. Variable: y No. Observations: 5000\n",
"Model: Logit Df Residuals: 4999\n",
"Method: MLE Df Model: 0\n",
"Date: Thu, 26 Sep 2024 Pseudo R-squ.: 0.4436\n",
"Time: 09:51:09 Log-Likelihood: -1928.1\n",
"Date: Wed, 02 Oct 2024 Pseudo R-squ.: 0.4436\n",
"Time: 15:53:27 Log-Likelihood: -1928.1\n",
"converged: False LL-Null: -3465.6\n",
"Covariance Type: nonrobust LLR p-value: nan\n",
"==============================================================================\n",
Expand Down Expand Up @@ -126,7 +126,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 3,
"metadata": {},
"outputs": [
{
Expand All @@ -138,8 +138,8 @@
"Dep. Variable: y No. Observations: 5000\n",
"Model: Logit Df Residuals: 4998\n",
"Method: MLE Df Model: 1\n",
"Date: Thu, 26 Sep 2024 Pseudo R-squ.: 0.4438\n",
"Time: 10:14:45 Log-Likelihood: -1927.5\n",
"Date: Wed, 02 Oct 2024 Pseudo R-squ.: 0.4438\n",
"Time: 15:53:27 Log-Likelihood: -1927.5\n",
"converged: True LL-Null: -3465.6\n",
"Covariance Type: nonrobust LLR p-value: 0.000\n",
"==============================================================================\n",
Expand All @@ -162,7 +162,7 @@
" 'COX ICI': 0.005610391483826338}"
]
},
"execution_count": 9,
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
Expand Down Expand Up @@ -218,7 +218,7 @@
},
{
"cell_type": "code",
"execution_count": 17,
"execution_count": 5,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -273,7 +273,7 @@
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": 7,
"metadata": {},
"outputs": [
{
Expand Down
22 changes: 11 additions & 11 deletions docs/build/doctrees/nbsphinx/notebooks/ece_mce.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/build/doctrees/nbsphinx/notebooks/hl_test.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We will show HL having the wrong size. The power of HL test will be demostrated in the Power section of the next notebook. We will need to generate fake data."
"We will show the size of HL test. Notice that the size of HL test had been shown to depend on sample size, number of bin and binning scheme (Hosmer et. al. 1997). We will generate fake data to show the size of HL test."
]
},
{
Expand Down
20 changes: 10 additions & 10 deletions docs/build/doctrees/nbsphinx/notebooks/ici.ipynb

Large diffs are not rendered by default.

175 changes: 55 additions & 120 deletions docs/build/doctrees/nbsphinx/notebooks/metrics_summary.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,170 +4,105 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Summary and guide for calibration metrics\n",
"# Summary and guide for calzone\n",
"\n",
"We provide a summary of the calibration metrics provides by calzone, including the pros and cons of each metrics. For a more detailed explanation of each metrics and how to calculate them using calzone, please refer to the specific notebook."
]
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 11,
"metadata": {
"nbsphinx": "hidden"
},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/kwoklung.fan/anaconda3/envs/uq/lib/python3.12/site-packages/dataframe_image/converter/matplotlib_table.py:147: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.\n",
" if not thead and not tbody:\n"
]
}
],
"source": [
"import pandas as pd\n",
"from IPython.display import display, HTML\n",
"\n",
"data = {\n",
" 'Metrics': ['ECE', 'MCE', 'Hosmer-Lemeshow test', \"Spiegelhalter's z test\", \"Cox's analysis\", 'Integrated calibration index (ICI)'],\n",
" 'Metrics': ['ECE', 'MCE', 'Hosmer-Lemeshow test', \"Spiegelhalter's z test\", \"Cox's analysis\", 'Integrated calibration index<br> (ICI)'],\n",
" 'Description': [\n",
" 'Using binned reliability diagram (equal-width or equal-count binning), sum of absolute difference, weighted by bin count.',\n",
" 'Using binned reliability diagram (equal-width or equal-count binning), Maximum absolute difference.',\n",
" 'Using binned reliability diagram (equal-width or equal-count binning), Chi-squared based test using expected and observed.',\n",
" 'Decomposition of brier score. Normal distributed',\n",
" 'Logistic regression of the logits',\n",
" 'Similar to ECE, using smooth fit (usually losse) instead of binning to get the calibration curve'\n",
" '<div>Using binned reliability diagram<br>(equal-width or equal-count binning),<br>sum of absolute difference, weighted by bin count.</div>',\n",
" '<div>Using binned reliability diagram<br>(equal-width or equal-count binning),<br>Maximum absolute difference.</div>',\n",
" '<div>Using binned reliability diagram<br>(equal-width or equal-count binning),<br>Chi-squared based test using expected and observed.</div>',\n",
" '<div>Decomposition of brier score.<br>Normal distributed<br> </div>',\n",
" '<div>Logistic regression of the logits<br> <br> </div>',\n",
" '<div>Similar to ECE, using smooth fit (usually losse)<br>instead of binning to get<br>the calibration curve</div>'\n",
" ],\n",
" 'Pros': [\n",
" '• Intuitive<br>• Easy to calculate',\n",
" '• Intuitive<br>• Easy to calculate',\n",
" '• Intuitive<br>• Statistical meaning',\n",
" '• Doesn\\'t rely on binning<br>• Statistical meaning',\n",
" '• Doesn\\'t rely on binning<br>• Its value shows the how the calibration is off',\n",
" '• Doesn\\'t rely on binning<br>• Capture all kind of miscalibration'\n",
" '<div>• Intuitive<br>• Easy to calculate</div>',\n",
" '<div>• Intuitive<br>• Easy to calculate</div>',\n",
" '<div>• Intuitive<br>• Statistical meaning</div>',\n",
" '<div>• Doesn\\'t rely on binning<br>• Statistical meaning</div>',\n",
" '<div>• Doesn\\'t rely on binning<br>• Hints at miscalibration type</div>',\n",
" '<div>• Doesn\\'t rely on binning<br>• Capture all kind of miscalibration</div>'\n",
" ],\n",
" 'Cons': [\n",
" '• Depend on binning <br>• Depend on class-by-class or top-class',\n",
" '• Depend on binning <br>• Depend on class-by-class or top-class',\n",
" '• Depend on binning <br>• Low power<br>• Wrong coverage',\n",
" '• Doesn\\'t detect prevalence shift',\n",
" '• Failed to capture some cases of miscalibration',\n",
" '• Depend on the choice of curve fitting<br>• Depend on fitting parameters'\n",
" '<div>• Depend on binning <br>• Depend on class-by-class/top-class</div>',\n",
" '<div>• Depend on binning <br>• Depend on class-by-class/top-class</div>',\n",
" '<div>• Depend on binning <br>• Low power<br>• Wrong coverage</div>',\n",
" '<div>• Doesn\\'t detect prevalence shift</div>',\n",
" '<div>• Failed to capture some miscalibration</div>',\n",
" '<div>• Depend on the choice of curve fitting<br>• Depend on fitting parameters</div>'\n",
" ],\n",
" 'Meaning': [\n",
" 'Average deviation from true probability',\n",
" 'Maximum deviation from true probability',\n",
" 'Test of calibration',\n",
" 'Test of calibration',\n",
" 'A logit fit to the calibration curve',\n",
" 'Average deviation from true probability'\n",
" '<div>Average deviation from<br>true probability</div>',\n",
" '<div>Maximum deviation from<br>true probability</div>',\n",
" '<div>Test of<br>calibration</div>',\n",
" '<div>Test of<br>calibration</div>',\n",
" '<div>A logit fit to the<br>calibration curve</div>',\n",
" '<div>Average deviation from<br>true probability</div>'\n",
" ]\n",
"}\n",
"\n",
"df = pd.DataFrame(data)\n",
"\n",
"# Apply custom styling\n",
"styled_df = df.style.set_properties(**{'text-align': 'left', 'white-space': 'pre-wrap'})\n",
"styled_df = styled_df.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])\n",
"\n",
"styled_df = styled_df.hide(axis=\"index\")\n",
"\n",
"# Display the styled dataframe\n",
"#display(HTML(styled_df.to_html(escape=False)))"
"#display(HTML(styled_df.to_html(escape=False)))\n",
"import dataframe_image as dfi\n",
"\n",
"dfi.export(styled_df,\"mytable.png\",table_conversion = 'matplotlib',dpi=300)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"cell_type": "markdown",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style type=\"text/css\">\n",
"#T_0ef8b th {\n",
" text-align: center;\n",
"}\n",
"#T_0ef8b_row0_col0, #T_0ef8b_row0_col1, #T_0ef8b_row0_col2, #T_0ef8b_row0_col3, #T_0ef8b_row0_col4, #T_0ef8b_row1_col0, #T_0ef8b_row1_col1, #T_0ef8b_row1_col2, #T_0ef8b_row1_col3, #T_0ef8b_row1_col4, #T_0ef8b_row2_col0, #T_0ef8b_row2_col1, #T_0ef8b_row2_col2, #T_0ef8b_row2_col3, #T_0ef8b_row2_col4, #T_0ef8b_row3_col0, #T_0ef8b_row3_col1, #T_0ef8b_row3_col2, #T_0ef8b_row3_col3, #T_0ef8b_row3_col4, #T_0ef8b_row4_col0, #T_0ef8b_row4_col1, #T_0ef8b_row4_col2, #T_0ef8b_row4_col3, #T_0ef8b_row4_col4, #T_0ef8b_row5_col0, #T_0ef8b_row5_col1, #T_0ef8b_row5_col2, #T_0ef8b_row5_col3, #T_0ef8b_row5_col4 {\n",
" text-align: left;\n",
" white-space: pre-wrap;\n",
"}\n",
"</style>\n",
"<table id=\"T_0ef8b\">\n",
" <thead>\n",
" <tr>\n",
" <th class=\"blank level0\" >&nbsp;</th>\n",
" <th id=\"T_0ef8b_level0_col0\" class=\"col_heading level0 col0\" >Metrics</th>\n",
" <th id=\"T_0ef8b_level0_col1\" class=\"col_heading level0 col1\" >Description</th>\n",
" <th id=\"T_0ef8b_level0_col2\" class=\"col_heading level0 col2\" >Pros</th>\n",
" <th id=\"T_0ef8b_level0_col3\" class=\"col_heading level0 col3\" >Cons</th>\n",
" <th id=\"T_0ef8b_level0_col4\" class=\"col_heading level0 col4\" >Meaning</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th id=\"T_0ef8b_level0_row0\" class=\"row_heading level0 row0\" >0</th>\n",
" <td id=\"T_0ef8b_row0_col0\" class=\"data row0 col0\" >ECE</td>\n",
" <td id=\"T_0ef8b_row0_col1\" class=\"data row0 col1\" >Using binned reliability diagram (equal-width or equal-count binning), sum of absolute difference, weighted by bin count.</td>\n",
" <td id=\"T_0ef8b_row0_col2\" class=\"data row0 col2\" >• Intuitive<br>• Easy to calculate</td>\n",
" <td id=\"T_0ef8b_row0_col3\" class=\"data row0 col3\" >• Depend on binning <br>• Depend on class-by-class or top-class</td>\n",
" <td id=\"T_0ef8b_row0_col4\" class=\"data row0 col4\" >Average deviation from true probability</td>\n",
" </tr>\n",
" <tr>\n",
" <th id=\"T_0ef8b_level0_row1\" class=\"row_heading level0 row1\" >1</th>\n",
" <td id=\"T_0ef8b_row1_col0\" class=\"data row1 col0\" >MCE</td>\n",
" <td id=\"T_0ef8b_row1_col1\" class=\"data row1 col1\" >Using binned reliability diagram (equal-width or equal-count binning), Maximum absolute difference.</td>\n",
" <td id=\"T_0ef8b_row1_col2\" class=\"data row1 col2\" >• Intuitive<br>• Easy to calculate</td>\n",
" <td id=\"T_0ef8b_row1_col3\" class=\"data row1 col3\" >• Depend on binning <br>• Depend on class-by-class or top-class</td>\n",
" <td id=\"T_0ef8b_row1_col4\" class=\"data row1 col4\" >Maximum deviation from true probability</td>\n",
" </tr>\n",
" <tr>\n",
" <th id=\"T_0ef8b_level0_row2\" class=\"row_heading level0 row2\" >2</th>\n",
" <td id=\"T_0ef8b_row2_col0\" class=\"data row2 col0\" >Hosmer-Lemeshow test</td>\n",
" <td id=\"T_0ef8b_row2_col1\" class=\"data row2 col1\" >Using binned reliability diagram (equal-width or equal-count binning), Chi-squared based test using expected and observed.</td>\n",
" <td id=\"T_0ef8b_row2_col2\" class=\"data row2 col2\" >• Intuitive<br>• Statistical meaning</td>\n",
" <td id=\"T_0ef8b_row2_col3\" class=\"data row2 col3\" >• Depend on binning <br>• Low power<br>• Wrong coverage</td>\n",
" <td id=\"T_0ef8b_row2_col4\" class=\"data row2 col4\" >Test of calibration</td>\n",
" </tr>\n",
" <tr>\n",
" <th id=\"T_0ef8b_level0_row3\" class=\"row_heading level0 row3\" >3</th>\n",
" <td id=\"T_0ef8b_row3_col0\" class=\"data row3 col0\" >Spiegelhalter's z test</td>\n",
" <td id=\"T_0ef8b_row3_col1\" class=\"data row3 col1\" >Decomposition of brier score. Normal distributed</td>\n",
" <td id=\"T_0ef8b_row3_col2\" class=\"data row3 col2\" >• Doesn't rely on binning<br>• Statistical meaning</td>\n",
" <td id=\"T_0ef8b_row3_col3\" class=\"data row3 col3\" >• Doesn't detect prevalence shift</td>\n",
" <td id=\"T_0ef8b_row3_col4\" class=\"data row3 col4\" >Test of calibration</td>\n",
" </tr>\n",
" <tr>\n",
" <th id=\"T_0ef8b_level0_row4\" class=\"row_heading level0 row4\" >4</th>\n",
" <td id=\"T_0ef8b_row4_col0\" class=\"data row4 col0\" >Cox's analysis</td>\n",
" <td id=\"T_0ef8b_row4_col1\" class=\"data row4 col1\" >Logistic regression of the logits</td>\n",
" <td id=\"T_0ef8b_row4_col2\" class=\"data row4 col2\" >• Doesn't rely on binning<br>• Its value shows the how the calibration is off</td>\n",
" <td id=\"T_0ef8b_row4_col3\" class=\"data row4 col3\" >• Failed to capture some cases of miscalibration</td>\n",
" <td id=\"T_0ef8b_row4_col4\" class=\"data row4 col4\" >A logit fit to the calibration curve</td>\n",
" </tr>\n",
" <tr>\n",
" <th id=\"T_0ef8b_level0_row5\" class=\"row_heading level0 row5\" >5</th>\n",
" <td id=\"T_0ef8b_row5_col0\" class=\"data row5 col0\" >Integrated calibration index (ICI)</td>\n",
" <td id=\"T_0ef8b_row5_col1\" class=\"data row5 col1\" >Similar to ECE, using smooth fit (usually losse) instead of binning to get the calibration curve</td>\n",
" <td id=\"T_0ef8b_row5_col2\" class=\"data row5 col2\" >• Doesn't rely on binning<br>• Capture all kind of miscalibration</td>\n",
" <td id=\"T_0ef8b_row5_col3\" class=\"data row5 col3\" >• Depend on the choice of curve fitting<br>• Depend on fitting parameters</td>\n",
" <td id=\"T_0ef8b_row5_col4\" class=\"data row5 col4\" >Average deviation from true probability</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(HTML(styled_df.to_html(escape=False)))"
"![alt text](mytable.png \"Title\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Guide to Calibration Metrics\n",
"## Guide to calzone and calibration metrics\n",
"\n",
"calzone aims to access whether a model achieves moderate calibration, meaning whether $\\mathbb{P}(\\hat{Y}=Y|\\hat{P}=p)=p$ for all $p\\in[0,1]$.\n",
"\n",
"To accurately assess the calibration of machine learning models, it is essential to have a comprehensive and reprensative dataset with sufficient coverage of the prediction space. The calibration metrics is not meaningful if the dataset is not representative of true intended population.\n",
"\n",
"calzone takes in a csv dataset which contains the probability of each class and the true label. Most metrics in calzone only work with binary classification and which transforms the problem into 1-vs-rest when calcualte the metrics. Therefore, you need to specify the class-of-interest when using the metrics. The only exception is the $ECE_{top}$ and $MCE_{top}$ metrics which works for multi-class problems. See the corresponding documentation for more details.\n",
"\n",
"\n",
"We recommend visualizing calibration using reliability diagrams. If you observe general over- or under-estimation of probabilities for a given class, consider applying a prevalence adjustment to determine if it's solely due to prevalence shift. After prevalence adjustment, plot the reliability diagrams again and examine the results of calibration metrics.\n",
"\n",
"For a general sense of average probability deviation, we recommend using the Cox and Loess integrated calibration index (ICI). To test for calibration, we suggest using Spiegelhalter's z-test. Other metrics such as Expected Calibration Error (ECE), Cox slope/intercept, and Hosmer-Lemeshow (HL) test depends strongly on binning and should be used with caution.\n",
"\n",
"Please refer to the following sections for detailed descriptions of each metric."
"Please refer to the notebooks for detailed descriptions of each metric."
]
}
],
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 2,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -140,7 +140,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -153,7 +153,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 4,
"metadata": {},
"outputs": [
{
Expand Down
Loading

0 comments on commit 8baf62e

Please sign in to comment.