janosh
diff --git a/‎contributing.md
Lines changed: 1 addition & 1 deletion b/‎contributing.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎data/wbm/readme.md
Lines changed: 1 addition & 1 deletion b/‎data/wbm/readme.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎readme.md
Lines changed: 1 addition & 1 deletion b/‎readme.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎scripts/analyze_model_failure_cases.py
Lines changed: 1 addition & 1 deletion b/‎scripts/analyze_model_failure_cases.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎scripts/model_figs/cumulative_metrics.py
Lines changed: 1 addition & 1 deletion b/‎scripts/model_figs/cumulative_metrics.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎scripts/model_figs/hull_dist_box_plot.py
Lines changed: 16 additions & 10 deletions b/‎scripts/model_figs/hull_dist_box_plot.py
Lines changed: 16 additions & 10 deletions
diff --git a/‎scripts/per_element_errors.py
Lines changed: 1 addition & 1 deletion b/‎scripts/per_element_errors.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎site/src/figs/bar-element-counts-mp+wbm-normalized=False.svelte
Lines changed: 1 addition & 1 deletion b/‎site/src/figs/bar-element-counts-mp+wbm-normalized=False.svelte
Lines changed: 1 addition & 1 deletion
diff --git a/‎site/src/figs/box-hull-dist-errors-only-compliant.svelte
Lines changed: 1 addition & 1 deletion b/‎site/src/figs/box-hull-dist-errors-only-compliant.svelte
Lines changed: 1 addition & 1 deletion
diff --git a/‎site/src/figs/box-hull-dist-errors.svelte
Lines changed: 1 addition & 1 deletion b/‎site/src/figs/box-hull-dist-errors.svelte
Lines changed: 1 addition & 1 deletion
diff --git a/‎site/src/figs/cumulative-precision-recall-only-compliant.svelte
Lines changed: 1 addition & 1 deletion b/‎site/src/figs/cumulative-precision-recall-only-compliant.svelte
Lines changed: 1 addition & 1 deletion
diff --git a/‎site/src/figs/cumulative-precision-recall.svelte
Lines changed: 1 addition & 1 deletion b/‎site/src/figs/cumulative-precision-recall.svelte
Lines changed: 1 addition & 1 deletion
diff --git a/‎site/src/figs/element-prevalence-vs-error.svelte
Lines changed: 1 addition & 1 deletion b/‎site/src/figs/element-prevalence-vs-error.svelte
Lines changed: 1 addition & 1 deletion
diff --git a/‎site/src/figs/hist-clf-pred-hull-dist-models-9x2.svelte
Lines changed: 1 addition & 1 deletion b/‎site/src/figs/hist-clf-pred-hull-dist-models-9x2.svelte
Lines changed: 1 addition & 1 deletion
diff --git a/‎site/src/routes/+layout.svelte
Lines changed: 5 additions & 3 deletions b/‎site/src/routes/+layout.svelte
Lines changed: 5 additions & 3 deletions
diff --git a/‎site/src/routes/+page.svelte
Lines changed: 4 additions & 1 deletion b/‎site/src/routes/+page.svelte
Lines changed: 4 additions & 1 deletion
diff --git a/‎site/src/routes/landing-page-figs.md
Lines changed: 66 additions & 0 deletions b/‎site/src/routes/landing-page-figs.md
Lines changed: 66 additions & 0 deletions
diff --git a/‎site/src/routes/models/tmi/+page.svelte
Lines changed: 8 additions & 6 deletions b/‎site/src/routes/models/tmi/+page.svelte
Lines changed: 8 additions & 6 deletions
diff --git a/‎site/src/routes/preprint/+layout.server.ts
Lines changed: 0 additions & 10 deletions b/‎site/src/routes/preprint/+layout.server.ts
Lines changed: 0 additions & 10 deletions
diff --git a/‎site/src/routes/preprint/+layout.svelte
Lines changed: 0 additions & 91 deletions b/‎site/src/routes/preprint/+layout.svelte
Lines changed: 0 additions & 91 deletions
@@ -246,7 +246,7 @@ And you're done! Once tests pass and the PR is merged, your model will be added
 - the exact code in the script that launched the run, and
 - which versions of dependencies were installed in the environment your model ran in.
 
-This information can be useful for others looking to reproduce your results or compare their model to yours i.t.o. computational cost. We therefore strongly recommend tracking all runs that went into a model submission with WandB so that the runs can be copied over to our WandB project at <https://wandb.ai/janosh/matbench-discovery> for everyone to inspect. This also allows us to include your model in more detailed analysis (see [SI](https://matbench-discovery.materialsproject.org/preprint#supplementary-information)).
+This information can be useful for others looking to reproduce your results or compare their model to yours i.t.o. computational cost. We therefore strongly recommend tracking all runs that went into a model submission with WandB so that the runs can be copied over to our WandB project at <https://wandb.ai/janosh/matbench-discovery> for everyone to inspect. This also allows us to include your model in more detailed analysis (see the SI in the [preprint](https://arxiv.org/abs/2308.14920)).
 
 ## 😵‍💫 &thinsp; Troubleshooting
 
 
@@ -142,6 +142,6 @@ The MP training set consists of 154,719 `ComputedStructureEntries`
 
 ## 📊 &thinsp; Symmetry Statistics
 
-These sunburst diagrams show the spacegroup distribution of MP on the left and WBM on the right. Both have good coverage of all 7 crystal systems, the only exception being triclinic crystals which are just 1% of WBM but well represented in MP (15%). The 3 largest systems in MP are monoclinic, orthorhombic and triclinic vs orthorhombic, tetragonal and cubic in WBM. So WBM structures have overall higher symmetry which can benefit some models more than others. Wrenformer in particular uses symmetries as a coarse-grained description of the underlying structure. Its representations basically degrades to composition only on symmetry-less P1 structures. Given this spacegroup distribution, it should fare well on the WBM test set. The fact that Wrenformer is still outperformed by all interatomic potentials and some single-shot GNNs indicates the underlying methodology is unable to compete. See [SI](/preprint#spacegroup-prevalence-in-wrenformer-failure-cases) for a specific Wrenformer failure case.
+These sunburst diagrams show the spacegroup distribution of MP on the left and WBM on the right. Both have good coverage of all 7 crystal systems, the only exception being triclinic crystals which are just 1% of WBM but well represented in MP (15%). The 3 largest systems in MP are monoclinic, orthorhombic and triclinic vs orthorhombic, tetragonal and cubic in WBM. So WBM structures have overall higher symmetry which can benefit some models more than others. Wrenformer in particular uses symmetries as a coarse-grained description of the underlying structure. Its representations basically degrades to composition only on symmetry-less P1 structures. Given this spacegroup distribution, it should fare well on the WBM test set. The fact that Wrenformer is still outperformed by all interatomic potentials and some single-shot GNNs indicates the underlying methodology is unable to compete.
 
 <slot name="spacegroup-sunbursts" />
@@ -29,6 +29,6 @@ If you'd like to refer to Matbench Discovery in a publication, please cite the [
 
 We welcome new models additions to the leaderboard through GitHub PRs. See the [contributing guide](https://janosh.github.io/matbench-discovery/contribute) for details and ask support questions via [GitHub discussion](https://github.com/janosh/matbench-discovery/discussions).
 
-For detailed results and analysis, check out the [preprint](https://janosh.github.io/matbench-discovery/preprint).
+For detailed results and analysis, check out the [preprint](https://arxiv.org/abs/2308.14920).
 
 > Disclaimer: We evaluate how accurately ML models predict solid-state thermodynamic stability. Although this is an important aspect of high-throughput materials discovery, the ranking cannot give a complete picture of a model's general applicability to materials. A high ranking does not constitute endorsement by the Materials Project.
@@ -107,7 +107,7 @@
 
 fig.show()
 
-# pmv.save_fig(fig, f"{FIGS}/hist-largest-each-errors-fp-diff-models.svelte")
+# pmv.save_fig(fig, f"{SITE_FIGS}/hist-largest-each-errors-fp-diff-models.svelte")
 
 
 # %%
 
@@ -59,7 +59,7 @@
 for key in filter(lambda key: key.startswith("yaxis"), fig.layout):
     fig.layout[key].range = range_y
 
-fig.layout.margin.update(l=60, r=10, t=30, b=60)
+fig.layout.margin.update(l=0, r=0, t=20, b=0)
 # use annotation for x-axis label
 fig.add_annotation(
     **dict(x=0.5, y=-0.15, xref="paper", yref="paper"),
 
@@ -1,4 +1,5 @@
 # %%
+import plotly.express as px
 import plotly.graph_objects as go
 import pymatviz as pmv
 
@@ -30,32 +31,37 @@
 fig.layout.yaxis.title = Quantity.e_above_hull_error
 fig.layout.margin = dict(l=0, r=0, b=0, t=0)
 
+# Get the default Plotly colors that will be used for the boxes
+color_seq = px.colors.qualitative.Plotly
+
 for idx, model in enumerate(models_to_plot):
     ys = [df_each_err[model].quantile(quant) for quant in (0.05, 0.25, 0.5, 0.75, 0.95)]
 
-    fig.add_box(y=ys, name=model, width=0.8)
+    # Use the same color for both box and label
+    color = color_seq[idx % len(color_seq)]
+    fig.add_box(y=ys, name=model, width=0.8, marker_color=color)
 
     # annotate median with numeric value
     median = ys[2]
     fig.add_annotation(
-        x=idx,
-        y=median,
-        text=f"{median:.2}",
-        showarrow=False,
-        # bgcolor="rgba(0, 0, 0, 0.2)",
+        x=idx, y=median, text=f"{median:.2}", showarrow=False, font_size=9
     )
 
 fig.layout.showlegend = False
-# use line breaks to offset every other x-label
+# use line breaks to offset every other x-label and color them
 x_labels_with_offset = [
-    f"{'<br>' * (idx % 2)}{label}" for idx, label in enumerate(models_to_plot)
+    f"{'<br>' * (idx % 3)}<span style='color: {color_seq[idx % len(color_seq)]}'>"
+    f"{label}</span>"
+    for idx, label in enumerate(models_to_plot)
 ]
+
 # prevent x-labels from rotating
 fig.layout.xaxis.range = [-0.7, len(models_to_plot) - 0.3]
 fig.layout.xaxis.update(
-    tickangle=0, tickvals=models_to_plot, ticktext=x_labels_with_offset
+    tickangle=0,
+    tickvals=models_to_plot,
+    ticktext=x_labels_with_offset,
 )
-fig.layout.width = 70 * len(models_to_plot)
 fig.show()
 
 
 
@@ -183,7 +183,7 @@
 fig.layout.legend.update(x=1, y=1, xanchor="right", yanchor="top", title="")
 fig.show()
 
-# pmv.save_fig(fig, f"{FIGS}/element-prevalence-vs-error.svelte")
+pmv.save_fig(fig, f"{SITE_FIGS}/element-prevalence-vs-error.svelte")
 pmv.save_fig(fig, f"{PDF_FIGS}/element-prevalence-vs-error.pdf")
 
 
 
@@ -24,8 +24,6 @@
     '/api': `API docs for the Matbench Discovery PyPI package.`,
     '/contribute': `Steps for contributing a new model to the benchmark.`,
     '/models': `Details on each model sortable by metrics.`,
-    '/preprint': `The preprint released with the Matbench Discovery benchmark.`,
-    '/preprint/iclr-ml4mat': `Extended abstract submitted to the ICLR ML4Materials workshop.`,
   }[url ?? ``]
   if (url && !description) console.warn(`No description for url=${url}`)
   $: title = url == `/` ? `` : `${url} • `
@@ -75,7 +73,11 @@
 <GitHubCorner href={repository} />
 
 <Nav
-  routes={[[`/home`, `/`], ...routes.filter((route) => route != `/changelog`)]}
+  routes={[
+    [`/home`, `/`],
+    ...routes.filter((route) => route != `/changelog`),
+    [`/preprint`, `https://arxiv.org/abs/2308.14920`],
+  ]}
   style="padding: 0 var(--main-padding);"
 />
 
 
@@ -3,6 +3,7 @@
   import { DiscoveryMetricsTable, model_is_compliant, MODEL_METADATA } from '$lib'
   import Readme from '$root/readme.md'
   import KappaNote from '$site/src/routes/kappa-note.md'
+  import LandingPageFigs from '$site/src/routes/landing-page-figs.md'
   import Icon from '@iconify/svelte'
   import { pretty_num } from 'elementari'
   import { Toggle, Tooltip } from 'svelte-zoo'
@@ -77,7 +78,7 @@
   <figure style="margin-top: 4em;" slot="metrics-table">
     <div class="discovery-set-toggle">
       {#each Object.entries(discovery_set_labels) as [key, { title, tooltip }]}
-        <Tooltip text={tooltip} tip_style="z-index: 2; font-size: 0.8em;" max_width="3em">
+        <Tooltip text={tooltip} tip_style="z-index: 2; font-size: 0.8em;">
           <button
             class:active={discovery_set === key}
             on:click={() => (discovery_set = key)}
@@ -181,6 +182,8 @@
 </Readme>
 <KappaNote />
 
+<LandingPageFigs />
+
 <style>
   figure {
     margin: 0;
 
@@ -0,0 +1,66 @@
+<script lang="ts">
+  import { onMount } from 'svelte'
+  import BoxHullDistErrors from '$figs/box-hull-dist-errors.svelte'
+  import CumulativePrecisionRecall from '$figs/cumulative-precision-recall.svelte'
+  import EachParityModels from '$figs/each-parity-models-9x2.svelte'
+  import HistClfPredHullDistModels from '$figs/hist-clf-pred-hull-dist-models-9x2.svelte'
+  import RocModels from '$figs/roc-models.svelte'
+  import RollingMaeVsHullDistModels from '$figs/rolling-mae-vs-hull-dist-models.svelte'
+
+  let mounted: boolean = false
+  onMount(() => (mounted = true))
+</script>
+
+{#if mounted}
+<CumulativePrecisionRecall />
+
+> @label:fig:cumulative-precision-recall Model precision and recall for thermodynamic stability classification as a function of number of materials ranked from most to least stable by each model.
+> CHGNet initially achieves the highest cumulative precision and recall.
+> Simulates materials discovery efforts of different sizes since a typical campaign will rank hypothetical materials by model-predicted hull distance from most to least stable and validate the most stable predictions first.
+> A higher fraction of correct stable predictions corresponds to higher precision and fewer stable materials overlooked correspond to higher recall.
+> This figure highlights how different models perform better or worse depending on the length of the discovery campaign.
+> The UIPs (CHGNet, M3GNet, MACE) are seen to offer significantly improved precision on shorter campaigns of ~20k or less materials validated as they are less prone to false positive predictions among highly stable materials.
+
+{/if}
+
+{#if mounted}
+<RocModels />
+
+> @label:fig:roc-models Receiver operating characteristic (ROC) curve for each model. TPR/FPR = true/false positive rate. FPR on the $x$ axis is the fraction of unstable structures classified as stable. TPR on the $y$ axis is the fraction of stable structures classified as stable.
+
+{/if}
+
+{#if mounted}
+<BoxHullDistErrors />
+
+> @label:fig:box-hull-dist-errors Box plot of interquartile ranges (IQR) of hull distance errors for each model. The whiskers extend to the 5th and 95th percentiles. The horizontal line inside the box shows the median. BOWSR has the highest median error, while Voronoi RF has the highest IQR. Note that MEGNet and CGCNN are the only models with a positive median. Their hull distance errors are biased towards more frequently predicting thermodynamic instability, explaining why they are closest to getting the overall number of stable structures in the test set right (see cumulative precision/recall in @fig:cumulative-precision-recall).
+
+{/if}
+
+{#if mounted}
+<RollingMaeVsHullDistModels style="place-self: center;" />
+
+> @label:fig:rolling-mae-vs-hull-dist-models Universal potentials are more reliable classifiers because they exit the red triangle earliest.
+> These lines show the rolling MAE on the WBM test set as the energy to the convex hull of the MP training set is varied.
+> Lower is better.
+> Inside the large red 'triangle of peril', models are most likely to misclassify structures.
+> As long as a model's rolling MAE remains inside the triangle, its mean error is larger than the distance to the convex hull.
+> If the model's error for a given prediction happens to point towards the stability threshold at $E$<sub>above MP hull</sub> = 0, its average error will change the stability classification from true positive/negative to false negative/positive.
+> The width of the 'rolling window' box indicates the width over which prediction errors were averaged.
+
+{/if}
+
+{#if mounted}
+<EachParityModels />
+
+> @label:fig:each-parity-models Parity plots of model-predicted energy distance to the convex hull (based on their formation energy predictions) vs DFT ground truth, color-coded by log density of points.
+> Models are sorted left to right and top to bottom by MAE.
+
+{/if}
+
+{#if mounted}
+<HistClfPredHullDistModels />
+
+> @label:fig:hist-clf-pred-hull-dist-models Distribution of model-predicted hull distance colored by stability classification. Models are sorted from top to bottom by F1 score. The thickness of the red and yellow bands shows how often models misclassify as a function of how far away from the convex hull they place a material. While CHGNet and M3GNet perform almost equally well overall, these plots reveal that they do so via different trade-offs. M3GNet commits fewer false negatives but more false positives predictions compared to CHGNet. In a real discovery campaign, false positives have a higher opportunity cost than false negatives, since they result in wasted DFT relaxations or even synthesis time in the lab. A false negative by contrast is just one missed opportunity out of many. For this reason, models with high true positive rate (TPR) even at the expense of lower true negative rate (TNR) are generally preferred.
+
+{/if}
@@ -20,12 +20,14 @@ Stuff that didn't make the cut into the&nbsp;<a href="/models">model page</a>.
 
 <h2>Does error correlate with element prevalence in training set?</h2>
 
-Answer: not much. You might (or might not) expect the more examples of structures
-containing a certain element models have seen in the training set, the smaller their
-average error on test set structures containing that element. That's not what we see in
-this plot. E<sub>above hull</sub> is all over the place as a function of elemental
-training set prevalence. Could be because the error is dominated by the least abundant
-element in composition or the model errors are more dependent on geometry than chemistry.
+Answer: not much. You might expect the more examples of structures containing a certain
+element models have seen in the training set, the smaller their average error on test set
+structures containing that element. That's not what we see in this plot. E<sub
+  >above hull</sub
+>
+is all over the place as a function of elemental training set prevalence. Could be because
+the error is dominated by the least abundant element in composition or the model errors
+are more dependent on geometry than chemistry.
 
 {#if browser}
   <ElementPrevalenceVsErr style="margin: 2em 0;" />