Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the codespell pre-commit hook #403

Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .codespell/codespell-whitelist.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
nD
CACE
compliers
complier
71 changes: 71 additions & 0 deletions .codespell/notebook_to_markdown.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Copyright 2024 The PyMC Labs Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This is a simple script that converts the jupyter notebooks into markdown
for easier (and cleaner) parsing for the codespell check. Whitelisted words
are maintained within this directory in the `codespeel-whitelist.txt`. For
more information on this pre-commit hook please visit the github homepage
for the project: https://github.com/codespell-project/codespell.
"""

import argparse
import os
from glob import glob

import nbformat
from nbconvert import MarkdownExporter


def notebook_to_markdown(pattern: str, output_dir: str) -> None:
"""
Utility to convert jupyter notebook to markdown files.

:param pattern:
str that is a glob appropriate pattern to search
:param output_dir:
str directory to save the markdown files to
"""
for f_name in glob(pattern, recursive=True):
with open(f_name, "r", encoding="utf-8") as f:
nb = nbformat.read(f, as_version=4)

markdown_exporter = MarkdownExporter()
(body, _) = markdown_exporter.from_notebook_node(nb)

os.makedirs(output_dir, exist_ok=True)

output_file = os.path.join(
output_dir, os.path.splitext(os.path.basename(f_name))[0] + ".md"
)

with open(output_file, "w", encoding="utf-8") as f:
f.write(body)


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"-p",
"--pattern",
help="the glob appropriate pattern to search for jupyter notebooks",
default="docs/**/*.ipynb",
)
parser.add_argument(
"-t",
"--tempdir",
help="temporary directory to save the converted notebooks",
default="tmp_markdown",
)
args = parser.parse_args()
notebook_to_markdown(args.pattern, args.tempdir)
38 changes: 37 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ repos:
exclude: &exclude_pattern 'iv_weak_instruments.ipynb'
args: ["--maxkb=1500"]
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.6.2
rev: v0.6.3
hooks:
# Run the linter
- id: ruff
Expand All @@ -41,3 +41,39 @@ repos:
# needed to make excludes in pyproject.toml work
# see here https://github.com/econchick/interrogate/issues/60#issuecomment-735436566
pass_filenames: false
- repo: local
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, design choices! (i.e. no right or wrong here, but would love to double-check ideas)

May I ask, codespell appears to behave more like a software test than a linter with the setup (convert-notebooks) and teardown (remove-temp-directory-notebooks), so would it be better for this to be implemented inside CI/CD instead of in pre-commit hooks?

Not suggesting that we do so, but I just wanted to see whether there's a strong(er) rationale for leaving it in a pre-commit hook than within a GitHub action independently. Is the intent for it to be run locally? Also, might there be a more compact way of configuring this?

To be clear, definitely not suggesting that we move away from what's implemented. Just asking these questions to make sure the rationale is strong.

The only ask I'd have here is to document this design choice in the documentation directory. (My criteria for documentation is that if a topic has been asked and the answers are not in the docs already, then it should be documented.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably more of a philosophical question for you (@ericmjl ) and @drbenvincent

initially this was just checking spelling outside of the jupyter notebooks, which to me definitely feels like a pre-commit check, but now the check is including the notebook to markdown conversion to find spelling mistakes in these notebooks. maybe this highlights a bit of scope creep for this one PR / issue because purely as a pre-commit check I think it makes sense to just look at the .py and .md files and then the notebook spelling check is more of a CI check

so, it's up to you two but i'm happy to keep plugging away at this PR with the updated doc changes and rationale changes but maybe it makes sense to prune back this PR to just the base codespell checks then cherry-pick out the commits to a new PR for an actual CI check for jupyter notebooks since those should be more static and don't need to be checked on every commit -- thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @westford14. Yeah, I was thinking the same thing - happy for you to do this.

hooks:
- id: convert-notebooks
name: Convert Notebooks to Markdown
entry: python ./.codespell/notebook_to_markdown.py
language: python
pass_filenames: false
always_run: false
additional_dependencies: ["nbconvert", "nbformat"]
args: ["--tempdir", "tmp_markdown"]
- repo: https://github.com/codespell-project/codespell
rev: v2.3.0
hooks:
- id: codespell
args: [
Copy link

@ericmjl ericmjl Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticing that codespell can be configured by pyproject.toml. I think it's worth standardizing on pyproject.toml as the place for configuration. Establishing the pattern will be good for the long-term health of the package. @drbenvincent what are your thoughts here?

I'm mostly thinking of the -S flags btw, just to see if we can compact down the args list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree - pyproject.toml should do as much of the project config work as possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea that's an easy fix

"-S",
"*.csv",
"-S",
"pyproject.toml",
"-S",
"*.svg",
"-S",
"*.ipynb",
"--ignore-words=./.codespell/codespell-whitelist.txt",
]
additional_dependencies:
# Support pyproject.toml configuration
- tomli
- repo: local
hooks:
- id: remove-temp-directory-notebooks
name: Remove temporary directory for codespell
entry: bash -c 'rm -rf tmp_markdown && exit 0'
language: system
always_run: true
pass_filenames: false
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ This is appropriate when you have multiple units, one of which is treated. You b
> The data (treated and untreated units), pre-treatment model fit, and counterfactual (i.e. the synthetic control) are plotted (top). The causal impact is shown as a blue shaded region. The Bayesian analysis shows shaded Bayesian credible regions of the model fit and counterfactual. Also shown is the causal impact (middle) and cumulative causal impact (bottom).

### Geographical lift (Geolift)
We can also use synthetic control methods to analyse data from geographical lift studies. For example, we can try to evaluate the causal impact of an intervention (e.g. a marketing campaign) run in one geographical area by using control geographical areas which are similar to the intervention area but which did not recieve the specific marketing intervention.
We can also use synthetic control methods to analyse data from geographical lift studies. For example, we can try to evaluate the causal impact of an intervention (e.g. a marketing campaign) run in one geographical area by using control geographical areas which are similar to the intervention area but which did not receive the specific marketing intervention.

### ANCOVA

Expand Down
4 changes: 2 additions & 2 deletions causalpy/data/simulate_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,7 @@ def generate_ancova_data(
N=200, pre_treatment_means=np.array([10, 12]), treatment_effect=2, sigma=1
):
"""
Generate ANCOVA eample data
Generate ANCOVA example data

Example
--------
Expand Down Expand Up @@ -440,7 +440,7 @@ def generate_seasonality(n=12, amplitude=1, length_scale=0.5):


def periodic_kernel(x1, x2, period=1, length_scale=1, amplitude=1):
"""Generate a periodic kernal for gaussian process"""
"""Generate a periodic kernel for gaussian process"""
return amplitude**2 * np.exp(
-2 * np.sin(np.pi * np.abs(x1 - x2) / period) ** 2 / length_scale**2
)
Expand Down
2 changes: 1 addition & 1 deletion causalpy/experiments/instrumental_variable.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ class InstrumentalVariable(BaseExperiment):
:param model: A PyMC model
:param priors: An optional dictionary of priors for the
mus and sigmas of both regressions. If priors are not
specified we will substitue MLE estimates for the beta
specified we will substitute MLE estimates for the beta
coefficients. Greater control can be achieved
by specifying the priors directly e.g. priors = {
"mus": [0, 0],
Expand Down
2 changes: 1 addition & 1 deletion causalpy/experiments/inverse_propensity_weighting.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ def make_doubly_robust_adjustment(self, ps):
m1 = sk_lin_reg().fit(X[t == 1].astype(float), self.y[t == 1])
m0_pred = m0.predict(X)
m1_pred = m1.predict(X)
## Compromise between outcome and treatement assignment model
## Compromise between outcome and treatment assignment model
weighted_outcome0 = (1 - t) * (self.y - m0_pred) / (1 - X["ps"]) + m0_pred
weighted_outcome1 = t * (self.y - m1_pred) / X["ps"] + m1_pred
return weighted_outcome0, weighted_outcome1, None, None
Expand Down
4 changes: 2 additions & 2 deletions causalpy/experiments/prepostfit.py
Original file line number Diff line number Diff line change
Expand Up @@ -311,7 +311,7 @@ class InterruptedTimeSeries(PrePostFit):
:param data:
A pandas dataframe
:param treatment_time:
The time when treatment occured, should be in reference to the data index
The time when treatment occurred, should be in reference to the data index
:param formula:
A statistical model formula
:param model:
Expand Down Expand Up @@ -352,7 +352,7 @@ class SyntheticControl(PrePostFit):
:param data:
A pandas dataframe
:param treatment_time:
The time when treatment occured, should be in reference to the data index
The time when treatment occurred, should be in reference to the data index
:param formula:
A statistical model formula
:param model:
Expand Down
2 changes: 1 addition & 1 deletion causalpy/plot_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ def plot_xY(
ax=ax,
**plot_hdi_kwargs,
)
# Return handle to patch. We get a list of the childen of the axis. Filter for just
# Return handle to patch. We get a list of the children of the axis. Filter for just
# the PolyCollection objects. Take the last one.
h_patch = list(
filter(lambda x: isinstance(x, PolyCollection), ax_hdi.get_children())
Expand Down
2 changes: 1 addition & 1 deletion causalpy/pymc_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@


class PyMCModel(pm.Model):
"""A wraper class for PyMC models. This provides a scikit-learn like interface with
"""A wrapper class for PyMC models. This provides a scikit-learn like interface with
methods like `fit`, `predict`, and `score`. It also provides other methods which are
useful for causal inference.

Expand Down
2 changes: 1 addition & 1 deletion causalpy/tests/test_pymc_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ def test_idata_property():
@pytest.mark.parametrize("seed", seeds)
def test_result_reproducibility(seed):
"""Test that we can reproduce the results from the model. We could in theory test
this with all the model and experiment types, but what is being targetted is
this with all the model and experiment types, but what is being targeted is
the PyMCModel.fit method, so we should be safe testing with just one model. Here
we use the DifferenceInDifferences experiment class."""
# Load the data
Expand Down
6 changes: 3 additions & 3 deletions docs/source/_static/interrogate_badge.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ This is appropriate when you have multiple units, one of which is treated. You b
![Synthetic Control](./_static/synthetic_control_pymc.svg)

### Geographical Lift / Geolift
We can also use synthetic control methods to analyse data from geographical lift studies. For example, we can try to evaluate the causal impact of an intervention (e.g. a marketing campaign) run in one geographical area by using control geographical areas which are similar to the intervention area but which did not recieve the specific marketing intervention.
We can also use synthetic control methods to analyse data from geographical lift studies. For example, we can try to evaluate the causal impact of an intervention (e.g. a marketing campaign) run in one geographical area by using control geographical areas which are similar to the intervention area but which did not receive the specific marketing intervention.

### ANCOVA

Expand Down
10 changes: 5 additions & 5 deletions docs/source/knowledgebase/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@ Glossary

Average treatment effect
ATE
The average treatement effect across all units.
The average treatment effect across all units.

Average treatment effect on the treated
ATT
The average effect of the treatment on the units that recieved it. Also called Treatment on the treated.
The average effect of the treatment on the units that received it. Also called Treatment on the treated.

Change score analysis
A statistical procedure where the outcome variable is the difference between the posttest and protest scores.
Expand Down Expand Up @@ -48,7 +48,7 @@ Glossary

Local Average Treatment effect
LATE
Also known asthe complier average causal effect (CACE), is the effect of a treatment for subjects who comply with the experimental treatment assigned to their sample group. It is the quantity we're estimating in IV designs.
drbenvincent marked this conversation as resolved.
Show resolved Hide resolved
Also known as the complier average causal effect (CACE), is the effect of a treatment for subjects who comply with the experimental treatment assigned to their sample group. It is the quantity we're estimating in IV designs.

Non-equivalent group designs
NEGD
Expand Down Expand Up @@ -76,7 +76,7 @@ Glossary
Where units are assigned to conditions at random.

Randomized experiment
An emprical comparison used to estimate the effects of treatments where units are assigned to treatment conditions randomly.
An empirical comparison used to estimate the effects of treatments where units are assigned to treatment conditions randomly.

Regression discontinuity design
RDD
Expand All @@ -96,7 +96,7 @@ Glossary

Treatment on the treated effect
TOT
The average effect of the treatment on the units that recieved it. Also called the average treatment effect on the treated (ATT).
The average effect of the treatment on the units that received it. Also called the average treatment effect on the treated (ATT).

Treatment effect
The difference in outcomes between what happened after a treatment is implemented and what would have happened (see Counterfactual) if the treatment had not been implemented, assuming everything else had been the same.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/knowledgebase/quasi_dags.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This leads us to Randomized Controlled Trials (RCTs) which are considered the gold standard for estimating causal effects. One reason for this is that we (as experimenters) intervene in the system by assigning units to treatment by {term}`random assignment`. Because of this intervention, any causal influence of the confounders upon the treatment $\\mathbf{X} \\rightarrow Z$ is broken - treamtent is now soley determined by the randomisation process, $R \\rightarrow T$. The following causal DAG illustrates the structure of an RCT."
"This leads us to Randomized Controlled Trials (RCTs) which are considered the gold standard for estimating causal effects. One reason for this is that we (as experimenters) intervene in the system by assigning units to treatment by {term}`random assignment`. Because of this intervention, any causal influence of the confounders upon the treatment $\\mathbf{X} \\rightarrow Z$ is broken - treamtent is now solely determined by the randomisation process, $R \\rightarrow T$. The following causal DAG illustrates the structure of an RCT."
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion docs/source/notebooks/ancova_pymc.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,7 @@
"## Run the analysis\n",
"\n",
":::{note}\n",
"The `random_seed` keyword argument for the PyMC sampler is not neccessary. We use it here so that the results are reproducible.\n",
"The `random_seed` keyword argument for the PyMC sampler is not necessary. We use it here so that the results are reproducible.\n",
":::"
]
},
Expand Down
2 changes: 1 addition & 1 deletion docs/source/notebooks/did_pymc.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@
"## Run the analysis\n",
"\n",
":::{note}\n",
"The `random_seed` keyword argument for the PyMC sampler is not neccessary. We use it here so that the results are reproducible.\n",
"The `random_seed` keyword argument for the PyMC sampler is not necessary. We use it here so that the results are reproducible.\n",
":::"
]
},
Expand Down
4 changes: 2 additions & 2 deletions docs/source/notebooks/did_pymc_banks.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -329,7 +329,7 @@
"* $\\mu_i$ is the expected value of the outcome (number of banks in business) for the $i^{th}$ observation.\n",
"* $\\beta_0$ is an intercept term to capture the basiline number of banks in business of the control group, in the pre-intervention period.\n",
"* `district` is a dummy variable, so $\\beta_{d}$ will represent a main effect of district, that is any offset of the treatment group relative to the control group.\n",
"* `post_treatment` is also a dummy variable which captures any shift in the outcome after the treatment time, regardless of the recieving treatment or not.\n",
"* `post_treatment` is also a dummy variable which captures any shift in the outcome after the treatment time, regardless of the receiving treatment or not.\n",
"* the interaction of the two dummary variables `district:post_treatment` will only take on values of 1 for the treatment group after the intervention. Therefore $\\beta_{\\Delta}$ will represent our estimated causal effect."
]
},
Expand Down Expand Up @@ -515,7 +515,7 @@
"source": [
"## Analysis 2 - DiD with multiple pre/post observations\n",
"\n",
"Now we'll do a difference in differences analysis of the full dataset. This approach has similarities to {term}`CITS` (Comparative Interrupted Time-Series) with a single control over time. Although slightly abitrary, we distinguish between the two techniques on whether there is enough time series data for CITS to capture the time series patterns."
"Now we'll do a difference in differences analysis of the full dataset. This approach has similarities to {term}`CITS` (Comparative Interrupted Time-Series) with a single control over time. Although slightly arbitrary, we distinguish between the two techniques on whether there is enough time series data for CITS to capture the time series patterns."
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion docs/source/notebooks/geolift1.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -269,7 +269,7 @@
"We can use `CausalPy`'s API to run this procedure, but using Bayesian inference methods as follows:\n",
"\n",
":::{note}\n",
"The `random_seed` keyword argument for the PyMC sampler is not neccessary. We use it here so that the results are reproducible.\n",
"The `random_seed` keyword argument for the PyMC sampler is not necessary. We use it here so that the results are reproducible.\n",
":::"
]
},
Expand Down
Loading
Loading