diff --git a/dev/changelog/index.html b/dev/changelog/index.html index 297f353..f06ae2d 100644 --- a/dev/changelog/index.html +++ b/dev/changelog/index.html @@ -72,7 +72,7 @@
For changes since the latest tagged release, please refer to the git commit log.
New features and enhancements:
+Formula.differentiate()
is now considered stable, with
+ ModelMatrix.differentiate()
to follow in a future release.Bugfixes and cleanups:
+Breaking changes:
Formulaic is a high-performance implementation of Wilkinson formulas for Python, which are very useful for transforming dataframes into a form suitable for ingestion into various modelling frameworks (especially linear regression).
It provides:
pandas.DataFrame
pyarrow.Table
pandas.DataFrame
numpy.ndarray
scipy.sparse.CSCMatrix
with more to come!
"},{"location":"changelog/","title":"Changelog","text":"For changes since the latest tagged release, please refer to the git commit log.
"},{"location":"changelog/#110-15-december-2024","title":"1.1.0 (15 December 2024)","text":"Breaking changes:
Formula
is no longer always \"structured\" with special cases to handle the case where it has no structure. Legacy shims have been added to support old patterns, with DeprecationWarning
s raised when they are used. It is not expected to break anyone not explicitly checking whether the Formula.root
is a list instance (which formerly should have been simply assumed) [it is a now SimpleFormula
instance that acts like an ordered sequence of Term
instances].feature[T.A]
, whether nor not the encoding will result in that term acting as a contrast. Now, in keeping with patsy
, we only add the prefix if the categorical factor is encoded with reduced rank. Otherwise, feature[A]
will be used instead.formulaic.parsers.types.structured
has been promoted to formulaic.utils.structured
.New features and enhancements:
Formula
now instantiates to SimpleFormula
or StructuredFormula
, the latter being a tree-structure of SimpleFormula
instances (as compared to List[Term]
) previously. This simplifies various internal logic and makes the propagation of formula metadata more explicit.dict
and recarray
types are no associated with the pandas
materializer by default (rather than raising), simplifying some user workflows..
operator (which is replaced with all variables not used on the left-hand-side of formulae).[ ... ~ ... ]
. This is useful for (e.g.) generating formulae for IV 2SLS.ModelSpec[s]
based on an arbitrary strictly reduced FormulaSpec
.Formula.required_variables
to more easily surface the expected data requirements of the formula.cc
) and natural (cr
). See formulaic.materializers.transforms.cubic_spline.cubic_spline
for more details.lag()
transform.LinearConstraints
can now be done from a list of strings (for increased parity with patsy
).T.
when they actully describe contrasts (i.e. when they are encoded with reduced rank).encode_categorical
; which is surfaced via ModelSpec.factor_contrasts
.Operator
instances now received context
which is optionally specified by the user during formula parsing, and updated by the parser. This is what makes the .
implementation possible.Structured
, it has been promoted to formulaic.utils
.Bugfixes and cleanups:
Formula
instance.Structured
instances.ModelMatrix
and FactorValues
instances (whenever wrapped objects are picklable).basis_spline
: Fixed evaluation involving datasets with null values, and disallow out-of-bounds knots.ruff
for linting, and updated mypy
and pre-commit
tooling.ruff
are automatically applied when using hatch run lint:format
.Documentation:
Bugfixes and cleanups:
pandas
>=3.mypy
type inference in materializer subclasses.Documentation:
sklearn
integration example.Bugfixes and cleanups:
Breaking changes:
Formula.terms
, ModelSpec.feature_names
, and ModelSpec.feature_indices
.New features and enhancements:
patsy
.hashed
transform for categorically encoding deterministically hashed representations of a dataset.Bugfixes and cleanups:
astor
for Python 3.9 and newer.Documentation:
This is minor release with one important bugfix.
Bugfixes and cleanups:
This is a minor release with several important bugfixes.
Bugfixes and cleanups:
poly()
transforms operating on datasets that include null values.hatch run tests
.This is a minor release with several new features and cleanups.
New features and enhancements:
ModelSpec
documentation for more details.Bugfixes and cleanups:
OrderedDict
usage, since Python guarantees the orderedness of dictionaries in Python 3.7+.None
.This is a minor release with a bugfix.
Bugfixes and cleanups:
This is a minor release with several bugfixes.
Bugfixes and cleanups:
This is a minor release with one new feature.
New features and enhancements:
This is a major release with some important consistency and completeness improvements. It should be treated as almost being the first release candidate of 1.0.0, which will land after some small amount of further feature extensions and documentation improvements. All users are recommended to upgrade.
Breaking changes:
Although there are some internal changes to API, as documented below, there are no breaking changes to user-facing APIs.
New features and enhancements:
_ordering
keyword to the Formula
constructor.standardize
, Q
and treatment contrasts shims.cluster_by='numerical_factors
option to ModelSpec
to enable patsy style clustering of output columns by involved numerical factors.^
and %in%
.Structured
instances, and use this functionality during AST evaluation where relevant.ModelSpec.term_indices
is now a list rather than a tuple, to allow direct use when indexing pandas and numpy model matrices.Bugfixes and cleanups:
Structured
instances for non-sequential iterable values.poly
unit tests.PandasMaterializer
.Factor.EvalMethod.UNKNOWN
was removed, defaulting instead to LOOKUP
.sympy
version constraint now that a bug has been fixed upstream.Documentation:
This is a minor patch releases that fixes one bug.
Bugfixes and cleanups:
Structured
instance and iteration over this instance (including Formula
instances). Formerly the length would only count the number of keys in its structure, rather than the number of objects that would be yielded during iteration.This is a minor patch release that fixes two bugs.
Bugfixes and cleanups:
Formula
objects.formulaic.__version__
during package build.This is a major new release with some minor API changes, some ergonomic improvements, and a few bug fixes.
Breaking changes:
Formula
objects (e.g. formula.lhs
) no longer returns a list of terms; but rather a Formula
object, so that the helper methods can remain accessible. You can access the raw terms by iterating over the formula (list(formula)
) or looking up the root node (formula.root
).New features and improvements:
ModelSpec
object is now the source of truth in all ModelMatrix
generations, and can be constructed directly from any supported specification using ModelSpec.from_spec(...)
. Supported specifications include formula strings, parsed formulae, model matrices and prior model specs..get_model_matrix()
helper methods across Formula
, FormulaMaterializer
, ModelSpec
and model_matrix
objects/helpers functions are now consistent, and all use ModelSpec
directly under the hood.Formula
objects (e.g. formula.lhs
), the term lists will be wrapped as trivial Formula
instances rather than returned as raw lists (so that the helper methods like .get_model_matrix()
can still be used).FormulaSpec
is now exported from the top-level module.Bugfixes and cleanups:
ModelSpec
specifications being overriden by default arguments to FormulaMaterializer.get_model_matrix
.Structured._flatten()
now correctly flattens unnamed substructures.This is a major new release with some new features, greatly improved ergonomics for structured formulae, matrices and specs, and a few small breaking changes (most with backward compatibility shims). All users are encouraged to upgrade.
Breaking changes:
include_intercept
is no longer an argument to FormulaParser.get_terms
; and is instead an argument of the DefaultFormulaParser
constructor. If you want to modify the include_intercept
behaviour, please use: Formula(\"y ~ x\", _parser=DefaultFormulaParser(include_intercept=False))\n
Formula.terms
is deprecated since Formula
became a subclass of Structured[List[Terms]]
. You can directly iterate over, and/or access nested structure on the Formula
instance itself. Formula.terms
has a deprecated property which will return a reference to itself in order to support legacy use-cases. This will be removed in 1.0.0.ModelSpec.feature_names
and ModelSpec.feature_columns
are deprecated in favour of ModelSpec.column_names
and ModelSpec.column_indices
. Deprecated properties remain in-place to support legacy use-cases. These will be removed in 1.0.0.New features and enhancements:
Formula
has been refactored as a subclass of Structured[List[Terms]]
, and can be incrementally built and modified. The matrix and spec outputs now have explicit subclasses of Structured
(ModelMatrices
and ModelSpecs
respectively) to expose convenience methods that allow these objects to be largely used interchangeably with their singular counterparts.ModelMatrices
and ModelSpecs
arenow surfaced as top-level exports of the formulaic
module.Structured
(and its subclasses) gained improved integration of nested tuple structure, as well as support for flattened iteration, explicit mapping output types, and lots of cleanups.ModelSpec
was made into a dataclass, and gained several new properties/methods to support better introspection and mutation of the model spec.FormulaParser
was renamed DefaultFormulaParser
, and made a subclass of the new formula parser interface FormulaParser
. In this process include_intercept
was removed from the API, and made an instance attribute of the default parser implementation.Bugfixes and cleanups:
ModelSpec
s are provided by the user during materialization, they are updated to reflect the output-type chosen by the user, as well as whether to ensure full rank/etc.pylint
was added to the CI testing.Documentation:
.materializer
submodule, most code now has inline documentation and annotations.This is a backward compatible major release that adds several new features.
New features and enhancements:
ModelMatrix
instances (see ModelMatrix.model_spec.get_linear_constraints
).ModelMatrix
, ModelSpec
and other formula-like objects to the model_matrix
sugar method so that pre-processed formulae can be used.0
with -1
to avoid substitutions in quoted contexts.Bugfixes and cleanups:
bs(`my|feature%is^cool`)
.C(x, {\"a\": [1,2,3]})
.astor
to >=0.8 to fix issues with ast-generation in Python 3.8+ when numerical constants are present in the parsed python expression (e.g. \"bs(x, df=10)\").This is a minor patch release that migrates the package tooling to poetry; solving a version inconsistency when packaging for conda
.
This is a minor patch release that fixes an attempt to import numpy.typing
when numpy is not version 1.20 or later.
This is a minor patch release that fixes the maintaining of output types, NA-handling, and assurance of full-rank for factors that evaluate to pre-encoded columns when constructing a model matrix from a pre-defined ModelSpec. The benchmarks were also updated.
"},{"location":"changelog/#030-14-march-2022","title":"0.3.0 (14 March 2022)","text":"This is a major new release with many new features, and a few small breaking changes. All users are encouraged to upgrade.
Breaking changes:
formulaic.materializers.transforms
to the top-level formulaic.transforms
module, and ported all existing transforms to output FactorValues
types rather than dictionaries. FactorValues
is an object proxy that allows output types like pandas.DataFrame
s to be used as they normally would, with some additional metadata for formulaic accessible via the __formulaic_metadata__
attribute. This makes non-formula direct usage of these transforms much more pleasant.~
is no longer a generic formula separator, and can only be used once in a formula. Please use the newly added |
operator to separate a formula into multiple parts.New features and enhancements:
~
operator to use them. Structured formulas can have named substructures, for example: lhs
and rhs
for the ~
operator. The representation of formulas has been updated to show this structure.|
operator which splits formulas into multiple parts).formulaic.model_matrix
syntactic sugar function now accepts ModelSpec
and ModelMatrix
instances as the \"formula\" spec, making generation of matrices with the same form as previously generated matrices more convenient.poly
transform (compatible with R and patsy).numpy
is now always available in formulas via np
, allowing formulas like np.sum(x)
. For convenience, log
, log10
, log2
, exp
, exp10
and exp2
are now exposed as transforms independent of user context.formulaic.utils.context.capture_context()
. This can be used by libraries that wrap Formulaic to capture the variables and/or transforms available in a users' environment where appropriate.Bugfixes and cleanups:
Documentation:
.parser
and .utils
modules of Formulaic are now inline documented and annotated.This is a minor release that fixes an issue whereby the ModelSpec instances attached to ModelMatrix objects would keep reference to the original data, greatly inflating the size of the ModelSpec.
"},{"location":"changelog/#023-4-february-2021","title":"0.2.3 (4 February 2021)","text":"This release is identical to v0.2.2, except that the source distribution now includes the docs, license, and tox configuration.
"},{"location":"changelog/#022-4-february-2021","title":"0.2.2 (4 February 2021)","text":"This is a minor release with one bugfix.
This is a minor patch release that brings in some valuable improvements.
DataFrame
.This is major release that brings in a large number of improvements, with a huge number of commits. Some API breakage from the experimental 0.1.x series is likely in various edge-cases.
Highlights include:
Performance improvements around the encoding of categorical features.
Matthew Wardrop (1):\n Improve the performance of encoding operations.\n
"},{"location":"changelog/#011-31-october-2019","title":"0.1.1 (31 October 2019)","text":"No code changes here, just a verification that GitHub CI integration was working.
Matthew Wardrop (1):\n Update Github workflow triggers.\n
"},{"location":"changelog/#010-31-october-2019","title":"0.1.0 (31 October 2019)","text":"This release added support for keeping track of encoding choices during model matrix generation, so that they can be reused on similarly structured data. It also added comprehensive unit testing and CI integration using GitHub actions.
Matthew Wardrop (5):\n Add support for stateful transforms (including encoding).\n Fix tokenizing of nested Python function calls.\n Add support for nested transforms that return multiple columns, as well as passing through of materializer config through to transforms.\n Add comprehensive unit testing along with several small miscellaneous bug fixes and improvements.\n Add GitHub actions configuration.\n
"},{"location":"changelog/#001-1-september-2019","title":"0.0.1 (1 September 2019)","text":"Initial open sourcing of formulaic
.
Matthew Wardrop (1):\n Initial (mostly) working implementation of Wilkinson formulas.\n
"},{"location":"formulas/","title":"What are formulas?","text":"This section introduces the basic notions and origins of formulas. If you are already familiar with formulas from another context, you might want to skip forward to the Formula Grammar or other User Guides.
"},{"location":"formulas/#origins","title":"Origins","text":"Formulas were originally proposed by Wilkinson et al.1 to aid in the description of ANOVA problems, but were popularised by the S language (and then R, as an implementation of S) in the context of linear regression. Since then they have been extended in R, and implemented in Python (by patsy), in MATLAB, in Julia, and quite conceivably elsewhere. Each implementation has its own nuances and grammatical extensions, including Formulaic's which are described more completely in the Formula Grammar section of this manual.
"},{"location":"formulas/#why-are-they-useful","title":"Why are they useful?","text":"Formulas are useful because they provide a concise and explicit specification for how data should be prepared for a model. Typically, the raw input data for a model is stored in a dataframe, but the actual implementations of various statistical methodologies (e.g. linear regression solvers) act on two-dimensional numerical matrices that go by several names depending on the prevailing nomenclature of your field, including \"model matrices\", \"design matrices\" and \"regressor matrices\" (within Formulaic, we refer to them as \"model matrices\"). A formula provides the necessary information required to automate much of the translation of a dataframe into a model matrix suitable for ingestion into a statistical model.
Suppose, for example, that you have a dataframe with \\(N\\) rows and three numerical columns labelled: y
, a
and b
. You would like to construct a linear regression model for y
based on a
, b
and their interaction: \\[ y = \\alpha + \\beta_a a + \\beta_b b + \\beta_{ab} ab + \\varepsilon \\] with \\(\\varepsilon \\sim \\mathcal{N}(0, \\sigma^2)\\). Rather than manually constructing the required matrices to pass to the regression solver, you could specify a formula of form:
y ~ a + b + a:b\n
When furnished with this formula and the dataframe, Formulaic (or indeed any other formula implementation) would generate two model matrix objects: an \\( N \\times 1 \\) matrix \\(Y\\) for the response variable y
, and an \\( N \\times 4 \\) matrix \\(X\\) for the input columns intercept
, a
, b
, and a * b
. You can then directly pass these matrices to your regression solver, which internally will solve for \\(\\beta\\) in: \\[ Y = X\\beta + \\varepsilon. \\] The true value of formulas becomes more apparent as model complexity increases, where they can be a huge time-saver. For example:
~ (f1 + f2 + f3) * (x1 + x2 + scale(x3))\n
tells the formula interpreter to consider 16 fields of input data, corresponding to an intercept (1), each of the f*
fields (3), each of the x*
fields (3), and the combination of each f
with each x
(9). It also instructs the materializer to ensure that the x3
column is rescaled during the model matrix materialization phase such that it has mean zero and standard error of 1. If any of these columns is categorical in nature, they would by default also be one-hot/dummy encoded. Depending on the formula interpreter (including Formulaic), extra steps would also be taken to ensure that the resulting model matrix is structurally full-rank. As an added bonus, some formula implementations (including Formulaic) can remember any choices made during the materialization process, and apply them to consistently to new data, making it possible to easily generate new data that conforms to the same structure as the training data. For example, the scale(...)
transform in the example above makes use of the mean and variance of the column to be scaled. Any future data should, however, should not undergo scaling based on its own mean and variance, but rather on the mean and variance that was measured for the training data set (otherwise the new dataset will not be consistent with the expectations of the trained model which will be interpreting it).
Formulas are a very flexible tool, and can be augmented with arbitrary user-defined transforms. However, some transformations required by certain models may be more elegantly defined via a pre-formula dataframe operation or post-formula model matrix operation. Another consideration is that the default encoding and materialization choices for data are aligned with linear regression. If you are using a tree model, for example, you may not be interested in dummy encoding of \"categorical\" features, and this type of transform would have to be explicitly noted in the formula. Nevertheless, even in these cases, formulas are an excellent tool, and can often be used to greatly simplify data preparation workflows.
"},{"location":"formulas/#where-to-from-here","title":"Where to from here?","text":"To learn about the full set of features supported by the formula language as implemented by Formulaic, please review the Formula Grammar. To get a feel for how you can use formulaic
to transform your dataframes into model matrices, please review the Quickstart.
Wilkinson, G. N., and C. E. Rogers. Symbolic description of factorial models for analysis of variance. J. Royal Statistics Society 22, pp. 392\u2013399, 1973.\u00a0\u21a9
The latest release of formulaic
is always published to the Python Package Index (PyPI), from which it is available to download @ https://pypi.org/project/formulaic/.
If your Python environment is provisioned with pip
, installing formulaic
from the PyPI is as simple as running:
$ pip install formulaic\n
Note
If you have a non-standard setup, ensure that pip
above are replaced with the executables corresponding to the environment for which you are interested in installing formulaic
. This is done automatically if you are using a virtual environment.
You are ready to use Formulaic. To get introduced to the concepts underpinning Formulaic, please review the Concepts documentation, or to jump straight to how to use Formulaic, please review the User Guides documentation.
"},{"location":"installation/#installing-for-development","title":"Installing for development","text":"If you are interested in developing formulaic
, you should clone the source code repository, and install in editable mode from there (allowing your changes to be instantly available to all new Python sessions).
To clone the source code, run:
$ git clone git@github.com:matthewwardrop/formulaic.git\n
Note
This requires you to have a GitHub account set up. If you do not have an account you can replace the SSH url above with https://github.com/matthewwardrop/formulaic.git
. Also, if you are planning to submit your work upstream, you may wish to fork the repository into your own namespace first, and clone from there.
To install in editable mode, run:
$ pip install -e <path_to_cloned_formulaic_repo>\n
You will need pip>=21.3
in order for this to work. You can then make any changes you like to the repo, and have them be reflected in your local Python sessions. Happy hacking, and I look forward to your contributions!
"},{"location":"migration/","title":"Migrating from Patsy/R","text":"The default Formulaic parser and materialization configuration is designed to be highly compatibly with existing Wilkinson formula implementations in R and Python; however there are some differences which are highlighted here. If you find other differences, feel free to submit a PR to update this documentation.
"},{"location":"migration/#migrating-from-patsy","title":"Migrating frompatsy
","text":"Patsy has been the go-to implementation of Wilkinson formulae for Python use-cases for many years, and Formulaic should be largely a drop-in replacement, while bringing order of magnitude improvements in runtime performance and greater extensibility. Being written in the same language (Python) there are two separate migration concerns: input/output and API migrations, which will be explored separately below.
"},{"location":"migration/#inputoutput-changes","title":"Input/Output changes","text":"The primary inputs to patsy
are a formula string, and pandas dataframe from which features referenced in the formula are drawn. The output is a model matrix (called a design matrix in patsy
). We focus here on any potentially breaking behavioural differences here, rather than ways in which Formulaic extends the functionality available in patsy
.
^
operator is interpreted as exponentiation, rather than Python's XOR binary operator.C(x, contr.treatment)
. For greater compatibility with patsy
we add to the transform namespace Treatment
, Poly
, Sum
, Helmert
and Diff
, allowing formulae like C(x, Poly)
or C(x, Treatment(reference='x'))
to work as expected, with the following caveats:C
is C(data, contrasts=None, *, levels=None)
as compared to C(data, contrast=None, levels=None)
from patsy
.Sum
contrast does not offer an omit
option to specify the index of the omitted column.scale(x)
, but compatibility shims for standardize(x)
are added for greater compatibility with patsy
. Note that the standardize
shim follows patsy argument kwarg naming conventions, but scale
uses scale
instead of rescale
, following R.cluster_by=\"numerical_factors\"
to model_matrix
or any of the .get_model_matrix(...)
methods.cr
and cc
) or tensor smoothing (te
) stateful transforms.patsy
offers two high-level user-facing entrypoints: patsy.dmatrix
and patsy.dmatrices
, depending on whether you have both left- and right-hand sides present. In formulaic
, we offer a single entrypoint for both cases: model_matrix
.
In the vast majority of cases, a simple substitution of dmatrix
or dmatrices
with model_matrix
will achieve the desired the result; however there are some differences in signature that could trip up a naive copy and replace. Patsy's dmatrix
signature is:
patsy.dmatrix(\n formula_like,\n data={},\n eval_env=0,\n NA_action='drop',\n return_type='matrix',\n)\n
whereas model_matrix
has a signature of: formulaic.model_matrix(\n spec: FormulaSpec, # accepts any formula-like spec (include model matrices and specs)\n data: Any, # accepts any supported data structure (include pandas DataFrames)\n *,\n context: Union[int, Mapping[str, Any]] = 0, # equivalent to `eval_env`\n **spec_overrides, # Additional overrides for generated `ModelSpec`, including `na_action` and `output` (similar to `return_type`).\n)\n
If you are integrating Formulaic into your library, it is highly recommended to use the Formula()
API directly rather than model_matrix
, which by default will add all variables in the local context into the evaluation environment (just like dmatrix
). This allows you to better isolate and control the behaviour of the Formula parsing.
Most formulae that work in R will work without modification, including those written against the enhanced R Formula package that supports multi-part formulae. However, there are a few caveats that are worth calling out:
^
operator within an I
transform; e.g. I(x^2)
. This is because this is treated as Python code, and so you should use I(x**2)
or {x**2}
instead.patsy
. In particular, order of operations are respected when evaluating intercept directives, and so: 1 + (b - 1)
would result in the intercept remaining (since (b-1)
would be evaluated first to b
, resulting in 1 + b
), whereas in R the intercept would have been dropped.patsy
. Using capital letters to represent categorical variables, and lower-case letters to represent numerical ones, the difference from R will become apparent in two cases:1 + A:B
. In this case, R does not account for the fact that A:B
spans the intercept, and so does not rank reduce the product, and thus generates an over-specified matrix. This affects higher-order interactions also.0 + A:x + B:C
. Here we use 0 +
to avoid the previous bug, but unfortunately when R is checking whether to reduce the rank of the categorical features during encoding, it assumes that all involved features are categorical, and thus unnecessarily reduces the rank of C
, resulting in an under-specified matrix. This affects higher-order interactions also.offset(...)
.For more details, refer to the Formula Grammar.
"},{"location":"dev/","title":"Introduction","text":"This section of the documentation focuses on providing guidance to developers of libraries that are integrating Formulaic, users who want to extend its behavior, or for those interested in directly contributing.
If you are looking to directly work with Formulaic as an end-user, please review the User Guides instead.
This portion of the documentation is less complete than user-facing documentation, and you are encouraged to reach out via the Issue Tracker if you need any help.
"},{"location":"dev/extensions/","title":"Extensions","text":"Formulaic was designed to be extensible from day one, and nearly all of its core functionality is implemented as \"plugins\"/\"modules\" that you can use as examples for how extensions could be written. In this document we will provide a basic high-level overview of the basic components of Formulaic that can extended.
An important consideration is that while Formulaic offers extensible APIs, and effort will be made not to break extension APIs without reason (and never in patch releases), the safest place for you extensions is in Formulaic itself, where they can be kept up to date and maintained (assuming the extension is not overly bespoke). If you think your extensions might help others, feel free to reach out via the issue tracker and/or open a pull request.
"},{"location":"dev/extensions/#transforms","title":"Transforms","text":"Transforms are likely the most commonly extended feature of Formulaic, and also likely the least valuable to upstream (since transforms are often domain specific). Documentation for implementing transforms is described in detail in the Transforms user guide.
"},{"location":"dev/extensions/#materializers","title":"Materializers","text":"Materializers are responsible for translating formulae into model matrices as documented in the How it works user guide. You need to implement a new materializer if you want to add support for new input and/or output types.
Implementing a new materializer is as simple as subclassing the abstract class formulaic.materializers.FormulaMaterializer
(or one of its subclasses). This base class defines the API expected by the rest of the Formulaic system. Example implementations include pandas and pyarrow.
During subclassing, the new class is registered according to the various REGISTER_*
attributes if REGISTER_NAME
is specified. This registration allows looking up of the materializer by name through the model_matrix()
and .get_model_matrix()
functions. You can always manually pass in your materializer class explicitly without this registration.
Parsers translate a formula string to a set of terms and factors that are then evaluated and assembled into the model matrix, as documented in the How it works user guide. This is unlikely to be necessary very often, but can be used to add additional formula operators, or change the behavior of existing ones.
Formula parsers are expected to implement the API of formulaic.parser.types.FormulaParser
. The default implementation can be seen here. You can pass in custom parsers to Formula()
via the parser
and nested_parser
options (see inline documentation for more details).
If you are considering extending the parser, please do reach out via the issue tracker.
"},{"location":"dev/integration/","title":"Integration","text":"If you are looking to enrich your existing Python project with support for formulae, you have come to the right place. Formulaic is designed with simple APIs that should make it straightforward to integrate into any project.
In this document we provide several general recommendations for developers integrating Formulaic, and then some more specific guidance for developers looking to migrate existing formula functionality from patsy
. As you are working on integrating Formulaic, if you come across anything not mentioned here that really ought to be, please report it to our Issue Tracker.
For the most part, Formulaic should \"just work\". However, here are a couple of recommendations that might make your integration work easier.
model_matrix
. This is a simple wrapper around lower-level APIs that automatically includes variables from users' local namespaces. This is convenient when running in a notebook, but can lead to unexpected interactions with your library code that are hard to debug. Called naively in your library it will treat the frame in which it was run as the user context, which may include somewhat sensitive internal state and may override transforms normally available to formulae. Instead, use Formula(...).get_model_matrix(...)
.formulaic.utils.context.capture_context()
function and pass the result as context
to the .get_model_matrix()
methods. It is easiest to use in the outermost user-facing entrypoints so that you do not need to figure out exactly how many frames removed you are from user-context. You may also manually construct a dictionary from the user's context if you want to do additional filtering.eval()
function may be called to invoke the indicated Python functions. Since this is user-specified code, it is possible that the formula had some malicious code in it (such as sys.exit()
or shutil.rmtree()
). If you are integrating Formulaic into server-side code, it is highly recommended not to pass in any user-specified context, but instead to curate the set of additional functions that are available and pass that in instead. If you are writing a user-facing library, this should not be as concerning.pandas.DataFrame
-> PandasMaterializer
). Different materializers may have different output (and other) options. It may make sense to hard-code your choice of materializer by passing materializer=
to the .get_model_matrix()
methods.output='sparse'
to .get_model_matrix()
, assuming the materializer of your datatype supports this).ModelSpec
instances to work across different major versions of Formulaic. It may be tempting to serialize them to disk and then reuse them in newer versions of Formulaic. Most of the time this will work fine, but the stored encoder and transform states are considered implementation details of stateful transforms and are subject to change between major versions. Patch releases should never result in changes to this state.formulaic.parser.DefaultFormulaParser.FeatureFlags
. You can pass these flags, or a set of (case-insensitive) strings corresponding to these enums, to DefaultFormulaParser(feature_flags=...)
.If you are migrating a library that previous used patsy
to formulaic
, you should first review the general user-facing migration notes, which describes differences in API and formula grammars. Then, in addition to the recommendations above, the following notes might be helpful.
patsy
, such as manually assembling Term
instances, this code will need to be rewritten to use Formulaic
classes instead. Generally speaking, this will likely be transparent to your users and so should be a relatively small lift.This section of the documentation focuses on guiding end-users through the various aspects of Formulaic likely to be useful in day-to-day workflows. Feel free to pick and choose which modules you peruse, but note that later modules may assume knowledge of content described in earlier modules.
If you are a developer of another library looking to leverage Formulaic internally to your code, or are looking to contribute directly to Formulaic, it is recommended to also review the Developer Guides.
"},{"location":"guides/contrasts/","title":"Categorical Encoding","text":"Categorical data (also known as \"factors\") is encoded in model matrices using \"contrast codings\" that transform categorical vectors into a collection of numerical vectors suitable for use in regression models. In this guide we begin with some basic examples, before introducing the concepts behind contrast codings, how to select and/or design your own coding, and (for more advanced readers) describe how we guarantee structural full-rankness of formulae with complex interactions between categorical and other features.
In\u00a0[1]: Copied!from pandas import Categorical, DataFrame\n\nfrom formulaic import model_matrix\n\ndf = DataFrame(\n {\n \"letters\": [\"a\", \"b\", \"c\"],\n \"numbers\": Categorical([1, 2, 3]),\n \"values\": [20, 200, 30],\n }\n)\n\nmodel_matrix(\"letters + numbers + values\", df)\nfrom pandas import Categorical, DataFrame from formulaic import model_matrix df = DataFrame( { \"letters\": [\"a\", \"b\", \"c\"], \"numbers\": Categorical([1, 2, 3]), \"values\": [20, 200, 30], } ) model_matrix(\"letters + numbers + values\", df) Out[1]: Intercept letters[T.b] letters[T.c] numbers[T.2] numbers[T.3] values 0 1.0 0 0 0 0 20 1 1.0 1 0 1 0 200 2 1.0 0 1 0 1 30
Here letters
was identified as a categorical variable because of it consisted of strings, numbers
was identified as categorical because of its data type, and values
was treated as a vector of numerical values. The categorical data was encoded using the default encoding of \"Treatment\" (aka. \"Dummy\", see below for more details).
If we wanted to force formulaic to treat a column as categorical, we can use the C()
transform (just as in patsy and R). For example:
model_matrix(\"C(values)\", df)\nmodel_matrix(\"C(values)\", df) Out[2]: Intercept C(values)[T.30] C(values)[T.200] 0 1.0 0 0 1 1.0 0 1 2 1.0 1 0
The C()
transform tells Formulaic that the column should be encoded as categorical data, and allows you to customise how the encoding is performed. For example, we could use polynomial coding (detailed below) and explicitly specify the categorical levels and their order using:
model_matrix(\"C(values, contr.poly, levels=[10, 20, 30])\", df)\nmodel_matrix(\"C(values, contr.poly, levels=[10, 20, 30])\", df)
/home/matthew/Repositories/github/formulaic/formulaic/transforms/contrasts.py:124: DataMismatchWarning: Data has categories outside of the nominated levels (or that were not seen in original dataset): {200}. They are being cast to nan, which will likely skew the results of your analyses.\n warnings.warn(\nOut[3]: Intercept C(values, contr.poly, levels=[10, 20, 30]).L C(values, contr.poly, levels=[10, 20, 30]).Q 0 1.0 0.000000 -0.816497 1 1.0 0.000000 0.000000 2 1.0 0.707107 0.408248
Where possible, as you can see above, we also provide warnings when a categorical encoding does not reflect the structure of the data.
In\u00a0[4]: Copied!from formulaic.transforms.contrasts import C, TreatmentContrasts\n\nTreatmentContrasts(base=\"B\").get_coding_matrix([\"A\", \"B\", \"C\", \"D\"])\nfrom formulaic.transforms.contrasts import C, TreatmentContrasts TreatmentContrasts(base=\"B\").get_coding_matrix([\"A\", \"B\", \"C\", \"D\"]) Out[4]: A C D A 1.0 0.0 0.0 B 0.0 0.0 0.0 C 0.0 1.0 0.0 D 0.0 0.0 1.0 In\u00a0[5]: Copied!
TreatmentContrasts(base=\"B\").get_coefficient_matrix([\"A\", \"B\", \"C\", \"D\"])\nTreatmentContrasts(base=\"B\").get_coefficient_matrix([\"A\", \"B\", \"C\", \"D\"]) Out[5]: A B C D B 0.0 1.0 0.0 0.0 A-B 1.0 -1.0 -0.0 -0.0 C-B 0.0 -1.0 1.0 0.0 D-B 0.0 -1.0 0.0 1.0 In\u00a0[6]: Copied!
model_matrix(\"C(letters, contr.treatment)\", df)\nmodel_matrix(\"C(letters, contr.treatment)\", df) Out[6]: Intercept C(letters, contr.treatment)[T.b] C(letters, contr.treatment)[T.c] 0 1.0 0 0 1 1.0 1 0 2 1.0 0 1 In\u00a0[7]: Copied!
model_matrix(\"C(letters, contr.SAS)\", df)\nmodel_matrix(\"C(letters, contr.SAS)\", df) Out[7]: Intercept C(letters, contr.SAS)[T.a] C(letters, contr.SAS)[T.b] 0 1.0 1 0 1 1.0 0 1 2 1.0 0 0 In\u00a0[8]: Copied!
model_matrix(\"C(letters, contr.sum)\", df)\nmodel_matrix(\"C(letters, contr.sum)\", df) Out[8]: Intercept C(letters, contr.sum)[S.a] C(letters, contr.sum)[S.b] 0 1.0 1.0 0.0 1 1.0 0.0 1.0 2 1.0 -1.0 -1.0 In\u00a0[9]: Copied!
model_matrix(\"C(letters, contr.helmert)\", df)\nmodel_matrix(\"C(letters, contr.helmert)\", df) Out[9]: Intercept C(letters, contr.helmert)[H.b] C(letters, contr.helmert)[H.c] 0 1.0 -1.0 -1.0 1 1.0 1.0 -1.0 2 1.0 0.0 2.0 In\u00a0[10]: Copied!
model_matrix(\"C(letters, contr.diff)\", df)\nmodel_matrix(\"C(letters, contr.diff)\", df) Out[10]: Intercept C(letters, contr.diff)[D.b] C(letters, contr.diff)[D.c] 0 1.0 -0.666667 -0.333333 1 1.0 0.333333 -0.333333 2 1.0 0.333333 0.666667 In\u00a0[11]: Copied!
model_matrix(\"C(letters, contr.poly)\", df)\nmodel_matrix(\"C(letters, contr.poly)\", df) Out[11]: Intercept C(letters, contr.poly).L C(letters, contr.poly).Q 0 1.0 -0.707107 0.408248 1 1.0 0.000000 -0.816497 2 1.0 0.707107 0.408248 In\u00a0[12]: Copied!
my_letters = C(df.letters, TreatmentContrasts(base=\"b\"))\nmodel_matrix(\"my_letters\", df)\nmy_letters = C(df.letters, TreatmentContrasts(base=\"b\")) model_matrix(\"my_letters\", df) Out[12]: Intercept my_letters[T.a] my_letters[T.c] 0 1.0 1 0 1 1.0 0 0 2 1.0 0 1 In\u00a0[13]: Copied!
import numpy\n\nZ = numpy.array(\n [\n [1, 0, 0, 0], # A\n [-1, 1, 0, 0], # B - A\n [0, -1, 1, 0], # C - B\n [-1, 0, 0, 1], # D - A\n ]\n)\ncoding = numpy.linalg.inv(Z)[:, 1:]\ncoding\nimport numpy Z = numpy.array( [ [1, 0, 0, 0], # A [-1, 1, 0, 0], # B - A [0, -1, 1, 0], # C - B [-1, 0, 0, 1], # D - A ] ) coding = numpy.linalg.inv(Z)[:, 1:] coding Out[13]:
array([[0., 0., 0.],\n [1., 0., 0.],\n [1., 1., 0.],\n [0., 0., 1.]])In\u00a0[14]: Copied!
model_matrix(\n \"C(letters, contr.custom(coding))\", DataFrame({\"letters\": [\"A\", \"B\", \"C\", \"D\"]})\n)\nmodel_matrix( \"C(letters, contr.custom(coding))\", DataFrame({\"letters\": [\"A\", \"B\", \"C\", \"D\"]}) ) Out[14]: Intercept C(letters, contr.custom(coding))[1] C(letters, contr.custom(coding))[2] C(letters, contr.custom(coding))[3] 0 1.0 0.0 0.0 0.0 1 1.0 1.0 0.0 0.0 2 1.0 1.0 1.0 0.0 3 1.0 0.0 0.0 1.0
The model matrices generated from formulae are often consumed directly by linear regression algorithms. In these cases, if your model matrix is not full rank, then the features in your model are not linearly independent, and the resulting coefficients (assuming they can be computed at all) cannot be uniquely determined. While there are ways to handle this, such as regularization, it is usually easiest to put in a little more effort during the model matrix creation process, and make the incoming vectors in your model matrix linearly independent from the outset. As noted in the text above, categorical coding requires consideration about the overlap of the coding with the intercept in order to remain full rank. The good news is that Formulaic will do most of the heavy lifting for you, and does so by default.
It is important to note at this point that Formulaic does not protect against all forms of linear dependence, only structural linear dependence; i.e. linear dependence that results from multiple categorical variables overlapping in vectorspace. If you have two identical numerical vectors called by two different names in your model matrix, Formulaic will happily build the model matrix you requested, and you're on your own. This is intentional. While Formulaic strives to make the model matrix generation process as painless as possible, it also doesn't want to make more assumptions about the use of the data than is necessary. Note that you can also disable Formulaic's structural full-rankness algorithms by passing ensure_full_rank=False
to model_matrix()
or .get_model_matrix()
methods; and can bypass the reducing of the rank of a single categorical term in a formula using C(..., spans_intercept=False)
(this is especially useful, for example, if your model includes regularization and you would prefer to use the over-specified model to ensure fairer shrinkage).
The algorithm that Formulaic uses was heavily inspired by patsy
^1. The basic idea is to recognize that all categorical codings span the intercept[^2]; and then to break that coding up into two pieces: a single column that can be dropped to avoid spanning the intercept, and the remaining body of the coding that will always be present. You expand associatively the categorical factors, and then greedily recombine the components, omitting any that would lead to structural linear dependence. The result is a set of categorical codings that only spans the intercept when it is safe to do so, guaranteeing structural full rankness. The patsy documentation goes into this in much more detail if this is interesting to you.
[^2]: This assumes that categories are \"complete\", in that each unit has been assigned a category. You can \"complete\" categories by treating all those unassigned as being a member of an imputed \"null\" category.
"},{"location":"guides/contrasts/#basic-usage","title":"Basic usage\u00b6","text":"Formulaic follows in the stead of R and Patsy by automatically inferring from the data whether a feature needs to categorically encoded. For example:
"},{"location":"guides/contrasts/#how-does-contrast-coding-work","title":"How does contrast coding work?\u00b6","text":"As you have seen, contrast coding transforms categorical vectors into a matrix of numbers that can be used during modeling. If your data has $K$ mutually exclusive categories, these matrices typically consist of $K-1$ columns. This reduction in dimensionality reflects the fact that membership of the $K$th category could be inferred from the lack of membership in any other category, and so is redundant in the presence of a global intercept. You can read more about this in the full rankness discussion below.
The first step toward generating numerical vectors from categorical data is to dummy encode it. This transforms the single vector of $K$ categories into $K$ boolean vectors, each having a $1$ only in rows that are members of the corresponding category. If you do not have a global intercept, you can directly use this dummy encoding with the full $K$ columns and contrasts are unnecessary. This is not always the case, which requires you to reduce the rank of your coding by thinking about contrasts (or differences) between the levels.
In practice, this dimension reduction using \"contrasts\" looks like constructing a $K \\times (K-1)$ \"coding matrix\" that describes the contrasts of interest. You can then post-multiply your dummy-encoded columns by it. That is: $$ E = DC $$ where $E \\in \\mathbb{R}^{N \\times (K-1)}$ is the contrast coded categorical data, $D \\in \\{0, 1\\}^{N \\times K}$ is the dummy encoded data, and $C \\in \\mathbb{R}^{K \\times (K-1)}$ is the coding matrix.
The easiest way to construct a coding matrix is to start with a \"coefficient matrix\" $Z \\in \\mathbb{R}^{K \\times K}$ which describes the contrasts that you want the coefficients of a trained linear regression model to represent (with columns representing the untransformed levels, and rows representing the transforms). For a consistently chosen set of contrasts, this matrix will be full-rank, and the inverse of this matrix will have a constant column representing the global intercept. Removing this column results in the $K \\times (K-1)$ coding matrix that should be apply to the dummy encoded data in order for the coefficients to have the desired interpretation.
For example, if we wanted all of the levels to be compared to the first level, we would build a matrix $Z$ as: $$ \\begin{align} Z =& \\left(\\begin{array}{c c c c} 1 & 0 & 0 & 0 \\\\ -1 & 1 & 0 & 0\\\\ -1 & 0 & 1 & 0\\\\ -1 & 0 & 0 & 1 \\end{array}\\right)\\\\ \\therefore Z^{-1} =& \\left(\\begin{array}{c c c c} 1 & 0 & 0 & 0 \\\\ 1 & 1 & 0 & 0\\\\ 1 & 0 & 1 & 0\\\\ 1 & 0 & 0 & 1 \\end{array}\\right)\\\\ \\implies C =& \\left(\\begin{array}{c c c} 0 & 0 & 0 \\\\ 1 & 0 & 0\\\\ 0 & 1 & 0\\\\ 0 & 0 & 1 \\end{array}\\right) \\end{align} $$ This is none other than the default \"treatment\" coding described below, which applies one-hot coding to the categorical data.
It is important to note that while your choice of contrast coding will change the interpretation and values of your coefficients, all contrast encodings ultimately result in equivalent regressions, and it is possible to restrospectively infer any other set of interesting contrasts given the regression covariance matrix. The task is therefore to find the most useful representation, not the \"correct\" one.
For those interested in reading more, the R Documentation on Coding Matrices covers this in more detail.
"},{"location":"guides/contrasts/#contrast-codings","title":"Contrast codings\u00b6","text":"This section introduces the contrast encodings that are shipped as part of Formulaic. These implementations live in formulaic.transforms.contrasts
, and are surfaced by default in formulae as an attribute of contr
(e.g. contr.treatment
, in order to be consistent with R). You can always implement your own contrasts if the need arises.
If you would like to dig deeper and see the actual contrast/coefficient matrices for various parameters you can directly import these contrast implementations and play with them in a Python shell, but otherwise for brevity we will not exhaustively show these in the following documentation. For example:
"},{"location":"guides/contrasts/#treatment-aka-dummy","title":"Treatment (aka. dummy)\u00b6","text":"This contrast coding compares each level with some reference level. If not specified, the reference level is taken to be the first level. The reference level can be specified as the first argument to the TreatmentContrasts
/contr.treatment
constructor.
Example formulae:
~ X
: Assuming X
is categorical, the treatment encoding will be used by default.~ C(X)
: You can also explicitly flag a feature to be encoded as categorical, whereupon the default is treatment encoding.~ C(X, contr.treatment)
: Explicitly indicate that the treatment encoding should be used.~ C(X, contr.treatment(\"x\"))
: Indicate that the reference treatment should be \"x\" instead of the first index.~ C(X, contr.treatment(base=\"x\"))
: As above.This constrasts generated by this class are the same as the above, but with the reference level defaulting to the last level (the default in SAS).
Example formulae:
~ C(X, contr.SAS)
: Basic use-case.~ C(X, contr.SAS(\"x\"))
: Same as treatment encoding case above.~ C(X, contr.SAS(base=\"x\"))
: Same as treatment encoding case above.These contrasts compare each level (except the last, which is redundant) to the global average of all levels.
Example formulae:
~ C(X, contr.sum)
: Encode categorical data using the sum coding.These contrasts compare each successive level to the average all previous/subsequent levels. It has two configurable parameters: reverse
which controls the direction of comparison, and scale
which controls whether to scale the encoding to simplify interpretation of coefficients (results in a floating point model matrix instead of an integer one). When reverse
is True
, the contrasts compare a level to all previous levels; and when False
, it compares it to all subsequent levels.
The default parameter values are chosen to match the R implementation, which corresponds to a reversed and unscaled Helmert coding.
Example formulae:
~ C(X, contr.helmert)
: Unscaled reverse coding.~ C(X, contr.helmert(reverse=False))
: Unscaled forward coding.~ C(X, contr.helmert(scale=True))
: Scaled reverse coding.~ C(X, contr.helmert(scale=True, reverse=False))
: Scaled forward coding.These contrasts take the difference of each level with the previous level. It has one parameter, forward
, which indicates that the difference should be inverted such the difference is taken between the previous level and the current level. The default attribute values are chosen to match the R implemention, and correspond to a backward difference coding.
Example formulae:
~ C(X, contr.diff)
: Backward coding.~ C(X, contr.diff(forward=True))
: Forward coding.The \"contrasts\" represent a categorical variable that is assumed to equal (or known) spacing/scores, and allow us to model non-linear polynomial behaviour of the dependent variable with respect to the ordered levels by projecting the spacing onto a basis of orthogonal polynomials. It has one parameter, scores
which indicates the spacing of the categories. It must have the same length as the number of levels. If not provided, the categories are assumed equidistant and spaced by 1.
The feature names of categorical variables can become quite unwieldy, as you may have noticed. Fortunately this is easily remedied by aliasing the variable outside of your formula (and then making it available via formula context). This is done automatically if you use the model_matrix
function. For example:
It may be useful to define your own coding matrices in some contexts. This is readily achieved using the CustomContrasts
class directly or via the contr.custom
alias. In these cases, you are responsible for providing the coding matrix ($C$ from above). For example, if you had four levels: A, B, C and D, and wanted to compute the contrasts: B - A, C - B, and D - A, you could write:
This section of the documentation is intended to provide a high-level overview of the way in which formulae are interpreted and materialized by Formulaic.
Recall that the goal of a formula is to act as a recipe for building a \"model matrix\" (also known as a \"design matrix\") from an existing dataset. Following the recipe should result in a dataset that consists only of numerical columns that can be linearly combined to model an outcome/response of interest (the coefficients of which typically being estimated using linear regression). As such, this process will bake in any desired non-linearity via interactions or transforms, and will encode nominal/categorical/factor data as a collection of numerical contrasts.
The ingredients of each formula are the columns of the original dataset, and each operator acting on these columns in the formula should be thought of as inclusion/exclusion of the column in the resulting model matrix, or as a transformation on the column(s) prior to inclusion. Thus, a +
operator does not act in its usual algebraic manner, but rather acts as set union, indicating that both the left- and right-hand arguments should be included in the model matrix; a -
operator acts like a set difference; and so on.
Formulas in Formulaic are represented by (subclasses of) the Formula
class. Instances of Formula
subclasses are a ultimately containers for sets of Term
instances, which in turn are a container for a set of Factor
instances. Let's start our dissection at the bottom, and work our way up.
from formulaic.parser.types import Factor\n\nFactor(\n \"1\", eval_method=\"literal\"\n) # a factor that represents the numerical constant of 1\nFactor(\"a\") # a factor that will be looked up from the data context\nFactor(\n \"a + b\", eval_method=\"python\"\n) # a factor that will return the sum of `a` and `b`\nfrom formulaic.parser.types import Factor Factor( \"1\", eval_method=\"literal\" ) # a factor that represents the numerical constant of 1 Factor(\"a\") # a factor that will be looked up from the data context Factor( \"a + b\", eval_method=\"python\" ) # a factor that will return the sum of `a` and `b` Out[1]:
a + bIn\u00a0[2]: Copied!
from formulaic.parser.types import Term\n\nTerm(factors=[Factor(\"b\"), Factor(\"a\"), Factor(\"c\")])\nfrom formulaic.parser.types import Term Term(factors=[Factor(\"b\"), Factor(\"a\"), Factor(\"c\")]) Out[2]:
b:a:c
Note that to ensure uniqueness in the representation, the factor instances are sorted.
In\u00a0[3]: Copied!from formulaic import Formula\n\n# Unstructured formula (a simple list of terms)\nf = Formula(\n [\n Term(factors=[Factor(\"c\"), Factor(\"d\"), Factor(\"e\")]),\n Term(factors=[Factor(\"a\"), Factor(\"b\")]),\n ]\n)\nf\nfrom formulaic import Formula # Unstructured formula (a simple list of terms) f = Formula( [ Term(factors=[Factor(\"c\"), Factor(\"d\"), Factor(\"e\")]), Term(factors=[Factor(\"a\"), Factor(\"b\")]), ] ) f Out[3]:
a:b + c:d:e
Note that unstructured formulae are actually instances of SimpleFormula
(a Formula
subclass that acts like a mutable list of Term
instances):
type(f), list(f)\ntype(f), list(f) Out[4]:
(formulaic.formula.SimpleFormula, [a:b, c:d:e])
Also note that in its standard representation, the terms are separated by \"+\" which is interpreted as the set union in this context, and that (as we have seen for Term
instances) Formula
instances are sorted (the default is to sort terms only by interaction order, but this can be customized and/or disabled, as described below).
Structured formula are constructed similary:
In\u00a0[5]: Copied!f = Formula(\n [\n Term(factors=[Factor(\"root_col\")]),\n ],\n my_substructure=[\n Term(factors=[Factor(\"sub_col\")]),\n ],\n nested=Formula(\n [\n Term(factors=[Factor(\"nested_col\")]),\n Term(factors=[Factor(\"another_nested_col\")]),\n ],\n really_nested=[\n Term(factors=[Factor(\"really_nested_col\")]),\n ],\n ),\n)\nf\nf = Formula( [ Term(factors=[Factor(\"root_col\")]), ], my_substructure=[ Term(factors=[Factor(\"sub_col\")]), ], nested=Formula( [ Term(factors=[Factor(\"nested_col\")]), Term(factors=[Factor(\"another_nested_col\")]), ], really_nested=[ Term(factors=[Factor(\"really_nested_col\")]), ], ), ) f Out[5]:
root:\n root_col\n.my_substructure:\n sub_col\n.nested:\n root:\n nested_col + another_nested_col\n .really_nested:\n really_nested_col
Structured formulae are instances of StructuredFormula
:
type(f)\ntype(f) Out[6]:
formulaic.formula.StructuredFormula
And the sub-formula can be selected using:
In\u00a0[7]: Copied!f.root\nf.root Out[7]:
root_colIn\u00a0[8]: Copied!
f.nested\nf.nested Out[8]:
root:\n nested_col + another_nested_col\n.really_nested:\n really_nested_col
Formulae can also have different ordering conventions applied to them. By default, Formulaic follows R conventions around ordering whereby terms are sorted by their interaction degree (number of factors) and then by the order in which they were present in the the term list. This behaviour can be modified to perform no ordering or full lexical sorting of terms and factors by passing _ordering=\"none\"
or _ordering=\"sort\"
respectively to the Formula
constructor. The default ordering is equivalent to passing _ordering=\"degree\"
. For example:
{\n \"degree\": Formula(\"z + z:a + z:b:a + g\"),\n \"none\": Formula(\"z + z:a + z:b:a + g\", _ordering=\"none\"),\n \"sort\": Formula(\"z + z:a + z:b:a + g\", _ordering=\"sort\"),\n}\n{ \"degree\": Formula(\"z + z:a + z:b:a + g\"), \"none\": Formula(\"z + z:a + z:b:a + g\", _ordering=\"none\"), \"sort\": Formula(\"z + z:a + z:b:a + g\", _ordering=\"sort\"), } Out[9]:
{'degree': 1 + z + g + z:a + z:b:a,\n 'none': 1 + z + z:a + z:b:a + g,\n 'sort': 1 + g + z + a:z + a:b:z}
Formulaic intentionally makes the tokenization phase as unopinionated and unstructured as possible. This allows formula grammars to be extended via plugins only high-level APIs (usually Operator
s).
The tokenizer's role is to take an arbitrary string representation of a formula and convert it into a series of Token
instances. The tokenization phase knows very little about formula grammar except that whitespace doesn't matter, that we distinguish non-word characters as operators or context indicators. Interpretation of these tokens is left to the AST generation phase. There are five different kinds of tokens:
()
and square brackets []
.The tokenizer treats text quoted with `
characters as a name token, and {}
are used to quote Python operations.
An example of the tokens generated can be seen below:
In\u00a0[10]: Copied!from formulaic.parser import DefaultFormulaParser\n\n[\n f\"{token.token} : {token.kind.value}\"\n for token in (\n DefaultFormulaParser(include_intercept=False).get_tokens(\n \"y ~ 1 + b:log(c) | `d$in^df` + {e + f}\"\n )\n )\n]\nfrom formulaic.parser import DefaultFormulaParser [ f\"{token.token} : {token.kind.value}\" for token in ( DefaultFormulaParser(include_intercept=False).get_tokens( \"y ~ 1 + b:log(c) | `d$in^df` + {e + f}\" ) ) ] Out[10]:
['y : name',\n '~ : operator',\n '1 : value',\n '+ : operator',\n 'b : name',\n ': : operator',\n 'log(c) : python',\n '| : operator',\n 'd$in^df : name',\n '+ : operator',\n 'e + f : python']
The next phase is to assemble an abstract syntax tree (AST) from the tokens output from the above that when evaluated will generate the Term
instances we need to build a formula. This is done by using an enriched shunting yard algorithm which determines how to interpret each operator token based on the symbol used, the number and position of the non-operator arguments, and the current context (i.e. how many parentheses deep we are). This allows us to disambiguate between, for example, unary and binary addition operators. The available operators and their implementation are described in more detail in the Formula Grammar section of this documentation. It is worth noting that the available operators can be easily modified at runtime, and is typically all that needs to be modified in order to add new formula grammars.
The result is an AST that look something like:
In\u00a0[11]: Copied!DefaultFormulaParser().get_ast(\"y ~ a + b:c\")\nDefaultFormulaParser().get_ast(\"y ~ a + b:c\") Out[11]:
<ASTNode ~: [y, <ASTNode +: [<ASTNode +: [1, a]>, <ASTNode :: [b, c]>]>]>
Now that we have the AST, we can readily evaluate it to generate the Term
instances we need to pass to our Formula
constructor. For example:
terms = DefaultFormulaParser(include_intercept=False).get_terms(\"y ~ a + b:c\")\nterms\nterms = DefaultFormulaParser(include_intercept=False).get_terms(\"y ~ a + b:c\") terms Out[12]:
.lhs:\n {y}\n.rhs:\n {a, b:c}In\u00a0[13]: Copied!
Formula(terms)\nFormula(terms) Out[13]:
.lhs:\n y\n.rhs:\n a + b:c
Of course, manually building the terms and passing them to the formula constructor is a bit annoying, and so instead we allow passing the string directly to the Formula
constructor; and allow you to override the default parser if you so desire (though 99.9% of the time this wouldn't be necessary).
Thus, we can generate the same formula from above using:
In\u00a0[14]: Copied!Formula(\"y ~ a + b:c\", _parser=DefaultFormulaParser(include_intercept=False))\nFormula(\"y ~ a + b:c\", _parser=DefaultFormulaParser(include_intercept=False)) Out[14]:
.lhs:\n y\n.rhs:\n a + b:c
Once you have Formula
instance, the next logical step is to use it to materialize a model matrix. This is usually as simple passing the raw data as an argument to .get_model_matrix()
:
import pandas\n\ndata = pandas.DataFrame(\n {\"a\": [1, 2, 3], \"b\": [4, 5, 6], \"c\": [7, 8, 9], \"A\": [\"a\", \"b\", \"c\"]}\n)\nFormula(\"a + b:c\").get_model_matrix(data)\nimport pandas data = pandas.DataFrame( {\"a\": [1, 2, 3], \"b\": [4, 5, 6], \"c\": [7, 8, 9], \"A\": [\"a\", \"b\", \"c\"]} ) Formula(\"a + b:c\").get_model_matrix(data) Out[15]: Intercept a b:c 0 1.0 1 28 1 1.0 2 40 2 1.0 3 54
Just as for formulae, the model matrices can be structured, and will be structured in the same way as the original formula. For example:
In\u00a0[16]: Copied!Formula(\"a\", group=\"b+c\").get_model_matrix(data)\nFormula(\"a\", group=\"b+c\").get_model_matrix(data) Out[16]:
root:\n Intercept a\n 0 1.0 1\n 1 1.0 2\n 2 1.0 3\n.group:\n b c\n 0 4 7\n 1 5 8\n 2 6 9
Under the hood, both of these calls have looked at the type of the data (pandas.DataFrame
here) and then looked up the FormulaMaterializer
associated with that type (PandasMaterializer
here), and then passed the formula and data along to the materializer for materialization. It is also possible to request a specific output type that varies by materializer (PandasMaterializer
supports \"pandas\", \"numpy\", and \"sparse\"). If one is not selected, the first available output type is selected for you. Thus, the above code is equivalent to:
from formulaic.materializers import PandasMaterializer\n\nPandasMaterializer(data).get_model_matrix(Formula(\"a + b:c\"), output=\"pandas\")\nfrom formulaic.materializers import PandasMaterializer PandasMaterializer(data).get_model_matrix(Formula(\"a + b:c\"), output=\"pandas\") Out[17]: Intercept a b:c 0 1.0 1 28 1 1.0 2 40 2 1.0 3 54
The return type of .get_model_matrix()
is either a ModelMatrix
instance if the original formula was unstructured, or a ModelMatrices
instance that is just a structured container for ModelMatrix
instances. However, ModelMatrix
is an ObjectProxy subclass, and so it also acts like the type of object requested. For example:
import numpy\n\nfrom formulaic import ModelMatrix\n\nmm = Formula(\"a + b:c\").get_model_matrix(data, output=\"numpy\")\nisinstance(mm, ModelMatrix), isinstance(mm, numpy.ndarray)\nimport numpy from formulaic import ModelMatrix mm = Formula(\"a + b:c\").get_model_matrix(data, output=\"numpy\") isinstance(mm, ModelMatrix), isinstance(mm, numpy.ndarray) Out[18]:
(True, True)
The main purpose of this additional proxy layer is to expose the ModelSpec
instance associated with the materialization, which retains all of the encoding choices made during materialization (for reuse in subsequent materializations), as well as metadata about the feature names of the current model matrix (which is very useful when your model matrix output type doesn't have column names, like numpy or sparse arrays). This ModelSpec
instance is always available via .model_spec
, and is introduced in more detail in the Model Specs section of this documentation.
mm.model_spec\nmm.model_spec Out[19]:
ModelSpec(formula=1 + a + b:c, materializer='pandas', materializer_params={}, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='numpy', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a']), EncodedTermStructure(term=b:c, scoped_terms=[b:c], columns=['b:c'])], transform_state={}, encoder_state={'a': (<Kind.NUMERICAL: 'numerical'>, {}), 'b': (<Kind.NUMERICAL: 'numerical'>, {}), 'c': (<Kind.NUMERICAL: 'numerical'>, {})})
It is sometimes convenient to have the columns in the final model matrix be clustered by numerical factors included in the terms. This means that in regression reports, for example, all of the columns related to a particular feature of interest (including its interactions with various categorical features) are contiguously clustered. This is the default behaviour in patsy. You can perform this clustering in Formulaic by passing the cluster_by=\"numerical_factors\"
argument to model_matrix
or any of the .get_model_matrix(...)
methods. For example:
Formula(\"a + b + a:A + A:b\").get_model_matrix(data, cluster_by=\"numerical_factors\")\nFormula(\"a + b + a:A + A:b\").get_model_matrix(data, cluster_by=\"numerical_factors\") Out[20]: Intercept a a:A[T.b] a:A[T.c] b A[T.b]:b A[T.c]:b 0 1.0 1 0 0 4 0 0 1 1.0 2 2 0 5 5 0 2 1.0 3 0 3 6 0 6"},{"location":"guides/formulae/#anatomy-of-a-formula","title":"Anatomy of a Formula\u00b6","text":""},{"location":"guides/formulae/#factor","title":"Factor\u00b6","text":"
Factor
instances are the atomic unit of a formula, and represent the output of a single expression evaluation. Typically this will be one vector of data, but could also be more than one column (especially common with categorically encoded data).
A Factor
instance's expression can be evaluated in one of three ways:
Note: Factor instances act as metadata only, and are not directly responsible for doing the evaluation. This is handled in a backend specific way by the appropriate Materializer
instance.
In code, instantiating a factor looks like:
"},{"location":"guides/formulae/#term","title":"Term\u00b6","text":"Term
instances are a thin wrapper around a set of Factor
instances, and represent the Cartesian (or Kronecker) product of the factors. If all of the Factor
instances evaluate to single columns, then the Term
represents the product of all of the factor columns.
Instantiating a Term
looks like:
Formula
instances are (potentially nested) wrappers around collections of Term
instances. During materialization into a model matrix, each Term
instance will have its columns independently inserted into the resulting matrix.
Formula
instances can consist of a single \"list\" of Term
instances, or may be \"structured\"; for example, we may want a separate collection of terms for the left- and right-hand side of a formula; or to simultaneously construct multiple model matrices for different parts of our modeling process.
For example, an unstructured formula might look like:
"},{"location":"guides/formulae/#parsed-formulae","title":"Parsed Formulae\u00b6","text":"While it would be possible to always manually construct Formula
instances in this way, it would quickly grow tedious. As you might have guessed from reading the quickstart or via other implementations, this is where Wilkinson formulae come in. Formulaic has a rich extensible formula parser that converts string expressions into the formula structures you see above. Where functionality and grammar overlap, it tries to conform to existing patterns found in R and patsy.
Formula parsing happens in three phases:
Term
instances.In the sections below these phases are described in more detail.
"},{"location":"guides/formulae/#tokenization","title":"Tokenization\u00b6","text":""},{"location":"guides/formulae/#abstract-syntax-tree-ast","title":"Abstract Syntax Tree (AST)\u00b6","text":""},{"location":"guides/formulae/#evaluation","title":"Evaluation\u00b6","text":""},{"location":"guides/formulae/#materialization","title":"Materialization\u00b6","text":""},{"location":"guides/grammar/","title":"Formula Grammar","text":"This section of the documentation describes the formula grammar used by Formulaic. It is almost identical that used by patsy and R, and so most formulas should work without modification. However, there are some differences, which are called out below.
"},{"location":"guides/grammar/#operators","title":"Operators","text":"In this section, we introduce a complete list of the grammatical operators that you can use by default in your formulas. They are listed such that each section (demarcated by \"-----\") has higher precedence then the block that follows. When you write a formula involving several operators of different precedence, those with higher precedence will be resolved first. \"Arity\" is the number of arguments the operator takes. Within operators of the same precedence, all binary operators are evaluated from left to right (they are left-associative). To highlight differences in grammar betweeh formulaic, patsy and R, we highlight any differences below. If there is a checkmark the Formulaic, Patsy and R columns, then the grammar is consistent across all three, unless otherwise indicated.
Operator Arity Description Formulaic Patsy R\"...\"
1 1 String literal. \u2713 \u2713 \u2717 [0-9]+\\.[0-9]+
1 1 Numerical literal. \u2713 \u2717 \u2717 `...`
1 1 Quotes fieldnames within the incoming dataframe, allowing the use of special characters, e.g. `my|special$column!`
\u2713 \u2717 \u2713 {...}
1 1 Quotes python operations, as a more convenient way to do Python operations than I(...)
, e.g. {`my|col`**2}
\u2713 \u2717 \u2717 <function>(...)
1 1 Python transform on column, e.g. my_func(x)
which is equivalent to {my_func(x)}
\u27132 \u2713 \u2717 ----- (...)
1 Groups operations, overriding normal precedence rules. All operations with the parentheses are performed before the result of these operations is permitted to be operated upon by its peers. \u2713 \u2717 \u2713 ----- .
9 0 Stands in as a wild-card for the sum of variables in the data not used on the left-hand side of a formula. \u2713 \u2717 \u2713 ----- **
2 Includes all n-th order interactions of the terms in the left operand, where n is the (integral) value of the right operand, e.g. (a+b+c)**2
is equivalent to a + b + c + a:b + a:c + b:c
. \u2713 \u2713 \u2713 ^
2 Alias for **
. \u2713 \u27173 \u2713 ----- :
2 Adds a new term that corresponds to the interaction of its operands (i.e. their elementwise product). \u27134 \u2713 \u2713 ----- *
2 Includes terms for each of the additive and interactive effects of the left and right operands, e.g. a * b
is equivalent to a + b + a:b
. \u2713 \u2713 \u2713 /
2 Adds terms describing nested effects. It expands to the addition of a new term for the left operand and the interaction of all left operand terms with the right operand, i.e a / b
is equivalent to a + a:b
, (a + b) / c
is equivalent to a + b + a:b:c
, and a/(b+c)
is equivalent to a + a:b + a:c
.5 \u2713 \u2713 \u2713 %in%
2 As above, but with arguments inverted: e.g. b %in% a
is equivalent to a / b
. \u2713 \u2717 \u2713 ----- +
2 Adds a new term to the set of features. \u2713 \u2713 \u2713 -
2 Removes a term from the set of features (if present). \u2713 \u2713 \u2713 +
1 Returns the current term unmodified (not very useful). \u2713 \u2713 \u2713 -
1 Negates a term (only implemented for 0, in which case it is replaced with 1
). \u2713 \u2713 \u2713 ----- \\|
2 Splits a formula into multiple parts, allowing the simultaneous generation of multiple model matrices. When on the right-hand-side of the ~
operator, all parts will attract an additional intercept term by default. \u2713 \u2717 \u27136 ----- ~
1,2 Separates the target features from the input features. If absent, it is assumed that we are considering only the the input features. Unless otherwise indicated, it is assumed that the input features implicitly include an intercept. \u2713 \u2713 \u2713 [ . ~ . ]
2 [Experimental] Multi stage formula notation, which is useful in (e.g.) IV contexts. Requires the MULTISTAGE
feature flag to be passed to the parser. \u2713 \u2717 \u2717"},{"location":"guides/grammar/#transforms","title":"Transforms","text":"Formulaic supports arbitrary transforms, any of which can also preserve state so that new data can undergo the same transformation as that used during modelling. The currently implemented transforms are shown below. Commonly used transforms that have not been implemented by formulaic
are explicitly noted also.
I(...)
Identity transform, allowing arbitrary Python/R operations, e.g. I(x+y)
. Note that in formulaic
, it is more idiomatic to use {x+y}
. \u2713 \u2713 \u2713 Q('<column_name>')
Look up feature by potentially exotic name, e.g. Q('wacky name!')
. Note that in formulaic
, it is more idiomatic to use `wacky name!`
. \u2713 \u2713 \u2717 C(...)
Categorically encode a column, e.g. C(x)
\u2713 \u2713 \u2713 center(...)
Shift column data so mean is zero. \u2713 \u2713 \u2717 scale(...)
Shift column so mean is zero and variance is 1. \u2713 \u27137 \u2713 standardize(...)
Alias of scale
. \u27138 \u2713 \u2717 lag(...[, <k>])
Generate lagging or leading columns (useful for datasets collected at regular intervals). \u2713 \u2717 \u2713 poly(...)
Generates a polynomial basis, allowing non-linear fits. \u2713 \u2717 \u2713 bs(...)
Generates a B-Spline basis, allowing non-linear fits. \u2713 \u2713 \u2713 cs(...)
Generates a natural cubic spline basis, allowing non-linear fits. \u2713 \u2713 \u2713 cr(...)
Alias for cs
above. \u2713 \u2717 \u2713 cc(...)
Generates a cyclic cubic spline basis, allowing non-linear fits. \u2713 \u2713 \u2713 te(...)
Generates a tensor product smooth. \u2717 \u2713 \u2713 hashed(...)
Categorically encode a deterministic hash of a column. \u2713 \u2717 \u2717 ... Others? Contributions welcome! ? ? ? Tip
Any function available in the context
dictionary will also be available as transform, along with some commonly used functions imported from numpy: log
, log10
, log2
, exp
, exp10
, and exp2
. In addition the numpy
module is always available as np
. Thus, formulas like: log(y) ~ x + 10
will always do the right thing, even when these functions have not been made available in the user namespace.
Note
Formulaic does not (yet) support including extra terms in the formula that will not result in additions to the dataframe, for example model annotations like R's offset(...)
.
Beyond the formula operator grammar itself there are some differing behaviours and conventions of which you should be aware.
Formula
R package in that both sides of the ~
operator are treated considered to be using the formula grammar, with the only difference being that the right hand side attracts an intercept by default. In vanilla R, the left hand side is treated as R code (and so x + y ~ z
would result in a single column on the left-hand-side). You can recover vanilla R's behaviour by nesting the operations in a Python operator block (as described in the operator table): {y1 + y2} ~ a + b
.formulaic
with the same set of fields will always generate the same model matrix.b-1
and (b-1)
both do not have an intercept, whereas in Formulaic and Patsy the parentheses are resolved first, and so the first does not have an intercept and the second does (because '1 +' is implicitly prepended to the right hand side of the formula).This \"operator\" is actually part of the tokenisation process.\u00a0\u21a9\u21a9\u21a9\u21a9\u21a9
Formulaic additionally supports quoted fields with special characters, e.g. my_func(`my|special+column`)
.\u00a0\u21a9
The caret operator is not supported, but will not cause an error. It is ignored by the patsy formula parser, and treated as XOR Python operation on column.\u00a0\u21a9
Note that Formulaic also allows you to use this to scale columns, for example: 2.5:a
(this scaling happens after factor coding).\u00a0\u21a9
This somewhat confusing operator is useful when you want to include hierachical features in your data, and where certain interaction terms do not make sense (particularly in ANOVA contexts). For example, if a
represents countries, and b
represents cities, then the full product of terms from a * b === a + b + a:b
does not make sense, because any value of b
is guaranteed to coincide with a value in a
, and does not independently add value. Thus, the operation a / b === a + a:b
results in more sensible dataset. As a result, the /
operator is right-distributive, since if b
and c
were both nested in a
, you would want a/(b+c) === a + a:b + a:c
. Likewise, the operator is not left-distributive, since if c
is nested under both a
and b
separately, then you want (a + b)/c === a + b + a:b:c
. Lastly, if c
is nested in b
, and b
is nested in a
, then you would want a/b/c === a + a:(b/c) === a + a:b + a:b:c
.\u00a0\u21a9
Implemented by an R package called Formula that extends the default formula syntax.\u00a0\u21a9
Patsy uses the rescale
keyword rather than scale
, but provides the same functionality.\u00a0\u21a9
For increased compatibility with patsy, we use patsy's signature for standardize
.\u00a0\u21a9
Requires additional context to be passed in when directly using the Formula
constructor. e.g. Formula(\"y ~ .\", context={\"__formulaic_variables_available__\": [\"x\", \"y\", \"z\"]})
; or you can use model_matrix
, ModelSpec.get_model_matrix()
, or FormulaMaterializer.get_model_matrix()
without further specification.\u00a0\u21a9
As Formulaic matures it is expected that it will be integrated directly into downstream projects where formula parsing is required. This is known to have already happened in the following high-profile projects:
Where direct integration has not yet happened, you can still use Formulaic in conjunction with other commonly used libraries. On this page, we will add various examples how to achieve this. If you have done some integration work, please feel free to submit a PR that extends this documentation!
statsmodels is a popular toolkit hosting many different statistical models, tests, and exploration tools. The formula API in statsmodels
is currently based on patsy
. If you need the features found in Formulaic, you can use it directly to generate the model matrices, and use the regular API. For example:
import pandas\nfrom statsmodels.api import OLS\n\nfrom formulaic import model_matrix\n\ndata = pandas.DataFrame({\"y\": [0.1, 0.4, 3], \"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]})\ny, X = model_matrix(\"y ~ a + b\", data)\nmodel = OLS(y, X)\nresults = model.fit()\nprint(results.summary())\nimport pandas from statsmodels.api import OLS from formulaic import model_matrix data = pandas.DataFrame({\"y\": [0.1, 0.4, 3], \"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]}) y, X = model_matrix(\"y ~ a + b\", data) model = OLS(y, X) results = model.fit() print(results.summary())
OLS Regression Results \n==============================================================================\nDep. Variable: y R-squared: 1.000\nModel: OLS Adj. R-squared: nan\nMethod: Least Squares F-statistic: nan\nDate: Fri, 14 Apr 2023 Prob (F-statistic): nan\nTime: 21:15:59 Log-Likelihood: 102.01\nNo. Observations: 3 AIC: -198.0\nDf Residuals: 0 BIC: -200.7\nDf Model: 2 \nCovariance Type: nonrobust \n==============================================================================\n coef std err t P>|t| [0.025 0.975]\n------------------------------------------------------------------------------\nIntercept -0.7857 inf -0 nan nan nan\na 0.8857 inf 0 nan nan nan\nb[T.B] -0.5857 inf -0 nan nan nan\nb[T.C] 1.1286 inf 0 nan nan nan\n==============================================================================\nOmnibus: nan Durbin-Watson: 0.820\nProb(Omnibus): nan Jarque-Bera (JB): 0.476\nSkew: -0.624 Prob(JB): 0.788\nKurtosis: 1.500 Cond. No. 6.94\n==============================================================================\n\nNotes:\n[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n[2] The input rank is higher than the number of observations.\n
/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/stats/stattools.py:74: ValueWarning: omni_normtest is not valid with less than 8 observations; 3 samples were given.\n warn(\"omni_normtest is not valid with less than 8 observations; %i \"\n/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1765: RuntimeWarning: divide by zero encountered in divide\n return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)\n/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1765: RuntimeWarning: invalid value encountered in scalar multiply\n return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)\n/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1687: RuntimeWarning: divide by zero encountered in scalar divide\n return np.dot(wresid, wresid) / self.df_resid\nIn\u00a0[2]: Copied!
from typing import Iterable, List, Optional\n\nfrom sklearn.base import BaseEstimator, TransformerMixin\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.pipeline import Pipeline\n\nfrom formulaic import Formula, FormulaSpec, ModelSpec\n\n\nclass FormulaicTransformer(TransformerMixin, BaseEstimator):\n def __init__(self, formula: FormulaSpec):\n self.formula: Formula = Formula.from_spec(formula)\n self.model_spec: Optional[ModelSpec] = None\n if self.formula._has_structure:\n raise ValueError(\n f\"Formula specification {repr(formula)} results in a structured formula, which is not supported.\"\n )\n\n def fit(self, X, y=None):\n \"\"\"\n Generate the initial model spec by which subsequent X's will be\n transformed.\n \"\"\"\n self.model_spec = self.formula.get_model_matrix(X).model_spec\n return self\n\n def transform(self, X, y=None):\n \"\"\"\n Transform `X` by generating a model matrix from it based on the fit\n model spec.\n \"\"\"\n if self.model_spec is None:\n raise RuntimeError(\n \"`FormulaicTransformer.fit()` must be called before `.transform()`.\"\n )\n X_ = self.model_spec.get_model_matrix(X)\n return X_\n\n def get_feature_names_out(\n self, input_features: Optional[Iterable[str]] = None\n ) -> List[str]:\n \"\"\"\n Expose model spec column names to scikit learn to allow column transforms later in the pipeline.\n \"\"\"\n if self.model_spec is None:\n raise RuntimeError(\n \"`FormulaicTransformer.fit()` must be called before columns can be assigned names.\"\n )\n return self.model_spec.column_names\n\n\npipe = Pipeline(\n [(\"formula\", FormulaicTransformer(\"x1 + x2 + x3\")), (\"model\", LinearRegression())]\n)\npipe_fit = pipe.fit(\n pandas.DataFrame({\"x1\": [1, 2, 3], \"x2\": [2, 3.4, 6], \"x3\": [7, 3, 1]}),\n y=pandas.Series([1, 3, 5]),\n)\npipe_fit\n# Note: You could optionally serialize `pipe_fit` here.\n# Then: Use the pipe to predict outcomes for new data.\nfrom typing import Iterable, List, Optional from sklearn.base import BaseEstimator, TransformerMixin from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline from formulaic import Formula, FormulaSpec, ModelSpec class FormulaicTransformer(TransformerMixin, BaseEstimator): def __init__(self, formula: FormulaSpec): self.formula: Formula = Formula.from_spec(formula) self.model_spec: Optional[ModelSpec] = None if self.formula._has_structure: raise ValueError( f\"Formula specification {repr(formula)} results in a structured formula, which is not supported.\" ) def fit(self, X, y=None): \"\"\" Generate the initial model spec by which subsequent X's will be transformed. \"\"\" self.model_spec = self.formula.get_model_matrix(X).model_spec return self def transform(self, X, y=None): \"\"\" Transform `X` by generating a model matrix from it based on the fit model spec. \"\"\" if self.model_spec is None: raise RuntimeError( \"`FormulaicTransformer.fit()` must be called before `.transform()`.\" ) X_ = self.model_spec.get_model_matrix(X) return X_ def get_feature_names_out( self, input_features: Optional[Iterable[str]] = None ) -> List[str]: \"\"\" Expose model spec column names to scikit learn to allow column transforms later in the pipeline. \"\"\" if self.model_spec is None: raise RuntimeError( \"`FormulaicTransformer.fit()` must be called before columns can be assigned names.\" ) return self.model_spec.column_names pipe = Pipeline( [(\"formula\", FormulaicTransformer(\"x1 + x2 + x3\")), (\"model\", LinearRegression())] ) pipe_fit = pipe.fit( pandas.DataFrame({\"x1\": [1, 2, 3], \"x2\": [2, 3.4, 6], \"x3\": [7, 3, 1]}), y=pandas.Series([1, 3, 5]), ) pipe_fit # Note: You could optionally serialize `pipe_fit` here. # Then: Use the pipe to predict outcomes for new data. Out[2]:
Pipeline(steps=[('formula', FormulaicTransformer(formula=1 + x1 + x2 + x3)),\n ('model', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.Pipeline
Pipeline(steps=[('formula', FormulaicTransformer(formula=1 + x1 + x2 + x3)),\n ('model', LinearRegression())])FormulaicTransformer
FormulaicTransformer(formula=1 + x1 + x2 + x3)LinearRegression
LinearRegression()"},{"location":"guides/integration/#statsmodels","title":"StatsModels\u00b6","text":""},{"location":"guides/integration/#scikit-learn","title":"Scikit-Learn\u00b6","text":"
scikit-learn is a very popular machine learning toolkit for Python. You can use Formulaic directly, as for statsmodels
, or as a module in scikit-learn pipelines along the lines of:
Sooner or later, you will encounter datasets with null values, and it is important to know how their presence will impact your modeling. Formulaic model matrix materialization procedures allow you to specify how you want nulls to be handled. You can either:
You can specify the desired behaviour by passing an NAAction
enum value (or string value thereof) to the materialization methods (model_matrix
, and *.get_model_matrix()
). Examples of each of these approaches is show below.
from pandas import Categorical, DataFrame\n\nfrom formulaic import model_matrix\nfrom formulaic.materializers import NAAction\n\ndf = DataFrame(\n {\n \"c\": [1, 2, None, 4, 5],\n \"C\": Categorical(\n [\"a\", \"b\", \"c\", None, \"e\"], categories=[\"a\", \"b\", \"c\", \"d\", \"e\"]\n ),\n }\n)\n\nmodel_matrix(\"c + C\", df, na_action=NAAction.DROP)\n# Equivlent to:\n# * model_matrix(\"c + C\", df)\n# * model_matrix(\"c + C\", df, na_action=\"drop\")\nfrom pandas import Categorical, DataFrame from formulaic import model_matrix from formulaic.materializers import NAAction df = DataFrame( { \"c\": [1, 2, None, 4, 5], \"C\": Categorical( [\"a\", \"b\", \"c\", None, \"e\"], categories=[\"a\", \"b\", \"c\", \"d\", \"e\"] ), } ) model_matrix(\"c + C\", df, na_action=NAAction.DROP) # Equivlent to: # * model_matrix(\"c + C\", df) # * model_matrix(\"c + C\", df, na_action=\"drop\") Out[36]: Intercept c C[T.b] C[T.c] C[T.d] C[T.e] 0 1.0 1.0 0 0 0 0 1 1.0 2.0 1 0 0 0 4 1.0 5.0 0 0 0 1
You can also specify additional rows to drop using the drop_rows
argument:
model_matrix(\"c + C\", df, drop_rows={0, 4})\nmodel_matrix(\"c + C\", df, drop_rows={0, 4}) Out[24]: Intercept c C[T.b] C[T.c] C[T.d] C[T.e] 1 1.0 2.0 1 0 0 0
Note that the set passed to drop_rows
is expected to be mutable, as it will be updated with the indices of rows dropped automatically also; which can be useful if you need to keep track of this information outside of the materialization procedure.
drop_rows = {0, 4}\nmodel_matrix(\"c + C\", df, drop_rows=drop_rows)\ndrop_rows\ndrop_rows = {0, 4} model_matrix(\"c + C\", df, drop_rows=drop_rows) drop_rows Out[25]:
{0, np.int64(2), np.int64(3), 4}In\u00a0[31]: Copied!
model_matrix(\"c + C\", df, na_action=\"ignore\")\nmodel_matrix(\"c + C\", df, na_action=\"ignore\") Out[31]: Intercept c C[T.b] C[T.c] C[T.d] C[T.e] 0 1.0 1.0 0 0 0 0 1 1.0 2.0 1 0 0 0 2 1.0 NaN 0 1 0 0 3 1.0 4.0 0 0 0 0 4 1.0 5.0 0 0 0 1
Note the NaN
in the c
column, and that NaN
does NOT appear in the dummy coding of C on row 3, consistent with standard implementations of dummy coding. This could result in misleading model estimates, so care should be taken.
You can combine this with drop_rows
, as described above, to manually filter out the null values you are concerned about.
try:\n model_matrix(\"c + C\", df, na_action=\"raise\")\nexcept Exception as e:\n print(e)\ntry: model_matrix(\"c + C\", df, na_action=\"raise\") except Exception as e: print(e)
Error encountered while checking for nulls in `C`: `C` contains null values after evaluation.\n
As with ignoring nulls above, you can combine this raising behaviour with drop_rows
to manually filter out the null values that you feel you can safely ignore, and then raise if any additional null values make it into your data.
NAAction.DROP
, or \"drop\"
)\u00b6","text":"This the default behaviour, and will result in any row with a null in any column that is being used by the materialization being dropped from the resulting dataset. For example:
"},{"location":"guides/missing_data/#ignore-nulls-naactionignore-or-ignore","title":"Ignore nulls (NAAction.IGNORE
, or \"ignore\"
)\u00b6","text":"If your modeling toolkit can handle the presence of nulls, or you otherwise want to keep them in the dataset, you can pass na_action = \"ignore\"
to the materialization methods. This will allow null values to remain in columns, and take no action to prevent the propagation of nulls.
NAAction.RAISE
or \"raise\"
)\u00b6","text":"If you are unwilling to risk the perils of dropping or ignoring null values, you can instead opt to raise an exception whenever a null value is found. This can prevent yourself from accidentally biasing your model, but also makes your code more brittle. For example:
"},{"location":"guides/model_specs/","title":"Model Specs","text":"While Formula
instances (discussed in How it works) are the source of truth for abstract user intent, ModelSpec
instances are the source of truth for the materialization process; and bundle a Formula
instance with explicit metadata about the encoding choices that were made (or should be made) when a formula was (or will be) materialized. As soon as materialization begins, Formula
instances are upgraded into ModelSpec
instances, and any missing metadata is attached as decisions are made during the materialization process.
Besides acting as runtime state during materialization, it serves two main purposes:
Formula
has been materialized once, you can use the generated ModelSpec
instance to repeat the process on similar datasets, being confident that the encoding choices will be identical. This is especially useful during out-of-sample prediction, where you need to prepare the out-of-sample data in exactly the same was as the training data for the predictions to be valid.In the remainder of this portion of the documentation, we will introduce how to leverage the metadata stored inside ModelSpec
instances derived from materializations, and for more advanced programmatic use-cases, how to manually build a ModelSpec
.
# Let's get ourselves a simple `ModelMatrix` instance to play with.\nfrom pandas import DataFrame\n\nfrom formulaic import model_matrix\n\nmm = model_matrix(\"center(a) + b\", DataFrame({\"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]}))\nmm\n# Let's get ourselves a simple `ModelMatrix` instance to play with. from pandas import DataFrame from formulaic import model_matrix mm = model_matrix(\"center(a) + b\", DataFrame({\"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]})) mm Out[1]: Intercept center(a) b[T.B] b[T.C] 0 1.0 -1.0 0 0 1 1.0 0.0 1 0 2 1.0 1.0 0 1 In\u00a0[2]: Copied!
# And extract the model spec from it\nms = mm.model_spec\nms\n# And extract the model spec from it ms = mm.model_spec ms Out[2]:
ModelSpec(formula=1 + center(a) + b, materializer='pandas', materializer_params={}, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='pandas', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=center(a), scoped_terms=[center(a)], columns=['center(a)']), EncodedTermStructure(term=b, scoped_terms=[b-], columns=['b[T.B]', 'b[T.C]'])], transform_state={'center(a)': {'ddof': 1, 'center': np.float64(2.0), 'scale': None}}, encoder_state={'center(a)': (<Kind.NUMERICAL: 'numerical'>, {}), 'b': (<Kind.CATEGORICAL: 'categorical'>, {'categories': ['A', 'B', 'C'], 'contrasts': ContrastsState(contrasts=TreatmentContrasts(base=UNSET), levels=['A', 'B', 'C'])})})In\u00a0[3]: Copied!
# We can now interrogate it for various column, factor, term, and variable related metadata\n{\n \"column_names\": ms.column_names,\n \"column_indices\": ms.column_indices,\n \"terms\": ms.terms,\n \"term_indices\": ms.term_indices,\n \"term_slices\": ms.term_slices,\n \"term_factors\": ms.term_factors,\n \"term_variables\": ms.term_variables,\n \"factors\": ms.factors,\n \"factor_terms\": ms.factor_terms,\n \"factor_variables\": ms.factor_variables,\n \"factor_contrasts\": ms.factor_contrasts,\n \"variables\": ms.variables,\n \"variable_terms\": ms.variable_terms,\n \"variable_indices\": ms.variable_indices,\n \"variables_by_source\": ms.variables_by_source,\n}\n# We can now interrogate it for various column, factor, term, and variable related metadata { \"column_names\": ms.column_names, \"column_indices\": ms.column_indices, \"terms\": ms.terms, \"term_indices\": ms.term_indices, \"term_slices\": ms.term_slices, \"term_factors\": ms.term_factors, \"term_variables\": ms.term_variables, \"factors\": ms.factors, \"factor_terms\": ms.factor_terms, \"factor_variables\": ms.factor_variables, \"factor_contrasts\": ms.factor_contrasts, \"variables\": ms.variables, \"variable_terms\": ms.variable_terms, \"variable_indices\": ms.variable_indices, \"variables_by_source\": ms.variables_by_source, } Out[3]:
{'column_names': ('Intercept', 'center(a)', 'b[T.B]', 'b[T.C]'),\n 'column_indices': {'Intercept': 0, 'center(a)': 1, 'b[T.B]': 2, 'b[T.C]': 3},\n 'terms': [1, center(a), b],\n 'term_indices': {1: [0], center(a): [1], b: [2, 3]},\n 'term_slices': {1: slice(0, 1, None),\n center(a): slice(1, 2, None),\n b: slice(2, 4, None)},\n 'term_factors': {1: {1}, center(a): {center(a)}, b: {b}},\n 'term_variables': {1: set(), center(a): {'a', 'center'}, b: {'b'}},\n 'factors': {1, b, center(a)},\n 'factor_terms': {1: {1}, center(a): {center(a)}, b: {b}},\n 'factor_variables': {b: {'b'}, 1: set(), center(a): {'a', 'center'}},\n 'factor_contrasts': {b: ContrastsState(contrasts=TreatmentContrasts(base=UNSET), levels=['A', 'B', 'C'])},\n 'variables': {'a', 'b', 'center'},\n 'variable_terms': {'center': {center(a)}, 'a': {center(a)}, 'b': {b}},\n 'variable_indices': {'center': [1], 'a': [1], 'b': [2, 3]},\n 'variables_by_source': {'transforms': {'center'}, 'data': {'a', 'b'}}}In\u00a0[4]: Copied!
# And use it to select out various parts of the model matrix; here the columns\n# produced by the `b` term.\nmm.iloc[:, ms.term_indices[\"b\"]]\n# And use it to select out various parts of the model matrix; here the columns # produced by the `b` term. mm.iloc[:, ms.term_indices[\"b\"]] Out[4]: b[T.B] b[T.C] 0 0 0 1 1 0 2 0 1
Some of this metadata may seem redundant at first, but this kind of metadata is essential when the generated model matrix does not natively support indexing by names; for example:
In\u00a0[5]: Copied!mm_numpy = model_matrix(\n \"center(a) + b\", DataFrame({\"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]}), output=\"numpy\"\n)\nmm_numpy\nmm_numpy = model_matrix( \"center(a) + b\", DataFrame({\"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]}), output=\"numpy\" ) mm_numpy Out[5]:
array([[ 1., -1., 0., 0.],\n [ 1., 0., 1., 0.],\n [ 1., 1., 0., 1.]])In\u00a0[6]: Copied!
ms_numpy = mm_numpy.model_spec\nmm_numpy[:, ms_numpy.term_indices[\"b\"]]\nms_numpy = mm_numpy.model_spec mm_numpy[:, ms_numpy.term_indices[\"b\"]] Out[6]:
array([[0., 0.],\n [1., 0.],\n [0., 1.]])In\u00a0[7]: Copied!
ms.get_model_matrix(DataFrame({\"a\": [4, 5, 6], \"b\": [\"A\", \"B\", \"D\"]}))\nms.get_model_matrix(DataFrame({\"a\": [4, 5, 6], \"b\": [\"A\", \"B\", \"D\"]}))
/home/matthew/Repositories/github/formulaic/formulaic/transforms/contrasts.py:169: DataMismatchWarning: Data has categories outside of the nominated levels (or that were not seen in original dataset): {'D'}. They are being cast to nan, which will likely skew the results of your analyses.\n warnings.warn(\nOut[7]: Intercept center(a) b[T.B] b[T.C] 0 1.0 2.0 0 0 1 1.0 3.0 1 0 2 1.0 4.0 0 0
Notice that when the assumptions of the stateful transforms are violated warnings and/or exceptions will be generated.
You can also just pass the ModelSpec
directly to model_matrix
, for example:
model_matrix(ms, data=DataFrame({\"a\": [4, 5, 6], \"b\": [\"A\", \"A\", \"A\"]}))\nmodel_matrix(ms, data=DataFrame({\"a\": [4, 5, 6], \"b\": [\"A\", \"A\", \"A\"]})) Out[8]: Intercept center(a) b[T.B] b[T.C] 0 1.0 2.0 0 0 1 1.0 3.0 0 0 2 1.0 4.0 0 0 In\u00a0[9]: Copied!
from formulaic import ModelSpec\n\nms = ModelSpec(\"a+b+c\", output=\"numpy\", ensure_full_rank=False)\nms\nfrom formulaic import ModelSpec ms = ModelSpec(\"a+b+c\", output=\"numpy\", ensure_full_rank=False) ms Out[9]:
ModelSpec(formula=1 + a + b + c, materializer=None, materializer_params=None, ensure_full_rank=False, na_action=<NAAction.DROP: 'drop'>, output='numpy', cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})In\u00a0[10]: Copied!
import pandas\n\nmm = ms.get_model_matrix(\n pandas.DataFrame({\"a\": [1, 2, 3], \"b\": [4, 5, 6], \"c\": [7, 8, 9]})\n)\nmm\nimport pandas mm = ms.get_model_matrix( pandas.DataFrame({\"a\": [1, 2, 3], \"b\": [4, 5, 6], \"c\": [7, 8, 9]}) ) mm Out[10]:
array([[1., 1., 4., 7.],\n [1., 2., 5., 8.],\n [1., 3., 6., 9.]])In\u00a0[11]: Copied!
mm.model_spec\nmm.model_spec Out[11]:
ModelSpec(formula=1 + a + b + c, materializer='pandas', materializer_params={}, ensure_full_rank=False, na_action=<NAAction.DROP: 'drop'>, output='numpy', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a']), EncodedTermStructure(term=b, scoped_terms=[b], columns=['b']), EncodedTermStructure(term=c, scoped_terms=[c], columns=['c'])], transform_state={}, encoder_state={'a': (<Kind.NUMERICAL: 'numerical'>, {}), 'b': (<Kind.NUMERICAL: 'numerical'>, {}), 'c': (<Kind.NUMERICAL: 'numerical'>, {})})
Notice that any missing fields not provided by the user are imputed automatically.
In\u00a0[12]: Copied!from formulaic import Formula, ModelSpecs\n\nModelSpecs(\n ModelSpec(\"a\"), substructure=ModelSpec(\"b\"), another_substructure=ModelSpec(\"c\")\n)\nfrom formulaic import Formula, ModelSpecs ModelSpecs( ModelSpec(\"a\"), substructure=ModelSpec(\"b\"), another_substructure=ModelSpec(\"c\") ) Out[12]:
root:\n ModelSpec(formula=1 + a, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})\n.substructure:\n ModelSpec(formula=1 + b, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})\n.another_substructure:\n ModelSpec(formula=1 + c, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})In\u00a0[13]: Copied!
ModelSpec.from_spec(Formula(lhs=\"y\", rhs=\"a + b\"))\nModelSpec.from_spec(Formula(lhs=\"y\", rhs=\"a + b\")) Out[13]:
.lhs:\n ModelSpec(formula=y, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})\n.rhs:\n ModelSpec(formula=a + b, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})
Some operations, such as ModelSpec.subset(...)
are also accessible in a mapped way (e.g. via ModelSpecs.subset(...)
). You can find documentation for the complete set of available methods using help(ModelSpecs)
.
ModelSpec
and ModelSpecs
instances have been designed to support serialization via the standard pickling process offered by Python. This allows model specs to be persisted into storage and reloaded at a later time, or used in multiprocessing scenarios.
Serialized model specs are not guaranteed to work between different versions of formulaic. While things will work in the vast majority of cases, the internal state of transforms is free to change from version to version, and may invalidate previously serialized model specs. Efforts will be made to reduce the likelihood of this, and when it happens it should be indicated in the changelogs.
"},{"location":"guides/model_specs/#anatomy-of-a-modelspec-instance","title":"Anatomy of aModelSpec
instance.\u00b6","text":"As noted above, a ModelSpec
is the complete specification and record of the materialization process, combining all user-specified parameters with the runtime state of the materializer. In particular, ModelSpec
instances have the following explicitly specifiable attributes:
Often, only formula
is explicitly specified, and the rest is inferred on the user's behalf.
ModelSpec
instances also have derived properties and methods that you can use to introspect the structure of generated model matrices. These derived methods assume that the ModelSpec
has been fully populated, and thus usually only make sense to consider on ModelSpec
instances that are attached to a ModelMatrix
. They are:
.column_indices
.Term
instances that were used to generate this model matrix.Term
instances to the generated column indices..term_indices
using formulae.Term
instances to a slice that when used on the columns of the model matrix will subsample the model matrix down to those corresponding to each term.Term
instances to the set of factors used by that term.Term
instances to Variable
instances (a string subclass with addition attributes of roles
and source
), indicating the variables used by that term.Factor
instances used in the entire formula.Factor
instances to the Term
instances that used them.Factor
instances to Variable
instances, corresponding to the variables used by that factor.Factor
instances to ContrastsState
instances that can be used to reproduce the coding matrices used during materialization.Variable
instances describing the variables used in entire formula.term_variables
.Variable
instance to the indices of the columns in the model matrix that variable..variable_indices
.\"data\"
, \"context\"
, or \"transforms\"
) to the variables derived from that source.Term
instance, its string representation, a column name, or pre-specified ints/slices.ModelSpec
instance with the nominated attributes mutated.ModelSpec
instance with its structure subset to correspond to the form strict subset of terms indicted by a formula specification.We'll cover some of these attributes and methods in examples below, but you can always refer to help(ModelSpec)
for more details.
ModelSpec
as metadata\u00b6","text":"One of the most common use-cases for ModelSpec
instances is as metadata to describe a generated model matrix. This metadata can be used to programmatically access the appropriate features in the model matrix in order (e.g.) to assign sensible names to the coefficients fit during a regression.
Another common use-case for ModelSpec
instances is replaying the same materialization process used to prepare a training dataset on a new dataset. Since the ModelSpec
instance stores all relevant choices made during materialization achieving this is a simple as using using the ModelSpec
to generate the new model matrix.
By way of example, recall from above section that we used the formula
center(a) + b
where a
was a numerical vector, and b
was a categorical vector. When generating model matrices for subsequent datasets it is very important to use the same centering used during the initial model matrix generation, and not just center the incoming data again. Likewise, b
should be aware of which categories were present during the initial training, and ensure that the same columns are created during subsequent materializations (otherwise the model matrices will not be of the same form, and cannot be used for predictions/etc). These kinds of transforms that require memory are called \"stateful transforms\" in Formulaic, and are described in more detail in the Transforms documentation.
We can see this in action below:
"},{"location":"guides/model_specs/#directly-constructing-modelspec-instances","title":"Directly constructingModelSpec
instances\u00b6","text":"It is possible to directly construct Model Matrices, and to prepopulate them with various choices (e.g. output types, materializer, etc). You could even, in principle, populate them with state information (but this is not recommended; it is easy to make mistakes here, and is likely better to encode these choices into the formula itself where possible). For example:
"},{"location":"guides/model_specs/#structured-modelspecs","title":"StructuredModelSpecs
\u00b6","text":"As discussed in How it works, formulae can be arbitrarily structured, resulting in a similarly structured set of model matrices. ModelSpec
instances can also be arranged into a structured collection using ModelSpecs
, allowing different choices to be made at different levels of the structure. You can either create these structures yourself, or inherit the structure from a formula. For example:
This document provides high-level documentation on how to get started using Formulaic.
In\u00a0[1]: Copied!import pandas\n\nfrom formulaic import model_matrix\n\ndf = pandas.DataFrame(\n {\n \"y\": [0, 1, 2],\n \"a\": [\"A\", \"B\", \"C\"],\n \"b\": [0.3, 0.1, 0.2],\n }\n)\n\ny, X = model_matrix(\"y ~ a + b + a:b\", df)\n# This is short-hand for:\n# y, X = formulaic.Formula('y ~ a + b + a:b').get_model_matrix(df)\nimport pandas from formulaic import model_matrix df = pandas.DataFrame( { \"y\": [0, 1, 2], \"a\": [\"A\", \"B\", \"C\"], \"b\": [0.3, 0.1, 0.2], } ) y, X = model_matrix(\"y ~ a + b + a:b\", df) # This is short-hand for: # y, X = formulaic.Formula('y ~ a + b + a:b').get_model_matrix(df) In\u00a0[2]: Copied!
y\ny Out[2]: y 0 0 1 1 2 2 In\u00a0[3]: Copied!
X\nX Out[3]: Intercept a[T.B] a[T.C] b a[T.B]:b a[T.C]:b 0 1.0 0 0 0.3 0.0 0.0 1 1.0 1 0 0.1 0.1 0.0 2 1.0 0 1 0.2 0.0 0.2
You will notice that the categorical values for a
have been one-hot (aka dummy) encoded, and to ensure structural full-rankness of X
[^1], one level has been dropped from a
. For more details about how this guarantees that the matrix is full-rank, please refer to the excellent patsy documentation. If you are not using the model matrices for regression, and don't care if the matrix is not full-rank, you can pass ensure_full_rank=False
:
X = model_matrix(\"a + b + a:b\", df, ensure_full_rank=False)\nX\nX = model_matrix(\"a + b + a:b\", df, ensure_full_rank=False) X Out[4]: Intercept a[T.A] a[T.B] a[T.C] b a[T.A]:b a[T.B]:b a[T.C]:b 0 1.0 1 0 0 0.3 0.3 0.0 0.0 1 1.0 0 1 0 0.1 0.0 0.1 0.0 2 1.0 0 0 1 0.2 0.0 0.0 0.2
Note that the dropped level in a
has been restored.
There is a rich trove of information about the columns and structure of the the model matrix stored in the ModelSpec
instance attached to the model matrix, for example:
X.model_spec\nX.model_spec Out[5]:
ModelSpec(formula=1 + a + b + a:b, materializer='pandas', materializer_params={}, ensure_full_rank=False, na_action=<NAAction.DROP: 'drop'>, output='pandas', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a[T.A]', 'a[T.B]', 'a[T.C]']), EncodedTermStructure(term=b, scoped_terms=[b], columns=['b']), EncodedTermStructure(term=a:b, scoped_terms=[a:b], columns=['a[T.A]:b', 'a[T.B]:b', 'a[T.C]:b'])], transform_state={}, encoder_state={'a': (<Kind.CATEGORICAL: 'categorical'>, {'categories': ['A', 'B', 'C']}), 'b': (<Kind.NUMERICAL: 'numerical'>, {})})
You can read more about the model specs in the Model Specs documentation.
In\u00a0[6]: Copied!X = model_matrix(\"a + b + a:b\", df, output=\"sparse\")\nX\nX = model_matrix(\"a + b + a:b\", df, output=\"sparse\") X Out[6]:
<3x6 sparse matrix of type '<class 'numpy.float64'>'\n\twith 10 stored elements in Compressed Sparse Column format>
In this example, X
is a $ 6 \\times 3 $ scipy.sparse.csc_matrix
instance.
Since sparse matrices do not have labels for columns, you can look these up from the model spec described above; for example:
In\u00a0[7]: Copied!X.model_spec.column_names\nX.model_spec.column_names Out[7]:
('Intercept', 'a[T.B]', 'a[T.C]', 'b', 'a[T.B]:b', 'a[T.C]:b')
[^1]: X
must be full-rank in order for the regression algorithm to invert a matrix derived from X
.
In formulaic
, the simplest way to build your model matrices is to use the high-level model_matrix
function:
By default, the generated model matrices are dense. In some case, particularly in large datasets with many categorical features, dense model matrices become hugely memory inefficient (since most entries of the data will be zero). Formulaic allows you to directly generate sparse model matrices using:
"},{"location":"guides/splines/","title":"Spline Encoding","text":"Formulaic offers several spline encoding transforms that allow you to model non-linear responses to continuous variables using linear models. They are:
poly
: projection onto orthogonal polynomial basis functions.basis_spline
(bs
in formulae): projection onto a basis spline (B-spline) basis.These are all implemented as stateful transforms and described in more detail below.
In\u00a0[1]: Copied!import matplotlib.pyplot as plt\nimport numpy\nimport pandas\nfrom statsmodels.api import OLS\n\nfrom formulaic import model_matrix\n\n# Build some data, and hard-code \"y\" as a quartic function with some Gaussian noise.\ndata = pandas.DataFrame(\n {\n \"x\": numpy.linspace(0.0, 1.0, 100),\n }\n).assign(\n y=lambda df: df.x\n + 0.2 * df.x**2\n - 0.7 * df.x**3\n + 3 * df.x**4\n + 0.1 * numpy.random.randn(100)\n)\n\n# Generate a model matrix with a polynomial coding in \"x\".\ny, X = model_matrix(\"y ~ poly(x, degree=3)\", data, output=\"numpy\")\n\n# Fit coefficients for the intercept and polynomial basis\ncoeffs = OLS(y[:, 0], X).fit().params\n\n# Plot the basis splines weighted by coefficients.\nplt.plot(\n data.x,\n X * coeffs,\n label=[name + \" (weighted)\" for name in X.model_spec.column_names],\n)\n# Plot B-spline basis functions (colored curves) each multiplied by its coeff\nplt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\")\n# Plot the fit spline itself (sum of the basis functions)\nplt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")\n\nplt.legend();\nimport matplotlib.pyplot as plt import numpy import pandas from statsmodels.api import OLS from formulaic import model_matrix # Build some data, and hard-code \"y\" as a quartic function with some Gaussian noise. data = pandas.DataFrame( { \"x\": numpy.linspace(0.0, 1.0, 100), } ).assign( y=lambda df: df.x + 0.2 * df.x**2 - 0.7 * df.x**3 + 3 * df.x**4 + 0.1 * numpy.random.randn(100) ) # Generate a model matrix with a polynomial coding in \"x\". y, X = model_matrix(\"y ~ poly(x, degree=3)\", data, output=\"numpy\") # Fit coefficients for the intercept and polynomial basis coeffs = OLS(y[:, 0], X).fit().params # Plot the basis splines weighted by coefficients. plt.plot( data.x, X * coeffs, label=[name + \" (weighted)\" for name in X.model_spec.column_names], ) # Plot B-spline basis functions (colored curves) each multiplied by its coeff plt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\") # Plot the fit spline itself (sum of the basis functions) plt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\") plt.legend(); In\u00a0[11]: Copied!
# Generate a model matrix with a basis spline \"x\".\ny, X = model_matrix(\"y ~ bs(x, df=4, degree=3)\", data, output=\"numpy\")\n\n# Fit coefficients for the intercept and polynomial basis\ncoeffs = OLS(y[:, 0], X).fit().params\n\n# Plot the basis splines weighted by coefficients.\nplt.plot(\n data.x,\n X * coeffs,\n label=[name + \" (weighted)\" for name in X.model_spec.column_names],\n)\n# Plot B-spline basis functions (colored curves) each multiplied by its coeff\nplt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\")\n# Plot the fit spline itself (sum of the basis functions)\nplt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")\n\nplt.legend();\n# Generate a model matrix with a basis spline \"x\". y, X = model_matrix(\"y ~ bs(x, df=4, degree=3)\", data, output=\"numpy\") # Fit coefficients for the intercept and polynomial basis coeffs = OLS(y[:, 0], X).fit().params # Plot the basis splines weighted by coefficients. plt.plot( data.x, X * coeffs, label=[name + \" (weighted)\" for name in X.model_spec.column_names], ) # Plot B-spline basis functions (colored curves) each multiplied by its coeff plt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\") # Plot the fit spline itself (sum of the basis functions) plt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\") plt.legend(); In\u00a0[18]: Copied!
x = numpy.linspace(0.0, 2 * numpy.pi, 100)\n\ndata = pandas.DataFrame(\n {\n \"x\": x,\n }\n).assign(\n y=lambda df: 2\n + numpy.sin(x)\n - x * numpy.sin(2 * x)\n + 4 * numpy.sin(x / 7)\n + 0.1 * numpy.random.randn(100)\n)\n\n# Generate a model matrix with a cyclic cubic spline coding in \"x\".\ny, X = model_matrix(\"y ~ 1 + cc(x, df=4, constraints='center')\", data, output=\"numpy\")\n\n# Fit coefficients for the intercept and polynomial basis\ncoeffs = OLS(y[:, 0], X).fit().params\n\n# Plot the basis splines weighted by coefficients.\nplt.plot(\n data.x,\n X * coeffs,\n label=[name + \" (weighted)\" for name in X.model_spec.column_names],\n)\n# Plot B-spline basis functions (colored curves) each multiplied by its coeff\nplt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\")\n# Plot the fit spline itself (sum of the basis functions)\nplt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")\n\nplt.legend();\nx = numpy.linspace(0.0, 2 * numpy.pi, 100) data = pandas.DataFrame( { \"x\": x, } ).assign( y=lambda df: 2 + numpy.sin(x) - x * numpy.sin(2 * x) + 4 * numpy.sin(x / 7) + 0.1 * numpy.random.randn(100) ) # Generate a model matrix with a cyclic cubic spline coding in \"x\". y, X = model_matrix(\"y ~ 1 + cc(x, df=4, constraints='center')\", data, output=\"numpy\") # Fit coefficients for the intercept and polynomial basis coeffs = OLS(y[:, 0], X).fit().params # Plot the basis splines weighted by coefficients. plt.plot( data.x, X * coeffs, label=[name + \" (weighted)\" for name in X.model_spec.column_names], ) # Plot B-spline basis functions (colored curves) each multiplied by its coeff plt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\") # Plot the fit spline itself (sum of the basis functions) plt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\") plt.legend();"},{"location":"guides/splines/#poly","title":"
poly
\u00b6","text":"The simplest way to generate a non-linear response in a linear model is to include higher order powers of numerical variables. For example, you might want to include: $x$, $x^2$, $x^3$, $\\ldots$, $x^n$. However, these features are not orthogonal, and so adding each term one by one in a regression model will lead to all previously trained coefficients changing. Especially in exploratory analysis, this can be frustrating, and that's where poly
comes in. By default, poly
iteratively builds orthogonal polynomial features up to the order specified.
poly
has two parameters:
For those who are mathematically inclined, this transform is an implementation of the \"three-term recurrence relation\" for monic orthogonal polynomials. There are many good introductions to these recurrence relations, including: (at the time of writing) https://dec41.user.srcf.net/h/IB_L/numerical_analysis/2_3. Another common approach is QR factorisation where the columns of Q are the orthogonal basis vectors. A pre-existing implementation of this can be found in numpy
, however our implementation outperforms numpy's QR decomposition, and does not require needless computation of the R matrix. It should also be noted that orthogonal polynomial bases are unique up to the choice of inner-product and scaling, and so all methods will result in the same set of polynomials.
Note
When used as a stateful transform, we retain the coefficients that uniquely define the polynomials; and so new data will be evaluated against the same polynomial bases as the original dataset. However, the polynomial basis will almost certainly *not* be orthogonal for the new data. This is because changing the incoming dataset is equivalent to changing your choice of inner product.
Example:
"},{"location":"guides/splines/#basis_spline-or-bs","title":"basis_spline
(or bs
)\u00b6","text":"If you were to attempt to fit a complex function over a large domain using poly
, it is highly likely that you would need to use a very large degree for the polynomial basis. However, this can lead to overfitting and/or high-frequency oscillations (see Runge's phenomenon). An alternative approach is to use piece-wise polynomial curves of lower degree, with smoothness conditions on the \"knots\" between each of the polynomial pieces. This limits overfitting while still offering the flexibility required to model very complex non-linearities.
Basis-splines (or B-splines) are popular choice for generating a basis for such polynomials, with many attractive features such as maximal smoothness around each of the knots, and minimal support given such smoothness.
Formulaic has its own implementation of basis_spline
that is API compatible (where features overlap) with R, and is more performant than existing Python implementations for our use-cases (such as splev
from scipy
). For compatibility with R
and patsy
, basis_spline
is available as bs
in formulae.
basis_spline
(or bs
) has eight parameters:
knots
will be automatically generated such that they are df
- degree
(minus one if include_intercept
is True) equally spaced quantiles. You cannot specify both df
and knots
.df
is specified), in which case the ordinary polynomial (Bezier) basis is generated.ensure_full_rank=True
is passed to the materializer, then the intercept will (depending on context) nevertheless be omitted.x
.x
.x
extend beyond the lower and upper bounds. Valid values are:'raise'
(the default): Raises a ValueError
if there are any values in x
outside the B-Spline domain.'clip'
: Any values above/below the domain are set to the upper/lower bounds.'na'
: Any values outside of bounds are set to numpy.nan
.'zero'
: Any values outside of bounds are set to 0
.'extend'
: Any values outside of bounds are computed by extending the polynomials of the B-Spline (this is the same as the default in R).The algorithm used to generate the basis splines is a slightly generalised version of the \"Cox-de Boor\" algorithm, extended by this author to allow for extrapolations (although this author doubts this is terribly novel). If you would like to learn more about B-Splines, the primer put together by Jeffrey Racine is an excellent resource.
As a stateful transform, we only keep track of knots
, lower_bound
and upper_bound
, which are sufficient given that all other information must be explicitly specified.
Example, reusing the data and imports from above:
"},{"location":"guides/splines/#cubic_spline-crcs-and-cc","title":"cubic_spline
(cr
/cs
and cc
)\u00b6","text":"While the basis_spline
transform above is capable of generating cubic splines, it is sometimes helpful to be able to generate cubic splines that satisfy various additional constraints (including direct constraints on the parameters of the spline, or indirect ones such as cyclicity). To that end, Formulaic implements direct support for generating natural and cyclic cubic splines with constraints (via the cr
/cs
and cc
transforms respectively), borrowing much of the implementation from patsy
. These splines are compatible with R's mgcv
, and share the nice features of the basis-splines above, including continuous first and second derivatives, and general applicability to interpolation/smoothing. Note that cr
and cs
generate identical splines, but are both included for compatibility with R.
In practice, the reason that we focus on cubic (as compared to quadratic or quartic) splines is that they offer a nice compromise. They offer:
All of cr
, cs
, and cc
are configurations of the cubic_spline
transform, and have seven parameters:
cc
or cr
. If using an existing model_spec
with a fitted transformation, the x
values are only used to produce the locations for the fitted values of the spline.knots
will be automatically generated such that they are df
- degree
equally spaced quantiles. You cannot specify both df
and knots
.df
is specified).x
.x
.np.dot(constraints, betas)
is zero, where betas
denotes the array of initial parameters, corresponding to the initial unconstrained model matrix), or the string 'center'
indicating that we should apply a centering constraint (this constraint will be computed from the input data, remembered and re-used for prediction from the fitted model). The constraints are absorbed in the resulting design matrix which means that the model is actually rewritten in terms of unconstrained parameters.x
extend beyond the lower and upper bounds. Valid values are:'raise'
(the default): Raises a ValueError
if there are any values in x
outside the spline domain.'clip'
: Any values above/below the domain are set to the upper/lower bounds.'na'
: Any values outside of bounds are set to numpy.nan
.'zero'
: Any values outside of bounds are set to 0
.'extend'
: Any values outside of bounds are computed by extending the polynomials of the spline (this is the same as the default in R).As a stateful transform, we only keep track of knots
, lower_bound
, upper_bound
, constraints
, and cyclic
which are sufficient given that all other information must be explicitly specified.
Example, reusing the data and imports from above:
"},{"location":"guides/transforms/","title":"Transforms","text":"A transform in Formulaic is any function that is called to modify factor values during the evaluation of a Factor
(see the How it works documentation). Any function can be used as a transform, so long as it is present in the evaluation context (see below).
There are two types of transform:
numpy.cumsum
function to any vector being fed into the model matrix materialization procedure.In the below we describe how to make a function available for use as a transform during materialization, demonstrate this for regular transforms, and then introduce how to use already implemented stateful transforms and/or write your own.
In\u00a0[1]: Copied!import pandas\n\nfrom formulaic import Formula, model_matrix\n\n\ndef my_transform(col: pandas.Series) -> pandas.Series:\n return col**2\nimport pandas from formulaic import Formula, model_matrix def my_transform(col: pandas.Series) -> pandas.Series: return col**2 In\u00a0[2]: Copied!
# Local context is automatically added\nmodel_matrix(\"a + my_transform(a)\", pandas.DataFrame({\"a\": [1, 2, 3]}))\n# Local context is automatically added model_matrix(\"a + my_transform(a)\", pandas.DataFrame({\"a\": [1, 2, 3]})) Out[2]: Intercept a my_transform(a) 0 1.0 1 1 1 1.0 2 4 2 1.0 3 9 In\u00a0[3]: Copied!
# Manually add `my_transform` to the context\nFormula(\"a + my_transform(a)\").get_model_matrix(\n pandas.DataFrame({\"a\": [1, 2, 3]}),\n context={\"my_transform\": my_transform}, # could also use: context=locals()\n)\n# Manually add `my_transform` to the context Formula(\"a + my_transform(a)\").get_model_matrix( pandas.DataFrame({\"a\": [1, 2, 3]}), context={\"my_transform\": my_transform}, # could also use: context=locals() ) Out[3]: Intercept a my_transform(a) 0 1.0 1 1 1 1.0 2 4 2 1.0 3 9 In\u00a0[4]: Copied!
from formulaic.transforms import center, scale\n\nscale(pandas.Series([1, 2, 3, 4, 5, 6, 7, 8]))\nfrom formulaic.transforms import center, scale scale(pandas.Series([1, 2, 3, 4, 5, 6, 7, 8])) Out[4]:
array([-1.42886902, -1.02062073, -0.61237244, -0.20412415, 0.20412415,\n 0.61237244, 1.02062073, 1.42886902])In\u00a0[5]: Copied!
center(pandas.Series([1, 2, 3, 4, 5, 6, 7, 8]))\ncenter(pandas.Series([1, 2, 3, 4, 5, 6, 7, 8])) Out[5]:
array([-3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5])In\u00a0[6]: Copied!
import numpy\n\nfrom formulaic.transforms import stateful_transform\n\n\n@stateful_transform\ndef center(data, _state=None, _metadata=None, _spec=None):\n print(\"state\", _state)\n print(\"metadata\", _metadata)\n print(\"spec\", _spec)\n if \"mean\" not in _state:\n _state[\"mean\"] = numpy.mean(data)\n return data - _state[\"mean\"]\n\n\nstate = {}\ncenter(pandas.Series([1, 2, 3]), _state=state)\nimport numpy from formulaic.transforms import stateful_transform @stateful_transform def center(data, _state=None, _metadata=None, _spec=None): print(\"state\", _state) print(\"metadata\", _metadata) print(\"spec\", _spec) if \"mean\" not in _state: _state[\"mean\"] = numpy.mean(data) return data - _state[\"mean\"] state = {} center(pandas.Series([1, 2, 3]), _state=state)
state {}\nmetadata None\nspec ModelSpec(formula=, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, structure=None, transform_state={}, encoder_state={})\nOut[6]:
0 -1.0\n1 0.0\n2 1.0\ndtype: float64In\u00a0[7]: Copied!
state\nstate Out[7]:
{'mean': 2.0}
The mutated state object is then stored by formulaic automatically into the right context in the appropriate ModelSpec
instance for reuse as necessary.
If you wanted to leverage the single dispatch functionality, you could do something like:
In\u00a0[8]: Copied!from formulaic.transforms import stateful_transform\n\n\n@stateful_transform\ndef center(data, _state=None, _metadata=None, _spec=None):\n raise ValueError(f\"No implementation for data of type {repr(type(data))}\")\n\n\n@center.register(pandas.Series)\ndef _(data, _state=None, _metadata=None, _spec=None):\n if \"mean\" not in _state:\n _state[\"mean\"] = numpy.mean(data)\n return data - _state[\"mean\"]\nfrom formulaic.transforms import stateful_transform @stateful_transform def center(data, _state=None, _metadata=None, _spec=None): raise ValueError(f\"No implementation for data of type {repr(type(data))}\") @center.register(pandas.Series) def _(data, _state=None, _metadata=None, _spec=None): if \"mean\" not in _state: _state[\"mean\"] = numpy.mean(data) return data - _state[\"mean\"]
Note
If taking advantage of the single dispatch functionality, it is important that the top-level function has exactly the same signature as the type specific implementations.
"},{"location":"guides/transforms/#adding-transforms-to-the-evaluation-context","title":"Adding transforms to the evaluation context\u00b6","text":"The only requirement for using a transform in formula is making it available in the execution context. The evaluation context is always pre-seeded with:
numpy
module.numpy.log
.numpy.log10
.numpy.log2
.numpy.exp
.numpy.exp10
.numpy.exp2
.{<expr>}
syntax).The evaluation context can be extended to include arbitrary additional functions. If you are using the top-level model_matrix
function then the local context in which model_matrix
is called is automatically added to the execution context, otherwise you need to manually specify this context. For example:
In Formulaic, a stateful transform is just a regular callable object (typically a function) that has an attribute __is_stateful_transform__
that is set to True
. Such callables will be passed up to three additional arguments by formulaic if they are present in the callable signature:
_state
: The existing state or an empty dictionary that should be mutated to record any additional state._metadata
: An additional metadata dictionary passed on about the factor or None
. Will typically only be present if the Factor
metadata is populated._spec
: The current model spec being evaluated (or an empty ModelSpec
if being called outside of Formulaic's materialization routines).Only _state
is required, _metadata
and _spec
will only be passed in by Formulaic if they are present in the callable signature.
Formulaic comes preloaded with some useful stateful transforms, which are outlined below.
"},{"location":"guides/transforms/#scaling-and-centering","title":"Scaling and Centering\u00b6","text":"There are two provided scaling transforms: scale(...)
and center(...)
.
scale
rescales the data such that it is centered around zero with a standard deviation of 1. The centering and variance standardisation can be independently disabled as necessary. center
is a simple wrapper around scale
that only does the centering. For more details, refer to inline documentation: help(scale)
.
Example usage is shown below:
"},{"location":"guides/transforms/#categorical-encoding","title":"Categorical Encoding\u00b6","text":"Formulaic provides a rich family of categorical stateful transforms. These are perhaps the most commonly used transforms, and are used to encode categorical/factor data into a form suitable for numerical analysis. Use of these transforms is separately documented in the Categorical Encoding section.
"},{"location":"guides/transforms/#spline-encoding","title":"Spline Encoding\u00b6","text":"Spline coding is used to enable non-linear dependence on numerical features in linear models. Formulaic currently provides two spline transforms: bs
for basis splines, and poly
for polynomial splines. These are separately documented in the Spline Encoding section.
You can either implement the above interface directly, or leverage the stateful_transform
decorator provided by Formulaic, which then also updates your function into a single dispatch function, allowing multiple implementations that depend on the currently materialized type. A simple centering example is explored below.
Formulaic is a high-performance implementation of Wilkinson formulas for Python, which are very useful for transforming dataframes into a form suitable for ingestion into various modelling frameworks (especially linear regression).
It provides:
pandas.DataFrame
pyarrow.Table
pandas.DataFrame
numpy.ndarray
scipy.sparse.CSCMatrix
with more to come!
"},{"location":"changelog/","title":"Changelog","text":"For changes since the latest tagged release, please refer to the git commit log.
"},{"location":"changelog/#111-20-december-2024","title":"1.1.1 (20 December 2024)","text":"New features and enhancements:
Formula.differentiate()
is now considered stable, with ModelMatrix.differentiate()
to follow in a future release.Bugfixes and cleanups:
Breaking changes:
Formula
is no longer always \"structured\" with special cases to handle the case where it has no structure. Legacy shims have been added to support old patterns, with DeprecationWarning
s raised when they are used. It is not expected to break anyone not explicitly checking whether the Formula.root
is a list instance (which formerly should have been simply assumed) [it is a now SimpleFormula
instance that acts like an ordered sequence of Term
instances].feature[T.A]
, whether nor not the encoding will result in that term acting as a contrast. Now, in keeping with patsy
, we only add the prefix if the categorical factor is encoded with reduced rank. Otherwise, feature[A]
will be used instead.formulaic.parsers.types.structured
has been promoted to formulaic.utils.structured
.New features and enhancements:
Formula
now instantiates to SimpleFormula
or StructuredFormula
, the latter being a tree-structure of SimpleFormula
instances (as compared to List[Term]
) previously. This simplifies various internal logic and makes the propagation of formula metadata more explicit.dict
and recarray
types are no associated with the pandas
materializer by default (rather than raising), simplifying some user workflows..
operator (which is replaced with all variables not used on the left-hand-side of formulae).[ ... ~ ... ]
. This is useful for (e.g.) generating formulae for IV 2SLS.ModelSpec[s]
based on an arbitrary strictly reduced FormulaSpec
.Formula.required_variables
to more easily surface the expected data requirements of the formula.cc
) and natural (cr
). See formulaic.materializers.transforms.cubic_spline.cubic_spline
for more details.lag()
transform.LinearConstraints
can now be done from a list of strings (for increased parity with patsy
).T.
when they actully describe contrasts (i.e. when they are encoded with reduced rank).encode_categorical
; which is surfaced via ModelSpec.factor_contrasts
.Operator
instances now received context
which is optionally specified by the user during formula parsing, and updated by the parser. This is what makes the .
implementation possible.Structured
, it has been promoted to formulaic.utils
.Bugfixes and cleanups:
Formula
instance.Structured
instances.ModelMatrix
and FactorValues
instances (whenever wrapped objects are picklable).basis_spline
: Fixed evaluation involving datasets with null values, and disallow out-of-bounds knots.ruff
for linting, and updated mypy
and pre-commit
tooling.ruff
are automatically applied when using hatch run lint:format
.Documentation:
Bugfixes and cleanups:
pandas
>=3.mypy
type inference in materializer subclasses.Documentation:
sklearn
integration example.Bugfixes and cleanups:
Breaking changes:
Formula.terms
, ModelSpec.feature_names
, and ModelSpec.feature_indices
.New features and enhancements:
patsy
.hashed
transform for categorically encoding deterministically hashed representations of a dataset.Bugfixes and cleanups:
astor
for Python 3.9 and newer.Documentation:
This is minor release with one important bugfix.
Bugfixes and cleanups:
This is a minor release with several important bugfixes.
Bugfixes and cleanups:
poly()
transforms operating on datasets that include null values.hatch run tests
.This is a minor release with several new features and cleanups.
New features and enhancements:
ModelSpec
documentation for more details.Bugfixes and cleanups:
OrderedDict
usage, since Python guarantees the orderedness of dictionaries in Python 3.7+.None
.This is a minor release with a bugfix.
Bugfixes and cleanups:
This is a minor release with several bugfixes.
Bugfixes and cleanups:
This is a minor release with one new feature.
New features and enhancements:
This is a major release with some important consistency and completeness improvements. It should be treated as almost being the first release candidate of 1.0.0, which will land after some small amount of further feature extensions and documentation improvements. All users are recommended to upgrade.
Breaking changes:
Although there are some internal changes to API, as documented below, there are no breaking changes to user-facing APIs.
New features and enhancements:
_ordering
keyword to the Formula
constructor.standardize
, Q
and treatment contrasts shims.cluster_by='numerical_factors
option to ModelSpec
to enable patsy style clustering of output columns by involved numerical factors.^
and %in%
.Structured
instances, and use this functionality during AST evaluation where relevant.ModelSpec.term_indices
is now a list rather than a tuple, to allow direct use when indexing pandas and numpy model matrices.Bugfixes and cleanups:
Structured
instances for non-sequential iterable values.poly
unit tests.PandasMaterializer
.Factor.EvalMethod.UNKNOWN
was removed, defaulting instead to LOOKUP
.sympy
version constraint now that a bug has been fixed upstream.Documentation:
This is a minor patch releases that fixes one bug.
Bugfixes and cleanups:
Structured
instance and iteration over this instance (including Formula
instances). Formerly the length would only count the number of keys in its structure, rather than the number of objects that would be yielded during iteration.This is a minor patch release that fixes two bugs.
Bugfixes and cleanups:
Formula
objects.formulaic.__version__
during package build.This is a major new release with some minor API changes, some ergonomic improvements, and a few bug fixes.
Breaking changes:
Formula
objects (e.g. formula.lhs
) no longer returns a list of terms; but rather a Formula
object, so that the helper methods can remain accessible. You can access the raw terms by iterating over the formula (list(formula)
) or looking up the root node (formula.root
).New features and improvements:
ModelSpec
object is now the source of truth in all ModelMatrix
generations, and can be constructed directly from any supported specification using ModelSpec.from_spec(...)
. Supported specifications include formula strings, parsed formulae, model matrices and prior model specs..get_model_matrix()
helper methods across Formula
, FormulaMaterializer
, ModelSpec
and model_matrix
objects/helpers functions are now consistent, and all use ModelSpec
directly under the hood.Formula
objects (e.g. formula.lhs
), the term lists will be wrapped as trivial Formula
instances rather than returned as raw lists (so that the helper methods like .get_model_matrix()
can still be used).FormulaSpec
is now exported from the top-level module.Bugfixes and cleanups:
ModelSpec
specifications being overriden by default arguments to FormulaMaterializer.get_model_matrix
.Structured._flatten()
now correctly flattens unnamed substructures.This is a major new release with some new features, greatly improved ergonomics for structured formulae, matrices and specs, and a few small breaking changes (most with backward compatibility shims). All users are encouraged to upgrade.
Breaking changes:
include_intercept
is no longer an argument to FormulaParser.get_terms
; and is instead an argument of the DefaultFormulaParser
constructor. If you want to modify the include_intercept
behaviour, please use: Formula(\"y ~ x\", _parser=DefaultFormulaParser(include_intercept=False))\n
Formula.terms
is deprecated since Formula
became a subclass of Structured[List[Terms]]
. You can directly iterate over, and/or access nested structure on the Formula
instance itself. Formula.terms
has a deprecated property which will return a reference to itself in order to support legacy use-cases. This will be removed in 1.0.0.ModelSpec.feature_names
and ModelSpec.feature_columns
are deprecated in favour of ModelSpec.column_names
and ModelSpec.column_indices
. Deprecated properties remain in-place to support legacy use-cases. These will be removed in 1.0.0.New features and enhancements:
Formula
has been refactored as a subclass of Structured[List[Terms]]
, and can be incrementally built and modified. The matrix and spec outputs now have explicit subclasses of Structured
(ModelMatrices
and ModelSpecs
respectively) to expose convenience methods that allow these objects to be largely used interchangeably with their singular counterparts.ModelMatrices
and ModelSpecs
arenow surfaced as top-level exports of the formulaic
module.Structured
(and its subclasses) gained improved integration of nested tuple structure, as well as support for flattened iteration, explicit mapping output types, and lots of cleanups.ModelSpec
was made into a dataclass, and gained several new properties/methods to support better introspection and mutation of the model spec.FormulaParser
was renamed DefaultFormulaParser
, and made a subclass of the new formula parser interface FormulaParser
. In this process include_intercept
was removed from the API, and made an instance attribute of the default parser implementation.Bugfixes and cleanups:
ModelSpec
s are provided by the user during materialization, they are updated to reflect the output-type chosen by the user, as well as whether to ensure full rank/etc.pylint
was added to the CI testing.Documentation:
.materializer
submodule, most code now has inline documentation and annotations.This is a backward compatible major release that adds several new features.
New features and enhancements:
ModelMatrix
instances (see ModelMatrix.model_spec.get_linear_constraints
).ModelMatrix
, ModelSpec
and other formula-like objects to the model_matrix
sugar method so that pre-processed formulae can be used.0
with -1
to avoid substitutions in quoted contexts.Bugfixes and cleanups:
bs(`my|feature%is^cool`)
.C(x, {\"a\": [1,2,3]})
.astor
to >=0.8 to fix issues with ast-generation in Python 3.8+ when numerical constants are present in the parsed python expression (e.g. \"bs(x, df=10)\").This is a minor patch release that migrates the package tooling to poetry; solving a version inconsistency when packaging for conda
.
This is a minor patch release that fixes an attempt to import numpy.typing
when numpy is not version 1.20 or later.
This is a minor patch release that fixes the maintaining of output types, NA-handling, and assurance of full-rank for factors that evaluate to pre-encoded columns when constructing a model matrix from a pre-defined ModelSpec. The benchmarks were also updated.
"},{"location":"changelog/#030-14-march-2022","title":"0.3.0 (14 March 2022)","text":"This is a major new release with many new features, and a few small breaking changes. All users are encouraged to upgrade.
Breaking changes:
formulaic.materializers.transforms
to the top-level formulaic.transforms
module, and ported all existing transforms to output FactorValues
types rather than dictionaries. FactorValues
is an object proxy that allows output types like pandas.DataFrame
s to be used as they normally would, with some additional metadata for formulaic accessible via the __formulaic_metadata__
attribute. This makes non-formula direct usage of these transforms much more pleasant.~
is no longer a generic formula separator, and can only be used once in a formula. Please use the newly added |
operator to separate a formula into multiple parts.New features and enhancements:
~
operator to use them. Structured formulas can have named substructures, for example: lhs
and rhs
for the ~
operator. The representation of formulas has been updated to show this structure.|
operator which splits formulas into multiple parts).formulaic.model_matrix
syntactic sugar function now accepts ModelSpec
and ModelMatrix
instances as the \"formula\" spec, making generation of matrices with the same form as previously generated matrices more convenient.poly
transform (compatible with R and patsy).numpy
is now always available in formulas via np
, allowing formulas like np.sum(x)
. For convenience, log
, log10
, log2
, exp
, exp10
and exp2
are now exposed as transforms independent of user context.formulaic.utils.context.capture_context()
. This can be used by libraries that wrap Formulaic to capture the variables and/or transforms available in a users' environment where appropriate.Bugfixes and cleanups:
Documentation:
.parser
and .utils
modules of Formulaic are now inline documented and annotated.This is a minor release that fixes an issue whereby the ModelSpec instances attached to ModelMatrix objects would keep reference to the original data, greatly inflating the size of the ModelSpec.
"},{"location":"changelog/#023-4-february-2021","title":"0.2.3 (4 February 2021)","text":"This release is identical to v0.2.2, except that the source distribution now includes the docs, license, and tox configuration.
"},{"location":"changelog/#022-4-february-2021","title":"0.2.2 (4 February 2021)","text":"This is a minor release with one bugfix.
This is a minor patch release that brings in some valuable improvements.
DataFrame
.This is major release that brings in a large number of improvements, with a huge number of commits. Some API breakage from the experimental 0.1.x series is likely in various edge-cases.
Highlights include:
Performance improvements around the encoding of categorical features.
Matthew Wardrop (1):\n Improve the performance of encoding operations.\n
"},{"location":"changelog/#011-31-october-2019","title":"0.1.1 (31 October 2019)","text":"No code changes here, just a verification that GitHub CI integration was working.
Matthew Wardrop (1):\n Update Github workflow triggers.\n
"},{"location":"changelog/#010-31-october-2019","title":"0.1.0 (31 October 2019)","text":"This release added support for keeping track of encoding choices during model matrix generation, so that they can be reused on similarly structured data. It also added comprehensive unit testing and CI integration using GitHub actions.
Matthew Wardrop (5):\n Add support for stateful transforms (including encoding).\n Fix tokenizing of nested Python function calls.\n Add support for nested transforms that return multiple columns, as well as passing through of materializer config through to transforms.\n Add comprehensive unit testing along with several small miscellaneous bug fixes and improvements.\n Add GitHub actions configuration.\n
"},{"location":"changelog/#001-1-september-2019","title":"0.0.1 (1 September 2019)","text":"Initial open sourcing of formulaic
.
Matthew Wardrop (1):\n Initial (mostly) working implementation of Wilkinson formulas.\n
"},{"location":"formulas/","title":"What are formulas?","text":"This section introduces the basic notions and origins of formulas. If you are already familiar with formulas from another context, you might want to skip forward to the Formula Grammar or other User Guides.
"},{"location":"formulas/#origins","title":"Origins","text":"Formulas were originally proposed by Wilkinson et al.1 to aid in the description of ANOVA problems, but were popularised by the S language (and then R, as an implementation of S) in the context of linear regression. Since then they have been extended in R, and implemented in Python (by patsy), in MATLAB, in Julia, and quite conceivably elsewhere. Each implementation has its own nuances and grammatical extensions, including Formulaic's which are described more completely in the Formula Grammar section of this manual.
"},{"location":"formulas/#why-are-they-useful","title":"Why are they useful?","text":"Formulas are useful because they provide a concise and explicit specification for how data should be prepared for a model. Typically, the raw input data for a model is stored in a dataframe, but the actual implementations of various statistical methodologies (e.g. linear regression solvers) act on two-dimensional numerical matrices that go by several names depending on the prevailing nomenclature of your field, including \"model matrices\", \"design matrices\" and \"regressor matrices\" (within Formulaic, we refer to them as \"model matrices\"). A formula provides the necessary information required to automate much of the translation of a dataframe into a model matrix suitable for ingestion into a statistical model.
Suppose, for example, that you have a dataframe with \\(N\\) rows and three numerical columns labelled: y
, a
and b
. You would like to construct a linear regression model for y
based on a
, b
and their interaction: \\[ y = \\alpha + \\beta_a a + \\beta_b b + \\beta_{ab} ab + \\varepsilon \\] with \\(\\varepsilon \\sim \\mathcal{N}(0, \\sigma^2)\\). Rather than manually constructing the required matrices to pass to the regression solver, you could specify a formula of form:
y ~ a + b + a:b\n
When furnished with this formula and the dataframe, Formulaic (or indeed any other formula implementation) would generate two model matrix objects: an \\( N \\times 1 \\) matrix \\(Y\\) for the response variable y
, and an \\( N \\times 4 \\) matrix \\(X\\) for the input columns intercept
, a
, b
, and a * b
. You can then directly pass these matrices to your regression solver, which internally will solve for \\(\\beta\\) in: \\[ Y = X\\beta + \\varepsilon. \\] The true value of formulas becomes more apparent as model complexity increases, where they can be a huge time-saver. For example:
~ (f1 + f2 + f3) * (x1 + x2 + scale(x3))\n
tells the formula interpreter to consider 16 fields of input data, corresponding to an intercept (1), each of the f*
fields (3), each of the x*
fields (3), and the combination of each f
with each x
(9). It also instructs the materializer to ensure that the x3
column is rescaled during the model matrix materialization phase such that it has mean zero and standard error of 1. If any of these columns is categorical in nature, they would by default also be one-hot/dummy encoded. Depending on the formula interpreter (including Formulaic), extra steps would also be taken to ensure that the resulting model matrix is structurally full-rank. As an added bonus, some formula implementations (including Formulaic) can remember any choices made during the materialization process, and apply them to consistently to new data, making it possible to easily generate new data that conforms to the same structure as the training data. For example, the scale(...)
transform in the example above makes use of the mean and variance of the column to be scaled. Any future data should, however, should not undergo scaling based on its own mean and variance, but rather on the mean and variance that was measured for the training data set (otherwise the new dataset will not be consistent with the expectations of the trained model which will be interpreting it).
Formulas are a very flexible tool, and can be augmented with arbitrary user-defined transforms. However, some transformations required by certain models may be more elegantly defined via a pre-formula dataframe operation or post-formula model matrix operation. Another consideration is that the default encoding and materialization choices for data are aligned with linear regression. If you are using a tree model, for example, you may not be interested in dummy encoding of \"categorical\" features, and this type of transform would have to be explicitly noted in the formula. Nevertheless, even in these cases, formulas are an excellent tool, and can often be used to greatly simplify data preparation workflows.
"},{"location":"formulas/#where-to-from-here","title":"Where to from here?","text":"To learn about the full set of features supported by the formula language as implemented by Formulaic, please review the Formula Grammar. To get a feel for how you can use formulaic
to transform your dataframes into model matrices, please review the Quickstart.
Wilkinson, G. N., and C. E. Rogers. Symbolic description of factorial models for analysis of variance. J. Royal Statistics Society 22, pp. 392\u2013399, 1973.\u00a0\u21a9
The latest release of formulaic
is always published to the Python Package Index (PyPI), from which it is available to download @ https://pypi.org/project/formulaic/.
If your Python environment is provisioned with pip
, installing formulaic
from the PyPI is as simple as running:
$ pip install formulaic\n
Note
If you have a non-standard setup, ensure that pip
above are replaced with the executables corresponding to the environment for which you are interested in installing formulaic
. This is done automatically if you are using a virtual environment.
You are ready to use Formulaic. To get introduced to the concepts underpinning Formulaic, please review the Concepts documentation, or to jump straight to how to use Formulaic, please review the User Guides documentation.
"},{"location":"installation/#installing-for-development","title":"Installing for development","text":"If you are interested in developing formulaic
, you should clone the source code repository, and install in editable mode from there (allowing your changes to be instantly available to all new Python sessions).
To clone the source code, run:
$ git clone git@github.com:matthewwardrop/formulaic.git\n
Note
This requires you to have a GitHub account set up. If you do not have an account you can replace the SSH url above with https://github.com/matthewwardrop/formulaic.git
. Also, if you are planning to submit your work upstream, you may wish to fork the repository into your own namespace first, and clone from there.
To install in editable mode, run:
$ pip install -e <path_to_cloned_formulaic_repo>\n
You will need pip>=21.3
in order for this to work. You can then make any changes you like to the repo, and have them be reflected in your local Python sessions. Happy hacking, and I look forward to your contributions!
"},{"location":"migration/","title":"Migrating from Patsy/R","text":"The default Formulaic parser and materialization configuration is designed to be highly compatibly with existing Wilkinson formula implementations in R and Python; however there are some differences which are highlighted here. If you find other differences, feel free to submit a PR to update this documentation.
"},{"location":"migration/#migrating-from-patsy","title":"Migrating frompatsy
","text":"Patsy has been the go-to implementation of Wilkinson formulae for Python use-cases for many years, and Formulaic should be largely a drop-in replacement, while bringing order of magnitude improvements in runtime performance and greater extensibility. Being written in the same language (Python) there are two separate migration concerns: input/output and API migrations, which will be explored separately below.
"},{"location":"migration/#inputoutput-changes","title":"Input/Output changes","text":"The primary inputs to patsy
are a formula string, and pandas dataframe from which features referenced in the formula are drawn. The output is a model matrix (called a design matrix in patsy
). We focus here on any potentially breaking behavioural differences here, rather than ways in which Formulaic extends the functionality available in patsy
.
^
operator is interpreted as exponentiation, rather than Python's XOR binary operator.C(x, contr.treatment)
. For greater compatibility with patsy
we add to the transform namespace Treatment
, Poly
, Sum
, Helmert
and Diff
, allowing formulae like C(x, Poly)
or C(x, Treatment(reference='x'))
to work as expected, with the following caveats:C
is C(data, contrasts=None, *, levels=None)
as compared to C(data, contrast=None, levels=None)
from patsy
.Sum
contrast does not offer an omit
option to specify the index of the omitted column.scale(x)
, but compatibility shims for standardize(x)
are added for greater compatibility with patsy
. Note that the standardize
shim follows patsy argument kwarg naming conventions, but scale
uses scale
instead of rescale
, following R.cluster_by=\"numerical_factors\"
to model_matrix
or any of the .get_model_matrix(...)
methods.cr
and cc
) or tensor smoothing (te
) stateful transforms.patsy
offers two high-level user-facing entrypoints: patsy.dmatrix
and patsy.dmatrices
, depending on whether you have both left- and right-hand sides present. In formulaic
, we offer a single entrypoint for both cases: model_matrix
.
In the vast majority of cases, a simple substitution of dmatrix
or dmatrices
with model_matrix
will achieve the desired the result; however there are some differences in signature that could trip up a naive copy and replace. Patsy's dmatrix
signature is:
patsy.dmatrix(\n formula_like,\n data={},\n eval_env=0,\n NA_action='drop',\n return_type='matrix',\n)\n
whereas model_matrix
has a signature of: formulaic.model_matrix(\n spec: FormulaSpec, # accepts any formula-like spec (include model matrices and specs)\n data: Any, # accepts any supported data structure (include pandas DataFrames)\n *,\n context: Union[int, Mapping[str, Any]] = 0, # equivalent to `eval_env`\n **spec_overrides, # Additional overrides for generated `ModelSpec`, including `na_action` and `output` (similar to `return_type`).\n)\n
If you are integrating Formulaic into your library, it is highly recommended to use the Formula()
API directly rather than model_matrix
, which by default will add all variables in the local context into the evaluation environment (just like dmatrix
). This allows you to better isolate and control the behaviour of the Formula parsing.
Most formulae that work in R will work without modification, including those written against the enhanced R Formula package that supports multi-part formulae. However, there are a few caveats that are worth calling out:
^
operator within an I
transform; e.g. I(x^2)
. This is because this is treated as Python code, and so you should use I(x**2)
or {x**2}
instead.patsy
. In particular, order of operations are respected when evaluating intercept directives, and so: 1 + (b - 1)
would result in the intercept remaining (since (b-1)
would be evaluated first to b
, resulting in 1 + b
), whereas in R the intercept would have been dropped.patsy
. Using capital letters to represent categorical variables, and lower-case letters to represent numerical ones, the difference from R will become apparent in two cases:1 + A:B
. In this case, R does not account for the fact that A:B
spans the intercept, and so does not rank reduce the product, and thus generates an over-specified matrix. This affects higher-order interactions also.0 + A:x + B:C
. Here we use 0 +
to avoid the previous bug, but unfortunately when R is checking whether to reduce the rank of the categorical features during encoding, it assumes that all involved features are categorical, and thus unnecessarily reduces the rank of C
, resulting in an under-specified matrix. This affects higher-order interactions also.offset(...)
.For more details, refer to the Formula Grammar.
"},{"location":"dev/","title":"Introduction","text":"This section of the documentation focuses on providing guidance to developers of libraries that are integrating Formulaic, users who want to extend its behavior, or for those interested in directly contributing.
If you are looking to directly work with Formulaic as an end-user, please review the User Guides instead.
This portion of the documentation is less complete than user-facing documentation, and you are encouraged to reach out via the Issue Tracker if you need any help.
"},{"location":"dev/extensions/","title":"Extensions","text":"Formulaic was designed to be extensible from day one, and nearly all of its core functionality is implemented as \"plugins\"/\"modules\" that you can use as examples for how extensions could be written. In this document we will provide a basic high-level overview of the basic components of Formulaic that can extended.
An important consideration is that while Formulaic offers extensible APIs, and effort will be made not to break extension APIs without reason (and never in patch releases), the safest place for you extensions is in Formulaic itself, where they can be kept up to date and maintained (assuming the extension is not overly bespoke). If you think your extensions might help others, feel free to reach out via the issue tracker and/or open a pull request.
"},{"location":"dev/extensions/#transforms","title":"Transforms","text":"Transforms are likely the most commonly extended feature of Formulaic, and also likely the least valuable to upstream (since transforms are often domain specific). Documentation for implementing transforms is described in detail in the Transforms user guide.
"},{"location":"dev/extensions/#materializers","title":"Materializers","text":"Materializers are responsible for translating formulae into model matrices as documented in the How it works user guide. You need to implement a new materializer if you want to add support for new input and/or output types.
Implementing a new materializer is as simple as subclassing the abstract class formulaic.materializers.FormulaMaterializer
(or one of its subclasses). This base class defines the API expected by the rest of the Formulaic system. Example implementations include pandas and pyarrow.
During subclassing, the new class is registered according to the various REGISTER_*
attributes if REGISTER_NAME
is specified. This registration allows looking up of the materializer by name through the model_matrix()
and .get_model_matrix()
functions. You can always manually pass in your materializer class explicitly without this registration.
Parsers translate a formula string to a set of terms and factors that are then evaluated and assembled into the model matrix, as documented in the How it works user guide. This is unlikely to be necessary very often, but can be used to add additional formula operators, or change the behavior of existing ones.
Formula parsers are expected to implement the API of formulaic.parser.types.FormulaParser
. The default implementation can be seen here. You can pass in custom parsers to Formula()
via the parser
and nested_parser
options (see inline documentation for more details).
If you are considering extending the parser, please do reach out via the issue tracker.
"},{"location":"dev/integration/","title":"Integration","text":"If you are looking to enrich your existing Python project with support for formulae, you have come to the right place. Formulaic is designed with simple APIs that should make it straightforward to integrate into any project.
In this document we provide several general recommendations for developers integrating Formulaic, and then some more specific guidance for developers looking to migrate existing formula functionality from patsy
. As you are working on integrating Formulaic, if you come across anything not mentioned here that really ought to be, please report it to our Issue Tracker.
For the most part, Formulaic should \"just work\". However, here are a couple of recommendations that might make your integration work easier.
model_matrix
. This is a simple wrapper around lower-level APIs that automatically includes variables from users' local namespaces. This is convenient when running in a notebook, but can lead to unexpected interactions with your library code that are hard to debug. Called naively in your library it will treat the frame in which it was run as the user context, which may include somewhat sensitive internal state and may override transforms normally available to formulae. Instead, use Formula(...).get_model_matrix(...)
.formulaic.utils.context.capture_context()
function and pass the result as context
to the .get_model_matrix()
methods. It is easiest to use in the outermost user-facing entrypoints so that you do not need to figure out exactly how many frames removed you are from user-context. You may also manually construct a dictionary from the user's context if you want to do additional filtering.eval()
function may be called to invoke the indicated Python functions. Since this is user-specified code, it is possible that the formula had some malicious code in it (such as sys.exit()
or shutil.rmtree()
). If you are integrating Formulaic into server-side code, it is highly recommended not to pass in any user-specified context, but instead to curate the set of additional functions that are available and pass that in instead. If you are writing a user-facing library, this should not be as concerning.pandas.DataFrame
-> PandasMaterializer
). Different materializers may have different output (and other) options. It may make sense to hard-code your choice of materializer by passing materializer=
to the .get_model_matrix()
methods.output='sparse'
to .get_model_matrix()
, assuming the materializer of your datatype supports this).ModelSpec
instances to work across different major versions of Formulaic. It may be tempting to serialize them to disk and then reuse them in newer versions of Formulaic. Most of the time this will work fine, but the stored encoder and transform states are considered implementation details of stateful transforms and are subject to change between major versions. Patch releases should never result in changes to this state.formulaic.parser.DefaultFormulaParser.FeatureFlags
. You can pass these flags, or a set of (case-insensitive) strings corresponding to these enums, to DefaultFormulaParser(feature_flags=...)
.If you are migrating a library that previous used patsy
to formulaic
, you should first review the general user-facing migration notes, which describes differences in API and formula grammars. Then, in addition to the recommendations above, the following notes might be helpful.
patsy
, such as manually assembling Term
instances, this code will need to be rewritten to use Formulaic
classes instead. Generally speaking, this will likely be transparent to your users and so should be a relatively small lift.This section of the documentation focuses on guiding end-users through the various aspects of Formulaic likely to be useful in day-to-day workflows. Feel free to pick and choose which modules you peruse, but note that later modules may assume knowledge of content described in earlier modules.
If you are a developer of another library looking to leverage Formulaic internally to your code, or are looking to contribute directly to Formulaic, it is recommended to also review the Developer Guides.
"},{"location":"guides/contrasts/","title":"Categorical Encoding","text":"Categorical data (also known as \"factors\") is encoded in model matrices using \"contrast codings\" that transform categorical vectors into a collection of numerical vectors suitable for use in regression models. In this guide we begin with some basic examples, before introducing the concepts behind contrast codings, how to select and/or design your own coding, and (for more advanced readers) describe how we guarantee structural full-rankness of formulae with complex interactions between categorical and other features.
In\u00a0[1]: Copied!from pandas import Categorical, DataFrame\n\nfrom formulaic import model_matrix\n\ndf = DataFrame(\n {\n \"letters\": [\"a\", \"b\", \"c\"],\n \"numbers\": Categorical([1, 2, 3]),\n \"values\": [20, 200, 30],\n }\n)\n\nmodel_matrix(\"letters + numbers + values\", df)\nfrom pandas import Categorical, DataFrame from formulaic import model_matrix df = DataFrame( { \"letters\": [\"a\", \"b\", \"c\"], \"numbers\": Categorical([1, 2, 3]), \"values\": [20, 200, 30], } ) model_matrix(\"letters + numbers + values\", df) Out[1]: Intercept letters[T.b] letters[T.c] numbers[T.2] numbers[T.3] values 0 1.0 0 0 0 0 20 1 1.0 1 0 1 0 200 2 1.0 0 1 0 1 30
Here letters
was identified as a categorical variable because of it consisted of strings, numbers
was identified as categorical because of its data type, and values
was treated as a vector of numerical values. The categorical data was encoded using the default encoding of \"Treatment\" (aka. \"Dummy\", see below for more details).
If we wanted to force formulaic to treat a column as categorical, we can use the C()
transform (just as in patsy and R). For example:
model_matrix(\"C(values)\", df)\nmodel_matrix(\"C(values)\", df) Out[2]: Intercept C(values)[T.30] C(values)[T.200] 0 1.0 0 0 1 1.0 0 1 2 1.0 1 0
The C()
transform tells Formulaic that the column should be encoded as categorical data, and allows you to customise how the encoding is performed. For example, we could use polynomial coding (detailed below) and explicitly specify the categorical levels and their order using:
model_matrix(\"C(values, contr.poly, levels=[10, 20, 30])\", df)\nmodel_matrix(\"C(values, contr.poly, levels=[10, 20, 30])\", df)
/home/matthew/Repositories/github/formulaic/formulaic/transforms/contrasts.py:124: DataMismatchWarning: Data has categories outside of the nominated levels (or that were not seen in original dataset): {200}. They are being cast to nan, which will likely skew the results of your analyses.\n warnings.warn(\nOut[3]: Intercept C(values, contr.poly, levels=[10, 20, 30]).L C(values, contr.poly, levels=[10, 20, 30]).Q 0 1.0 0.000000 -0.816497 1 1.0 0.000000 0.000000 2 1.0 0.707107 0.408248
Where possible, as you can see above, we also provide warnings when a categorical encoding does not reflect the structure of the data.
In\u00a0[4]: Copied!from formulaic.transforms.contrasts import C, TreatmentContrasts\n\nTreatmentContrasts(base=\"B\").get_coding_matrix([\"A\", \"B\", \"C\", \"D\"])\nfrom formulaic.transforms.contrasts import C, TreatmentContrasts TreatmentContrasts(base=\"B\").get_coding_matrix([\"A\", \"B\", \"C\", \"D\"]) Out[4]: A C D A 1.0 0.0 0.0 B 0.0 0.0 0.0 C 0.0 1.0 0.0 D 0.0 0.0 1.0 In\u00a0[5]: Copied!
TreatmentContrasts(base=\"B\").get_coefficient_matrix([\"A\", \"B\", \"C\", \"D\"])\nTreatmentContrasts(base=\"B\").get_coefficient_matrix([\"A\", \"B\", \"C\", \"D\"]) Out[5]: A B C D B 0.0 1.0 0.0 0.0 A-B 1.0 -1.0 -0.0 -0.0 C-B 0.0 -1.0 1.0 0.0 D-B 0.0 -1.0 0.0 1.0 In\u00a0[6]: Copied!
model_matrix(\"C(letters, contr.treatment)\", df)\nmodel_matrix(\"C(letters, contr.treatment)\", df) Out[6]: Intercept C(letters, contr.treatment)[T.b] C(letters, contr.treatment)[T.c] 0 1.0 0 0 1 1.0 1 0 2 1.0 0 1 In\u00a0[7]: Copied!
model_matrix(\"C(letters, contr.SAS)\", df)\nmodel_matrix(\"C(letters, contr.SAS)\", df) Out[7]: Intercept C(letters, contr.SAS)[T.a] C(letters, contr.SAS)[T.b] 0 1.0 1 0 1 1.0 0 1 2 1.0 0 0 In\u00a0[8]: Copied!
model_matrix(\"C(letters, contr.sum)\", df)\nmodel_matrix(\"C(letters, contr.sum)\", df) Out[8]: Intercept C(letters, contr.sum)[S.a] C(letters, contr.sum)[S.b] 0 1.0 1.0 0.0 1 1.0 0.0 1.0 2 1.0 -1.0 -1.0 In\u00a0[9]: Copied!
model_matrix(\"C(letters, contr.helmert)\", df)\nmodel_matrix(\"C(letters, contr.helmert)\", df) Out[9]: Intercept C(letters, contr.helmert)[H.b] C(letters, contr.helmert)[H.c] 0 1.0 -1.0 -1.0 1 1.0 1.0 -1.0 2 1.0 0.0 2.0 In\u00a0[10]: Copied!
model_matrix(\"C(letters, contr.diff)\", df)\nmodel_matrix(\"C(letters, contr.diff)\", df) Out[10]: Intercept C(letters, contr.diff)[D.b] C(letters, contr.diff)[D.c] 0 1.0 -0.666667 -0.333333 1 1.0 0.333333 -0.333333 2 1.0 0.333333 0.666667 In\u00a0[11]: Copied!
model_matrix(\"C(letters, contr.poly)\", df)\nmodel_matrix(\"C(letters, contr.poly)\", df) Out[11]: Intercept C(letters, contr.poly).L C(letters, contr.poly).Q 0 1.0 -0.707107 0.408248 1 1.0 0.000000 -0.816497 2 1.0 0.707107 0.408248 In\u00a0[12]: Copied!
my_letters = C(df.letters, TreatmentContrasts(base=\"b\"))\nmodel_matrix(\"my_letters\", df)\nmy_letters = C(df.letters, TreatmentContrasts(base=\"b\")) model_matrix(\"my_letters\", df) Out[12]: Intercept my_letters[T.a] my_letters[T.c] 0 1.0 1 0 1 1.0 0 0 2 1.0 0 1 In\u00a0[13]: Copied!
import numpy\n\nZ = numpy.array(\n [\n [1, 0, 0, 0], # A\n [-1, 1, 0, 0], # B - A\n [0, -1, 1, 0], # C - B\n [-1, 0, 0, 1], # D - A\n ]\n)\ncoding = numpy.linalg.inv(Z)[:, 1:]\ncoding\nimport numpy Z = numpy.array( [ [1, 0, 0, 0], # A [-1, 1, 0, 0], # B - A [0, -1, 1, 0], # C - B [-1, 0, 0, 1], # D - A ] ) coding = numpy.linalg.inv(Z)[:, 1:] coding Out[13]:
array([[0., 0., 0.],\n [1., 0., 0.],\n [1., 1., 0.],\n [0., 0., 1.]])In\u00a0[14]: Copied!
model_matrix(\n \"C(letters, contr.custom(coding))\", DataFrame({\"letters\": [\"A\", \"B\", \"C\", \"D\"]})\n)\nmodel_matrix( \"C(letters, contr.custom(coding))\", DataFrame({\"letters\": [\"A\", \"B\", \"C\", \"D\"]}) ) Out[14]: Intercept C(letters, contr.custom(coding))[1] C(letters, contr.custom(coding))[2] C(letters, contr.custom(coding))[3] 0 1.0 0.0 0.0 0.0 1 1.0 1.0 0.0 0.0 2 1.0 1.0 1.0 0.0 3 1.0 0.0 0.0 1.0
The model matrices generated from formulae are often consumed directly by linear regression algorithms. In these cases, if your model matrix is not full rank, then the features in your model are not linearly independent, and the resulting coefficients (assuming they can be computed at all) cannot be uniquely determined. While there are ways to handle this, such as regularization, it is usually easiest to put in a little more effort during the model matrix creation process, and make the incoming vectors in your model matrix linearly independent from the outset. As noted in the text above, categorical coding requires consideration about the overlap of the coding with the intercept in order to remain full rank. The good news is that Formulaic will do most of the heavy lifting for you, and does so by default.
It is important to note at this point that Formulaic does not protect against all forms of linear dependence, only structural linear dependence; i.e. linear dependence that results from multiple categorical variables overlapping in vectorspace. If you have two identical numerical vectors called by two different names in your model matrix, Formulaic will happily build the model matrix you requested, and you're on your own. This is intentional. While Formulaic strives to make the model matrix generation process as painless as possible, it also doesn't want to make more assumptions about the use of the data than is necessary. Note that you can also disable Formulaic's structural full-rankness algorithms by passing ensure_full_rank=False
to model_matrix()
or .get_model_matrix()
methods; and can bypass the reducing of the rank of a single categorical term in a formula using C(..., spans_intercept=False)
(this is especially useful, for example, if your model includes regularization and you would prefer to use the over-specified model to ensure fairer shrinkage).
The algorithm that Formulaic uses was heavily inspired by patsy
^1. The basic idea is to recognize that all categorical codings span the intercept[^2]; and then to break that coding up into two pieces: a single column that can be dropped to avoid spanning the intercept, and the remaining body of the coding that will always be present. You expand associatively the categorical factors, and then greedily recombine the components, omitting any that would lead to structural linear dependence. The result is a set of categorical codings that only spans the intercept when it is safe to do so, guaranteeing structural full rankness. The patsy documentation goes into this in much more detail if this is interesting to you.
[^2]: This assumes that categories are \"complete\", in that each unit has been assigned a category. You can \"complete\" categories by treating all those unassigned as being a member of an imputed \"null\" category.
"},{"location":"guides/contrasts/#basic-usage","title":"Basic usage\u00b6","text":"Formulaic follows in the stead of R and Patsy by automatically inferring from the data whether a feature needs to categorically encoded. For example:
"},{"location":"guides/contrasts/#how-does-contrast-coding-work","title":"How does contrast coding work?\u00b6","text":"As you have seen, contrast coding transforms categorical vectors into a matrix of numbers that can be used during modeling. If your data has $K$ mutually exclusive categories, these matrices typically consist of $K-1$ columns. This reduction in dimensionality reflects the fact that membership of the $K$th category could be inferred from the lack of membership in any other category, and so is redundant in the presence of a global intercept. You can read more about this in the full rankness discussion below.
The first step toward generating numerical vectors from categorical data is to dummy encode it. This transforms the single vector of $K$ categories into $K$ boolean vectors, each having a $1$ only in rows that are members of the corresponding category. If you do not have a global intercept, you can directly use this dummy encoding with the full $K$ columns and contrasts are unnecessary. This is not always the case, which requires you to reduce the rank of your coding by thinking about contrasts (or differences) between the levels.
In practice, this dimension reduction using \"contrasts\" looks like constructing a $K \\times (K-1)$ \"coding matrix\" that describes the contrasts of interest. You can then post-multiply your dummy-encoded columns by it. That is: $$ E = DC $$ where $E \\in \\mathbb{R}^{N \\times (K-1)}$ is the contrast coded categorical data, $D \\in \\{0, 1\\}^{N \\times K}$ is the dummy encoded data, and $C \\in \\mathbb{R}^{K \\times (K-1)}$ is the coding matrix.
The easiest way to construct a coding matrix is to start with a \"coefficient matrix\" $Z \\in \\mathbb{R}^{K \\times K}$ which describes the contrasts that you want the coefficients of a trained linear regression model to represent (with columns representing the untransformed levels, and rows representing the transforms). For a consistently chosen set of contrasts, this matrix will be full-rank, and the inverse of this matrix will have a constant column representing the global intercept. Removing this column results in the $K \\times (K-1)$ coding matrix that should be apply to the dummy encoded data in order for the coefficients to have the desired interpretation.
For example, if we wanted all of the levels to be compared to the first level, we would build a matrix $Z$ as: $$ \\begin{align} Z =& \\left(\\begin{array}{c c c c} 1 & 0 & 0 & 0 \\\\ -1 & 1 & 0 & 0\\\\ -1 & 0 & 1 & 0\\\\ -1 & 0 & 0 & 1 \\end{array}\\right)\\\\ \\therefore Z^{-1} =& \\left(\\begin{array}{c c c c} 1 & 0 & 0 & 0 \\\\ 1 & 1 & 0 & 0\\\\ 1 & 0 & 1 & 0\\\\ 1 & 0 & 0 & 1 \\end{array}\\right)\\\\ \\implies C =& \\left(\\begin{array}{c c c} 0 & 0 & 0 \\\\ 1 & 0 & 0\\\\ 0 & 1 & 0\\\\ 0 & 0 & 1 \\end{array}\\right) \\end{align} $$ This is none other than the default \"treatment\" coding described below, which applies one-hot coding to the categorical data.
It is important to note that while your choice of contrast coding will change the interpretation and values of your coefficients, all contrast encodings ultimately result in equivalent regressions, and it is possible to restrospectively infer any other set of interesting contrasts given the regression covariance matrix. The task is therefore to find the most useful representation, not the \"correct\" one.
For those interested in reading more, the R Documentation on Coding Matrices covers this in more detail.
"},{"location":"guides/contrasts/#contrast-codings","title":"Contrast codings\u00b6","text":"This section introduces the contrast encodings that are shipped as part of Formulaic. These implementations live in formulaic.transforms.contrasts
, and are surfaced by default in formulae as an attribute of contr
(e.g. contr.treatment
, in order to be consistent with R). You can always implement your own contrasts if the need arises.
If you would like to dig deeper and see the actual contrast/coefficient matrices for various parameters you can directly import these contrast implementations and play with them in a Python shell, but otherwise for brevity we will not exhaustively show these in the following documentation. For example:
"},{"location":"guides/contrasts/#treatment-aka-dummy","title":"Treatment (aka. dummy)\u00b6","text":"This contrast coding compares each level with some reference level. If not specified, the reference level is taken to be the first level. The reference level can be specified as the first argument to the TreatmentContrasts
/contr.treatment
constructor.
Example formulae:
~ X
: Assuming X
is categorical, the treatment encoding will be used by default.~ C(X)
: You can also explicitly flag a feature to be encoded as categorical, whereupon the default is treatment encoding.~ C(X, contr.treatment)
: Explicitly indicate that the treatment encoding should be used.~ C(X, contr.treatment(\"x\"))
: Indicate that the reference treatment should be \"x\" instead of the first index.~ C(X, contr.treatment(base=\"x\"))
: As above.This constrasts generated by this class are the same as the above, but with the reference level defaulting to the last level (the default in SAS).
Example formulae:
~ C(X, contr.SAS)
: Basic use-case.~ C(X, contr.SAS(\"x\"))
: Same as treatment encoding case above.~ C(X, contr.SAS(base=\"x\"))
: Same as treatment encoding case above.These contrasts compare each level (except the last, which is redundant) to the global average of all levels.
Example formulae:
~ C(X, contr.sum)
: Encode categorical data using the sum coding.These contrasts compare each successive level to the average all previous/subsequent levels. It has two configurable parameters: reverse
which controls the direction of comparison, and scale
which controls whether to scale the encoding to simplify interpretation of coefficients (results in a floating point model matrix instead of an integer one). When reverse
is True
, the contrasts compare a level to all previous levels; and when False
, it compares it to all subsequent levels.
The default parameter values are chosen to match the R implementation, which corresponds to a reversed and unscaled Helmert coding.
Example formulae:
~ C(X, contr.helmert)
: Unscaled reverse coding.~ C(X, contr.helmert(reverse=False))
: Unscaled forward coding.~ C(X, contr.helmert(scale=True))
: Scaled reverse coding.~ C(X, contr.helmert(scale=True, reverse=False))
: Scaled forward coding.These contrasts take the difference of each level with the previous level. It has one parameter, forward
, which indicates that the difference should be inverted such the difference is taken between the previous level and the current level. The default attribute values are chosen to match the R implemention, and correspond to a backward difference coding.
Example formulae:
~ C(X, contr.diff)
: Backward coding.~ C(X, contr.diff(forward=True))
: Forward coding.The \"contrasts\" represent a categorical variable that is assumed to equal (or known) spacing/scores, and allow us to model non-linear polynomial behaviour of the dependent variable with respect to the ordered levels by projecting the spacing onto a basis of orthogonal polynomials. It has one parameter, scores
which indicates the spacing of the categories. It must have the same length as the number of levels. If not provided, the categories are assumed equidistant and spaced by 1.
The feature names of categorical variables can become quite unwieldy, as you may have noticed. Fortunately this is easily remedied by aliasing the variable outside of your formula (and then making it available via formula context). This is done automatically if you use the model_matrix
function. For example:
It may be useful to define your own coding matrices in some contexts. This is readily achieved using the CustomContrasts
class directly or via the contr.custom
alias. In these cases, you are responsible for providing the coding matrix ($C$ from above). For example, if you had four levels: A, B, C and D, and wanted to compute the contrasts: B - A, C - B, and D - A, you could write:
This section of the documentation is intended to provide a high-level overview of the way in which formulae are interpreted and materialized by Formulaic.
Recall that the goal of a formula is to act as a recipe for building a \"model matrix\" (also known as a \"design matrix\") from an existing dataset. Following the recipe should result in a dataset that consists only of numerical columns that can be linearly combined to model an outcome/response of interest (the coefficients of which typically being estimated using linear regression). As such, this process will bake in any desired non-linearity via interactions or transforms, and will encode nominal/categorical/factor data as a collection of numerical contrasts.
The ingredients of each formula are the columns of the original dataset, and each operator acting on these columns in the formula should be thought of as inclusion/exclusion of the column in the resulting model matrix, or as a transformation on the column(s) prior to inclusion. Thus, a +
operator does not act in its usual algebraic manner, but rather acts as set union, indicating that both the left- and right-hand arguments should be included in the model matrix; a -
operator acts like a set difference; and so on.
Formulas in Formulaic are represented by (subclasses of) the Formula
class. Instances of Formula
subclasses are a ultimately containers for sets of Term
instances, which in turn are a container for a set of Factor
instances. Let's start our dissection at the bottom, and work our way up.
from formulaic.parser.types import Factor\n\nFactor(\n \"1\", eval_method=\"literal\"\n) # a factor that represents the numerical constant of 1\nFactor(\"a\") # a factor that will be looked up from the data context\nFactor(\n \"a + b\", eval_method=\"python\"\n) # a factor that will return the sum of `a` and `b`\nfrom formulaic.parser.types import Factor Factor( \"1\", eval_method=\"literal\" ) # a factor that represents the numerical constant of 1 Factor(\"a\") # a factor that will be looked up from the data context Factor( \"a + b\", eval_method=\"python\" ) # a factor that will return the sum of `a` and `b` Out[1]:
a + bIn\u00a0[2]: Copied!
from formulaic.parser.types import Term\n\nTerm(factors=[Factor(\"b\"), Factor(\"a\"), Factor(\"c\")])\nfrom formulaic.parser.types import Term Term(factors=[Factor(\"b\"), Factor(\"a\"), Factor(\"c\")]) Out[2]:
b:a:c
Note that to ensure uniqueness in the representation, the factor instances are sorted.
In\u00a0[3]: Copied!from formulaic import Formula\n\n# Unstructured formula (a simple list of terms)\nf = Formula(\n [\n Term(factors=[Factor(\"c\"), Factor(\"d\"), Factor(\"e\")]),\n Term(factors=[Factor(\"a\"), Factor(\"b\")]),\n ]\n)\nf\nfrom formulaic import Formula # Unstructured formula (a simple list of terms) f = Formula( [ Term(factors=[Factor(\"c\"), Factor(\"d\"), Factor(\"e\")]), Term(factors=[Factor(\"a\"), Factor(\"b\")]), ] ) f Out[3]:
a:b + c:d:e
Note that unstructured formulae are actually instances of SimpleFormula
(a Formula
subclass that acts like a mutable list of Term
instances):
type(f), list(f)\ntype(f), list(f) Out[4]:
(formulaic.formula.SimpleFormula, [a:b, c:d:e])
Also note that in its standard representation, the terms are separated by \"+\" which is interpreted as the set union in this context, and that (as we have seen for Term
instances) Formula
instances are sorted (the default is to sort terms only by interaction order, but this can be customized and/or disabled, as described below).
Structured formula are constructed similary:
In\u00a0[5]: Copied!f = Formula(\n [\n Term(factors=[Factor(\"root_col\")]),\n ],\n my_substructure=[\n Term(factors=[Factor(\"sub_col\")]),\n ],\n nested=Formula(\n [\n Term(factors=[Factor(\"nested_col\")]),\n Term(factors=[Factor(\"another_nested_col\")]),\n ],\n really_nested=[\n Term(factors=[Factor(\"really_nested_col\")]),\n ],\n ),\n)\nf\nf = Formula( [ Term(factors=[Factor(\"root_col\")]), ], my_substructure=[ Term(factors=[Factor(\"sub_col\")]), ], nested=Formula( [ Term(factors=[Factor(\"nested_col\")]), Term(factors=[Factor(\"another_nested_col\")]), ], really_nested=[ Term(factors=[Factor(\"really_nested_col\")]), ], ), ) f Out[5]:
root:\n root_col\n.my_substructure:\n sub_col\n.nested:\n root:\n nested_col + another_nested_col\n .really_nested:\n really_nested_col
Structured formulae are instances of StructuredFormula
:
type(f)\ntype(f) Out[6]:
formulaic.formula.StructuredFormula
And the sub-formula can be selected using:
In\u00a0[7]: Copied!f.root\nf.root Out[7]:
root_colIn\u00a0[8]: Copied!
f.nested\nf.nested Out[8]:
root:\n nested_col + another_nested_col\n.really_nested:\n really_nested_col
Formulae can also have different ordering conventions applied to them. By default, Formulaic follows R conventions around ordering whereby terms are sorted by their interaction degree (number of factors) and then by the order in which they were present in the the term list. This behaviour can be modified to perform no ordering or full lexical sorting of terms and factors by passing _ordering=\"none\"
or _ordering=\"sort\"
respectively to the Formula
constructor. The default ordering is equivalent to passing _ordering=\"degree\"
. For example:
{\n \"degree\": Formula(\"z + z:a + z:b:a + g\"),\n \"none\": Formula(\"z + z:a + z:b:a + g\", _ordering=\"none\"),\n \"sort\": Formula(\"z + z:a + z:b:a + g\", _ordering=\"sort\"),\n}\n{ \"degree\": Formula(\"z + z:a + z:b:a + g\"), \"none\": Formula(\"z + z:a + z:b:a + g\", _ordering=\"none\"), \"sort\": Formula(\"z + z:a + z:b:a + g\", _ordering=\"sort\"), } Out[9]:
{'degree': 1 + z + g + z:a + z:b:a,\n 'none': 1 + z + z:a + z:b:a + g,\n 'sort': 1 + g + z + a:z + a:b:z}
Formulaic intentionally makes the tokenization phase as unopinionated and unstructured as possible. This allows formula grammars to be extended via plugins only high-level APIs (usually Operator
s).
The tokenizer's role is to take an arbitrary string representation of a formula and convert it into a series of Token
instances. The tokenization phase knows very little about formula grammar except that whitespace doesn't matter, that we distinguish non-word characters as operators or context indicators. Interpretation of these tokens is left to the AST generation phase. There are five different kinds of tokens:
()
and square brackets []
.The tokenizer treats text quoted with `
characters as a name token, and {}
are used to quote Python operations.
An example of the tokens generated can be seen below:
In\u00a0[10]: Copied!from formulaic.parser import DefaultFormulaParser\n\n[\n f\"{token.token} : {token.kind.value}\"\n for token in (\n DefaultFormulaParser(include_intercept=False).get_tokens(\n \"y ~ 1 + b:log(c) | `d$in^df` + {e + f}\"\n )\n )\n]\nfrom formulaic.parser import DefaultFormulaParser [ f\"{token.token} : {token.kind.value}\" for token in ( DefaultFormulaParser(include_intercept=False).get_tokens( \"y ~ 1 + b:log(c) | `d$in^df` + {e + f}\" ) ) ] Out[10]:
['y : name',\n '~ : operator',\n '1 : value',\n '+ : operator',\n 'b : name',\n ': : operator',\n 'log(c) : python',\n '| : operator',\n 'd$in^df : name',\n '+ : operator',\n 'e + f : python']
The next phase is to assemble an abstract syntax tree (AST) from the tokens output from the above that when evaluated will generate the Term
instances we need to build a formula. This is done by using an enriched shunting yard algorithm which determines how to interpret each operator token based on the symbol used, the number and position of the non-operator arguments, and the current context (i.e. how many parentheses deep we are). This allows us to disambiguate between, for example, unary and binary addition operators. The available operators and their implementation are described in more detail in the Formula Grammar section of this documentation. It is worth noting that the available operators can be easily modified at runtime, and is typically all that needs to be modified in order to add new formula grammars.
The result is an AST that look something like:
In\u00a0[11]: Copied!DefaultFormulaParser().get_ast(\"y ~ a + b:c\")\nDefaultFormulaParser().get_ast(\"y ~ a + b:c\") Out[11]:
<ASTNode ~: [y, <ASTNode +: [<ASTNode +: [1, a]>, <ASTNode :: [b, c]>]>]>
Now that we have the AST, we can readily evaluate it to generate the Term
instances we need to pass to our Formula
constructor. For example:
terms = DefaultFormulaParser(include_intercept=False).get_terms(\"y ~ a + b:c\")\nterms\nterms = DefaultFormulaParser(include_intercept=False).get_terms(\"y ~ a + b:c\") terms Out[12]:
.lhs:\n {y}\n.rhs:\n {a, b:c}In\u00a0[13]: Copied!
Formula(terms)\nFormula(terms) Out[13]:
.lhs:\n y\n.rhs:\n a + b:c
Of course, manually building the terms and passing them to the formula constructor is a bit annoying, and so instead we allow passing the string directly to the Formula
constructor; and allow you to override the default parser if you so desire (though 99.9% of the time this wouldn't be necessary).
Thus, we can generate the same formula from above using:
In\u00a0[14]: Copied!Formula(\"y ~ a + b:c\", _parser=DefaultFormulaParser(include_intercept=False))\nFormula(\"y ~ a + b:c\", _parser=DefaultFormulaParser(include_intercept=False)) Out[14]:
.lhs:\n y\n.rhs:\n a + b:c
Once you have Formula
instance, the next logical step is to use it to materialize a model matrix. This is usually as simple passing the raw data as an argument to .get_model_matrix()
:
import pandas\n\ndata = pandas.DataFrame(\n {\"a\": [1, 2, 3], \"b\": [4, 5, 6], \"c\": [7, 8, 9], \"A\": [\"a\", \"b\", \"c\"]}\n)\nFormula(\"a + b:c\").get_model_matrix(data)\nimport pandas data = pandas.DataFrame( {\"a\": [1, 2, 3], \"b\": [4, 5, 6], \"c\": [7, 8, 9], \"A\": [\"a\", \"b\", \"c\"]} ) Formula(\"a + b:c\").get_model_matrix(data) Out[15]: Intercept a b:c 0 1.0 1 28 1 1.0 2 40 2 1.0 3 54
Just as for formulae, the model matrices can be structured, and will be structured in the same way as the original formula. For example:
In\u00a0[16]: Copied!Formula(\"a\", group=\"b+c\").get_model_matrix(data)\nFormula(\"a\", group=\"b+c\").get_model_matrix(data) Out[16]:
root:\n Intercept a\n 0 1.0 1\n 1 1.0 2\n 2 1.0 3\n.group:\n b c\n 0 4 7\n 1 5 8\n 2 6 9
Under the hood, both of these calls have looked at the type of the data (pandas.DataFrame
here) and then looked up the FormulaMaterializer
associated with that type (PandasMaterializer
here), and then passed the formula and data along to the materializer for materialization. It is also possible to request a specific output type that varies by materializer (PandasMaterializer
supports \"pandas\", \"numpy\", and \"sparse\"). If one is not selected, the first available output type is selected for you. Thus, the above code is equivalent to:
from formulaic.materializers import PandasMaterializer\n\nPandasMaterializer(data).get_model_matrix(Formula(\"a + b:c\"), output=\"pandas\")\nfrom formulaic.materializers import PandasMaterializer PandasMaterializer(data).get_model_matrix(Formula(\"a + b:c\"), output=\"pandas\") Out[17]: Intercept a b:c 0 1.0 1 28 1 1.0 2 40 2 1.0 3 54
The return type of .get_model_matrix()
is either a ModelMatrix
instance if the original formula was unstructured, or a ModelMatrices
instance that is just a structured container for ModelMatrix
instances. However, ModelMatrix
is an ObjectProxy subclass, and so it also acts like the type of object requested. For example:
import numpy\n\nfrom formulaic import ModelMatrix\n\nmm = Formula(\"a + b:c\").get_model_matrix(data, output=\"numpy\")\nisinstance(mm, ModelMatrix), isinstance(mm, numpy.ndarray)\nimport numpy from formulaic import ModelMatrix mm = Formula(\"a + b:c\").get_model_matrix(data, output=\"numpy\") isinstance(mm, ModelMatrix), isinstance(mm, numpy.ndarray) Out[18]:
(True, True)
The main purpose of this additional proxy layer is to expose the ModelSpec
instance associated with the materialization, which retains all of the encoding choices made during materialization (for reuse in subsequent materializations), as well as metadata about the feature names of the current model matrix (which is very useful when your model matrix output type doesn't have column names, like numpy or sparse arrays). This ModelSpec
instance is always available via .model_spec
, and is introduced in more detail in the Model Specs section of this documentation.
mm.model_spec\nmm.model_spec Out[19]:
ModelSpec(formula=1 + a + b:c, materializer='pandas', materializer_params={}, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='numpy', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a']), EncodedTermStructure(term=b:c, scoped_terms=[b:c], columns=['b:c'])], transform_state={}, encoder_state={'a': (<Kind.NUMERICAL: 'numerical'>, {}), 'b': (<Kind.NUMERICAL: 'numerical'>, {}), 'c': (<Kind.NUMERICAL: 'numerical'>, {})})
It is sometimes convenient to have the columns in the final model matrix be clustered by numerical factors included in the terms. This means that in regression reports, for example, all of the columns related to a particular feature of interest (including its interactions with various categorical features) are contiguously clustered. This is the default behaviour in patsy. You can perform this clustering in Formulaic by passing the cluster_by=\"numerical_factors\"
argument to model_matrix
or any of the .get_model_matrix(...)
methods. For example:
Formula(\"a + b + a:A + A:b\").get_model_matrix(data, cluster_by=\"numerical_factors\")\nFormula(\"a + b + a:A + A:b\").get_model_matrix(data, cluster_by=\"numerical_factors\") Out[20]: Intercept a a:A[T.b] a:A[T.c] b A[T.b]:b A[T.c]:b 0 1.0 1 0 0 4 0 0 1 1.0 2 2 0 5 5 0 2 1.0 3 0 3 6 0 6"},{"location":"guides/formulae/#anatomy-of-a-formula","title":"Anatomy of a Formula\u00b6","text":""},{"location":"guides/formulae/#factor","title":"Factor\u00b6","text":"
Factor
instances are the atomic unit of a formula, and represent the output of a single expression evaluation. Typically this will be one vector of data, but could also be more than one column (especially common with categorically encoded data).
A Factor
instance's expression can be evaluated in one of three ways:
Note: Factor instances act as metadata only, and are not directly responsible for doing the evaluation. This is handled in a backend specific way by the appropriate Materializer
instance.
In code, instantiating a factor looks like:
"},{"location":"guides/formulae/#term","title":"Term\u00b6","text":"Term
instances are a thin wrapper around a set of Factor
instances, and represent the Cartesian (or Kronecker) product of the factors. If all of the Factor
instances evaluate to single columns, then the Term
represents the product of all of the factor columns.
Instantiating a Term
looks like:
Formula
instances are (potentially nested) wrappers around collections of Term
instances. During materialization into a model matrix, each Term
instance will have its columns independently inserted into the resulting matrix.
Formula
instances can consist of a single \"list\" of Term
instances, or may be \"structured\"; for example, we may want a separate collection of terms for the left- and right-hand side of a formula; or to simultaneously construct multiple model matrices for different parts of our modeling process.
For example, an unstructured formula might look like:
"},{"location":"guides/formulae/#parsed-formulae","title":"Parsed Formulae\u00b6","text":"While it would be possible to always manually construct Formula
instances in this way, it would quickly grow tedious. As you might have guessed from reading the quickstart or via other implementations, this is where Wilkinson formulae come in. Formulaic has a rich extensible formula parser that converts string expressions into the formula structures you see above. Where functionality and grammar overlap, it tries to conform to existing patterns found in R and patsy.
Formula parsing happens in three phases:
Term
instances.In the sections below these phases are described in more detail.
"},{"location":"guides/formulae/#tokenization","title":"Tokenization\u00b6","text":""},{"location":"guides/formulae/#abstract-syntax-tree-ast","title":"Abstract Syntax Tree (AST)\u00b6","text":""},{"location":"guides/formulae/#evaluation","title":"Evaluation\u00b6","text":""},{"location":"guides/formulae/#materialization","title":"Materialization\u00b6","text":""},{"location":"guides/grammar/","title":"Formula Grammar","text":"This section of the documentation describes the formula grammar used by Formulaic. It is almost identical that used by patsy and R, and so most formulas should work without modification. However, there are some differences, which are called out below.
"},{"location":"guides/grammar/#operators","title":"Operators","text":"In this section, we introduce a complete list of the grammatical operators that you can use by default in your formulas. They are listed such that each section (demarcated by \"-----\") has higher precedence then the block that follows. When you write a formula involving several operators of different precedence, those with higher precedence will be resolved first. \"Arity\" is the number of arguments the operator takes. Within operators of the same precedence, all binary operators are evaluated from left to right (they are left-associative). To highlight differences in grammar betweeh formulaic, patsy and R, we highlight any differences below. If there is a checkmark the Formulaic, Patsy and R columns, then the grammar is consistent across all three, unless otherwise indicated.
Operator Arity Description Formulaic Patsy R\"...\"
1 1 String literal. \u2713 \u2713 \u2717 [0-9]+\\.[0-9]+
1 1 Numerical literal. \u2713 \u2717 \u2717 `...`
1 1 Quotes fieldnames within the incoming dataframe, allowing the use of special characters, e.g. `my|special$column!`
\u2713 \u2717 \u2713 {...}
1 1 Quotes python operations, as a more convenient way to do Python operations than I(...)
, e.g. {`my|col`**2}
\u2713 \u2717 \u2717 <function>(...)
1 1 Python transform on column, e.g. my_func(x)
which is equivalent to {my_func(x)}
\u27132 \u2713 \u2717 ----- (...)
1 Groups operations, overriding normal precedence rules. All operations with the parentheses are performed before the result of these operations is permitted to be operated upon by its peers. \u2713 \u2717 \u2713 ----- .
9 0 Stands in as a wild-card for the sum of variables in the data not used on the left-hand side of a formula. \u2713 \u2717 \u2713 ----- **
2 Includes all n-th order interactions of the terms in the left operand, where n is the (integral) value of the right operand, e.g. (a+b+c)**2
is equivalent to a + b + c + a:b + a:c + b:c
. \u2713 \u2713 \u2713 ^
2 Alias for **
. \u2713 \u27173 \u2713 ----- :
2 Adds a new term that corresponds to the interaction of its operands (i.e. their elementwise product). \u27134 \u2713 \u2713 ----- *
2 Includes terms for each of the additive and interactive effects of the left and right operands, e.g. a * b
is equivalent to a + b + a:b
. \u2713 \u2713 \u2713 /
2 Adds terms describing nested effects. It expands to the addition of a new term for the left operand and the interaction of all left operand terms with the right operand, i.e a / b
is equivalent to a + a:b
, (a + b) / c
is equivalent to a + b + a:b:c
, and a/(b+c)
is equivalent to a + a:b + a:c
.5 \u2713 \u2713 \u2713 %in%
2 As above, but with arguments inverted: e.g. b %in% a
is equivalent to a / b
. \u2713 \u2717 \u2713 ----- +
2 Adds a new term to the set of features. \u2713 \u2713 \u2713 -
2 Removes a term from the set of features (if present). \u2713 \u2713 \u2713 +
1 Returns the current term unmodified (not very useful). \u2713 \u2713 \u2713 -
1 Negates a term (only implemented for 0, in which case it is replaced with 1
). \u2713 \u2713 \u2713 ----- \\|
2 Splits a formula into multiple parts, allowing the simultaneous generation of multiple model matrices. When on the right-hand-side of the ~
operator, all parts will attract an additional intercept term by default. \u2713 \u2717 \u27136 ----- ~
1,2 Separates the target features from the input features. If absent, it is assumed that we are considering only the the input features. Unless otherwise indicated, it is assumed that the input features implicitly include an intercept. \u2713 \u2713 \u2713 [ . ~ . ]
2 [Experimental] Multi stage formula notation, which is useful in (e.g.) IV contexts. Requires the MULTISTAGE
feature flag to be passed to the parser. \u2713 \u2717 \u2717"},{"location":"guides/grammar/#transforms","title":"Transforms","text":"Formulaic supports arbitrary transforms, any of which can also preserve state so that new data can undergo the same transformation as that used during modelling. The currently implemented transforms are shown below. Commonly used transforms that have not been implemented by formulaic
are explicitly noted also.
I(...)
Identity transform, allowing arbitrary Python/R operations, e.g. I(x+y)
. Note that in formulaic
, it is more idiomatic to use {x+y}
. \u2713 \u2713 \u2713 Q('<column_name>')
Look up feature by potentially exotic name, e.g. Q('wacky name!')
. Note that in formulaic
, it is more idiomatic to use `wacky name!`
. \u2713 \u2713 \u2717 C(...)
Categorically encode a column, e.g. C(x)
\u2713 \u2713 \u2713 center(...)
Shift column data so mean is zero. \u2713 \u2713 \u2717 scale(...)
Shift column so mean is zero and variance is 1. \u2713 \u27137 \u2713 standardize(...)
Alias of scale
. \u27138 \u2713 \u2717 lag(...[, <k>])
Generate lagging or leading columns (useful for datasets collected at regular intervals). \u2713 \u2717 \u2713 poly(...)
Generates a polynomial basis, allowing non-linear fits. \u2713 \u2717 \u2713 bs(...)
Generates a B-Spline basis, allowing non-linear fits. \u2713 \u2713 \u2713 cs(...)
Generates a natural cubic spline basis, allowing non-linear fits. \u2713 \u2713 \u2713 cr(...)
Alias for cs
above. \u2713 \u2717 \u2713 cc(...)
Generates a cyclic cubic spline basis, allowing non-linear fits. \u2713 \u2713 \u2713 te(...)
Generates a tensor product smooth. \u2717 \u2713 \u2713 hashed(...)
Categorically encode a deterministic hash of a column. \u2713 \u2717 \u2717 ... Others? Contributions welcome! ? ? ? Tip
Any function available in the context
dictionary will also be available as transform, along with some commonly used functions imported from numpy: log
, log10
, log2
, exp
, exp10
, and exp2
. In addition the numpy
module is always available as np
. Thus, formulas like: log(y) ~ x + 10
will always do the right thing, even when these functions have not been made available in the user namespace.
Note
Formulaic does not (yet) support including extra terms in the formula that will not result in additions to the dataframe, for example model annotations like R's offset(...)
.
Beyond the formula operator grammar itself there are some differing behaviours and conventions of which you should be aware.
Formula
R package in that both sides of the ~
operator are treated considered to be using the formula grammar, with the only difference being that the right hand side attracts an intercept by default. In vanilla R, the left hand side is treated as R code (and so x + y ~ z
would result in a single column on the left-hand-side). You can recover vanilla R's behaviour by nesting the operations in a Python operator block (as described in the operator table): {y1 + y2} ~ a + b
.formulaic
with the same set of fields will always generate the same model matrix.b-1
and (b-1)
both do not have an intercept, whereas in Formulaic and Patsy the parentheses are resolved first, and so the first does not have an intercept and the second does (because '1 +' is implicitly prepended to the right hand side of the formula).This \"operator\" is actually part of the tokenisation process.\u00a0\u21a9\u21a9\u21a9\u21a9\u21a9
Formulaic additionally supports quoted fields with special characters, e.g. my_func(`my|special+column`)
.\u00a0\u21a9
The caret operator is not supported, but will not cause an error. It is ignored by the patsy formula parser, and treated as XOR Python operation on column.\u00a0\u21a9
Note that Formulaic also allows you to use this to scale columns, for example: 2.5:a
(this scaling happens after factor coding).\u00a0\u21a9
This somewhat confusing operator is useful when you want to include hierachical features in your data, and where certain interaction terms do not make sense (particularly in ANOVA contexts). For example, if a
represents countries, and b
represents cities, then the full product of terms from a * b === a + b + a:b
does not make sense, because any value of b
is guaranteed to coincide with a value in a
, and does not independently add value. Thus, the operation a / b === a + a:b
results in more sensible dataset. As a result, the /
operator is right-distributive, since if b
and c
were both nested in a
, you would want a/(b+c) === a + a:b + a:c
. Likewise, the operator is not left-distributive, since if c
is nested under both a
and b
separately, then you want (a + b)/c === a + b + a:b:c
. Lastly, if c
is nested in b
, and b
is nested in a
, then you would want a/b/c === a + a:(b/c) === a + a:b + a:b:c
.\u00a0\u21a9
Implemented by an R package called Formula that extends the default formula syntax.\u00a0\u21a9
Patsy uses the rescale
keyword rather than scale
, but provides the same functionality.\u00a0\u21a9
For increased compatibility with patsy, we use patsy's signature for standardize
.\u00a0\u21a9
Requires additional context to be passed in when directly using the Formula
constructor. e.g. Formula(\"y ~ .\", context={\"__formulaic_variables_available__\": [\"x\", \"y\", \"z\"]})
; or you can use model_matrix
, ModelSpec.get_model_matrix()
, or FormulaMaterializer.get_model_matrix()
without further specification.\u00a0\u21a9
As Formulaic matures it is expected that it will be integrated directly into downstream projects where formula parsing is required. This is known to have already happened in the following high-profile projects:
Where direct integration has not yet happened, you can still use Formulaic in conjunction with other commonly used libraries. On this page, we will add various examples how to achieve this. If you have done some integration work, please feel free to submit a PR that extends this documentation!
statsmodels is a popular toolkit hosting many different statistical models, tests, and exploration tools. The formula API in statsmodels
is currently based on patsy
. If you need the features found in Formulaic, you can use it directly to generate the model matrices, and use the regular API. For example:
import pandas\nfrom statsmodels.api import OLS\n\nfrom formulaic import model_matrix\n\ndata = pandas.DataFrame({\"y\": [0.1, 0.4, 3], \"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]})\ny, X = model_matrix(\"y ~ a + b\", data)\nmodel = OLS(y, X)\nresults = model.fit()\nprint(results.summary())\nimport pandas from statsmodels.api import OLS from formulaic import model_matrix data = pandas.DataFrame({\"y\": [0.1, 0.4, 3], \"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]}) y, X = model_matrix(\"y ~ a + b\", data) model = OLS(y, X) results = model.fit() print(results.summary())
OLS Regression Results \n==============================================================================\nDep. Variable: y R-squared: 1.000\nModel: OLS Adj. R-squared: nan\nMethod: Least Squares F-statistic: nan\nDate: Fri, 14 Apr 2023 Prob (F-statistic): nan\nTime: 21:15:59 Log-Likelihood: 102.01\nNo. Observations: 3 AIC: -198.0\nDf Residuals: 0 BIC: -200.7\nDf Model: 2 \nCovariance Type: nonrobust \n==============================================================================\n coef std err t P>|t| [0.025 0.975]\n------------------------------------------------------------------------------\nIntercept -0.7857 inf -0 nan nan nan\na 0.8857 inf 0 nan nan nan\nb[T.B] -0.5857 inf -0 nan nan nan\nb[T.C] 1.1286 inf 0 nan nan nan\n==============================================================================\nOmnibus: nan Durbin-Watson: 0.820\nProb(Omnibus): nan Jarque-Bera (JB): 0.476\nSkew: -0.624 Prob(JB): 0.788\nKurtosis: 1.500 Cond. No. 6.94\n==============================================================================\n\nNotes:\n[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n[2] The input rank is higher than the number of observations.\n
/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/stats/stattools.py:74: ValueWarning: omni_normtest is not valid with less than 8 observations; 3 samples were given.\n warn(\"omni_normtest is not valid with less than 8 observations; %i \"\n/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1765: RuntimeWarning: divide by zero encountered in divide\n return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)\n/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1765: RuntimeWarning: invalid value encountered in scalar multiply\n return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)\n/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1687: RuntimeWarning: divide by zero encountered in scalar divide\n return np.dot(wresid, wresid) / self.df_resid\nIn\u00a0[2]: Copied!
from typing import Iterable, List, Optional\n\nfrom sklearn.base import BaseEstimator, TransformerMixin\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.pipeline import Pipeline\n\nfrom formulaic import Formula, FormulaSpec, ModelSpec\n\n\nclass FormulaicTransformer(TransformerMixin, BaseEstimator):\n def __init__(self, formula: FormulaSpec):\n self.formula: Formula = Formula.from_spec(formula)\n self.model_spec: Optional[ModelSpec] = None\n if self.formula._has_structure:\n raise ValueError(\n f\"Formula specification {repr(formula)} results in a structured formula, which is not supported.\"\n )\n\n def fit(self, X, y=None):\n \"\"\"\n Generate the initial model spec by which subsequent X's will be\n transformed.\n \"\"\"\n self.model_spec = self.formula.get_model_matrix(X).model_spec\n return self\n\n def transform(self, X, y=None):\n \"\"\"\n Transform `X` by generating a model matrix from it based on the fit\n model spec.\n \"\"\"\n if self.model_spec is None:\n raise RuntimeError(\n \"`FormulaicTransformer.fit()` must be called before `.transform()`.\"\n )\n X_ = self.model_spec.get_model_matrix(X)\n return X_\n\n def get_feature_names_out(\n self, input_features: Optional[Iterable[str]] = None\n ) -> List[str]:\n \"\"\"\n Expose model spec column names to scikit learn to allow column transforms later in the pipeline.\n \"\"\"\n if self.model_spec is None:\n raise RuntimeError(\n \"`FormulaicTransformer.fit()` must be called before columns can be assigned names.\"\n )\n return self.model_spec.column_names\n\n\npipe = Pipeline(\n [(\"formula\", FormulaicTransformer(\"x1 + x2 + x3\")), (\"model\", LinearRegression())]\n)\npipe_fit = pipe.fit(\n pandas.DataFrame({\"x1\": [1, 2, 3], \"x2\": [2, 3.4, 6], \"x3\": [7, 3, 1]}),\n y=pandas.Series([1, 3, 5]),\n)\npipe_fit\n# Note: You could optionally serialize `pipe_fit` here.\n# Then: Use the pipe to predict outcomes for new data.\nfrom typing import Iterable, List, Optional from sklearn.base import BaseEstimator, TransformerMixin from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline from formulaic import Formula, FormulaSpec, ModelSpec class FormulaicTransformer(TransformerMixin, BaseEstimator): def __init__(self, formula: FormulaSpec): self.formula: Formula = Formula.from_spec(formula) self.model_spec: Optional[ModelSpec] = None if self.formula._has_structure: raise ValueError( f\"Formula specification {repr(formula)} results in a structured formula, which is not supported.\" ) def fit(self, X, y=None): \"\"\" Generate the initial model spec by which subsequent X's will be transformed. \"\"\" self.model_spec = self.formula.get_model_matrix(X).model_spec return self def transform(self, X, y=None): \"\"\" Transform `X` by generating a model matrix from it based on the fit model spec. \"\"\" if self.model_spec is None: raise RuntimeError( \"`FormulaicTransformer.fit()` must be called before `.transform()`.\" ) X_ = self.model_spec.get_model_matrix(X) return X_ def get_feature_names_out( self, input_features: Optional[Iterable[str]] = None ) -> List[str]: \"\"\" Expose model spec column names to scikit learn to allow column transforms later in the pipeline. \"\"\" if self.model_spec is None: raise RuntimeError( \"`FormulaicTransformer.fit()` must be called before columns can be assigned names.\" ) return self.model_spec.column_names pipe = Pipeline( [(\"formula\", FormulaicTransformer(\"x1 + x2 + x3\")), (\"model\", LinearRegression())] ) pipe_fit = pipe.fit( pandas.DataFrame({\"x1\": [1, 2, 3], \"x2\": [2, 3.4, 6], \"x3\": [7, 3, 1]}), y=pandas.Series([1, 3, 5]), ) pipe_fit # Note: You could optionally serialize `pipe_fit` here. # Then: Use the pipe to predict outcomes for new data. Out[2]:
Pipeline(steps=[('formula', FormulaicTransformer(formula=1 + x1 + x2 + x3)),\n ('model', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.Pipeline
Pipeline(steps=[('formula', FormulaicTransformer(formula=1 + x1 + x2 + x3)),\n ('model', LinearRegression())])FormulaicTransformer
FormulaicTransformer(formula=1 + x1 + x2 + x3)LinearRegression
LinearRegression()"},{"location":"guides/integration/#statsmodels","title":"StatsModels\u00b6","text":""},{"location":"guides/integration/#scikit-learn","title":"Scikit-Learn\u00b6","text":"
scikit-learn is a very popular machine learning toolkit for Python. You can use Formulaic directly, as for statsmodels
, or as a module in scikit-learn pipelines along the lines of:
Sooner or later, you will encounter datasets with null values, and it is important to know how their presence will impact your modeling. Formulaic model matrix materialization procedures allow you to specify how you want nulls to be handled. You can either:
You can specify the desired behaviour by passing an NAAction
enum value (or string value thereof) to the materialization methods (model_matrix
, and *.get_model_matrix()
). Examples of each of these approaches is show below.
from pandas import Categorical, DataFrame\n\nfrom formulaic import model_matrix\nfrom formulaic.materializers import NAAction\n\ndf = DataFrame(\n {\n \"c\": [1, 2, None, 4, 5],\n \"C\": Categorical(\n [\"a\", \"b\", \"c\", None, \"e\"], categories=[\"a\", \"b\", \"c\", \"d\", \"e\"]\n ),\n }\n)\n\nmodel_matrix(\"c + C\", df, na_action=NAAction.DROP)\n# Equivlent to:\n# * model_matrix(\"c + C\", df)\n# * model_matrix(\"c + C\", df, na_action=\"drop\")\nfrom pandas import Categorical, DataFrame from formulaic import model_matrix from formulaic.materializers import NAAction df = DataFrame( { \"c\": [1, 2, None, 4, 5], \"C\": Categorical( [\"a\", \"b\", \"c\", None, \"e\"], categories=[\"a\", \"b\", \"c\", \"d\", \"e\"] ), } ) model_matrix(\"c + C\", df, na_action=NAAction.DROP) # Equivlent to: # * model_matrix(\"c + C\", df) # * model_matrix(\"c + C\", df, na_action=\"drop\") Out[36]: Intercept c C[T.b] C[T.c] C[T.d] C[T.e] 0 1.0 1.0 0 0 0 0 1 1.0 2.0 1 0 0 0 4 1.0 5.0 0 0 0 1
You can also specify additional rows to drop using the drop_rows
argument:
model_matrix(\"c + C\", df, drop_rows={0, 4})\nmodel_matrix(\"c + C\", df, drop_rows={0, 4}) Out[24]: Intercept c C[T.b] C[T.c] C[T.d] C[T.e] 1 1.0 2.0 1 0 0 0
Note that the set passed to drop_rows
is expected to be mutable, as it will be updated with the indices of rows dropped automatically also; which can be useful if you need to keep track of this information outside of the materialization procedure.
drop_rows = {0, 4}\nmodel_matrix(\"c + C\", df, drop_rows=drop_rows)\ndrop_rows\ndrop_rows = {0, 4} model_matrix(\"c + C\", df, drop_rows=drop_rows) drop_rows Out[25]:
{0, np.int64(2), np.int64(3), 4}In\u00a0[31]: Copied!
model_matrix(\"c + C\", df, na_action=\"ignore\")\nmodel_matrix(\"c + C\", df, na_action=\"ignore\") Out[31]: Intercept c C[T.b] C[T.c] C[T.d] C[T.e] 0 1.0 1.0 0 0 0 0 1 1.0 2.0 1 0 0 0 2 1.0 NaN 0 1 0 0 3 1.0 4.0 0 0 0 0 4 1.0 5.0 0 0 0 1
Note the NaN
in the c
column, and that NaN
does NOT appear in the dummy coding of C on row 3, consistent with standard implementations of dummy coding. This could result in misleading model estimates, so care should be taken.
You can combine this with drop_rows
, as described above, to manually filter out the null values you are concerned about.
try:\n model_matrix(\"c + C\", df, na_action=\"raise\")\nexcept Exception as e:\n print(e)\ntry: model_matrix(\"c + C\", df, na_action=\"raise\") except Exception as e: print(e)
Error encountered while checking for nulls in `C`: `C` contains null values after evaluation.\n
As with ignoring nulls above, you can combine this raising behaviour with drop_rows
to manually filter out the null values that you feel you can safely ignore, and then raise if any additional null values make it into your data.
NAAction.DROP
, or \"drop\"
)\u00b6","text":"This the default behaviour, and will result in any row with a null in any column that is being used by the materialization being dropped from the resulting dataset. For example:
"},{"location":"guides/missing_data/#ignore-nulls-naactionignore-or-ignore","title":"Ignore nulls (NAAction.IGNORE
, or \"ignore\"
)\u00b6","text":"If your modeling toolkit can handle the presence of nulls, or you otherwise want to keep them in the dataset, you can pass na_action = \"ignore\"
to the materialization methods. This will allow null values to remain in columns, and take no action to prevent the propagation of nulls.
NAAction.RAISE
or \"raise\"
)\u00b6","text":"If you are unwilling to risk the perils of dropping or ignoring null values, you can instead opt to raise an exception whenever a null value is found. This can prevent yourself from accidentally biasing your model, but also makes your code more brittle. For example:
"},{"location":"guides/model_specs/","title":"Model Specs","text":"While Formula
instances (discussed in How it works) are the source of truth for abstract user intent, ModelSpec
instances are the source of truth for the materialization process; and bundle a Formula
instance with explicit metadata about the encoding choices that were made (or should be made) when a formula was (or will be) materialized. As soon as materialization begins, Formula
instances are upgraded into ModelSpec
instances, and any missing metadata is attached as decisions are made during the materialization process.
Besides acting as runtime state during materialization, it serves two main purposes:
Formula
has been materialized once, you can use the generated ModelSpec
instance to repeat the process on similar datasets, being confident that the encoding choices will be identical. This is especially useful during out-of-sample prediction, where you need to prepare the out-of-sample data in exactly the same was as the training data for the predictions to be valid.In the remainder of this portion of the documentation, we will introduce how to leverage the metadata stored inside ModelSpec
instances derived from materializations, and for more advanced programmatic use-cases, how to manually build a ModelSpec
.
# Let's get ourselves a simple `ModelMatrix` instance to play with.\nfrom pandas import DataFrame\n\nfrom formulaic import model_matrix\n\nmm = model_matrix(\"center(a) + b\", DataFrame({\"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]}))\nmm\n# Let's get ourselves a simple `ModelMatrix` instance to play with. from pandas import DataFrame from formulaic import model_matrix mm = model_matrix(\"center(a) + b\", DataFrame({\"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]})) mm Out[1]: Intercept center(a) b[T.B] b[T.C] 0 1.0 -1.0 0 0 1 1.0 0.0 1 0 2 1.0 1.0 0 1 In\u00a0[2]: Copied!
# And extract the model spec from it\nms = mm.model_spec\nms\n# And extract the model spec from it ms = mm.model_spec ms Out[2]:
ModelSpec(formula=1 + center(a) + b, materializer='pandas', materializer_params={}, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='pandas', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=center(a), scoped_terms=[center(a)], columns=['center(a)']), EncodedTermStructure(term=b, scoped_terms=[b-], columns=['b[T.B]', 'b[T.C]'])], transform_state={'center(a)': {'ddof': 1, 'center': np.float64(2.0), 'scale': None}}, encoder_state={'center(a)': (<Kind.NUMERICAL: 'numerical'>, {}), 'b': (<Kind.CATEGORICAL: 'categorical'>, {'categories': ['A', 'B', 'C'], 'contrasts': ContrastsState(contrasts=TreatmentContrasts(base=UNSET), levels=['A', 'B', 'C'])})})In\u00a0[3]: Copied!
# We can now interrogate it for various column, factor, term, and variable related metadata\n{\n \"column_names\": ms.column_names,\n \"column_indices\": ms.column_indices,\n \"terms\": ms.terms,\n \"term_indices\": ms.term_indices,\n \"term_slices\": ms.term_slices,\n \"term_factors\": ms.term_factors,\n \"term_variables\": ms.term_variables,\n \"factors\": ms.factors,\n \"factor_terms\": ms.factor_terms,\n \"factor_variables\": ms.factor_variables,\n \"factor_contrasts\": ms.factor_contrasts,\n \"variables\": ms.variables,\n \"variable_terms\": ms.variable_terms,\n \"variable_indices\": ms.variable_indices,\n \"variables_by_source\": ms.variables_by_source,\n}\n# We can now interrogate it for various column, factor, term, and variable related metadata { \"column_names\": ms.column_names, \"column_indices\": ms.column_indices, \"terms\": ms.terms, \"term_indices\": ms.term_indices, \"term_slices\": ms.term_slices, \"term_factors\": ms.term_factors, \"term_variables\": ms.term_variables, \"factors\": ms.factors, \"factor_terms\": ms.factor_terms, \"factor_variables\": ms.factor_variables, \"factor_contrasts\": ms.factor_contrasts, \"variables\": ms.variables, \"variable_terms\": ms.variable_terms, \"variable_indices\": ms.variable_indices, \"variables_by_source\": ms.variables_by_source, } Out[3]:
{'column_names': ('Intercept', 'center(a)', 'b[T.B]', 'b[T.C]'),\n 'column_indices': {'Intercept': 0, 'center(a)': 1, 'b[T.B]': 2, 'b[T.C]': 3},\n 'terms': [1, center(a), b],\n 'term_indices': {1: [0], center(a): [1], b: [2, 3]},\n 'term_slices': {1: slice(0, 1, None),\n center(a): slice(1, 2, None),\n b: slice(2, 4, None)},\n 'term_factors': {1: {1}, center(a): {center(a)}, b: {b}},\n 'term_variables': {1: set(), center(a): {'a', 'center'}, b: {'b'}},\n 'factors': {1, b, center(a)},\n 'factor_terms': {1: {1}, center(a): {center(a)}, b: {b}},\n 'factor_variables': {b: {'b'}, 1: set(), center(a): {'a', 'center'}},\n 'factor_contrasts': {b: ContrastsState(contrasts=TreatmentContrasts(base=UNSET), levels=['A', 'B', 'C'])},\n 'variables': {'a', 'b', 'center'},\n 'variable_terms': {'center': {center(a)}, 'a': {center(a)}, 'b': {b}},\n 'variable_indices': {'center': [1], 'a': [1], 'b': [2, 3]},\n 'variables_by_source': {'transforms': {'center'}, 'data': {'a', 'b'}}}In\u00a0[4]: Copied!
# And use it to select out various parts of the model matrix; here the columns\n# produced by the `b` term.\nmm.iloc[:, ms.term_indices[\"b\"]]\n# And use it to select out various parts of the model matrix; here the columns # produced by the `b` term. mm.iloc[:, ms.term_indices[\"b\"]] Out[4]: b[T.B] b[T.C] 0 0 0 1 1 0 2 0 1
Some of this metadata may seem redundant at first, but this kind of metadata is essential when the generated model matrix does not natively support indexing by names; for example:
In\u00a0[5]: Copied!mm_numpy = model_matrix(\n \"center(a) + b\", DataFrame({\"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]}), output=\"numpy\"\n)\nmm_numpy\nmm_numpy = model_matrix( \"center(a) + b\", DataFrame({\"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]}), output=\"numpy\" ) mm_numpy Out[5]:
array([[ 1., -1., 0., 0.],\n [ 1., 0., 1., 0.],\n [ 1., 1., 0., 1.]])In\u00a0[6]: Copied!
ms_numpy = mm_numpy.model_spec\nmm_numpy[:, ms_numpy.term_indices[\"b\"]]\nms_numpy = mm_numpy.model_spec mm_numpy[:, ms_numpy.term_indices[\"b\"]] Out[6]:
array([[0., 0.],\n [1., 0.],\n [0., 1.]])In\u00a0[7]: Copied!
ms.get_model_matrix(DataFrame({\"a\": [4, 5, 6], \"b\": [\"A\", \"B\", \"D\"]}))\nms.get_model_matrix(DataFrame({\"a\": [4, 5, 6], \"b\": [\"A\", \"B\", \"D\"]}))
/home/matthew/Repositories/github/formulaic/formulaic/transforms/contrasts.py:169: DataMismatchWarning: Data has categories outside of the nominated levels (or that were not seen in original dataset): {'D'}. They are being cast to nan, which will likely skew the results of your analyses.\n warnings.warn(\nOut[7]: Intercept center(a) b[T.B] b[T.C] 0 1.0 2.0 0 0 1 1.0 3.0 1 0 2 1.0 4.0 0 0
Notice that when the assumptions of the stateful transforms are violated warnings and/or exceptions will be generated.
You can also just pass the ModelSpec
directly to model_matrix
, for example:
model_matrix(ms, data=DataFrame({\"a\": [4, 5, 6], \"b\": [\"A\", \"A\", \"A\"]}))\nmodel_matrix(ms, data=DataFrame({\"a\": [4, 5, 6], \"b\": [\"A\", \"A\", \"A\"]})) Out[8]: Intercept center(a) b[T.B] b[T.C] 0 1.0 2.0 0 0 1 1.0 3.0 0 0 2 1.0 4.0 0 0 In\u00a0[9]: Copied!
from formulaic import ModelSpec\n\nms = ModelSpec(\"a+b+c\", output=\"numpy\", ensure_full_rank=False)\nms\nfrom formulaic import ModelSpec ms = ModelSpec(\"a+b+c\", output=\"numpy\", ensure_full_rank=False) ms Out[9]:
ModelSpec(formula=1 + a + b + c, materializer=None, materializer_params=None, ensure_full_rank=False, na_action=<NAAction.DROP: 'drop'>, output='numpy', cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})In\u00a0[10]: Copied!
import pandas\n\nmm = ms.get_model_matrix(\n pandas.DataFrame({\"a\": [1, 2, 3], \"b\": [4, 5, 6], \"c\": [7, 8, 9]})\n)\nmm\nimport pandas mm = ms.get_model_matrix( pandas.DataFrame({\"a\": [1, 2, 3], \"b\": [4, 5, 6], \"c\": [7, 8, 9]}) ) mm Out[10]:
array([[1., 1., 4., 7.],\n [1., 2., 5., 8.],\n [1., 3., 6., 9.]])In\u00a0[11]: Copied!
mm.model_spec\nmm.model_spec Out[11]:
ModelSpec(formula=1 + a + b + c, materializer='pandas', materializer_params={}, ensure_full_rank=False, na_action=<NAAction.DROP: 'drop'>, output='numpy', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a']), EncodedTermStructure(term=b, scoped_terms=[b], columns=['b']), EncodedTermStructure(term=c, scoped_terms=[c], columns=['c'])], transform_state={}, encoder_state={'a': (<Kind.NUMERICAL: 'numerical'>, {}), 'b': (<Kind.NUMERICAL: 'numerical'>, {}), 'c': (<Kind.NUMERICAL: 'numerical'>, {})})
Notice that any missing fields not provided by the user are imputed automatically.
In\u00a0[12]: Copied!from formulaic import Formula, ModelSpecs\n\nModelSpecs(\n ModelSpec(\"a\"), substructure=ModelSpec(\"b\"), another_substructure=ModelSpec(\"c\")\n)\nfrom formulaic import Formula, ModelSpecs ModelSpecs( ModelSpec(\"a\"), substructure=ModelSpec(\"b\"), another_substructure=ModelSpec(\"c\") ) Out[12]:
root:\n ModelSpec(formula=1 + a, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})\n.substructure:\n ModelSpec(formula=1 + b, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})\n.another_substructure:\n ModelSpec(formula=1 + c, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})In\u00a0[13]: Copied!
ModelSpec.from_spec(Formula(lhs=\"y\", rhs=\"a + b\"))\nModelSpec.from_spec(Formula(lhs=\"y\", rhs=\"a + b\")) Out[13]:
.lhs:\n ModelSpec(formula=y, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})\n.rhs:\n ModelSpec(formula=a + b, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})
Some operations, such as ModelSpec.subset(...)
are also accessible in a mapped way (e.g. via ModelSpecs.subset(...)
). You can find documentation for the complete set of available methods using help(ModelSpecs)
.
ModelSpec
and ModelSpecs
instances have been designed to support serialization via the standard pickling process offered by Python. This allows model specs to be persisted into storage and reloaded at a later time, or used in multiprocessing scenarios.
Serialized model specs are not guaranteed to work between different versions of formulaic. While things will work in the vast majority of cases, the internal state of transforms is free to change from version to version, and may invalidate previously serialized model specs. Efforts will be made to reduce the likelihood of this, and when it happens it should be indicated in the changelogs.
"},{"location":"guides/model_specs/#anatomy-of-a-modelspec-instance","title":"Anatomy of aModelSpec
instance.\u00b6","text":"As noted above, a ModelSpec
is the complete specification and record of the materialization process, combining all user-specified parameters with the runtime state of the materializer. In particular, ModelSpec
instances have the following explicitly specifiable attributes:
Often, only formula
is explicitly specified, and the rest is inferred on the user's behalf.
ModelSpec
instances also have derived properties and methods that you can use to introspect the structure of generated model matrices. These derived methods assume that the ModelSpec
has been fully populated, and thus usually only make sense to consider on ModelSpec
instances that are attached to a ModelMatrix
. They are:
.column_indices
.Term
instances that were used to generate this model matrix.Term
instances to the generated column indices..term_indices
using formulae.Term
instances to a slice that when used on the columns of the model matrix will subsample the model matrix down to those corresponding to each term.Term
instances to the set of factors used by that term.Term
instances to Variable
instances (a string subclass with addition attributes of roles
and source
), indicating the variables used by that term.Factor
instances used in the entire formula.Factor
instances to the Term
instances that used them.Factor
instances to Variable
instances, corresponding to the variables used by that factor.Factor
instances to ContrastsState
instances that can be used to reproduce the coding matrices used during materialization.Variable
instances describing the variables used in entire formula.term_variables
.Variable
instance to the indices of the columns in the model matrix that variable..variable_indices
.\"data\"
, \"context\"
, or \"transforms\"
) to the variables derived from that source.Term
instance, its string representation, a column name, or pre-specified ints/slices.ModelSpec
instance with the nominated attributes mutated.ModelSpec
instance with its structure subset to correspond to the form strict subset of terms indicted by a formula specification.We'll cover some of these attributes and methods in examples below, but you can always refer to help(ModelSpec)
for more details.
ModelSpec
as metadata\u00b6","text":"One of the most common use-cases for ModelSpec
instances is as metadata to describe a generated model matrix. This metadata can be used to programmatically access the appropriate features in the model matrix in order (e.g.) to assign sensible names to the coefficients fit during a regression.
Another common use-case for ModelSpec
instances is replaying the same materialization process used to prepare a training dataset on a new dataset. Since the ModelSpec
instance stores all relevant choices made during materialization achieving this is a simple as using using the ModelSpec
to generate the new model matrix.
By way of example, recall from above section that we used the formula
center(a) + b
where a
was a numerical vector, and b
was a categorical vector. When generating model matrices for subsequent datasets it is very important to use the same centering used during the initial model matrix generation, and not just center the incoming data again. Likewise, b
should be aware of which categories were present during the initial training, and ensure that the same columns are created during subsequent materializations (otherwise the model matrices will not be of the same form, and cannot be used for predictions/etc). These kinds of transforms that require memory are called \"stateful transforms\" in Formulaic, and are described in more detail in the Transforms documentation.
We can see this in action below:
"},{"location":"guides/model_specs/#directly-constructing-modelspec-instances","title":"Directly constructingModelSpec
instances\u00b6","text":"It is possible to directly construct Model Matrices, and to prepopulate them with various choices (e.g. output types, materializer, etc). You could even, in principle, populate them with state information (but this is not recommended; it is easy to make mistakes here, and is likely better to encode these choices into the formula itself where possible). For example:
"},{"location":"guides/model_specs/#structured-modelspecs","title":"StructuredModelSpecs
\u00b6","text":"As discussed in How it works, formulae can be arbitrarily structured, resulting in a similarly structured set of model matrices. ModelSpec
instances can also be arranged into a structured collection using ModelSpecs
, allowing different choices to be made at different levels of the structure. You can either create these structures yourself, or inherit the structure from a formula. For example:
This document provides high-level documentation on how to get started using Formulaic.
In\u00a0[1]: Copied!import pandas\n\nfrom formulaic import model_matrix\n\ndf = pandas.DataFrame(\n {\n \"y\": [0, 1, 2],\n \"a\": [\"A\", \"B\", \"C\"],\n \"b\": [0.3, 0.1, 0.2],\n }\n)\n\ny, X = model_matrix(\"y ~ a + b + a:b\", df)\n# This is short-hand for:\n# y, X = formulaic.Formula('y ~ a + b + a:b').get_model_matrix(df)\nimport pandas from formulaic import model_matrix df = pandas.DataFrame( { \"y\": [0, 1, 2], \"a\": [\"A\", \"B\", \"C\"], \"b\": [0.3, 0.1, 0.2], } ) y, X = model_matrix(\"y ~ a + b + a:b\", df) # This is short-hand for: # y, X = formulaic.Formula('y ~ a + b + a:b').get_model_matrix(df) In\u00a0[2]: Copied!
y\ny Out[2]: y 0 0 1 1 2 2 In\u00a0[3]: Copied!
X\nX Out[3]: Intercept a[T.B] a[T.C] b a[T.B]:b a[T.C]:b 0 1.0 0 0 0.3 0.0 0.0 1 1.0 1 0 0.1 0.1 0.0 2 1.0 0 1 0.2 0.0 0.2
You will notice that the categorical values for a
have been one-hot (aka dummy) encoded, and to ensure structural full-rankness of X
[^1], one level has been dropped from a
. For more details about how this guarantees that the matrix is full-rank, please refer to the excellent patsy documentation. If you are not using the model matrices for regression, and don't care if the matrix is not full-rank, you can pass ensure_full_rank=False
:
X = model_matrix(\"a + b + a:b\", df, ensure_full_rank=False)\nX\nX = model_matrix(\"a + b + a:b\", df, ensure_full_rank=False) X Out[4]: Intercept a[T.A] a[T.B] a[T.C] b a[T.A]:b a[T.B]:b a[T.C]:b 0 1.0 1 0 0 0.3 0.3 0.0 0.0 1 1.0 0 1 0 0.1 0.0 0.1 0.0 2 1.0 0 0 1 0.2 0.0 0.0 0.2
Note that the dropped level in a
has been restored.
There is a rich trove of information about the columns and structure of the the model matrix stored in the ModelSpec
instance attached to the model matrix, for example:
X.model_spec\nX.model_spec Out[5]:
ModelSpec(formula=1 + a + b + a:b, materializer='pandas', materializer_params={}, ensure_full_rank=False, na_action=<NAAction.DROP: 'drop'>, output='pandas', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a[T.A]', 'a[T.B]', 'a[T.C]']), EncodedTermStructure(term=b, scoped_terms=[b], columns=['b']), EncodedTermStructure(term=a:b, scoped_terms=[a:b], columns=['a[T.A]:b', 'a[T.B]:b', 'a[T.C]:b'])], transform_state={}, encoder_state={'a': (<Kind.CATEGORICAL: 'categorical'>, {'categories': ['A', 'B', 'C']}), 'b': (<Kind.NUMERICAL: 'numerical'>, {})})
You can read more about the model specs in the Model Specs documentation.
In\u00a0[6]: Copied!X = model_matrix(\"a + b + a:b\", df, output=\"sparse\")\nX\nX = model_matrix(\"a + b + a:b\", df, output=\"sparse\") X Out[6]:
<3x6 sparse matrix of type '<class 'numpy.float64'>'\n\twith 10 stored elements in Compressed Sparse Column format>
In this example, X
is a $ 6 \\times 3 $ scipy.sparse.csc_matrix
instance.
Since sparse matrices do not have labels for columns, you can look these up from the model spec described above; for example:
In\u00a0[7]: Copied!X.model_spec.column_names\nX.model_spec.column_names Out[7]:
('Intercept', 'a[T.B]', 'a[T.C]', 'b', 'a[T.B]:b', 'a[T.C]:b')
[^1]: X
must be full-rank in order for the regression algorithm to invert a matrix derived from X
.
In formulaic
, the simplest way to build your model matrices is to use the high-level model_matrix
function:
By default, the generated model matrices are dense. In some case, particularly in large datasets with many categorical features, dense model matrices become hugely memory inefficient (since most entries of the data will be zero). Formulaic allows you to directly generate sparse model matrices using:
"},{"location":"guides/splines/","title":"Spline Encoding","text":"Formulaic offers several spline encoding transforms that allow you to model non-linear responses to continuous variables using linear models. They are:
poly
: projection onto orthogonal polynomial basis functions.basis_spline
(bs
in formulae): projection onto a basis spline (B-spline) basis.These are all implemented as stateful transforms and described in more detail below.
In\u00a0[1]: Copied!import matplotlib.pyplot as plt\nimport numpy\nimport pandas\nfrom statsmodels.api import OLS\n\nfrom formulaic import model_matrix\n\n# Build some data, and hard-code \"y\" as a quartic function with some Gaussian noise.\ndata = pandas.DataFrame(\n {\n \"x\": numpy.linspace(0.0, 1.0, 100),\n }\n).assign(\n y=lambda df: df.x\n + 0.2 * df.x**2\n - 0.7 * df.x**3\n + 3 * df.x**4\n + 0.1 * numpy.random.randn(100)\n)\n\n# Generate a model matrix with a polynomial coding in \"x\".\ny, X = model_matrix(\"y ~ poly(x, degree=3)\", data, output=\"numpy\")\n\n# Fit coefficients for the intercept and polynomial basis\ncoeffs = OLS(y[:, 0], X).fit().params\n\n# Plot the basis splines weighted by coefficients.\nplt.plot(\n data.x,\n X * coeffs,\n label=[name + \" (weighted)\" for name in X.model_spec.column_names],\n)\n# Plot B-spline basis functions (colored curves) each multiplied by its coeff\nplt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\")\n# Plot the fit spline itself (sum of the basis functions)\nplt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")\n\nplt.legend();\nimport matplotlib.pyplot as plt import numpy import pandas from statsmodels.api import OLS from formulaic import model_matrix # Build some data, and hard-code \"y\" as a quartic function with some Gaussian noise. data = pandas.DataFrame( { \"x\": numpy.linspace(0.0, 1.0, 100), } ).assign( y=lambda df: df.x + 0.2 * df.x**2 - 0.7 * df.x**3 + 3 * df.x**4 + 0.1 * numpy.random.randn(100) ) # Generate a model matrix with a polynomial coding in \"x\". y, X = model_matrix(\"y ~ poly(x, degree=3)\", data, output=\"numpy\") # Fit coefficients for the intercept and polynomial basis coeffs = OLS(y[:, 0], X).fit().params # Plot the basis splines weighted by coefficients. plt.plot( data.x, X * coeffs, label=[name + \" (weighted)\" for name in X.model_spec.column_names], ) # Plot B-spline basis functions (colored curves) each multiplied by its coeff plt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\") # Plot the fit spline itself (sum of the basis functions) plt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\") plt.legend(); In\u00a0[11]: Copied!
# Generate a model matrix with a basis spline \"x\".\ny, X = model_matrix(\"y ~ bs(x, df=4, degree=3)\", data, output=\"numpy\")\n\n# Fit coefficients for the intercept and polynomial basis\ncoeffs = OLS(y[:, 0], X).fit().params\n\n# Plot the basis splines weighted by coefficients.\nplt.plot(\n data.x,\n X * coeffs,\n label=[name + \" (weighted)\" for name in X.model_spec.column_names],\n)\n# Plot B-spline basis functions (colored curves) each multiplied by its coeff\nplt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\")\n# Plot the fit spline itself (sum of the basis functions)\nplt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")\n\nplt.legend();\n# Generate a model matrix with a basis spline \"x\". y, X = model_matrix(\"y ~ bs(x, df=4, degree=3)\", data, output=\"numpy\") # Fit coefficients for the intercept and polynomial basis coeffs = OLS(y[:, 0], X).fit().params # Plot the basis splines weighted by coefficients. plt.plot( data.x, X * coeffs, label=[name + \" (weighted)\" for name in X.model_spec.column_names], ) # Plot B-spline basis functions (colored curves) each multiplied by its coeff plt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\") # Plot the fit spline itself (sum of the basis functions) plt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\") plt.legend(); In\u00a0[18]: Copied!
x = numpy.linspace(0.0, 2 * numpy.pi, 100)\n\ndata = pandas.DataFrame(\n {\n \"x\": x,\n }\n).assign(\n y=lambda df: 2\n + numpy.sin(x)\n - x * numpy.sin(2 * x)\n + 4 * numpy.sin(x / 7)\n + 0.1 * numpy.random.randn(100)\n)\n\n# Generate a model matrix with a cyclic cubic spline coding in \"x\".\ny, X = model_matrix(\"y ~ 1 + cc(x, df=4, constraints='center')\", data, output=\"numpy\")\n\n# Fit coefficients for the intercept and polynomial basis\ncoeffs = OLS(y[:, 0], X).fit().params\n\n# Plot the basis splines weighted by coefficients.\nplt.plot(\n data.x,\n X * coeffs,\n label=[name + \" (weighted)\" for name in X.model_spec.column_names],\n)\n# Plot B-spline basis functions (colored curves) each multiplied by its coeff\nplt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\")\n# Plot the fit spline itself (sum of the basis functions)\nplt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")\n\nplt.legend();\nx = numpy.linspace(0.0, 2 * numpy.pi, 100) data = pandas.DataFrame( { \"x\": x, } ).assign( y=lambda df: 2 + numpy.sin(x) - x * numpy.sin(2 * x) + 4 * numpy.sin(x / 7) + 0.1 * numpy.random.randn(100) ) # Generate a model matrix with a cyclic cubic spline coding in \"x\". y, X = model_matrix(\"y ~ 1 + cc(x, df=4, constraints='center')\", data, output=\"numpy\") # Fit coefficients for the intercept and polynomial basis coeffs = OLS(y[:, 0], X).fit().params # Plot the basis splines weighted by coefficients. plt.plot( data.x, X * coeffs, label=[name + \" (weighted)\" for name in X.model_spec.column_names], ) # Plot B-spline basis functions (colored curves) each multiplied by its coeff plt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\") # Plot the fit spline itself (sum of the basis functions) plt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\") plt.legend();"},{"location":"guides/splines/#poly","title":"
poly
\u00b6","text":"The simplest way to generate a non-linear response in a linear model is to include higher order powers of numerical variables. For example, you might want to include: $x$, $x^2$, $x^3$, $\\ldots$, $x^n$. However, these features are not orthogonal, and so adding each term one by one in a regression model will lead to all previously trained coefficients changing. Especially in exploratory analysis, this can be frustrating, and that's where poly
comes in. By default, poly
iteratively builds orthogonal polynomial features up to the order specified.
poly
has two parameters:
For those who are mathematically inclined, this transform is an implementation of the \"three-term recurrence relation\" for monic orthogonal polynomials. There are many good introductions to these recurrence relations, including: (at the time of writing) https://dec41.user.srcf.net/h/IB_L/numerical_analysis/2_3. Another common approach is QR factorisation where the columns of Q are the orthogonal basis vectors. A pre-existing implementation of this can be found in numpy
, however our implementation outperforms numpy's QR decomposition, and does not require needless computation of the R matrix. It should also be noted that orthogonal polynomial bases are unique up to the choice of inner-product and scaling, and so all methods will result in the same set of polynomials.
Note
When used as a stateful transform, we retain the coefficients that uniquely define the polynomials; and so new data will be evaluated against the same polynomial bases as the original dataset. However, the polynomial basis will almost certainly *not* be orthogonal for the new data. This is because changing the incoming dataset is equivalent to changing your choice of inner product.
Example:
"},{"location":"guides/splines/#basis_spline-or-bs","title":"basis_spline
(or bs
)\u00b6","text":"If you were to attempt to fit a complex function over a large domain using poly
, it is highly likely that you would need to use a very large degree for the polynomial basis. However, this can lead to overfitting and/or high-frequency oscillations (see Runge's phenomenon). An alternative approach is to use piece-wise polynomial curves of lower degree, with smoothness conditions on the \"knots\" between each of the polynomial pieces. This limits overfitting while still offering the flexibility required to model very complex non-linearities.
Basis-splines (or B-splines) are popular choice for generating a basis for such polynomials, with many attractive features such as maximal smoothness around each of the knots, and minimal support given such smoothness.
Formulaic has its own implementation of basis_spline
that is API compatible (where features overlap) with R, and is more performant than existing Python implementations for our use-cases (such as splev
from scipy
). For compatibility with R
and patsy
, basis_spline
is available as bs
in formulae.
basis_spline
(or bs
) has eight parameters:
knots
will be automatically generated such that they are df
- degree
(minus one if include_intercept
is True) equally spaced quantiles. You cannot specify both df
and knots
.df
is specified), in which case the ordinary polynomial (Bezier) basis is generated.ensure_full_rank=True
is passed to the materializer, then the intercept will (depending on context) nevertheless be omitted.x
.x
.x
extend beyond the lower and upper bounds. Valid values are:'raise'
(the default): Raises a ValueError
if there are any values in x
outside the B-Spline domain.'clip'
: Any values above/below the domain are set to the upper/lower bounds.'na'
: Any values outside of bounds are set to numpy.nan
.'zero'
: Any values outside of bounds are set to 0
.'extend'
: Any values outside of bounds are computed by extending the polynomials of the B-Spline (this is the same as the default in R).The algorithm used to generate the basis splines is a slightly generalised version of the \"Cox-de Boor\" algorithm, extended by this author to allow for extrapolations (although this author doubts this is terribly novel). If you would like to learn more about B-Splines, the primer put together by Jeffrey Racine is an excellent resource.
As a stateful transform, we only keep track of knots
, lower_bound
and upper_bound
, which are sufficient given that all other information must be explicitly specified.
Example, reusing the data and imports from above:
"},{"location":"guides/splines/#cubic_spline-crcs-and-cc","title":"cubic_spline
(cr
/cs
and cc
)\u00b6","text":"While the basis_spline
transform above is capable of generating cubic splines, it is sometimes helpful to be able to generate cubic splines that satisfy various additional constraints (including direct constraints on the parameters of the spline, or indirect ones such as cyclicity). To that end, Formulaic implements direct support for generating natural and cyclic cubic splines with constraints (via the cr
/cs
and cc
transforms respectively), borrowing much of the implementation from patsy
. These splines are compatible with R's mgcv
, and share the nice features of the basis-splines above, including continuous first and second derivatives, and general applicability to interpolation/smoothing. Note that cr
and cs
generate identical splines, but are both included for compatibility with R.
In practice, the reason that we focus on cubic (as compared to quadratic or quartic) splines is that they offer a nice compromise. They offer:
All of cr
, cs
, and cc
are configurations of the cubic_spline
transform, and have seven parameters:
cc
or cr
. If using an existing model_spec
with a fitted transformation, the x
values are only used to produce the locations for the fitted values of the spline.knots
will be automatically generated such that they are df
- degree
equally spaced quantiles. You cannot specify both df
and knots
.df
is specified).x
.x
.np.dot(constraints, betas)
is zero, where betas
denotes the array of initial parameters, corresponding to the initial unconstrained model matrix), or the string 'center'
indicating that we should apply a centering constraint (this constraint will be computed from the input data, remembered and re-used for prediction from the fitted model). The constraints are absorbed in the resulting design matrix which means that the model is actually rewritten in terms of unconstrained parameters.x
extend beyond the lower and upper bounds. Valid values are:'raise'
(the default): Raises a ValueError
if there are any values in x
outside the spline domain.'clip'
: Any values above/below the domain are set to the upper/lower bounds.'na'
: Any values outside of bounds are set to numpy.nan
.'zero'
: Any values outside of bounds are set to 0
.'extend'
: Any values outside of bounds are computed by extending the polynomials of the spline (this is the same as the default in R).As a stateful transform, we only keep track of knots
, lower_bound
, upper_bound
, constraints
, and cyclic
which are sufficient given that all other information must be explicitly specified.
Example, reusing the data and imports from above:
"},{"location":"guides/transforms/","title":"Transforms","text":"A transform in Formulaic is any function that is called to modify factor values during the evaluation of a Factor
(see the How it works documentation). Any function can be used as a transform, so long as it is present in the evaluation context (see below).
There are two types of transform:
numpy.cumsum
function to any vector being fed into the model matrix materialization procedure.In the below we describe how to make a function available for use as a transform during materialization, demonstrate this for regular transforms, and then introduce how to use already implemented stateful transforms and/or write your own.
In\u00a0[1]: Copied!import pandas\n\nfrom formulaic import Formula, model_matrix\n\n\ndef my_transform(col: pandas.Series) -> pandas.Series:\n return col**2\nimport pandas from formulaic import Formula, model_matrix def my_transform(col: pandas.Series) -> pandas.Series: return col**2 In\u00a0[2]: Copied!
# Local context is automatically added\nmodel_matrix(\"a + my_transform(a)\", pandas.DataFrame({\"a\": [1, 2, 3]}))\n# Local context is automatically added model_matrix(\"a + my_transform(a)\", pandas.DataFrame({\"a\": [1, 2, 3]})) Out[2]: Intercept a my_transform(a) 0 1.0 1 1 1 1.0 2 4 2 1.0 3 9 In\u00a0[3]: Copied!
# Manually add `my_transform` to the context\nFormula(\"a + my_transform(a)\").get_model_matrix(\n pandas.DataFrame({\"a\": [1, 2, 3]}),\n context={\"my_transform\": my_transform}, # could also use: context=locals()\n)\n# Manually add `my_transform` to the context Formula(\"a + my_transform(a)\").get_model_matrix( pandas.DataFrame({\"a\": [1, 2, 3]}), context={\"my_transform\": my_transform}, # could also use: context=locals() ) Out[3]: Intercept a my_transform(a) 0 1.0 1 1 1 1.0 2 4 2 1.0 3 9 In\u00a0[4]: Copied!
from formulaic.transforms import center, scale\n\nscale(pandas.Series([1, 2, 3, 4, 5, 6, 7, 8]))\nfrom formulaic.transforms import center, scale scale(pandas.Series([1, 2, 3, 4, 5, 6, 7, 8])) Out[4]:
array([-1.42886902, -1.02062073, -0.61237244, -0.20412415, 0.20412415,\n 0.61237244, 1.02062073, 1.42886902])In\u00a0[5]: Copied!
center(pandas.Series([1, 2, 3, 4, 5, 6, 7, 8]))\ncenter(pandas.Series([1, 2, 3, 4, 5, 6, 7, 8])) Out[5]:
array([-3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5])In\u00a0[6]: Copied!
import numpy\n\nfrom formulaic.transforms import stateful_transform\n\n\n@stateful_transform\ndef center(data, _state=None, _metadata=None, _spec=None):\n print(\"state\", _state)\n print(\"metadata\", _metadata)\n print(\"spec\", _spec)\n if \"mean\" not in _state:\n _state[\"mean\"] = numpy.mean(data)\n return data - _state[\"mean\"]\n\n\nstate = {}\ncenter(pandas.Series([1, 2, 3]), _state=state)\nimport numpy from formulaic.transforms import stateful_transform @stateful_transform def center(data, _state=None, _metadata=None, _spec=None): print(\"state\", _state) print(\"metadata\", _metadata) print(\"spec\", _spec) if \"mean\" not in _state: _state[\"mean\"] = numpy.mean(data) return data - _state[\"mean\"] state = {} center(pandas.Series([1, 2, 3]), _state=state)
state {}\nmetadata None\nspec ModelSpec(formula=, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, structure=None, transform_state={}, encoder_state={})\nOut[6]:
0 -1.0\n1 0.0\n2 1.0\ndtype: float64In\u00a0[7]: Copied!
state\nstate Out[7]:
{'mean': 2.0}
The mutated state object is then stored by formulaic automatically into the right context in the appropriate ModelSpec
instance for reuse as necessary.
If you wanted to leverage the single dispatch functionality, you could do something like:
In\u00a0[8]: Copied!from formulaic.transforms import stateful_transform\n\n\n@stateful_transform\ndef center(data, _state=None, _metadata=None, _spec=None):\n raise ValueError(f\"No implementation for data of type {repr(type(data))}\")\n\n\n@center.register(pandas.Series)\ndef _(data, _state=None, _metadata=None, _spec=None):\n if \"mean\" not in _state:\n _state[\"mean\"] = numpy.mean(data)\n return data - _state[\"mean\"]\nfrom formulaic.transforms import stateful_transform @stateful_transform def center(data, _state=None, _metadata=None, _spec=None): raise ValueError(f\"No implementation for data of type {repr(type(data))}\") @center.register(pandas.Series) def _(data, _state=None, _metadata=None, _spec=None): if \"mean\" not in _state: _state[\"mean\"] = numpy.mean(data) return data - _state[\"mean\"]
Note
If taking advantage of the single dispatch functionality, it is important that the top-level function has exactly the same signature as the type specific implementations.
"},{"location":"guides/transforms/#adding-transforms-to-the-evaluation-context","title":"Adding transforms to the evaluation context\u00b6","text":"The only requirement for using a transform in formula is making it available in the execution context. The evaluation context is always pre-seeded with:
numpy
module.numpy.log
.numpy.log10
.numpy.log2
.numpy.exp
.numpy.exp10
.numpy.exp2
.{<expr>}
syntax).The evaluation context can be extended to include arbitrary additional functions. If you are using the top-level model_matrix
function then the local context in which model_matrix
is called is automatically added to the execution context, otherwise you need to manually specify this context. For example:
In Formulaic, a stateful transform is just a regular callable object (typically a function) that has an attribute __is_stateful_transform__
that is set to True
. Such callables will be passed up to three additional arguments by formulaic if they are present in the callable signature:
_state
: The existing state or an empty dictionary that should be mutated to record any additional state._metadata
: An additional metadata dictionary passed on about the factor or None
. Will typically only be present if the Factor
metadata is populated._spec
: The current model spec being evaluated (or an empty ModelSpec
if being called outside of Formulaic's materialization routines).Only _state
is required, _metadata
and _spec
will only be passed in by Formulaic if they are present in the callable signature.
Formulaic comes preloaded with some useful stateful transforms, which are outlined below.
"},{"location":"guides/transforms/#scaling-and-centering","title":"Scaling and Centering\u00b6","text":"There are two provided scaling transforms: scale(...)
and center(...)
.
scale
rescales the data such that it is centered around zero with a standard deviation of 1. The centering and variance standardisation can be independently disabled as necessary. center
is a simple wrapper around scale
that only does the centering. For more details, refer to inline documentation: help(scale)
.
Example usage is shown below:
"},{"location":"guides/transforms/#categorical-encoding","title":"Categorical Encoding\u00b6","text":"Formulaic provides a rich family of categorical stateful transforms. These are perhaps the most commonly used transforms, and are used to encode categorical/factor data into a form suitable for numerical analysis. Use of these transforms is separately documented in the Categorical Encoding section.
"},{"location":"guides/transforms/#spline-encoding","title":"Spline Encoding\u00b6","text":"Spline coding is used to enable non-linear dependence on numerical features in linear models. Formulaic currently provides two spline transforms: bs
for basis splines, and poly
for polynomial splines. These are separately documented in the Spline Encoding section.
You can either implement the above interface directly, or leverage the stateful_transform
decorator provided by Formulaic, which then also updates your function into a single dispatch function, allowing multiple implementations that depend on the currently materialized type. A simple centering example is explored below.