diff --git a/dev/changelog/index.html b/dev/changelog/index.html
index 297f353..f06ae2d 100644
--- a/dev/changelog/index.html
+++ b/dev/changelog/index.html
@@ -72,7 +72,7 @@
     <div data-md-component="skip">
       
         
-        <a href="#110-15-december-2024" class="md-skip">
+        <a href="#111-20-december-2024" class="md-skip">
           Skip to content
         </a>
       
@@ -436,6 +436,15 @@
     </label>
     <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
       
+        <li class="md-nav__item">
+  <a href="#111-20-december-2024" class="md-nav__link">
+    <span class="md-ellipsis">
+      1.1.1 (20 December 2024)
+    </span>
+  </a>
+  
+</li>
+      
         <li class="md-nav__item">
   <a href="#110-15-december-2024" class="md-nav__link">
     <span class="md-ellipsis">
@@ -1100,6 +1109,15 @@
     </label>
     <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
       
+        <li class="md-nav__item">
+  <a href="#111-20-december-2024" class="md-nav__link">
+    <span class="md-ellipsis">
+      1.1.1 (20 December 2024)
+    </span>
+  </a>
+  
+</li>
+      
         <li class="md-nav__item">
   <a href="#110-15-december-2024" class="md-nav__link">
     <span class="md-ellipsis">
@@ -1381,6 +1399,17 @@ <h1>Changelog</h1>
 <p>For changes since the latest tagged release, please refer to the
 <a href="https://github.com/matthewwardrop/formulaic/commits/main">git commit log</a>.</p>
 <hr />
+<h2 id="111-20-december-2024">1.1.1 (20 December 2024)</h2>
+<p><strong>New features and enhancements:</strong></p>
+<ul>
+<li><code>Formula.differentiate()</code> is now considered stable, with
+  <code>ModelMatrix.differentiate()</code> to follow in a future release.</li>
+</ul>
+<p><strong>Bugfixes and cleanups:</strong></p>
+<ul>
+<li>Fixed a regression introduced in v1.1.0 regarding ordering of terms in a 
+  differentiated formula.</li>
+</ul>
 <h2 id="110-15-december-2024">1.1.0 (15 December 2024)</h2>
 <p><strong>Breaking changes:</strong></p>
 <ul>
diff --git a/dev/search/search_index.json b/dev/search/search_index.json
index 15263b7..6993df5 100644
--- a/dev/search/search_index.json
+++ b/dev/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Introduction","text":"<p>Formulaic is a high-performance implementation of Wilkinson formulas for Python, which are very useful for transforming dataframes into a form suitable for ingestion into various modelling frameworks (especially linear regression).</p> <ul> <li>Source Code: https://github.com/matthewwardrop/formulaic</li> <li>Issue tracker: https://github.com/matthewwardrop/formulaic/issues</li> </ul> <p>It provides:</p> <ul> <li>high-performance dataframe to model-matrix conversions.</li> <li>support for reusing the encoding choices made during conversion of one data-set on other datasets.</li> <li>extensible formula parsing.</li> <li>extensible data input/output plugins, with implementations for:</li> <li>input:<ul> <li><code>pandas.DataFrame</code></li> <li><code>pyarrow.Table</code></li> </ul> </li> <li>output:<ul> <li><code>pandas.DataFrame</code></li> <li><code>numpy.ndarray</code></li> <li><code>scipy.sparse.CSCMatrix</code></li> </ul> </li> <li>support for symbolic differentiation of formulas (and hence model matrices).</li> </ul> <p>with more to come!</p>"},{"location":"changelog/","title":"Changelog","text":"<p>For changes since the latest tagged release, please refer to the git commit log.</p>"},{"location":"changelog/#110-15-december-2024","title":"1.1.0 (15 December 2024)","text":"<p>Breaking changes:</p> <ul> <li><code>Formula</code> is no longer always \"structured\" with special cases to handle the   case where it has no structure. Legacy shims have been added to support old   patterns, with <code>DeprecationWarning</code>s raised when they are used. It is not   expected to break anyone not explicitly checking whether the <code>Formula.root</code> is   a list instance (which formerly should have been simply assumed) [it is a now   <code>SimpleFormula</code> instance that acts like an ordered sequence of <code>Term</code>   instances].</li> <li>The column names associated with categorical factors has changed. Previously,   a prefix was unconditionally added to the level in the column name like   <code>feature[T.A]</code>, whether nor not the encoding will result in that term acting   as a contrast. Now, in keeping with <code>patsy</code>, we only add the prefix if the   categorical factor is encoded with reduced rank. Otherwise, <code>feature[A]</code> will   be used instead.</li> <li><code>formulaic.parsers.types.structured</code> has been promoted to   <code>formulaic.utils.structured</code>.</li> </ul> <p>New features and enhancements:</p> <ul> <li><code>Formula</code> now instantiates to <code>SimpleFormula</code> or <code>StructuredFormula</code>, the   latter being a tree-structure of <code>SimpleFormula</code> instances (as compared to   <code>List[Term]</code>) previously. This simplifies various internal logic and makes the   propagation of formula metadata more explicit.</li> <li>Added support for restricting the set of features used by the default formula   parser so that libraries can more easily restrict the structure of output   formulae.</li> <li><code>dict</code> and <code>recarray</code> types are no associated with the <code>pandas</code> materializer   by default (rather than raising), simplifying some user workflows.</li> <li>Added support for the <code>.</code> operator (which is replaced with all variables not   used on the left-hand-side of formulae).</li> <li>Added experimental support for nested formulae of form <code>[ ... ~ ... ]</code>.   This is useful for (e.g.) generating formulae for IV 2SLS.</li> <li>Add support for subsettings <code>ModelSpec[s]</code> based on an arbitrary   strictly reduced <code>FormulaSpec</code>.</li> <li>Added <code>Formula.required_variables</code> to more easily surface the expected data   requirements of the formula.</li> <li>Added support for extracting rows dropped during materialization.</li> <li>Added cubic spline support for cyclic (<code>cc</code>) and natural (<code>cr</code>). See   <code>formulaic.materializers.transforms.cubic_spline.cubic_spline</code> for   more details.</li> <li>Added a <code>lag()</code> transform.</li> <li>Constructing <code>LinearConstraints</code> can now be done from a list of strings (for   increased parity with <code>patsy</code>).</li> <li>Categorical factors are now preceded with (e.g.) <code>T.</code> when they actully   describe contrasts (i.e. when they are encoded with reduced rank).</li> <li>Contrasts metadata is now added to the encoder state via <code>encode_categorical</code>;   which is surfaced via <code>ModelSpec.factor_contrasts</code>.</li> <li><code>Operator</code> instances now received <code>context</code> which is optionally specified by   the user during formula parsing, and updated by the parser. This is what makes   the <code>.</code> implementation possible.</li> <li>Given the generic usefulness of <code>Structured</code>, it has been promoted to   <code>formulaic.utils</code>.</li> <li>Added explicit support and testing for Python 3.13.</li> </ul> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed nested ordering of <code>Formula</code> instance.</li> <li>Allow Python tokens to multiple chained parentheses and brackets without using   quotes as long as the parentheses are balanced.</li> <li>Reduced the number of redundant initialisation operations in <code>Structured</code>   instances.</li> <li>Fixed pickling <code>ModelMatrix</code> and <code>FactorValues</code> instances (whenever wrapped   objects are picklable).</li> <li><code>basis_spline</code>: Fixed evaluation involving datasets with null values, and   disallow out-of-bounds knots.</li> <li>Improved robustness of data contexts involving PyArrow datasets.</li> <li>We now use the same sentiles throughout the code-base, rather than having   module specific sentinels in some places.</li> <li>Migrated to <code>ruff</code> for linting, and updated <code>mypy</code> and <code>pre-commit</code> tooling.</li> <li>Automatic fixes from <code>ruff</code> are automatically applied when using   <code>hatch run lint:format</code>.</li> </ul> <p>Documentation:</p> <ul> <li>Fixed and updated docsite build, as well as other minor tweaks.</li> </ul>"},{"location":"changelog/#102-12-july-2024","title":"1.0.2 (12 July 2024)","text":"<p>Bugfixes and cleanups:</p> <ul> <li>Fix compatibility with <code>pandas</code> &gt;=3.</li> <li>Fix <code>mypy</code> type inference in materializer subclasses.</li> </ul> <p>Documentation:</p> <ul> <li>Add column name extraction to <code>sklearn</code> integration example.</li> <li>Add section to allow users to indicate their usage of formulaic.</li> </ul>"},{"location":"changelog/#101-24-december-2023","title":"1.0.1 (24 December 2023)","text":"<p>Bugfixes and cleanups:</p> <ul> <li>Update package status from \"beta\" to \"production/stable\".</li> </ul>"},{"location":"changelog/#100-24-december-2023","title":"1.0.0 (24 December 2023)","text":"<p>Breaking changes:</p> <ul> <li>Python tokens are now canonically formatted (see below).</li> <li>Methods deprecated during the 0.x series have been removed: <code>Formula.terms</code>,   <code>ModelSpec.feature_names</code>, and <code>ModelSpec.feature_indices</code>.</li> </ul> <p>New features and enhancements:</p> <ul> <li>Python tokens are now sanitized and canonically formatted to prevent   ambiguities and better align with <code>patsy</code>.</li> <li>Added official support for Python 3.12 (no code changes were necessary).</li> <li>Added the <code>hashed</code> transform for categorically encoding deterministically   hashed representations of a dataset.</li> </ul> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed transform state not propagating correctly when Python code tokens were   not canonically formatted.</li> <li>Literals in formulae will no longer be silently ignored, and feature scaling   is now fully supported.</li> <li>Improved code parsing and formatting utilities and dropped the requirement for   <code>astor</code> for Python 3.9 and newer.</li> <li>Fixed all warnings emitted during unit tests.</li> </ul> <p>Documentation:</p> <ul> <li>Removed incompleteness warnings.</li> <li>Added some lightweight developer documents.</li> <li>Fixed some broken links.</li> </ul>"},{"location":"changelog/#066-4-october-2023","title":"0.6.6 (4 October 2023)","text":"<p>This is minor release with one important bugfix.</p> <p>Bugfixes and cleanups:</p> <ul> <li>Fixes a regression introduced by 0.6.4 whereby missing variables will be   silently dropped from the formula., rather than raising an exception.</li> </ul>"},{"location":"changelog/#065-25-september-2023","title":"0.6.5 (25 September 2023)","text":"<p>This is a minor release with several important bugfixes.</p> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed intercept terms sorting after other features (by not counting literal   factors toward the degree of a term).</li> <li>Fixed a regression in 0.6.4 around quoted field names in Python evaluations.</li> <li>Fixed detection and dropping of null rows in sparse datasets.</li> <li>Fixed <code>poly()</code> transforms operating on datasets that include null values.</li> <li>Arguments can now be passed when running the unit tests using <code>hatch run tests</code>.</li> </ul>"},{"location":"changelog/#064-10-july-2023","title":"0.6.4 (10 July 2023)","text":"<p>This is a minor release with several new features and cleanups.</p> <p>New features and enhancements:</p> <ul> <li>Added support for keeping track of the source of variables being used to   evaluate a formula. Refer to the <code>ModelSpec</code> documentation for more details.</li> </ul> <p>Bugfixes and cleanups:</p> <ul> <li>All functions and methods now have type signatures that are statically checked   during unit testing.</li> <li>Removed <code>OrderedDict</code> usage, since Python guarantees the orderedness of   dictionaries in Python 3.7+.</li> <li>Suppress terms/factors in model matrices for which the factors evaluate to   <code>None</code>.</li> </ul>"},{"location":"changelog/#063-26-june-2023","title":"0.6.3 (26 June 2023)","text":"<p>This is a minor release with a bugfix.</p> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed a regression introduced in the previous release when materializing   categorical encodings of variables with no levels.</li> </ul>"},{"location":"changelog/#062-22-june-2023","title":"0.6.2 (22 June 2023)","text":"<p>This is a minor release with several bugfixes.</p> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed issues handling empty data sets in formulae that used categorical   encoding.</li> <li>Added the MIT license to distribution classifiers.</li> </ul>"},{"location":"changelog/#061-2-may-2023","title":"0.6.1 (2 May 2023)","text":"<p>This is a minor release with one new feature.</p> <p>New features and enhancements:</p> <ul> <li>Added support for treating individual categorical features as though they do not span the intercept (useful for intentionally generating over-specified model matrices in e.g. regularized models).</li> </ul>"},{"location":"changelog/#060-26-apr-2023","title":"0.6.0 (26 Apr 2023)","text":"<p>This is a major release with some important consistency and completeness improvements. It should be treated as almost being the first release candidate of 1.0.0, which will land after some small amount of further feature extensions and documentation improvements. All users are recommended to upgrade.</p> <p>Breaking changes:</p> <p>Although there are some internal changes to API, as documented below, there are no breaking changes to user-facing APIs.</p> <p>New features and enhancements:</p> <ul> <li>Formula terms are now consistently ordered regardless of providence (formulae or   manual term specification), and sorted according to R conventions by default   rather than lexically. This can be changed using the <code>_ordering</code> keyword to   the <code>Formula</code> constructor.</li> <li>Greater compatibility with R and patsy formulae:</li> <li>for patsy: added <code>standardize</code>, <code>Q</code> and treatment contrasts shims.</li> <li>for patsy: added <code>cluster_by='numerical_factors</code> option to <code>ModelSpec</code> to enable     patsy style clustering of output columns by involved numerical factors.</li> <li>for R: added support for exponentiation with <code>^</code> and <code>%in%</code>.</li> <li>Diff and Helmert contrast codings gained support for additional variants.</li> <li>Greatly improved the performance of generating sparse dummy encodings when   there are many categories.</li> <li>Context scoping operators (like paretheses) are now tokenized as their own special   type.</li> <li>Add support for merging <code>Structured</code> instances, and use this functionality during   AST evaluation where relevant.</li> <li><code>ModelSpec.term_indices</code> is now a list rather than a tuple, to allow direct use when   indexing pandas and numpy model matrices.</li> <li>Add official support for Python 3.11.</li> </ul> <p>Bugfixes and cleanups:</p> <ul> <li>Fix parsing formulae starting with a parenthesis.</li> <li>Fix iteration over root nodes of <code>Structured</code> instances for non-sequential iterable values.</li> <li>Bump testing versions and fix <code>poly</code> unit tests.</li> <li>Fix use of deprecated automatic casting of factors to numpy arrays during dense   column evaluation in <code>PandasMaterializer</code>.</li> <li><code>Factor.EvalMethod.UNKNOWN</code> was removed, defaulting instead to <code>LOOKUP</code>.</li> <li>Remove <code>sympy</code> version constraint now that a bug has been fixed upstream.</li> </ul> <p>Documentation:</p> <ul> <li>Substantial updates to documentation, which is now mostly complete for end-user   use-cases. Developer and API docs are still pending.</li> </ul>"},{"location":"changelog/#052-17-sep-2022","title":"0.5.2 (17 Sep 2022)","text":"<p>This is a minor patch releases that fixes one bug.</p> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed alignment between the length of a <code>Structured</code> instance and iteration   over this instance (including <code>Formula</code> instances). Formerly the length would   only count the number of keys in its structure, rather than the number of   objects that would be yielded during iteration.</li> </ul>"},{"location":"changelog/#051-9-sep-2022","title":"0.5.1 (9 Sep 2022)","text":"<p>This is a minor patch release that fixes two bugs.</p> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed generation of string representation of <code>Formula</code> objects.</li> <li>Fixed generation of <code>formulaic.__version__</code> during package build.</li> </ul>"},{"location":"changelog/#050-28-aug-2022","title":"0.5.0 (28 Aug 2022)","text":"<p>This is a major new release with some minor API changes, some ergonomic improvements, and a few bug fixes.</p> <p>Breaking changes:</p> <ul> <li>Accessing named substructures of <code>Formula</code> objects (e.g. <code>formula.lhs</code>) no   longer returns a list of terms; but rather a <code>Formula</code> object, so that the   helper methods can remain accessible. You can access the raw terms by   iterating over the formula (<code>list(formula)</code>) or looking up the root node   (<code>formula.root</code>).</li> </ul> <p>New features and improvements:</p> <ul> <li>The <code>ModelSpec</code> object is now the source of truth in all <code>ModelMatrix</code>   generations, and can be constructed directly from any supported specification   using <code>ModelSpec.from_spec(...)</code>. Supported specifications include formula   strings, parsed formulae, model matrices and prior model specs.</li> <li>The <code>.get_model_matrix()</code> helper methods across <code>Formula</code>,   <code>FormulaMaterializer</code>, <code>ModelSpec</code> and <code>model_matrix</code> objects/helpers   functions are now consistent, and all use <code>ModelSpec</code> directly under the hood.</li> <li>When accessing substructures of <code>Formula</code> objects (e.g. <code>formula.lhs</code>), the   term lists will be wrapped as trivial <code>Formula</code> instances rather than returned   as raw lists (so that the helper methods like <code>.get_model_matrix()</code> can still   be used).</li> <li><code>FormulaSpec</code> is now exported from the top-level module.</li> </ul> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed <code>ModelSpec</code> specifications being overriden by default arguments to   <code>FormulaMaterializer.get_model_matrix</code>.</li> <li><code>Structured._flatten()</code> now correctly flattens unnamed substructures.</li> </ul>"},{"location":"changelog/#040-10-aug-2022","title":"0.4.0 (10 Aug 2022)","text":"<p>This is a major new release with some new features, greatly improved ergonomics for structured formulae, matrices and specs, and a few small breaking changes (most with backward compatibility shims). All users are encouraged to upgrade.</p> <p>Breaking changes:</p> <ul> <li><code>include_intercept</code> is no longer an argument to <code>FormulaParser.get_terms</code>;   and is instead an argument of the <code>DefaultFormulaParser</code> constructor. If you   want to modify the <code>include_intercept</code> behaviour, please use:   <pre><code>Formula(\"y ~ x\", _parser=DefaultFormulaParser(include_intercept=False))\n</code></pre></li> <li>Accessing terms via <code>Formula.terms</code> is deprecated since <code>Formula</code> became a   subclass of <code>Structured[List[Terms]]</code>. You can directly iterate over, and/or   access nested structure on the <code>Formula</code> instance itself. <code>Formula.terms</code>   has a deprecated property which will return a reference to itself in order to   support legacy use-cases. This will be removed in 1.0.0.</li> <li><code>ModelSpec.feature_names</code> and <code>ModelSpec.feature_columns</code> are deprecated in   favour of <code>ModelSpec.column_names</code> and <code>ModelSpec.column_indices</code>. Deprecated   properties remain in-place to support legacy use-cases. These will be removed   in 1.0.0.</li> </ul> <p>New features and enhancements:</p> <ul> <li>Structured formulae (and their derived matrices and specs) are now mutable.   Internally <code>Formula</code> has been refactored as a subclass of   <code>Structured[List[Terms]]</code>, and can be incrementally built and modified. The   matrix and spec outputs now have explicit subclasses of <code>Structured</code>   (<code>ModelMatrices</code> and <code>ModelSpecs</code> respectively) to expose convenience methods   that allow these objects to be largely used interchangeably with their   singular counterparts.</li> <li><code>ModelMatrices</code> and <code>ModelSpecs</code> arenow surfaced as top-level exports of the   <code>formulaic</code> module.</li> <li><code>Structured</code> (and its subclasses) gained improved integration of nested tuple   structure, as well as support for flattened iteration, explicit mapping   output types, and lots of cleanups.</li> <li><code>ModelSpec</code> was made into a dataclass, and gained several new   properties/methods to support better introspection and mutation of the model   spec.</li> <li><code>FormulaParser</code> was renamed <code>DefaultFormulaParser</code>, and made a subclass of the   new formula parser interface <code>FormulaParser</code>. In this process   <code>include_intercept</code> was removed from the API, and made an instance attribute   of the default parser implementation.</li> </ul> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed AST evaluation for large formulae that caused the evaluation to hit the   recursion limit.</li> <li>Fixed sparse categorical encoding when the dataframe index is not the standard   range index.</li> <li>Fixed a bug in the linear constraints parser when more than two constraints   were specified in a comma-separated string.</li> <li>Avoid implicit changing of the sparsity structure of CSC matrices.</li> <li>If manually constructed <code>ModelSpec</code>s are provided by the user during   materialization, they are updated to reflect the output-type chosen by the   user, as well as whether to ensure full rank/etc.</li> <li>Allowed use of older pandas versions. All versions &gt;=1.0.0 are now supported.</li> <li>Various linting cleanups as <code>pylint</code> was added to the CI testing.</li> </ul> <p>Documentation:</p> <ul> <li>Apart from the <code>.materializer</code> submodule, most code now has inline   documentation and annotations.</li> </ul>"},{"location":"changelog/#034-1-may-2022","title":"0.3.4 (1 May 2022)","text":"<p>This is a backward compatible major release that adds several new features.</p> <p>New features and enhancements:</p> <ul> <li>Added support for customizing the contrasts generated for categorical   features, including treatment, sum, deviation, helmert and custom contrasts.</li> <li>Added support for the generation of linear constraints for <code>ModelMatrix</code>   instances (see <code>ModelMatrix.model_spec.get_linear_constraints</code>).</li> <li>Added support for passing <code>ModelMatrix</code>, <code>ModelSpec</code> and other formula-like   objects to the <code>model_matrix</code> sugar method so that pre-processed formulae can   be used.</li> <li>Improved the way tokens are manipulated for the right-hand-side intercept and   substitutions of <code>0</code> with <code>-1</code> to avoid substitutions in quoted contexts.</li> </ul> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed variable sanitization during evaluation, allowing variables with   special characters to be used in Python transforms; for example:   <code>bs(`my|feature%is^cool`)</code>.</li> <li>Fixed the parsing of dictionaries and sets within python expressions in the   formula; for example: <code>C(x, {\"a\": [1,2,3]})</code>.</li> <li>Bumped requirement on <code>astor</code> to &gt;=0.8 to fix issues with ast-generation in   Python 3.8+ when numerical constants are present in the parsed python   expression (e.g. \"bs(x, df=10)\").</li> </ul>"},{"location":"changelog/#033-4-april-2022","title":"0.3.3 (4 April 2022)","text":"<p>This is a minor patch release that migrates the package tooling to poetry; solving a version inconsistency when packaging for <code>conda</code>.</p>"},{"location":"changelog/#032-17-march-2022","title":"0.3.2 (17 March 2022)","text":"<p>This is a minor patch release that fixes an attempt to import <code>numpy.typing</code> when numpy is not version 1.20 or later.</p>"},{"location":"changelog/#031-15-march-2022","title":"0.3.1 (15 March 2022)","text":"<p>This is a minor patch release that fixes the maintaining of output types, NA-handling, and assurance of full-rank for factors that evaluate to pre-encoded columns when constructing a model matrix from a pre-defined ModelSpec. The benchmarks were also updated.</p>"},{"location":"changelog/#030-14-march-2022","title":"0.3.0 (14 March 2022)","text":"<p>This is a major new release with many new features, and a few small breaking changes. All users are encouraged to upgrade.</p> <p>Breaking changes:</p> <ul> <li>The minimum supported version of Python is now 3.7 (up from 3.6).</li> <li>Moved transform implementations from <code>formulaic.materializers.transforms</code> to   the top-level <code>formulaic.transforms</code> module, and ported all existing   transforms to output <code>FactorValues</code> types rather than dictionaries.   <code>FactorValues</code> is an object proxy that allows output types like   <code>pandas.DataFrame</code>s to be used as they normally would, with some additional   metadata for formulaic accessible via the <code>__formulaic_metadata__</code>   attribute. This makes non-formula direct usage of these transforms much more   pleasant.</li> <li><code>~</code> is no longer a generic formula separator, and can only be used once in a   formula. Please use the newly added <code>|</code> operator to separate a formula into   multiple parts.</li> </ul> <p>New features and enhancements:</p> <ul> <li>Added support for \"structured\" formulas, and updated the <code>~</code> operator to use   them. Structured formulas can have named substructures, for example: <code>lhs</code>   and <code>rhs</code> for the <code>~</code> operator. The representation of formulas has been   updated to show this structure.</li> <li>Added support for context-sensitivity during the resolution of operators,   allowing more flexible operators to be implemented (this is exploited by the   <code>|</code> operator which splits formulas into multiple parts).</li> <li>The <code>formulaic.model_matrix</code> syntactic sugar function now accepts <code>ModelSpec</code>   and <code>ModelMatrix</code> instances as the \"formula\" spec, making generation of   matrices with the same form as previously generated matrices more   convenient.</li> <li>Added the <code>poly</code> transform (compatible with R and patsy).</li> <li><code>numpy</code> is now always available in formulas via <code>np</code>, allowing formulas like   <code>np.sum(x)</code>. For convenience, <code>log</code>, <code>log10</code>, <code>log2</code>, <code>exp</code>, <code>exp10</code> and   <code>exp2</code> are now exposed as transforms independent of user context.</li> <li>Pickleability is now guaranteed and tested via unit tests. Failure to pickle   any formulaic metadata object (such as formulas, model specs, etc) is   considered a bug.</li> <li>The capturing of user context for use in formula materialization has been   split out into a utility method <code>formulaic.utils.context.capture_context()</code>.   This can be used by libraries that wrap Formulaic to capture the variables   and/or transforms available in a users' environment where appropriate.</li> </ul> <p>Bugfixes and cleanups:</p> <ul> <li>Migrated all code to use the Black style.</li> <li>Increased unit testing coverage to 100%.</li> <li>Fixed mis-alignment in the right- and left-hand sides of formulas if there   were nulls at different indices.</li> <li>Fixed basis spline transforms ignoring state, fixed generated splines for   large numbers of knots, and fixed specification of knots via non-list   datatypes.</li> <li>Fixed category order being inconsistent if categories are explicitly ordered   differently in the underlying data.</li> <li>Lots of other minor nits and cleanups.</li> </ul> <p>Documentation:</p> <ul> <li>The structure of the docsite has been improved (but is still incomplete).</li> <li>The <code>.parser</code> and <code>.utils</code> modules of Formulaic are now inline documented   and annotated.</li> </ul>"},{"location":"changelog/#024-9-july-2021","title":"0.2.4 (9 July 2021)","text":"<p>This is a minor release that fixes an issue whereby the ModelSpec instances attached to ModelMatrix objects would keep reference to the original data, greatly inflating the size of the ModelSpec.</p>"},{"location":"changelog/#023-4-february-2021","title":"0.2.3 (4 February 2021)","text":"<p>This release is identical to v0.2.2, except that the source distribution now includes the docs, license, and tox configuration.</p>"},{"location":"changelog/#022-4-february-2021","title":"0.2.2 (4 February 2021)","text":"<p>This is a minor release with one bugfix.</p> <ul> <li>Fix pandas model matrix outputs when constants are generated as part of model   matrix construction and the incoming dataframe has a custom rather than range   index.</li> </ul>"},{"location":"changelog/#021-22-january-2021","title":"0.2.1 (22 January 2021)","text":"<p>This is a minor patch release that brings in some valuable improvements.</p> <ul> <li>Keep track of the pandas dataframe index if outputting a pandas <code>DataFrame</code>.</li> <li>Fix using functions in formulae that are nested within a module or class.</li> <li>Avoid crashing when an attempt is made to generate an empty model matrix.</li> <li>Enriched setup.py with long description for a better experience on PyPI.</li> </ul>"},{"location":"changelog/#020-21-january-2021","title":"0.2.0 (21 January 2021)","text":"<p>This is major release that brings in a large number of improvements, with a huge number of commits. Some API breakage from the experimental 0.1.x series is likely in various edge-cases.</p> <p>Highlights include:</p> <ul> <li>Enriched formula parser to support quoting, and evaluation of formulas involving fields with invalid Python names.</li> <li>Added commonly used stateful transformations (identity, center, scale, bs)</li> <li>Improved the helpfulness of error messages reported by the formula parser.</li> <li>Added support for basic calculus on formulas (useful when taking the gradient of linear models).</li> <li>Made it easier to extend Formulaic with additional materializers.</li> <li>Many internal improvements to code quality and reliability, including 100% test coverage.</li> <li>Added benchmarks for Formulaic against R and patsy.</li> <li>Added documentation.</li> <li>Miscellaneous other bugfixes and cleanups.</li> </ul>"},{"location":"changelog/#012-6-november-2019","title":"0.1.2 (6 November 2019)","text":"<p>Performance improvements around the encoding of categorical features.</p> <pre><code>Matthew Wardrop (1):\n    Improve the performance of encoding operations.\n</code></pre>"},{"location":"changelog/#011-31-october-2019","title":"0.1.1 (31 October 2019)","text":"<p>No code changes here, just a verification that GitHub CI integration was working.</p> <pre><code>Matthew Wardrop (1):\n    Update Github workflow triggers.\n</code></pre>"},{"location":"changelog/#010-31-october-2019","title":"0.1.0 (31 October 2019)","text":"<p>This release added support for keeping track of encoding choices during model matrix generation, so that they can be reused on similarly structured data. It also added comprehensive unit testing and CI integration using GitHub actions.</p> <pre><code>Matthew Wardrop (5):\n    Add support for stateful transforms (including encoding).\n    Fix tokenizing of nested Python function calls.\n    Add support for nested transforms that return multiple columns, as well as passing through of materializer config through to transforms.\n    Add comprehensive unit testing along with several small miscellaneous bug fixes and improvements.\n    Add GitHub actions configuration.\n</code></pre>"},{"location":"changelog/#001-1-september-2019","title":"0.0.1 (1 September 2019)","text":"<p>Initial open sourcing of <code>formulaic</code>.</p> <pre><code>Matthew Wardrop (1):\n    Initial (mostly) working implementation of Wilkinson formulas.\n</code></pre>"},{"location":"formulas/","title":"What are formulas?","text":"<p>This section introduces the basic notions and origins of formulas. If you are already familiar with formulas from another context, you might want to skip forward to the Formula Grammar or other User Guides.</p>"},{"location":"formulas/#origins","title":"Origins","text":"<p>Formulas were originally proposed by Wilkinson et al.<sup>1</sup> to aid in the description of ANOVA problems, but were popularised by the S language (and then R, as an implementation of S) in the context of linear regression. Since then they have been extended in R, and implemented in Python (by patsy), in MATLAB, in Julia, and quite conceivably elsewhere. Each implementation has its own nuances and grammatical extensions, including Formulaic's which are described more completely in the Formula Grammar section of this manual.</p>"},{"location":"formulas/#why-are-they-useful","title":"Why are they useful?","text":"<p>Formulas are useful because they provide a concise and explicit specification for how data should be prepared for a model. Typically, the raw input data for a model is stored in a dataframe, but the actual implementations of various statistical methodologies (e.g. linear regression solvers) act on two-dimensional numerical matrices that go by several names depending on the prevailing nomenclature of your field, including \"model matrices\", \"design matrices\" and \"regressor matrices\" (within Formulaic, we refer to them as \"model matrices\"). A formula provides the necessary information required to automate much of the translation of a dataframe into a model matrix suitable for ingestion into a statistical model.</p> <p>Suppose, for example, that you have a dataframe with \\(N\\) rows and three numerical columns labelled: <code>y</code>, <code>a</code> and <code>b</code>. You would like to construct a linear regression model for <code>y</code> based on <code>a</code>, <code>b</code> and their interaction: \\[ y = \\alpha + \\beta_a a + \\beta_b b + \\beta_{ab} ab + \\varepsilon \\] with \\(\\varepsilon \\sim \\mathcal{N}(0, \\sigma^2)\\). Rather than manually constructing the required matrices to pass to the regression solver, you could specify a formula of form: <pre><code>y ~ a + b + a:b\n</code></pre> When furnished with this formula and the dataframe, Formulaic (or indeed any other formula implementation) would generate two model matrix objects: an \\( N \\times 1 \\) matrix \\(Y\\) for the response variable <code>y</code>, and an \\( N \\times 4 \\) matrix \\(X\\) for the input columns <code>intercept</code>, <code>a</code>, <code>b</code>, and <code>a * b</code>. You can then directly pass these matrices to your regression solver, which internally will solve for \\(\\beta\\) in: \\[ Y = X\\beta + \\varepsilon. \\]</p> <p>The true value of formulas becomes more apparent as model complexity increases, where they can be a huge time-saver. For example: <pre><code>~ (f1 + f2 + f3) * (x1 + x2 + scale(x3))\n</code></pre> tells the formula interpreter to consider 16 fields of input data, corresponding to an intercept (1), each of the <code>f*</code> fields (3), each of the <code>x*</code> fields (3), and the combination of each <code>f</code> with each <code>x</code> (9). It also instructs the materializer to ensure that the <code>x3</code> column is rescaled during the model matrix materialization phase such that it has mean zero and standard error of 1. If any of these columns is categorical in nature, they would by default also be one-hot/dummy encoded. Depending on the formula interpreter (including Formulaic), extra steps would also be taken to ensure that the resulting model matrix is structurally full-rank.</p> <p>As an added bonus, some formula implementations (including Formulaic) can remember any choices made during the materialization process, and apply them to consistently to new data, making it possible to easily generate new data that conforms to the same structure as the training data. For example, the <code>scale(...)</code> transform in the example above makes use of the mean and variance of the column to be scaled. Any future data should, however, should not undergo scaling based on its own mean and variance, but rather on the mean and variance that was measured for the training data set (otherwise the new dataset will not be consistent with the expectations of the trained model which will be interpreting it).</p>"},{"location":"formulas/#limitations","title":"Limitations","text":"<p>Formulas are a very flexible tool, and can be augmented with arbitrary user-defined transforms. However, some transformations required by certain models may be more elegantly defined via a pre-formula dataframe operation or post-formula model matrix operation. Another consideration is that the default encoding and materialization choices for data are aligned with linear regression. If you are using a tree model, for example, you may not be interested in dummy encoding of \"categorical\" features, and this type of transform would have to be explicitly noted in the formula. Nevertheless, even in these cases, formulas are an excellent tool, and can often be used to greatly simplify data preparation workflows.</p>"},{"location":"formulas/#where-to-from-here","title":"Where to from here?","text":"<p>To learn about the full set of features supported by the formula language as implemented by Formulaic, please review the Formula Grammar. To get a feel for how you can use <code>formulaic</code> to transform your dataframes into model matrices, please review the Quickstart.</p> <ol> <li> <p>Wilkinson, G. N., and C. E. Rogers. Symbolic description of factorial models for analysis of variance. J. Royal Statistics Society 22, pp. 392\u2013399, 1973.\u00a0\u21a9</p> </li> </ol>"},{"location":"installation/","title":"Installation","text":"<p>The latest release of <code>formulaic</code> is always published to the Python Package Index (PyPI), from which it is available to download @ https://pypi.org/project/formulaic/.</p> <p>If your Python environment is provisioned with <code>pip</code>, installing <code>formulaic</code> from the PyPI is as simple as running:</p> <pre><code>$ pip install formulaic\n</code></pre> <p>Note</p> <p>If you have a non-standard setup, ensure that <code>pip</code> above are replaced with the executables corresponding to the environment for which you are interested in installing <code>formulaic</code>. This is done automatically if you are using a virtual environment.</p> <p>You are ready to use Formulaic. To get introduced to the concepts underpinning Formulaic, please review the Concepts documentation, or to jump straight to how to use Formulaic, please review the User Guides documentation.</p>"},{"location":"installation/#installing-for-development","title":"Installing for development","text":"<p>If you are interested in developing <code>formulaic</code>, you should clone the source code repository, and install in editable mode from there (allowing your changes to be instantly available to all new Python sessions).</p> <p>To clone the source code, run:</p> <pre><code>$ git clone git@github.com:matthewwardrop/formulaic.git\n</code></pre> <p>Note</p> <p>This requires you to have a GitHub account set up. If you do not have an account you can replace the SSH url above with <code>https://github.com/matthewwardrop/formulaic.git</code>. Also, if you are planning to submit your work upstream, you may wish to fork the repository into your own namespace first, and clone from there.</p> <p>To install in editable mode, run: <pre><code>$ pip install -e &lt;path_to_cloned_formulaic_repo&gt;\n</code></pre> You will need <code>pip&gt;=21.3</code> in order for this to work.</p> <p>You can then make any changes you like to the repo, and have them be reflected in your local Python sessions. Happy hacking, and I look forward to your contributions!</p>"},{"location":"migration/","title":"Migrating from Patsy/R","text":"<p>The default Formulaic parser and materialization configuration is designed to be highly compatibly with existing Wilkinson formula implementations in R and Python; however there are some differences which are highlighted here. If you find other differences, feel free to submit a PR to update this documentation.</p>"},{"location":"migration/#migrating-from-patsy","title":"Migrating from <code>patsy</code>","text":"<p>Patsy has been the go-to implementation of Wilkinson formulae for Python use-cases for many years, and Formulaic should be largely a drop-in replacement, while bringing order of magnitude improvements in runtime performance and greater extensibility. Being written in the same language (Python) there are two separate migration concerns: input/output and API migrations, which will be explored separately below.</p>"},{"location":"migration/#inputoutput-changes","title":"Input/Output changes","text":"<p>The primary inputs to <code>patsy</code> are a formula string, and pandas dataframe from which features referenced in the formula are drawn. The output is a model matrix (called a design matrix in <code>patsy</code>). We focus here on any potentially breaking behavioural differences here, rather than ways in which Formulaic extends the functionality available in <code>patsy</code>.</p> <ul> <li>The <code>^</code> operator is interpreted as exponentiation, rather than Python's XOR     binary operator.</li> <li>Contrast encoding is recommended to follow R-style conventions e.g.     <code>C(x, contr.treatment)</code>. For greater compatibility with <code>patsy</code> we add to the     transform namespace <code>Treatment</code>, <code>Poly</code>, <code>Sum</code>, <code>Helmert</code> and <code>Diff</code>,     allowing formulae like <code>C(x, Poly)</code> or <code>C(x, Treatment(reference='x'))</code> to     work as expected, with the following caveats:<ul> <li>The signature of <code>C</code> is <code>C(data, contrasts=None, *, levels=None)</code> as     compared to <code>C(data, contrast=None, levels=None)</code> from <code>patsy</code>.</li> <li>The <code>Sum</code> contrast does not offer an <code>omit</code> option to specify the index of     the omitted column.</li> </ul> </li> <li>Feature rescaling is recommended to follow R conventions e.g. <code>scale(x)</code>, but     compatibility shims for <code>standardize(x)</code> are added for greater     compatibility with <code>patsy</code>. Note that the <code>standardize</code> shim follows patsy     argument kwarg naming conventions, but <code>scale</code> uses <code>scale</code> instead of     <code>rescale</code>, following R.</li> <li>The order of the model matrix columns will differ by default. Patsy groups     columns by the numerical features from which they derived, then sorts by     interaction order, and then by the order in which features were added into     the formula. Formulaic does not by default do the clustering by numerical     factors. This behaviour can be restored by passing     <code>cluster_by=\"numerical_factors\"</code> to <code>model_matrix</code> or any of the     <code>.get_model_matrix(...)</code> methods.</li> <li>Formulaic does not yet have implementations for natural and cyclic cubic basis     splines (<code>cr</code> and <code>cc</code>) or tensor smoothing (<code>te</code>) stateful transforms.</li> </ul>"},{"location":"migration/#api-translations","title":"API translations","text":"<p><code>patsy</code> offers two high-level user-facing entrypoints: <code>patsy.dmatrix</code> and <code>patsy.dmatrices</code>, depending on whether you have both left- and right-hand sides present. In <code>formulaic</code>, we offer a single entrypoint for both cases: <code>model_matrix</code>.</p> <p>In the vast majority of cases, a simple substitution of <code>dmatrix</code> or <code>dmatrices</code> with <code>model_matrix</code> will achieve the desired the result; however there are some differences in signature that could trip up a naive copy and replace. Patsy's <code>dmatrix</code> signature is: <pre><code>patsy.dmatrix(\n    formula_like,\n    data={},\n    eval_env=0,\n    NA_action='drop',\n    return_type='matrix',\n)\n</code></pre> whereas <code>model_matrix</code> has a signature of: <pre><code>formulaic.model_matrix(\n    spec: FormulaSpec,  # accepts any formula-like spec (include model matrices and specs)\n    data: Any,  # accepts any supported data structure (include pandas DataFrames)\n    *,\n    context: Union[int, Mapping[str, Any]] = 0,  # equivalent to `eval_env`\n    **spec_overrides,  # Additional overrides for generated `ModelSpec`, including `na_action` and `output` (similar to `return_type`).\n)\n</code></pre></p> <p>If you are integrating Formulaic into your library, it is highly recommended to use the <code>Formula()</code> API directly rather than <code>model_matrix</code>, which by default will add all variables in the local context into the evaluation environment (just like <code>dmatrix</code>). This allows you to better isolate and control the behaviour of the Formula parsing.</p>"},{"location":"migration/#migrating-from-r","title":"Migrating from R","text":"<p>Most formulae that work in R will work without modification, including those written against the enhanced R Formula package that supports multi-part formulae. However, there are a few caveats that are worth calling out:</p> <ul> <li>As in the enhanced R Formula package,     the left hand side of formulae can have multiple terms; the only difference     between the left- and right-hand sides being that an intercept is     automatically added on the right.</li> <li>Exponentiation will not work using the <code>^</code> operator within an <code>I</code> transform;     e.g. <code>I(x^2)</code>. This is because this is treated as Python code, and so you     should use <code>I(x**2)</code> or <code>{x**2}</code> instead.</li> <li>Intercept inclusion/exclusion directives are handled more rigorously,     following the conventions of <code>patsy</code>. In particular, order of operations are     respected when evaluating intercept directives, and so: <code>1 + (b - 1)</code> would     result in the intercept remaining (since <code>(b-1)</code> would be evaluated first to     <code>b</code>, resulting in <code>1 + b</code>), whereas in R the intercept would have been     dropped.</li> <li>Model matrices are guaranteed to be structurally full-rank no matter how     categorical variables are interacted, whereas R will sometimes become     confused and output over- or under-specified model matrices. The algorithm     used is the same as that found in <code>patsy</code>. Using capital letters to     represent categorical variables, and lower-case letters to represent     numerical ones, the difference from R will become apparent in two cases:<ol> <li>When categories are interacted in the presence of intercept. e.g.:     <code>1 + A:B</code>. In this case, R does not account for the fact that <code>A:B</code>     spans the intercept, and so does not rank reduce the product, and thus     generates an over-specified matrix. This affects higher-order     interactions also.</li> <li>When categories are interacted with numerical features alongside     interactions with categorical features. e.g.: <code>0 + A:x + B:C</code>. Here we     use <code>0 +</code> to avoid the previous bug, but unfortunately when R is     checking whether to reduce the rank of the categorical features during     encoding, it assumes that all involved features are categorical, and     thus unnecessarily reduces the rank of <code>C</code>, resulting in an     under-specified matrix. This affects higher-order interactions also.</li> </ol> </li> <li>Formulaic does not (yet) support including extra \"metadata\" terms in the     formula that will not result in additions to the model matrix, for example     model annotations like R's <code>offset(...)</code>.</li> <li>Some transforms that are commonly available in R may not be available.</li> </ul> <p>For more details, refer to the Formula Grammar.</p>"},{"location":"dev/","title":"Introduction","text":"<p>This section of the documentation focuses on providing guidance to developers of libraries that are integrating Formulaic, users who want to extend its behavior, or for those interested in directly contributing.</p> <p>If you are looking to directly work with Formulaic as an end-user, please review the User Guides instead.</p> <p>This portion of the documentation is less complete than user-facing documentation, and you are encouraged to reach out via the Issue Tracker if you need any help.</p>"},{"location":"dev/extensions/","title":"Extensions","text":"<p>Formulaic was designed to be extensible from day one, and nearly all of its core functionality is implemented as \"plugins\"/\"modules\" that you can use as examples for how extensions could be written. In this document we will provide a basic high-level overview of the basic components of Formulaic that can extended.</p> <p>An important consideration is that while Formulaic offers extensible APIs, and effort will be made not to break extension APIs without reason (and never in patch releases), the safest place for you extensions is in Formulaic itself, where they can be kept up to date and maintained (assuming the extension is not overly bespoke). If you think your extensions might help others, feel free to reach out via the issue tracker and/or open a pull request.</p>"},{"location":"dev/extensions/#transforms","title":"Transforms","text":"<p>Transforms are likely the most commonly extended feature of Formulaic, and also likely the least valuable to upstream (since transforms are often domain specific). Documentation for implementing transforms is described in detail in the Transforms user guide.</p>"},{"location":"dev/extensions/#materializers","title":"Materializers","text":"<p>Materializers are responsible for translating formulae into model matrices as documented in the How it works user guide. You need to implement a new materializer if you want to add support for new input and/or output types.</p> <p>Implementing a new materializer is as simple as subclassing the abstract class <code>formulaic.materializers.FormulaMaterializer</code> (or one of its subclasses). This base class defines the API expected by the rest of the Formulaic system. Example implementations include pandas and pyarrow.</p> <p>During subclassing, the new class is registered according to the various <code>REGISTER_*</code> attributes if <code>REGISTER_NAME</code> is specified. This registration allows looking up of the materializer by name through the <code>model_matrix()</code> and <code>.get_model_matrix()</code> functions. You can always manually pass in your materializer class explicitly without this registration.</p>"},{"location":"dev/extensions/#parsers","title":"Parsers","text":"<p>Parsers translate a formula string to a set of terms and factors that are then evaluated and assembled into the model matrix, as documented in the How it works user guide. This is unlikely to be necessary very often, but can be used to add additional formula operators, or change the behavior of existing ones.</p> <p>Formula parsers are expected to implement the API of <code>formulaic.parser.types.FormulaParser</code>. The default implementation can be seen here. You can pass in custom parsers to <code>Formula()</code> via the <code>parser</code> and <code>nested_parser</code> options (see inline documentation for more details).</p> <p>If you are considering extending the parser, please do reach out via the issue tracker.</p>"},{"location":"dev/integration/","title":"Integration","text":"<p>If you are looking to enrich your existing Python project with support for formulae, you have come to the right place. Formulaic is designed with simple APIs that should make it straightforward to integrate into any project.</p> <p>In this document we provide several general recommendations for developers integrating Formulaic, and then some more specific guidance for developers looking to migrate existing formula functionality from <code>patsy</code>. As you are working on integrating Formulaic, if you come across anything not mentioned here that really ought to be, please report it to our Issue Tracker.</p>"},{"location":"dev/integration/#recommendations","title":"Recommendations","text":"<p>For the most part, Formulaic should \"just work\". However, here are a couple of recommendations that might make your integration work easier.</p> <ul> <li>Do not use the user-facing syntactic sugar function <code>model_matrix</code>. This is a   simple wrapper around lower-level APIs that automatically includes variables   from users' local namespaces. This is convenient when running in a notebook,   but can lead to unexpected interactions with your library code that are hard   to debug. Called naively in your library it will treat the frame in which it   was run as the user context, which may include somewhat sensitive internal   state and may override transforms normally available to formulae. Instead,   use <code>Formula(...).get_model_matrix(...)</code>.</li> <li>If you do need access to user namespaces it is recommended that you use the   <code>formulaic.utils.context.capture_context()</code> function and pass the result as   <code>context</code> to the <code>.get_model_matrix()</code> methods. It is easiest to use in the   outermost user-facing entrypoints so that you do not need to figure out   exactly how many frames removed you are from user-context. You may also   manually construct a dictionary from the user's context if you want to do   additional filtering.</li> <li>During the evaluation of some term factors, the <code>eval()</code> function may be   called to invoke the indicated Python functions. Since this is user-specified   code, it is possible that the formula had some malicious code in it (such as   <code>sys.exit()</code> or <code>shutil.rmtree()</code>). If you are integrating Formulaic into   server-side code, it is highly recommended not to pass in any   user-specified context, but instead to curate the set of additional functions   that are available and pass that in instead. If you are writing a user-facing   library, this should not be as concerning.</li> <li>Formulaic selects the materialization algorithms to use based on the incoming   data type (e.g. <code>pandas.DataFrame</code> -&gt; <code>PandasMaterializer</code>). Different   materializers may have different output (and other) options. It may make sense   to hard-code your choice of materializer by passing <code>materializer=</code> to the   <code>.get_model_matrix()</code> methods.</li> <li>Formulaic typically provides sensible defaults that should work in most   scenarios out-of-the-box. However, it may make sense for your library to   override some of these defaults. For example, if you typically deal with   categorical factors with high cardinality you may want to enable sparse   outputs by default (by passing <code>output='sparse'</code> to <code>.get_model_matrix()</code>,   assuming the materializer of your datatype supports this).</li> <li>Do not rely on <code>ModelSpec</code> instances to work across different major versions   of Formulaic. It may be tempting to serialize them to disk and then reuse them   in newer versions of Formulaic. Most of the time this will work fine, but the   stored encoder and transform states are considered implementation details of   stateful transforms and are subject to change between major versions. Patch   releases should never result in changes to this state.</li> <li>Formulaic's default parsers allow you to restrict the set of features used   when parsing formulae. This is useful if you are expecting formulae to be of a   particular structural form, and do not want to have to check after the fact.   Any feature that results in nested structure is associated with a feature flag   in <code>formulaic.parser.DefaultFormulaParser.FeatureFlags</code>. You can pass these   flags, or a set of (case-insensitive) strings corresponding to these enums, to   <code>DefaultFormulaParser(feature_flags=...)</code>.</li> </ul>"},{"location":"dev/integration/#migrating-from-patsy","title":"Migrating from Patsy","text":"<p>If you are migrating a library that previous used <code>patsy</code> to <code>formulaic</code>, you should first review the general user-facing migration notes, which describes differences in API and formula grammars. Then, in addition to the recommendations above, the following notes might be helpful.</p> <ul> <li>While the vast majority of formulae will be parsed identically in Formulaic   and Patsy, there will inevitably be small differences in some edge-cases (as   highlighted in migration notes). For highly entrenched   use-cases, do not expect this to be without friction.</li> <li>If you used any of the internals of <code>patsy</code>, such as manually assembling   <code>Term</code> instances, this code will need to be rewritten to use <code>Formulaic</code>   classes instead. Generally speaking, this will likely be transparent to your   users and so should be a relatively small lift.</li> <li>Formulaic is much more flexible than Patsy, and so moulding it to your needs   should be much easier. If you do need assistance, please always feel free to   open an issue in our issue tracker.</li> </ul>"},{"location":"guides/","title":"Introduction","text":"<p>This section of the documentation focuses on guiding end-users through the various aspects of Formulaic likely to be useful in day-to-day workflows. Feel free to pick and choose which modules you peruse, but note that later modules may assume knowledge of content described in earlier modules.</p> <p>If you are a developer of another library looking to leverage Formulaic internally to your code, or are looking to contribute directly to Formulaic, it is recommended to also review the Developer Guides.</p>"},{"location":"guides/contrasts/","title":"Categorical Encoding","text":"<p>Categorical data (also known as \"factors\") is encoded in model matrices using \"contrast codings\" that transform categorical vectors into a collection of numerical vectors suitable for use in regression models. In this guide we begin with some basic examples, before introducing the concepts behind contrast codings, how to select and/or design your own coding, and (for more advanced readers) describe how we guarantee structural full-rankness of formulae with complex interactions between categorical and other features.</p> In\u00a0[1]: Copied! <pre>from pandas import Categorical, DataFrame\n\nfrom formulaic import model_matrix\n\ndf = DataFrame(\n    {\n        \"letters\": [\"a\", \"b\", \"c\"],\n        \"numbers\": Categorical([1, 2, 3]),\n        \"values\": [20, 200, 30],\n    }\n)\n\nmodel_matrix(\"letters + numbers + values\", df)\n</pre> from pandas import Categorical, DataFrame  from formulaic import model_matrix  df = DataFrame(     {         \"letters\": [\"a\", \"b\", \"c\"],         \"numbers\": Categorical([1, 2, 3]),         \"values\": [20, 200, 30],     } )  model_matrix(\"letters + numbers + values\", df) Out[1]: Intercept letters[T.b] letters[T.c] numbers[T.2] numbers[T.3] values 0 1.0 0 0 0 0 20 1 1.0 1 0 1 0 200 2 1.0 0 1 0 1 30 <p>Here <code>letters</code> was identified as a categorical variable because of it consisted of strings, <code>numbers</code> was identified as categorical because of its data type, and <code>values</code> was treated as a vector of numerical values. The categorical data was encoded using the default encoding of \"Treatment\" (aka. \"Dummy\", see below for more details).</p> <p>If we wanted to force formulaic to treat a column as categorical, we can use the <code>C()</code> transform (just as in patsy and R). For example:</p> In\u00a0[2]: Copied! <pre>model_matrix(\"C(values)\", df)\n</pre> model_matrix(\"C(values)\", df) Out[2]: Intercept C(values)[T.30] C(values)[T.200] 0 1.0 0 0 1 1.0 0 1 2 1.0 1 0 <p>The <code>C()</code> transform tells Formulaic that the column should be encoded as categorical data, and allows you to customise how the encoding is performed. For example, we could use polynomial coding (detailed below) and explicitly specify the categorical levels and their order using:</p> In\u00a0[3]: Copied! <pre>model_matrix(\"C(values, contr.poly, levels=[10, 20, 30])\", df)\n</pre> model_matrix(\"C(values, contr.poly, levels=[10, 20, 30])\", df) <pre>/home/matthew/Repositories/github/formulaic/formulaic/transforms/contrasts.py:124: DataMismatchWarning: Data has categories outside of the nominated levels (or that were not seen in original dataset): {200}. They are being  cast to nan, which will likely skew the results of your analyses.\n  warnings.warn(\n</pre> Out[3]: Intercept C(values, contr.poly, levels=[10, 20, 30]).L C(values, contr.poly, levels=[10, 20, 30]).Q 0 1.0 0.000000 -0.816497 1 1.0 0.000000 0.000000 2 1.0 0.707107 0.408248 <p>Where possible, as you can see above, we also provide warnings when a categorical encoding does not reflect the structure of the data.</p> In\u00a0[4]: Copied! <pre>from formulaic.transforms.contrasts import C, TreatmentContrasts\n\nTreatmentContrasts(base=\"B\").get_coding_matrix([\"A\", \"B\", \"C\", \"D\"])\n</pre> from formulaic.transforms.contrasts import C, TreatmentContrasts  TreatmentContrasts(base=\"B\").get_coding_matrix([\"A\", \"B\", \"C\", \"D\"]) Out[4]: A C D A 1.0 0.0 0.0 B 0.0 0.0 0.0 C 0.0 1.0 0.0 D 0.0 0.0 1.0 In\u00a0[5]: Copied! <pre>TreatmentContrasts(base=\"B\").get_coefficient_matrix([\"A\", \"B\", \"C\", \"D\"])\n</pre> TreatmentContrasts(base=\"B\").get_coefficient_matrix([\"A\", \"B\", \"C\", \"D\"]) Out[5]: A B C D B 0.0 1.0 0.0 0.0 A-B 1.0 -1.0 -0.0 -0.0 C-B 0.0 -1.0 1.0 0.0 D-B 0.0 -1.0 0.0 1.0 In\u00a0[6]: Copied! <pre>model_matrix(\"C(letters, contr.treatment)\", df)\n</pre> model_matrix(\"C(letters, contr.treatment)\", df) Out[6]: Intercept C(letters, contr.treatment)[T.b] C(letters, contr.treatment)[T.c] 0 1.0 0 0 1 1.0 1 0 2 1.0 0 1 In\u00a0[7]: Copied! <pre>model_matrix(\"C(letters, contr.SAS)\", df)\n</pre> model_matrix(\"C(letters, contr.SAS)\", df) Out[7]: Intercept C(letters, contr.SAS)[T.a] C(letters, contr.SAS)[T.b] 0 1.0 1 0 1 1.0 0 1 2 1.0 0 0 In\u00a0[8]: Copied! <pre>model_matrix(\"C(letters, contr.sum)\", df)\n</pre> model_matrix(\"C(letters, contr.sum)\", df) Out[8]: Intercept C(letters, contr.sum)[S.a] C(letters, contr.sum)[S.b] 0 1.0 1.0 0.0 1 1.0 0.0 1.0 2 1.0 -1.0 -1.0 In\u00a0[9]: Copied! <pre>model_matrix(\"C(letters, contr.helmert)\", df)\n</pre> model_matrix(\"C(letters, contr.helmert)\", df) Out[9]: Intercept C(letters, contr.helmert)[H.b] C(letters, contr.helmert)[H.c] 0 1.0 -1.0 -1.0 1 1.0 1.0 -1.0 2 1.0 0.0 2.0 In\u00a0[10]: Copied! <pre>model_matrix(\"C(letters, contr.diff)\", df)\n</pre> model_matrix(\"C(letters, contr.diff)\", df) Out[10]: Intercept C(letters, contr.diff)[D.b] C(letters, contr.diff)[D.c] 0 1.0 -0.666667 -0.333333 1 1.0 0.333333 -0.333333 2 1.0 0.333333 0.666667 In\u00a0[11]: Copied! <pre>model_matrix(\"C(letters, contr.poly)\", df)\n</pre> model_matrix(\"C(letters, contr.poly)\", df) Out[11]: Intercept C(letters, contr.poly).L C(letters, contr.poly).Q 0 1.0 -0.707107 0.408248 1 1.0 0.000000 -0.816497 2 1.0 0.707107 0.408248 In\u00a0[12]: Copied! <pre>my_letters = C(df.letters, TreatmentContrasts(base=\"b\"))\nmodel_matrix(\"my_letters\", df)\n</pre> my_letters = C(df.letters, TreatmentContrasts(base=\"b\")) model_matrix(\"my_letters\", df) Out[12]: Intercept my_letters[T.a] my_letters[T.c] 0 1.0 1 0 1 1.0 0 0 2 1.0 0 1 In\u00a0[13]: Copied! <pre>import numpy\n\nZ = numpy.array(\n    [\n        [1, 0, 0, 0],  # A\n        [-1, 1, 0, 0],  # B - A\n        [0, -1, 1, 0],  # C - B\n        [-1, 0, 0, 1],  # D - A\n    ]\n)\ncoding = numpy.linalg.inv(Z)[:, 1:]\ncoding\n</pre> import numpy  Z = numpy.array(     [         [1, 0, 0, 0],  # A         [-1, 1, 0, 0],  # B - A         [0, -1, 1, 0],  # C - B         [-1, 0, 0, 1],  # D - A     ] ) coding = numpy.linalg.inv(Z)[:, 1:] coding Out[13]: <pre>array([[0., 0., 0.],\n       [1., 0., 0.],\n       [1., 1., 0.],\n       [0., 0., 1.]])</pre> In\u00a0[14]: Copied! <pre>model_matrix(\n    \"C(letters, contr.custom(coding))\", DataFrame({\"letters\": [\"A\", \"B\", \"C\", \"D\"]})\n)\n</pre> model_matrix(     \"C(letters, contr.custom(coding))\", DataFrame({\"letters\": [\"A\", \"B\", \"C\", \"D\"]}) ) Out[14]: Intercept C(letters, contr.custom(coding))[1] C(letters, contr.custom(coding))[2] C(letters, contr.custom(coding))[3] 0 1.0 0.0 0.0 0.0 1 1.0 1.0 0.0 0.0 2 1.0 1.0 1.0 0.0 3 1.0 0.0 0.0 1.0 <p>The model matrices generated from formulae are often consumed directly by linear regression algorithms. In these cases, if your model matrix is not full rank, then the features in your model are not linearly independent, and the resulting coefficients (assuming they can be computed at all) cannot be uniquely determined. While there are ways to handle this, such as regularization, it is usually easiest to put in a little more effort during the model matrix creation process, and make the incoming vectors in your model matrix linearly independent from the outset. As noted in the text above, categorical coding requires consideration about the overlap of the coding with the intercept in order to remain full rank. The good news is that Formulaic will do most of the heavy lifting for you, and does so by default.</p> <p>It is important to note at this point that Formulaic does not protect against all forms of linear dependence, only structural linear dependence; i.e. linear dependence that results from multiple categorical variables overlapping in vectorspace. If you have two identical numerical vectors called by two different names in your model matrix, Formulaic will happily build the model matrix you requested, and you're on your own. This is intentional. While Formulaic strives to make the model matrix generation process as painless as possible, it also doesn't want to make more assumptions about the use of the data than is necessary. Note that you can also disable Formulaic's structural full-rankness algorithms by passing <code>ensure_full_rank=False</code> to <code>model_matrix()</code> or <code>.get_model_matrix()</code> methods; and can bypass the reducing of the rank of a single categorical term in a formula using <code>C(..., spans_intercept=False)</code> (this is especially useful, for example, if your model includes regularization and you would prefer to use the over-specified model to ensure fairer shrinkage).</p> <p>The algorithm that Formulaic uses was heavily inspired by <code>patsy</code>^1. The basic idea is to recognize that all categorical codings span the intercept[^2]; and then to break that coding up into two pieces: a single column that can be dropped to avoid spanning the intercept, and the remaining body of the coding that will always be present. You expand associatively the categorical factors, and then greedily recombine the components, omitting any that would lead to structural linear dependence. The result is a set of categorical codings that only spans the intercept when it is safe to do so, guaranteeing structural full rankness. The patsy documentation goes into this in much more detail if this is interesting to you.</p> <p>[^2]: This assumes that categories are \"complete\", in that each unit has been assigned a category. You can \"complete\" categories by treating all those unassigned as being a member of an imputed \"null\" category.</p>"},{"location":"guides/contrasts/#basic-usage","title":"Basic usage\u00b6","text":"<p>Formulaic follows in the stead of R and Patsy by automatically inferring from the data whether a feature needs to categorically encoded. For example:</p>"},{"location":"guides/contrasts/#how-does-contrast-coding-work","title":"How does contrast coding work?\u00b6","text":"<p>As you have seen, contrast coding transforms categorical vectors into a matrix of numbers that can be used during modeling. If your data has $K$ mutually exclusive categories, these matrices typically consist of $K-1$ columns. This reduction in dimensionality reflects the fact that membership of the $K$th category could be inferred from the lack of membership in any other category, and so is redundant in the presence of a global intercept. You can read more about this in the full rankness discussion below.</p> <p>The first step toward generating numerical vectors from categorical data is to dummy encode it. This transforms the single vector of $K$ categories into $K$ boolean vectors, each having a $1$ only in rows that are members of the corresponding category. If you do not have a global intercept, you can directly use this dummy encoding with the full $K$ columns and contrasts are unnecessary. This is not always the case, which requires you to reduce the rank of your coding by thinking about contrasts (or differences) between the levels.</p> <p>In practice, this dimension reduction using \"contrasts\" looks like constructing a $K \\times (K-1)$ \"coding matrix\" that describes the contrasts of interest. You can then post-multiply your dummy-encoded columns by it. That is: $$ E = DC $$ where $E \\in \\mathbb{R}^{N \\times (K-1)}$ is the contrast coded categorical data, $D \\in \\{0, 1\\}^{N \\times K}$ is the dummy encoded data, and $C \\in \\mathbb{R}^{K \\times (K-1)}$ is the coding matrix.</p> <p>The easiest way to construct a coding matrix is to start with a \"coefficient matrix\" $Z \\in \\mathbb{R}^{K \\times K}$ which describes the contrasts that you want the coefficients of a trained linear regression model to represent (with columns representing the untransformed levels, and rows representing the transforms). For a consistently chosen set of contrasts, this matrix will be full-rank, and the inverse of this matrix will have a constant column representing the global intercept. Removing this column results in the $K \\times (K-1)$ coding matrix that should be apply to the dummy encoded data in order for the coefficients to have the desired interpretation.</p> <p>For example, if we wanted all of the levels to be compared to the first level, we would build a matrix $Z$ as: $$ \\begin{align} Z =&amp; \\left(\\begin{array}{c c c c} 1 &amp; 0 &amp; 0 &amp; 0 \\\\ -1 &amp; 1 &amp; 0 &amp; 0\\\\ -1 &amp; 0 &amp; 1 &amp; 0\\\\ -1 &amp; 0 &amp; 0 &amp; 1 \\end{array}\\right)\\\\ \\therefore Z^{-1} =&amp; \\left(\\begin{array}{c c c c} 1 &amp; 0 &amp; 0 &amp; 0 \\\\ 1 &amp; 1 &amp; 0 &amp; 0\\\\ 1 &amp; 0 &amp; 1 &amp; 0\\\\ 1 &amp; 0 &amp; 0 &amp; 1 \\end{array}\\right)\\\\ \\implies C =&amp; \\left(\\begin{array}{c c c} 0 &amp; 0 &amp; 0 \\\\ 1 &amp; 0 &amp; 0\\\\ 0 &amp; 1 &amp; 0\\\\ 0 &amp; 0 &amp; 1 \\end{array}\\right) \\end{align} $$ This is none other than the default \"treatment\" coding described below, which applies one-hot coding to the categorical data.</p> <p>It is important to note that while your choice of contrast coding will change the interpretation and values of your coefficients, all contrast encodings ultimately result in equivalent regressions, and it is possible to restrospectively infer any other set of interesting contrasts given the regression covariance matrix. The task is therefore to find the most useful representation, not the \"correct\" one.</p> <p>For those interested in reading more, the R Documentation on Coding Matrices covers this in more detail.</p>"},{"location":"guides/contrasts/#contrast-codings","title":"Contrast codings\u00b6","text":"<p>This section introduces the contrast encodings that are shipped as part of Formulaic. These implementations live in <code>formulaic.transforms.contrasts</code>, and are surfaced by default in formulae as an attribute of <code>contr</code> (e.g. <code>contr.treatment</code>, in order to be consistent with R). You can always implement your own contrasts if the need arises.</p> <p>If you would like to dig deeper and see the actual contrast/coefficient matrices for various parameters you can directly import these contrast implementations and play with them in a Python shell, but otherwise for brevity we will not exhaustively show these in the following documentation. For example:</p>"},{"location":"guides/contrasts/#treatment-aka-dummy","title":"Treatment (aka. dummy)\u00b6","text":"<p>This contrast coding compares each level with some reference level. If not specified, the reference level is taken to be the first level. The reference level can be specified as the first argument to the <code>TreatmentContrasts</code>/<code>contr.treatment</code> constructor.</p> <p>Example formulae:</p> <ul> <li><code>~ X</code>: Assuming <code>X</code> is categorical, the treatment encoding will be used by default.</li> <li><code>~ C(X)</code>: You can also explicitly flag a feature to be encoded as categorical, whereupon the default is treatment encoding.</li> <li><code>~ C(X, contr.treatment)</code>: Explicitly indicate that the treatment encoding should be used.</li> <li><code>~ C(X, contr.treatment(\"x\"))</code>: Indicate that the reference treatment should be \"x\" instead of the first index.</li> <li><code>~ C(X, contr.treatment(base=\"x\"))</code>: As above.</li> </ul>"},{"location":"guides/contrasts/#sas-treatment","title":"SAS Treatment\u00b6","text":"<p>This constrasts generated by this class are the same as the above, but with the reference level defaulting to the last level (the default in SAS).</p> <p>Example formulae:</p> <ul> <li><code>~ C(X, contr.SAS)</code>: Basic use-case.</li> <li><code>~ C(X, contr.SAS(\"x\"))</code>: Same as treatment encoding case above.</li> <li><code>~ C(X, contr.SAS(base=\"x\"))</code>: Same as treatment encoding case above.</li> </ul>"},{"location":"guides/contrasts/#sum-or-deviation","title":"Sum (or Deviation)\u00b6","text":"<p>These contrasts compare each level (except the last, which is redundant) to the global average of all levels.</p> <p>Example formulae:</p> <ul> <li><code>~ C(X, contr.sum)</code>: Encode categorical data using the sum coding.</li> </ul>"},{"location":"guides/contrasts/#helmert","title":"Helmert\u00b6","text":"<p>These contrasts compare each successive level to the average all previous/subsequent levels. It has two configurable parameters: <code>reverse</code> which controls the direction of comparison, and <code>scale</code> which controls whether to scale the encoding to simplify interpretation of coefficients (results in a floating point model matrix instead of an integer one). When <code>reverse</code> is <code>True</code>, the contrasts compare a level to all previous levels; and when <code>False</code>, it compares it to all subsequent levels.</p> <p>The default parameter values are chosen to match the R implementation, which corresponds to a reversed and unscaled Helmert coding.</p> <p>Example formulae:</p> <ul> <li><code>~ C(X, contr.helmert)</code>: Unscaled reverse coding.</li> <li><code>~ C(X, contr.helmert(reverse=False))</code>: Unscaled forward coding.</li> <li><code>~ C(X, contr.helmert(scale=True))</code>: Scaled reverse coding.</li> <li><code>~ C(X, contr.helmert(scale=True, reverse=False))</code>: Scaled forward coding.</li> </ul>"},{"location":"guides/contrasts/#diff","title":"Diff\u00b6","text":"<p>These contrasts take the difference of each level with the previous level. It has one parameter, <code>forward</code>, which indicates that the difference should be inverted such the difference is taken between the previous level and the current level. The default attribute values are chosen to match the R implemention, and correspond to a backward difference coding.</p> <p>Example formulae:</p> <ul> <li><code>~ C(X, contr.diff)</code>: Backward coding.</li> <li><code>~ C(X, contr.diff(forward=True))</code>: Forward coding.</li> </ul>"},{"location":"guides/contrasts/#polynomial","title":"Polynomial\u00b6","text":"<p>The \"contrasts\" represent a categorical variable that is assumed to equal (or known) spacing/scores, and allow us to model non-linear polynomial behaviour of the dependent variable with respect to the ordered levels by projecting the spacing onto a basis of orthogonal polynomials. It has one parameter, <code>scores</code> which indicates the spacing of the categories. It must have the same length as the number of levels. If not provided, the categories are assumed equidistant and spaced by 1.</p>"},{"location":"guides/contrasts/#aliasing-categorical-features","title":"Aliasing categorical features\u00b6","text":"<p>The feature names of categorical variables can become quite unwieldy, as you may have noticed. Fortunately this is easily remedied by aliasing the variable outside of your formula (and then making it available via formula context). This is done automatically if you use the <code>model_matrix</code> function. For example:</p>"},{"location":"guides/contrasts/#writing-your-own-coding","title":"Writing your own coding\u00b6","text":"<p>It may be useful to define your own coding matrices in some contexts. This is readily achieved using the <code>CustomContrasts</code> class directly or via the <code>contr.custom</code> alias. In these cases, you are responsible for providing the coding matrix ($C$ from above). For example, if you had four levels: A, B, C and D, and wanted to compute the contrasts: B - A, C - B, and D - A, you could write:</p>"},{"location":"guides/contrasts/#guaranteeing-structural-full-rankness","title":"Guaranteeing Structural Full Rankness\u00b6","text":""},{"location":"guides/formulae/","title":"How it works","text":"<p>This section of the documentation is intended to provide a high-level overview of the way in which formulae are interpreted and materialized by Formulaic.</p> <p>Recall that the goal of a formula is to act as a recipe for building a \"model matrix\" (also known as a \"design matrix\") from an existing dataset. Following the recipe should result in a dataset that consists only of numerical columns that can be linearly combined to model an outcome/response of interest (the coefficients of which typically being estimated using linear regression). As such, this process will bake in any desired non-linearity via interactions or transforms, and will encode nominal/categorical/factor data as a collection of numerical contrasts.</p> <p>The ingredients of each formula are the columns of the original dataset, and each operator acting on these columns in the formula should be thought of as inclusion/exclusion of the column in the resulting model matrix, or as a transformation on the column(s) prior to inclusion. Thus, a <code>+</code> operator does not act in its usual algebraic manner, but rather acts as set union, indicating that both the left- and right-hand arguments should be included in the model matrix; a <code>-</code> operator acts like a set difference; and so on.</p> <p>Formulas in Formulaic are represented by (subclasses of) the <code>Formula</code> class. Instances of <code>Formula</code> subclasses are a ultimately containers for sets of <code>Term</code> instances, which in turn are a container for a set of <code>Factor</code> instances. Let's start our dissection at the bottom, and work our way up.</p> In\u00a0[1]: Copied! <pre>from formulaic.parser.types import Factor\n\nFactor(\n    \"1\", eval_method=\"literal\"\n)  # a factor that represents the numerical constant of 1\nFactor(\"a\")  # a factor that will be looked up from the data context\nFactor(\n    \"a + b\", eval_method=\"python\"\n)  # a factor that will return the sum of `a` and `b`\n</pre> from formulaic.parser.types import Factor  Factor(     \"1\", eval_method=\"literal\" )  # a factor that represents the numerical constant of 1 Factor(\"a\")  # a factor that will be looked up from the data context Factor(     \"a + b\", eval_method=\"python\" )  # a factor that will return the sum of `a` and `b` Out[1]: <pre>a + b</pre> In\u00a0[2]: Copied! <pre>from formulaic.parser.types import Term\n\nTerm(factors=[Factor(\"b\"), Factor(\"a\"), Factor(\"c\")])\n</pre> from formulaic.parser.types import Term  Term(factors=[Factor(\"b\"), Factor(\"a\"), Factor(\"c\")]) Out[2]: <pre>b:a:c</pre> <p>Note that to ensure uniqueness in the representation, the factor instances are sorted.</p> In\u00a0[3]: Copied! <pre>from formulaic import Formula\n\n# Unstructured formula (a simple list of terms)\nf = Formula(\n    [\n        Term(factors=[Factor(\"c\"), Factor(\"d\"), Factor(\"e\")]),\n        Term(factors=[Factor(\"a\"), Factor(\"b\")]),\n    ]\n)\nf\n</pre> from formulaic import Formula  # Unstructured formula (a simple list of terms) f = Formula(     [         Term(factors=[Factor(\"c\"), Factor(\"d\"), Factor(\"e\")]),         Term(factors=[Factor(\"a\"), Factor(\"b\")]),     ] ) f Out[3]: <pre>a:b + c:d:e</pre> <p>Note that unstructured formulae are actually instances of <code>SimpleFormula</code> (a <code>Formula</code> subclass that acts like a mutable list of <code>Term</code> instances):</p> In\u00a0[4]: Copied! <pre>type(f), list(f)\n</pre> type(f), list(f) Out[4]: <pre>(formulaic.formula.SimpleFormula, [a:b, c:d:e])</pre> <p>Also note that in its standard representation, the terms are separated by \"+\" which is interpreted as the set union in this context, and that (as we have seen for <code>Term</code> instances) <code>Formula</code> instances are sorted (the default is to sort terms only by interaction order, but this can be customized and/or disabled, as described below).</p> <p>Structured formula are constructed similary:</p> In\u00a0[5]: Copied! <pre>f = Formula(\n    [\n        Term(factors=[Factor(\"root_col\")]),\n    ],\n    my_substructure=[\n        Term(factors=[Factor(\"sub_col\")]),\n    ],\n    nested=Formula(\n        [\n            Term(factors=[Factor(\"nested_col\")]),\n            Term(factors=[Factor(\"another_nested_col\")]),\n        ],\n        really_nested=[\n            Term(factors=[Factor(\"really_nested_col\")]),\n        ],\n    ),\n)\nf\n</pre> f = Formula(     [         Term(factors=[Factor(\"root_col\")]),     ],     my_substructure=[         Term(factors=[Factor(\"sub_col\")]),     ],     nested=Formula(         [             Term(factors=[Factor(\"nested_col\")]),             Term(factors=[Factor(\"another_nested_col\")]),         ],         really_nested=[             Term(factors=[Factor(\"really_nested_col\")]),         ],     ), ) f Out[5]: <pre>root:\n    root_col\n.my_substructure:\n    sub_col\n.nested:\n    root:\n        nested_col + another_nested_col\n    .really_nested:\n        really_nested_col</pre> <p>Structured formulae are instances of <code>StructuredFormula</code>:</p> In\u00a0[6]: Copied! <pre>type(f)\n</pre> type(f) Out[6]: <pre>formulaic.formula.StructuredFormula</pre> <p>And the sub-formula can be selected using:</p> In\u00a0[7]: Copied! <pre>f.root\n</pre> f.root Out[7]: <pre>root_col</pre> In\u00a0[8]: Copied! <pre>f.nested\n</pre> f.nested Out[8]: <pre>root:\n    nested_col + another_nested_col\n.really_nested:\n    really_nested_col</pre> <p>Formulae can also have different ordering conventions applied to them. By default, Formulaic follows R conventions around ordering whereby terms are sorted by their interaction degree (number of factors) and then by the order in which they were present in the the term list. This behaviour can be modified to perform no ordering or full lexical sorting of terms and factors by passing <code>_ordering=\"none\"</code> or <code>_ordering=\"sort\"</code> respectively to the <code>Formula</code> constructor. The default ordering is equivalent to passing <code>_ordering=\"degree\"</code>. For example:</p> In\u00a0[9]: Copied! <pre>{\n    \"degree\": Formula(\"z + z:a + z:b:a + g\"),\n    \"none\": Formula(\"z + z:a + z:b:a + g\", _ordering=\"none\"),\n    \"sort\": Formula(\"z + z:a + z:b:a + g\", _ordering=\"sort\"),\n}\n</pre> {     \"degree\": Formula(\"z + z:a + z:b:a + g\"),     \"none\": Formula(\"z + z:a + z:b:a + g\", _ordering=\"none\"),     \"sort\": Formula(\"z + z:a + z:b:a + g\", _ordering=\"sort\"), } Out[9]: <pre>{'degree': 1 + z + g + z:a + z:b:a,\n 'none': 1 + z + z:a + z:b:a + g,\n 'sort': 1 + g + z + a:z + a:b:z}</pre> <p>Formulaic intentionally makes the tokenization phase as unopinionated and unstructured as possible. This allows formula grammars to be extended via plugins only high-level APIs (usually <code>Operator</code>s).</p> <p>The tokenizer's role is to take an arbitrary string representation of a formula and convert it into a series of <code>Token</code> instances. The tokenization phase knows very little about formula grammar except that whitespace doesn't matter, that we distinguish non-word characters as operators or context indicators. Interpretation of these tokens is left to the AST generation phase. There are five different kinds of tokens:</p> <ol> <li>Context: Symbol pairs representing he opening or closing of nested contexts. This applies to parentheses <code>()</code> and square brackets <code>[]</code>.</li> <li>Operator: Symbol(s) representing a operation on other tokens in the string (interpreted during AST creation).</li> <li>Name: A character sequence representing variable/data-column name.</li> <li>Python: A character sequence representing executable Python code.</li> <li>Value: A character sequence representing a Python literal.</li> </ol> <p>The tokenizer treats text quoted with <code>`</code> characters as a name token, and <code>{}</code> are used to quote Python operations.</p> <p>An example of the tokens generated can be seen below:</p> In\u00a0[10]: Copied! <pre>from formulaic.parser import DefaultFormulaParser\n\n[\n    f\"{token.token} : {token.kind.value}\"\n    for token in (\n        DefaultFormulaParser(include_intercept=False).get_tokens(\n            \"y ~ 1 + b:log(c) | `d$in^df` + {e + f}\"\n        )\n    )\n]\n</pre> from formulaic.parser import DefaultFormulaParser  [     f\"{token.token} : {token.kind.value}\"     for token in (         DefaultFormulaParser(include_intercept=False).get_tokens(             \"y ~ 1 + b:log(c) | `d$in^df` + {e + f}\"         )     ) ] Out[10]: <pre>['y : name',\n '~ : operator',\n '1 : value',\n '+ : operator',\n 'b : name',\n ': : operator',\n 'log(c) : python',\n '| : operator',\n 'd$in^df : name',\n '+ : operator',\n 'e + f : python']</pre> <p>The next phase is to assemble an abstract syntax tree (AST) from the tokens output from the above that when evaluated will generate the <code>Term</code> instances we need to build a formula. This is done by using an enriched shunting yard algorithm which determines how to interpret each operator token based on the symbol used, the number and position of the non-operator arguments, and the current context (i.e. how many parentheses deep we are). This allows us to disambiguate between, for example, unary and binary addition operators. The available operators and their implementation are described in more detail in the Formula Grammar section of this documentation. It is worth noting that the available operators can be easily modified at runtime, and is typically all that needs to be modified in order to add new formula grammars.</p> <p>The result is an AST that look something like:</p> In\u00a0[11]: Copied! <pre>DefaultFormulaParser().get_ast(\"y ~ a + b:c\")\n</pre> DefaultFormulaParser().get_ast(\"y ~ a + b:c\") Out[11]: <pre>&lt;ASTNode ~: [y, &lt;ASTNode +: [&lt;ASTNode +: [1, a]&gt;, &lt;ASTNode :: [b, c]&gt;]&gt;]&gt;</pre> <p>Now that we have the AST, we can readily evaluate it to generate the <code>Term</code> instances we need to pass to our <code>Formula</code> constructor. For example:</p> In\u00a0[12]: Copied! <pre>terms = DefaultFormulaParser(include_intercept=False).get_terms(\"y ~ a + b:c\")\nterms\n</pre> terms = DefaultFormulaParser(include_intercept=False).get_terms(\"y ~ a + b:c\") terms Out[12]: <pre>.lhs:\n    {y}\n.rhs:\n    {a, b:c}</pre> In\u00a0[13]: Copied! <pre>Formula(terms)\n</pre> Formula(terms) Out[13]: <pre>.lhs:\n    y\n.rhs:\n    a + b:c</pre> <p>Of course, manually building the terms and passing them to the formula constructor is a bit annoying, and so instead we allow passing the string directly to the <code>Formula</code> constructor; and allow you to override the default parser if you so desire (though 99.9% of the time this wouldn't be necessary).</p> <p>Thus, we can generate the same formula from above using:</p> In\u00a0[14]: Copied! <pre>Formula(\"y ~ a + b:c\", _parser=DefaultFormulaParser(include_intercept=False))\n</pre> Formula(\"y ~ a + b:c\", _parser=DefaultFormulaParser(include_intercept=False)) Out[14]: <pre>.lhs:\n    y\n.rhs:\n    a + b:c</pre> <p>Once you have <code>Formula</code> instance, the next logical step is to use it to materialize a model matrix. This is usually as simple passing the raw data as an argument to <code>.get_model_matrix()</code>:</p> In\u00a0[15]: Copied! <pre>import pandas\n\ndata = pandas.DataFrame(\n    {\"a\": [1, 2, 3], \"b\": [4, 5, 6], \"c\": [7, 8, 9], \"A\": [\"a\", \"b\", \"c\"]}\n)\nFormula(\"a + b:c\").get_model_matrix(data)\n</pre> import pandas  data = pandas.DataFrame(     {\"a\": [1, 2, 3], \"b\": [4, 5, 6], \"c\": [7, 8, 9], \"A\": [\"a\", \"b\", \"c\"]} ) Formula(\"a + b:c\").get_model_matrix(data) Out[15]: Intercept a b:c 0 1.0 1 28 1 1.0 2 40 2 1.0 3 54 <p>Just as for formulae, the model matrices can be structured, and will be structured in the same way as the original formula. For example:</p> In\u00a0[16]: Copied! <pre>Formula(\"a\", group=\"b+c\").get_model_matrix(data)\n</pre> Formula(\"a\", group=\"b+c\").get_model_matrix(data) Out[16]: <pre>root:\n       Intercept  a\n    0        1.0  1\n    1        1.0  2\n    2        1.0  3\n.group:\n       b  c\n    0  4  7\n    1  5  8\n    2  6  9</pre> <p>Under the hood, both of these calls have looked at the type of the data (<code>pandas.DataFrame</code> here) and then looked up the <code>FormulaMaterializer</code> associated with that type (<code>PandasMaterializer</code> here), and then passed the formula and data along to the materializer for materialization. It is also possible to request a specific output type that varies by materializer (<code>PandasMaterializer</code> supports \"pandas\", \"numpy\", and \"sparse\"). If one is not selected, the first available output type is selected for you. Thus, the above code is equivalent to:</p> In\u00a0[17]: Copied! <pre>from formulaic.materializers import PandasMaterializer\n\nPandasMaterializer(data).get_model_matrix(Formula(\"a + b:c\"), output=\"pandas\")\n</pre> from formulaic.materializers import PandasMaterializer  PandasMaterializer(data).get_model_matrix(Formula(\"a + b:c\"), output=\"pandas\") Out[17]: Intercept a b:c 0 1.0 1 28 1 1.0 2 40 2 1.0 3 54 <p>The return type of <code>.get_model_matrix()</code> is either a <code>ModelMatrix</code> instance if the original formula was unstructured, or a <code>ModelMatrices</code> instance that is just a structured container for <code>ModelMatrix</code> instances. However, <code>ModelMatrix</code> is an ObjectProxy subclass, and so it also acts like the type of object requested. For example:</p> In\u00a0[18]: Copied! <pre>import numpy\n\nfrom formulaic import ModelMatrix\n\nmm = Formula(\"a + b:c\").get_model_matrix(data, output=\"numpy\")\nisinstance(mm, ModelMatrix), isinstance(mm, numpy.ndarray)\n</pre> import numpy  from formulaic import ModelMatrix  mm = Formula(\"a + b:c\").get_model_matrix(data, output=\"numpy\") isinstance(mm, ModelMatrix), isinstance(mm, numpy.ndarray) Out[18]: <pre>(True, True)</pre> <p>The main purpose of this additional proxy layer is to expose the <code>ModelSpec</code> instance associated with the materialization, which retains all of the encoding choices made during materialization (for reuse in subsequent materializations), as well as metadata about the feature names of the current model matrix (which is very useful when your model matrix output type doesn't have column names, like numpy or sparse arrays). This <code>ModelSpec</code> instance is always available via <code>.model_spec</code>, and is introduced in more detail in the Model Specs section of this documentation.</p> In\u00a0[19]: Copied! <pre>mm.model_spec\n</pre> mm.model_spec Out[19]: <pre>ModelSpec(formula=1 + a + b:c, materializer='pandas', materializer_params={}, ensure_full_rank=True, na_action=&lt;NAAction.DROP: 'drop'&gt;, output='numpy', cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a']), EncodedTermStructure(term=b:c, scoped_terms=[b:c], columns=['b:c'])], transform_state={}, encoder_state={'a': (&lt;Kind.NUMERICAL: 'numerical'&gt;, {}), 'b': (&lt;Kind.NUMERICAL: 'numerical'&gt;, {}), 'c': (&lt;Kind.NUMERICAL: 'numerical'&gt;, {})})</pre> <p>It is sometimes convenient to have the columns in the final model matrix be clustered by numerical factors included in the terms. This means that in regression reports, for example, all of the columns related to a particular feature of interest (including its interactions with various categorical features) are contiguously clustered. This is the default behaviour in patsy. You can perform this clustering in Formulaic by passing the <code>cluster_by=\"numerical_factors\"</code> argument to <code>model_matrix</code> or any of the <code>.get_model_matrix(...)</code> methods. For example:</p> In\u00a0[20]: Copied! <pre>Formula(\"a + b + a:A + A:b\").get_model_matrix(data, cluster_by=\"numerical_factors\")\n</pre> Formula(\"a + b + a:A + A:b\").get_model_matrix(data, cluster_by=\"numerical_factors\") Out[20]: Intercept a a:A[T.b] a:A[T.c] b A[T.b]:b A[T.c]:b 0 1.0 1 0 0 4 0 0 1 1.0 2 2 0 5 5 0 2 1.0 3 0 3 6 0 6"},{"location":"guides/formulae/#anatomy-of-a-formula","title":"Anatomy of a Formula\u00b6","text":""},{"location":"guides/formulae/#factor","title":"Factor\u00b6","text":"<p><code>Factor</code> instances are the atomic unit of a formula, and represent the output of a single expression evaluation. Typically this will be one vector of data, but could also be more than one column (especially common with categorically encoded data).</p> <p>A <code>Factor</code> instance's expression can be evaluated in one of three ways:</p> <ol> <li>As a literal; in which case the expression is taken as number or string, and repeated over all rows in the resulting model matrix.</li> <li>As a lookup: in which case the expression is treated as a name to be looked up in the data context provided during materialization.</li> <li>As a Python expression: in which case it is executed in the data context provided.</li> </ol> <p>Note: Factor instances act as metadata only, and are not directly responsible for doing the evaluation. This is handled in a backend specific way by the appropriate <code>Materializer</code> instance.</p> <p>In code, instantiating a factor looks like:</p>"},{"location":"guides/formulae/#term","title":"Term\u00b6","text":"<p><code>Term</code> instances are a thin wrapper around a set of <code>Factor</code> instances, and represent the Cartesian (or Kronecker) product of the factors. If all of the <code>Factor</code> instances evaluate to single columns, then the <code>Term</code> represents the product of all of the factor columns.</p> <p>Instantiating a <code>Term</code> looks like:</p>"},{"location":"guides/formulae/#formula","title":"Formula\u00b6","text":"<p><code>Formula</code> instances are (potentially nested) wrappers around collections of <code>Term</code> instances. During materialization into a model matrix, each <code>Term</code> instance will have its columns independently inserted into the resulting matrix.</p> <p><code>Formula</code> instances can consist of a single \"list\" of <code>Term</code> instances, or may be \"structured\"; for example, we may want a separate collection of terms for the left- and right-hand side of a formula; or to simultaneously construct multiple model matrices for different parts of our modeling process.</p> <p>For example, an unstructured formula might look like:</p>"},{"location":"guides/formulae/#parsed-formulae","title":"Parsed Formulae\u00b6","text":"<p>While it would be possible to always manually construct <code>Formula</code> instances in this way, it would quickly grow tedious. As you might have guessed from reading the quickstart or via other implementations, this is where Wilkinson formulae come in. Formulaic has a rich extensible formula parser that converts string expressions into the formula structures you see above. Where functionality and grammar overlap, it tries to conform to existing patterns found in R and patsy.</p> <p>Formula parsing happens in three phases:</p> <ol> <li>tokenization of the formula string;</li> <li>conversion of the tokens into an abstract syntax tree (AST); and</li> <li>evaluation of the AST into (potentially structured) lists of <code>Term</code> instances.</li> </ol> <p>In the sections below these phases are described in more detail.</p>"},{"location":"guides/formulae/#tokenization","title":"Tokenization\u00b6","text":""},{"location":"guides/formulae/#abstract-syntax-tree-ast","title":"Abstract Syntax Tree (AST)\u00b6","text":""},{"location":"guides/formulae/#evaluation","title":"Evaluation\u00b6","text":""},{"location":"guides/formulae/#materialization","title":"Materialization\u00b6","text":""},{"location":"guides/grammar/","title":"Formula Grammar","text":"<p>This section of the documentation describes the formula grammar used by Formulaic. It is almost identical that used by patsy and R, and so most formulas should work without modification. However, there are some differences, which are called out below.</p>"},{"location":"guides/grammar/#operators","title":"Operators","text":"<p>In this section, we introduce a complete list of the grammatical operators that you can use by default in your formulas. They are listed such that each section (demarcated by \"-----\") has higher precedence then the block that follows. When you write a formula involving several operators of different precedence, those with higher precedence will be resolved first. \"Arity\" is the number of arguments the operator takes. Within operators of the same precedence, all binary operators are evaluated from left to right (they are left-associative). To highlight differences in grammar betweeh formulaic, patsy and R, we highlight any differences below. If there is a checkmark the Formulaic, Patsy and R columns, then the grammar is consistent across all three, unless otherwise indicated.</p> Operator Arity Description Formulaic Patsy R <code>\"...\"</code><sup>1</sup> 1 String literal. \u2713 \u2713 \u2717 <code>[0-9]+\\.[0-9]+</code><sup>1</sup> 1 Numerical literal. \u2713 \u2717 \u2717 <code>`...`</code><sup>1</sup> 1 Quotes fieldnames within the incoming dataframe, allowing the use of special characters, e.g. <code>`my|special$column!`</code> \u2713 \u2717 \u2713 <code>{...}</code><sup>1</sup> 1 Quotes python operations, as a more convenient way to do Python operations than <code>I(...)</code>, e.g. <code>{`my|col`**2}</code> \u2713 \u2717 \u2717 <code>&lt;function&gt;(...)</code><sup>1</sup> 1 Python transform on column, e.g. <code>my_func(x)</code> which is equivalent to <code>{my_func(x)}</code> \u2713<sup>2</sup> \u2713 \u2717 ----- <code>(...)</code> 1 Groups operations, overriding normal precedence rules. All operations with the parentheses are performed before the result of these operations is permitted to be operated upon by its peers. \u2713 \u2717 \u2713 ----- <code>.</code><sup>9</sup> 0 Stands in as a wild-card for the sum of variables in the data not used on the left-hand side of a formula. \u2713 \u2717 \u2713 ----- <code>**</code> 2 Includes all n-th order interactions of the terms in the left operand, where n is the (integral) value of the right operand, e.g. <code>(a+b+c)**2</code> is equivalent to <code>a + b + c + a:b + a:c + b:c</code>. \u2713 \u2713 \u2713 <code>^</code> 2 Alias for <code>**</code>. \u2713 \u2717<sup>3</sup> \u2713 ----- <code>:</code> 2 Adds a new term that corresponds to the interaction of its operands (i.e. their elementwise product). \u2713<sup>4</sup> \u2713 \u2713 ----- <code>*</code> 2 Includes terms for each of the additive and interactive effects of the left and right operands, e.g. <code>a * b</code> is equivalent to <code>a + b + a:b</code>. \u2713 \u2713 \u2713 <code>/</code> 2 Adds terms describing nested effects. It expands to the addition of a new term for the left operand and the interaction of all left operand terms with the right operand, i.e <code>a / b</code> is equivalent to <code>a + a:b</code>, <code>(a + b) / c</code> is equivalent to <code>a + b + a:b:c</code>, and <code>a/(b+c)</code> is equivalent to <code>a + a:b + a:c</code>.<sup>5</sup> \u2713 \u2713 \u2713 <code>%in%</code> 2 As above, but with arguments inverted: e.g. <code>b %in% a</code> is equivalent to <code>a / b</code>. \u2713 \u2717 \u2713 ----- <code>+</code> 2 Adds a new term to the set of features. \u2713 \u2713 \u2713 <code>-</code> 2 Removes a term from the set of features (if present). \u2713 \u2713 \u2713 <code>+</code> 1 Returns the current term unmodified (not very useful). \u2713 \u2713 \u2713 <code>-</code> 1 Negates a term (only implemented for 0, in which case it is replaced with <code>1</code>). \u2713 \u2713 \u2713 ----- <code>\\|</code> 2 Splits a formula into multiple parts, allowing the simultaneous generation of multiple model matrices. When on the right-hand-side of the <code>~</code> operator, all parts will attract an additional intercept term by default. \u2713 \u2717 \u2713<sup>6</sup> ----- <code>~</code> 1,2 Separates the target features from the input features. If absent, it is assumed that we are considering only the the input features. Unless otherwise indicated, it is assumed that the input features implicitly include an intercept. \u2713 \u2713 \u2713 <code>[ . ~ . ]</code> 2 [Experimental] Multi stage formula notation, which is useful in (e.g.) IV contexts. Requires the <code>MULTISTAGE</code> feature flag to be passed to the parser. \u2713 \u2717 \u2717"},{"location":"guides/grammar/#transforms","title":"Transforms","text":"<p>Formulaic supports arbitrary transforms, any of which can also preserve state so that new data can undergo the same transformation as that used during modelling. The currently implemented transforms are shown below. Commonly used transforms that have not been implemented by <code>formulaic</code> are explicitly noted also.</p> Transform Description Formulaic Patsy R <code>I(...)</code> Identity transform, allowing arbitrary Python/R operations, e.g. <code>I(x+y)</code>. Note that in <code>formulaic</code>, it is more idiomatic to use <code>{x+y}</code>. \u2713 \u2713 \u2713 <code>Q('&lt;column_name&gt;')</code> Look up feature by potentially exotic name, e.g. <code>Q('wacky name!')</code>. Note that in <code>formulaic</code>, it is more idiomatic to use <code>`wacky name!`</code>. \u2713 \u2713 \u2717 <code>C(...)</code> Categorically encode a column, e.g. <code>C(x)</code> \u2713 \u2713 \u2713 <code>center(...)</code> Shift column data so mean is zero. \u2713 \u2713 \u2717 <code>scale(...)</code> Shift column so mean is zero and variance is 1. \u2713 \u2713<sup>7</sup> \u2713 <code>standardize(...)</code> Alias of <code>scale</code>. \u2713<sup>8</sup> \u2713 \u2717 <code>lag(...[, &lt;k&gt;])</code> Generate lagging or leading columns (useful for datasets collected at regular intervals). \u2713 \u2717 \u2713 <code>poly(...)</code> Generates a polynomial basis, allowing non-linear fits. \u2713 \u2717 \u2713 <code>bs(...)</code> Generates a B-Spline basis, allowing non-linear fits. \u2713 \u2713 \u2713 <code>cs(...)</code> Generates a natural cubic spline basis, allowing non-linear fits. \u2713 \u2713 \u2713 <code>cr(...)</code> Alias for <code>cs</code> above. \u2713 \u2717 \u2713 <code>cc(...)</code> Generates a cyclic cubic spline basis, allowing non-linear fits. \u2713 \u2713 \u2713 <code>te(...)</code> Generates a tensor product smooth. \u2717 \u2713 \u2713 <code>hashed(...)</code> Categorically encode a deterministic hash of a column. \u2713 \u2717 \u2717 ... Others? Contributions welcome! ? ? ? <p>Tip</p> <p>Any function available in the <code>context</code> dictionary will also be available as transform, along with some commonly used functions imported from numpy: <code>log</code>, <code>log10</code>, <code>log2</code>, <code>exp</code>, <code>exp10</code>, and <code>exp2</code>. In addition the <code>numpy</code> module is always available as <code>np</code>. Thus, formulas like: <code>log(y) ~ x + 10</code> will always do the right thing, even when these functions have not been made available in the user namespace.</p> <p>Note</p> <p>Formulaic does not (yet) support including extra terms in the formula that will not result in additions to the dataframe, for example model annotations like R's <code>offset(...)</code>.</p>"},{"location":"guides/grammar/#behaviours-and-conventions","title":"Behaviours and Conventions","text":"<p>Beyond the formula operator grammar itself there are some differing behaviours and conventions of which you should be aware.</p> <ul> <li>Formulaic follows Patsy and then enhanced <code>Formula</code> R package in that both     sides of the <code>~</code> operator are treated considered to be using the formula     grammar, with the only difference being that the right hand side attracts an     intercept by default. In vanilla R, the left hand side is treated as R code     (and so <code>x + y ~ z</code> would result in a single column on the left-hand-side).     You can recover vanilla R's behaviour by nesting the operations in a Python     operator block (as described in the operator table): <code>{y1 + y2} ~ a + b</code>.</li> <li>Formula terms in Formulaic are always sorted first by the order of the     interaction, and then alphabetically. In R and patsy, this second ordering     is done in the order that columns were introduced to the formula (patsy     additionally sorts by which fields are involved in the interactions). As a     result formulas generated by <code>formulaic</code> with the same set of fields will     always generate the same model matrix.</li> <li>Formulaic follows patsy's more rigourous handling of whether or not to     include an intercept term. In R, <code>b-1</code> and <code>(b-1)</code> both do not have an     intercept, whereas in Formulaic and Patsy the parentheses are resolved     first, and so the first does not have an intercept and the second does     (because '1 +' is implicitly prepended to the right hand side of the formula).</li> <li>Formulaic borrows a clever algorithm introduced by Patsy to carefully choose     where to reduce the rank of the model matrix in order to ensure that the     matrix is structurally full rank. This avoids producing over-specified model     matrices in contexts that R would (since it only considers local full-rank     structure, rather than global structure). You can read more about this in     Patsy's documentation.</li> </ul> <ol> <li> <p>This \"operator\" is actually part of the tokenisation process.\u00a0\u21a9\u21a9\u21a9\u21a9\u21a9</p> </li> <li> <p>Formulaic additionally supports quoted fields with special characters, e.g. <code>my_func(`my|special+column`)</code>.\u00a0\u21a9</p> </li> <li> <p>The caret operator is not supported, but will not cause an error. It is ignored by the patsy formula parser, and treated as XOR Python operation on column.\u00a0\u21a9</p> </li> <li> <p>Note that Formulaic also allows you to use this to scale columns, for example: <code>2.5:a</code> (this scaling happens after factor coding).\u00a0\u21a9</p> </li> <li> <p>This somewhat confusing operator is useful when you want to include hierachical features in your data, and where certain interaction terms do not make sense (particularly in ANOVA contexts). For example, if <code>a</code> represents countries, and <code>b</code> represents cities, then the full product of terms from <code>a * b === a + b + a:b</code> does not make sense, because any value of <code>b</code> is guaranteed to coincide with a value in <code>a</code>, and does not independently add value. Thus, the operation <code>a / b === a + a:b</code> results in more sensible dataset. As a result, the <code>/</code> operator is right-distributive, since if <code>b</code> and <code>c</code> were both nested in <code>a</code>, you would want <code>a/(b+c) === a + a:b + a:c</code>. Likewise, the operator is not left-distributive, since if <code>c</code> is nested under both <code>a</code> and <code>b</code> separately, then you want <code>(a + b)/c === a + b + a:b:c</code>. Lastly, if <code>c</code> is nested in <code>b</code>, and <code>b</code> is nested in <code>a</code>, then you would want <code>a/b/c === a + a:(b/c) === a + a:b + a:b:c</code>.\u00a0\u21a9</p> </li> <li> <p>Implemented by an R package called Formula that extends the default formula syntax.\u00a0\u21a9</p> </li> <li> <p>Patsy uses the <code>rescale</code> keyword rather than <code>scale</code>, but provides the same functionality.\u00a0\u21a9</p> </li> <li> <p>For increased compatibility with patsy, we use patsy's signature for <code>standardize</code>.\u00a0\u21a9</p> </li> <li> <p>Requires additional context to be passed in when directly using the <code>Formula</code> constructor. e.g. <code>Formula(\"y ~ .\", context={\"__formulaic_variables_available__\": [\"x\", \"y\", \"z\"]})</code>; or you can use <code>model_matrix</code>, <code>ModelSpec.get_model_matrix()</code>, or <code>FormulaMaterializer.get_model_matrix()</code> without further specification.\u00a0\u21a9</p> </li> </ol>"},{"location":"guides/integration/","title":"Using with existing libraries","text":"<p>As Formulaic matures it is expected that it will be integrated directly into downstream projects where formula parsing is required. This is known to have already happened in the following high-profile projects:</p> <ul> <li>lifelines</li> <li>linearmodels</li> <li>Others?</li> </ul> <p>Where direct integration has not yet happened, you can still use Formulaic in conjunction with other commonly used libraries. On this page, we will add various examples how to achieve this. If you have done some integration work, please feel free to submit a PR that extends this documentation!</p> <p>statsmodels is a popular toolkit hosting many different statistical models, tests, and exploration tools. The formula API in <code>statsmodels</code> is currently based on <code>patsy</code>. If you need the features found in Formulaic, you can use it directly to generate the model matrices, and use the regular API. For example:</p> In\u00a0[1]: Copied! <pre>import pandas\nfrom statsmodels.api import OLS\n\nfrom formulaic import model_matrix\n\ndata = pandas.DataFrame({\"y\": [0.1, 0.4, 3], \"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]})\ny, X = model_matrix(\"y ~ a + b\", data)\nmodel = OLS(y, X)\nresults = model.fit()\nprint(results.summary())\n</pre> import pandas from statsmodels.api import OLS  from formulaic import model_matrix  data = pandas.DataFrame({\"y\": [0.1, 0.4, 3], \"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]}) y, X = model_matrix(\"y ~ a + b\", data) model = OLS(y, X) results = model.fit() print(results.summary()) <pre>                            OLS Regression Results                            \n==============================================================================\nDep. Variable:                      y   R-squared:                       1.000\nModel:                            OLS   Adj. R-squared:                    nan\nMethod:                 Least Squares   F-statistic:                       nan\nDate:                Fri, 14 Apr 2023   Prob (F-statistic):                nan\nTime:                        21:15:59   Log-Likelihood:                 102.01\nNo. Observations:                   3   AIC:                            -198.0\nDf Residuals:                       0   BIC:                            -200.7\nDf Model:                           2                                         \nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          t      P&gt;|t|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     -0.7857        inf         -0        nan         nan         nan\na              0.8857        inf          0        nan         nan         nan\nb[T.B]        -0.5857        inf         -0        nan         nan         nan\nb[T.C]         1.1286        inf          0        nan         nan         nan\n==============================================================================\nOmnibus:                          nan   Durbin-Watson:                   0.820\nProb(Omnibus):                    nan   Jarque-Bera (JB):                0.476\nSkew:                          -0.624   Prob(JB):                        0.788\nKurtosis:                       1.500   Cond. No.                         6.94\n==============================================================================\n\nNotes:\n[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n[2] The input rank is higher than the number of observations.\n</pre> <pre>/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/stats/stattools.py:74: ValueWarning: omni_normtest is not valid with less than 8 observations; 3 samples were given.\n  warn(\"omni_normtest is not valid with less than 8 observations; %i \"\n/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1765: RuntimeWarning: divide by zero encountered in divide\n  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)\n/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1765: RuntimeWarning: invalid value encountered in scalar multiply\n  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)\n/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1687: RuntimeWarning: divide by zero encountered in scalar divide\n  return np.dot(wresid, wresid) / self.df_resid\n</pre> In\u00a0[2]: Copied! <pre>from typing import Iterable, List, Optional\n\nfrom sklearn.base import BaseEstimator, TransformerMixin\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.pipeline import Pipeline\n\nfrom formulaic import Formula, FormulaSpec, ModelSpec\n\n\nclass FormulaicTransformer(TransformerMixin, BaseEstimator):\n    def __init__(self, formula: FormulaSpec):\n        self.formula: Formula = Formula.from_spec(formula)\n        self.model_spec: Optional[ModelSpec] = None\n        if self.formula._has_structure:\n            raise ValueError(\n                f\"Formula specification {repr(formula)} results in a structured formula, which is not supported.\"\n            )\n\n    def fit(self, X, y=None):\n        \"\"\"\n        Generate the initial model spec by which subsequent X's will be\n        transformed.\n        \"\"\"\n        self.model_spec = self.formula.get_model_matrix(X).model_spec\n        return self\n\n    def transform(self, X, y=None):\n        \"\"\"\n        Transform `X` by generating a model matrix from it based on the fit\n        model spec.\n        \"\"\"\n        if self.model_spec is None:\n            raise RuntimeError(\n                \"`FormulaicTransformer.fit()` must be called before `.transform()`.\"\n            )\n        X_ = self.model_spec.get_model_matrix(X)\n        return X_\n\n    def get_feature_names_out(\n        self, input_features: Optional[Iterable[str]] = None\n    ) -&gt; List[str]:\n        \"\"\"\n        Expose model spec column names to scikit learn to allow column transforms later in the pipeline.\n        \"\"\"\n        if self.model_spec is None:\n            raise RuntimeError(\n                \"`FormulaicTransformer.fit()` must be called before columns can be assigned names.\"\n            )\n        return self.model_spec.column_names\n\n\npipe = Pipeline(\n    [(\"formula\", FormulaicTransformer(\"x1 + x2 + x3\")), (\"model\", LinearRegression())]\n)\npipe_fit = pipe.fit(\n    pandas.DataFrame({\"x1\": [1, 2, 3], \"x2\": [2, 3.4, 6], \"x3\": [7, 3, 1]}),\n    y=pandas.Series([1, 3, 5]),\n)\npipe_fit\n# Note: You could optionally serialize `pipe_fit` here.\n# Then: Use the pipe to predict outcomes for new data.\n</pre> from typing import Iterable, List, Optional  from sklearn.base import BaseEstimator, TransformerMixin from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline  from formulaic import Formula, FormulaSpec, ModelSpec   class FormulaicTransformer(TransformerMixin, BaseEstimator):     def __init__(self, formula: FormulaSpec):         self.formula: Formula = Formula.from_spec(formula)         self.model_spec: Optional[ModelSpec] = None         if self.formula._has_structure:             raise ValueError(                 f\"Formula specification {repr(formula)} results in a structured formula, which is not supported.\"             )      def fit(self, X, y=None):         \"\"\"         Generate the initial model spec by which subsequent X's will be         transformed.         \"\"\"         self.model_spec = self.formula.get_model_matrix(X).model_spec         return self      def transform(self, X, y=None):         \"\"\"         Transform `X` by generating a model matrix from it based on the fit         model spec.         \"\"\"         if self.model_spec is None:             raise RuntimeError(                 \"`FormulaicTransformer.fit()` must be called before `.transform()`.\"             )         X_ = self.model_spec.get_model_matrix(X)         return X_      def get_feature_names_out(         self, input_features: Optional[Iterable[str]] = None     ) -&gt; List[str]:         \"\"\"         Expose model spec column names to scikit learn to allow column transforms later in the pipeline.         \"\"\"         if self.model_spec is None:             raise RuntimeError(                 \"`FormulaicTransformer.fit()` must be called before columns can be assigned names.\"             )         return self.model_spec.column_names   pipe = Pipeline(     [(\"formula\", FormulaicTransformer(\"x1 + x2 + x3\")), (\"model\", LinearRegression())] ) pipe_fit = pipe.fit(     pandas.DataFrame({\"x1\": [1, 2, 3], \"x2\": [2, 3.4, 6], \"x3\": [7, 3, 1]}),     y=pandas.Series([1, 3, 5]), ) pipe_fit # Note: You could optionally serialize `pipe_fit` here. # Then: Use the pipe to predict outcomes for new data. Out[2]: <pre>Pipeline(steps=[('formula', FormulaicTransformer(formula=1 + x1 + x2 + x3)),\n                ('model', LinearRegression())])</pre>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.Pipeline<pre>Pipeline(steps=[('formula', FormulaicTransformer(formula=1 + x1 + x2 + x3)),\n                ('model', LinearRegression())])</pre>FormulaicTransformer<pre>FormulaicTransformer(formula=1 + x1 + x2 + x3)</pre>LinearRegression<pre>LinearRegression()</pre>"},{"location":"guides/integration/#statsmodels","title":"StatsModels\u00b6","text":""},{"location":"guides/integration/#scikit-learn","title":"Scikit-Learn\u00b6","text":"<p>scikit-learn is a very popular machine learning toolkit for Python. You can use Formulaic directly, as for <code>statsmodels</code>, or as a module in scikit-learn pipelines along the lines of:</p>"},{"location":"guides/missing_data/","title":"Handling Missing Data","text":"<p>Sooner or later, you will encounter datasets with null values, and it is important to know how their presence will impact your modeling. Formulaic model matrix materialization procedures allow you to specify how you want nulls to be handled. You can either:</p> <ul> <li>Automatically drop null rows from the dataset (the default).</li> <li>Ignore nulls, and allow them to propagate naturally.</li> <li>Raise an exception when null values are encountered.</li> </ul> <p>You can specify the desired behaviour by passing an <code>NAAction</code> enum value (or string value thereof) to the materialization methods (<code>model_matrix</code>, and <code>*.get_model_matrix()</code>). Examples of each of these approaches is show below.</p> In\u00a0[36]: Copied! <pre>from pandas import Categorical, DataFrame\n\nfrom formulaic import model_matrix\nfrom formulaic.materializers import NAAction\n\ndf = DataFrame(\n    {\n        \"c\": [1, 2, None, 4, 5],\n        \"C\": Categorical(\n            [\"a\", \"b\", \"c\", None, \"e\"], categories=[\"a\", \"b\", \"c\", \"d\", \"e\"]\n        ),\n    }\n)\n\nmodel_matrix(\"c + C\", df, na_action=NAAction.DROP)\n# Equivlent to:\n#   * model_matrix(\"c + C\", df)\n#   * model_matrix(\"c + C\", df, na_action=\"drop\")\n</pre> from pandas import Categorical, DataFrame  from formulaic import model_matrix from formulaic.materializers import NAAction  df = DataFrame(     {         \"c\": [1, 2, None, 4, 5],         \"C\": Categorical(             [\"a\", \"b\", \"c\", None, \"e\"], categories=[\"a\", \"b\", \"c\", \"d\", \"e\"]         ),     } )  model_matrix(\"c + C\", df, na_action=NAAction.DROP) # Equivlent to: #   * model_matrix(\"c + C\", df) #   * model_matrix(\"c + C\", df, na_action=\"drop\") Out[36]: Intercept c C[T.b] C[T.c] C[T.d] C[T.e] 0 1.0 1.0 0 0 0 0 1 1.0 2.0 1 0 0 0 4 1.0 5.0 0 0 0 1 <p>You can also specify additional rows to drop using the <code>drop_rows</code> argument:</p> In\u00a0[24]: Copied! <pre>model_matrix(\"c + C\", df, drop_rows={0, 4})\n</pre> model_matrix(\"c + C\", df, drop_rows={0, 4}) Out[24]: Intercept c C[T.b] C[T.c] C[T.d] C[T.e] 1 1.0 2.0 1 0 0 0 <p>Note that the set passed to <code>drop_rows</code> is expected to be mutable, as it will be updated with the indices of rows dropped automatically also; which can be useful if you need to keep track of this information outside of the materialization procedure.</p> In\u00a0[25]: Copied! <pre>drop_rows = {0, 4}\nmodel_matrix(\"c + C\", df, drop_rows=drop_rows)\ndrop_rows\n</pre> drop_rows = {0, 4} model_matrix(\"c + C\", df, drop_rows=drop_rows) drop_rows Out[25]: <pre>{0, np.int64(2), np.int64(3), 4}</pre> In\u00a0[31]: Copied! <pre>model_matrix(\"c + C\", df, na_action=\"ignore\")\n</pre> model_matrix(\"c + C\", df, na_action=\"ignore\") Out[31]: Intercept c C[T.b] C[T.c] C[T.d] C[T.e] 0 1.0 1.0 0 0 0 0 1 1.0 2.0 1 0 0 0 2 1.0 NaN 0 1 0 0 3 1.0 4.0 0 0 0 0 4 1.0 5.0 0 0 0 1 <p>Note the <code>NaN</code> in the <code>c</code> column, and that <code>NaN</code> does NOT appear in the dummy coding of C on row 3, consistent with standard implementations of dummy coding. This could result in misleading model estimates, so care should be taken.</p> <p>You can combine this with <code>drop_rows</code>, as described above, to manually filter out the null values you are concerned about.</p> In\u00a0[41]: Copied! <pre>try:\n    model_matrix(\"c + C\", df, na_action=\"raise\")\nexcept Exception as e:\n    print(e)\n</pre> try:     model_matrix(\"c + C\", df, na_action=\"raise\") except Exception as e:     print(e) <pre>Error encountered while checking for nulls in `C`: `C` contains null values after evaluation.\n</pre> <p>As with ignoring nulls above, you can combine this raising behaviour with <code>drop_rows</code> to manually filter out the null values that you feel you can safely ignore, and then raise if any additional null values make it into your data.</p>"},{"location":"guides/missing_data/#dropping-null-rows-naactiondrop-or-drop","title":"Dropping null rows (<code>NAAction.DROP</code>, or <code>\"drop\"</code>)\u00b6","text":"<p>This the default behaviour, and will result in any row with a null in any column that is being used by the materialization being dropped from the resulting dataset. For example:</p>"},{"location":"guides/missing_data/#ignore-nulls-naactionignore-or-ignore","title":"Ignore nulls (<code>NAAction.IGNORE</code>, or <code>\"ignore\"</code>)\u00b6","text":"<p>If your modeling toolkit can handle the presence of nulls, or you otherwise want to keep them in the dataset, you can pass <code>na_action = \"ignore\"</code> to the materialization methods. This will allow null values to remain in columns, and take no action to prevent the propagation of nulls.</p>"},{"location":"guides/missing_data/#raise-for-null-values-naactionraise-or-raise","title":"Raise for null values (<code>NAAction.RAISE</code> or <code>\"raise\"</code>)\u00b6","text":"<p>If you are unwilling to risk the perils of dropping or ignoring null values, you can instead opt to raise an exception whenever a null value is found. This can prevent yourself from accidentally biasing your model, but also makes your code more brittle. For example:</p>"},{"location":"guides/model_specs/","title":"Model Specs","text":"<p>While <code>Formula</code> instances (discussed in How it works) are the source of truth for abstract user intent, <code>ModelSpec</code> instances are the source of truth for the materialization process; and bundle a <code>Formula</code> instance with explicit metadata about the encoding choices that were made (or should be made) when a formula was (or will be) materialized. As soon as materialization begins, <code>Formula</code> instances are upgraded into <code>ModelSpec</code> instances, and any missing metadata is attached as decisions are made during the materialization process.</p> <p>Besides acting as runtime state during materialization, it serves two main purposes:</p> <ol> <li>It acts as a metadata store about model matrices, for example providing ready access to the column names, the terms from which they derived, and so on. This is especially useful when the output data type does not have native ways of representing this information (e.g. numpy arrays or scipy sparse matrices where even naming columns is challenging).</li> <li>It guarantees reproducibility. Once a <code>Formula</code> has been materialized once, you can use the generated <code>ModelSpec</code> instance to repeat the process on similar datasets, being confident that the encoding choices will be identical. This is especially useful during out-of-sample prediction, where you need to prepare the out-of-sample data in exactly the same was as the training data for the predictions to be valid.</li> </ol> <p>In the remainder of this portion of the documentation, we will introduce how to leverage the metadata stored inside <code>ModelSpec</code> instances derived from materializations, and for more advanced programmatic use-cases, how to manually build a <code>ModelSpec</code>.</p> In\u00a0[1]: Copied! <pre># Let's get ourselves a simple `ModelMatrix` instance to play with.\nfrom pandas import DataFrame\n\nfrom formulaic import model_matrix\n\nmm = model_matrix(\"center(a) + b\", DataFrame({\"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]}))\nmm\n</pre> # Let's get ourselves a simple `ModelMatrix` instance to play with. from pandas import DataFrame  from formulaic import model_matrix  mm = model_matrix(\"center(a) + b\", DataFrame({\"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]})) mm Out[1]: Intercept center(a) b[T.B] b[T.C] 0 1.0 -1.0 0 0 1 1.0 0.0 1 0 2 1.0 1.0 0 1 In\u00a0[2]: Copied! <pre># And extract the model spec from it\nms = mm.model_spec\nms\n</pre> # And extract the model spec from it ms = mm.model_spec ms Out[2]: <pre>ModelSpec(formula=1 + center(a) + b, materializer='pandas', materializer_params={}, ensure_full_rank=True, na_action=&lt;NAAction.DROP: 'drop'&gt;, output='pandas', cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=center(a), scoped_terms=[center(a)], columns=['center(a)']), EncodedTermStructure(term=b, scoped_terms=[b-], columns=['b[T.B]', 'b[T.C]'])], transform_state={'center(a)': {'ddof': 1, 'center': np.float64(2.0), 'scale': None}}, encoder_state={'center(a)': (&lt;Kind.NUMERICAL: 'numerical'&gt;, {}), 'b': (&lt;Kind.CATEGORICAL: 'categorical'&gt;, {'categories': ['A', 'B', 'C'], 'contrasts': ContrastsState(contrasts=TreatmentContrasts(base=UNSET), levels=['A', 'B', 'C'])})})</pre> In\u00a0[3]: Copied! <pre># We can now interrogate it for various column, factor, term, and variable related metadata\n{\n    \"column_names\": ms.column_names,\n    \"column_indices\": ms.column_indices,\n    \"terms\": ms.terms,\n    \"term_indices\": ms.term_indices,\n    \"term_slices\": ms.term_slices,\n    \"term_factors\": ms.term_factors,\n    \"term_variables\": ms.term_variables,\n    \"factors\": ms.factors,\n    \"factor_terms\": ms.factor_terms,\n    \"factor_variables\": ms.factor_variables,\n    \"factor_contrasts\": ms.factor_contrasts,\n    \"variables\": ms.variables,\n    \"variable_terms\": ms.variable_terms,\n    \"variable_indices\": ms.variable_indices,\n    \"variables_by_source\": ms.variables_by_source,\n}\n</pre> # We can now interrogate it for various column, factor, term, and variable related metadata {     \"column_names\": ms.column_names,     \"column_indices\": ms.column_indices,     \"terms\": ms.terms,     \"term_indices\": ms.term_indices,     \"term_slices\": ms.term_slices,     \"term_factors\": ms.term_factors,     \"term_variables\": ms.term_variables,     \"factors\": ms.factors,     \"factor_terms\": ms.factor_terms,     \"factor_variables\": ms.factor_variables,     \"factor_contrasts\": ms.factor_contrasts,     \"variables\": ms.variables,     \"variable_terms\": ms.variable_terms,     \"variable_indices\": ms.variable_indices,     \"variables_by_source\": ms.variables_by_source, } Out[3]: <pre>{'column_names': ('Intercept', 'center(a)', 'b[T.B]', 'b[T.C]'),\n 'column_indices': {'Intercept': 0, 'center(a)': 1, 'b[T.B]': 2, 'b[T.C]': 3},\n 'terms': [1, center(a), b],\n 'term_indices': {1: [0], center(a): [1], b: [2, 3]},\n 'term_slices': {1: slice(0, 1, None),\n  center(a): slice(1, 2, None),\n  b: slice(2, 4, None)},\n 'term_factors': {1: {1}, center(a): {center(a)}, b: {b}},\n 'term_variables': {1: set(), center(a): {'a', 'center'}, b: {'b'}},\n 'factors': {1, b, center(a)},\n 'factor_terms': {1: {1}, center(a): {center(a)}, b: {b}},\n 'factor_variables': {b: {'b'}, 1: set(), center(a): {'a', 'center'}},\n 'factor_contrasts': {b: ContrastsState(contrasts=TreatmentContrasts(base=UNSET), levels=['A', 'B', 'C'])},\n 'variables': {'a', 'b', 'center'},\n 'variable_terms': {'center': {center(a)}, 'a': {center(a)}, 'b': {b}},\n 'variable_indices': {'center': [1], 'a': [1], 'b': [2, 3]},\n 'variables_by_source': {'transforms': {'center'}, 'data': {'a', 'b'}}}</pre> In\u00a0[4]: Copied! <pre># And use it to select out various parts of the model matrix; here the columns\n# produced by the `b` term.\nmm.iloc[:, ms.term_indices[\"b\"]]\n</pre> # And use it to select out various parts of the model matrix; here the columns # produced by the `b` term. mm.iloc[:, ms.term_indices[\"b\"]] Out[4]: b[T.B] b[T.C] 0 0 0 1 1 0 2 0 1 <p>Some of this metadata may seem redundant at first, but this kind of metadata is essential when the generated model matrix does not natively support indexing by names; for example:</p> In\u00a0[5]: Copied! <pre>mm_numpy = model_matrix(\n    \"center(a) + b\", DataFrame({\"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]}), output=\"numpy\"\n)\nmm_numpy\n</pre> mm_numpy = model_matrix(     \"center(a) + b\", DataFrame({\"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]}), output=\"numpy\" ) mm_numpy Out[5]: <pre>array([[ 1., -1.,  0.,  0.],\n       [ 1.,  0.,  1.,  0.],\n       [ 1.,  1.,  0.,  1.]])</pre> In\u00a0[6]: Copied! <pre>ms_numpy = mm_numpy.model_spec\nmm_numpy[:, ms_numpy.term_indices[\"b\"]]\n</pre> ms_numpy = mm_numpy.model_spec mm_numpy[:, ms_numpy.term_indices[\"b\"]] Out[6]: <pre>array([[0., 0.],\n       [1., 0.],\n       [0., 1.]])</pre> In\u00a0[7]: Copied! <pre>ms.get_model_matrix(DataFrame({\"a\": [4, 5, 6], \"b\": [\"A\", \"B\", \"D\"]}))\n</pre> ms.get_model_matrix(DataFrame({\"a\": [4, 5, 6], \"b\": [\"A\", \"B\", \"D\"]})) <pre>/home/matthew/Repositories/github/formulaic/formulaic/transforms/contrasts.py:169: DataMismatchWarning: Data has categories outside of the nominated levels (or that were not seen in original dataset): {'D'}. They are being  cast to nan, which will likely skew the results of your analyses.\n  warnings.warn(\n</pre> Out[7]: Intercept center(a) b[T.B] b[T.C] 0 1.0 2.0 0 0 1 1.0 3.0 1 0 2 1.0 4.0 0 0 <p>Notice that when the assumptions of the stateful transforms are violated warnings and/or exceptions will be generated.</p> <p>You can also just pass the <code>ModelSpec</code> directly to <code>model_matrix</code>, for example:</p> In\u00a0[8]: Copied! <pre>model_matrix(ms, data=DataFrame({\"a\": [4, 5, 6], \"b\": [\"A\", \"A\", \"A\"]}))\n</pre> model_matrix(ms, data=DataFrame({\"a\": [4, 5, 6], \"b\": [\"A\", \"A\", \"A\"]})) Out[8]: Intercept center(a) b[T.B] b[T.C] 0 1.0 2.0 0 0 1 1.0 3.0 0 0 2 1.0 4.0 0 0 In\u00a0[9]: Copied! <pre>from formulaic import ModelSpec\n\nms = ModelSpec(\"a+b+c\", output=\"numpy\", ensure_full_rank=False)\nms\n</pre> from formulaic import ModelSpec  ms = ModelSpec(\"a+b+c\", output=\"numpy\", ensure_full_rank=False) ms Out[9]: <pre>ModelSpec(formula=1 + a + b + c, materializer=None, materializer_params=None, ensure_full_rank=False, na_action=&lt;NAAction.DROP: 'drop'&gt;, output='numpy', cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=None, transform_state={}, encoder_state={})</pre> In\u00a0[10]: Copied! <pre>import pandas\n\nmm = ms.get_model_matrix(\n    pandas.DataFrame({\"a\": [1, 2, 3], \"b\": [4, 5, 6], \"c\": [7, 8, 9]})\n)\nmm\n</pre> import pandas  mm = ms.get_model_matrix(     pandas.DataFrame({\"a\": [1, 2, 3], \"b\": [4, 5, 6], \"c\": [7, 8, 9]}) ) mm Out[10]: <pre>array([[1., 1., 4., 7.],\n       [1., 2., 5., 8.],\n       [1., 3., 6., 9.]])</pre> In\u00a0[11]: Copied! <pre>mm.model_spec\n</pre> mm.model_spec Out[11]: <pre>ModelSpec(formula=1 + a + b + c, materializer='pandas', materializer_params={}, ensure_full_rank=False, na_action=&lt;NAAction.DROP: 'drop'&gt;, output='numpy', cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a']), EncodedTermStructure(term=b, scoped_terms=[b], columns=['b']), EncodedTermStructure(term=c, scoped_terms=[c], columns=['c'])], transform_state={}, encoder_state={'a': (&lt;Kind.NUMERICAL: 'numerical'&gt;, {}), 'b': (&lt;Kind.NUMERICAL: 'numerical'&gt;, {}), 'c': (&lt;Kind.NUMERICAL: 'numerical'&gt;, {})})</pre> <p>Notice that any missing fields not provided by the user are imputed automatically.</p> In\u00a0[12]: Copied! <pre>from formulaic import Formula, ModelSpecs\n\nModelSpecs(\n    ModelSpec(\"a\"), substructure=ModelSpec(\"b\"), another_substructure=ModelSpec(\"c\")\n)\n</pre> from formulaic import Formula, ModelSpecs  ModelSpecs(     ModelSpec(\"a\"), substructure=ModelSpec(\"b\"), another_substructure=ModelSpec(\"c\") ) Out[12]: <pre>root:\n    ModelSpec(formula=1 + a, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=&lt;NAAction.DROP: 'drop'&gt;, output=None, cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=None, transform_state={}, encoder_state={})\n.substructure:\n    ModelSpec(formula=1 + b, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=&lt;NAAction.DROP: 'drop'&gt;, output=None, cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=None, transform_state={}, encoder_state={})\n.another_substructure:\n    ModelSpec(formula=1 + c, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=&lt;NAAction.DROP: 'drop'&gt;, output=None, cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=None, transform_state={}, encoder_state={})</pre> In\u00a0[13]: Copied! <pre>ModelSpec.from_spec(Formula(lhs=\"y\", rhs=\"a + b\"))\n</pre> ModelSpec.from_spec(Formula(lhs=\"y\", rhs=\"a + b\")) Out[13]: <pre>.lhs:\n    ModelSpec(formula=y, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=&lt;NAAction.DROP: 'drop'&gt;, output=None, cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=None, transform_state={}, encoder_state={})\n.rhs:\n    ModelSpec(formula=a + b, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=&lt;NAAction.DROP: 'drop'&gt;, output=None, cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=None, transform_state={}, encoder_state={})</pre> <p>Some operations, such as <code>ModelSpec.subset(...)</code> are also accessible in a mapped way (e.g. via <code>ModelSpecs.subset(...)</code>). You can find documentation for the complete set of available methods using <code>help(ModelSpecs)</code>.</p> <p><code>ModelSpec</code> and <code>ModelSpecs</code> instances have been designed to support serialization via the standard pickling process offered by Python. This allows model specs to be persisted into storage and reloaded at a later time, or used in multiprocessing scenarios.</p> <p>         Serialized model specs are not guaranteed to work between          different versions of formulaic. While things will work in the         vast majority of cases, the internal state of transforms is free to change         from version to version, and may invalidate previously serialized model         specs. Efforts will be made to reduce the likelihood of this, and when         it happens it should be indicated in the changelogs.     </p>"},{"location":"guides/model_specs/#anatomy-of-a-modelspec-instance","title":"Anatomy of a <code>ModelSpec</code> instance.\u00b6","text":"<p>As noted above, a <code>ModelSpec</code> is the complete specification and record of the materialization process, combining all user-specified parameters with the runtime state of the materializer. In particular, <code>ModelSpec</code> instances have the following explicitly specifiable attributes:</p> <ul> <li>Configuration (these attributes are typically specified by the user):<ul> <li>formula: The formula for which the model matrix was (and/or will be) generated.</li> <li>materializer: The materializer used (and/or to be used) to materialize the formula into a matrix.</li> <li>ensure_full_rank: Whether to ensure that the generated matrix is \"structurally\" full-rank (features are not included which are known to violate full-rankness).</li> <li>na_action: The action to be taken if NA values are found in the data. Can be on of: \"drop\" (the default), \"raise\" or \"ignore\".</li> <li>output: The desired output type (as interpreted by the materializer; e.g. \"pandas\", \"sparse\", etc).</li> </ul> </li> <li>State (these attributes are typically only populated during materialization):<ul> <li>structure: The model matrix structure resulting from materialization.</li> <li>transform_state: The state of any stateful transformations that took place during factor evaluation.</li> <li>encoder_state: The state of any implicit stateful transformations that took place during encoding.</li> </ul> </li> </ul> <p>Often, only <code>formula</code> is explicitly specified, and the rest is inferred on the user's behalf.</p> <p><code>ModelSpec</code> instances also have derived properties and methods that you can use to introspect the structure of generated model matrices. These derived methods assume that the <code>ModelSpec</code> has been fully populated, and thus usually only make sense to consider on <code>ModelSpec</code> instances that are attached to a <code>ModelMatrix</code>. They are:</p> <ul> <li>Metadata attributes and methods:<ul> <li>column_names: An ordered sequence of names associated with the columns of the generated model matrix.</li> <li>column_indices: An ordered mapping from column names to the column index in generated model matrices.</li> <li>get_column_indicies(...): A shorthand method for compiling indices for multiple columns from <code>.column_indices</code>.</li> <li>terms: A sequence of <code>Term</code> instances that were used to generate this model matrix.</li> <li>term_indices: An ordered mapping of <code>Term</code> instances to the generated column indices.</li> <li>get_term_indices(...): A shorthand method for selecting term indices from <code>.term_indices</code> using formulae.</li> <li>term_slices: An ordered mapping of <code>Term</code> instances to a slice that when used on the columns of the model matrix will subsample the model matrix down to those corresponding to each term.</li> <li>term_factors: An ordered mapping of <code>Term</code> instances to the set of factors used by that term.</li> <li>term_variables: An order mapping of <code>Term</code> instances to <code>Variable</code> instances (a string subclass with addition attributes of <code>roles</code> and <code>source</code>), indicating the variables used by that term.</li> <li>factors: A set of <code>Factor</code> instances used in the entire formula.</li> <li>factor_terms: A mapping from <code>Factor</code> instances to the <code>Term</code> instances that used them.</li> <li>factor_variables: A mapping from <code>Factor</code> instances to <code>Variable</code> instances, corresponding to the variables used by that factor.</li> <li>factor_contrasts: A mapping from <code>Factor</code> instances to <code>ContrastsState</code> instances that can be used to reproduce the coding matrices used during materialization.</li> <li>variables: A set of <code>Variable</code> instances describing the variables used in entire formula.</li> <li>variable_terms: The reverse lookup of <code>term_variables</code>.</li> <li>variable_indices: A mapping from <code>Variable</code> instance to the indices of the columns in the model matrix that variable.</li> <li>get_variable_indices(...): A shorthand method for compiling indices for multiple columns from <code>.variable_indices</code>.</li> <li>variables_by_source: A mapping from source name (typically one of <code>\"data\"</code>, <code>\"context\"</code>, or <code>\"transforms\"</code>) to the variables derived from that source.</li> <li>get_slice(...): Build a slice instance that can subset a matrix down to the columns associated with a <code>Term</code> instance, its string representation, a column name, or pre-specified ints/slices.</li> </ul> </li> <li>Utility methods:<ul> <li>get_model_matrix(...): Build a model matrix using this spec. This allows a new dataset to be generated using exactly the same encoding process as an earlier dataset.</li> <li>get_linear_constraints(...): Build a set of linear constraints for use during constrained linear regressions.</li> </ul> </li> <li>Transform methods:<ul> <li>update(...): Create a copy of this <code>ModelSpec</code> instance with the nominated attributes mutated.</li> <li>subset(...): Create a copy of this <code>ModelSpec</code> instance with its structure subset to correspond to the form strict subset of terms indicted by a formula specification.</li> </ul> </li> </ul> <p>We'll cover some of these attributes and methods in examples below, but you can always refer to <code>help(ModelSpec)</code> for more details.</p>"},{"location":"guides/model_specs/#using-modelspec-as-metadata","title":"Using <code>ModelSpec</code> as metadata\u00b6","text":"<p>One of the most common use-cases for <code>ModelSpec</code> instances is as metadata to describe a generated model matrix. This metadata can be used to programmatically access the appropriate features in the model matrix in order (e.g.) to assign sensible names to the coefficients fit during a regression.</p>"},{"location":"guides/model_specs/#reusing-model-specifications","title":"Reusing model specifications\u00b6","text":"<p>Another common use-case for <code>ModelSpec</code> instances is replaying the same materialization process used to prepare a training dataset on a new dataset. Since the <code>ModelSpec</code> instance stores all relevant choices made during materialization achieving this is a simple as using using the <code>ModelSpec</code> to generate the new model matrix.</p> <p>By way of example, recall from above section that we used the formula</p> <pre><code>center(a) + b</code></pre> <p>where <code>a</code> was a numerical vector, and <code>b</code> was a categorical vector. When generating model matrices for subsequent datasets it is very important to use the same centering used during the initial model matrix generation, and not just center the incoming data again. Likewise, <code>b</code> should be aware of which categories were present during the initial training, and ensure that the same columns are created during subsequent materializations (otherwise the model matrices will not be of the same form, and cannot be used for predictions/etc). These kinds of transforms that require memory are called \"stateful transforms\" in Formulaic, and are described in more detail in the Transforms documentation.</p> <p>We can see this in action below:</p>"},{"location":"guides/model_specs/#directly-constructing-modelspec-instances","title":"Directly constructing <code>ModelSpec</code> instances\u00b6","text":"<p>It is possible to directly construct Model Matrices, and to prepopulate them with various choices (e.g. output types, materializer, etc). You could even, in principle, populate them with state information (but this is not recommended; it is easy to make mistakes here, and is likely better to encode these choices into the formula itself where possible). For example:</p>"},{"location":"guides/model_specs/#structured-modelspecs","title":"Structured <code>ModelSpecs</code>\u00b6","text":"<p>As discussed in How it works, formulae can be arbitrarily structured, resulting in a similarly structured set of model matrices. <code>ModelSpec</code> instances can also be arranged into a structured collection using <code>ModelSpecs</code>, allowing different choices to be made at different levels of the structure. You can either create these structures yourself, or inherit the structure from a formula. For example:</p>"},{"location":"guides/model_specs/#serialization","title":"Serialization\u00b6","text":""},{"location":"guides/quickstart/","title":"Quickstart","text":"<p>This document provides high-level documentation on how to get started using Formulaic.</p> In\u00a0[1]: Copied! <pre>import pandas\n\nfrom formulaic import model_matrix\n\ndf = pandas.DataFrame(\n    {\n        \"y\": [0, 1, 2],\n        \"a\": [\"A\", \"B\", \"C\"],\n        \"b\": [0.3, 0.1, 0.2],\n    }\n)\n\ny, X = model_matrix(\"y ~ a + b + a:b\", df)\n# This is short-hand for:\n# y, X = formulaic.Formula('y ~ a + b + a:b').get_model_matrix(df)\n</pre> import pandas  from formulaic import model_matrix  df = pandas.DataFrame(     {         \"y\": [0, 1, 2],         \"a\": [\"A\", \"B\", \"C\"],         \"b\": [0.3, 0.1, 0.2],     } )  y, X = model_matrix(\"y ~ a + b + a:b\", df) # This is short-hand for: # y, X = formulaic.Formula('y ~ a + b + a:b').get_model_matrix(df) In\u00a0[2]: Copied! <pre>y\n</pre> y Out[2]: y 0 0 1 1 2 2 In\u00a0[3]: Copied! <pre>X\n</pre> X Out[3]: Intercept a[T.B] a[T.C] b a[T.B]:b a[T.C]:b 0 1.0 0 0 0.3 0.0 0.0 1 1.0 1 0 0.1 0.1 0.0 2 1.0 0 1 0.2 0.0 0.2 <p>You will notice that the categorical values for <code>a</code> have been one-hot (aka dummy) encoded, and to ensure structural full-rankness of <code>X</code>[^1], one level has been dropped from <code>a</code>. For more details about how this guarantees that the matrix is full-rank, please refer to the excellent patsy documentation. If you are not using the model matrices for regression, and don't care if the matrix is not full-rank, you can pass <code>ensure_full_rank=False</code>:</p> In\u00a0[4]: Copied! <pre>X = model_matrix(\"a + b + a:b\", df, ensure_full_rank=False)\nX\n</pre> X = model_matrix(\"a + b + a:b\", df, ensure_full_rank=False) X Out[4]: Intercept a[T.A] a[T.B] a[T.C] b a[T.A]:b a[T.B]:b a[T.C]:b 0 1.0 1 0 0 0.3 0.3 0.0 0.0 1 1.0 0 1 0 0.1 0.0 0.1 0.0 2 1.0 0 0 1 0.2 0.0 0.0 0.2 <p>Note that the dropped level in <code>a</code> has been restored.</p> <p>There is a rich trove of information about the columns and structure of the the model matrix stored in the <code>ModelSpec</code> instance attached to the model matrix, for example:</p> In\u00a0[5]: Copied! <pre>X.model_spec\n</pre> X.model_spec Out[5]: <pre>ModelSpec(formula=1 + a + b + a:b, materializer='pandas', materializer_params={}, ensure_full_rank=False, na_action=&lt;NAAction.DROP: 'drop'&gt;, output='pandas', cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a[T.A]', 'a[T.B]', 'a[T.C]']), EncodedTermStructure(term=b, scoped_terms=[b], columns=['b']), EncodedTermStructure(term=a:b, scoped_terms=[a:b], columns=['a[T.A]:b', 'a[T.B]:b', 'a[T.C]:b'])], transform_state={}, encoder_state={'a': (&lt;Kind.CATEGORICAL: 'categorical'&gt;, {'categories': ['A', 'B', 'C']}), 'b': (&lt;Kind.NUMERICAL: 'numerical'&gt;, {})})</pre> <p>You can read more about the model specs in the Model Specs documentation.</p> In\u00a0[6]: Copied! <pre>X = model_matrix(\"a + b + a:b\", df, output=\"sparse\")\nX\n</pre> X = model_matrix(\"a + b + a:b\", df, output=\"sparse\") X Out[6]: <pre>&lt;3x6 sparse matrix of type '&lt;class 'numpy.float64'&gt;'\n\twith 10 stored elements in Compressed Sparse Column format&gt;</pre> <p>In this example, <code>X</code> is a $ 6 \\times 3 $ <code>scipy.sparse.csc_matrix</code> instance.</p> <p>Since sparse matrices do not have labels for columns, you can look these up from the model spec described above; for example:</p> In\u00a0[7]: Copied! <pre>X.model_spec.column_names\n</pre> X.model_spec.column_names Out[7]: <pre>('Intercept', 'a[T.B]', 'a[T.C]', 'b', 'a[T.B]:b', 'a[T.C]:b')</pre> <p>[^1]: <code>X</code> must be full-rank in order for the regression algorithm to invert a matrix derived from <code>X</code>.</p>"},{"location":"guides/quickstart/#building-model-matrices","title":"Building Model Matrices\u00b6","text":"<p>In <code>formulaic</code>, the simplest way to build your model matrices is to use the high-level <code>model_matrix</code> function:</p>"},{"location":"guides/quickstart/#sparse-model-matrices","title":"Sparse Model Matrices\u00b6","text":"<p>By default, the generated model matrices are dense. In some case, particularly in large datasets with many categorical features, dense model matrices become hugely memory inefficient (since most entries of the data will be zero). Formulaic allows you to directly generate sparse model matrices using:</p>"},{"location":"guides/splines/","title":"Spline Encoding","text":"<p>Formulaic offers several spline encoding transforms that allow you to model non-linear responses to continuous variables using linear models. They are:</p> <ul> <li><code>poly</code>: projection onto orthogonal polynomial basis functions.</li> <li><code>basis_spline</code> (<code>bs</code> in formulae): projection onto a basis spline (B-spline) basis.</li> </ul> <p>These are all implemented as stateful transforms and described in more detail below.</p> In\u00a0[1]: Copied! <pre>import matplotlib.pyplot as plt\nimport numpy\nimport pandas\nfrom statsmodels.api import OLS\n\nfrom formulaic import model_matrix\n\n# Build some data, and hard-code \"y\" as a quartic function with some Gaussian noise.\ndata = pandas.DataFrame(\n    {\n        \"x\": numpy.linspace(0.0, 1.0, 100),\n    }\n).assign(\n    y=lambda df: df.x\n    + 0.2 * df.x**2\n    - 0.7 * df.x**3\n    + 3 * df.x**4\n    + 0.1 * numpy.random.randn(100)\n)\n\n# Generate a model matrix with a polynomial coding in \"x\".\ny, X = model_matrix(\"y ~ poly(x, degree=3)\", data, output=\"numpy\")\n\n# Fit coefficients for the intercept and polynomial basis\ncoeffs = OLS(y[:, 0], X).fit().params\n\n# Plot the basis splines weighted by coefficients.\nplt.plot(\n    data.x,\n    X * coeffs,\n    label=[name + \" (weighted)\" for name in X.model_spec.column_names],\n)\n# Plot B-spline basis functions (colored curves) each multiplied by its coeff\nplt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\")\n# Plot the fit spline itself (sum of the basis functions)\nplt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")\n\nplt.legend();\n</pre> import matplotlib.pyplot as plt import numpy import pandas from statsmodels.api import OLS  from formulaic import model_matrix  # Build some data, and hard-code \"y\" as a quartic function with some Gaussian noise. data = pandas.DataFrame(     {         \"x\": numpy.linspace(0.0, 1.0, 100),     } ).assign(     y=lambda df: df.x     + 0.2 * df.x**2     - 0.7 * df.x**3     + 3 * df.x**4     + 0.1 * numpy.random.randn(100) )  # Generate a model matrix with a polynomial coding in \"x\". y, X = model_matrix(\"y ~ poly(x, degree=3)\", data, output=\"numpy\")  # Fit coefficients for the intercept and polynomial basis coeffs = OLS(y[:, 0], X).fit().params  # Plot the basis splines weighted by coefficients. plt.plot(     data.x,     X * coeffs,     label=[name + \" (weighted)\" for name in X.model_spec.column_names], ) # Plot B-spline basis functions (colored curves) each multiplied by its coeff plt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\") # Plot the fit spline itself (sum of the basis functions) plt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")  plt.legend(); In\u00a0[11]: Copied! <pre># Generate a model matrix with a basis spline \"x\".\ny, X = model_matrix(\"y ~ bs(x, df=4, degree=3)\", data, output=\"numpy\")\n\n# Fit coefficients for the intercept and polynomial basis\ncoeffs = OLS(y[:, 0], X).fit().params\n\n# Plot the basis splines weighted by coefficients.\nplt.plot(\n    data.x,\n    X * coeffs,\n    label=[name + \" (weighted)\" for name in X.model_spec.column_names],\n)\n# Plot B-spline basis functions (colored curves) each multiplied by its coeff\nplt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\")\n# Plot the fit spline itself (sum of the basis functions)\nplt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")\n\nplt.legend();\n</pre> # Generate a model matrix with a basis spline \"x\". y, X = model_matrix(\"y ~ bs(x, df=4, degree=3)\", data, output=\"numpy\")  # Fit coefficients for the intercept and polynomial basis coeffs = OLS(y[:, 0], X).fit().params  # Plot the basis splines weighted by coefficients. plt.plot(     data.x,     X * coeffs,     label=[name + \" (weighted)\" for name in X.model_spec.column_names], ) # Plot B-spline basis functions (colored curves) each multiplied by its coeff plt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\") # Plot the fit spline itself (sum of the basis functions) plt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")  plt.legend(); In\u00a0[18]: Copied! <pre>x = numpy.linspace(0.0, 2 * numpy.pi, 100)\n\ndata = pandas.DataFrame(\n    {\n        \"x\": x,\n    }\n).assign(\n    y=lambda df: 2\n    + numpy.sin(x)\n    - x * numpy.sin(2 * x)\n    + 4 * numpy.sin(x / 7)\n    + 0.1 * numpy.random.randn(100)\n)\n\n# Generate a model matrix with a cyclic cubic spline coding in \"x\".\ny, X = model_matrix(\"y ~ 1 + cc(x, df=4, constraints='center')\", data, output=\"numpy\")\n\n# Fit coefficients for the intercept and polynomial basis\ncoeffs = OLS(y[:, 0], X).fit().params\n\n# Plot the basis splines weighted by coefficients.\nplt.plot(\n    data.x,\n    X * coeffs,\n    label=[name + \" (weighted)\" for name in X.model_spec.column_names],\n)\n# Plot B-spline basis functions (colored curves) each multiplied by its coeff\nplt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\")\n# Plot the fit spline itself (sum of the basis functions)\nplt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")\n\nplt.legend();\n</pre> x = numpy.linspace(0.0, 2 * numpy.pi, 100)  data = pandas.DataFrame(     {         \"x\": x,     } ).assign(     y=lambda df: 2     + numpy.sin(x)     - x * numpy.sin(2 * x)     + 4 * numpy.sin(x / 7)     + 0.1 * numpy.random.randn(100) )  # Generate a model matrix with a cyclic cubic spline coding in \"x\". y, X = model_matrix(\"y ~ 1 + cc(x, df=4, constraints='center')\", data, output=\"numpy\")  # Fit coefficients for the intercept and polynomial basis coeffs = OLS(y[:, 0], X).fit().params  # Plot the basis splines weighted by coefficients. plt.plot(     data.x,     X * coeffs,     label=[name + \" (weighted)\" for name in X.model_spec.column_names], ) # Plot B-spline basis functions (colored curves) each multiplied by its coeff plt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\") # Plot the fit spline itself (sum of the basis functions) plt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")  plt.legend();"},{"location":"guides/splines/#poly","title":"<code>poly</code>\u00b6","text":"<p>The simplest way to generate a non-linear response in a linear model is to include higher order powers of numerical variables. For example, you might want to include: $x$, $x^2$, $x^3$, $\\ldots$, $x^n$. However, these features are not orthogonal, and so adding each term one by one in a regression model will lead to all previously trained coefficients changing. Especially in exploratory analysis, this can be frustrating, and that's where <code>poly</code> comes in. By default, <code>poly</code> iteratively builds orthogonal polynomial features up to the order specified.</p> <p><code>poly</code> has two parameters:</p> <ul> <li>degree: The maximum polynomial degree to generate ($degree - 1$ features will thus be generated).</li> <li>raw: A boolean indicator of whether raw non-orthogonal polynomials should be generated instead of the orthogonalized ones.</li> </ul> <p>For those who are mathematically inclined, this transform is an implementation of the \"three-term recurrence relation\" for monic orthogonal polynomials. There are many good introductions to these recurrence relations, including: (at the time of writing) https://dec41.user.srcf.net/h/IB_L/numerical_analysis/2_3. Another common approach is QR factorisation where the columns of Q are the orthogonal basis vectors. A pre-existing implementation of this can be found in <code>numpy</code>, however our implementation outperforms numpy's QR decomposition, and does not require needless computation of the R matrix. It should also be noted that orthogonal polynomial bases are unique up to the choice of inner-product and scaling, and so all methods will result in the same set of polynomials.</p> <p>Note</p> <p>         When used as a stateful transform, we retain the coefficients that         uniquely define the polynomials; and so new data will be evaluated         against the same polynomial bases as the original dataset. However,         the polynomial basis will almost certainly *not* be orthogonal for the         new data. This is because changing the incoming dataset is equivalent to         changing your choice of inner product.     </p> <p>Example:</p>"},{"location":"guides/splines/#basis_spline-or-bs","title":"<code>basis_spline</code> (or <code>bs</code>)\u00b6","text":"<p>If you were to attempt to fit a complex function over a large domain using <code>poly</code>, it is highly likely that you would need to use a very large degree for the polynomial basis. However, this can lead to overfitting and/or high-frequency oscillations (see Runge's phenomenon). An alternative approach is to use piece-wise polynomial curves of lower degree, with smoothness conditions on the \"knots\" between each of the polynomial pieces. This limits overfitting while still offering the flexibility required to model very complex non-linearities.</p> <p>Basis-splines (or B-splines) are popular choice for generating a basis for such polynomials, with many attractive features such as maximal smoothness around each of the knots, and minimal support given such smoothness.</p> <p>Formulaic has its own implementation of <code>basis_spline</code> that is API compatible (where features overlap) with R, and is more performant than existing Python implementations for our use-cases (such as <code>splev</code> from <code>scipy</code>). For compatibility with <code>R</code> and <code>patsy</code>, <code>basis_spline</code> is available as <code>bs</code> in formulae.</p> <p><code>basis_spline</code> (or <code>bs</code>) has eight parameters:</p> <ul> <li>df: The number of degrees of freedom to use for this spline. If specified, <code>knots</code> will be automatically generated such that they are <code>df</code> - <code>degree</code> (minus one if <code>include_intercept</code> is True) equally spaced quantiles. You cannot specify both <code>df</code> and <code>knots</code>.</li> <li>knots: The internal breakpoints of the B-Spline. If not specified, they default to the empty list (unless <code>df</code> is specified), in which case the ordinary polynomial (Bezier) basis is generated.</li> <li>degree: The degree of the B-Spline (the highest degree of terms in the resulting polynomial). Must be a non-negative integer.</li> <li>include_intercep: Whether to return a complete (full-rank) basis. Note that if <code>ensure_full_rank=True</code> is passed to the materializer, then the intercept will (depending on context) nevertheless be omitted.</li> <li>lower_bound: The lower bound for the domain for the B-Spline basis. If not specified this is determined from <code>x</code>.</li> <li>upper_bound: The upper bound for the domain for the B-Spline basis. If not specified this is determined from <code>x</code>.</li> <li>extrapolation: Selects how extrapolation should be performed when values in <code>x</code> extend beyond the lower and upper bounds. Valid values are:<ul> <li><code>'raise'</code> (the default): Raises a <code>ValueError</code> if there are any values in <code>x</code> outside the B-Spline domain.</li> <li><code>'clip'</code>: Any values above/below the domain are set to the upper/lower bounds.</li> <li><code>'na'</code>: Any values outside of bounds are set to <code>numpy.nan</code>.</li> <li><code>'zero'</code>: Any values outside of bounds are set to <code>0</code>.</li> <li><code>'extend'</code>: Any values outside of bounds are computed by extending the polynomials of the B-Spline (this is the same as the default in R).</li> </ul> </li> </ul> <p>The algorithm used to generate the basis splines is a slightly generalised version of the \"Cox-de Boor\" algorithm, extended by this author to allow for extrapolations (although this author doubts this is terribly novel). If you would like to learn more about B-Splines, the primer put together by Jeffrey Racine is an excellent resource.</p> <p>As a stateful transform, we only keep track of <code>knots</code>, <code>lower_bound</code> and <code>upper_bound</code>, which are sufficient given that all other information must be explicitly specified.</p> <p>Example, reusing the data and imports from above:</p>"},{"location":"guides/splines/#cubic_spline-crcs-and-cc","title":"<code>cubic_spline</code> (<code>cr</code>/<code>cs</code> and <code>cc</code>)\u00b6","text":"<p>While the <code>basis_spline</code> transform above is capable of generating cubic splines, it is sometimes helpful to be able to generate cubic splines that satisfy various additional constraints (including direct constraints on the parameters of the spline, or indirect ones such as cyclicity). To that end, Formulaic implements direct support for generating natural and cyclic cubic splines with constraints (via the <code>cr</code>/<code>cs</code> and <code>cc</code> transforms respectively), borrowing much of the implementation from <code>patsy</code>. These splines are compatible with R's <code>mgcv</code>, and share the nice features of the basis-splines above, including continuous first and second derivatives, and general applicability to interpolation/smoothing. Note that <code>cr</code> and <code>cs</code> generate identical splines, but are both included for compatibility with R.</p> <p>In practice, the reason that we focus on cubic (as compared to quadratic or quartic) splines is that they offer a nice compromise. They offer:</p> <ol> <li>Sufficient smoothness - Cubic splines provide smooth curves by connecting data points with piecewise cubic polynomials, ensuring continuity in the function and its first few derivatives at each connection point.</li> <li>Sufficient flexibility - Unlike lower-order polynomials, cubic splines can capture complex curves with varying curvature, making them suitable for data with sharp changes or multiple inflection points. Generally there is diminished marginal benefit from higher order polynomials.</li> <li>Avoids overfitting (including avoiding Runge's phenomenon described aboe) - Compared to higher-degree polynomials, cubic splines are less prone to large oscillations between data points, which can occur with high-order interpolation methods.</li> </ol> <p>All of <code>cr</code>, <code>cs</code>, and <code>cc</code> are configurations of the <code>cubic_spline</code> transform, and have seven parameters:</p> <ul> <li>x: The data to use in the spline. Depending on the other options, the knots are selected from this data on the first application of <code>cc</code> or <code>cr</code>. If using an existing <code>model_spec</code> with a fitted transformation, the <code>x</code> values are only used to produce the locations for the fitted values of the spline.</li> <li>df: The number of degrees of freedom to use for this spline. If specified, <code>knots</code> will be automatically generated such that they are <code>df</code> - <code>degree</code> equally spaced quantiles. You cannot specify both <code>df</code> and <code>knots</code>.</li> <li>knots: The internal breakpoints of the spline. If not specified, they default to the empty list (unless <code>df</code> is specified).</li> <li>lower_bound: The lower bound for the domain for the spline. If not specified this is determined from <code>x</code>.</li> <li>upper_bound: The upper bound for the domain for the spline. If not specified this is determined from <code>x</code>.</li> <li>constraints: Either a 2-d array defining general linear constraints (that is <code>np.dot(constraints, betas)</code> is zero, where <code>betas</code> denotes the array of initial parameters, corresponding to the initial unconstrained model matrix), or the string <code>'center'</code> indicating that we should apply a centering constraint (this constraint will be computed from the input data, remembered and re-used for prediction from the fitted model). The constraints are absorbed in the resulting design matrix which means that the model is actually rewritten in terms of unconstrained parameters.</li> <li>extrapolation: Selects how extrapolation should be performed when values in <code>x</code> extend beyond the lower and upper bounds. Valid values are:<ul> <li><code>'raise'</code> (the default): Raises a <code>ValueError</code> if there are any values in <code>x</code> outside the spline domain.</li> <li><code>'clip'</code>: Any values above/below the domain are set to the upper/lower bounds.</li> <li><code>'na'</code>: Any values outside of bounds are set to <code>numpy.nan</code>.</li> <li><code>'zero'</code>: Any values outside of bounds are set to <code>0</code>.</li> <li><code>'extend'</code>: Any values outside of bounds are computed by extending the polynomials of the spline (this is the same as the default in R).</li> </ul> </li> </ul> <p>As a stateful transform, we only keep track of <code>knots</code>, <code>lower_bound</code>, <code>upper_bound</code>, <code>constraints</code>, and <code>cyclic</code> which are sufficient given that all other information must be explicitly specified.</p> <p>Example, reusing the data and imports from above:</p>"},{"location":"guides/transforms/","title":"Transforms","text":"<p>A transform in Formulaic is any function that is called to modify factor values during the evaluation of a <code>Factor</code> (see the How it works documentation). Any function can be used as a transform, so long as it is present in the evaluation context (see below).</p> <p>There are two types of transform:</p> <ol> <li>Regular transforms: These are just normal functions that are applied to features prior to encoding. For example, you could apply the <code>numpy.cumsum</code> function to any vector being fed into the model matrix materialization procedure.</li> <li>Stateful transforms: These are functions that keep track of the transform state so that they can be reapplied in the future with the same state. This is useful if the transform does something data specific that has to be replicated in future materializations (such as subtracting the mean of the dataset; subsequent materializations should use the mean of the training dataset rather than the mean of the current data).</li> </ol> <p>In the below we describe how to make a function available for use as a transform during materialization, demonstrate this for regular transforms, and then introduce how to use already implemented stateful transforms and/or write your own.</p> In\u00a0[1]: Copied! <pre>import pandas\n\nfrom formulaic import Formula, model_matrix\n\n\ndef my_transform(col: pandas.Series) -&gt; pandas.Series:\n    return col**2\n</pre> import pandas  from formulaic import Formula, model_matrix   def my_transform(col: pandas.Series) -&gt; pandas.Series:     return col**2 In\u00a0[2]: Copied! <pre># Local context is automatically added\nmodel_matrix(\"a + my_transform(a)\", pandas.DataFrame({\"a\": [1, 2, 3]}))\n</pre> # Local context is automatically added model_matrix(\"a + my_transform(a)\", pandas.DataFrame({\"a\": [1, 2, 3]})) Out[2]: Intercept a my_transform(a) 0 1.0 1 1 1 1.0 2 4 2 1.0 3 9 In\u00a0[3]: Copied! <pre># Manually add `my_transform` to the context\nFormula(\"a + my_transform(a)\").get_model_matrix(\n    pandas.DataFrame({\"a\": [1, 2, 3]}),\n    context={\"my_transform\": my_transform},  # could also use: context=locals()\n)\n</pre> # Manually add `my_transform` to the context Formula(\"a + my_transform(a)\").get_model_matrix(     pandas.DataFrame({\"a\": [1, 2, 3]}),     context={\"my_transform\": my_transform},  # could also use: context=locals() ) Out[3]: Intercept a my_transform(a) 0 1.0 1 1 1 1.0 2 4 2 1.0 3 9 In\u00a0[4]: Copied! <pre>from formulaic.transforms import center, scale\n\nscale(pandas.Series([1, 2, 3, 4, 5, 6, 7, 8]))\n</pre> from formulaic.transforms import center, scale  scale(pandas.Series([1, 2, 3, 4, 5, 6, 7, 8])) Out[4]: <pre>array([-1.42886902, -1.02062073, -0.61237244, -0.20412415,  0.20412415,\n        0.61237244,  1.02062073,  1.42886902])</pre> In\u00a0[5]: Copied! <pre>center(pandas.Series([1, 2, 3, 4, 5, 6, 7, 8]))\n</pre> center(pandas.Series([1, 2, 3, 4, 5, 6, 7, 8])) Out[5]: <pre>array([-3.5, -2.5, -1.5, -0.5,  0.5,  1.5,  2.5,  3.5])</pre> In\u00a0[6]: Copied! <pre>import numpy\n\nfrom formulaic.transforms import stateful_transform\n\n\n@stateful_transform\ndef center(data, _state=None, _metadata=None, _spec=None):\n    print(\"state\", _state)\n    print(\"metadata\", _metadata)\n    print(\"spec\", _spec)\n    if \"mean\" not in _state:\n        _state[\"mean\"] = numpy.mean(data)\n    return data - _state[\"mean\"]\n\n\nstate = {}\ncenter(pandas.Series([1, 2, 3]), _state=state)\n</pre> import numpy  from formulaic.transforms import stateful_transform   @stateful_transform def center(data, _state=None, _metadata=None, _spec=None):     print(\"state\", _state)     print(\"metadata\", _metadata)     print(\"spec\", _spec)     if \"mean\" not in _state:         _state[\"mean\"] = numpy.mean(data)     return data - _state[\"mean\"]   state = {} center(pandas.Series([1, 2, 3]), _state=state) <pre>state {}\nmetadata None\nspec ModelSpec(formula=, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=&lt;NAAction.DROP: 'drop'&gt;, output=None, structure=None, transform_state={}, encoder_state={})\n</pre> Out[6]: <pre>0   -1.0\n1    0.0\n2    1.0\ndtype: float64</pre> In\u00a0[7]: Copied! <pre>state\n</pre> state Out[7]: <pre>{'mean': 2.0}</pre> <p>The mutated state object is then stored by formulaic automatically into the right context in the appropriate <code>ModelSpec</code> instance for reuse as necessary.</p> <p>If you wanted to leverage the single dispatch functionality, you could do something like:</p> In\u00a0[8]: Copied! <pre>from formulaic.transforms import stateful_transform\n\n\n@stateful_transform\ndef center(data, _state=None, _metadata=None, _spec=None):\n    raise ValueError(f\"No implementation for data of type {repr(type(data))}\")\n\n\n@center.register(pandas.Series)\ndef _(data, _state=None, _metadata=None, _spec=None):\n    if \"mean\" not in _state:\n        _state[\"mean\"] = numpy.mean(data)\n    return data - _state[\"mean\"]\n</pre> from formulaic.transforms import stateful_transform   @stateful_transform def center(data, _state=None, _metadata=None, _spec=None):     raise ValueError(f\"No implementation for data of type {repr(type(data))}\")   @center.register(pandas.Series) def _(data, _state=None, _metadata=None, _spec=None):     if \"mean\" not in _state:         _state[\"mean\"] = numpy.mean(data)     return data - _state[\"mean\"] <p>Note</p> <p>         If taking advantage of the single dispatch functionality, it is         important that the top-level function has exactly the same signature as         the type specific implementations.     </p>"},{"location":"guides/transforms/#adding-transforms-to-the-evaluation-context","title":"Adding transforms to the evaluation context\u00b6","text":"<p>The only requirement for using a transform in formula is making it available in the execution context. The evaluation context is always pre-seeded with:</p> <ul> <li>Regular transforms (and modules):<ul> <li>np: The top-level <code>numpy</code> module.</li> <li>log: <code>numpy.log</code>.</li> <li>log10: <code>numpy.log10</code>.</li> <li>log2: <code>numpy.log2</code>.</li> <li>exp: <code>numpy.exp</code>.</li> <li>exp10: <code>numpy.exp10</code>.</li> <li>exp2: <code>numpy.exp2</code>.</li> <li>I: Identity/null transform (alternative to <code>{&lt;expr&gt;}</code> syntax).</li> <li>lag: Generate lagging or leading columns (useful for datasets collected at regular intervals).</li> </ul> </li> <li>Stateful transforms (documented below):<ul> <li>bs: Basis spline coding.</li> <li>center: Subtraction of the mean.</li> <li>hashed: Categorical coding of a deterministically hashed representation.</li> <li>poly: Polynomial spline coding.</li> <li>scale: Centering and renormalization.</li> <li>C: Categorical coding.<ul> <li>contr.: An R-like interface to specification of contrast coding.</li> </ul> </li> </ul> </li> </ul> <p>The evaluation context can be extended to include arbitrary additional functions. If you are using the top-level <code>model_matrix</code> function then the local context in which <code>model_matrix</code> is called is automatically added to the execution context, otherwise you need to manually specify this context. For example:</p>"},{"location":"guides/transforms/#stateful-transforms","title":"Stateful transforms\u00b6","text":"<p>In Formulaic, a stateful transform is just a regular callable object (typically a function) that has an attribute <code>__is_stateful_transform__</code> that is set to <code>True</code>. Such callables will be passed up to three additional arguments by formulaic if they are present in the callable signature:</p> <ul> <li><code>_state</code>: The existing state or an empty dictionary that should be mutated to record any additional state.</li> <li><code>_metadata</code>: An additional metadata dictionary passed on about the factor or <code>None</code>. Will typically only be present if the <code>Factor</code> metadata is populated.</li> <li><code>_spec</code>: The current model spec being evaluated (or an empty <code>ModelSpec</code> if being called outside of Formulaic's materialization routines).</li> </ul> <p>Only <code>_state</code> is required, <code>_metadata</code> and <code>_spec</code> will only be passed in by Formulaic if they are present in the callable signature.</p>"},{"location":"guides/transforms/#provided-stateful-transforms","title":"Provided stateful transforms\u00b6","text":"<p>Formulaic comes preloaded with some useful stateful transforms, which are outlined below.</p>"},{"location":"guides/transforms/#scaling-and-centering","title":"Scaling and Centering\u00b6","text":"<p>There are two provided scaling transforms: <code>scale(...)</code> and <code>center(...)</code>.</p> <p><code>scale</code> rescales the data such that it is centered around zero with a standard deviation of 1. The centering and variance standardisation can be independently disabled as necessary. <code>center</code> is a simple wrapper around <code>scale</code> that only does the centering. For more details, refer to inline documentation: <code>help(scale)</code>.</p> <p>Example usage is shown below:</p>"},{"location":"guides/transforms/#categorical-encoding","title":"Categorical Encoding\u00b6","text":"<p>Formulaic provides a rich family of categorical stateful transforms. These are perhaps the most commonly used transforms, and are used to encode categorical/factor data into a form suitable for numerical analysis. Use of these transforms is separately documented in the Categorical Encoding section.</p>"},{"location":"guides/transforms/#spline-encoding","title":"Spline Encoding\u00b6","text":"<p>Spline coding is used to enable non-linear dependence on numerical features in linear models. Formulaic currently provides two spline transforms: <code>bs</code> for basis splines, and <code>poly</code> for polynomial splines. These are separately documented in the Spline Encoding section.</p>"},{"location":"guides/transforms/#implementing-custom-stateful-transforms","title":"Implementing custom stateful transforms\u00b6","text":"<p>You can either implement the above interface directly, or leverage the <code>stateful_transform</code> decorator provided by Formulaic, which then also updates your function into a single dispatch function, allowing multiple implementations that depend on the currently materialized type. A simple centering example is explored below.</p>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Introduction","text":"<p>Formulaic is a high-performance implementation of Wilkinson formulas for Python, which are very useful for transforming dataframes into a form suitable for ingestion into various modelling frameworks (especially linear regression).</p> <ul> <li>Source Code: https://github.com/matthewwardrop/formulaic</li> <li>Issue tracker: https://github.com/matthewwardrop/formulaic/issues</li> </ul> <p>It provides:</p> <ul> <li>high-performance dataframe to model-matrix conversions.</li> <li>support for reusing the encoding choices made during conversion of one data-set on other datasets.</li> <li>extensible formula parsing.</li> <li>extensible data input/output plugins, with implementations for:</li> <li>input:<ul> <li><code>pandas.DataFrame</code></li> <li><code>pyarrow.Table</code></li> </ul> </li> <li>output:<ul> <li><code>pandas.DataFrame</code></li> <li><code>numpy.ndarray</code></li> <li><code>scipy.sparse.CSCMatrix</code></li> </ul> </li> <li>support for symbolic differentiation of formulas (and hence model matrices).</li> </ul> <p>with more to come!</p>"},{"location":"changelog/","title":"Changelog","text":"<p>For changes since the latest tagged release, please refer to the git commit log.</p>"},{"location":"changelog/#111-20-december-2024","title":"1.1.1 (20 December 2024)","text":"<p>New features and enhancements:</p> <ul> <li><code>Formula.differentiate()</code> is now considered stable, with   <code>ModelMatrix.differentiate()</code> to follow in a future release.</li> </ul> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed a regression introduced in v1.1.0 regarding ordering of terms in a    differentiated formula.</li> </ul>"},{"location":"changelog/#110-15-december-2024","title":"1.1.0 (15 December 2024)","text":"<p>Breaking changes:</p> <ul> <li><code>Formula</code> is no longer always \"structured\" with special cases to handle the   case where it has no structure. Legacy shims have been added to support old   patterns, with <code>DeprecationWarning</code>s raised when they are used. It is not   expected to break anyone not explicitly checking whether the <code>Formula.root</code> is   a list instance (which formerly should have been simply assumed) [it is a now   <code>SimpleFormula</code> instance that acts like an ordered sequence of <code>Term</code>   instances].</li> <li>The column names associated with categorical factors has changed. Previously,   a prefix was unconditionally added to the level in the column name like   <code>feature[T.A]</code>, whether nor not the encoding will result in that term acting   as a contrast. Now, in keeping with <code>patsy</code>, we only add the prefix if the   categorical factor is encoded with reduced rank. Otherwise, <code>feature[A]</code> will   be used instead.</li> <li><code>formulaic.parsers.types.structured</code> has been promoted to   <code>formulaic.utils.structured</code>.</li> </ul> <p>New features and enhancements:</p> <ul> <li><code>Formula</code> now instantiates to <code>SimpleFormula</code> or <code>StructuredFormula</code>, the   latter being a tree-structure of <code>SimpleFormula</code> instances (as compared to   <code>List[Term]</code>) previously. This simplifies various internal logic and makes the   propagation of formula metadata more explicit.</li> <li>Added support for restricting the set of features used by the default formula   parser so that libraries can more easily restrict the structure of output   formulae.</li> <li><code>dict</code> and <code>recarray</code> types are no associated with the <code>pandas</code> materializer   by default (rather than raising), simplifying some user workflows.</li> <li>Added support for the <code>.</code> operator (which is replaced with all variables not   used on the left-hand-side of formulae).</li> <li>Added experimental support for nested formulae of form <code>[ ... ~ ... ]</code>.   This is useful for (e.g.) generating formulae for IV 2SLS.</li> <li>Add support for subsettings <code>ModelSpec[s]</code> based on an arbitrary   strictly reduced <code>FormulaSpec</code>.</li> <li>Added <code>Formula.required_variables</code> to more easily surface the expected data   requirements of the formula.</li> <li>Added support for extracting rows dropped during materialization.</li> <li>Added cubic spline support for cyclic (<code>cc</code>) and natural (<code>cr</code>). See   <code>formulaic.materializers.transforms.cubic_spline.cubic_spline</code> for   more details.</li> <li>Added a <code>lag()</code> transform.</li> <li>Constructing <code>LinearConstraints</code> can now be done from a list of strings (for   increased parity with <code>patsy</code>).</li> <li>Categorical factors are now preceded with (e.g.) <code>T.</code> when they actully   describe contrasts (i.e. when they are encoded with reduced rank).</li> <li>Contrasts metadata is now added to the encoder state via <code>encode_categorical</code>;   which is surfaced via <code>ModelSpec.factor_contrasts</code>.</li> <li><code>Operator</code> instances now received <code>context</code> which is optionally specified by   the user during formula parsing, and updated by the parser. This is what makes   the <code>.</code> implementation possible.</li> <li>Given the generic usefulness of <code>Structured</code>, it has been promoted to   <code>formulaic.utils</code>.</li> <li>Added explicit support and testing for Python 3.13.</li> </ul> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed nested ordering of <code>Formula</code> instance.</li> <li>Allow Python tokens to multiple chained parentheses and brackets without using   quotes as long as the parentheses are balanced.</li> <li>Reduced the number of redundant initialisation operations in <code>Structured</code>   instances.</li> <li>Fixed pickling <code>ModelMatrix</code> and <code>FactorValues</code> instances (whenever wrapped   objects are picklable).</li> <li><code>basis_spline</code>: Fixed evaluation involving datasets with null values, and   disallow out-of-bounds knots.</li> <li>Improved robustness of data contexts involving PyArrow datasets.</li> <li>We now use the same sentiles throughout the code-base, rather than having   module specific sentinels in some places.</li> <li>Migrated to <code>ruff</code> for linting, and updated <code>mypy</code> and <code>pre-commit</code> tooling.</li> <li>Automatic fixes from <code>ruff</code> are automatically applied when using   <code>hatch run lint:format</code>.</li> </ul> <p>Documentation:</p> <ul> <li>Fixed and updated docsite build, as well as other minor tweaks.</li> </ul>"},{"location":"changelog/#102-12-july-2024","title":"1.0.2 (12 July 2024)","text":"<p>Bugfixes and cleanups:</p> <ul> <li>Fix compatibility with <code>pandas</code> &gt;=3.</li> <li>Fix <code>mypy</code> type inference in materializer subclasses.</li> </ul> <p>Documentation:</p> <ul> <li>Add column name extraction to <code>sklearn</code> integration example.</li> <li>Add section to allow users to indicate their usage of formulaic.</li> </ul>"},{"location":"changelog/#101-24-december-2023","title":"1.0.1 (24 December 2023)","text":"<p>Bugfixes and cleanups:</p> <ul> <li>Update package status from \"beta\" to \"production/stable\".</li> </ul>"},{"location":"changelog/#100-24-december-2023","title":"1.0.0 (24 December 2023)","text":"<p>Breaking changes:</p> <ul> <li>Python tokens are now canonically formatted (see below).</li> <li>Methods deprecated during the 0.x series have been removed: <code>Formula.terms</code>,   <code>ModelSpec.feature_names</code>, and <code>ModelSpec.feature_indices</code>.</li> </ul> <p>New features and enhancements:</p> <ul> <li>Python tokens are now sanitized and canonically formatted to prevent   ambiguities and better align with <code>patsy</code>.</li> <li>Added official support for Python 3.12 (no code changes were necessary).</li> <li>Added the <code>hashed</code> transform for categorically encoding deterministically   hashed representations of a dataset.</li> </ul> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed transform state not propagating correctly when Python code tokens were   not canonically formatted.</li> <li>Literals in formulae will no longer be silently ignored, and feature scaling   is now fully supported.</li> <li>Improved code parsing and formatting utilities and dropped the requirement for   <code>astor</code> for Python 3.9 and newer.</li> <li>Fixed all warnings emitted during unit tests.</li> </ul> <p>Documentation:</p> <ul> <li>Removed incompleteness warnings.</li> <li>Added some lightweight developer documents.</li> <li>Fixed some broken links.</li> </ul>"},{"location":"changelog/#066-4-october-2023","title":"0.6.6 (4 October 2023)","text":"<p>This is minor release with one important bugfix.</p> <p>Bugfixes and cleanups:</p> <ul> <li>Fixes a regression introduced by 0.6.4 whereby missing variables will be   silently dropped from the formula., rather than raising an exception.</li> </ul>"},{"location":"changelog/#065-25-september-2023","title":"0.6.5 (25 September 2023)","text":"<p>This is a minor release with several important bugfixes.</p> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed intercept terms sorting after other features (by not counting literal   factors toward the degree of a term).</li> <li>Fixed a regression in 0.6.4 around quoted field names in Python evaluations.</li> <li>Fixed detection and dropping of null rows in sparse datasets.</li> <li>Fixed <code>poly()</code> transforms operating on datasets that include null values.</li> <li>Arguments can now be passed when running the unit tests using <code>hatch run tests</code>.</li> </ul>"},{"location":"changelog/#064-10-july-2023","title":"0.6.4 (10 July 2023)","text":"<p>This is a minor release with several new features and cleanups.</p> <p>New features and enhancements:</p> <ul> <li>Added support for keeping track of the source of variables being used to   evaluate a formula. Refer to the <code>ModelSpec</code> documentation for more details.</li> </ul> <p>Bugfixes and cleanups:</p> <ul> <li>All functions and methods now have type signatures that are statically checked   during unit testing.</li> <li>Removed <code>OrderedDict</code> usage, since Python guarantees the orderedness of   dictionaries in Python 3.7+.</li> <li>Suppress terms/factors in model matrices for which the factors evaluate to   <code>None</code>.</li> </ul>"},{"location":"changelog/#063-26-june-2023","title":"0.6.3 (26 June 2023)","text":"<p>This is a minor release with a bugfix.</p> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed a regression introduced in the previous release when materializing   categorical encodings of variables with no levels.</li> </ul>"},{"location":"changelog/#062-22-june-2023","title":"0.6.2 (22 June 2023)","text":"<p>This is a minor release with several bugfixes.</p> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed issues handling empty data sets in formulae that used categorical   encoding.</li> <li>Added the MIT license to distribution classifiers.</li> </ul>"},{"location":"changelog/#061-2-may-2023","title":"0.6.1 (2 May 2023)","text":"<p>This is a minor release with one new feature.</p> <p>New features and enhancements:</p> <ul> <li>Added support for treating individual categorical features as though they do not span the intercept (useful for intentionally generating over-specified model matrices in e.g. regularized models).</li> </ul>"},{"location":"changelog/#060-26-apr-2023","title":"0.6.0 (26 Apr 2023)","text":"<p>This is a major release with some important consistency and completeness improvements. It should be treated as almost being the first release candidate of 1.0.0, which will land after some small amount of further feature extensions and documentation improvements. All users are recommended to upgrade.</p> <p>Breaking changes:</p> <p>Although there are some internal changes to API, as documented below, there are no breaking changes to user-facing APIs.</p> <p>New features and enhancements:</p> <ul> <li>Formula terms are now consistently ordered regardless of providence (formulae or   manual term specification), and sorted according to R conventions by default   rather than lexically. This can be changed using the <code>_ordering</code> keyword to   the <code>Formula</code> constructor.</li> <li>Greater compatibility with R and patsy formulae:</li> <li>for patsy: added <code>standardize</code>, <code>Q</code> and treatment contrasts shims.</li> <li>for patsy: added <code>cluster_by='numerical_factors</code> option to <code>ModelSpec</code> to enable     patsy style clustering of output columns by involved numerical factors.</li> <li>for R: added support for exponentiation with <code>^</code> and <code>%in%</code>.</li> <li>Diff and Helmert contrast codings gained support for additional variants.</li> <li>Greatly improved the performance of generating sparse dummy encodings when   there are many categories.</li> <li>Context scoping operators (like paretheses) are now tokenized as their own special   type.</li> <li>Add support for merging <code>Structured</code> instances, and use this functionality during   AST evaluation where relevant.</li> <li><code>ModelSpec.term_indices</code> is now a list rather than a tuple, to allow direct use when   indexing pandas and numpy model matrices.</li> <li>Add official support for Python 3.11.</li> </ul> <p>Bugfixes and cleanups:</p> <ul> <li>Fix parsing formulae starting with a parenthesis.</li> <li>Fix iteration over root nodes of <code>Structured</code> instances for non-sequential iterable values.</li> <li>Bump testing versions and fix <code>poly</code> unit tests.</li> <li>Fix use of deprecated automatic casting of factors to numpy arrays during dense   column evaluation in <code>PandasMaterializer</code>.</li> <li><code>Factor.EvalMethod.UNKNOWN</code> was removed, defaulting instead to <code>LOOKUP</code>.</li> <li>Remove <code>sympy</code> version constraint now that a bug has been fixed upstream.</li> </ul> <p>Documentation:</p> <ul> <li>Substantial updates to documentation, which is now mostly complete for end-user   use-cases. Developer and API docs are still pending.</li> </ul>"},{"location":"changelog/#052-17-sep-2022","title":"0.5.2 (17 Sep 2022)","text":"<p>This is a minor patch releases that fixes one bug.</p> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed alignment between the length of a <code>Structured</code> instance and iteration   over this instance (including <code>Formula</code> instances). Formerly the length would   only count the number of keys in its structure, rather than the number of   objects that would be yielded during iteration.</li> </ul>"},{"location":"changelog/#051-9-sep-2022","title":"0.5.1 (9 Sep 2022)","text":"<p>This is a minor patch release that fixes two bugs.</p> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed generation of string representation of <code>Formula</code> objects.</li> <li>Fixed generation of <code>formulaic.__version__</code> during package build.</li> </ul>"},{"location":"changelog/#050-28-aug-2022","title":"0.5.0 (28 Aug 2022)","text":"<p>This is a major new release with some minor API changes, some ergonomic improvements, and a few bug fixes.</p> <p>Breaking changes:</p> <ul> <li>Accessing named substructures of <code>Formula</code> objects (e.g. <code>formula.lhs</code>) no   longer returns a list of terms; but rather a <code>Formula</code> object, so that the   helper methods can remain accessible. You can access the raw terms by   iterating over the formula (<code>list(formula)</code>) or looking up the root node   (<code>formula.root</code>).</li> </ul> <p>New features and improvements:</p> <ul> <li>The <code>ModelSpec</code> object is now the source of truth in all <code>ModelMatrix</code>   generations, and can be constructed directly from any supported specification   using <code>ModelSpec.from_spec(...)</code>. Supported specifications include formula   strings, parsed formulae, model matrices and prior model specs.</li> <li>The <code>.get_model_matrix()</code> helper methods across <code>Formula</code>,   <code>FormulaMaterializer</code>, <code>ModelSpec</code> and <code>model_matrix</code> objects/helpers   functions are now consistent, and all use <code>ModelSpec</code> directly under the hood.</li> <li>When accessing substructures of <code>Formula</code> objects (e.g. <code>formula.lhs</code>), the   term lists will be wrapped as trivial <code>Formula</code> instances rather than returned   as raw lists (so that the helper methods like <code>.get_model_matrix()</code> can still   be used).</li> <li><code>FormulaSpec</code> is now exported from the top-level module.</li> </ul> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed <code>ModelSpec</code> specifications being overriden by default arguments to   <code>FormulaMaterializer.get_model_matrix</code>.</li> <li><code>Structured._flatten()</code> now correctly flattens unnamed substructures.</li> </ul>"},{"location":"changelog/#040-10-aug-2022","title":"0.4.0 (10 Aug 2022)","text":"<p>This is a major new release with some new features, greatly improved ergonomics for structured formulae, matrices and specs, and a few small breaking changes (most with backward compatibility shims). All users are encouraged to upgrade.</p> <p>Breaking changes:</p> <ul> <li><code>include_intercept</code> is no longer an argument to <code>FormulaParser.get_terms</code>;   and is instead an argument of the <code>DefaultFormulaParser</code> constructor. If you   want to modify the <code>include_intercept</code> behaviour, please use:   <pre><code>Formula(\"y ~ x\", _parser=DefaultFormulaParser(include_intercept=False))\n</code></pre></li> <li>Accessing terms via <code>Formula.terms</code> is deprecated since <code>Formula</code> became a   subclass of <code>Structured[List[Terms]]</code>. You can directly iterate over, and/or   access nested structure on the <code>Formula</code> instance itself. <code>Formula.terms</code>   has a deprecated property which will return a reference to itself in order to   support legacy use-cases. This will be removed in 1.0.0.</li> <li><code>ModelSpec.feature_names</code> and <code>ModelSpec.feature_columns</code> are deprecated in   favour of <code>ModelSpec.column_names</code> and <code>ModelSpec.column_indices</code>. Deprecated   properties remain in-place to support legacy use-cases. These will be removed   in 1.0.0.</li> </ul> <p>New features and enhancements:</p> <ul> <li>Structured formulae (and their derived matrices and specs) are now mutable.   Internally <code>Formula</code> has been refactored as a subclass of   <code>Structured[List[Terms]]</code>, and can be incrementally built and modified. The   matrix and spec outputs now have explicit subclasses of <code>Structured</code>   (<code>ModelMatrices</code> and <code>ModelSpecs</code> respectively) to expose convenience methods   that allow these objects to be largely used interchangeably with their   singular counterparts.</li> <li><code>ModelMatrices</code> and <code>ModelSpecs</code> arenow surfaced as top-level exports of the   <code>formulaic</code> module.</li> <li><code>Structured</code> (and its subclasses) gained improved integration of nested tuple   structure, as well as support for flattened iteration, explicit mapping   output types, and lots of cleanups.</li> <li><code>ModelSpec</code> was made into a dataclass, and gained several new   properties/methods to support better introspection and mutation of the model   spec.</li> <li><code>FormulaParser</code> was renamed <code>DefaultFormulaParser</code>, and made a subclass of the   new formula parser interface <code>FormulaParser</code>. In this process   <code>include_intercept</code> was removed from the API, and made an instance attribute   of the default parser implementation.</li> </ul> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed AST evaluation for large formulae that caused the evaluation to hit the   recursion limit.</li> <li>Fixed sparse categorical encoding when the dataframe index is not the standard   range index.</li> <li>Fixed a bug in the linear constraints parser when more than two constraints   were specified in a comma-separated string.</li> <li>Avoid implicit changing of the sparsity structure of CSC matrices.</li> <li>If manually constructed <code>ModelSpec</code>s are provided by the user during   materialization, they are updated to reflect the output-type chosen by the   user, as well as whether to ensure full rank/etc.</li> <li>Allowed use of older pandas versions. All versions &gt;=1.0.0 are now supported.</li> <li>Various linting cleanups as <code>pylint</code> was added to the CI testing.</li> </ul> <p>Documentation:</p> <ul> <li>Apart from the <code>.materializer</code> submodule, most code now has inline   documentation and annotations.</li> </ul>"},{"location":"changelog/#034-1-may-2022","title":"0.3.4 (1 May 2022)","text":"<p>This is a backward compatible major release that adds several new features.</p> <p>New features and enhancements:</p> <ul> <li>Added support for customizing the contrasts generated for categorical   features, including treatment, sum, deviation, helmert and custom contrasts.</li> <li>Added support for the generation of linear constraints for <code>ModelMatrix</code>   instances (see <code>ModelMatrix.model_spec.get_linear_constraints</code>).</li> <li>Added support for passing <code>ModelMatrix</code>, <code>ModelSpec</code> and other formula-like   objects to the <code>model_matrix</code> sugar method so that pre-processed formulae can   be used.</li> <li>Improved the way tokens are manipulated for the right-hand-side intercept and   substitutions of <code>0</code> with <code>-1</code> to avoid substitutions in quoted contexts.</li> </ul> <p>Bugfixes and cleanups:</p> <ul> <li>Fixed variable sanitization during evaluation, allowing variables with   special characters to be used in Python transforms; for example:   <code>bs(`my|feature%is^cool`)</code>.</li> <li>Fixed the parsing of dictionaries and sets within python expressions in the   formula; for example: <code>C(x, {\"a\": [1,2,3]})</code>.</li> <li>Bumped requirement on <code>astor</code> to &gt;=0.8 to fix issues with ast-generation in   Python 3.8+ when numerical constants are present in the parsed python   expression (e.g. \"bs(x, df=10)\").</li> </ul>"},{"location":"changelog/#033-4-april-2022","title":"0.3.3 (4 April 2022)","text":"<p>This is a minor patch release that migrates the package tooling to poetry; solving a version inconsistency when packaging for <code>conda</code>.</p>"},{"location":"changelog/#032-17-march-2022","title":"0.3.2 (17 March 2022)","text":"<p>This is a minor patch release that fixes an attempt to import <code>numpy.typing</code> when numpy is not version 1.20 or later.</p>"},{"location":"changelog/#031-15-march-2022","title":"0.3.1 (15 March 2022)","text":"<p>This is a minor patch release that fixes the maintaining of output types, NA-handling, and assurance of full-rank for factors that evaluate to pre-encoded columns when constructing a model matrix from a pre-defined ModelSpec. The benchmarks were also updated.</p>"},{"location":"changelog/#030-14-march-2022","title":"0.3.0 (14 March 2022)","text":"<p>This is a major new release with many new features, and a few small breaking changes. All users are encouraged to upgrade.</p> <p>Breaking changes:</p> <ul> <li>The minimum supported version of Python is now 3.7 (up from 3.6).</li> <li>Moved transform implementations from <code>formulaic.materializers.transforms</code> to   the top-level <code>formulaic.transforms</code> module, and ported all existing   transforms to output <code>FactorValues</code> types rather than dictionaries.   <code>FactorValues</code> is an object proxy that allows output types like   <code>pandas.DataFrame</code>s to be used as they normally would, with some additional   metadata for formulaic accessible via the <code>__formulaic_metadata__</code>   attribute. This makes non-formula direct usage of these transforms much more   pleasant.</li> <li><code>~</code> is no longer a generic formula separator, and can only be used once in a   formula. Please use the newly added <code>|</code> operator to separate a formula into   multiple parts.</li> </ul> <p>New features and enhancements:</p> <ul> <li>Added support for \"structured\" formulas, and updated the <code>~</code> operator to use   them. Structured formulas can have named substructures, for example: <code>lhs</code>   and <code>rhs</code> for the <code>~</code> operator. The representation of formulas has been   updated to show this structure.</li> <li>Added support for context-sensitivity during the resolution of operators,   allowing more flexible operators to be implemented (this is exploited by the   <code>|</code> operator which splits formulas into multiple parts).</li> <li>The <code>formulaic.model_matrix</code> syntactic sugar function now accepts <code>ModelSpec</code>   and <code>ModelMatrix</code> instances as the \"formula\" spec, making generation of   matrices with the same form as previously generated matrices more   convenient.</li> <li>Added the <code>poly</code> transform (compatible with R and patsy).</li> <li><code>numpy</code> is now always available in formulas via <code>np</code>, allowing formulas like   <code>np.sum(x)</code>. For convenience, <code>log</code>, <code>log10</code>, <code>log2</code>, <code>exp</code>, <code>exp10</code> and   <code>exp2</code> are now exposed as transforms independent of user context.</li> <li>Pickleability is now guaranteed and tested via unit tests. Failure to pickle   any formulaic metadata object (such as formulas, model specs, etc) is   considered a bug.</li> <li>The capturing of user context for use in formula materialization has been   split out into a utility method <code>formulaic.utils.context.capture_context()</code>.   This can be used by libraries that wrap Formulaic to capture the variables   and/or transforms available in a users' environment where appropriate.</li> </ul> <p>Bugfixes and cleanups:</p> <ul> <li>Migrated all code to use the Black style.</li> <li>Increased unit testing coverage to 100%.</li> <li>Fixed mis-alignment in the right- and left-hand sides of formulas if there   were nulls at different indices.</li> <li>Fixed basis spline transforms ignoring state, fixed generated splines for   large numbers of knots, and fixed specification of knots via non-list   datatypes.</li> <li>Fixed category order being inconsistent if categories are explicitly ordered   differently in the underlying data.</li> <li>Lots of other minor nits and cleanups.</li> </ul> <p>Documentation:</p> <ul> <li>The structure of the docsite has been improved (but is still incomplete).</li> <li>The <code>.parser</code> and <code>.utils</code> modules of Formulaic are now inline documented   and annotated.</li> </ul>"},{"location":"changelog/#024-9-july-2021","title":"0.2.4 (9 July 2021)","text":"<p>This is a minor release that fixes an issue whereby the ModelSpec instances attached to ModelMatrix objects would keep reference to the original data, greatly inflating the size of the ModelSpec.</p>"},{"location":"changelog/#023-4-february-2021","title":"0.2.3 (4 February 2021)","text":"<p>This release is identical to v0.2.2, except that the source distribution now includes the docs, license, and tox configuration.</p>"},{"location":"changelog/#022-4-february-2021","title":"0.2.2 (4 February 2021)","text":"<p>This is a minor release with one bugfix.</p> <ul> <li>Fix pandas model matrix outputs when constants are generated as part of model   matrix construction and the incoming dataframe has a custom rather than range   index.</li> </ul>"},{"location":"changelog/#021-22-january-2021","title":"0.2.1 (22 January 2021)","text":"<p>This is a minor patch release that brings in some valuable improvements.</p> <ul> <li>Keep track of the pandas dataframe index if outputting a pandas <code>DataFrame</code>.</li> <li>Fix using functions in formulae that are nested within a module or class.</li> <li>Avoid crashing when an attempt is made to generate an empty model matrix.</li> <li>Enriched setup.py with long description for a better experience on PyPI.</li> </ul>"},{"location":"changelog/#020-21-january-2021","title":"0.2.0 (21 January 2021)","text":"<p>This is major release that brings in a large number of improvements, with a huge number of commits. Some API breakage from the experimental 0.1.x series is likely in various edge-cases.</p> <p>Highlights include:</p> <ul> <li>Enriched formula parser to support quoting, and evaluation of formulas involving fields with invalid Python names.</li> <li>Added commonly used stateful transformations (identity, center, scale, bs)</li> <li>Improved the helpfulness of error messages reported by the formula parser.</li> <li>Added support for basic calculus on formulas (useful when taking the gradient of linear models).</li> <li>Made it easier to extend Formulaic with additional materializers.</li> <li>Many internal improvements to code quality and reliability, including 100% test coverage.</li> <li>Added benchmarks for Formulaic against R and patsy.</li> <li>Added documentation.</li> <li>Miscellaneous other bugfixes and cleanups.</li> </ul>"},{"location":"changelog/#012-6-november-2019","title":"0.1.2 (6 November 2019)","text":"<p>Performance improvements around the encoding of categorical features.</p> <pre><code>Matthew Wardrop (1):\n    Improve the performance of encoding operations.\n</code></pre>"},{"location":"changelog/#011-31-october-2019","title":"0.1.1 (31 October 2019)","text":"<p>No code changes here, just a verification that GitHub CI integration was working.</p> <pre><code>Matthew Wardrop (1):\n    Update Github workflow triggers.\n</code></pre>"},{"location":"changelog/#010-31-october-2019","title":"0.1.0 (31 October 2019)","text":"<p>This release added support for keeping track of encoding choices during model matrix generation, so that they can be reused on similarly structured data. It also added comprehensive unit testing and CI integration using GitHub actions.</p> <pre><code>Matthew Wardrop (5):\n    Add support for stateful transforms (including encoding).\n    Fix tokenizing of nested Python function calls.\n    Add support for nested transforms that return multiple columns, as well as passing through of materializer config through to transforms.\n    Add comprehensive unit testing along with several small miscellaneous bug fixes and improvements.\n    Add GitHub actions configuration.\n</code></pre>"},{"location":"changelog/#001-1-september-2019","title":"0.0.1 (1 September 2019)","text":"<p>Initial open sourcing of <code>formulaic</code>.</p> <pre><code>Matthew Wardrop (1):\n    Initial (mostly) working implementation of Wilkinson formulas.\n</code></pre>"},{"location":"formulas/","title":"What are formulas?","text":"<p>This section introduces the basic notions and origins of formulas. If you are already familiar with formulas from another context, you might want to skip forward to the Formula Grammar or other User Guides.</p>"},{"location":"formulas/#origins","title":"Origins","text":"<p>Formulas were originally proposed by Wilkinson et al.<sup>1</sup> to aid in the description of ANOVA problems, but were popularised by the S language (and then R, as an implementation of S) in the context of linear regression. Since then they have been extended in R, and implemented in Python (by patsy), in MATLAB, in Julia, and quite conceivably elsewhere. Each implementation has its own nuances and grammatical extensions, including Formulaic's which are described more completely in the Formula Grammar section of this manual.</p>"},{"location":"formulas/#why-are-they-useful","title":"Why are they useful?","text":"<p>Formulas are useful because they provide a concise and explicit specification for how data should be prepared for a model. Typically, the raw input data for a model is stored in a dataframe, but the actual implementations of various statistical methodologies (e.g. linear regression solvers) act on two-dimensional numerical matrices that go by several names depending on the prevailing nomenclature of your field, including \"model matrices\", \"design matrices\" and \"regressor matrices\" (within Formulaic, we refer to them as \"model matrices\"). A formula provides the necessary information required to automate much of the translation of a dataframe into a model matrix suitable for ingestion into a statistical model.</p> <p>Suppose, for example, that you have a dataframe with \\(N\\) rows and three numerical columns labelled: <code>y</code>, <code>a</code> and <code>b</code>. You would like to construct a linear regression model for <code>y</code> based on <code>a</code>, <code>b</code> and their interaction: \\[ y = \\alpha + \\beta_a a + \\beta_b b + \\beta_{ab} ab + \\varepsilon \\] with \\(\\varepsilon \\sim \\mathcal{N}(0, \\sigma^2)\\). Rather than manually constructing the required matrices to pass to the regression solver, you could specify a formula of form: <pre><code>y ~ a + b + a:b\n</code></pre> When furnished with this formula and the dataframe, Formulaic (or indeed any other formula implementation) would generate two model matrix objects: an \\( N \\times 1 \\) matrix \\(Y\\) for the response variable <code>y</code>, and an \\( N \\times 4 \\) matrix \\(X\\) for the input columns <code>intercept</code>, <code>a</code>, <code>b</code>, and <code>a * b</code>. You can then directly pass these matrices to your regression solver, which internally will solve for \\(\\beta\\) in: \\[ Y = X\\beta + \\varepsilon. \\]</p> <p>The true value of formulas becomes more apparent as model complexity increases, where they can be a huge time-saver. For example: <pre><code>~ (f1 + f2 + f3) * (x1 + x2 + scale(x3))\n</code></pre> tells the formula interpreter to consider 16 fields of input data, corresponding to an intercept (1), each of the <code>f*</code> fields (3), each of the <code>x*</code> fields (3), and the combination of each <code>f</code> with each <code>x</code> (9). It also instructs the materializer to ensure that the <code>x3</code> column is rescaled during the model matrix materialization phase such that it has mean zero and standard error of 1. If any of these columns is categorical in nature, they would by default also be one-hot/dummy encoded. Depending on the formula interpreter (including Formulaic), extra steps would also be taken to ensure that the resulting model matrix is structurally full-rank.</p> <p>As an added bonus, some formula implementations (including Formulaic) can remember any choices made during the materialization process, and apply them to consistently to new data, making it possible to easily generate new data that conforms to the same structure as the training data. For example, the <code>scale(...)</code> transform in the example above makes use of the mean and variance of the column to be scaled. Any future data should, however, should not undergo scaling based on its own mean and variance, but rather on the mean and variance that was measured for the training data set (otherwise the new dataset will not be consistent with the expectations of the trained model which will be interpreting it).</p>"},{"location":"formulas/#limitations","title":"Limitations","text":"<p>Formulas are a very flexible tool, and can be augmented with arbitrary user-defined transforms. However, some transformations required by certain models may be more elegantly defined via a pre-formula dataframe operation or post-formula model matrix operation. Another consideration is that the default encoding and materialization choices for data are aligned with linear regression. If you are using a tree model, for example, you may not be interested in dummy encoding of \"categorical\" features, and this type of transform would have to be explicitly noted in the formula. Nevertheless, even in these cases, formulas are an excellent tool, and can often be used to greatly simplify data preparation workflows.</p>"},{"location":"formulas/#where-to-from-here","title":"Where to from here?","text":"<p>To learn about the full set of features supported by the formula language as implemented by Formulaic, please review the Formula Grammar. To get a feel for how you can use <code>formulaic</code> to transform your dataframes into model matrices, please review the Quickstart.</p> <ol> <li> <p>Wilkinson, G. N., and C. E. Rogers. Symbolic description of factorial models for analysis of variance. J. Royal Statistics Society 22, pp. 392\u2013399, 1973.\u00a0\u21a9</p> </li> </ol>"},{"location":"installation/","title":"Installation","text":"<p>The latest release of <code>formulaic</code> is always published to the Python Package Index (PyPI), from which it is available to download @ https://pypi.org/project/formulaic/.</p> <p>If your Python environment is provisioned with <code>pip</code>, installing <code>formulaic</code> from the PyPI is as simple as running:</p> <pre><code>$ pip install formulaic\n</code></pre> <p>Note</p> <p>If you have a non-standard setup, ensure that <code>pip</code> above are replaced with the executables corresponding to the environment for which you are interested in installing <code>formulaic</code>. This is done automatically if you are using a virtual environment.</p> <p>You are ready to use Formulaic. To get introduced to the concepts underpinning Formulaic, please review the Concepts documentation, or to jump straight to how to use Formulaic, please review the User Guides documentation.</p>"},{"location":"installation/#installing-for-development","title":"Installing for development","text":"<p>If you are interested in developing <code>formulaic</code>, you should clone the source code repository, and install in editable mode from there (allowing your changes to be instantly available to all new Python sessions).</p> <p>To clone the source code, run:</p> <pre><code>$ git clone git@github.com:matthewwardrop/formulaic.git\n</code></pre> <p>Note</p> <p>This requires you to have a GitHub account set up. If you do not have an account you can replace the SSH url above with <code>https://github.com/matthewwardrop/formulaic.git</code>. Also, if you are planning to submit your work upstream, you may wish to fork the repository into your own namespace first, and clone from there.</p> <p>To install in editable mode, run: <pre><code>$ pip install -e &lt;path_to_cloned_formulaic_repo&gt;\n</code></pre> You will need <code>pip&gt;=21.3</code> in order for this to work.</p> <p>You can then make any changes you like to the repo, and have them be reflected in your local Python sessions. Happy hacking, and I look forward to your contributions!</p>"},{"location":"migration/","title":"Migrating from Patsy/R","text":"<p>The default Formulaic parser and materialization configuration is designed to be highly compatibly with existing Wilkinson formula implementations in R and Python; however there are some differences which are highlighted here. If you find other differences, feel free to submit a PR to update this documentation.</p>"},{"location":"migration/#migrating-from-patsy","title":"Migrating from <code>patsy</code>","text":"<p>Patsy has been the go-to implementation of Wilkinson formulae for Python use-cases for many years, and Formulaic should be largely a drop-in replacement, while bringing order of magnitude improvements in runtime performance and greater extensibility. Being written in the same language (Python) there are two separate migration concerns: input/output and API migrations, which will be explored separately below.</p>"},{"location":"migration/#inputoutput-changes","title":"Input/Output changes","text":"<p>The primary inputs to <code>patsy</code> are a formula string, and pandas dataframe from which features referenced in the formula are drawn. The output is a model matrix (called a design matrix in <code>patsy</code>). We focus here on any potentially breaking behavioural differences here, rather than ways in which Formulaic extends the functionality available in <code>patsy</code>.</p> <ul> <li>The <code>^</code> operator is interpreted as exponentiation, rather than Python's XOR     binary operator.</li> <li>Contrast encoding is recommended to follow R-style conventions e.g.     <code>C(x, contr.treatment)</code>. For greater compatibility with <code>patsy</code> we add to the     transform namespace <code>Treatment</code>, <code>Poly</code>, <code>Sum</code>, <code>Helmert</code> and <code>Diff</code>,     allowing formulae like <code>C(x, Poly)</code> or <code>C(x, Treatment(reference='x'))</code> to     work as expected, with the following caveats:<ul> <li>The signature of <code>C</code> is <code>C(data, contrasts=None, *, levels=None)</code> as     compared to <code>C(data, contrast=None, levels=None)</code> from <code>patsy</code>.</li> <li>The <code>Sum</code> contrast does not offer an <code>omit</code> option to specify the index of     the omitted column.</li> </ul> </li> <li>Feature rescaling is recommended to follow R conventions e.g. <code>scale(x)</code>, but     compatibility shims for <code>standardize(x)</code> are added for greater     compatibility with <code>patsy</code>. Note that the <code>standardize</code> shim follows patsy     argument kwarg naming conventions, but <code>scale</code> uses <code>scale</code> instead of     <code>rescale</code>, following R.</li> <li>The order of the model matrix columns will differ by default. Patsy groups     columns by the numerical features from which they derived, then sorts by     interaction order, and then by the order in which features were added into     the formula. Formulaic does not by default do the clustering by numerical     factors. This behaviour can be restored by passing     <code>cluster_by=\"numerical_factors\"</code> to <code>model_matrix</code> or any of the     <code>.get_model_matrix(...)</code> methods.</li> <li>Formulaic does not yet have implementations for natural and cyclic cubic basis     splines (<code>cr</code> and <code>cc</code>) or tensor smoothing (<code>te</code>) stateful transforms.</li> </ul>"},{"location":"migration/#api-translations","title":"API translations","text":"<p><code>patsy</code> offers two high-level user-facing entrypoints: <code>patsy.dmatrix</code> and <code>patsy.dmatrices</code>, depending on whether you have both left- and right-hand sides present. In <code>formulaic</code>, we offer a single entrypoint for both cases: <code>model_matrix</code>.</p> <p>In the vast majority of cases, a simple substitution of <code>dmatrix</code> or <code>dmatrices</code> with <code>model_matrix</code> will achieve the desired the result; however there are some differences in signature that could trip up a naive copy and replace. Patsy's <code>dmatrix</code> signature is: <pre><code>patsy.dmatrix(\n    formula_like,\n    data={},\n    eval_env=0,\n    NA_action='drop',\n    return_type='matrix',\n)\n</code></pre> whereas <code>model_matrix</code> has a signature of: <pre><code>formulaic.model_matrix(\n    spec: FormulaSpec,  # accepts any formula-like spec (include model matrices and specs)\n    data: Any,  # accepts any supported data structure (include pandas DataFrames)\n    *,\n    context: Union[int, Mapping[str, Any]] = 0,  # equivalent to `eval_env`\n    **spec_overrides,  # Additional overrides for generated `ModelSpec`, including `na_action` and `output` (similar to `return_type`).\n)\n</code></pre></p> <p>If you are integrating Formulaic into your library, it is highly recommended to use the <code>Formula()</code> API directly rather than <code>model_matrix</code>, which by default will add all variables in the local context into the evaluation environment (just like <code>dmatrix</code>). This allows you to better isolate and control the behaviour of the Formula parsing.</p>"},{"location":"migration/#migrating-from-r","title":"Migrating from R","text":"<p>Most formulae that work in R will work without modification, including those written against the enhanced R Formula package that supports multi-part formulae. However, there are a few caveats that are worth calling out:</p> <ul> <li>As in the enhanced R Formula package,     the left hand side of formulae can have multiple terms; the only difference     between the left- and right-hand sides being that an intercept is     automatically added on the right.</li> <li>Exponentiation will not work using the <code>^</code> operator within an <code>I</code> transform;     e.g. <code>I(x^2)</code>. This is because this is treated as Python code, and so you     should use <code>I(x**2)</code> or <code>{x**2}</code> instead.</li> <li>Intercept inclusion/exclusion directives are handled more rigorously,     following the conventions of <code>patsy</code>. In particular, order of operations are     respected when evaluating intercept directives, and so: <code>1 + (b - 1)</code> would     result in the intercept remaining (since <code>(b-1)</code> would be evaluated first to     <code>b</code>, resulting in <code>1 + b</code>), whereas in R the intercept would have been     dropped.</li> <li>Model matrices are guaranteed to be structurally full-rank no matter how     categorical variables are interacted, whereas R will sometimes become     confused and output over- or under-specified model matrices. The algorithm     used is the same as that found in <code>patsy</code>. Using capital letters to     represent categorical variables, and lower-case letters to represent     numerical ones, the difference from R will become apparent in two cases:<ol> <li>When categories are interacted in the presence of intercept. e.g.:     <code>1 + A:B</code>. In this case, R does not account for the fact that <code>A:B</code>     spans the intercept, and so does not rank reduce the product, and thus     generates an over-specified matrix. This affects higher-order     interactions also.</li> <li>When categories are interacted with numerical features alongside     interactions with categorical features. e.g.: <code>0 + A:x + B:C</code>. Here we     use <code>0 +</code> to avoid the previous bug, but unfortunately when R is     checking whether to reduce the rank of the categorical features during     encoding, it assumes that all involved features are categorical, and     thus unnecessarily reduces the rank of <code>C</code>, resulting in an     under-specified matrix. This affects higher-order interactions also.</li> </ol> </li> <li>Formulaic does not (yet) support including extra \"metadata\" terms in the     formula that will not result in additions to the model matrix, for example     model annotations like R's <code>offset(...)</code>.</li> <li>Some transforms that are commonly available in R may not be available.</li> </ul> <p>For more details, refer to the Formula Grammar.</p>"},{"location":"dev/","title":"Introduction","text":"<p>This section of the documentation focuses on providing guidance to developers of libraries that are integrating Formulaic, users who want to extend its behavior, or for those interested in directly contributing.</p> <p>If you are looking to directly work with Formulaic as an end-user, please review the User Guides instead.</p> <p>This portion of the documentation is less complete than user-facing documentation, and you are encouraged to reach out via the Issue Tracker if you need any help.</p>"},{"location":"dev/extensions/","title":"Extensions","text":"<p>Formulaic was designed to be extensible from day one, and nearly all of its core functionality is implemented as \"plugins\"/\"modules\" that you can use as examples for how extensions could be written. In this document we will provide a basic high-level overview of the basic components of Formulaic that can extended.</p> <p>An important consideration is that while Formulaic offers extensible APIs, and effort will be made not to break extension APIs without reason (and never in patch releases), the safest place for you extensions is in Formulaic itself, where they can be kept up to date and maintained (assuming the extension is not overly bespoke). If you think your extensions might help others, feel free to reach out via the issue tracker and/or open a pull request.</p>"},{"location":"dev/extensions/#transforms","title":"Transforms","text":"<p>Transforms are likely the most commonly extended feature of Formulaic, and also likely the least valuable to upstream (since transforms are often domain specific). Documentation for implementing transforms is described in detail in the Transforms user guide.</p>"},{"location":"dev/extensions/#materializers","title":"Materializers","text":"<p>Materializers are responsible for translating formulae into model matrices as documented in the How it works user guide. You need to implement a new materializer if you want to add support for new input and/or output types.</p> <p>Implementing a new materializer is as simple as subclassing the abstract class <code>formulaic.materializers.FormulaMaterializer</code> (or one of its subclasses). This base class defines the API expected by the rest of the Formulaic system. Example implementations include pandas and pyarrow.</p> <p>During subclassing, the new class is registered according to the various <code>REGISTER_*</code> attributes if <code>REGISTER_NAME</code> is specified. This registration allows looking up of the materializer by name through the <code>model_matrix()</code> and <code>.get_model_matrix()</code> functions. You can always manually pass in your materializer class explicitly without this registration.</p>"},{"location":"dev/extensions/#parsers","title":"Parsers","text":"<p>Parsers translate a formula string to a set of terms and factors that are then evaluated and assembled into the model matrix, as documented in the How it works user guide. This is unlikely to be necessary very often, but can be used to add additional formula operators, or change the behavior of existing ones.</p> <p>Formula parsers are expected to implement the API of <code>formulaic.parser.types.FormulaParser</code>. The default implementation can be seen here. You can pass in custom parsers to <code>Formula()</code> via the <code>parser</code> and <code>nested_parser</code> options (see inline documentation for more details).</p> <p>If you are considering extending the parser, please do reach out via the issue tracker.</p>"},{"location":"dev/integration/","title":"Integration","text":"<p>If you are looking to enrich your existing Python project with support for formulae, you have come to the right place. Formulaic is designed with simple APIs that should make it straightforward to integrate into any project.</p> <p>In this document we provide several general recommendations for developers integrating Formulaic, and then some more specific guidance for developers looking to migrate existing formula functionality from <code>patsy</code>. As you are working on integrating Formulaic, if you come across anything not mentioned here that really ought to be, please report it to our Issue Tracker.</p>"},{"location":"dev/integration/#recommendations","title":"Recommendations","text":"<p>For the most part, Formulaic should \"just work\". However, here are a couple of recommendations that might make your integration work easier.</p> <ul> <li>Do not use the user-facing syntactic sugar function <code>model_matrix</code>. This is a   simple wrapper around lower-level APIs that automatically includes variables   from users' local namespaces. This is convenient when running in a notebook,   but can lead to unexpected interactions with your library code that are hard   to debug. Called naively in your library it will treat the frame in which it   was run as the user context, which may include somewhat sensitive internal   state and may override transforms normally available to formulae. Instead,   use <code>Formula(...).get_model_matrix(...)</code>.</li> <li>If you do need access to user namespaces it is recommended that you use the   <code>formulaic.utils.context.capture_context()</code> function and pass the result as   <code>context</code> to the <code>.get_model_matrix()</code> methods. It is easiest to use in the   outermost user-facing entrypoints so that you do not need to figure out   exactly how many frames removed you are from user-context. You may also   manually construct a dictionary from the user's context if you want to do   additional filtering.</li> <li>During the evaluation of some term factors, the <code>eval()</code> function may be   called to invoke the indicated Python functions. Since this is user-specified   code, it is possible that the formula had some malicious code in it (such as   <code>sys.exit()</code> or <code>shutil.rmtree()</code>). If you are integrating Formulaic into   server-side code, it is highly recommended not to pass in any   user-specified context, but instead to curate the set of additional functions   that are available and pass that in instead. If you are writing a user-facing   library, this should not be as concerning.</li> <li>Formulaic selects the materialization algorithms to use based on the incoming   data type (e.g. <code>pandas.DataFrame</code> -&gt; <code>PandasMaterializer</code>). Different   materializers may have different output (and other) options. It may make sense   to hard-code your choice of materializer by passing <code>materializer=</code> to the   <code>.get_model_matrix()</code> methods.</li> <li>Formulaic typically provides sensible defaults that should work in most   scenarios out-of-the-box. However, it may make sense for your library to   override some of these defaults. For example, if you typically deal with   categorical factors with high cardinality you may want to enable sparse   outputs by default (by passing <code>output='sparse'</code> to <code>.get_model_matrix()</code>,   assuming the materializer of your datatype supports this).</li> <li>Do not rely on <code>ModelSpec</code> instances to work across different major versions   of Formulaic. It may be tempting to serialize them to disk and then reuse them   in newer versions of Formulaic. Most of the time this will work fine, but the   stored encoder and transform states are considered implementation details of   stateful transforms and are subject to change between major versions. Patch   releases should never result in changes to this state.</li> <li>Formulaic's default parsers allow you to restrict the set of features used   when parsing formulae. This is useful if you are expecting formulae to be of a   particular structural form, and do not want to have to check after the fact.   Any feature that results in nested structure is associated with a feature flag   in <code>formulaic.parser.DefaultFormulaParser.FeatureFlags</code>. You can pass these   flags, or a set of (case-insensitive) strings corresponding to these enums, to   <code>DefaultFormulaParser(feature_flags=...)</code>.</li> </ul>"},{"location":"dev/integration/#migrating-from-patsy","title":"Migrating from Patsy","text":"<p>If you are migrating a library that previous used <code>patsy</code> to <code>formulaic</code>, you should first review the general user-facing migration notes, which describes differences in API and formula grammars. Then, in addition to the recommendations above, the following notes might be helpful.</p> <ul> <li>While the vast majority of formulae will be parsed identically in Formulaic   and Patsy, there will inevitably be small differences in some edge-cases (as   highlighted in migration notes). For highly entrenched   use-cases, do not expect this to be without friction.</li> <li>If you used any of the internals of <code>patsy</code>, such as manually assembling   <code>Term</code> instances, this code will need to be rewritten to use <code>Formulaic</code>   classes instead. Generally speaking, this will likely be transparent to your   users and so should be a relatively small lift.</li> <li>Formulaic is much more flexible than Patsy, and so moulding it to your needs   should be much easier. If you do need assistance, please always feel free to   open an issue in our issue tracker.</li> </ul>"},{"location":"guides/","title":"Introduction","text":"<p>This section of the documentation focuses on guiding end-users through the various aspects of Formulaic likely to be useful in day-to-day workflows. Feel free to pick and choose which modules you peruse, but note that later modules may assume knowledge of content described in earlier modules.</p> <p>If you are a developer of another library looking to leverage Formulaic internally to your code, or are looking to contribute directly to Formulaic, it is recommended to also review the Developer Guides.</p>"},{"location":"guides/contrasts/","title":"Categorical Encoding","text":"<p>Categorical data (also known as \"factors\") is encoded in model matrices using \"contrast codings\" that transform categorical vectors into a collection of numerical vectors suitable for use in regression models. In this guide we begin with some basic examples, before introducing the concepts behind contrast codings, how to select and/or design your own coding, and (for more advanced readers) describe how we guarantee structural full-rankness of formulae with complex interactions between categorical and other features.</p> In\u00a0[1]: Copied! <pre>from pandas import Categorical, DataFrame\n\nfrom formulaic import model_matrix\n\ndf = DataFrame(\n    {\n        \"letters\": [\"a\", \"b\", \"c\"],\n        \"numbers\": Categorical([1, 2, 3]),\n        \"values\": [20, 200, 30],\n    }\n)\n\nmodel_matrix(\"letters + numbers + values\", df)\n</pre> from pandas import Categorical, DataFrame  from formulaic import model_matrix  df = DataFrame(     {         \"letters\": [\"a\", \"b\", \"c\"],         \"numbers\": Categorical([1, 2, 3]),         \"values\": [20, 200, 30],     } )  model_matrix(\"letters + numbers + values\", df) Out[1]: Intercept letters[T.b] letters[T.c] numbers[T.2] numbers[T.3] values 0 1.0 0 0 0 0 20 1 1.0 1 0 1 0 200 2 1.0 0 1 0 1 30 <p>Here <code>letters</code> was identified as a categorical variable because of it consisted of strings, <code>numbers</code> was identified as categorical because of its data type, and <code>values</code> was treated as a vector of numerical values. The categorical data was encoded using the default encoding of \"Treatment\" (aka. \"Dummy\", see below for more details).</p> <p>If we wanted to force formulaic to treat a column as categorical, we can use the <code>C()</code> transform (just as in patsy and R). For example:</p> In\u00a0[2]: Copied! <pre>model_matrix(\"C(values)\", df)\n</pre> model_matrix(\"C(values)\", df) Out[2]: Intercept C(values)[T.30] C(values)[T.200] 0 1.0 0 0 1 1.0 0 1 2 1.0 1 0 <p>The <code>C()</code> transform tells Formulaic that the column should be encoded as categorical data, and allows you to customise how the encoding is performed. For example, we could use polynomial coding (detailed below) and explicitly specify the categorical levels and their order using:</p> In\u00a0[3]: Copied! <pre>model_matrix(\"C(values, contr.poly, levels=[10, 20, 30])\", df)\n</pre> model_matrix(\"C(values, contr.poly, levels=[10, 20, 30])\", df) <pre>/home/matthew/Repositories/github/formulaic/formulaic/transforms/contrasts.py:124: DataMismatchWarning: Data has categories outside of the nominated levels (or that were not seen in original dataset): {200}. They are being  cast to nan, which will likely skew the results of your analyses.\n  warnings.warn(\n</pre> Out[3]: Intercept C(values, contr.poly, levels=[10, 20, 30]).L C(values, contr.poly, levels=[10, 20, 30]).Q 0 1.0 0.000000 -0.816497 1 1.0 0.000000 0.000000 2 1.0 0.707107 0.408248 <p>Where possible, as you can see above, we also provide warnings when a categorical encoding does not reflect the structure of the data.</p> In\u00a0[4]: Copied! <pre>from formulaic.transforms.contrasts import C, TreatmentContrasts\n\nTreatmentContrasts(base=\"B\").get_coding_matrix([\"A\", \"B\", \"C\", \"D\"])\n</pre> from formulaic.transforms.contrasts import C, TreatmentContrasts  TreatmentContrasts(base=\"B\").get_coding_matrix([\"A\", \"B\", \"C\", \"D\"]) Out[4]: A C D A 1.0 0.0 0.0 B 0.0 0.0 0.0 C 0.0 1.0 0.0 D 0.0 0.0 1.0 In\u00a0[5]: Copied! <pre>TreatmentContrasts(base=\"B\").get_coefficient_matrix([\"A\", \"B\", \"C\", \"D\"])\n</pre> TreatmentContrasts(base=\"B\").get_coefficient_matrix([\"A\", \"B\", \"C\", \"D\"]) Out[5]: A B C D B 0.0 1.0 0.0 0.0 A-B 1.0 -1.0 -0.0 -0.0 C-B 0.0 -1.0 1.0 0.0 D-B 0.0 -1.0 0.0 1.0 In\u00a0[6]: Copied! <pre>model_matrix(\"C(letters, contr.treatment)\", df)\n</pre> model_matrix(\"C(letters, contr.treatment)\", df) Out[6]: Intercept C(letters, contr.treatment)[T.b] C(letters, contr.treatment)[T.c] 0 1.0 0 0 1 1.0 1 0 2 1.0 0 1 In\u00a0[7]: Copied! <pre>model_matrix(\"C(letters, contr.SAS)\", df)\n</pre> model_matrix(\"C(letters, contr.SAS)\", df) Out[7]: Intercept C(letters, contr.SAS)[T.a] C(letters, contr.SAS)[T.b] 0 1.0 1 0 1 1.0 0 1 2 1.0 0 0 In\u00a0[8]: Copied! <pre>model_matrix(\"C(letters, contr.sum)\", df)\n</pre> model_matrix(\"C(letters, contr.sum)\", df) Out[8]: Intercept C(letters, contr.sum)[S.a] C(letters, contr.sum)[S.b] 0 1.0 1.0 0.0 1 1.0 0.0 1.0 2 1.0 -1.0 -1.0 In\u00a0[9]: Copied! <pre>model_matrix(\"C(letters, contr.helmert)\", df)\n</pre> model_matrix(\"C(letters, contr.helmert)\", df) Out[9]: Intercept C(letters, contr.helmert)[H.b] C(letters, contr.helmert)[H.c] 0 1.0 -1.0 -1.0 1 1.0 1.0 -1.0 2 1.0 0.0 2.0 In\u00a0[10]: Copied! <pre>model_matrix(\"C(letters, contr.diff)\", df)\n</pre> model_matrix(\"C(letters, contr.diff)\", df) Out[10]: Intercept C(letters, contr.diff)[D.b] C(letters, contr.diff)[D.c] 0 1.0 -0.666667 -0.333333 1 1.0 0.333333 -0.333333 2 1.0 0.333333 0.666667 In\u00a0[11]: Copied! <pre>model_matrix(\"C(letters, contr.poly)\", df)\n</pre> model_matrix(\"C(letters, contr.poly)\", df) Out[11]: Intercept C(letters, contr.poly).L C(letters, contr.poly).Q 0 1.0 -0.707107 0.408248 1 1.0 0.000000 -0.816497 2 1.0 0.707107 0.408248 In\u00a0[12]: Copied! <pre>my_letters = C(df.letters, TreatmentContrasts(base=\"b\"))\nmodel_matrix(\"my_letters\", df)\n</pre> my_letters = C(df.letters, TreatmentContrasts(base=\"b\")) model_matrix(\"my_letters\", df) Out[12]: Intercept my_letters[T.a] my_letters[T.c] 0 1.0 1 0 1 1.0 0 0 2 1.0 0 1 In\u00a0[13]: Copied! <pre>import numpy\n\nZ = numpy.array(\n    [\n        [1, 0, 0, 0],  # A\n        [-1, 1, 0, 0],  # B - A\n        [0, -1, 1, 0],  # C - B\n        [-1, 0, 0, 1],  # D - A\n    ]\n)\ncoding = numpy.linalg.inv(Z)[:, 1:]\ncoding\n</pre> import numpy  Z = numpy.array(     [         [1, 0, 0, 0],  # A         [-1, 1, 0, 0],  # B - A         [0, -1, 1, 0],  # C - B         [-1, 0, 0, 1],  # D - A     ] ) coding = numpy.linalg.inv(Z)[:, 1:] coding Out[13]: <pre>array([[0., 0., 0.],\n       [1., 0., 0.],\n       [1., 1., 0.],\n       [0., 0., 1.]])</pre> In\u00a0[14]: Copied! <pre>model_matrix(\n    \"C(letters, contr.custom(coding))\", DataFrame({\"letters\": [\"A\", \"B\", \"C\", \"D\"]})\n)\n</pre> model_matrix(     \"C(letters, contr.custom(coding))\", DataFrame({\"letters\": [\"A\", \"B\", \"C\", \"D\"]}) ) Out[14]: Intercept C(letters, contr.custom(coding))[1] C(letters, contr.custom(coding))[2] C(letters, contr.custom(coding))[3] 0 1.0 0.0 0.0 0.0 1 1.0 1.0 0.0 0.0 2 1.0 1.0 1.0 0.0 3 1.0 0.0 0.0 1.0 <p>The model matrices generated from formulae are often consumed directly by linear regression algorithms. In these cases, if your model matrix is not full rank, then the features in your model are not linearly independent, and the resulting coefficients (assuming they can be computed at all) cannot be uniquely determined. While there are ways to handle this, such as regularization, it is usually easiest to put in a little more effort during the model matrix creation process, and make the incoming vectors in your model matrix linearly independent from the outset. As noted in the text above, categorical coding requires consideration about the overlap of the coding with the intercept in order to remain full rank. The good news is that Formulaic will do most of the heavy lifting for you, and does so by default.</p> <p>It is important to note at this point that Formulaic does not protect against all forms of linear dependence, only structural linear dependence; i.e. linear dependence that results from multiple categorical variables overlapping in vectorspace. If you have two identical numerical vectors called by two different names in your model matrix, Formulaic will happily build the model matrix you requested, and you're on your own. This is intentional. While Formulaic strives to make the model matrix generation process as painless as possible, it also doesn't want to make more assumptions about the use of the data than is necessary. Note that you can also disable Formulaic's structural full-rankness algorithms by passing <code>ensure_full_rank=False</code> to <code>model_matrix()</code> or <code>.get_model_matrix()</code> methods; and can bypass the reducing of the rank of a single categorical term in a formula using <code>C(..., spans_intercept=False)</code> (this is especially useful, for example, if your model includes regularization and you would prefer to use the over-specified model to ensure fairer shrinkage).</p> <p>The algorithm that Formulaic uses was heavily inspired by <code>patsy</code>^1. The basic idea is to recognize that all categorical codings span the intercept[^2]; and then to break that coding up into two pieces: a single column that can be dropped to avoid spanning the intercept, and the remaining body of the coding that will always be present. You expand associatively the categorical factors, and then greedily recombine the components, omitting any that would lead to structural linear dependence. The result is a set of categorical codings that only spans the intercept when it is safe to do so, guaranteeing structural full rankness. The patsy documentation goes into this in much more detail if this is interesting to you.</p> <p>[^2]: This assumes that categories are \"complete\", in that each unit has been assigned a category. You can \"complete\" categories by treating all those unassigned as being a member of an imputed \"null\" category.</p>"},{"location":"guides/contrasts/#basic-usage","title":"Basic usage\u00b6","text":"<p>Formulaic follows in the stead of R and Patsy by automatically inferring from the data whether a feature needs to categorically encoded. For example:</p>"},{"location":"guides/contrasts/#how-does-contrast-coding-work","title":"How does contrast coding work?\u00b6","text":"<p>As you have seen, contrast coding transforms categorical vectors into a matrix of numbers that can be used during modeling. If your data has $K$ mutually exclusive categories, these matrices typically consist of $K-1$ columns. This reduction in dimensionality reflects the fact that membership of the $K$th category could be inferred from the lack of membership in any other category, and so is redundant in the presence of a global intercept. You can read more about this in the full rankness discussion below.</p> <p>The first step toward generating numerical vectors from categorical data is to dummy encode it. This transforms the single vector of $K$ categories into $K$ boolean vectors, each having a $1$ only in rows that are members of the corresponding category. If you do not have a global intercept, you can directly use this dummy encoding with the full $K$ columns and contrasts are unnecessary. This is not always the case, which requires you to reduce the rank of your coding by thinking about contrasts (or differences) between the levels.</p> <p>In practice, this dimension reduction using \"contrasts\" looks like constructing a $K \\times (K-1)$ \"coding matrix\" that describes the contrasts of interest. You can then post-multiply your dummy-encoded columns by it. That is: $$ E = DC $$ where $E \\in \\mathbb{R}^{N \\times (K-1)}$ is the contrast coded categorical data, $D \\in \\{0, 1\\}^{N \\times K}$ is the dummy encoded data, and $C \\in \\mathbb{R}^{K \\times (K-1)}$ is the coding matrix.</p> <p>The easiest way to construct a coding matrix is to start with a \"coefficient matrix\" $Z \\in \\mathbb{R}^{K \\times K}$ which describes the contrasts that you want the coefficients of a trained linear regression model to represent (with columns representing the untransformed levels, and rows representing the transforms). For a consistently chosen set of contrasts, this matrix will be full-rank, and the inverse of this matrix will have a constant column representing the global intercept. Removing this column results in the $K \\times (K-1)$ coding matrix that should be apply to the dummy encoded data in order for the coefficients to have the desired interpretation.</p> <p>For example, if we wanted all of the levels to be compared to the first level, we would build a matrix $Z$ as: $$ \\begin{align} Z =&amp; \\left(\\begin{array}{c c c c} 1 &amp; 0 &amp; 0 &amp; 0 \\\\ -1 &amp; 1 &amp; 0 &amp; 0\\\\ -1 &amp; 0 &amp; 1 &amp; 0\\\\ -1 &amp; 0 &amp; 0 &amp; 1 \\end{array}\\right)\\\\ \\therefore Z^{-1} =&amp; \\left(\\begin{array}{c c c c} 1 &amp; 0 &amp; 0 &amp; 0 \\\\ 1 &amp; 1 &amp; 0 &amp; 0\\\\ 1 &amp; 0 &amp; 1 &amp; 0\\\\ 1 &amp; 0 &amp; 0 &amp; 1 \\end{array}\\right)\\\\ \\implies C =&amp; \\left(\\begin{array}{c c c} 0 &amp; 0 &amp; 0 \\\\ 1 &amp; 0 &amp; 0\\\\ 0 &amp; 1 &amp; 0\\\\ 0 &amp; 0 &amp; 1 \\end{array}\\right) \\end{align} $$ This is none other than the default \"treatment\" coding described below, which applies one-hot coding to the categorical data.</p> <p>It is important to note that while your choice of contrast coding will change the interpretation and values of your coefficients, all contrast encodings ultimately result in equivalent regressions, and it is possible to restrospectively infer any other set of interesting contrasts given the regression covariance matrix. The task is therefore to find the most useful representation, not the \"correct\" one.</p> <p>For those interested in reading more, the R Documentation on Coding Matrices covers this in more detail.</p>"},{"location":"guides/contrasts/#contrast-codings","title":"Contrast codings\u00b6","text":"<p>This section introduces the contrast encodings that are shipped as part of Formulaic. These implementations live in <code>formulaic.transforms.contrasts</code>, and are surfaced by default in formulae as an attribute of <code>contr</code> (e.g. <code>contr.treatment</code>, in order to be consistent with R). You can always implement your own contrasts if the need arises.</p> <p>If you would like to dig deeper and see the actual contrast/coefficient matrices for various parameters you can directly import these contrast implementations and play with them in a Python shell, but otherwise for brevity we will not exhaustively show these in the following documentation. For example:</p>"},{"location":"guides/contrasts/#treatment-aka-dummy","title":"Treatment (aka. dummy)\u00b6","text":"<p>This contrast coding compares each level with some reference level. If not specified, the reference level is taken to be the first level. The reference level can be specified as the first argument to the <code>TreatmentContrasts</code>/<code>contr.treatment</code> constructor.</p> <p>Example formulae:</p> <ul> <li><code>~ X</code>: Assuming <code>X</code> is categorical, the treatment encoding will be used by default.</li> <li><code>~ C(X)</code>: You can also explicitly flag a feature to be encoded as categorical, whereupon the default is treatment encoding.</li> <li><code>~ C(X, contr.treatment)</code>: Explicitly indicate that the treatment encoding should be used.</li> <li><code>~ C(X, contr.treatment(\"x\"))</code>: Indicate that the reference treatment should be \"x\" instead of the first index.</li> <li><code>~ C(X, contr.treatment(base=\"x\"))</code>: As above.</li> </ul>"},{"location":"guides/contrasts/#sas-treatment","title":"SAS Treatment\u00b6","text":"<p>This constrasts generated by this class are the same as the above, but with the reference level defaulting to the last level (the default in SAS).</p> <p>Example formulae:</p> <ul> <li><code>~ C(X, contr.SAS)</code>: Basic use-case.</li> <li><code>~ C(X, contr.SAS(\"x\"))</code>: Same as treatment encoding case above.</li> <li><code>~ C(X, contr.SAS(base=\"x\"))</code>: Same as treatment encoding case above.</li> </ul>"},{"location":"guides/contrasts/#sum-or-deviation","title":"Sum (or Deviation)\u00b6","text":"<p>These contrasts compare each level (except the last, which is redundant) to the global average of all levels.</p> <p>Example formulae:</p> <ul> <li><code>~ C(X, contr.sum)</code>: Encode categorical data using the sum coding.</li> </ul>"},{"location":"guides/contrasts/#helmert","title":"Helmert\u00b6","text":"<p>These contrasts compare each successive level to the average all previous/subsequent levels. It has two configurable parameters: <code>reverse</code> which controls the direction of comparison, and <code>scale</code> which controls whether to scale the encoding to simplify interpretation of coefficients (results in a floating point model matrix instead of an integer one). When <code>reverse</code> is <code>True</code>, the contrasts compare a level to all previous levels; and when <code>False</code>, it compares it to all subsequent levels.</p> <p>The default parameter values are chosen to match the R implementation, which corresponds to a reversed and unscaled Helmert coding.</p> <p>Example formulae:</p> <ul> <li><code>~ C(X, contr.helmert)</code>: Unscaled reverse coding.</li> <li><code>~ C(X, contr.helmert(reverse=False))</code>: Unscaled forward coding.</li> <li><code>~ C(X, contr.helmert(scale=True))</code>: Scaled reverse coding.</li> <li><code>~ C(X, contr.helmert(scale=True, reverse=False))</code>: Scaled forward coding.</li> </ul>"},{"location":"guides/contrasts/#diff","title":"Diff\u00b6","text":"<p>These contrasts take the difference of each level with the previous level. It has one parameter, <code>forward</code>, which indicates that the difference should be inverted such the difference is taken between the previous level and the current level. The default attribute values are chosen to match the R implemention, and correspond to a backward difference coding.</p> <p>Example formulae:</p> <ul> <li><code>~ C(X, contr.diff)</code>: Backward coding.</li> <li><code>~ C(X, contr.diff(forward=True))</code>: Forward coding.</li> </ul>"},{"location":"guides/contrasts/#polynomial","title":"Polynomial\u00b6","text":"<p>The \"contrasts\" represent a categorical variable that is assumed to equal (or known) spacing/scores, and allow us to model non-linear polynomial behaviour of the dependent variable with respect to the ordered levels by projecting the spacing onto a basis of orthogonal polynomials. It has one parameter, <code>scores</code> which indicates the spacing of the categories. It must have the same length as the number of levels. If not provided, the categories are assumed equidistant and spaced by 1.</p>"},{"location":"guides/contrasts/#aliasing-categorical-features","title":"Aliasing categorical features\u00b6","text":"<p>The feature names of categorical variables can become quite unwieldy, as you may have noticed. Fortunately this is easily remedied by aliasing the variable outside of your formula (and then making it available via formula context). This is done automatically if you use the <code>model_matrix</code> function. For example:</p>"},{"location":"guides/contrasts/#writing-your-own-coding","title":"Writing your own coding\u00b6","text":"<p>It may be useful to define your own coding matrices in some contexts. This is readily achieved using the <code>CustomContrasts</code> class directly or via the <code>contr.custom</code> alias. In these cases, you are responsible for providing the coding matrix ($C$ from above). For example, if you had four levels: A, B, C and D, and wanted to compute the contrasts: B - A, C - B, and D - A, you could write:</p>"},{"location":"guides/contrasts/#guaranteeing-structural-full-rankness","title":"Guaranteeing Structural Full Rankness\u00b6","text":""},{"location":"guides/formulae/","title":"How it works","text":"<p>This section of the documentation is intended to provide a high-level overview of the way in which formulae are interpreted and materialized by Formulaic.</p> <p>Recall that the goal of a formula is to act as a recipe for building a \"model matrix\" (also known as a \"design matrix\") from an existing dataset. Following the recipe should result in a dataset that consists only of numerical columns that can be linearly combined to model an outcome/response of interest (the coefficients of which typically being estimated using linear regression). As such, this process will bake in any desired non-linearity via interactions or transforms, and will encode nominal/categorical/factor data as a collection of numerical contrasts.</p> <p>The ingredients of each formula are the columns of the original dataset, and each operator acting on these columns in the formula should be thought of as inclusion/exclusion of the column in the resulting model matrix, or as a transformation on the column(s) prior to inclusion. Thus, a <code>+</code> operator does not act in its usual algebraic manner, but rather acts as set union, indicating that both the left- and right-hand arguments should be included in the model matrix; a <code>-</code> operator acts like a set difference; and so on.</p> <p>Formulas in Formulaic are represented by (subclasses of) the <code>Formula</code> class. Instances of <code>Formula</code> subclasses are a ultimately containers for sets of <code>Term</code> instances, which in turn are a container for a set of <code>Factor</code> instances. Let's start our dissection at the bottom, and work our way up.</p> In\u00a0[1]: Copied! <pre>from formulaic.parser.types import Factor\n\nFactor(\n    \"1\", eval_method=\"literal\"\n)  # a factor that represents the numerical constant of 1\nFactor(\"a\")  # a factor that will be looked up from the data context\nFactor(\n    \"a + b\", eval_method=\"python\"\n)  # a factor that will return the sum of `a` and `b`\n</pre> from formulaic.parser.types import Factor  Factor(     \"1\", eval_method=\"literal\" )  # a factor that represents the numerical constant of 1 Factor(\"a\")  # a factor that will be looked up from the data context Factor(     \"a + b\", eval_method=\"python\" )  # a factor that will return the sum of `a` and `b` Out[1]: <pre>a + b</pre> In\u00a0[2]: Copied! <pre>from formulaic.parser.types import Term\n\nTerm(factors=[Factor(\"b\"), Factor(\"a\"), Factor(\"c\")])\n</pre> from formulaic.parser.types import Term  Term(factors=[Factor(\"b\"), Factor(\"a\"), Factor(\"c\")]) Out[2]: <pre>b:a:c</pre> <p>Note that to ensure uniqueness in the representation, the factor instances are sorted.</p> In\u00a0[3]: Copied! <pre>from formulaic import Formula\n\n# Unstructured formula (a simple list of terms)\nf = Formula(\n    [\n        Term(factors=[Factor(\"c\"), Factor(\"d\"), Factor(\"e\")]),\n        Term(factors=[Factor(\"a\"), Factor(\"b\")]),\n    ]\n)\nf\n</pre> from formulaic import Formula  # Unstructured formula (a simple list of terms) f = Formula(     [         Term(factors=[Factor(\"c\"), Factor(\"d\"), Factor(\"e\")]),         Term(factors=[Factor(\"a\"), Factor(\"b\")]),     ] ) f Out[3]: <pre>a:b + c:d:e</pre> <p>Note that unstructured formulae are actually instances of <code>SimpleFormula</code> (a <code>Formula</code> subclass that acts like a mutable list of <code>Term</code> instances):</p> In\u00a0[4]: Copied! <pre>type(f), list(f)\n</pre> type(f), list(f) Out[4]: <pre>(formulaic.formula.SimpleFormula, [a:b, c:d:e])</pre> <p>Also note that in its standard representation, the terms are separated by \"+\" which is interpreted as the set union in this context, and that (as we have seen for <code>Term</code> instances) <code>Formula</code> instances are sorted (the default is to sort terms only by interaction order, but this can be customized and/or disabled, as described below).</p> <p>Structured formula are constructed similary:</p> In\u00a0[5]: Copied! <pre>f = Formula(\n    [\n        Term(factors=[Factor(\"root_col\")]),\n    ],\n    my_substructure=[\n        Term(factors=[Factor(\"sub_col\")]),\n    ],\n    nested=Formula(\n        [\n            Term(factors=[Factor(\"nested_col\")]),\n            Term(factors=[Factor(\"another_nested_col\")]),\n        ],\n        really_nested=[\n            Term(factors=[Factor(\"really_nested_col\")]),\n        ],\n    ),\n)\nf\n</pre> f = Formula(     [         Term(factors=[Factor(\"root_col\")]),     ],     my_substructure=[         Term(factors=[Factor(\"sub_col\")]),     ],     nested=Formula(         [             Term(factors=[Factor(\"nested_col\")]),             Term(factors=[Factor(\"another_nested_col\")]),         ],         really_nested=[             Term(factors=[Factor(\"really_nested_col\")]),         ],     ), ) f Out[5]: <pre>root:\n    root_col\n.my_substructure:\n    sub_col\n.nested:\n    root:\n        nested_col + another_nested_col\n    .really_nested:\n        really_nested_col</pre> <p>Structured formulae are instances of <code>StructuredFormula</code>:</p> In\u00a0[6]: Copied! <pre>type(f)\n</pre> type(f) Out[6]: <pre>formulaic.formula.StructuredFormula</pre> <p>And the sub-formula can be selected using:</p> In\u00a0[7]: Copied! <pre>f.root\n</pre> f.root Out[7]: <pre>root_col</pre> In\u00a0[8]: Copied! <pre>f.nested\n</pre> f.nested Out[8]: <pre>root:\n    nested_col + another_nested_col\n.really_nested:\n    really_nested_col</pre> <p>Formulae can also have different ordering conventions applied to them. By default, Formulaic follows R conventions around ordering whereby terms are sorted by their interaction degree (number of factors) and then by the order in which they were present in the the term list. This behaviour can be modified to perform no ordering or full lexical sorting of terms and factors by passing <code>_ordering=\"none\"</code> or <code>_ordering=\"sort\"</code> respectively to the <code>Formula</code> constructor. The default ordering is equivalent to passing <code>_ordering=\"degree\"</code>. For example:</p> In\u00a0[9]: Copied! <pre>{\n    \"degree\": Formula(\"z + z:a + z:b:a + g\"),\n    \"none\": Formula(\"z + z:a + z:b:a + g\", _ordering=\"none\"),\n    \"sort\": Formula(\"z + z:a + z:b:a + g\", _ordering=\"sort\"),\n}\n</pre> {     \"degree\": Formula(\"z + z:a + z:b:a + g\"),     \"none\": Formula(\"z + z:a + z:b:a + g\", _ordering=\"none\"),     \"sort\": Formula(\"z + z:a + z:b:a + g\", _ordering=\"sort\"), } Out[9]: <pre>{'degree': 1 + z + g + z:a + z:b:a,\n 'none': 1 + z + z:a + z:b:a + g,\n 'sort': 1 + g + z + a:z + a:b:z}</pre> <p>Formulaic intentionally makes the tokenization phase as unopinionated and unstructured as possible. This allows formula grammars to be extended via plugins only high-level APIs (usually <code>Operator</code>s).</p> <p>The tokenizer's role is to take an arbitrary string representation of a formula and convert it into a series of <code>Token</code> instances. The tokenization phase knows very little about formula grammar except that whitespace doesn't matter, that we distinguish non-word characters as operators or context indicators. Interpretation of these tokens is left to the AST generation phase. There are five different kinds of tokens:</p> <ol> <li>Context: Symbol pairs representing he opening or closing of nested contexts. This applies to parentheses <code>()</code> and square brackets <code>[]</code>.</li> <li>Operator: Symbol(s) representing a operation on other tokens in the string (interpreted during AST creation).</li> <li>Name: A character sequence representing variable/data-column name.</li> <li>Python: A character sequence representing executable Python code.</li> <li>Value: A character sequence representing a Python literal.</li> </ol> <p>The tokenizer treats text quoted with <code>`</code> characters as a name token, and <code>{}</code> are used to quote Python operations.</p> <p>An example of the tokens generated can be seen below:</p> In\u00a0[10]: Copied! <pre>from formulaic.parser import DefaultFormulaParser\n\n[\n    f\"{token.token} : {token.kind.value}\"\n    for token in (\n        DefaultFormulaParser(include_intercept=False).get_tokens(\n            \"y ~ 1 + b:log(c) | `d$in^df` + {e + f}\"\n        )\n    )\n]\n</pre> from formulaic.parser import DefaultFormulaParser  [     f\"{token.token} : {token.kind.value}\"     for token in (         DefaultFormulaParser(include_intercept=False).get_tokens(             \"y ~ 1 + b:log(c) | `d$in^df` + {e + f}\"         )     ) ] Out[10]: <pre>['y : name',\n '~ : operator',\n '1 : value',\n '+ : operator',\n 'b : name',\n ': : operator',\n 'log(c) : python',\n '| : operator',\n 'd$in^df : name',\n '+ : operator',\n 'e + f : python']</pre> <p>The next phase is to assemble an abstract syntax tree (AST) from the tokens output from the above that when evaluated will generate the <code>Term</code> instances we need to build a formula. This is done by using an enriched shunting yard algorithm which determines how to interpret each operator token based on the symbol used, the number and position of the non-operator arguments, and the current context (i.e. how many parentheses deep we are). This allows us to disambiguate between, for example, unary and binary addition operators. The available operators and their implementation are described in more detail in the Formula Grammar section of this documentation. It is worth noting that the available operators can be easily modified at runtime, and is typically all that needs to be modified in order to add new formula grammars.</p> <p>The result is an AST that look something like:</p> In\u00a0[11]: Copied! <pre>DefaultFormulaParser().get_ast(\"y ~ a + b:c\")\n</pre> DefaultFormulaParser().get_ast(\"y ~ a + b:c\") Out[11]: <pre>&lt;ASTNode ~: [y, &lt;ASTNode +: [&lt;ASTNode +: [1, a]&gt;, &lt;ASTNode :: [b, c]&gt;]&gt;]&gt;</pre> <p>Now that we have the AST, we can readily evaluate it to generate the <code>Term</code> instances we need to pass to our <code>Formula</code> constructor. For example:</p> In\u00a0[12]: Copied! <pre>terms = DefaultFormulaParser(include_intercept=False).get_terms(\"y ~ a + b:c\")\nterms\n</pre> terms = DefaultFormulaParser(include_intercept=False).get_terms(\"y ~ a + b:c\") terms Out[12]: <pre>.lhs:\n    {y}\n.rhs:\n    {a, b:c}</pre> In\u00a0[13]: Copied! <pre>Formula(terms)\n</pre> Formula(terms) Out[13]: <pre>.lhs:\n    y\n.rhs:\n    a + b:c</pre> <p>Of course, manually building the terms and passing them to the formula constructor is a bit annoying, and so instead we allow passing the string directly to the <code>Formula</code> constructor; and allow you to override the default parser if you so desire (though 99.9% of the time this wouldn't be necessary).</p> <p>Thus, we can generate the same formula from above using:</p> In\u00a0[14]: Copied! <pre>Formula(\"y ~ a + b:c\", _parser=DefaultFormulaParser(include_intercept=False))\n</pre> Formula(\"y ~ a + b:c\", _parser=DefaultFormulaParser(include_intercept=False)) Out[14]: <pre>.lhs:\n    y\n.rhs:\n    a + b:c</pre> <p>Once you have <code>Formula</code> instance, the next logical step is to use it to materialize a model matrix. This is usually as simple passing the raw data as an argument to <code>.get_model_matrix()</code>:</p> In\u00a0[15]: Copied! <pre>import pandas\n\ndata = pandas.DataFrame(\n    {\"a\": [1, 2, 3], \"b\": [4, 5, 6], \"c\": [7, 8, 9], \"A\": [\"a\", \"b\", \"c\"]}\n)\nFormula(\"a + b:c\").get_model_matrix(data)\n</pre> import pandas  data = pandas.DataFrame(     {\"a\": [1, 2, 3], \"b\": [4, 5, 6], \"c\": [7, 8, 9], \"A\": [\"a\", \"b\", \"c\"]} ) Formula(\"a + b:c\").get_model_matrix(data) Out[15]: Intercept a b:c 0 1.0 1 28 1 1.0 2 40 2 1.0 3 54 <p>Just as for formulae, the model matrices can be structured, and will be structured in the same way as the original formula. For example:</p> In\u00a0[16]: Copied! <pre>Formula(\"a\", group=\"b+c\").get_model_matrix(data)\n</pre> Formula(\"a\", group=\"b+c\").get_model_matrix(data) Out[16]: <pre>root:\n       Intercept  a\n    0        1.0  1\n    1        1.0  2\n    2        1.0  3\n.group:\n       b  c\n    0  4  7\n    1  5  8\n    2  6  9</pre> <p>Under the hood, both of these calls have looked at the type of the data (<code>pandas.DataFrame</code> here) and then looked up the <code>FormulaMaterializer</code> associated with that type (<code>PandasMaterializer</code> here), and then passed the formula and data along to the materializer for materialization. It is also possible to request a specific output type that varies by materializer (<code>PandasMaterializer</code> supports \"pandas\", \"numpy\", and \"sparse\"). If one is not selected, the first available output type is selected for you. Thus, the above code is equivalent to:</p> In\u00a0[17]: Copied! <pre>from formulaic.materializers import PandasMaterializer\n\nPandasMaterializer(data).get_model_matrix(Formula(\"a + b:c\"), output=\"pandas\")\n</pre> from formulaic.materializers import PandasMaterializer  PandasMaterializer(data).get_model_matrix(Formula(\"a + b:c\"), output=\"pandas\") Out[17]: Intercept a b:c 0 1.0 1 28 1 1.0 2 40 2 1.0 3 54 <p>The return type of <code>.get_model_matrix()</code> is either a <code>ModelMatrix</code> instance if the original formula was unstructured, or a <code>ModelMatrices</code> instance that is just a structured container for <code>ModelMatrix</code> instances. However, <code>ModelMatrix</code> is an ObjectProxy subclass, and so it also acts like the type of object requested. For example:</p> In\u00a0[18]: Copied! <pre>import numpy\n\nfrom formulaic import ModelMatrix\n\nmm = Formula(\"a + b:c\").get_model_matrix(data, output=\"numpy\")\nisinstance(mm, ModelMatrix), isinstance(mm, numpy.ndarray)\n</pre> import numpy  from formulaic import ModelMatrix  mm = Formula(\"a + b:c\").get_model_matrix(data, output=\"numpy\") isinstance(mm, ModelMatrix), isinstance(mm, numpy.ndarray) Out[18]: <pre>(True, True)</pre> <p>The main purpose of this additional proxy layer is to expose the <code>ModelSpec</code> instance associated with the materialization, which retains all of the encoding choices made during materialization (for reuse in subsequent materializations), as well as metadata about the feature names of the current model matrix (which is very useful when your model matrix output type doesn't have column names, like numpy or sparse arrays). This <code>ModelSpec</code> instance is always available via <code>.model_spec</code>, and is introduced in more detail in the Model Specs section of this documentation.</p> In\u00a0[19]: Copied! <pre>mm.model_spec\n</pre> mm.model_spec Out[19]: <pre>ModelSpec(formula=1 + a + b:c, materializer='pandas', materializer_params={}, ensure_full_rank=True, na_action=&lt;NAAction.DROP: 'drop'&gt;, output='numpy', cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a']), EncodedTermStructure(term=b:c, scoped_terms=[b:c], columns=['b:c'])], transform_state={}, encoder_state={'a': (&lt;Kind.NUMERICAL: 'numerical'&gt;, {}), 'b': (&lt;Kind.NUMERICAL: 'numerical'&gt;, {}), 'c': (&lt;Kind.NUMERICAL: 'numerical'&gt;, {})})</pre> <p>It is sometimes convenient to have the columns in the final model matrix be clustered by numerical factors included in the terms. This means that in regression reports, for example, all of the columns related to a particular feature of interest (including its interactions with various categorical features) are contiguously clustered. This is the default behaviour in patsy. You can perform this clustering in Formulaic by passing the <code>cluster_by=\"numerical_factors\"</code> argument to <code>model_matrix</code> or any of the <code>.get_model_matrix(...)</code> methods. For example:</p> In\u00a0[20]: Copied! <pre>Formula(\"a + b + a:A + A:b\").get_model_matrix(data, cluster_by=\"numerical_factors\")\n</pre> Formula(\"a + b + a:A + A:b\").get_model_matrix(data, cluster_by=\"numerical_factors\") Out[20]: Intercept a a:A[T.b] a:A[T.c] b A[T.b]:b A[T.c]:b 0 1.0 1 0 0 4 0 0 1 1.0 2 2 0 5 5 0 2 1.0 3 0 3 6 0 6"},{"location":"guides/formulae/#anatomy-of-a-formula","title":"Anatomy of a Formula\u00b6","text":""},{"location":"guides/formulae/#factor","title":"Factor\u00b6","text":"<p><code>Factor</code> instances are the atomic unit of a formula, and represent the output of a single expression evaluation. Typically this will be one vector of data, but could also be more than one column (especially common with categorically encoded data).</p> <p>A <code>Factor</code> instance's expression can be evaluated in one of three ways:</p> <ol> <li>As a literal; in which case the expression is taken as number or string, and repeated over all rows in the resulting model matrix.</li> <li>As a lookup: in which case the expression is treated as a name to be looked up in the data context provided during materialization.</li> <li>As a Python expression: in which case it is executed in the data context provided.</li> </ol> <p>Note: Factor instances act as metadata only, and are not directly responsible for doing the evaluation. This is handled in a backend specific way by the appropriate <code>Materializer</code> instance.</p> <p>In code, instantiating a factor looks like:</p>"},{"location":"guides/formulae/#term","title":"Term\u00b6","text":"<p><code>Term</code> instances are a thin wrapper around a set of <code>Factor</code> instances, and represent the Cartesian (or Kronecker) product of the factors. If all of the <code>Factor</code> instances evaluate to single columns, then the <code>Term</code> represents the product of all of the factor columns.</p> <p>Instantiating a <code>Term</code> looks like:</p>"},{"location":"guides/formulae/#formula","title":"Formula\u00b6","text":"<p><code>Formula</code> instances are (potentially nested) wrappers around collections of <code>Term</code> instances. During materialization into a model matrix, each <code>Term</code> instance will have its columns independently inserted into the resulting matrix.</p> <p><code>Formula</code> instances can consist of a single \"list\" of <code>Term</code> instances, or may be \"structured\"; for example, we may want a separate collection of terms for the left- and right-hand side of a formula; or to simultaneously construct multiple model matrices for different parts of our modeling process.</p> <p>For example, an unstructured formula might look like:</p>"},{"location":"guides/formulae/#parsed-formulae","title":"Parsed Formulae\u00b6","text":"<p>While it would be possible to always manually construct <code>Formula</code> instances in this way, it would quickly grow tedious. As you might have guessed from reading the quickstart or via other implementations, this is where Wilkinson formulae come in. Formulaic has a rich extensible formula parser that converts string expressions into the formula structures you see above. Where functionality and grammar overlap, it tries to conform to existing patterns found in R and patsy.</p> <p>Formula parsing happens in three phases:</p> <ol> <li>tokenization of the formula string;</li> <li>conversion of the tokens into an abstract syntax tree (AST); and</li> <li>evaluation of the AST into (potentially structured) lists of <code>Term</code> instances.</li> </ol> <p>In the sections below these phases are described in more detail.</p>"},{"location":"guides/formulae/#tokenization","title":"Tokenization\u00b6","text":""},{"location":"guides/formulae/#abstract-syntax-tree-ast","title":"Abstract Syntax Tree (AST)\u00b6","text":""},{"location":"guides/formulae/#evaluation","title":"Evaluation\u00b6","text":""},{"location":"guides/formulae/#materialization","title":"Materialization\u00b6","text":""},{"location":"guides/grammar/","title":"Formula Grammar","text":"<p>This section of the documentation describes the formula grammar used by Formulaic. It is almost identical that used by patsy and R, and so most formulas should work without modification. However, there are some differences, which are called out below.</p>"},{"location":"guides/grammar/#operators","title":"Operators","text":"<p>In this section, we introduce a complete list of the grammatical operators that you can use by default in your formulas. They are listed such that each section (demarcated by \"-----\") has higher precedence then the block that follows. When you write a formula involving several operators of different precedence, those with higher precedence will be resolved first. \"Arity\" is the number of arguments the operator takes. Within operators of the same precedence, all binary operators are evaluated from left to right (they are left-associative). To highlight differences in grammar betweeh formulaic, patsy and R, we highlight any differences below. If there is a checkmark the Formulaic, Patsy and R columns, then the grammar is consistent across all three, unless otherwise indicated.</p> Operator Arity Description Formulaic Patsy R <code>\"...\"</code><sup>1</sup> 1 String literal. \u2713 \u2713 \u2717 <code>[0-9]+\\.[0-9]+</code><sup>1</sup> 1 Numerical literal. \u2713 \u2717 \u2717 <code>`...`</code><sup>1</sup> 1 Quotes fieldnames within the incoming dataframe, allowing the use of special characters, e.g. <code>`my|special$column!`</code> \u2713 \u2717 \u2713 <code>{...}</code><sup>1</sup> 1 Quotes python operations, as a more convenient way to do Python operations than <code>I(...)</code>, e.g. <code>{`my|col`**2}</code> \u2713 \u2717 \u2717 <code>&lt;function&gt;(...)</code><sup>1</sup> 1 Python transform on column, e.g. <code>my_func(x)</code> which is equivalent to <code>{my_func(x)}</code> \u2713<sup>2</sup> \u2713 \u2717 ----- <code>(...)</code> 1 Groups operations, overriding normal precedence rules. All operations with the parentheses are performed before the result of these operations is permitted to be operated upon by its peers. \u2713 \u2717 \u2713 ----- <code>.</code><sup>9</sup> 0 Stands in as a wild-card for the sum of variables in the data not used on the left-hand side of a formula. \u2713 \u2717 \u2713 ----- <code>**</code> 2 Includes all n-th order interactions of the terms in the left operand, where n is the (integral) value of the right operand, e.g. <code>(a+b+c)**2</code> is equivalent to <code>a + b + c + a:b + a:c + b:c</code>. \u2713 \u2713 \u2713 <code>^</code> 2 Alias for <code>**</code>. \u2713 \u2717<sup>3</sup> \u2713 ----- <code>:</code> 2 Adds a new term that corresponds to the interaction of its operands (i.e. their elementwise product). \u2713<sup>4</sup> \u2713 \u2713 ----- <code>*</code> 2 Includes terms for each of the additive and interactive effects of the left and right operands, e.g. <code>a * b</code> is equivalent to <code>a + b + a:b</code>. \u2713 \u2713 \u2713 <code>/</code> 2 Adds terms describing nested effects. It expands to the addition of a new term for the left operand and the interaction of all left operand terms with the right operand, i.e <code>a / b</code> is equivalent to <code>a + a:b</code>, <code>(a + b) / c</code> is equivalent to <code>a + b + a:b:c</code>, and <code>a/(b+c)</code> is equivalent to <code>a + a:b + a:c</code>.<sup>5</sup> \u2713 \u2713 \u2713 <code>%in%</code> 2 As above, but with arguments inverted: e.g. <code>b %in% a</code> is equivalent to <code>a / b</code>. \u2713 \u2717 \u2713 ----- <code>+</code> 2 Adds a new term to the set of features. \u2713 \u2713 \u2713 <code>-</code> 2 Removes a term from the set of features (if present). \u2713 \u2713 \u2713 <code>+</code> 1 Returns the current term unmodified (not very useful). \u2713 \u2713 \u2713 <code>-</code> 1 Negates a term (only implemented for 0, in which case it is replaced with <code>1</code>). \u2713 \u2713 \u2713 ----- <code>\\|</code> 2 Splits a formula into multiple parts, allowing the simultaneous generation of multiple model matrices. When on the right-hand-side of the <code>~</code> operator, all parts will attract an additional intercept term by default. \u2713 \u2717 \u2713<sup>6</sup> ----- <code>~</code> 1,2 Separates the target features from the input features. If absent, it is assumed that we are considering only the the input features. Unless otherwise indicated, it is assumed that the input features implicitly include an intercept. \u2713 \u2713 \u2713 <code>[ . ~ . ]</code> 2 [Experimental] Multi stage formula notation, which is useful in (e.g.) IV contexts. Requires the <code>MULTISTAGE</code> feature flag to be passed to the parser. \u2713 \u2717 \u2717"},{"location":"guides/grammar/#transforms","title":"Transforms","text":"<p>Formulaic supports arbitrary transforms, any of which can also preserve state so that new data can undergo the same transformation as that used during modelling. The currently implemented transforms are shown below. Commonly used transforms that have not been implemented by <code>formulaic</code> are explicitly noted also.</p> Transform Description Formulaic Patsy R <code>I(...)</code> Identity transform, allowing arbitrary Python/R operations, e.g. <code>I(x+y)</code>. Note that in <code>formulaic</code>, it is more idiomatic to use <code>{x+y}</code>. \u2713 \u2713 \u2713 <code>Q('&lt;column_name&gt;')</code> Look up feature by potentially exotic name, e.g. <code>Q('wacky name!')</code>. Note that in <code>formulaic</code>, it is more idiomatic to use <code>`wacky name!`</code>. \u2713 \u2713 \u2717 <code>C(...)</code> Categorically encode a column, e.g. <code>C(x)</code> \u2713 \u2713 \u2713 <code>center(...)</code> Shift column data so mean is zero. \u2713 \u2713 \u2717 <code>scale(...)</code> Shift column so mean is zero and variance is 1. \u2713 \u2713<sup>7</sup> \u2713 <code>standardize(...)</code> Alias of <code>scale</code>. \u2713<sup>8</sup> \u2713 \u2717 <code>lag(...[, &lt;k&gt;])</code> Generate lagging or leading columns (useful for datasets collected at regular intervals). \u2713 \u2717 \u2713 <code>poly(...)</code> Generates a polynomial basis, allowing non-linear fits. \u2713 \u2717 \u2713 <code>bs(...)</code> Generates a B-Spline basis, allowing non-linear fits. \u2713 \u2713 \u2713 <code>cs(...)</code> Generates a natural cubic spline basis, allowing non-linear fits. \u2713 \u2713 \u2713 <code>cr(...)</code> Alias for <code>cs</code> above. \u2713 \u2717 \u2713 <code>cc(...)</code> Generates a cyclic cubic spline basis, allowing non-linear fits. \u2713 \u2713 \u2713 <code>te(...)</code> Generates a tensor product smooth. \u2717 \u2713 \u2713 <code>hashed(...)</code> Categorically encode a deterministic hash of a column. \u2713 \u2717 \u2717 ... Others? Contributions welcome! ? ? ? <p>Tip</p> <p>Any function available in the <code>context</code> dictionary will also be available as transform, along with some commonly used functions imported from numpy: <code>log</code>, <code>log10</code>, <code>log2</code>, <code>exp</code>, <code>exp10</code>, and <code>exp2</code>. In addition the <code>numpy</code> module is always available as <code>np</code>. Thus, formulas like: <code>log(y) ~ x + 10</code> will always do the right thing, even when these functions have not been made available in the user namespace.</p> <p>Note</p> <p>Formulaic does not (yet) support including extra terms in the formula that will not result in additions to the dataframe, for example model annotations like R's <code>offset(...)</code>.</p>"},{"location":"guides/grammar/#behaviours-and-conventions","title":"Behaviours and Conventions","text":"<p>Beyond the formula operator grammar itself there are some differing behaviours and conventions of which you should be aware.</p> <ul> <li>Formulaic follows Patsy and then enhanced <code>Formula</code> R package in that both     sides of the <code>~</code> operator are treated considered to be using the formula     grammar, with the only difference being that the right hand side attracts an     intercept by default. In vanilla R, the left hand side is treated as R code     (and so <code>x + y ~ z</code> would result in a single column on the left-hand-side).     You can recover vanilla R's behaviour by nesting the operations in a Python     operator block (as described in the operator table): <code>{y1 + y2} ~ a + b</code>.</li> <li>Formula terms in Formulaic are always sorted first by the order of the     interaction, and then alphabetically. In R and patsy, this second ordering     is done in the order that columns were introduced to the formula (patsy     additionally sorts by which fields are involved in the interactions). As a     result formulas generated by <code>formulaic</code> with the same set of fields will     always generate the same model matrix.</li> <li>Formulaic follows patsy's more rigourous handling of whether or not to     include an intercept term. In R, <code>b-1</code> and <code>(b-1)</code> both do not have an     intercept, whereas in Formulaic and Patsy the parentheses are resolved     first, and so the first does not have an intercept and the second does     (because '1 +' is implicitly prepended to the right hand side of the formula).</li> <li>Formulaic borrows a clever algorithm introduced by Patsy to carefully choose     where to reduce the rank of the model matrix in order to ensure that the     matrix is structurally full rank. This avoids producing over-specified model     matrices in contexts that R would (since it only considers local full-rank     structure, rather than global structure). You can read more about this in     Patsy's documentation.</li> </ul> <ol> <li> <p>This \"operator\" is actually part of the tokenisation process.\u00a0\u21a9\u21a9\u21a9\u21a9\u21a9</p> </li> <li> <p>Formulaic additionally supports quoted fields with special characters, e.g. <code>my_func(`my|special+column`)</code>.\u00a0\u21a9</p> </li> <li> <p>The caret operator is not supported, but will not cause an error. It is ignored by the patsy formula parser, and treated as XOR Python operation on column.\u00a0\u21a9</p> </li> <li> <p>Note that Formulaic also allows you to use this to scale columns, for example: <code>2.5:a</code> (this scaling happens after factor coding).\u00a0\u21a9</p> </li> <li> <p>This somewhat confusing operator is useful when you want to include hierachical features in your data, and where certain interaction terms do not make sense (particularly in ANOVA contexts). For example, if <code>a</code> represents countries, and <code>b</code> represents cities, then the full product of terms from <code>a * b === a + b + a:b</code> does not make sense, because any value of <code>b</code> is guaranteed to coincide with a value in <code>a</code>, and does not independently add value. Thus, the operation <code>a / b === a + a:b</code> results in more sensible dataset. As a result, the <code>/</code> operator is right-distributive, since if <code>b</code> and <code>c</code> were both nested in <code>a</code>, you would want <code>a/(b+c) === a + a:b + a:c</code>. Likewise, the operator is not left-distributive, since if <code>c</code> is nested under both <code>a</code> and <code>b</code> separately, then you want <code>(a + b)/c === a + b + a:b:c</code>. Lastly, if <code>c</code> is nested in <code>b</code>, and <code>b</code> is nested in <code>a</code>, then you would want <code>a/b/c === a + a:(b/c) === a + a:b + a:b:c</code>.\u00a0\u21a9</p> </li> <li> <p>Implemented by an R package called Formula that extends the default formula syntax.\u00a0\u21a9</p> </li> <li> <p>Patsy uses the <code>rescale</code> keyword rather than <code>scale</code>, but provides the same functionality.\u00a0\u21a9</p> </li> <li> <p>For increased compatibility with patsy, we use patsy's signature for <code>standardize</code>.\u00a0\u21a9</p> </li> <li> <p>Requires additional context to be passed in when directly using the <code>Formula</code> constructor. e.g. <code>Formula(\"y ~ .\", context={\"__formulaic_variables_available__\": [\"x\", \"y\", \"z\"]})</code>; or you can use <code>model_matrix</code>, <code>ModelSpec.get_model_matrix()</code>, or <code>FormulaMaterializer.get_model_matrix()</code> without further specification.\u00a0\u21a9</p> </li> </ol>"},{"location":"guides/integration/","title":"Using with existing libraries","text":"<p>As Formulaic matures it is expected that it will be integrated directly into downstream projects where formula parsing is required. This is known to have already happened in the following high-profile projects:</p> <ul> <li>lifelines</li> <li>linearmodels</li> <li>Others?</li> </ul> <p>Where direct integration has not yet happened, you can still use Formulaic in conjunction with other commonly used libraries. On this page, we will add various examples how to achieve this. If you have done some integration work, please feel free to submit a PR that extends this documentation!</p> <p>statsmodels is a popular toolkit hosting many different statistical models, tests, and exploration tools. The formula API in <code>statsmodels</code> is currently based on <code>patsy</code>. If you need the features found in Formulaic, you can use it directly to generate the model matrices, and use the regular API. For example:</p> In\u00a0[1]: Copied! <pre>import pandas\nfrom statsmodels.api import OLS\n\nfrom formulaic import model_matrix\n\ndata = pandas.DataFrame({\"y\": [0.1, 0.4, 3], \"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]})\ny, X = model_matrix(\"y ~ a + b\", data)\nmodel = OLS(y, X)\nresults = model.fit()\nprint(results.summary())\n</pre> import pandas from statsmodels.api import OLS  from formulaic import model_matrix  data = pandas.DataFrame({\"y\": [0.1, 0.4, 3], \"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]}) y, X = model_matrix(\"y ~ a + b\", data) model = OLS(y, X) results = model.fit() print(results.summary()) <pre>                            OLS Regression Results                            \n==============================================================================\nDep. Variable:                      y   R-squared:                       1.000\nModel:                            OLS   Adj. R-squared:                    nan\nMethod:                 Least Squares   F-statistic:                       nan\nDate:                Fri, 14 Apr 2023   Prob (F-statistic):                nan\nTime:                        21:15:59   Log-Likelihood:                 102.01\nNo. Observations:                   3   AIC:                            -198.0\nDf Residuals:                       0   BIC:                            -200.7\nDf Model:                           2                                         \nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          t      P&gt;|t|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     -0.7857        inf         -0        nan         nan         nan\na              0.8857        inf          0        nan         nan         nan\nb[T.B]        -0.5857        inf         -0        nan         nan         nan\nb[T.C]         1.1286        inf          0        nan         nan         nan\n==============================================================================\nOmnibus:                          nan   Durbin-Watson:                   0.820\nProb(Omnibus):                    nan   Jarque-Bera (JB):                0.476\nSkew:                          -0.624   Prob(JB):                        0.788\nKurtosis:                       1.500   Cond. No.                         6.94\n==============================================================================\n\nNotes:\n[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n[2] The input rank is higher than the number of observations.\n</pre> <pre>/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/stats/stattools.py:74: ValueWarning: omni_normtest is not valid with less than 8 observations; 3 samples were given.\n  warn(\"omni_normtest is not valid with less than 8 observations; %i \"\n/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1765: RuntimeWarning: divide by zero encountered in divide\n  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)\n/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1765: RuntimeWarning: invalid value encountered in scalar multiply\n  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)\n/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1687: RuntimeWarning: divide by zero encountered in scalar divide\n  return np.dot(wresid, wresid) / self.df_resid\n</pre> In\u00a0[2]: Copied! <pre>from typing import Iterable, List, Optional\n\nfrom sklearn.base import BaseEstimator, TransformerMixin\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.pipeline import Pipeline\n\nfrom formulaic import Formula, FormulaSpec, ModelSpec\n\n\nclass FormulaicTransformer(TransformerMixin, BaseEstimator):\n    def __init__(self, formula: FormulaSpec):\n        self.formula: Formula = Formula.from_spec(formula)\n        self.model_spec: Optional[ModelSpec] = None\n        if self.formula._has_structure:\n            raise ValueError(\n                f\"Formula specification {repr(formula)} results in a structured formula, which is not supported.\"\n            )\n\n    def fit(self, X, y=None):\n        \"\"\"\n        Generate the initial model spec by which subsequent X's will be\n        transformed.\n        \"\"\"\n        self.model_spec = self.formula.get_model_matrix(X).model_spec\n        return self\n\n    def transform(self, X, y=None):\n        \"\"\"\n        Transform `X` by generating a model matrix from it based on the fit\n        model spec.\n        \"\"\"\n        if self.model_spec is None:\n            raise RuntimeError(\n                \"`FormulaicTransformer.fit()` must be called before `.transform()`.\"\n            )\n        X_ = self.model_spec.get_model_matrix(X)\n        return X_\n\n    def get_feature_names_out(\n        self, input_features: Optional[Iterable[str]] = None\n    ) -&gt; List[str]:\n        \"\"\"\n        Expose model spec column names to scikit learn to allow column transforms later in the pipeline.\n        \"\"\"\n        if self.model_spec is None:\n            raise RuntimeError(\n                \"`FormulaicTransformer.fit()` must be called before columns can be assigned names.\"\n            )\n        return self.model_spec.column_names\n\n\npipe = Pipeline(\n    [(\"formula\", FormulaicTransformer(\"x1 + x2 + x3\")), (\"model\", LinearRegression())]\n)\npipe_fit = pipe.fit(\n    pandas.DataFrame({\"x1\": [1, 2, 3], \"x2\": [2, 3.4, 6], \"x3\": [7, 3, 1]}),\n    y=pandas.Series([1, 3, 5]),\n)\npipe_fit\n# Note: You could optionally serialize `pipe_fit` here.\n# Then: Use the pipe to predict outcomes for new data.\n</pre> from typing import Iterable, List, Optional  from sklearn.base import BaseEstimator, TransformerMixin from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline  from formulaic import Formula, FormulaSpec, ModelSpec   class FormulaicTransformer(TransformerMixin, BaseEstimator):     def __init__(self, formula: FormulaSpec):         self.formula: Formula = Formula.from_spec(formula)         self.model_spec: Optional[ModelSpec] = None         if self.formula._has_structure:             raise ValueError(                 f\"Formula specification {repr(formula)} results in a structured formula, which is not supported.\"             )      def fit(self, X, y=None):         \"\"\"         Generate the initial model spec by which subsequent X's will be         transformed.         \"\"\"         self.model_spec = self.formula.get_model_matrix(X).model_spec         return self      def transform(self, X, y=None):         \"\"\"         Transform `X` by generating a model matrix from it based on the fit         model spec.         \"\"\"         if self.model_spec is None:             raise RuntimeError(                 \"`FormulaicTransformer.fit()` must be called before `.transform()`.\"             )         X_ = self.model_spec.get_model_matrix(X)         return X_      def get_feature_names_out(         self, input_features: Optional[Iterable[str]] = None     ) -&gt; List[str]:         \"\"\"         Expose model spec column names to scikit learn to allow column transforms later in the pipeline.         \"\"\"         if self.model_spec is None:             raise RuntimeError(                 \"`FormulaicTransformer.fit()` must be called before columns can be assigned names.\"             )         return self.model_spec.column_names   pipe = Pipeline(     [(\"formula\", FormulaicTransformer(\"x1 + x2 + x3\")), (\"model\", LinearRegression())] ) pipe_fit = pipe.fit(     pandas.DataFrame({\"x1\": [1, 2, 3], \"x2\": [2, 3.4, 6], \"x3\": [7, 3, 1]}),     y=pandas.Series([1, 3, 5]), ) pipe_fit # Note: You could optionally serialize `pipe_fit` here. # Then: Use the pipe to predict outcomes for new data. Out[2]: <pre>Pipeline(steps=[('formula', FormulaicTransformer(formula=1 + x1 + x2 + x3)),\n                ('model', LinearRegression())])</pre>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.Pipeline<pre>Pipeline(steps=[('formula', FormulaicTransformer(formula=1 + x1 + x2 + x3)),\n                ('model', LinearRegression())])</pre>FormulaicTransformer<pre>FormulaicTransformer(formula=1 + x1 + x2 + x3)</pre>LinearRegression<pre>LinearRegression()</pre>"},{"location":"guides/integration/#statsmodels","title":"StatsModels\u00b6","text":""},{"location":"guides/integration/#scikit-learn","title":"Scikit-Learn\u00b6","text":"<p>scikit-learn is a very popular machine learning toolkit for Python. You can use Formulaic directly, as for <code>statsmodels</code>, or as a module in scikit-learn pipelines along the lines of:</p>"},{"location":"guides/missing_data/","title":"Handling Missing Data","text":"<p>Sooner or later, you will encounter datasets with null values, and it is important to know how their presence will impact your modeling. Formulaic model matrix materialization procedures allow you to specify how you want nulls to be handled. You can either:</p> <ul> <li>Automatically drop null rows from the dataset (the default).</li> <li>Ignore nulls, and allow them to propagate naturally.</li> <li>Raise an exception when null values are encountered.</li> </ul> <p>You can specify the desired behaviour by passing an <code>NAAction</code> enum value (or string value thereof) to the materialization methods (<code>model_matrix</code>, and <code>*.get_model_matrix()</code>). Examples of each of these approaches is show below.</p> In\u00a0[36]: Copied! <pre>from pandas import Categorical, DataFrame\n\nfrom formulaic import model_matrix\nfrom formulaic.materializers import NAAction\n\ndf = DataFrame(\n    {\n        \"c\": [1, 2, None, 4, 5],\n        \"C\": Categorical(\n            [\"a\", \"b\", \"c\", None, \"e\"], categories=[\"a\", \"b\", \"c\", \"d\", \"e\"]\n        ),\n    }\n)\n\nmodel_matrix(\"c + C\", df, na_action=NAAction.DROP)\n# Equivlent to:\n#   * model_matrix(\"c + C\", df)\n#   * model_matrix(\"c + C\", df, na_action=\"drop\")\n</pre> from pandas import Categorical, DataFrame  from formulaic import model_matrix from formulaic.materializers import NAAction  df = DataFrame(     {         \"c\": [1, 2, None, 4, 5],         \"C\": Categorical(             [\"a\", \"b\", \"c\", None, \"e\"], categories=[\"a\", \"b\", \"c\", \"d\", \"e\"]         ),     } )  model_matrix(\"c + C\", df, na_action=NAAction.DROP) # Equivlent to: #   * model_matrix(\"c + C\", df) #   * model_matrix(\"c + C\", df, na_action=\"drop\") Out[36]: Intercept c C[T.b] C[T.c] C[T.d] C[T.e] 0 1.0 1.0 0 0 0 0 1 1.0 2.0 1 0 0 0 4 1.0 5.0 0 0 0 1 <p>You can also specify additional rows to drop using the <code>drop_rows</code> argument:</p> In\u00a0[24]: Copied! <pre>model_matrix(\"c + C\", df, drop_rows={0, 4})\n</pre> model_matrix(\"c + C\", df, drop_rows={0, 4}) Out[24]: Intercept c C[T.b] C[T.c] C[T.d] C[T.e] 1 1.0 2.0 1 0 0 0 <p>Note that the set passed to <code>drop_rows</code> is expected to be mutable, as it will be updated with the indices of rows dropped automatically also; which can be useful if you need to keep track of this information outside of the materialization procedure.</p> In\u00a0[25]: Copied! <pre>drop_rows = {0, 4}\nmodel_matrix(\"c + C\", df, drop_rows=drop_rows)\ndrop_rows\n</pre> drop_rows = {0, 4} model_matrix(\"c + C\", df, drop_rows=drop_rows) drop_rows Out[25]: <pre>{0, np.int64(2), np.int64(3), 4}</pre> In\u00a0[31]: Copied! <pre>model_matrix(\"c + C\", df, na_action=\"ignore\")\n</pre> model_matrix(\"c + C\", df, na_action=\"ignore\") Out[31]: Intercept c C[T.b] C[T.c] C[T.d] C[T.e] 0 1.0 1.0 0 0 0 0 1 1.0 2.0 1 0 0 0 2 1.0 NaN 0 1 0 0 3 1.0 4.0 0 0 0 0 4 1.0 5.0 0 0 0 1 <p>Note the <code>NaN</code> in the <code>c</code> column, and that <code>NaN</code> does NOT appear in the dummy coding of C on row 3, consistent with standard implementations of dummy coding. This could result in misleading model estimates, so care should be taken.</p> <p>You can combine this with <code>drop_rows</code>, as described above, to manually filter out the null values you are concerned about.</p> In\u00a0[41]: Copied! <pre>try:\n    model_matrix(\"c + C\", df, na_action=\"raise\")\nexcept Exception as e:\n    print(e)\n</pre> try:     model_matrix(\"c + C\", df, na_action=\"raise\") except Exception as e:     print(e) <pre>Error encountered while checking for nulls in `C`: `C` contains null values after evaluation.\n</pre> <p>As with ignoring nulls above, you can combine this raising behaviour with <code>drop_rows</code> to manually filter out the null values that you feel you can safely ignore, and then raise if any additional null values make it into your data.</p>"},{"location":"guides/missing_data/#dropping-null-rows-naactiondrop-or-drop","title":"Dropping null rows (<code>NAAction.DROP</code>, or <code>\"drop\"</code>)\u00b6","text":"<p>This the default behaviour, and will result in any row with a null in any column that is being used by the materialization being dropped from the resulting dataset. For example:</p>"},{"location":"guides/missing_data/#ignore-nulls-naactionignore-or-ignore","title":"Ignore nulls (<code>NAAction.IGNORE</code>, or <code>\"ignore\"</code>)\u00b6","text":"<p>If your modeling toolkit can handle the presence of nulls, or you otherwise want to keep them in the dataset, you can pass <code>na_action = \"ignore\"</code> to the materialization methods. This will allow null values to remain in columns, and take no action to prevent the propagation of nulls.</p>"},{"location":"guides/missing_data/#raise-for-null-values-naactionraise-or-raise","title":"Raise for null values (<code>NAAction.RAISE</code> or <code>\"raise\"</code>)\u00b6","text":"<p>If you are unwilling to risk the perils of dropping or ignoring null values, you can instead opt to raise an exception whenever a null value is found. This can prevent yourself from accidentally biasing your model, but also makes your code more brittle. For example:</p>"},{"location":"guides/model_specs/","title":"Model Specs","text":"<p>While <code>Formula</code> instances (discussed in How it works) are the source of truth for abstract user intent, <code>ModelSpec</code> instances are the source of truth for the materialization process; and bundle a <code>Formula</code> instance with explicit metadata about the encoding choices that were made (or should be made) when a formula was (or will be) materialized. As soon as materialization begins, <code>Formula</code> instances are upgraded into <code>ModelSpec</code> instances, and any missing metadata is attached as decisions are made during the materialization process.</p> <p>Besides acting as runtime state during materialization, it serves two main purposes:</p> <ol> <li>It acts as a metadata store about model matrices, for example providing ready access to the column names, the terms from which they derived, and so on. This is especially useful when the output data type does not have native ways of representing this information (e.g. numpy arrays or scipy sparse matrices where even naming columns is challenging).</li> <li>It guarantees reproducibility. Once a <code>Formula</code> has been materialized once, you can use the generated <code>ModelSpec</code> instance to repeat the process on similar datasets, being confident that the encoding choices will be identical. This is especially useful during out-of-sample prediction, where you need to prepare the out-of-sample data in exactly the same was as the training data for the predictions to be valid.</li> </ol> <p>In the remainder of this portion of the documentation, we will introduce how to leverage the metadata stored inside <code>ModelSpec</code> instances derived from materializations, and for more advanced programmatic use-cases, how to manually build a <code>ModelSpec</code>.</p> In\u00a0[1]: Copied! <pre># Let's get ourselves a simple `ModelMatrix` instance to play with.\nfrom pandas import DataFrame\n\nfrom formulaic import model_matrix\n\nmm = model_matrix(\"center(a) + b\", DataFrame({\"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]}))\nmm\n</pre> # Let's get ourselves a simple `ModelMatrix` instance to play with. from pandas import DataFrame  from formulaic import model_matrix  mm = model_matrix(\"center(a) + b\", DataFrame({\"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]})) mm Out[1]: Intercept center(a) b[T.B] b[T.C] 0 1.0 -1.0 0 0 1 1.0 0.0 1 0 2 1.0 1.0 0 1 In\u00a0[2]: Copied! <pre># And extract the model spec from it\nms = mm.model_spec\nms\n</pre> # And extract the model spec from it ms = mm.model_spec ms Out[2]: <pre>ModelSpec(formula=1 + center(a) + b, materializer='pandas', materializer_params={}, ensure_full_rank=True, na_action=&lt;NAAction.DROP: 'drop'&gt;, output='pandas', cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=center(a), scoped_terms=[center(a)], columns=['center(a)']), EncodedTermStructure(term=b, scoped_terms=[b-], columns=['b[T.B]', 'b[T.C]'])], transform_state={'center(a)': {'ddof': 1, 'center': np.float64(2.0), 'scale': None}}, encoder_state={'center(a)': (&lt;Kind.NUMERICAL: 'numerical'&gt;, {}), 'b': (&lt;Kind.CATEGORICAL: 'categorical'&gt;, {'categories': ['A', 'B', 'C'], 'contrasts': ContrastsState(contrasts=TreatmentContrasts(base=UNSET), levels=['A', 'B', 'C'])})})</pre> In\u00a0[3]: Copied! <pre># We can now interrogate it for various column, factor, term, and variable related metadata\n{\n    \"column_names\": ms.column_names,\n    \"column_indices\": ms.column_indices,\n    \"terms\": ms.terms,\n    \"term_indices\": ms.term_indices,\n    \"term_slices\": ms.term_slices,\n    \"term_factors\": ms.term_factors,\n    \"term_variables\": ms.term_variables,\n    \"factors\": ms.factors,\n    \"factor_terms\": ms.factor_terms,\n    \"factor_variables\": ms.factor_variables,\n    \"factor_contrasts\": ms.factor_contrasts,\n    \"variables\": ms.variables,\n    \"variable_terms\": ms.variable_terms,\n    \"variable_indices\": ms.variable_indices,\n    \"variables_by_source\": ms.variables_by_source,\n}\n</pre> # We can now interrogate it for various column, factor, term, and variable related metadata {     \"column_names\": ms.column_names,     \"column_indices\": ms.column_indices,     \"terms\": ms.terms,     \"term_indices\": ms.term_indices,     \"term_slices\": ms.term_slices,     \"term_factors\": ms.term_factors,     \"term_variables\": ms.term_variables,     \"factors\": ms.factors,     \"factor_terms\": ms.factor_terms,     \"factor_variables\": ms.factor_variables,     \"factor_contrasts\": ms.factor_contrasts,     \"variables\": ms.variables,     \"variable_terms\": ms.variable_terms,     \"variable_indices\": ms.variable_indices,     \"variables_by_source\": ms.variables_by_source, } Out[3]: <pre>{'column_names': ('Intercept', 'center(a)', 'b[T.B]', 'b[T.C]'),\n 'column_indices': {'Intercept': 0, 'center(a)': 1, 'b[T.B]': 2, 'b[T.C]': 3},\n 'terms': [1, center(a), b],\n 'term_indices': {1: [0], center(a): [1], b: [2, 3]},\n 'term_slices': {1: slice(0, 1, None),\n  center(a): slice(1, 2, None),\n  b: slice(2, 4, None)},\n 'term_factors': {1: {1}, center(a): {center(a)}, b: {b}},\n 'term_variables': {1: set(), center(a): {'a', 'center'}, b: {'b'}},\n 'factors': {1, b, center(a)},\n 'factor_terms': {1: {1}, center(a): {center(a)}, b: {b}},\n 'factor_variables': {b: {'b'}, 1: set(), center(a): {'a', 'center'}},\n 'factor_contrasts': {b: ContrastsState(contrasts=TreatmentContrasts(base=UNSET), levels=['A', 'B', 'C'])},\n 'variables': {'a', 'b', 'center'},\n 'variable_terms': {'center': {center(a)}, 'a': {center(a)}, 'b': {b}},\n 'variable_indices': {'center': [1], 'a': [1], 'b': [2, 3]},\n 'variables_by_source': {'transforms': {'center'}, 'data': {'a', 'b'}}}</pre> In\u00a0[4]: Copied! <pre># And use it to select out various parts of the model matrix; here the columns\n# produced by the `b` term.\nmm.iloc[:, ms.term_indices[\"b\"]]\n</pre> # And use it to select out various parts of the model matrix; here the columns # produced by the `b` term. mm.iloc[:, ms.term_indices[\"b\"]] Out[4]: b[T.B] b[T.C] 0 0 0 1 1 0 2 0 1 <p>Some of this metadata may seem redundant at first, but this kind of metadata is essential when the generated model matrix does not natively support indexing by names; for example:</p> In\u00a0[5]: Copied! <pre>mm_numpy = model_matrix(\n    \"center(a) + b\", DataFrame({\"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]}), output=\"numpy\"\n)\nmm_numpy\n</pre> mm_numpy = model_matrix(     \"center(a) + b\", DataFrame({\"a\": [1, 2, 3], \"b\": [\"A\", \"B\", \"C\"]}), output=\"numpy\" ) mm_numpy Out[5]: <pre>array([[ 1., -1.,  0.,  0.],\n       [ 1.,  0.,  1.,  0.],\n       [ 1.,  1.,  0.,  1.]])</pre> In\u00a0[6]: Copied! <pre>ms_numpy = mm_numpy.model_spec\nmm_numpy[:, ms_numpy.term_indices[\"b\"]]\n</pre> ms_numpy = mm_numpy.model_spec mm_numpy[:, ms_numpy.term_indices[\"b\"]] Out[6]: <pre>array([[0., 0.],\n       [1., 0.],\n       [0., 1.]])</pre> In\u00a0[7]: Copied! <pre>ms.get_model_matrix(DataFrame({\"a\": [4, 5, 6], \"b\": [\"A\", \"B\", \"D\"]}))\n</pre> ms.get_model_matrix(DataFrame({\"a\": [4, 5, 6], \"b\": [\"A\", \"B\", \"D\"]})) <pre>/home/matthew/Repositories/github/formulaic/formulaic/transforms/contrasts.py:169: DataMismatchWarning: Data has categories outside of the nominated levels (or that were not seen in original dataset): {'D'}. They are being  cast to nan, which will likely skew the results of your analyses.\n  warnings.warn(\n</pre> Out[7]: Intercept center(a) b[T.B] b[T.C] 0 1.0 2.0 0 0 1 1.0 3.0 1 0 2 1.0 4.0 0 0 <p>Notice that when the assumptions of the stateful transforms are violated warnings and/or exceptions will be generated.</p> <p>You can also just pass the <code>ModelSpec</code> directly to <code>model_matrix</code>, for example:</p> In\u00a0[8]: Copied! <pre>model_matrix(ms, data=DataFrame({\"a\": [4, 5, 6], \"b\": [\"A\", \"A\", \"A\"]}))\n</pre> model_matrix(ms, data=DataFrame({\"a\": [4, 5, 6], \"b\": [\"A\", \"A\", \"A\"]})) Out[8]: Intercept center(a) b[T.B] b[T.C] 0 1.0 2.0 0 0 1 1.0 3.0 0 0 2 1.0 4.0 0 0 In\u00a0[9]: Copied! <pre>from formulaic import ModelSpec\n\nms = ModelSpec(\"a+b+c\", output=\"numpy\", ensure_full_rank=False)\nms\n</pre> from formulaic import ModelSpec  ms = ModelSpec(\"a+b+c\", output=\"numpy\", ensure_full_rank=False) ms Out[9]: <pre>ModelSpec(formula=1 + a + b + c, materializer=None, materializer_params=None, ensure_full_rank=False, na_action=&lt;NAAction.DROP: 'drop'&gt;, output='numpy', cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=None, transform_state={}, encoder_state={})</pre> In\u00a0[10]: Copied! <pre>import pandas\n\nmm = ms.get_model_matrix(\n    pandas.DataFrame({\"a\": [1, 2, 3], \"b\": [4, 5, 6], \"c\": [7, 8, 9]})\n)\nmm\n</pre> import pandas  mm = ms.get_model_matrix(     pandas.DataFrame({\"a\": [1, 2, 3], \"b\": [4, 5, 6], \"c\": [7, 8, 9]}) ) mm Out[10]: <pre>array([[1., 1., 4., 7.],\n       [1., 2., 5., 8.],\n       [1., 3., 6., 9.]])</pre> In\u00a0[11]: Copied! <pre>mm.model_spec\n</pre> mm.model_spec Out[11]: <pre>ModelSpec(formula=1 + a + b + c, materializer='pandas', materializer_params={}, ensure_full_rank=False, na_action=&lt;NAAction.DROP: 'drop'&gt;, output='numpy', cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a']), EncodedTermStructure(term=b, scoped_terms=[b], columns=['b']), EncodedTermStructure(term=c, scoped_terms=[c], columns=['c'])], transform_state={}, encoder_state={'a': (&lt;Kind.NUMERICAL: 'numerical'&gt;, {}), 'b': (&lt;Kind.NUMERICAL: 'numerical'&gt;, {}), 'c': (&lt;Kind.NUMERICAL: 'numerical'&gt;, {})})</pre> <p>Notice that any missing fields not provided by the user are imputed automatically.</p> In\u00a0[12]: Copied! <pre>from formulaic import Formula, ModelSpecs\n\nModelSpecs(\n    ModelSpec(\"a\"), substructure=ModelSpec(\"b\"), another_substructure=ModelSpec(\"c\")\n)\n</pre> from formulaic import Formula, ModelSpecs  ModelSpecs(     ModelSpec(\"a\"), substructure=ModelSpec(\"b\"), another_substructure=ModelSpec(\"c\") ) Out[12]: <pre>root:\n    ModelSpec(formula=1 + a, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=&lt;NAAction.DROP: 'drop'&gt;, output=None, cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=None, transform_state={}, encoder_state={})\n.substructure:\n    ModelSpec(formula=1 + b, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=&lt;NAAction.DROP: 'drop'&gt;, output=None, cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=None, transform_state={}, encoder_state={})\n.another_substructure:\n    ModelSpec(formula=1 + c, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=&lt;NAAction.DROP: 'drop'&gt;, output=None, cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=None, transform_state={}, encoder_state={})</pre> In\u00a0[13]: Copied! <pre>ModelSpec.from_spec(Formula(lhs=\"y\", rhs=\"a + b\"))\n</pre> ModelSpec.from_spec(Formula(lhs=\"y\", rhs=\"a + b\")) Out[13]: <pre>.lhs:\n    ModelSpec(formula=y, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=&lt;NAAction.DROP: 'drop'&gt;, output=None, cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=None, transform_state={}, encoder_state={})\n.rhs:\n    ModelSpec(formula=a + b, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=&lt;NAAction.DROP: 'drop'&gt;, output=None, cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=None, transform_state={}, encoder_state={})</pre> <p>Some operations, such as <code>ModelSpec.subset(...)</code> are also accessible in a mapped way (e.g. via <code>ModelSpecs.subset(...)</code>). You can find documentation for the complete set of available methods using <code>help(ModelSpecs)</code>.</p> <p><code>ModelSpec</code> and <code>ModelSpecs</code> instances have been designed to support serialization via the standard pickling process offered by Python. This allows model specs to be persisted into storage and reloaded at a later time, or used in multiprocessing scenarios.</p> <p>         Serialized model specs are not guaranteed to work between          different versions of formulaic. While things will work in the         vast majority of cases, the internal state of transforms is free to change         from version to version, and may invalidate previously serialized model         specs. Efforts will be made to reduce the likelihood of this, and when         it happens it should be indicated in the changelogs.     </p>"},{"location":"guides/model_specs/#anatomy-of-a-modelspec-instance","title":"Anatomy of a <code>ModelSpec</code> instance.\u00b6","text":"<p>As noted above, a <code>ModelSpec</code> is the complete specification and record of the materialization process, combining all user-specified parameters with the runtime state of the materializer. In particular, <code>ModelSpec</code> instances have the following explicitly specifiable attributes:</p> <ul> <li>Configuration (these attributes are typically specified by the user):<ul> <li>formula: The formula for which the model matrix was (and/or will be) generated.</li> <li>materializer: The materializer used (and/or to be used) to materialize the formula into a matrix.</li> <li>ensure_full_rank: Whether to ensure that the generated matrix is \"structurally\" full-rank (features are not included which are known to violate full-rankness).</li> <li>na_action: The action to be taken if NA values are found in the data. Can be on of: \"drop\" (the default), \"raise\" or \"ignore\".</li> <li>output: The desired output type (as interpreted by the materializer; e.g. \"pandas\", \"sparse\", etc).</li> </ul> </li> <li>State (these attributes are typically only populated during materialization):<ul> <li>structure: The model matrix structure resulting from materialization.</li> <li>transform_state: The state of any stateful transformations that took place during factor evaluation.</li> <li>encoder_state: The state of any implicit stateful transformations that took place during encoding.</li> </ul> </li> </ul> <p>Often, only <code>formula</code> is explicitly specified, and the rest is inferred on the user's behalf.</p> <p><code>ModelSpec</code> instances also have derived properties and methods that you can use to introspect the structure of generated model matrices. These derived methods assume that the <code>ModelSpec</code> has been fully populated, and thus usually only make sense to consider on <code>ModelSpec</code> instances that are attached to a <code>ModelMatrix</code>. They are:</p> <ul> <li>Metadata attributes and methods:<ul> <li>column_names: An ordered sequence of names associated with the columns of the generated model matrix.</li> <li>column_indices: An ordered mapping from column names to the column index in generated model matrices.</li> <li>get_column_indicies(...): A shorthand method for compiling indices for multiple columns from <code>.column_indices</code>.</li> <li>terms: A sequence of <code>Term</code> instances that were used to generate this model matrix.</li> <li>term_indices: An ordered mapping of <code>Term</code> instances to the generated column indices.</li> <li>get_term_indices(...): A shorthand method for selecting term indices from <code>.term_indices</code> using formulae.</li> <li>term_slices: An ordered mapping of <code>Term</code> instances to a slice that when used on the columns of the model matrix will subsample the model matrix down to those corresponding to each term.</li> <li>term_factors: An ordered mapping of <code>Term</code> instances to the set of factors used by that term.</li> <li>term_variables: An order mapping of <code>Term</code> instances to <code>Variable</code> instances (a string subclass with addition attributes of <code>roles</code> and <code>source</code>), indicating the variables used by that term.</li> <li>factors: A set of <code>Factor</code> instances used in the entire formula.</li> <li>factor_terms: A mapping from <code>Factor</code> instances to the <code>Term</code> instances that used them.</li> <li>factor_variables: A mapping from <code>Factor</code> instances to <code>Variable</code> instances, corresponding to the variables used by that factor.</li> <li>factor_contrasts: A mapping from <code>Factor</code> instances to <code>ContrastsState</code> instances that can be used to reproduce the coding matrices used during materialization.</li> <li>variables: A set of <code>Variable</code> instances describing the variables used in entire formula.</li> <li>variable_terms: The reverse lookup of <code>term_variables</code>.</li> <li>variable_indices: A mapping from <code>Variable</code> instance to the indices of the columns in the model matrix that variable.</li> <li>get_variable_indices(...): A shorthand method for compiling indices for multiple columns from <code>.variable_indices</code>.</li> <li>variables_by_source: A mapping from source name (typically one of <code>\"data\"</code>, <code>\"context\"</code>, or <code>\"transforms\"</code>) to the variables derived from that source.</li> <li>get_slice(...): Build a slice instance that can subset a matrix down to the columns associated with a <code>Term</code> instance, its string representation, a column name, or pre-specified ints/slices.</li> </ul> </li> <li>Utility methods:<ul> <li>get_model_matrix(...): Build a model matrix using this spec. This allows a new dataset to be generated using exactly the same encoding process as an earlier dataset.</li> <li>get_linear_constraints(...): Build a set of linear constraints for use during constrained linear regressions.</li> </ul> </li> <li>Transform methods:<ul> <li>update(...): Create a copy of this <code>ModelSpec</code> instance with the nominated attributes mutated.</li> <li>subset(...): Create a copy of this <code>ModelSpec</code> instance with its structure subset to correspond to the form strict subset of terms indicted by a formula specification.</li> </ul> </li> </ul> <p>We'll cover some of these attributes and methods in examples below, but you can always refer to <code>help(ModelSpec)</code> for more details.</p>"},{"location":"guides/model_specs/#using-modelspec-as-metadata","title":"Using <code>ModelSpec</code> as metadata\u00b6","text":"<p>One of the most common use-cases for <code>ModelSpec</code> instances is as metadata to describe a generated model matrix. This metadata can be used to programmatically access the appropriate features in the model matrix in order (e.g.) to assign sensible names to the coefficients fit during a regression.</p>"},{"location":"guides/model_specs/#reusing-model-specifications","title":"Reusing model specifications\u00b6","text":"<p>Another common use-case for <code>ModelSpec</code> instances is replaying the same materialization process used to prepare a training dataset on a new dataset. Since the <code>ModelSpec</code> instance stores all relevant choices made during materialization achieving this is a simple as using using the <code>ModelSpec</code> to generate the new model matrix.</p> <p>By way of example, recall from above section that we used the formula</p> <pre><code>center(a) + b</code></pre> <p>where <code>a</code> was a numerical vector, and <code>b</code> was a categorical vector. When generating model matrices for subsequent datasets it is very important to use the same centering used during the initial model matrix generation, and not just center the incoming data again. Likewise, <code>b</code> should be aware of which categories were present during the initial training, and ensure that the same columns are created during subsequent materializations (otherwise the model matrices will not be of the same form, and cannot be used for predictions/etc). These kinds of transforms that require memory are called \"stateful transforms\" in Formulaic, and are described in more detail in the Transforms documentation.</p> <p>We can see this in action below:</p>"},{"location":"guides/model_specs/#directly-constructing-modelspec-instances","title":"Directly constructing <code>ModelSpec</code> instances\u00b6","text":"<p>It is possible to directly construct Model Matrices, and to prepopulate them with various choices (e.g. output types, materializer, etc). You could even, in principle, populate them with state information (but this is not recommended; it is easy to make mistakes here, and is likely better to encode these choices into the formula itself where possible). For example:</p>"},{"location":"guides/model_specs/#structured-modelspecs","title":"Structured <code>ModelSpecs</code>\u00b6","text":"<p>As discussed in How it works, formulae can be arbitrarily structured, resulting in a similarly structured set of model matrices. <code>ModelSpec</code> instances can also be arranged into a structured collection using <code>ModelSpecs</code>, allowing different choices to be made at different levels of the structure. You can either create these structures yourself, or inherit the structure from a formula. For example:</p>"},{"location":"guides/model_specs/#serialization","title":"Serialization\u00b6","text":""},{"location":"guides/quickstart/","title":"Quickstart","text":"<p>This document provides high-level documentation on how to get started using Formulaic.</p> In\u00a0[1]: Copied! <pre>import pandas\n\nfrom formulaic import model_matrix\n\ndf = pandas.DataFrame(\n    {\n        \"y\": [0, 1, 2],\n        \"a\": [\"A\", \"B\", \"C\"],\n        \"b\": [0.3, 0.1, 0.2],\n    }\n)\n\ny, X = model_matrix(\"y ~ a + b + a:b\", df)\n# This is short-hand for:\n# y, X = formulaic.Formula('y ~ a + b + a:b').get_model_matrix(df)\n</pre> import pandas  from formulaic import model_matrix  df = pandas.DataFrame(     {         \"y\": [0, 1, 2],         \"a\": [\"A\", \"B\", \"C\"],         \"b\": [0.3, 0.1, 0.2],     } )  y, X = model_matrix(\"y ~ a + b + a:b\", df) # This is short-hand for: # y, X = formulaic.Formula('y ~ a + b + a:b').get_model_matrix(df) In\u00a0[2]: Copied! <pre>y\n</pre> y Out[2]: y 0 0 1 1 2 2 In\u00a0[3]: Copied! <pre>X\n</pre> X Out[3]: Intercept a[T.B] a[T.C] b a[T.B]:b a[T.C]:b 0 1.0 0 0 0.3 0.0 0.0 1 1.0 1 0 0.1 0.1 0.0 2 1.0 0 1 0.2 0.0 0.2 <p>You will notice that the categorical values for <code>a</code> have been one-hot (aka dummy) encoded, and to ensure structural full-rankness of <code>X</code>[^1], one level has been dropped from <code>a</code>. For more details about how this guarantees that the matrix is full-rank, please refer to the excellent patsy documentation. If you are not using the model matrices for regression, and don't care if the matrix is not full-rank, you can pass <code>ensure_full_rank=False</code>:</p> In\u00a0[4]: Copied! <pre>X = model_matrix(\"a + b + a:b\", df, ensure_full_rank=False)\nX\n</pre> X = model_matrix(\"a + b + a:b\", df, ensure_full_rank=False) X Out[4]: Intercept a[T.A] a[T.B] a[T.C] b a[T.A]:b a[T.B]:b a[T.C]:b 0 1.0 1 0 0 0.3 0.3 0.0 0.0 1 1.0 0 1 0 0.1 0.0 0.1 0.0 2 1.0 0 0 1 0.2 0.0 0.0 0.2 <p>Note that the dropped level in <code>a</code> has been restored.</p> <p>There is a rich trove of information about the columns and structure of the the model matrix stored in the <code>ModelSpec</code> instance attached to the model matrix, for example:</p> In\u00a0[5]: Copied! <pre>X.model_spec\n</pre> X.model_spec Out[5]: <pre>ModelSpec(formula=1 + a + b + a:b, materializer='pandas', materializer_params={}, ensure_full_rank=False, na_action=&lt;NAAction.DROP: 'drop'&gt;, output='pandas', cluster_by=&lt;ClusterBy.NONE: 'none'&gt;, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a[T.A]', 'a[T.B]', 'a[T.C]']), EncodedTermStructure(term=b, scoped_terms=[b], columns=['b']), EncodedTermStructure(term=a:b, scoped_terms=[a:b], columns=['a[T.A]:b', 'a[T.B]:b', 'a[T.C]:b'])], transform_state={}, encoder_state={'a': (&lt;Kind.CATEGORICAL: 'categorical'&gt;, {'categories': ['A', 'B', 'C']}), 'b': (&lt;Kind.NUMERICAL: 'numerical'&gt;, {})})</pre> <p>You can read more about the model specs in the Model Specs documentation.</p> In\u00a0[6]: Copied! <pre>X = model_matrix(\"a + b + a:b\", df, output=\"sparse\")\nX\n</pre> X = model_matrix(\"a + b + a:b\", df, output=\"sparse\") X Out[6]: <pre>&lt;3x6 sparse matrix of type '&lt;class 'numpy.float64'&gt;'\n\twith 10 stored elements in Compressed Sparse Column format&gt;</pre> <p>In this example, <code>X</code> is a $ 6 \\times 3 $ <code>scipy.sparse.csc_matrix</code> instance.</p> <p>Since sparse matrices do not have labels for columns, you can look these up from the model spec described above; for example:</p> In\u00a0[7]: Copied! <pre>X.model_spec.column_names\n</pre> X.model_spec.column_names Out[7]: <pre>('Intercept', 'a[T.B]', 'a[T.C]', 'b', 'a[T.B]:b', 'a[T.C]:b')</pre> <p>[^1]: <code>X</code> must be full-rank in order for the regression algorithm to invert a matrix derived from <code>X</code>.</p>"},{"location":"guides/quickstart/#building-model-matrices","title":"Building Model Matrices\u00b6","text":"<p>In <code>formulaic</code>, the simplest way to build your model matrices is to use the high-level <code>model_matrix</code> function:</p>"},{"location":"guides/quickstart/#sparse-model-matrices","title":"Sparse Model Matrices\u00b6","text":"<p>By default, the generated model matrices are dense. In some case, particularly in large datasets with many categorical features, dense model matrices become hugely memory inefficient (since most entries of the data will be zero). Formulaic allows you to directly generate sparse model matrices using:</p>"},{"location":"guides/splines/","title":"Spline Encoding","text":"<p>Formulaic offers several spline encoding transforms that allow you to model non-linear responses to continuous variables using linear models. They are:</p> <ul> <li><code>poly</code>: projection onto orthogonal polynomial basis functions.</li> <li><code>basis_spline</code> (<code>bs</code> in formulae): projection onto a basis spline (B-spline) basis.</li> </ul> <p>These are all implemented as stateful transforms and described in more detail below.</p> In\u00a0[1]: Copied! <pre>import matplotlib.pyplot as plt\nimport numpy\nimport pandas\nfrom statsmodels.api import OLS\n\nfrom formulaic import model_matrix\n\n# Build some data, and hard-code \"y\" as a quartic function with some Gaussian noise.\ndata = pandas.DataFrame(\n    {\n        \"x\": numpy.linspace(0.0, 1.0, 100),\n    }\n).assign(\n    y=lambda df: df.x\n    + 0.2 * df.x**2\n    - 0.7 * df.x**3\n    + 3 * df.x**4\n    + 0.1 * numpy.random.randn(100)\n)\n\n# Generate a model matrix with a polynomial coding in \"x\".\ny, X = model_matrix(\"y ~ poly(x, degree=3)\", data, output=\"numpy\")\n\n# Fit coefficients for the intercept and polynomial basis\ncoeffs = OLS(y[:, 0], X).fit().params\n\n# Plot the basis splines weighted by coefficients.\nplt.plot(\n    data.x,\n    X * coeffs,\n    label=[name + \" (weighted)\" for name in X.model_spec.column_names],\n)\n# Plot B-spline basis functions (colored curves) each multiplied by its coeff\nplt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\")\n# Plot the fit spline itself (sum of the basis functions)\nplt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")\n\nplt.legend();\n</pre> import matplotlib.pyplot as plt import numpy import pandas from statsmodels.api import OLS  from formulaic import model_matrix  # Build some data, and hard-code \"y\" as a quartic function with some Gaussian noise. data = pandas.DataFrame(     {         \"x\": numpy.linspace(0.0, 1.0, 100),     } ).assign(     y=lambda df: df.x     + 0.2 * df.x**2     - 0.7 * df.x**3     + 3 * df.x**4     + 0.1 * numpy.random.randn(100) )  # Generate a model matrix with a polynomial coding in \"x\". y, X = model_matrix(\"y ~ poly(x, degree=3)\", data, output=\"numpy\")  # Fit coefficients for the intercept and polynomial basis coeffs = OLS(y[:, 0], X).fit().params  # Plot the basis splines weighted by coefficients. plt.plot(     data.x,     X * coeffs,     label=[name + \" (weighted)\" for name in X.model_spec.column_names], ) # Plot B-spline basis functions (colored curves) each multiplied by its coeff plt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\") # Plot the fit spline itself (sum of the basis functions) plt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")  plt.legend(); In\u00a0[11]: Copied! <pre># Generate a model matrix with a basis spline \"x\".\ny, X = model_matrix(\"y ~ bs(x, df=4, degree=3)\", data, output=\"numpy\")\n\n# Fit coefficients for the intercept and polynomial basis\ncoeffs = OLS(y[:, 0], X).fit().params\n\n# Plot the basis splines weighted by coefficients.\nplt.plot(\n    data.x,\n    X * coeffs,\n    label=[name + \" (weighted)\" for name in X.model_spec.column_names],\n)\n# Plot B-spline basis functions (colored curves) each multiplied by its coeff\nplt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\")\n# Plot the fit spline itself (sum of the basis functions)\nplt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")\n\nplt.legend();\n</pre> # Generate a model matrix with a basis spline \"x\". y, X = model_matrix(\"y ~ bs(x, df=4, degree=3)\", data, output=\"numpy\")  # Fit coefficients for the intercept and polynomial basis coeffs = OLS(y[:, 0], X).fit().params  # Plot the basis splines weighted by coefficients. plt.plot(     data.x,     X * coeffs,     label=[name + \" (weighted)\" for name in X.model_spec.column_names], ) # Plot B-spline basis functions (colored curves) each multiplied by its coeff plt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\") # Plot the fit spline itself (sum of the basis functions) plt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")  plt.legend(); In\u00a0[18]: Copied! <pre>x = numpy.linspace(0.0, 2 * numpy.pi, 100)\n\ndata = pandas.DataFrame(\n    {\n        \"x\": x,\n    }\n).assign(\n    y=lambda df: 2\n    + numpy.sin(x)\n    - x * numpy.sin(2 * x)\n    + 4 * numpy.sin(x / 7)\n    + 0.1 * numpy.random.randn(100)\n)\n\n# Generate a model matrix with a cyclic cubic spline coding in \"x\".\ny, X = model_matrix(\"y ~ 1 + cc(x, df=4, constraints='center')\", data, output=\"numpy\")\n\n# Fit coefficients for the intercept and polynomial basis\ncoeffs = OLS(y[:, 0], X).fit().params\n\n# Plot the basis splines weighted by coefficients.\nplt.plot(\n    data.x,\n    X * coeffs,\n    label=[name + \" (weighted)\" for name in X.model_spec.column_names],\n)\n# Plot B-spline basis functions (colored curves) each multiplied by its coeff\nplt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\")\n# Plot the fit spline itself (sum of the basis functions)\nplt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")\n\nplt.legend();\n</pre> x = numpy.linspace(0.0, 2 * numpy.pi, 100)  data = pandas.DataFrame(     {         \"x\": x,     } ).assign(     y=lambda df: 2     + numpy.sin(x)     - x * numpy.sin(2 * x)     + 4 * numpy.sin(x / 7)     + 0.1 * numpy.random.randn(100) )  # Generate a model matrix with a cyclic cubic spline coding in \"x\". y, X = model_matrix(\"y ~ 1 + cc(x, df=4, constraints='center')\", data, output=\"numpy\")  # Fit coefficients for the intercept and polynomial basis coeffs = OLS(y[:, 0], X).fit().params  # Plot the basis splines weighted by coefficients. plt.plot(     data.x,     X * coeffs,     label=[name + \" (weighted)\" for name in X.model_spec.column_names], ) # Plot B-spline basis functions (colored curves) each multiplied by its coeff plt.scatter(data.x, data.y, marker=\"x\", label=\"Raw observations\") # Plot the fit spline itself (sum of the basis functions) plt.plot(data.x, numpy.dot(X, coeffs), color=\"k\", linewidth=3, label=\"Fit spline\")  plt.legend();"},{"location":"guides/splines/#poly","title":"<code>poly</code>\u00b6","text":"<p>The simplest way to generate a non-linear response in a linear model is to include higher order powers of numerical variables. For example, you might want to include: $x$, $x^2$, $x^3$, $\\ldots$, $x^n$. However, these features are not orthogonal, and so adding each term one by one in a regression model will lead to all previously trained coefficients changing. Especially in exploratory analysis, this can be frustrating, and that's where <code>poly</code> comes in. By default, <code>poly</code> iteratively builds orthogonal polynomial features up to the order specified.</p> <p><code>poly</code> has two parameters:</p> <ul> <li>degree: The maximum polynomial degree to generate ($degree - 1$ features will thus be generated).</li> <li>raw: A boolean indicator of whether raw non-orthogonal polynomials should be generated instead of the orthogonalized ones.</li> </ul> <p>For those who are mathematically inclined, this transform is an implementation of the \"three-term recurrence relation\" for monic orthogonal polynomials. There are many good introductions to these recurrence relations, including: (at the time of writing) https://dec41.user.srcf.net/h/IB_L/numerical_analysis/2_3. Another common approach is QR factorisation where the columns of Q are the orthogonal basis vectors. A pre-existing implementation of this can be found in <code>numpy</code>, however our implementation outperforms numpy's QR decomposition, and does not require needless computation of the R matrix. It should also be noted that orthogonal polynomial bases are unique up to the choice of inner-product and scaling, and so all methods will result in the same set of polynomials.</p> <p>Note</p> <p>         When used as a stateful transform, we retain the coefficients that         uniquely define the polynomials; and so new data will be evaluated         against the same polynomial bases as the original dataset. However,         the polynomial basis will almost certainly *not* be orthogonal for the         new data. This is because changing the incoming dataset is equivalent to         changing your choice of inner product.     </p> <p>Example:</p>"},{"location":"guides/splines/#basis_spline-or-bs","title":"<code>basis_spline</code> (or <code>bs</code>)\u00b6","text":"<p>If you were to attempt to fit a complex function over a large domain using <code>poly</code>, it is highly likely that you would need to use a very large degree for the polynomial basis. However, this can lead to overfitting and/or high-frequency oscillations (see Runge's phenomenon). An alternative approach is to use piece-wise polynomial curves of lower degree, with smoothness conditions on the \"knots\" between each of the polynomial pieces. This limits overfitting while still offering the flexibility required to model very complex non-linearities.</p> <p>Basis-splines (or B-splines) are popular choice for generating a basis for such polynomials, with many attractive features such as maximal smoothness around each of the knots, and minimal support given such smoothness.</p> <p>Formulaic has its own implementation of <code>basis_spline</code> that is API compatible (where features overlap) with R, and is more performant than existing Python implementations for our use-cases (such as <code>splev</code> from <code>scipy</code>). For compatibility with <code>R</code> and <code>patsy</code>, <code>basis_spline</code> is available as <code>bs</code> in formulae.</p> <p><code>basis_spline</code> (or <code>bs</code>) has eight parameters:</p> <ul> <li>df: The number of degrees of freedom to use for this spline. If specified, <code>knots</code> will be automatically generated such that they are <code>df</code> - <code>degree</code> (minus one if <code>include_intercept</code> is True) equally spaced quantiles. You cannot specify both <code>df</code> and <code>knots</code>.</li> <li>knots: The internal breakpoints of the B-Spline. If not specified, they default to the empty list (unless <code>df</code> is specified), in which case the ordinary polynomial (Bezier) basis is generated.</li> <li>degree: The degree of the B-Spline (the highest degree of terms in the resulting polynomial). Must be a non-negative integer.</li> <li>include_intercep: Whether to return a complete (full-rank) basis. Note that if <code>ensure_full_rank=True</code> is passed to the materializer, then the intercept will (depending on context) nevertheless be omitted.</li> <li>lower_bound: The lower bound for the domain for the B-Spline basis. If not specified this is determined from <code>x</code>.</li> <li>upper_bound: The upper bound for the domain for the B-Spline basis. If not specified this is determined from <code>x</code>.</li> <li>extrapolation: Selects how extrapolation should be performed when values in <code>x</code> extend beyond the lower and upper bounds. Valid values are:<ul> <li><code>'raise'</code> (the default): Raises a <code>ValueError</code> if there are any values in <code>x</code> outside the B-Spline domain.</li> <li><code>'clip'</code>: Any values above/below the domain are set to the upper/lower bounds.</li> <li><code>'na'</code>: Any values outside of bounds are set to <code>numpy.nan</code>.</li> <li><code>'zero'</code>: Any values outside of bounds are set to <code>0</code>.</li> <li><code>'extend'</code>: Any values outside of bounds are computed by extending the polynomials of the B-Spline (this is the same as the default in R).</li> </ul> </li> </ul> <p>The algorithm used to generate the basis splines is a slightly generalised version of the \"Cox-de Boor\" algorithm, extended by this author to allow for extrapolations (although this author doubts this is terribly novel). If you would like to learn more about B-Splines, the primer put together by Jeffrey Racine is an excellent resource.</p> <p>As a stateful transform, we only keep track of <code>knots</code>, <code>lower_bound</code> and <code>upper_bound</code>, which are sufficient given that all other information must be explicitly specified.</p> <p>Example, reusing the data and imports from above:</p>"},{"location":"guides/splines/#cubic_spline-crcs-and-cc","title":"<code>cubic_spline</code> (<code>cr</code>/<code>cs</code> and <code>cc</code>)\u00b6","text":"<p>While the <code>basis_spline</code> transform above is capable of generating cubic splines, it is sometimes helpful to be able to generate cubic splines that satisfy various additional constraints (including direct constraints on the parameters of the spline, or indirect ones such as cyclicity). To that end, Formulaic implements direct support for generating natural and cyclic cubic splines with constraints (via the <code>cr</code>/<code>cs</code> and <code>cc</code> transforms respectively), borrowing much of the implementation from <code>patsy</code>. These splines are compatible with R's <code>mgcv</code>, and share the nice features of the basis-splines above, including continuous first and second derivatives, and general applicability to interpolation/smoothing. Note that <code>cr</code> and <code>cs</code> generate identical splines, but are both included for compatibility with R.</p> <p>In practice, the reason that we focus on cubic (as compared to quadratic or quartic) splines is that they offer a nice compromise. They offer:</p> <ol> <li>Sufficient smoothness - Cubic splines provide smooth curves by connecting data points with piecewise cubic polynomials, ensuring continuity in the function and its first few derivatives at each connection point.</li> <li>Sufficient flexibility - Unlike lower-order polynomials, cubic splines can capture complex curves with varying curvature, making them suitable for data with sharp changes or multiple inflection points. Generally there is diminished marginal benefit from higher order polynomials.</li> <li>Avoids overfitting (including avoiding Runge's phenomenon described aboe) - Compared to higher-degree polynomials, cubic splines are less prone to large oscillations between data points, which can occur with high-order interpolation methods.</li> </ol> <p>All of <code>cr</code>, <code>cs</code>, and <code>cc</code> are configurations of the <code>cubic_spline</code> transform, and have seven parameters:</p> <ul> <li>x: The data to use in the spline. Depending on the other options, the knots are selected from this data on the first application of <code>cc</code> or <code>cr</code>. If using an existing <code>model_spec</code> with a fitted transformation, the <code>x</code> values are only used to produce the locations for the fitted values of the spline.</li> <li>df: The number of degrees of freedom to use for this spline. If specified, <code>knots</code> will be automatically generated such that they are <code>df</code> - <code>degree</code> equally spaced quantiles. You cannot specify both <code>df</code> and <code>knots</code>.</li> <li>knots: The internal breakpoints of the spline. If not specified, they default to the empty list (unless <code>df</code> is specified).</li> <li>lower_bound: The lower bound for the domain for the spline. If not specified this is determined from <code>x</code>.</li> <li>upper_bound: The upper bound for the domain for the spline. If not specified this is determined from <code>x</code>.</li> <li>constraints: Either a 2-d array defining general linear constraints (that is <code>np.dot(constraints, betas)</code> is zero, where <code>betas</code> denotes the array of initial parameters, corresponding to the initial unconstrained model matrix), or the string <code>'center'</code> indicating that we should apply a centering constraint (this constraint will be computed from the input data, remembered and re-used for prediction from the fitted model). The constraints are absorbed in the resulting design matrix which means that the model is actually rewritten in terms of unconstrained parameters.</li> <li>extrapolation: Selects how extrapolation should be performed when values in <code>x</code> extend beyond the lower and upper bounds. Valid values are:<ul> <li><code>'raise'</code> (the default): Raises a <code>ValueError</code> if there are any values in <code>x</code> outside the spline domain.</li> <li><code>'clip'</code>: Any values above/below the domain are set to the upper/lower bounds.</li> <li><code>'na'</code>: Any values outside of bounds are set to <code>numpy.nan</code>.</li> <li><code>'zero'</code>: Any values outside of bounds are set to <code>0</code>.</li> <li><code>'extend'</code>: Any values outside of bounds are computed by extending the polynomials of the spline (this is the same as the default in R).</li> </ul> </li> </ul> <p>As a stateful transform, we only keep track of <code>knots</code>, <code>lower_bound</code>, <code>upper_bound</code>, <code>constraints</code>, and <code>cyclic</code> which are sufficient given that all other information must be explicitly specified.</p> <p>Example, reusing the data and imports from above:</p>"},{"location":"guides/transforms/","title":"Transforms","text":"<p>A transform in Formulaic is any function that is called to modify factor values during the evaluation of a <code>Factor</code> (see the How it works documentation). Any function can be used as a transform, so long as it is present in the evaluation context (see below).</p> <p>There are two types of transform:</p> <ol> <li>Regular transforms: These are just normal functions that are applied to features prior to encoding. For example, you could apply the <code>numpy.cumsum</code> function to any vector being fed into the model matrix materialization procedure.</li> <li>Stateful transforms: These are functions that keep track of the transform state so that they can be reapplied in the future with the same state. This is useful if the transform does something data specific that has to be replicated in future materializations (such as subtracting the mean of the dataset; subsequent materializations should use the mean of the training dataset rather than the mean of the current data).</li> </ol> <p>In the below we describe how to make a function available for use as a transform during materialization, demonstrate this for regular transforms, and then introduce how to use already implemented stateful transforms and/or write your own.</p> In\u00a0[1]: Copied! <pre>import pandas\n\nfrom formulaic import Formula, model_matrix\n\n\ndef my_transform(col: pandas.Series) -&gt; pandas.Series:\n    return col**2\n</pre> import pandas  from formulaic import Formula, model_matrix   def my_transform(col: pandas.Series) -&gt; pandas.Series:     return col**2 In\u00a0[2]: Copied! <pre># Local context is automatically added\nmodel_matrix(\"a + my_transform(a)\", pandas.DataFrame({\"a\": [1, 2, 3]}))\n</pre> # Local context is automatically added model_matrix(\"a + my_transform(a)\", pandas.DataFrame({\"a\": [1, 2, 3]})) Out[2]: Intercept a my_transform(a) 0 1.0 1 1 1 1.0 2 4 2 1.0 3 9 In\u00a0[3]: Copied! <pre># Manually add `my_transform` to the context\nFormula(\"a + my_transform(a)\").get_model_matrix(\n    pandas.DataFrame({\"a\": [1, 2, 3]}),\n    context={\"my_transform\": my_transform},  # could also use: context=locals()\n)\n</pre> # Manually add `my_transform` to the context Formula(\"a + my_transform(a)\").get_model_matrix(     pandas.DataFrame({\"a\": [1, 2, 3]}),     context={\"my_transform\": my_transform},  # could also use: context=locals() ) Out[3]: Intercept a my_transform(a) 0 1.0 1 1 1 1.0 2 4 2 1.0 3 9 In\u00a0[4]: Copied! <pre>from formulaic.transforms import center, scale\n\nscale(pandas.Series([1, 2, 3, 4, 5, 6, 7, 8]))\n</pre> from formulaic.transforms import center, scale  scale(pandas.Series([1, 2, 3, 4, 5, 6, 7, 8])) Out[4]: <pre>array([-1.42886902, -1.02062073, -0.61237244, -0.20412415,  0.20412415,\n        0.61237244,  1.02062073,  1.42886902])</pre> In\u00a0[5]: Copied! <pre>center(pandas.Series([1, 2, 3, 4, 5, 6, 7, 8]))\n</pre> center(pandas.Series([1, 2, 3, 4, 5, 6, 7, 8])) Out[5]: <pre>array([-3.5, -2.5, -1.5, -0.5,  0.5,  1.5,  2.5,  3.5])</pre> In\u00a0[6]: Copied! <pre>import numpy\n\nfrom formulaic.transforms import stateful_transform\n\n\n@stateful_transform\ndef center(data, _state=None, _metadata=None, _spec=None):\n    print(\"state\", _state)\n    print(\"metadata\", _metadata)\n    print(\"spec\", _spec)\n    if \"mean\" not in _state:\n        _state[\"mean\"] = numpy.mean(data)\n    return data - _state[\"mean\"]\n\n\nstate = {}\ncenter(pandas.Series([1, 2, 3]), _state=state)\n</pre> import numpy  from formulaic.transforms import stateful_transform   @stateful_transform def center(data, _state=None, _metadata=None, _spec=None):     print(\"state\", _state)     print(\"metadata\", _metadata)     print(\"spec\", _spec)     if \"mean\" not in _state:         _state[\"mean\"] = numpy.mean(data)     return data - _state[\"mean\"]   state = {} center(pandas.Series([1, 2, 3]), _state=state) <pre>state {}\nmetadata None\nspec ModelSpec(formula=, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=&lt;NAAction.DROP: 'drop'&gt;, output=None, structure=None, transform_state={}, encoder_state={})\n</pre> Out[6]: <pre>0   -1.0\n1    0.0\n2    1.0\ndtype: float64</pre> In\u00a0[7]: Copied! <pre>state\n</pre> state Out[7]: <pre>{'mean': 2.0}</pre> <p>The mutated state object is then stored by formulaic automatically into the right context in the appropriate <code>ModelSpec</code> instance for reuse as necessary.</p> <p>If you wanted to leverage the single dispatch functionality, you could do something like:</p> In\u00a0[8]: Copied! <pre>from formulaic.transforms import stateful_transform\n\n\n@stateful_transform\ndef center(data, _state=None, _metadata=None, _spec=None):\n    raise ValueError(f\"No implementation for data of type {repr(type(data))}\")\n\n\n@center.register(pandas.Series)\ndef _(data, _state=None, _metadata=None, _spec=None):\n    if \"mean\" not in _state:\n        _state[\"mean\"] = numpy.mean(data)\n    return data - _state[\"mean\"]\n</pre> from formulaic.transforms import stateful_transform   @stateful_transform def center(data, _state=None, _metadata=None, _spec=None):     raise ValueError(f\"No implementation for data of type {repr(type(data))}\")   @center.register(pandas.Series) def _(data, _state=None, _metadata=None, _spec=None):     if \"mean\" not in _state:         _state[\"mean\"] = numpy.mean(data)     return data - _state[\"mean\"] <p>Note</p> <p>         If taking advantage of the single dispatch functionality, it is         important that the top-level function has exactly the same signature as         the type specific implementations.     </p>"},{"location":"guides/transforms/#adding-transforms-to-the-evaluation-context","title":"Adding transforms to the evaluation context\u00b6","text":"<p>The only requirement for using a transform in formula is making it available in the execution context. The evaluation context is always pre-seeded with:</p> <ul> <li>Regular transforms (and modules):<ul> <li>np: The top-level <code>numpy</code> module.</li> <li>log: <code>numpy.log</code>.</li> <li>log10: <code>numpy.log10</code>.</li> <li>log2: <code>numpy.log2</code>.</li> <li>exp: <code>numpy.exp</code>.</li> <li>exp10: <code>numpy.exp10</code>.</li> <li>exp2: <code>numpy.exp2</code>.</li> <li>I: Identity/null transform (alternative to <code>{&lt;expr&gt;}</code> syntax).</li> <li>lag: Generate lagging or leading columns (useful for datasets collected at regular intervals).</li> </ul> </li> <li>Stateful transforms (documented below):<ul> <li>bs: Basis spline coding.</li> <li>center: Subtraction of the mean.</li> <li>hashed: Categorical coding of a deterministically hashed representation.</li> <li>poly: Polynomial spline coding.</li> <li>scale: Centering and renormalization.</li> <li>C: Categorical coding.<ul> <li>contr.: An R-like interface to specification of contrast coding.</li> </ul> </li> </ul> </li> </ul> <p>The evaluation context can be extended to include arbitrary additional functions. If you are using the top-level <code>model_matrix</code> function then the local context in which <code>model_matrix</code> is called is automatically added to the execution context, otherwise you need to manually specify this context. For example:</p>"},{"location":"guides/transforms/#stateful-transforms","title":"Stateful transforms\u00b6","text":"<p>In Formulaic, a stateful transform is just a regular callable object (typically a function) that has an attribute <code>__is_stateful_transform__</code> that is set to <code>True</code>. Such callables will be passed up to three additional arguments by formulaic if they are present in the callable signature:</p> <ul> <li><code>_state</code>: The existing state or an empty dictionary that should be mutated to record any additional state.</li> <li><code>_metadata</code>: An additional metadata dictionary passed on about the factor or <code>None</code>. Will typically only be present if the <code>Factor</code> metadata is populated.</li> <li><code>_spec</code>: The current model spec being evaluated (or an empty <code>ModelSpec</code> if being called outside of Formulaic's materialization routines).</li> </ul> <p>Only <code>_state</code> is required, <code>_metadata</code> and <code>_spec</code> will only be passed in by Formulaic if they are present in the callable signature.</p>"},{"location":"guides/transforms/#provided-stateful-transforms","title":"Provided stateful transforms\u00b6","text":"<p>Formulaic comes preloaded with some useful stateful transforms, which are outlined below.</p>"},{"location":"guides/transforms/#scaling-and-centering","title":"Scaling and Centering\u00b6","text":"<p>There are two provided scaling transforms: <code>scale(...)</code> and <code>center(...)</code>.</p> <p><code>scale</code> rescales the data such that it is centered around zero with a standard deviation of 1. The centering and variance standardisation can be independently disabled as necessary. <code>center</code> is a simple wrapper around <code>scale</code> that only does the centering. For more details, refer to inline documentation: <code>help(scale)</code>.</p> <p>Example usage is shown below:</p>"},{"location":"guides/transforms/#categorical-encoding","title":"Categorical Encoding\u00b6","text":"<p>Formulaic provides a rich family of categorical stateful transforms. These are perhaps the most commonly used transforms, and are used to encode categorical/factor data into a form suitable for numerical analysis. Use of these transforms is separately documented in the Categorical Encoding section.</p>"},{"location":"guides/transforms/#spline-encoding","title":"Spline Encoding\u00b6","text":"<p>Spline coding is used to enable non-linear dependence on numerical features in linear models. Formulaic currently provides two spline transforms: <code>bs</code> for basis splines, and <code>poly</code> for polynomial splines. These are separately documented in the Spline Encoding section.</p>"},{"location":"guides/transforms/#implementing-custom-stateful-transforms","title":"Implementing custom stateful transforms\u00b6","text":"<p>You can either implement the above interface directly, or leverage the <code>stateful_transform</code> decorator provided by Formulaic, which then also updates your function into a single dispatch function, allowing multiple implementations that depend on the currently materialized type. A simple centering example is explored below.</p>"}]}
\ No newline at end of file