Skip to content

Conversation

@TimCookCountyDS
Copy link

Closes Issue #416

Note additional context / attempted data lineages: (suggested by Nicole to exclude from public facing repo, but save somewhere- any suggestions as to where?).

Notes

To predict the expected-market-price of unsold properties, the residential valuation model relies on (recent (within the last decade?)) sales data from across the county. To ensure that the model is built on a data that is both accurate and representative of (true) fair- market transactions, we review all sales before including them in the model-training dataset. We remove any properties or sales outliers that may not accurately reflect fair-market transactions1. We do this with hard-coded rules (ex. deed types), statistically (ex. removing properties whose sale price is a certain number of standard deviations above the mean of properties within the same geographic area and class), and by an analyst review process (ex. checking for non-arms length sales).
(Note: statistical, or otherwise computationally flagged outliers are (oftne, but not always) further reviewed by human analysts).

Below is a table of inclusion/exclusion criteria used by the model.
For a brief thematic explanation of "why/how" including or excluding certain sales may alter the model's predictions, see the examples following the table.


Category Excluded (not used for model training) Included (used for model training) Stage and Source Notes
Muliple PINs ins sale multi-PIN single-PIN model-ingest see also this note in sales val None
Single vs Multiple Buildings, single PIN multiple buildings on single pin excluded single building single pin model-training See here for more on multiple buildings None
Year of Sale greater than 9 years most recent 9 years maybe model ingest? None
Sale within Same year multiple sales of same pin within same year single sale of single PIN model-ingest, data-architecture this may indicate a flip (and/or may include otherwise-unrecorded-characteristic-updates)
Price Less than 10k Over 10k model-ingest (though see also, for sales-validation) None
Deed Type Quit Claim Deed, Executor Deed, Beneficial Institution, Uknown Warranty Deeds, Trustee Deeds, Other model-ingest, or here, data-catalogue, or data-architecture (The deed types correspond to numbers (3,4,6,99) and (1,2,5) in the data catalogue)
Replies to property tax form 203 reply "yes" on any of questions 10b-10i, 10k on form 203: Sale between related individuals or corporate affiliates, Transfer of less than 100 percent interest, Court-ordered sale , Sale in lieu of foreclosure, Condemnation, Short sale, Bank REO (real estate owned), Auction sale, Seller/buyer is a financial institution or government agency Other responses may or may not be included depending on combinations other outlier markers (see here) 203 Form from MyDecLocation in iAsworld table, default.vw_pin_sale, column is sale_filter_ptax_flagmodified in sales_val ingest scripts, see ptax_flag_original, _sv_individual_ptax_flagintegrate with other outlier criteria if needed in sales_val utils script, ultimately mark all sv_outliers → upload to athena → pull data in final model ingest stagefilter any remaining (203 and other) outliers pre-training None
Statistical Outliers 2 SD above or below the mean for shared geo, class, and timeframe AND the number of similar properties of the same class, geo, and time frame is greater than 30 within 2SD of mean OR number of similar properties less than 30 Sales Validation repository readme None
Heuristic Outliers (must be combined with a statistical outlier) Non-person sale, form 203 response(more than 10b?), anomaly as flagged by isolation forest, flip (?) Sales Validation repository readme None


A final note about sales inclusion criteria: The general data ingestion process looks roughly something like this:
Data Integrity + Valuations and other teams at the assessor's office collect Permit Data (for characteristics, from townships), Sales data from(from sites like MyDec and the Illinois Depertmant of Revenue), and additional data → and upload this to our server (iAsworld (the assessor's office "source of truth"), see data diagram(link)). → The sales data in iAsworld is then imported into Athena (AWS), and the actual table that the model uses is default.vw_pin_sale (link to table, link to code). → (Most of) the inclusion/exclusion filters/criteria in table above are applied either in the ingest scripts (link), or at the start of modeling (link). That table does NOT include any inclusion/exclusion processes (or criteria) that happen prior to iAsworld ingestion (e.g. by Idor/mydec, various permitting departments, or the valuations or data integrety teams.)

Footnotes

  1. Useful examples of how outliers can bias the model: A property that sold for an unusually high value (for it's recorded property class, characteristics, and geography) - may reflect a recent remodel (which increased the property's sale price), whose remodeled/updated characteristics are not reflected in CCAO's characteristics data. Conversely, an unusually low valuation for a property (relative to comparable properties in it's geo), may indicate a non-arms length transaction (ex. a sale between family members). Both such properties should be excluded from modeling; In the first case, including the high-valued property with inaccurate characterterstics would likely inflate the modeled-values of (otherwise) similar-seeming properties, while in the second case, including the non-arms length sale with it's non-market-driven low price, would cause the model to artificially lower the predicted values of similar properties.

Copy link
Member

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work here! I think you got the most important points, so my feedback is mostly nitpicky copy edits. I am not normally this nitpicky about copy, but these are some of our highest-impact public-facing docs, so I hold them to a very high standard.

A couple meta-level notes about the PR itself:

  • You don't need to include the suffix "- closes issue 416" in your PR title. The most important thing is to link to the issue in the body of the PR description, as you've done. We squash every PR during merge, so our PR titles form the basis of our commit history, and it's helpful to keep them tightly scoped to a brief summary of the changes contained in the PR for the purposes of skimming the history.
  • I'm not totally sure what the "Notes" section of the body of your PR description is accomplishing, nor the line about additional context and attempted data lineages. It seems like those sections might represent an archive of the raw notes you drafted up during your research process? I thought they were intended to be notes for the reviewer, so I read through them, only to find it to be basically a duplicate of the contents of the diff. To me, the most important function of a PR description is to provide a summary of the changes along with any relevant context that's important to help reviewers (and future code historians) understand the choices you made; I wouldn't go as far as to say raw notes are off limits in a PR description, but I would encourage you to clearly distinguish them from contextual information that you're providing to readers.

@wagnerlmichael may be interested in this PR, as someone who has done a lot of work on our sales validation process. However, I recognize that Michael has a lot on his plate right now, so I don't expect his review and I don't think it should block this PR!

README.md Outdated

#### Types of Sales Excluded

The key objective of the model is to fairly estimate what a home could sell for in a fair, arms-length, open-market transaction. It's important to train the model on high-quality sales that are representative of the market, and to exclude sales that are not representative. For example, a sale price of $3M for an 800 square foot home, in a community where homes of this size tend to sell between $200k - $400k, is not a representative sale. (This may indicate a flip.) If this sale was used to train the model, then the model could learn that _all_ 800 sf homes in this region are worth around $3M, and should be assessed and taxed accordingly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nitpick, optional] Two tiny copy edits, and a thought on correctness:

  • I tend to use a possessive apostrophe in the word "arm's-length", which is also how the IAAO styles it in their standard on sale verification.
  • I think "sf" should be either capitalized to make it clear that it's an acronym ("S.F."), or written out in full ("square foot").
  • I'm unsure about this sentence: "If this sale was used to train the model, then the model could learn that all 800 sf homes in this region are worth around $3M, and should be assessed and taxed accordingly." This seems like a fine explanation for a lay audience, but I worry that it could strike technical readers as underinformed. Here's what I'm thinking: Given a reasonably large sample size, one unrepresentative training observation is almost certainly not going to cause a boosted tree model to predict that sale price for all homes of a similar square footage; it will, however, increase the average predicted value for homes with similar characteristics. Does that seem right to you? Am I overthinking this, or should we tweak this language so that it's more correct for technical readers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up:
Corrected arm's length, changed square foot to S.F.
Re averaging- Strong-agree re technical correctness. The extreme example seems helpful to set the stage for lay readers. I added this:

" If this sale was used to train the model, then the model could learn that all 800 S.F. homes in this region are worth around $3M, and should be assessed and taxed accordingly. (More realistically, the inclusion of this single outlier could increase the average estimated sale of otherwise similar 800 S.F. homes in this community by a smaller, but still significant, amount.)"

Comment on lines 567 to 583
We accomplish these exclusions in multiple ways.

**Excluding sales with prices that are statistical outliers.** A "statistical outlier" is a sale price that is statistically higher or lower than sale prices of other similar homes, such as our hypothetical $3M example above. Often, these price outliers tend to indicate substantial characteristics errors. This outlier aproach was developed in partnership with the [Mansueto Institute](https://miurban.uchicago.edu/). See the sales validation code at [ccao-data/model-sales-val](https://github.com/ccao-data/model-sales-val).

**Excluding sales based on other transaction data.** In addition to removing sales because they have sale prices that are statistically high or low, other transaction features are useful for identifying sales that should be removed from the model. They are:

| Transaction feature | How transaction feature is used |
| --------------------- | ---------------------------------------------------------------------------------------- |
| Sale price | The model is trained only on sale prices > $10k. Many sale prices below this amount are, for example, deeded parking spots rather than houses. |
| Year of sale | Only sales from the last 9 years are used to train the model. This provides a sufficient amount of transaction data for accurate training and prediction, without including price information that is extremely out-of-date. |
| Back-to-back sales | The model is trained on sales of homes that haven't had a back-to-back sale of selling twice within the same year. These back-to-back sales often indicate flips, or other characteristics errors that the model should not be trained on, and are excluded. |
| Number of PINs in the sale | The model is trained using only single-PIN sales. When one sale price is attributed to a sale of two PINs, it's not clear how much each PIN's value individually contributed to the sale price, so multi-PIN sales are excluded. |
| Number of buildings | The model is trained using only single-building PINs. When one sale price is attributed to a sale with two buildings, it's not clear how much each building's value individually contributed to the sale price, so multi-building sales are excluded. |
| Deed Type | The model is trained using Warranty Deeds, Trustee Deeds, and Other deed types. Quit Claim, Executor, and Beneficial Institution may represent non arms-length transactions, and are excluded from the model training. |
| Buyer-seller attributes | The model is trained on sales between non-corporate unrelated buyers and sellers. We exclude sales between corporate affiliates, related individuals, Bank REO (Real Estate Owned), sales involving a financial institution or government agency, and sales in lieu of foreclosure. |

**Analyst review.** Finally, CCAO residential analysts can manually review sales and inform us of sales that we should not use to train the model.
Copy link
Member

@jeancochrane jeancochrane Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion, optional] This might flow better as a bulleted list, but I don't feel strongly about it:

Suggested change
We accomplish these exclusions in multiple ways.
**Excluding sales with prices that are statistical outliers.** A "statistical outlier" is a sale price that is statistically higher or lower than sale prices of other similar homes, such as our hypothetical $3M example above. Often, these price outliers tend to indicate substantial characteristics errors. This outlier aproach was developed in partnership with the [Mansueto Institute](https://miurban.uchicago.edu/). See the sales validation code at [ccao-data/model-sales-val](https://github.com/ccao-data/model-sales-val).
**Excluding sales based on other transaction data.** In addition to removing sales because they have sale prices that are statistically high or low, other transaction features are useful for identifying sales that should be removed from the model. They are:
| Transaction feature | How transaction feature is used |
| --------------------- | ---------------------------------------------------------------------------------------- |
| Sale price | The model is trained only on sale prices > $10k. Many sale prices below this amount are, for example, deeded parking spots rather than houses. |
| Year of sale | Only sales from the last 9 years are used to train the model. This provides a sufficient amount of transaction data for accurate training and prediction, without including price information that is extremely out-of-date. |
| Back-to-back sales | The model is trained on sales of homes that haven't had a back-to-back sale of selling twice within the same year. These back-to-back sales often indicate flips, or other characteristics errors that the model should not be trained on, and are excluded. |
| Number of PINs in the sale | The model is trained using only single-PIN sales. When one sale price is attributed to a sale of two PINs, it's not clear how much each PIN's value individually contributed to the sale price, so multi-PIN sales are excluded. |
| Number of buildings | The model is trained using only single-building PINs. When one sale price is attributed to a sale with two buildings, it's not clear how much each building's value individually contributed to the sale price, so multi-building sales are excluded. |
| Deed Type | The model is trained using Warranty Deeds, Trustee Deeds, and Other deed types. Quit Claim, Executor, and Beneficial Institution may represent non arms-length transactions, and are excluded from the model training. |
| Buyer-seller attributes | The model is trained on sales between non-corporate unrelated buyers and sellers. We exclude sales between corporate affiliates, related individuals, Bank REO (Real Estate Owned), sales involving a financial institution or government agency, and sales in lieu of foreclosure. |
**Analyst review.** Finally, CCAO residential analysts can manually review sales and inform us of sales that we should not use to train the model.
We accomplish these exclusions in multiple ways:
- **Excluding sales with prices that are statistical outliers.** A "statistical outlier" is a sale price that is statistically higher or lower than sale prices of other similar homes, such as our hypothetical $3M example above. Often, these price outliers tend to indicate substantial characteristics errors. This outlier aproach was developed in partnership with the [Mansueto Institute](https://miurban.uchicago.edu/). See the sales validation code at [ccao-data/model-sales-val](https://github.com/ccao-data/model-sales-val).
- **Excluding sales based on other transaction data.** In addition to removing sales because they have sale prices that are statistically high or low, other transaction features are useful for identifying sales that should be removed from the model. They are:
| Transaction feature | How transaction feature is used |
| --------------------- | ---------------------------------------------------------------------------------------- |
| Sale price | The model is trained only on sale prices > $10k. Many sale prices below this amount are, for example, deeded parking spots rather than houses. |
| Year of sale | Only sales from the last 9 years are used to train the model. This provides a sufficient amount of transaction data for accurate training and prediction, without including price information that is extremely out-of-date. |
| Back-to-back sales | The model is trained on sales of homes that haven't had a back-to-back sale of selling twice within the same year. These back-to-back sales often indicate flips, or other characteristics errors that the model should not be trained on, and are excluded. |
| Number of PINs in the sale | The model is trained using only single-PIN sales. When one sale price is attributed to a sale of two PINs, it's not clear how much each PIN's value individually contributed to the sale price, so multi-PIN sales are excluded. |
| Number of buildings | The model is trained using only single-building PINs. When one sale price is attributed to a sale with two buildings, it's not clear how much each building's value individually contributed to the sale price, so multi-building sales are excluded. |
| Deed Type | The model is trained using Warranty Deeds, Trustee Deeds, and Other deed types. Quit Claim, Executor, and Beneficial Institution may represent non-arm's-length transactions, and are excluded from the model training. |
| Buyer-seller attributes | The model is trained on sales between non-corporate unrelated buyers and sellers. We exclude sales between corporate affiliates, related individuals, Bank REO (Real Estate Owned), sales involving a financial institution or government agency, and sales in lieu of foreclosure. |
- **Analyst review.** Finally, CCAO residential analysts can manually review sales and inform us of sales that we should not use to train the model.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong feelings either way (table or bullet points). That said, I've left it as a table in the most recent push - b.c. that was nicole's suggested formatting in the original issue ;
(I guess also, for future state, if we want to add outlinks to cleaning code, DBs, or external sources (form 203?) it would be easy enough to add additional columns to the table). Happy to include the change though if you'd like (and thanks for taking the time to reformat )).

@TimCookCountyDS
Copy link
Author

TimCookCountyDS commented Feb 3, 2026

Thanks @jeancochrane ! Quick reply on "Notes" - correct- those are mostly an archive of my notes/process. They're a fair bit more detailed than what went into the P.R. -particularly some of the links to code blobs and sketches of data lineages. @njardine had recommended keeping those somewhere internally in case they're useful later. For future reference, where is a better place to put shared work like that? (and I imagine it could be linked out in a P.R?).

Nice work here! I think you got the most important points, so my feedback is mostly nitpicky copy edits. I am not normally this nitpicky about copy, but these are some of our highest-impact public-facing docs, so I hold them to a very high standard.

A couple meta-level notes about the PR itself:

  • You don't need to include the suffix "- closes issue 416" in your PR title. The most important thing is to link to the issue in the body of the PR description, as you've done. We squash every PR during merge, so our PR titles form the basis of our commit history, and it's helpful to keep them tightly scoped to a brief summary of the changes contained in the PR for the purposes of skimming the history.
  • I'm not totally sure what the "Notes" section of the body of your PR description is accomplishing, nor the line about additional context and attempted data lineages. It seems like those sections might represent an archive of the raw notes you drafted up during your research process? I thought they were intended to be notes for the reviewer, so I read through them, only to find it to be basically a duplicate of the contents of the diff. To me, the most important function of a PR description is to provide a summary of the changes along with any relevant context that's important to help reviewers (and future code historians) understand the choices you made; I wouldn't go as far as to say raw notes are off limits in a PR description, but I would encourage you to clearly distinguish them from contextual information that you're providing to readers.

@wagnerlmichael may be interested in this PR, as someone who has done a lot of work on our sales validation process. However, I recognize that Michael has a lot on his plate right now, so I don't expect his review and I don't think it should block this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add docs generally explaining which sales are excluded from training_data

2 participants