Add Exclusion Criteria for Sales to data section of README - closes issue #416 #424

TimCookCountyDS · 2026-01-27T02:12:17Z

Closes Issue #416

Note additional context / attempted data lineages: (suggested by Nicole to exclude from public facing repo, but save somewhere- any suggestions as to where?).

Notes

To predict the expected-market-price of unsold properties, the residential valuation model relies on (recent (within the last decade?)) sales data from across the county. To ensure that the model is built on a data that is both accurate and representative of (true) fair- market transactions, we review all sales before including them in the model-training dataset. We remove any properties or sales outliers that may not accurately reflect fair-market transactions¹. We do this with hard-coded rules (ex. deed types), statistically (ex. removing properties whose sale price is a certain number of standard deviations above the mean of properties within the same geographic area and class), and by an analyst review process (ex. checking for non-arms length sales).
(Note: statistical, or otherwise computationally flagged outliers are (oftne, but not always) further reviewed by human analysts).

Below is a table of inclusion/exclusion criteria used by the model.
For a brief thematic explanation of "why/how" including or excluding certain sales may alter the model's predictions, see the examples following the table.

Category	Excluded (not used for model training)	Included (used for model training)	Stage and Source	Notes
Muliple PINs ins sale	multi-PIN	single-PIN	model-ingest see also this note in sales val	None
Single vs Multiple Buildings, single PIN	multiple buildings on single pin excluded	single building single pin	model-training See here for more on multiple buildings	None
Year of Sale	greater than 9 years	most recent 9 years	maybe model ingest?	None
Sale within Same year	multiple sales of same pin within same year	single sale of single PIN	model-ingest, data-architecture	this may indicate a flip (and/or may include otherwise-unrecorded-characteristic-updates)
Price	Less than 10k	Over 10k	model-ingest (though see also, for sales-validation)	None
Deed Type	Quit Claim Deed, Executor Deed, Beneficial Institution, Uknown	Warranty Deeds, Trustee Deeds, Other	model-ingest, or here, data-catalogue, or data-architecture	(The deed types correspond to numbers (3,4,6,99) and (1,2,5) in the data catalogue)
Replies to property tax form 203	reply "yes" on any of questions 10b-10i, 10k on form 203: Sale between related individuals or corporate affiliates, Transfer of less than 100 percent interest, Court-ordered sale , Sale in lieu of foreclosure, Condemnation, Short sale, Bank REO (real estate owned), Auction sale, Seller/buyer is a financial institution or government agency	Other responses may or may not be included depending on combinations other outlier markers (see here)	203 Form from MyDec → Location in iAsworld table, default.vw_pin_sale, column is sale_filter_ptax_flag → modified in sales_val ingest scripts, see ptax_flag_original, _sv_individual_ptax_flag → integrate with other outlier criteria if needed in sales_val utils script, ultimately mark all sv_outliers → upload to athena → pull data in final model ingest stage → filter any remaining (203 and other) outliers pre-training	None
Statistical Outliers	2 SD above or below the mean for shared geo, class, and timeframe AND the number of similar properties of the same class, geo, and time frame is greater than 30	within 2SD of mean OR number of similar properties less than 30	Sales Validation repository readme	None
Heuristic Outliers (must be combined with a statistical outlier)	Non-person sale, form 203 response(more than 10b?), anomaly as flagged by isolation forest, flip	(?)	Sales Validation repository readme	None

A final note about sales inclusion criteria: The general data ingestion process looks roughly something like this:
Data Integrity + Valuations and other teams at the assessor's office collect Permit Data (for characteristics, from townships), Sales data from(from sites like MyDec and the Illinois Depertmant of Revenue), and additional data → and upload this to our server (iAsworld (the assessor's office "source of truth"), see data diagram(link)). → The sales data in iAsworld is then imported into Athena (AWS), and the actual table that the model uses is default.vw_pin_sale (link to table, link to code). → (Most of) the inclusion/exclusion filters/criteria in table above are applied either in the ingest scripts (link), or at the start of modeling (link). That table does NOT include any inclusion/exclusion processes (or criteria) that happen prior to iAsworld ingestion (e.g. by Idor/mydec, various permitting departments, or the valuations or data integrety teams.)

Useful examples of how outliers can bias the model: A property that sold for an unusually high value (for it's recorded property class, characteristics, and geography) - may reflect a recent remodel (which increased the property's sale price), whose remodeled/updated characteristics are not reflected in CCAO's characteristics data. Conversely, an unusually low valuation for a property (relative to comparable properties in it's geo), may indicate a non-arms length transaction (ex. a sale between family members). Both such properties should be excluded from modeling; In the first case, including the high-valued property with inaccurate characterterstics would likely inflate the modeled-values of (otherwise) similar-seeming properties, while in the second case, including the non-arms length sale with it's non-market-driven low price, would cause the model to artificially lower the predicted values of similar properties. ↩

jeancochrane

Nice work here! I think you got the most important points, so my feedback is mostly nitpicky copy edits. I am not normally this nitpicky about copy, but these are some of our highest-impact public-facing docs, so I hold them to a very high standard.

A couple meta-level notes about the PR itself:

You don't need to include the suffix "- closes issue 416" in your PR title. The most important thing is to link to the issue in the body of the PR description, as you've done. We squash every PR during merge, so our PR titles form the basis of our commit history, and it's helpful to keep them tightly scoped to a brief summary of the changes contained in the PR for the purposes of skimming the history.
I'm not totally sure what the "Notes" section of the body of your PR description is accomplishing, nor the line about additional context and attempted data lineages. It seems like those sections might represent an archive of the raw notes you drafted up during your research process? I thought they were intended to be notes for the reviewer, so I read through them, only to find it to be basically a duplicate of the contents of the diff. To me, the most important function of a PR description is to provide a summary of the changes along with any relevant context that's important to help reviewers (and future code historians) understand the choices you made; I wouldn't go as far as to say raw notes are off limits in a PR description, but I would encourage you to clearly distinguish them from contextual information that you're providing to readers.

@wagnerlmichael may be interested in this PR, as someone who has done a lot of work on our sales validation process. However, I recognize that Michael has a lot on his plate right now, so I don't expect his review and I don't think it should block this PR!

jeancochrane · 2026-01-29T21:30:55Z

README.md


+#### Types of Sales Excluded
+
+The key objective of the model is to fairly estimate what a home could sell for in a fair, arms-length, open-market transaction. It's important to train the model on high-quality sales that are representative of the market, and to exclude sales that are not representative. For example, a sale price of $3M for an 800 square foot home, in a community where homes of this size tend to sell between $200k - $400k, is not a representative sale. (This may indicate a flip.) If this sale was used to train the model, then the model could learn that _all_ 800 sf homes in this region are worth around $3M, and should be assessed and taxed accordingly.


[Nitpick, optional] Two tiny copy edits, and a thought on correctness:

I tend to use a possessive apostrophe in the word "arm's-length", which is also how the IAAO styles it in their standard on sale verification.

I think "sf" should be either capitalized to make it clear that it's an acronym ("S.F."), or written out in full ("square foot").

I'm unsure about this sentence: "If this sale was used to train the model, then the model could learn that all 800 sf homes in this region are worth around $3M, and should be assessed and taxed accordingly." This seems like a fine explanation for a lay audience, but I worry that it could strike technical readers as underinformed. Here's what I'm thinking: Given a reasonably large sample size, one unrepresentative training observation is almost certainly not going to cause a boosted tree model to predict that sale price for all homes of a similar square footage; it will, however, increase the average predicted value for homes with similar characteristics. Does that seem right to you? Am I overthinking this, or should we tweak this language so that it's more correct for technical readers?

Follow-up:
Corrected arm's length, changed square foot to S.F.
Re averaging- Strong-agree re technical correctness. The extreme example seems helpful to set the stage for lay readers. I added this:

" If this sale was used to train the model, then the model could learn that all 800 S.F. homes in this region are worth around $3M, and should be assessed and taxed accordingly. (More realistically, the inclusion of this single outlier could increase the average estimated sale of otherwise similar 800 S.F. homes in this community by a smaller, but still significant, amount.)"

jeancochrane · 2026-01-29T23:28:57Z

README.md

+We accomplish these exclusions in multiple ways.
+
+**Excluding sales with prices that are statistical outliers.**  A "statistical outlier" is a sale price that is statistically higher or lower than sale prices of other similar homes, such as our hypothetical $3M example above. Often, these price outliers tend to indicate substantial characteristics errors.  This outlier aproach was developed in partnership with the [Mansueto Institute](https://miurban.uchicago.edu/). See the sales validation code at [ccao-data/model-sales-val](https://github.com/ccao-data/model-sales-val). 
+
+**Excluding sales based on other transaction data.** In addition to removing sales because they have sale prices that are statistically high or low, other transaction features are useful for identifying sales that should be removed from the model. They are:
+
+| Transaction feature | How transaction feature is used |
+| --------------------- | ---------------------------------------------------------------------------------------- | 
+| Sale price | The model is trained only on sale prices > $10k. Many sale prices below this amount are, for example, deeded parking spots rather than houses. | 
+| Year of sale | Only sales from the last 9 years are used to train the model. This provides a sufficient amount of transaction data for accurate training and prediction, without including price information that is extremely out-of-date. |
+| Back-to-back sales | The model is trained on sales of homes that haven't had a back-to-back sale of selling twice within the same year. These back-to-back sales often indicate flips, or other characteristics errors that the model should not be trained on, and are excluded. |
+| Number of PINs in the sale | The model is trained using only single-PIN sales. When one sale price is attributed to a sale of two PINs, it's not clear how much each PIN's value individually contributed to the sale price, so multi-PIN sales are excluded. | 
+| Number of buildings | The model is trained using only single-building PINs. When one sale price is attributed to a sale with two buildings, it's not clear how much each building's value individually contributed to the sale price, so multi-building sales are excluded. | 
+| Deed Type | The model is trained using Warranty Deeds, Trustee Deeds, and Other deed types. Quit Claim, Executor, and Beneficial Institution may represent non arms-length transactions, and are excluded from the model training. | 
+| Buyer-seller attributes | The model is trained on sales between non-corporate unrelated buyers and sellers. We exclude sales between corporate affiliates, related individuals, Bank REO (Real Estate Owned), sales involving a financial institution or government agency, and sales in lieu of foreclosure. |
+
+**Analyst review.** Finally, CCAO residential analysts can manually review sales and inform us of sales that we should not use to train the model. 


[Suggestion, optional] This might flow better as a bulleted list, but I don't feel strongly about it:

Suggested change

We accomplish these exclusions in multiple ways.

**Excluding sales with prices that are statistical outliers.** A "statistical outlier" is a sale price that is statistically higher or lower than sale prices of other similar homes, such as our hypothetical $3M example above. Often, these price outliers tend to indicate substantial characteristics errors. This outlier aproach was developed in partnership with the [Mansueto Institute](https://miurban.uchicago.edu/). See the sales validation code at [ccao-data/model-sales-val](https://github.com/ccao-data/model-sales-val).

**Excluding sales based on other transaction data.** In addition to removing sales because they have sale prices that are statistically high or low, other transaction features are useful for identifying sales that should be removed from the model. They are:

| Transaction feature | How transaction feature is used |

| --------------------- | ---------------------------------------------------------------------------------------- |

| Sale price | The model is trained only on sale prices > $10k. Many sale prices below this amount are, for example, deeded parking spots rather than houses. |

| Year of sale | Only sales from the last 9 years are used to train the model. This provides a sufficient amount of transaction data for accurate training and prediction, without including price information that is extremely out-of-date. |

| Back-to-back sales | The model is trained on sales of homes that haven't had a back-to-back sale of selling twice within the same year. These back-to-back sales often indicate flips, or other characteristics errors that the model should not be trained on, and are excluded. |

| Number of PINs in the sale | The model is trained using only single-PIN sales. When one sale price is attributed to a sale of two PINs, it's not clear how much each PIN's value individually contributed to the sale price, so multi-PIN sales are excluded. |

| Number of buildings | The model is trained using only single-building PINs. When one sale price is attributed to a sale with two buildings, it's not clear how much each building's value individually contributed to the sale price, so multi-building sales are excluded. |

| Deed Type | The model is trained using Warranty Deeds, Trustee Deeds, and Other deed types. Quit Claim, Executor, and Beneficial Institution may represent non arms-length transactions, and are excluded from the model training. |

| Buyer-seller attributes | The model is trained on sales between non-corporate unrelated buyers and sellers. We exclude sales between corporate affiliates, related individuals, Bank REO (Real Estate Owned), sales involving a financial institution or government agency, and sales in lieu of foreclosure. |

**Analyst review.** Finally, CCAO residential analysts can manually review sales and inform us of sales that we should not use to train the model.

We accomplish these exclusions in multiple ways:

- **Excluding sales with prices that are statistical outliers.** A "statistical outlier" is a sale price that is statistically higher or lower than sale prices of other similar homes, such as our hypothetical $3M example above. Often, these price outliers tend to indicate substantial characteristics errors. This outlier aproach was developed in partnership with the [Mansueto Institute](https://miurban.uchicago.edu/). See the sales validation code at [ccao-data/model-sales-val](https://github.com/ccao-data/model-sales-val).

- **Excluding sales based on other transaction data.** In addition to removing sales because they have sale prices that are statistically high or low, other transaction features are useful for identifying sales that should be removed from the model. They are:

| Transaction feature | How transaction feature is used |

| --------------------- | ---------------------------------------------------------------------------------------- |

| Sale price | The model is trained only on sale prices > $10k. Many sale prices below this amount are, for example, deeded parking spots rather than houses. |

| Year of sale | Only sales from the last 9 years are used to train the model. This provides a sufficient amount of transaction data for accurate training and prediction, without including price information that is extremely out-of-date. |

| Back-to-back sales | The model is trained on sales of homes that haven't had a back-to-back sale of selling twice within the same year. These back-to-back sales often indicate flips, or other characteristics errors that the model should not be trained on, and are excluded. |

| Number of PINs in the sale | The model is trained using only single-PIN sales. When one sale price is attributed to a sale of two PINs, it's not clear how much each PIN's value individually contributed to the sale price, so multi-PIN sales are excluded. |

| Number of buildings | The model is trained using only single-building PINs. When one sale price is attributed to a sale with two buildings, it's not clear how much each building's value individually contributed to the sale price, so multi-building sales are excluded. |

| Deed Type | The model is trained using Warranty Deeds, Trustee Deeds, and Other deed types. Quit Claim, Executor, and Beneficial Institution may represent non-arm's-length transactions, and are excluded from the model training. |

| Buyer-seller attributes | The model is trained on sales between non-corporate unrelated buyers and sellers. We exclude sales between corporate affiliates, related individuals, Bank REO (Real Estate Owned), sales involving a financial institution or government agency, and sales in lieu of foreclosure. |

- **Analyst review.** Finally, CCAO residential analysts can manually review sales and inform us of sales that we should not use to train the model.

I don't have strong feelings either way (table or bullet points). That said, I've left it as a table in the most recent push - b.c. that was nicole's suggested formatting in the original issue ;
(I guess also, for future state, if we want to add outlinks to cleaning code, DBs, or external sources (form 203?) it would be easy enough to add additional columns to the table). Happy to include the change though if you'd like (and thanks for taking the time to reformat )).

TimCookCountyDS · 2026-02-03T20:40:32Z

Thanks @jeancochrane ! Quick reply on "Notes" - correct- those are mostly an archive of my notes/process. They're a fair bit more detailed than what went into the P.R. -particularly some of the links to code blobs and sketches of data lineages. @njardine had recommended keeping those somewhere internally in case they're useful later. For future reference, where is a better place to put shared work like that? (and I imagine it could be linked out in a P.R?).

Nice work here! I think you got the most important points, so my feedback is mostly nitpicky copy edits. I am not normally this nitpicky about copy, but these are some of our highest-impact public-facing docs, so I hold them to a very high standard.

A couple meta-level notes about the PR itself:

You don't need to include the suffix "- closes issue 416" in your PR title. The most important thing is to link to the issue in the body of the PR description, as you've done. We squash every PR during merge, so our PR titles form the basis of our commit history, and it's helpful to keep them tightly scoped to a brief summary of the changes contained in the PR for the purposes of skimming the history.

I'm not totally sure what the "Notes" section of the body of your PR description is accomplishing, nor the line about additional context and attempted data lineages. It seems like those sections might represent an archive of the raw notes you drafted up during your research process? I thought they were intended to be notes for the reviewer, so I read through them, only to find it to be basically a duplicate of the contents of the diff. To me, the most important function of a PR description is to provide a summary of the changes along with any relevant context that's important to help reviewers (and future code historians) understand the choices you made; I wouldn't go as far as to say raw notes are off limits in a PR description, but I would encourage you to clearly distinguish them from contextual information that you're providing to readers.

@wagnerlmichael may be interested in this PR, as someone who has done a lot of work on our sales validation process. However, I recognize that Michael has a lot on his plate right now, so I don't expect his review and I don't think it should block this PR!

…utlier impact

Add Exclusion Criteria for Sales to data section of README

5642244

TimCookCountyDS requested review from jeancochrane and wrridgeway as code owners January 27, 2026 02:12

TimCookCountyDS linked an issue Jan 27, 2026 that may be closed by this pull request

Add docs generally explaining which sales are excluded from training_data #416

Open

jeancochrane approved these changes Jan 29, 2026

View reviewed changes

TimCookCountyDS added 2 commits February 6, 2026 18:51

Merge branch 'master' into sales_exclusion_criteria

65bd0d5

Incorporates Jean's copy edits to sales exclusion readme, Clarifies o…

0a5b8ee

…utlier impact

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Exclusion Criteria for Sales to data section of README - closes issue #416 #424

Add Exclusion Criteria for Sales to data section of README - closes issue #416 #424

Uh oh!

TimCookCountyDS commented Jan 27, 2026

Uh oh!

jeancochrane left a comment •

edited

Loading

Uh oh!

jeancochrane Jan 29, 2026

Uh oh!

TimCookCountyDS Feb 7, 2026

Uh oh!

jeancochrane Jan 29, 2026 •

edited

Loading

Uh oh!

TimCookCountyDS Feb 7, 2026

Uh oh!

TimCookCountyDS commented Feb 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		#### Types of Sales Excluded

		The key objective of the model is to fairly estimate what a home could sell for in a fair, arms-length, open-market transaction. It's important to train the model on high-quality sales that are representative of the market, and to exclude sales that are not representative. For example, a sale price of $3M for an 800 square foot home, in a community where homes of this size tend to sell between $200k - $400k, is not a representative sale. (This may indicate a flip.) If this sale was used to train the model, then the model could learn that _all_ 800 sf homes in this region are worth around $3M, and should be assessed and taxed accordingly.

Add Exclusion Criteria for Sales to data section of README - closes issue #416 #424

Are you sure you want to change the base?

Add Exclusion Criteria for Sales to data section of README - closes issue #416 #424

Uh oh!

Conversation

TimCookCountyDS commented Jan 27, 2026

Notes

Footnotes

Uh oh!

jeancochrane left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeancochrane Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

TimCookCountyDS Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

jeancochrane Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TimCookCountyDS Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

TimCookCountyDS commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeancochrane left a comment •

edited

Loading

jeancochrane Jan 29, 2026 •

edited

Loading

TimCookCountyDS commented Feb 3, 2026 •

edited

Loading