-
Notifications
You must be signed in to change notification settings - Fork 15
Add Exclusion Criteria for Sales to data section of README - closes issue #416 #424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work here! I think you got the most important points, so my feedback is mostly nitpicky copy edits. I am not normally this nitpicky about copy, but these are some of our highest-impact public-facing docs, so I hold them to a very high standard.
A couple meta-level notes about the PR itself:
- You don't need to include the suffix "- closes issue 416" in your PR title. The most important thing is to link to the issue in the body of the PR description, as you've done. We squash every PR during merge, so our PR titles form the basis of our commit history, and it's helpful to keep them tightly scoped to a brief summary of the changes contained in the PR for the purposes of skimming the history.
- I'm not totally sure what the "Notes" section of the body of your PR description is accomplishing, nor the line about additional context and attempted data lineages. It seems like those sections might represent an archive of the raw notes you drafted up during your research process? I thought they were intended to be notes for the reviewer, so I read through them, only to find it to be basically a duplicate of the contents of the diff. To me, the most important function of a PR description is to provide a summary of the changes along with any relevant context that's important to help reviewers (and future code historians) understand the choices you made; I wouldn't go as far as to say raw notes are off limits in a PR description, but I would encourage you to clearly distinguish them from contextual information that you're providing to readers.
@wagnerlmichael may be interested in this PR, as someone who has done a lot of work on our sales validation process. However, I recognize that Michael has a lot on his plate right now, so I don't expect his review and I don't think it should block this PR!
README.md
Outdated
|
|
||
| #### Types of Sales Excluded | ||
|
|
||
| The key objective of the model is to fairly estimate what a home could sell for in a fair, arms-length, open-market transaction. It's important to train the model on high-quality sales that are representative of the market, and to exclude sales that are not representative. For example, a sale price of $3M for an 800 square foot home, in a community where homes of this size tend to sell between $200k - $400k, is not a representative sale. (This may indicate a flip.) If this sale was used to train the model, then the model could learn that _all_ 800 sf homes in this region are worth around $3M, and should be assessed and taxed accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nitpick, optional] Two tiny copy edits, and a thought on correctness:
- I tend to use a possessive apostrophe in the word "arm's-length", which is also how the IAAO styles it in their standard on sale verification.
- I think "sf" should be either capitalized to make it clear that it's an acronym ("S.F."), or written out in full ("square foot").
- I'm unsure about this sentence: "If this sale was used to train the model, then the model could learn that all 800 sf homes in this region are worth around $3M, and should be assessed and taxed accordingly." This seems like a fine explanation for a lay audience, but I worry that it could strike technical readers as underinformed. Here's what I'm thinking: Given a reasonably large sample size, one unrepresentative training observation is almost certainly not going to cause a boosted tree model to predict that sale price for all homes of a similar square footage; it will, however, increase the average predicted value for homes with similar characteristics. Does that seem right to you? Am I overthinking this, or should we tweak this language so that it's more correct for technical readers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up:
Corrected arm's length, changed square foot to S.F.
Re averaging- Strong-agree re technical correctness. The extreme example seems helpful to set the stage for lay readers. I added this:
" If this sale was used to train the model, then the model could learn that all 800 S.F. homes in this region are worth around $3M, and should be assessed and taxed accordingly. (More realistically, the inclusion of this single outlier could increase the average estimated sale of otherwise similar 800 S.F. homes in this community by a smaller, but still significant, amount.)"
| We accomplish these exclusions in multiple ways. | ||
|
|
||
| **Excluding sales with prices that are statistical outliers.** A "statistical outlier" is a sale price that is statistically higher or lower than sale prices of other similar homes, such as our hypothetical $3M example above. Often, these price outliers tend to indicate substantial characteristics errors. This outlier aproach was developed in partnership with the [Mansueto Institute](https://miurban.uchicago.edu/). See the sales validation code at [ccao-data/model-sales-val](https://github.com/ccao-data/model-sales-val). | ||
|
|
||
| **Excluding sales based on other transaction data.** In addition to removing sales because they have sale prices that are statistically high or low, other transaction features are useful for identifying sales that should be removed from the model. They are: | ||
|
|
||
| | Transaction feature | How transaction feature is used | | ||
| | --------------------- | ---------------------------------------------------------------------------------------- | | ||
| | Sale price | The model is trained only on sale prices > $10k. Many sale prices below this amount are, for example, deeded parking spots rather than houses. | | ||
| | Year of sale | Only sales from the last 9 years are used to train the model. This provides a sufficient amount of transaction data for accurate training and prediction, without including price information that is extremely out-of-date. | | ||
| | Back-to-back sales | The model is trained on sales of homes that haven't had a back-to-back sale of selling twice within the same year. These back-to-back sales often indicate flips, or other characteristics errors that the model should not be trained on, and are excluded. | | ||
| | Number of PINs in the sale | The model is trained using only single-PIN sales. When one sale price is attributed to a sale of two PINs, it's not clear how much each PIN's value individually contributed to the sale price, so multi-PIN sales are excluded. | | ||
| | Number of buildings | The model is trained using only single-building PINs. When one sale price is attributed to a sale with two buildings, it's not clear how much each building's value individually contributed to the sale price, so multi-building sales are excluded. | | ||
| | Deed Type | The model is trained using Warranty Deeds, Trustee Deeds, and Other deed types. Quit Claim, Executor, and Beneficial Institution may represent non arms-length transactions, and are excluded from the model training. | | ||
| | Buyer-seller attributes | The model is trained on sales between non-corporate unrelated buyers and sellers. We exclude sales between corporate affiliates, related individuals, Bank REO (Real Estate Owned), sales involving a financial institution or government agency, and sales in lieu of foreclosure. | | ||
|
|
||
| **Analyst review.** Finally, CCAO residential analysts can manually review sales and inform us of sales that we should not use to train the model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Suggestion, optional] This might flow better as a bulleted list, but I don't feel strongly about it:
| We accomplish these exclusions in multiple ways. | |
| **Excluding sales with prices that are statistical outliers.** A "statistical outlier" is a sale price that is statistically higher or lower than sale prices of other similar homes, such as our hypothetical $3M example above. Often, these price outliers tend to indicate substantial characteristics errors. This outlier aproach was developed in partnership with the [Mansueto Institute](https://miurban.uchicago.edu/). See the sales validation code at [ccao-data/model-sales-val](https://github.com/ccao-data/model-sales-val). | |
| **Excluding sales based on other transaction data.** In addition to removing sales because they have sale prices that are statistically high or low, other transaction features are useful for identifying sales that should be removed from the model. They are: | |
| | Transaction feature | How transaction feature is used | | |
| | --------------------- | ---------------------------------------------------------------------------------------- | | |
| | Sale price | The model is trained only on sale prices > $10k. Many sale prices below this amount are, for example, deeded parking spots rather than houses. | | |
| | Year of sale | Only sales from the last 9 years are used to train the model. This provides a sufficient amount of transaction data for accurate training and prediction, without including price information that is extremely out-of-date. | | |
| | Back-to-back sales | The model is trained on sales of homes that haven't had a back-to-back sale of selling twice within the same year. These back-to-back sales often indicate flips, or other characteristics errors that the model should not be trained on, and are excluded. | | |
| | Number of PINs in the sale | The model is trained using only single-PIN sales. When one sale price is attributed to a sale of two PINs, it's not clear how much each PIN's value individually contributed to the sale price, so multi-PIN sales are excluded. | | |
| | Number of buildings | The model is trained using only single-building PINs. When one sale price is attributed to a sale with two buildings, it's not clear how much each building's value individually contributed to the sale price, so multi-building sales are excluded. | | |
| | Deed Type | The model is trained using Warranty Deeds, Trustee Deeds, and Other deed types. Quit Claim, Executor, and Beneficial Institution may represent non arms-length transactions, and are excluded from the model training. | | |
| | Buyer-seller attributes | The model is trained on sales between non-corporate unrelated buyers and sellers. We exclude sales between corporate affiliates, related individuals, Bank REO (Real Estate Owned), sales involving a financial institution or government agency, and sales in lieu of foreclosure. | | |
| **Analyst review.** Finally, CCAO residential analysts can manually review sales and inform us of sales that we should not use to train the model. | |
| We accomplish these exclusions in multiple ways: | |
| - **Excluding sales with prices that are statistical outliers.** A "statistical outlier" is a sale price that is statistically higher or lower than sale prices of other similar homes, such as our hypothetical $3M example above. Often, these price outliers tend to indicate substantial characteristics errors. This outlier aproach was developed in partnership with the [Mansueto Institute](https://miurban.uchicago.edu/). See the sales validation code at [ccao-data/model-sales-val](https://github.com/ccao-data/model-sales-val). | |
| - **Excluding sales based on other transaction data.** In addition to removing sales because they have sale prices that are statistically high or low, other transaction features are useful for identifying sales that should be removed from the model. They are: | |
| | Transaction feature | How transaction feature is used | | |
| | --------------------- | ---------------------------------------------------------------------------------------- | | |
| | Sale price | The model is trained only on sale prices > $10k. Many sale prices below this amount are, for example, deeded parking spots rather than houses. | | |
| | Year of sale | Only sales from the last 9 years are used to train the model. This provides a sufficient amount of transaction data for accurate training and prediction, without including price information that is extremely out-of-date. | | |
| | Back-to-back sales | The model is trained on sales of homes that haven't had a back-to-back sale of selling twice within the same year. These back-to-back sales often indicate flips, or other characteristics errors that the model should not be trained on, and are excluded. | | |
| | Number of PINs in the sale | The model is trained using only single-PIN sales. When one sale price is attributed to a sale of two PINs, it's not clear how much each PIN's value individually contributed to the sale price, so multi-PIN sales are excluded. | | |
| | Number of buildings | The model is trained using only single-building PINs. When one sale price is attributed to a sale with two buildings, it's not clear how much each building's value individually contributed to the sale price, so multi-building sales are excluded. | | |
| | Deed Type | The model is trained using Warranty Deeds, Trustee Deeds, and Other deed types. Quit Claim, Executor, and Beneficial Institution may represent non-arm's-length transactions, and are excluded from the model training. | | |
| | Buyer-seller attributes | The model is trained on sales between non-corporate unrelated buyers and sellers. We exclude sales between corporate affiliates, related individuals, Bank REO (Real Estate Owned), sales involving a financial institution or government agency, and sales in lieu of foreclosure. | | |
| - **Analyst review.** Finally, CCAO residential analysts can manually review sales and inform us of sales that we should not use to train the model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong feelings either way (table or bullet points). That said, I've left it as a table in the most recent push - b.c. that was nicole's suggested formatting in the original issue ;
(I guess also, for future state, if we want to add outlinks to cleaning code, DBs, or external sources (form 203?) it would be easy enough to add additional columns to the table). Happy to include the change though if you'd like (and thanks for taking the time to reformat )).
|
Thanks @jeancochrane ! Quick reply on "Notes" - correct- those are mostly an archive of my notes/process. They're a fair bit more detailed than what went into the P.R. -particularly some of the links to code blobs and sketches of data lineages. @njardine had recommended keeping those somewhere internally in case they're useful later. For future reference, where is a better place to put shared work like that? (and I imagine it could be linked out in a P.R?).
|
Closes Issue #416
Note additional context / attempted data lineages: (suggested by Nicole to exclude from public facing repo, but save somewhere- any suggestions as to where?).
Notes
To predict the expected-market-price of unsold properties, the residential valuation model relies on (recent (within the last decade?)) sales data from across the county. To ensure that the model is built on a data that is both accurate and representative of (true) fair- market transactions, we review all sales before including them in the model-training dataset. We remove any properties or sales outliers that may not accurately reflect fair-market transactions1. We do this with hard-coded rules (ex. deed types), statistically (ex. removing properties whose sale price is a certain number of standard deviations above the mean of properties within the same geographic area and class), and by an analyst review process (ex. checking for non-arms length sales).
(Note: statistical, or otherwise computationally flagged outliers are (oftne, but not always) further reviewed by human analysts).
Below is a table of inclusion/exclusion criteria used by the model.
For a brief thematic explanation of "why/how" including or excluding certain sales may alter the model's predictions, see the examples following the table.
A final note about sales inclusion criteria: The general data ingestion process looks roughly something like this:
Data Integrity + Valuations and other teams at the assessor's office collect Permit Data (for characteristics, from townships), Sales data from(from sites like MyDec and the Illinois Depertmant of Revenue), and additional data → and upload this to our server (iAsworld (the assessor's office "source of truth"), see data diagram(link)). → The sales data in iAsworld is then imported into Athena (AWS), and the actual table that the model uses is default.vw_pin_sale (link to table, link to code). → (Most of) the inclusion/exclusion filters/criteria in table above are applied either in the ingest scripts (link), or at the start of modeling (link). That table does NOT include any inclusion/exclusion processes (or criteria) that happen prior to iAsworld ingestion (e.g. by Idor/mydec, various permitting departments, or the valuations or data integrety teams.)
Footnotes
Useful examples of how outliers can bias the model: A property that sold for an unusually high value (for it's recorded property class, characteristics, and geography) - may reflect a recent remodel (which increased the property's sale price), whose remodeled/updated characteristics are not reflected in CCAO's characteristics data. Conversely, an unusually low valuation for a property (relative to comparable properties in it's geo), may indicate a non-arms length transaction (ex. a sale between family members). Both such properties should be excluded from modeling; In the first case, including the high-valued property with inaccurate characterterstics would likely inflate the modeled-values of (otherwise) similar-seeming properties, while in the second case, including the non-arms length sale with it's non-market-driven low price, would cause the model to artificially lower the predicted values of similar properties. ↩