Skip to content

Commit

Permalink
added summary fields to data dictionary.
Browse files Browse the repository at this point in the history
  • Loading branch information
ericrobskyhuntley committed Aug 8, 2024
1 parent 7f1dc19 commit 0a44692
Show file tree
Hide file tree
Showing 2 changed files with 52 additions and 15 deletions.
61 changes: 49 additions & 12 deletions DICTIONARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ For any spatial tables listed below (indicated using 🌐), data is stored in [N
Residential properties Each row represents a property in the assessors table.

| Field | Type | Description |
|-------------------------------|-----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|-----------|-----------|--------------------------------------------------|
| `id` (PK) | Integer | Unique identifier. |
| `fy` | Integer | Fiscal year of assessor's database. |
| `muni_id` (FK to `munis`) | String | Identifier of property municipality. |
Expand All @@ -37,7 +37,7 @@ Residential properties Each row represents a property in the assessors table.
Each row represents either a unique owner name-address pair.

| Field | Type | Description |
|---------------------------------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|-------------|-------------|-----------------------------------------------|
| `id` (PK) | Integer | Unique identifier. |
| `name` | String | Name of un-deduplicated owner. |
| `inst` | Boolean | Institutional owner. If `TRUE`, we flagged the owner as institutional using keywords unlikely to be identified with individuals. |
Expand All @@ -53,9 +53,9 @@ Each row represents either a unique owner name-address pair.
Represents the many-to-many relationship between `owners` and `sites`. All many-to-many relations are induced by splitting non-institutional owners on instances of the word "and" to identify multiple individual owners of a site.

| Field | Type | Description |
|---------------------------------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|-------------|-------------|-----------------------------------------------|
| `id` (PK) | Integer | Unique identifier. |
| `site_id` (FK to `site``)` | Integer | Identifier of property. |
| `site_id` (FK to ``` site``) ``` | Integer | Identifier of property. |
| `owner_id` (FK to `owners`) | Integer | Identifier of owner. |
| `cosine_group` (FK to `metacorps_cosine`) | String | Identifier of cosine-deduplicated metacorp. Group assigned by cosine deduplication process. |
| `network_group` (FK to `metacorps_network`) | String | Identifier of network-deduplicated metacorp. Group assigned by either network deduplication process or cosine deduplication process, when there are no network matches. |
Expand All @@ -65,7 +65,7 @@ Represents the many-to-many relationship between `owners` and `sites`. All many-
Companies from OpenCorporates matched to at least one row in the assessors table, or present in the networks of those companies.

| Field | Type | Description |
|------------------------------------------|---------|----------------------------------------------|
|---------------------------|---------------|------------------------------|
| `id` (PK) | Integer | Unique identifier. |
| `name` | String | Name of company. |
| `company_type` | String | Type of company, given by OpenCorporates. |
Expand All @@ -77,14 +77,22 @@ Companies from OpenCorporates matched to at least one row in the assessors table
Each row represents a unique name-company relationship.

| Field | Type | Description |
|------------------------------------------|---------|---------------------------------------------------------------|
|-----------------------|---------------|-----------------------------------|
| `id` (PK) | Integer | Unique identifier. |
| `name` | String | Name of officer. |
| `positions` | String | Comma-separated list of positions held by officer in company. |
| `company_id` (FK to `companies`) | String | Identifier of company. |
| `addr_id` (FK to `addresses`) | String | Identifier of address. |
| `network_id` (FK to `metacorps_network`) | String | Identifier of network-deduplicated metacorp. |

#### Summary Fields

Currently included only when `load_results()` is run with `summarize=TRUE`.

| Field | Type | Description |
|-----------------------|---------------|-----------------------------------|
| `innetwork_company_count` | Integer | Number of companies the given officer is an officer of *within* the network metacorp. |

### `metacorps_network`

Each row represents a network-identified 'metacorp', or group of companies that we've identified as related.
Expand All @@ -94,6 +102,21 @@ Each row represents a network-identified 'metacorp', or group of companies that
| `id` (PK) | Integer | Unique identifier. |
| `name` | String | Most common company name within metacorp. |

#### Summary Fields

Currently included only when `load_results()` is run with `summarize=TRUE`.

| Field | Type | Description |
|---------------------|-----------------|----------------------------------|
| `prop_count` | Integer | Number of properties (i.e., `sites` rows) linked to a given metacorp. |
| `unit_count` | Numeric | Estimated number of units linked to a given metacorp. |
| `area` | Integer | Summed building area held by a particular metacorp (where 'building area' means the larger of `res_area` and `bld_area`). |
| `val` | Integer | Summed building and residential value held by a particular metacorp. |
| `units_per_prop` | Numeric | Total estimated units divided by the property count. This is a measure of what scale of property a given owner invests in. |
| `val_per_prop` | Numeric | Total value divided by property count. A measure of how valuable a given metacorps properties are. |
| `val_per_area` | Numeric | Value per square foot. Another measure of how valuable a metacorps properties are. |
| `company_count` | Integer | How many unique companies appear within a given metacorp. |

### `metacorps_cosine`

Each row represents a cosine-deduplication-identified 'metacorp', or group of companies that we've identified as related.
Expand All @@ -103,12 +126,26 @@ Each row represents a cosine-deduplication-identified 'metacorp', or group of co
| `id` (PK) | Integer | Unique identifier. |
| `name` | String | Most common company name within metacorp. |

#### Summary Fields

Currently included only when `load_results()` is run with `summarize=TRUE`.

| Field | Type | Description |
|---------------------|------------------|---------------------------------|
| `prop_count` | Integer | Number of properties (i.e., `sites` rows) linked to a given metacorp. |
| `unit_count` | Numeric | Estimated number of units linked to a given metacorp. |
| `area` | Integer | Summed building area held by a particular metacorp (where 'building area' means the larger of `res_area` and `bld_area`). |
| `val` | Integer | Summed building and residential value held by a particular metacorp. |
| `units_per_prop` | Numeric | Total estimated units divided by the property count. This is a measure of what scale of property a given owner invests in. |
| `val_per_prop` | Numeric | Total value divided by property count. A measure of how valuable a given metacorps properties are. |
| `val_per_area` | Numeric | Value per square foot. Another measure of how valuable a metacorps properties are. |

### `parcels_point` (🌐)

Each row is a `POINT()` representation of a parcel in the MassGIS parcels database.

| Field | Type | Description |
|-----------------------------------------|----------------|--------------------------------------------------------------------------|
|--------------------|----------------|-------------------------------------|
| `loc_id` (PK) | Integer | Unique identifier. |
| `muni_id` (FK to `munis`) | String | Unique identifier of municipality. |
| `block_group_id` (FK to `block_groups`) | String | Unique identifier of block group that contains parcel. |
Expand All @@ -120,7 +157,7 @@ Each row is a `POINT()` representation of a parcel in the MassGIS parcels databa
Each row is a unique address (including parsed ranges) found in any of `assessors`, `sites`, `owners`, `companies`, or `owners`. Constructed, in part, using the statewide and Boston address layers as a reference dataset.

| Field | Type | Description |
|----------------------------------|---------|----------------------------------------------------------------------------------------------------|
|--------------|--------------|---------------------------------------------|
| `loc_id` (PK) | Integer | Unique identifier. |
| `addr` | String | Complete number, street name, type string, often reconstructed from address ranges, PO Boxes, etc. |
| `start` | Number | For ranges, start of address range. For single-number addresses, that single number. |
Expand All @@ -139,7 +176,7 @@ These are loaded using `load_results()` if `load_boundaries = TRUE`.
### `munis` (🌐)

| Field | Type | Description |
|----------------|-------------------------|--------------------------------------------------------------------------|
|---------------|---------------|------------------------------------------|
| `muni_id` (PK) | Integer | Unique identifier. |
| `muni` | String | Name of municipality. |
| `hns` | Boolean | If `TRUE`, municipality is one of the Healthy Neighborhoods Study areas. |
Expand All @@ -151,7 +188,7 @@ These are loaded using `load_results()` if `load_boundaries = TRUE`.
Each row is a Massachusetts block group from the most recent vintage available in `tigris`. Currently, 2022.

| Field | Type | Description |
|------------|-------------------------|-----------------------------------------------|
|----------------|-------------------|-------------------------------------|
| `id` (PK) | Integer | Unique identifier (i.e., the 12-digit GEOID). |
| `geometry` | Geometry (MultiPolygon) | Block group boundary. |

Expand All @@ -160,7 +197,7 @@ Each row is a Massachusetts block group from the most recent vintage available i
Each row is a Massachusetts census tract from the most recent vintage available in `tigris`. Currently, 2022.

| Field | Type | Description |
|------------|-------------------------|-----------------------------------------------|
|----------------|-------------------|-------------------------------------|
| `id` (PK) | Integer | Unique identifier (i.e., the 11-digit GEOID). |
| `geometry` | Geometry (MultiPolygon) | Block group boundary. |

Expand All @@ -169,6 +206,6 @@ Each row is a Massachusetts census tract from the most recent vintage available
Each row is a ZIP code boundary some of which intersects with Massachusetts. (ZIPS can cross state lines).

| Field | Type | Description |
|------------|-------------------------|-----------------------------------------------|
|----------------|-------------------|-------------------------------------|
| `id` (PK) | Integer | Unique identifier (i.e., the 11-digit GEOID). |
| `geometry` | Geometry (MultiPolygon) | Block group boundary. |
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,10 +61,10 @@ load_results("your_db_prefix", load_boundaries=TRUE, summarize=TRUE)

This will load `companies`, `munis`, `officers`, `owners`, `sites`, `sites_to_owners`, `parcels_point`, `metacorps_cosine` and `metacorps_network` into your R environment. If `load_boundaries` is true, it will also return `munis`, `zips`, `tracts`, and `block_groups`.

If summarize is `TRUE`, it will return a number of summary fields for `officers`, `metacorps_cosine`, and `metacorps_network` that are useful for diagnosing cases of over-inclusion in the network analysis.

[Please consult the data dictionary for field definitions.](https://github.com/mit-spatial-action/who-owns-mass-processing/blob/main/README.md)

If summarize is `TRUE`, it will return a number of summary fields for `officers`, `metacorps_cosine`, and `metacorps_network` that are useful for diagnosing cases of over-inclusion in the network analysis. These appear in the data dictionary as well.

**This requires that you have `.Renviron` set up with appropriate prefixes (see 'Setting up `.Renviron`', above).**

Note that for statewide results, these are very large tables and therefore it might take 5-10 minutes depending on your network connection/whether you're reading from a local or remote database.
Expand All @@ -90,7 +90,7 @@ If the process is run interactively, it automatically outputs results to objects
We expose a large number of configuration variables in `config.R`, which is sourced in `run.R`. In order...

| Variable | Description |
|---------------|---------------------------------------------------------|
|-----------------|-------------------------------------------------------|
| `COMPLETE_RUN` | Default: `FALSE`A little helper that overrides values such that `ROUTINES=list(load = TRUE, proc = TRUE, dedupe = TRUE)`, `REFRESH=TRUE`, `MUNI_IDS=NULL`,and `COMPANY_TEST=FALSE`. This ensures a fresh, statewide run on complete datasets, not subsets. |
| `REFRESH` | Default: `TRUE`If `TRUE`, datasets will be reingested regardless of whether results already exist in the database. |
| `PUSH_DBS` | Default: `list(load = "", proc = "", dedupe = "")` Named list with string values. If `""`, looks for `.Renviron` database connection parameters of the format `"DB_NAME"`. If string passed, looks for parameters of the format `"YOURSTRING_DB_NAME"` where `YOURSTRING` can be passed upper or lower case, though parameters must be all uppercase. **Note that whatever `dedupe` is set to is treated as "production", meaning that select intermediate tables from previous subroutines are pushed there as well. Requires that you set `.Renviron` parameters (see section 'Setting Up `.Renviron`' above).** |
Expand Down

0 comments on commit 0a44692

Please sign in to comment.