Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APS Within-Set ID Creation - Initial Fuzzy-Match Chunk Generation #50

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

corvidfox
Copy link
Collaborator

Brief Summary of Commits:
QMD Files

  • data_unique_person_01_aps_01_cleaning.qmd
    • Final updates to prep and clean of APS data subject identifier data file
    • File renamed for organization, and to more closely follow SOP. Originally called "data_unique_person_01_within_set_aps.qmd"
  • data_unique_person_01_aps_02_fl_chunk_generation.qmd
    • Performs benchmarking and exploration of fastLink processing of data
    • Due to size of data file, 5 chunks were required and created
    • Manual review posterior probability range of 0.10 - 0.80 (approximately 0.027% of the benchmarking chunk)
    • APS data was exported with modifications made in this file, for use in further processing of each chunk

R Scripts of Helper Functions

  • single_df_fastlink.R
    • Wrapper function to facilitate fastLink processing for within-set fuzzy matching, without tripping auto-modifications otherwise made in the algorithm that deduplicate or otherwise limit returned data
  • get_pair_data.R
    • Extracts data from the fastLink output object to facilitate within-row comparison of identifiers in each pair
  • stack_ids.R
    • Modifies output from get_pair_data.R to assign unique group IDs for clustered pairs, acting as a preliminary unique Subject ID

corvidfox and others added 13 commits October 11, 2024 14:39
Completed farm to market standardization to "ftm"
Fix small typos. One which accidentally overwrote address names (instead of modifying street address appropriately). Others which were simply in the explanatory text.
Private road standardization completed
Standardization of major street address elements complete.
Homeless entries, general delivery entries. Some point-modifications to prior code for clarity and fixing failed captures.
Final QCs, separation of DOB components, and file export
Updated export file path to reflect new aps data dedicated sub directory
Helper functions for fuzzy matching and review:
1) wrap and force fuzzy match of single data set (within-set matching);
2) pull pair data from fastlink object and source file for raw pair data;
3) convert raw pair data into side-by-side comparison of data in pairs for manual review
APS data set required chunking into 5 chunks due to hardware limitations for processing. Exploration of constraints and chunk generation in file.
Renamed files to reflect content and organization, and more closely align with SOPs
Updated text in Overview summary to be accurate to this file, fill placeholders.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant