From 21d40eea14095fe73604d9d1e7183f7ba10ea6ee Mon Sep 17 00:00:00 2001 From: jw2249a Date: Thu, 16 May 2024 11:01:43 -0700 Subject: [PATCH] updating documentation (partial) --- README.md | 63 ++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 49 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index aca16ac..b6cdac0 100644 --- a/README.md +++ b/README.md @@ -15,13 +15,51 @@ The basic arguments for the `fastLink` function to run are - `dfB`: A `DataFrame` table of records to be matched. -- `varnames` A `Vector{String}` that contains the variable names present in both tables. +- `config`: A `Dict{String, Any}` that specifies how the two dataframes ought to be matched. -- `idvar`: A `Tuple{String,String}` that has the unique ids for records in both tables (should differ from each other). Should be formatted as so that the arguments are ordered in ("dfa","dfb") order. +### Match Configuration + +The match configuration for a FastLink match needs to contain certain in the base dictionary (nested dictionaries will be discussed later). + +The Base Dictionary needs to contain: + +- `link_type`: Either `link_only`, `dedupe_only`, or `link_and_dedupe`. + +- `idvar`: A `Vector{String}` of length 2 that specifies the ids of the two dataframes (in `[dfA, dfB]` order). + +- `comparisons`: a `Dict{String, Any}` a that defines the type of matching to be done and the variables that will be matched. + +#### Comparisons dictionary +The comparison dictionary defined above can be located in the base Dictionary or can be substituted instead of a `varname` dictionary in the `variables` vector. The effect of nesting the comparisons in the `variables` vector will lead it to be matched first using the fastlink algorithm and then treated as a single variables in the parent `comparisons` dictionary. You can substitute multiple `varnames` for comparisons at the same level of nestedness. + +Each `comparisons` dictionary much have: +- `name`: should be "total" in the base dictionary and then can be any `name` for the nested dictionaries. + +- `variables`: a `Vector{Dict{String, Any}}` that contains the individual variable dictionaries and/or a `comparisons` dictionaries. + +The optional parameters for the `comparisons` dictionary are: +- `w_lambda::Float64`: Default 0.0. + +- `prior_lambda::Float64`: Default 0.0. + +- `threshold_match`: Lower bound for the posterior probability that will act as a cutoff for matches. Default `[0.85]`. + +- `tol_em`: Convergence tolerance for the EM Algorithm. (default `1e-05`) + +- `prior_pi::Float64`: Default 0.0. + +- `w_pi::Float64`: Default 0.0. + +#### Variables +Individual variables can be declared in a dictionary and must contain both a `varname` and `method`. + +- `varname`: name of the variable in `dfA` and `dfB` to be compared. + +- `method`: the method to match the variable. The current accepted methods are (`exact`, `fuzzy`, `string`, `numeric`, `float`, `int` any of the `distmethod` options). + +Each `method` has a number of arguments that can be specified for that matching method. -In addition, there are a number of optional parameters to assist with record linkage. For each of the parameters, you can either specify each variables value or specify a vector with 1 value to apply it to all relevant variables. If irrelevant, like `stringdist_method` for a numeric match method, it will be ignored. -The optional parameters are - `term_freq_adjustment`: A `Vector{Bool}` that determines whether you want the term frequencies for each comparision for a given variable. Note: does not adjust match weight. Default value `[false]`. @@ -33,25 +71,22 @@ The optional parameters are - `stringdist_method`: A `Vector{String}` that specifies the desired string distance method ("jw" Jaro-Winkler (Default), "dl" Damerau-Levenshtein, "jaro" Jaro, "lv" Levenshtein, and "ham" Hamming). Default `["jw"]`. -- `cut_a` A `Vector{Float}` that specifies the first lower bound for string distance cutoff for each comparison. Default `[0.92]`. +- `cut_a` A `Float` that specifies the first lower bound for string distance cutoff for each comparison. Default `[0.92]`. + +- `cut_b`: A `Float` that specifies the second lower bound for string distance (if varnames in partial) for each comparison. Default `[0.88]`. + + + + -- `cut_p`: A `Vector{Float}` that specifies the second lower bound for string distance (if varnames in partial) for each comparison. Default `[0.88]`. -- `jw_weight`: A `Vector{Float}` that specifies the Winkler weight for jw string distance for each comparison. Default `[0.1]`. -- `address_field`: A `Vector{Bool}` that specifies whether a comparison contains an address field. Default `[false]`. -- `tol_em`: Convergence tolerance for the EM Algorithm. (default `1e-05`) -- `threshold_match`: Lower bound for the posterior probability that will act as a cutoff for matches. Default `[0.85]`. -- `prior_lambda::Float64`: Default 0.0. -- `w_lambda::Float64`: Default 0.0. -- `prior_pi::Float64`: Default 0.0. -- `w_pi::Float64`: Default 0.0. - `dedupe_matches`: Whether to dedupe the matches within the dataset.