Skip to content

Commit

Permalink
documentation added for output and cleaned up the results object
Browse files Browse the repository at this point in the history
  • Loading branch information
jw2249a committed Apr 2, 2024
1 parent c18f532 commit c313e8c
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 21 deletions.
43 changes: 23 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ The basic arguments for the `fastLink` function to run are

In addition, there are a number of optional parameters to assist with record linkage. For each of the parameters, you can either specify each variables value or specify a vector with 1 value to apply it to all relevant variables. If irrelevant, like `stringdist_method` for a numeric match method, it will be ignored.

The optional parameters are.
The optional parameters are

- `term_freq_adjustment`: A `Vector{Bool}` that determines whether you want the term frequencies for each comparision for a given variable. Note: does not adjust match weight. Default value `[false]`.

Expand All @@ -36,7 +36,7 @@ The optional parameters are.

- `cut_a` A `Vector{Float}` that specifies the first lower bound for string distance cutoff for each comparison. Default `[0.92]`.

- `cut_p::Float`: A `Vector{Float}` that specifies the second lower bound for string distance (if varnames in partial) for each comparison. Default `[0.88]`.
- `cut_p`: A `Vector{Float}` that specifies the second lower bound for string distance (if varnames in partial) for each comparison. Default `[0.88]`.

- `jw_weight`: A `Vector{Float}` that specifies the Winkler weight for jw string distance for each comparison. Default `[0.1]`.

Expand All @@ -60,38 +60,41 @@ The optional parameters are.
__________________
### `fastLink`'s output

- `NamedTuple` with these vars
- `indices`
A `NamedTuple` with these vars:

- `iter_converge`
- `indices` - a vector with indices in `dfA` and `dfB` that are in each pattern group (see `patterns_w` or `patterns_b`)

- `matched_ids`
- `matched_ids` - same as `indices` but using `idvars` from input parameters

- `obs_a`
- `iter_converge` - number of iterations for expectation maximization algorithm to converge.

- `obs_b`
- `obs_a` - observations in `dfA`

- `p_m`
- `obs_b` - observations in `dfB`

- `p_u`
- `p_m` - posterior match probability

- `patterns_b`
- `p_u` - posterior **not** match probability

- `patterns_w`
- `patterns_w` - a `DataFrame` of:
- `gamma_` - An `Int64` with the gamma values for each variable (similar to `patterns_b`)
- `counts` - An `Int64` with counts for each agreement pattern
- `weights` - An `Int64` with partial match weights for each agreement pattern
- `p_gamma_jm` - A `Float64` that has the posterior probability that a pair matches for each agreement pattern
- `p_gamma_ju` - A `Float64` that has the posterior probability that a pair **does not** match for each agreement pattern
- `is_match` - A `Bool` that specifies whether the given pattern is above the input parameter `threshold_match`

- `pgamma_jm`
- `patterns_b` - vector of all patterns observed. each pattern as a scored number for each variable (0 nonmatch, 1 partial, 2 exact, 3 missing)

- `pgamma_ju`
- `pgamma_km` - A `Vector{Vector{Float64}}` with posterior probababilities for each variable in the EM algorithm. Ordered (0,1,2).

- `pgamma_km`
- `pgamma_ku` - A `Vector{Vector{Float64}}` with posterior probababilities for each variable in the EM algorithm. Ordered (2,1,0).

- `pgamma_ku`
- `tf_adj_table` - A `Vector{DataFrame}` that has a DataFrame for each match pattern and a row in each DataFrame for each comparison appended with the letter of their corresponding dataset.

- `tf_adj_table`
- `varnames` - A `Vector{String}` of the input variable names

- `varnames`

- `zeta_j`
- `zeta_j` - A `Vector{Float64}` with the posterior match probabilities for each agreement pattern.

# Examples
```julia
Expand Down
2 changes: 1 addition & 1 deletion src/getMatches.jl
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ end
# applies term frequency adjustments to table
function tf_adj_table(resultsEM::NamedTuple,varnames::Vector{String},tf_table_x::Vector{Vector{Float16}},tf_table_y::Vector{Vector{Float16}})
tf_vec = [DataFrame() for _ in eachindex(resultsEM.indices)]
new_names=vcat("tf_" .* varnames .* "_x", "tf_" .* varnames .* "_y")
new_names=vcat("tf_" .* varnames .* "_a", "tf_" .* varnames .* "_b")
for i in eachindex(resultsEM.indices)
result_len=length(resultsEM.indices[i])
tf_results=DataFrame(ones(Float16,(result_len, 2*length(varnames))),new_names)
Expand Down

0 comments on commit c313e8c

Please sign in to comment.