deffK

CDCgov · Feb 29, 2024 · d9fa92c · d9fa92c
1 parent 1d4bc20
commit d9fa92c
Show file tree

Hide file tree

Showing 7 changed files with 123 additions and 119 deletions.
diff --git a/R/tab.R b/R/tab.R
@@ -149,11 +149,21 @@ tab = function(...
 	}
 
 	##
-	sto = svytotal(frm, design) # , deff = TRUE)
+	sto = svytotal(frm, design) # , deff = "replace")
 	mmcr = data.frame(x = as.numeric(sto)
 		, s = sqrt(diag(attr(sto, "var"))) )
-  mmcr$samp.size = .calc_samp_size(design = design, vr = vr, counts = counts)
-  mmcr$counts = counts
+	mmcr$counts = counts
+
+	# deff = attr(sto, "deff") %>% diag
+	# I am having trouble interpreting this deff.
+	# In some situations, results are unusual.
+	# total(), tab("AGER"), tab("PAYNOCHG")
+	# Using Kish design effect instead.
+
+	mmcr$deff = by(design$prob, design$variables[,vr], deffK) %>% as.numeric
+	mmcr$samp.size = mmcr$counts / mmcr$deff
+	idx.bad = which(mmcr$samp.size > mmcr$counts)
+	mmcr$samp.size[idx.bad] = mmcr$counts[idx.bad]
 
   df1 = degf(design)
   mmcr$degf = df1
@@ -261,21 +271,3 @@ tab = function(...
 	}
   df1
 }
-
-
-.calc_samp_size = function(design, vr, counts) {
-
-  # In svytotal(frm, design, deff = TRUE), DEff sometimes
-  # appears incorrect. If no variability, DEff = Inf.
-  # Calculating "Kish's Effective Sample Size" directly, bypassing DEff
-  #	deff = attr(sto, "deff") %>% diag
-
-  design$wi = 1 / design$prob
-  design$wi[design$prob <= 0] = 0
-  design$wi2 = design$wi^2
-  sum_wi = by(design$wi, design$variables[,vr], sum) %>% as.numeric
-  sum_wi2 = by(design$wi2, design$variables[,vr], sum) %>% as.numeric
-  neff = sum_wi^2 / sum_wi2
-  assert_that(length(neff) == length(counts))
-  pmin(counts, neff)
-}
diff --git a/R/total.R b/R/total.R
@@ -31,12 +31,16 @@ total = function(csv = getOption("surveytable.csv") ) {
   }
 
   ##
-  sto = svytotal(~Total, design) # , deff = TRUE)
+  sto = svytotal(~Total, design) # , deff = "replace")
   mmcr = data.frame(x = as.numeric(sto)
               , s = sqrt(diag(attr(sto, "var"))) )
-  mmcr$samp.size = .calc_samp_size(design = design, vr = "Total", counts = counts)
   mmcr$counts = counts
 
+  mmcr$deff = deffK(design$prob)
+  mmcr$samp.size = mmcr$counts / mmcr$deff
+  idx.bad = which(mmcr$samp.size > mmcr$counts)
+  mmcr$samp.size[idx.bad] = mmcr$counts[idx.bad]
+
   df1 = degf(design)
   mmcr$degf = df1
 

diff --git a/R/z_deffK.R b/R/z_deffK.R
@@ -0,0 +1,22 @@
+# From:
+# Richard Valliant and George Zipf (2023). PracTools: Designing and
+# Weighting Survey Samples. R package version 1.4.2.
+# https://CRAN.R-project.org/package=PracTools
+
+# Kish design effect
+# Kish design effect due to unequal weights
+# w: vector of inverses of selection probabilities for a sample
+
+# deffK = function (w)
+deffK = function(prob)
+{
+  w = 1 / prob
+  assert_that(all(w > 0), all(w < Inf))
+
+  # if (any(w <= 0))
+  #   warning("Some weights are less than or equal to 0.\n")
+  n <- length(w)
+  ret = 1 + sum((w - mean(w))^2)/n/mean(w)^2
+  assert_that(all(ret > 0))
+  ret
+}
diff --git a/docs/articles/Example-National-Ambulatory-Medical-Care-Survey-NAMCS-tables.html b/docs/articles/Example-National-Ambulatory-Medical-Care-Survey-NAMCS-tables.html
diff --git a/docs/pkgdown.yml b/docs/pkgdown.yml
@@ -4,7 +4,7 @@ pkgdown_sha: ~
 articles:
   Example-National-Ambulatory-Medical-Care-Survey-NAMCS-tables: Example-National-Ambulatory-Medical-Care-Survey-NAMCS-tables.html
   surveytable: surveytable.html
-last_built: 2024-02-23T16:05Z
+last_built: 2024-02-29T02:56Z
 urls:
   reference: https://cdcgov.github.io/surveytable/reference
   article: https://cdcgov.github.io/surveytable/articles

diff --git a/docs/search.json b/docs/search.json
diff --git a/vignettes/Example-National-Ambulatory-Medical-Care-Survey-NAMCS-tables.Rmd b/vignettes/Example-National-Ambulatory-Medical-Care-Survey-NAMCS-tables.Rmd
@@ -14,9 +14,9 @@ knitr::opts_chunk$set(
 )
 ```
 
-This example uses the National Ambulatory Medical Care Survey (NAMCS) 2019 Public Use File (PUF) to replicate certain tables from the [National Ambulatory Medical Care Survey: 2019 National Summary Tables](https://www.cdc.gov/nchs/data/ahcd/namcs_summary/2019-namcs-web-tables-508.pdf). NAMCS is "an annual nationally representative sample survey of visits to non-federal office-based patient care physicians, excluding anesthesiologists, radiologists, and pathologists." Note that the unit of observation is visits, not patients – this distinction is important since a single patient can make multiple visits.
+This example uses the National Ambulatory Medical Care Survey (NAMCS) 2019 Public Use File (PUF) to replicate certain tables from the [National Ambulatory Medical Care Survey: 2019 National Summary Tables](https://www.cdc.gov/nchs/data/ahcd/namcs_summary/2019-namcs-web-tables-508.pdf). NAMCS is "an annual nationally representative sample survey of visits to non-federal office-based patient care physicians, excluding anesthesiologists, radiologists, and pathologists." Note that the unit of observation is visits, not patients -- this distinction is important since a single patient can make multiple visits.
 
-Selected variables from NAMCS 2019 come with the `surveytable` package, for use in examples, in an object called `namcs2019sv`. To make the below code more reusable, we'll call the survey by a generic name, namely, `mysurvey`.
+Selected variables from NAMCS 2019 come with the `surveytable` package, for use in examples, in an object called `namcs2019sv`.
 
 # Begin
 
@@ -29,8 +29,7 @@ library(surveytable)
 Now, specify the survey that you'd like to analyze.
 
 ```{r, results='asis'}
-mysurvey = namcs2019sv
-set_survey(mysurvey)
+set_survey(namcs2019sv)
 ```
 
 Check the survey name, survey design variables, and the number of observations to verify that it all looks correct.
@@ -67,15 +66,9 @@ Once we have the overall population estimate, the overall rate is:
 total_rate(uspop2019$total)
 ```
 
-Recall that survey objects have an element called `variables`, which is a data frame that contains the survey variables. Let's examine the levels of `MSA`.
+To calculate the rates for a particular variable, we need to provide a data frame with a variable called `Level` that matches the levels of the variable in the survey, and a variable called `Population` that gives the population size (which is assumed to be a constant rather than a random variable).
 
-```{r}
-levels(mysurvey$variables$MSA)
-```
-
-To calculate the rates for a particular variable, we need to provide a data frame with a variable called `Level` that matches the levels of the variable in the survey. `Population` gives the population estimate.
-
-For example, for `MSA`, we need a data frame as follows:
+For `MSA`, we can see the levels of the variables just by using the `tab()` command, just as we did above. Thus, to calculate rates, we need a data frame as follows:
 
 ```{r}
 uspop2019$MSA