formatting & consistency

ref #68
nlsy-links · Jul 25, 2021 · b942472 · b942472
1 parent efdb38c
commit b942472
Show file tree

Hide file tree

Showing 4 changed files with 141 additions and 136 deletions.
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -3,6 +3,7 @@
         "Mplus",
         "NLSY",
         "NLSYC",
+        "O'Keefe",
         "asis",
         "authorrunning",
         "bibstyle",
@@ -13,13 +14,17 @@
         "d'onofrio",
         "dataset",
         "datasets",
+        "dplyr",
         "eval",
         "extdata",
         "fulker",
         "intergenerational",
+        "kable",
+        "knitr",
         "kohler",
         "lavaan",
         "maes",
+        "magrittr",
         "missingness",
         "monospace",
         "neale",

diff --git a/article/nlsy-links-article.Rmd b/article/nlsy-links-article.Rmd
@@ -66,7 +66,7 @@ Starting in the late 1960s, the NLS has provided multiple longitudinal cohorts t
 
 The Nlsy79 sample refers to both the original 12,686 subjects interviewed in 1979, and their 11,500+ children (termed "Nlsy79 Gen1" and "Nlsy79 Gen2", respectively).  Nlsy79 Gen1 reflects the [original NLSY79 study](https://www.nlsinfo.org/content/cohorts/nlsy79), while Nlsy Gen2 reflects their children, the [NLSY79  Children and Young Adults](https://www.nlsinfo.org/content/cohorts/nlsy79-children).  The NLSY does not interview "Nlsy79 Gen0" (the parents of Gen1) or "Nlsy79 Gen3" (the children of Gen2), though it does contain direct and indirect information about them.
 
-More specifically, the Nlsy79 Gen2 subjects are the biological offspring of the Nlsy79 Gen1 mothers.  Until roughly the age of 14, the "children" completed the NLSY-C survey, and then became "young adults" and completed the NLSY-YA survey.  Though they are the same respondents, different funding mechanisms and different survey items necessitate the distinction.  This cohort is sometimes abbreviated as "NLSY79-C", "NLSY79C", "NLSY-C" or "NLSYC". This packages uses '"Nlsy79 Gen2" to refer to subjects of this generation, regardless of their age at the time of the survey.
+More specifically, the Nlsy79 Gen2 subjects are the biological offspring of the Nlsy79 Gen1 mothers.  Until roughly the age of 14, the "children" completed the NLSY-C survey, and then became "young adults" and completed the NLSY-YA survey.  Though they are the same respondents, different funding mechanisms and different survey items necessitate the distinction.  This cohort is sometimes abbreviated as "NLSY79-C", "NLSY79C", "NLSY-C" or "NLSYC". This packages uses "Nlsy79 Gen2" to refer to subjects of this generation, regardless of their age at the time of the survey.
 
 The terminology for the Nlsy97 sample is similar yet simpler than the Nlsy79, because the explicit respondents come from a single generation (in a sense the Nlsy97 Gen1).  A few variables reflect Gen0 and Gen2.  In contrast to the Nlsy79, the Nlsy97 contains more information about their housemates, even if the housemates are not subjects themselves.
 
@@ -95,7 +95,7 @@ Retrieving Data with the NLS Investigator
 
 When a researcher pursues a new idea, we suggest to start by exploring what the NLSY can offer by exploring the (a) vast online documentation and (b) [NLS Investigator](https://www.nlsinfo.org/investigator).  The documentation online (<https://www.nlsinfo.org/content/getting-started>) has general information (*e.g.*, how to connect the nationally representative sample was collected), topical information (*e.g.*, what medical and health information has been collected across survey waves and subject ages), and descriptive summaries (*e.g.*, attrition over time for different race and ethnic groups).  This material has helpful suggestions which variables are available and appropriate.
 
-With these hints, identify and download the specific variables from the NLS Investigator.  The NLS Investigator is described briefly in the second example.  For more detailed instruction see the NlsyLink package's [NLS Investigator vignette](https://nlsy-links.github.io/NlsyLinks/articles/nls-investigator.html) or the official [NLS documentation](https://www.nlsinfo.org/content/access-data-investigator/investigator-user-guide)).  Researchers new to the NLSY should expect at least a dozen round trips as they iteratively improve and complete their set of variables.  First, select the 'Study', such as 'NLSY79 Child & Young Adult' (which corresponds to 'Nlsy79 Gen2' in our terminology.  Second, select your desired variables, out of the tens of thousands available ones.
+With these hints, identify and download the specific variables from the NLS Investigator.  The NLS Investigator is described briefly in the second example.  For more detailed instruction see the NlsyLink package's [NLS Investigator vignette](https://nlsy-links.github.io/NlsyLinks/articles/nls-investigator.html) or the official [NLS documentation](https://www.nlsinfo.org/content/access-data-investigator/investigator-user-guide)).  Researchers new to the NLSY should expect at least a dozen round trips as they iteratively improve and complete their set of variables.  First, select the "Study", such as "NLSY79 Child & Young Adult" (which corresponds to Nlsy79 Gen2 in our terminology.  Second, select your desired variables, out of the tens of thousands available ones.
 
 Before starting the examples, first verify that the NlsyLinks and dplyr packages are installed correctly; for the later examples, the lavaan package is required.  If you are using an older version of R, replace the native pipes (`|>`) with pipes from the magrittr package (`%>%`).
 
@@ -123,11 +123,11 @@ A DeFries-Fulker (**DF**) Analysis uses linear regression to estimate the $a^2$,
 1. Use the NLS Investigator to select and download a Gen2 dataset.
 2. Open R and create a new script (see [Appendix: R Scripts](#appendix-creating-and-saving-r-scripts) and load the NlsyLinks package.
  Within the R script, identify the locations of the downloaded data file, and load it into a data frame.
-3. Within the R script, load the linking dataset.  Then select only Gen2 subjects.  The 'Pair' version of the linking dataset is essentially an upper triangle of a symmetric sparse matrix.
+3. Within the R script, load the linking dataset.  Then select only Gen2 subjects.  The "Pair" version of the linking dataset is essentially an upper triangle of a symmetric sparse matrix.
 4. Load and assign the `ExtraOutcomes79` dataset.
-5. Specify the outcome variable name and filter out all subjects who have a negative value in this variable.  The NLSY typically uses negative values to indicate different types of missingness (see 'Further Information' below).
-6. Create a double-entered file by calling the 'CreatePairLinksDoubleEntered` function.  At minimum, pass the (i) outcome dataset, the (ii) linking dataset, and the (iii) name(s) of the outcome variable(s).  *(There are occasions when a single-entered file is more appropriate for a DF analysis.  See Rodgers & Kohler, 2005, for additional information.)*
-7. Use 'DeFriesFulkerMethod3` function (*i.e.*, general linear model) to estimate the coefficients of the DF model.
+5. Specify the outcome variable name and filter out all subjects who have a negative value in this variable.  The NLSY typically uses negative values to indicate different types of missingness (see Further Information below).
+6. Create a double-entered file by calling the `CreatePairLinksDoubleEntered()` function.  At minimum, pass the (i) outcome dataset, the (ii) linking dataset, and the (iii) name(s) of the outcome variable(s).  *(There are occasions when a single-entered file is more appropriate for a DF analysis.  See Rodgers & Kohler, 2005, for additional information.)*
+7. Use `DeFriesFulkerMethod3()` function (*i.e.*, general linear model) to estimate the coefficients of the DF model.
 
 ```{r}
 # Step 2: Load the package containing the linking routines.
@@ -140,7 +140,7 @@ dsLinking <-
 summary(dsLinking) # Notice there are 11k records (one for each pair).
 
 # Step 4: Load the OUTCOMES dataset, and then examine the summary.
-dsOutcomes <- ExtraOutcomes79 #'ds' stands for 'Data Set'
+dsOutcomes <- ExtraOutcomes79 # ds stands for Data Set
 summary(dsOutcomes)
 
 # Step 5: If the negative values (which represent NLSY missing or
@@ -160,7 +160,7 @@ dsDouble <-
   CreatePairLinksDoubleEntered(
     outcomeDataset   = dsOutcomes,
     linksPairDataset = dsLinking,
-    outcomeNames     = c('MathStandardized')
+    outcomeNames     = c("MathStandardized")
   )
 # Notice there are now two records for each unique pair.
 summary(dsDouble)
@@ -184,14 +184,14 @@ The second example differs from the previous example in two ways.  First, the ou
 
 The steps are:
 
-1. Use the NLS Investigator to select and download a Gen2 dataset.  Select the variables 'length of gestation of child in weeks' (`C03280.00`), 'weight of child at birth in ounces' (`C03286.00`), and 'length of child at birth' (`C03288.00`), and then download the *.zip file to your local computer.
+1. Use the NLS Investigator to select and download a Gen2 dataset.  Select the variables "length of gestation of child in weeks" (`C03280.00`), "weight of child at birth in ounces" (`C03286.00`), and "length of child at birth" (`C03288.00`), and then download the *.zip file to your local computer.
 2. [Open R and create a new script](#appendix-creating-and-saving-r-scripts) and load the NlsyLinks package.
 3. Within the R script, load the linking dataset.  Then select only Gen2 subjects.
-4. Read the CSV into R as a `data.frame` using `ReadCsvNlsy79Gen2`.
+4. Read the CSV into R as a `data.frame` using `ReadCsvNlsy79Gen2()`.
 5. Verify the desired outcome column exists, and rename it something meaningful to your project.  In this example, we rename column `C0328800` to `BirthWeightInOunces`.
-6. Filter out all subjects who have a negative `BirthWeightInOunces` value.  See the 'Further Information' note in the previous example.
-7. Create a double-entered file by calling the `CreatePairLinksDoubleEntered` function.  At minimum, pass the (i) outcome dataset, the (ii) linking dataset, and the (iii) name(s) of the outcome variable(s).
-8. Call the `AceUnivariate` function to estimate the coefficients.
+6. Filter out all subjects who have a negative `BirthWeightInOunces` value.  See the Further Information note in the previous example.
+7. Create a double-entered file by calling `CreatePairLinksDoubleEntered()`.  At minimum, pass the (i) outcome dataset, the (ii) linking dataset, and the (iii) name(s) of the outcome variable(s).
+8. Call `AceUnivariate()` to estimate the coefficients.
 
 ```{r}
 # Step 2: Load the package containing the linking routines.
@@ -213,7 +213,7 @@ dsOutcomes <- ReadCsvNlsy79Gen2(filePathOutcomes)
 summary(dsOutcomes)
 
 # Step 5: Verify and rename an existing column.
-VerifyColumnExists(dsOutcomes, "C0328600") # Should return '10' in this example.
+VerifyColumnExists(dsOutcomes, "C0328600") # Should return 10 in this example.
 dsOutcomes <-
   dsOutcomes |>
   dplyr::rename(
@@ -271,7 +271,7 @@ The steps are:
 (Steps 1-5 proceed identically to the first example.)
 
 6. Create a *single*-entered file by calling the `CreatePairLinksSingleEntered` function.  At minimum, pass the (i) outcome dataset, the (ii) linking dataset, and the (iii) name(s) of the outcome variable(s).
-7. Declare the names of the outcome variables corresponding to the two members in each pair.  Assuming the variable is called 'ZZZ' and the preceding steps have been followed, the variable 'ZZZ\_S1' corresponds to the first members and ZZZ\_S2' corresponds to the second members.
+7. Declare the names of the outcome variables corresponding to the two members in each pair.  Assuming the variable is called `ZZZ` and the preceding steps have been followed, the variable `ZZZ\_S1` corresponds to the first members and `ZZZ\_S2` corresponds to the second members.
 8. Create a GroupSummary `data.frame`, which identifies the `R` groups that should be considered by the model.  Inspect the output to see if the groups show unexpected or fishy differences.
 9. Create a `data.frame` with cleaned variables to pass to the SEM function.  This `data.frame` contains only the three necessary rows and columns.
 10. Estimate the SEM with the lavaan package.  The function returns an `S4` object, which shows the basic ACE information.
@@ -299,7 +299,7 @@ dsSingle <-
   CreatePairLinksSingleEntered(
     outcomeDataset   = dsOutcomes,
     linksPairDataset = dsLinking,
-    outcomeNames     = c('MathStandardized')
+    outcomeNames     = c("MathStandardized")
   )
 
 # Step 7: Declare the names for the two outcome variables.
@@ -323,7 +323,7 @@ dsClean <-
 # Step 10: Run the model
 ace <- AceLavaanGroup(dsClean)
 ace
-# Notice the 'CaseCount' is 8.5k instead of 17k.
+# Notice the CaseCount is 8.5k instead of 17k.
 #   This is because (a) one pair with R=.75 was excluded, and
 #   (b) the SEM uses a single-entered dataset (not double).
 
@@ -358,7 +358,7 @@ Links79Pair |>
   knitr::kable(
     format      = "latex",
     format.args = list(big.mark = ","),
-    caption     = "Count of Nlsy79 relationships, by `RelationshipPath`.  (Recall that 'AuntNiece' also contains uncles and nephews.)"
+    caption     = "Count of Nlsy79 relationships, by `RelationshipPath`.  (Recall that AuntNiece also contains uncles and nephews.)"
   )
 ```
 
@@ -388,7 +388,7 @@ dsSingle <-
   CreatePairLinksSingleEntered(
     outcomeDataset   = dsOutcomes,
     linksPairDataset = dsLinking,
-    outcomeNames     = c('HeightZGenderAge')
+    outcomeNames     = c("HeightZGenderAge")
   )
 
 # Step 7: Declare the names for the two outcome variables.
@@ -417,7 +417,7 @@ ace
 
 Notice the ACE estimates are very similar to the previous example, but the number of pairs has increased by 6x --from 4,185 to 24,700.  The number of *subjects* doubles when Gen2 is added, and the number of *relationship pairs* increases combinatorially.  When an extended family's entire pedigree is considered by the model, many more types of links are possible than if just nuclear families are considered.  This increased statistical power is even more important when the population's $a^2$ is small or moderate.
 
-Note that the analysis has `r scales::comma(ace@CaseCount)` relationships instead of the entire `r scales::comma(nrow(Links79Pair))`.  This is primarily because not all subjects have a value for 'adult height' (and that's mostly because a lot of Gen2 subjects are too young).  There are `r scales::comma(sum(!is.na(Links79PairExpanded$RFull)))` pairs with a nonmissing value in `RFull`, meaning that `r round(mean(!is.na(Links79PairExpanded$RFull))*100, 1)` are classified.  We feel comfortable claiming that if a researcher has a phenotype for both members of a pair, there's a 99+% chance we have an `RFull` for it.  For a description of the `R` and `RFull` variables, please see the `Links79Pair` entry in the package [reference manual](https://nlsy-links.github.io/NlsyLinks/).
+Note that the analysis has `r scales::comma(ace@CaseCount)` relationships instead of the entire `r scales::comma(nrow(Links79Pair))`.  This is primarily because not all subjects have a value for adult height (and that's mostly because a lot of Gen2 subjects are too young).  There are `r scales::comma(sum(!is.na(Links79PairExpanded$RFull)))` pairs with a nonmissing value in `RFull`, meaning that `r round(mean(!is.na(Links79PairExpanded$RFull))*100, 1)` are classified.  We feel comfortable claiming that if a researcher has a phenotype for both members of a pair, the pair will likely have an `RFull`.  For a description of the `R` and `RFull` variables, please see the `Links79Pair` entry in the package [reference manual](https://nlsy-links.github.io/NlsyLinks/).
 
 More Advanced ACE Analyses
 ----------------------------------------------------

diff --git a/article/nlsy-links-article.pdf b/article/nlsy-links-article.pdf