diff --git a/404.html b/404.html index 6f934732..3049caeb 100644 --- a/404.html +++ b/404.html @@ -185,7 +185,7 @@
Developing surveys to gather accurate information about populations involves a more intricate and time-intensive process compared to surveys that use non-random criteria for selecting samples. Researchers can spend months, or even years, developing the study design, questions, and other methods for a single survey to ensure high-quality data is collected.
While this book focuses on the analysis methods of complex surveys, understanding the entire survey life cycle can provide a better insight into what types of analyses should be conducted on the data. The survey life cycle consists of the stages required to successfully execute a survey project. Each stage influences the timing, costs, and feasibility of the survey, consequently impacting the data collected and how it should be analyzed.
The survey life cycle starts with a research topic or question of interest (e.g., what impact does childhood trauma have on health outcomes later in life). Researchers typically review existing data sources to determine if data are already available that can answer this question, as drawing from available resources can result in a reduced burden on respondents, cheaper research costs, and faster research outcomes. However, if existing data cannot answer the nuances of the research question, a survey can be used to capture the exact data that the researcher needs.
-To gain a deeper understanding of survey design and implementation, there are many pieces of existing literature that we recommend reviewing in detail (e.g., Dillman, Smyth, and Christian (2014), Groves et al. (2009), (Tourangeau2000psych?), (Bradburn2004?), Valliant, Dever, and Kreuter (2013), and Paul P. Biemer and Lyberg (2003)).
+To gain a deeper understanding of survey design and implementation, there are many pieces of existing literature that we recommend reviewing in detail (e.g., Dillman, Smyth, and Christian 2014; Groves et al. 2009; Tourangeau, Rips, and Rasinski 2000; Bradburn, Sudman, and Wansink 2004; Valliant, Dever, and Kreuter 2013; Biemer and Lyberg 2003).
When starting a survey, there are multiple things to consider. Errors are the differences between the true values of the variables being studied and the values obtained through the survey. Each step and decision made before the launch of the survey can impact the types of error that are introduced into the data, which in turn impact how to interpret the results.
-Generally, survey researchers consider there to be seven main sources of error that fall into two major categories of Representation and Measurement (Groves et al. (2009)): +
Generally, survey researchers consider there to be seven main sources of error that fall into two major categories of Representation and Measurement (Groves et al. 2009):
Almost every survey will have some errors. Researchers attempt to conduct a survey that reduces the total survey error, or the accumulation of all errors that may arise throughout the survey life cycle. By assessing these different types of errors together, researchers can seek strategies to maximize the overall survey quality and improve the reliability and validity of results (Biemer 2010). However, attempts to lower individual sources errors (and therefore total survey error) come at the price of time, resources, and money:
Let’s use a simple example where a researcher is interested in the average number of pets in a household. Our researcher will need to consider the target population for this study. Specifically, are they interested in all households in a given country or household in a more local area (e.g., city or state)? Let’s assume our researcher is interested in the number of pets in a U.S. household with at least one adult (18 years old or older). In this case, using a sampling frame of mailing addresses would provide the least coverage error as the frame would closely match our target population. Specifically, our researcher would most likely want to use the Computerized Delivery Sequence File (CDSF), which is a file of mailing addresses that the United States Postal Service (USPS) creates and covers nearly 100% of U.S. households (Harter et al. (2016)). To sample these households, for simplicity, we will use a stratified simple random sample design, where we randomly sample households within each state (i.e., we stratify by state).
+Let’s use a simple example where a researcher is interested in the average number of pets in a household. Our researcher will need to consider the target population for this study. Specifically, are they interested in all households in a given country or household in a more local area (e.g., city or state)? Let’s assume our researcher is interested in the number of pets in a U.S. household with at least one adult (18 years old or older). In this case, using a sampling frame of mailing addresses would provide the least coverage error as the frame would closely match our target population. Specifically, our researcher would most likely want to use the Computerized Delivery Sequence File (CDSF), which is a file of mailing addresses that the United States Postal Service (USPS) creates and covers nearly 100% of U.S. households (Harter et al. 2016). To sample these households, for simplicity, we will use a stratified simple random sample design, where we randomly sample households within each state (i.e., we stratify by state).
Throughout this chapter, we will build on this example research question to plan a survey.
Researchers can use a single mode to collect data or multiple modes (also called mixed modes). Using mixed modes can allow for broader reach and increase response rates depending on the target population ((deLeeuw2005?), DeLeeuw (2018), Paul P. Biemer et al. (2017)). For example, researchers could both call households to conduct a CATI survey and send mail with a PAPI survey to the household. Using both of these modes, researchers could gain participation through the mail from individuals who do not pick up the phone to unknown numbers or through the phone from individuals who do not open all of their mail. However, mode effects (where responses differ based on the mode of response) can be present in the data and may need to be considered during analysis.
+Researchers can use a single mode to collect data or multiple modes (also called mixed modes). Using mixed modes can allow for broader reach and increase response rates depending on the target population (DeLeeuw 2005, 2018; Biemer et al. 2017). For example, researchers could both call households to conduct a CATI survey and send mail with a PAPI survey to the household. Using both of these modes, researchers could gain participation through the mail from individuals who do not pick up the phone to unknown numbers or through the phone from individuals who do not open all of their mail. However, mode effects (where responses differ based on the mode of response) can be present in the data and may need to be considered during analysis.
When selecting which mode, or modes, to use, understanding the unique aspects of the chosen target population and sampling frame will provide insight into how they can best be reached and engaged. For example, if we plan to survey adults aged 18-24 who live in North Carolina, asking them to complete a survey using CATI (i.e., over the phone) would most likely not be as successful as other modes like the web. This age group does not talk on the phone as much as other generations, and often do not answer their phones for unknown numbers. Additionally, the mode for contacting respondents relies on what information is available on the sampling frame. For example, if our sampling frame includes an email address, we could email our selected sample members to convince them to complete a survey. Or if the sampling frame is a list of mailing addresses, researchers would have to contact sample members with a letter.
It is important to note that there can be a difference between the contact and survey modes. For example, if we have a sampling frame with addresses, we can send a letter to our sample members and provide information on how to complete a web survey. Or we could use mixed-mode surveys and send sample members a paper and pencil survey with our letter and also ask them to complete the survey online. Combining different contact modes and different survey modes can be useful in reducing unit nonresponse error–where the entire unit (e.g., a household) does not respond to the survey at all–as different sample members may respond better to different contact and survey modes. However, when considering which modes to use, it is important to make access to the survey as easy as possible for sample members to reduce burden and unit nonresponse.
-Another way to reduce unit nonresponse error is through varying the language of the contact materials (Dillman, Smyth, and Christian (2014)). People are motivated by different things, so constantly repeating the same message may not be helpful. Instead, mixing up the messaging and the type of contact material the sample member receives can increase response rates and reduce the unit nonresponse error. For example, instead of only sending standard letters, researchers could consider sending mailings that invoke “urgent” or “important” thoughts by sending priority letters or using other delivery services like FedEx, UPS, or DHL.
+Another way to reduce unit nonresponse error is through varying the language of the contact materials (Dillman, Smyth, and Christian 2014). People are motivated by different things, so constantly repeating the same message may not be helpful. Instead, mixing up the messaging and the type of contact material the sample member receives can increase response rates and reduce the unit nonresponse error. For example, instead of only sending standard letters, researchers could consider sending mailings that invoke “urgent” or “important” thoughts by sending priority letters or using other delivery services like FedEx, UPS, or DHL.
A study timeline may also determine the number and types of contacts. If the timeline is long, then there is a lot of time for follow-ups and varying the message in contact materials. If the timeline is short, then fewer follow-ups can be implemented. Many studies will start with the tailored design method put forth by Dillman, Smyth, and Christian (2014) and implement 5 contacts:
Researchers can benefit from the work of others by using questions from other surveys. Demographic questions such as race, ethnicity, or education often use questions from a government census or other official surveys. Other survey questions can be found using question banks which are a compilation of questions that have been asked across various surveys such as the Inter-university Consortium for Political and Social Research (ICPSR) variable search.
If a question does not exist in a question bank, researchers can craft their own. When creating their own questions, researchers should start with the research question or topic and attempt to write questions that match the concept. The closer the question asked is to the overall concept, the better validity there is. For example, if the researcher wants to know how people consume TV series and movies but only asks a question about how many TVs are in the house, then they would be missing other ways that people watch TV series and movies, such as on other devices or at places outside of the home. As mentioned above, researchers can employ techniques to increase the validity of their questionnaire. For example, questionnaire testing involves conducting a pilot of the survey instrument to identify and fix potential issues before the main survey is conducted. Cognitive interviewing is a technique where researchers walk through the survey with participants, encouraging them to speak their thoughts out loud to uncover how they interpret and understand survey questions.
-Additionally, when designing questions, researchers should consider the mode for the survey and adjust language appropriately. In self-administered surveys (e.g., web or mail), respondents can see all the questions and response options, but that is not the case in interviewer-administered surveys (e.g., CATI or CAPI). With interviewer-administered surveys, the response options need to be read aloud to the respondents, so the question may need to be adjusted to allow a better flow to the interview. Additionally, with self-administered surveys, because the respondents are viewing the questionnaire, the formatting of the questions is even more important to ensure accurate measurement. Incorrect formatting or wording can result in measurement error, so following best practices or using existing validated questions can reduce error. There are multiple resources to help researchers draft questions for different modes (e.g., Dillman, Smyth, and Christian (2014), (Fowler1989?), (Bradburn2004?), (Tourangeau2004spacing?)).
+Additionally, when designing questions, researchers should consider the mode for the survey and adjust language appropriately. In self-administered surveys (e.g., web or mail), respondents can see all the questions and response options, but that is not the case in interviewer-administered surveys (e.g., CATI or CAPI). With interviewer-administered surveys, the response options need to be read aloud to the respondents, so the question may need to be adjusted to allow a better flow to the interview. Additionally, with self-administered surveys, because the respondents are viewing the questionnaire, the formatting of the questions is even more important to ensure accurate measurement. Incorrect formatting or wording can result in measurement error, so following best practices or using existing validated questions can reduce error. There are multiple resources to help researchers draft questions for different modes (e.g., Dillman, Smyth, and Christian 2014; Fowler and Mangione 1989; Bradburn, Sudman, and Wansink 2004; Tourangeau, Couper, and Conrad 2004).
As part of our survey on the average number of pets in a household, researchers may want to know what animal most people prefer to have as a pet. Let’s say we have the following question in our survey:
@@ -569,13 +568,13 @@Once the data collection starts, researchers try to stick to the data collection protocol designed during pre-survey planning. However, a good researcher will adjust their plans and adapt as needed to the current progress of data collection ((Schouten2018?)). Some extreme examples could be natural disasters that could prevent mail or interviewers from getting to the sample members. Others could be smaller in that something newsworthy occurs that is connected to the survey, so researchers could choose to play this up in communication materials. In addition to these external factors, there could be factors unique to the survey, such as lower response rates for a specific sub-group, so the data collection protocol may need to find ways to improve response rates for that specific group.
+Once the data collection starts, researchers try to stick to the data collection protocol designed during pre-survey planning. However, a good researcher will adjust their plans and adapt as needed to the current progress of data collection (Schouten, Peytchev, and Wagner 2018). Some extreme examples could be natural disasters that could prevent mail or interviewers from getting to the sample members. Others could be smaller in that something newsworthy occurs that is connected to the survey, so researchers could choose to play this up in communication materials. In addition to these external factors, there could be factors unique to the survey, such as lower response rates for a specific sub-group, so the data collection protocol may need to find ways to improve response rates for that specific group.
Post-survey cleaning and imputation is one of the first steps researchers will do to get the survey responses into a dataset for use by analysts. Data cleaning can consist of cleaning inconsistent data (e.g., with skip pattern errors or multiple questions throughout the survey being consistent with each other), editing numeric entries or open-ended responses for grammar and consistency, or recoding open-ended questions into categories for analysis. There is no universal set of fixed rules that every project must adhere to. Instead, each project or research study should establish its own guidelines and procedures for handling various cleaning scenarios based on its specific objectives.
Researchers should use their best judgment to ensure data integrity, and all decisions should be documented and available to those using the data in the analysis. Each decision a researcher makes impacts processing error, so often researchers will have multiple people review these rules or recode open-ended data and adjudicate any differences in an attempt to reduce this error.
-Another crucial step in post-survey processing is imputation. Often, there is item nonresponse where respondents do not answer specific questions. If the questions are crucial to analysis efforts or the research question, researchers may implement imputation in an effort to reduce item nonresponse error. Imputation is a technique for replacing missing or incomplete data values with estimated values. However, as imputation is a way of assigning a value to missing data based on an algorithm or model, it can also introduce processing error, so researchers should consider the overall implications of imputing data compared to having item nonresponse. There are multiple ways imputation can be conducted. We recommend reviewing other resources like (Kim2021?) for more information.
+Another crucial step in post-survey processing is imputation. Often, there is item nonresponse where respondents do not answer specific questions. If the questions are crucial to analysis efforts or the research question, researchers may implement imputation in an effort to reduce item nonresponse error. Imputation is a technique for replacing missing or incomplete data values with estimated values. However, as imputation is a way of assigning a value to missing data based on an algorithm or model, it can also introduce processing error, so researchers should consider the overall implications of imputing data compared to having item nonresponse. There are multiple ways imputation can be conducted. We recommend reviewing other resources like Kim and Shao (2021) for more information.
Let’s return to the question we created to ask about animal preference. The “other specify” invites respondents to specify the type of animal they prefer to have as a pet. If respondents entered answers such as “puppy,” “turtle,” “rabit,” “rabbit,” “bunny,” “ant farm,” “snake,” “Mr. Purr,” then researchers may wish to categorize these write-in responses to help with analysis. In this example, “puppy” could be assumed to be a reference to a “Dog”, and could be recoded there. The misspelling of “rabit” could be coded along with “rabbit” and “bunny” into a single category of “Bunny or Rabbit”. These are relatively standard decisions that a researcher could make. The remaining write-in responses could be categorized in a few different ways. “Mr. Purr,” which may be someone’s reference to their own cat, could be recoded as “Cat”, or it could remain as “Other” or some category that is “Unknown”. Depending on the number of responses related to each of the others, they could all be combined into a single “Other” category, or maybe categories such as “Reptiles” or “Insects” could be created. Each of these decisions may impact the interpretation of the data, so our researcher should document the types of responses that fall into each of the new categories and any decisions made.
@@ -592,7 +591,7 @@Weighting can typically be used to address some of the error sources identified in the previous sections. For example, weights may be used to address coverage, sampling, and nonresponse errors. Many published surveys will include an “analysis weight” variable that combines these adjustments. However, weighting itself can also introduce adjustment error, so researchers need to balance which types of errors should be corrected with weighting. The construction of weights is outside the scope of this book, and researchers should reference other materials if interested in constructing their own ((Valliant2018weights?)). Instead, this book assumes the survey has been completed, weights are constructed, and data is made available for users. We will walk users through how to read the documentation (Chapter 4) and work with the data and analysis weights provided to analyze and interpret survey results correctly.
+Weighting can typically be used to address some of the error sources identified in the previous sections. For example, weights may be used to address coverage, sampling, and nonresponse errors. Many published surveys will include an “analysis weight” variable that combines these adjustments. However, weighting itself can also introduce adjustment error, so researchers need to balance which types of errors should be corrected with weighting. The construction of weights is outside the scope of this book, and researchers should reference other materials if interested in constructing their own (Valliant and Dever 2018). Instead, this book assumes the survey has been completed, weights are constructed, and data is made available for users. We will walk users through how to read the documentation (Chapter 4) and work with the data and analysis weights provided to analyze and interpret survey results correctly.
In the simple example of our survey, we decided to use a stratified sample by state to select our sample members. Knowing this sampling design, our researcher can include selection weights for analysis that account for how the sample members were selected for the survey. Additionally, the sampling frame may have the type of building associated with each address, so we could include the building type as a potential nonresponse weighting variable, along with some interviewer observations that may be related to our research topic of the average number of pets in a household. Combining these weights, we can create an analytic weight that researchers will need to use when analyzing the data.
@@ -600,13 +599,13 @@Before data is made publicly available, researchers will need to ensure that individual respondents can not be identified by the data when confidentiality is required. There are a variety of different methods that can be used, including data swapping, top or bottom coding, coarsening, and perturbation. In data swapping, researchers may swap specific data values across different respondents so that it does not impact insights from the data but ensures that specific individuals cannot be identified. For extreme values, top and bottom coding is sometimes used. For example, researchers may top-code income values such that households with income greater than $99,999,999 are coded into a single category of $99,999,999 or more. Other disclosure methods may include aggregating response categories or location information to avoid having only a few respondents in a given group and thus be identified. For example, researchers may use coarsening to display income in categories instead of as a continuous variable. Data producers may also perturb the data by adding random noise. There is as much art as there is a science to the methods used for disclosure, and in documentation, researchers should only provide high-level comments that disclosure was conducted and not specific details to ensure nobody can reverse the disclosure and thus identify individuals. For more information on different disclosure methods, please see (Skinner2009?) and -AAPOR Standards4.
+Before data is made publicly available, researchers will need to ensure that individual respondents can not be identified by the data when confidentiality is required. There are a variety of different methods that can be used, including data swapping, top or bottom coding, coarsening, and perturbation. In data swapping, researchers may swap specific data values across different respondents so that it does not impact insights from the data but ensures that specific individuals cannot be identified. For extreme values, top and bottom coding is sometimes used. For example, researchers may top-code income values such that households with income greater than $99,999,999 are coded into a single category of $99,999,999 or more. Other disclosure methods may include aggregating response categories or location information to avoid having only a few respondents in a given group and thus be identified. For example, researchers may use coarsening to display income in categories instead of as a continuous variable. Data producers may also perturb the data by adding random noise. There is as much art as there is a science to the methods used for disclosure, and in documentation, researchers should only provide high-level comments that disclosure was conducted and not specific details to ensure nobody can reverse the disclosure and thus identify individuals. For more information on different disclosure methods, please see Skinner (2009) and +AAPOR Standards.
Documentation is a critical step of the survey life cycle. Researchers systematically record all the details, decisions, procedures, and methodologies to ensure transparency, reproducibility, and the overall quality of survey research.
-Proper documentation allows analysts to understand, reproduce, and evaluate the study’s methods and findings. Chapter @(c04-understanding-survey-data-documentation) dives into how analysts should use survey data documentation.
+Proper documentation allows analysts to understand, reproduce, and evaluate the study’s methods and findings. Chapter 4 dives into how analysts should use survey data documentation.
Other modes such as using mobile apps or text messaging can also be considered, but at the time of publication, have smaller reach or are better for longitudinal studies (i.e., surveying the same individuals over many time periods of a single study)↩︎
https://www-archive.aapor.org/Standards-Ethics/AAPOR-Code-of-Ethics/Survey-Disclosure-Checklist.aspx↩︎
For this chapter, here are the libraries and helper functions we will need:
@@ -392,7 +392,7 @@data(api)
data(scd)
Additionally, we have created multiple analytic datasets for use in this book on a directory on OSF5. To load any data used in the book that is not included in existing packages, we have created a helper function read_osf()
. This chapter uses data from the Residential Energy Consumption Survey (RECS), so we will use the following code to load the RECS data to use later in this chapter:
Additionally, we have created multiple analytic datasets for use in this book on a directory on OSF4. To load any data used in the book that is not included in existing packages, we have created a helper function read_osf()
. This chapter uses data from the Residential Energy Consumption Survey (RECS), so we will use the following code to load the RECS data to use later in this chapter:
<- read_osf("recs_2015.rds") recs_in
Replicate weights generally come in groups and are sequentially numbered, such as PWGTP1, PWGTP2, …, PWGTP80 for the person weights in the American Community Survey (ACS) (U.S. Census Bureau 2021) or BRRWT1, BRRWT2, …, BRRWT96 in the 2015 Residential Energy Consumption Survey (RECS) (U.S. Energy Information Administration 2017). This makes it easy to use some of the tidy selection6 functions in R. For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated WT1, WT2, …, WT20, we can use the following syntax (both are equivalent):
+Replicate weights generally come in groups and are sequentially numbered, such as PWGTP1, PWGTP2, …, PWGTP80 for the person weights in the American Community Survey (ACS) (U.S. Census Bureau 2021) or BRRWT1, BRRWT2, …, BRRWT96 in the 2015 Residential Energy Consumption Survey (RECS) (U.S. Energy Information Administration 2017). This makes it easy to use some of the tidy selection5 functions in R. For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated WT1, WT2, …, WT20, we can use the following syntax (both are equivalent):
<- dat %>%
brr_des as_survey_rep(
weights = WT0,
@@ -1025,7 +1025,7 @@ The syntax
Example
-The American Community Survey releases public use microdata with JK1 weights at the person and household level. This example includes data at the household level where the replicate weights are specified as WGTP1, …, WGTP80, and the main weight is WGTP (U.S. Census Bureau 2023). Using the {tidycensus} package7, data is downloaded from the Census API. For example, the code below has a request to obtain data for each person in each household in two Public Use Microdata Areas (PUMAs) in Durham County, NC8. The variables requested are NP (number of persons in the household), BDSP (number of bedrooms), HINCP (household income), and TYPEHUGQ (type of household). By default, several other variables will come along, including SERIALNO (a unique identifier for each household), SPORDER (a unique identifier for each person within each household), PUMA, ST (state), person weight (PWGTP), and the household weights (WGTP, WGTP1, …, WGTP80). Filtering to records where SPORDER=1 yields only one record per household and TYPEHUGQ=1 filters to only households and not group quarters.
+The American Community Survey releases public use microdata with JK1 weights at the person and household level. This example includes data at the household level where the replicate weights are specified as WGTP1, …, WGTP80, and the main weight is WGTP (U.S. Census Bureau 2023). Using the {tidycensus} package6, data is downloaded from the Census API. For example, the code below has a request to obtain data for each person in each household in two Public Use Microdata Areas (PUMAs) in Durham County, NC7. The variables requested are NP (number of persons in the household), BDSP (number of bedrooms), HINCP (household income), and TYPEHUGQ (type of household). By default, several other variables will come along, including SERIALNO (a unique identifier for each household), SPORDER (a unique identifier for each person within each household), PUMA, ST (state), person weight (PWGTP), and the household weights (WGTP, WGTP1, …, WGTP80). Filtering to records where SPORDER=1 yields only one record per household and TYPEHUGQ=1 filters to only households and not group quarters.
<- get_pums(
pums_in variables = c("NP", "BDSP", "HINCP"),
state = "37",
@@ -1261,7 +1261,7 @@ 3.4 Understanding survey design d
A common method of sampling is to stratify PSUs, select PSUs within the stratum using PPS selection, and then select units within the PSUs either with SRS or PPS. Reading survey documentation is an important first step in survey analysis to understand the design of the survey you are using and variables necessary to specify the design. Good documentation will highlight the variables necessary to specify the design. This is often found in User’s Guides, methodology, analysis guides, or technical documentation (see Chapter 4 for more details).
Example
-For example, the 2017-2019 National Survey of Family Growth (NSFG)9 had a stratified multi-stage area probability sample. In the first stage, PSUs are counties or collections of counties and are stratified by Census region/division, size (population), and MSA status. Within each stratum, PSUs were selected via PPS. In the second stage, neighborhoods were selected within the sampled PSUs using PPS selection. In the third stage, housing units were selected within the sampled neighborhoods. In the fourth stage, a person was randomly chosen within the selected housing units among eligible persons using unequal probabilities based on the person’s age and sex. The public use file does not include all these levels of selection and instead has pseudo-strata and pseudo-clusters, which are the variables used in R to specify the design. As specified on page 4 of the documentation, the stratum variable is SEST
, the cluster variable is SECU
, and the weight variable is WGT2017_2019
. Thus, to specify this design in R, use the following syntax:
+For example, the 2017-2019 National Survey of Family Growth (NSFG)8 had a stratified multi-stage area probability sample. In the first stage, PSUs are counties or collections of counties and are stratified by Census region/division, size (population), and MSA status. Within each stratum, PSUs were selected via PPS. In the second stage, neighborhoods were selected within the sampled PSUs using PPS selection. In the third stage, housing units were selected within the sampled neighborhoods. In the fourth stage, a person was randomly chosen within the selected housing units among eligible persons using unequal probabilities based on the person’s age and sex. The public use file does not include all these levels of selection and instead has pseudo-strata and pseudo-clusters, which are the variables used in R to specify the design. As specified on page 4 of the documentation, the stratum variable is SEST
, the cluster variable is SECU
, and the weight variable is WGT2017_2019
. Thus, to specify this design in R, use the following syntax:
<- nsfgdata %>%
nsfg_des as_survey_design(ids = SECU,
strata = SEST,
@@ -1272,12 +1272,12 @@ Example3.5 Exercises
-- The American National Election Studies (ANES) collect data before and after elections approximately every four years around the presidential election cycle. Each year with the data release, a user’s guide is also released10. What is the syntax for specifying the analysis of the full sample post-election data?
+- The American National Election Studies (ANES) collect data before and after elections approximately every four years around the presidential election cycle. Each year with the data release, a user’s guide is also released9. What is the syntax for specifying the analysis of the full sample post-election data?
<- anes_data %>%
anes_des as_survey_design(weight)
-- The General Social Survey is a survey that has been administered since 1972 on social, behavioral, and attitudinal topics. The 2016-2020 GSS Panel codebook11 provides examples of setting up syntax in SAS and Stata but not R. How would you specify the design in R?
+- The General Social Survey is a survey that has been administered since 1972 on social, behavioral, and attitudinal topics. The 2016-2020 GSS Panel codebook10 provides examples of setting up syntax in SAS and Stata but not R. How would you specify the design in R?
<- gss_data %>%
gss_des as_survey_design(ids = VPSU_2,
@@ -1324,14 +1324,14 @@ References
-
-https://osf.io/gzbkn/?view_only=8ca80573293b4e12b7f934a0f742b957↩︎
-dplyr documentation on tidy-select: https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html↩︎
-tidycensus package: https://walker-data.com/tidycensus/↩︎
-Public Use Microdata Areas in North Carolina: https://www.census.gov/geographies/reference-maps/2010/geo/2010-pumas/north-carolina.html↩︎
-2017-2019 National Survey of Family Growth (NSFG): Sample Design Documentation - https://www.cdc.gov/nchs/data/nsfg/NSFG-2017-2019-Sample-Design-Documentation-508.pdf↩︎
-ANES 2020 User’s Guide: https://electionstudies.org/wp-content/uploads/2022/02/anes_timeseries_2020_userguidecodebook_20220210.pdf↩︎
-2016-2020 GSS Panel Codebook Release 1a: https://gss.norc.org/Documents/codebook/2016-2020%20GSS%20Panel%20Codebook%20-%20R1a.pdf↩︎
+
+https://osf.io/gzbkn/?view_only=8ca80573293b4e12b7f934a0f742b957↩︎
+dplyr documentation on tidy-select: https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html↩︎
+tidycensus package: https://walker-data.com/tidycensus/↩︎
+Public Use Microdata Areas in North Carolina: https://www.census.gov/geographies/reference-maps/2010/geo/2010-pumas/north-carolina.html↩︎
+2017-2019 National Survey of Family Growth (NSFG): Sample Design Documentation - https://www.cdc.gov/nchs/data/nsfg/NSFG-2017-2019-Sample-Design-Documentation-508.pdf↩︎
+ANES 2020 User’s Guide: https://electionstudies.org/wp-content/uploads/2022/02/anes_timeseries_2020_userguidecodebook_20220210.pdf↩︎
+2016-2020 GSS Panel Codebook Release 1a: https://gss.norc.org/Documents/codebook/2016-2020%20GSS%20Panel%20Codebook%20-%20R1a.pdf↩︎
diff --git a/c04-understanding-survey-data-documentation.html b/c04-understanding-survey-data-documentation.html
index 326bdaf9..0c9487bb 100644
--- a/c04-understanding-survey-data-documentation.html
+++ b/c04-understanding-survey-data-documentation.html
@@ -185,7 +185,7 @@
- 3 Specifying sample designs and replicate weights in {srvyr}
-- Prerequisites
+- Prerequisites
- 3.1 Introduction
- 3.2 Common sampling designs
@@ -206,7 +206,7 @@
- 4 Understanding survey data documentation
-- Prerequisites
+- Prerequisites
- 4.1 Introduction
- 4.2 Types of survey documentation
@@ -226,7 +226,7 @@
- 5 Descriptive analyses in srvyr
- 6 Statistical testing
-- Prerequisites
+- Prerequisites
- 6.1 Introduction
- 6.2 Comparison of Proportions and Means
@@ -279,7 +279,7 @@
- 7 Modeling
-- Prerequisites
+- Prerequisites
- 7.1 Introduction
- 7.2 Analysis of Variance (ANOVA)
@@ -300,7 +300,7 @@
- 8 Communicating Results
-- Prerequisites
+- Prerequisites
- 8.1 Introduction
- 8.2 Describing Results through Text
@@ -324,7 +324,7 @@
- 9 National Crime Victimization Survey Vignette
-- Prerequisites
+- Prerequisites
- 9.1 Introduction
- 9.2 Data structure
- 9.3 Survey notation
@@ -345,7 +345,7 @@
- 10 AmericasBarometer Vignette
-- Prerequisites
+- Prerequisites
- 10.1 Introduction
- 10.2 Data Structure
- 10.3 Preparing files
@@ -378,8 +378,8 @@
Chapter 4 Understanding survey data documentation
-
-Prerequisites
+
For this chapter, here are the libraries and helper functions we will need:
@@ -414,15 +414,15 @@ 4.2.1 Technical documentation4.2.2 Questionnaires
A questionnaire is a series of questions asked to obtain information from survey respondents. A questionnaire gathers opinions, behaviors, or demographic data by employing different types of questions, such as closed-ended (e.g., radio button select one or check all that apply), open-ended (e.g., numeric or text), Likert scales, or ranking questions. It may randomize the display order of responses or include instructions to help respondents understand the questions. A survey may have one questionnaire or multiple, depending on its scale and scope.
The questionnaire is an essential resource for understanding and interpreting the survey data (see Section 2.2.3), and we should use it alongside any analysis. It provides details about each of the questions asked in the survey, such as question name, question wording, response options, skip logic, randomizations, display specification, mode differences, and the universe (if only a subset of respondents were asked the question).
-Below in Figure 4.1, we show a question from the ANES 2020 questionnaire (American National Election Studies 2021). This figure shows a particular question’s question name (postvote_rvote
), description (Did R Vote?), full wording of the question and responses, response order, universe, question logic (if vote_pre
= 0), and other specifications. The section also includes the variable name, which we can link to the codebook.
-
+Below in Figure 4.1, we show a question from the ANES 2020 questionnaire (American National Election Studies 2021). This figure shows a particular question’s question name (postvote_rvote
), description (Did R Vote?), full wording of the question and responses, response order, universe, question logic (if vote_pre
= 0), and other specifications. The section also includes the variable name, which we can link to the codebook.
+
-
The content and structure of questionnaires vary depending on the specific survey. For instance, question names may be informative (like the ANES example), sequential, or denoted by a code. In some cases, surveys may not use separate names for questions and variables. Figure 4.2 shows a question from the Behavioral Risk Factor Surveillance System (BRFSS) questionnaire that shows a sequential question number and a coded variable name (as opposed to a question name) (Centers for Disease Control and Prevention (CDC) 2021).
-
+The content and structure of questionnaires vary depending on the specific survey. For instance, question names may be informative (like the ANES example), sequential, or denoted by a code. In some cases, surveys may not use separate names for questions and variables. Figure 4.2 shows a question from the Behavioral Risk Factor Surveillance System (BRFSS) questionnaire that shows a sequential question number and a coded variable name (as opposed to a question name) (Centers for Disease Control and Prevention (CDC) 2021).
+
4.2.2 Questionnaires
4.2.3 Codebooks
While a questionnaire provides information about the questions asked to respondents, the codebook explains how the survey data was coded and recorded. The codebook lists details such as variable names, variable labels, variable meanings, codes for missing data, values labels, and value types (whether categorical or continuous, etc.). In particular, the codebook often includes information on missing data (as opposed to the questionnaire). The codebook enables us to understand and use the variables appropriately in our analysis.
-Figure 4.3 is a question from the ANES 2020 codebook (American National Election Studies 2022). This part indicates a particular variable’s name (V202066
), question wording, value labels, universe, and associated survey question (postvote_rvote
).
-
+Figure 4.3 is a question from the ANES 2020 codebook (American National Election Studies 2022). This part indicates a particular variable’s name (V202066
), question wording, value labels, universe, and associated survey question (postvote_rvote
).
+
-
Reviewing both questionnaires and codebooks in parallel is important (Figures 4.1 and 4.3, as questions and variables do not always correspond directly to each other in a one-to-one mapping. A single question may have multiple associated variables, or a single variable may summarize multiple questions. Reviewing the codebook clarifies how to interpret the variables.
+Reviewing both questionnaires and codebooks in parallel is important (Figures 4.1 and 4.3, as questions and variables do not always correspond directly to each other in a one-to-one mapping. A single question may have multiple associated variables, or a single variable may summarize multiple questions. Reviewing the codebook clarifies how to interpret the variables.
4.2.4 Errata
@@ -465,7 +465,7 @@ 4.3 Working with missing dataMissing not at random (MNAR): The missing data is related to unobserved data, and the probability of being missing varies for reasons we are not measuring. For example, if respondents with depression do not answer a question about depression severity.
-The survey documentation, often the codebook, represents the missing data with a code. For example, a survey may have “Yes” responses coded to 1
, “No” responses coded to 2
, and missing responses coded to -9
. Or, the codebook may list different codes depending on why certain data is missing. In the example of variable V202066
from the ANES (Figure 4.3), -9
represents “Refused,” -7
means that the response was deleted due to an incomplete interview, -6
means that there is no response because there was no follow-up interview, and -1
means “Inapplicable” (due to the designed skip pattern).
+The survey documentation, often the codebook, represents the missing data with a code. For example, a survey may have “Yes” responses coded to 1
, “No” responses coded to 2
, and missing responses coded to -9
. Or, the codebook may list different codes depending on why certain data is missing. In the example of variable V202066
from the ANES (Figure 4.3), -9
represents “Refused,” -7
means that the response was deleted due to an incomplete interview, -6
means that there is no response because there was no follow-up interview, and -1
means “Inapplicable” (due to the designed skip pattern).
When running analysis in R, we must handle missing responses as missing data (i.e., NA
) and not numeric data. If missing responses are treated as zeros or arbitrary values, they can artificially alter summary statistics or introduce spurious patterns in the analysis. Recoding these values to NA
will allow you to handle missing data in different ways in R, such as using functions like na.omit()
, complete.cases()
, or specialized packages like {tidyimpute} or {mice}. These tools allow us to treat missing responses as missing data to conduct your analysis accurately and obtain valid results.
Visualizing the missing data can also help to inform the types of missing data that are present. The {naniar} package provides many valuable missing data visualizations, such as using gg_miss_var()
to see the count or percent of missing data points by variable or gg_miss_fct()
to see relationships in missing data across levels of a factor variable. Investigating the relationships and nature of the missing data before running models can ensure that the missing data is accurately accounted for.
diff --git a/c05-descriptive-analysis.html b/c05-descriptive-analysis.html
index b9b0d520..4546c620 100644
--- a/c05-descriptive-analysis.html
+++ b/c05-descriptive-analysis.html
@@ -185,7 +185,7 @@
- 3 Specifying sample designs and replicate weights in {srvyr}
-- Prerequisites
+- Prerequisites
- 3.1 Introduction
- 3.2 Common sampling designs
@@ -206,7 +206,7 @@
- 4 Understanding survey data documentation
-- Prerequisites
+- Prerequisites
- 4.1 Introduction
- 4.2 Types of survey documentation
@@ -226,7 +226,7 @@
- 5 Descriptive analyses in srvyr
- 6 Statistical testing
-- Prerequisites
+- Prerequisites
- 6.1 Introduction
- 6.2 Comparison of Proportions and Means
@@ -279,7 +279,7 @@
- 7 Modeling
-- Prerequisites
+- Prerequisites
- 7.1 Introduction
- 7.2 Analysis of Variance (ANOVA)
@@ -300,7 +300,7 @@
- 8 Communicating Results
-- Prerequisites
+- Prerequisites
- 8.1 Introduction
- 8.2 Describing Results through Text
@@ -324,7 +324,7 @@
- 9 National Crime Victimization Survey Vignette
-- Prerequisites
+- Prerequisites
- 9.1 Introduction
- 9.2 Data structure
- 9.3 Survey notation
@@ -345,7 +345,7 @@
- 10 AmericasBarometer Vignette
-- Prerequisites
+- Prerequisites
- 10.1 Introduction
- 10.2 Data Structure
- 10.3 Preparing files
@@ -378,25 +378,8 @@
Chapter 5 Descriptive analyses in srvyr
-
-TABLE 5.1: Summary of Chapter 5
-
-
-Topic
-Descriptive analysis of survey data
-
-
-Purpose
-purpose-blah
-
-
-Learning Goals
-learning-goals-blah
-
-
-
-
-Prerequisites
+
For this chapter, here are the libraries and helper functions we will need:
@@ -699,7 +682,7 @@ Examples12. We also calculate an unweighted estimate using unweighted()
. The unweighted()
function calculates unweighted summaries from tbl_svy
object which reflects the summary among the respondents and does not extrapolate to a population estimate.
+
The difference between survey_total()
and survey_count()
is more evident when specifying continuous variables to sum. Let’s compute the total cost of electricity in whole dollars from variable DOLLAREL
11. We also calculate an unweighted estimate using unweighted()
. The unweighted()
function calculates unweighted summaries from tbl_svy
object which reflects the summary among the respondents and does not extrapolate to a population estimate.
%>%
recs_des summarize(
elec_bill = survey_total(DOLLAREL),
@@ -776,7 +759,7 @@ Examples13
+
Getting proportions by more than one variable is possible. In the next example, we look at the proportion of housing units by Region and whether air-conditioning is used (ACUsed
).12
%>%
recs_des group_by(Region, ACUsed) %>%
summarize(p=survey_mean())
@@ -1103,7 +1086,7 @@ Syntax
Examples
-We can calculate the correlation between total square footage (TOTSQFT_EN
)14 and electricity consumption (BTUEL
)15.
+We can calculate the correlation between total square footage (TOTSQFT_EN
)13 and electricity consumption (BTUEL
)14.
%>%
recs_des summarize(
SQFT_Elec_Corr=survey_corr(TOTSQFT_EN, BTUEL)
@@ -1598,7 +1581,7 @@ across() Example 2
map example
If you want to calculate something again and again, loops are a common tool. The {purrr} package has the map()
functions which like a loop allows you to do something in the same way many times. In our case, we want to calculate proportions from the same design multiple times. We find an easy way to do this is to think about how you would do it for one outcome, build a function from there, and then iterate.
-Suppose, we want to create a table that shows the proportion of people that trust in their government (TrustGovernment
)16 as well as those that trust in people (TrustPeople
)17.
+Suppose, we want to create a table that shows the proportion of people that trust in their government (TrustGovernment
)15 as well as those that trust in people (TrustPeople
)16.
In the example below, we create a table that has the variable name as a column, the answer as a column, and then the percentage and its standard error.
%>%
anes_des drop_na(TrustGovernment) %>%
@@ -1689,13 +1672,13 @@ References
-
-RECS has two components: a household survey and an energy supplier survey. For each household that responds, their energy provider(s) are contacted to obtain their energy consumption and expenditure. This value reflects the dollars spent on electricity in 2015 according to the energy supplier. See https://www.eia.gov/consumption/residential/reports/2015/methodology/pdf/2015C&EMethodology.pdf for more details.↩︎
-Question text: Is any air condition equipment used in your home?↩︎
-Question text: What is the square footage of your home?↩︎
-BTUEL is derived from the supplier side component of the survey where BTUEL represents the electricity consumption in British thermal units (Btus) converted from kilowatt hours (kWh) in a year↩︎
-Question: How often can you trust the federal government in Washington to do what is right? (Always, most of the time, about half the time, some of the time, or never / Never, some of the time, about half the time, most of the time, or always)?↩︎
-Question: Generally speaking, how often can you trust other people? (Always, most of the time, about half the time, some of the time, or never / Never, some of the time, about half the time, most of the time, or always)? ↩︎
+
+RECS has two components: a household survey and an energy supplier survey. For each household that responds, their energy provider(s) are contacted to obtain their energy consumption and expenditure. This value reflects the dollars spent on electricity in 2015 according to the energy supplier. See https://www.eia.gov/consumption/residential/reports/2015/methodology/pdf/2015C&EMethodology.pdf for more details.↩︎
+Question text: Is any air condition equipment used in your home?↩︎
+Question text: What is the square footage of your home?↩︎
+BTUEL is derived from the supplier side component of the survey where BTUEL represents the electricity consumption in British thermal units (Btus) converted from kilowatt hours (kWh) in a year↩︎
+Question: How often can you trust the federal government in Washington to do what is right? (Always, most of the time, about half the time, some of the time, or never / Never, some of the time, about half the time, most of the time, or always)?↩︎
+Question: Generally speaking, how often can you trust other people? (Always, most of the time, about half the time, some of the time, or never / Never, some of the time, about half the time, most of the time, or always)? ↩︎
diff --git a/c06-statistical-testing.html b/c06-statistical-testing.html
index f0d6af81..976303ca 100644
--- a/c06-statistical-testing.html
+++ b/c06-statistical-testing.html
@@ -185,7 +185,7 @@
- 3 Specifying sample designs and replicate weights in {srvyr}
-- Prerequisites
+- Prerequisites
- 3.1 Introduction
- 3.2 Common sampling designs
@@ -206,7 +206,7 @@
- 4 Understanding survey data documentation
-- Prerequisites
+- Prerequisites
- 4.1 Introduction
- 4.2 Types of survey documentation
@@ -226,7 +226,7 @@
- 5 Descriptive analyses in srvyr
- 6 Statistical testing
-- Prerequisites
+- Prerequisites
- 6.1 Introduction
- 6.2 Comparison of Proportions and Means
@@ -279,7 +279,7 @@
- 7 Modeling
-- Prerequisites
+- Prerequisites
- 7.1 Introduction
- 7.2 Analysis of Variance (ANOVA)
@@ -300,7 +300,7 @@
- 8 Communicating Results
-- Prerequisites
+- Prerequisites
- 8.1 Introduction
- 8.2 Describing Results through Text
@@ -324,7 +324,7 @@
- 9 National Crime Victimization Survey Vignette
-- Prerequisites
+- Prerequisites
- 9.1 Introduction
- 9.2 Data structure
- 9.3 Survey notation
@@ -345,7 +345,7 @@
- 10 AmericasBarometer Vignette
-- Prerequisites
+- Prerequisites
- 10.1 Introduction
- 10.2 Data Structure
- 10.3 Preparing files
@@ -378,8 +378,8 @@
Chapter 6 Statistical testing
-
-Prerequisites
+
For this chapter, here are the libraries and helper functions we will need:
@@ -675,7 +675,7 @@ 6.3.1 Syntax na.rm = TRUE,
... )
-There are six statistics that are accepted in this formula. For tests of homogeneity (when comparing cross-tabulations), the F
or Chisq
statistics should be used.18 The F
statistic is the default and uses the Rao-Scott second-order correction. This correction is designed to assist with complicated sampling designs (i.e., those other than a simple random sample) (CITE)19. The Chisq
statistic is an adjusted version of the Pearson \(\chi^2\) statistic. The version of this statistic in the svychisq()
function compares the design effect estimate from the provided survey data to what the \(\chi^2\) distribution would have been if the data came from a simple random sampling.
+There are six statistics that are accepted in this formula. For tests of homogeneity (when comparing cross-tabulations), the F
or Chisq
statistics should be used.17 The F
statistic is the default and uses the Rao-Scott second-order correction. This correction is designed to assist with complicated sampling designs (i.e., those other than a simple random sample) (CITE)18. The Chisq
statistic is an adjusted version of the Pearson \(\chi^2\) statistic. The version of this statistic in the svychisq()
function compares the design effect estimate from the provided survey data to what the \(\chi^2\) distribution would have been if the data came from a simple random sampling.
For tests of independence, the Wald
and adjWald
are recommended as they provide a better adjustment for variable comparisons (Lumley (2010)). If the data has a small number of primary sampling units (PSUs) compared to the degrees of freedom, then the adjWald
statistic should be used to account for this. The lincom
and saddlepoint
statistics are available for more complicated data structures.
The formula argument will always be one-sided, unlike the svyttest()
function. The two variables of interest should be included with a plus sign: formula = ~ var_1 + var_2
. As with the svygofchisq()
function, the variables entered into the formula should be formatted as either a factor or a character.
Additionally, as with the t-test function, both svygofchisq()
and svychisq()
have the na.rm
argument. This argument defaults to FALSE
; however, unlike the t-test function, if any data is missing, the \(\chi^2\) tests will assume that NA
is a category and will include it in the calculation. Throughout this chapter, we will always set na.rm = TRUE
, but before analyzing the survey data, review the notes provided in Chapter 4 to better understand how to handle missing data.
@@ -685,7 +685,7 @@ 6.3.2 Examples
Example 1: Goodness of Fit Test
-ANES asked respondents about their highest education level. Based on the data from the 2020 American Community Survey (ACS) 5-year estimates20, the education distribution of those 18+ in the U.S. is as follows:
+
ANES asked respondents about their highest education level. Based on the data from the 2020 American Community Survey (ACS) 5-year estimates19, the education distribution of those 18+ in the U.S. is as follows:
- 11% had less than High School degree
- 27% had a High School degree
- 29% had some college or associate’s degree
@@ -2523,10 +2523,10 @@
References
-
-These two statistics can also be used for goodness of fit tests, if the svygofchisq()
function is not used.↩︎
-http://www.asasrms.org/Proceedings/y2007/Files/JSM2007-000874.pdf↩︎
-Data was pulled from data.census.gov using the S1501 Education Attainment 2020: ACS 5-Year Estimates Subject Tables↩︎
+
+These two statistics can also be used for goodness of fit tests, if the svygofchisq()
function is not used.↩︎
+http://www.asasrms.org/Proceedings/y2007/Files/JSM2007-000874.pdf↩︎
+Data was pulled from data.census.gov using the S1501 Education Attainment 2020: ACS 5-Year Estimates Subject Tables↩︎
diff --git a/c07-modeling.html b/c07-modeling.html
index 748d8d54..4ec5ed1a 100644
--- a/c07-modeling.html
+++ b/c07-modeling.html
@@ -185,7 +185,7 @@
- 3 Specifying sample designs and replicate weights in {srvyr}
-- Prerequisites
+- Prerequisites
- 3.1 Introduction
- 3.2 Common sampling designs
@@ -206,7 +206,7 @@
- 4 Understanding survey data documentation
-- Prerequisites
+- Prerequisites
- 4.1 Introduction
- 4.2 Types of survey documentation
@@ -226,7 +226,7 @@
- 5 Descriptive analyses in srvyr
- 6 Statistical testing
-- Prerequisites
+- Prerequisites
- 6.1 Introduction
- 6.2 Comparison of Proportions and Means
@@ -279,7 +279,7 @@
- 7 Modeling
-- Prerequisites
+- Prerequisites
- 7.1 Introduction
- 7.2 Analysis of Variance (ANOVA)
@@ -300,7 +300,7 @@
- 8 Communicating Results
-- Prerequisites
+- Prerequisites
- 8.1 Introduction
- 8.2 Describing Results through Text
@@ -324,7 +324,7 @@
- 9 National Crime Victimization Survey Vignette
-- Prerequisites
+- Prerequisites
- 9.1 Introduction
- 9.2 Data structure
- 9.3 Survey notation
@@ -345,7 +345,7 @@
- 10 AmericasBarometer Vignette
-- Prerequisites
+- Prerequisites
- 10.1 Introduction
- 10.2 Data Structure
- 10.3 Preparing files
@@ -378,8 +378,8 @@
Chapter 7 Modeling
-
-Prerequisites
+
For this chapter, here are the libraries and helper functions we will need:
@@ -417,7 +417,7 @@ 7.1 Introduction\[Y_i=\beta_0+\beta_1 X_i+\epsilon_i\] would be specified in R as y~x
where the intercept is not explicitly included. To fit a model with no intercept, that is,
\[Y_i=\beta_1 X_i+\epsilon_i\]
-it can be specified as y~x-1
. Formula notation details in R can be found in the help file for formula21. A quick overview of the common formula notation is in the following table:
+it can be specified as y~x-1
. Formula notation details in R can be found in the help file for formula20. A quick overview of the common formula notation is in the following table: