diff --git a/README.docx b/README.docx index 3094af7..0124197 100644 Binary files a/README.docx and b/README.docx differ diff --git a/README.html b/README.html index 15f1aa4..ce6d0d7 100644 --- a/README.html +++ b/README.html @@ -26,7 +26,7 @@
-INSTRUCTIONS: The typical README in social science journals serves the purpose of guiding a reader through the available material and a route to replicating the results in the research paper. Start by providing a brief overview of the available material and a brief guide as to how to proceed from beginning to end.
Example: The code in this replication package constructs the analysis file from the three data sources (Ruggles et al, 2018; Inglehart et al, 2019; BEA, 2016) using Stata and Julia. Two master files run all of the code to generate the data for the 15 figures and 3 tables in the paper. The replicator should expect the code to run for about 14 hours.
+Example: The code in this replication package constructs the analysis file from the three data sources (Ruggles et al, 2018; Inglehart et al, 2019; BEA, 2016) using Stata and Julia. Two main files run all of the code to generate the data for the 15 figures and 3 tables in the paper. The replicator should expect the code to run for about 14 hours.
INSTRUCTIONS: Every README should contain a description of the origin (provenance), location and accessibility (data availability) of the data used in the article. These descriptions are generally referred to as “Data Availability Statements” (DAS). However, in some cases, there is no external data used.
@@ -40,21 +40,25 @@Data Availability and Prove
INSTRUCTIONS: - When the authors are secondary data users (they did not generate the data), the provenance and DAS coincide, and should describe the condition under which (a) the current authors (b) any future users might access the data. - When the data were generated (by the authors) in the course of conducting (lab or field) experiments, or were collected as part of surveys, then the description of the provenance should describe the data generating process, i.e., survey or experimental procedures: - Experiments: complete sets of experimental instructions, questionnaires, stimuli for all conditions, potentially screenshots, scripts for experimenters or research assistants, as well as for subject eligibility criteria (e.g. selection criteria, exclusions), recruitment waves, demographics of subject pool used. - For lab experiments specifically, a description of any pilot sessions/studies, and computer programs, configuration files, or scripts used to run the experiment. - For surveys, the whole questionnaire (code or images/PDF) including survey logic if not linear, interviewer instructions, enumeration lists, sample selection criteria.
-The information should describe ALL data used, regardless of whether they are provided as part of the replication archive or not, and regardless of size or scope. For instance, if using GDP deflators, the source of the deflators (e.g. at the national statistical office) should also be listed here. If any of this information has been provided in a pre-registration, then a link to that registration may (partially) suffice.
+The information should describe ALL data used, regardless of whether they are provided as part of the replication archive or not, and regardless of size or scope. The DAS should provide enough information that a replicator can obtain the data from the original source, even if the file is provided.
+For instance, if using GDP deflators, the source of the deflators (e.g. at the national statistical office) should also be listed here. If any of this information has been provided in a pre-registration, then a link to that registration may (partially) suffice.
DAS can be complex and varied. Examples are provided here, and below.
Importantly, if providing the data as part of the replication package, authors should be clear about whether they have the rights to distribute the data. Data may be subject to distribution restrictions due to sensitivity, IRB, proprietary clauses in the data use agreement, etc.
-NOTE: DAS do not replace Data Citations (see Guidance). Rather, they augment them. Depending on journal requirements and to some extent stylistic considerations, data citations should appear in the main article, in an appendix, or in the README. However, data citations only provide information where to find the data, not how to access that data. Thus, DAS augment data citations by going into additional detail that allow a researcher to assess cost, complexity, and availability over time of the data used by the original author.
+NOTE: DAS do not replace Data Citations (see Guidance). Rather, they augment them. Depending on journal requirements and to some extent stylistic considerations, data citations should appear in the main article, in an appendix, or in the README. However, data citations only provide information where to find the data, not how to access those data. Thus, DAS augment data citations by going into additional detail that allow a researcher to assess cost, complexity, and availability over time of the data used by the original author.
--INSTRUCTIONS: Most data repositories provide for a default license, but do not impose a specific license. Authors should actively select a license. This should be provided in a LICENSE.txt file, separately from the README, possibly combined with the license for any code. Some data may be subject to inherited license requirements, i.e., the data provider may allow for redistribution only if the data is licensed under specific rules - authors should check with their data providers. For instance, a data use license might require that users - the current author, but also any subsequent users - cite the data provider. Licensing can be complex. Some non-legal guidance may be found here.
+INSTRUCTIONS: Most data repositories provide for a default license, but do not impose a specific license. Authors should actively select a license. This should be provided in a LICENSE.txt file, separately from the README, possibly combined with the license for any code. Some data may be subject to inherited license requirements, i.e., the data provider may allow for redistribution only if the data is licensed under specific rules - authors should check with their data providers. For instance, a data use license might require that users - the current author, but also any subsequent users - cite the data provider. Licensing can be complex. Some non-legal guidance may be found here. For multiple licenses within a data package, the
+LICENSE.txt
file might contain the concatenation of all the licenses that apply (for instance, a custom license for one file, plus a CC-BY license for another file).NOTE: In many cases, it is not up to the creator of the replication package to simply define a license, a license may be sticky and be defined by the original data creator.
The code is licensed under a Creative Commons/CC-BY-NC/CC0 license. See LICENSE.txt for details.
+Example: The data are licensed under a Creative Commons/CC-BY-NC license. See LICENSE.txt for details.
.dta
, .xlsx
, .csv
, netCDF
, etc..dta
should have both variable and value labels).A summary in tabular form can be useful:
+Data.Name | +Data.Files | +Location | +Provided | +Citation | +
---|---|---|---|---|
“Current Population Survey 2018” | +cepr_march_2018.dta | +data/ | +TRUE | +CEPR (2018) | +
“Provincial Administration Reports” | +coast_simplepoint2.csv; rivers_simplepoint2.csv; RAIL_dummies.dta; railways_Dissolve_Simplify_point2.csv | +Data/maps/ | +TRUE | +Administration (2017) | +
“2017 SAT scores” | +Not available | +data/to_clean/ | +FALSE | +College Board (2020) | +
where the Data.Name
column is then expanded in the subsequent paragraphs, and CEPR (2018)
is resolved in the References section of the README.
The [DATA TYPE] data used to support the findings of this study have been deposited in the [NAME] repository ([DOI or OTHER PERSISTENT IDENTIFIER]). [1]. The data were collected by the authors, and are available under a Creative Commons Non-commercial license.
@@ -116,6 +165,15 @@Dataset list
+INSTRUCTIONS: In some cases, authors will provide one dataset (file) per data source, and the code to combine them. In others, in particular when data access might be restrictive, the replication package may only include derived/analysis data. Every file should be described. This can be provided as a Excel/CSV table, or in the table below.
++INSTRUCTIONS: While it is often most convenient to provide data in the native format of the software used to analyze and process the data, not all formats are “open” and can be read by other (free) software. Data should at a minimum be provided in formats that can be read by open-source software (R, Python, others), and ideally be provided in non-proprietary, archival-friendly formats.
+++INSTRUCTIONS: All data files should be fully documented: variables/columns should have labels (long-form meaningful names), and values should be explained. This might mean generating a codebook, pointing at a public codebook, or providing data in (non-proprietary) formats that allow for a rich description. This is in particular important for data that is not distributable.
++INSTRUCTIONS: Some journals require, and it is considered good practice, to provide synthetic or simulated data that has some of the key characteristics of the restricted-access data which are not provided. The level of fidelity may vary - it may be useful for debugging only, or it should allow to assess the key characteristics of the statistical/econometric procedure or the main conclusions of the paper.
+
@@ -157,7 +215,7 @@ Computational requirements
INSTRUCTIONS: In general, the specific computer code used to generate the results in the article will be within the repository that also contains this README. However, other computational requirements - shared libraries or code packages, required software, specific computing hardware - may be important, and is always useful, for the goal of replication. Some example text follows.
-INSTRUCTIONS: We strongly suggest providing setup scripts that install/set up the environment. Sample scripts for Stata, R, Python, Julia are easy to set up and implement.
+INSTRUCTIONS: We strongly suggest providing setup scripts that install/set up the environment. Sample scripts for Stata, R, Julia are easy to set up and implement. Specific software may have more sophisticated tools: Python, Julia.
Software Requirements
@@ -174,7 +232,7 @@Software Requirements
pandas
0.24.2- -
numpy
1.16.4- the file “
+requirements.txt
” lists these dependencies, please run “pip install -r requirements.txt
” as the first step. See https://pip.readthedocs.io/en/1.1/requirements.html for further instructions on using the “requirements.txt
” file.- the file “
requirements.txt
” lists these dependencies, please run “pip install -r requirements.txt
” as the first step. See https://pip.pypa.io/en/stable/user_guide/#ensuring-repeatability for further instructions on creating and using the “requirements.txt
” file.- Intel Fortran Compiler version 20200104
- Matlab (code was run with Matlab Release 2018a)
@@ -187,6 +245,14 @@Software Requirements
Portions of the code use bash scripting, which may require Linux.
Portions of the code use Powershell scripting, which may require Windows 10 or higher.
+Controlled Randomness
+++INSTRUCTIONS: Some estimation code uses random numbers, almost always provided by pseudorandom number generators (PRNGs). For reproducibility purposes, these should be provided with a deterministic seed, so that the sequence of numbers provided is the same for the original author and any replicators. While this is not always possible, it is a requirement by many journals’ policies. The seed should be set once, and not use a time-stamp. If using parallel processing, special care needs to be taken. If using multiple programs in sequence, care must be taken on how to call these programs, ideally from a main program, so that the sequence is not altered.
++
- +Random seed is set at line _____ of program ______
+Memory and Runtime Requirements
INSTRUCTIONS: Memory and compute-time requirements may also be relevant or even critical. Some example text follows. It may be useful to break this out by Table/Figure/section of processing. For instance, some estimation routines might run for weeks, but data prep and creating figures might only take a few minutes.
@@ -199,7 +265,9 @@Summary
- 10-60 minutes
- -1-8 hours
+1-2 hours +- +2-8 hours
- 8-24 hours
- @@ -228,10 +296,10 @@
Description of programs/code
INSTRUCTIONS: Give a high-level overview of the program files and their purpose. Remove redundant/ obsolete files from the Replication archive.
-
@@ -239,16 +307,16 @@- Programs in
-programs/01_dataprep
will extract and reformat all datasets referenced above. The fileprograms/01_dataprep/master.do
will run them all.- Programs in
-programs/02_analysis
generate all tables and figures in the main body of the article. The programprograms/02_analysis/master.do
will run them all. Each program called frommaster.do
identifies the table or figure it creates (e.g.,05_table5.do
). Output files are called appropriate names (table5.tex
,figure12.png
) and should be easy to correlate with the manuscript.- Programs in
-programs/03_appendix
will generate all tables and figures in the online appendix. The programprograms/03_appendix/master-appendix.do
will run them all.- Ado files have been stored in
+programs/ado
and themaster.do
files set the ADO directories appropriately.- Programs in
+programs/01_dataprep
will extract and reformat all datasets referenced above. The fileprograms/01_dataprep/main.do
will run them all.- Programs in
+programs/02_analysis
generate all tables and figures in the main body of the article. The programprograms/02_analysis/main.do
will run them all. Each program called frommain.do
identifies the table or figure it creates (e.g.,05_table5.do
). Output files are called appropriate names (table5.tex
,figure12.png
) and should be easy to correlate with the manuscript.- Programs in
+programs/03_appendix
will generate all tables and figures in the online appendix. The programprograms/03_appendix/main-appendix.do
will run them all.- Ado files have been stored in
programs/ado
and themain.do
files set the ADO directories appropriately.- The program
programs/00_setup.do
will populate theprograms/ado
directory with updated ado packages, but for purposes of exact reproduction, this is not needed. The fileprograms/00_setup.log
identifies the versions as they were last updated.- The program
programs/config.do
contains parameters used by all programs, including a random seed. Note that the random seed is set once for each of the two sequences (in02_analysis
and03_appendix
). If running in any order other than the one outlined below, your results may differ.(Optional, but recommended) L
-INSTRUCTIONS: Most journal repositories provide for a default license, but do not impose a specific license. Authors should actively select a license. This should be provided in a LICENSE.txt file, separately from the README, possibly combined with the license for any data provided. Some code may be subject to inherited license requirements, i.e., the original code author may allow for redistribution only if the code is licensed under specific rules - authors should check with their sources. For instance, some code authors require that their article describing the econometrics of the package be cited. Licensing can be complex. Some non-legal guidance may be found here.
The code is licensed under a MIT/BSD/GPL/Creative Commons license. See LICENSE.txt for details.
+The code is licensed under a MIT/BSD/GPL [choose one!] license. See LICENSE.txt for details.
Instructions to Replicators
-INSTRUCTIONS: The first two sections ensure that the data and software necessary to conduct the replication have been collected. This section then describes a human-readable instruction to conduct the replication. This may be simple, or may involve many complicated steps. It should be a simple list, no excess prose. Strict linear sequence. If more than 4-5 manual steps, please wrap a master program/Makefile around them, in logical sequences. Examples follow.
+INSTRUCTIONS: The first two sections ensure that the data and software necessary to conduct the replication have been collected. This section then describes a human-readable instruction to conduct the replication. This may be simple, or may involve many complicated steps. It should be a simple list, no excess prose. Strict linear sequence. If more than 4-5 manual steps, please wrap a main program/Makefile around them, in logical sequences. Examples follow.
- Edit
programs/config.do
to adjust the default path- Run
programs/00_setup.do
once on a new system to set up the working environment.- Download the data files referenced above. Each should be stored in the prepared subdirectories of
-data/
, in the format that you download them in. Do not unzip. Scripts are provided in each directory to download the public-use files. Confidential data files requested as part of your FSRDC project will appear in the/data
folder. No further action is needed on the replicator’s part.- Run
+programs/01_master.do
to run all steps in sequence.- Run
programs/01_main.do
to run all steps in sequence.Details
@@ -260,14 +328,14 @@
Details
-
- These programs were last run at various times in 2018.
- Order does not matter, all programs can be run in parallel, if needed.
-- A
+programs/01_dataprep/master.do
will run them all in sequence, which should take about 2 hours.- A
programs/01_dataprep/main.do
will run them all in sequence, which should take about 2 hours.programs/02_analysis/master.do
. +- -
programs/02_analysis/main.do
.
- If running programs individually, note that ORDER IS IMPORTANT.
- The programs were last run top to bottom on July 4, 2019.
- +
programs/03_appendix/master-appendix.do
. The programs were last run top to bottom on July 4, 2019.programs/03_appendix/main-appendix.do
. The programs were last run top to bottom on July 4, 2019.- Figure 1: The figure can be reproduced using the data provided in the folder “2_data/data_map”, and ArcGIS Desktop (Version 10.7.1) by following these (manual) instructions:
- Create a new map document in ArcGIS ArcMap, browse to the folder “2_data/data_map” in the “Catalog”, with files “provinceborders.shp”, “lakes.shp”, and “cities.shp”.
diff --git a/README.md b/README.md index 703e3f7..0a580d8 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,23 @@ +--- +contributors: + - Lars Vilhuber + - Miklos Kóren + - Joan Llull + - Marie Connolly + - Peter Morrow +--- + # Template README and Guidance > INSTRUCTIONS: This README suggests structure and content that have been approved by various journals, see [Endorsers](Endorsers.md). It is available as [Markdown/txt](https://github.com/social-science-data-editors/template_README/blob/master/template-README.md), [Word](templates/README.docx), [LaTeX](templates/README.tex), and [PDF](templates/README.pdf). In practice, there are many variations and complications, and authors should feel free to adapt to their needs. All instructions can (should) be removed from the final README (in Markdown, remove lines starting with `> INSTRUCTIONS`). Please ensure that a PDF is submitted in addition to the chosen native format. -Overview --------- +## Overview > INSTRUCTIONS: The typical README in social science journals serves the purpose of guiding a reader through the available material and a route to replicating the results in the research paper. Start by providing a brief overview of the available material and a brief guide as to how to proceed from beginning to end. -Example: The code in this replication package constructs the analysis file from the three data sources (Ruggles et al, 2018; Inglehart et al, 2019; BEA, 2016) using Stata and Julia. Two master files run all of the code to generate the data for the 15 figures and 3 tables in the paper. The replicator should expect the code to run for about 14 hours. +Example: The code in this replication package constructs the analysis file from the three data sources (Ruggles et al, 2018; Inglehart et al, 2019; BEA, 2016) using Stata and Julia. Two main files run all of the code to generate the data for the 15 figures and 3 tables in the paper. The replicator should expect the code to run for about 14 hours. -Data Availability and Provenance Statements ----------------------------- +## Data Availability and Provenance Statements > INSTRUCTIONS: Every README should contain a description of the origin (provenance), location and accessibility (data availability) of the data used in the article. These descriptions are generally referred to as "Data Availability Statements" (DAS). However, in some cases, there is no external data used. @@ -25,24 +32,29 @@ Data Availability and Provenance Statements > - For lab experiments specifically, a description of any pilot sessions/studies, and computer programs, configuration files, or scripts used to run the experiment. > - For surveys, the whole questionnaire (code or images/PDF) including survey logic if not linear, interviewer instructions, enumeration lists, sample selection criteria. > -> The information should describe ALL data used, regardless of whether they are provided as part of the replication archive or not, and regardless of size or scope. For instance, if using GDP deflators, the source of the deflators (e.g. at the national statistical office) should also be listed here. If any of this information has been provided in a pre-registration, then a link to that registration may (partially) suffice. +> The information should describe ALL data used, regardless of whether they are provided as part of the replication archive or not, and regardless of size or scope. The DAS should provide enough information that a replicator can obtain the data from the original source, even if the file is provided. +> +> For instance, if using GDP deflators, the source of the deflators (e.g. at the national statistical office) should also be listed here. If any of this information has been provided in a pre-registration, then a link to that registration may (partially) suffice. > > DAS can be complex and varied. Examples are provided [here](https://social-science-data-editors.github.io/guidance/Requested_information_dcas.html), and below. > > Importantly, if providing the data as part of the replication package, authors should be clear about whether they have the **rights** to distribute the data. Data may be subject to distribution restrictions due to sensitivity, IRB, proprietary clauses in the data use agreement, etc. > -> NOTE: DAS do not replace Data Citations (see [Guidance](Data_citation_guidance.md)). Rather, they augment them. Depending on journal requirements and to some extent stylistic considerations, data citations should appear in the main article, in an appendix, or in the README. However, data citations only provide information **where** to find the data, not **how to access** that data. Thus, DAS augment data citations by going into additional detail that allow a researcher to assess cost, complexity, and availability over time of the data used by the original author. +> NOTE: DAS do not replace Data Citations (see [Guidance](Data_citation_guidance.md)). Rather, they augment them. Depending on journal requirements and to some extent stylistic considerations, data citations should appear in the main article, in an appendix, or in the README. However, data citations only provide information **where** to find the data, not **how to access** those data. Thus, DAS augment data citations by going into additional detail that allow a researcher to assess cost, complexity, and availability over time of the data used by the original author. ### Statement about Rights - [ ] I certify that the author(s) of the manuscript have legitimate access to and permission to use the data used in this manuscript. +- [ ] I certify that the author(s) of the manuscript have documented permission to redistribute/publish the data contained within this replication package. Appropriate permission are documented in the [LICENSE.txt](LICENSE.txt) file. ### (Optional, but recommended) License for Data -> INSTRUCTIONS: Most data repositories provide for a default license, but do not impose a specific license. Authors should actively select a license. This should be provided in a LICENSE.txt file, separately from the README, possibly combined with the license for any code. Some data may be subject to inherited license requirements, i.e., the data provider may allow for redistribution only if the data is licensed under specific rules - authors should check with their data providers. For instance, a data use license might require that users - the current author, but also any subsequent users - cite the data provider. Licensing can be complex. Some non-legal guidance may be found [here](https://social-science-data-editors.github.io/guidance/Licensing_guidance.html). +> INSTRUCTIONS: Most data repositories provide for a default license, but do not impose a specific license. Authors should actively select a license. This should be provided in a LICENSE.txt file, separately from the README, possibly combined with the license for any code. Some data may be subject to inherited license requirements, i.e., the data provider may allow for redistribution only if the data is licensed under specific rules - authors should check with their data providers. For instance, a data use license might require that users - the current author, but also any subsequent users - cite the data provider. Licensing can be complex. Some non-legal guidance may be found [here](https://social-science-data-editors.github.io/guidance/Licensing_guidance.html). For multiple licenses within a data package, the `LICENSE.txt` file might contain the concatenation of all the licenses that apply (for instance, a custom license for one file, plus a CC-BY license for another file). +> +> NOTE: In many cases, it is not up to the creator of the replication package to simply define a license, a license may be *sticky* and be defined by the original data creator. -The code is licensed under a Creative Commons/CC-BY-NC/CC0 license. See [LICENSE.txt](LICENSE.txt) for details. +*Example:* The data are licensed under a Creative Commons/CC-BY-NC license. See LICENSE.txt for details. ### Summary of Availability @@ -57,7 +69,18 @@ The code is licensed under a Creative Commons/CC-BY-NC/CC0 license. See [LICENSE > > - Describe the format (open formats preferred, but some software-specific formats OK if open-source readers available): `.dta`, `.xlsx`, `.csv`, `netCDF`, etc. > - Provide a data dictionairy, either as part of the archive (list the file name), or at a URL (list the URL). Some formats are self-describing *if* they have the requisite information (e.g., `.dta` should have both variable and value labels). +> - List availability within the package +> - Use proper bibliographic references in addition to a verbose description (and provide a bibliography at the end of the README, expanding those references) +> +> A summary in tabular form can be useful: +| Data.Name | Data.Files | Location | Provided | Citation | +| -- | -- | -- | -- | -- | +| “Current Population Survey 2018” | cepr_march_2018.dta | data/ | TRUE | CEPR (2018) | +| “Provincial Administration Reports” | coast_simplepoint2.csv; rivers_simplepoint2.csv; RAIL_dummies.dta; railways_Dissolve_Simplify_point2.csv | Data/maps/ | TRUE | Administration (2017) | +| “2017 SAT scores” | Not available | data/to_clean/ | FALSE | College Board (2020) | + +where the `Data.Name` column is then expanded in the subsequent paragraphs, and `CEPR (2018)` is resolved in the References section of the README. ### Example for public use data collected by the authors @@ -104,11 +127,16 @@ You must request the following datasets in your proposal: > Code for data cleaning and analysis is provided as part of the replication package. It is available at https://dropbox.com/link/to/code/XYZ123ABC for review. It will be uploaded to the [JOURNAL REPOSITORY] once the paper has been conditionally accepted. -Dataset list ------------- +## Dataset list > INSTRUCTIONS: In some cases, authors will provide one dataset (file) per data source, and the code to combine them. In others, in particular when data access might be restrictive, the replication package may only include derived/analysis data. Every file should be described. This can be provided as a Excel/CSV table, or in the table below. +> INSTRUCTIONS: While it is often most convenient to provide data in the native format of the software used to analyze and process the data, not all formats are "open" and can be read by other (free) software. Data should at a minimum be provided in formats that can be read by open-source software (R, Python, others), and ideally be provided in non-proprietary, archival-friendly formats. + +> INSTRUCTIONS: All data files should be fully documented: variables/columns should have labels (long-form meaningful names), and values should be explained. This might mean generating a codebook, pointing at a public codebook, or providing data in (non-proprietary) formats that allow for a rich description. This is in particular important for data that is not distributable. + +> INSTRUCTIONS: Some journals require, and it is considered good practice, to provide synthetic or simulated data that has some of the key characteristics of the restricted-access data which are not provided. The level of fidelity may vary - it may be useful for debugging only, or it should allow to assess the key characteristics of the statistical/econometric procedure or the main conclusions of the paper. + | Data file | Source | Notes |Provided | |-----------|--------|----------|---------| | `data/raw/lbd.dta` | LBD | Confidential | No | @@ -116,12 +144,11 @@ Dataset list | `data/derived/regression_input.dta`| All listed | Combines multiple data sources, serves as input for Table 2, 3 and Figure 5. | Yes | -Computational requirements ---------------------------- +## Computational requirements > INSTRUCTIONS: In general, the specific computer code used to generate the results in the article will be within the repository that also contains this README. However, other computational requirements - shared libraries or code packages, required software, specific computing hardware - may be important, and is always useful, for the goal of replication. Some example text follows. -> INSTRUCTIONS: We strongly suggest providing setup scripts that install/set up the environment. Sample scripts for [Stata](https://github.com/gslab-econ/template/blob/master/config/config_stata.do), [R](https://github.com/labordynamicsinstitute/paper-template/blob/master/programs/global-libraries.R), [Python](https://pip.readthedocs.io/en/1.1/requirements.html), [Julia](https://github.com/labordynamicsinstitute/paper-template/blob/master/programs/packages.jl) are easy to set up and implement. +> INSTRUCTIONS: We strongly suggest providing setup scripts that install/set up the environment. Sample scripts for [Stata](https://github.com/gslab-econ/template/blob/master/config/config_stata.do), [R](https://github.com/labordynamicsinstitute/paper-template/blob/master/programs/global-libraries.R), [Julia](https://github.com/labordynamicsinstitute/paper-template/blob/master/programs/packages.jl) are easy to set up and implement. Specific software may have more sophisticated tools: [Python](https://pip.pypa.io/en/stable/user_guide/#ensuring-repeatability), [Julia](https://julia.quantecon.org/more_julia/tools_editors.html#Package-Environments). ### Software Requirements @@ -134,7 +161,7 @@ Computational requirements - Python 3.6.4 - `pandas` 0.24.2 - `numpy` 1.16.4 - - the file "`requirements.txt`" lists these dependencies, please run "`pip install -r requirements.txt`" as the first step. See [https://pip.readthedocs.io/en/1.1/requirements.html](https://pip.readthedocs.io/en/1.1/requirements.html) for further instructions on using the "`requirements.txt`" file. + - the file "`requirements.txt`" lists these dependencies, please run "`pip install -r requirements.txt`" as the first step. See [https://pip.pypa.io/en/stable/user_guide/#ensuring-repeatability](https://pip.pypa.io/en/stable/user_guide/#ensuring-repeatability) for further instructions on creating and using the "`requirements.txt`" file. - Intel Fortran Compiler version 20200104 - Matlab (code was run with Matlab Release 2018a) - R 3.4.3 @@ -146,7 +173,11 @@ Portions of the code use bash scripting, which may require Linux. Portions of the code use Powershell scripting, which may require Windows 10 or higher. +### Controlled Randomness + +> INSTRUCTIONS: Some estimation code uses random numbers, almost always provided by pseudorandom number generators (PRNGs). For reproducibility purposes, these should be provided with a deterministic seed, so that the sequence of numbers provided is the same for the original author and any replicators. While this is not always possible, it is a requirement by many journals' policies. The seed should be set once, and not use a time-stamp. If using parallel processing, special care needs to be taken. If using multiple programs in sequence, care must be taken on how to call these programs, ideally from a main program, so that the sequence is not altered. +- [ ] Random seed is set at line _____ of program ______ ### Memory and Runtime Requirements @@ -158,7 +189,8 @@ Approximate time needed to reproduce the analyses on a standard (CURRENT YEAR) d - [ ] <10 minutes - [ ] 10-60 minutes -- [ ] 1-8 hours +- [ ] 1-2 hours +- [ ] 2-8 hours - [ ] 8-24 hours - [ ] 1-3 days - [ ] 3-14 days @@ -181,15 +213,14 @@ Portions of the code were last run on a **12-node AWS R3 cluster, consuming 20,0 > - (Linux) see code in [tools/linux-system-info.sh](https://github.com/AEADataEditor/replication-template/blob/master/tools/linux-system-info.sh)` -Description of programs/code ----------------------------- +## Description of programs/code > INSTRUCTIONS: Give a high-level overview of the program files and their purpose. Remove redundant/ obsolete files from the Replication archive. -- Programs in `programs/01_dataprep` will extract and reformat all datasets referenced above. The file `programs/01_dataprep/master.do` will run them all. -- Programs in `programs/02_analysis` generate all tables and figures in the main body of the article. The program `programs/02_analysis/master.do` will run them all. Each program called from `master.do` identifies the table or figure it creates (e.g., `05_table5.do`). Output files are called appropriate names (`table5.tex`, `figure12.png`) and should be easy to correlate with the manuscript. -- Programs in `programs/03_appendix` will generate all tables and figures in the online appendix. The program `programs/03_appendix/master-appendix.do` will run them all. -- Ado files have been stored in `programs/ado` and the `master.do` files set the ADO directories appropriately. +- Programs in `programs/01_dataprep` will extract and reformat all datasets referenced above. The file `programs/01_dataprep/main.do` will run them all. +- Programs in `programs/02_analysis` generate all tables and figures in the main body of the article. The program `programs/02_analysis/main.do` will run them all. Each program called from `main.do` identifies the table or figure it creates (e.g., `05_table5.do`). Output files are called appropriate names (`table5.tex`, `figure12.png`) and should be easy to correlate with the manuscript. +- Programs in `programs/03_appendix` will generate all tables and figures in the online appendix. The program `programs/03_appendix/main-appendix.do` will run them all. +- Ado files have been stored in `programs/ado` and the `main.do` files set the ADO directories appropriately. - The program `programs/00_setup.do` will populate the `programs/ado` directory with updated ado packages, but for purposes of exact reproduction, this is not needed. The file `programs/00_setup.log` identifies the versions as they were last updated. - The program `programs/config.do` contains parameters used by all programs, including a random seed. Note that the random seed is set once for each of the two sequences (in `02_analysis` and `03_appendix`). If running in any order other than the one outlined below, your results may differ. @@ -197,17 +228,16 @@ Description of programs/code > INSTRUCTIONS: Most journal repositories provide for a default license, but do not impose a specific license. Authors should actively select a license. This should be provided in a LICENSE.txt file, separately from the README, possibly combined with the license for any data provided. Some code may be subject to inherited license requirements, i.e., the original code author may allow for redistribution only if the code is licensed under specific rules - authors should check with their sources. For instance, some code authors require that their article describing the econometrics of the package be cited. Licensing can be complex. Some non-legal guidance may be found [here](https://social-science-data-editors.github.io/guidance/Licensing_guidance.html). -The code is licensed under a MIT/BSD/GPL/Creative Commons license. See [LICENSE.txt](LICENSE.txt) for details. +The code is licensed under a MIT/BSD/GPL [choose one!] license. See [LICENSE.txt](LICENSE.txt) for details. -Instructions to Replicators ---------------------------- +## Instructions to Replicators -> INSTRUCTIONS: The first two sections ensure that the data and software necessary to conduct the replication have been collected. This section then describes a human-readable instruction to conduct the replication. This may be simple, or may involve many complicated steps. It should be a simple list, no excess prose. Strict linear sequence. If more than 4-5 manual steps, please wrap a master program/Makefile around them, in logical sequences. Examples follow. +> INSTRUCTIONS: The first two sections ensure that the data and software necessary to conduct the replication have been collected. This section then describes a human-readable instruction to conduct the replication. This may be simple, or may involve many complicated steps. It should be a simple list, no excess prose. Strict linear sequence. If more than 4-5 manual steps, please wrap a main program/Makefile around them, in logical sequences. Examples follow. - Edit `programs/config.do` to adjust the default path - Run `programs/00_setup.do` once on a new system to set up the working environment. - Download the data files referenced above. Each should be stored in the prepared subdirectories of `data/`, in the format that you download them in. Do not unzip. Scripts are provided in each directory to download the public-use files. Confidential data files requested as part of your FSRDC project will appear in the `/data` folder. No further action is needed on the replicator's part. -- Run `programs/01_master.do` to run all steps in sequence. +- Run `programs/01_main.do` to run all steps in sequence. ### Details @@ -216,19 +246,19 @@ Instructions to Replicators - `programs/01_dataprep`: - These programs were last run at various times in 2018. - Order does not matter, all programs can be run in parallel, if needed. - - A `programs/01_dataprep/master.do` will run them all in sequence, which should take about 2 hours. -- `programs/02_analysis/master.do`. + - A `programs/01_dataprep/main.do` will run them all in sequence, which should take about 2 hours. +- `programs/02_analysis/main.do`. - If running programs individually, note that ORDER IS IMPORTANT. - The programs were last run top to bottom on July 4, 2019. -- `programs/03_appendix/master-appendix.do`. The programs were last run top to bottom on July 4, 2019. +- `programs/03_appendix/main-appendix.do`. The programs were last run top to bottom on July 4, 2019. - Figure 1: The figure can be reproduced using the data provided in the folder “2_data/data_map”, and ArcGIS Desktop (Version 10.7.1) by following these (manual) instructions: - Create a new map document in ArcGIS ArcMap, browse to the folder “2_data/data_map” in the “Catalog”, with files "provinceborders.shp", "lakes.shp", and "cities.shp". - Drop the files listed above onto the new map, creating three separate layers. Order them with "lakes" in the top layer and "cities" in the bottom layer. - Right-click on the cities file, in properties choose the variable "health"... (more details) -List of tables and programs ---------------------------- +## List of tables and programs + > INSTRUCTIONS: Your programs should clearly identify the tables and figures as they appear in the manuscript, by number. Sometimes, this may be obvious, e.g. a program called "`table1.do`" generates a file called `table1.png`. Sometimes, mnemonics are used, and a mapping is necessary. In all circumstances, provide a list of tables and figures, identifying the program (and possibly the line number) where a figure is created. > diff --git a/README.pdf b/README.pdf index b3cd94f..646f882 100644 Binary files a/README.pdf and b/README.pdf differ diff --git a/README.tex b/README.tex index 44d001f..e3762b9 100644 --- a/README.tex +++ b/README.tex @@ -92,7 +92,7 @@ \subsection{Overview}\label{overview}} Example: The code in this replication package constructs the analysis file from the three data sources (Ruggles et al, 2018; Inglehart et al, -2019; BEA, 2016) using Stata and Julia. Two master files run all of the +2019; BEA, 2016) using Stata and Julia. Two main files run all of the code to generate the data for the 15 figures and 3 tables in the paper. The replicator should expect the code to run for about 14 hours. @@ -145,11 +145,14 @@ \subsection{Data Availability and Provenance The information should describe ALL data used, regardless of whether they are provided as part of the replication archive or not, and -regardless of size or scope. For instance, if using GDP deflators, the -source of the deflators (e.g.~at the national statistical office) should -also be listed here. If any of this information has been provided in a -pre-registration, then a link to that registration may (partially) -suffice. +regardless of size or scope. The DAS should provide enough information +that a replicator can obtain the data from the original source, even if +the file is provided. + +For instance, if using GDP deflators, the source of the deflators +(e.g.~at the national statistical office) should also be listed here. If +any of this information has been provided in a pre-registration, then a +link to that registration may (partially) suffice. DAS can be complex and varied. Examples are provided \href{https://social-science-data-editors.github.io/guidance/Requested_information_dcas.html}{here}, @@ -167,7 +170,7 @@ \subsection{Data Availability and Provenance extent stylistic considerations, data citations should appear in the main article, in an appendix, or in the README. However, data citations only provide information \textbf{where} to find the data, not -\textbf{how to access} that data. Thus, DAS augment data citations by +\textbf{how to access} those data. Thus, DAS augment data citations by going into additional detail that allow a researcher to assess cost, complexity, and availability over time of the data used by the original author. @@ -181,6 +184,12 @@ \subsubsection{Statement about Rights}\label{statement-about-rights}} \item[$\square$] I certify that the author(s) of the manuscript have legitimate access to and permission to use the data used in this manuscript. +\item[$\square$] + I certify that the author(s) of the manuscript have documented + permission to redistribute/publish the data contained within this + replication package. Appropriate permission are documented in the + \href{https://social-science-data-editors.github.io/template_README/LICENSE.txt}{LICENSE.txt} + file. \end{itemize} \hypertarget{optional-but-recommended-license-for-data}{% @@ -199,11 +208,18 @@ \subsubsection{(Optional, but recommended) License for author, but also any subsequent users - cite the data provider. Licensing can be complex. Some non-legal guidance may be found \href{https://social-science-data-editors.github.io/guidance/Licensing_guidance.html}{here}. +For multiple licenses within a data package, the \texttt{LICENSE.txt} +file might contain the concatenation of all the licenses that apply (for +instance, a custom license for one file, plus a CC-BY license for +another file). + +NOTE: In many cases, it is not up to the creator of the replication +package to simply define a license, a license may be \emph{sticky} and +be defined by the original data creator. \end{quote} -The code is licensed under a Creative Commons/CC-BY-NC/CC0 license. See -\href{https://social-science-data-editors.github.io/template_README/LICENSE.txt}{LICENSE.txt} -for details. +\emph{Example:} The data are licensed under a Creative Commons/CC-BY-NC +license. See LICENSE.txt for details. \hypertarget{summary-of-availability}{% \subsubsection{Summary of Availability}\label{summary-of-availability}} @@ -239,9 +255,73 @@ \subsubsection{Details on each Data file name), or at a URL (list the URL). Some formats are self-describing \emph{if} they have the requisite information (e.g., \texttt{.dta} should have both variable and value labels). +\item + List availability within the package +\item + Use proper bibliographic references in addition to a verbose + description (and provide a bibliography at the end of the README, + expanding those references) \end{itemize} + +A summary in tabular form can be useful: \end{quote} +\begin{longtable}[]{@{}lllll@{}} +\toprule +\begin{minipage}[b]{0.17\columnwidth}\raggedright +Data.Name\strut +\end{minipage} & \begin{minipage}[b]{0.17\columnwidth}\raggedright +Data.Files\strut +\end{minipage} & \begin{minipage}[b]{0.17\columnwidth}\raggedright +Location\strut +\end{minipage} & \begin{minipage}[b]{0.17\columnwidth}\raggedright +Provided\strut +\end{minipage} & \begin{minipage}[b]{0.17\columnwidth}\raggedright +Citation\strut +\end{minipage}\tabularnewline +\midrule +\endhead +\begin{minipage}[t]{0.17\columnwidth}\raggedright +``Current Population Survey 2018''\strut +\end{minipage} & \begin{minipage}[t]{0.17\columnwidth}\raggedright +cepr\_march\_2018.dta\strut +\end{minipage} & \begin{minipage}[t]{0.17\columnwidth}\raggedright +data/\strut +\end{minipage} & \begin{minipage}[t]{0.17\columnwidth}\raggedright +TRUE\strut +\end{minipage} & \begin{minipage}[t]{0.17\columnwidth}\raggedright +CEPR (2018)\strut +\end{minipage}\tabularnewline +\begin{minipage}[t]{0.17\columnwidth}\raggedright +``Provincial Administration Reports''\strut +\end{minipage} & \begin{minipage}[t]{0.17\columnwidth}\raggedright +coast\_simplepoint2.csv; rivers\_simplepoint2.csv; RAIL\_dummies.dta; +railways\_Dissolve\_Simplify\_point2.csv\strut +\end{minipage} & \begin{minipage}[t]{0.17\columnwidth}\raggedright +Data/maps/\strut +\end{minipage} & \begin{minipage}[t]{0.17\columnwidth}\raggedright +TRUE\strut +\end{minipage} & \begin{minipage}[t]{0.17\columnwidth}\raggedright +Administration (2017)\strut +\end{minipage}\tabularnewline +\begin{minipage}[t]{0.17\columnwidth}\raggedright +``2017 SAT scores''\strut +\end{minipage} & \begin{minipage}[t]{0.17\columnwidth}\raggedright +Not available\strut +\end{minipage} & \begin{minipage}[t]{0.17\columnwidth}\raggedright +data/to\_clean/\strut +\end{minipage} & \begin{minipage}[t]{0.17\columnwidth}\raggedright +FALSE\strut +\end{minipage} & \begin{minipage}[t]{0.17\columnwidth}\raggedright +College Board (2020)\strut +\end{minipage}\tabularnewline +\bottomrule +\end{longtable} + +where the \texttt{Data.Name} column is then expanded in the subsequent +paragraphs, and \texttt{CEPR\ (2018)} is resolved in the References +section of the README. + \hypertarget{example-for-public-use-data-collected-by-the-authors}{% \subsubsection{Example for public use data collected by the authors}\label{example-for-public-use-data-collected-by-the-authors}} @@ -382,6 +462,33 @@ \subsection{Dataset list}\label{dataset-list}} be provided as a Excel/CSV table, or in the table below. \end{quote} +\begin{quote} +INSTRUCTIONS: While it is often most convenient to provide data in the +native format of the software used to analyze and process the data, not +all formats are ``open'' and can be read by other (free) software. Data +should at a minimum be provided in formats that can be read by +open-source software (R, Python, others), and ideally be provided in +non-proprietary, archival-friendly formats. +\end{quote} + +\begin{quote} +INSTRUCTIONS: All data files should be fully documented: +variables/columns should have labels (long-form meaningful names), and +values should be explained. This might mean generating a codebook, +pointing at a public codebook, or providing data in (non-proprietary) +formats that allow for a rich description. This is in particular +important for data that is not distributable. +\end{quote} + +\begin{quote} +INSTRUCTIONS: Some journals require, and it is considered good practice, +to provide synthetic or simulated data that has some of the key +characteristics of the restricted-access data which are not provided. +The level of fidelity may vary - it may be useful for debugging only, or +it should allow to assess the key characteristics of the +statistical/econometric procedure or the main conclusions of the paper. +\end{quote} + \begin{longtable}[]{@{}llll@{}} \toprule \begin{minipage}[b]{0.26\columnwidth}\raggedright @@ -444,9 +551,11 @@ \subsection{Computational install/set up the environment. Sample scripts for \href{https://github.com/gslab-econ/template/blob/master/config/config_stata.do}{Stata}, \href{https://github.com/labordynamicsinstitute/paper-template/blob/master/programs/global-libraries.R}{R}, -\href{https://pip.readthedocs.io/en/1.1/requirements.html}{Python}, \href{https://github.com/labordynamicsinstitute/paper-template/blob/master/programs/packages.jl}{Julia} -are easy to set up and implement. +are easy to set up and implement. Specific software may have more +sophisticated tools: +\href{https://pip.pypa.io/en/stable/user_guide/\#ensuring-repeatability}{Python}, +\href{https://julia.quantecon.org/more_julia/tools_editors.html\#Package-Environments}{Julia}. \end{quote} \hypertarget{software-requirements}{% @@ -489,9 +598,9 @@ \subsubsection{Software Requirements}\label{software-requirements}} the file ``\texttt{requirements.txt}'' lists these dependencies, please run ``\texttt{pip\ install\ -r\ requirements.txt}'' as the first step. See - \url{https://pip.readthedocs.io/en/1.1/requirements.html} for - further instructions on using the ``\texttt{requirements.txt}'' - file. + \url{https://pip.pypa.io/en/stable/user_guide/\#ensuring-repeatability} + for further instructions on creating and using the + ``\texttt{requirements.txt}'' file. \end{itemize} \item Intel Fortran Compiler version 20200104 @@ -518,6 +627,28 @@ \subsubsection{Software Requirements}\label{software-requirements}} Portions of the code use Powershell scripting, which may require Windows 10 or higher. +\hypertarget{controlled-randomness}{% +\subsubsection{Controlled Randomness}\label{controlled-randomness}} + +\begin{quote} +INSTRUCTIONS: Some estimation code uses random numbers, almost always +provided by pseudorandom number generators (PRNGs). For reproducibility +purposes, these should be provided with a deterministic seed, so that +the sequence of numbers provided is the same for the original author and +any replicators. While this is not always possible, it is a requirement +by many journals' policies. The seed should be set once, and not use a +time-stamp. If using parallel processing, special care needs to be +taken. If using multiple programs in sequence, care must be taken on how +to call these programs, ideally from a main program, so that the +sequence is not altered. +\end{quote} + +\begin{itemize} +\tightlist +\item[$\square$] + Random seed is set at line \_\_\_\_\_ of program \_\_\_\_\_\_ +\end{itemize} + \hypertarget{memory-and-runtime-requirements}{% \subsubsection{Memory and Runtime Requirements}\label{memory-and-runtime-requirements}} @@ -543,7 +674,9 @@ \subsubsection{Memory and Runtime \item[$\square$] 10-60 minutes \item[$\square$] - 1-8 hours + 1-2 hours +\item[$\square$] + 2-8 hours \item[$\square$] 8-24 hours \item[$\square$] @@ -600,22 +733,22 @@ \subsection{Description of \item Programs in \texttt{programs/01\_dataprep} will extract and reformat all datasets referenced above. The file - \texttt{programs/01\_dataprep/master.do} will run them all. + \texttt{programs/01\_dataprep/main.do} will run them all. \item Programs in \texttt{programs/02\_analysis} generate all tables and figures in the main body of the article. The program - \texttt{programs/02\_analysis/master.do} will run them all. Each - program called from \texttt{master.do} identifies the table or figure - it creates (e.g., \texttt{05\_table5.do}). Output files are called - appropriate names (\texttt{table5.tex}, \texttt{figure12.png}) and - should be easy to correlate with the manuscript. + \texttt{programs/02\_analysis/main.do} will run them all. Each program + called from \texttt{main.do} identifies the table or figure it creates + (e.g., \texttt{05\_table5.do}). Output files are called appropriate + names (\texttt{table5.tex}, \texttt{figure12.png}) and should be easy + to correlate with the manuscript. \item Programs in \texttt{programs/03\_appendix} will generate all tables and figures in the online appendix. The program - \texttt{programs/03\_appendix/master-appendix.do} will run them all. + \texttt{programs/03\_appendix/main-appendix.do} will run them all. \item Ado files have been stored in \texttt{programs/ado} and the - \texttt{master.do} files set the ADO directories appropriately. + \texttt{main.do} files set the ADO directories appropriately. \item The program \texttt{programs/00\_setup.do} will populate the \texttt{programs/ado} directory with updated ado packages, but for @@ -648,7 +781,7 @@ \subsubsection{(Optional, but recommended) License for \href{https://social-science-data-editors.github.io/guidance/Licensing_guidance.html}{here}. \end{quote} -The code is licensed under a MIT/BSD/GPL/Creative Commons license. See +The code is licensed under a MIT/BSD/GPL {[}choose one!{]} license. See \href{https://social-science-data-editors.github.io/template_README/LICENSE.txt}{LICENSE.txt} for details. @@ -662,7 +795,7 @@ \subsection{Instructions to then describes a human-readable instruction to conduct the replication. This may be simple, or may involve many complicated steps. It should be a simple list, no excess prose. Strict linear sequence. If more than 4-5 -manual steps, please wrap a master program/Makefile around them, in +manual steps, please wrap a main program/Makefile around them, in logical sequences. Examples follow. \end{quote} @@ -681,7 +814,7 @@ \subsection{Instructions to part of your FSRDC project will appear in the \texttt{/data} folder. No further action is needed on the replicator's part. \item - Run \texttt{programs/01\_master.do} to run all steps in sequence. + Run \texttt{programs/01\_main.do} to run all steps in sequence. \end{itemize} \hypertarget{details-1}{% @@ -711,11 +844,11 @@ \subsubsection{Details}\label{details-1}} Order does not matter, all programs can be run in parallel, if needed. \item - A \texttt{programs/01\_dataprep/master.do} will run them all in + A \texttt{programs/01\_dataprep/main.do} will run them all in sequence, which should take about 2 hours. \end{itemize} \item - \texttt{programs/02\_analysis/master.do}. + \texttt{programs/02\_analysis/main.do}. \begin{itemize} \tightlist @@ -725,7 +858,7 @@ \subsubsection{Details}\label{details-1}} The programs were last run top to bottom on July 4, 2019. \end{itemize} \item - \texttt{programs/03\_appendix/master-appendix.do}. The programs were + \texttt{programs/03\_appendix/main-appendix.do}. The programs were last run top to bottom on July 4, 2019. \item Figure 1: The figure can be reproduced using the data provided in the