-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathvalidation-report-answer-key.Rmd
executable file
·271 lines (222 loc) · 14.5 KB
/
validation-report-answer-key.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
---
title: "Validation Report"
output: pdf_document
---
<!-- README -->
<!-- This is the answer key. If you are totally new at Python and R, we recommend you follow -->
<!-- along with this document. We've packed the document full of comments to help you -->
<!-- understand what's going on and to help you learn. Want to give the coding a shot? -->
<!-- hop on over to the validation-report-guided.Rmd file to give yourself a challenge. -->
```{r rsetup, include=FALSE}
# import needed R packages, these stay imported in future chunks
library(tidyverse)
library(magrittr)
library(knitr)
library(kableExtra)
library(reticulate)
# Set up Python for reticulate
# This path is using the miniconda installation, which is located
# inside of the home directory based on how the setup script installed it
reticulate::use_python("~/miniconda/bin/python3", required = TRUE)
```
```{python pysetup, include=FALSE}
# Import needed python packages, these stay imported in future chunks
# This only has to be done once for the life of your session
# The Python session stays alive between blocks of code
import os
import re
import pandas as pd
```
<!-- begin breakout one -->
<!-- Goals: -->
<!-- - Report the run date/time and user (R or Python - your preference) -->
<!-- - Store the absolute paths to the programs, outputs, and metadata directories in Python -->
<!-- - Report each of these paths inline in bold markdown text -->
```{r usercheck, include=FALSE}
# get system information (user, date, time)
currentuser <- Sys.getenv("USER")
currentdate <- format(Sys.time(), "%Y-%m-%d")
currenttime <- format(Sys.time(), "%H:%M:%S")
```
<!-- Report the user, date, and time -->
This report was run by user **`r currentuser`** on **`r currentdate`** at **`r currenttime`**
# Report Synopsis
This is a validation report for the CDISC Replication Pilot Project. The directory of the project is laid out as seen below. These folders are where the different files for this project are held. In addition, there are checks in place for the files used in the project to check consistency and completeness.
```{python directorychecks, echo=FALSE}
# get current working directory
wd = os.getcwd()
# create paths to the subfolders containing programs, outputs, and metadata
progdir = os.path.join(wd, "programs")
outputdir = os.path.join(wd, "outputs")
metadir = os.path.join(wd, "metadata")
```
# Directory Outline
<!-- Python variables are accessible inside the `py` variable in your R session -->
The metadata files are located in the following folder:
**`r py$metadir`**
The program files are located in the following folder:
**`r py$progdir`**
The output files are located in the following folder:
**`r py$outputdir`**
<!-- end breakout one -->
<!-- begin breakout two -->
<!-- Goals: -->
<!-- - Gather names of files in the programs and outputs folders -->
<!-- - Read in the metadata file metadata/metadata.csv. Review this file and familiarize -->
<!-- yourself with the contents -->
<!-- - Bidirectionally check the metadata and directories for programs and outputs -->
<!-- - Create Boolean variables for -->
<!-- - Any programs missing? -->
<!-- - Programs in folder not in metadata? -->
<!-- - Any outputs missing? -->
<!-- - Outputs in folder not in metadata? -->
<!-- Use one Python code block to do all of this -->
# File Checks
```{python filechecks, echo=FALSE}
# Import metadata file into dataframe
metacsv = os.path.join(metadir,"metadata.csv")
metadf = pd.read_csv(metacsv)
# iterate through the metadata data frame to check if programs and outputs exist
# Remember to review the metadata data frame to familiarize yourself with the contents
# iterrows will loop over each row of the dataframe. SAS Programmer? This is similar
# to a data step. R Programmer? This is similar to dplyr::rowwise()
for index, row in metadf.iterrows():
# check to see if program exists
program = os.path.join(progdir, row.ProgramName) # Get the path to the file
programexist = os.path.exists(program) # Test if it exists
# check to see if output exists
output = os.path.join(outputdir, row.OutputName) # Get the path to the file
outputexist = os.path.exists(output) # Check to see if it exists
# assign variables created to columns in the data frame
# DataFrame.at will takes the index and a variable name,
# and will assign a value at to that variable at the specified index
metadf.at[index, 'ProgramExists'] = programexist
metadf.at[index, 'OutputExists'] = outputexist
# Get contents of programs and outputs sub directories
proglist = os.listdir(progdir)
ouputlist = os.listdir(outputdir)
# Create lists of programs and outputs in the respective directory but not in the metadata file
# List comprehensions are similar to lapply in R, in that they return a list object.
# They're written like for loops. Here we're returning a list of files in the list variable
# proglist if the file is NOT in the metadf column ProgramName, and if they're .R files.
progsindirnotmeta = [file for file in proglist if file not in metadf.ProgramName.tolist() and re.search(".R", file)]
# And we'll do the same thing for the output files.
outputsindirnotmeta = [file for file in ouputlist if file not in metadf.OutputName.tolist() and re.search(".rtf", file)]
# Create a True/False variable to indicate if all programs in the metadata exist
# The all() function tests if all elements of an object are True
programsallgood = all(metadf.ProgramExists)
# Create a True/False variable to indicate if all outputs in the metadata exist
outputsallgood = all(metadf.OutputExists)
# Create a True/False variable to indicate if there are programs in the directory not in the metadata
# The way we did the list comprehensions above, the list will be empty if all files
# in the directory are in the metadata, and therefore the length will be 0.
programsmetaallgood = (len(progsindirnotmeta) == 0)
# Create a True/False variable to indicate if there are outputs in the directory not in the metadata
outputsmetaallgood = (len(outputsindirnotmeta) == 0)
```
<!-- end breakout two -->
<!-- begin breakout three -->
<!-- - Use the 4 variables created earlier to output conditional markdown text
<!-- - Some/No programs missing
<!-- - Some/No programs present in folder not in metadata
<!-- - Some/No outputs missing
<!-- - Some/No outputs present in folder not in metadata
<!-- - Create code blocks to output the detailed issues in a nicely presented table -->
## Program File Check
The programs in the metadata document were checked against the programs folder, results are as follows:
<!-- These lines are conditional inline markdown text. If the logical value within the if() condition -->
<!-- is met, the text will populate. Otherwise, nothing will be written -->
<!-- Another way to do this is to use a code block, use the cat() function to write your text, -->
<!-- and use the option results='asis' -->
`r if(py$programsallgood){"**All programs present in the metadata document are present in the programs folder.**"}`
<!-- In this next statement we just want the opposite of the if condition above -->
`r if(!py$programsallgood){"**Some programs present in the metadata document are not present in the programs folder.** The programs listed below are not present in the programs folder."}`
<!-- Display this table only if programs are in the metadata but not directory -->
<!-- Note in eval option in the code block below - you can use variables in your session -->
<!-- to control if the block is evaluated or not. Here, we can control whether or not a -->
<!-- table is written. -->
```{r badprograms, echo=FALSE, eval=!py$programsallgood}
# Nicely write out the names of programs that are in the metadata but not directory
# Note - the python data frames are successfully coerced to R data frames! This
# happens automatically with no extra effort on your part.
# Here we use dplyr::select and dplyr::filter to filter the pandas data frame
kable(select(filter(py$metadf, ProgramExists != TRUE), ProgramName), booktabs = TRUE) %>%
kable_styling(latex_options = "striped", position = "left")
```
<!-- Same idea as above, but now we're reporting programs in the folder that weren't in the metadata -->
`r if(py$programsmetaallgood){"**All programs present in the programs folder are present in the metadata document.**"}`
`r if(!py$programsmetaallgood){"**Some programs present in the programs folder are not present in the metadata document.** The programs listed below are not present in the metadata document."}`
<!-- Display this table only if programs are in the directory but not metadata -->
```{r badprogrammeta, echo=FALSE, eval=!py$programsmetaallgood}
# nicely write out the names of programs that are in the directory but not metadata
# Kable can easily present a list as a table
kable(py$progsindirnotmeta, col.names="ProgramName", booktabs = TRUE) %>%
kable_styling(latex_options = "striped", position = "left")
```
## Output File Check
<!-- The lines below repeat all of the same checks, but now for the outputs folder -->
The output files in the metadata document were checked against the outputs folder, results are as follows:
`r if(py$outputsallgood){"**All outputs present in the metadata document are present in the outputs folder.**"}`
`r if(!py$outputsallgood){"**Some outputs present in the metadata document are not present in the outputs folder.** The outputs listed below are not present in the outputs folder."}`
<!-- Display this table only if outputs are in the metadata but not directory -->
```{r badoutputs, echo=FALSE, eval=!py$outputsallgood}
# Nicely write out the names of outputs that are in the metadata but not directory
kable(select(filter(py$metadf, OutputExists != TRUE), OutputName), booktabs = TRUE) %>%
kable_styling(latex_options = "striped", position = "left")
```
<!-- Same idea as above, but now we're reporting outputs in the folder that weren't in the metadata -->
`r if(py$outputsmetaallgood){"**All outputs present in the outputs folder are present in the metadata document.**"}`
`r if(!py$outputsmetaallgood){"**Some outputs present in the outputs folder are not present in the metadata document.** The outputs listed below are not present in the metadata document."}`
<!-- Display this table only if outputs are in the directory but not metadata -->
```{r badoutputmeta, echo=FALSE, eval=!py$outputsmetaallgood}
# nicely write out the names of outputs that are in the directory but not metadata
kable(py$outputsindirnotmeta, col.names="OutputName", booktabs = TRUE) %>%
kable_styling(latex_options = "striped", position = "left")
```
<!-- end breakout three -->
<!-- begin breakout four -->
<!-- - Pass over the metadata data frame from before and check if the source file name is in the output file -->
<!-- - Create a new data frame of just records where the output is in both the metadata and the outputs directory -->
<!-- - Create a variables that's True if all the source file names are in their outputs, and false if not -->
<!-- - Create a section that dynamically tells us if something is wrong and outputs issues in a nicely presented data frame -->
# Output Contents Checks
The text of the output files present in both the metadata document and outputs folder were checked for consistency with the source program files provided.
<!-- In this block, we're checking the output RTF document to look for a line containing the -->
<!-- program name that should have created this output document. This is a quick and dirty way -->
<!-- to check if there was a footnote the contained the executing program name. -->
```{python outputsourcecheck, echo=FALSE}
# Iterate through the metadata data frame to check the content of output files
for index, row in metadf.iterrows():
# Get the file path for the output document
output = os.path.join(outputdir, row.OutputName)
# Get the file path of program associated with this output in the metadata
source = "programs/" + row.ProgramName
# Check if the output existed (which we tested earlier)
if row.OutputExists == True:
# Open the file as a variable named `file`
with open(output, 'r') as file:
# Grab the file text as a string
filetext = str(file.read())
# Search the file text for the source program name
sourcecheck = bool(re.search(source, filetext))
# If the file doesn't exist, then just store text saying telling us that
else:
sourcecheck = "Output does not exist"
# Assign variable created to a column in the metadf data frame
metadf.at[index, 'SourceCorrect'] = sourcecheck
# Create a dataframe of all output files both in the metadata and the directory
goodoutput = metadf[metadf.OutputExists == True]
# Create a True/False variable that we'll use to trigger text output depending on if the sources were correct
sourcesallgood = all(metadf['SourceCorrect'])
```
## Sources
<!-- Similar to before, we're writing different text based on whether or not there were any problems with the sources -->
`r if(py$sourcesallgood){"**All sources in the outputs match those in the metadata document.**"}`
`r if(!py$sourcesallgood){"**Some sources in the outputs do not match those in the metadata document.** The outputs and sources listed below do not match the values in the metadata document."}`
<!-- Display a table of outputs with issues in the source footnotes -->
```{r badsources, echo=FALSE, eval=!py$sourcesallgood}
# nicely write out the names of program and output pairs where the source is wrong
kable(select(filter(py$goodoutput, SourceCorrect != TRUE), OutputName, ProgramName), booktabs = TRUE) %>%
kable_styling(latex_options = "striped", position = "left")
```
<!-- end breakout four -->