validation-report-answer-key.Rmd

---
title: "Validation Report"
output: pdf_document
---

<!-- README                                                                                  -->
<!-- This is the answer key. If you are totally new at Python and R, we recommend you follow -->
<!-- along with this document. We've packed the document full of comments to help you        -->
<!-- understand what's going on and to help you learn. Want to give the coding a shot?       -->
<!-- hop on over to the validation-report-guided.Rmd file to give yourself a challenge.      -->
```{r rsetup, include=FALSE}
# import needed R packages, these stay imported in future chunks
library(tidyverse)
library(magrittr)
library(knitr)
library(kableExtra)
library(reticulate)
# Set up Python for reticulate
# This path is using the miniconda installation, which is located
# inside of the home directory based on how the setup script installed it
reticulate::use_python("~/miniconda/bin/python3", required = TRUE)
```
```{python pysetup, include=FALSE}
# Import needed python packages, these stay imported in future chunks
# This only has to be done once for the life of your session
# The Python session stays alive between blocks of code
import os
import re
import pandas as pd
```

<!-- begin breakout one                                                                       -->
<!-- Goals:                                                                                   -->
<!--  - Report the run date/time and user (R or Python - your preference)                     -->
<!--  - Store the absolute paths to the programs, outputs, and metadata directories in Python -->
<!--  - Report each of these paths inline in bold markdown text                               -->
```{r usercheck, include=FALSE}
# get system information (user, date, time)
currentuser <- Sys.getenv("USER")
currentdate <- format(Sys.time(), "%Y-%m-%d")
currenttime <- format(Sys.time(), "%H:%M:%S")
```

<!-- Report the user, date, and time -->
This report was run by user **`r currentuser`** on **`r currentdate`** at **`r currenttime`**

# Report Synopsis

This is a validation report for the CDISC Replication Pilot Project.  The directory of the project is laid out as seen below.  These folders are where the different files for this project are held.  In addition, there are checks in place for the files used in the project to check consistency and completeness.

```{python directorychecks, echo=FALSE}
# get current working directory
wd = os.getcwd()
# create paths to the subfolders containing programs, outputs, and metadata 
progdir = os.path.join(wd, "programs")
outputdir = os.path.join(wd, "outputs")
metadir = os.path.join(wd, "metadata")
```

# Directory Outline

<!-- Python variables are accessible inside the `py` variable in your R session -->
The metadata files are located in the following folder:  
**`r py$metadir`**

The program files are located in the following folder:  
**`r py$progdir`**

The output files are located in the following folder:  
**`r py$outputdir`**
<!-- end breakout one -->

<!-- begin breakout two -->
<!-- Goals:                                                                               -->
<!--  - Gather names of files in the programs and outputs folders                         -->
<!--  - Read in the metadata file metadata/metadata.csv. Review this file and familiarize -->
<!--    yourself with the contents                                                        -->
<!--  - Bidirectionally check the metadata and directories for programs and outputs       -->
<!--  - Create Boolean variables for                                                      -->
<!--    - Any programs missing?                                                           -->
<!--    - Programs in folder not in metadata?                                             -->
<!--    - Any outputs missing?                                                            -->
<!--    - Outputs in folder not in metadata?                                              -->
<!--                       Use one Python code block to do all of this                    -->

# File Checks

```{python filechecks, echo=FALSE}
# Import metadata file into dataframe
metacsv = os.path.join(metadir,"metadata.csv")
metadf = pd.read_csv(metacsv)

# iterate through the metadata data frame to check if programs and outputs exist 
# Remember to review the metadata data frame to familiarize yourself with the contents
# iterrows will loop over each row of the dataframe. SAS Programmer? This is similar
# to a data step. R Programmer? This is similar to dplyr::rowwise()
for index, row in metadf.iterrows():

   # check to see if program exists
   program = os.path.join(progdir, row.ProgramName) # Get the path to the file
   programexist = os.path.exists(program) # Test if it exists
   
   # check to see if output exists
   output = os.path.join(outputdir, row.OutputName) # Get the path to the file
   outputexist = os.path.exists(output) # Check to see if it exists
   
   # assign variables created to columns in the data frame
   # DataFrame.at will takes the index and a variable name,
   # and will assign a value at to that variable at the specified index
   metadf.at[index, 'ProgramExists'] = programexist
   metadf.at[index, 'OutputExists'] = outputexist

# Get contents of programs and outputs sub directories
proglist = os.listdir(progdir)
ouputlist = os.listdir(outputdir)

# Create lists of programs and outputs in the respective directory but not in the metadata file
# List comprehensions are similar to lapply in R, in that they return a list object.
# They're written like for loops. Here we're returning a list of files in the list variable
# proglist if the file is NOT in the metadf column ProgramName, and if they're .R files.
progsindirnotmeta = [file for file in proglist if file not in metadf.ProgramName.tolist() and re.search(".R", file)]

# And we'll do the same thing for the output files.
outputsindirnotmeta = [file for file in ouputlist if file not in metadf.OutputName.tolist() and re.search(".rtf", file)]

# Create a True/False variable to indicate if all programs in the metadata exist
# The all() function tests if all elements of an object are True
programsallgood = all(metadf.ProgramExists)

# Create a True/False variable to indicate if all outputs in the metadata exist
outputsallgood = all(metadf.OutputExists)

# Create a True/False variable to indicate if there are programs in the directory not in the metadata
# The way we did the list comprehensions above, the list will be empty if all files 
# in the directory are in the metadata, and therefore the length will be 0.
programsmetaallgood = (len(progsindirnotmeta) == 0)

# Create a True/False variable to indicate if there are outputs in the directory not in the metadata
outputsmetaallgood = (len(outputsindirnotmeta) == 0)
```
<!-- end breakout two -->

<!-- begin breakout three -->
<!-- - Use the 4 variables created earlier to output conditional markdown text
<!--   - Some/No programs missing
<!--   - Some/No programs present in folder not in metadata
<!--   - Some/No outputs missing
<!--   - Some/No outputs present in folder not in metadata
<!-- - Create code blocks to output the detailed issues in a nicely presented table -->

## Program File Check

The programs in the metadata document were checked against the programs folder, results are as follows:
<!-- These lines are conditional inline markdown text. If the logical value within the if() condition -->
<!-- is met, the text will populate. Otherwise, nothing will be written                               -->
<!-- Another way to do this is to use a code block, use the cat() function to write your text,        -->
<!-- and use the option results='asis'                                                                -->
`r if(py$programsallgood){"**All programs present in the metadata document are present in the programs folder.**"}`
<!-- In this next statement we just want the opposite of the if condition above -->
`r if(!py$programsallgood){"**Some programs present in the metadata document are not present in the programs folder.**  The programs listed below are not present in the programs folder."}`

<!-- Display this table only if programs are in the metadata but not directory --> 
<!-- Note in eval option in the code block below - you can use variables in your session -->
<!-- to control if the block is evaluated or not. Here, we can control whether or not a  -->
<!-- table is written.                                                                   -->
```{r badprograms, echo=FALSE, eval=!py$programsallgood}
# Nicely write out the names of programs that are in the metadata but not directory
# Note - the python data frames are successfully coerced to R data frames! This 
# happens automatically with no extra effort on your part.
# Here we use dplyr::select and dplyr::filter to filter the pandas data frame
kable(select(filter(py$metadf, ProgramExists != TRUE), ProgramName), booktabs = TRUE) %>%
   kable_styling(latex_options = "striped", position = "left")
```  

<!-- Same idea as above, but now we're reporting programs in the folder that weren't in the metadata -->
`r if(py$programsmetaallgood){"**All programs present in the programs folder are present in the metadata document.**"}`
`r if(!py$programsmetaallgood){"**Some programs present in the programs folder are not present in the metadata document.**  The programs listed below are not present in the metadata document."}`

<!-- Display this table only if programs are in the directory but not metadata --> 
```{r badprogrammeta, echo=FALSE, eval=!py$programsmetaallgood}
# nicely write out the names of programs that are in the directory but not metadata
# Kable can easily present a list as a table
kable(py$progsindirnotmeta, col.names="ProgramName", booktabs = TRUE) %>%
   kable_styling(latex_options = "striped", position = "left")
```

## Output File Check

<!-- The lines below repeat all of the same checks, but now for the outputs folder -->
The output files in the metadata document were checked against the outputs folder, results are as follows:  

`r if(py$outputsallgood){"**All outputs present in the metadata document are present in the outputs folder.**"}`
`r if(!py$outputsallgood){"**Some outputs present in the metadata document are not present in the outputs folder.**  The outputs listed below are not present in the outputs folder."}`

<!-- Display this table only if outputs are in the metadata but not directory --> 
```{r badoutputs, echo=FALSE, eval=!py$outputsallgood}
# Nicely write out the names of outputs that are in the metadata but not directory
kable(select(filter(py$metadf, OutputExists != TRUE), OutputName), booktabs = TRUE) %>%
   kable_styling(latex_options = "striped", position = "left")
```  

<!-- Same idea as above, but now we're reporting outputs in the folder that weren't in the metadata -->
`r if(py$outputsmetaallgood){"**All outputs present in the outputs folder are present in the metadata document.**"}`
`r if(!py$outputsmetaallgood){"**Some outputs present in the outputs folder are not present in the metadata document.**  The outputs listed below are not present in the metadata document."}`

<!-- Display this table only if outputs are in the directory but not metadata --> 
```{r badoutputmeta, echo=FALSE, eval=!py$outputsmetaallgood}
# nicely write out the names of outputs that are in the directory but not metadata
kable(py$outputsindirnotmeta, col.names="OutputName", booktabs = TRUE) %>%
   kable_styling(latex_options = "striped", position = "left")
```
<!-- end breakout three -->

<!-- begin breakout four -->
<!-- - Pass over the metadata data frame from before and check if the source file name is in the output file                  -->
<!-- - Create a new data frame of just records where the output is in both the metadata and the outputs directory           -->
<!-- - Create a variables that's True if all the source file names are in their outputs, and false if not                   -->
<!-- - Create a section that dynamically tells us if something is wrong and outputs issues in a nicely presented data frame -->

# Output Contents Checks

The text of the output files present in both the metadata document and outputs folder were checked for consistency with the source program files provided.

<!-- In this block, we're checking the output RTF document to look for a line containing the   -->
<!-- program name that should have created this output document. This is a quick and dirty way -->
<!-- to check if there was a footnote the contained the executing program name.                -->
```{python outputsourcecheck, echo=FALSE}
# Iterate through the metadata data frame to check the content of output files 
for index, row in metadf.iterrows():

   # Get the file path for the output document
   output = os.path.join(outputdir, row.OutputName)
      
   # Get the file path of program associated with this output in the metadata
   source = "programs/" + row.ProgramName
   
   # Check if the output existed (which we tested earlier)
   if row.OutputExists == True:
      # Open the file as a variable named `file`
      with open(output, 'r') as file:
         # Grab the file text as a string
         filetext = str(file.read())
         # Search the file text for the source program name
         sourcecheck = bool(re.search(source, filetext))
         
   # If the file doesn't exist, then just store text saying telling us that
   else:
      sourcecheck = "Output does not exist"
      
   # Assign variable created to a column in the metadf data frame
   metadf.at[index, 'SourceCorrect'] = sourcecheck
   
# Create a dataframe of all output files both in the metadata and the directory
goodoutput = metadf[metadf.OutputExists == True]

# Create a True/False variable that we'll use to trigger text output depending on if the sources were correct
sourcesallgood = all(metadf['SourceCorrect'])
```

## Sources
<!-- Similar to before, we're writing different text based on whether or not there were any problems with the sources -->
`r if(py$sourcesallgood){"**All sources in the outputs match those in the metadata document.**"}`
`r if(!py$sourcesallgood){"**Some sources in the outputs do not match those in the metadata document.**  The outputs and sources listed below do not match the values in the metadata document."}`

<!-- Display a table of outputs with issues in the source footnotes -->
```{r badsources, echo=FALSE, eval=!py$sourcesallgood}
# nicely write out the names of program and output pairs where the source is wrong
kable(select(filter(py$goodoutput, SourceCorrect != TRUE), OutputName, ProgramName), booktabs = TRUE) %>%
   kable_styling(latex_options = "striped", position = "left")
``` 
<!-- end breakout four -->