Session 3 has two assignments:
Before Session 3, please complete the reading list and then create four Code Workbooks.
- Reread G2N3C's Section 8.4.3 that focuses on Code Workbooks
- Watch the video N3C Intro to Code Workbook. There's overlap between this video and Session 3. Session 3 focuses more on analysis, and this video focuses more on mechanics of the Foundry workbooks.
- Brush up your SQL. If you don't have an intro book available, I recommend lessons 1-12 of the interactive SQLBolt.
- Brush up your R or Python. We won't be programming a lot during class, but you'll need to know enough to follow the rough approach. Common recommendations include:
- Read the OMOP & N3C documentation for a few important tables we'll use in Session 3:
- OMOP's
person
- OMOP's
observation
- OMOP's
condition_occurrence
- N3C's covid patients fact table assembled by the logic liaisons
- Read the first part carefully (about two pages).
- For the sake of Session 3, you can skim starting at the variable definitions in the "DATA DICTIONARY" section. But when you use it in a real analysis in a few weeks/months, please the whole document carefully.
- OMOP's
The section resembles the Week 1 Assignment so refer to that if you forgot some steps.
- Log in to the Enclave, with MFA.
- In our class's L0 workspace, open to the "Users/" directory and create your personal folder. (if it doesn't already exist). For directory & file names, I like kebab case (eg, "will-beasley", jerrod-anzalone", "james-cheng").
- Make sure you're in the root of your personal directory.
- Create a new "Code Workbook".
- Once the workbook opens, rename it to "manipulation-1".
Then say, "Rename" when it asks,
"Also rename the output folder to
manipulation-1
?" (Rename the workbook once it's open, so these behind-the-scenes files are appropriately adjusted.) - Change the environment.
-
In the top center of the screen, click the lightning bolt near "Environment (default-r-3.5)"
-
Click "Configure Environment"
-
In the left-hand Profiles panel, click "profile-high-driver-cores-and-memory-minimal".
(Remarks that will make sense later: #1 For workbooks that rely on R, we'll choose "r4-high-driver-memory". #2 We're using small datasets in this session; for real projects that use more memory, we would choose environments such as "profile-high-driver-cores-and-memory".)
-
Click the blue "Update Environment" button.
-
Wait a few minutes, or go back to the directory screen and continue creating workbooks. It take the servers a few minutes to create environment.
-
- Follow the same steps as
manipulation-1
, except - Rename it
graphs-1
and - Choose the "r4-high-driver-memory" environment.
- Follow the same steps as
manipulation-1
, except - Rename it
validation-1
and - Choose the "r4-high-driver-memory" environment.
- Follow the same steps as
manipulation-1
, except - Rename it
modeling-1
and - Choose the "r4-high-driver-memory" environment.
- Before clicking "Update environment", add one additional packages:
- Click "Customize Profile"
- Search for "r-emmeans" and click the green plus button
- Click the blue "Update Environment" button.
Note: when you customize an environment, it takes longer to build. And there's a chance the package versions don't resolve correctly. So don't add unnecessary packages.
- In the root of your personal directory
(eg,
Users/will-beasley
) - Add a new "Notepad document"
- When it opens, rename it to "README"
- Add important info and links to help connect different things, like
- Non-Enclave locations your group uses like Google Drive & a GitHub repo
- Link to your group's meeting notes (eg, in Google Docs)
- Link to your group's manuscript & supplement drafts
- Concept sets used or discussed
- Describe the workflow among code workbooks. For Session 3, the order is
manipulation-1
graphs-1
validation-1
modeling-1
- For this class, some important resources are
- Our class's GitHub file repository
- The dataset simulated for this Session, which is stored in the Enclave.
I like including both the
- direct link: https://unite.nih.gov/workspace/compass/view/ri.compass.main.folder.32f8e987-c58c-44a0-bff8-8b8fb3d15805
- spelled-out path:
N3C Training Area/Group Exercises/Introduction to Real World Data Analysis for COVID-19 Research, Spring 2024/analysis-with-synthetic-data/
Resources
- The Researcher's Guide to N3C
- N3C Office Hours on Tuesdays & Thursdays
- OMOP tables
-
Go through the four workbooks we created for Session 3 and make sure each transform is completed.
-
You don't have to understand every line of code, but try to understand the big picture.
-
For each input dataset and each transform that produces a table, write down the table's grain.
(For the
modeling-1
transforms, choose between the unconventional grains "coefficient" and "plotted point".)
-
List three areas outside software development where it's advantageous to breakup bigger challenges into smaller ones.
-
Add a new variable to
pt
from an existing input table. -
Incorporate the
condition_occurrence
table. We suggest something like:- Import the dataset (from the same location as the session's other simulated datasets).
- Create a transform that narrows it down to zero or one row per patient.
- In the
pt
transform, join it topatient_ll
(along with any existing joins) in the FROM clause. - In the
pt
transform, include relevant column(s) in the SELECT clause. - In the
prepare_dataset()
function, add any R flavoring that would aid the analysis later. (Remember this function is not defined in any specific transform. Where did we put it?)
-
Color code the workbooks transforms. Think which parts belong to what category.
- "omop source": medium purple (#7B64FF)
- "n3c derived": light purple (#AEA1FF)
- "metadata": olive green (#B0BC00)
- "intermediate": gray (#999999)
- "outcome": orange (#FB9E00)
- "diagnostic": cyan (#73D8FF)
Optional and Superduper Advanced Assignment
Improve the definition of theevent_animal
variable by using a look up table.- Upgrade the
event_animal
code to use codeset.
-
Which
data_partner
has the lowest prevalence of moderate (or worse) covid? How can you tell? -
Which
data_partner
has the highest prevalence of moderate (or worse) covid? How can you tell? -
The
data_partner
with the highest prevalence of developing moderate covid does not necessarily mean their providers are doing the worst job taking care of their patients. Why? What are some confounds you can think of? How could you start to control for these confounds? -
Improve cosmetically (eg, better axis labels with ggplot's
labs()
function) -
Improve the graphs in some way, even if it's just a cosmetic improvement (if you can't think of something). What did you change and how did you do it?
-
Feel free to graph with Python instead of R. (Or both) If so, we suggest:
- Starting a new workbook called
graphs-2
. - Change the environment to "profile-high-driver-cores-and-memory-minimal"
- Import the
pt
Spark Dataframe. - Create a downstream Python transform.
- As a starting point use this Python code.
- Starting a new workbook called
- Describe the relationship between
data_partner_id
andcondition_source_value
. What (operation behind the scenes) might explain this pattern? - Describe the relationship between
data_partner_id
,condition_start_date
, andcondition_end_date
. What could explain this pattern? - Inspect
condition_end_date
. What is fishy about it? What are some mitigation approaches for this specific scenario?
- Interpret the coefficients of
m_covid_moderate_1
. Aim for 2 sentences total. - Interpret the coefficients of
m_covid_moderate_2a
. Aim for 3-6 sentences total. - Improve the model (for predicting
covid_moderate_plus
) in some way. Consider adding an interaction or a new variable. What did you change and how did you do it? - Write a hypothesis for a new outcome variable. Develop a model for it and describe the results.
If you're following this document outside of class, don't forget all analytical output (like tables, graphs, models, & screenshots) must be approved before it's exported from the Enclave. We'll discuss details later in Session 6.
Remember Session 3 uses only synthetic/fake data. Therefore no patients can potentially be exposed by this handout.
But when you start working with real Level 2 or Level 3 data, you must follow the procedures described in Session 6 and the Publishing and Sharing Your Work chapter of G2N3C.