updated study tutorial

Bayer-Group · Jan 23, 2025 · 2ec4d6e · 2ec4d6e
1 parent 1d06479
commit 2ec4d6e
Showing 1 changed file with 167 additions and 32 deletions.
diff --git a/docs/tutorials/PhenEx_Study_Tutorial.ipynb b/docs/tutorials/PhenEx_Study_Tutorial.ipynb
@@ -150,19 +150,19 @@
    "id": "f7ca99f7-8ed0-40ac-b6e6-8fd3000989a1",
    "metadata": {},
    "source": [
-    "## Define input data structure"
+    "## Tell PhenEx about the input data structure"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "6b1bd3bb-c3a3-4a71-a4c2-8f1652fdcf12",
    "metadata": {},
    "source": [
-    "PhenEx needs to know a little bit about the structure of the input data in order to help us make phenotypes and cohorts.\n",
+    "PhenEx is designed to be data model agnostic i.e. does not require you to transform your data. However, PhenEx does need to know a little bit about the structure of the input data in order to help us make phenotypes and cohorts.\n",
     "\n",
-    "What this means is that PhenEx knows in what table and column to find information such as patient id, year of birth, diagnosis events, etc. This information is generally present in all RWD sources, but for each data source, is (1) organized in a different way and (2) can have different column names.\n",
+    "What this means is that PhenEx needs to know in what table and column to find information such as patient id, year of birth, diagnosis events, etc. This information is generally present in all RWD sources, but for each data source, is (1) organized in a different way and (2) can have different column names.\n",
     "\n",
-    "When using a new data source, we need to onboard that database for usage with PhenEx (tell it about table structure and column names). Go to the [tutorial on onboarding a new database](/2_Onboarding_New_Database.ipynb) to learn how to onboard a database.\n",
+    "When using a new data source, we need to onboard that database (once!) for usage with PhenEx (i.e. tell it about table structure and column names). Go to the [tutorial on onboarding a new database](/2_Onboarding_New_Database.ipynb) to learn how to onboard a database.\n",
     "\n",
     "For the purposes of this tutorial, we will be using OMOP data, which is already onboarded and available in the PhenEx library. All we have to do is import the OMOPDomains and then get the mapped tables."
    ]
@@ -175,46 +175,64 @@
    "outputs": [],
    "source": [
     "from phenex.mappers import OMOPDomains\n",
-    "omop_mapped_tables = OMOPDomains.get_mapped_tables(con)\n",
-    "omop_domains = list(omop_mapped_tables.keys())\n",
-    "omop_domains"
+    "omop_mapped_tables = OMOPDomains.get_mapped_tables(con)"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "45fb5bb0-102c-4194-b3e9-aad9c6373615",
+   "cell_type": "markdown",
+   "id": "298a929e-229d-401c-bdcd-f62d3703c5f5",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "omop_mapped_tables['PERSON']"
+    "### Looking at input data\n",
+    "PhenEx bundles all input data from a single data source into a python dictionary, in this case in the variable called omop_mapped_tables. After this step, we no longer need to deal with input data - it is all available for this datasource in the omop_mapped_tables dictionary.\n",
+    "\n",
+    "The dictionary keys are the names of the 'domains' within OMOP. Let's look at the domains available"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "7f361f6c-b698-4b4d-8d48-551976a16b4f",
+   "id": "45fb5bb0-102c-4194-b3e9-aad9c6373615",
    "metadata": {},
    "outputs": [],
    "source": [
-    "omop_mapped_tables['PERSON'].table.select('PERSON_ID')"
+    "list(omop_mapped_tables.keys())"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "298a929e-229d-401c-bdcd-f62d3703c5f5",
+   "id": "dc11097b-7fb8-4074-8b0b-194376d97f54",
    "metadata": {},
    "source": [
-    "### Looking at input data\n",
-    "PhenEx bundles all input data into a dictionary, in this case in the variable called omop_mapped_tables. The keys in this dictionary are known as 'domains'; for example, there is the '"
+    "We see that there are several domains. We will see later on that from now on, we tell PhenEx what table to use using these keys.\n",
+    "\n",
+    "We can additionally look at and explore these tables interactively as well; just access the value (table) of the domain you are interested in."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7f361f6c-b698-4b4d-8d48-551976a16b4f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "omop_mapped_tables['PERSON']"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "3f83ea41-de39-49dc-9692-5d5cf8d73e2e",
    "metadata": {},
    "source": [
-    "# Entry criterion"
+    "# Building a cohort\n",
+    "We are now ready to build a cohort using the OMOP data we now have available to PhenEx.\n",
+    "\n",
+    "## Step 1 : Define an Entry criterion\n",
+    "The entry criterion is the phenotype that defines the index date of your cohort. \n",
+    "\n",
+    "**Note on index dates** The concept of index date comes from prospective clinical trials; it is simplistically the date on which the patient enters the clinical trial, i.e. day 0 of data collection. In real world data sources, we are generally performing retrospective studies and have data that exists in the past. Regardless, it is standard practice in observational studies to define an index date for each patient, and the index date is defined by some medical event or phenotypic feature of each patient.\n",
+    "\n",
+    "Here we will create a cohort that has an index date set at the 'date of first instance of atrial fibrillation diagnosis' for each patient. See the CodelistPhenotype tutorial to learn more about how to define codelist phenotypes."
    ]
   },
   {
@@ -237,6 +255,14 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "9aef68ff-c2a4-43de-af98-4784ae226fc3",
+   "metadata": {},
+   "source": [
+    "Once we've created our phenotype, we're ready to move on. However, if you want to, you can already execute and see the output of a single phenotype (though this is not necessary). This is helpful for sanity checking the construction of your phenotypes, or seeing if any patients at all fulfill the phenotypic criteria you entered."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -253,7 +279,70 @@
    "id": "06d60576-81bb-4cd8-b549-2cc5941b2b55",
    "metadata": {},
    "source": [
-    "# Inclusions"
+    "## Step 2 : Define inclusion criteria (optional)\n",
+    "Next we need to define additional phenotypic features a patient must have in order to be a part of our cohort. We usually see these as a list in study definitions. For example, we require  patients that are 18 or older and\n",
+    "\n",
+    "In PhenEx, we simply create a phenotype using the provided phenotype classes for each inclusion criteria, and then create a list with each inclusion criteria phenotype. We will later pass these to the cohort."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0e1c43a0-4374-4162-837e-9359bd86db08",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from phenex.phenotypes.age_phenotype import AgePhenotype\n",
+    "from phenex.filters import GreaterThanOrEqualTo\n",
+    "\n",
+    "age_ge18 = AgePhenotype(anchor_phenotype=entry, min_age=GreaterThanOrEqualTo(18))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a6c1a41b-294e-4183-a7e9-d5b469178f22",
+   "metadata": {},
+   "source": [
+    "Remember that we can check the results of a phenotype directly (though this is NOT required)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5b497111-6074-4ea4-9904-a1339bd4ff70",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "age_ge18.execute(omop_mapped_tables)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6844b533-2dd5-4383-aa1b-3cf53a58331e",
+   "metadata": {},
+   "source": [
+    "Finally, create the list of inclusion phenotypes"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "947de6f2-aeeb-4153-88b8-81e31e932f7e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "inclusions = [age_ge18]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2147f629-d916-4951-8307-06e08fb39f39",
+   "metadata": {},
+   "source": [
+    "## Step 3 : Define exclusion criteria (optional)\n",
+    "Cohort definitions often have a list of things patients should **not** have to be considered part of our study cohort. These are exclusion criteria. PhenEx handles these similarly to inclusion criteria; simply create a list of individual phenotypes that patients shoudl not have and bundle them together in a list called 'exclusions'\n",
+    "\n",
+    "Here we create a slightly more complicated phenotype: we are excluding all patients who had a emergency room visit for myocardial infarction within 90 days prior of their index date."
    ]
   },
   {
@@ -268,15 +357,20 @@
     "from phenex.filters.categorical_filter import CategoricalFilter\n",
     "from phenex.filters.relative_time_range_filter import RelativeTimeRangeFilter\n",
     "\n",
+    "# define 'emergency room visit'\n",
     "inpatient = CategoricalFilter(\n",
     "    column_name='VISIT_DETAIL_SOURCE_VALUE', \n",
     "    allowed_values=['22'], \n",
     "    domain='VISIT_DETAIL'\n",
     ")\n",
     "\n",
+    "# define time period pre-index to search for mi code\n",
     "preindex = RelativeTimeRangeFilter(max_days=Value('<', 90), anchor_phenotype=entry)\n",
     "\n",
+    "# define MI codes of interest\n",
     "mi_codelist = Codelist([49601007])\n",
+    "\n",
+    "# create exclusion phenotype\n",
     "mi_emergency_preindex = CodelistPhenotype(\n",
     "    name='hf',\n",
     "    domain='condition_occurrence'.upper(),\n",
@@ -289,32 +383,30 @@
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a515f2e4-47fc-4976-a8b4-b4d3e1e3281d",
+   "cell_type": "markdown",
+   "id": "4e4ef38e-9da2-4b14-ad40-1b0ab53147e5",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "mi_emergency_preindex.execute(omop_mapped_tables)\n",
-    "mi_emergency_preindex.table.head(5).to_pandas()"
+    "As prior, we can let this run immediately for sanity checking (not required)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "947de6f2-aeeb-4153-88b8-81e31e932f7e",
+   "id": "a515f2e4-47fc-4976-a8b4-b4d3e1e3281d",
    "metadata": {},
    "outputs": [],
    "source": [
-    "inclusions = [mi_emergency_preindex]"
+    "mi_emergency_preindex.execute(omop_mapped_tables)\n",
+    "mi_emergency_preindex.table.head(5).to_pandas()"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "2147f629-d916-4951-8307-06e08fb39f39",
+   "id": "6071eed8-40c1-460e-bad9-1b64fa7c08de",
    "metadata": {},
    "source": [
-    "# Exclusions"
+    "Create the final list of exclusion criteria"
    ]
   },
   {
@@ -324,15 +416,22 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "exclusions = []"
+    "exclusions = [mi_emergency_preindex]"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "eb7c85e0-e4e5-42de-80ce-5abce4b54624",
    "metadata": {},
    "source": [
-    "# Characteristics"
+    "## Step 4 : Define baseline characteristics (optional)\n",
+    "We are often interested in characterizing patients at index date. For example, we could be interested in knowing how old they are at index. \n",
+    "\n",
+    "In PhenEx, we simply create a list of phenotypes, identically to inclusion and exclusion criteria.\n",
+    "\n",
+    "**Note**, inclusion, exclusion criteria and baseline characteristics are defined at or before index date; they should not use 'future' data (i.e. data after the index date)! PhenEx does NOT currently check whether future data is being used. It is up to the user to design phenotypes with appropriate relative time range filters that only work in the pre-index period.\n",
+    "\n",
+    "In this example, we are simply interested in the age of patients at index date."
    ]
   },
   {
@@ -348,12 +447,23 @@
     "characteristics = [age]"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3f6a513e-8ed2-4363-8a4b-47e64a924b03",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "age.execute(omop_mapped_tables)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "b0ec0903-d401-48b4-85a5-5fd73747e341",
    "metadata": {},
    "source": [
-    "# Cohort"
+    "## Step 5: Build the cohort\n",
+    "In this step we take all the pieces we defined above and put them together into a cohort. Simply instantiate a cohort, give it a name, and pass it the entry, inclusion, exclusions and baseline characteristic phenotypes defined above."
    ]
   },
   {
@@ -374,6 +484,14 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "5bfa25aa-f3f8-4358-810b-00358c858034",
+   "metadata": {},
+   "source": [
+    "We execute the full cohort and pass it the tables required"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -384,6 +502,23 @@
     "cohort.execute(omop_mapped_tables)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "c802b0a9-c470-41bf-a3a5-10746126237d",
+   "metadata": {},
+   "source": [
+    "## Viewing cohort summary\n",
+    "After execution of the cohort, PhenEx has created a number of standard tables and can produce some standard readouts.\n",
+    "\n",
+    "### Tables created\n",
+    "1. index table : contains patients that fulfill all in/exclusion criteria, with index date\n",
+    "2. inclusion table : a feature table with only inclusion criteria (rows are patients, columns are inclusion criteria)\n",
+    "3. exclusion table : a feature table with only exclusion criteria (rows are patients, columns are exclusion criteria)\n",
+    "4. characteristics table : a feature table of all baseline characteristics\n",
+    "\n",
+    "### Reports can be used to produce readouts"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,