diff --git a/docs/assets/style.css b/docs/assets/style.css index b26b9ba..443745d 100644 --- a/docs/assets/style.css +++ b/docs/assets/style.css @@ -29,7 +29,7 @@ body { background: linear-gradient(#e32b0e 20%, #71010c 80%); -webkit-background-clip: text; /* Clip the background to the text */ -webkit-text-fill-color: transparent; /* Make the text color transparent */ - font-size: 150px; /* Change size of h1 headings */ + font-size: 90px; /* Change size of h1 headings */ } h1, diff --git a/docs/index.md b/docs/index.md index ce5c870..25b1c3d 100644 --- a/docs/index.md +++ b/docs/index.md @@ -4,21 +4,27 @@ PhenEx ![Alt text](assets/phenex_feather_horizontal.png) -Implementing observational studies using real-world data (RWD) is challenging, requiring expertise in epidemiology, medical practice, and data engineering. Currently, observational studies are often implemented as bespoke software packages by individual data analysts or small teams. While tools exist, such as open-source tools from the OHDSI program in the R language and proprietary tools from vendors like Aetion or Panalgo, there are no open-source tools for Python-based implementation of observational studies using RWD. +Implementing observational studies using real-world data (RWD) is challenging, requiring expertise in epidemiology, medical practice, statistics and data engineering. Observational studies are often implemented as bespoke software packages by individual data analysts or small teams. While tools exist to help, such as open-source tools from the [OHDSI community](https://ohdsi.github.io/Hades/) in the R language and proprietary tools, they are typically bound to a specific data model (e.g. [OMOP](https://ohdsi.github.io/CommonDataModel/cdm54.html)) and limited in their ability to express and implement complicated medical definitions. -PhenEx (Automated Phenotype Extraction) aims to fill this gap. PhenEx is a Python-based software package that aims to provide reusuable and end-to-end tested implementations of commonly performed operations in the implementation of observational studies. PhenEx is designed with a focus on ease of writing and reading cohort definitions. Medical domain knowledge should be clear and simple, without requiring an understanding of complex data schemas. Ideally, a cohort definition should read like free text. +PhenEx (Automated Phenotype Extraction) fills this gap. PhenEx is a Python-based software package that provides reusuable and end-to-end tested implementations of commonly performed operations in the implementation of observational studies. The main advantages of PhenEx are: + +- **Arbitrarily complex medical definitions**: Build medical definitions that depend on diagnoses, labs, procedures, and encounter context, as well as on other medical definitions +- **Data-model agnostic**: Work with almost any RWD dataset with only extremely minimal mappings. Only map the data needed for the study execution. Use the ontologies native to your dataset. +- **Portable**: Built on top of [ibis](https://ibis-project.org/), PhenEx works with any backend that ibis supports, including snowflake, PySpark and many more! +- **Intuitive interface**: Study specification in PhenEx mirrors plain language description of the study. +- **High test coverage**: Full confidence answer is correct. ## Basics of PhenEx design -### The Phenotype class +### Electronic phenotypes -The most basic concept in PhenEx is the phenotype. A Phenotype is a set of criteria that define a cohort of patients. In a clinical setting, a Phenotype is usually identified by the phrase "patient presents with ...". For example, a phenotype could be "patient presents with diabetes". In the observational setting, we would cacluate the phenotype "patient presents with diabetes" by looking for patients who have a diagnosis of diabetes in their medical record in certain time frame. +The most basic concept in PhenEx is the (electronic) [phenotype](https://rethinkingclinicaltrials.org/chapters/conduct/electronic-health-records-based-phenotyping/electronic-health-records-based-phenotyping-introduction/). A phenotype defines a set of patients that share some physiological state. In a clinical setting, a phenotype is usually identified by the phrase "patient presents with ...". For example, a phenotype could be "patient presents with diabetes". In the observational setting, we would calculate the phenotype "patient presents with diabetes" by looking for patients who have a diagnosis of diabetes in their medical record in certain time frame. -A phenotype can reference other phenotypes. For instance, the phenotype "untreated diabetic patients" might translate to real-world data as "having a diagnosis of diabetes but not having a prescription for insulin or metformin". In this case, the prescription phenotype refers to the diabetes phenotype to build the overall phenotype. In PhenEx, your job is to simply specify these criteria. PhenEx will take care of the rest. +A phenotype can reference other phenotypes. For instance, the phenotype "untreated diabetic patients" might translate to real-world data as "having a diagnosis of diabetes but not having a prescription for insulin or metformin". In this case, we create a medication phenotype that refers to the diabetes phenotype to build the overall phenotype. In PhenEx, your job is to simply specify these criteria. PhenEx will take care of the rest. All studies are built through the calculation of various phenotypes: -- entry criterion phenotype +- entry criterion phenotype (defines an index date) - inclusion phenotypes - exclusion phenotypes - baseline characteristic phenotypes, and @@ -26,12 +32,36 @@ All studies are built through the calculation of various phenotypes: After defining the parameters of all these phenotypes in the study definition file, PhenEx will compute the phenotypes and return a cohort table, which contains the set of patients which satisfied all the inclusion / exclusion / entry criteria for the specified study. Additionally, a baseline characteristics table will be computed and reports generated, including a waterfall chart, the distributions of baseline characteristics. +### Phenotype classes + +In PhenEx, the concept on an electronic phenotype is encapsulated by Phenotype classes that expose all relevant parameters to express an electronic phenotype. These classes are designed to be reusable and composable, allowing complex phenotypes to be built from simpler ones. The foundational Phenotype classes include: + +| Phenotype Class | Identify patients using ... | Example | +| --------------------------- | --------------------------------------------------------- | ---------------------------------------------------------------------------------- | +| CodelistPhenotype | Medical code lists (e.g. ICD10CM, SNOMED, NDC, RxNorm) | All patients with a diagnosis code for atrial fibrillation one year prior to index | +| MeasurementPhenotype | Numerical values such as lab tests or observation results | All patients with systolic blood pressure greater than 160 | +| ContinuousCoveragePhenotype | Observation coverage data | One year continuous insurance coverage prior to index | +| AgePhenotype | Date of birth data | Age at date of first atrial fibrillation diagnosis | +| DeathPhenotype | Date of death data | Date of death after atrial fibrillation diagnosis | + +These foundational Phenotype's allow you to express complex constraints with just keyword arguments. For example, in CodelistPhenotype, you can specify that the diagnosis must have occurred in the inpatient setting or in the primary position in the outpatient setting. Phenotype's can refer to other Phenotype's in specifying their constraints. For example, "one year preindex" refers to another Phenotype which defines the index date. + +Furthermore, they can be combined using the following derived Phenotypes: + +| Phenotype Class | Identify patients using ... | Example | +| ------------------- | ------------------------------------------------- | ----------------------------------------------------------------------------------- | +| LogicPhenotype | Logical combinations of any other phenotypes | high blood pressure observation OR blood pressure medication in the baseline period | +| ArithmeticPhenotype | Mathematical combinations of any other phenotypes | BMI at index | +| ScorePhenotype | Logical combinations of any other phenotypes | CHADSVASc, CCI, HASBLED | + +Each phenotype class provides methods for defining the criteria and for evaluating the phenotype against a dataset. By using these classes, researchers can define complex phenotypes in a clear and concise manner, without needing to write custom code for each study. + ### Architecture Below is an illustration of the basic design of the PhenEx in the evidence generation ecosystem. ![Architecture](assets/architecture.png) -# Getting started +## Getting started -To get started, please head over to our [tutorials](tutorials.md). +To get started, head over to our [tutorials](tutorials.md) to get a better feel for how the library works. Then, learn how to [install PhenEx](installation.md) and start using it yourself for your own studies. Any questions? Feel free to reach out or create a [github issue](https://github.com/Bayer-Group/PhenEx/issues). diff --git a/docs/installation.md b/docs/installation.md index ee03308..070dbd7 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -16,7 +16,7 @@ Coming soon! To install from source, run the following from within your virtual environment: ``` -git clone git@github.com:Bayer-Group/PhenEx.git && \ +git clone https://github.com/Bayer-Group/PhenEx.git && \ cd PhenEx && \ pip install -r requirements.txt && \ pip install . diff --git a/docs/tutorials/phenotypes/CodelistPhenotype_Tutorial.ipynb b/docs/tutorials/phenotypes/CodelistPhenotype_Tutorial.ipynb index 0115833..6485678 100644 --- a/docs/tutorials/phenotypes/CodelistPhenotype_Tutorial.ipynb +++ b/docs/tutorials/phenotypes/CodelistPhenotype_Tutorial.ipynb @@ -14,22 +14,22 @@ "\n", "\n", "
    \n", - "
  1. which patients had an atrial fibrillation diagnosis at any time in the data source?
  2. \n", - "
  3. which patients had an ECG procedure performed at any time in the data source?
  4. \n", - "
  5. which patients had an atrial fibrillation diagnosis one year prior to index date?
  6. \n", - "
  7. which patients had an atrial fibrillation diagnosis one year after index date?
  8. \n", - "
  9. which patients had an ECG performed one month prior to atrial fibrillation diagnosis?
  10. \n", - "
  11. which patients had an atrial fibrillation diagnosis in the inpatient setting?
  12. \n", - "
  13. which patients had an atrial fibrillation diagnosis in the inpatient setting or outpatient setting and primary diagnosis position?
  14. \n", - "
  15. which patients had an ECG performed one month prior to an atrial fibrillation diagnosis in the inpatient setting and primary diagnosis position, with the diagnosis occurring one year prior to index date
  16. \n", - "
  17. what was the date of the first atrial fibrillation diagnosis for patients that had an atrial fibrillation diagnosis
  18. \n", - "
  19. which patients had an ecg performed within one month prior of the first atrial fibrillation diagnosis
  20. \n", + "
  21. Which patients had an atrial fibrillation diagnosis at any time in the data source?
  22. \n", + "
  23. Which patients had an ECG procedure performed at any time in the data source?
  24. \n", + "
  25. Which patients had an atrial fibrillation diagnosis one year prior to index date?
  26. \n", + "
  27. Which patients had an atrial fibrillation diagnosis one year after index date?
  28. \n", + "
  29. Which patients had an ECG performed one month prior to atrial fibrillation diagnosis?
  30. \n", + "
  31. Which patients had an atrial fibrillation diagnosis in the inpatient setting?
  32. \n", + "
  33. Which patients had an atrial fibrillation diagnosis in the inpatient setting or outpatient setting and primary diagnosis position?
  34. \n", + "
  35. Which patients had an ECG performed one month prior to an atrial fibrillation diagnosis in the inpatient setting and primary diagnosis position, with the diagnosis occurring one year prior to index date
  36. \n", + "
  37. What was the date of the first atrial fibrillation diagnosis for patients that had an atrial fibrillation diagnosis
  38. \n", + "
  39. Which patients had an ecg performed within one month prior of the first atrial fibrillation diagnosis
  40. \n", "
\n", "\n", + "- What is the **date of the first** atrial fibrillation diagnosis at **any time** in the data source?\n", + "- What is the **date of the first** atrial fibrillation diagnosis occurring **one year after index date**? -->\n", "\n", "CodelistPhenotype makes it possible to answer all these questions, and many more. Let's see how...\n", "
\n", @@ -86,10 +86,10 @@ "We must first understand : our input data is in a **dictionary** where **keys = domains** and **values = input tables**. \n", "\n", "We need to pass the CodelistPhenotype one of these keys (a domain)! For our examples, we will be working with two domains: \n", - "- for atrial fibrillation, we are interested in diagnosis codes which are stored in the *condition_occurrence* table/domain\n", - "- for ECG's we are interested in procedures, which are stored in the *procedure_occurrence* table/domain\n", + "- for atrial fibrillation, we are interested in diagnosis codes Which are stored in the *condition_occurrence* table/domain\n", + "- for ECG's we are interested in procedures, Which are stored in the *procedure_occurrence* table/domain\n", "\n", - "*Note beyond* The reason these are called domains and not tables is because, in the background, phenx may work on raw tables **or** a subset of the raw tables, depending on the stage of execution.\n", + "*Note beyond* The reason these are called domains and not tables is because, in the background, PhenEx may work on raw tables **or** a subset of the raw tables, depending on the stage of execution.\n", "\n", "
\n", "
\n", @@ -116,14 +116,14 @@ "source": [ "from phenex.phenotypes import CodelistPhenotype\n", "# Ex.1 \n", - "# which patients had an atrial fibrillation diagnosis at **any time** in the data source?\n", + "# Which patients had an atrial fibrillation diagnosis at **any time** in the data source?\n", "af_phenotype = CodelistPhenotype(\n", " codelist = af_codelist,\n", " domain = 'condition_occurrence'\n", ")\n", "\n", "# Ex.2 \n", - "# which patients had an ECG procedure performed at **any time** in the data source?\n", + "# Which patients had an ECG procedure performed at **any time** in the data source?\n", "ecg_phenotype = CodelistPhenotype(\n", " codelist = ecg_codelist,\n", " domain = 'procedure_occurrence'\n", @@ -138,12 +138,12 @@ "These CodelistPhenotypes create tables containing only patients that have one or more occurrences of an atrial fibrillation code of type ICD10CM and ICD9CM at **any time** within the condition occurrence table.\n", "
\n", "
\n", - "#### A note on CodelistPhenotypes' *name_phenotype* argument...\n", - "Every phenotype requires a name in phenx. However, for simplicity, phenx attempts to find a name for phenotypes using information you enter the that phenotype. \n", + "#### A note on CodelistPhenotypes' *name* argument...\n", + "Every phenotype requires a name in PhenEx. However, for simplicity, PhenEx attempts to find a name for phenotypes using information you enter the that phenotype. \n", "\n", - "For CodelistPhenotype, phenx will name set the *name_phenotype* to the name of the codelist, if the name of the codelist is specified. If the name of the codelist is *not* specified, an error will be thrown. \n", + "For CodelistPhenotype, PhenEx will name set the *name* to the name of the codelist, if the name of the codelist is specified. If the name of the codelist is *not* specified, an error will be thrown. \n", "\n", - "As we will see in this tutorial, we will be using the atrial fibrillation and ecg codelists repeatedly; each phenotype that uses them will be identically named and will lead to errors. It is thereforebest practice to always define *name_phenotype* using a unique name! All following examples will specify *name_phenotype*...\n", + "As we will see in this tutorial, we will be using the atrial fibrillation and ecg codelists repeatedly; each phenotype that uses them will be identically named and will lead to errors. It is thereforebest practice to always define *name* using a unique name! All following examples will specify *name*...\n", "
\n", "
\n", "
" @@ -204,7 +204,7 @@ ")\n", "\n", "# Ex.3\n", - "# which patients had an atrial fibrillation diagnosis **one year prior to index date**?\n", + "# Which patients had an atrial fibrillation diagnosis **one year prior to index date**?\n", "one_year_before_index = RelativeTimeRangeFilter(\n", " when=\"before\", \n", " min_days = GreaterThanOrEqualTo(0),\n", @@ -220,7 +220,7 @@ "\n", "\n", "# Ex.4\n", - "# which patients had an atrial fibrillation diagnosis **one year after index date**?\n", + "# Which patients had an atrial fibrillation diagnosis **one year after index date**?\n", "one_year_after_index = RelativeTimeRangeFilter(\n", " when=\"after\", \n", " min_days = GreaterThanOrEqualTo(0),\n", @@ -249,7 +249,7 @@ "\n", "Another common pattern is to define time ranges in relation to other phenotypes. In this case, we explicitely set the anchor to the date returned by some other phenotype.\n", "\n", - "Phenotypes do not return dates by default. It is therefore important to remember to define **which date** the anchor phenotype should return, as this greatly affects your query. The options for return date are 'first', 'last', and 'all'; see the 'return_date' section below for more information.\n", + "Phenotypes do not return dates by default. It is therefore important to remember to define **Which date** the anchor phenotype should return, as this greatly affects your query. The options for return date are 'first', 'last', and 'all'; see the 'return_date' section below for more information.\n", "\n", "**Note :** The components of EntryPhenotype **must** define an anchor phenotype if using RelativeTimeRangeFilter, as no index date is defined. See the tutorial on EntryPhenotype for more details." ] @@ -277,7 +277,7 @@ ")\n", "\n", "# Ex.5\n", - "# which patients had an ECG performed **one month prior** to atrial fibrillation diagnosis?\n", + "# Which patients had an ECG performed **one month prior** to atrial fibrillation diagnosis?\n", "\n", "# Create the anchor phenotype\n", "af_phenotype = CodelistPhenotype(\n", @@ -342,7 +342,7 @@ ")\n", "\n", "# Ex.6\n", - "# which patients had an atrial fibrillation diagnosis **in the inpatient setting**?\n", + "# Which patients had an atrial fibrillation diagnosis **in the inpatient setting**?\n", "\n", "inpatient_setting = CategoricalFilter(columnname = 'encounter_type', allowed_values = ['inpatient'])\n", "\n", @@ -384,7 +384,7 @@ ")\n", "\n", "# Ex.7\n", - "# which patients had an atrial fibrillation diagnosis **in the inpatient setting and primary diagnosis position**?\n", + "# Which patients had an atrial fibrillation diagnosis **in the inpatient setting and primary diagnosis position**?\n", "\n", "# create all necessary component categorical filters\n", "inpatient_setting = CategoricalFilter(columnname = 'encounter_type', allowed_values = ['inpatient'])\n", @@ -436,7 +436,7 @@ ")\n", "\n", "# Ex.8\n", - "# which patients had an ECG performed **one month prior** to an atrial fibrillation \n", + "# Which patients had an ECG performed **one month prior** to an atrial fibrillation \n", "# diagnosis **in the inpatient setting and primary diagnosis position**, with the diagnosis\n", "# occurring **one year prior to index date**\n", "\n", @@ -507,7 +507,7 @@ "\n", "\n", "# Ex.9\n", - "# what was the date of the first atrial fibrillation diagnosis for patients \n", + "# What was the date of the first atrial fibrillation diagnosis for patients \n", "# that had an atrial fibrillation diagnosis\n", "af_phenotype = CodelistPhenotype(\n", " name = 'af_date_first_diagnosis',\n", @@ -517,7 +517,7 @@ ")\n", "\n", "# Ex.10\n", - "# which patients had an ecg performed within one month prior of the first \n", + "# Which patients had an ecg performed within one month prior of the first \n", "# atrial fibrillation diagnosis?\n", "ecg_phenotype = CodelistPhenotype(\n", " codelist = ecg_codelist,\n", diff --git a/phenex/phenotypes/computation_graph_phenotypes.py b/phenex/phenotypes/computation_graph_phenotypes.py index 2d7b281..9f53482 100644 --- a/phenex/phenotypes/computation_graph_phenotypes.py +++ b/phenex/phenotypes/computation_graph_phenotypes.py @@ -26,9 +26,9 @@ class ComputationGraphPhenotype(Phenotype): Attributes: expression (ComputationGraph): The arithmetic expression to be evaluated composed of phenotypes combined by python arithmetic operations. return_date (Union[str, Phenotype]): The date to be returned for the phenotype. Can be "first", "last", or a Phenotype object. - _operate_on (str): The column to operate on. Can be "boolean" or "value". - _populate (str): The column to populate. Can be "boolean" or "value". - _reduce (bool): Whether to reduce the phenotype table to only include rows where the boolean column is True. This is only relevant if _populate is "boolean". + operate_on (str): The column to operate on. Can be "boolean" or "value". + populate (str): The column to populate. Can be "boolean" or "value". + reduce (bool): Whether to reduce the phenotype table to only include rows where the boolean column is True. This is only relevant if populate is "boolean". """ def __init__( @@ -37,18 +37,18 @@ def __init__( return_date: Union[str, Phenotype], name: str = None, aggregation_index=["PERSON_ID"], - _operate_on: str = "boolean", - _populate: str = "value", - _reduce: bool = False, + operate_on: str = "boolean", + populate: str = "value", + reduce: bool = False, ): super(ComputationGraphPhenotype, self).__init__() self.computation_graph = expression self.return_date = return_date self.aggregation_index = aggregation_index self._name = name - self._operate_on = _operate_on - self._populate = _populate - self._reduce = _reduce + self.operate_on = operate_on + self.populate = populate + self.reduce = reduce self.children = self.computation_graph.get_leaf_phenotypes() @property @@ -73,16 +73,16 @@ def _execute(self, tables: Dict[str, Table]) -> PhenotypeTable: """ joined_table = hstack(self.children, tables["PERSON"].select("PERSON_ID")) - if self._populate == "value" and self._operate_on == "boolean": + if self.populate == "value" and self.operate_on == "boolean": for child in self.children: column_name = f"{child.name}_BOOLEAN" joined_table = joined_table.mutate( **{column_name: joined_table[column_name].cast(float)} ) - if self._populate == "value": + if self.populate == "value": _expression = self.computation_graph.get_value_expression( - joined_table, operate_on=self._operate_on + joined_table, operate_on=self.operate_on ) joined_table = joined_table.mutate(VALUE=_expression) # Arithmetic operations imply a boolean 'and' of children i.e. child1 + child two implies child1 and child2. if there are any null values in value calculations this is because one of the children is null, so we filter them out as the implied boolean condition is not met. @@ -90,7 +90,7 @@ def _execute(self, tables: Dict[str, Table]) -> PhenotypeTable: elif self._populate == "boolean": _expression = self.computation_graph.get_boolean_expression( - joined_table, operate_on=self._operate_on + joined_table, operate_on=self.operate_on ) joined_table = joined_table.mutate(BOOLEAN=_expression) @@ -110,7 +110,7 @@ def _execute(self, tables: Dict[str, Table]) -> PhenotypeTable: joined_table = joined_table.mutate(EVENT_DATE=ibis.null(date)) # Reduce the table to only include rows where the boolean column is True - if self._reduce: + if self.reduce: joined_table = joined_table.filter(joined_table.BOOLEAN == True) # Add a null value column if it doesn't exist, for example in the case of a LogicPhenotype @@ -199,8 +199,8 @@ def __init__( super(ScorePhenotype, self).__init__( expression=expression, return_date=return_date, - _operate_on="boolean", - _populate="value", + operate_on="boolean", + populate="value", ) @@ -231,8 +231,8 @@ def __init__( super(ArithmeticPhenotype, self).__init__( expression=expression, return_date=return_date, - _operate_on="value", - _populate="value", + operate_on="value", + populate="value", ) @@ -265,7 +265,7 @@ def __init__( super(LogicPhenotype, self).__init__( expression=expression, return_date=return_date, - _operate_on="boolean", - _populate="boolean", - _reduce=True, + operate_on="boolean", + populate="boolean", + reduce=True, ) diff --git a/phenex/sim.py b/phenex/sim.py new file mode 100644 index 0000000..7de3456 --- /dev/null +++ b/phenex/sim.py @@ -0,0 +1,46 @@ +from typing import Dict +from phenex.mappers import DomainsDictionary +import pandas as pd +import numpy as np +from dataclasses import asdict + + +def generate_mock_mapped_tables(n_patients: int, domains: DomainsDictionary) -> Dict[str, pd.DataFrame]: + """ + Generate fake data for N patients based on the given domains. + + Args: + n_patients (int): The number of patients to generate data for. + domains (DomainsDictionary): The domains dictionary containing the table mappers. + + Returns: + Dict[str, pd.DataFrame]: A dictionary where keys are domain names and values are DataFrames with fake data. + """ + fake_data = {} + for domain, mapper in domains.domains_dict.items(): + columns = [field for field in asdict(mapper).keys() if field != "NAME_TABLE"] + data = {} + for col in columns: + if "DATE" in col: + start_date = pd.to_datetime('2000-01-01') + end_date = pd.to_datetime('2020-12-31') + data[col] = pd.to_datetime(np.random.randint(start_date.value, end_date.value, n_patients)).date + elif "ID" in col: + data[col] = np.arange(1, n_patients + 1) + elif "VALUE" in col: + data[col] = np.random.uniform(0, 100, n_patients) + elif "CODE_TYPE" in col: + if "CONDITION" in domain: + data[col] = np.random.choice(['ICD-10', 'SNOMED'], n_patients) + elif "DRUG" in domain: + data[col] = np.random.choice(['NDC', 'RxNorm'], n_patients) + elif "PROCEDURE" in domain: + data[col] = np.random.choice(['CPT', 'HCPCS'], n_patients) + else: + data[col] = np.random.choice(['TYPE1', 'TYPE2'], n_patients) + elif "CODE" in col: + data[col] = np.random.choice(['A', 'B', 'C', 'D', 'E', 'F', 'G'], n_patients) + else: + data[col] = np.random.choice(range(1000), n_patients) + fake_data[domain] = pd.DataFrame(data) + return fake_data