This project contains a set of files for processing and managing structured data from various sources. While the example focuses on clinical trials, the system is designed to be flexible and can be adapted for any dictionary-like data, including sports statistics, retail inventory, financial data, and more.
-
sample_ingestion_config.yaml- Contains configuration settings for different data sets (e.g., TrialA, TrialB, TrialC)
- Specifies data file paths, subsets to include, and processing options
-
sample_ingestion_data.yaml- Stores detailed data for each data set, including:
- Names and identifiers
- Subset information (e.g., cohorts, categories)
- Schedules or time-based information
- Data sources
- Stores detailed data for each data set, including:
-
samplecode_dataingestion.py- Main Python script for processing data
- Includes classes for different data types (e.g., TrialA, TrialB, TrialC)
- Implements data loading, subset management, and scheduling functionality
TrialBase: Base class for all data types (can be renamed for different contexts)Cohort: Handles subset-related logic and data processingSchedule: Manages time-based aspects of data setscreate_import: Factory function to create import jobs for different endpoints
To run the data ingestion process:
-
Ensure all YAML files are in the correct locations as specified in the configuration.
-
Run the
samplecode_dataingestion.pyscript:python samplecode_dataingestion.py
This will create and run import jobs for the TrialB configuration, demonstrating both scheduling and cohort functionality.
To process different data sets or modify existing ones:
- Update the
sample_ingestion_config.yamlfile with the desired data set configurations. - Modify the
sample_ingestion_data.yamlfile to include or update your data. - Modify or add a class for each "Trial" (or equivalent data set) which you expect to find in the sample ingestion data. These classes should inherit from
TrialBaseand implement any specific logic needed for that data type. - Adjust the main script (
samplecode_dataingestion.py) to create import jobs for the desired data sets and endpoints. - If needed, organize the code into separate files based on functionality or data types. For example:
data_models.py: Contains the base classes and data type-specific classesdata_processors.py: IncludesCohort,Schedule, and other processing logicimport_jobs.py: Handles the creation and execution of import jobsdata_validation.py: Implements data quality checks and validation logicmain.py: Orchestrates the overall data processing flow
To ensure data integrity and reliability, implement data validation and quality checks throughout the pipeline:
-
Create a
data_validation.pyfile to centralize validation logic:- Define functions for different types of checks (e.g., data type validation, range checks, consistency checks)
- Implement domain-specific validation rules
-
Integrate validation checks at key points in the pipeline:
- During data ingestion: Validate input data format and required fields
- Before processing: Check for data completeness and consistency
- After processing: Verify output data meets expected criteria
-
Implement logging and error handling for validation issues:
- Use Python's logging module to record validation results
- Raise custom exceptions for critical validation failures
- Implement error recovery or fallback mechanisms where appropriate
-
Add configuration options for validation:
- Allow users to specify required vs. optional checks
- Provide options to set threshold values for numeric validations
-
Create summary reports of data quality:
- Generate statistics on data completeness, consistency, and quality
- Provide visualizations of data quality metrics
Example of a basic validation function in data_validation.py:
def validate_cohort_data(cohort_data):
errors = []
if 'patient_count' not in cohort_data:
errors.append("Missing required field: patient_count")
if 'dose' not in cohort_data:
errors.append("Missing required field: dose")
if not isinstance(cohort_data.get('d', []), list):
errors.append("Field 'd' must be a list")
return errorsIntegrate this validation into your data processing pipeline to ensure data quality at every stage.
This framework can be adapted to various domains by adjusting the terminology and implementing domain-specific logic:
- Sports: Replace "Trial" with "League", "Cohort" with "Team", etc.
- Retail: Use "Store" instead of "Trial", "ProductCategory" instead of "Cohort", etc.
- Finance: Adapt to "Portfolio", "AssetClass", "TradingSchedule", etc.
Modify the base classes, processing logic, and validation rules to match the requirements of your specific domain.
This project is a sample implementation and may require additional error handling and feature enhancements for production use. Always ensure proper data validation and error handling when adapting this code for different domains or data sources.