Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat 363 analysis step #371

Closed
wants to merge 14 commits into from
78 changes: 78 additions & 0 deletions src/aind_data_schema/datamanipulation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
""" schema for processing """
Sun-flow marked this conversation as resolved.
Show resolved Hide resolved

from __future__ import annotations

from datetime import datetime
from enum import Enum
from typing import Any, Dict, List, Optional

from pydantic import Field

from aind_data_schema.base import AindCoreModel, AindModel
from aind_data_schema.imaging.tile import Tile
from aind_data_schema.processing import ProcessName


class TouchType(Enum):
ANALYSIS = "Analysis"
PROCESSING = "Processing"

## These could be subtypes of analysis/processing

class ProcessType(Enum):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this different from the ProcessType in processing.py?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, ProcessName

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, ProcessName

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processing.py currently does not have a ProcessType class, in my implementation at least. ProcessName would have values that should go here, most likely. I just had some examples in so that we could talk about the general structure first.

CURATION = "Curation"
SORTING = "Sorting"

class AnalysisType(Enum):
uhhh = "some analysis"




class DataTouchInfo:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need doc strings for this class

Sun-flow marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need the code manipulation class? (or code information?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataTouch is imported by AnalysisStep and ProcessStep. There are two structure options here:

DataTouch (overarching "Step" class):

  • class for basic info
  • either process/analysis tag
  • class for relevent step (analysis/processing)
  • class for codemanipulation info

but it seems you prefer the separation of Analysis/processing, so alternative looks like:

Process:

  • basic info about process (name/desc)
  • class for DataTouchInfo
  • class for codemanipulation info

Analysis:

  • basic info about analysis (name/desc)
  • class for DataTouchInfo
  • class for codemanipulation info

And some file tracking DataManipulations (we should maybe just change to this name,,,, since DataTouch is so weird...) would be a series of nodes (Process + Analysis Steps) tracking the history.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I realize I got confused - too many different things swimming in my head at once. We want processing and analysis together - but I think then I missed something. Because the DataTouch is still the step, and we need something that has the list of steps, for both processing and analysis.
(and apologies if you jsut removed it in response to this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me make a figma doc detailing this organization, and I can show it to you later in the week

touch_type: TouchType = Field(..., title="Manipulation type")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason the title isn't the same as the name itself? Or vice versa?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didnt know if using touch as a forward facing field would make sense to users? even though it makes senseo nthe backend. I can change one or the other to match.

start_date_time: datetime = Field(..., title="Start date time")
end_date_time: datetime = Field(..., title="End date time")
input_location: str = Field(..., description="Path to data inputs", title="Input location")
output_location: str = Field(..., description="Path to data outputs", title="Output location")
notes: Optional[str] = None


class CodeManipulation:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need docstring

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a time when code is used? is that the idea? So not so much code manipulation, but data manipulation? Why is this separate from data touch? Oh because there can be data touches that don't use code... eg manual annotations. ... okay. I'm still not sure if this needs to be its own class or not... either it's a class that is optional or these are optional fields in the data touch class...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes all of these classes are going to be necessary, but I think dividing them up based on topic is useful for organization, and lets us see exactly what we are using. Additionally if more fields need to be added, there's some clear organization for where they go based on what they are related to.

code_url: str = Field(..., description="Path to code respository", title="Code URL")
code_version: Optional[str] = Field(None, description="Version of the code", title="Code version")
input_parameters: Dict[str, Any] = Field(..., title="Input parameters")
output_parameters: Optional[Dict[str, Any]] = Field(..., title="Output parameters")
outputs: Optional[Dict[str, Any]] = Field(None, description="Output parameters", title="Outputs")


class AnalysisStep(AindModel):
"""Description of a single processing step"""
Sun-flow marked this conversation as resolved.
Show resolved Hide resolved

analysis_name: str = Field(..., description="Name of the analysis method used", title="Analysis name")
analysis_type: AnalysisType = Field(..., description="Type of analysis performed on dataset", title="Analysis type")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if we need an AnalysisType enum - or rather if we can even create it. I think this has to be a description that the scientist provides. Which is sad because they will probably not be good descriptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats why I thought analysis type might be useful? at least to have some overarching classes -- that they could then provide further description for

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we need to have overarching classes of analysis, and I'm not sure we have them...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case then that's fine, we won't need an enum. It will be easy to remove in favor of a string.

touch_info: DataTouchInfo = Field(..., description="General information regarding the data manipulation performed")
analysis_code: CodeManipulation = Field(..., description="Info regarding code used to manipulate data")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think I prefer this to be in the data touch class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not sure I am following this comment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the analysis_code - I think this needs to live in the DataTouch class because it's part of how the data was used.



class DataProcess(AindModel):
"""Description of a single processing step"""

process_name: ProcessName = Field(..., title="Process name")
processing_type: ProcessType = Field(..., description="Type of processing performed on dataset", title="Processing type")
Sun-flow marked this conversation as resolved.
Show resolved Hide resolved
touch_info: DataTouchInfo = Field(..., description="General information regarding the data manipulation performed")
analysis_code: CodeManipulation = Field(..., description="Info regarding code used to manipulate data")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need an Analysis model that creates a list of analysis steps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah! youve addressed my prior question, lol. It can easily, i would like to discuss briefly what it is that this model is doing first.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry - I think I confused myself. Let me take another look.


class Processing(AindCoreModel):
"""Description of all processes run on data"""

schema_version: str = Field(
"0.2.1",
description="Schema version",
title="Schema version",
const=True,
)
pipeline_version: Optional[str] = Field(None, description="Version of the pipeline", title="Pipeline version")
pipeline_url: Optional[str] = Field(None, description="URL to the pipeline code", title="Pipeline URL")
data_processes: List[AnalysisStep] = Field(..., title="Data processing", unique_items=True)
Sun-flow marked this conversation as resolved.
Show resolved Hide resolved
7 changes: 2 additions & 5 deletions src/aind_data_schema/processing.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@

from aind_data_schema.base import AindCoreModel, AindModel
from aind_data_schema.imaging.tile import Tile
from aind_data_schema.datamanipulation import DataTouchInfo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused - are we keeping this schema? How is this different from the processing in the datamanipulation file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relic of an earlier edit, ideally we could drop the processing.py file after this update, though we could also have two separate files for analysis + processing if we wanted



class ProcessName(Enum):
Expand Down Expand Up @@ -37,16 +38,12 @@ class DataProcess(AindModel):
"""Description of a single processing step"""

name: ProcessName = Field(..., title="Name")
touch_info: DataTouchInfo = Field(..., description="General information regarding the data manipulation performed")
version: str = Field(..., description="Version of the software used", title="Version")
start_date_time: datetime = Field(..., title="Start date time")
end_date_time: datetime = Field(..., title="End date time")
input_location: str = Field(..., description="Path to data inputs", title="Input location")
output_location: str = Field(..., description="Path to data outputs", title="Output location")
code_url: str = Field(..., description="Path to code respository", title="Code URL")
code_version: Optional[str] = Field(None, description="Version of the code", title="Code version")
parameters: Dict[str, Any] = Field(..., title="Parameters")
outputs: Optional[Dict[str, Any]] = Field(None, description="Output parameters", title="Outputs")
notes: Optional[str] = None


class Processing(AindCoreModel):
Expand Down