Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare OpenML import #99

Open
LizzAlice opened this issue Aug 31, 2023 · 3 comments
Open

Prepare OpenML import #99

LizzAlice opened this issue Aug 31, 2023 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@LizzAlice
Copy link
Contributor

LizzAlice commented Aug 31, 2023

Get Overview about data available via api. This will be documented here:

  • dataset data openml.datasets.list_datasets(output_format="dataframe")
    • did: unique dataset ID
    • name: non unique
    • version: int, the combination of name and version seems to be unique in every case but one
    • uploader: int (maybe this is a user id??)
    • status: "active" for all of them
    • format: one of ARFF, SParse_ARFF, arff or sparse_arff
    • MajorityClassSize: number or NaN
    • MaxNominalAttDistinctValues: number or NaN
    • MinorityClassSize: number or NaN
    • NumberOfClasses: number or NaN
    • NumberOfFeatures: number or NaN
    • NumberOfInstances: number or NaN
    • NumberOfInstancesWithMissingValues: number or NaN
    • NumberOfMissingValues: number or NaN
    • NumberOfNumericFeatues: number or NaN
    • NumberOfSymbolicFeatures: number or NaN
  • evaluations (have to give evaluation function)
    • run_id: run id
    • task_id: task id
    • setup_id: setup id
    • flow_id: flow id
    • flow_name: flow name
    • data_id: dataset id?
    • data_name: dataset name?
    • function: evaluation function
    • upload_time: time it was uploaded
    • uploader: uploader number
    • uploader_name: name string
    • value: int
    • values: always None?
    • array_data: always None?
  • flows
    • id: unique id
    • full_name: name with number in parentheses
    • name: name of python class or function\
    • version: number
    • external_version: None or package versions with package name in the form 'openml==0.14.1,sklearn==1.3.0'
    • uploader: number
  • runs
    • run_id: unique id
    • task_id: task id
    • setup_id: setup id
    • flow_id: flow id
    • uploader: number
    • task_type: instance of task type in the following form: TaskType.LEARNING_CURVE
    • upload_time: time in the format of 2014-04-06 23:30:40
    • error_message: string
  • setups:
    • setup_id: unique id
    • flow_id: flow id
    • parameters: dict of things that are given as numbers; the dicts contain information such as flow information, data_type, default_value etc
  • study openml.study.list_studies(output_format="dataframe") (a bit unclear, what this is, but there are only two... However, from the ids, it seems as if there were more)
    • id: unique id, only 123 and 226
    • main_entity_type: "run"
    • status: "active"
    • creation_date: time in the format of 2019-02-21 19:55:30
    • creator: number
    • alia: NaN or "amlb"
  • tasks openml.tasks.list_tasks(output_format="dataframe")
    • tid: unique task id
    • ttid: String with task type in the form of TaskType.TASK_TYPE_NAME
    • did: dataset id
    • name: should be the task name, but actually looks like the dataset name
    • task_type: task type as in ttid, but in words
    • status: "active" for all of them
    • estimation_procedure: string
    • evaluation_measures: string or NaN
    • source_data: seems to be the same as did
    • target_feature: string
    • MajorityClassSize: number or NaN (is this the value from the dataset?)
    • MaxNominalAttDistinctValues: number or NaN (is this the value from the dataset?)
    • MinorityClassSize: number or NaN (is this the value from the dataset?)
    • NumberOfClasses: number or NaN (is this the value from the dataset?)
    • NumberOfFeatures: number or NaN (is this the value from the dataset?)
    • NumberOfInstances: number or NaN (is this the value from the dataset?)
    • NumberOfInstancesWithMissingValues: number or NaN (is this the value from the dataset?)
    • NumberOfMissingValues: number or NaN (is this the value from the dataset?)
    • NumberOfNumericFeatures: number or NaN (is this the value from the dataset?)
    • NumberOfSymbolicFeatures: number or NaN (is this the value from the dataset?)
    • number_samples: number or NaN
    • cost_matrix: NaN or matrix in list of lists format or string or number
    • source_data_labeled: NaN or '1227' or '1451'
    • target_feature_event: NaN, or 'event' or 'OS_event'
    • target_feature_left: NaN
    • target_feature_right: NaN or "time" or "OS_years"
    • quality_measure: NaN or string
    • target_value: NaN or string

Dependencies: Task on Dataset; Run on Task, Setup and Flow; Setup on Flow, Evaluation on Run, Task, Setup, Flow, Dataset

@LizzAlice LizzAlice added the enhancement New feature or request label Aug 31, 2023
@LizzAlice LizzAlice self-assigned this Aug 31, 2023
@LizzAlice
Copy link
Contributor Author

LizzAlice commented Nov 3, 2023

Questions:

  • Dataset:
    • why is dataset name not unique? --> just how it works
    • uploader: is this ID unique? -->yes
    • what does status=active mean and why are they all active? always active, can be ignored
  • Evaluation:
    • is data_id the dataset_id? yes
    • where do I find a list of evaluation functions? list_evaluation_measures
    • what is values and when is it not None?
    • what is array_data and when is it not None?
  • Study:
    • what are studies, why are there only two but the ids seem as if there are more, why are they not linked to the other --> seems to be a bug
  • Task:
    • name here is not the task name, but the dataset name, or what? they dont have a name
    • what does source_data_labeled mean?
    • target_feature_event: what is the difference between event and OS_event?
    • task type classification and regression only important?

@LizzAlice
Copy link
Contributor Author

  • excluded fields for Dataset:
    • MaxNominalAttDistinctValues: this one is the number of distinct attributes overall, i.e. over several columns; doesn't make sense to show

@LizzAlice
Copy link
Contributor Author

LizzAlice commented Jan 8, 2024

potential changes to prototype:

  • what about a field for the quality? --> verified/not
  • extra field with just text from "cites work"
  • rename cites work to sth that makes it clear that it is an item and it has a doi?
  • are all I get back from the api active? --> yes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant