diff --git a/03_Aggregate_Classifications.ipynb b/03_Aggregate_Classifications.ipynb new file mode 100644 index 0000000..d1ba4d6 --- /dev/null +++ b/03_Aggregate_Classifications.ipynb @@ -0,0 +1,564 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "e3b4daaa-6ff8-467c-8a9b-c56d3e4d9f8b", + "metadata": {}, + "source": [ + "\"Vera \n", + "

Retrieve and Aggregate Zooniverse Output

\n", + "Authors: Becky Nevin, Clare Higgs, and Eric Rosas
\n", + "Contact author: Clare Higgs
\n", + "Last verified to run: 2024-11-07
\n", + "LSST Science Pipelines version: Weekly 2024_42
\n", + "Container size: small or medium
\n", + "Targeted learning level: intermediate" + ] + }, + { + "cell_type": "markdown", + "id": "0f16d57a-9af8-4fc9-bf86-8724024516e6", + "metadata": {}, + "source": [ + "Description: This notebook guides a PI through the process of retrieving and aggregating classification data from Zooniverse.

\n", + "Skills: Query for Zooniverse classification data via the panoptes client; retrieve and aggregate user classifications and retrieve original objectIds or diaobjectIds.\n", + "

\n", + "LSST Data Products: n/a, this notebook demonstrates retrieving classifications from Zooniverse; the notebooks 01 and 02 in the citizen science tutorial series demonstrate working with Rubin data

\n", + "Packages: panoptes_client, panoptes_aggregation, rubin.citsci, utils (citsci plotting and display utilities),

\n", + "Credit: Hayley Roberts' aggregation code https://github.com/astrohayley/SLSN-Aggregation-Example/blob/main/SLSN_batch_aggregation.py

\n", + "Get Support: PIs new to DP0 are encouraged to find documentation and resources at dp0-2.lsst.io. Support for this notebook is available and questions are welcome at cscience@lsst.org." + ] + }, + { + "cell_type": "markdown", + "id": "2faa844a-2bd3-41e4-9058-f5ce85c3fc54", + "metadata": {}, + "source": [ + "## 1. Introduction \n", + "This notebook provides an introduction to how to use the Zooniverse panoptes client and rubin.citsci package to retrieve classifications from Zooniverse and aggregate the results. Data aggregation in this context is collecting classifications across all citizen scientists and summarizing them by subject in terms of classifier majority. In other words, if multiple Zooniverse users or the same Zooniverse users classified a single subject, this notebook demonstrates how to combine these classifications to a consensus." + ] + }, + { + "cell_type": "markdown", + "id": "1de3b241-4e92-49cd-9148-c6b180dc11d8", + "metadata": { + "execution": { + "iopub.execute_input": "2024-06-20T17:44:28.705146Z", + "iopub.status.busy": "2024-06-20T17:44:28.704785Z", + "iopub.status.idle": "2024-06-20T17:44:28.708543Z", + "shell.execute_reply": "2024-06-20T17:44:28.707909Z", + "shell.execute_reply.started": "2024-06-20T17:44:28.705123Z" + } + }, + "source": [ + "### 1.1 Package imports " + ] + }, + { + "cell_type": "markdown", + "id": "59cdc872-8bdf-485b-921f-bb9bd5b597dd", + "metadata": {}, + "source": [ + "#### Install Pipeline Package\n", + "\n", + "First, install the Rubin Citizen Science Pipeline package by doing the following:\n", + "\n", + "1. Open up a New Launcher tab\n", + "2. In the \"Other\" section of the New Launcher tab, click \"Terminal\"\n", + "3. Use `pip` to install the `rubin.citsci` package by entering the following command:\n", + "```\n", + "pip install rubin.citsci\n", + "```\n", + "Note that this package will soon be installed directly on RSP.\n", + "\n", + "If this package is already installed, make sure it is updated:\n", + "```\n", + "pip install --u rubin.citsci\n", + "```\n", + "\n", + "4. Confirm the next cell containing `from rubin.citsci import pipeline` works as expected and does not throw an error\n", + "\n", + "5. Install `panoptes_client`:\n", + "```\n", + "pip install panoptes_client\n", + "pip install panoptes_aggregation\n", + "```\n", + "\n", + "6. If the pip install doesn't work for `panoptes_aggregation`:\n", + "```\n", + "pip install -U git+git://github.com/zooniverse/aggregation-for-caesar.git\n", + "```\n", + "** Notes about this install: https://www.zooniverse.org/talk/1322/2415041?comment=3969837&page=1" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "7d65d903-8012-43e3-ab19-d94b3f422e83", + "metadata": { + "execution": { + "iopub.execute_input": "2024-11-07T15:59:37.235769Z", + "iopub.status.busy": "2024-11-07T15:59:37.235041Z", + "iopub.status.idle": "2024-11-07T15:59:38.977620Z", + "shell.execute_reply": "2024-11-07T15:59:38.977030Z", + "shell.execute_reply.started": "2024-11-07T15:59:37.235741Z" + } + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import json\n", + "import sys\n", + "from tqdm import tqdm\n", + "# Zooniverse tools\n", + "from panoptes_client import Workflow, panoptes\n", + "from panoptes_aggregation.extractors.utilities import annotation_by_task\n", + "from panoptes_aggregation.extractors import question_extractor\n", + "from panoptes_aggregation.reducers import question_consensus_reducer\n", + "# rubin citizen science tools\n", + "from rubin.citsci import pipeline" + ] + }, + { + "cell_type": "markdown", + "id": "cb65e56b-028b-425f-b3c3-87233734a9d7", + "metadata": {}, + "source": [ + "### 1.2 Define functions and parameters \n", + "There are four relevant functions for retrieval and aggregation:\n", + "- `generate_classification_export`: Submits a request to Zooniverse via panoptes to generate the classification export given a workflow ID. Wait up to 24 hours after this function is run if generating classifications. This function is not necessary if the export has previously been generated and doesn't require updating.\n", + "- `download_classifications`: Downloads the (previously-generated) classifications given a workflow ID, returning a dataframe.\n", + "- `extract_data`: Extracts user annotations by task and sorts by when they were classified. This can be modified for other classification tasks such as drawing, please see the Zooniverse documentation.\n", + "- `aggregate_data`: Groups classifications by task and user, selects the most recent classification from each user, and uses the Zooniverse `question_consesus_reducer` function to determine the consensus for each subject ID amongst all users." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "7d951495-9ed7-44ed-bffe-04504679e8bb", + "metadata": { + "execution": { + "iopub.execute_input": "2024-11-07T16:11:11.898187Z", + "iopub.status.busy": "2024-11-07T16:11:11.897331Z", + "iopub.status.idle": "2024-11-07T16:11:11.908930Z", + "shell.execute_reply": "2024-11-07T16:11:11.908262Z", + "shell.execute_reply.started": "2024-11-07T16:11:11.898155Z" + } + }, + "outputs": [ + { + "ename": "SyntaxError", + "evalue": "unterminated string literal (detected at line 85) (2935190750.py, line 85)", + "output_type": "error", + "traceback": [ + "\u001b[0;36m Cell \u001b[0;32mIn[3], line 85\u001b[0;36m\u001b[0m\n\u001b[0;31m print(f\"Key '{subject_id_str}' not found in\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m unterminated string literal (detected at line 85)\n" + ] + } + ], + "source": [ + "def generate_classification_export(workflow_id, client):\n", + " \"\"\"\n", + " Submits a request to Zooniverse to generate the export\n", + " classification report.\n", + "\n", + " Description:\n", + " This function should be run only if the classification\n", + " export has never been generated before or if the\n", + " classification export needs to be updatied, i.e., if\n", + " there are new classifications ready for download that\n", + " have never been downloaded before.\n", + "\n", + " Args:\n", + " workflow_id (int): Workflow ID of workflow being aggregated\n", + " client: Logged in Zooniverse client\n", + " \"\"\"\n", + " workflow = Workflow(workflow_id)\n", + " workflow.generate_export('classifications')\n", + "\n", + "\n", + "def download_classifications(workflow_id, client):\n", + " \"\"\"\n", + " Downloads data from Zooniverse\n", + "\n", + " Args:\n", + " workflow_id (int): Workflow ID of workflow being aggregated\n", + " client: Logged in Zooniverse client\n", + "\n", + " Returns:\n", + " classification_data (DataFrame): Raw classifications from Zooniverse\n", + " \"\"\"\n", + " workflow = Workflow(workflow_id)\n", + " try:\n", + " classification_export = workflow.get_export(\n", + " 'classifications', generate=False)\n", + " except panoptes.PanoptesAPIException:\n", + " # This error will be thrown if no classifications exist and\n", + " # its necessary to first run the generate_classification_export\n", + " # function and then rerun this function\n", + " print(\"The classification export is not ready, please ensure \"\n", + " \"that you have already run the `generate_classification_export` \"\n", + " \"function and that you have received an email from Zooniverse \"\n", + " \"that the classification export is ready.\")\n", + " # since it's a partial class, call it to get the DictReader object\n", + " csv_dictreader_instance = classification_export.csv_dictreader()\n", + " classification_rows = [row for row in\n", + " tqdm(csv_dictreader_instance, file=sys.stdout)]\n", + " # convert to pandas dataframe\n", + " classification_data = pd.DataFrame.from_dict(classification_rows)\n", + " return classification_data\n", + "\n", + "\n", + "def extract_data(classification_data, id_type='#objectId'):\n", + " \"\"\"\n", + " Extracts annotations from the classification data\n", + "\n", + " Args:\n", + " classification_data (DataFrame): Raw classifications from Zooniverse\n", + " id_type (str): Name of the id in the extracted classifications; the #\n", + " default that precedes objectId keeps the objectId hidden\n", + " from the user in the information window on Zooniverse.\n", + "\n", + " Returns:\n", + " extracted_data (DataFrame): Extracted annotations from raw\n", + " classification data\n", + " \"\"\"\n", + " # set up our list where we will store the extracted data temporarily\n", + " extracted_rows = []\n", + " # iterate through our classification data\n", + " for i in range(len(classification_data)):\n", + " # access the specific row and extract the annotations\n", + " row = classification_data.iloc[i]\n", + " for annotation in json.loads(row.annotations):\n", + " row_annotation = annotation_by_task({'annotations': [annotation]})\n", + " extract = question_extractor(row_annotation)\n", + " subject_id_str = str(row.subject_ids)\n", + " # Check if the subject ID exists and is a dictionary\n", + " subject_data = json.loads(row.subject_data)\n", + " if (\n", + " subject_id_str in subject_data\n", + " and isinstance(\n", + " subject_data[subject_id_str], dict)\n", + " ):\n", + " rubin_id = subject_data[subject_id_str][id_type]\n", + "\n", + " else:\n", + " print(f\"Key '{subject_id_str}' not found in \"\n", + " f\"subject_data or it is not a dictionary.\")\n", + " # add the extracted annotations to our temporary list\n", + " # along with some other additional data\n", + " extracted_rows.append({\n", + " 'classification_id': row.classification_id,\n", + " 'subject_id': row.subject_ids,\n", + " 'user_name': row.user_name,\n", + " 'user_id': row.user_id,\n", + " 'created_at': row.created_at,\n", + " 'rubin_id': rubin_id,\n", + " 'data': json.dumps(extract),\n", + " 'task': annotation['task']\n", + " })\n", + " # convert the extracted data to a pandas dataframe and sort\n", + " extracted_data = pd.DataFrame.from_dict(extracted_rows)\n", + " extracted_data.sort_values(['subject_id', 'created_at'], inplace=True)\n", + " return extracted_data\n", + "\n", + "\n", + "def last_filter(data):\n", + " \"\"\"\n", + " Determines the most recently submitted classifications\n", + " \"\"\"\n", + " last_time = data.created_at.max()\n", + " ldx = data.created_at == last_time\n", + " return data[ldx]\n", + "\n", + "\n", + "def aggregate_data(extracted_data):\n", + " \"\"\"\n", + " Aggregates question data from extracted annotations\n", + "\n", + " Args:\n", + " extracted_data (DataFrame): Extracted annotations from raw\n", + " classifications\n", + "\n", + " Returns:\n", + " aggregated_data (DataFrame): Aggregated data for the given workflow\n", + " \"\"\"\n", + " # generate an array of unique subject ids -\n", + " # these are the ones that we will iterate over\n", + " subject_ids_unique = np.unique(extracted_data.subject_id)\n", + " # Create a dictionary to map subject IDs to their corresponding metadata\n", + " rubin_ids_unique = extracted_data.groupby('subject_id')['rubin_id'].unique()\n", + " # set up a temporary list to store reduced data\n", + " aggregated_rows = []\n", + " # determine the total number of tasks\n", + " tasks = np.unique(extracted_data.task)\n", + " # iterating over each unique subject id\n", + " for i in range(len(subject_ids_unique)):\n", + " # determine the subject_id to work on\n", + " subject_id = subject_ids_unique[i]\n", + " rubin_id = rubin_ids_unique.iloc[i][0]\n", + " # filter the extract_data dataframe for only the subject_id that is being worked on\n", + " extract_data_subject = \\\n", + " extracted_data[extracted_data.subject_id==subject_id].drop_duplicates()\n", + " for task in tasks:\n", + " extract_data_filtered = extract_data_subject[\n", + " extract_data_subject.task == task]\n", + " # if there are less unique user submissions than classifications,\n", + " # filter for the most recently updated classification\n", + " if (len(extract_data_filtered.user_name.unique()) < len(extract_data_filtered)):\n", + " extract_data_filtered = \\\n", + " extract_data_filtered.groupby(\n", + " ['user_name'], group_keys=False).apply(last_filter)\n", + " # iterate through the filtered extract data to prepare for the reducer\n", + " classifications_to_reduce = \\\n", + " [json.loads(extract_data_filtered.iloc[j].data)\n", + " for j in range(len(extract_data_filtered))]\n", + " # use the Zooniverse question_consesus_reducer to get the final consensus\n", + " reduction = question_consensus_reducer(classifications_to_reduce)\n", + " # add the subject id to our reduction data\n", + " reduction['subject_id'] = subject_id\n", + " reduction['task'] = task\n", + " reduction['rubin_id'] = rubin_id\n", + " # add the data to our temporary list\n", + " aggregated_rows.append(reduction)\n", + " # converting the result to a dataframe\n", + " aggregated_data = pd.DataFrame.from_dict(aggregated_rows)\n", + " # drop rows that are nan\n", + " aggregated_data.dropna(inplace=True)\n", + " return aggregated_data" + ] + }, + { + "cell_type": "markdown", + "id": "31a708c0-046f-4634-a22f-a80dd9b613f2", + "metadata": {}, + "source": [ + "## 2. Log into Zooniverse and find the workflow to download classifications from\n", + "If you're running this notebook, you should already have a Zooniverse account with a project with classifications. If you do not yet have an account, please return to notebook `01_Introduction_to_Citsci_Pipeline.ipynb`.\n", + "\n", + "IMPORTANT: Your Zooniverse project must be set to \"public\", a \"private\" project will not work. Select this setting under the \"Visibility\" tab, (it does not need to be set to live). \n", + "\n", + "Supply the email associated with your Zooniverse account, and then follow the instructions in the prompt to log in and select your project by slug name. \n", + "\n", + "A \"slug\" is the string of your Zooniverse username and your project name without the leading forward slash, for instance: \"username/project-name\". [Click here for more details](https://www.zooniverse.org/talk/18/967061?comment=1898157&page=1).\n", + "\n", + "**The `rubin.citsci` package includes a method that creates a Zooniverse project from template. If you wish to use this feature, do not provide a slug_name and run the subsequent cell.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7be9aa4e-4ac6-473c-9660-29afbfa9dc74", + "metadata": {}, + "outputs": [], + "source": [ + "email = \"\"\n", + "cit_sci_pipeline = pipeline.CitSciPipeline()\n", + "client = cit_sci_pipeline.client\n", + "cit_sci_pipeline.login_to_zooniverse(email)" + ] + }, + { + "cell_type": "markdown", + "id": "b3426882-9f5c-4f8e-bdfb-1daed5230721", + "metadata": {}, + "source": [ + "Use the `list_workflows` method to find the workflow ID." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8f5808b1-109f-4417-8dd5-003d7d6f83f1", + "metadata": {}, + "outputs": [], + "source": [ + "cit_sci_pipeline.list_workflows()" + ] + }, + { + "cell_type": "markdown", + "id": "f3e2d1d1-de4d-414b-afb9-e5b3f6957b68", + "metadata": {}, + "source": [ + "Copy and paste the above ID into the cell below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "505d46f9-c80f-421c-b8d8-92e3093b4d37", + "metadata": {}, + "outputs": [], + "source": [ + "workflow_id = input(\"Please enter the workflow_id: \")" + ] + }, + { + "cell_type": "markdown", + "id": "a6c56e82-f54c-4864-89ee-326825170e3d", + "metadata": {}, + "source": [ + "## 3. Generate the classification export\n", + "Set `generate = True` and run the below two cells if you need to generate the classification export from Zooniverse. This is necessary if you have never generated the classification before or if there are new classifications you wish to include since last time the export was generated.\n", + "\n", + "The export classification generation runs as a background process on Zooniverse's platform and could take up to 24 hours; when the export is ready, Zooniverse will email you. Once you receive the email confirmation, proceed to the next step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3249bd61-93c8-4b5b-8e92-9995d0fbc8c5", + "metadata": {}, + "outputs": [], + "source": [ + "generate = False" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3258b2cc-fd9d-4c31-ad9b-77b1b361986e", + "metadata": {}, + "outputs": [], + "source": [ + "if generate:\n", + " print(\"Generating classification export, wait up to 24 hours \"\n", + " \"to receive an email from Zooniverse that the \"\n", + " \"classification export is ready.\")\n", + " print(\"Once it is ready, run the following cells.\")\n", + " generate_classification_export(workflow_id, client)\n", + "else:\n", + " print(\"Not generating new classification export.\")" + ] + }, + { + "cell_type": "markdown", + "id": "234ff23f-b321-4277-a28e-6a2f33e28b1a", + "metadata": {}, + "source": [ + "## 4. Download the classifications\n", + "Once you have received the email verification that the classification export is ready, you can now download the classifications. \n", + "\n", + "This function reads from the output csv and puts all rows into a dataframe format." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a8e0291a-d5e7-4fca-88d0-6930d32f2d4f", + "metadata": {}, + "outputs": [], + "source": [ + "classification_data = download_classifications(workflow_id, client)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "87efe634-b609-483f-b3d3-49f7623fc565", + "metadata": {}, + "outputs": [], + "source": [ + "classification_data" + ] + }, + { + "cell_type": "markdown", + "id": "db961c2e-0304-4d9e-8ac1-9e25197a2fc2", + "metadata": {}, + "source": [ + "## 5. Extract annotations by task and sort by subject ID\n", + "The `id_type` argument should either be set to '#objectId' (default) or '#diaobjectId'. The preceding # denotes that this keyword is a hidden field on Zooniverse, meaning that it is hidden from users.\n", + "\n", + "This function will return all annotations, there are repeated rows for some `subject_id` entries from different users or the same user re-classifying the same subject. This function will also return the Rubin IDs in a table." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f0e8cd49-3975-4dec-accd-ef853385bc1b", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "extracted_data = extract_data(classification_data, id_type='#objectId')\n", + "extracted_data" + ] + }, + { + "cell_type": "markdown", + "id": "6cbe65b0-dc50-4550-92a6-a2954c897c12", + "metadata": {}, + "source": [ + "## 6. Aggregate the annotations\n", + "Sort by unique subject ID and then unique tasks. Find the most recent classification for each user ID, and uses the Zooniverse consensus builder to look through all user classifications and build consensus." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cd918a8f-434a-4109-a89a-5f635f71b70a", + "metadata": {}, + "outputs": [], + "source": [ + "aggregated_data = aggregate_data(extracted_data)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fab35094-bae0-4883-91ed-6d0215a17135", + "metadata": {}, + "outputs": [], + "source": [ + "aggregated_data" + ] + }, + { + "cell_type": "markdown", + "id": "2b8ad7cd-a9d1-4496-8b70-5cbe26986e2c", + "metadata": {}, + "source": [ + "## 7. Next steps and additional resources\n", + "You are now done! Congratulations!\n", + "Next steps could include joining the above table with other LSST data using the `rubin_id` column, which is either objectId or diaobjectId.\n", + "\n", + "Additional resources include the Zooniverse team's resources to run panoptes through python (https://github.com/zooniverse/panoptes-python-client/tree/master), which provides high level access to the Zooniverse API in order to manage projects via python.\n", + "\n", + "For examples of how to work with the data exports, see the Data Digging code repository or use the Panoptes Aggregation python package.\n", + "https://github.com/zooniverse/Data-digging, https://github.com/zooniverse/aggregation-for-caesar" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6f9540c1-ae86-4421-85bd-dbfd7a969219", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "LSST", + "language": "python", + "name": "lsst" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/utils.py b/utils.py index d48ee32..6d6fe0e 100644 --- a/utils.py +++ b/utils.py @@ -321,7 +321,7 @@ def make_manifest_with_calexp_images( ) # and of the diaObjectId figout_data["diaObjectId:image_" + str(i)] = str(star_id) - figout_data[f"metadata:diaObjectId_image_{str(i)}"] = str(star_id) + figout_data[f"metadata:#diaObjectId_image_{str(i)}"] = str(star_id) figout_data["filename"] = str(star_id) + "_" + str(star_ccdid) + ".png" cutouts.append(figout_data) @@ -361,12 +361,12 @@ def make_manifest_with_deepcoadd_images(results_table, butler, batch_dir): if hasattr(row, "objectId"): has_canonical_id = True csv_row["objectId"] = row.objectId - csv_row["metadata:objectId"] = row.objectId + csv_row["metadata:#objectId"] = row.objectId csv_row["objectIdType"] = "DIRECT" if hasattr(row, "diaObjectId"): has_canonical_id = True csv_row["diaObjectId"] = row.diaObjectId - csv_row["metadata:diaObjectId"] = row.diaObjectId + csv_row["metadata:#diaObjectId"] = row.diaObjectId if "objectIdType" not in csv_row: csv_row["objectIdType"] = "INDIRECT"