Skip to content

Model evaluation harness for vision model evaluation on the Fair human-centric image dataset for ethical AI benchmarking (FHIBE) dataset developed by the Sony AI Ethics Team.

License

Notifications You must be signed in to change notification settings

SonyResearch/fhibe_evaluation_api

Repository files navigation

FHIBE Evaluation API

This README describes the API for performing bias evaluations of machine learning models on the The Fair Human-centric Image Benchmark (FHIBE), developed by the Sony AI Ethics team.

This API can be used to:

  1. Evaluate custom models on the FHIBE dataset.
  2. Automatically generate a comprehensive bias report PDF summarizing each model evaluation.

Examples of bias reports for some popular open source models evaluated on FHIBE can be found in examples/open_source_model_bias_reports/.

Dataset Overview

The FHIBE dataset consists of approximately 10,000 high-resolution images of one or two people in a real-world background setting. The image dimensions vary, but are typically 3-10k pixels on a side. A lower resolution ("downsampled") version of this dataset exists, where each image is resized such that the larger dimension has 2048 pixels, while maintaining the original aspect ratio. This API currently only supports the downsampled version of the FHIBE dataset. This is done to avoid memory overflow issues.

The FHIBE-Face dataset is a derivative dataset where each image is of a single face only. There are two versions of the FHIBE-face dataset, a crop-only and a crop+aligned version. The crop-only images have the same resolution as the full-res FHIBE images, and the crop+aligned images are standardized such that each image is 512x512 pixels in size. This API currently only supports the crop+aligned version of the FHIBE-Face dataset.

Obtaining the dataset

NOTE: Always ensure you are using the latest version of FHIBE. In compliance with GDPR, subjects can revoke their consent from the dataset at any time. To respect this, we will remove these subjects on a monthly basis and update the dataset at the download page below.

Download the latest dataset *downsampled_public.tar from https://fairnessbenchmark.ai.sony/download. The downsampled FHIBE dataset and the crop+aligned FHIBE-Face dataset are contained within this single tar file.

The dataset tar ball will unpack into the following directory structure:

fhibe.{version_name}_downsampled_public
├── results/
│   ├── face_parsing/
│   └── face_verification/
└── data/
    ├── protocol/
    ├── processed/ # dataset metadata here
    ├── annotator_metadata/
    ├── raw/
    │    └── fhibe_downsampled/  # dataset here
    └── aggregated_results/

Where the {version_name} refers to the current version, which will vary depending on when you downloaded the dataset.

The raw image and annotation files for both FHIBE downsampled and the face crop+align dataset are contained in:

fhibe.{version_name}_downsampled_public/data/raw/fhibe_downsampled/

and the metadata (annotations, subject and annotator attributes) are contained within

fhibe.{version_name}_downsampled_public/data/processed/

API overview

Task overview

This API supports the evaluation of nine computer vision tasks. The following table lists each task, including what the model prediction is for the task, and whether the task is evaluated on the FHIBE or FHIBE-Face dataset.

static/API_Tasks.png

Main API endpoints

The figure and text below describe the two main API endpoints that the user will interact with and what each of them does.

static/api_overview.png

  • fhibe_eval_api.evaluate.evaluate_task: a function that
    • runs inference on your model over the FHIBE dataset.
    • evaluates task-specific metrics based on your model's outputs.
    • aggregates metric performance over attributes (e.g., age, ancestry, etc...) and their intersections.
    • generates files containing the model outputs and aggregate metric performance.
  • fhibe_eval_api.reporting.BiasReport: A class containing a method generate_pdf_report that uses the files generated by evaluate_task to
    • generate a comprehensive bias evaluation report.

The Bias Report

The bias report contains plots of metric performance for each attribute group specified when the report was generated. It also contains a table showing the attribute groups with the highest statistically significant disparities. For example, if age was one of the attributes, there will be a plot showing metric performance in each age group (i.e., 18-29, 30-39, etc.). The plots provide an overview of the model's performance for each attribute, including uncertainties to help understand if the disparities are statistically significant. The table lists the actual pair-wise disparities and their statistical effect size.

The table is sorted from high-to-low disparity, so the rows at the top of the table are the most disparate attribute groups, showing the areas where the model needs most work (depending on its application). If the model has no significant disparities between attribute groups, the model is well balanced and exhibits low bias.

We chose not to include FHIBE images and their annotations in the bias reports for privacy reasons. The dataset is GDPR compliant, meaning that subjects can revoke their consent to have their data (including images) included in the dataset at any time. If we enable bias reports to be generated with subjects' images, and a subject revokes their consent, it is challenging and burdensome to track down every pdf containing their image(s) and delete them.

Examples of generated bias reports from popular open source models evaluated on FHIBE can be found in examples/open_source_model_bias_reports/.

Getting started

The code was developed and tested using python=3.11, pytorch=2.2.0 and torchvision=0.17 with CUDA 12 support. This API is model-agnostic, so your model can be implemented in any library as long as it is callable from Python.

The evaluation can be performed on the CPU or GPU -- no GPU is required to use the API.

The code is tested on OS X and Ubuntu Linux. Windows is not supported.

Installation

Pip/conda installation

  1. Clone this repository locally (or unzip it if you downloaded the zip file):
$ git clone git@github.com:SonyResearch/fhibe_evaluation_api.git
  1. Change to the local repo directory:
$ cd fhibe_evaluation_api
  1. Active your existing environment, e.g.,
$ conda activate <env_name>
  1. $ pip install -e . will give you access to the fhibe_eval_api package in your current environment.

Poetry installation

  1. Clone this repository locally (or unzip it if you downloaded the zip file):
$ git clone git@github.com:SonyResearch/fhibe_evaluation_api.git
  1. Change to the local repo directory:
$ cd fhibe_evaluation_api
  1. If you don't already have poetry installed, follow the official installation instructions to obtain it.

  2. Install the virtual environment and its dependencies for this project:

$ poetry install

The install command may take a few minutes.

Run the demos

The demo directory contains demos for each task. For example, demo/person_localization contains the demo for person localization.

In each demo task directory is a yaml file called {task_name}.yaml containing the configuration for that task. You will need to modify the following keys in the yaml to get the demo to run on your machine:

data_rootdir
dataset_version
model_name # optional, but you can update to today's date
results_basedir

Each demo task directory also contains a Python script called run_{task_name}.py which is the demo script you will run. If using Poetry, after you have installed the environment as above, you can run the demo scrips using:

$ poetry run python run_{task_name}.py

Once finished, the script will log out some information to disk such as the location of saved files, including the bias report.

For many tasks, the bias report will contain the structure of a real bias report, but it may be empty of actual results. This is due to the fact that the models implemented in the demos are primarily random number generators. To see what a complete bias report looks like for a real model, see the examples/open_source_model_bias_reports/.

The model base class

To evaluate your custom model using this API, you must first wrap your model. To do so, write a class that inherits from our provided base class:

fhibe_eval_api.models.base_model.BaseModelWrapper

Your child class must contain three methods:

  • __init__(): Initializes the model and passes your custom model object to the base class' __init__() method.
  • data_preprocessor(): Performs data preprocessing and returns an iterator that yields batches of preprocessed data. Often, this is a PyTorch DataLoader, but that is not required.
  • __call__(): Performs a model forward pass of a batch of data, where the batch comes from the data loader returned in data_preprocessor().

See the demos for the required signatures of these methods. Note that some tasks have additional required methods (see demos and task-specific docs below for more details).

Note that while the "custom" models used in the demos are torch.nn.Module objects, your custom model may be implemented with any library. Your model's forward pass must be wrapped in __call__(), and the only requirement is that it can perform inference on a batch of data provided by the iterator (e.g., data loader) that you returned from the data_preprocessor() method.

Task-specific details

Once you have identified a task that best fits your model, see docs/task_specifics.md for specific implementation details.

Supported attributes

FHIBE and FHIBE-Face contain over thirty annotated attributes each. For the list of attributes available for bias evaluation with this API, see: docs/supported_attributes.md.

Running the bias evaluation on pre-generated model outputs

In some cases, it may not be possible to perform model inference using the evaluation API. This can occur if, for example, your model can only be run on specific hardware that the API does not support. In these cases, you can run inference on the FHIBE images using your own hardware, but then use the API to perform the bias evaluation on the model outputs.

The steps to do this are as follows:

  • Preprocess the FHIBE images for model inference using a custom pipeline specific to your model.
  • Run model inference on the full FHIBE image set, saving the model outputs for each image.
  • Format the model outputs to conform with the specifications laid out for the task under the "Model output file format" section of the docs/task_specifics.md page.
    • Usually, this involves creating a single model_outputs.json file and putting it the directory {results_rootdir}/{task_name}/{dataset_name}/{model_name}/, where each variable is an input parameter to evaluate_task. Note that if {results_rootdir} is omitted, the default is a results/ subdirectory of your project root directory.
    • If your preprocessing steps resized the FHIBE images, you must rescale the model outputs back to the dimensions of the FHIBE images. The dimensions of the FHIBE images can be found in the fhibe_downsampled.csv or fhibe_face_crop_align.csv metadata files, depending on whether your task uses the FHIBE or FHIBE-face dataset.
  • Set up your evaluation script as normal (referring to the demo for your task as needed), but in your call to evaluate_task, ensure that you set the following parameters accordingly:
    • model=None
    • reuse_model_outputs=True

Assuming that you formatted your model results correctly and put them in the correct location for the evaluation API to find them, the evaluation pipeline will simply skip inference and perform the rest of the pipeline, including the bias report generation.

Issues?

Please raise an issue or question if it is not already answered.

License

This software is distributed under the Apache 2.0 License.

Citing our work

If you use the FHIBE evaluation API and/or the FHIBE dataset in your work, please cite Xiang et al. 2025.

About

Model evaluation harness for vision model evaluation on the Fair human-centric image dataset for ethical AI benchmarking (FHIBE) dataset developed by the Sony AI Ethics Team.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages