This README describes the API for performing bias evaluations of machine learning models on the The Fair Human-centric Image Benchmark (FHIBE), developed by the Sony AI Ethics team.
This API can be used to:
- Evaluate custom models on the FHIBE dataset.
- Automatically generate a comprehensive bias report PDF summarizing each model evaluation.
Examples of bias reports for some popular open source models evaluated on FHIBE can be found in examples/open_source_model_bias_reports/.
The FHIBE dataset consists of approximately 10,000 high-resolution images of one or two people in a real-world background setting. The image dimensions vary, but are typically 3-10k pixels on a side. A lower resolution ("downsampled") version of this dataset exists, where each image is resized such that the larger dimension has 2048 pixels, while maintaining the original aspect ratio. This API currently only supports the downsampled version of the FHIBE dataset. This is done to avoid memory overflow issues.
The FHIBE-Face dataset is a derivative dataset where each image is of a single face only. There are two versions of the FHIBE-face dataset, a crop-only and a crop+aligned version. The crop-only images have the same resolution as the full-res FHIBE images, and the crop+aligned images are standardized such that each image is 512x512 pixels in size. This API currently only supports the crop+aligned version of the FHIBE-Face dataset.
NOTE: Always ensure you are using the latest version of FHIBE. In compliance with GDPR, subjects can revoke their consent from the dataset at any time. To respect this, we will remove these subjects on a monthly basis and update the dataset at the download page below.
Download the latest dataset *downsampled_public.tar from https://fairnessbenchmark.ai.sony/download. The downsampled FHIBE dataset and the crop+aligned FHIBE-Face dataset are contained within this single tar file.
The dataset tar ball will unpack into the following directory structure:
fhibe.{version_name}_downsampled_public
├── results/
│ ├── face_parsing/
│ └── face_verification/
└── data/
├── protocol/
├── processed/ # dataset metadata here
├── annotator_metadata/
├── raw/
│ └── fhibe_downsampled/ # dataset here
└── aggregated_results/
Where the {version_name} refers to the current version, which will vary depending on when you downloaded the dataset.
The raw image and annotation files for both FHIBE downsampled and the face crop+align dataset are contained in:
fhibe.{version_name}_downsampled_public/data/raw/fhibe_downsampled/
and the metadata (annotations, subject and annotator attributes) are contained within
fhibe.{version_name}_downsampled_public/data/processed/
This API supports the evaluation of nine computer vision tasks. The following table lists each task, including what the model prediction is for the task, and whether the task is evaluated on the FHIBE or FHIBE-Face dataset.
The figure and text below describe the two main API endpoints that the user will interact with and what each of them does.
fhibe_eval_api.evaluate.evaluate_task: a function that- runs inference on your model over the FHIBE dataset.
- evaluates task-specific metrics based on your model's outputs.
- aggregates metric performance over attributes (e.g., age, ancestry, etc...) and their intersections.
- generates files containing the model outputs and aggregate metric performance.
fhibe_eval_api.reporting.BiasReport: A class containing a methodgenerate_pdf_reportthat uses the files generated byevaluate_taskto- generate a comprehensive bias evaluation report.
The bias report contains plots of metric performance for each attribute group specified when the report was generated. It also contains a table showing the attribute groups with the highest statistically significant disparities. For example, if age was one of the attributes, there will be a plot showing metric performance in each age group (i.e., 18-29, 30-39, etc.). The plots provide an overview of the model's performance for each attribute, including uncertainties to help understand if the disparities are statistically significant. The table lists the actual pair-wise disparities and their statistical effect size.
The table is sorted from high-to-low disparity, so the rows at the top of the table are the most disparate attribute groups, showing the areas where the model needs most work (depending on its application). If the model has no significant disparities between attribute groups, the model is well balanced and exhibits low bias.
We chose not to include FHIBE images and their annotations in the bias reports for privacy reasons. The dataset is GDPR compliant, meaning that subjects can revoke their consent to have their data (including images) included in the dataset at any time. If we enable bias reports to be generated with subjects' images, and a subject revokes their consent, it is challenging and burdensome to track down every pdf containing their image(s) and delete them.
Examples of generated bias reports from popular open source models evaluated on FHIBE can be found in examples/open_source_model_bias_reports/.
The code was developed and tested using python=3.11, pytorch=2.2.0
and torchvision=0.17 with CUDA 12 support. This API is model-agnostic, so your model can be implemented in any library as long as it is callable from Python.
The evaluation can be performed on the CPU or GPU -- no GPU is required to use the API.
The code is tested on OS X and Ubuntu Linux. Windows is not supported.
- Clone this repository locally (or unzip it if you downloaded the zip file):
$ git clone git@github.com:SonyResearch/fhibe_evaluation_api.git- Change to the local repo directory:
$ cd fhibe_evaluation_api- Active your existing environment, e.g.,
$ conda activate <env_name>$ pip install -e .will give you access to thefhibe_eval_apipackage in your current environment.
- Clone this repository locally (or unzip it if you downloaded the zip file):
$ git clone git@github.com:SonyResearch/fhibe_evaluation_api.git- Change to the local repo directory:
$ cd fhibe_evaluation_api-
If you don't already have
poetryinstalled, follow the official installation instructions to obtain it. -
Install the virtual environment and its dependencies for this project:
$ poetry installThe install command may take a few minutes.
The demo directory contains demos for each task. For example, demo/person_localization contains the demo for person localization.
In each demo task directory is a yaml file called {task_name}.yaml containing the configuration for that task. You will need to modify the following keys in the yaml to get the demo to run on your machine:
data_rootdir
dataset_version
model_name # optional, but you can update to today's date
results_basedir
Each demo task directory also contains a Python script called run_{task_name}.py which is the demo script you will run. If using Poetry, after you have installed the environment as above, you can run the demo scrips using:
$ poetry run python run_{task_name}.py
Once finished, the script will log out some information to disk such as the location of saved files, including the bias report.
For many tasks, the bias report will contain the structure of a real bias report, but it may be empty of actual results. This is due to the fact that the models implemented in the demos are primarily random number generators. To see what a complete bias report looks like for a real model, see the examples/open_source_model_bias_reports/.
To evaluate your custom model using this API, you must first wrap your model. To do so, write a class that inherits from our provided base class:
fhibe_eval_api.models.base_model.BaseModelWrapper
Your child class must contain three methods:
__init__(): Initializes the model and passes your custom model object to the base class'__init__()method.data_preprocessor(): Performs data preprocessing and returns an iterator that yields batches of preprocessed data. Often, this is a PyTorch DataLoader, but that is not required.__call__(): Performs a model forward pass of a batch of data, where the batch comes from the data loader returned indata_preprocessor().
See the demos for the required signatures of these methods. Note that some tasks have additional required methods (see demos and task-specific docs below for more details).
Note that while the "custom" models used in the demos are torch.nn.Module objects, your custom model may be implemented with any library. Your model's forward pass must be wrapped in __call__(), and the only requirement is that it can perform inference on a batch of data provided by the iterator (e.g., data loader) that you returned from the data_preprocessor() method.
Once you have identified a task that best fits your model, see docs/task_specifics.md for specific implementation details.
FHIBE and FHIBE-Face contain over thirty annotated attributes each. For the list of attributes available for bias evaluation with this API, see: docs/supported_attributes.md.
In some cases, it may not be possible to perform model inference using the evaluation API. This can occur if, for example, your model can only be run on specific hardware that the API does not support. In these cases, you can run inference on the FHIBE images using your own hardware, but then use the API to perform the bias evaluation on the model outputs.
The steps to do this are as follows:
- Preprocess the FHIBE images for model inference using a custom pipeline specific to your model.
- Run model inference on the full FHIBE image set, saving the model outputs for each image.
- Format the model outputs to conform with the specifications laid out for the task under the "Model output file format" section of the docs/task_specifics.md page.
- Usually, this involves creating a single
model_outputs.jsonfile and putting it the directory{results_rootdir}/{task_name}/{dataset_name}/{model_name}/, where each variable is an input parameter toevaluate_task. Note that if{results_rootdir}is omitted, the default is aresults/subdirectory of your project root directory. - If your preprocessing steps resized the FHIBE images, you must rescale the model outputs back to the dimensions of the FHIBE images. The dimensions of the FHIBE images can be found in the
fhibe_downsampled.csvorfhibe_face_crop_align.csvmetadata files, depending on whether your task uses the FHIBE or FHIBE-face dataset.
- Usually, this involves creating a single
- Set up your evaluation script as normal (referring to the demo for your task as needed), but in your call to
evaluate_task, ensure that you set the following parameters accordingly:model=Nonereuse_model_outputs=True
Assuming that you formatted your model results correctly and put them in the correct location for the evaluation API to find them, the evaluation pipeline will simply skip inference and perform the rest of the pipeline, including the bias report generation.
Please raise an issue or question if it is not already answered.
This software is distributed under the Apache 2.0 License.
If you use the FHIBE evaluation API and/or the FHIBE dataset in your work, please cite Xiang et al. 2025.

