The different data modalities contained in the dataset. From left to right shows pose, semantic segmentation, silhouette and optical flow.
This repository contains the Health&Gait dataset, the first that enables gait analysis using visual information without specific sensors, relying solely on cameras. The dataset includes multimodal features extracted from videos, and gait parameters and anthropometric measurements from each participant. This dataset is intended for use in health, sports and gait analysis research.
Two examples of the different data types from the dataset for two participants (a) and (b).
Health&Gait consists of 1,564 videos of 398 participants walking in a controlled closed environment, where each video has associated the following information:
- 2D pose estimation of their joints by AlphaPose (JSON format files).
- Semantic segmentation by DensePose (PNG images).
- Optical flow by TVL1 and GMFlow (PNG images).
- Silhouette by YOLOV8 (JPEG images).
Moreover, for each subject, the following data has been recorded:
- Anthropometric measurements.
- Gait parameters obtained from OptoGait and MuscleLAB.
- Gait parameters estimated from pose information.
Directory and file scheme of the Health&Gait database.
Attributes in the file participants_measures.csv
Attributes | Description | Unit |
---|---|---|
Sex | Participant sex | 0:female, 1:male |
Age | Participant Age | Years |
PA_level | Level of physical activity | >=3 days: Active, < 3 days: Non active |
Height | Participant height | cm |
Weight | Participant weight | kg |
BMI | Body Mass Index | kg/m2 |
WaistC | Waist circumference | cm |
HipC | Hip circumference | cm |
NeckC | Neck circumference | cm |
Percentage fat mass | The total mass of fat divided by total body mass | % |
Lean mass | The difference between total body weight and body fat weight | kg |
Atributes in the file gait_parameters_estimation.csv.
Attributes | Description | Unit |
---|---|---|
Step_UGS / Step_FGS | The distance between the two toes or heels of the feet in sequence for usual/fast gait speed. | cm |
Stride_UGS / Stride_FGS | The distance between the two toes or heels of sequential strides of the same foot for usual/fast gait speed. | cm |
Cadence_UGS / Cadence_FGS | The number of steps taken per unit of time for usual/fast gait speed. | Steps / min |
MonoSP_UGS / MonoSP_FGS | Time in the swing phase where only one limb is in contact with the ground for usual/fast gait speed. | sec |
BiSP_UGS / BiSP_FGS | Time that both feet are on the ground for usual/fast gait speed. | sec |
Speed_UGS / Speed_FGS | Participant velocity for usual/fast gait speed. | m / s |
The dataset is hosted in the following Zenodo repository.
The first is to create a Python environment from the requirement.txt file. The use of conda is recommended.
conda create --name <env> --file requirement.txt
conda activate <env>
Check if Tensorflow is installed correctly:
python3 -c "import tensorflow as tf; print(len(tf.config.list_physical_devices('GPU')) > 0)"
The following indicates how to use the various scripts provided in the repository and a recommendation on how to load and use the dataset. The first thing is to access the directory within the repository where the scripts are located:
cd scripts/
The create_partitions.sh script allows the partitioning of patients into train, validation and test sets in a stratified manner for the sex classification and weight and age regression tasks.
bash create_partitions.sh -p <patient_measures_file> -o <output_path>
where:
- <patient_measures_file> is the path to the participants_measures.csv file in the dataset.
- <output_path> is the path where to save the partitions.
The following script can be used to obtain the results presented in the technical validation sections in the paper:
python technical_validation.py --config '../configfiles/train_configfile.json' --targets {sex, weight, age} --methods {MoviNet, XGBoost, MLP}
where train_configfile.json is a configuration file where you need to set the value of the following fields:
- "data_path": path where you have downloaded the dataset.
- "partitions_path": path used in the previous scripts.
- "save_dir": path where to store the training results.
- "patients_measures": file with the anthropometric data of the participants.
- "gait_parameters": file with the information of the gait parameters.
- "gait_parameters_estimation": file with the information of the estimated gait parameters.
- "hyperparameters_search_space": JSON file where the hyperparameters search space of the XGBoost method is defined.
You can tell the script from which target to get the results and which methods to use to do so.
The following script allows to obtain the gait parameters from the pose:
bash gait_parameters_estimation.sh -p <patient_measures_file> -s <sensor_bboxes> -e <semantic_segmentation_path> -o <output_csv_path> -k [scale] -f [fps]
where:
- -p is used to indicate the path to the participants_measures.csv file.
- -s is the path to the directory with the bounding boxes of OptoGait sensors (for more information contact me).
- -e is the path to the directory with the semantic segmentation.
- -o is the path to the CSV file where we want to store the estimates.
- -k is used to indicate the scale of the scene to obtain real measurements.
- -f is used to indicate the fps rate needed to obtain the gait parameters related to time.
Example of DataGenerator in Tensorflow
To load the multimodal features extracted from the videos (semantic segmentation, silhouette, and optical flow), the use of DataGenerators is recommended.
class FrameGenerator:
def __init__(self, videos_names, labels, n_frames, output_size = (224, 224), training = False):
""" Returns a set of frames with their associated label.
Args:
path: Video file paths.
n_frames: Number of frames.
training: Boolean to determine if a training dataset is being created.
"""
self.videos_names = videos_names
self.n_frames = n_frames
self.training = training
self.labels = labels
self.output_size = output_size
def __call__(self):
pairs = list(zip(self.videos_names, self.labels))
if self.training:
random.shuffle(pairs)
for video_name, label in pairs:
video_frames = frames_from_video_file(video_name, self.n_frames, self.output_size)
yield video_frames, label
def frames_from_video_file(video_file, num_frames, output_size = (224,224), skip_percent=0.15):
frames = sorted(os.listdir(video_file))
total_frames = len(frames)
skip_frames = int(total_frames * skip_percent)
effective_frame_count = total_frames - (2 * skip_frames)
frame_interval = effective_frame_count // num_frames
selected_frames = []
for i in range(num_frames):
frame_num = skip_frames + int(i * frame_interval)
frame = frames[frame_num]
selected_frames.append(format_frames(cv2.imread(os.path.join(video_file, frame)), output_size))
frames = np.array(selected_frames)[..., [2, 1, 0]]
return frames
def format_frames(frame, output_size):
"""
Pad and resize an image from a video.
Args:
frame: Image that needs to be resized and padded.
output_size: Pixel size of the output frame image.
Return:
Formatted frame with padding of specified output size.
"""
frame = tf.image.convert_image_dtype(frame, tf.float32)
frame = tf.image.resize_with_pad(frame, *output_size)
return frame
Get paths and targets for the DataGenerator
To define the DataGenerator it is first necessary to obtain the path and target of the data type to be used:
def get_data(PARTITIONS_FILE, DATA_PATH, DATA_TYPE, DATA_CLASS, OPTICAL_FLOW_METHOD, PATIENTS_INFO, TARGET):
with open(PARTITIONS_FILE, 'r') as f:
partitions_data = json.load(f)
train_patients = partitions_data["train"]
validation_patients = partitions_data["validation"]
test_patients = partitions_data["test"]
df = pd.read_csv(PATIENTS_INFO, sep = ',')
data = {
"train": {"path": [], "class": []},
"validation": {"path": [], "class": []},
"test": {"path": [], "class": []}
}
def add_data(patient, file_path):
category = ""
if patient in train_patients:
category = "train"
elif patient in validation_patients:
category = "validation"
elif patient in test_patients:
category = "test"
if category:
data[category]["path"].append(file_path)
data[category]["class"].append(df.loc[df['ID'] == patient][TARGET].values[0])
for patient in os.listdir(DATA_PATH):
for gait_type in os.listdir(os.path.join(DATA_PATH, patient)):
directory = os.path.join(DATA_PATH, patient, gait_type)
directory = directory if DATA_TYPE in ['silhouette', 'semantic_segmentation'] else os.path.join(directory, OPTICAL_FLOW_METHOD)
for class_folder in os.listdir(directory):
file_path = os.path.join(directory, class_folder)
if DATA_CLASS == 'both':
add_data(patient, file_path)
else:
first_split = class_folder.split("_")
second_split = first_split[0].split("-")
if second_split[1] == DATA_CLASS:
add_data(patient, file_path)
return data
Define and use DataGenerators
The following code shows how to define the data generators for the three datasets, making use of the methods from_generator:
output_signature = (tf.TensorSpec(shape = (None, None, None, 3), dtype = tf.float32),
tf.TensorSpec(shape = (), dtype = tf.int16))
train_ds = tf.data.Dataset.from_generator(FrameGenerator(data["train"]["path"], data["train"]["class"], NUM_FRAMES, (IMG_SIZE, IMG_SIZE), training = True),
output_signature=output_signature)
train_ds = train_ds.batch(BATCH_SIZE)
val_ds = tf.data.Dataset.from_generator(FrameGenerator(data["validation"]["path"], data["validation"]["class"], NUM_FRAMES, (IMG_SIZE, IMG_SIZE)),
output_signature=output_signature)
val_ds = val_ds.batch(BATCH_SIZE)
test_ds = tf.data.Dataset.from_generator(FrameGenerator(data["test"]["path"], data["test"]["class"], NUM_FRAMES, (IMG_SIZE, IMG_SIZE)),
output_signature=output_signature)
test_ds = test_ds.batch(BATCH_SIZE)
These generators can be used in Tensorflow's fit or predict methods:
results = model.fit(train_ds,
validation_data = val_ds,
epochs = EPOCHS,
validation_freq = 1,
callbacks = callbacks_list,
verbose = 1)
y_preds = model.predict(test_ds)
The document detailing the computational experiments conducted to perform quality control on the data is below.
Health&Gait Additional Material
Health&Gait is freely available for free non-commercial use, but may not be redistributed without the authors' consent. Please, see the license for further details.
@article{zafra2024,
author = {Zafra-Palma, Jorge and Marín-Jiménez, Nuria and Castro-Piñero, José and Cuenca-García, Magdalena and Muñoz-Salinas, Rafael and Marín-Jiménez, Manuel J},
title = {Health \& Gait: a dataset for gait-based analysis},
journal = {Scientific Data},
volume = {12},
year = {2025},
number = {1},
issn = {2052-4463},
doi = {10.1038/s41597-024-04327-4}
}
If you have any question or suggestion, contact us by jzafra@uco.es.