Extract image file paths from ImageNet by matching category keywords. Useful for creating custom subsets of ImageNet for training or evaluation.
- Python 3.8+
- ImageNet dataset (or a subset) with the standard ILSVRC directory structure:
ImageNet-Subset/ ├── LOC_synset_mapping.txt ├── LOC_val_solution.csv └── ILSVRC/ ├── ImageSets/ │ └── CLS-LOC/ │ ├── train_cls.txt │ └── val.txt └── Data/ └── CLS-LOC/ ├── train/ │ ├── n01440764/ │ │ ├── n01440764_10026.JPEG │ │ └── ... │ └── ... └── val/ ├── ILSVRC2012_val_00000001.JPEG └── ...
pip install parseimagenetFor local development:
git clone https://github.com/MrT3313/Parse-ImageNet.git
pip install -e /path/to/ParseImageNet
# ex: pip install -e /Users/mrt/Documents/MrT/code/computer-vision/ParseImageNetNote
| Parameter | Type | Default | Alternatives | Description |
|---|---|---|---|---|
base_path |
Path |
- | Any valid directory path | Root path to the ImageNet dataset |
preset |
str or None |
None |
"birds", "dogs", ... via get_available_presets() |
Predefined keyword list. None selects all categories |
keywords |
list or None |
None |
Any list of strings | Custom keyword list. Overrides preset when provided |
num_images |
int |
200 |
Any positive integer | Max images to return (capped by availability) |
source |
str |
"train" |
"val" |
Data split to sample from |
silent |
bool |
True |
False |
Suppresses print output when enabled |
from pathlib import Path
from parseimagenet import get_image_paths_by_keywords
# Set the path to your ImageNet directory
base_path = Path('/path/to/your/ImageNet-Subset')
# ex: /Users/mrt/Documents/MrT/code/computer-vision/image-bank/ImageNet-Subset
# Default: no preset, selects from all categories
image_paths = get_image_paths_by_keywords(base_path=base_path)
# image_paths is a list of Path objects
print(f"Found {len(image_paths)} images")
print(image_paths[:5])Note
Presets are predefined keyword lists for common categories:
from parseimagenet import get_image_paths_by_keywords # main function
from parseimagenet import get_available_presets, KEYWORD_PRESETS # helpers
# See available presets
print(get_available_presets()) # ['birds', 'dogs', 'wild_canids', 'snakes']
# Access preset keywords directly
print(KEYWORD_PRESETS["birds"])
# Use a specific preset
image_paths = get_image_paths_by_keywords(
base_path=base_path,
preset="birds",
num_images=200
)Note
Custom keywords override the preset:
Important
you can find all applicable category keywords in the LOC_synset_mapping.txt file
image_paths = get_image_paths_by_keywords(
base_path=base_path,
keywords=['dog', 'puppy', 'hound'],
num_images=100
)By default, images are sourced from the training set. Use source="val" to pull from the validation set instead:
Important
we do not provide a fetch from the test data because the Kaggle Competition Dataset does not provide the ground truth for the training data.
image_paths = get_image_paths_by_keywords(
base_path=base_path,
preset="birds",
num_images=100,
source="val"
)# Use default preset (birds)
python -m parseimagenet.ParseImageNetSubset --base_path /path/to/ImageNet-Subset
# Use a specific preset
python -m parseimagenet.ParseImageNetSubset --base_path /path/to/ImageNet-Subset --preset birds --num_images 100
# Use custom keywords (overrides preset)
python -m parseimagenet.ParseImageNetSubset --base_path /path/to/ImageNet-Subset --keywords "dog, puppy" --num_images 100
# Use validation data instead of training data
python -m parseimagenet.ParseImageNetSubset --base_path /path/to/ImageNet-Subset --preset birds --source val --num_images 100