Used datasets:
Model building process:
-
Download required datasets
-
Create a catalog file (pandas dataframe) to hold downloaded files information.
-
Split & Re-organize files into train and test sets for each of the following:
- Species based splitting : cat / dog classifier
- Breed based splitting: cat breed classifier + dog breed classifier
-
Train and optimize each of the above described models using neural networks and transfer learning.
-
Classify any image using a web app integrated with the saved models.
First, clone the github repo using the following command, and then navigate to the newly created directory:
git clone https://github.com/Shlomigreen/pet-breed-classifier
cd pet-breed-classifier
In your chosen CLI, with python 3.8 or above installed, run the following command:
pip install -r requirements.txt
Datasets for this project downloaded from Kaggle. An automated script to download all files is existing in this project however you will first need to create a kaggle token:
- Go to Kaggle
- Login or create a new account
- Navigate to Account > Create API Token and save the file somewhere safe on your computer.
- Copy the file path (if working on Google Colab, upload the file first and then copy its path).
Run src/download-files.py
with the path to kaggle.json
token as follows:
python3 src/download-files.py <TOKEN PATH>
This will create (by default) a new directory named data/
that holds both
datasets in the original file tree system.
├── data
│ ├── cats-and-dogs-breeds-classification-oxford-dataset
│ │ ├── annotations
│ │ │ └── annotations
│ │ │ ├── trimaps
│ │ │ └── xmls
│ │ └── images
│ │ └── images
│ └── microsoft-catsvsdogs-dataset
│ └── PetImages
│ ├── Cat
│ └── Dog
All scripts in the project work by updating information into what's called a catalog file.
The catalog is simply a .csv
file that hold information about downloaded datasets files.
Most of the files supposed to be images, and they were collected to the catalog
by a known path (as provided on each dataset's page).
Included information (per file record):
dataset
: the source dataset (int).species
: whether the file labeled as dog or cat in source datasets (str).breed
: breed id as was given in source dataset (int).breed_name
: literal name of the breed (string.title format).dir_path
: relative path to the directory holding the file (str).file_name
: name of the file, including extension (str).full_path
: concatenation ofdir_path
andfile_name
(str).
The catalog file will be created on the following path by default info/catalog.csv
after running
the catalog generating script:
python3 src/create-catalog.py
After the catalog has been created, a virtual pre-processing can be done to detect non-images and wrognly classified images. This will add a new column to the catalog file:
is_image
: indicates if the found file can be opened as an image and is truely labeled (boolean).
python3 src/pre-processing.py
Note every file is being checked so this process take a bit of time.
Uses CLI in order to split files and reorganize them into train and test sets. Split is done by adding new column(s) to catalog file indicating for each model if the specific record is chosen to be used as train or test set.
python3 src/split_data.py [-h] [-b BY] [-s] [-o] [-u] {breed,species}
There are 3 ways to split and organize the files:
-
For species classification (cat /dog) : a boolean column named
species_train
will be created.
- split:python3 src/split_data.py -s species
- organize:python3 src/split_data.py -o species
- restore original file locations:python3 src/split_data.py -o -u species
-
For cat breed classification : a boolean column named
cat_train
will be created.
- split:python3 src/split_data.py -s -b cat breed
- organize:python3 src/split_data.py -o -b cat breed
- restore original file locations:python3 src/split_data.py -o -u -b cat breed
-
For dog breed classification : a boolean column named
dog_train
will be created.
- split:python3 src/split_data.py -s -b cat breed
- organize:python3 src/split_data.py -o -b cat breed
- restore original file locations:python3 src/split_data.py -o -u -b cat breed
When running a split and organize commands on a either species breed, new sub-folders will be created
under the data
directory including the species name, train, test and a directory for each breed.
Example of directory tree after running a split and organize commands on cat breeds:
data/
├── cat
│ ├── test
│ │ ├── Abyssinian
│ │ ├── Bengal
│ │ ├── Birman
│ │ ├── Bombay
│ │ ├── British Shorthair
│ │ ├── Egyptian Mau
│ │ ├── Maine Coon
│ │ ├── Persian
│ │ ├── Ragdoll
│ │ ├── Russian Blue
│ │ ├── Siamese
│ │ └── Sphynx
│ └── train
│ ├── Abyssinian
│ ├── Bengal
│ ├── Birman
│ ├── Bombay
│ ├── British Shorthair
│ ├── Egyptian Mau
│ ├── Maine Coon
│ ├── Persian
│ ├── Ragdoll
│ ├── Russian Blue
│ ├── Siamese
│ └── Sphynx
Note that when cloning the repo, split is already done in all three ways and its information is contained in the catalog file using the default
test_size=.1 random_state=42'
.
If needed, used functions from each command can be imported to a notebook / other python script:
from src.split_data import split_species, split_breed, organize_species, organize_breed