A GitHub/Git repository template for data science projects for my personal use.
Features:
- A logical, reasonably standardardize, but flexible structure for data science projects
- A Makefile to facilitate initial setup, python virtual environment creation, code execution, etc.
- Automatic code linting and reformatting
Built on Cookiecutter Data Science
- On GitHub, navigate to the main page of the template repository
- Above the file list, click Use this template.
- Type a name for your repository, and an optional description.
- Choose repository visibility as Private or Public.
- Select Include all branches to include the directory structure and files from all branches in the template, and not just the default branch.
- Click Create repository from template.
-
Clone the repository to local machine and navigate to the repository
git clone git@github.com:sulibo/<repository_name>.git cd <repository_name>
-
Change the project name in the
Makefile
file to your project name.Replace
data_science_template
in the linePROJECT_NAME = data_science_template
inMakefile
with your project name. -
Commit and push your changes
git add Makefile git commit -m "initial setup" git push origin main
-
Set up the pre-commit hooks for code style check and create the
data
folder.make init
The pre-commit hooks using flake8 and black will be created to enforce PEP8 code style before every commit.
The
data
folder to store the data will be created. But, it is not tracked by Git as the data usually contains large files which are not suitable to be version controlled by Git/GitHub. It is recommended to use DVC - an open source version control system for machine learning projects to manage the data, model and also the experiments.The structure of the constructed repository is shown in the Project Organization section.
Tip: The Makefile alos provides
make create_environment
andmake requirements
to facilitate virtual environment creatation and requirement mangement. Typemake help
for more options.
This workflow is proposed for feature development or bug fixes. It is constructed based on Github flow with Interactive Rebasing.
The main concepts of GitHub flow are:
- Anything in the
main
branch is deployable. - To work on something new or fix a bug, create a descriptively named branch off of
main
. - Commit to that branch locally and regularly push your work to the same named branch on the server.
- When you need feedback or help, or you think the branch is ready for merging, open a pull request.
- After some else has reviewed and signed off on the feature, you can merge it to
main
. - Once it is merged and pushed to
main
, you can and should deploy immedidately.
See more here.
The main steps are as follows:
-
Clone the repository from GitHub. For subsequent features/changes this step should be ignored.
git clone git@github.com:IXP-Team/<repository_name>.git cd <project directory>
-
Create the python virtual environment. For subsequent features/changes this step should be ignored.
make create_environment
A python virtual environment named after your project name will be created. If you have conda installed, the environment is created by Conda, otherwise virtualenv and virtualenvwrapper will be installed to create the environment.
-
Activate the virtual environment
conda activate <project_name>
or
workon <project_name>
-
Install the required Python modules. For subsequent features/changes this step should be ignored, except when the requirements.txt is changed.
make requirements
-
Checkout a new feature/bug-fix branch.
git checkout -b <branchname> main
-
Make changes and add the files.
git add <file1> <file2> ... git commit
Why:
git add <file1> <file2> ...
- you should add only files that make up a small and coherent change.git commit
will start an editor which lets you separate the subject from the body.Tip:
You could use
git add -p
instead, which will give you chance to review all of the introduced changes one by one, and decide whether to include them in the commit or not. -
Sync with remote to get changes you’ve missed.
git checkout main git pull
Why:
This will give you a chance to deal with conflicts on your machine while rebasing (later) rather than creating a Pull Request that contains conflicts.
-
Update your feature branch with latest changes from develop by interactive rebase.
git checkout <branchname> git rebase -i --autosquash main
Why:
You can use --autosquash to squash all your commits to a single commit. Nobody wants many commits for a single feature in develop branch. read more...
-
If you don’t have conflicts, skip this step. If you have conflicts, resolve them and continue rebase.
git add <file1> <file2> ... git rebase --continue
-
Push your branch. Rebase will change history, so you'll have to use
-f
to force changes into the remote branch. If someone else is working on your branch, use the less destructive--force-with-lease
.git push -f -u origin <branchname>
Why:
When you do a rebase, you are changing the history on your feature branch. As a result, Git will reject normal
git push
. Instead, you'll need to use the -f or --force flag. read more...Note:
Before pushing, make sure your feature branch builds successfully and passes all tests (including code style checks).
The
-u
option is only needed for the first push. -
Make a Pull Request.
-
Pull request will be accepted, merged and closed by a reviewer.
-
Remove your local feature branch if you're done.
git branch -D <branchname>
to remove all branches which are no longer on remote
git fetch -p && for branch in `git branch -vv --no-color | grep ': gone]' | awk '{print $1}'`; do git branch -D $branch; done
-
Deactivate virtual environment if all work is done.
Conda:
conda deactivate
or
Virtualenvwrapper:
deactivate
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
├── .pre-commit-config.yaml
│ <- Pre-commit configuration file
│
└── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
Project based on the cookiecutter data science project template. #cookiecutterdatascience