Research Template

This template repository enables an easy configuration management system and dataset reproduction system for research projects.
It revolves around modularizing the software system into its components, making the reproduction, management and logging of experiments easier.
This project was developed after realizing a few pitfalls of past research-projects and in an attempt to address a loss of understanding of measurements multiple months after executing them.

Installation

Run the following command to install the environment.

.make/install

Setup for Usage

The system can be used after initializing it:

source enter

Getting Started

1. Creating Components

After entering the environment with source enter, we can start to build the architecture of our system component by component. Assume, that components are structured in a similar fashion to the following:

├── model
│   ├── attributors
│   │   ├── nlp
│   │   ├── vision
│   ├── gans
│   │   ├── nlp
│   │   ├── vision
├── dataset
│   ├── imagenet

This architecture assumes - as is often the case in a research-context - that for a certain component within a software-architecture, we want to compare multiple different implementations against eachother.
The directory-depths will in the following be known as topic (e.g. model, dataset), optionally type (e.g. attributors, gans) and name (the name of the actual component implementation).
For initializing this structure, the command-line script add_component can be used to easily create multiple components. Executing tree on the repository shows the created copmonents, including their configuration-files mirroring the source file-structure:

├── configs
│   ├── base.yaml
│   ├── config.yaml
│   ├── model
│   │   ├── attributors
│   │   │   ├── base.yaml
│   │   │   ├── nlp.yaml
│   │   │   ├── vision.yaml
├── src
│   ├── __init__.py
│   ├── model
│   │   ├── attributors
│   │   │   ├── __init__.py
│   │   │   ├── base.py
│   │   │   ├── builder.py
│   │   │   ├── nlp.py
│   │   │   ├── vision.py

It immediately becomes apparent, that each implementation has a corresponding configuration file (configs/model/attributors/base.yaml for src/model/attributors/base.py). We will use this configuration file to instantiate an entire object out of it, by specifying all arguments of an object's constructor in the configuration file.
Since keeping track of changing constructors and a configuration file can be cumbersome, simply executing make or make configs is sufficient to reload all configuration files of source files we have changed. Thus, after specifying a constructor in the generated py-files, the configuration file is generated:

# model/attributors/base.py

class BaseModel(Module):

    def __init__(self, p_dropout, hidden_dim, use_softmax):
        pass

# model/attributors.base.yaml

type: BaseModel
p_dropout: ???
hidden_dim: ???
use_softmax: ???

We can now fill in the standard configuration for each component. This is simply the out-of-the-box configuration which can later be overwritten in the respective Experiments.

2. A look into the generated Config-files

Each config file consists of a type, which declares the class to instantiate and a set of parameters to fill in the constructor. Note here, that the **kwargs argument will never appear here while any manually added argument will automatically be passed to the constructor in the expected **kwargs behavior. \

3. Bundling Components into stand-alone Experiments

Since in a research-context, different experiments consist of different architecture-combinations, the template offers an easy interface to create new, independent experiments, which can easily be logged, evaluated and stashed, if need be.
The config-directory configs/experiment will be scanned for any yaml-files and found files will be recommended in the commandline upon entering run e<TAB><TAB> or run experiment=<TAB>. An experiment configuration is defined in the following way:

# Path to all components used in this experiment.
defaults:
    - model/attributors: nlp.yaml
    - model/gans: nlp.yaml

# Explicit overwriting of certain parameters.
model:
    attributors:
        p_dropout: 0.3

To define a default system-configuration, the same is recommended to be done in the configs/base.yaml-file.

4. Bringing everything together: The complete System

After having created the entire architecture, the project can be combined in the run-script. Note, that imports should be conducted within the main-function due to prevent slowing down the auto-completion of Hydra.
The single components now can conveniently be parsed from the config by using the respective builder.py-classes.

from src.model.attributors import build_attributor
from src.model.gans import build_gan

...

model_conf = config["model"]
attributor = build_attributor(model_conf.pop("attributor"))
gan = build_gan(model_conf.pop("gan"))

# Now use these components in a reasonable way.
# In an ML context, this would probably mean, concatenating them in a Sequential-Model and
# Run this model within a Solver-Object, which itself is instantiated from a Config.
# This Solver would have a Learning Rate, a Loss Function, an Optimizer Name, ...

5. Running and Logging of results

As already introduced, the entire pipeline can be run using the following command:

run e<TAB><TAB>
# or
run experiment=<TAB>

This will list all available experiments which can conveniently be selected and run. In the end, the results, logs and configurations are saved to outputs/<Date>/<Time>.
The log-level by default is DEBUG and uses the standard logging module. The used configuration can be found in <output_dir>/.hydra/config.yaml.

6. Reproducing your results

After having used this template, the input configurations, the logging at runtime and the end-results are saved in corresponding directories. We successfully have accomplished full reproducibility! But ... have we? The answer is no!
We have not yet talked about the processing and retrieving of our datasets. This is another feature of this template and is easily explained. After having decided on datasets to use for our research project, add corresponding directories to the data-directory. Running make or make <directory_name> will automatically create a predefined directory structure for each dataset:

├── data
│   ├── dataset1
│   │   ├── get_original_data.sh
│   │   ├── original
│   │   ├── preprocess.py
│   │   ├── preprocessed

As soon as you change the get_original_data.sh-file, running make will execute the script, which should populate the original directory with the raw data-files. Afterwards, for each file in original which does not have a counterpart in preprocessed, the corresponding files will be piped through the preprocess.py-script and saved to preprocessed.
Another execution of make on an already processed dataset will not run anything. Notable here is, that the preprocess.py-script per default uses all available hardware-cores to process the dataset in parallel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Research Template

Installation

Setup for Usage

Getting Started

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.make		.make
configs		configs
data		data
src		src
.gitignore		.gitignore
README.md		README.md
add_component		add_component
dodo.py		dodo.py
enter		enter
make_configs.py		make_configs.py
requirements.txt		requirements.txt
run		run

dennis-n-schneider/research_template

Folders and files

Latest commit

History

Repository files navigation

Research Template

Installation

Setup for Usage

Getting Started

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages