marp	theme	size	paginate	math
true	defaults	58140	true	katex

Reward optimisation

Rosa Sensat intership at CVC Research Journal

Tutor: Joan Serrat Guim Casadellà Cors 04 - 2024

Semantic segmentation

Pixel-wise Classification

Real Datasets

Urban Environments

Cityscapes

Mapillary Vistas

Real datasets

EasyPortrait

Issue: Hard and costly to annotate

Synthetic datasets

Ground Truth is "free" to obtain

UDA vs ADA

Unsupervised domain adaptation : no target labels available, do self-training

train model with synthetic source
make predictions on real target images
select "best" predictions as groundtruth = pseudolabels
retrain the model, go to 2.

Active domain adaptation: ask for a few target labels

train model with synthetic source
select a few promising$^*$ target samples to be annotated by a human
retrain the model
if annotation budget exhausted stop, else go to 2.

$^*$promising = with good chances to improve performance (mIoU)

Goal: Mean Intersection Over Union

mIoU: Average IoU for each class

Optimization of Non-Differentiable Functions

Tuning Computer Vision Models with Task Rewards

Some definitions

$y$ and $x$ are the ground truth and input vectors, respectively.
Dataset of $N$ training examples, sampled from a distribution $D$.
$\theta$ are the parameters describing a neural network.
$P(y|x, \theta)$ Probability of predicting y from x given NN is in state $\theta$.
$R(x, y)$ is the evaluation of the reward function.
$\nabla_{\theta}$ denotes the usual gradient computation of a NN.

Tuning Computer Vision Models with Task Rewards

Align model predictions and intended usage via reward optimisation.
REINFORCE's well-known log-derivative trick.
- Learn $P(y|x, \theta) \rightarrow \max_{\theta} \mathbb{E}{x \sim D} \left[ \mathbb{E}{y \sim P(\cdot | x, \theta)} R(x, y) \right]$

Approach:

Model pretraining with maximum-likelihood estimation.
Model tuning for the task by maximazing related reward with the REINFORCE algorithm.

Tuning computer vision models with task rewards | Full paper link

Monte Carlo Gradient Estimator: Log Derivative Trick

Provides a way to estimate the gradient of the expected reward for a given input x:
- $\nabla_{\theta} \mathbb{E}{y \sim P} \left[ R(x, y) \right] = \mathbb{E}{y \sim P} \left[ R(x, y) \nabla_{\theta} \log P(y|x; \theta) \right]$
Unbiased estimate of RHS as an average of per-example gradients.
Implemented in the model's loss function

Loss Function Involving Reward Optimization

MLE Optimization Step

function batch_loss(θ, x, y):
    # n is the size of a mini-batch.
    return (1/n) * Σ(log P(yᵢ'|xᵢ; θ))
end function

function step_mle(θ, x, y, α):
    G_mle := ∇θ batch_loss(θ, x, y)
    return θ + αG_mle
end function

Reward Optimization Step

function batch_loss(θ, x, y, r):
    return (1/n) * Σ(r log P(yᵢ'|xᵢ; θ))
end function

function step_reward(θ, x, α):
    y_sample := batch_sample(θ, x)
    y_baseline := batch_sample(θ, x)
    r := R(x, y_sample) - R(x, y_baseline)
    G_r := ∇θ batch_loss(θ, x, y_sample, r)
    return θ + αG_r
end function

Reduction of Variance

Method can suffer from high variance affecting the overall performance.
Some variance reduction techniques include:
- Increase the number of samples (batch size, nº GPUs) $V_n \propto O(\frac{1}{N})$
- Substracting baselines $B$ independent to x. $\rightarrow \mathbb{E}(B)=0$
- Rolling mean?

Monte Carlo Gradient Estimation in Machine Learning | Full paper link

Papers with code

Encoder - Decoder Structure

Encoder: Extracts features (e.g., ResNet);
Decoder: Reconstructs output using these features, often with upsampling.

Deeplabv3

Widespread use in numerous research papers
Medium size: Fits inavailable GPU's memory $\le 12 GB$
Atrous Spatial Pyramid Pooling (ASPP)

Implementation overview

Implemented a custom decode head MMsegmentation module which inherits from the base decode head.
Includes reward & baseline computation capabilities when performing a loss forward step in train mode.
Interacts with the new reward optimisation loss function.

Reward and baseline computation

Reward

Implemented in MMSegmentation as a new Loss Function
$R(x_i, y_i) = 1 - \frac{\sum{IoU(x_i, y_i)}}{nº \ classes}$ to weight better ($loss \to 0$) when $mIoU \to 1$

Baseline

Given a loss forward step on $k$ batched images
- $r_x = R(x, y)$ Vector of rewards
- $r_b = \frac{\sum{r_x} - r_x}{k-1}$ Baseline: Rewards of other images
- $r = r_x - 0.1r_b$ Reward vector fed to custom loss function

Custom loss Implementation

Reaward optimization step

function batch_loss(θ, x, y, r):
    return (1/n) * Σ(r log P(yᵢ'|xᵢ; θ))
end function

function step_reward(θ, x, α):
    y_sample := batch_sample(θ, x)
    y_baseline := batch_sample(θ, x)
    r := R(x, y_sample) - R(x, y_baseline)
    G_r := ∇θ batch_loss(θ, x, y_sample, r)
    return θ + αG_r
end function

Implementation

function batch_loss(θ, x, y, r):
    return r·cross_entropy(θ, x, y)
end function

Where cross_entropy is a wrapper from MMSegmentation above torch.nn.functional.cross_entropy

Experimentation

Deeplav3 Progress

Degrees of freedom

Scheduler: Initial Learning Rate, evolution, ...
Model Structure: Auxiliary head?.
Baseline: Definition of $r_b$.
Steps: What checkpoint to use and how many steps to take.
Weights: Use of weights in loss computation.

Scheduler

Best config: $1e-4 \rightarrow 1e-6$
Lower starting learing rate LR to "overcome" the change in the model's structure and loss function
Perform some variance reduction by also lowering the final LR

Model Structure

Overall better performance than original trend line
Auxiliary head roughest start beacause facing more changes
Both face high-variance

Baseline

Steps

Weight computation

$C_c$: Class counts
$T_c$: Total counts
$C_w$: Class weights

$$C_w = \frac{1}{C_c + \text{CLS_SMOOTH} \cdot C_w}$$

$$C_w = \frac{N \cdot C_w}{\sum{C_w}}$$

Weights

Additional Results and Future Work

Have tried the same configuration on the EasyPortrait dataset, achieving similar results.
In the future, plan to implement this on the Mapillary Vistas dataset, which starts from a lower mIoU since it's a more challenging task.
Repeat some of the experiments to establish the mean improvement for each result and its variance.

Conclusions

The work shown is trying to fine-tune a model that already has extremely high accuracy. It's trying to outperform models that have been extensively tuned. This gives little room for improvement, as changes in the model's behavior make it extremely difficult to optimize its hyperparameters for significant improvements.
The key paper in the project also doesn't help, as it provides no details on the implementation of the new algorithm. This makes it difficult to apply those ideas to a new problem.
As seen, variance is a significant problem in this project. However, the proposed approach seems to address it, to some extent.
Overall, the general goal has been accomplished, as some configurations perform somewhat better than the original model. However, the lack of time and computing power makes it difficult to further tune the model and try to widen the gap through reward optimization.

Files

rewardOptimisation.md

Latest commit

History

rewardOptimisation.md

File metadata and controls

Reward optimisation

Rosa Sensat intership at CVC Research Journal

Semantic segmentation

Real Datasets

Urban Environments

Real datasets

Synthetic datasets

UDA vs ADA

Goal: Mean Intersection Over Union

Optimization of Non-Differentiable Functions

Tuning Computer Vision Models with Task Rewards

Tuning Computer Vision Models with Task Rewards

Some definitions

Tuning Computer Vision Models with Task Rewards

Monte Carlo Gradient Estimator: Log Derivative Trick

Loss Function Involving Reward Optimization

MLE Optimization Step

Reward Optimization Step

Reduction of Variance

Reduction of Variance

Encoder - Decoder Structure

Deeplabv3

Implementation overview

Reward and baseline computation

Reward

Baseline

Custom loss Implementation

Reaward optimization step

Implementation

Experimentation

Deeplav3 Progress

Degrees of freedom

Scheduler

Model Structure

Baseline

Baseline

Steps

Weight computation

Weights

Additional Results and Future Work

Conclusions