Skip to content

Conversation

@m2kulkarni
Copy link

No description provided.

@greptile-apps
Copy link

greptile-apps bot commented Nov 30, 2025

Greptile Overview

Greptile Summary

Implemented per-agent reward conditioning by replacing hardcoded reward values with dynamically sampled weight arrays. The changes enable agents to learn from varied reward functions during training.

Key changes:

  • Replaced static reward_vehicle_collision, reward_offroad_collision, and reward_goal values with per-agent collision_weights[i], offroad_weights[i], and goal_weights[i] arrays
  • Fixed post-respawn goal reward scaling by multiplying reward_goal_post_respawn constant by goal_weights[i]
  • Corrected configuration bounds: swapped offroad_weight_lb (-0.4) and offroad_weight_ub (0.0), and swapped discount_weight_lb (0.80) and discount_weight_ub (0.98)
  • Increased num_agents from 512 to 1024 for larger batch training
  • Extended SLURM array to test all 5 conditioning types (0-4 vs 1-2)

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • All changes are internally consistent and mathematically correct. The weight array replacements maintain the same reward computation logic while enabling per-agent conditioning. Configuration bounds were properly corrected to ensure lb < ub. No breaking changes or edge cases detected.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
pufferlib/config/ocean/drive.ini 5/5 Updated conditioning weight bounds and increased num_agents from 512 to 1024; swapped lb/ub values correctly for offroad and discount weights
pufferlib/ocean/drive/drive.h 5/5 Replaced hardcoded reward values with per-agent weight arrays for reward conditioning; correctly scaled post-respawn goal rewards
scripts/run.sh 5/5 Expanded SLURM array from 1-2 to 0-4 to test all five conditioning types (none, reward, entropy, discount, all)

Sequence Diagram

sequenceDiagram
    participant Config as drive.ini
    participant Reset as c_reset()
    participant Agent as Agent Loop
    participant Step as c_step()
    participant Rewards as Reward Calculation

    Config->>Reset: Load conditioning bounds (lb/ub)
    Reset->>Reset: Sample collision_weights[i]
    Reset->>Reset: Sample offroad_weights[i]
    Reset->>Reset: Sample goal_weights[i]
    Reset->>Agent: Initialize agents with weights
    
    Agent->>Step: Execute agent action
    Step->>Rewards: Check collision state
    alt Vehicle Collision
        Rewards->>Rewards: reward = collision_weights[i]
    else Offroad
        Rewards->>Rewards: reward = offroad_weights[i]
    end
    
    Step->>Rewards: Check goal distance
    alt Goal Reached (Post Respawn)
        Rewards->>Rewards: reward += reward_goal_post_respawn * goal_weights[i]
    else Goal Reached (Generate New)
        Rewards->>Rewards: reward += goal_weights[i]
    else Goal Reached (Stop)
        Rewards->>Rewards: reward = goal_weights[i]
    end
    
    Rewards->>Agent: Return per-agent rewards
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could place the policy in the co player policies folder?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants