A comprehensive MARL environment implementing job market dynamics with information asymmetry, costly screening, and employer learning.
HireRL implements the theoretical framework from "Multi-agent Reinforcement Learning benchmark for job market search and matching" as a PettingZoo Parallel environment. The environment models:
- Strategic Agents: Companies learning hiring policies through RL
- Environment: Worker pool with private abilities and public signals
- Information Asymmetry: Workers know true ability σ_j, firms observe noisy signals σ̂_j,0
- Costly Screening: Firms invest to better estimate worker abilities
- Employer Learning: Firms update beliefs based on performance observations
- Optimal Screening: How should firms choose their level of investment in screening to infer worker ability?
- Greedy vs Stable Matching: How does myopic profit-maximizing affect time-to-match compared to stable matching?
# Install dependencies
pip install -r requirements.txt- Python >= 3.8
- numpy >= 1.20.0
- gymnasium >= 0.28.0
- pettingzoo >= 1.23.0
- torch >= 2.0.0 (for PPO training)
- tensorboard >= 2.10.0 (for experiment tracking and visualization)
from pettingzoo.hirerl import JobMarketEnv
from pettingzoo.policies import GreedyPolicy
# Create environment
env = JobMarketEnv(
num_companies=3,
num_workers=10,
max_workers_per_company=5
)
# Run episode with greedy policy
observations, infos = env.reset()
for _ in range(100):
# Simple greedy policy: hire best workers
actions = {agent: 0 for agent in env.agents} # no-op
observations, rewards, terminations, truncations, infos = env.step(actions)
if all(terminations.values()):
breakTrain companies to learn optimal hiring policies using Independent PPO (IPPO):
# Train with default settings (1M steps, 3 companies, 10 workers)
python train_ppo.py
# View training progress in TensorBoard
tensorboard --logdir=runsThe PPO implementation includes CleanRL best practices:
- Action Masking: Neural network masks invalid actions before sampling
- Orthogonal Initialization: Better weight initialization for stability
- TensorBoard Logging: Track episodic returns, losses, entropy, KL divergence
- Learning Rate Annealing: Linear decay over training
- Explained Variance: Monitor value function prediction quality
- Clipped Value Loss: Prevent value function over-updating
- Unique Run IDs: Each training run gets timestamped directory
- Episodic Returns: Total reward per episode for each company
- Explained Variance: >0.5 indicates good value function (higher is better)
- Entropy: Should gradually decrease (exploration → exploitation)
- Approx KL: Should stay <0.1 (policy update stability)
- Clip Fraction: 0.1-0.3 indicates healthy PPO clipping
Evaluate trained models on new environments:
# Evaluate a specific run
python evaluate_policy.py --run_name hirerl_ippo_20250113_143022 --n_episodes 20 --save_plots
# Checkpoints and configs are stored in runs/{run_name}/# Run PettingZoo compliance tests
python tests/test_pettingzoo_compliance.py
# Run basic verification tests
python tests/test_simple.py
# Compare baseline policies
python tests/test_baseline_policies.pyThe environment passes all official PettingZoo tests:
- ✅ Parallel API compliance
- ✅ Seed determinism
- ✅ Action masking validation
- ✅ Observation space consistency
- ✅ Render functionality
Companies can take four types of actions:
- NO_OP (0): Do nothing
- FIRE (1 to N): Fire worker j
- OFFER (N+1 to 2N): Make wage offer to unemployed worker j
- INTERVIEW (2N+1 to 3N): Screen worker j before hiring
Action encoding: Discrete(1 + 3*N) where N = number of workers
Action Masking: Invalid actions are automatically masked:
- FIRE: Only valid if worker is employed by this company
- OFFER: Only valid if worker is unemployed AND company has capacity
- INTERVIEW: Only valid if worker is unemployed
- NO_OP: Always valid
Each company observes a dictionary containing:
Observation (Box):
- Public Information:
- σ̂_j,t: Public ability signals for all workers
- exp_j,t: Experience levels
- τ_j,t: Tenure (time employed)
- Employment status and current wages
- Private Information:
- Belief about each worker's ability (mean and variance)
- Own workforce and profit
Action Mask (MultiBinary):
- Binary mask indicating valid actions (1) vs invalid actions (0)
- Enables PPO agents to avoid invalid actions during training
r_i,t = Σ_{j ∈ E_i,t} (p_ij,t - w_ij,t) - c_fire - c_hire - c_screen
where:
- p_ij,t = σ_j + β*log(1 + exp_j,t): Match-specific profit
- w_ij,t: Wage paid to worker j
- c_fire, c_hire, c_screen: Action costs
Workers have:
- True ability σ_j ~ N(0, 1): Private, static
- Public signal σ̂_j,0 = σ_j + ε: Noisy resume/CV
- Experience exp_j,t: Grows while employed at rate g(σ_j) = g0 + g1*σ_j
- Tenure τ_j,t: Total time employed
- Public signal update: σ̂_j,t = σ̂_j,0 + γ*τ_j,t
Firms can invest cost c to get better ability estimates:
σ_estimate = σ̂_j,0 + precision(c) * (σ_j - σ̂_j,0) + ε
where precision(c) ∈ [0, 1] increases with cost
Screening technologies available: SQRT (default), LINEAR, LOGARITHMIC, SIGMOID
Firms perform Bayesian updating:
- Prior: Initialize beliefs from public signals
- Screening Update: Incorporate interview results
- Performance Update: Update from observed profits
Workers quit if they observe comparable workers earning higher wages. This creates wage competition.
Implements Gale-Shapley deferred acceptance for comparison with greedy matching.
HireRL/
├── pettingzoo/
│ ├── hirerl.py # Main environment (with action masking)
│ ├── workers.py # Worker pool management
│ ├── screening.py # Screening mechanism & Bayesian beliefs
│ ├── matching.py # Stable matching algorithms
│ ├── policies.py # Baseline policies (action masking compatible)
│ └── utils.py # Logging & visualization
├── tests/
│ ├── test_pettingzoo_compliance.py # PettingZoo API tests
│ ├── test_simple.py # Basic verification
│ └── test_baseline_policies.py # Policy comparison
├── train_ppo.py # IPPO training with CleanRL best practices
├── requirements.txt # Dependencies (includes TensorBoard)
├── README.md
└── runs/ # TensorBoard logs (generated)
Six baseline policies for testing (all support action masking):
- RandomPolicy: Random action selection from valid actions only
- GreedyPolicy: Hire best available, fire worst performers
- NoScreeningPolicy: Greedy strategy without interviews
- HighScreeningPolicy: Always screen workers before hiring
- NeverFirePolicy: Only hire, never fire workers
- HeuristicPolicy: Rule-based strategy with screening threshold
All policies automatically respect action masks and only select valid actions.
- Information Asymmetry: Signaling (Spence 1978) and Screening (Stiglitz 1975)
- Employer Learning: Altonji & Pierret (2001)
- Stable Matching: Gale-Shapley deferred acceptance
- Search Frictions: Diamond-Mortensen-Pissarides framework
- Wage Determination: Mincer equation with ability and experience
MIT License
If you use this environment in your research, please cite:
@article{hirerl2025,
title={Multi-agent Reinforcement Learning benchmark for job market search and matching},
author={Zong, Haijing and Zhou, Boyang},
year={2025}
}