Steerable Imitation Learning Under Perturbations

Skills

Reinforcement Learning, Adversarial Imitation Learning, Deep Learning, Curriculum Learning, Motion Retargeting
Frameworks: Isaac Gym, PyTorch Methods: AMP, ADD, Humanoid Control, Force Robustness

Summary

Can adversarial motion imitation methods produce humanoid locomotion policies that remain stable under external force perturbations? What if there is an additional reward for tracking user input velocity vector? This project focuses on answering these questions. Using the Unitree G1 in Isaac Gym, we compared Adversarial Motion Priors (AMP) and Adversarial Differential Discriminators (ADD) under curriculum-based wrist forces simulating box carrying. ADD is further extended with steering objectives to enable directional control. Results reveal a clear trade-off: ADD learns faster and is more force-robust, while AMP produces higher motion fidelity but degrades under sustained disturbance.

Steerable Imitation Learning demonstration

Humanoid locomotion policy demonstrating robust motion under external forces and steerable control.

Project Report Slides GitHub

Problem Motivation

Motion imitation produces impressive gaits in simulation, but reward tuning is tedious to achieve this smoothness. Moreover, policies need to be robust under real-world forces.

Contribution

Motion retargeting pipeline (AMASS, LAFAN → Unitree G1 kinematics)
Custom box-carrying dataset generation with force
Force curriculum learning framework
AMP vs ADD adversarial discriminator comparison
Steerable ADD policy with velocity and heading rewards
PPO training with 4096 parallel Isaac Gym environments

Scale: 2–3B samples, RTX 4090, 15–20 GPU hours per training run.

Key Technical Insights

Insight: ADD vs AMP Tradeoff

ADD optimizes faster and tolerates disturbance better due to differential discrimination, while AMP preserves motion style but struggles with root drift under sustained perturbations. This fundamental difference stems from ADD’s explicit tracking of pose differences rather than absolute pose matching.

Quantitative Results

Method	Force Curriculum	Episode Length	Body Pos Error
ADD	[10,10,30]	260.5s	0.015
AMP	[10,10,30]	265.2s	0.045

ADD maintains lower body position error under force due to explicit differential tracking, enabling longer stable rollouts.

Steerable Policy Extension

We extended ADD with velocity and heading rewards for directional control. The policy successfully tracked commanded velocities but required some reward balancing to preserve motion style. This demonstrates the challenge of multi-objective optimization in imitation learning, where style preservation and task performance must be carefully balanced.

Engineering Takeaways

This project demonstrates:

Systems Design: Built force curriculum learning framework with progressive difficulty scheduling
Algorithm Analysis: Diagnosed discriminator behavior differences between AMP and ADD architectures
Scalable Infrastructure: Implemented RL pipeline in Isaac Gym with 4096 parallel environments
Failure Analysis: Identified overcompensation failure modes through qualitative analysis
Problem Solving: Proposed curriculum and objective scheduling fixes based on empirical findings