Steerable Imitation Learning Under Perturbations

Generate smooth walking motions for the robot to mimic and use it to steer it in any direction and under forces!

Skills

Reinforcement Learning, Adversarial Imitation Learning, Deep Learning, Curriculum Learning, Motion Retargeting
Frameworks: Isaac Gym, PyTorch Methods: AMP, ADD, Humanoid Control, Force Robustness

Summary

Can adversarial motion imitation methods produce humanoid locomotion policies that remain stable under external force perturbations? What if there is an additional reward for tracking user input velocity vector? This project focuses on answering these questions. Using the Unitree G1 in Isaac Gym, we compared Adversarial Motion Priors (AMP) and Adversarial Differential Discriminators (ADD) under curriculum-based wrist forces simulating box carrying. ADD is further extended with steering objectives to enable directional control. Results reveal a clear trade-off: ADD learns faster and is more force-robust, while AMP produces higher motion fidelity but degrades under sustained disturbance.

Humanoid locomotion policy demonstrating robust motion under external forces and steerable control.
Project Report Slides GitHub

Problem Motivation

Motion imitation produces impressive gaits in simulation, but reward tuning is tedious to achieve this smoothness. Moreover, policies need to be robust under real-world forces.

Contribution

  • Motion retargeting pipeline (AMASS, LAFAN → Unitree G1 kinematics)
  • Custom box-carrying dataset generation with force
  • Force curriculum learning framework
  • AMP vs ADD adversarial discriminator comparison
  • Steerable ADD policy with velocity and heading rewards
  • PPO training with 4096 parallel Isaac Gym environments

Scale: 2–3B samples, RTX 4090, 15–20 GPU hours per training run.

Key Technical Insights

Insight: ADD vs AMP Tradeoff

ADD optimizes faster and tolerates disturbance better due to differential discrimination, while AMP preserves motion style but struggles with root drift under sustained perturbations. This fundamental difference stems from ADD’s explicit tracking of pose differences rather than absolute pose matching.

Quantitative Results

Method Force Curriculum Episode Length Body Pos Error
ADD [10,10,30] 260.5s 0.015
AMP [10,10,30] 265.2s 0.045

ADD maintains lower body position error under force due to explicit differential tracking, enabling longer stable rollouts.

Steerable Policy Extension

We extended ADD with velocity and heading rewards for directional control. The policy successfully tracked commanded velocities but required some reward balancing to preserve motion style. This demonstrates the challenge of multi-objective optimization in imitation learning, where style preservation and task performance must be carefully balanced.

Engineering Takeaways

This project demonstrates:

  • Systems Design: Built force curriculum learning framework with progressive difficulty scheduling
  • Algorithm Analysis: Diagnosed discriminator behavior differences between AMP and ADD architectures
  • Scalable Infrastructure: Implemented RL pipeline in Isaac Gym with 4096 parallel environments
  • Failure Analysis: Identified overcompensation failure modes through qualitative analysis
  • Problem Solving: Proposed curriculum and objective scheduling fixes based on empirical findings