Robotics: Science and Systems

Learning to Evolve: Multi-modal Interactive Fields for Robust Humanoid Navigation in Dynamic Environments

Peifeng Jiang1, Hong Liu1,*, Wenshuai Wang1, Jin Jin2, Xia Li3

1State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School
2Oxford Robotics Institute, University of Oxford
3Institute for Machine Learning, Department of Computer Science, ETH Zurich

MIF teaser showing appearance, spatial, and geometry fields.
Multi-modal Interactive Fields couple appearance, spatial memory, and geometry for robust humanoid navigation.

Video

Humanoid Navigation and Adaptation Demo

Add your final RSS demonstration videos under docs/assets/videos/. The page is already wired for a main demo and optional adaptation or interaction clips.

Main System Demo

Unitree G1 executes language-conditioned navigation, detects scene changes, updates memory, and verifies IPS.

Suggested Additional Clips

  • assets/videos/adaptation.mp4: dynamic scene graph update after object relocation.
  • assets/videos/interaction.mp4: mesh recovery and interaction-pose safety verification.

Abstract

Robust Scene Memory for Humanoid Robots

Safe manipulation-oriented navigation for humanoid robots requires scene memory that remains reliable under locomotion-induced perceptual distortion, environmental changes, and interaction-level geometric safety constraints. MIF integrates confidence-aware semantic 3D Gaussian Splatting, discrepancy-triggered spatial memory updates, and task-driven geometric reconstruction in a closed-loop perception-adaptation pipeline. On a Unitree G1 humanoid in a real dynamic office, MIF improves relocation success from 12% to 94% compared with static scene-graph memory, while reducing semantic memory footprint by 91.4%.

Method

Three Coupled Fields

MIF treats scene memory as a locally revisable system representation, grounding language queries into spatial memory and interaction-ready geometry.

MIF framework pipeline.
Replace this placeholder with the final method overview figure from the paper.

Appearance Field

Builds a confidence-aware semantic 3DGS representation and suppresses gait-corrupted primitives during rendering and graph construction.

Spatial Field

Maintains topological scene memory and triggers local updates when persistent multi-modal discrepancies indicate relocated, removed, or newly introduced objects.

Geometry Field

Recovers object-centric meshes on demand and verifies terminal humanoid poses through interaction-pose safety checks.

Results

Real-World Dynamic Office Evaluation

The full ROS1 system runs on a centralized RTX 4090 workstation and communicates with a Unitree G1 humanoid during navigation and interaction trials.

94%IPS success
91.4%semantic memory reduction
0%observed collision rate
0.12mmean terminal error
IPS success evidence from point-cloud and mesh comparison.
IPS Success

Dense geometry makes interaction poses safer

Memory reduction feature realtime result.
Memory Reduction

Feature distillation keeps memory practical

Pure Pursuit navigation error evidence.
Navigation Error

Stable tracking for humanoid navigation

Dynamic adaptation result corresponding to collision rate.
Collision Rate

Local memory updates avoid obsolete collision-prone paths

Citation

BibTeX

@inproceedings{jiang2026mif,
  title={Learning to Evolve: Multi-modal Interactive Fields for Robust Humanoid Navigation in Dynamic Environments},
  author={Jiang, Peifeng and Liu, Hong and Wang, Wenshuai and Jin, Jin and Li, Xia},
  booktitle={Robotics: Science and Systems},
  year={2026}
}