Master Academic Journey Overview
Architecture Overview: Semantic-Geometric Memory for Embodied AI
This figure illustrates the proposed framework for an Embodied AI system capable of understanding open-ended human instructions, constructing a hierarchical memory representation, and performing physically robust interactions. The pipeline consists of five interconnected stages:
1. Multimodal Perception & Feature Extraction (Top)
The system takes RGB and Depth streams as input.
- Visual Odometry (VO): Estimates the initial camera pose to establish a spatial reference frame.
- Semantic Extraction: Utilizes a suite of state-of-the-art vision models—CLIP (Contrastive Language-Image Pre-training), SAM (Segment Anything Model), and DINO—to extract dense, semantically enhanced multimodal features from key frames.
2. The Bridge Module & Memory Bank (Center)
The Bridge Module acts as a filter and distributor, processing key frames and their associated poses to construct a comprehensive Memory Bank. This memory is hierarchically organized into three distinct layers:
- Visual Memory: Stores semantic features correlated with spatial positions, enabling the system to “know” what objects look like and what they are.
- Spatial Memory: Generated via a Large Language Model (LLM), this layer abstracts high-level spatial relationships between objects (e.g., “the cup is on the table,” “the chair is next to the desk”).
- Geometry Memory: Utilizes a Feed-Forward Network (FFN) to reconstruct object-level geometric meshes and textures. It incorporates scale information to perform scale correction, ensuring the digital reconstruction matches physical reality.
3. Interaction Pose Regression (Left Pillar)
This module is responsible for planning how to approach an object.
- Based on the interaction object category, it retrieves the specific target from the Memory Bank.
- It synthesizes data from Visual, Spatial, and Geometry memories to calculate and screen for the optimal interaction viewpoint, ensuring the robot approaches the object from a viable angle.
4. Geometry-Based Force Constraint (Right Pillar)
This module ensures the physical feasibility and stability of the grasp.
- It retrieves the detailed surface geometry (including surface normals) from the Geometry Memory.
- It performs real-time hand localization relative to the object’s surface.
- It applies Force Closure optimization algorithms to refine the grasping pose, ensuring the grasp is mechanically stable and prevents object slippage.
5. VLM Inference & Execution (Bottom)
The system is driven by high-level intent through VLM (Vision-Language Model) Inference.
- Instruction Understanding: The VLM processes open-ended Human Instructions.
- Reasoning: It reasons over the Memory Field to identify one or multiple target objects and infers their best interaction poses.
- Execution:
- Perception: Uses the Gaussian memory field for real-time relocalization.
- Action: Outputs low-level control commands for Navigation (moving to the target) and Grasping (executing the optimized interaction).
Enjoy Reading This Article?
Here are some more articles you might like to read next: