Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Giving Gemma 3-4B VLAs 15-Minute Context for Complex Tasks

Current end-to-end robotic policies, specifically Vision-Language-Action (VLA) models, typically operate on a single observation or a very short history. This ‘lack of memory’ makes long-horizon tasks, such as cleaning a kitchen or following a complex recipe, computationally intractable or prone to failure. To address this, researchers from Physical Intelligence, Stanford, UC Berkeley, and MIT have introduced Multi-Scale Embodied Memory (MEM).

The Dual-Scale Memory Architecture

MEM factorizes robotic memory into two distinct scales to balance semantic context with real-time control constraints.

(1) Short-Term Video Memory

For tasks requiring fine-grained spatial awareness—like resolving self-occlusions or adapting a grasp—dense visual data is required. MEM utilizes an efficient video encoder that extends standard Vision Transformers (ViTs). To maintain real-time inference (the 380ms ‘real-time barrier’), the architecture avoids joint attention over all patches. Instead, it uses Space-Time Separable Attention, interleaving spatial attention within frames with causal-temporal attention across frames every fourth layer.

The computational complexity is reduced from O(n2K2) to O(Kn2+nK2), where n is the number of spatial patches and K is the number of timesteps. By dropping tokens from past timesteps in upper layers, the model passes only the current observation’s representation to the VLA backbone, keeping the token count invariant compared to single-frame models.

(2) Long-Term Language Memory

To handle tasks spanning up to 15 minutes, MEM uses a language-based representation for semantic events. The system decomposes the action prediction as:

$$\pi(a_{t:t+H},l_{t+1},m_{t+1}|o_{t-T:t},m_{t},g) \approx\pi_{LL}(a_{t:t+H}|o_{t-K:t},l_{t+1},g)\pi_{HL}(l_{t+1},m_{t+1}|o_{t},m_{t},g)$$

Here, a high-level policy (πHL) maintains a running language summary (mt) of past events and generates subtask instructions (lt+1) for a low-level policy (πLL). This language memory is trained using LLM-generated summaries that compress information (e.g., ‘I placed three bowls’ instead of individual attributes), reducing the risk of training-inference distribution shifts.

Implementation and Performance

The research team integrated MEM into the π0.6 VLA, which is initialized from a pre-trained Gemma 3-4B model. The model was pre-trained on a diverse mixture of robot demonstrations, vision-language tasks, and internet video data.

Key Results:

In-Context Adaptation: MEM enables robots to adapt manipulation strategies based on recent failures. In evaluation, this led to a +62% success rate increase in opening refrigerators with unknown hinge directions and a +11% increase in picking up chopsticks at variable heights.
Long-Horizon Tasks: The model successfully performed 15-minute tasks like ‘Recipe Setup’ (retrieving ingredients from multiple locations) and ‘Kitchen Cleaning’ (washing dishes and wiping counters). Memory-less VLAs failed these tasks significantly more often.
Efficiency: The video encoder allows the model to process up to 16 observation frames (spanning ~1 minute) while remaining under critical real-time inference thresholds on a single NVIDIA H100 GPU.

MEM demonstrates that combining dense, short-term visual tokens with compressed, long-term language summaries allows VLAs to scale their ‘working memory’ without incurring prohibitive computational costs.

Check out the Paper and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Source link

Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Giving Gemma 3-4B VLAs 15-Minute Context for Complex Tasks

NanoClaw and Docker partner to make sandboxes the safest way for enterprises to deploy AI agents

E.SUN Bank and IBM build AI governance framework for banking

How to Design a Streaming Decision Agent with Partial Reasoning, Online Replanning, and Reactive Mid-Execution Adaptation in Dynamic Environments

New MIT class uses anthropology to improve chatbots | MIT News

Anthropic and OpenAI just exposed SAST's structural blind spot with free tools

Why AI insurance underwriting is finally attracting institutional capital

How to Use AI to Make Money in 2026 (Explainer Version) | No Guru Lies

How to Create an AI Influencer (EASY Beginners Guide)

Strategy STRC Offering Hits Record High in Single Day

Why Every Blockchain Suddenly Wants Its Own Perp Dex

Bitcoin Liquidation Clusters Become Clearer, And Traders Are Leaning Long On BTC

Top Insights

The latest US inflation report looked like good news — next week may change that

AAVE Price Prediction: Targeting $131-137 Recovery by March 2026

Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Giving Gemma 3-4B VLAs 15-Minute Context for Complex Tasks

The Dual-Scale Memory Architecture

(1) Short-Term Video Memory

(2) Long-Term Language Memory

Implementation and Performance

Key Results:

Related Posts