SlotVLA: Towards Modeling of Object-Relation Representationsin Robotic Manipulation

ICRA 2026

Taisei Hanyu*,1       Nhat Chung*,2       Huy Le2       Toan Nguyen2      
Yuki Ikebe1       Anthony Gunderman1       Duy Nguyen Ho Minh3,7,8       Khoa Vo1      
Tung Kieu4       Kashu Yamazaki5       Chase Rainwater1       Anh Nguyen6       Ngan Le1              

*Equal contributions  
1University of Arkansas   2FPT Software AI Center  
3University of Stuttgart   4Aalborg University  
5Carnegie Mellon University   6University of Liverpool  
7German Research Center for Artificial Intelligence (DFKI)  
8Max Planck Research School for Intelligent Systems (IMPRS-IS)  
Scene overview


Object–relation-centric multitask robotic manipulation for efficient, interpretable control.
– LIBERO+: Object-annotated benchmark for structured reasoning.
– SlotVLA: Slot-based model for action generation from compact object–relation representations.

Abstract

Inspired by how humans reason over discrete objects and their relationships, we explore whether compact object-centric and object-relation representations can form a foundation for multitask robotic manipulation. Most exist- ing robotic multitask models rely on dense embeddings that entangle both object and background cues, raising concerns about both efficiency and interpretability. In contrast, we study object–relation-centric representations as a pathway to more structured, efficient, and explainable visuomotor control. Our contributions are two-fold. First, we introduce LIBERO+, a fine-grained benchmark dataset designed to enable and evaluate object-relation reasoning in robotic manipulation. Unlike prior datasets, LIBERO+ provides object-centric annotations that enrich demonstrations with box- and mask-level labels as well as instance-level temporal tracking, supporting compact and interpretable visuomotor representations. Second, we propose SlotVLA, a slot-attention–based framework that captures both objects and their relations for action decoding. It uses a slot-based visual tokenizer to maintain consistent temporal object representations, a relation-centric decoder to produce task-relevant embeddings, and an LLM-driven module that translates these embeddings into executable actions. Experi- ments on LIBERO+ demonstrate that object-centric slot and object-relation slot representations drastically reduce the num- ber of required visual tokens, while providing competitive generalization. Together, LIBERO+ and SlotVLA provide a compact, interpretable, and effective foundation for advancing object–relation-centric robotic manipulation.

Key Contributions

LIBERO+ augments demonstrations with structured object-centric annotations across RGB and depth:

  • Bounding boxes: 2D spatial anchors for object localization and token extraction.
  • Object masks: Pixel-level segmentation to preserve object boundaries and avoid background entanglement.
  • Temporal instance IDs: Consistent object identities across frames (e.g., plate1, plate2) for long-horizon tracking.
  • Masked depth maps: Per-pixel depth within object regions to encode occlusion and relative distance for gripper–object reasoning.
  • Task-relevant objects: Objects explicitly referenced in the task (e.g., robot, bowl, cabinet).

LIBERO+ contains four subsets—Goal, Object, Spatial, and Long—each emphasizing different manipulation challenges while maintaining structured object–relation annotations. We retain native action labels but remove redundant no-op actions to reduce idle frames and improve alignment between object dynamics and supervision. The result is a compact, semantically grounded benchmark designed for efficient and interpretable object–relation-centric VLA reasoning. The statistical summary of LIBERO+, constructed from four subsets: LIBERO-Object (L-Object), LIBERO-Goal (L- Goal), LIBERO-Spatial (L-Spatial), and LIBERO-Long (L- Long), is presented in Table I.

Novelty of LIBERO


Slot-VLA: an object-relation-centric VLA framework that combines object-centric slots with relation-centric tokens. The object slots capture disentangled entities from the environment and are filtered for task relevance, while the relation tokens (e.g., gripper-object interactions) encode task-aware interactions. Together, they yield a compact and interpretable representation for action decoding.

slotvla

Acknowledgements

We borrow github page from HabiCrowd and HyperNeRF. Special thanks to them!