SlotVLA: Towards Modeling of Object-Relation Representationsin Robotic Manipulation

Inspired by how humans reason over discrete objects and their relationships, we explore whether compact object-centric and object-relation representations can form a foundation for multitask robotic manipulation. Most exist- ing robotic multitask models rely on dense embeddings that entangle both object and background cues, raising concerns about both efficiency and interpretability. In contrast, we study object–relation-centric representations as a pathway to more structured, efficient, and explainable visuomotor control. Our contributions are two-fold. First, we introduce LIBERO+, a fine-grained benchmark dataset designed to enable and evaluate object-relation reasoning in robotic manipulation. Unlike prior datasets, LIBERO+ provides object-centric annotations that enrich demonstrations with box- and mask-level labels as well as instance-level temporal tracking, supporting compact and interpretable visuomotor representations. Second, we propose SlotVLA, a slot-attention–based framework that captures both objects and their relations for action decoding. It uses a slot-based visual tokenizer to maintain consistent temporal object representations, a relation-centric decoder to produce task-relevant embeddings, and an LLM-driven module that translates these embeddings into executable actions. Experi- ments on LIBERO+ demonstrate that object-centric slot and object-relation slot representations drastically reduce the num- ber of required visual tokens, while providing competitive generalization. Together, LIBERO+ and SlotVLA provide a compact, interpretable, and effective foundation for advancing object–relation-centric robotic manipulation.

LIBERO+ augments demonstrations with structured object-centric annotations across RGB and depth:

Bounding boxes: 2D spatial anchors for object localization and token extraction.
Object masks: Pixel-level segmentation to preserve object boundaries and avoid background entanglement.
Temporal instance IDs: Consistent object identities across frames (e.g., plate1, plate2) for long-horizon tracking.
Masked depth maps: Per-pixel depth within object regions to encode occlusion and relative distance for gripper–object reasoning.
Task-relevant objects: Objects explicitly referenced in the task (e.g., robot, bowl, cabinet).

LIBERO+ contains four subsets—Goal, Object, Spatial, and Long—each emphasizing different manipulation challenges while maintaining structured object–relation annotations. We retain native action labels but remove redundant no-op actions to reduce idle frames and improve alignment between object dynamics and supervision. The result is a compact, semantically grounded benchmark designed for efficient and interpretable object–relation-centric VLA reasoning. The statistical summary of LIBERO+, constructed from four subsets: LIBERO-Object (L-Object), LIBERO-Goal (L- Goal), LIBERO-Spatial (L-Spatial), and LIBERO-Long (L- Long), is presented in Table I.

Slot-VLA: an object-relation-centric VLA framework that combines object-centric slots with relation-centric tokens. The object slots capture disentangled entities from the environment and are filtered for task relevance, while the relation tokens (e.g., gripper-object interactions) encode task-aware interactions. Together, they yield a compact and interpretable representation for action decoding.

Acknowledgements

We borrow github page from HabiCrowd and HyperNeRF. Special thanks to them!

SlotVLA: Towards Modeling of Object-Relation Representationsin Robotic Manipulation

ICRA 2026

Object–relation-centric multitask robotic manipulation for efficient, interpretable control.
– LIBERO+: Object-annotated benchmark for structured reasoning.
– SlotVLA: Slot-based model for action generation from compact object–relation representations.

Abstract

Key Contributions

Acknowledgements

SlotVLA: Towards Modeling of Object-Relation Representationsin Robotic Manipulation

ICRA 2026

Object–relation-centric multitask robotic manipulation for efficient, interpretable control. – LIBERO+: Object-annotated benchmark for structured reasoning. – SlotVLA: Slot-based model for action generation from compact object–relation representations.

Abstract

Key Contributions

Acknowledgements

Object–relation-centric multitask robotic manipulation for efficient, interpretable control.
– LIBERO+: Object-annotated benchmark for structured reasoning.
– SlotVLA: Slot-based model for action generation from compact object–relation representations.