We study reward models for long-horizon manipulation by learning from action-free videos and language instructions, which we term the visual-instruction correlation (VIC) problem. Existing VIC methods face challenges in learning rewards for longhorizon tasks due to their lack of sub-stage awareness, difficulty in modeling task complexities, and inadequate object state estimation. To address these challenges, we introduce VICtoR, a novel hierarchical VIC reward model capable of providing effective reward signals for long-horizon manipulation tasks. Trained solely on primitive motion demonstrations, VICtoR effectively provides precise reward signals for long-horizon tasks by assessing task progress at various stages using a novel stage detector and motion progress evaluator. We conducted extensive experiments in both simulated and real-world datasets. The results suggest that VICtoR outperformed the best existing methods, achieving a 43% improvement in success rates for long-horizon tasks.
The reward function plays a critical role in the reinforcement learning (RL) framework. However, in practice, we often lack a reward function that provides precise and informative guidance, especially for long-horizon tasks. Recently, methods that leverage vision-instruction correlation (VIC) as reward signals have emerged, offering a more accessible way to specify tasks through language. Specifically, VIC-based methods frame reward modeling as a regression or classification problem and train the reward model on action-free demonstrations and instructions.
Even though existing VIC methods have made several breakthroughs, there are three limitions we observe when applying them to long-horizon manipulation tasks: (1) No awareness of task decomposition: Failing to divide complex tasks into manageable parts limits adaptability. (2) Confusion from variance in task difficulties: Training a reward model on long-horizon tasks impairs the learning of reward signals and fails to generate suitable progressive rewards. (3) Ambiguity from lacking explicit object state estimates: Relying on whole-scale image observations can overlook critical environmental changes. For instance, when training for the task move the block into the closed drawer, previous VIC models would assign high rewards for moving the block even if the drawer is closed, misleading the learning process.
Motivated by this, we aim to develop a hierarchical assessment model that decomposes long-horizon tasks into manageable segments. Specifically, the model evaluates overall task progress at three levels: stage, motion (action primitive), and motion progress. With this design, our model can better capture progress changes and environmental status.
VICtoR employs a hierarchical approach to assess task progress at various levels, including stage, motion, and motion progress. It consists of three main components: (1) a Task Knowledge Generator that decomposes the task into stages and identifies the necessary object states and motions for each stage; (2) a Stage Detector that detects object states to determine the current stage based on the generated knowledge; (3) a Motion Progress Evaluator that assesses motion completion within stages. With this information, VICtoR then transforms it into rewards. Both the Stage Detector and Motion Progress Evaluator are trained on motion-level videos labeled with object states, which are autonomously annotated during video collection. This setup enables VICtoR to deliver precise reward signals for complex, unseen long-horizon tasks composed of these motions in any sequence.
@inproceedings{hung2025victor,
title={VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation},
author={Kuo-Han Hung and Pang-Chi Lo and Jia-Fong Yeh and Han-Yuan Hsu and Yi-Ting Chen and Winston H. Hsu},
booktitle={The Thirteenth International Conference on Learning Representations (ICLR)},
url={https://openreview.net/forum?id=UpQLu9bzAR},
year={2025}
}