VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation

1 National Taiwan University
2 National Yang Ming Chiao Tung University
ICLR 2025
ARLET (ICML Workshop) 2024

Abstract

We study reward models for long-horizon manipulation by learning from action-free videos and language instructions, which we term the visual-instruction correlation (VIC) problem. Existing VIC methods face challenges in learning rewards for longhorizon tasks due to their lack of sub-stage awareness, difficulty in modeling task complexities, and inadequate object state estimation. To address these challenges, we introduce VICtoR, a novel hierarchical VIC reward model capable of providing effective reward signals for long-horizon manipulation tasks. Trained solely on primitive motion demonstrations, VICtoR effectively provides precise reward signals for long-horizon tasks by assessing task progress at various stages using a novel stage detector and motion progress evaluator. We conducted extensive experiments in both simulated and real-world datasets. The results suggest that VICtoR outperformed the best existing methods, achieving a 43% improvement in success rates for long-horizon tasks.

Vision-Instruction Correlation (VIC) Reward Learning

The reward function plays a critical role in the reinforcement learning (RL) framework. However, in practice, we often lack a reward function that provides precise and informative guidance, especially for long-horizon tasks. Recently, methods that leverage vision-instruction correlation (VIC) as reward signals have emerged, offering a more accessible way to specify tasks through language. Specifically, VIC-based methods frame reward modeling as a regression or classification problem and train the reward model on action-free demonstrations and instructions.

VICtoR overview

Even though existing VIC methods have made several breakthroughs, there are three limitions we observe when applying them to long-horizon manipulation tasks: (1) No awareness of task decomposition: Failing to divide complex tasks into manageable parts limits adaptability. (2) Confusion from variance in task difficulties: Training a reward model on long-horizon tasks impairs the learning of reward signals and fails to generate suitable progressive rewards. (3) Ambiguity from lacking explicit object state estimates: Relying on whole-scale image observations can overlook critical environmental changes. For instance, when training for the task move the block into the closed drawer, previous VIC models would assign high rewards for moving the block even if the drawer is closed, misleading the learning process.

Motivated by this, we aim to develop a hierarchical assessment model that decomposes long-horizon tasks into manageable segments. Specifically, the model evaluates overall task progress at three levels: stage, motion (action primitive), and motion progress. With this design, our model can better capture progress changes and environmental status.

Our Method: VICtoR

VICtoR employs a hierarchical approach to assess task progress at various levels, including stage, motion, and motion progress. It consists of three main components: (1) a Task Knowledge Generator that decomposes the task into stages and identifies the necessary object states and motions for each stage; (2) a Stage Detector that detects object states to determine the current stage based on the generated knowledge; (3) a Motion Progress Evaluator that assesses motion completion within stages. With this information, VICtoR then transforms it into rewards. Both the Stage Detector and Motion Progress Evaluator are trained on motion-level videos labeled with object states, which are autonomously annotated during video collection. This setup enables VICtoR to deliver precise reward signals for complex, unseen long-horizon tasks composed of these motions in any sequence.


VICtoR pipeline

Training Objectives: In VICtoR, two components require training: the Object Status Classifier (in the Stage Detector) and the Motion Progress Evaluator. The former is trained using cross-entropy loss, while the latter is trained with three variants of InfoNCE loss. Below are their details and schematic diagrams:
  • Time Contrastive Loss L tcn : It encourages images that are temporally closer to have more similar representations (embeddings) than those that are temporally distant or from different videos.
  • Motion Contrastive Loss L mcn : The objective aligns each motion's embedding with its relevant language embedding and separates it from unrelated language embeddings.
  • Language-Frame Contrastive Loss L lfcn : It brings the progress embeddings of nearly completed steps closer to the instruction embedding of the motion while distancing the progress embeddings of frames from earlier steps.

training objetives

Experimental Settings

Tasks and Baselines: To assess VICtoR's effectiveness, we train the same RL method with each reward model for every task. We construct nine simulated long-horizon manipulation tasks in CoppeliaSim and additionally evaluate all reward models on the real-world benchmark XSkill. For baselines, we compare VICtoR with the following reward models:
  • Sparse Reward: A binary reward function assigns a reward only when the task succeeds.
  • Stage Reward: A reward function that assigns a reward equal to the stage number when the agent reaches a new stage.
  • LOReL (CoRL'21): A language-conditioned reward model that learns a classifier to evaluate whether the progression between frames at time 0 and time t aligns with the task instruction.
  • LIV (ICML'23): A vision-language representation for robotics that can be utilized as a reward model by finetuning on target-domain data.
environments

Main Results

In our main experiment, a PPO policy is trained using the rewards generated by VICtoR and four other reward models for each task. The "S/M" below the task IDs represents the total number of stages and motions required to complete the task. The results clearly show that the RL method trained with VICtoR's rewards can learn to perform more complex and long-horizon tasks compared to those trained with other reward models. Notably, for long-horizon tasks, VICtoR achieves an average performance improvement of 43%.
main results

Reward Visualization

To verify whether VICtoR provides informative rewards, we visualized the potential (reward) curves for two cases: one with videos that match the corresponding instructions and another with incorrect videos using the same instructions. For the correct action case (left), the potential curves show that VICtoR effectively identifies task progress by increasing the potential as the agent completes the task—a capability not matched by previous VIC reward models. For the incorrect action case (right), as the agent moves from the right to the left side to close the drawer, VICtoR's rewards initially show an increase in potential. This increase is logical, as these movements align with the first stage of the instruction, open the light. However, as the agent continues toward the drawer, VICtoR recognizes the incorrect task and begins to decrease the reward, effectively discouraging the agent’s movement. These two cases highlight VICtoR's ability to analyze agent movement and task progress accurately.

progress visualization


Next, we further visualize the policy execution progress and VICtoR's motion determination at each time step. VICtoR measures the embedding distance between motion descriptions and frame embeddings as a basis for generating rewards. As shown on the right, VICtoR accurately switches motions at the appropriate time step and reduces the embedding distance as the agent approaches each motion's goal.
execution

Citations

If you find our VICtoR helpful and useful for your research, please cite our work as follows:
@inproceedings{hung2025victor, 
        title={VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation}, 
        author={Kuo-Han Hung and Pang-Chi Lo and Jia-Fong Yeh and Han-Yuan Hsu and Yi-Ting Chen and Winston H. Hsu}, 
        booktitle={The Thirteenth International Conference on Learning Representations (ICLR)}, 
        url={https://openreview.net/forum?id=UpQLu9bzAR}, 
        year={2025}
    }

If you have any question, feel free to reach out to Jia-Fong Yeh or raise an issue on VICtoR's github repo.