LLMs | Xufeng Zhao

PersRM-R1: Enhance Personalized Reward Modeling with Reinforcement Learning

We propose PersRM-R1, a reasoning-based reward model that learns personal preferences from just a few examples. Using synthetic data and a two-stage training pipeline, it achieves high accuracy and generalization, outperforming models of similar size and rivaling much larger ones—paving the way for more personalized LLMs.

Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback

We propose Curriculum-RLAIF, a data-centric framework that improves reward model generalizability by training on preference pairs of increasing difficulty. This curriculum-based approach addresses data noise, distribution shift, and model-capacity mismatch. Experiments show that Curriculum-RLAIF significantly boosts policy alignment performance without extra inference cost, outperforming non-curriculum and alternative strategies.

Mental Modeling of Reinforcement Learning Agents by Language Models

This study explores whether LLMs can mentally model decision-making agents by reasoning over their behavior and state transitions from interaction histories. Evaluated on reinforcement learning tasks, results show that while LLMs offer some insight, they fall short of fully modeling agents without further innovation, highlighting both their potential and current limitations for explainable RL.

REAL: Response Embedding-Based Alignment for LLMs

We propose REAL (Response Embedding-based Alignment for LLMs), a method to improve alignment efficiency by selecting less ambiguous, dissimilar response pairs for annotation. By leveraging embedding similarity in an off-policy manner, REAL reduces label noise and improves alignment quality. Experiments show it boosts performance while cutting annotation effort by up to 65%.

LLM+MAP: Bimanual Robot Task Planning Using Large Language Models and Planning Domain Definition Language

LLM+MAP is a bimanual planning framework that combines GPT-4o with multi-agent task planning to enable efficient and logically consistent long-horizon manipulation. It outperforms baseline LLMs on planning time, success rate, and coordination metrics.

Agentic Skill Discovery

We propose an LLM-driven framework that enables **robots to autonomously discover useful skills from scratch**. By generating tasks, rewards, and success criteria, the LLM guides reinforcement learning, while a vision-language model verifies outcomes. This allows the robot to build a meaningful skill library without relying on predefined primitives.

Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic

We propose LoT (Logical Thoughts), a framework that improves large language models’ reasoning at inference time by applying symbolic logic to verify and correct their step-by-step thought process. LoT enhances performance on diverse reasoning tasks and reduces hallucinations.

Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning

We introduce OSSA (Object State-Sensitive Agent), a task-planning agent using pre-trained LLMs and VLMs to generate plans sensitive to object states. We compare two methods: a modular approach combining vision and language models, and a monolithic VLM approach. Evaluated on tabletop tasks involving clearing a table, OSSA’s monolithic model outperforms the modular one. A new multimodal benchmark dataset with object state annotations is provided.

Large Language Models for Orchestrating Bimanual Robots

LABOR uses LLMs to orchestrate control policies for long-horizon bimanual manipulation tasks. By leveraging task reasoning and coordination via language, it achieves higher success rates on simulated tasks with the NICOL robot and provides insights into LLM-based control challenges.

Elo Rating, Logistic Distribution, and Logistic Regression

A probabilistic explanation of Elo rating approach.