RM | Xufeng Zhao

PersRM-R1: Enhance Personalized Reward Modeling with Reinforcement Learning

We propose PersRM-R1, a reasoning-based reward model that learns personal preferences from just a few examples. Using synthetic data and a two-stage training pipeline, it achieves high accuracy and generalization, outperforming models of similar size and rivaling much larger ones—paving the way for more personalized LLMs.