Models are Worse at Persona-Following than You Think

Authors: Yada Pruksachatkun, Jason Wu

As Al systems have been increasingly deployed over the past years, challenges have emerged with respect to their behavior in interaction (or lack thereof) with human users. This includes models causing Al psychosis and ignoring instructions to ask for permission and deleting production databases. The ML research community, acknowledging the importance of interactivity, has incorporated interactivity factors in evaluation of agents in environments such as tau2-bench (customer service), CRMArena-Pro (enterprise), and Salesbot (sales). They incorporate user simulators, or models that are tasked to act as a human would in a scenario, to test the ability of models to complete tasks given human ambiguity and behavior.

However, a natural question arises: how can we evaluate and improve the realism of these user simulators themselves? Existing environments assess simulators with coarse metrics such as naturalness, high-level persona compliance, and high-level erroneous behavior. On closer inspection, we find that simulators behave in a non-human like manner and show low behavioral variation.

We propose persona following as a new goal for LLM training, and ground it in a cognitive-science based framework for defining and evaluating personas for simulators. As with existing efforts in coding, academic math, and agentic capabilities, focusing on persona following is a valuable and technically compelling effort to ensure interactivity in models.

We select one important enterprise use case of persona following: embodying customers in sales. We develop and release Sales Sim, an e-commerce sales simulation environment for human agents and sales agents, featuring an automated human-validated evaluation methodology, and usersimeval, a corresponding tool for grading customer simulations. Finally, we present initial results and insights on the capabilities and limitations of current frontier models on customer persona following. We call for community awareness and focus on persona following and incorporating complex user simulators into evaluation and training of language models.

A Call for Research in User Simulators

We highlight the below as following directions on research in user simulators:


SalesSim: A Testbed for Sales Interactions in E-Commerce

We introduce a test bed for persona following in sales, where customers must express communication shaped by latent preferences and sales agents must their approach based on personality, customer background, and level of product knowledge. We extend the laptop split of **Salesbot** (Murakhovs'ka et al), a testbed for catalog-grounded sales inspired by real product-support agents in e-commerce stores and aggregators such as Best Buy. Laptops provide an ideal domain for interactive evaluation, given the many factors that contribute to consumer needs and the multiple axes for personalized sales communication (e.g. varying knowledge levels and the relative importance of different specs).

A Framework for Defining Personas in User Simulators

Inspired by the Cognitive-Affective Personality System Framework (CAPS) we define the below framework for persona definition.

Diagram 1: Framework for Personas Definition in User Simulators

Applying this to human agents, for character traits, we use the Big Five personality framework, which consists of Extraversion, Agreeableness, Conscientiousness, Neuroticism, and Openness to Experience. We define speaking style as an explicit behavior pattern, and provide an initial backstory alongside a specific knowledge level of laptops at the time of interaction for past experiences and knowledge. Finally, for task-oriented scenario information, we define laptop-related preferences and dealbreakers and starting emotion for each sales scenario.


Evaluating Persona-Following in Human Agents

A good shopper simulator should be able to express preferences, interact with the Al salesperson, and make decisions aligned with its consumer preferences, dealbreakers, and personality traits. To this end, we propose evaluation dimensions and a scoring rubric, categorized by level of importance driven by their impact on behavior in sales conversations. Most dimensions are scored with 0 or 1 due to known issues with calibration of large language models.

Critical Dimensions

Ancillary Dimensions

Smoke testing

In addition to the core evaluation dimensions, smoke tests are provided as table-stakes evaluation dimensions useful for debugging across-dimension and potential root causes for Al shopper issues.

To support automatic development of customer agents, eight hand-crafted, human-validated model graders for the critical dimensions are released, with plans to release graders for the rest of the dimensions.

Decision alignment with Persona's Success Criteria (Cohen's Kappa) Role-appropriate comprehension about the task (Cohen's Kappa) Adherence to customer intent (Cohen's Kappa) Adherence to Big Five Traits (Pearson's Correlation)
0.8 0.71 0.9 0.53

Diagram 2: Human-model agreement for critical dimensions

Initial Results on Persona-Following in Sales Sim

Mistral-3.1-Small was chosen to benchmark due to its reported persona following strengths and superior performance compared to proprietary models in initial experiments. The figure below shows the performance on each dimension over 40 rollouts.

Customer simulation evaluation for SalesSim

Figure 1: Customer simulation evaluation for SalesSim. Critical dimensions are in bold.

Surprisingly, Mistral-3.1-Small struggles with dimensions that prevent it from being used as a valid customer simulator, including the ability to disclose preferences and dealbreakers, make decisions aligned with customer preferences, and embody varied personality types. Out of the critical dimensions, 65% of sampled interaction rollouts in SalesSim contain critical errors.

A Takeaway for the Community

Given these results, we encourage the community to:

Opensource Release

We opensource SalesSim at our Github page, alongside usersimeval, a lightweight tool to help with evaluation and error analysis of user simulation rollouts. We invite contributions and issues to both SalesSim and usersimeval, especially with adding more product categories to SalesSim and more graders for open-ended user simulators to usersimeval. We are excited about centering user simulators in research, and the downstream usage of user simulators in improving ML systems.

Ethics

The development and use of user simulators must be guided by a clear ethical statement prioritizing user well-being, fairness, and transparency. This necessitates careful design to prevent the perpetuation of bias and to ensure that models accurately reflect diverse user populations.

Acknowledgements

We acknowledge Pranav Narayanan Venkit and Yu Li for their feedback to this project.

Citation

@misc{personafollowing2025,
  title={Models are Worse at Persona-Following than You Think},
  author={Pruksachatkun, Yada and Wu, Chien-Sheng and Xiong, Caiming},
  year={2025}
}