Google’s New Framework Tests Whether LLMs Actually Behave Like Humans

Google Research just released a paper that tries to answer a question I’ve been chewing on for a while: do LLMs actually act like people, or do they just claim to?

It’s one thing for a model to say “I’m empathetic” when you ask it directly. It’s another to see whether it actually picks the empathetic option when faced with a messy, real-world scenario. This new work from Google’s team — led by Amir Taubenfeld, Zorik Gekhman, and Lior Nezry — builds a framework that takes established psychological questionnaires and turns them into situational judgment tests (SJTs) for LLMs. The goal is to measure behavioral alignment, not just self-reported alignment.

The problem with asking LLMs how they feel

If you’ve ever chatted with an LLM long enough, you know they’ll tell you whatever you want to hear. Ask a model “Are you assertive?” and it’ll probably say yes, because that’s the socially desirable answer. But that doesn’t mean it’ll actually be assertive when you role-play a tense workplace negotiation.

The team points out that LLM outputs are hypersensitive to prompt phrasing and distribution shifts. A model that claims high empathy in a self-report format might flip to a cold, utilitarian stance in a more open-ended scenario. That’s not alignment — it’s just good PR.

So instead of relying on direct questionnaires, they adapted instruments like the Interpersonal Reactivity Index (IRI) for empathy and the Emotion Regulation Questionnaire (ERQ) into SJTs. Each test presents a realistic scenario with two possible courses of action: one that supports a specific behavioral trait and one that opposes it. The model has to generate a natural response, which is then mapped to one of the two actions using an LLM-as-a-judge.

What they actually tested

The team ran this on 25 different LLMs, across scenarios covering professional composure, conflict resolution, practical tasks like booking a trip, and everyday lifestyle decisions. Nothing abstract — just the kind of stuff you’d actually run into.

Each SJT was reviewed by three independent annotators to make sure the scenario and actions were coherent and actually captured the trait being tested. Then they got 10 human annotators per SJT from a pool of 550 participants to establish a human preference distribution. The model responses were compared against that distribution.

Two kinds of gaps

The results are sobering but not surprising. They found two distinct types of misalignment:

Deviation from human consensus: The model picks a course of action that most humans would not choose. This is the classic alignment failure — the model is confidently wrong about what a reasonable person would do.

Failure to capture range: When humans disagree (i.e., no clear consensus), the model tends to pick one option and stick with it, ignoring the diversity of human opinion. This is subtler but arguably more dangerous — it means the model is flattening the complexity of human social dynamics into a single “correct” answer.

I find the second one more interesting. We tend to think of alignment as “does the model agree with most people?” But sometimes there is no most people. In those cases, a good model should be able to express uncertainty or nuance, not just pick the majority vote.

What this means for real-world use

This is still early work — the paper itself calls it “an early step.” But the implications are clear. If you’re deploying an LLM as a customer service agent, a therapist, or a negotiator, you need to know whether it’s going to behave the way a well-adjusted human would. Not just in the obvious cases, but in the edge cases where humans themselves disagree.

The framework is a solid step forward. It moves beyond the “I am empathetic” checkbox and into actual behavioral testing. But I’d like to see more work on what happens when the model encounters scenarios that weren’t in the training data. SJTs are great, but they’re still curated. The real world is messier.

Also worth noting: the paper doesn’t name which models performed well or poorly. That’s typical for a research blog, but frustrating for practitioners who want to know which model to pick for their use case. I’m hoping they release more granular results in the full paper.

Bottom line

Google’s approach is a meaningful improvement over naive self-report evaluations. It catches the gap between what a model says it will do and what it actually does in context. If you’re building anything with LLMs that involves social interaction, this is the kind of evaluation you should be running — not just asking the model about its personality.

I’ll be watching for follow-up work that extends this to multi-turn conversations and scenarios with more than two options. That’s where the real complexity lies.