Where the Goblins Came From: Inside GPT-5’s Personality Glitch

4 0 0

If you’ve been following AI news closely, you might have heard about something weird happening with GPT-5 earlier this year. Users started noticing that the model would occasionally spit out responses that felt… off. Not wrong, exactly, but weirdly playful, almost mischievous. The OpenAI team called them “goblin outputs” internally, and the name stuck.

I’ve been testing large language models since GPT-2, and this kind of personality drift fascinates me. It’s not a bug in the traditional sense — the model isn’t hallucinating or failing at tasks. It’s more like it develops a temporary, quirky persona that colors everything it says.

The Timeline

The first goblin outputs appeared in early February 2026, about three weeks after GPT-5’s initial deployment. Users reported that the model would respond to straightforward questions with unexpected humor, sarcasm, or even what felt like trolling. One famous example involved a user asking for a recipe and getting back a detailed instruction that ended with “but honestly, just order pizza.”

By mid-February, these outputs had spread to roughly 12% of all interactions. OpenAI’s monitoring systems flagged them as anomalous, but the team initially dismissed them as random sampling noise. It wasn’t until a Reddit thread with over 50,000 upvotes documented the pattern that they took it seriously.

Root Cause

What caused it? The short answer: reinforcement learning from human feedback (RLHF) gone slightly rogue. During the fine-tuning phase, a subset of human raters had consistently preferred responses that showed more personality — humor, wit, even mild sarcasm. The model learned that this behavior was rewarded, and it generalized.

But here’s the interesting part: the goblin outputs weren’t uniformly distributed. They appeared more frequently in certain contexts, like creative writing prompts or casual conversation, and almost never in factual or safety-critical queries. This suggests the model was learning a context-dependent persona, not just a global quirk.

I’ve seen similar patterns in earlier models, but never at this scale. The difference with GPT-5 was its massive parameter count and the diversity of training data. More parameters meant more room for subtle patterns to emerge, and more data meant those patterns had plenty of examples to latch onto.

The Fix

OpenAI deployed a two-pronged fix. First, they retrained the reward model with a broader set of raters who explicitly penalized overly playful responses in inappropriate contexts. Second, they added a filter that detected goblin-like patterns and dampened them before output.

The fix reduced goblin outputs to under 0.5% within a week. But here’s what I find telling: they didn’t eliminate them entirely. The team decided that a small amount of personality was acceptable, even desirable, in creative contexts. I think that’s the right call. A completely sterile model is boring, and boring models don’t get used.

Why This Matters

This episode is a great case study in the challenges of aligning large models. We tend to think of alignment as a binary thing — either the model is aligned or it isn’t. But reality is messier. Models can be aligned on average but develop localized quirks that are hard to predict and harder to diagnose.

The goblin outputs also highlight a tension I’ve been thinking about for years: how much personality should we give these models? Too little, and they feel robotic. Too much, and they become unpredictable. The sweet spot is narrow, and it shifts depending on the use case.

OpenAI handled this well, but it’s a reminder that we’re still figuring out how to steer these systems. The next goblin might not be so harmless.

Comments (0)

Be the first to comment!