Testing LLMs on Superconductivity: Who Actually Knows Their Physics?

Google Research just published a paper in PNAS that asks a question I’ve been wondering about for a while: can LLMs actually be useful for real physics research, or are they just fancy autocomplete engines that sound confident while being wrong?

The answer, as usual, is complicated.

The team, led by Subhashini Venugopalan and Eun-ah Kim, took high-temperature superconductivity as their test case. This is a genuinely hard area of condensed matter physics. We’ve known about high-Tc superconductors since 1987 — the discovery won a Nobel Prize — but nobody has a fully agreed-upon theory for why they work. Thousands of papers exist, experimental techniques vary, and competing theories have been pursued by different groups for decades. It’s the kind of messy, unsettled domain where a good literature assistant could save a new grad student months of flailing.

So they grabbed six LLMs and asked them expert-level questions about cuprates — those copper-oxide compounds that superconduct at temperatures up to about -140°C. A panel of physicists then graded the responses on accuracy, comprehensiveness, and how well they handled the unresolved debates in the field.

The results? NotebookLM and a custom-built system that pulls from a closed ecosystem of curated, quality-controlled sources came out on top. This makes sense to me. When you’re dealing with open research questions, the last thing you want is an LLM hallucinating a paper that doesn’t exist or giving equal weight to a fringe theory and the mainstream consensus. A curated reference set keeps things grounded.

But here’s the kicker: even the best systems had clear weaknesses. They struggled with nuanced questions that required weighing competing evidence rather than just summarizing known facts. They were also prone to oversimplifying complex debates, which is dangerous when you’re trying to understand something as subtle as the pseudogap phase or the role of spin fluctuations in pairing mechanisms.

The paper doesn’t sugarcoat this. They explicitly identify areas where all six systems fell short. That’s refreshing. Too many AI benchmarks feel like marketing exercises where everyone gets an A+.

I’ve been following Google’s work on AI for science for a while. Their earlier CURIE benchmark tested LLMs on basic analytic tasks across six scientific fields, and other teams have explored using AI to interpret figures, solve quantum mechanics equations, and even write scientific software. This new study feels like a natural progression — moving from “can this model regurgitate facts?” to “can this model help someone think?”

The answer is: sort of, but not yet reliably. If you’re a grad student trying to get up to speed on cuprates, a well-designed LLM tool could save you weeks of reading. But you’d still need to double-check everything against primary sources. The models are better than random internet searches, but they’re not ready to replace a human advisor.

What I’d really like to see next is a similar study in a field with less entrenched debate — say, classical electromagnetism or semiconductor physics — to see if the models perform better when the ground truth is more settled. But for now, this paper gives a sobering look at where we actually are. The hype says AI will revolutionize science. The reality says it’s a promising assistant that still needs a lot of supervision.

Testing LLMs on Superconductivity: Who Actually Knows Their Physics?

Comments (0)