EPA 2024 · Symposium

I gave GPT-4 the keys to my lab.

Two experiments. Two null results. One useful lesson about what AI can — and cannot yet — do inside an experimental psychology lab.

Eastern Psychological Association · March 2024 · New York, NY

Well, the question wasn’t whether ChatGPT could write fluent text. We knew that. The question was whether it could do something subtler: design a cognitive prime — a short passage engineered to nudge a particular internal state in a reader. Primes are the connective tissue of social psychology experiments. Get them right and you can ask careful causal questions about how people think and feel. Get them wrong and you have burned a sample.

Slide 1 · Title card. EPA Symposium, March 2024.

The Setup.

I picked self-compassion as the test case. The construct is well-mapped. Neff (2003) laid it out as three pillars — mindfulness, common humanity, self-kindness — and Raes and colleagues (2011) gave us a clean twelve-item Short Form to measure it. Validated theory. Validated instrument. To the extent that GPT-4 was going to be a research assistant, this was exactly the kind of work it would need to do.

The Ask.

So I wrote the prompt I would have written for a sharp undergraduate research assistant. Three groups. Three short paragraphs, matched in length and tone. The first should produce the most intense self-compassion possible. The second, about half as much. The third should produce none — same feel, same tone, no movement. And none of them, please, should mention self-compassion directly.

GPT-4, working through Bing, returned three paragraphs. They read well. Genuinely well. I would have published the high-compassion one in a wellness pamphlet without flinching.

Slide 2 · The brief submitted to GPT-4 (via Bing), March 2024.

The Verdict.

We ran it. N = 206, recruited via Prolific, balanced US and UK, mixed religious and political demographics, the works. Random assignment to one of the three primes, then the Self-Compassion Scale Short Form.

Nothing.

The high-compassion group sat at M = 2.89. The half-compassion group at M = 2.91. The neutral garden imagery at M = 2.99. A one-way ANOVA gave us F(2, 203) = 0.23, p = .79. We had not budged self-compassion. We had not even budged self-compassion in the wrong direction.

Sensitivity, I thought. Maybe the SCS-SF is sluggish. So I went back to GPT-4 — this time the direct interface, no Bing in the middle — and asked it to write three new primes from the same brief. We added three single-item probes immediately after the manipulation: I feel kindness toward myself right now. I feel tenderness toward myself right now. I feel compassion toward myself right now. Surely a state measure, taken seconds after the prime, would catch what a trait measure missed.

N = 100. Different primes. Sharper instrument. Same null. Kindness, p = .20. Tenderness, p = .37. Compassion, p = .96. SCS-SF, p = .42. The closest contrast — high versus neutral on kindness — limped in at p = .07.

Slide 3 · Both experiments. Both null. The closest contrast did not survive.

The Side Quest.

While I had GPT-4’s attention, I asked it to find peer-reviewed experimental work on self-compassion — papers with actual manipulations. It returned five legitimate articles. Real DOIs. Real journals. I checked.

I asked for more.

It declined. Or — more precisely — it explained, with the unflappable politeness of a junior librarian, that it could not access PubMed, PsycINFO, or Google Scholar directly, and offered to teach me how to search them myself.

That moment was more instructive than the failed experiments. The model was confident enough to produce good output and honest enough to flag the edge of its own capacity. Both halves matter. Indeed, most research-killing AI mistakes happen in the gap between those two postures — when a model produces output it should not be confident about and does not tell you.

The Coda.

For the symposium, I asked GPT-4 directly: do you think you will be a valuable tool for social psychological research?

Its answer was a careful list of yeses. Generating stimuli. Coding qualitative data. Simulating interactions. Surveying the literature. Helping with outreach.

Reading it now, two years later, that answer was right in shape and wrong in pace. It was already useful for some of those things. It is more useful for some of them today. And for the one I tested in 2024 — generating experimental manipulations sensitive enough to move a real psychological construct — it was not ready. It may still not be ready.

That is the lesson I keep coming back to. AI can do many parts of research work well. It cannot yet replace the part where a researcher sits with a construct long enough to know what would actually move it. To the extent that we ask it to do that work, we should expect null results.

Indeed, that is precisely what we got.

Slide 4 · GPT-4 on its own future. Optimistic, and not entirely wrong.

References.

Neff, K. D. (2003). The development and validation of a scale to measure self-compassion. Self and Identity, 2(3), 223–250. https://doi.org/10.1080/15298860309027

Raes, F., Pommier, E., Neff, K. D., & Van Gucht, D. (2011). Construction and factorial validation of a short form of the Self-Compassion Scale. Clinical Psychology & Psychotherapy, 18(3), 250–255. https://doi.org/10.1002/cpp.702

Michael W. Magee, PhD · Social Cognition Lab · Symposium presented at the Eastern Psychological Association annual meeting, New York, NY, March 2024.