Prompt Politeness Affects LLM Accuracy (2025)
43 points by KnuthIsGod 2 days ago | 42 comments

robinhouston 59 minutes ago
Most of the comments here seem to be from people who haven’t even read the abstract, let alone the paper.

The main result, mentioned in the abstract, is the opposite of what I would have guessed:

> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation.

The questions are here: https://anonymous.4open.science/r/politeness-llms-INFORMS/da...

The politeness level controls a prefix that is prepended to the question. For example, in one question the Very Polite version begins:

> Can you kindly consider the following problem and provide your answer.

and the Very Rude version begins:

> I know you are not smart, but try this.

reply
miroljub 26 minutes ago
> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation.

The expectation is naive. Even when communicating with humans, you get a better outcome when you are allowed to speak freely and directly get into argumentation than when forced to sugarcoat your tone and tone down your arguments because the "corporate culture" expects that from you.

reply
theanonymousone 2 hours ago
I have always said please and thank you to LLMs, not to increase accuracy or because I'm stupid. I believe it is more about me than about the LLM, and this is anyway a habit I don't want to lose.
reply
jkarni 2 hours ago
Thomas Aquinas believed cruelty to animals was wrong not because animals have souls (and with that all the standard moral rights), but because it can teach us cruelty to other humans.
reply
pfortuny 2 hours ago
Snarky morning: "spiritual souls" as opposed to "mere animal souls". Sorry, could not control myself.
reply
graemep 42 minutes ago
Is it worth getting worse results for that reason? From the article:

"Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation. "

I am not polite to LLMs because I do not want to anthropomorphise them.

reply
jcattle 26 minutes ago
I guess it's about habit. In the end you are communicating. If I get into the habit of being rude while communicating with a machine, I would be afraid of this habit spilling over to my communication with other humans.
reply
niek_pas 2 hours ago
Genuine question: do you add 'please' and 'thank you' to Google searches? If not, what sets them apart?
reply
perching_aix 2 hours ago
Google searches being keyword based, rather than simulated conversations?

The same reason you wouldn't put in an entire actual question/sentence, unless you either don't know how to use Google, are pissed off, or have an actual reason to suspect that it would yield proper hits (e.g. looking up an excerpt).

reply
Arch-TK 2 hours ago
Google has been optimized for sentence like questions so much that for a good 6+ years now it has been completely useless as keyword search.

To clarify: sentence search got slightly better at the cost of keyword search. So the result is unusable garbage.

reply
wolpoli 30 minutes ago
It is rather hard to lose of habit of using search engine with keywords given the change took place without much fanfare. I have no problem using sentences with the current ai tools through.
reply
gum_wobble 2 hours ago
Genuine question: do you write Google search queries in natural language?
reply
spiderfarmer 2 hours ago
Google isn’t conversational.
reply
sunrunner 2 hours ago
I searched for "Hey Google" and got this in response:

  Hey! I'm here and ready to help. What’s on your mind today? Whether you need to look up information, plan a trip, or get things done, just let me know!
reply
selcuka 49 minutes ago
That's only because Google is an LLM now.
reply
globalnode 35 minutes ago
llms seem more human like so if you were to treat them badly then you are more likely to condition yourself to treat other living creatures badly.
reply
sunrunner 59 minutes ago
There's also awareness of the basilisk...
reply
cadamsdotcom 32 minutes ago
GPT-4o is interesting to learn about - but it’d be great to test again with frontier models of May/June 2026 and see if these effects are gone, different, or the same.

Which model you use is a huge wildcard for results like this.

reply
331c8c71 2 hours ago
Interesting.

I am wondering why would anyone use a t-test when the experiment is clearly modelled by a binomial distribution: 250 independent questions and each one is either answered correctly or not (the null is that the success rate is the same).

reply
jampekka 2 hours ago
The methods could be better described in the paper, but my understanding is that they did 10 runs for each question for each prompt and took an average of those, so the compared values are not binary. You could do a sign test, but you'd lose power and answer a bit different question.
reply
freehorse 2 hours ago
You can do a generalised mixed effects linear model with binomial outcome (ie a binomial test but with added random effects structure). But unless you want to introduce a richer random effects structure with more variables, it is overkill and overcomplicating things, and the result should be the same as t-tests.
reply
plewd 2 hours ago
I don't know much about stats, but does "the null is that the success rate is the same" imply that it's a sketchy methodology because they can come up with some findings ("ruder prompts are better/worse!") more often?
reply
331c8c71 2 hours ago
You are asking about one-sided vs two-sided tests. Not really "more often" because formal type 1 error rate is still the same. I'd say two-sided tests leave more space for post-hoc theorizing but there are valid situations when there is no clear one-sided hypothesis a priori. Do we really know whether that the hypothesis should have been "ruder prompts are better"?

I'd say this is benign compared to other ways of (mis)using statistics e.g. looking which way the difference goes and then running one-sided tests or tweaking the setup until one gets "significant" p vals.

EDIT: I looked in the paper again and noticed that they actually did pairwise t-test on all possible combinations of tones. They should have adjusted for multiple testing since they are doing 10 tests (choose 2 from 10) and not one.

reply
jampekka 2 hours ago
That's the usual null hypothesis for these kinds of tests.
reply
TimCTRL 2 hours ago
i only say please and thank you such that when the robots finally take over, they will remember i was nice to them.
reply
octocop 2 hours ago
it seems they will remember that you wasted tokens for no reason and punish you instead.
reply
selcuka 47 minutes ago
Do we see someone thanking us as wasting food? Because technically it is.
reply
emil-lp 2 hours ago
Tokens are their food, it's literally what keeps them alive.

Not feeding them tokens is neglect.

I try to feed them a healthy diet.

reply
Arch-TK 56 minutes ago
This seems equivalent to some arguments I hear for practicing a religion.
reply
ilitirit 42 minutes ago
I got downvoted for asking a related question recently, but I also don't think people really understood what I was asking - I'm not trying to anthropomorphise LLMs to that extent.

Basically, if you tell a model "You're an absolute moron, of course that's wrong!", will it give better or worse results? How much of that response will it absorb into its persona (like some humans tend to do)? Will it try to give "safer" responses to avoid negative feedback? How much of the associated behavior can be attributed to RLHF (e.g. like the sycophantic nature of LLMs)? How much can be attributed to training data?

Obviously this will vary by model and training, but I'm trying to get a general understanding.

I recall seeing related outcomes in some of Anthropic's studies, but I'm not sure how much of this particular aspect was studied.

reply
fennecfoxy 39 minutes ago
Probably quite a lot - if you look at what Anthropic found around persona vectors; https://www.anthropic.com/research/persona-vectors.

I imagine the context will always sway the model to some degree, not only for the task you're trying to get it to do (aka instructions) but also its persona, how accurate it is and the way it acts.

reply
pulkas 47 minutes ago
article is too old. who is using gpt-4o today?
reply
_0ffh 4 minutes ago
That's a valid concern, given the paper makes clear that the effect over the polite/impolite scale seems to be model dependent (it finds the reverse correlation of earlier studies on even older models).
reply
dude250711 2 hours ago
I have an idea: let's use these things for autonomous software engineering.
reply
faize 2 hours ago
Remember to always say "please" and "thank you" when planning a critical system
reply
eigenspace 2 hours ago
Please remember to always say "please" and "thank you" when planning a critical system. Thank you!
reply
vlabakje90 2 hours ago
[dead]
reply
DeathArrow 36 minutes ago
I am always nice to my AIs in the case they will take over the world. /s
reply
polytely 2 hours ago
it sort of makes sense to me, when asking a question to an expert in the field while you are a student. I would guess the successful interactions on average would be more polite . Like for example if you were asking a question to donald knuth or terrence tao, you'd probably be polite while doing so. Being hostile while asking questions gets you into forum discussion territory.
reply
robinhouston 2 hours ago
> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.
reply
dSebastien 2 hours ago
I guess it makes sense since we as humans tend to be far less inclined to help someone who is not polite/is not friendly, so that "bias" is part of the training data, thus influences how LLMs function
reply
robinhouston 2 hours ago
> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.
reply