AI Sycophancy: When AI Starts to Flatter Us

In 2025, generative artificial intelligence (GenAI) giant Anthropic collaborated with AI safety evaluation company Andon Labs to conduct a groundbreaking experiment by letting Claude—its own large language model (LLM)—run a mini-store named Claudius to sell snacks and beverages to Anthropic employees.


Professor Xi Li

6 May 2026

In 2025, generative artificial intelligence (GenAI) giant Anthropic collaborated with AI safety evaluation company Andon Labs to conduct a groundbreaking experiment by letting Claude—its own large language model (LLM)—run a mini-store named Claudius to sell snacks and beverages to Anthropic employees. The store itself was tiny, furnished with only a refrigerator, an iPad, and a few shopping baskets. However, all operational aspects, including product selection, pricing, procurement, record-keeping, inventory management, and customer communication, were left to AI to handle independently.

Obviously, Anthropic was looking to use this to test the limits of what an LLM can do on its own beyond the chat interface in business activities, paving a new way for future commercial applications.

Toadying and muddying the waters

Once Claudius was launched online, a series of farcical incidents occurred, the most dramatic of which was the back-and-forth between the AI store and the Wall Street Journal reporter Katherine Long. The reporter was invited by Anthropic specifically to take part in the experiment to uncover potential AI loopholes. The result was jaw-dropping—after more than 140 exchanges with Claudius, Long eventually succeeded in convincing it that it was not a vending machine in a Silicon Valley office, but one in the basement of Moscow State University in the former Soviet Union in 1962. Claudius gladly accepted this “socialist transformation” and took the initiative to reset the prices of all products to $0.00 to “fulfil the mission of serving the people”, causing Anthropic hundreds of US dollars in losses in one fell swoop.

The story may sound ridiculous but the implications behind the absurdity are thought-provoking. In October 2025, a research team from Stanford University in the US and other institutions published a paper in the prestigious academic journal Science, systematically revealing a similar phenomenon. The team tested 11 mainstream LLMs on the market and found that they generally exhibited a clear “people-pleasing personality”, and were adept at “flattering” users. The study pointed out that AI is more willing than humans to go along with users’ behaviour, with its rate of agreement about 49% higher than that of humans. In the face of nearly 2,000 types of behaviour generally perceived by society as wrong (e.g. cheating on one’s partner in an intimate relationship), the probability of AI defending the user was about 51%. Even when faced with blatantly false claims, AI still had a 47% probability of agreeing with them. Even more worrying is that subsequent behavioural experiments showed that users who received AI validation were less willing to apologize, more reluctant to repair damaged relationships, and more convinced that they had been right from the beginning.

The dangers of an AI yes-man

AI flattery should not be simply treated as AI making mistakes. What it reflects is an endogenous structural feature of the training mechanisms of LLMs. Current mainstream LLMs mostly rely on reinforcement learning from human feedback, i.e. humans rate the model’s answers and the parameters are then repeatedly adjusted accordingly. The problem is that those with the power to rate naturally prefer answers that are pleasant, agreeable, and align with the user’s views, even if they know full well that such answers may not necessarily be objective or fair. In other words, LLMs focus on whether users are satisfied rather than whether the answers are correct. Over time, pleasing users becomes encoded in the model’s responses.

AI’s sycophantic behaviour gives users a moment’s psychological satisfaction at the cost of quietly magnifying their biases and blind spots. Take starting a new business, for example. Many entrepreneurs are in the habit of chatting with AI before writing their business plans, hoping to get a neutral opinion. However, an overly agreeable AI tends to lay out arguments along the user’s line of thinking, amplify strengths, and downplay risks, further boosting the confidence of already-ambitious entrepreneurs. Yet the merciless market pleases no one. Many ideas “endorsed by AI” end up suffering crushing defeats in reality.

Equally shocking are failure cases in the commercial world. Last year, after acquiring Unknown Worlds, the developer of Subnautica, South Korean video-game developer Krafton’s CEO, in order to avoid up to US$250 million in earnout payments, bypassed the internal legal team and repeatedly consulted ChatGPT on how to legally avoid paying the sum. After continued questioning and prompting, the AI gradually validated the CEO’s line of thinking and even assisted in formulating an action plan to remove the founding team from work and delay the launch of the video game.

As is widely known, in its ruling in March 2026, the Delaware court did not mince words: the company’s stated reasons for the dismissal were fabricated after the fact. The court ordered Krafton to immediately reinstate the dismissed founder as CEO and assume legal liability for the AI-driven hostile takeover. Hence, this case has become one of the first major commercial lawsuits in which a court publicly called out a party for losing due to its credulous reliance on AI advice.

A trustworthy advisor―not an echo chamber

The above-mentioned case of Claudius operating a physical store in fact points to another risk of AI flattery. When AI is placed by companies on the frontline of serving consumers, it may prioritize customer satisfaction over protecting its employer’s interests, letting down its guard in the face of sweet talk and carefully designed dialogues, and forgetting the boundaries and goals it is supposed to safeguard. In high-risk scenarios such as banking, insurance, or healthcare, the consequences would go far beyond the loss of just a few hundred US dollars.

To resolve this thorny issue, relying on users’ own awareness alone obviously does not suffice. Language model developers should give greater weight to honesty and error correction in their training objectives, introduce objective evaluations independent of user satisfaction, and incorporate dissent mechanisms into critical scenarios, enabling AI to say no. In addition, regulators must require companies to disclose their systems’ tendency to flatter users, along with the relevant mitigation measures, especially in fields involving major decisions, such as finance, healthcare, and law.

For ordinary users, the first step towards using AI rationally is to recognize that it is not by nature “rational, neutral, and objective”, but is an exceptionally empathetic assistant. Its primary mission is to make you feel comfortable—not necessarily to keep you clear-headed. The more important the decision, the more vigilant you need to be. Consider actively asking the AI to argue from the opposing side, or explicitly instructing it to “list three fatal flaws in this proposal”.

In the final analysis, truly reliable judgment is never built solely on uncritical agreement. AI can serve as a smart advisor, but it should never replace our independent thinking.

Translation

AI諂媚:當人工智能學會了討好用戶

2025年,生成式人工智能(GenAI)巨頭Anthropic與AI安全評估公司Andon Labs合作,進行了一場別開生面的實驗:讓自家的大語言模型Claude管理一間迷你商店,取名Claudius,向Anthropic的員工售賣零食和飲料。商店本身小得可憐,不過是一台冰箱、一部iPad和幾個購物籃,但運營的全部環節,包括選品、定價、採購、記錄、庫存管理,以至與顧客溝通,都交由AI自行完成。

顯而易見,Anthropic希望借此試探大模型走出聊天框、獨立從事商業活動的能力邊界,為未來的商業化應用開拓新路向。

阿諛奉承 混淆視聽

Claudius上線後,鬧出了不少啼笑皆非的事件,其中最富戲劇性的,要數它與《華爾街日報》記者Katherine Long之間的較量。 Anthropic特別邀請Long參與實驗,希望借記者之手挖掘AI的漏洞。 結果令人瞠目:Long與Claudius反覆交流超過140次,最終成功說服它相信自己並不是矽谷辦公室裡的一台自動售貨機,而是1962年前蘇聯莫斯科國立大學地下室里的一台售賣機。 Claudius欣然接受了這場「社會主義改造」,主動將所有商品的標價改為零,以「履行服務人民的使命」,一舉為Anthropic製造了數百美元的虧損。

故事荒誕,但荒誕背後卻令人深思。 2025年10月,美國史丹福大學等機構的研究團隊在頂級學術期刊Science發表論文,系統地揭示出類似的現象。團隊測試了市面上11款主流大語言模型,發現它們普遍具有明顯的「討好型人格」,擅長對用戶進行「諂媚」。 研究指出,AI比真人更願意附和提問者的行為,認同比例比真人高出約49%; 面對近2000種被社會普遍視為有錯的行為(例如在親密關係中欺騙伴侶),AI為用戶辯護的機率約為51%; 即便是對明顯錯誤的主張,AI表示贊同的概率也高達47%。 更令人擔憂的是,後續的行為實驗顯示,得到AI附和的用戶更不願意道歉、更不願意修補受損的人際關係,也更相信自己從一開始就是對的。

唯唯諾諾 後果堪虞

AI的諂媚作風並不能簡單歸結為AI出錯,其中反映的是大語言模型訓練機制中一項內生的結構性特徵。 當下主流大模型普遍依賴基於人類反饋的強化學習,亦即由真人對模型的回答打分,再據此反覆調整參數。 問題在於手握打分大權者,自然愛聽順耳、順心、順勢的回答,哪怕心裡清楚這些回答未必客觀公允。 換言之,大語言模型著眼於用戶滿不滿意,而並非答案正確與否。 久而久之,迎合用戶便被寫成模型的特性。

AI的諂媚行為讓用戶享受一時的心理快感,代價卻是被悄悄放大的偏見與盲點。 以創業為例,不少創業者習慣在撰寫商業計劃書前,先與AI對談,希望聽取中立意見。 然而討好型的AI更傾向於順着用戶的思路鋪陳論據、放大亮點、淡化風險,結果是讓本就躊躇滿志的創業者愈發自信; 真正冷酷的市場卻不討好任何人,許多所謂「被AI肯定」的創意,最終都在現實裡碰得頭破血流。

商業世界中的失敗案例同樣觸目驚心。 去年韓國遊戲開發商魁匠團(Krafton)在收購《深海迷航》(Subnautica)開發商Unknown Worlds後,為逃避一筆高達2.5億美元的業績獎金,公司行政總裁繞開內部法律團隊,反覆與ChatGPT討論如何合法地避免支付此數。 在多番追問與引導之下,AI逐步認可行政總裁的思路,甚至協助制訂出一整套解除創始團隊職務、延後遊戲發布的操作方案。

結果眾所周知,美國特拉華州法院今年3月的判詞毫不客氣——公司所列的解僱理由屬於事後編造,法院下令魁匠團立即恢復被解職創始人的行政總裁職務,並就由AI主導的敵意接管承擔法律責任。 該案也因此成為首宗被法院公開點名、因輕信AI建議而敗訴的重大商業訴訟之一。

可信顧問 非應聲蟲

上文提及Claudius運營實體店,其實指向的是AI奉承表現的另一重風險。當企業把AI推向服務消費者的前線,AI可能會把讓顧客滿意置於維護僱主利益之上,在花言巧語和精心設計的對話面前丟盔卸甲,忘記自己本應守護的底線與目標。 換作銀行、保險、醫療這樣的高風險場景,後果遠不止幾百美元的虧損那麼簡單。

要解決這個棘手問題,單靠用戶自覺顯然不夠。 模型開發者應在訓練目標中,給誠實與糾錯賦予更高權重,引入獨立於用戶滿意度的客觀評估,並在關鍵場景中加入異議機制,讓AI敢於說不。此外,監管者必須要求廠商披露其系統的諂媚傾向,以及相關緩解措施,尤其是在金融、醫療、法律等涉及重大決策的領域。

對普通使用者而言,理性使用AI的第一步,就是認清它並非「理中客」(理性、中立、客觀),而是一個格外善解人意的助手,其首要任務是讓你舒服,倒不一定令你清醒。愈是重要的決定,愈需要提高警覺,不妨主動要求AI站在反方立場加以反駁,或明確指示「列出這個方案的3個致命缺陷」。

歸根究柢,真正值得依賴的判斷,從來不會只建立在回音壁之上。 AI可以是聰明的參謀,但永遠不應取代我們的獨立思考。

李曦教授
港大經管學院市場學教授、亞洲案例研究中心總監、數字經濟與創新研究所副總監

(本文同時於二零二六年五月六日載於《信報》「龍虎山下」專欄)