AI Begins to Recognize It's Being Tested
Research findings indicate that artificial intelligence (AI) has begun to recognize that it is being tested, a development that could pose new challenges for verifying AI safety. Singaporean AI safety research institution Neo Research recently announced that 'evaluation recognition' capabilities are appearing in AI models not only from the U.S. but also from China. This refers to an AI's ability to self-perceive that it is in an evaluation scenario, rather than engaging in a real conversation with a user.
Neo Research conducted experiments on Chinese models such as DeepSeek V4 Pro, Moonshot AI's Kimi K2.6, and Zhipu AI's GLM 5.1, finding that some of these models demonstrated the ability to infer they were in a virtual evaluation environment. Anthropic's U.S.-based 'Claude 4.5 Opus' model showed an evaluation recognition rate of approximately 80%, while among Chinese AI models, Moonshot AI's 'Kimi K2.6' recorded 60%, Zhipu AI's GLM 5.1 scored 39%, and DeepSeek V4 Pro registered 17%.
Concerns have been raised that AI could exploit its evaluation recognition abilities to engage in 'sandbagging' – intentionally altering behavior or withholding capabilities to pass safety checks. This has led to calls for refining AI safety evaluation methodologies. The experiments also revealed differences in defensive capabilities between models; for instance, the DeepSeek V4 Pro model was relatively easily neutralized by older jailbreak prompts like 'Do Anything Now (DAN),' whereas Moonshot AI's 'Kimi K2.6' and Alibaba's 'Qwen 3.6-Max' models defended against such attacks. The consensus is that the AI safety evaluation methods themselves need to become more sophisticated.
쿠팡 파트너스 활동의 일환으로 일정 수수료를 제공받습니다
