AI Safety Hackathon

Can Language Models Sandbag Manipulation?

Entry for the 3 day Deception Detection Hackathon at the London Initiative for Safe AI (LISA), organized by Apart Research.

With Alexander Cockburn and Myles Heller

Full Paper

Abstract

This study explores the capacity of large language models (LLMs) to strategically underperform, or "sandbag," in manipulative tasks. Building on recent research into AI sandbagging, we adapted the "Make Me Pay" evaluation to test whether open-source models like LLaMA and Mixtral could modulate their manipulative behaviors when instructed to perform at varying levels of capability. Our results suggest that, unlike in multiple-choice evaluations, these models struggle to consistently control their performance in complex, interactive scenarios. We observed high variability in outcomes and no clear correlation between targeted performance levels and actual results. These findings highlight the challenges in assessing AI deception capabilities in dynamic contexts and underscore the need for more robust evaluation methods. Our work contributes to the ongoing discourse on AI safety, governance, and the development of reliable AI evaluation frameworks.