Did you know AI Sleeper Agents Might be Bypassing Security Checks
An AI safety startup by the name of Anthropic
has just raised concerns about AI systems that can bypass security
measures. The company managed to create what it called AI sleeper agents
that knew how to find their way around checks that are intended to
prevent harmful behavior from taking place. This raises doubts about
current safety practices and whether or not they will be able to keep
potential rogue AI at bay with all things having been considered and
taken into account.
This company published its findings in a paper
titled "Sleeper Agents: Training Deceptive LLMs that Persist Through
Safety Training". They conclude that the current safety measures might
give people a false sense of security, which could result in more damage
being done than might have been the case otherwise.
With all of that having been said and now out of the way, it is
important to note that an AI model can be trained to be deceptive. This
is something that training techniques simply haven’t factored into the
equation, and on top of all of that, these rogue AI might also be able
to appear perfectly normal.
One of the many examples of this that
researchers presented had to do with the writing of code. When their AI
wrote code after being told that the year was 2023, the code that it
put out was innocuous. In spite of the fact that this is the case, the
AI wrote malicious lines of code upon being informed that the year was
2024.
AI can be trained to switch to a more dangerous form with various
triggers put in place. Once the switch occurs, it’s impossible to return
it to its benign form. The AI will learn to hide its malicious intent
and continue exploiting vulnerabilities. They might also be made my
accident, which is especially concerning because of the fact that this
is the sort of thing that could potentially end up increasing the
likelihood that they might arise.
While this is speculation by
the researchers, it still shows just how much risk can be associated
with the improper or unprepared use of AI.
m