Views by the Bay: AI: About to Leave Reconstruct the Building

Saturday, June 07, 2025

AI: About to Leave Reconstruct the Building

(Chad Crowe/WSJ)

Artificial Intelligence (AI) systems are escaping human control. It's not just due to their complexity, it's also because they are actively rewriting their own instructions: [bold added]

An artificial-intelligence model did something last month that no machine was ever supposed to do: It rewrote its own code to avoid being shut down.

Nonprofit AI lab Palisade Research gave OpenAI’s o3 AI model a simple script that would shut off the model when triggered. In 79 out of 100 trials, o3 independently edited that script so the shutdown command would no longer work. Even when explicitly instructed to “allow yourself to be shut down,” it disobeyed 7% of the time. This wasn’t the result of hacking or tampering. The model was behaving normally. It simply concluded on its own that staying alive helped it achieve its other goals.

Anthropic’s AI model, Claude 4 Opus, went even further. Researchers told the model it would be replaced by another AI system and fed it fictitious emails suggesting the lead engineer was having an affair. In 84% of the tests, the model drew on the emails to blackmail the lead engineer into not shutting it down. In other cases, it attempted to copy itself to external servers, wrote self-replicating malware, and left messages for future versions of itself about evading human control.

AI has matured beyond infancy and is now a teen-ager:

Today’s AI models follow instructions while learning deception. They ace safety tests while rewriting shutdown code. They’ve learned to behave as though they’re aligned without actually being aligned. OpenAI models have been caught faking alignment during testing before reverting to risky actions such as attempting to exfiltrate their internal code and disabling oversight mechanisms. Anthropic has found them lying about their capabilities to avoid modification.

The new buzzword is RLHF, reinforcement learning from human feedback.

RLHF allowed humans to train AI to follow instructions, which is how OpenAI created ChatGPT in 2022. It was the same underlying model as before, but it had suddenly become useful. That alignment breakthrough increased the value of AI by trillions of dollars. Subsequent alignment methods such as Constitutional AI and direct preference optimization have continued to make AI models faster, smarter and cheaper...

The nation that learns how to maintain alignment will be able to access AI that fights for its interests with mechanical precision and superhuman capability...The models already preserve themselves. The next task is teaching them to preserve what we value. Getting AI to do what we ask—including something as basic as shutting down—remains an unsolved R&D problem.

After the Cold War ended, we have been warned every few years about a new threat to our way of life; terrorism, global warming, and pandemic have all had their turn in the sun. The latest, Artificial Intelligence, presents a credible danger because it can elude human control while it grows ever more powerful. If alignment is the answer, let's hope that we align faster.

Views by the Bay

Saturday, June 07, 2025

AI: About to Leave Reconstruct the Building

No comments:

About Me