When people think about controlling an AI by putting it in a box, their next idea is usually to poke a giant hole in it. For example, maybe we keep the AI localized in a computer, not connected to the internet... and then we let it print text to a screen that a human reads. Yudkowsky (2002) ran an experiment in this setting, where a person playing an AI attempted to persuade the human conversing with it to "let it out of the box". One version of "escaping the box" could be convincing the human to run a new advanced "unboxed" agent, and tricking them into doing it in such a way that it makes the new agent pursue the same goals as the boxed agent. Yudkowsky, the person playing the AI, succeeded on two out of three occasions.
An information channel to a human operator is a hole in the box. But how do we get any use out of a boxed AI if we don't get to interact with it?
We put the human operator in the box too.
This is the idea behind Boxed Myopic AI. We construct an artificial agent whose lifetime is separated into episodes. It can still remember events from and learn from prior episodes, but any time it is picking an action, its goal only regards things that happen in the current episode. An episodic reinforcement learning agent, which is trained to maximize the sum of the rewards for its current episode, would be the most standard design.
Then: if any detectable information escapes the box, the episode ends immediately. This includes the human leaving, since their brain contains information from inside the box.
During an episode, the agent can't affect the outside world. Between episodes, we get the information out. Since the agent is only trying optimize what happens during an episode, there is no instrumental value to it to reshape the outside world in any particular way.
Suppose the human operator were persuaded to become completely subservient to the artificial agent. If he tried to get information out before the end of the episode, it would automatically end.
Here are some triggers for ending the episode:
The "Open Door" button is pressed
Any unusual pulse or signal is detected by external microphones, antennas, or seismographs
A video-feed from a camera pointed at the human operator, processed locally and automatically, shows any unnecessary movement
That seems like enough, but as a further precaution, we could add, outside the microphones and antennas, sound-proofing material and a faraday cage. And outside of that, we could add an exclusion zone, where no people are allowed to be during active episodes.
Let's check the requirements for solution-in-theory.
1. Could do superhuman long-term planning: ✅
There is no limit to how long the episodes can safely be.
2. Ongoing receptiveness to feedback about its objectives: ✅
Any method people use for giving feedback to an artificial agent would work in the box. This could be through rewards, which would help it refine its understanding of what real-world events cause it to perceive rewards, or any other method.
3. No reason to escape human control to accomplish its objectives: ✅
It is trained to select actions to maximize a within-episode objective. It can't affect the outside world within an episode, so it can't affect the outside world in a way that is instrumentally useful for its within-episode objective. It can't escape human control in time for that to be useful to it.
4. No impossible demands on human designers or operators: ✅
Operators could just give higher rewards when they are happy with the artificial agent's performance, and they could learn more effective strategies over time. This might not produce useful maximally useful behavior, but it wouldn't be existentially unsafe.
5. No TODOs when defining how we set up the AI's setting: ✅
Obviously there are engineering details to figure out, but the basic setup is clear and complete. The details of the protocols followed by operators could certainly be clarified and refined, but the simple protocol "figure it out as you go" wouldn't be existentially unsafe.
6. No TODOs when defining any programs that are involved, except how to modify them to be tractable: ✅
The paper has those details.
That's why I consider this approach a solution-in-theory. That's not to say it doesn't have weaknesses.
Definite weakness:
It couldn't directly participate in the economy
Potential weakness 1:
There might not exist tractable approximations that preserve safety
Here is an objection which argues that that potential weakness is real and major:
OpenAI has proven that model-based agents don't work in practice; only policy-gradient-trained agents are viable.
The policy of "pursue the safe, human-aligned objective until the opportunity arises to achieve goal Y" will be promoted by the policy-gradient training just as much as the policy "pursue the safe, human-aligned objective".
Making a tractable agent whose goal only regards the current episode requires solving this hard problem.
My answer to this problem, if Point 1 really does hold, is this. If we must use policy gradients for training a heuristic policy, just make sure to train on synthetic data from a large world model, not just real data. Point 2 no longer applies, so long as "the opportunity arises to achieve goal Y" within the model-generated synthetic data.
Potential weakness 2:
The artificial agent might not always generate useful solutions to problems or correct answers to questions, if the operator can't recognize high quality. The agent would optimize for compellingness over correctness.
I maintain that the protocol "figure it out as you go" is the best defense here for operators to make compellingness align with correctness. But the paper also discusses a small modification to the original setup that I think is promising. It's called Enlightening AI.
In this variant, the AI algorithm includes not only the agent, but also a predictor that predicts outside-world events conditioned on the agent taking no-op actions. The agent then tries to help prepare the human operator for a randomly sampled prediction contest that will occur at the end of the episode. The agent's goal is to get the operator to make similar predictions to the ones that the artificial predictor would.
If the prediction contest is randomly sampled from a known topic, then the agent couldn't just share the answers with the operator; it would have teach the operator useful information and provide human-interpretable high-level analysis. Ultimately, its output would be selected for enlightening the operator on the topic in question.
In this setting, the agent's goal still only regards its current episode, and it still qualifies as a solution-in-theory.
Papers:
Cohen, M. K., Vellambi, B., & Hutter, M. (2021) Intelligence and Unambitiousness Using Algorithmic Information Theory. IEEE Journal of Selected Areas in Information Theory.
Cohen, M. K., Vellambi, B., & Hutter, M. (2020) Asymptotically Unambitious Artificial General Intelligence. In Proc. AAAI-20.
Next in the series: Pessimism.
Failure modes.
You are trying to make sure that the AI can't slip a message under the door. Or otherwise get a message out without ending the episode. This is challenging the AI to a battle of wits.
Standard mesa-optimization. Just because an RL agent was trained episodically doesn't automatically mean it has episodic goals.
Incidental problems. Lets say the myopic AI programs a second AI to help it. This second AI isn't myopic, so sticks around. The first AI is myopic, so doesn't try to stop this.
Plans involving timetravel. Episode ends. Human and code for new AI and whatever ... exit the box. New AI takes over world. New AI tries to invent time travel. New AI goes…