top of page

Solutions In Theory



Different posts discuss different AI proposals and why I do or don't consider them to be solutions in theory

Critera for solutions in theory

  1. Could do superhuman long-term planning

  2. Ongoing receptiveness to feedback about its objectives

  3. No reason to escape human control to accomplish its objectives

  4. No impossible demands on human designers/operators

  5. No TODOs when defining how we set up the AI’s setting

  6. No TODOs when defining any programs that are involved, except how to modify them to be tractable

philosophical problem → computer science problem

  • Writer's pictureMichael Cohen

Boxed Myopic AI

When people think about controlling an AI by putting it in a box, their next idea is usually to poke a giant hole in it. For example, maybe we keep the AI localized in a computer, not connected to the internet... and then we let it print text to a screen that a human reads. Yudkowsky (2002) ran an experiment in this setting, where a person playing an AI attempted to persuade the human conversing with it to "let it out of the box". One version of "escaping the box" could be convincing the human to run a new advanced "unboxed" agent, and tricking them into doing it in such a way that it makes the new agent pursue the same goals as the boxed agent. Yudkowsky, the person playing the AI, succeeded on two out of three occasions.

An information channel to a human operator is a hole in the box. But how do we get any use out of a boxed AI if we don't get to interact with it?

We put the human operator in the box too.

This is the idea behind Boxed Myopic AI. We construct an artificial agent whose lifetime is separated into episodes. It can still remember events from and learn from prior episodes, but any time it is picking an action, its goal only regards things that happen in the current episode. An episodic reinforcement learning agent, which is trained to maximize the sum of the rewards for its current episode, would be the most standard design.

Then: if any detectable information escapes the box, the episode ends immediately. This includes the human leaving, since their brain contains information from inside the box.

During an episode, the agent can't affect the outside world. Between episodes, we get the information out. Since the agent is only trying optimize what happens during an episode, there is no instrumental value to it to reshape the outside world in any particular way.

Suppose the human operator were persuaded to become completely subservient to the artificial agent. If he tried to get information out before the end of the episode, it would automatically end.

Here are some triggers for ending the episode:

  • The "Open Door" button is pressed

  • Any unusual pulse or signal is detected by external microphones, antennas, or seismographs

  • A video-feed from a camera pointed at the human operator, processed locally and automatically, shows any unnecessary movement

That seems like enough, but as a further precaution, we could add, outside the microphones and antennas, sound-proofing material and a faraday cage. And outside of that, we could add an exclusion zone, where no people are allowed to be during active episodes.

Let's check the requirements for solution-in-theory.

1. Could do superhuman long-term planning:

There is no limit to how long the episodes can safely be.

2. Ongoing receptiveness to feedback about its objectives:

Any method people use for giving feedback to an artificial agent would work in the box. This could be through rewards, which would help it refine its understanding of what real-world events cause it to perceive rewards, or any other method.

3. No reason to escape human control to accomplish its objectives:

It is trained to select actions to maximize a within-episode objective. It can't affect the outside world within an episode, so it can't affect the outside world in a way that is instrumentally useful for its within-episode objective. It can't escape human control in time for that to be useful to it.

4. No impossible demands on human designers or operators:

Operators could just give higher rewards when they are happy with the artificial agent's performance, and they could learn more effective strategies over time. This might not produce useful maximally useful behavior, but it wouldn't be existentially unsafe.

5. No TODOs when defining how we set up the AI's setting:

Obviously there are engineering details to figure out, but the basic setup is clear and complete. The details of the protocols followed by operators could certainly be clarified and refined, but the simple protocol "figure it out as you go" wouldn't be existentially unsafe.

6. No TODOs when defining any programs that are involved, except how to modify them to be tractable:

The paper has those details.

That's why I consider this approach a solution-in-theory. That's not to say it doesn't have weaknesses.

Definite weakness:

  • It couldn't directly participate in the economy

Potential weakness 1:

  • There might not exist tractable approximations that preserve safety

Here is an objection which argues that that potential weakness is real and major:

  1. OpenAI has proven that model-based agents don't work in practice; only policy-gradient-trained agents are viable.

  2. The policy of "pursue the safe, human-aligned objective until the opportunity arises to achieve goal Y" will be promoted by the policy-gradient training just as much as the policy "pursue the safe, human-aligned objective".

  3. Making a tractable agent whose goal only regards the current episode requires solving this hard problem.

My answer to this problem, if Point 1 really does hold, is this. If we must use policy gradients for training a heuristic policy, just make sure to train on synthetic data from a large world model, not just real data. Point 2 no longer applies, so long as "the opportunity arises to achieve goal Y" within the model-generated synthetic data.

Potential weakness 2:

  • The artificial agent might not always generate useful solutions to problems or correct answers to questions, if the operator can't recognize high quality. The agent would optimize for compellingness over correctness.

I maintain that the protocol "figure it out as you go" is the best defense here for operators to make compellingness align with correctness. But the paper also discusses a small modification to the original setup that I think is promising. It's called Enlightening AI.

In this variant, the AI algorithm includes not only the agent, but also a predictor that predicts outside-world events conditioned on the agent taking no-op actions. The agent then tries to help prepare the human operator for a randomly sampled prediction contest that will occur at the end of the episode. The agent's goal is to get the operator to make similar predictions to the ones that the artificial predictor would.

If the prediction contest is randomly sampled from a known topic, then the agent couldn't just share the answers with the operator; it would have teach the operator useful information and provide human-interpretable high-level analysis. Ultimately, its output would be selected for enlightening the operator on the topic in question.

In this setting, the agent's goal still only regards its current episode, and it still qualifies as a solution-in-theory.


Next in the series: Pessimism.

49 views0 comments

Recent Posts

See All


bottom of page