Me: Is there a quote from The Art of War about how if each side can predict the result of a battle, they won't fight it?
Claude 3 Opus: "If you know the enemy and know yourself, you need not fear the result of a hundred battles" [but that's not quite what you're talking about].
I did some more searching and found "He will win who knows when to fight and when not to fight", which is a bit more apt. Anyway, regardless of whether there's a matching quote from Master Sun, it stands to reason that if an artificial agent is pessimistic about its ability to defeat you (or pessimistic about the upside if it does), it will be less likely to try. And it turns out one can make such a pessimistic agent in a way that is compatible with superhuman world-modelling and planning in a stochastic world. Here's how:
Importantly, this agent is pessimistic with respect to epistemic uncertainty or model uncertainty while being risk neutral with respect to aleatoric uncertainty.
Epistemic uncertainty: how the world works and how the agent's perceptions of the world are determined
Aleatoric uncertainty: how random events in the world will resolve
It works as follows. Operators give the agent rewards according to how satisfied they are with the agent's performance. The agent considers many possible models of the world; that is, ways that the world works and what its rewards correspond to in the world. It ranks those models in order of plausibility, and collects the top several until it is at least 100-X% sure that one of the ones it collected is correct.
In this diagram, each bar is a separate model with a width corresponding to its plausibility. The orange ones are the ones which are "collected" or taken seriously.
Then, when evaluating a course of action, the agent considers its long-term expected reward according to all of the models it's taking seriously. And then the score it assigns to that course of action is the lowest value computed, i.e. the expected reward according to the model which is the biggest skeptic. It acts to maximize this score, unless all courses of action seem sufficiently bad, in which case it asks for help from a mentor. The mentor need not be any sort of expert.
So effectively, the pessimistic agent is a believer in Murphy's law—anything that can go wrong will go wrong—among reasonable possibilities. The smaller X is, the more models there are that are considered "reasonable", and the more pessimistic it is. What about the model which says your opponent in poker always gets a better hand than you? This kind of sounds like Murphy's Law. But after enough poker hands, such a model would be considered too implausible to take seriously. Ultimately, the only models deemed plausible would be ones that handle aleatoric uncertainty correctly. That's a good thing.
What about a model which says that reward is immutably equal to operator satisfaction? The agent might not be sure about this; another possibility is that the agent could get higher reward by hacking the reward protocol. But as long as the agent can't rule out the possibility that reward immutably equals operator satisfaction, and hacking the system has no effect, then a sufficiently pessimistic agent should consider futures with unsatisfied operators to be unacceptable.
Why entertain that reward immutably equals operator satisfaction? It can't rule it out a priori. (It can't rule anything out a priori). And if all rewards are from operators reporting their subjective satisfaction, the agent never gets data contradicting this. Therefore, its credence in this possibility can't go to zero. (For more on this, see the "Proximal and Distal Models" section of this paper.)
If the agent is really pessimistic, then it would consider [insert unprecedented event] to be maximally bad. That includes human extinction. The more complex the event, the more pessimistic it has to be for this result to hold. There is literally a theorem in the paper which means that a sufficiently pessimistic agent would avoid causing human extinction at all costs! I think many people expect that we'll never be able to prove such a theorem for any AI system, real or theoretical, unaware of the fact that we already have. It is not clear whether it is possible to maintain this result for practical variants of this pessimistic agent, but for the agent defined in the paper, there are no impossible demands on the designers, operators, or mentors for this result to hold; indeed, there are no demands at all.
What is the cost of this method? The agent sometimes asks for help. But the ask-for-help probability does go to 0 eventually, no matter how pessimistic it is. Why is that? It only asks for help when all courses of action seem extremely bad. Then, when it does ask for help, it gets to see for itself that some courses of action (like the one the mentor just demonstrated) are not extremely bad.
Finally, there is a theorem in the paper that the pessimistic agent learns to accrue reward at least as well as the mentor. So if the mentor is a human, this proves it would be at least human-level intelligent. It could learn to play poker, a game that requires risk neutrality, as well as a human. Many or maybe most implementations of pessimistic agents lack this property. For those familiar with Infra-Bayesianism, which was developed independently by Kosoy (2020), our agent can be thought of as an Infra-Bayesian agent that asks for help.
Let's check the requirements for solution-in-theory.
1. Could do superhuman long-term planning: ✅
See Corollary 6 and Theorem 15.
2. Ongoing receptiveness to feedback about its objectives: ✅
It continues to observe rewards, which provides more information about what it should do to get high rewards.
3. No reason to escape human control to accomplish its objectives: ✅
If it is pessimistic enough, it would entertain models which predict low rewards following an escape human control, provided it hasn't done that before.
4. No impossible demands on human designers or operators: ✅
Operators could just give higher rewards when they are happy with the artificial agent's performance, and they could learn more effective strategies over time.
5. No TODOs when defining how we set up the AI's setting: ✅
There are no constraints on the AI's setting.
6. No TODOs when defining any programs that are involved, except how to modify them to be tractable: ✅
The paper has the details.
Surely the most promising approach to existentially safe AGI is to try to build a practical agent similar to the only agent proven to avoid human extinction at all costs. Especially since its design has been proven to be compatible with any amount of competence.
Two empirical papers have implemented agents similar to ours. First,
Coste, et al. (2024) train many reward models independently; the training processes only differ in their random seed. Then, the policy is trained to optimize the minimum. This is quite similar to our approach, except nothing ensures the reward-model ensemble will have sufficient diversity. More to the point, nothing ensures that any model in the ensemble has the property that [high-reward outcomes] ⊂ [good outcomes], which sufficient diversity might ensure. Indeed, the reward models could all be identical; in practice they do differ, but the possibility highlights that insufficient diversity is a valid concern. Methods to enforce ensemble diversity exist, and so this problem might be very straightforward to fix. Also, David Krueger has told me that just varying the random seed really does seem to produce reliably diverse ensembles, and while I find this terrifyingly haphazard, maybe there is some unknown theory about why we should expect that to suffice in overparametrized networks.
Next,
Rigter, et al. (2022) train a policy to maximize value (with respect to a model) and they train the model to maximize the log likelihood of observed transitions and to minimize the policy’s value. This kind of adversarial game captures the spirit of maximizing over the worst-case-within-reason environment.
Unfortunately, there is no solid theory for just having one pessimistic model. In our paper and in Coste, et al. (2024), different models in the ensemble fill the role of "the pessimistic one" for different proposed courses of action. The empirical results they report are decent. Maybe having a single pessimistic model is unworkable, or maybe the training process just needs to be better somehow. Despite my criticisms, I still consider this work to be some of the most important research out there, and lots of people should be studying this approach.
There are three key potential weaknesses to pessimism:
It might require lots of ongoing human intervention/demonstration.
It might not be substantially superhuman.
As always, there might not exist tractable approximations that preserve safety, although the recent empirical work is certainly promising.
Next in the series (forthcoming): Definitely Human-like Optimization
Suppose the AI thinks that [chance my world takeover plot fails] < [chance humans get the grumps and give me min reward for grumpy-human reasons]?
The AI is pessimistic both about it's chances of taking over, and what humans will do to it if it doesn't take over. (Even if it acts nice, there's a chance humans punish it anyway)