Re: Anticipatory backfire

From: Eliezer S. Yudkowsky (
Date: Thu Nov 08 2001 - 11:48:39 MST

Mitchell Porter wrote:
> The AI example: this time one wants to be able to defend against
> the full range of possible hostile minds. In this case, making a
> simulation is making the thing itself, so if you must do so
> (rather than relying on theory to tell you, a priori, about a
> particular possible mind), it's important that it's trapped high
> in a tower of nested virtual worlds, rather than running at
> the physical 'ground level'. But as above, once the code for such
> an entity exists, it can in principle be implemented at ground
> level, which would give it freedom to act in the real world.

I don't think this is the right protocol. First, do we really need to
investigate the full range of possible hostile minds? I agree that if we
can get the knowledge without taking any risks, it would be nice to know.
I don't see a good way to do that. And it's not really knowledge we need
if we can build a Friendly seed AI, do the Singularity, et cetera - if
that knowledge is still needed, it's knowledge that can be obtained
post-Singularity by a superintelligence. We don't necessarily need to
know about it pre-Singularity. There's also the question of whether we
could realistically investigate it at all, pre-Singularity - I think that
pre-Singularity you just see infrahuman AIs, and investigating a hostile
infrahuman AI doesn't necessarily give you knowledge that's useful for
defending against ~human or >human AI.

For investigating failure of Friendliness - this, you probably *do* want
to watch in the laboratory - the ideal tool, if at all possible, is a
Friendly AI pretending to be unFriendly, and taking only those mental
actions that are consonant with the goal of pretending to be unFriendly
but not in conflict with actual Friendliness. You want to investigate
unFriendliness in complete realism without ever actually creating an
unFriendly entity; you also want to prevent the "shadow" unFriendly entity
from initiating real-world unFriendly actions, or unFriendly actions
internal to the AI mind doing the shadowing. In particular, you want to
investigate unFriendliness or failure of Friendliness without letting the
shadowed mind ever once form the thought "How do I break out of this
enclosing Friendly mind?", and the way you do this is by making sure that
a Friendly mind is carrying out all the actual thoughts of the shadowed
mind, carrying out actions which are consistent with pretending to be
unFriendly for purposes of investigating Friendliness, but not those
actions which are only consistent with actual unFriendliness. The
shadowed mind might think "I need a new goal system", but the real mind
would think "I need to pretend to build a new goal system". And so on.

I think this would take pretty sophisticated seed AI, possibly too
sophisticated to be seen very long before the Singularity. If so, then
the prehuman AIs being tested probably aren't sophisticated enough to do
anything horrible to nearby humans. Even so, you'd want to take a few
obvious precautions, such as running all the tests in a sealed, separate
lab, totally erasing the hardware after each run, never reusing the
hardware for anything except future UFAI tests (crush and melt rather than
resell), no cellphones in the building, et cetera... but that's just being
sensible. A prehuman seed AI is lucky if it can win a game of chess
against you - it's not going to wipe out humanity, even if you're setting
up an observable failure of Friendliness under laboratory conditions.
Which should not be as trivial as it sounds, if other FAI issues have been
handled properly - ideally, it should require an enormous amount of
effort, and the cooperation of the AI, to set up the conditions under
which anything can break even temporarily.

For an infrahuman seed AI, the only reason you'd have to worry would be if
somehow the laboratory-simulated failure of Friendliness contributed to
some kind of unexpected cognitive breakthrough. And running the
disposable experimental version of the AI on a little less hardware, with
some of the knowledge of self-improvement removed, should easily be enough
to prevent what is actually a pretty implausible scenario in the first
place. Disabling Friendliness shouldn't lead to improved intelligence
unless something is very drastically and seriously wrong with Friendly AI

I agree with Ken Clements that running an actual hostile transhuman,
whether in any number of towers of simulation, is too dangerous to ever be
considered. A hostile transhuman AI isn't an enemy computer program, it's
an enemy mind, and fighting an enemy mind isn't the same as fighting a
computer program. You can't carry out the actions that would be used to
control a hostile computer program and expect them to work on a hostile
transhuman AI; that would be a Hofstadterian level confusion, like baking
a Recipe Cake. And besides, even an *absolutely* secure Java
implementation doesn't help if the humans and the rest of the universe are
full of security holes.

-- -- -- -- --
Eliezer S. Yudkowsky
Research Fellow, Singularity Institute for Artificial Intelligence

This archive was generated by hypermail 2b30 : Sat May 11 2002 - 17:44:18 MDT