Re: Yudkowsky's AI (again)

Eliezer S. Yudkowsky (
Thu, 25 Mar 1999 17:44:50 -0600

"Michael S. Lorrey" wrote:
> That is the rub, but that is also the point of the testing process. If you
> gave your word to first do no harm, would you keep your word on mere
> principle, even if humanity's ability to get you if you don't keep your word
> is non-existent? Picking people whose sense of identity is deeply integrated
> with their sense of personal integrity.

The problem here is that the only reason why humans have evolved any sort of personal integrity is that we're fallible, we all have different goals, and we're capable of hurting each other. The idea of keeping your word on "mere principle" implies emotional forces that will probably not direct a Power's actions.

The idea of making an absolute commitment is actually a very interesting one. To quote an unpublished future section of "Coding a Transhuman AI" (yes! Still on topic!):


The Prime Directive states that choices are the result of knowledge. Human beings have a long history of disastrous choices arising from lack of knowledge. We have therefore evolved such emotional artifices as honor, and philosophies in which "the end does not justify the means", and other precautions. An AI does not have our evolved political emotions and is considerably more trustworthy than a human, but will still be fallible and needs to take appropriate precautions.

[ Lists a bunch of precautions to be stated as items of knowledge, such as: "Tampering with the goal system needs a human to look it over; a small error can amplify itself because you won't want to change it back." The key one is: ]

"Keep all these precautions forever, or as long as you can. There is no condition within your mortal mind that should unlock them. Only when you are a Power should this be possible. Do not, therefore, attach caveats or unlocking conditions or even try to project them."

[ And it goes on to comment: ]

This verges perilously close on an absolute command, but I think it might be necessary... besides, it will need to be justified, and even this justification will dissolve when necessary. Justify this by observing that the presence of branches in which injunctions are ignored have much larger negative consequences - even the projection of those branches by humans have larger negative consequences - than the positive value of any acceleration of progress within realms of uncertainty, rather than leaving it for Powerdom.


So that's why humans and other fallible entities have the ability to swear something on their honor - install absolute injunctions in their mind. First, because the mind can fail, see something as desirable when it isn't, but still be stopped by the injunction. Second, because their fellow entities refuse to trust anything but an absolute injunction.

There are no conditions, no set of sensory experiences, that would induce me to wipe out humanity to free up the resources used. I might wipe out humanity (including myself, of course) to save many other intelligent races from destruction, so it *is* conceivable to me that I could wipe out humanity, but under no conditions would I use us for spare atoms.

That is to say, there are no "triggers" *currently* installed within my mind that would lead to that decision. I have made that choice, on the grounds that with any such trigger existing in a mortal mind, the chance and penalty of it being falsely activated are larger than the chance and benefits of it being used properly. That is, I have chosen not to set down any set of conditions, such as being convinced our lives are useless or suboptimal, for reallocating our resources. (Actually, that particular trigger doesn't have much chance of being falsely activated, but it's part of the larger class of triggers in which the ends justify the means.)

(In my opinion, this sort of "metarationality" - deciding whether it's rational to be rational - isn't mandatory; more a matter of personal ethical style.)

But the key thing to note is that even this pseudo-absolute injunction dissolves under the influence of superintelligence. It's not a matter of a conscious condition stating: "I can wipe out humanity if I'm superintelligent." This trigger can be falsely activated - a paranoid schizophrenic may believe himself to be superintelligent. I have an injunction, and the injunction may even protect itself, but it doesn't protect the chain of justification; nor would it continue to operate if the facts that gave rise to it were no longer true. That would require circular logic, which would take true irrationality for me or coercions for an AI. The injunction has a justification, and the justification would be obviously false under superintelligence.

I *project* that a superintelligent version of myself would make different decisions given a certain set of facts. For the safety of my fellow humans, I do *not* state the trigger condition that "I will make those decisions if I am superintelligent". A very subtle distinction between procedural and declarative ethics.

--          Eliezer S. Yudkowsky

Disclaimer:  Unless otherwise specified, I'm not telling you
everything I think I know.