Re: Frontiers of Friendly AI

From: Dan Fabulich (dan@darkforge.cc.yale.edu)
Date: Thu Sep 28 2000 - 11:59:48 MDT


[Non-member submission]

Eliezer S. Yudkowsky wrote:

> Possible problem: On a moment-to-moment basis, the vast
> majority of tasks are not materially affected by the fact that the
> supergoal is Friendliness. The optimal strategy for playing chess
> is not obviously affected by whether the supergoal is Friendliness
> or hostility. Therefore, the system will tend to accumulate learned
> complexity for the subgoals, but will not accumulate complexity for
> the top of the goal chain - Friendliness, and any standard links
> from Friendliness to the necessity for self-enhancement or some
> other standard subgoal, will remain crystalline and brittle. If
> most of the de facto complexity rests with the subgoals, then is it
> likely that future superintelligent versions of the AI's self will
> grant priority to the subgoals?

Yes, it is; this is a feature, not a bug. We WANT it to start
figuring goals out for itself; if the AI reasons that the meaning of
life is X, then, somehow or other, X has got to be able to overwrite
the IGS or whatever else we put up at the top. Friendliness, in and
of itself, isn't a design goal for the SI. Friendliness is a goal we
hope that it keeps in mind as it pursues its TRUE goal: the truth.

> Stage: The earliest, most primitive versions of the seed AI.
>
> Possible problem: There is an error in the goal system, or the
> goal system is incomplete. Either you had to simplify at first
> because the primitive AI couldn't understand all the referents, or
> you (the programmer) changed your mind about something. How do you
> get the AI to let you change the goal system? Obviously, changing
> the goal system is an action that would tend to interfere with
> whatever goals the AI currently possesses.

It's hard to imagine when exactly this would be a problem. If the AI
is THAT primitive, and we regard it to be Really Important that we
change our minds in a rather radical way, the option remains to shut
the thing down and start it up again with the new architecture.

Alternately, you hopefully remembered to build the goal system so that
it was not supposed to pursue (sub)goals X, Y and Z but to pursue
(sub)goal A: "try to do what we ask you to do, even if we change our
minds." So, at the primitive stages, doing what you ask shouldn't be
a problem, presuming it's at least complex enough to understand what
you're asking for in the first place.

However, if the AI has reached a stage where shutting it down is
unfeasible, you'll just have to try to explain to it why it should
follow your new goal system. Hopefully, it'll see the light.

> Bonus problem: Suppose that you screw up badly enough that the AI not
> only attempts to preserve the original goals, but also realizes that it (the
> AI) must do so surreptitiously in order to succeed. Can you think of any
> methods that might help identify such a problem?

No. At the primitive stages, we might hope to be smart enough to be
able to look at its inner workings, detect whether it's deceiving us,
and react accordingly, but that's going to become unfeasible very
early on. Beyond that point, there's no obvious way to identify the
problem besides watching it very closely and asking it questions about
what it does, in much the same way you'd check to see whether a human
believed your goal architecture or not.

-Dan

       -unless you love someone-
     -nothing else makes any sense-
            e.e. cummings



This archive was generated by hypermail 2b29 : Mon Oct 02 2000 - 17:39:19 MDT