From: Eliezer S. Yudkowsky (email@example.com)
Date: Thu Feb 07 2002 - 02:22:44 MST
"Smigrodzki, Rafal" wrote:
> I just finished reading CFAI and GISAI and here are some questions and
I nominate Rafal's post for "Post of the Month".
> > Of course, SIAI <../GISAI/meta/glossary.html> knows of only one current
> > project advanced enough to even begin implementing the first baby steps
> > toward Friendliness - but where there is one today, there may be a dozen
> > tomorrow.
> ### Which other AI projects are on the right track?
I didn't say that I *ever* knew of any AI project on the "right track".
My phrasing was "advanced enough to even begin". And now that Webmind has
gone down, there aren't any AI projects left even in that category - that
I know about, anyway. Peter Voss has earned my respect, but he hasn't
said enough (that I know of) about the proposed architecture of his AI
project for me to judge whether it would be capable of representing a
Friendly goal system.
> > That is, conscious reasoning can replace the "damage signal" aspect of
> > pain. If the AI successfully solves a problem, the AI can choose to
> > increase the priority or devote additional computational power to
> > whichever subheuristics or internal cognitive events were most useful in
> > solving the problem, replacing the positive-feedback aspect of pleasure.
> ### Do you think that the feedback loops focusing the AI on a particular
> problem might (in any sufficiently highly organized AI) give rise to
> qualia analogous to our feelings of "intellectual unease", and "rapture
> of an insight"?
I do not pretend to understand qualia. But what focuses the AI on a
particular problem is not a low-level feedback loop, but a deliberately
implemented feedback loop. The AI controls the feedback. The feedback
doesn't control the AI. Unless an FAI deems it necessary to shift to the
human pleasure-pain architecture to stay Friendly, I can't see vis mental
state ever becoming that closely analogous to human emotions.
> > The absoluteness of "The end does not justify the means" is the result
> > of the Bayesian Probability Theorem <../../GISAI/meta/glossary.html>
> > applied to internal cognitive events. Given the cognitive event of a
> > human thinking that the end justifies the means, what is the probability
> > that the end actually does justify the means? Far, far less than 100%,
> > historically speaking. Even the cognitive event "I'm a special case for
> > [reason X] and am therefore capable of safely reasoning that the end
> > justifies the means" is, historically speaking, often dissociated with
> > external reality. The rate of hits and misses is not due to the
> > operation of ordinary rationality, but to an evolutionary bias towards
> > self-overestimation. There's no Bayesian binding
> > <../../GISAI/meta/glossary.html> between our subjective experience of
> > feeling justified and the external event of actually being justified, so
> > our subjective experience cannot license actions that would be dependent
> > on being actually justified.
> ### I think I can dimly perceive a certain meaning in the above
> paragraph, with which I could agree (especially in hindsight after
> reading 3.4.3: Causal validity semantics <design/structure/causal.html>
> ). Yet, without recourse to the grand finale, this paragraph is very
> cryptic and in some interpretations for me unacceptable.
My apologies. Just because CFAI is 910K long doesn't mean that it wasn't
written in a tearing hurry.
For the record, what I was trying to talk about was the internal use of
the Bayesian theorem for meta-rationality - deciding to what extent the
thought "X" implies the real-world state X. Any imperfect organism can
use meta-rationality to correct for internal errors - not just evolved
biases, but also things like insufficient cognitive resources.
However, Bayesian reflectivity is most important for (a) correcting human
biases, (b) understanding the ethical heuristics that humans use to
correct those biases that they view as invalid, (c) understanding the
social heuristics that humans use to spot uncorrected biases in others,
and (d) making principled statements about whether such social heuristics
should be generalized to AIs.
> > I think that undirected evolution is unsafe, and I can't think of any
> > way to make it acceptably safe. Directed evolution might be made to
> > work, but it will still be substantially less safe than
> > self-modification. Directed evolution will also be extremely unsafe
> > unless pursued with Friendliness in mind and with a full understanding
> > of non-anthropomorphic minds. Another academically popular theory is
> > that all people are blank slates, or that all altruism is a child goal
> > of selfishness - evolutionary psychologists know better, but some of the
> > social sciences have managed to totally insulate themselves from the
> > rest of cognitive science, and there are still AI people who are getting
> > their psychology from the social sciences.
> ### Is altruism something other than a child goal of selfishness?
Within a given human, altruism is an adaptation, not a subgoal. This is
in the strict sense used in CFAI, i.e. Tooby and Cosmides's "Individual
organisms are best thought of as adaptation-executers rather than as
> It was
> my impression that evolutionary psychology predicts the emergence of
> altruism as a result of natural selection.
Yes. (Strictly speaking EP explains known altruism in terms of natural
selection, and makes predictions about further details, but "altruism" was
a known phenomenon before EP or even Darwin, and shouldn't count as a
> Since natural selection is
> not goal-oriented, the goals defining the base of the goal system (at
> least if you use the derivative validity rule), are the goals present
> in the subjects of natural selection - selfish organisms, which in the
> service of their own survival have to develop (secondarily) altruistic
> impulses. As the derivative validity rule is itself the outcome of
> ethical reasoning , one could claim that it cannot invalidate the goals
> from which it is derived, thus sparing us a total meltdown of the goal
> system and shutdown.
If the rule of derivative validity can't be tail-ended through an
objective morality, then the point at which we will be forced to make a
controlled exception (for ourselves, and for our AIs) will lie somewhere
around the evolution of altruism.
The derivative validity instinct becomes complex when partially thwarted.
Where we choose to place our reluctant exceptions will then likely be
based on the degree to which evolution is seen as playing puppeteer or
conduit. For example, in the case of when a human feels that he or she is
"entitled" to something, evolution is clearly playing puppet master. In
the case of mathematics, evolution is clearly playing unbiased conduit.
The question then becomes whether the rule of derivative validity, or the
semantics of objectivity, are more like self-righteousness or
mathematics. I would say they are more like mathematics.
> > The semantics of objectivity are also ubiquitous because they fit very
> > well into the way our brain processes statements; statements about
> > morality (containing the word "should") are not evaluated by some
> > separate, isolated subsystem, but by the same stream of consciousness
> > that does everything else in the mind. Thus, for example, we cognitively
> > expect the same kind of coherence and sensibility from morality as we
> > expect from any other fact in our Universe
> ### It is likely that there are specialized cortical areas, mainly in
> the frontopolar and ventromedial frontal cortices, involved in the
> processing of ethics-related information.
I agree. But emotions modulate the stream of consciousness; they don't
reimplement the stream of consciousness under different rules. There is
not one hippocampus used by moral thinking and a separate hippocampus used
by motorcycle maintenance. To have two completely different systems, you
would need two completely different brains.
> Many of us are perfectly
> capable of double- and triple-thinking about ethical issues, as your
> examples of self-deception testify, while similar feats of mental
> juggling are not possible in the arena of mathematics or motorcycle
Possible, but not likely - possible if mathematics or motorcycle
maintenance becomes a political issue. Our emotions are capable of
influencing any mental "cause" that has a moral "effect". To the extent
that motorcycle maintenance is rarely biased, it is because motorcycle
maintenance rarely becomes a moral/social/political issue. Let two people
get in an argument about motorcycle maintenance and each may begin to
tolerate flawed arguments that happen to argue in favor of their own
positions, or reject correct arguments that argue against their positions,
where in private they would have evenly weighed both sides of the issue.
> > Actually, rationalization does not totally disjoint morality and
> > actions; it simply gives evolution a greater degree of freedom by
> > loosely decoupling the two. Every now and then, the gene pool or the
> > memetic environment spits out a genuine altruist; who, from evolution's
> > perspective, may turn out to be a lost cause. The really interesting
> > point is that evolution is free to load us with beliefs and adaptations
> > which, if executed in the absence of rationalization, would turn us into
> > total altruists ninety-nine point nine percent of the time. Thus, even
> > though our "carnal" desires are almost entirely observer-centered, and
> > our social desires are about evenly split between the personal and the
> > altruistic, the adaptations that control our moral justifications have
> > strong biases toward moral symmetry, fairness, truth, altruism, working
> > for the public benefit, and so on.
> ### In my very personal outlook, the "moral justifications" are the
> results of advanced information processing applied in the service of
> "carnal" desires, supplemented by innate, evolved biases.
By computation in the service of "carnal" desires, do you mean computation
in the service of evolution's goals, or computation that has been skewed
by rationalization effects toward outcomes that the thinker finds
attractive? In either case the effective parent goals are not limited to
> The initial
> supergoals are analyzed, their implications for action under various
> conditions are explored, and the usual normative human comes to
> recognize the superior effectiveness of fairness, truth, etc., for
> survival in a social situation.
I think this is a common misconception from the "Age of Game Theory" in
EP. (By the "Age of Game Theory" I mean the age when a game-theoretical
explanation was thought to be the final step of an analysis; we still use
game theory today, of course.) Only a modern-day human, armed with
declarative knowledge about Axelrod and Hamilton's results for the
iterated Prisoner's Dilemna, would employ altruism as a strict subgoal.
And even then the results would be suboptimal because people instinctively
mistrust agents who employ altruism as a subgoal rather than "for its own
sake"... but that's a separate issue. A human in an ancestral environment
may come to see virtue rewarded and wickedness punished, or more likely,
witness the selective reporting of virtuous rewards and wicked follies.
However, this memetic effect only reinforces an innate altruism instinct.
It does not construct a cultural altruism strategy from scratch.
> As a result the initial supergoals are
> overwritten by new content (at least to some degree, dictated by the
> ability to deceive others). As much as the imprint of my 4-year old self
> in my present mind might object, I am forced to accept the higher
> Kohlberg stage rules. Do you think that the Friendly AI will have some
> analogue of such (higher) levels? Can you hypothesize about the
> supergoal content of such level? Could it be translated back for
> unenhanced humans, or would it be only accessible to highly improved
I'm not sure I believe in Kohlberg, but that aside: From the perspective
of a human, an FAI would most closely resemble Kohlberg 6, and indeed
could not be anything but Kohlberg 6, because an FAI cannot be influenced
by threat of punishment, threat of disapproval, someone else's opinion, or
society's opinion, except insofar as the FAI decides that these events
represent valid signals about vis target goal content.
> > We want a Meaning of Life that can be explained to a rock, in the same
> > way that the First Cause (whatever it is) can be explained to
> > Nothingness. We want what I call an "objective morality" - a set of
> > moral propositions, or propositions about differential desirabilities,
> > that have the status of provably factual statements, without derivation
> > from any previously accepted moral propositions. We want a tail-end
> > recursion to the rule of derivative validity. Without that, then yes -
> > in the ultimate sense described above, Friendliness is unstable
> ### I do agree with the last sentence. A human's self-Friendliness is
> inherently unstable, too.
Yes, that's the point. The design requirement is that Friendliness should
be at least as stable as the moral structure and altruistic content of any
human. If the human structure disapproves of itself, so will
Friendliness. If the human structure reluctantly decides to keep the
human structure for lack of a better alternative, so should Friendliness.
> > Even if an extraneous cause affects a deep shaper, even deep shapers
> > don't justify themselves; rather than individual principles justifying
> > themselves - as would be the case with a generic goal system protecting
> > absolute supergoals - there's a set of mutually reinforcing deep
> > principles that resemble cognitive principles more than moral
> > statements, and that are stable under renormalization. Why "resemble
> > cognitive principles more than moral statements"? Because the system
> > would distrust a surface-level moral statement capable of justifying
> > itself!
> ### Can you give examples of such deep moral principles?
Heh. Such principles in CFAI are more described than seen, because they
mostly fall under the catgory of "Friendship content" rather than
"Friendship structure". The deep principles can be discovered by
examination of a human, as long as you start out with the Friendship
structure needed to deduce moral shapers from moral statements.
Two deep principles that do make a cameo in CFAI are the rule of
derivative validity and the semantics of objectivity, both of which, when
applied to themselves, unflinchingly judge themselves to be flawed. The
semantics of objectivity are derived from social selection pressures as
well as objective truths; the rule of derivative validity is ultimately
caused by evolution, even though it is more in the nature of mathematics
than self-righteousness. Both are stable under renormalization as long as
they are the optimum *achievable*, even though they may not be the optimum
But at any rate they are not self-protecting; there is no reason why they
> > Humanity is diverse, and there's still some variance even in the
> > panhuman layer, but it's still possible to conceive of description for
> > humanity and not just any one individual human, by superposing the sum
> > of all the variances in the panhuman layer into one description of
> > humanity. Suppose, for example, that any given human has a preference
> > for X; this preference can be thought of as a cloud in configuration
> > space. Certain events very strongly satisfy the metric for X; others
> > satisfy it more weakly; other events satisfy it not at all. Thus,
> > there's a cloud in configuration space, with a clearly defined center.
> > If you take something in the panhuman layer (not the personal layer) and
> > superimpose the clouds of all humanity, you should end up with a
> > slightly larger cloud that still has a clearly defined center. Any point
> > that is squarely in the center of the cloud is "grounded in the panhuman
> > layer of humanity".
> ### What if the shape of superposition turns out to be more complicated,
> with the center of mass falling outside the maximum values of the
> superposition? In that case implementing a Friendliness focused on this
> center would have outcomes distasteful to all humans, and finding
> alternative criteria for Friendliness would be highly nontrivial.
Well, what would *you* do in a situation like that?
What would you want a Friendly AI to do?
It seems to me that problems like these are also subject to
renormalization. You would use the other principles to decide what to do
about the local problem with panhuman grounding.
If that's not the answer you had in mind, could you please give a more
specific example of a problem? It's hard to answer questions when things
get this abstract.
> ### And a few more comments:
> I wonder if you read Lem's "Golem XIV"?
> Oops, Google says you did read it. Of course.
I've read some Lem, but not that one.
> In a post on Exilist you say that uploading is a post-Singularity
Yes, and I still hold to this.
> While I intuitively feel that true AI will be built well
> before the computing power becomes available for an upload, I would
> imagine it should be possible to do uploading without AI. After all, you
> need just some improved scanning methods, and with laser tissue
> machining, quantum dot antibody labeling and high-res confocal
> microscopy, as well as the proteome project, this might be realistic in
> as little as 10 years (a guess). With a huge computer but no AI the
> scanned data would give you a human mind in a box, amenable to some
The statement about post-Singularity technology reflects relative rates of
development, not an absolute technological impossiblity. In other words,
you might be able to carry out uploading in 30 years (10 is a bit much),
but the precursor technologies are such as to permit the construction of
an AI, using the knowledge of cognitive science gained from snooping on
human neural processes. For that matter, uploading precursor technologies
would allow you to construct a 64-human grid with broadband
brain-to-computer-to-brain links. Uploading does not require SI; it
requires knowledge and technology that could also be used to follow much
shorter paths to strong transhumanity.
> What do you think about using interactions between a nascent AI and the
> upload(s), with reciprocal rounds of enhancement and ethical system
> transfer, to develop Friendliness?
Using a system with a human component should only be necessary if
Friendliness turns out to be substantially more "human" (nonportable) than
I currently expect. I would consider it as, if not a "last" resort, then
at least not the default resort. Hopefully it will be pretty clear that
the AI has "gotten" Friendliness and is now inventing vis own, clearly
excellent ideas about Friendliness, so that the thought of the AI needing
a human to supply some extra "kick" seems unlikely, and in any case we
would trust the AI to notice on vis own if a situation like that
I suppose that during the Singularity someone could volunteer to be
uploaded, copied, and have the copy/original go along with the AI on vis
path to superintelligence, as an advisor, as long as it's clear that this
presents *no* additional threat of takeover by a transcending upload.
Necessary conditions for "no additional threat" are that the AI-born SI
remains smarter, remains in control of the computing system, and does not
accept the upload-born SI's advice uncritically.
> And, by the way, I do think that CFAI and GISAI are wonderful
> intellectual achievements.
Thank you kindly, good sir. No doubt your message will someday appear
under the dictionary entry for "constructive criticism".
-- -- -- -- --
Eliezer S. Yudkowsky http://singinst.org/
Research Fellow, Singularity Institute for Artificial Intelligence
This archive was generated by hypermail 2.1.5 : Fri Nov 01 2002 - 13:37:38 MST