Understanding CFAI

From: Smigrodzki, Rafal (SmigrodzkiR@msx.upmc.edu)
Date: Wed Feb 06 2002 - 21:52:12 MST

I just finished reading CFAI and GISAI and here are some questions and
Of course, SIAI <../GISAI/meta/glossary.html> knows of only one current
project advanced enough to even begin implementing the first baby steps
toward Friendliness - but where there is one today, there may be a dozen
### Which other AI projects are on the right track?
That is, conscious reasoning can replace the "damage signal" aspect of pain.
If the AI successfully solves a problem, the AI can choose to increase the
priority or devote additional computational power to whichever subheuristics
or internal cognitive events were most useful in solving the problem,
replacing the positive-feedback aspect of pleasure.
### Do you think that the feedback loops focusing the AI on a particular
problem might (in any sufficiently highly organized AI) give rise to qualia
analogous to our feelings of "intellectual unease", and "rapture of an
The absoluteness of "The end does not justify the means" is the result of
the Bayesian Probability Theorem <../../GISAI/meta/glossary.html> applied to
internal cognitive events. Given the cognitive event of a human thinking
that the end justifies the means, what is the probability that the end
actually does justify the means? Far, far less than 100%, historically
speaking. Even the cognitive event "I'm a special case for [reason X] and am
therefore capable of safely reasoning that the end justifies the means" is,
historically speaking, often dissociated with external reality. The rate of
hits and misses is not due to the operation of ordinary rationality, but to
an evolutionary bias towards self-overestimation. There's no Bayesian
binding <../../GISAI/meta/glossary.html> between our subjective experience
of feeling justified and the external event of actually being justified, so
our subjective experience cannot license actions that would be dependent on
being actually justified.
### I think I can dimly perceive a certain meaning in the above paragraph,
with which I could agree (especially in hindsight after reading 3.4.3:
Causal validity semantics <design/structure/causal.html> ). Yet, without
recourse to the grand finale, this paragraph is very cryptic and in some
interpretations for me unacceptable.
I think that undirected evolution is unsafe, and I can't think of any way to
make it acceptably safe. Directed evolution might be made to work, but it
will still be substantially less safe than self-modification. Directed
evolution will also be extremely unsafe unless pursued with Friendliness in
mind and with a full understanding of non-anthropomorphic minds. Another
academically popular theory is that all people are blank slates, or that all
altruism is a child goal of selfishness - evolutionary psychologists know
better, but some of the social sciences have managed to totally insulate
themselves from the rest of cognitive science, and there are still AI people
who are getting their psychology from the social sciences.
### Is altruism something other than a child goal of selfishness? It was my
impression that evolutionary psychology predicts the emergence of altruism
as a result of natural selection. Since natural selection is not
goal-oriented, the goals defining the base of the goal system (at least if
you use the derivative validity rule), are the goals present in the
subjects of natural selection - selfish organisms, which in the service of
their own survival have to develop (secondarily) altruistic impulses. As the
derivative validity rule is itself the outcome of ethical reasoning , one
could claim that it cannot invalidate the goals from which it is derived,
thus sparing us a total meltdown of the goal system and shutdown.
The semantics of objectivity are also ubiquitous because they fit very well
into the way our brain processes statements; statements about morality
(containing the word "should") are not evaluated by some separate, isolated
subsystem, but by the same stream of consciousness that does everything else
in the mind. Thus, for example, we cognitively expect the same kind of
coherence and sensibility from morality as we expect from any other fact in
our Universe
### It is likely that there are specialized cortical areas, mainly in the
frontopolar and ventromedial frontal cortices, involved in the processing of
ethics-related information. Many of us are perfectly capable of double- and
triple-thinking about ethical issues, as your examples of self-deception
testify, while similar feats of mental juggling are not possible in the
arena of mathematics or motorcycle maintenance.
Actually, rationalization does not totally disjoint morality and actions; it
simply gives evolution a greater degree of freedom by loosely decoupling the
two. Every now and then, the gene pool or the memetic environment spits out
a genuine altruist; who, from evolution's perspective, may turn out to be a
lost cause. The really interesting point is that evolution is free to load
us with beliefs and adaptations which, if executed in the absence of
rationalization, would turn us into total altruists ninety-nine point nine
percent of the time. Thus, even though our "carnal" desires are almost
entirely observer-centered, and our social desires are about evenly split
between the personal and the altruistic, the adaptations that control our
moral justifications have strong biases toward moral symmetry, fairness,
truth, altruism, working for the public benefit, and so on.
### In my very personal outlook, the "moral justifications" are the results
of advanced information processing applied in the service of "carnal"
desires, supplemented by innate, evolved biases. The initial supergoals are
analyzed, their implications for action under various conditions are
explored, and the usual normative human comes to recognize the superior
effectiveness of fairness, truth, etc., for survival in a social situation.
As a result the initial supergoals are overwritten by new content (at least
to some degree, dictated by the ability to deceive others). As much as the
imprint of my 4-year old self in my present mind might object, I am forced
to accept the higher Kohlberg stage rules. Do you think that the Friendly AI
will have some analogue of such (higher) levels? Can you hypothesize about
the supergoal content of such level? Could it be translated back for
unenhanced humans, or would it be only accessible to highly improved
An AI's complete mind-state at any moment in time is the result of a long
causal chain. We have, for this moment, stopped speaking in the language of
desirable and undesirable, or even true and false, and are now speaking
strictly about cause and effect. Sometimes the causes described may be
beliefs existing in cognitive entities, but we are not obliged to treat
these beliefs as beliefs, or consider their truth or falsity; it suffices to
treat them as purely physical events with purely physical consequences.
This is the physicalist perspective, and it's a dangerous place for humans
to be. I don't advise that you stay too long. The way the human mind is set
up to think about morality, just imagining the existence of a physicalist
perspective can have negative emotional effects. I do hope that you'll hold
off on drawing any philosophical conclusions until the end of this topic at
the very least
### If I understand this paragraph correctly, my way of thinking about
myself has been physicalist for the past 10 - 15 years, yet it failed to
produce negative emotional effects. I am an information processing routine,
with the current self-referential goal of preserving its continued
existence, all in the context of 15x10e9 years of the Universe's evolution.
Even this self-referential supergoal can be explicitly renounced if it
becomes advantageous for the routine's survival (as in joining the Borg to
escape the Klingons).

The rule of derivative validity - "Effects cannot have greater validity than
their causes." - contains a flaw; it has no tail-end recursion. Of course,
so does the rule of derivative causality - "Effects have causes" - and yet,
we're still here; there is Something rather than Nothing. The problem is
more severe for derivative validity, however. At some clearly defined point
after the Big Bang, there are no valid causes (before the rise of
self-replicating chemicals on Earth, say); then, at some clearly defined
point in the future (i.e., the rise of homo sapiens sapiens) there are valid
causes. At some point, an invalid cause must have had a valid effect. To
some extent you might get around this by saying that, i.e., self-replicating
chemicals or evolved intelligences are pattern-identical with (represent)
some Platonic valid cause - a low-entropy cause, so that evolved
intelligences in general are valid causes - but then there would still be
the question of what validates the Platonic cause. And so on. 
The rule of derivative validity is embedded very deeply in the human mind.
It's the ultimate drive behind our search for the Meaning of Life. It's the
reason why we instinctively dislike circular logic. It's a very powerful
shaper(!). Just violating it arbitrarily, to trick the AI into doing
something, or in the belief that it doesn't really matter... well, that
wouldn't be safe (4), because that kind of "selfishness" is designated as an
extraneous cause by quite a few deeper shapers. Of course, I'm omitting the
possibility that the programmer personally believes that kind of logic is
okay (i.e., would use it herself), in which case things would probably come
out okay, though I personally would worry that this programmer, or her
shaper network, had too high a tolerance for circular logic...
### I think it's possible for a human to have both a limited use of
derivative validity and circular (or self-referential) basis for the goal
system. See my comments above.
We want a Meaning of Life that can be explained to a rock, in the same way
that the First Cause (whatever it is) can be explained to Nothingness. We
want what I call an "objective morality" - a set of moral propositions, or
propositions about differential desirabilities, that have the status of
provably factual statements, without derivation from any previously accepted
moral propositions. We want a tail-end recursion to the rule of derivative
validity. Without that, then yes - in the ultimate sense described above,
Friendliness is unstable
### I do agree with the last sentence. A human's self-Friendliness is
inherently unstable, too
As General Intelligence and Seed AI describes a seed AI capable of
self-improvement, so Creating Friendly AI describes a Friendly AI capable of
self-correction. A Friendly AI is stabilized, not by objective morality -
though I'll take that if I can get it - but by renormalization, in which the
whole passes judgement on the parts, and on its own causal history.
### This seems to summarize my goal system functioning, with the importance
of personal history, use of a wide range of cognitive tools to derive rules,
and to slowly change the goal system. 
. Even if an extraneous cause affects a deep shaper, even deep shapers don't
justify themselves; rather than individual principles justifying themselves
- as would be the case with a generic goal system protecting absolute
supergoals - there's a set of mutually reinforcing deep principles that
resemble cognitive principles more than moral statements, and that are
stable under renormalization. Why "resemble cognitive principles more than
moral statements"? Because the system would distrust a surface-level moral
statement capable of justifying itself! 
### Can you give examples of such deep moral principles?
Humanity is diverse, and there's still some variance even in the panhuman
layer, but it's still possible to conceive of description for humanity and
not just any one individual human, by superposing the sum of all the
variances in the panhuman layer into one description of humanity. Suppose,
for example, that any given human has a preference for X; this preference
can be thought of as a cloud in configuration space. Certain events very
strongly satisfy the metric for X; others satisfy it more weakly; other
events satisfy it not at all. Thus, there's a cloud in configuration space,
with a clearly defined center. If you take something in the panhuman layer
(not the personal layer) and superimpose the clouds of all humanity, you
should end up with a slightly larger cloud that still has a clearly defined
center. Any point that is squarely in the center of the cloud is "grounded
in the panhuman layer of humanity".
### What if the shape of superposition turns out to be more complicated,
with the center of mass falling outside the maximum values of the
superposition? In that case implementing a Friendliness focused on this
center would have outcomes distasteful to all humans, and finding
alternative criteria for Friendliness would be highly nontrivial.
### And a few more comments:
I wonder if you read Lem's "Golem XIV"?
Oops, Google says you did read it. Of course.
In a post on Exilist you say that uploading is a post-Singularity
technology. While I intuitively feel that true AI will be built well before
the computing power becomes available for an upload, I would imagine it
should be possible to do uploading without AI. After all, you need just some
improved scanning methods, and with laser tissue machining, quantum dot
antibody labeling and high-res confocal microscopy, as well as the proteome
project, this might be realistic in as little as 10 years (a guess). With a
huge computer but no AI the scanned data would give you a human mind in a
box, amenable to some enhancement.
What do you think about using interactions between a nascent AI and the
upload(s), with reciprocal rounds of enhancemnt and ethical system transfer,
to develop Friendliness?
And, by the way, I do think that CFAI and GISAI are wonderful intellectual

This archive was generated by hypermail 2.1.5 : Fri Nov 01 2002 - 13:37:38 MST