[Messy post] against coherent ethics
update (12 march 2022): Here’s a dump of more links I found on stated versus revealed preferences. (As an aside Lesswrong from 10 years ago feels so much smarter than both LW and EA today. Yes the number of utilitarians present is one reason but it’s not just that, there’s a bunch of really cool posts I’ve found that are all 10+ years old.)
and a comment of mine:
update (23 feb): This post is like 90% on ethics rather than AI so I’ve changed the title.
If you’re new to my blog please check out some other post first.
tldr update 25 jan: Ethics might be unsolveable, humans may have values and incoherence (in value-pursuit) because of the same neurological modules, such that you can’t have one without the other. I might write this up properly after engaging with more people working in alignment, hesitant to write up before.
update (5 feb): crossposting more comments
Thank you for replying!
I don't think "System 1" and "System 2" are real things in a sense that would make "System 2 has no desires" a sensible statement.
Fair enough. I guess I meant our desires have evolved into our neural circuitry as part of System 1. And we can't use a lot of thinking alone (System 2 activation) to decide what our goals, we need to first have first-hand experiences of pleasure or pain.
I think that humans do something akin to CEV as part of our daily lives
You're right humans do end up doing this decision + retroactive decision, I just don't think that always leads to one consistent place, different humans can justify different things to themselves (or even the same human at different points in time). There are a bunch of different things that System 1 strongly reacts to (our "core values"), and I don't think our brains naturally have way of trading them off against each other that doesn't lose pennies. System-2 tries its best to ignore these inconsistencies, but then where it ends up is random, because there's no real way to decide what to ignore.
We don't often encounter such situations where we're forced to make such trades, but we can in theory, as in trolley problems.
How do we justify that?
Just first-hand experience honestly, I don't feel as much empathy for a paperclip maximiser as I do for human suffering. I doubt you can ground ethics in anything other than our experiences. (Like you might find intermediate things to ground ethical theories in, but those things will just further ground themselves in actual experiences.)
That may be what we do care about, but how can we justify that in terms of what we should care about?
I feel like asking "what we should care about" is just our System-2 trying to make sense of the fact that our System-1 cares about a lot of inconsistent things - I'm currently skeptical you can find an objective answer to that question.
I also recently wrote this reply about this, it's where my intuition currently points and I'm super happy to get any feedback (positive or negative).
This is a cross post of 3 comments.
This stuff could be better presented. It could also be obviously wrong or obviously unoriginal to people actually working in the space, I just felt it’s worth posting anyway.
tl;dr why my current intuitions point towards alignment not being solveable in any meaningful sense
This opinion of mine is weak mainly because I see a lot of smart people who believe differently. I’m not sure I have a mechanistic understanding of where exactly these people would clash with my opinion.
I’m hesitating on writing this up further because I want to first interact properly with some alignment theorists to get a better understanding of what they mean by alignment or what utopian world they are imagining they get once they solve alignment.
I read your post - and decided to write up my thoughts anyway. It might be a weird take, but I would really appreciate your opinion on it, if you have the time. I've spent way too much time unsure about the best way to explain it, yet felt the need to explain it in so many places, so it would mean a lot to me if you read it. It also describes why I'm kinda skeptical of AI alignment being solveable.
I just found this from your post.
I’m not going to say “System One” and “System Two”, because that’s cliche, so instead I’ll say “warm giving” and “cool giving” to reflect the fact that giving is driven by a mix of “cool” motivations (an abstract desire to do good, careful calculation of your impact, strategic giving that will make people like you) and “warm” motivations (empathy toward the recipient, personal connections to the charity, a habit of giving a dollar to anyone who asks).
I somehow feel like System 2 has no genuine desires of its own, it simply borrows them from System 1.
System 1 = desires + primitive tools to trade-off different desires
These tools aren't super advanced, they can't do math or formal logic and are mostly heuristics, they often throw up random answers if situations don't cleanly fit into exactly one heuristic, and decisions guided purely by System 1 will lose pennies.
System 1 is also super-inflexible - you can't simply choose to rewire all of your System 1, this is beyond the reach of free will. (Maybe neurosurgery can change it.)
System 2 = advanced reasoning tools
System 2 just borrows desires from System 1. You won't do cold-hearted calculation on what saves the most lives unless you've already had System-1 experiences of other people's pain or joy, and System-1 desire to help people.
Problem with System 2 is, no matter how much math or logic it throws at the problem, it can't find a consistent way of trading-off different desires that is also consistent with System-1. Why - because System-1 was never coherent in the first place.
It also can't choose to just ignore System-1 and formulate its all-important theory of ethics because it has no desires (/values/ethics) of its own, nor any objective way to compare them. Ground truth on such matters comes from System-1.
(I see ethics as a subset of desires btw, I don't think we should assume something fundamentally different between the desire to eat sugar and the desire to protect a friend.)
Evolutionarily this could be because System-1 is a bunch of assorted hacks evolved over millions of years in different ways, whereas System-2 is one single evolutionary change between apes and humans, and hence a lot cleaner.
This is also why I'm not convinced CEV exists or that AI alignment is solveable in any meaningful sense.
I'd love to know why I'm wrong or what you feel about this.
Tbvh I'm not sure how to engage with alignment folk because I feel like I'm missing a lot of existing knowledge and mental models that they do. Like I don't get what people think of when they think of "aligned AI", that they go - you know what, we don't even know how to rigorously define this thing yet we're convinced it exists and it is a good thing. But I'll try:
Bostrom mentions stuff like populating the cosmos with digital minds generating positive experiences, that seems like unusual and one of the things we could possibly want, but not necessarily the only thing we want. If it was the only thing we wanted we could actually work to explicitly specify that as the AI's goal, and that's CEV and hence problem solved.
Basically humans want a lot of different things, and we're confused about which things we want more. That doesn't necessarily mean there exists some objective answer as to which of those we want more, that we need to "solve" it. Instead it could just be random - there are neurochemicals firing from different portions of the brain and sometimes one thing wins out, sometimes the other thing wins out. And if you input a sufficiently persuasive sequence of words into this brain it will prioritise some things more, but if you input a different sequence it will prioritise different things more. (An AI with sufficient human manipulation skills can find this easy imo.) Turing machines don't have "values" by default, they have behaviours based on their input.
I also wrote up some stuff here a month back but idk how coherent it is: https://www.lesswrong.com/posts/uswN6jyxdkgxHWi7F/samuel-shadrach-s-shortform?commentId=phwddxuNNuumAbSxC
I'm sure we're a finite number of voluntary neurosurgeries away from worshipping paperclip maximisers. I tend to feel we're a hodge-podge of quick heuristic modules and deep strategic modules, and until you delete the heuristic modules via neurosurgery our notion of alignment will always be confused. Our notion of superintelligence / super-rationality is an agent that doesn't use the bad heuristics we do, people have even tried formalising this with Solomonoff / Turing machines / AIXI. But when actually coming face to face with one:
- Either we are informed of the consequences of the agent's thinking and we dislike those, because those don't match our heuristics
- Or the AGI can convince us to become more like them to the point we can actually agree with their values. The fastest way to get there is neurosurgery but if we initially feel neurosurgery is too invasive I'm sure there exists another much more subtle path that the AGI can take. Namely one where we want our values to be influenced in ways, that eventually end up with us getting closer to the neurosurgery table.
- Or ofcourse the AGI doesn't bother to even get our approval (the default case) but I'm ignoring that and considering far more favourable situations.
We don't actually have "values" in an absolute sense, we have behaviours. Plenty of Turing machines have no notion of "values", they just have behaviour given a certain input. "Values" are this fake variable we create when trying to model ourselves and each other. In other words the turing machine has a model of itself inside itself, that's how we think about ourselves (metacognition). So a mini-Turing machine inside a Turing machine. Ofcourse the mini-machine has some portions deleted, it is a model. First of all this is physically necessitated. But more importantly, you need a simple model to do high-level reasoning on it in short amounts of time. So we create this singular variable called "values" to point to essentially a cluser in thingspace. Let's say the Turing machine tends to increment it's 58th register on only 0.1% of all possible 24-bitstring inputs, else it tends to decrement a lot more. The mini-Turing machine inside the machine modelling itself will just have some equivalent of the 58th register never incrementing at all, and decrementing instead. So now the Turing machine incorrectly thinks its 58th register never increments. So it thinks that decrementing the 58th register is a "value" of the machine.
[Meta note: When I say "value" here, I'd like to still stick to a viewpoint where concepts like "free will", "choice", "desire" and "consciousness" are taboo. Basically I have put on my reductionist hat. If you believe free will and determinism are compatible, you should be okay with this - as I'm just consciously restricting the number of tools/concepts/intuitions I wish to use for this particular discussion - not adopting any incorrect ones. You can certainly reintroduce your intuitions in a different discussion, but in a compabilist world, both our discussions should generally lead to true statements.
Hence in this case, when the machine thinks of "decrementing its 58th register" as its own value, I'm not referring to concepts like "I am driven to decrement my 58th register" or "I desire to decrement my 58th register" but rather "Decrementing 58th register is something I do a lot." and since "value" is a fake variable that the Turing machine has full liberty to define, it says " "Value" is defined by the things I tend to do." When I say "fake" I mean it exists in the Turing machine's model of itself, the mini-machine.
"Why do I do the things I do?" or "Can I choose what I actually do?" are not questions I'm considering, and I'm for now let's assume the machine doesn't bother itself with such questions (although in practice it certainly may end up asking itself such terribly confused questions, if it is anything like human beings. This doesn't really matter rn.)
I'm gonna assume a single scale called "intelligence" along which all Turing machines can be graded. I'm not sure this scale actually even exists, but I'm gonna assume it anyway. On this scale:
Humans <<< AGI <<< AIXI-like ideal
<<< means "much less intelligent than" or "much further away from reasoning like the AIXI-like ideal", these two are the same thing for now, by definition.
An AGI trying to model human beings won't use such a simple fake variable called "values", it'll be able to build a far richer model of human behaviour. It'll know all about the bad human heuristics that prevent humans from becoming like an AGI or AIXI-like.
Even if the AGI wants us to be aligned, it's just going to do the stuff in the first para. There are different notions of aligned:
Notion 1: "I look at another agent superficially and feel we want the same things." In other words, my fake variable called "my values" is sufficiently close to my fake variable called "this agent's values. I will necessarily be creating such simple fake variables if I'm stupid, i.e., I'm human, cause all humans are stupid relative to the AIXI-like ideal.
An AGI that optimises to satisfy notion 1 can hide what it plans to do to humans and maintain a good appearance until it kills us without telling us.
Notion 2: "I get a full picture of what the agent intends to do and want the same things"
This is very hard because my heuristics tell me all the consequences the AGI plans to do are bad. Problem is my heuristics. If I didn't have those heuristics, if was closer to the AIXI-like ideal, I wouldn't mind. Again "I wouldn't mind" is from a perspective of machines not consciousness or desires, so translate it to " the interaction of my outputs will be favourable towards the AGI's outputs in the real world".
So the AGI will find ways to convince us to want our values neurosurgically altered. Eventually we will both be clones and hence perfectly aligned.
Now let's bring stuff like "consciousness" and "desires" and "free will" back into the picture. All this stuff interacts very strongly with the heuristics, exactly the things that make us further away from the AIXI-like ideal.
Simply stated, we don't naturally want to be ideal rational agents. We can't want them, in the sense we can't get ourselves to truly consistently want to be ideal rational agents by sheer willpower, we need physical intervention like neurosurgey. Even if free will exists and is a useful concept, it has finite power. I can't delete sections of my brain using free will alone.
So now if intelligence is defined as closer to AIXI-like ideal in internal structure, then intelligence by definition leads to misalignment.
P.S. I should probably also throw some colour on what kinds of "bad heuristics" I am referring to here. In simplest terms "low kolmogrow complexity behaviours that are very not AIXI-like"
1/ For starters, the entirety of System 1 and sensory processing (see Daniel Kahlmann thinking fast and slow). We aren't designed to maximise our intelligence, we just happen to have an intelligence module (aka somewhat AIXI-like). Things we care about sufficiently strongly are designed to override System 2 which is more AIXI-like, insofar has evolution has any design for us. So maybe it's not even "bad heuristics" here, it's entire modules in our brain that are not meant for thinking in the first place. It's just neurochemicals firing and one side winning and the other side losing, this system looks nothing like AIXI. And it's how we deal with most life-and-death situations.
This stuff is beyond the reach of free will, I can't stop reacting to snakes out of sheerwill, or hate chocolate or love to eat shit. Maybe I can train myself on snakes, but I can't train myself to love to eat shit. The closer you are to sensory apparatus and the further away from the brain, the less the system looks like AIXI. And simultaneously the less free will you seem to have.
(P.S. Is that a coincidence or is free will / consciousness really an emergent property of being AIXI-like? I have no clue, it's again the messy free will debates that might get nowhere)
2/ Then ofcourse at the place where System 1 and 2 interact you can actually observe behaviour that makes us further away from AIXI-like. Things like why we find it difficult to have independent thoughts that go away from the crowd. Even if we do have independent thoughts we need to spend a lot more energy actually developing them further (versus thinking of hypothetical arguments to defend those ideas in society).
This stuff is barely within the reach of our "free will". Large portions of LessWrong attempt at training to make us more AIXI-like. By reducing so-called cognitive biases and socially-induced biases.
3/ Then we have proper deep thinking (System 2 and such-like) which seems a lot closer to AIXI. This is where we move beyond "bad heuristics" aka "heuristics that AIXI-like agents won't use". But maybe an AGI will find these modules of ours horribly flawed too, who knows.