Buck Shlegeris

I am a software engineer from Australia. I have lived in San Francisco for most of the last three years. I'm interested in data structures, economics, and effective altruism.

Here are the most interesting things I've written. My most notable software project is a search engine for data structures.

If, like many, you prefer to consume content ordered by recency rather than quality, here are some things I wrote recently:

Why are less intelligent people more racist?

(Epistemic status: I think there’s a 75% chance that less intelligent people have at least a 20% higher relative probability of agreeing to racist or segregationist claims like “blacks tend to be lazy”. I think my explanation of this phenomenon is probably roughly right. I don’t think my explanation is very nonobvious to people with the right background knowledge; I just applied primitive evo psych reasoning to the problem and was happy with what came out.)

There seems to be evidence that dumb people are more explicitly racist than smarter people. Research on how political views correlate with intelligence normally finds the following things:

  • Less intelligent people are more likely to agree with explicitly racist statements like “Blacks tend to be unintelligent”, “Blacks tend to be lazy”, or “White people have a right to keep blacks out of their neighborhoods if they want to, and blacks should respect that right.”.
    • One paper found that 21%, 43%, and 29% of people in the bottom third of intelligence agree with those statements, while 14%, 33%, and 18% of people in the top third of intelligence do (Table 3, Wodtke 2016). So this effect isn’t totally overpowering, but is pretty strong.
  • Less intelligent people are more likely to endorse that flavor of comment about homosexuality, gender roles, and so on.
  • This effect is detected only in socially conservative attitudes, not other conservative attitudes–Republican voters are probably slightly smarter than Democrat voters on average (Carl 2014), as well as better informed about the world in general and current events. And dumb people are no more likely than smart people to support some racially redistributive policies like tax breaks for businesses in predominantly black areas.

Why do we find that effect of intelligence on social conservatism? I find the answers traditionally offered by researchers and writers pretty condescending and unconvincing. It’s common for them to say that conservatism is more attractive for less intelligent people because it suggests a simpler view of the world, which is comforting for their tiny minds. But this doesn’t really feel sufficient to me. Human behaviors were optimized by evolution. If humans have a tendency to err in a particular direction (by being too mistrustful of weird strangers, or too trustful), then the tendency generally has a fitness advantage, and it’s often productive to try to investigate what that fitness advantage is.

Racism as a strategy

I think racism is an evolved behavior which makes more sense if you’re less intelligent, and our brains are built to adapt our level of racism to many aspects of our situation, including how dumb we are.

Statistical discrimination and broad generalizations make more sense if you’re worse at accurately evaluating things. If you’re good at accurately estimating the qualities of people by interacting with them and using your judgement, then it isn’t very useful to make judgements based on their race. But if you’re not very smart, then it’s a better idea to learn heuristics and use them.

Also, if you’re less intelligent, you should be more inclined to be suspicious of people you don’t know. One risk of making decisions with explicit reasoning is that someone much smarter than you can trick you. A sensible defense against this is to be suspicious of people who might be trying to trick you. People in your outgroup are more likely to be trying to trick you, because you’ll have less social recourse if they scam you. This strategy is more useful if you’re less smart. (I could phrase this hypothesis as “dumb people should be more prone to engage in epistemic learned helplessness”.)

More generally, if you’re less smart, you should be more reluctant to approve of social changes based on verbal arguments. So we should expect less intelligent people to hear the arguments for things like same-sex marriage, and then say “I don’t know, it just seems wrong and your arguments don’t convince me.” (This fits with the evidence that higher IQ people have more Openness to Experience.)


This explanation predicts that social conservatism will generally be unfashionable among intellectuals, because social liberalism will consistently be an effective signal of intelligence. I don’t know how universally this is true.

One final mystery: How are there so many dumb liberals, if dumbness inclines you to social conservatism? My guess is that dumb liberals really don’t see gay people and black people as their outgroup, so these factors don’t affect them. And in their subcultures, gender roles have already shifted in the direction that they want society as a whole to change, so this doesn’t feel scary to them.

Why do I propose that humans adapt our level of racism to fit our situation, instead of thinking that racism is determined by genetics? I’m mostly just believing “The Moral Animal”. It argues that choosing strategies based on your experiences while you’re a child is more effective than doing it based on genes, because that’s more a more adoptable strategy. I haven’t looked into how heritable explicitly racist views are, but any research on that would obviously have to control for intelligence for it to be worth anything.

“Are Smart People Less Racist” and other papers

Let’s review Wodtke’s 2016 paper “Are Smart People Less Racist?” in detail, then briefly look at a bunch of other papers. It found that people who do better on the GSS vocabulary test, which is often used by social psychologists as a proxy for intelligence, were less likely to agree with explicitly racist statements.

This isn’t just smarter people being more liberal, or smarter people being more inclined to say liberal-sounding things. Smarter people were no more likely to agree with supposedly leftist policy prescriptions:

But, despite these liberalizing effects, whites with higher cognitive ability are no more likely to support open housing laws, special government aid for blacks, tax breaks for businesses to locate in largely black areas, and targeted spending on predominantly black schools, and they are significantly less likely to support school busing programs and preferential hiring policies, compared to their counterparts with lower cognitive ability.

This makes sense to me–I think that both conservatives and liberals can be reasonable people, and smart conservatives who oppose these policies probably do so out of a legitimate disagreement with those proposals, rather than because they’re really racist deep down.

It makes less sense to Wodtke, who writes:

These seemingly paradoxical findings challenge the enlightenment hypothesis that higher cognitive ability promotes a sincere commitment to racial equality.

Principal Skinner: are my leftist policy ideas really not unambiguously correct? no, my opponents must be secret racists

Later, he considers the possibility that smarter people are opposed to these policies because they’re more libertarian, instead of just being secretly racist. He notes that in his data, smarter people are more likely than less intelligent people to have environmentalist attitudes. He takes this to be evidence that smarter people are not actually more libertarian. I appreciate that he did this, but I really wish that instead of this he’d investigated people’s feelings towards non-racially-charged redistributive policies–I think that that would have provided much more relevant evidence. As it is, I don’t feel like it’s very strong evidence against my hypothesis that smart people might legitimately think that some of these policy ideas aren’t very good.

I also have a bunch of minor methodological issues with the paper. His respondents filled in their answers on a Likert scale, which were recoded as a binary variable. That is, people who were neutral on things like “Giving business and industry special tax breaks for locating in largely black areas” are coded the same as people who are strongly opposed. I don’t know if this is standard practice in social psychology but it seems foolish to me. Also, social desirability bias could have affected people’s enthusiasm about answering affirmatively to questions like “Would you object to having a coworker of a different race?”. Also, I know I often defend the legitimacy of intelligence tests, but using a vocabulary test as an intelligence test seem kind of sketchy–perhaps it’s picking up on some cultural factors as well as intelligence.

Here are some of the other papers I looked at:

  • Hodson and Busseri (2012), “Bright Minds and Dark Attitudes”. This paper finds, with a methodology I find weak (the intelligence test was done twenty years earlier than the racism test), that lower intelligence predicts social conservatism which predicts explicitly racist beliefs, but intelligence doesn’t predict racism given social conservatism. This matches the results here. This paper doesn’t check how intelligence affects race-related political questions.
  • Dhont and Hodson (2014), “Does Lower Cognitive Ability Predict Greater Prejudice?”. This reviews a bunch of research and, defying Betteridge, says that the consensus is that lower cognitive ability predicts greater prejudice. This paper says that less intelligent people are more prejudiced and authoritarian; it says nothing about their economic beliefs.
  • Carl (2015), “Cognitive ability and political beliefs in the United States”. This paper thinks that intelligence is correlated with fiscal conservatism, but the trend reverses at high IQs. It also finds that intelligence is correlated with social liberalism.
  • Carl (2014), “Cognitive ability and party identity in the United States”. This paper finds that Republican voters are a few IQ points smarter than Democrat voters. The author says “These results are consistent with Carl’s (2014) hypothesis that higher intelligence among classically liberal Republicans compensates for lower intelligence among socially conservative Republicans.”
  • Duarte et al (2015), “Political diversity will improve social psychological science”. This paper, co-written by Haidt, is mostly about other things, but talks about intelligence differences between liberals and conservatives as a hypothesis for why conservatives are underrepresented in academia. The paper refers to some of the other papers I’ve looked at here and treats them as reasonable; coming from critics of political bias in social psychology, this caused a mild positive update for my opinion on the quality of the research.

How legit do I think this is overall? This is social psychology research, so we should always be pretty suspicious that it’s actually just totally wrong for some reason too fundamental for us to notice, and we should treat all results as provisional until they’re replicated.

I guess one point in the favor of these studies is that they doesn’t totally play into the worldview of the liberal academics running them. Their data makes more sense to me than it does to them, I think. This seems true of other research on similar topics too: research on political attitudes seems to often have results that conflict with the biases of the researchers–eg that smarter people are more economically liberal.

Overall, I think that the findings of these researchers seem plausible.

Hash map implementations in practice

algorithms programming

Thanks to Kent Ross for finding several of the details which I list here.

I think hash maps are a really interesting data structure. They’re probably the second most widely used data structure in modern programming languages, behind dynamic arrays. But they involve many much more complicated design decisions than dynamic arrays do. So they’re probably the data structure where we can learn the most about how they’re used in practice.

(I don’t mean to imply dynamic arrays don’t have any interesting subtleties. To start with, there’s a way to modify dynamic arrays so that their time to append is worst case instead of amortized. There’s also a fun way to make them so that they only waste space instead of . This paper is about the latter; it explains the former in passing.)

In this post, I’ll briefly describe a few of the key ways that hash map implementations vary, and then I’ll do a review of the hash map implementations in Ruby, Python, Go, Rust, Java, C++, PHP, and C#, as well as some other influential implementations such as those in Google’s SparseHash package.

Key concepts in hash map design

Hash maps differ in their collision resolution strategies. The simplest strategy to implement is probably chaining. There’s a decent explanation of chaining here. When you’re implementing a hash map with chaining, you need to choose a maximum load factor–that is, the number of elements per bucket at which you resize the hash map. Most languages which use chaining seem to use a load factor of about 5x. Another fun detail here is that you can use data structures that aren’t linked lists for your buckets. For example, Java switches buckets to be balanced trees instead of linked lists if they have more than 8 elements, in an effort to make the worst case runtime logarithmic instead of linear.

A wide variety of different open addressing strategies are also used in practice–linear probing, quadratic probing, double hashing, and others.

One exciting and relatively recent development in hash map implementations is Robin Hood hashing. Here’s a quote from a great article about it:

The clever trick is just this: when you probe for a position to insert a new element, if the probe length for the existing element is less than the current probe length for the element being inserted, swap the two elements and keep going.

That way elements that were inserted early and thus “lucked out” on their probe lengths, will gradually be moved away from their preferred slot as new elements come in that could make better use of that place in the table (hence the name - the insertion “takes from the rich”, i.e. the elements with low probe counts). It leads to an “evening out” of the probe lengths.

Why is low variance better? Well, with modern cache architectures a probe count of 1 isn’t really much faster than a probe count of 3, because the main cost is fetching the cache line, so the ideal scenario is for all elements to have the same probe count, and having that probe count fitting within one cache line.

It turns out that Robin Hood hashing has the same expected probe count as normal open addressing (about 2.55) - it just has substantially less variance, and therefore ends up with far fewer cache misses. Yes, there are fewer elements that can early out after 1 probe, but there also far fewer elements that end up needing to fetch multiple cache lines because of long probe lengths.

Currently, Rust is the only mainstream language which uses Robin Hood hashing in its default hash map implementation.

Hash maps in practice

Python

Python’s dictionary implementation uses an open addressing scheme. It resizes when the hash map is 2/3 full. You can read the source code here.

Unlike most programming languages, Python doesn’t try to use a hash function which appears random. To quote the source:

Major subtleties ahead:  Most hash schemes depend on having a "good" hash
function, in the sense of simulating randomness.  Python doesn't:  its most
important hash functions (for ints) are very regular in common
cases:

  >>>[hash(i) for i in range(4)]
  [0, 1, 2, 3]

This isn't necessarily bad!  To the contrary, in a table of size 2**i, taking
the low-order i bits as the initial table index is extremely fast, and there
are no collisions at all for dicts indexed by a contiguous range of ints. So
this gives better-than-random behavior in common cases, and that's very
desirable.

OTOH, when collisions occur, the tendency to fill contiguous slices of the
hash table makes a good collision resolution strategy crucial.  Taking only
the last i bits of the hash code is also vulnerable:  for example, consider
the list [i << 16 for i in range(20000)] as a set of keys.  Since ints are
their own hash codes, and this fits in a dict of size 2**15, the last 15 bits
of every hash code are all 0:  they *all* map to the same table index.

But catering to unusual cases should not slow the usual ones, so we just take
the last i bits anyway.  It's up to collision resolution to do the rest.  If
we *usually* find the key we're looking for on the first try (and, it turns
out, we usually do -- the table load factor is kept under 2/3, so the odds
are solidly in our favor), then it makes best sense to keep the initial index
computation dirt cheap.

In Python 3.6, an additional layer of indirection was added, with the following reasoning:

The current memory layout for dictionaries is
unnecessarily inefficient.  It has a sparse table of
24-byte entries containing the hash value, key pointer,
and value pointer.

Instead, the 24-byte entries should be stored in a
dense table referenced by a sparse table of indices.

For example, the dictionary:

    d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'}

is currently stored as:

    entries = [['--', '--', '--'],
               [-8522787127447073495, 'barry', 'green'],
               ['--', '--', '--'],
               ['--', '--', '--'],
               ['--', '--', '--'],
               [-9092791511155847987, 'timmy', 'red'],
               ['--', '--', '--'],
               [-6480567542315338377, 'guido', 'blue']]

Instead, the data should be organized as follows:

    indices =  [None, 1, None, None, None, 0, None, 2]
    entries =  [[-9092791511155847987, 'timmy', 'red'],
                [-8522787127447073495, 'barry', 'green'],
                [-6480567542315338377, 'guido', 'blue']]

This significantly reduces memory usage. It also means that Python dictionaries are now ordered, which makes Hacker News (and me) unhappy.

Since version 3.3, Python dicts double in size when they resize. Before version 3.3, it quadrupled its capacity on resize.

V8

Source here.

This uses open addressing and has a maximum load capacity of 80%.

One interesting detail:

// To remove an entry we need to ensure that it does not create an empty
// entry that will cause the search for another entry to stop too soon. If all
// the entries between the entry to remove and the next empty slot have their
// initial position inside this interval, clearing the entry to remove will
// not break the search. If, while searching for the next empty entry, an
// entry is encountered which does not have its initial position between the
// entry to remove and the position looked at, then this entry can be moved to
// the place of the entry to remove without breaking the search for it. The
// entry made vacant by this move is now the entry to remove and the process
// starts over.

Java

Java’s HashMap class uses chaining, but with a neat twist!

in Java 8 hash elements use balanced trees instead of linked lists after a certain threshold is reached. Which means HashMap starts with storing Entry objects in linked list but after the number of items in a hash becomes larger than a certain threshold, the hash will change from using a linked list to a balanced tree, this will improve the worst case performance from O(n) to O(log n).

According to this page the threshold is 8 elements:

Well, this optimization is described in JEP-180. Basically when a bucket becomes too big (currently: TREEIFY_THRESHOLD = 8), HashMap dynamically replaces it with an ad-hoc implementation of tree map. This way rather than having pessimistic O(n) we get much better O(logn). How does it work? Well, previously entries with conflicting keys were simply appended to linked list, which later had to be traversed. Now HashMap promotes list into binary tree, using hash code as a branching variable. If two hashes are different but ended up in the same bucket, one is considered bigger and goes to the right. If hashes are equal (as in our case), HashMap hopes that the keys are Comparable, so that it can establish some order. This is not a requirement of HashMap keys, but apparently a good practice. If keys are not comparable, don’t expect any performance improvements in case of heavy hash collisions.

So if you implement Comparable properly, hash map retrieval is worst case ! What an exciting world we live in!

The default load factor in Java hash maps is 0.75, and has been since at least Java version 5.

C++ STL

chaining. Amusingly enough, this seem to be a direct result of a requirement in the C++ standard:

The Standard effectively mandates std::unordered_set and std::unordered_map implementations that use open hashing, which means an array of buckets, each of which holds the head of a logical (and typically actual) list. That requirement is subtle: it’s a consequence of the default max load factor being 1.0 and the guarantee that the table will not be rehashed unless grown beyond that load factor: that would be impractical without chaining, as the collisions with closed hashing become overwhelming as the load factor approaches 1

Linux hashtable

Linux has a fixed sized hashtable which uses chaining, which it uses internally a bunch of places.

Google SparseHashMap and DenseHashMap

Google SparseHashMap and DenseHashMap: quadratic probing

This directory contains several hash-map implementations, similar in API to SGI’s hash_map class, but with different performance characteristics. sparse_hash_map uses very little space overhead, 1-2 bits per entry. dense_hash_map is very fast, particulary on lookup. (sparse_hash_set and dense_hash_set are the set versions of these routines.) On the other hand, these classes have requirements that may not make them appropriate for all applications.

All these implementation use a hashtable with internal quadratic probing. This method is space-efficient – there is no pointer overhead – and time-efficient for good hash functions.

The usage of these classes differ from SGI’s hash_map, and other hashtable implementations, in the following major ways:

1) dense_hash_map requires you to set aside one key value as the ‘empty bucket’ value, set via the set_empty_key() method. This MUST be called before you can use the dense_hash_map. It is illegal to insert any elements into a dense_hash_map whose key is equal to the empty-key.

2) For both dense_hash_map and sparse_hash_map, if you wish to delete elements from the hashtable, you must set aside a key value as the ‘deleted bucket’ value, set via the set_deleted_key() method. If your hash-map is insert-only, there is no need to call this method. If you call set_deleted_key(), it is illegal to insert any elements into a dense_hash_map or sparse_hash_map whose key is equal to the deleted-key.

Ruby

According to the source, Ruby used chaining (with a threshold load factor of 5x) until 2.4, when it decided to follow Python’s lead and switch to a similar scheme with open addressing and two separate tables. As my colleague Kent points out, the Ruby hash table is basically the same as the Python one, except Ruby misspells “perturb” as “perterb” for some reason, and in Ruby the hash code is shifted down 11 bits instead of 5 in each perturbation.

Rust

In an exciting change of pace, Rust uses “linear probing with Robin Hood bucket stealing.”

This had a super neat bug which led to a sequence of hash map insertions taking quadratic time.

To quote that Tumblr piece:

Surprisingly to me, the specific dynamics of Robin Hood hashing end up being relatively unimportant here; I believe that vanilla linear probing would exhibit similar behaviors. The key effect of Robin Hood hashing is just that it gives you confidence and/or hubris to push a table to 90% capacity, which greatly exacerbates the problem.

I’m glad that Rust is doing the trailblazing here. I think that long term, more languages should probably switch to some kind of Robin Hood hashing, and it’s nice that we’re working out these kinks now.

C#

C# uses open addressing with double hashing. That article also contains this hilarious tidbit:

In an overloaded form of the Hashtable’s constructor, you can specify a loadFactor value between 0.1 and 1.0. Realize, however, that whatever value you provide, it is scaled down 72%, so even if you pass in a value of 1.0 the Hashtable class’s actual loadFactor will be 0.72. The 0.72 was found by Microsoft to be the optimal load factor, so consider using the default 1.0 load factor value (which gets scaled automatically to 0.72). Therefore, you would be encouraged to use the default of 1.0 (which is really 0.72).

Also, that documentation consistently writes ‘rehasing’ when I am almost sure they mean ‘rehashing’. Apparently Ruby isn’t the only programming language which needs a copy editor.

Golang

Golang uses chaining:

A map is just a hash table. The data is arranged into an array of buckets. Each bucket contains up to 8 key/value pairs. The low-order bits of the hash are used to select a bucket. Each bucket contains a few high-order bits of each hash to distinguish the entries within a single bucket.

If more than 8 keys hash to a bucket, we chain on extra buckets.

When the hashtable grows, we allocate a new array of buckets twice as big. Buckets are incrementally copied from the old bucket array to the new bucket array.

According to the same file, the default average load factor at which a resizing is triggered is 6.5.

PHP

PHP uses chaining, with the additional constraint that PHP semantics require that PHP hash tables be ordered by default, which sparks controversy on HN.

PHP’s hash map implementation uses the identity function as the hash function for integers. This has great performance in the case where your hash keys are a contiguous sequence of integers. However, it also means that if you have a hash map with capacity 64, and you insert a bunch of numbers which all have the same remainder mod 64, they’ll be in the same hash bucket, so insert will take linear time. Under this condition, inserts can take linear time. This blog post shows how inserting 65536 evil elements can take 30 seconds.

Other topics in hash maps

If you want to learn more about hash maps, here are some topics to look up:

Wild speculations on the balance of pain and pleasure

Epistemic status: mad speculation, mixed with math that’s so much fun that I find it hard to believe it’s useful. I’m serious about thinking these questions are important, I’m much less confident in any of my answers.

It’s kind of interesting that living creatures seem to feel both pain and pleasure, which seem like distinctly negative and positive experiences. When you construct a theory of reinforcement learning, you don’t see anything like this distinction between positive and negative sensations: reward is just a real-valued function, and the behavior of agents isn’t affected by any transformation of their reward function which is either a shift (eg you feel +3 happy about every outcome) or a positive scaling (eg your reward function is multiplied by two). Obviously humans aren’t literally reinforcement learners, but it’s surprising that our sense of reward is so different.

So why do humans experience reward as a function centered around zero? Why does it feel meaningful to think about what it would be like for all your sensations to be twice as powerful? And given that we seem to have experiences that work this way, what factors determine where the zero point for an organism is, and how powerful its sensations are?


Suppose I’m trying to train a little robot to pick up trash for me. Every day, it goes out and picks up some trash and brings it home for inspection. I need to give it a reward signal so that it knows how well it’s done. The reward signal that works for it is slightly costly though: I need to either give it small chocolates as a reward, or small pieces of coal as punishment. The cost of a piece of chocolate or coal is $1, and the robot views one piece of coal as exactly as bad as a piece of chocolate is good.

There’s some distribution of robot success which I’m expecting. Let’s say the robot always brings home 10, 11, or 13 pieces of trash. I need to choose how to reward the robot. That is, I need to choose a function from the success of the robot to the reward experienced by the robot.

One really simple option is to just give the robot one chocolate per piece of trash that it brings home. But this is pointlessly inefficient: you’re always going to be giving the robot a bunch of chocolate regardless of its actual success level.

Or you could give the robot a number of chocolates equal to the amount of trash minus 10. This is more efficient—you give out 0, 1, or 3 chocolates.

More efficient still, you could give the robot a punishment of 1 when it brings home 10 pieces of trash, give it nothing for 11 pieces, and give it a reward of 2 for bringing home 13 pieces. This is the best option.

So given a distribution, we need to choose a zero point for the reward function of the robot. What’s the optimal choice? Why, it’s the point which minimizes the expected cost, so it’s the point with minimum expected absolute distance to a randomly chosen point in the distribution. The median of a distribution is the number with this property.

(Why? Suppose that you want to choose a point to minimize the sum of absolute distance to the set [10, 11, 13, 16, 20, 23, 28]. Which point is better out of 20 and 23? Well, if you go from 23 to 20 you’re only making the distance longer for the two points 23 and 28, and you’re making it shorter for all 5 smaller points. And all these distance changes are the same magnitude, so all that matters is how many points you’re getting closer to. So 20 is better than 23. So the sum of absolute distances is minimized if you choose the point in the middle of the list.)

Now, I don’t care about my little robot. But suppose you do care about my little robot. You might care about how it likes its reward scheme. You care about its mean reward, which is (-1 + 0 + 2)/3 = 0.33. This is positive, which is great—the robot has a life worth living!

So in this situation, the robot-master will set the zero point to the median amount of trash, and the utilitarian onlooker cares about the mean value of this reward. The mean value of the reward is going to be positive iff the mean of the trash distribution is greater than the median of the trash distribution.

The median of a distribution is greater than the mean if the distribution is skewed to the left:

So left-skewed trash distributions will lead to overall sad robots.


I’ve been thinking a lot recently about cases like this where one optimization function is choosing the reward function for another system to maximize some function of its own. For example, evolution tries to pick reward functions for animals to maximize the animal’s reproductive fitness. But evolution has a bunch of constraints on this choice of reward function, and it’s also a really stupid optimization process, so the reward function is only a pretty rough approximation of the fitness benefit of various actions.

For example, humans have a tendency to consume an unhealthy amount of sugar: this is because evolution gave me a reward function that tries to motivate me to get sugar, which made sense in the ancestral environment but doesn’t make as much sense now. Other examples are when people are scared of the dark, or when they watch porn and masturbate, or do drugs.

The way that optimization functions choose reward functions is crucially important for utilitarians. Understanding the factors which affect the shape of reward functions is essential to predicting the welfare of beings like: ems in a competitive em world, subroutines in a paperclip maximizer, subroutines used by a galactic human civilization, factory farms if they exist a thousand years in the future, and wild animals.

I have lots of speculations about all this. I’ve been having trouble writing it all up, because there are a lot of different angles to approach these issues from. But with this post, at least I’ve made a start. Let me know if you’re interested in hearing my less well-formed crazy ideas on this topic.


The only academic discussion of this general question I’ve seen is in the 1995 paper “Towards Welfare Biology”, which was one of the first academic papers about the ethical importance of wild animal suffering, where Yew-Kwang Ng writes:

Our common sense recognition of the suffering of a typical non-surviving individual in most species may be supported by a simple argument based on evolution. We start by asking, why do we enjoy eating but suffer in starvation? The answer is that this genetic program provides us with the right incentives to do things favourable to survival. But why suffering? Why not just less enjoyment when starving and more enjoyment when eating? If the difference in the degrees of enjoyment between the two is big enough, we will still do the “right things”. However, the existence of suffering may be explained below.

First, both enjoyment and suffering are costly in terms of energy requirement, tissue maintenance, etc. This is why we feel neutral most of the time when we are not starving, eating, having sex, etc. (It would be nice if we could be programmed to feel ecstatic most of the time.) Secondly, it is likely that the extra (or marginal) costs involved in having an extra unit of enjoyment (or suffering) increases with the amount of enjoyment (suffering). Viewed differently, we have diminishing marginal returns in both enjoyment and suffering per unit of cost. Thirdly, it is likely that the costs (generalized resource costs, not subjective welfare costs) of suffering are unlikely to be significantly less, and maybe actually more, than those of enjoyment.

I used the first and third these claims in my robot story, and ignored the second.

If we also assume that the metabolic cost of rewards increases superlinearly with their absolute value, then some other things might happen. For example, if the cost of a reward is the square of its magnitude, then the reward distribution will have mean zero. More generally, if the cost of a reward is its magnitude to the power of p, then:

  • if p == 1, then the zero point is the median, and the creature has a life worth living iff the distribution is skewed to the right.
  • if 1 < p < 2, again the creature has a life worth living iff the distribution is skewed to the right.
  • if p == 2, the reward distribution has mean zero.
  • if p > 2, the creature has a life worth living iff the distribution is skewed to the left.

view Facebook comments here

All posts

RSS feed