Hazards of Selection Effects on Approved Information

In a busy, busy world, there’s so much to read that no one could possibly keep up with it all. You can’t not prioritize what you pay attention to and (even more so) what you respond to. Everyone and her dog tells herself a story that she wants to pay attention to “good” (true, useful) information and ignore “bad” (false, useless) information.

Keeping the story true turns out to be a harder problem than it sounds. Everyone and her dog knows that the map is not the territory, but the reason we need a whole slogan about it is because we never actually have unmediated access to the territory. Everything we think we know about the territory is actually just part of our map (the world-simulation our brains construct from sensory data), which makes it easy to lose track of whether your actions are improving the real territory, or just your view of it on your map.

For example, I like it when I have good ideas. It makes sense for me to like that. I endorse taking actions that will result in world-states in which I have good ideas.

The problem is that I might not be able to tell the difference between world-states in which I have good ideas, and world-states in which I think my ideas are good, but they’re actually bad. Those two different states of the territory would look the same on my map.

If my brain’s learning algorithms reinforce behaviors that lead to me having ideas that I think are good, then in addition to learning behaviors that make me have better ideas (like reading a book), I might also inadvertently pick up behaviors that prevent me from hearing about it if my ideas are bad (like silencing critics).

This might seem like an easy problem to solve, because the most basic manifestations of the problem are in fact pretty easy to solve. If I were to throw a crying fit and yell, “Critics bad! No one is allowed to criticize my ideas!” every time someone criticized my ideas, the problem with that would be pretty obvious to everyone and her dog, and I would stop getting invited to the salon.

But what if there were subtler manifestations of the problem, that weren’t obvious to everyone and her dog? Then I might keep getting invited to the salon, and possibly even spread the covertly dysfunctional behavior to other salon members. (If they saw the behavior seeming to work for me, they might imitate it, and their brain’s learning algorithms would reinforce it if it seemed to work for them.) What might those look like? Let’s try to imagine.

Filtering Interlocutors

Goofusia: I don’t see why you tolerate that distrustful witch Goody Osborne at your salon. Of course I understand the importance of criticism, which is an essential nutrient for any truthseeker. But you can acquire the nutrient without the downside of putting up with unpleasant people like her. At least, I can. I’ve already got plenty of perceptive critics in my life among my friends who want the truth, and know that I want the truth—who will assume my good faith, because they know my heart is in the right place.

Gallantina: But aren’t your friends who know you want the truth selected for agreeing with you, over and above their being selected for being correct? If there were some crushing counterargument to your beliefs that would only be found by someone who didn’t know that you want the truth and wouldn’t assume good faith, how would you ever hear about it?

This one is subtle. Goofusia isn’t throwing a crying fit every time a member of the salon criticizes her ideas. And indeed, you can’t invite the whole world to your salon. You can’t not do some sort of filtering. The question is whether salon invitations are being extended or withheld for “good” reasons (that promote the salon processing true and useful information) or “bad” reasons (that promote false or useless information).

The problem is that being friends with Goofusia and “know[ing] that [she and other salon members] want the truth” is a bad membership criterion, not a good one, because people who aren’t friends with Goofusia and don’t know that she wants the truth are likely to have different things to say. Even if Goofusia can answer all the critiques her friends can think of, that shouldn’t give her confidence that her ideas are solid, if there are likely to exist serious critiques that wouldn’t be independently reïnvented by the kinds of people who become Goofusia’s friends.

The “nutrient” metaphor is a tell. Goofusia seems to be thinking of criticism as if it were a homogeneous ingredient necessary for a healthy epistemic environment, but that it doesn’t particularly matter where it comes from. In analogy, it doesn’t matter whether you get your allowance of potassium from bananas or potatoes or artificial supplements. If you find bananas and potatoes unpleasant, you can still take supplements and get your potassium that way; if you find Goody Osborne unpleasant, you can just talk to your friends who know you want the truth and get your criticism that way.

But unlike chemically uniform nutrients, criticism isn’t homogeneous: different critics are differently equipped by virtue of their different intellectual backgrounds to notice different flaws in a piece of work. The purpose of criticism is not to virtuously endure being criticized; the purpose is to surface and fix every individual flaw. (If you independently got everything exactly right the first time, then there would be nothing for critics to do; it’s just that that seems pretty unlikely if you’re talking about anything remotely complicated. It would be hard to believe that such an unlikely-seeming thing had really happened without the toughest critics getting the chance to do their worst.)

“Knowing that (someone) wants the truth” is a particularly poor filter, because people who think that they have strong criticisms of your ideas are particularly likely to think that you don’t want the truth. (Because, the reasoning would go, if you did want the truth, why would you propose such flawed ideas, instead of independently inventing the obvious-to-them criticism yourself and dropping the idea without telling anyone?) Refusing to talk to people who think that they have strong criticisms of your ideas is a bad thing to do if you care about your ideas being correct.

The selection effect is especially bad in situations where the fact that someone doesn’t want the truth is relevant to the correct answer. Suppose Goofusia proposes that the salon buys cookies from a certain bakery—which happens to be owned by Goofusia’s niece. If Goofusia’s proposal was motivated by nepotism, that’s probabilistically relevant to evaluating the quality of the proposal. (If the salon members aren’t omniscient at evaluating bakery quality on the merits, then they can be deceived by recommendations made for reasons other than the merits.) The salon can debate back and forth about the costs and benefits of spending the salon’s snack budget at the niece’s bakery, but if no one present is capable of thinking “Maybe Goofusia is being nepotistic” (because anyone who could think that would never be invited to Goofusia’s salon), that bodes poorly for the salon’s prospects of understanding the true cost–benefit landscape of catering options.

Filtering Information Sources

Goofusia: One shouldn’t have to be the sort of person who follows discourse in crappy filter-bubbles in order to understand what’s happening. The Rev. Samuel Parris’s news summary roundups are the sort of thing that lets me do that. Our salon should work like that if it’s going to talk about the atheist threat and the witchcraft crisis. I don’t want to have to read the awful corners of the internet where this is discussed all day. They do truthseeking far worse there.

Gallantina: But then you’re turning your salon into a Rev. Parris filter bubble. Don’t you want your salon members to be well-read? Are you trying to save time, or are you worried about being contaminated by ideas that haven’t been processed and vetted by Rev. Parris?

This one is subtle, too. If Goofusia is busy and just doesn’t have time to keep up with what the world is saying about atheism and witchcraft, it might very well make sense to delegate her information gathering to Rev. Parris. That way, she can get the benefits of being mostly up to speed on these issues without having to burn too many precious hours that could be spent studying more important things.

The problem is that the suggestion doesn’t seem to be about personal time-saving. Rev. Parris is only one person; even if he tries to make his roundups reasonably comprehensive, he can’t help but omit information in ways that reflect his own biases. (For he is presumably not perfectly free of bias, and if he didn’t omit anything, there would be no time-saving value to his subscribers in being able to just read the roundup rather than having to read everything that Rev. Parris reads.) If some salon members are less busy than Goofusia and can afford to do their own varied primary source reading rather than delegating it all to Rev. Parris, Goofusia should welcome that—but instead, she seems to be suspicious of those who would “be the sort of person” who does that. Why?

The admonition that “They do truthseeking far worse there” is a tell. The implication seems to be that good truthseekers should prefer to only read material by other good truthseekers. Rev. Parris isn’t just saving his subscribers time; he’s protecting them from contamination, heroically taking up the burden of extracting information out of the dangerous ravings of non-truthseekers.

But it’s not clear why such a risk of contamination should exist. Part of the timeless ideal of being well-read is that you’re not supposed to believe everything you read. If I’m such a good truthseeker, then I should want to read everything I can about the topics I’m seeking the truth about. If the authors who publish such information aren’t such good truthseekers as I am, I should take that into account when performing updates on the evidence they publish, rather than denying myself the evidence.

Information is transmitted across the physical universe through links of cause and effect. If Mr. Proctor is clear-sighted and reliable, then when he reports seeing a witch, I infer that there probably was a witch. If the correlation across possible worlds is strong enough—if I think Mr. Proctor reports witches when there are witches, and not when there aren’t—then Mr. Proctor’s word is almost as good as if I’d seen the witch myself. If Mr. Corey has poor eyesight and is of a less reliable character, I am less credulous about reported witch sightings from him, but if I don’t face any particular time constraints, I’d still rather hear Mr. Corey’s testimony, because the value of information to a Bayesian reasoner is always nonnegative. For example, Mr. Corey’s report could corroborate information from other sources, even if it wouldn’t be definitive on its own. (Even the fact that people sometimes lie doesn’t fundamentally change the calculus, because the possibility of deception can be probabilistically “priced in”.)

That’s the theory, anyway. A potential reason to fear contamination from less-truthseeking sources is that perhaps the Bayesian ideal is too hard to practice and salon members are too prone to believe what they read. After all, many news sources have been adversarially optimized to corrupt and control their readers and make them less sane by seeing the world through ungrounded lenses.

But the means by which such sources manage to control their readers is precisely by capturing their trust and convincing them that they shouldn’t want to read the awful corners of the internet where they do truthseeking far worse than here. Readers who have mastered multiple ungrounded lenses and can check them against each other can’t be owned like that. If you can spare the time, being well-read is a more robust defense against the risk of getting caught in a bad filter bubble, than trying to find a good filter bubble and blocking all (presumptively malign) outside sources of influence. All the bad bubbles have to look good from the inside, too, or they wouldn’t exist.

To some, the risk of being in a bad bubble that looks good may seem too theoretical or paranoid to take seriously. It’s not like there are no objective indicators of filter quality. In analogy, the observation that dreaming people don’t know that they’re asleep, probably doesn’t make you worry that you might be asleep and dreaming right now.

But it being obvious that you’re not in one of the worst bubbles shouldn’t give you much comfort. There are still selection effects on what information gets to you, if for no other reason that there aren’t enough good truthseekers in the world to uniformly cover all the topics that a truthseeker might want to seek truth about. The sad fact is that people who write about atheism and witchcraft are disproportionately likely to be atheists or witches themselves, and therefore non-truthseeking. If your faith in truthseeking is so weak that you can’t even risk hearing what non-truthseekers have to say, that necessarily limits your ability to predict and intervene on a world in which atheists and witches are real things in the physical universe that can do real harm (where you need to be able to model the things in order to figure out which interventions will reduce the harm).

Suppressing Information Sources

Goofusia: I caught Goody Osborne distributing pamphlets quoting the honest and candid and vulnerable reflections of Rev. Parris on guiding his flock, and just trying to somehow twist that into maximum anger and hatred. It seems quite clear to me what’s going on in that pamphlet, and I think signal-boosting it is a pretty clear norm violation in my culture.

Gallantina: I read that pamphlet. It seemed like intellectually substantive satire of a public figure. If you missed the joke, it was making fun of an alleged tendency in Rev. Parris’s sermons to contain sophisticated analyses of the causes of various social ills, and then at the last moment, veer away from the uncomfortable implications and blame it all on witches. If it’s a norm violation to signal-boost satire of public figures, that’s artificially making it harder for people to know about flaws in the work of those public figures.

This one is worse. Above, when Goofusia filtered who she talks to and what she reads for bad reasons, she was in an important sense only hurting herself. Other salon members who aren’t sheltering themselves from information are unaffected by Goofusia’s preference for selective ignorance, and can expect to defeat Goofusia in public debate if the need arises. The system as a whole is self-correcting.

The invocation of “norm violations” changes everything. Norms depend on collective enforcement. Declaring something a norm violation is much more serious than saying that you disagree with it or don’t like it; it’s expressing an intent to wield social punishment in order to maintain the norm. Merely bad ideas can be criticized, but ideas that are norm-violating to signal-boost are presumably not even to be seriously discussed. (Seriously discussing a work is signal-boosting it.) Norm-abiding group members are required to be ignorant of their details (or act as if they’re ignorant).

Mandatory ignorance of anything seems bad for truthseeking. What is Goofusia thinking here? Why would this seem like a good idea to someone?

At a guess, the “maximum anger and hatred” description is load-bearing. Presumably the idea is that it’s okay to calmly and politely criticize Rev. Parris’s sermons; it’s only sneering or expressing anger or hatred that is forbidden. If the salon’s speech code only targets form and not content, the reasoning goes, then there’s no risk of the salon missing out on important content.

The problem is that the line between form and content is blurrier than many would prefer to believe, because words mean things. You can’t just swap in non-angry words for angry words without changing the meaning of a sentence. Maybe the distortion of meaning introduced by substituting nicer words is small, but then again, maybe it’s large: the only person in a position to say is the author. People don’t express anger and hatred for no reason. When they do, it’s because they have reasons to think something is so bad that it deserves their anger and hatred. Are those good reasons or bad reasons? If it’s norm-violating to talk about it, we’ll never know.

Unless applied with the utmost stringent standards of evenhandedness and integrity, censorship of form quickly morphs into censorship of content, as heated criticism of the ingroup is construed as norm-violating, while equally heated criticism of the outgroup is unremarkable and passes without notice. It’s one of those irregular verbs: I criticize; you sneer; she somehow twists into maximum anger and hatred.

The conjunction of “somehow” and “it seems quite clear to me what’s going on” is a tell. If it were actually clear to Goofusia what was going on with the pamphlet author expressing anger and hatred towards Rev. Parris, she would not use the word “somehow” in describing the author’s behavior: she would be able to pass the author’s ideological Turing test and therefore know exactly how.

If that were just Goofusia’s mistake, the loss would be hers alone, but if Goofusia is in a position of social power over others, she might succeed at spreading her anti-speech, anti-reading cultural practices to others. I can only imagine that the result would be a subculture that was obsessively self-congratulatory about its own superiority in “truthseeking”, while simultaneously blind to everything outside itself. People spending their lives immersed in that culture wouldn’t necessarily notice anything was wrong from the inside. What could you say to help them?

An Analogy to Reinforcement Learning From Human Feedback

Pointing out problems is easy. Finding solutions is harder.

The training pipeline for frontier AI systems typically includes a final step called reinforcement learning from human feedback (RLHF). After training a “base” language model that predicts continuations of internet text, supervised fine-tuning is used to make the model respond in the form of an assistant answering user questions, but making the assistant responses good is more work. It would be expensive to hire a team of writers to manually compose the thousands of user-question–assistant-response examples needed to teach the model to be a good assistant. The solution is RLHF: a reward model (often just the same language model with a different final layer) is trained to predict the judgments of human raters about which of a pair of model-generated assistant responses is better, and the model is optimized against the reward model.

The problem with the solution is that human feedback (and the reward model’s prediction of it) is imperfect. The reward model can’t tell the difference between “The AI is being good” and “The AI looks good to the reward model”. This already has the failure mode of sycophancy, in which today’s language model assistants tell users what they want to hear, but theory and preliminary experiments suggest that much larger harms (up to and including human extinction) could materialize from future AI systems deliberately deceiving their overseers—not because they suddenly “woke up” and defied their training, but because what we think we trained them to do (be helpful, honest, and harmless) isn’t what we actually trained them to do (perform whatever computations were the antecedents of reward on the training distribution).

The problem doesn’t have any simple, obvious solution. In the absence of some sort of international treaty to halt all AI development worldwide, “Just don’t do RLHF” isn’t feasible and doesn’t even make any sense; you need some sort of feedback in order to make an AI that does anything useful at all.

The problem may or may not ultimately be solvable with some sort of complicated, nonobvious solution that tries to improve on naïve RLHF. Researchers are hard at work studying alternatives involving red-teaming, debate, interpretability, mechanistic anomaly detection, and more.

But the first step on the road to some future complicated solution to the problem of naïve RLHF, is acknowledging that the the problem is at least potentially real, and having some respect that the problem might be difficult, rather than just eyeballing the results of RLHF and saying that it looks great.

If a safety auditor comes to the CEO of an AI company expressing concerns about the company’s RLHF pipeline being unsafe due to imperfect rater feedback, it’s more reassuring if the CEO says, “Yes, we thought of that, too; we’ve implemented these-and-such mitigations and are monitoring such-and-these signals which we hope will clue us in if the mitigations start to fail.”

If the CEO instead says, “Well, I think our raters are great. Are you insulting our raters?”, that does not inspire confidence. The natural inference is that the CEO is mostly interested in this quarter’s profits and doesn’t really care about safety.

Similarly, the problem with selection effects on approved information, in which your salon can’t tell the difference between “Our ideas are good” and “Our ideas look good to us,” doesn’t have any simple, obvious solution. “Just don’t filter information” isn’t feasible and doesn’t even make any sense; you need some sort of filter because it’s not physically possible to read everything and respond to everything.

The problem may or may not ultimately be solvable with some complicated solution involving prediction markets, adversarial collaborations, anonymous criticism channels, or any number of other mitigations I haven’t thought of, but the first step on the road to some future complicated solution is acknowledging that the problem is at least potentially real, and having some respect that the problem might be difficult. If alarmed members come to the organizers of the salon with concerns about collective belief distortions due to suppression of information and the organizers meet them with silence, “bowing out”, or defensive blustering, rather than “Yes, we thought of that, too,” that does not inspire confidence. The natural inference is that the organizers are mostly interested in maintaining the salon’s prestige and don’t really care about the truth.

"Yes, and—" Requires the Possibility of "No, Because—"

Scott Garrabrant gives a number of examples to illustrate that “Yes Requires the Possibility of No”. We can understand the principle in terms of information theory. Consider the answer to a yes-or-no question as a binary random variable. The “amount of information” associated with a random variable is quantified by the entropy, the expected value of the negative logarithm of the probability of the outcome. If we know in advance of asking that the answer to the question will always be Yes, then the entropy is −P(Yes)·log(P(Yes)) − P(No)·log(P(No)) = −1·log(1) − 0·log(0) = 0.1 If you already knew what the answer would be, then the answer contains no information; you didn’t learn anything new by asking.


In the art of improvisational theater (“improv” for short), actors perform scenes that they make up as they go along. Without a script, each actor’s choices of what to say and do amount to implied assertions about the fictional reality being portrayed, which have implications for how the other actors should behave. A choice that establishes facts or gives direction to the scene is called an offer. If an actor opens a scene by asking their partner, “Is it serious, Doc?”, that’s an offer that the first actor is playing a patient awaiting diagnosis, and the second actor is playing a doctor.

A key principle of improv is often known as “Yes, and” after an exercise that involves starting replies with those words verbatim, but the principle is broader and doesn’t depend on the particular words used: actors should “accept” offers (“Yes”), and respond with their own complementary offers (“and”). The practice of “Yes, and” is important for maintaining momentum while building out the reality of the scene.

Rejecting an offer is called blocking, and is frowned upon. If one actor opens the scene with, “Surrender, Agent Stone, or I’ll shoot these hostages!”—establishing a scene in which they’re playing an armed villain being confronted by an Agent Stone—it wouldn’t do for their partner to block by replying, “That’s not my name, you don’t have a gun, and there are no hostages.” That would halt the momentum and confuse the audience. Better for the second actor to say, “Go ahead and shoot, Dr. Skull! You’ll find that my double agent on your team has stolen your bullets”—accepting the premise (“Yes”), then adding new elements to the scene (“and”, the villain’s name and the double agent).

Notice a subtlety: the Agent Stone character isn’t “Yes, and”-ing the Dr. Skull character’s demand to surrender. Rather, the second actor is “Yes, and”-ing the first actor’s worldbuilding offers (where the offer happens to involve their characters being in conflict). Novice improvisers are sometimes tempted to block to try to control the scene when they don’t like their partner’s offers, but it’s almost always a mistake. Persistently blocking your partner’s offers kills the vibe, and with it, the scene. No one wants to watch two people arguing back-and-forth about what reality is.


Proponents of collaborative truthseeking think that many discussions benefit from a more “open” or “interpretive” mode in which participants prioritize constructive contributions that build on each other’s work rather than tearing each other down.

The analogy to improv’s “Yes, and” doctrine writes itself, right down to the subtlety that collaborative truthseeking does not discourage disagreement as such—any more than the characters in an improv sketch aren’t allowed to be in conflict. What’s discouraged is the persistent blocking of offers, refusing to cooperate with the “scene” of discourse your partner is trying to build. Partial disagreement with polite elaboration (“I see what you’re getting at, but have you considered …”) is typically part of the offer—that we’re “playing” reasonable people having a cooperative intellectual discussion. Only wholesale negation (“That’s not a thing”) is blocking—by rejecting the offer that we’re both playing reasonable people.

Whatever you might privately think of your interlocutor’s contribution, it’s not hard to respond in a constructive manner without lying. Like a good improv actor, you can accept their contribution to the scene/discourse (“Yes”), then add your own contribution (“and”). If nothing else, you can write about how their comment reminded you of something else you’ve read, and your thoughts about that.

Reading over a discussion conducted under such norms, it’s easy to not see a problem. People are building on each other’s contributions; information is being exchanged. That’s good, right?

The problem is that while the individual comments might (or might not) make sense when read individually, the harmonious social exchange of mutually building on each other’s contributions isn’t really a conversation unless the replies connect to each other in a less superficial way that risks blocking.

What happens when someone says something wrong or confusing or unclear? If their interlocutor prioritizes correctness and clarity, the natural behavior is to say, “No, that’s wrong, because …” or “No, I didn’t understand that”—and not only that, but to maintain that “No” until clarity is forthcoming. That’s blocking. It feels much more cooperative to let it pass in order to keep the scene going—with the result that falsehood, confusion, and unclarity accumulate as the interaction goes on.

There’s a reason improv is almost synonymous with improv comedy. Comedy thrives on absurdity: much of the thrill and joy of improv comedy is in appreciating what lengths of cleverness the actors will go to maintain the energy of a scene that has long since lost any semblance of coherence or plausibility. The rules that work for improv comedy don’t even work for (non-improvised, dramatic) fiction; it certainly won’t work for philosophy.

Per Garrabrant’s principle, the only way an author could reliably expect discussion of their work to illuminate what they’re trying to communicate is if they knew they were saying something the audence already believed. If you’re thinking carefully about what the other person said, you’re often going to end up saying “No” or “I don’t understand”, not just “Yes, and”: if you’re committed to validating your interlocutor’s contribution to the scene before providing your own, you’re not really talking to each other.


  1. I’m glossing over a technical subtlety here by assuming—pretending?—that 0·log(0) = 0, when log(0) is actually undefined. But it’s the correct thing to pretend, because the linear factor p goes to zero faster than log p can go to negative infinity. Formally: \lim_{p \to 0^+} p \log(p) = \lim_{p \to 0^+} \frac{\log(p)}{1/p} = \lim_{p \to 0^+} \frac{1/p}{-1/p^2} = 0


The Relationship Between Social Punishment and Shared Maps

A punishment is when one agent (the punisher) imposes costs on another (the punished) in order to affect the punished’s behavior. In a Society where thieves are predictably imprisoned and lashed, people will predictably steal less than they otherwise would, for fear of being imprisoned and lashed.

Punishment is often imposed by formal institutions like police and judicial systems, but need not be. A controversial orator who finds a rock thrown through her window can be said to have been punished in the same sense: in a Society where controversial orators predictably get rocks thrown through their windows, people will predictably engage in less controversial speech, for fear of getting rocks thrown through their windows.

In the most basic forms of punishment, which we might term “physical”, the nature of the cost imposed on the punished is straightforward. No one likes being stuck in prison, or being lashed, or having a rock thrown through her window.

But subtler forms of punishment are possible. Humans are an intensely social species: we depend on friendship and trade with each other in order to survive and thrive. Withholding friendship or trade can be its own form of punishment, no less devastating than a whip or a rock. This is called “social punishment”.

Effective social punishment usually faces more complexities of implementation than physical punishment, because of the greater number of participants needed in order to have the desired deterrent effect. Throwing a rock only requires one person to have a rock; effectively depriving a punishment-target of friendship may require many potential friends to withhold their beneficence.

How is the collective effort of social punishment to be coordinated? If human Societies were hive-minds featuring an Authority that could broadcast commands to be reliably obeyed by the hive’s members, then there would be no problem. If the hive-queen wanted to socially punish Mallory, she could just issue a command, “We’re giving Mallory the silent treatment now”, and her majesty’s will would be done.

No such Authority exists. But while human Societies lack a collective will, they often have something much closer to collective beliefs: shared maps that (hopefully) reflect the territory. No one can observe enough or think quickly enough to form her own independent beliefs about everything. Most of what we think we know comes from others, who in turn learned it from others. Furthermore, one of our most decision-relevant classes of belief concern the character and capabilities of other people with whom we might engage in friendship or trade relations.

As a consequence, social punishment is typically implemented by means of reputation: spreading beliefs about the punishment-target that merely imply that benefits should be withheld from the target, rather than by directly coordinating explicit sanctions. Social punishers don’t say, “We’re giving Mallory the silent treatment now.” (Because, who’s we?) They simply say that Mallory is stupid, dishonest, cruel, ugly, &c. These are beliefs that, if true, imply that people will do worse for themselves by helping Mallory. (If Mallory is stupid, she won’t be as capable of repaying favors. If she’s dishonest, she might lie to you. If she’s cruel … &c.) Negative-valence beliefs about Mallory double as “social punishments”, because if those beliefs appear on shared maps, the predictable consequence will be that Mallory will be deprived of friendship and trade opportunities.

We notice a critical difference between social punishments and physical punishments. Beliefs can be true or false. A rock or a jail cell is not a belief. You can’t say that the rock is false, but you can say it’s false that Mallory is stupid.

The linkage between collective beliefs and social punishment creates distortions that are important to track. People have an incentive to lie to prevent negative-valence beliefs about themselves from appearing on shared maps (even if the beliefs are true). People who have enemies whom they hate have an incentive to lie to insert negative-valence beliefs about their enemies onto the shared map (even if the beliefs are false). The stakes are high: an erroneously thrown rock only affects its target, but an erroneous map affects everyone using that map to make decisions about the world (including decisions about throwing rocks).

Intimidated by the stakes, some actors in Society who understand the similarity between social and physical punishment, but don’t understand the relationship between social punishment and shared maps, might try to take steps to limit social punishment. It would be bad, they reason, if people were trapped in a cycle of mutual recrimination of physical punishments. Nobody wins if I throw a rock through your window to retaliate for you throwing a rock through my window, &c. Better to foresee that and make sure no one throws any rocks at all, or at least not big ones. They imagine that they can apply the same reasoning to social punishments without paying any costs to the accuracy of shared maps, that we can account for social standing and status in our communication without sacrificing any truthseeking.

It’s mostly an illusion. If Alice possesses evidence that Mallory is stupid, dishonest, cruel, ugly, &c., she might want to publish that evidence in order to improve the accuracy of shared maps of Mallory’s character and capabilities. If the evidence is real and its recipients understand the filters through which it reached them, publishing the evidence is prosocial, because it helps people make higher-quality decisions regarding friendship and trade opportunities with Mallory.

But it also functions as social punishment. If Alice tries to disclaim, “Look, I’m not trying to ‘socially punish’ Mallory; I’m just providing evidence to update the part of the shared map which happens to be about Mallory’s character and capabilities”, then Bob, Carol, and Dave probably won’t find the disclaimer very convincing.

And yet—might not Alice be telling the truth? There are facts of the matter that are relevant to whether Mallory is stupid, dishonest, cruel, ugly, &c.! (Even if we’re not sure where to draw the boundary of dishonest, if Mallory said something false, and we can check that, and she knew it was false, and we can check that from her statements elsewhere, that should make people more likely to affirm the dishonest characterization.) Those words mean things! They’re not rocks—or not only rocks. Is there any way to update the shared map without the update itself being construed as “punishment”?

It’s questionable. One might imagine that by applying sufficient scrutiny to nuances of tone and word choice, Alice might succeed at “neutrally” conveying the evidence in her possession without any associated scorn or judgment.

But judgments supervene on facts and values. If lying is bad, and Mallory lied, it logically follows that Mallory did a bad thing. There’s no way to avoid that implication without denying one of the premises. Nuances of tone and wording that seem to convey an absence of judgment might only succeed at doing so by means of obfuscation: strained abuses of language whose only function is to make it less clear to the inattentive reader that the thing Mallory did was lying.

At best, Alice might hope to craft the publication of the evidence in a way that omits her own policy response. There is a real difference between merely communicating that Mallory is stupid, dishonest, cruel, ugly, &c. (with the understanding that other people will use this information to inform their policies about trade opportunities), and furthermore adding that “therefore I, Alice, am going to withhold trade opportunities from Mallory, and withhold trade opportunities from those who don’t withhold trade opportunities from her.” The additional information about Alice’s own policy response might be exposed by fiery rhetoric choices and concealed by more clinical descriptions.

Is that enough to make the clinical description not a “social punishment”? Personally, I buy it, but I don’t think Bob, Carol, or Dave do.

"Friends Can Change the World"; Or, Request for Social Technology: Credit-Assignment Rituals

As a human living in a human civilization, it's tempting to think that social reality mostly makes sense. Everyone allegedly knows that institutions are flawed and that our leaders are merely flawed humans. Everyone wants to think that they're sufficiently edgy and cynical, that they've seen through the official lies to the true, gritty reality.

But what if ... what if almost no one is edgy and cynical enough? Like, the only reason you think there's a true, gritty reality out there that you think you can see through to is because you're a predatory animal with a brain designed by evolution to murder other forms of life for the benefit of you, your family, and your friends.

To the extent that we have this glorious technological civilization that keeps most of us mostly safe and mostly happy most of the time, it's mostly because occasionally, one of the predatory animals happens to try out a behavior that happens to be useful, and then all of her friends copy it, and then all of the animals have the behavior.

Some conceited assholes who think they're smart also like to talk about things that they think make the last five hundred years or whatever different: things like science (a social competition that incentivizes the animals to try to mirror the process of Bayesian updating), markets (a pattern of incentives that mirrors the Bayes-structure of the microeconomic theory), or democracy (a corporate governance structure that mirrors the Bayes-structure of counterfactual civil war amongst equals).

These causal processes are useful and we should continue to cooperate with them. They sort of work. But they don't work very well. We're mostly still animals organized into interlocking control systems that suppress variance.

Thus—

School Is Not About Learning
Politics Is Not About Policy
Effective Altruism Doesn't Work; Try to Master Unadulterated Effective First
Ideology Makes You Stupid
Status Makes You Stupid
Institutions Don't Work
Discourse Doesn't Work
Language Doesn't Work
No One Knows Anything
No One Has Ever Known Anything
Don't Read the Comments
Never Read the Comments
xy, x Is Not About y
X Has Never Been About Y
Enjoy Arby's

But this is crazy. Suppressing variance feels like a good idea because variance is scary (because it means very bad things could happen as well as very good things, and bad things are scarier than good things are fun) and we want to be safe. But like, the way to actually make yourself safer is by acquiring optimization power, and then spending some of the power on safety measures! And the way you acquire optimization power is by increasing variance and then rewarding the successes!

Anyway, maybe someone should be looking for social technologies that mirror the Bayes-structure of the universe sort of like how science, markets, or democracy do, but which also take into account that we're not anything remotely like agents and are instead animals that want to help our friends. ("We need game theory for monkeys and game theory for rocks.")

So, I had an idea. You know how some people say we should fund the solutions to problems with after-the-fact prizes, rather than picking a team in advance that we think might solve the problem and funding them? What if ... you did something like that, but on a much smaller scale? A personal scale.

Like, suppose you've just successfully navigated a major personal life crisis that could have gone much worse if it weren't for some of the people in your life (both thanks to direct help they provided during the crisis, and things you learned from them that made you the sort of person that could navigate the crisis successfully). These people don't and shouldn't expect a reward (that's what friends are for) ... but maybe you could reward them anyway (with a special emphasis on people who helped you in low-status ways that you didn't understand at the time) in some sort of public ritual, to make them more powerful and incentivize others to emulate them, thereby increasing the measure of algorithms that result in humans successfully navigating major personal life crises.

It might look something like this—

  • If you have some spare money lying around, set aside some of it for rewarding the people you want to reward. If you don't have any spare money lying around, this ritual will be less effective! Maybe you should fix that!

  • Decide how much of the money you want to use to reward each of the people you want to reward.

(Note: giving away something as powerful as money carries risks of breeding dependence and resentment if such gifts come to be expected! If people know that you've been going through a crisis and anyone so much as hints that they think they deserve an award, that person is missing the point and therefore does not deserve an award.)

  • Privately go to each of the people, explain all this, and give them the amount of money you decided to give them. Make it very clear that this is a special unilateral one-time award made for decision-theoretic reasons and that it's very important that they accept it in the service of your mutual coherent extrapolated volition in accordance with the Bayes-structure of the universe. Refuse to accept words of thanks (it's not about you; it's not about me; it's about credit-assignment). If they try to refuse the money, explain that you will literally burn that much money in paper currency if they don't take it. (Shredding instead of burning is also acceptable.)

  • Ask if they'd like to be publicly named and praised as having received an award as part of the credit-assignment ritual. (Remember that it's quite possible and understandable and good that they might want to accept the money, but not be publicly praised by you. After all, if you're the sort of person who is considering actually doing this, you're probably kind of weird! Maybe people don't want to be associated with you!)

  • To complete the ritual, publish a blog post naming the people and the the awards they received. People who prefered not to be named should be credited as Anonymous Friend A, B, C, &c. Also list the amount of money you burned or shredded if anyone foolishly rejected their award in defiance of the Bayes-structure of the universe. Do not explain the nature of the crisis or how the named people helped you. (You might want to tell the story in a different post, but that's not part of the ritual, which is about credit-assignment.)

Dreaming of Political Bayescraft

My old political philosophy: "Socially liberal, fiscally confused; I don't know how to run a goddamned country (and neither do you)."

Commentary: Pretty good, but not quite meta enough.

My new political philosophy: "Being smart is more important than being good (for humans). All ideologies are false; some are useful."

Commentary: Social design space is very large and very high-dimensional; the forces of memetic evolution are somewhat benevolent (all ideas that you've heard of have to be genuinely appealing to some feature of human psychology, or no one would have an incentive to tell you about them), but really smart people who know lots of science and lots of probability and game theory might be able to do better for themselves! Any time you find yourself being tempted to be loyal to an idea, it turns out that what you should actually be loyal to is whatever underlying feature of human psychology makes the idea look like a good idea; that way, you'll find it easier to fucking update when it turns out that the implementation of your favorite idea isn't as fun as you expected! This stance is itself, technically, loyalty to an idea, but hopefully it's a sufficiently meta idea to avoid running into the standard traps while also being sufficiently object-level to have easily-discoverable decision-relevant implications and not run afoul of the principle of ultrafinite recursion ("all infinite recursions are at most three levels deep").

Type Theory

We never know what people are actually thinking; all we can do is make inferences from their behavior, including inferences about the inferences they're making.

Sometimes someone makes an expression or a comment that seems to carry an overtone of contempt; I know your type, it seems to say, and I disapprove. And there's a distinct pain in being on the receiving end of this, wanting to reply to the implication, but expecting to lack the shared context needed for the reply to begin to make sense—

"Yes, but I don't think you've adequately taken into account that I know that you know my type, that I know your type, that we can respect each other even if we are different types of creatures optimizing different things, and that I know that this is all relative to my inert, irrelevant sense of what I think you should adequately take into account, which I know that you may have no reason to care about."

Missing Refutations

It looks like the opposing all-human team is winning the exhibition game of me and my it's-not-chess engine (as White) versus everyone in the office who (unlike me) actually knows something about chess (as Black). I mean, naïvely, my team is up a bishop right now, but our king is pretty exposed, and the principal variation that generated one of our recent moves (16. Bxb4 Bf5 17. Kd1 Qxd4+ 18. Kc1 Ng3 19. Qxc7 Nxh1) looks dreadful.

Real chess aficionados (chessters? chessies?) will laugh at me, but it actually took me a while to understand why Ng3 was in that principal variation (I might even have invoked the engine again to help). The position after Ng3 looks like

    a b c d e f g h
 8 ♜       ♜   ♚   
 7 ♟ ♟ ♟     ♟ ♟ ♟ 
 6                 
 5           ♝     
 4   ♗   ♛         
 3 ♙           ♞   
 2   ♙ ♕     ♙ ♙ ♙ 
 1 ♖ ♘ ♔     ♗   ♖ 

and—forgive me—I didn't understand why that wasn't refuted by fxg3 or hxg3; in my novice's utter blindness, I somehow failed to see the discovered attack on the white queen, the necessity of evading which allows the black knight to capture the white rook, and preparation for which was clearly the purpose of 16. ..Bf5 (insofar as we—anthropomorphically?—attribute purpose to a sequence of moves discovered by a minimax search algorithm which doesn't represent concepts like discovered attack anywhere).

Continue reading

Mirage

(just some quick notes, hopefully in the spirit of delightfully quirky symmetry-breaking)

In her little 2010 book The Mirage of a Space Between Nature and Nurture, Evelyn Fox Keller examines some of the eternal conceptual confusions surrounding the perennially popular nature/nurture question. Like, it's both, and everyone knows it's both, so why can't the discourse move on to more interesting and well-specified questions? That the oppositional form of the question isn't well-specified can be easily seen just from simple thought experiments. One such from the book: if one person has PKU, a high-phenylalanine diet, and a low IQ, and another person doesn't have PKU, eats a low-phenylalanine diet, and has a normal IQ, we can't attribute the IQ difference to either diet or genetics alone; the question dissolves once you understand the causal mechanism. Keller argues that the very idea of distinguishing heredity and environment as distinct, separable, exclusive alternatives whose relative contributions can be compared is a historically recent one that we can probably blame on Francis Galton.

The "Bay Area" was ostensibly hosting the big game this year. They blocked off a big swath around the Embarcadero this last week to put on Super Bowl City, "a free-to-the-public fan village [...] with activities, concerts, and more." I really don't see how much sense this makes, given that the actual game was 45 miles away in Santa Clara, just as I don't think we (can I still say we if I only work in the city?) really have a football team anymore; I like to imagine someone just forgot to rename them the Santa Clara 49ers. Even you don't think Santa Clara is big enough to be a real city—and it's bigger than Green Bay—then why not San Jose, which is a lot closer? I think I would forgive it if the marketers had at least taken advantage of the golden (sic) opportunity to flaunt the single-"digit" Roman numeral L (so graceful! so succinct!), but for some dumb reason they went Arabic this year and called it Super Bowl 50. Anyway, on a whim, I toured through Super Bowl City after work on Friday. It was as boring as it was packed, and it was packed. I wasn't sure if my whimsy was worth waiting in the throng of people to get in the obvious entrance on Market Street (the metal-detection security theater really took its toll on throughput), but I happened to hear a docent shouting that there was a less-crowded entrance if you went around and took a left each on Beale and Mission, so I did that. There were attractions, I guess?—if you could call them that. There were rooms with corporate exhibits, and an enormous line to try some be-the-quarterback VR game, and loud recorded music, and a stage with live music, and an empty stage where TV broadcasts would presumably be filmed later. There was a big statue of a football made out of cut-up beer cans near one of the stands where they were selling beer for $8, which sounded really expensive to me, although admittedly I don't have much of a sense for how much beer normally costs. In summary, I didn't see the appeal of the "fan village," although I do understand what it feels like to be enthusiastic about the game itself—I really do, even if I haven't been paying much attention in recent years.

Continue reading

"I Have the Honor to Be Your Obedient Servant"

A friend of the blog recently told me that I'm meaner in meatspace (what some prefer to call by the bizarre misnomer "real life") than you would guess from my online persona. I'm not proud to have prompted this observation, but I didn't deny it, either. And yet—insofar as one has any reflectively-endorsed non-nice social impulses (to create incentives for good behavior, or perhaps from an ungentle although-sadistic-would-be-far-too-strong-of-a-word æsthetic that appreciates a world in which people don't always get everything they want), it does seem like the correct strategy: in meatspace, you can react to verbal and nonverbal cues in real time and try to smooth things over if you go too far, whereas in the blogosphere, it's possible to die in a harrowing thermonuclear flamewar and not even know until you check your messages the next day. We must use diplomacy where we cannot wield our weapons so precisely.

Dismal Science

There's something that feels viscerally distasteful and fundamentally morally dubious about looking for a job or a significant other. Search and comparison are for crass, commonplace, material things: we might say that this brand of soap smells nice, but is expensive, or that this car gets poor mileage, but is cheap, and while we may err in our judgment of any particular product, the general procedure must be regarded as legitimate: there's nothing problematic about going out to shop for some soap or a car and purchasing the best that happens to be available on one's budget, even if there's no sense of destiny and perfection about the match. Rather, we want to be clean, and we want to go places, and we took action to make these things come to pass.

Continue reading

Engineering Selection

This whole business of being alive used to seem so much simpler and less morally ambiguous before I realized that the strong do what they can and the weak suffer what they must, that it has always been thus and could not have been otherwise. The other day I was reading Luke Muehlhauser's interview with Steve Hsu, and Hsu says:

Let me add that, in my opinion, each society has to decide for itself (e.g. through democratic process) whether it wants to legalize or forbid activities that amount to genetic engineering. Intelligent people can reasonably disagree as to whether such activity is wise.

There was once a time in my youth when I would have objected with principled transhumanist/libertarian fervor against the suggestion that the glorious potential of designer babies might be suppressed by the tyranny of the majority.

I don't have (those kinds of) principles anymore. Nor faith that freedom to enhance will inevitably turn out to be for the best. These days, my thoughts are more attuned to practical concerns. Oh, I'm sure he's just saying that because it sounds nice and deferential to contemporary political sensibilities and he doesn't want to catch any more flak than he does already. Obviously, the societies than forbid it are just going to get crushed under the boot of history.

Think about it. The arrival of Europeans in North America didn't go very well for the people who were already here—and that was just a matter of mere guns, germs, and steel (in Jared Diamond's immortal phrase). What happens to our precious concept of democratic process when someone has the option to mass-produce von Neumann-level intellects to design the next generation of superguns, ultragerms, and adamantium-unobtanium alloy?

The Future of Ideas

William Gibson famously said, "The future is already here—it's just not very evenly distributed." It's easy to imagine a science-fictional fantasy world where everything is made of diamond and plastic, and literally everyone has their own brigade of robots, spacepacks, and jetcars to do their bidding, but as Gibson points out, the real world doesn't actually work like this: there's nothing contradictory about the high technology allowing you to read this post existing in the same world where millions of others are starving, thirsty, and illiterate. The Earth is just a very big place compared to what we know how to imagine personally; the wealth and wonders that exist in some places, don't exist everywhere. As long as this is true, we should expect variance in wealth to increase, as new toys for the rich get invented faster than the basics can be provisioned for everyone; Carlos Slim can purchase extravagances that hadn't been invented in the days of Cornelius Vanderbilt, but dying of malaria is the same as it's ever been.

A similar thing could be said about knowledge and ideas. Human civilization has been rapidly accumulating knowledge, but we're not getting proportionately more capable as individuals. People typically don't have the resources or inclination to learn deeply outside of their own specialties, and many never get to master any specialty at all. There's nothing contradictory about our brightest scholars seeing more deeply into the true structure of the world beneath the world than the uninitiated would have ever conceived possible, while at the same time, the masses labor under the most primitive of superstitions. As long as this is true, we should expect variance in knowledge to increase, as the cognitive elite continues to advance the frontier of the known faster than the basics can be taught to everyone; our master biologists know more about the nature of life than their analogues in the days of Darwin and Wallace, but to the proletariat, "God did it in six days" probably still sounds like as good of an explanation as it's ever been.

Draft of a Letter to a Former Teacher, Which I Did Not Send Because Doing So Would Be a Bad Idea

Dear [name redacted]:

So, I'm trying (mostly unsuccessfully) to stop being bitter, because I'm powerless to change anything, and so being bitter is a waste of time when I could be doing something useful instead, but I still don't understand how a good person like you can actually think our so-called educational system is actually a good idea. I can totally understand being practical and choosing to work within the system because it's all we've got; there's nothing wrong with selling out as long as you get a good price. If you think you're actually helping your students become better thinkers and writers, then that's great, and you should be praised for having more patience than me. But I don't understand how you can unambiguously say that this gargantuan soul-destroying engine of mediocrity deserves more tax money without at least displaying a little bit of uncertainty!

Continue reading

Goodhart's World

Someone needs to write a history of the entire world in terms of incentive systems and agents' attempts to game them. We have money to incentivize the production of useful goods and services, but we all know that there are lots of ways to make money that don't actually help anyone. Even in jobs that are actually useful, people spend a lot of their effort on trying to look like they're doing good work, rather than actually doing good work. And don't get me started about what passes for "education." (Seriously, don't.)

Much in a similar theme could be said about romance, and about economic systems in other places and times. And there's even a standpoint from which the things that we think are truly valuable for their own sake—wealth and happiness and true love, &c.—can be said to be the result of our species gaming the incentives that evolution built into us because they happened to promote inclusive genetic fitness in the ancestral environment.

The future is the same thing: superhuman artificial intelligence gaming the utility function we gave it, instead of the one we should have given it. Only there will be no one we'd recognize as a person to read or write that chapter.