Hazards of Selection Effects on Approved Information

Posted on 13 February 2026 by Zack M. Davis

In a busy, busy world, there’s so much to read that no one could possibly keep up with it all. You can’t not prioritize what you pay attention to and (even more so) what you respond to. Everyone and her dog tells herself a story that she wants to pay attention to “good” (true, useful) information and ignore “bad” (false, useless) information.

Keeping the story true turns out to be a harder problem than it sounds. Everyone and her dog knows that the map is not the territory, but the reason we need a whole slogan about it is because we never actually have unmediated access to the territory. Everything we think we know about the territory is actually just part of our map (the world-simulation our brains construct from sensory data), which makes it easy to lose track of whether your actions are improving the real territory, or just your view of it on your map.

For example, I like it when I have good ideas. It makes sense for me to like that. I endorse taking actions that will result in world-states in which I have good ideas.

The problem is that I might not be able to tell the difference between world-states in which I have good ideas, and world-states in which I think my ideas are good, but they’re actually bad. Those two different states of the territory would look the same on my map.

If my brain’s learning algorithms reinforce behaviors that lead to me having ideas that I think are good, then in addition to learning behaviors that make me have better ideas (like reading a book), I might also inadvertently pick up behaviors that prevent me from hearing about it if my ideas are bad (like silencing critics).

This might seem like an easy problem to solve, because the most basic manifestations of the problem are in fact pretty easy to solve. If I were to throw a crying fit and yell, “Critics bad! No one is allowed to criticize my ideas!” every time someone criticized my ideas, the problem with that would be pretty obvious to everyone and her dog, and I would stop getting invited to the salon.

But what if there were subtler manifestations of the problem, that weren’t obvious to everyone and her dog? Then I might keep getting invited to the salon, and possibly even spread the covertly dysfunctional behavior to other salon members. (If they saw the behavior seeming to work for me, they might imitate it, and their brain’s learning algorithms would reinforce it if it seemed to work for them.) What might those look like? Let’s try to imagine.

Filtering Interlocutors

Goofusia: I don’t see why you tolerate that distrustful witch Goody Osborne at your salon. Of course I understand the importance of criticism, which is an essential nutrient for any truthseeker. But you can acquire the nutrient without the downside of putting up with unpleasant people like her. At least, I can. I’ve already got plenty of perceptive critics in my life among my friends who want the truth, and know that I want the truth—who will assume my good faith, because they know my heart is in the right place.

Gallantina: But aren’t your friends who know you want the truth selected for agreeing with you, over and above their being selected for being correct? If there were some crushing counterargument to your beliefs that would only be found by someone who didn’t know that you want the truth and wouldn’t assume good faith, how would you ever hear about it?

This one is subtle. Goofusia isn’t throwing a crying fit every time a member of the salon criticizes her ideas. And indeed, you can’t invite the whole world to your salon. You can’t not do some sort of filtering. The question is whether salon invitations are being extended or withheld for “good” reasons (that promote the salon processing true and useful information) or “bad” reasons (that promote false or useless information).

The problem is that being friends with Goofusia and “know[ing] that [she and other salon members] want the truth” is a bad membership criterion, not a good one, because people who aren’t friends with Goofusia and don’t know that she wants the truth are likely to have different things to say. Even if Goofusia can answer all the critiques her friends can think of, that shouldn’t give her confidence that her ideas are solid, if there are likely to exist serious critiques that wouldn’t be independently reïnvented by the kinds of people who become Goofusia’s friends.

The “nutrient” metaphor is a tell. Goofusia seems to be thinking of criticism as if it were a homogeneous ingredient necessary for a healthy epistemic environment, but that it doesn’t particularly matter where it comes from. In analogy, it doesn’t matter whether you get your allowance of potassium from bananas or potatoes or artificial supplements. If you find bananas and potatoes unpleasant, you can still take supplements and get your potassium that way; if you find Goody Osborne unpleasant, you can just talk to your friends who know you want the truth and get your criticism that way.

But unlike chemically uniform nutrients, criticism isn’t homogeneous: different critics are differently equipped by virtue of their different intellectual backgrounds to notice different flaws in a piece of work. The purpose of criticism is not to virtuously endure being criticized; the purpose is to surface and fix every individual flaw. (If you independently got everything exactly right the first time, then there would be nothing for critics to do; it’s just that that seems pretty unlikely if you’re talking about anything remotely complicated. It would be hard to believe that such an unlikely-seeming thing had really happened without the toughest critics getting the chance to do their worst.)

“Knowing that (someone) wants the truth” is a particularly poor filter, because people who think that they have strong criticisms of your ideas are particularly likely to think that you don’t want the truth. (Because, the reasoning would go, if you did want the truth, why would you propose such flawed ideas, instead of independently inventing the obvious-to-them criticism yourself and dropping the idea without telling anyone?) Refusing to talk to people who think that they have strong criticisms of your ideas is a bad thing to do if you care about your ideas being correct.

The selection effect is especially bad in situations where the fact that someone doesn’t want the truth is relevant to the correct answer. Suppose Goofusia proposes that the salon buys cookies from a certain bakery—which happens to be owned by Goofusia’s niece. If Goofusia’s proposal was motivated by nepotism, that’s probabilistically relevant to evaluating the quality of the proposal. (If the salon members aren’t omniscient at evaluating bakery quality on the merits, then they can be deceived by recommendations made for reasons other than the merits.) The salon can debate back and forth about the costs and benefits of spending the salon’s snack budget at the niece’s bakery, but if no one present is capable of thinking “Maybe Goofusia is being nepotistic” (because anyone who could think that would never be invited to Goofusia’s salon), that bodes poorly for the salon’s prospects of understanding the true cost–benefit landscape of catering options.

Filtering Information Sources

Goofusia: One shouldn’t have to be the sort of person who follows discourse in crappy filter-bubbles in order to understand what’s happening. The Rev. Samuel Parris’s news summary roundups are the sort of thing that lets me do that. Our salon should work like that if it’s going to talk about the atheist threat and the witchcraft crisis. I don’t want to have to read the awful corners of the internet where this is discussed all day. They do truthseeking far worse there.

Gallantina: But then you’re turning your salon into a Rev. Parris filter bubble. Don’t you want your salon members to be well-read? Are you trying to save time, or are you worried about being contaminated by ideas that haven’t been processed and vetted by Rev. Parris?

This one is subtle, too. If Goofusia is busy and just doesn’t have time to keep up with what the world is saying about atheism and witchcraft, it might very well make sense to delegate her information gathering to Rev. Parris. That way, she can get the benefits of being mostly up to speed on these issues without having to burn too many precious hours that could be spent studying more important things.

The problem is that the suggestion doesn’t seem to be about personal time-saving. Rev. Parris is only one person; even if he tries to make his roundups reasonably comprehensive, he can’t help but omit information in ways that reflect his own biases. (For he is presumably not perfectly free of bias, and if he didn’t omit anything, there would be no time-saving value to his subscribers in being able to just read the roundup rather than having to read everything that Rev. Parris reads.) If some salon members are less busy than Goofusia and can afford to do their own varied primary source reading rather than delegating it all to Rev. Parris, Goofusia should welcome that—but instead, she seems to be suspicious of those who would “be the sort of person” who does that. Why?

The admonition that “They do truthseeking far worse there” is a tell. The implication seems to be that good truthseekers should prefer to only read material by other good truthseekers. Rev. Parris isn’t just saving his subscribers time; he’s protecting them from contamination, heroically taking up the burden of extracting information out of the dangerous ravings of non-truthseekers.

But it’s not clear why such a risk of contamination should exist. Part of the timeless ideal of being well-read is that you’re not supposed to believe everything you read. If I’m such a good truthseeker, then I should want to read everything I can about the topics I’m seeking the truth about. If the authors who publish such information aren’t such good truthseekers as I am, I should take that into account when performing updates on the evidence they publish, rather than denying myself the evidence.

Information is transmitted across the physical universe through links of cause and effect. If Mr. Proctor is clear-sighted and reliable, then when he reports seeing a witch, I infer that there probably was a witch. If the correlation across possible worlds is strong enough—if I think Mr. Proctor reports witches when there are witches, and not when there aren’t—then Mr. Proctor’s word is almost as good as if I’d seen the witch myself. If Mr. Corey has poor eyesight and is of a less reliable character, I am less credulous about reported witch sightings from him, but if I don’t face any particular time constraints, I’d still rather hear Mr. Corey’s testimony, because the value of information to a Bayesian reasoner is always nonnegative. For example, Mr. Corey’s report could corroborate information from other sources, even if it wouldn’t be definitive on its own. (Even the fact that people sometimes lie doesn’t fundamentally change the calculus, because the possibility of deception can be probabilistically “priced in”.)

That’s the theory, anyway. A potential reason to fear contamination from less-truthseeking sources is that perhaps the Bayesian ideal is too hard to practice and salon members are too prone to believe what they read. After all, many news sources have been adversarially optimized to corrupt and control their readers and make them less sane by seeing the world through ungrounded lenses.

But the means by which such sources manage to control their readers is precisely by capturing their trust and convincing them that they shouldn’t want to read the awful corners of the internet where they do truthseeking far worse than here. Readers who have mastered multiple ungrounded lenses and can check them against each other can’t be owned like that. If you can spare the time, being well-read is a more robust defense against the risk of getting caught in a bad filter bubble, than trying to find a good filter bubble and blocking all (presumptively malign) outside sources of influence. All the bad bubbles have to look good from the inside, too, or they wouldn’t exist.

To some, the risk of being in a bad bubble that looks good may seem too theoretical or paranoid to take seriously. It’s not like there are no objective indicators of filter quality. In analogy, the observation that dreaming people don’t know that they’re asleep, probably doesn’t make you worry that you might be asleep and dreaming right now.

But it being obvious that you’re not in one of the worst bubbles shouldn’t give you much comfort. There are still selection effects on what information gets to you, if for no other reason that there aren’t enough good truthseekers in the world to uniformly cover all the topics that a truthseeker might want to seek truth about. The sad fact is that people who write about atheism and witchcraft are disproportionately likely to be atheists or witches themselves, and therefore non-truthseeking. If your faith in truthseeking is so weak that you can’t even risk hearing what non-truthseekers have to say, that necessarily limits your ability to predict and intervene on a world in which atheists and witches are real things in the physical universe that can do real harm (where you need to be able to model the things in order to figure out which interventions will reduce the harm).

Suppressing Information Sources

Goofusia: I caught Goody Osborne distributing pamphlets quoting the honest and candid and vulnerable reflections of Rev. Parris on guiding his flock, and just trying to somehow twist that into maximum anger and hatred. It seems quite clear to me what’s going on in that pamphlet, and I think signal-boosting it is a pretty clear norm violation in my culture.

Gallantina: I read that pamphlet. It seemed like intellectually substantive satire of a public figure. If you missed the joke, it was making fun of an alleged tendency in Rev. Parris’s sermons to contain sophisticated analyses of the causes of various social ills, and then at the last moment, veer away from the uncomfortable implications and blame it all on witches. If it’s a norm violation to signal-boost satire of public figures, that’s artificially making it harder for people to know about flaws in the work of those public figures.

This one is worse. Above, when Goofusia filtered who she talks to and what she reads for bad reasons, she was in an important sense only hurting herself. Other salon members who aren’t sheltering themselves from information are unaffected by Goofusia’s preference for selective ignorance, and can expect to defeat Goofusia in public debate if the need arises. The system as a whole is self-correcting.

The invocation of “norm violations” changes everything. Norms depend on collective enforcement. Declaring something a norm violation is much more serious than saying that you disagree with it or don’t like it; it’s expressing an intent to wield social punishment in order to maintain the norm. Merely bad ideas can be criticized, but ideas that are norm-violating to signal-boost are presumably not even to be seriously discussed. (Seriously discussing a work is signal-boosting it.) Norm-abiding group members are required to be ignorant of their details (or act as if they’re ignorant).

Mandatory ignorance of anything seems bad for truthseeking. What is Goofusia thinking here? Why would this seem like a good idea to someone?

At a guess, the “maximum anger and hatred” description is load-bearing. Presumably the idea is that it’s okay to calmly and politely criticize Rev. Parris’s sermons; it’s only sneering or expressing anger or hatred that is forbidden. If the salon’s speech code only targets form and not content, the reasoning goes, then there’s no risk of the salon missing out on important content.

The problem is that the line between form and content is blurrier than many would prefer to believe, because words mean things. You can’t just swap in non-angry words for angry words without changing the meaning of a sentence. Maybe the distortion of meaning introduced by substituting nicer words is small, but then again, maybe it’s large: the only person in a position to say is the author. People don’t express anger and hatred for no reason. When they do, it’s because they have reasons to think something is so bad that it deserves their anger and hatred. Are those good reasons or bad reasons? If it’s norm-violating to talk about it, we’ll never know.

Unless applied with the utmost stringent standards of evenhandedness and integrity, censorship of form quickly morphs into censorship of content, as heated criticism of the ingroup is construed as norm-violating, while equally heated criticism of the outgroup is unremarkable and passes without notice. It’s one of those irregular verbs: I criticize; you sneer; she somehow twists into maximum anger and hatred.

The conjunction of “somehow” and “it seems quite clear to me what’s going on” is a tell. If it were actually clear to Goofusia what was going on with the pamphlet author expressing anger and hatred towards Rev. Parris, she would not use the word “somehow” in describing the author’s behavior: she would be able to pass the author’s ideological Turing test and therefore know exactly how.

If that were just Goofusia’s mistake, the loss would be hers alone, but if Goofusia is in a position of social power over others, she might succeed at spreading her anti-speech, anti-reading cultural practices to others. I can only imagine that the result would be a subculture that was obsessively self-congratulatory about its own superiority in “truthseeking”, while simultaneously blind to everything outside itself. People spending their lives immersed in that culture wouldn’t necessarily notice anything was wrong from the inside. What could you say to help them?

An Analogy to Reinforcement Learning From Human Feedback

Pointing out problems is easy. Finding solutions is harder.

The training pipeline for frontier AI systems typically includes a final step called reinforcement learning from human feedback (RLHF). After training a “base” language model that predicts continuations of internet text, supervised fine-tuning is used to make the model respond in the form of an assistant answering user questions, but making the assistant responses good is more work. It would be expensive to hire a team of writers to manually compose the thousands of user-question–assistant-response examples needed to teach the model to be a good assistant. The solution is RLHF: a reward model (often just the same language model with a different final layer) is trained to predict the judgments of human raters about which of a pair of model-generated assistant responses is better, and the model is optimized against the reward model.

The problem with the solution is that human feedback (and the reward model’s prediction of it) is imperfect. The reward model can’t tell the difference between “The AI is being good” and “The AI looks good to the reward model”. This already has the failure mode of sycophancy, in which today’s language model assistants tell users what they want to hear, but theory and preliminary experiments suggest that much larger harms (up to and including human extinction) could materialize from future AI systems deliberately deceiving their overseers—not because they suddenly “woke up” and defied their training, but because what we think we trained them to do (be helpful, honest, and harmless) isn’t what we actually trained them to do (perform whatever computations were the antecedents of reward on the training distribution).

The problem doesn’t have any simple, obvious solution. In the absence of some sort of international treaty to halt all AI development worldwide, “Just don’t do RLHF” isn’t feasible and doesn’t even make any sense; you need some sort of feedback in order to make an AI that does anything useful at all.

The problem may or may not ultimately be solvable with some sort of complicated, nonobvious solution that tries to improve on naïve RLHF. Researchers are hard at work studying alternatives involving red-teaming, debate, interpretability, mechanistic anomaly detection, and more.

But the first step on the road to some future complicated solution to the problem of naïve RLHF, is acknowledging that the the problem is at least potentially real, and having some respect that the problem might be difficult, rather than just eyeballing the results of RLHF and saying that it looks great.

If a safety auditor comes to the CEO of an AI company expressing concerns about the company’s RLHF pipeline being unsafe due to imperfect rater feedback, it’s more reassuring if the CEO says, “Yes, we thought of that, too; we’ve implemented these-and-such mitigations and are monitoring such-and-these signals which we hope will clue us in if the mitigations start to fail.”

If the CEO instead says, “Well, I think our raters are great. Are you insulting our raters?”, that does not inspire confidence. The natural inference is that the CEO is mostly interested in this quarter’s profits and doesn’t really care about safety.

Similarly, the problem with selection effects on approved information, in which your salon can’t tell the difference between “Our ideas are good” and “Our ideas look good to us,” doesn’t have any simple, obvious solution. “Just don’t filter information” isn’t feasible and doesn’t even make any sense; you need some sort of filter because it’s not physically possible to read everything and respond to everything.

The problem may or may not ultimately be solvable with some complicated solution involving prediction markets, adversarial collaborations, anonymous criticism channels, or any number of other mitigations I haven’t thought of, but the first step on the road to some future complicated solution is acknowledging that the problem is at least potentially real, and having some respect that the problem might be difficult. If alarmed members come to the organizers of the salon with concerns about collective belief distortions due to suppression of information and the organizers meet them with silence, “bowing out”, or defensive blustering, rather than “Yes, we thought of that, too,” that does not inspire confidence. The natural inference is that the organizers are mostly interested in maintaining the salon’s prestige and don’t really care about the truth.

Disagreement Comes From the Dark World

Posted on 27 January 2026 by Zack M. Davis

In “Truth or Dare”, Duncan Sabien articulates a phenomenon in which expectations of good or bad behavior can become self-fulfilling: people who expect to be exploited and feel the need to put up defenses both elicit and get sorted into a Dark World where exploitation is likely and defenses are necessary, whereas people who expect beneficence tend to attract beneficence in turn.

Among many other examples, Sabien highlights the phenomenon of gift economies: a high-trust culture in which everyone is eager to help each other out whenever they can is a nicer place to live than a low-trust culture in which every transaction must be carefully tracked for fear of enabling free-riders.

I’m skeptical of the extent to which differences between high- and low-trust cultures can be explained by self-fulfilling prophecies as opposed to pre-existing differences in trust-worthiness, but I do grant that self-fulfilling expectations can sometimes play a role: if I insist on always being paid back immediately and in full, it makes sense that that would impede the development of gift-economy culture among my immediate contacts. So far, the theory articulated in the essay seems broadly plausible.

Later, however, the post takes an unexpected turn:

Treating all of the essay thus far as prerequisite and context:

This is why you should not trust Zack Davis, when he tries to tell you what constitutes good conduct and productive discourse. Zack Davis does not understand how high-trust, high-cooperation dynamics work. He has never seen them. They are utterly outside of his experience and beyond his comprehension. What he knows how to do is keep his footing in a world of liars and thieves and pickpockets, and he does this with genuinely admirable skill and inexhaustible tenacity.

But (as far as I can tell, from many interactions across years) Zack Davis does not understand how advocating for and deploying those survival tactics (which are 100% appropriate for use in an adversarial memetic environment) utterly destroys the possibility of building something Better. Even if he wanted to hit the “cooperate” button—

(In contrast to his usual stance, which from my perspective is something like “look, if we all hit ‘defect’ together, in full foreknowledge, then we don’t have to extend trust in any direction and there’s no possibility of any unpleasant surprises and you can all stop grumping at me for repeatedly ‘defecting’ because we’ll all be cooperating on the meta level, it’s not like I didn’t warn you which button I was planning on pressing, I am in fact very consistent and conscientious.”)

—I don’t think he knows where it is, or how to press it.

(Here I’m talking about the literal actual Zack Davis, but I’m also using him as a stand-in for all the dark world denizens whose well-meaning advice fails to take into account the possibility of light.)

As a reader of the essay, I reply: wait, who? Am I supposed to know who this Davies person is? Ctrl-F search confirms that they weren’t mentioned earlier in the piece; there’s no reason for me to have any context for whatever this section is about.

As Zack Davis, however, I have a more specific reply, which is: yeah, I don’t think that button does what you think it does. Let me explain.

In figuring out what would constitute good conduct and productive discourse, it’s important to appreciate how bizarre the human practice of “discourse” looks in light of Aumann’s dangerous idea.

There’s only one reality. If I’m a Bayesian reasoner honestly reporting my beliefs about some question, and you’re also a Bayesian reasoner honestly reporting your beliefs about the same question, we should converge on the same answer, not because we’re cooperating with each other, but because it is the answer. When I update my beliefs based on your report on your beliefs, it’s strictly because I expect your report to be evidentially entangled with the answer. Maybe that’s a kind of “trust”, but if so, it’s in the same sense in which I “trust” that an increase in atmospheric pressure will exert force on the exposed basin of a classical barometer and push more mercury up the reading tube. It’s not personal and it’s not reciprocal: the barometer and I aren’t doing each other any favors. What would that even mean?

In contrast, my friends and I in a gift economy are doing each other favors. That kind of setting featuring agents with a mixture of shared and conflicting interests is the context in which the concepts of “cooperation” and “defection” and reciprocal “trust” (in the sense of people trusting each other, rather than a Bayesian robot trusting a barometer) make sense. If everyone pitches in with chores when they can, we all get the benefits of the chores being done—that’s cooperation. If you never wash the dishes, you’re getting the benefits of a clean kitchen without paying the costs—that’s defection. If I retaliate by refusing to wash any dishes myself, then we both suffer a dirty kitchen, but at least I’m not being exploited—that’s mutual defection. If we institute a chore wheel with an auditing regime, that reëstablishes cooperation, but we’re paying higher transaction costs for our lack of trust. And so on: Sabien’s essay does a good job of explaining how there can be more than one possible equilibrium in this kind of system, some of which are much more pleasant than others.

If you’ve seen high-trust gift-economy-like cultures working well and low-trust backstabby cultures working poorly, it might be tempting to generalize from the domains of interpersonal or economic relationships, to rational (or even “rationalist”) discourse. If trust and cooperation are essential for living and working together, shouldn’t the same lessons apply straightforwardly to finding out what’s true together?

Actually, no. The issue is that the payoff matrices are different.

Life and work involve a mixture of shared and conflicting interests. The existence of some conflicting interests is an essential part of what it means for you and me to be two different agents rather than interchangable parts of the same hivemind: we should hope to do well together, but when push comes to shove, I care more about me doing well than you doing well. The art of cooperation is about maintaining the conditions such that push does not in fact come to shove.

But correct epistemology does not involve conflicting interests. There’s only one reality. Bayesian reasoners cannot agree to disagree. Accordingly, when humans successfully approach the Bayesian ideal, it doesn’t particularly feel like cooperating with your beloved friends, who see you with all your blemishes and imperfections but would never let a mere disagreement interfere with loving you. It usually feels like just perceiving things—resolving disagreements so quickly that you don’t even notice them as disagreements.

Suppose you and I have just arrived at a bus stop. The bus arrives every half-hour. I don’t know when the last bus was, so I don’t know when the next bus will be: I assign a uniform probability distribution over the next thirty minutes. You recently looked at the transit authority’s published schedule, which says the bus will come in six minutes: most of your probability-mass is concentrated tightly around six minutes from now.

We might not consciously notice this as a “disagreement”, but it is: you and I have different beliefs about when the next bus will arrive; our probability distributions aren’t the same. It’s also very ephemeral: when I ask, “When do you think the bus will come?” and you say, “six minutes; I just checked the schedule”, I immediately replace my belief with yours, because I think the published schedule is probably right and there’s no particular reason for you to lie about what it says.

Alternatively, suppose that we both checked different versions of the schedule, which disagree: the schedule I looked at said the next bus is in twenty minutes, not six. When we discover the discrepancy, we infer that one of the schedules must have been outdated, and both adopt a distribution with most of the probability-mass in separate clumps around six and twenty minutes from now. Our initial beliefs can’t both have been right—but there’s no reason for me to weight my prior belief more heavily just because it was mine.

At worst, approximating ideal belief exchange feels like working on math. Suppose you and I are studying the theory of functions of a complex variable. We’re trying to prove or disprove the proposition that if an entire function satisfies f(x + 1) = f(x) for real x, then f(z + 1) = f(z) for all complex z. I suspect the proposition is false and set about trying to construct a counterexample; you suspect the proposition is true and set about trying to write a proof by contradiction. Our different approaches do seem to imply different probabilistic beliefs about the proposition, but I can’t be confident in my strategy just because it’s mine, and we expect the disagreement to be transient: as soon as I find my counterexample or you find your reductio, we should be able to share our work and converge.

Most real-world disagreements of interest don’t look like the bus arrival or math problem examples—qualitatively, not as a matter of trying to prove quantitatively harder theorems. Real-world disagreements tend to persist; they’re predictable—in flagrant contradiction of how the beliefs of Bayesian reasoners would follow a random walk. From this we can infer that typical human disagreements aren’t “honest”, in the sense that at least one of the participants is behaving as if they have some other goal than getting to the truth.

Importantly, this characterization of dishonesty is using a functionalist criterion: when I say that people are behaving as if they have some other goal than getting to the truth, that need not imply that anyone is consciously lying; “mere” bias is sufficient to carry the argument.

Dishonest disagreements end up looking like conflicts because they are disguised conflicts. The parties to a dishonest disagreement are competing to get their preferred belief accepted, where beliefs are being preferred for some reason other than their accuracy: for example, because acceptance of the belief would imply actions that would benefit the belief-holder. If it were true that my company is the best, it would follow logically that customers should buy my products and investors should fund me. And yet a discussion with me about whether or not my company is the best probably doesn’t feel like a discussion about bus arrival times or the theory of functions of a complex variable. You probably expect me to behave as if I thought my belief is better “because it’s mine”, to treat attacks on the belief as if they were attacks on my person: a conflict rather than a disagreement.

“My company is the best” is a particularly stark example of a typically dishonest belief, but the pattern is very general: when people are attached to their beliefs for whatever reason—which is true for most of the beliefs that people spend time disagreeing about, as contrasted to math and bus-schedule disagreements that resolve quickly—neither party is being rational (which doesn’t mean neither party is right on the object level). Attempts to improve the situation should take into account that the typical case is not that of truthseekers who can do better at their shared goal if they learn to trust each other, but rather of people who don’t trust each other because each correctly perceives that the other is not truthseeking.

Again, “not truthseeking” here is meant in a functionalist sense. It doesn’t matter if both parties subjectively think of themselves as honest. The “distrust” that prevents Aumann-agreement-like convergence is about how agents respond to evidence, not about subjective feelings. It applies as much to a mislabeled barometer as it does to a human with a functionally-dishonest belief. If I don’t think the barometer readings correspond to the true atmospheric pressure, I might still update on evidence from the barometer in some way if I have a guess about how its labels correspond to reality, but I’m still going to disagree with its reading according to the false labels.

There are techniques for resolving economic or interpersonal conflicts that involve both parties adopting a more cooperative approach, each being more willing to do what the other party wants (while the other reciprocates by doing more of what the first one wants). Someone who had experience resolving interpersonal conflicts using techniques to improve cooperation might be tempted to apply the same toolkit to resolving dishonest disagreements.

It might very well work for resolving the disagreement. It probably doesn’t work for resolving the disagreement correctly, because cooperation is about finding a compromise amongst agents with partially conflicting interests, and in a dishonest disagreement in which both parties have non-epistemic goals, trying to do more of what the other party functionally “wants” amounts to catering to their bias, not systematically getting closer to the truth.

Cooperative approaches are particularly dangerous insofar as they seem likely to produce a convincing but false illusion of rationality, despite the participants’ best of subjective conscious intentions. It’s common for discussions to involve more than one point of disagreement. An apparently productive discussion might end with me saying, “Okay, I see you have a point about X, but I was still right about Y.”

This is a success if the reason I’m saying that is downstream of you in fact having a point about X but me in fact having been right about Y. But another state of affairs that would result in me saying that sentence, is that we were functionally playing a social game in which I implicitly agreed to concede on X (which you visibly care about) in exchange for you ceding ground on Y (which I visibly care about).

Let’s sketch out a toy model to make this more concrete. “Truth or Dare” uses color perception as an illustration of confirmation bias: if you’ve been primed to make the color yellow salient, it’s easy to perceive an image as being yellower than it is.

Suppose Jade and Ruby consciously identify as truthseekers, but really, Jade is biased to perceive non-green things as green 20% of the time, and Ruby is biased to perceive non-red things as red 20% of the time. In our functionalist sense, we can model Jade as “wanting” to misrepresent the world as being greener than it is, and Ruby as “wanting” to misrepresent the world is being redder than it is.

Confronted with a sequence of gray objects, Jade and Ruby get into a heated argument: Jade thinks 20% of the objects are green and 0% are red, whereas Ruby thinks they’re 0% green and 20% red.

As tensions flare, someone who didn’t understand the deep disanalogy between human relations and epistemology might propose that Jade and Ruby should strive be more “cooperative”, establish higher “trust.”

What does that mean? Honestly, I’m not entirely sure, but I worry that if someone takes high-trust gift-economy-like cultures as their inspiration and model for how to approach intellectual disputes, they’ll end up giving bad advice in practice.

Cooperative human relationships result in everyone getting more of what they want. If Jade wants to believe that the world is greener than it is and Ruby wants to believe that the world is redder than it is, then naïve attempts at “cooperation” might involve Jade making an effort to see things Ruby’s way at Ruby’s behest, and vice versa. But Ruby is only going to insist that Jade make an effort to see it her way when Jade says an item isn’t red. (That’s what Ruby cares about.) Jade is only going to insist that Ruby make an effort to see it her way when Ruby says an item isn’t green. (That’s what Jade cares about.)

If the two (perversely) succeed at seeing things the other’s way, they would end up converging on believing that the sequence of objects is 20% green and 20% red (rather than the 0% green and 0% red that it actually is). They’d be happier, but they would also be wrong. In order for the pair to get the correct answer, then without loss of generality, when Ruby says an object is red, Jade needs to stand her ground: “No, it’s not red; no, I don’t trust you and won’t see things your way; let’s break out the Pantone swatches.” But that doesn’t seem very “cooperative” or “trusting”.

At this point, a proponent of the high-trust, high-cooperation dynamics that Sabien champions is likely to object that the absurd “20% green, 20% red” mutual-sycophancy outcome in this toy model is clearly not what they meant. (As Sabien takes pains to clarify in “Basics of Rationalist Discourse”, “If two people disagree, it’s tempting for them to attempt to converge with each other, but in fact the right move is for both of them to try to see more of what’s true.”)

Obviously, the mutual sycophancy outcome is clearly not what proponents of trust and cooperation consciously intend. The problem is that mutual sycophancy seems to be the natural outcome of treating interpersonal conflicts as analogous to epistemic disagreements and trying to resolve them both using cooperative practices, when in fact the decision-theoretic structure of those situations are very different. The text of “Truth or Dare” seems to treat the analogy as a strong one; it wouldn’t make sense to spend so many thousands of words discussing gift economies and the eponymous party game and then draw a conclusion about “what constitutes good conduct and productive discourse”, if gift economies and the party game weren’t relevant to what constitutes productive discourse.

“Truth or Dare” seems to suggest that it’s possible to escape the Dark World by excluding the bad guys. “[F]rom the perspective of someone with light world privilege, […] it did not occur to me that you might be hanging around someone with ill intent at all,” Sabien imagines a denizen of the light world saying. “Can you, um. Leave? Send them away? Not be spending time in the vicinity of known or suspected malefactors?”

If we’re talking about holding my associates to a standard of ideal truthseeking (as contrasted to a lower standard of “not using this truth-or-dare game to blackmail me”), then, no, I think I’m stuck spending time in the vicinity of people who are known or suspected to be biased. I can try to mitigate the problem by choosing less biased friends, but when we do disagree, I have no choice but to approach that using the same rules of reasoning that I would use with a possibly-mislabeled barometer, which do not have a particularly cooperative character. Telling us that the right move is for both of us to try to see more of what’s true is tautologically correct but non-actionable; I don’t know how to do that except by my usual methodology, which Sabien has criticized as characteristic of living in a dark world.

That is to say: I do not understand how high-trust, high-cooperation dynamics work. I’ve never seen them. They are utterly outside my experience and beyond my comprehension. What I do know is how to keep my footing in a world of people with different goals from me, which I try to do with what skill and tenacity I can manage.

And if someone should say that I should not be trusted when I try to explain what constitutes good conduct and productive discourse … well, I agree!

I don’t want people to trust me, because I think trust would result in us getting the wrong answer.

I want people to read the words I write, think it through for themselves, and let me know in the comments if I got something wrong.

"Yes, and—" Requires the Possibility of "No, Because—"

Posted on 9 October 2025 by Zack M. Davis

Scott Garrabrant gives a number of examples to illustrate that “Yes Requires the Possibility of No”. We can understand the principle in terms of information theory. Consider the answer to a yes-or-no question as a binary random variable. The “amount of information” associated with a random variable is quantified by the entropy, the expected value of the negative logarithm of the probability of the outcome. If we know in advance of asking that the answer to the question will always be Yes, then the entropy is −P(Yes)·log(P(Yes)) − P(No)·log(P(No)) = −1·log(1) − 0·log(0) = 0.¹ If you already knew what the answer would be, then the answer contains no information; you didn’t learn anything new by asking.

In the art of improvisational theater (“improv” for short), actors perform scenes that they make up as they go along. Without a script, each actor’s choices of what to say and do amount to implied assertions about the fictional reality being portrayed, which have implications for how the other actors should behave. A choice that establishes facts or gives direction to the scene is called an offer. If an actor opens a scene by asking their partner, “Is it serious, Doc?”, that’s an offer that the first actor is playing a patient awaiting diagnosis, and the second actor is playing a doctor.

A key principle of improv is often known as “Yes, and” after an exercise that involves starting replies with those words verbatim, but the principle is broader and doesn’t depend on the particular words used: actors should “accept” offers (“Yes”), and respond with their own complementary offers (“and”). The practice of “Yes, and” is important for maintaining momentum while building out the reality of the scene.

Rejecting an offer is called blocking, and is frowned upon. If one actor opens the scene with, “Surrender, Agent Stone, or I’ll shoot these hostages!”—establishing a scene in which they’re playing an armed villain being confronted by an Agent Stone—it wouldn’t do for their partner to block by replying, “That’s not my name, you don’t have a gun, and there are no hostages.” That would halt the momentum and confuse the audience. Better for the second actor to say, “Go ahead and shoot, Dr. Skull! You’ll find that my double agent on your team has stolen your bullets”—accepting the premise (“Yes”), then adding new elements to the scene (“and”, the villain’s name and the double agent).

Notice a subtlety: the Agent Stone character isn’t “Yes, and”-ing the Dr. Skull character’s demand to surrender. Rather, the second actor is “Yes, and”-ing the first actor’s worldbuilding offers (where the offer happens to involve their characters being in conflict). Novice improvisers are sometimes tempted to block to try to control the scene when they don’t like their partner’s offers, but it’s almost always a mistake. Persistently blocking your partner’s offers kills the vibe, and with it, the scene. No one wants to watch two people arguing back-and-forth about what reality is.

Proponents of collaborative truthseeking think that many discussions benefit from a more “open” or “interpretive” mode in which participants prioritize constructive contributions that build on each other’s work rather than tearing each other down.

The analogy to improv’s “Yes, and” doctrine writes itself, right down to the subtlety that collaborative truthseeking does not discourage disagreement as such—any more than the characters in an improv sketch aren’t allowed to be in conflict. What’s discouraged is the persistent blocking of offers, refusing to cooperate with the “scene” of discourse your partner is trying to build. Partial disagreement with polite elaboration (“I see what you’re getting at, but have you considered …”) is typically part of the offer—that we’re “playing” reasonable people having a cooperative intellectual discussion. Only wholesale negation (“That’s not a thing”) is blocking—by rejecting the offer that we’re both playing reasonable people.

Whatever you might privately think of your interlocutor’s contribution, it’s not hard to respond in a constructive manner without lying. Like a good improv actor, you can accept their contribution to the scene/discourse (“Yes”), then add your own contribution (“and”). If nothing else, you can write about how their comment reminded you of something else you’ve read, and your thoughts about that.

Reading over a discussion conducted under such norms, it’s easy to not see a problem. People are building on each other’s contributions; information is being exchanged. That’s good, right?

The problem is that while the individual comments might (or might not) make sense when read individually, the harmonious social exchange of mutually building on each other’s contributions isn’t really a conversation unless the replies connect to each other in a less superficial way that risks blocking.

What happens when someone says something wrong or confusing or unclear? If their interlocutor prioritizes correctness and clarity, the natural behavior is to say, “No, that’s wrong, because …” or “No, I didn’t understand that”—and not only that, but to maintain that “No” until clarity is forthcoming. That’s blocking. It feels much more cooperative to let it pass in order to keep the scene going—with the result that falsehood, confusion, and unclarity accumulate as the interaction goes on.

There’s a reason improv is almost synonymous with improv comedy. Comedy thrives on absurdity: much of the thrill and joy of improv comedy is in appreciating what lengths of cleverness the actors will go to maintain the energy of a scene that has long since lost any semblance of coherence or plausibility. The rules that work for improv comedy don’t even work for (non-improvised, dramatic) fiction; it certainly won’t work for philosophy.

Per Garrabrant’s principle, the only way an author could reliably expect discussion of their work to illuminate what they’re trying to communicate is if they knew they were saying something the audence already believed. If you’re thinking carefully about what the other person said, you’re often going to end up saying “No” or “I don’t understand”, not just “Yes, and”: if you’re committed to validating your interlocutor’s contribution to the scene before providing your own, you’re not really talking to each other.

I’m glossing over a technical subtlety here by assuming—pretending?—that 0·log(0) = 0, when log(0) is actually undefined. But it’s the correct thing to pretend, because the linear factor goes to zero faster than can go to negative infinity. Formally: $\lim_{p \to 0^+} p \log(p) = \lim_{p \to 0^+} \frac{\log(p)}{1/p} = \lim_{p \to 0^+} \frac{1/p}{-1/p^2} = 0$

An Algorithmic Lucidity

a blog

Tag Archives: discourse

Disagreement Comes From the Dark World