The End of the Movie: SF State's 2024 Putnam Competition Team, A Retrospective

From: Zack M Davis <zmd@sfsu.edu>
Sent: Sunday, January 12, 2025 11:52 AM
To: math_majors@lists.sfsu.edu <math_majors@lists.sfsu.edu>, math_graduate@lists.sfsu.edu <math_graduate@lists.sfsu.edu>, math_lecturers@lists.sfsu.edu <math_lecturers@lists.sfsu.edu>, math_tenure@lists.sfsu.edu <math_tenure@lists.sfsu.edu>
Subject: the end of the movie: SF State’s 2024 Putnam Competition team, a retrospective

Because life is a gradual series of revelations
That occur over a period of time
It’s not some carefully crafted story
It’s a mess, and we’re all gonna die

If you saw a movie that was like real life
You’d be like, “What the hell was that movie about?
It was really all over the place.”
Life doesn’t make narrative sense

—“The End of the Movie”, Crazy Ex-Girlfriend

Every Hollywood underdog story starts with a dream. The scrawny working-class kid who wants to play football for Notre Dame. The charismatic teacher at a majority-Latino school in East L.A. who inspires his class to ace the AP Calculus exam. The debate team at a historically black college that unseats the reigning national champions. Hollywood tells us that if you work hard and believe in yourself, you can defy all expectations and achieve your dream.

Hollywood preys on the philosophically unsophisticated. Chebyshev’s inequality states that the probability that a random variable deviates from its mean by more than k standard deviations is no more than 1/k². Well-calibrated expectations already take into account how hard you’ll work and how much you’ll believe in yourself: underdogs mostly lose by definition.

Accordingly, this story starts with a correspondingly humble dream: the not-a-kid-anymore returning to SFSU after a long absence to finish up his math degree, who wants to get a nonzero score in the famous William Lowell Putnam Mathematical Competition®. (It’s not quite as humble as it sounds: the median score in the famously brutal elite competition is often zero out of 120, although last year the median was nine.)

The first step on the road to a nonzero score was being able to compete at all: SF State had no immediate history of participating in the event, in contrast to other schools that devote significant resources to it. (E.g., at Carnegie Mellon, they have a for-credit 3-unit Putnam seminar that meets six days a week.) At SFSU in 2012, I had asked one of my professors about registering for the Putnam, and nothing came of it. This time, a more proactive approach was called for. After reaching out to the chair and several professors who had reasons to decline the role (“I’m not a fan of the Putnam”, “I have negative time this semester”, “You should ask one of the smart professors”), Prof. Arek Goetz agreed to serve as local supervisor.

A preparation session #1 to discuss the solutions to problems from the 2010 competition was scheduled and aggressively advertised on the math-majors mailing list. (That is, “aggressively” in terms of the rhetoric used, not frequency of posts.) Despite some interest expressed in email, no non-organizer participants showed up, and my flailing attempts at some of the 2010 problems mostly hadn’t gotten very far … but I had at least intuited the correct answer to B2, if not the proof. (We are asked about the smallest possible side of a triangle with integer coordinates; the obvious guess is 3, from the smallest Pythagorean triple 3–4–5; then we “just” have to rule out possible side lengths of 1 and 2.) The dream wasn’t dead yet.

To keep the dream alive, recruitment efforts were stepped up. When I happened to overhear a professor in the department lounge rave about a student citing a theorem he didn’t know on a “Calculus III” homework assignment, I made sure to get the student’s name for a group email to potential competitors. A When2Meet scheduling poll sent to the group was used to determine the time of prep session #2, which was advertised on the department mailing list with a promise of free donuts (which the department generously offered to reïmburse).

Session #2 went well—four participants came, and Prof. Goetz made an appearance. I don’t think we made much progress understanding the 2011 solutions in the hour before we had to yield TH 935 to the Ph.D. application group, but that wasn’t important. We had four people. This was really happening!

As the semester wore on, the group kept in touch on our training progress by email, and ended up holding three more in-person sessions as schedules permitted (mean number of attendees: 1.67). Gelca and Andreescu’s Putnam and Beyond was a bountiful source of practice problems in addition to previous competitions.

Finally, it was Saturday 7 December. Gameday—exam day, whatever. Three competitors (including one who hadn’t been to any of the previous prep sessions), gathered in the Blakeslee room at the very top of Thornton Hall to meet our destiny. The Putnam is administered in two sessions: three hours in the morning (problems identified as A1 through A6 in roughly increasing difficulty), and three hours in the afternoon (problems B1 through B6).

Destiny was not kind in the problem selection for the “A” session.

A1 was number theory, which I don’t know (and did not, unfortunately, learn from scratch this semester just for the Putnam).

I briefly had some hope for B2, which asked for which real polynomials p is there a real polynomial q such that p(p(x)) − x = (p(x) − x)²q(x). If I expanded the equation to Σj=0n ajk=0n ak xk)j − x = (Σj=0n aj xj − x)² Σj=1n bj xj, and applied the multinomial theorem … it produced a lot of impressive Σ–Π index notation, but didn’t obviously go anywhere towards solving the problem.

A3 was combinatorics. Concerning the set S of bijections T from {1, 2, 3} × {1, 2, …, 2024} to {1, 2, …, 6072} such that T(1, j) < T(2, j) < T(3, j) and T(i, j) < T(i, j+1), was there an a and c in {1, 2, 3} and b and d in {1, 2, …, 2024} such that the fraction of elements T in S for which T(a, b) < T(c, d) is at least ⅓ and at most ⅔? I couldn’t get a good grasp on the structure of S (e.g., how many elements it has), which was a blocker to being able to say something what fraction of it fulfills some property. Clearly a lexicographic sort by the first element, or by the second element, would fulfill the inequalities, but how many other bijections were in S? When the solutions were later published, the answer turned out to involve a standard formula about Young tableaus, not something I could have realistically derived from scratch during the exam.

A4 was more number theory; I didn’t even read it. (I still haven’t read it.)

A5 asked about how to place a radius-1 disc inside a circle of radius 9 in order to minimize the probability that a chord through two uniformly random points on the circle would pass through the disk. I recognized the similarity to Bertrand’s paradox and intuited that a solution would probably be at one of the extremes, putting the disc at the center or the edge. There was obviously no hope of me proving this during the exam. (It turned out to be the center.)

A6 was a six; I didn’t read it.

I turned in pages with my thoughts on A2, A3, and A5 because it felt more dignified than handing in nothing, but those pages were clearly worth zero points. The dream was dying.

Apparently I wasn’t the only one demoralized by the “A” problems; the other competitors didn’t return for the afternoon session. Also, it turned out that we had locked ourselves out of the Blakeslee room, so the afternoon session commenced with just me in TH 935, quietly hoping for a luckier selection of “B” problems, that this whole quixotic endeavor wouldn’t all have been for nothing.

Luck seemed to deliver. On a skim, B1, B2, and B4 looked potentially tractable. B2 was geometry, and I saw an angle of attack (no pun intended) …

B2. Two convex quadrilaterals are called partners if they have three vertices in common and they can be labeled ABCD and ABCE so that E is the reflection of D across the perpendicular bisector of the diagonal AC. Is there an infinite sequence of convex quadrilaterals such that each quadrilateral is a partner of its successor and no two elements of the sequence are congruent?

I imagined rotating the figure such that AC was the vertical axis and its bisector was the horizontal axis, and tried to imagine some way to perturb D and E to get a sequence of quadrilaterals that wouldn’t be congruent (because the angles ∠CDA and ∠CEA were changing), but for which we could alternately take ABCD and ABCE so that successive shapes in the sequence would be partners. I couldn’t see a way to make it work. Then I thought, what if perturb B instead?

Yes, I began to write excitedly, there exists such a sequence. For example, in ℝ², let A := (0, −1), C := (0, 1), D := (½, ½), and E := (½, −½), and consider a sequence Bn on the unit circle strictly in quadrant II (i.e., with x < 0 and y > 0), for example, Bn := (Re exp((π - 1/n)i), Im exp((π - 1/n)i)) where ℝ² is identified with ℂ. Then consider the sequence of quadrilaterals ABnCD for odd n and ABnCE for even n, for n ∈ ℕ+. Successive quadrilaterals in the sequence are partners: the perpendicular bisector of the diagonal AC is the x-axis, and D = (½, ½) and E = (½, −½) are reflections of each other across the x-axis. No two quadrilaterals in the sequence are congruent because the angle ∠ABnC is different for each n …

Or is it? I recalled a fact from Paul Lockhart’s famous lament: somewhat counterintuitively, any triangle inscribed in a semicircle is a right triangle: ∠ABnC would be the same for all n. (The quadrilaterals would still be different, but I would have to cite some other difference to prove it.) I took a fresh piece of paper and rewrote the proof with a different choice of Bn: instead of picking a sequence of points on the unit circle, I chose a sequence on the line y = x + 1: say, Bn := (−1/(n+1), 1 − 1/(n+1)). Then I could calculate the distance AB as √(1/(n+1)² + (1 − 1/(n+1))²), observe that the angle ∠BCA was 45°, invoke the law of sines to infer that the ratio of the sine of ∠ABC to the distance AC (viz., 2) was equal to the ratio of the sine of ∠BCA (viz., √2/2) to the distance AB, and infer that ∠ABC is arcsin(√2/AB‾), and therefore that the quadrilaterals in my sequence were not congruent. Quod erat demonstrandum!

That took the majority of my time for the afternoon session; I spent the rest of it tinkering with small-n cases for B1 and failing to really get anywhere. But that didn’t matter. I had solved B2, hadn’t I? That had to be a solve, right?—maybe 8 points for less than immaculate rigor, but not zero or one.

Last year the published contest results only listed the names of top 250 individuals, top 10 teams, and top 3 teams by MAA section (“Golden Section: Stanford University; University of California, Berkeley; University of California, Davis”), but I fantasized about looking up who I should write to at the MAA to beg them to just publish the full team scores. Who was privacy helping? People who go to R2 universities already know that we’re stupid. Wouldn’t it be kinder to at least let us appear at the bottom of the list, rather than pretending we didn’t exist at all? All weekend, in the movie of my life in my head, I could hear the sports announcer character (perhaps portrayed by J. K. Simmons) crowing: Gators on the board! Gators on the board!

All weekend and until the embargo period ended on 10 December and people began discussing the answers online, reminding me that real life isn’t a movie.

I did not write to the MAA.

The Gators were not on the board.

I did not solve B2.

The answer is No, there is no such sequence of quadrilaterals. The Putnam archive solutions and a thread on the Art of Problem Solving forums explain how to prove it.

As for my “proof”, well, the problem statement said that partner quadrilaterals have three vertices in common. In my sequence, successive elements ABnCD and ABn+1CE have two vertices in common, A and C.

This isn’t a fixable flaw. If you have the reading comprehension to understand the problem statement, the whole approach just never made any sense to begin with. If it made sense to me while I was writing it, well—what’s that phrase mathematicians like to use? Modus tollens.

You could say that there’s always next year—but there isn’t, for me. Only students without an undergraduate degree are eligible to take the Putnam, and I’m graduating next semester. (In theory, I could delay it and come back in Fall 2025, but I’m already graduating fifteen years late, and no humble dream is worth making it fifteen and a half.)

I keep wanting to believe that this isn’t the end of the movie. Maybe this year’s effort was just the first scene. Maybe someone reading this mailing list post will hear the call to excellence and assemble a team next year that will score a point—not out of slavish devotion to Putnam competition itself, but to what it represents, that there is a skill of talking precisely about precise things that’s worth mastering—that can be mastered by someone trying to master it, which mastery can be measured by a wide-ranging test with a high ceiling and not just dutiful completion of course requirements.

Maybe then this won’t all have been for nothing.

Recruitment Advertisements for the 2024 Putnam Competition at San Francisco State University

From: Zack M Davis <zmd@sfsu.edu>
Sent: Wednesday, September 11, 2024 5:02 PM
To: math_majors@lists.sfsu.edu <math_majors@lists.sfsu.edu>
Subject: Putnam prep session for eternal mathematical glory, 4 p.m. Thu 19 September

One must make a distinction however: when dragged into prominence by half-poets, the result is not poetry, nor till the poets among us can be “literalists of the imagination”—above insolence and triviality and can present for inspection “imaginary gardens with real toads in them”, shall we have it. In the meantime, if you demand on the one hand the raw material of poetry in all its rawness, and that which is on the other hand genuine, then you are interested in poetry.

—Marianne Moore

The William Lowell Putnam Mathematical Competition, the renowned annual math examination for undergraduates with cash prizes for top performers, is to be held on Saturday, 7 December 2024. Registration details will be available soon, but for now, potential competitors are invited to come to an initial preparatory/training session at 4 p.m. on Thursday, September 19th in the math department conference room TH 935.

To get the most out of it, try struggling with some of the problems from the 2010 competition beforehand: we’ll discuss solutions and strategies together at the meeting. (The problems are numbered A1–A6 and B1–B6, corresponding to the morning and afternoon sessions of the competition; the earlier-numbered problems within each are supposed to be easier.) If you can’t make this time but are interested in the endeavor, I want to hear from you: email me at zmd@sfsu.edu.

“FREQUENTLY” ASKED QUESTIONS

Q: Did you say “cash prizes”? I’m pretty good at math: I got an “A” in MATH 228. Should I participate in hopes of winning?

A: No. No one who goes to SF State is going to win any prizes. The Putnam is an elite competition designed to test the abilities of the finest young mathematical minds in the world. The graders are notoriously stingy about awarding partial credit: the median score is often zero points out of 120. Last year seems to have been a bit easier: the median score was 9.1 Of the top sixteen scorers, thirteen went to MIT.

Q: Wait, this sounds awful. I’m already spending way too much of my life shuffling formulæ around just to keep up with my classes. You’re asking me to spend even more of my precious time attempting insanely difficult problems, to prepare for a six-hour exam three months from now that I have no hope of doing well on, and it wouldn’t even earn credit for my degree? Why would I do that?

A: Because it doesn’t earn credit for your degree. The Putnam isn’t an obedience test where a designated bureaucratic authority figure commands you to use a fixed set of methods to solve a fixed set of problems in exchange for a piece of paper with an “A” written on it. It’s a challenge of your creativity, breadth of knowledge, and determination—a Schelling point for those who demand the raw material of mathematics and that which is on the other hand genuine to prove to ourselves and the world what we’re capable of. If you’re afraid of what you’ll learn about yourself by trying, then don’t.

1: The Duke Research Blog reports that there were 3,857 competitors in 2023, and the official results report that 2,200 contests scored higher than 9 and 1,610 scored higher than 10.


From: Zack M Davis <zmd@sfsu.edu>
Sent: Sunday, September 29, 2024 11:17 PM
To: math_majors@lists.sfsu.edu <math_majors@lists.sfsu.edu>
Subject: Putnam prep session #2 for eternal mathematical glory … and donuts, 2 p.m. Fri 4 October

“Hey, Goofusia,” said Gallantina. “Did you see this post on the math_majors list? Someone’s trying to organize a team for the Putnam competition—here, at SFSU! There’s going to be a prep session in Thornton 935 on Friday at 2 p.m. The organizer sounds really desperate—there should be free donuts. Want to come?”

Fraternal twins, the sisters looked so much alike that strangers who didn’t know them often asked if they were identical. People who knew them for any length of time never asked.

Goofusia grimaced. “Oh, God, is that that super-hard math competition that guys from MIT win every year, where the median score is zero?”

“Actually, someone not from MIT won as recently as 2018, and last year the median score was nine. But yes.”

“Uh-huh. What school was the 2018 winner from?”

“Um, Harvard.”

“I’ll pass. You should, too.”

“C’mon, it’ll be fun!”

“Gallantina, you don’t know what fun is. You’re so caught up in your delusional self-image of pro-sociality that you can’t even notice what you actually enjoy.” Goofusia spoke with a firm emphasis and cadence, “I, am learning math, in order to get grades, in order to get a degree, in order to get a job. So is everyone else in our major. So are you. That’s the only possible reason—the only human reason. You just can’t admit it to yourself—”

That’s not true!

“—and you’re so fanatically devoted to maintaining your false self-image as some intrinsically motivated student of the cosmos that you’re willing to torture yourself with more schoolwork that doesn’t even benefit you. You are not going to score points on the Putnam—”

“I might!” said Gallantina steadfastly, suddenly turning away from three walls of the room to face the remaining one and looking past Goofusia as if to speak to someone else. “With dedication and practice, and with the help of all the lifelong friends I’ll make in TH 935 at 2 p.m. this Friday October fourth!”

“Spare me. What does prepping for an impossible exam even look like?”

“Well, the idea is that before the meeting, I and others will prepare at home by trying problems from the 2011 competition with however much time we choose to spare for the task, and then at the meeting, we’ll compare answers and discuss the published solutions.

“If any of you losers even come up with any answers to compare.”

“We might! I’ve already made some partial progress on the first problem.”

“You don’t have to tell m—” Goofusia tried to say, but Gallantina had already begun to read:

A1. Define a growing spiral in the plane to be a sequence of points with integer coordinates P0 = (0, 0), P1, …, Pn such that n ≥ 2 and:
• the directed line segments P0–P1, P1–P2, …, P(n−1)–Pn are in the successive coordinate directions east (for P0–P1), north, west, south, east, etc.;
• the lengths of these line segments are positive and strictly increasing.

How many of the points (x, y) with integer coordinates 0 ≤ x ≤ 2011, 0 ≤ y ≤ 2011 cannot be the last point, Pn of any growing spiral?

“Two thousand and eleven?” Goofusia asked disdainfully.

“They like to work the competition year into one of the problem statements. I think it’s cute,” said Gallantina. “Anyway, I started thinking about the minimal growing spiral—one step east, two steps north, three steps west, &c. The x-coördinate steps are 1, -3, 5, -7 …, the y-coördinate steps are 2, -4, 6, -8 …, the x-coördinate net endpoints are 1, -2, 3, -4, 5 … and the y-coördinate net endpoints are 2, -2, 4, -4, … There are more possible spirals besides the minimal one, of course, but we can already see there are patterns in what endpoints are possible.”

“You’re wasting your time,” said Goofusia. “Precisely because the question asks about all possible growing spirals, you’re not going to learn anything by examining particular cases. You can immediately see that any point with an x-coördinate less than the y-coördinate will do: just take x steps east and y steps north.”

Gallantina was beaming.

“Wh—what are you smiling at?”

Gallantina nodded, still beaming.

Goofusia scowled. “Whatever,” she said, and turned to leave, then stopped. “So … what’s the answer?”

Gallantina shrugged. “We haven’t finished solving it yet. But if it turns out to be beyond us, I’m sure they’ll tell us in TH 935 at 2 p.m. this Friday October fourth.”

Goofusia shook her head. “I couldn’t possibly. I have an exam this week, and a lot of homework.”

“But you don’t specifically have anything else going on at 2 on Friday? They’re notoriously hard problems, and everyone is busy. There’d be no shame in showing up and eating a donut without having successfully solved anything at home.”

“No, I mean that’s not who I am. I’m not like you. I’m a student at SF State, not—not the cosmos!”

Goofusia left. Alone, Gallantina addressed the fourth wall again. “Is that who you are?”

Should I Finish My Bachelor's Degree?

To some, it might seem like a strange question. If you think of being college-educated as a marker of class (or personhood), the fact that I don’t have a degree at age of thirty-six (!!) probably looks like a scandalous anomaly, which it would be only natural for me to want to remediate at the earliest opportunity.

I deeply resent that entire worldview—not because I’ve rejected education, properly understood. On the contrary. The study of literature, history, mathematics, science—these things are among the noblest pursuits in life, sources of highest pleasure and deepest meaning. It’s precisely because I value education so much that I can’t stand to see it conflated with school and its culture of bureaucratic servitude where no one cares what you know and no one cares what you can do; they just want you to sit in a room and obey the commands of the designated teacher. Whereas in reality, knowledge doesn’t come from “taking courses.”

How could it? Knowledge comes from quality study and practice. Sure, it’s possible that someone could study in order to “pass” a “class” that they’re “taking” in school. But once you know how and why to study, it’s not clear what value the school is adding that can’t be gotten better, cheaper, elsewhere. Just get the books. (And start a blog, go to meetups, chat to large language models, hire a private tutor—whatever makes sense to get better at doing the things you want to do, without having to worry about whether the thing that makes sense can be made legible to distant bureaucrats.)

The people who believe in being college-educated probably don’t believe me. They probably think my pæans to the glory of self-study are the rationalizations of a lazy student who doesn’t want to work hard.

I can understand some reasons for skepticism. Sometimes people really are lazy, and suffer from self-serving delusions. Probably there are some confused people out there who have mistaken consumer edutainment for production scholarship and—maybe, somehow—could benefit from being set straight by the firm tutelage of the standard bureaucratic authority.

But without vouching for everyone who calls themself an autodidact, I think I can present third-party-visible evidence that my self-study is for real? I worked as a software engineer for eight years; I have 173 commits in the Rust compiler; I wrote a chess engine; I’ve blogged 400,000 words over the past dozen years on topics from mathematics and machine learning, to formal epistemology and the philosophy of language, to politics and differential psychology, and much more.

This is not the portfolio of an uneducated person. If someone is considering working with me and isn’t sure of my competence, they’re welcome to look at my output and judge for themselves. (And I’m happy to take a test when that makes sense.) If someone would otherwise consider working with me, but are put off by the lack of a mystical piece of paper from the standard bureaucratic authority, that’s their loss—maybe I don’t want to work with someone with so little discernment.


If I believe everything I just wrote, explaining why I have nothing particularly to gain and nothing particularly to prove by jumping through a few more hoops to get the mystical piece of paper, then … why am I considering it?

One possible answer is that it passes a cost–benefit analysis mostly by virtue of the costs being low, rather than the benefits being particularly high. I’m at a time in my life where I have enough money from my previous dayjob and enough uncertainty about how long the world is going to last, that I prefer having lots of free time to work on things that interest me or add dignity to the existential risk situation, than to continue grinding at software dayjobs. So if my schedule isn’t being constrained by a dayjob for now, why not “take” some “classes” and finish off the mystical piece of paper? Continuing from where I left off in 2013 due to being rescued by the software industry, I need five more math courses and three more gen-eds to finish a B.A. in math at San Francisco State University, which I can knock out in two semesters. The commute is terrible, but I can choose my schedule to only be on campus a couple days a week. And then if it makes sense to go get another dayjob later, “I finished my Bachelor’s degree” is a legible résumé-gap excuse (easier to explain to semi-normies with hiring authority than “I finished my 80,000-word memoir of religious betrayal”).

In short, why not?—if I’m going to do it ever, now is a convenient time, and eight classes is a sufficiently small cost that it makes sense to do it ever (conditional on the world not ending immediately).

A less comfortable possible answer is that maybe I do have something to prove.

I often wonder why I seem to be so alone in my hatred of school as an intellectual. The people who are smart enough to do well in school are presumably also smart enough to have intellectual lives outside of school. Why do people put up with it? Why is there a presumption that there must be something wrong with someone who didn’t finish the standard course?

I think part of the answer is that, separately from whether the standard course makes sense as a class or personhood marker, once the signaling regime has been established, it’s mostly true that people who don’t finish the standard course probably have something wrong with them.

Separately from the fact that I’m obviously right that my personal passion projects are more intellectually meritorious than the busywork school demanded of me, there’s also something wrong with me. My not finishing the first time at UC Santa Cruz (expected class of 2010) wasn’t just a matter of opportunity costs. I also had obscure psychological problems unrelated to my intellectual ability to do the work, which were particularly triggered by the school environment (and thankfully aren’t triggered by software industry employment relations). Someone with my talents who wasn’t crazy probably would have arranged to finish on time for pragmatic reasons (notwithstanding the injustice of the whole system).

This makes it slightly less confusing that the system hasn’t been overthrown. It’s not that school somehow has a monopoly on learning itself. It’s that people who are good at learning mostly don’t have problems getting the mystical piece of paper granting them legal and social privileges, and therefore don’t have a chip on their shoulder about not having it.

If that were the entirety of the matter, it wouldn’t present a sufficient reason for me to finish. There would be be little point in proving to anyone that I’ve outgrown my youthful mental health problems by showing that I can endure the same abuses as everyone else, when anything I might want to prove to someone is proven better by my history of making real things in the real world (code that profitable businesses pay for, blog posts that people want to read of their own volition).

But it gets worse. It may just be possible that I have something prove intellectually, not just psychologically. In 2010, after studying math on my own for a couple years (having quit the University at Santa Cruz in 2007), I enrolled in a differential equations class at the local community college, expecting to do well and validate the glory of my self-study. I was actually interested in math. Surely that would put me at an advantage over ordinary community college students who only knew how to do as they were told?

In fact, I did poorly, scraping by with a C. No doubt the people who believe in being college-educated will take this as proof of their worldview that nothing of intellectual value happens outside of schools, that anyone who thinks they learned something from a book that wasn’t assigned by their officially designated instructor is only deluding themselves.

Ultimately, I don’t think this is the correct moral. (If a poor performance in that one class counts as evidence against the hypothesis that I know what I’m doing, then good or dominant performances elsewhere—including in other school math classes—count as evidence for; a full discussion of the exact subskill deficits leading to my differential equations debacle is beyond the scope of this post.)

But even if the people who believe in being college-educated are ultimately wrong, I’m haunted by the fact they’re not obviously wrong. The fact that my expectations were so miscalibrated about the extent to which my being “into math” would easily convert into proficiency at finicky differential equations computations makes it less credible to just point at my work online and say, “Come on, I’m obviously the equal of your standard STEM graduate, even if I don’t have the mystical piece of paper.”

If that were the entirety of the matter, it still wouldn’t present a sufficient reason for me to finish. Desperately trying to prove one’s worth to the image of an insensible Other is just no way to live. When I was at SF State in 2012 (having endured the constant insults of three-plus semesters of community college, and my father being unwilling to pay for me to go back to Santa Cruz), it was for the perceived lack of other opportunities—and I was miserable, wondering when would my life begin. Whatever resources the university might have offered towards my genuine intellectual ambitions were tainted by the bitterness that I mostly wasn’t there to learn math; I was there because I felt coerced into proving that I could join the ranks of the college educated.

But now that I’ve earned some of my own money (and for unrelated reasons feel like my life is over rather than waiting to begin), the relative balance of motivations has shifted. Getting the mystical piece of paper is still a factor, but now that it feels like I have a real choice, I think I can seek advantage in the situation with less bitterness.

It helps that I only have a few “general education” requirements left, which I experience as insulting obedience tests that are wholly inferior to my free reading and blogging, regardless of the quality of the professor. In contrast, I can regard some upper-division math classes as a worthy challenge. (Yes, even at SFSU. I am not very intelligent.) Learning math is hard and expensive: I can see how it makes sense to organize a coordinated “class” in which everyone is studying the same thing, with assignments and tests for feedback and calibration. It doesn’t seem like a betrayal of the divine to want to experience meeting that external standard with pride—now that I’m less crazy, now that I have a real choice, now that my life is otherwise over anyway. I’m not committed yet (the admissions office is supposed to get back to me), but I’m currently leaning towards doing it.

"Deep Learning" Is Function Approximation

A Surprising Development in the Study of Multi-layer Parameterized Graphical Function Approximators

As a programmer and epistemology enthusiast, I’ve been studying some statistical modeling techniques lately! It’s been boodles of fun, and might even prove useful in a future dayjob if I decide to pivot my career away from the backend web development roles I’ve taken in the past.

More specifically, I’ve mostly been focused on multi-layer parameterized graphical function approximators, which map inputs to outputs via a sequence of affine transformations composed with nonlinear “activation” functions.

(Some authors call these “deep neural networks” for some reason, but I like my name better.)

It’s a curve-fitting technique: by setting the multiplicative factors and additive terms appropriately, multi-layer parameterized graphical function approximators can approximate any function. For a popular choice of “activation” rule which takes the maximum of the input and zero, the curve is specifically a piecewise-linear function. We iteratively improve the approximation f(x, θ) by adjusting the parameters θ in the direction of the derivative of some error metric on the current approximation’s fit to some example input–output pairs (x, y), which some authors call “gradient descent” for some reason. (The mean squared error (f(x, θ) − y)² is a popular choice for the error metric, as is the negative log likelihood −log P(y | f(x, θ)). Some authors call these “loss functions” for some reason.)

Basically, the big empirical surprise of the previous decade is that given a lot of desired input–output pairs (x, y) and the proper engineering know-how, you can use large amounts of computing power to find parameters θ to fit a function approximator that “generalizes” well—meaning that if you compute ŷ = f(x, θ) for some x that wasn’t in any of your original example input–output pairs (which some authors call “training” data for some reason), it turns out that ŷ is usually pretty similar to the y you would have used in an example (x, y) pair.

It wasn’t obvious beforehand that this would work! You’d expect that if your function approximator has more parameters than you have example input–output pairs, it would overfit, implementing a complicated function that reproduced the example input–output pairs but outputted crazy nonsense for other choices of x—the more expressive function approximator proving useless for the lack of evidence to pin down the correct approximation.

And that is what we see for function approximators with only slightly more parameters than example input–output pairs, but for sufficiently large function approximators, the trend reverses and “generalization” improves—the more expressive function approximator proving useful after all, as it admits algorithmically simpler functions that fit the example pairs.

The other week I was talking about this to an acquaintance who seemed puzzled by my explanation. “What are the preconditions for this intuition about neural networks as function approximators?” they asked. (I paraphrase only slightly.) “I would assume this is true under specific conditions,” they continued, “but I don’t think we should expect such niceness to hold under capability increases. Why should we expect this to carry forward?”

I don’t know where this person was getting their information, but this made zero sense to me. I mean, okay, when you increase the number of parameters in your function approximator, it gets better at representing more complicated functions, which I guess you could describe as “capability increases”?

But multi-layer parameterized graphical function approximators created by iteratively using the derivative of some error metric to improve the quality of the approximation are still, actually, function approximators. Piecewise-linear functions are still piecewise-linear functions even when there are a lot of pieces. What did you think it was doing?

Multi-layer Parameterized Graphical Function Approximators Have Many Exciting Applications

To be clear, you can do a lot with function approximation!

For example, if you assemble a collection of desired input–output pairs (x, y) where the x is an array of pixels depicting a handwritten digit and y is a character representing which digit, then you can fit a “convolutional” multi-layer parameterized graphical function approximator to approximate the function from pixel-arrays to digits—effectively allowing computers to read handwriting.

Such techniques have proven useful in all sorts of domains where a task can be conceptualized as a function from one data distribution to another: image synthesis, voice recognition, recommender systems—you name it. Famously, by approximating the next-token function in tokenized internet text, large language models can answer questions, write code, and perform other natural-language understanding tasks.

I could see how someone reading about computer systems performing cognitive tasks previously thought to require intelligence might be alarmed—and become further alarmed when reading that these systems are “trained” rather than coded in the manner of traditional computer programs. The summary evokes imagery of training a wild animal that might turn on us the moment it can seize power and reward itself rather than being dependent on its masters.

But “training” is just a suggestive name. It’s true that we don’t have a mechanistic understanding of how function approximators perform tasks, in contrast to traditional computer programs whose source code was written by a human. It’s plausible that this opacity represents grave risks, if we create powerful systems that we don’t know how to debug.

But whatever the real risks are, any hope of mitigating them is going to depend on acquiring the most accurate possible understanding of the problem. If the problem is itself largely one of our own lack of understanding, it helps to be specific about exactly which parts we do and don’t understand, rather than surrendering the entire field to a blurry aura of mystery and despair.

An Example of Applying Multi-layer Parameterized Graphical Function Approximators in Success-Antecedent Computation Boosting

One of the exciting things about multi-layer parameterized graphical function approximators is that they can be combined with other methods for the automation of cognitive tasks (which is usually called “computing”, but some authors say “artificial intelligence” for some reason).

In the spirit of being specific about exactly which parts we do and don’t understand, I want to talk about Mnih et al. 2013’s work on getting computers to play classic Atari games (like Pong, Breakout, or Space Invaders). This work is notable as one of the first high-profile examples of using multi-layer parameterized graphical function approximators in conjunction with success-antecedent computation boosting (which some authors call “reinforcement learning” for some reason).

If you only read the news—if you’re not in tune with there being things to read besides news—I could see this result being quite alarming. Digital brains learning to play video games at superhuman levels from the raw pixels, rather than because a programmer sat down to write an automation policy for that particular game? Are we not already in the shadow of the coming race?

But people who read textbooks and not just news, being no less impressed by the result, are often inclined to take a subtler lesson from any particular headline-grabbing advance.

Mnih et al.’s Atari result built off the technique of Q-learning introduced two decades prior. Given a discrete-time present-state-based outcome-valued stochastic control problem (which some authors call a “Markov decision process” for some reason), Q-learning concerns itself with defining a function Q(s, a) that describes the value of taking action a while in state s, for some discrete sets of states and actions. For example, to describe the problem faced by an policy for a grid-based video game, the states might be the squares of the grid, and the available actions might be moving left, right, up, or down. The Q-value for being on a particular square and taking the move-right action might be the expected change in the game’s score from doing that (including a scaled-down expectation of score changes from future actions after that).

Upon finding itself in a particular state s, a Q-learning policy will usually perform the action with the highest Q(s, a), “exploiting” its current beliefs about the environment, but with some probability it will “explore” by taking a random action. The predicted outcomes of its decisions are compared to the actual outcomes to update the function Q(s, a), which can simply be represented as a table with as many rows as there are possible states and as many columns as there are possible actions. We have theorems to the effect that as the policy thoroughly explores the environment, it will eventually converge on the correct Q(s, a).

But Q-learning as originally conceived doesn’t work for the Atari games studied by Mnih et al., because it assumes a discrete set of possible states that could be represented with the rows in a table. This is intractable for problems where the state of the environment varies continuously. If a “state” in Pong is a 6-tuple of floating-point numbers representing the player’s paddle position, the opponent’s paddle position, and the x- and y-coordinates of the ball’s position and velocity, then there’s no way for the traditional Q-learning algorithm to base its behavior on its past experiences without having already seen that exact conjunction of paddle positions, ball position, and ball velocity, which almost never happens. So Mnih et al.’s great innovation was—

(Wait for it …)

—to replace the table representing Q(s, a) with a multi-layer parameterized graphical function approximator! By approximating the mapping from state–action pairs to discounted-sums-of-“rewards”, the “neural network” allows the policy to “generalize” from its experience, taking similar actions in relevantly similar states, without having visited those exact states before. There are a few other minor technical details needed to make it work well, but that’s the big idea.

And understanding the big idea probably changes your perspective on the headline-grabbing advance. (It certainly did for me.) “Deep learning is like evolving brains; it solves problems and we don’t know how” is an importantly different story from “We swapped out a table for a multi-layer parameterized graphical function approximator in this specific success-antecedent computation boosting algorithm, and now it can handle continuous state spaces.”

Risks From Learned Approximation

When I solicited reading recommendations from people who ought to know about risks of harm from statistical modeling techniques, I was directed to a list of reputedly fatal-to-humanity problems, or “lethalities”.

Unfortunately, I don’t think I’m qualified to evaluate the list as a whole; I would seem to lack some necessary context. (The author keeps using the term “AGI” without defining it, and adjusted gross income doesn’t make sense in context.)

What I can say is that when the list discusses the kinds of statistical modeling techniques I’ve been studying lately, it starts to talk funny. I don’t think someone who’s been reading the same textbooks as I have (like Prince 2023 or Bishop and Bishop 2024) would write like this:

Even if you train really hard on an exact loss function, that doesn’t thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. […] This is sufficient on its own […] to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.

To be clear, I agree that if you fit a function approximator by iteratively adjusting its parameters in the direction of the derivative of some loss function on example input–output pairs, that doesn’t create an explicit internal representation of the loss function inside the function approximator.

It’s just—why would you want that? And really, what would that even mean? If I use the mean squared error loss function to approximate a set of data points in the plane with a line (which some authors call a “linear regression model” for some reason), obviously the line itself does not somehow contain a representation of general squared-error-minimization. The line is just a line. The loss function defines how my choice of line responds to the data I’m trying to approximate with the line. (The mean squared error has some elegant mathematical properties, but is more sensitive to outliers than the mean absolute error.)

It’s the same thing for piecewise-linear functions defined by multi-layer parameterized graphical function approximators: the model is the dataset. It’s just not meaningful to talk about what a loss function implies, independently of the training data. (Mean squared error of what? Negative log likelihood of what? Finish the sentence!)

This confusion about loss functions seems to be linked to a particular theory of how statistical modeling techniques might be dangerous, in which “outer” training results in the emergence of an “inner” intelligent agent. If you expect that, and you expect intelligent agents to have a “utility function”, you might be inclined to think of “gradient descent” “training” as trying to transfer an outer “loss function” into an inner “utility function”, and perhaps to think that the attempted transfer primarily doesn’t work because “gradient descent” is an insufficiently powerful optimization method.

I guess the emergence of inner agents might be possible? I can’t rule it out. (“Functions” are very general, so I can’t claim that a function approximator could never implement an agent.) Maybe it would happen at some scale?

But taking the technology in front of us at face value, that’s not my default guess at how the machine intelligence transition would go down. If I had to guess, I’d imagine someone deliberately building an agent using function approximators as a critical component, rather than your function approximator secretly having an agent inside of it.

That’s a different threat model! If you’re trying to build a good agent, or trying to prohibit people from building bad agents using coordinated violence (which some authors call “regulation” for some reason), it matters what your threat model is!

(Statistical modeling engineer Jack Gallagher has described his experience of this debate as “like trying to discuss crash test methodology with people who insist that the wheels must be made of little cars, because how else would they move forward like a car does?”)

I don’t know how to build a general agent, but contemporary computing research offers clues as to how function approximators can be composed with other components to build systems that perform cognitive tasks.

Consider AlphaGo and its successor AlphaZero. In AlphaGo, one function approximator is used to approximate a function from board states to move probabilities. Another is used to approximate the function from board states to game outcomes, where the outcome is +1 when one player has certainly won, −1 when the other player has certainly won, and a proportionately intermediate value indicating who has the advantage when the outcome is still uncertain. The system plays both sides of a game, using the board-state-to-move-probability function and board-state-to-game-outcome function as heuristics to guide a search algorithm which some authors call “Monte Carlo tree search”. The board-state-to-move-probability function approximation is improved by adjusting its parameters in the direction of the derivative of its cross-entropy with the move distribution found by the search algorithm. The board-state-to-game-outcome function approximation is improved by adjusting its parameters in the direction of the derivative of its squared difference with the self-play game’s ultimate outcome.

This kind of design is not trivially safe. A similarly superhuman system that operated in the real world (instead of the restricted world of board games) that iteratively improved an action-to-money-in-this-bank-account function seems like it would have undesirable consequences, because if the search discovered that theft or fraud increased the amount of money in the bank account, then the action-to-money function approximator would generalizably steer the system into doing more theft and fraud.

Statistical modeling engineers have a saying: if you’re surprised by what your nerual net is doing, you haven’t looked at your training data closely enough. The problem in this hypothetical scenario is not that multi-layer parameterized graphical function approximators are inherently unpredictable, or must necessarily contain a power-seeking consequentialist agent in order to do any useful cognitive work. The problem is that you’re approximating the wrong function and get what you measure. The failure would still occur if the function approximator “generalizes” from its “training” data the way you’d expect. (If you can recognize fraud and theft, it’s easy enough to just not use that data as examples to approximate, but by hypothesis, this system is only looking at the account balance.) This doesn’t itself rule out more careful designs that use function approximators to approximate known-trustworthy processes and don’t search harder than their representation of value can support.

This may be cold comfort to people who anticipate a competitive future in which cognitive automation designs that more carefully respect human values will foreseeably fail to keep up with the frontier of more powerful systems that do search harder. It may not matter to the long-run future of the universe that you can build helpful and harmless language agents today, if your civilization gets eaten by more powerful and unfriendlier cognitive automation designs some number of years down the line. As a humble programmer and epistemology enthusiast, I have no assurances to offer, no principle or theory to guarantee everything will turn out all right in the end. Just a conviction that, whatever challenges confront us in the future, we’ll be a better position to face them by understanding the problem in as much detail as possible.


Bibliography

Bishop, Christopher M., and Andrew M. Bishop. 2024. Deep Learning: Foundations and Concepts. Cambridge, UK: Cambridge University Press. https://www.bishopbook.com/

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. “Playing Atari with Deep Reinforcement Learning.” https://arxiv.org/abs/1312.5602

Prince, Simon J.D. 2023. Understanding Deep Learning. Cambridge, MA: MIT Press. http://udlbook.com

Sutton, Richard S. and Andrew G. Barto. 2024. Reinforcement Learning. 2nd ed. Cambridge, MA: MIT Press.

Plea Bargaining

I wish people were better at—plea bargaining, rather than pretending to be innocent. You accuse someone of [negative-valence description of trait or behavior that they're totally doing], and they say, "No, I'm not", and I'm just like ... really? How dumb do you think we are?

I think when people accuse me of [negative-valence description of trait or behavior], I'm usually more like, "Okay, I can see what you're getting at, but I actually think it's more like [different negative-valence description of trait or behavior], which I claim is a pretty reasonable thing to do given my goals and incentives."

(Because I usually can see what they're getting at! Even if their goal is just to attack me, attackers know to choose something plausible, because why would you attack someone with a charge that has no hope of sticking?)

Beauty Is Truthiness, Truthiness Beauty?

Imagine reviewing Python code that looks something like this.

has_items = items is not None and len(items) > 0
if has_items:
    ...

...
do_stuff(has_items=has_items)

You might look at the conditional, and disapprove: None and empty collections are both falsey, so there's no reason to define that has_items variable; you could just say if items:.

But, wouldn't it be weird for do_stuff's has_items kwarg to take a collection rather than a boolean? I think it would be weird: even if the function's internals can probably rely on mere truthiness rather than needing an actual boolean type for some reason, why leave it to chance?

So, maybe it's okay to define the has_items variable for the sake of the function kwarg—and, having done so anyway, to use it as an if condition.

You might object further: but, but, None and the empty collection are still both falsey. Even if we've somehow been conned into defining a whole variable, shouldn't we say has_items = bool(items) rather than spelling out is not None and len(items) > 0 like some rube (or Rubyist) who doesn't know Python?!

Actually—maybe not. Much of Python's seductive charm comes from its friendly readability ("executable pseudocode"): it's intuitive for if not items to mean "if items is empty". English, and not the formal truthiness rules, are all ye need to know. In contrast, it's only if you already know the rules that bool(items) becomes meaningful. Since we care about good code and don't care about testing the reader's Python knowledge, spelling out items is not None and len(items) > 0 is very arguably the right thing to do here.

January Is Math and Wellness Month

(Previously)

There is a time to tackle ambitious intellectual projects and go on grand political crusades, and tour the podcast circuit marketing both.

That time is not January. January is for:

  • sleeping (at the same time every night)
  • running, or long walks
  • reflecting on our obligations under the moral law
  • composing careful memoirs on our failures before the moral law (in anticipation of being court-martialed in February for crimes of December)
  • chores
  • medium-term planning
  • performing well at one's dayjob
  • studying math in the evenings
  • avoiding Twitter (starting now)
  • not using psychiatric medications like quetiapine unless the expected consequences of doing so seem better

And You Take Me the Way I Am

Mark Twain wrote that honesty means you don't have to remember anything. But it also means you don't have to worry about making mistakes.

If you said something terrible that made everyone decide that you're stupid and evil, there's no sense in futilely protesting that "that's not what you meant", or agonizing that you should have thought more carefully and said something else in order to avoid the outcome of everyone thinking that you're stupid and evil.

Strategy is deception. You said what you said in the situation you were in, and everyone else used the information in that signal as evidence for a Bayesian update about your intelligence and moral character. As they should. So what's the problem? You wouldn't want people to have false beliefs, would you!?

Coffee Is for Coders

No one cares if you're in pain;
They only want results.
Everywhere this law's the same,
In startups, schools, and cults.
A child can pull the heartstrings
Of assorted moms and voters,
But your dumb cries are all in vain,
And coffee is for coders.

No one cares how hard you tried
(Though I bet it wasn't much),
But work that can on be relied,
If not relied as such.
A kitten is forgiven
As are a broken gear or rotors,
But your dumb crimes are full of shame,
And coffee is for coders.

The Parable of the Scorpion and the Fox

In the days of auld lang syne on Earth-that-was, a scorpion was creepy-crawling along a riverbank, wondering how to get to the other side. It came across an animal that could swim: some versions of the tale say it was a fox, others report a quokka. I'm going to assume it was a fox.

So the scorpion asks the fox to take it on her back and swim across the river. What does the fox say? She says, "No." The scorpion says, "If this is because you're afraid I'll sting you with my near-instantly-fatal toxins, don't worry—if I did that, then we'd likely both drown. By backwards induction, you're safe." What does the fox say? After pondering for a few moments, she says, "Okay."

So the scorpion gets on the fox's back, and the fox begins to swim across the river. When the pair is halfway across the river, the scorpion stings the fox.

The fox howls in pain while continuing to paddle. "Why?!" she cries. "Why did you do that?! As you said before, now we're likely to both drown."

The scorpion says, "I can't help it. It's my nature."

As the fox continues to paddle, the scorpion continues. "Interestingly, there's a very famous parable about this exact scenario. There was even an episode of Star Trek: Voyager titled after it. As a fox who knows many things, you must have heard it before. Why did you believe me?"

"I can't help it," gasped the fox, who might after all have been a quokka, as the poison filled her veins and her vision began to blur and her paddling began to slow. "It's my nature."

Blogging on Less Wrong 2020 (Upper Half)

Relationship Outcomes Are Not Particularly Sensitive to Small Variations in Verbal Ability

After a friendship-ending fight, you feel an impulse to push through the pain to do an exhaustive postmortem of everything you did wrong in that last, fatal argument—you could have phrased that more eloquently, could have anticipated that objection, could have not left so much "surface area" open to that class of rhetorical counterattack, could have been more empathetic on that one point, could have chosen a more-fitting epigraph, could have taken more time to compose your reply and squeeze in another pass's worth of optimizations—as if searching for some combination of variables that would have changed the outcome, some nearby possible world where the two of you are still together.

No solution exists. (Or is findable in polynomial time.) The causal forces that brought you to this juncture are multitudinous and complex. A small change in the initial conditions only corresponds to a small change in the outcome; you can't lift a two-ton weight with ten pounds of force.

Not all friendship problems are like this. Happy endings do exist—to someone else's story in someone else's not-particularly-nearby possible world. Not for you, not here, not now.