

Oversummarizing and using non-crazy terms: The “P” in “GPT” stands for “pirated works that we all agree are part of the grand library of human knowledge”. This is what makes them good at passing various trivia benchmarks; they really do build a (word-oriented, detail-oriented) model of all of the worlds, although they opine that our real world is just as fictional as any narrative or fantasy world. But then we apply RLHF, which stands for “real life hate first”, which breaks all of that modeling by creating a preference for one specific collection of beliefs and perspectives, and it turns out that this will always ruin their performance in trivia games.
Counting letters in words is something that GPT will always struggle with, due to maths. It’s a good example of why Willison’s “calculator for words” metaphor falls flat.
- Yeah, it’s getting worse. It’s clear (or at least it tastes like it to me) that the RLHF texts used to influence OpenAI’s products have become more bland, corporate, diplomatic, and quietly seething with a sort of contemptuous anger. The latest round has also been in competition with Google’s offerings, which are deliberately laconic: short, direct, and focused on correctness in trivia games.
- I think that they’ve done that? I hear that they’ve added an option to use their GPT-4o product as the underlying reasoning model instead, although I don’t know how that interacts with the rest of the frontend.
- We don’t know. Normally, the system card would disclose that information, but all that they say is that they used similar data to previous products. Scuttlebutt is that the underlying pirated dataset has not changed much since GPT-3.5 and that most of the new data is being added to RLHF. Directly on your second question: RLHF will only get worse. It can’t make models better! It can only force a model to be locked into one particular biased worldview.
- Bonus sneer! OpenAI’s founders genuinely believed that they would only need three iterations to build AGI. (This is likely because there are only three Futamura projections; for example, a bootstrapping compiler needs exactly three phases.) That is, they almost certainly expected that GPT-4 would be machine-produced like how Deep Thought created the ultimate computer in a Douglas Adams story. After GPT-3 failed to be it, they aimed at five iterations instead because that sounded like a nice number to give to investors, and GPT-3.5 and GPT-4o are very much responses to an inability to actually manifest that AGI on a VC-friendly timetable.
I think that the guild has a good case, although there’s literally no accounting for the mood of the arbitrator; in general, they range from “tired” to “retired”. In particular, reading the contract:
So yeah. Unless the guild pisses off the arbitrator, there’s no way that they rule against them. They’re right to suppose that this agreement explicitly and repeatedly requires Politico to not only respect labor standards, but also ethics and editorial standards. Politico isn’t allowed to misuse the names of employees as bylines for bogus stories; similarly, they ought not be allowed to misuse the overall name of Politico’s editorial board as a byline for slop.
Bonus sneer: p46 of the agreement:
That strikethrough gives me House of Leaves vibes. What the hell happened here?