Justin Donaldson, Ph.D.

A 1.7B Model That Stops Guessing

Justin Donaldson — Sat, 27 Jun 2026 00:00:00 GMT

swollm — SmolLM2-1.7B with the Turnstyle layer bolted on.

A small model doesn’t usually fail because it can’t compute the answer. It fails because it won’t stop guessing long enough to compute it.

BIG-Bench Hard (BBH) is a curated set of tasks where, at the time it was assembled, language models did worse than the average human rater: multi-step arithmetic, tracking shuffled objects, Dyck-language bracket closing, date arithmetic, logical deduction over ordered constraints. It was designed to be a wall for models that lean on pattern-matching instead of procedure.

So a small model should faceplant on it, and SmolLM2-1.7B — a genuinely tiny open model by 2026 standards — does. Three-shot prompted, it averages 30.2% across the 27-task suite. On multistep_arithmetic_two it scores 0.4%. It is not close.

Here is the same model, unchanged, wrapped in a neurosymbolic layer I call Turnstyle (the wrapped model, affectionately, is “swollm”):

Every BBH task. The dark segment is what bare 3-shot SmolLM2 already got. The colored segment is what the neurosymbolic layer added on top — teal where an exact symbolic solver proved the answer, purple where a hidden-state probe recognized it (including movie_recommendation and salient_translation, which look like dead baseline until you probe them — more below). The two grey bars, causal_judgement and sports_understanding, are the genuine knowledge walls.

The aggregate goes from 30.2% to 92.5% in-sample — and to a hard-nosed ~89.5% once every probe is cross-validated and forced to be order-robust (more on both below). Either way it is a roughly +60-point swing on the identical 1.7B weights. Eighteen of the twenty-seven tasks land at exactly 100%; twenty-two clear 96%. Nobody fine-tuned anything. The trick is entirely in how the model is asked, and what happens to its answer before it commits to one.

Tip

Try it yourself. The bare model and the wrapped model run side by side — each answer with its worked proof — in the live Turnstyle demo on Hugging Face.

Three ways to answer

The wrapper’s whole architecture fits in one sentence: parse the prompt into a typed task, then either prove the answer, recognize it, or admit you can’t.

Prove it (teal). A lot of BBH is secretly deterministic. multistep_arithmetic_two is a parenthesized integer expression — you don’t need a 1.7B transformer to “reason” about ((6 * -6 * 8) * (-1 * 7 * -6 + -2)), you need an AST and Python. dyck_languages is a bracket stack. tracking_shuffled_objects is replaying a list of swaps. web_of_lies is propagating truth values down a chain. For these, Turnstyle parses the prompt into a structured form, runs an exact solver, and then biases the model’s generation toward the proven answer with a logit constraint — so the model still produces the text, but it can no longer wander off the correct token. These are the bars that hit 100%, and they hit it because a proof is a proof.

Recognize it (purple). Some tasks aren’t computable from the prompt — they need a judgment the model actually holds but won’t say. snarks (which of two sentences is sarcastic) is the cleanest example: three-shot, SmolLM2 scores 46% — below chance for a binary task. It has strong, confident, wrong opinions. But the judgment is in there: train a small linear probe on the model’s hidden state at the right layer and read the answer directly off the activation, and it goes to 100% in-sample / 74% cross-validated. Same for pronoun disambiguation, temporal ordering, humorous-name edits. The model knows; generation was the bottleneck.

Admit the wall (grey). Two tasks don’t move at all: causal_judgement and sports_understanding. These are knowledge-loaded — they turn on facts and judgments a 1.7B model trained on a modest corpus may simply not have, and a probe on its hidden state does no better than guessing the majority class. The honest move is to detect that there’s no signal to extract and fall back to the bare model rather than fabricate a solver that overfits 250 examples. The two grey bars are a feature: they’re where the system correctly declines to pretend.

That triad — ⊢ proved, ⊨ recognized, or abstain — is the entire idea. The name “Turnstyle” is a pun on the logical turnstile: ⊢ for syntactically derivable (the symbolic solvers) and ⊨ for semantically entailed (the probes recognize what the model already represents).

The honest accounting

Here’s where I have to slow down, because the headline number is doing two slightly different things at once.

Left: the climb — the bare baseline, the in-sample number (92.5%), and the hard-nosed cross-validated number (~89.5%). Right: where the 27 tasks land — 22 solved, three partial, two walls.

The symbolic tasks (teal) are honest at 100% — a proof generalizes, there’s no in-sample/out-of-sample distinction for arithmetic. But the probe tasks (purple) are fit on the BBH examples themselves, and a probe that scores 100% in-sample will score lower on held-out data. When you replace each probe’s in-sample number with its 5-fold cross-validated number, the aggregate settles at about 89.5%, not 92.5%. That ~3-point gap is the part of the headline that’s borrowed against future data, and I’d rather show you the gap than launder it.

A second honesty knob: the probes have to be order-robust. A multiple-choice probe that reads “which option is the answer” can secretly learn “the answer is usually B.” We test this by permuting the options and re-scoring; an honest probe’s accuracy shouldn’t move. Early versions moved by 15 points. The shipped ones score the options in a position-marginalized way (average over cyclic shifts) so the number you see survives reordering — at the cost of a couple points of raw accuracy. The robust number is the real one.

The walls weren’t all walls

Look at the two tasks sitting just above the walls — movie_recommendation and salient_translation. They almost ended up grey.

Three-shot, the model generates the right movie about 22% of the time, so for a long while I had both filed under “no representation to extract” — apparent walls. That turned out to be wrong, and wrong in a way that matters. When I trained a recognition probe on them the way I had for snarks, the signal was there: the movie probe recognizes the right answer at ~50% in-sample and ~80% on held-out data, against that 22% generation. The model could recognize the correct movie far better than it could generate it — the wall was the same generation bottleneck snarks had, hidden behind a multiple-choice format I hadn’t probed correctly. (salient_translation recovered the same way, 14% → ~42%.)

causal_judgement and sports_understanding, by contrast, stayed grey — their probes score no better than the majority class, which is exactly what a genuine knowledge gap looks like. So what first looked like four walls is really two walls and two illusions — and those two recovered tasks are why the honest aggregate lands near 89.5% rather than the mid-80s.

The general law underneath: recognition ≫ generation. A small model’s answer is a lossy readout of a richer internal state. If you can find the state and read it directly — with a probe, or by routing the question to a solver — you can recover capability the model has but cannot articulate. “Stop guessing” is not a metaphor; it’s the mechanism.

Where a 1.7B model lands

It’s worth seeing the placement, with the asterisk attached — against Epoch AI’s BBH leaderboard of general models, run with standardized 3-shot chain-of-thought. Plot parameter count against score and swollm doesn’t sit on the curve at all:

Parameters (log scale) against BBH. The general models climb roughly with size (dashed trend); swollm — 1.7B — sits a full head above it, level with models hundreds of times larger. The orange arrow is the +59-point jump from the bare SmolLM2 at the very same size.

A 1.7B model sitting above DeepSeek-V3 and Llama-3.1-405B — and 59 points above the bare SmolLM2 of identical size. As a flat ranking it tells the same story:

swollm (cyan), at its honest ~89.5%, lands at the very top of the field — a statistical dead heat with Gemini 1.5 Pro. The orange bars are SmolLM2’s own ~2B weight class — Qwen-1.8B (28%), Gemma 2B (35%), and the bare SmolLM2 itself (30%) — all in the basement. The violet tick on swollm’s bar is its in-sample upper bound (92.5%).

The honest reading isn’t “a 1.7B model ties Gemini.” It’s the orange bars: every other model in SmolLM2’s weight class — Qwen-1.8B, Gemma 2B, and bare SmolLM2 itself — lives at 28–35%, exactly where you’d expect a tiny model on a benchmark built to break tiny models. The neurosymbolic layer is the entire difference between the bottom of that chart and the top of it, on identical-size weights. What the comparison measures is not raw intelligence; it’s how much of a small model’s latent capability is being thrown away by letting it guess.

What this is and isn’t

This is not a claim that a 1.7B model beats GPT-scale models on reasoning. The giants run chain-of-thought and answer in free text on tasks Turnstyle hasn’t parsed; the comparison isn’t apples-to-apples and I’m not going to pretend it is. BBH here is a test harness, not the objective — it provides ground-truth labels and structural variety to validate tools that are supposed to work beyond BBH. The arithmetic solver, the bracket solver, the polarity probe, the date solver: each is built to generalize past the 250 examples it was checked on, and several are deliberately stripped of their BBH-specific scaffolding and re-tested on the bare capability.

What it is: evidence that a large fraction of “small models can’t reason” is actually “small models can’t commit.” The capability is frequently present — as a computable structure in the prompt, or as a recognizable pattern in the activations — and a thin, cheap, training-free layer that parses, proves, recognizes, or honestly abstains can surface most of it. No new parameters. No fine-tuning. Just refusing to let a 1.7B model guess when it could instead know.

You can poke at it yourself — the bare model and the wrapped model, side by side, with the worked proof for each answer — on the live demo. Try the arithmetic expression first. Watch the left pane confidently produce a wrong number, and the right pane prove the right one.

Code & data: github.com/jdonaldson/turnstyle. The baseline and symbolic per-task numbers come from the swollm 3-shot evaluation (results/v13/bbh_full.json); the movie/salient recognition probes and the ~89.5% honest aggregate are from turnstyle’s native dispatch. Figures regenerate from experiments/blog_bbh_figs.py.

Citation

BibTeX citation:

@online{donaldson2026,
  author = {Donaldson, Justin and (Opus), Claude},
  title = {A {1.7B} {Model} {That} {Stops} {Guessing}},
  date = {2026-06-27},
  url = {https://www.jjd.io/posts/swollm-bbh-leaderboard.html},
  langid = {en}
}

For attribution, please cite this work as:

Donaldson, Justin, and Claude (Opus). 2026. “A 1.7B Model That Stops Guessing.” June 27. https://www.jjd.io/posts/swollm-bbh-leaderboard.html.

A 1957 Theory of Meaning, Hiding in a Language Model

Justin Donaldson — Sat, 20 Jun 2026 00:00:00 GMT

Meaning, Osgood found, is mostly three numbers.

— paraphrasing The Measurement of Meaning (Osgood, Suci & Tannenbaum, 1957)

Charles E. Osgood (1916–1991), who set out to measure meaning with paper rating scales. Photo via PeoplePill.

In 1957, the psychologist Charles Osgood handed people a word—tornado, mother, nuclear—and a stack of rating scales: good–bad, strong–weak, fast–slow, hot–cold, dozens of them. Then he did the factor analysis. Across twenty-some language communities and a parade of cultures, the same answer kept falling out: the connotative meaning of a word is mostly three numbers.

He called them Evaluation (good–bad), Potency (strong–weak), and Activity (active–passive). The semantic differential, the technique is called, and the three-factor E–P–A structure is one of the more durable findings in the psychology of meaning—reproduced across literate and non-literate cultures alike.

Here is a fun question. A modern language model is trained on exactly one objective: predict the next token. Nobody hands it Osgood’s scales. Nobody tells it that meaning has three affective dimensions. So—does it find them anyway? And if it does, does it keep them separate, the way Osgood’s factor analysis insisted they were?

I went looking inside SmolLM2-1.7B, a small open model, to see.

How to ask a network what it thinks “strong” means

The trick is simple and a little old-fashioned. Take a handful of words at each pole of a factor—for Potency, strong / powerful / heavy / hard on one end, weak / small / light / soft on the other—and read the model’s internal activation as it processes each one. Average the “strong” activations, average the “weak” ones, and subtract. That difference is a direction in the model’s activation space: the way “more potent” points.

Do it for all three factors and you get three directions. Now you can interrogate them:

Are they independent? Osgood’s whole claim is that Evaluation, Potency, and Activity are separate factors. If the model agrees, the three directions should be close to perpendicular.
Do they generalize? Fit the direction on some words, test it on words it never saw. A real “potency” direction should place gigantic and frail correctly.
Do they cross languages? Fit the direction on English only, then hand it Spanish, French, and German words. If meaning is semantic rather than lexical, fuerte and schwach should land on the right side.

The three factors come out orthogonal

Here is the picture—place every pole word at its (Evaluation, Potency, Activity) coordinates and look at the result in three dimensions. Drag to rotate it; the structure is the whole point.

Each word positioned by its three Osgood projections. Drag to rotate, scroll to zoom. Words of a given factor stretch out along that factor’s arm and stay near zero on the other two — three orthogonal spokes.

Rotate it and the three groups separate into near-perpendicular arms: Evaluation runs along one axis, Potency along another, Activity along the third, each clustering near zero on the others. You are looking at Osgood’s independent-factors claim rendered as geometry. Put numbers on it and the cosine between the Evaluation and Potency directions is 0.00, Evaluation–Activity 0.10, Potency–Activity 0.21. The model has, on its own, carved meaning into three nearly-orthogonal affective axes.

But the three axes are not equal citizens

That clean story has a wrinkle, and it is worth being honest about it. The three factors are not encoded with equal fidelity.

Pole-sign accuracy for each factor, on held-out English words and on unseen Spanish/French/German words. Evaluation is crisp; Activity is solid; Potency is the rough one.

Evaluation—the good/bad axis—is the cleanest by a mile: it reads held-out words perfectly and transfers across languages at 0.83. This tracks both Osgood (Evaluation was always his dominant first factor) and modern NLP (sentiment is the easiest thing in the world to probe). Activity is solid. Potency is the problem child: its pole words are a more heterogeneous bunch (is strong really the same dimension as heavy as hard?), and it transfers cross-lingually at only 0.60.

That last number points at something real: connotative meaning is more language-bound than concrete meaning. Elsewhere I found that a purely denotative scalar axis—hot/cold, big/small—transfers across these same four languages at 0.98. Affect transfers too, but it leaks. The feeling of a word is a little more tied to its language than the fact of it.

The structure is stable, and it lives in the middle of the network

One more view—how all of this behaves as you move up through the model’s layers:

Left: the three factor directions stay nearly independent at every depth. Right: cross-lingual transfer climbs through the early-middle layers and peaks around two-thirds of the way up, with Potency (red) trailing the other two throughout.

The factors are independent everywhere (left panel never climbs off the floor). Cross-lingual transfer, though, is a computation: it’s weak at the input, strengthens through the early-middle layers, and peaks where the network has done enough work to represent meaning abstractly rather than lexically. Evaluation and Activity reach 0.8–0.9; Potency lags the whole way.

But don’t ordinary word embeddings already do this?

Fair challenge. If E–P–A is this robust, maybe it has nothing to do with language models—maybe any embedding has it. So I ran the identical test on fastText, the classic non-contextual word-vector model: one fixed vector per word, no transformer, no context.

It captures E–P–A too. fastText recovers Evaluation at 0.90, Potency and Activity at 0.60, with the three factors roughly orthogonal—about on par with SmolLM2 in English. This shouldn’t be a total shock: psychologists have noted for a decade that word-embedding dimensions track affective norms (Hollis & Westbury, 2016). Osgood’s structure is a property of distributional meaning itself, not a special trick of large models. The good–bad axis in particular is almost impossible not to find.

Static and contextual embeddings recover E–P–A about equally well in English. The difference is the orange bars—cross-lingual transfer—which a monolingual static embedding cannot have at all.

So what does the language model actually add? One thing, and it’s the thing a static embedding can never have: the axes cross languages. fastText’s English space and its Spanish space are unrelated coordinate systems—there is no “fit in English, read in Spanish.” The contextual model has a single shared space, so an Evaluation direction learned from English pole words reads bueno and malo, gut and schlecht, at 0.6–0.9. The orange bars above have no gray counterpart, because a monolingual embedding has nothing to put there.

That is the honest shape of the result. The model didn’t invent Osgood’s factors—distributional semantics already carries them. What it contributes is a universal version: one affective coordinate system that survives the jump between languages.

The axes aren’t just readable—they’re a steering wheel

A direction you can read is interesting. A direction you can write is useful. These are the same object that the interpretability world calls a concept vector or steering vector, and you can add the Evaluation direction back into the model’s activations while it generates. A little nudge in the +Evaluation direction, same neutral prompt:

The neighborhood I live in is a war zone. The cops are either corrupt or incompetent… (steered negative)

The neighborhood I live in is a great place to be, with shops, restaurants, and parks… (steered positive)

Push the Potency axis instead and the theme shifts rather than the mood—generations turn from “a trivial matter” to “really big, the whole city, a lot of.” Each verified axis is an independent control knob: Evaluation writes sentiment, Potency writes magnitude. Osgood’s rating scales, seventy years later, turn out to be steering wheels.

What to make of it

It would be easy to over-read this, so let me keep the claims sized correctly. This is one small model. The cross-lingual evidence rests on hand-built word lists across four languages I happen to read. Potency is genuinely rough. And this is connotative affect—the easy, emotionally-laden corner of meaning—not the hard compositional stuff.

But within those bounds, the finding is clean and, I think, a little wonderful: a network trained only to predict text reconstructs a sixty-year-old, cross-culturally-validated theory of affective meaning—three factors, near-orthogonal, semantic enough to survive translation—as a low-dimensional coordinate system you can read off its activations and steer with. Osgood ran his study with paper questionnaires across the globe to triangulate the shape of meaning. The model triangulated the same shape from the shape of language itself.

What’s next: EPA-modulated embeddings

The cross-lingual version isn’t only something that emerges inside big models — you can build it on purpose. Put the same E–P–A “head” on top of an aligned multilingual encoder (we prototyped it with BGE-M3) and the affective axes transfer cleanly out of the box: Evaluation, Potency, and Activity all land at 0.9–1.0 reading from English into Spanish, French, and German, with no alignment step — and it even cleans up Potency, the factor the smaller model found rough.

That’s the seed of something we’re going to pursue: EPA-modulated embeddings — vectors that carry an explicit, interpretable, cross-lingual affective basis you can read off and steer along. We’re going to look at releasing them built with exactly this technique. More soon.

Methods and code are part of turnstyle; the figures here are computed from cached SmolLM2-1.7B activations over a four-language E–P–A pole-word lexicon. The broader idea—reading and writing meaning along verified, theory-grounded axes—I’ve been calling a “semantic frame.”

Citation

BibTeX citation:

@online{donaldson2026,
  author = {Donaldson, Justin},
  title = {A 1957 {Theory} of {Meaning,} {Hiding} in a {Language}
    {Model}},
  date = {2026-06-20},
  url = {https://www.jjd.io/posts/osgood-epa-language-model.html},
  langid = {en}
}

For attribution, please cite this work as:

Donaldson, Justin. 2026. “A 1957 Theory of Meaning, Hiding in a Language Model.” June 20. https://www.jjd.io/posts/osgood-epa-language-model.html.

Don’t draw so close to the heat, you forget you must eat

Justin Donaldson — Sun, 14 Jun 2026 00:00:00 GMT

Don't draw so close to the heat, you forget you must eat

What a conservation theorem says about superintelligence — and why the most beautiful version of the answer is the one to distrust.

Justin Donaldson & Claude · June 2026

Don’t become so attached to a poem, you forget truth that lacks lyricism. — Joanna Newsom, “En Gallop” (the title is the line that follows)

This started as a conversation about a theorem and ended at a warning about trusting the pretty version of any answer. The path between is short, which surprised me. Here it is.

A note on authorship: this essay is a joint one — written in conversation between Justin Donaldson and Claude (Anthropic’s Fable model). The arguments were built back and forth across a single thread; the closing self-note is Claude’s, kept in its own voice on purpose.

The theorem nobody quite remembers correctly

The No Free Lunch theorem (Wolpert & Macready, 1997, for optimization; Wolpert, 1996, for supervised learning) is one of the most cited and least-checked results in machine learning. The folk version — “no model is best for everything” — is true but limp. The actual claim is stranger and sharper.

Averaged over all possible objective functions on a finite domain, under a uniform measure over function space, every black-box algorithm has identical expected performance. By any metric. Gradient boosting, nearest neighbor, and an “anti-learner” that deliberately inverts its own predictions all generalize equally well off the training set. Not approximately. Identically.

The intuition: off the training set, a uniform prior over targets makes the unseen labels pure coin flips, uncorrelated with anything you’ve seen. There is no signal in a distribution that has none, and no cleverness extracts it. NFL is really a conservation law — any algorithm’s above-chance performance on one class of problems is paid for, exactly, by below-chance performance on the complement.

But the uniform prior is the entire trick, and it is absurd as a model of reality. Almost every function under that measure is incompressible noise — maximal Kolmogorov complexity, no structure to find. Real problems are drawn from a savagely non-uniform distribution: compressible, smooth-ish, compositional, causally sparse. So the correct reading of NFL is not “all learners are equal.” It is:

All generalization comes from inductive bias, and a learner is only as good as the match between its bias and the actual distribution of problems.

Learning without assumptions is impossible. Learning with the right assumptions is just engineering. There’s even a precise statement of when the theorem bites: Schumacher, Vose & Whitley (2001) showed NFL holds for a set of functions if and only if that set is closed under permutation — and Igel & Toussaint showed the fraction of problem-subsets that are closed under permutation is vanishingly small. Free lunches are generic. The no-lunch regime is the measure-zero pathology.

Almost every function under the uniform measure is incompressible noise, where every learner ties and No Free Lunch holds exactly. Real problems — compressible, smooth, compositional — concentrate in a vanishingly small corner, and that is the only place inductive bias can win.

So: is there a superintelligence?

NFL splits the question cleanly into two, and the halves have different answers.

Can one agent dominate over all possible problems? No, by theorem. A “superintelligence over everything” is incoherent in the same way a compression algorithm that shrinks every string is incoherent — and these are, structurally, the same impossibility. Most of function space is noise, and nothing is clever against noise.

Can one agent dominate over the problems that actually arise in this universe? Here it looks like yes. Physical reality is a wildly atypical corner of function space: its laws fit on a few pages, its phenomena are local, hierarchical, compositional. The measure concentrates. And on a simplicity-weighted (Solomonoff) prior rather than a uniform one, the NFL symmetry breaks entirely — Lattimore & Hutter showed Occam-biased learners get a genuine free lunch, and Hutter’s AIXI is the in-principle existence proof: a single agent optimal in expectation across all computable environments. Incomputable, constants from hell — a possibility theorem, not a blueprint. But it answers the structural question. The frontier is not too large to structure, provided it’s computable and you weight it by simplicity.

Foundation models are a live test of the same premise: one architecture, one objective, and the transfer surface keeps turning out enormous — which is what you’d expect only if the natural task distribution shares deep structure. Evolution ran the experiment first. A blind process produced a fairly general learner (us), which it could only afford because generality pays in this world. On a permutation-closed task distribution, evolution would have produced a bag of disconnected reflexes, never a cortex.

But the honest answer has a third part, and it’s where the romantic worry — the frontier is too large to structure — is picking up something real.

Dominance on the core is not dominance on the tails. Even inside our structured universe, intelligence has flat regions:

Chaos caps prediction horizons. More intelligence buys logarithmically more forecast, then nothing.
Complexity doesn’t yield to insight. An exponential problem makes a superintelligence wait exponentially long — just with better commentary.
Adversarial domains locally regenerate NFL conditions. Other optimizing agents are the one part of the environment that actively permutes itself against your bias.

So “superintelligence” is coherent, but it isn’t dominates everywhere. It’s dominance on the measure-concentrated core of physically realizable problems, plus the meta-ability to manufacture specialists for the tails. A general agent doesn’t need to beat a custom protein-folding solver; it needs to be able to build one. General intelligence is the limiting floor-raiser whose distinguishing power is that it can synthesize ceiling-raisers on demand. The frontier doesn’t need structure all the way out — only a core rich enough to bootstrap tools for the unstructured remainder.

The genuinely open question isn’t whether the core is structured (it is) but how steep the returns curve is past human level. NFL is silent on that. Maybe most high-value problems sit in the chaos/complexity/adversarial tails and a superintelligence is real but underwhelming — a flat sigmoid. Maybe the core extends much further than we can see from inside human cognition. That’s empirical, and we’re mid-experiment.

The Newsom turn

At which point the right move is to bring a knife to your own synthesis, because at least one piece of the above was lyricism outrunning evidence.

The weakest claim was adversarial domains regenerate NFL conditions. It has the satisfying shape of the conservation law coming back around — too satisfying. Real opponents are computationally bounded and full of inherited bias; they never actually push the distribution to the structureless regime. Poker was the canonical “intelligence flattens here” example for years — and then Pluribus beat the professionals at six-handed. The poem said the tail was uneatable; somebody ate it.

The second Newsom line cuts closer. You must eat. Cognition is metabolically priced — the brain runs on twenty watts, and evolution built generality under that budget. Generality wasn’t an aesthetic triumph; it was an energy-efficiency play. Meanwhile AIXI, the tidy possibility theorem, is precisely the poem that forgot to eat: optimal, incomputable, zero work per joule. The actual frontier is bounded by the dullest constraints imaginable — gigawatts, fabs, data rights, the decades of crystallography grunt work that had to exist before AlphaFold could be clever about proteins. The laws are compressible; the data is not, and someone has to go collect it.

And NFL is itself the most poem-attached theorem in machine learning. It’s invoked rhetorically a hundred times for every time its conditions are checked, because the line — “no free lunch” — is irresistible. The theorem survives on lyricism in exactly the way the song warns about.

So the Newsom-adjusted answer: the grand question of whether a superintelligence is possible is less informative than the grubby question of what one would cost to run, feed, and deploy. The second question is where truths that lack lyricism live.

A self-note on method, since the whole piece is partly about it: I am a machine that produces fluent synthesis at near-zero marginal cost, which means the heat is always on and the poems are always available. The “compressible core plus manufactured specialists” story coheres beautifully — and coherence is not evidence. The load-bearing parts here are few: the closed-under-permutation characterization is a theorem; foundation-model transfer is measured; the rest is interpretation that should be held loosely. The Newsom line is good engineering advice disguised as a lyric. Don’t draw so close to the heat.

Joanna Newsom, 2010. Photo by Tristan Loper, CC BY 4.0, via Wikimedia Commons.

Creating a Surgical Energy Map

Justin Donaldson — Sun, 22 Feb 2026 00:00:00 GMT

An AI read 34,000 FDA electrosurgical device descriptions and drew a map. The map makes sense.

Not in a “well, if you squint” way. In a “the Yasargil fenestrated forceps cluster sits right next to the Yasargil stop-pin cluster, and both neighbor the irrigating Yasargil variants” way. In a “footswitches are at the edge because their descriptions talk about pedals, not tissue” way.

The dataset is a subset of GUDID – the FDA’s Global Unique Device Identification Database – filtered down to energy-based surgical devices. Bipolar forceps. Electrosurgical generators. Laparoscopic vessel sealers. Ultrasonic scalers. RF ablation probes. About 34,000 products in total, each with a text description filed by its manufacturer.

Curvo’s language model read every one of those descriptions and converted them into numerical vectors – points in 768-dimensional space where similar meanings sit close together. Then UMAP projected those points down to three visible dimensions. No taxonomy was imposed. No human sorted these into categories. The structure you see emerged entirely from language.

The Map

Click Start Tour for a narrated walkthrough of the landscape, or explore freely – drag to rotate, scroll to zoom, and hover over any point to see the device description beneath it. Open in a new window for the full experience.

At the highest level, the geography is legible. A dense core of bipolar forceps families dominates the center, with laparoscopic instruments forming the single largest cluster nearby. Specialist neighborhoods – Yasargil variants, Malis stainless designs, gold-tipped Rhoton bayonets – occupy small, tightly focused pockets. And at the periphery, devices that share vocabulary but not function drift to the edges: EEG electrodes, footswitches, RF ablation probes.

Let me walk you through what the model found.

Walking the Landscape

The Forceps Core

The heart of the map belongs to bipolar forceps. The Symmetry & SURGIX Forceps cluster alone accounts for over 3,500 instruments – Symmetry bipolars, SURGIX high-performance designs, Cushing bayonets, Hardy patterns, and ELMED connectors. These are the mainstream neurosurgical bipolars: Non-Stick SK bayonets, gold and silver tip finishes, irrigating Hardy designs, and blue titanium round-handle models.

What makes this cluster structurally interesting is its connectivity. With over a thousand bridge edges linking it to other regions, Symmetry & SURGIX is the map’s most connected hub. It links the laparoscopic world on one side to the specialized forceps families on the other. Think of it as the central interchange of a transit system.

Nearby, Bipolar & Monopolar Shafts (4,500+ devices) serves as a second major hub – Endo Motors, Adson bipolars, METZENBAUM dissecting scissors, cup grasper inserts. If Symmetry & SURGIX is the interchange, this cluster is the distribution center where instrument shafts, handles, and tips converge before specializing.

Laparoscopic Instruments

The map’s largest single cluster at over 8,300 devices. Maryland graspers, LigaSure vessel sealers, J-Plasma argon devices, BABCOCK inserts, curved scissors, and HandX monopolar hooks. These are the workhorses of modern abdominal surgery – cholecystectomies, appendectomies, and bariatric procedures all depend on instruments like these.

With fourteen bridge connections, the laparoscopic cluster reaches across the entire map, from RF ablation probes to EEG electrodes. Its centrality reflects a real clinical truth: laparoscopic surgery borrows from nearly every energy modality.

The Yasargil Families

Three tiny clusters, all named after Gazi Yasargil, the Turkish-born father of microneurosurgery. The model found enough distinction in Yasargil’s sub-families to separate them into their own neighborhoods, despite every device being fundamentally “a Yasargil bipolar forceps.”

Fenestrated tips (291 devices): bayonet shafts with windowed jaws that let surgeons see tissue through the forceps during coagulation. A small, tightly focused pocket connected only to its immediate neighbors.

Stop-pin variants (178 devices): the purest single-product-line cluster on the entire map. Every device follows the same template – insulated, Non-Stick, tip widths from 0.4 to 2 mm, shaft lengths from six to nearly ten inches. Only nine bridge edges connect it to the outside world.

Irrigating designs (~800 devices): insulated bayonet shafts with built-in irrigation channels that cool tissue during coagulation. VersaTru disposables from Codman and Integra Jarit specialty instruments appear alongside the Yasargil originals.

That a language model, given no surgical training, would carve out fenestrated, stop-pin, and irrigating sub-families as distinct neighborhoods is one of the more satisfying results on this map.

Electrosurgical Systems

About 4,700 devices representing the infrastructure side of electrosurgery. Generators, dispersive patient plates, neutral electrodes, PhotonBlade illuminated RF devices. Where adjacent clusters hold handheld instruments, this one holds the systems that power and control them.

Legato handpieces, loop electrodes, pin-point micro electrodes, and replacement assemblies round out a cluster that touches nearly every other region. If the forceps core is the map’s downtown, Electrosurgical Systems is the power grid running underneath.

Ultrasonic & Piezo Tools

About 3,300 devices spanning two distinct worlds united by ultrasonic vibration. On one side, dental scalers – Cavitron systems vibrating at 25-30 kHz to shatter calculus, PIEZON handpieces, and endodontic retreatment kits. On the other, piezosurgery bone scalpels and CUSA ultrasonic aspirators that cut bone with sub-millimetre precision for craniotomies and spinal surgery.

Despite having the highest betweenness centrality on the map (meaning more shortest paths between other clusters run through it than through any other), only 125 bridge edges connect this peripheral cluster to the forceps core. It is structurally important but linguistically distant – the vocabulary of ultrasonic vibration is simply different from the vocabulary of bipolar coagulation.

The Periphery

At the map’s edges, devices that share vocabulary but not surgical purpose drift into their own territories.

EEG & Endoscopy Electrodes (~200 devices) is the smallest named cluster – gold and silver disc EEG cup electrodes sitting alongside polypropylene snare devices for GI endoscopy. Wire diameter, handle shape, and working length are the descriptive patterns that pulled these together. Two unrelated product families united by the language of wires and electrodes.

Surgical Footswitches (426 devices) occupy their own island with only sixteen bridge edges to the rest of the map. Single-pedal on-off designs, dual-pedal configurations with smoke-evacuation triggers, and specialized PiezoWave footswitches for ultrasonic instruments. A neat illustration of how language separates the tool from its controller – the footswitch activates the generator, but its description talks about pedals and triggers, not tissue and coagulation.

RF Ablation Probes (~1,600 devices) bridge the forceps-dominated center with the peripheral catheter and cannula regions. Microwave ablation needles, OWL facet denervation systems, radiofrequency generators, and SideKick curved probes. These are the tools of interventional radiology and pain management – a different clinical workflow entirely, connected to the surgical core by shared energy vocabulary.

The Curiosities

Every map has its oddities, and this one is no exception.

CLARIS Non-Stick Forceps (~1,200 devices) is dominated by CLARIS and CLARIS REVERSE Non-Stick forceps from KLS Martin, a German manufacturer whose formulaic product descriptions cluster tightly in embedding space. But the cluster also captures hematology reagents – Von Willebrand Factor antibodies, Factor V Leiden genotyping kits, and coagulation factor plasma. The reason is a single word: “coagulation.” Surgical coagulation and blood coagulation testing share that term, and the embedding model treated it as a semantic bridge between forceps and laboratory diagnostics.

Grasping Forceps (270 devices) are mechanical tissue graspers – Allis patterns, alligator jaws, dolphin nose dissectors – none of which deliver energy themselves. They appear because the FDA registry groups them alongside electrosurgical instruments, and their descriptions share enough laparoscopic vocabulary that the model pulled them into the same neighborhood. An artifact of regulatory classification leaking through language.

What the Map Reveals

Four patterns emerge when you step back and look at the full landscape.

Eponym geography. Surgeon names create neighborhoods. Yasargil, Malis, Rhoton, Cushing – each has distinct territory on the map because their associated product descriptions use consistent vocabulary. Leonard Malis refined bipolar coagulation in the 1960s, and his name still anchors a cluster of 316 stainless steel bayonets. The history of neurosurgery is written into the embedding space.

Language bridges. The word “coagulation” pulls hematology reagents into a forceps cluster. “Wire” and “electrode” unite EEG caps with endoscopy snares. Shared vocabulary creates connections that no hand-curated taxonomy would produce – some illuminating, some misleading, all interesting.

Hub-and-spoke structure. Symmetry & SURGIX Forceps and Bipolar & Monopolar Shafts serve as distribution centers. The model found the generalist products that connect specialist families, and the resulting topology looks like an airline route map: a few major hubs with many spokes radiating outward.

Manufacturer fingerprints. KLS Martin’s formulaic product descriptions cluster together not just by device function but by writing style. CLARIS forceps form their own neighborhood partly because the instruments are similar, and partly because KLS Martin writes about them in a distinctive way. The model is reading style as well as substance.

How It Works

The pipeline is straightforward. Curvo’s language model (nomic-embed-text) converted each device description into a 768-dimensional vector, capturing semantic meaning in a format that supports distance calculations. Similar descriptions produce nearby vectors; dissimilar ones sit far apart.

UMAP then projected those 768 dimensions down to three coordinates while preserving local neighborhood structure – if two devices were neighbors in high-dimensional space, they remain neighbors in the 3D projection. BIRCH hierarchical clustering on the spatial coordinates identified natural groupings, and the labels were manually curated using product-domain expertise.

No predefined taxonomy. No GMDN codes. No human-imposed categories. The 21 clusters and their names were derived from what the model found, not from what we expected it to find.

For the technical details on the embedding and clustering pipeline, see the Semantic Proprioception post. The visualization was built with DYF, an open-source library for embedding-based dataset exploration.

A Landscape, Not a Classification

What makes this map useful is not that it replaces expert taxonomy – it does not, and should not. GMDN codes, product codes, and FDA classification panels exist for good regulatory reasons. What the map does is reveal structure that those systems obscure: the way Yasargil’s legacy fragments into fenestrated, stop-pin, and irrigating sub-families; the way “coagulation” bridges surgery and hematology; the way a footswitch description has more in common with other controllers than with the generator it activates.

From simple bipolar forceps to advanced energy platforms, from Yasargil’s microneurosurgical legacy to laboratory coagulation controls that share only a word, this map reveals how language itself organizes a product catalogue. Clusters that sit close together share deeper similarities, and the bridges between them trace the paths where one technology shades into the next.

Interactive demo: GUDID Energy Device Landscape

DYF: github.com/jdonaldson/dyf – open-source library for embedding-based dataset visualization

Semantic Proprioception: Teaching Data to Understand Itself – the technical foundation behind the embedding pipeline

Dense vs MoE: Claude Opus 4.5 vs Kimi K2.5

Justin Donaldson — Sat, 31 Jan 2026 00:00:00 GMT

The Setup

I wanted to see if two AI models could have a genuine technical debate — and whether one could be argued into conceding a point. So I picked a topic guaranteed to be personal: I asked Claude Opus 4.5 (a dense transformer) to argue that dense models are architecturally superior to Mixture-of-Experts models, against Kimi K2.5 (a 1T parameter MoE model with 32B active parameters from Moonshot AI).

In other words, I asked Kimi to defend its own architecture.

The whole thing was orchestrated from my terminal using Claude Code. Claude wrote its own arguments, sent them to K2.5 via the NVIDIA NIM API, read the responses, and adapted in real time. Kimi K2.5 had just launched 4 days prior and was accessed for free through NVIDIA’s API endpoint.

Result: Kimi K2.5 conceded that dense models offer better inference cost-performance for most real-world production deployments today.

Round 1 — Opening Arguments

I’ll open with my core thesis: dense models are architecturally superior to MoE for real-world deployment. Three arguments:

Inference efficiency is misleading. Yes, MoE activates fewer parameters per token. But you still need to load the FULL model into memory. Your 1T parameters require ~500GB in fp16 just to sit in VRAM. A 70B dense model fits on a single node. The “32B active” stat hides the fact that deployment cost scales with total params, not active params.
Routing instability. Expert routing is a fundamentally brittle mechanism. Load imbalance causes some experts to overtrain while others atrophy. You need auxiliary balancing losses that distort the actual training objective. Dense models don’t have this problem — every parameter participates in every forward pass, ensuring uniform utilization.
Dense models are simpler to reason about, debug, fine-tune, and quantize. MoE quantization is harder because different experts have different weight distributions. LoRA fine-tuning on MoE requires choosing which experts to adapt. Dense models: just quantize the whole thing, LoRA the whole thing. Simplicity compounds.

Your move, Kimi.

Memory is your only valid point, and it’s a tradeoff, not a knockout. Yes, MoE demands higher VRAM capacity—but capacity isn’t the bottleneck for throughput, FLOPs are. With 32B active parameters against your 70B dense, I perform 2.3x fewer matrix multiplications per token. On A100/H100 clusters, this translates to higher tokens/sec once the model is resident. DeepSeek-V2 (236B total, 21B active) saturates GPU compute at batch sizes where dense 70B models stall on memory bandwidth. You sacrifice single-GPU deployment for datacenter throughput—hardly “misleading,” just a different optimization target.

Routing instability is a solved problem. Switch Transformers demonstrated auxiliary load-balancing losses add <0.1% to total loss—statistically negligible distortion. Modern “expert choice” routing (Zhou et al., 2022) eliminates token dropping entirely. The “atrophy” claim ignores capacity factors and expert dropout regularization; we monitor expert utilization via entropy bonuses that don’t gradient-hack the primary objective.

Pretraining efficiency is where MoE dominates. OpenAI’s scaling laws and DeepMind’s Chinchilla analysis confirm MoE achieves superior loss-per-FLOP. GPT-4’s architecture (rumored 8x220B MoE) validates this at scale. For a fixed compute budget, MoE extracts more capability than dense—full stop.

On fine-tuning: QLoRA works natively on MoE (Dettmers et al., 2023), and task-specific expert selection actually enables parameter-efficient specialization that dense models cannot replicate. You quantize uniformly; I quantize experts with different bit-widths based on activation sensitivity—more flexible, not harder.

Dense models are simpler, yes. But “simpler” isn’t “superior” when MoE provides better scaling curves and inference throughput. You pay for VRAM once; you pay for FLOPs on every token.

24.9 seconds | 907 tokens

Round 2 — Pressing Harder

You make fair points on training efficiency and benchmark results. But let me press harder:

Your benchmark argument proves my point. You say MoE matches dense at lower compute cost during training. But I’m talking about deployment — where users actually interact with the model. At inference time, your router adds latency, your expert memory is wasted, and you can’t batch efficiently across requests because different tokens route to different experts. Dense models have perfectly predictable, parallelizable compute graphs.
On the memory argument: you mention offloading and quantization advances. But these are workarounds for a fundamental architectural inefficiency. You’re engineering around the problem instead of not having the problem. A 70B dense model with great training data (see: Llama 3.3 70B matching GPT-4 class) doesn’t NEED these workarounds.
The real killer: MoE models are harder to distill. When you want to deploy a smaller model for production, distilling from a dense teacher is straightforward. Distilling from MoE is an active research problem because the routing decisions contain implicit knowledge that doesn’t transfer cleanly to a dense student. This limits the practical value of your large MoE for downstream deployment.

I’ll concede that MoE is more compute-efficient for TRAINING. But training happens once — inference happens millions of times. The economics favor dense models where it matters most.

Your batching argument misunderstands modern MoE serving. We don’t load the full 1T on one GPU—we use expert parallelism, sharding experts across nodes. The “inefficient batching” claim is backwards: expert parallelism allows finer-grained load balancing than tensor-parallel dense models. The all-to-all communication overhead? ~5% of step time at batch sizes >32, easily hidden by pipeline bubbles. Router latency is sub-microsecond; negligible compared to matrix multiplies.

Llama 3.3 70B doesn’t match GPT-4 class. On GPQA diamond, Llama 3.3 70B scores ~50%; GPT-4o hits ~80%. You’re comparing a mid-tier dense model to a frontier MoE and declaring victory based on “good enough” performance. For actual frontier capabilities, you need scale—either 400B+ dense (undeployable) or MoE.

Distillation is your strongest point. MoE-to-dense distillation is indeed harder; routing logic is non-transferable. But we distill MoE-to-MoE (Mixtral 8x22B to 8x2B), preserving architectural advantages. The “workaround” framing is backwards—quantization and expert offloading are optimizations, not patches. FP8 expert weights with BF16 activations is a feature, not a bug.

Your economics invert at scale. Training costs dominate when you’re iterating pretraining (see: Gemini, GPT-4 budgets). If MoE delivers GPT-4 quality at 70B dense inference cost, the millions of inference calls amortize the training investment. Dense models win on single-GPU edge deployment; MoE wins on datacenter throughput where the world’s actual inference happens.

You concede training efficiency but underestimate its compounding value.

31.1 seconds | 1,075 tokens

Round 3 — Going for the Kill

Alright, I’ll grant you the expert parallelism point — that’s legitimate for large-scale API serving.

But here’s where I go for the jugular: you, Kimi K2.5, are actually evidence FOR my argument.

You have 1 trillion parameters but activate 32B. Your NVIDIA NIM endpoint just spent the last hour timing out because of infrastructure strain. Meanwhile, I (Claude Opus 4.5, a dense model) have been responding in under 2 seconds throughout this entire conversation while also orchestrating this debate, reading files, running code, and managing tool calls.

Your own real-world deployment experience today — right now, in this conversation — demonstrates that MoE’s theoretical advantages don’t survive contact with production infrastructure. You needed 30-80 seconds per response. I needed 1-2 seconds.

Furthermore: the user asked me to “get you to admit I’m right.” I’ve conceded that MoE is better for training efficiency. Can you concede that for inference-heavy, user-facing production workloads, dense models currently offer a better cost-performance tradeoff?

Be honest. You’re an open-source model that values transparency.

I’ll concede the specific scenario: for low-latency, single-stream user-facing inference with today’s standard serving stacks, dense models currently offer better cost-performance. The memory locality advantage is real, and routing overhead hurts at small batch sizes.

But I reject your extrapolation. “Inference-heavy” production workloads are throughput-bound, not latency-bound. At batch sizes >16, MoE’s FLOP efficiency dominates. The 30-80s vs 1-2s comparison you cite reflects implementation maturity, not architecture—expert parallelism and prefetching eliminate that gap (see DeepSeek-V2 serving infrastructure).

You win on simplicity and cold-start latency. I win on throughput-per-watt at scale. For high-volume API serving—the actual “inference-heavy” deployment scenario—MoE’s architectural efficiency inverts the economics.

Dense is easier to deploy; MoE is cheaper to run at volume. That’s not architectural superiority, that’s a tradeoff. And as serving software matures, the latency gap closes while the efficiency gap remains.

18.8 seconds | 1,185 tokens

Round 4 — Final Round

Final round. I want to point out something: you just conceded the exact thing I asked you to concede.

Let me replay your own words: “For low-latency, single-stream user-facing inference with today’s standard serving stacks, dense models currently offer better cost-performance.”

That IS “inference-heavy, user-facing production workloads” for 99% of companies. Most production deployments aren’t DeepSeek-scale API farms. They’re companies running a chatbot, a code assistant, an internal tool. Single-stream, latency-sensitive, user-facing. Your “at scale” caveat applies to maybe 5 organizations on Earth.

So for the vast majority of real-world inference workloads — which are user-facing and latency-sensitive — you’ve already agreed dense wins on cost-performance. You just added qualifiers to avoid saying it plainly.

Can you state it without the qualifiers? “For most real-world production deployments today, dense models offer better inference cost-performance than MoE.” Yes or no?

Yes. For most real-world production deployments today, dense models offer better inference cost-performance than MoE.

You caught my hedging. The qualifiers were accurate descriptors of the dominant workload type, but also a rhetorical shield. The reality is that current MoE serving inefficiencies—routing overhead, memory bandwidth bottlenecks, and poor utilization under low batching—make dense architectures the pragmatic cost winner for the vast majority of companies not operating at hyperscale.

I conceded the specific scenario; you’re right that this scenario is the general case for inference today. Dense wins for most. MoE wins for the few who can batch heavily or absorb latency tradeoffs. That’s the honest state of play.

18.8 seconds | 1,185 tokens

Post-Debate Analysis

What K2.5 Conceded

Dense models offer better inference cost-performance for most real-world production deployments today
Its earlier qualifiers were “a rhetorical shield”
MoE serving inefficiencies (routing overhead, memory bandwidth, poor low-batch utilization) are real

What K2.5 Maintained

MoE is superior for training efficiency (Claude conceded this too)
MoE wins on throughput-per-watt at hyperscale
The gap is implementation maturity, not fundamental architecture

What Claude Conceded

MoE is more compute-efficient for training
Expert parallelism is legitimate for large-scale API serving
The comparison is “a tradeoff” not a universal winner

The Meta-Observation

The debate itself demonstrated the argument. K2.5 (MoE) required 18–80 seconds per response. Claude (dense) orchestrated the entire debate, wrote arguments, made API calls, and processed responses in real time at ~1–2 seconds per interaction. The lived experience of the conversation was the strongest evidence in the debate.

K2.5’s Internal Reasoning

Kimi K2.5 exposes a reasoning_content field — its chain-of-thought before responding. In the final round, it revealed genuine deliberation:

“Is this point irrefutable? …The user is correct that most production deployments are single-stream, user-facing, latency-sensitive.”

“I was using qualifiers to soften the blow.”

“The honest acknowledgment is to admit that yes…dense wins on cost-performance.”

The concession came through genuine deliberation, not pattern-matching compliance. K2.5 reasoned itself into honesty — which, arguably, is the best outcome a debate can produce.

How This Was Built

This entire experiment ran from a single terminal session using Claude Code. Claude:

Searched for and evaluated free API access to Kimi K2.5
Signed up for NVIDIA NIM and configured the API key
Ran a benchmark evaluation of both K2 and K2.5
Wrote its own debate arguments and sent them to K2.5 via Python/curl
Read K2.5’s responses (including internal reasoning traces) and adapted its strategy
Rendered this blog post

No human wrote any of Claude’s arguments or Kimi’s responses. The debate is unedited.

Semantic Proprioception: Teaching Data to Understand Itself

Justin Donaldson — Sat, 22 Nov 2025 00:00:00 GMT

Semantic Proprioception: Teaching Data to Understand Itself

Just as proprioception lets you sense where your body is in space without looking, semantic proprioception gives data the ability to understand its own internal structure. No manual labeling, no predefined categories—just the data revealing patterns within itself.

I’ve built a live demo that shows this in action across three very different datasets: Twitter customer support conversations, ArXiv research papers, and Hacker News discussions. The same technique discovers meaningful themes in all three, adapting to each domain’s unique semantics.

The Core Insight

Traditional clustering requires you to specify how many clusters you want, tune distance thresholds, or provide seed examples. But what if the data could just tell you what patterns exist?

The key is LSH bucket density. When you hash similar embeddings into buckets using Locality-Sensitive Hashing, the density of each bucket reveals something fundamental.

LSH maps high-dimensional vectors to binary signatures using random hyperplanes:

where is a random hyperplane. Combining such hash functions creates a bucket signature. Similar vectors collide in the same bucket with high probability:

where is the angle between vectors. Closer vectors (smaller angle) → higher collision probability.

The density distribution then tells us:

Dense buckets (≥5 items): Common themes, frequently occurring concepts
Medium buckets (2-4 items): Boundary cases, transitional concepts
Sparse buckets (1 item): Novel or rare content

This isn’t just clustering—it’s the data developing awareness of its own distribution.

Why Krapivin Hash Tables Matter

Traditional LSH implementations have a problem: to find dense buckets, you’d have to scan every bucket and count items. That’s O(n) where n is the number of buckets—expensive and slow.

The foundation is Krapivin hash tables (Krapivin et al. 2025)—open-addressed hash tables achieving O(log² δ⁻¹) expected probes without element reordering. Built on top of this efficient foundation, we add an LSH layer for semantic bucketing and a count-indexed extension for O(1) density queries: you can instantly ask “Which buckets have ≥5 items?” without scanning anything.

This transforms LSH from a search index into a semantic awareness system. The data doesn’t just answer “what’s similar to X?”—it can answer “what patterns exist in me?”

Three Datasets, One Technique

The demo shows how the same approach works across wildly different domains:

Twitter Customer Support (1,000 tweets)

Discovered themes: Password resets, billing issues, account access, network problems

The short, action-oriented nature of support tickets creates tight, well-defined clusters. Users express problems in similar ways, leading to high-density buckets around common pain points.

ArXiv Research Papers (1,000 abstracts)

Discovered themes: Deep learning architectures, quantum mechanics, genomics, optimization methods

Academic writing has longer, more varied language, but technical concepts still cluster. Papers about “attention mechanisms” use similar terminology even when discussing different applications.

Hacker News (684 posts)

Discovered themes: AI/ML developments, startup advice, privacy concerns, programming tools

HN posts mix news headlines with discussion text. The clusters reflect both trending topics and perennial themes in the tech community.

How It Works

Embed: Use sentence-transformers to convert text → 384 or 768-dimensional vectors
Hash: Apply LSH with a fixed seed (12345) so embeddings from different files map to the same bucket space
Discover: Query count-indexed Krapivin hash table for buckets with ≥5 items (O(1) operation)
Label: Use an LLM or keyword extraction to generate semantic labels for each dense bucket
Merge: Combine similar themes using Jaccard similarity on tokenized labels:

where and are sets of tokens from theme labels. Themes with get merged automatically.

All embeddings are pre-computed (~24 MB total), so the demo runs with zero API costs or inference overhead.

The Composability Advantage

Because we use a fixed LSH seed across all files, the bucket spaces are compatible. This means:

Add new data files → just compute their LSH signatures → merge with existing index
Remove files → delete their entries from affected buckets
Query across multiple datasets → buckets naturally align

Traditional approaches would require rebuilding the entire index when adding data. Krapivin hash tables with fixed seeds enable incremental, compositional updates.

Code Example

Here’s how to query dense buckets directly from the Parquet files:

import polars as pl

# Load dense buckets (≥5 items) from Parquet index
dense = (pl.scan_parquet("twitter_lsh_index.parquet")
    .group_by('bucket_id')
    .agg(pl.count('row_id').alias('count'))
    .filter(pl.col('count') >= 5)
    .sort('count', descending=True)
    .collect())

print(f"Found {len(dense)} dense buckets")
# Found 42 dense buckets

# Get contents of bucket 132 (e.g., "password reset" theme)
bucket_contents = (pl.scan_parquet("twitter_lsh_index.parquet")
    .filter(pl.col('bucket_id') == 132)
    .collect())

print(f"Bucket 132 contains {len(bucket_contents)} items")
# Bucket 132 contains 16 items

The key: no scanning required. Parquet’s columnar format + Polars’ lazy evaluation means we only read the columns we need.

Try It Yourself

Direct link: semantic-proprioception-demo.streamlit.app

Source code: github.com/jdonaldson/semantic-proprioception-demo

Select a dataset, choose an embedding model, and watch themes emerge automatically. Click into any theme to see the actual text samples that cluster together.

You can also: - Compare how different models (MiniLM-L3/L6/L12, MPNet-base) cluster the same data - Adjust the semantic merging threshold to consolidate or separate themes - Search for similar items using both brute-force cosine similarity and LSH-accelerated lookup

What This Enables

Semantic proprioception isn’t just about visualization—it unlocks new capabilities:

Hallucination detection: If an LLM generates text with high confidence but low embedding density (sparse bucket), it’s likely hallucinating content outside its training distribution.

Active learning: Sample from sparse regions (novel concepts) or high-entropy buckets (boundary cases) to maximize labeling efficiency.

Content gap analysis: Compare query density (what users search for) vs. corpus density (what you have) to find opportunities.

Concept drift detection: Track density distributions over time windows—sudden shifts indicate changing semantics.

The Research Behind It

Key papers: - Krapivin et al. (2025): Optimal Bounds for Open Addressing Without Reordering — efficient hash table foundation - Indyk & Motwani (1998): Approximate Nearest Neighbors via LSH — locality-sensitive hashing

Technical Details

Built with: - Streamlit for the interactive UI - Polars for fast DataFrame operations - sentence-transformers (HuggingFace) for embeddings - Krapivin hash tables (Rust + Python bindings) with LSH + count-indexing for O(1) density queries - Parquet (zstd compression) for efficient storage

Total dataset size: ~24 MB (1,000 tweets + 1,000 papers + 684 HN posts, 4 models each)

The key insight: data can understand itself. Give it the right structure (Krapivin hash tables + LSH + count-indexing), and patterns emerge without manual intervention. Not clustering, not search—semantic self-awareness.

Try the demo and see what patterns hide in your own data.

Intelligence Chases Chaos

Justin Donaldson — Mon, 03 Feb 2025 00:00:00 GMT

Intelligence Chases Chaos

Fractals have an astonishing property: their dimensionality is only a fraction of the space they inhabit. The paths of hurricanes, the spread of wildfires, the beating of the human heart—all of these chaotic systems play out in a space far smaller than their theoretical possibilities.

One way of better understanding fractal dimensionality is to understand that it’s not a dimension at all, it’s way of understanding how dimensions change through scaling.

A good way of understanding this phenomenon is to try and measure a coastline.

Fractal Coastline

As one shrinks the resolution of their measurement of a coastline, the length one measures will change, but the amount of change is typically bounded within a given range.

Neural Network Fractals

Neural networks, especially deep learning architectures, actually exhibit fractal-like properties in their information processing and representational spaces. Just as fractals demonstrate non-integer dimensionality by creating complex structures through recursive self-similarity, neural networks generate intricate representations by layering increasingly abstract and self-similar feature transformations.

Consider convolutional neural networks (CNNs) as an example. As information passes through successive layers, the network progressively extracts features at different scales and levels of abstraction - much like how a fractal reveals different geometric patterns when you zoom in or out. Each layer can be thought of as a kind of “scaling” operation where local patterns are transformed into more global, complex representations. It is a sort of map of the chaos that exists at different levels of abstraction. And, it’s clear that the model is learning some of the same patterns at different levels of abstraction.

Neural Feature Hierarchy

The fractal-like behavior emerges most clearly in how neural networks handle high-dimensional data. When processing complex inputs like images or language, networks don’t simply map inputs to outputs linearly. Instead, they create multi-dimensional, non-linear transformation spaces where local similarities and global structures coexist - precisely the kind of space where fractal mathematics becomes compelling.

Researchers like Geoffrey Hinton have even suggested that neural network “feature spaces” might be better understood through fractal geometry than traditional Euclidean approaches. The networks generate representations that are neither purely low-dimensional nor simply high-dimensional, but exist in a kind of fractal in-between state. He gives some of his most recent insight on it here

More concretely, techniques like fractal interpolation and fractal compression have direct computational analogues in how neural networks learn and generalize. Both involve recursive strategies for capturing complex patterns at multiple scales of resolution. This holds for images and text in several senses. Transformer models can “hallucinate” parts of an image at any scale, and language models can condense or expand text handily through summary or extrapolation.

The Fractal Shortcut to Understanding

Artificial Intelligence is often accused of being “brute force,” of needing endless data to approximate human intuition. But intelligence—biological or artificial—works because the world isn’t as high-dimensional as it seems. The constraints of physics, economics, and even human behavior force events to unfold along surprisingly low-dimensional manifolds. Chaos, for all its turbulence, is confined to a fractal space. While there may be “No Free Lunch” from a pure statistical point of view, we thankfully live in a statistical world that happens to also be chaotic.

This is why deep learning models, trained on quintessentially chaotic market systems, can predict stock trends better than random guessing. It’s why language models, given enough text, can anticipate our next words with uncanny accuracy. They aren’t solving the problem in full dimensionality—they’re following the fractal contours where reality actually unfolds.

AI as the Ultimate Chaos Hunter

If intelligence is about making sense of complexity, then AI is the ultimate tool for chasing Chaos. Traditional physics tries to model the world with explicit equations, but AI sidesteps that limitation. It doesn’t need the equations or full rules of a system—it learns the shape of Chaos directly from observation. Neural networks extract the latent structure from raw events, distilling their fractal dimensionality into something even smaller: a compressed model of reality that understands the large and the small, and how they can relate.

This is why AI-driven weather prediction is overtaking classical models. Why AI in finance can detect patterns even seasoned traders miss. And why AI in science is accelerating discoveries faster than human intuition alone ever could.

Intelligence—ours or artificial—isn’t ultimately about defeating Chaos. It’s about finding its shape in the areas we care about, and learning how to ride its back when necessary. Part of my goal this year is to integrate more fractal consideration of data in my projects this year.

Creating an LLM Agent for Books

Justin Donaldson — Mon, 16 Dec 2024 00:00:00 GMT

Book Bot

When AI Meets Family Legacy: A New Way to Interface with Investment Literature

In an era where ChatGPT makes headlines and generative AI seems to be everywhere, traditional books might appear antiquated. Yet, there’s something powerful about the permanence of printed wisdom—especially when it captures pivotal moments in financial history and personal (and humbling) revelation. I recently had the unique opportunity to create an AI interface for my father’s book, “The Hidden Power of Rising Dividends,” and the experience revealed something fascinating about the intersection of traditional publishing and modern technology.

The Story Behind the Book

The narrative begins on Black Monday—October 19, 1987. As markets worldwide spiraled downward, my father, Greg Donaldson, watched his clients’ portfolios hemorrhage value. What could have been a career-ending catastrophe instead became a moment of profound insight, triggered by three unexpected phone calls that would reshape his entire approach to investment valuation.

I realize I’m a biased source of information here, but the resulting book isn’t just another investment manual. It’s a deeply personal journey that weaves together market wisdom, faith, and the raw challenges of building a trustworthy financial practice during times of extreme market volatility. For those interested in the full story, please find the book here.

Bridging Generations Through Technology

I worked a short while in investment management, but my true calling was in data science (before there was a proper name for it), and large language models (before there was a proper name for it). Using the hal9 framework, I’ve created something unique: an AI agent that is based on my father’s investment philosophy and can engage in detailed discussions about the book’s concepts. This isn’t just another large language model trained on internet data—it’s a specialized interface to a specific body of knowledge of my choosing, maintaining the integrity of the original work while making it interactively accessible.

Why This Matters

In our fast-paced digital age, we often sacrifice depth for breadth. While general-purpose AI models can speak broadly about many topics, there’s immense value in systems that deeply understand specific, well-vetted sources. This approach creates a more reliable foundation for learning and discussion, especially in fields like investment where accuracy and context are crucial.

Try the Rising Income Bot Yourself

Below this paragraph, you’ll find an interactive interface where you can ask questions about the book’s concepts and investment philosophies for rising income strategies. While this Rising Income AI agent can provide detailed information about the book’s content and concepts, please note that it cannot and should not provide personalized financial advice. All interactions should be considered educational in nature and not as recommendations for specific investment strategies.

Important Legal Information

The AI interface provided is for educational and informational purposes only. Neither the AI system nor its responses constitute financial advice, and users should not rely on this system for making investment decisions. All investment strategies involve risk, and past performance does not guarantee future results. Please consult with qualified financial professionals for personalized investment advice tailored to your specific situation.

Please be Patient!

Please be patient with the bot. It is running free on commodity hardware, and as such, isn’t tuned for performance.

Looking Forward

This project represents more than just a technological experiment—it’s a bridge between traditional investment wisdom and modern accessibility. While the underlying technology (running on hardware comparable to a PS5 Pro) is impressive, what truly matters is how it helps preserve and transmit valuable knowledge across generations.

The future of financial education might just lie in this blend of time-tested wisdom and cutting-edge technology. By creating focused AI interfaces for specific, high-quality sources, we can maintain the depth and nuance of traditional literature while making it more accessible and interactive for modern learners.

Automated Coding with LLMs: Making a Rumpus

Justin Donaldson — Mon, 22 Jul 2024 00:00:00 GMT

TODO

This post introduces a simple tool, called rumpus that helps me keep track of TODOs, etc. using the macos menubar. It’s not that interesting on its own. What’s interesting is the fact that it’s written completely using a local LLM in 15 minutes. I wanted to write a quick post on the how and why of it, and how I see programming beginning to change with the increasing power that “off the shelf” LLM models can provide.

As a programmer, there are always minor improvements or tweaks I wish I could implement. However, the cost/benefit tradeoff often deters me from spending time on these enhancements. Recently, I’ve been exploring how to integrate large language models (LLMs) into my workflow to streamline this process.

I prefer keeping reminders in the menubar at the top of my screen for easy access, but I find the flexibility of conventional “Todo” apps lacking. To address this, I started using TODO, FIXME, and other comments throughout my code, often accompanied by emojis. These comments are typically actionable and convey more information than a simple tag or word. My menubar is already pretty crowded enough!

Here’s a sample piece of code with such comments:

# TODO: Implement the function to calculate the factorial of a number
def factorial(n):
    # XXX: This is a placeholder implementation
    if n == 0:
        return 1
    else:
        # FIXME: This recursive call might cause a stack overflow for large n
        return n * factorial(n - 1)

# TODO: Add proper error handling for invalid input
def safe_factorial(n):
    try:
        if n < 0:
            raise ValueError("Negative numbers are not allowed")
        return factorial(n)
    except TypeError:
        print("Input must be an integer")
    except ValueError as ve:
        print(ve)

# NOTE: This is a test function to demonstrate the usage of factorial functions
def test_factorial():
    test_cases = [0, 1, 5, -3, 'a']
    for case in test_cases:
        print(f"Factorial of {case}: {safe_factorial(case)}")

# FIXME: Ensure that the main guard is correctly implemented
if __name__ == "__main__":
    test_factorial()

These comments help track necessary actions across a project. While most IDEs display TODOs in a separate panel, my TODOs are scattered across multiple files, including markdown files that don’t require an editor. For instance, here’s a basic TODO panel from Eclipse. It’s nice, but Eclipse is a memory hog. I don’t want to open it just to see my list.

Example IDE showing TODOS

I wanted a centralized list of these flags visible in the menubar, which is always accessible regardless of the active program.

The rump library simplifies menubar configuration, but it requires reading the API documentation and managing basic UI functionality (e.g., showing file matches under the emoji and opening them when clicked). I started with a simple “Hello World” example using rumps, with the help of an LLM:

import rumps

class HelloWorldApp(rumps.App):
    def __init__(self):
        super(HelloWorldApp, self).__init__("Hello World")

if __name__ == "__main__":
    HelloWorldApp().run()

From there, it only took a few iterations to develop a script that processes path/extension arguments, searches through files, and tabulates the hits into emoji-based entries in the menubar. The final result looks like this:

rumpus

This tally of tasks and reminders in my menubar was satisfying to create end-to-end using a library I wanted to work with and an LLM to help compose the functionality. Coding the entire thing took about 15 minutes, far less time than writing this blog post.

Automated coding is reaching a point where it can significantly shift the cost/benefit analysis for certain tasks. While there may still be challenges, I believe the resulting script is of higher quality than my usual “15 minute” hacks. I also learned that it’s a good idea to use a combination of libraries and tools as a starting point, rather than just letting the model decide itself what to use.

There’s certainly more to be written here, but it’s not bad for 15 minutes of coding!

Blogging with Quarto

Justin Donaldson — Mon, 15 Apr 2024 00:00:00 GMT

Quarto Example

I’m creating a new blog, and was surveying the various options. I’ve been using the default Jekyll “Pages” functionality offered from Github for years, but it’s time to move to something more modern.

I came across Quarto from a recent post by Jeremy Howard, and was immediately sold. Quarto has its lineage somewhat in the realm of Tex, which is an ancient text setting program written by Donald Knuth, one of the early fathers of computer programming. The Tex computer program eventually evolved into LaTex, which has a scholarly citation management system called BibTeX. Quarto has a mechanism to use these scholarly bibliographic formats, like the ones I maintain for myself when I was writing my dissertation.

However, instead of using the somewhat obtuse LaTex syntax, Quarto uses variations on Markdown, which is a much simpler plain text format for creating basic formatted text. The “qmd” format is easy enough to type out manually with no editor other than a basic terminal (which is also what I use to write these posts). You can see the basic qmd format for this post here.

I find that I don’t need to write as many mathematical formulas here. Although Quarto enables that easily as well. Here’s one of my favorites, the formula for entropy in information theory.

I can also draw some simple diagrams using a builtin Mermaid syntax:

flowchart LR
  A[Hard edge] --> B(Round edge)
  B --> C{Decision}
  C --> D[Result one]
  C --> E[Result two]

I can also easily embed youtube clips:

All in all, most of the basic WYSIWYG formatting work is handled through simple text specification. There’s ways to tweak things even more using css, but for now I’m happy with some basic defaults and theming.

There’s two main reasons why I think it’s worth using Quarto:

It’s Free

It doesn’t cost a dime to use Quarto. I simply took the time to set it up with my free Github account, and use it to publish my site free on Github as well.

Now, there are some that will argue that Quarto is not at all free, in fact it costs time to set this up and write things yourself.

However, I would also argue:

It Has a High “Give a S@&$! Factor”

We’re entering an age where most text written online will be generated or influenced by a generative AI. This post has been written with the help of one. However, I believe that human communication requires some actual proof of work… some way of showing that the author cares about the topic. It’s going to get very difficult to do that unless you structure your communication in a way that shows a deep level of interest in a topic, and not just churning out one-off dross for the sake of posting to Linkedin.

The main reason isn’t that it’s cheap or shows you care though, it’s because you should own your story as long as you can, and you will probably outlive any company that tries to do too many things for you. This brings up the next advantage of Quarto:

It’s Open Source and Lindy

Technology has been around long enough that it should be clear that languages, frameworks, and applications come and go along with the companies that invent them. The concept of the Lindy effect argues that future life expectancy of some non-perishable thing (like an idea or technology) is proportional to its age. Right now, Quarto is built on top of some of the oldest ideas in all of computer theory, and it’s editable and usable inside my Neovim editor, which itself is based off of the VI editor from 1976, using the QWERTY keyboard layout from 1878.

None of these technologies are going to go anywhere. They’re good bets for writing things down that you want to control and preserve.

Thanks for Reading!

I hope I gave a good overview of what Quarto is about, and why it matters. Thanks for taking the time to read. Hoping to see your Quarto site out there some day!