Hacking Nano-GPT into a Diffusion LLM

Note: Here, I hacked together a diffusion llm implementation on nanoGPT. All the code can be found in this github repo I’ve been really interested in diffusion models lately, and a really interesting application of them is in language modeling. Specifically, I am talking about diffusion LLMs, where an LM iteratively refines a text output. For example, the LLaDa paper outlines a method to start from a fixed number of masked tokens and refine that window to produce a coherent output. The advantage with this is that it is able to parallelize a large number of tokens all at once, whereas autoregressive LMs can really only produce one token at a time (when not batching, as in most inferece applications). ...

September 29, 2025 · 10 min

Thoughts on Tokenization, H-Nets, and Adaptive Compute

This post is a very unstructured set of thoughts I had after reading a blog post [1] by Albert Gu. Ideas here will be very imcomplete and only reflect my current understandings, misunderstandings, and speculations. Tokenization End to end tokenization schemes have always seemed like the natural way to learn natural language to me. In fact, tokenization appeared as a sort of feature engineering trick we used to reduce the computational overhead transformers face when trying to predict things like [Hello][,_][nice_][to_][meet_][you_]. In that example, the comma token , might’ve taken some amount of ‘intelligence’ for a model to predict, but the whitespace proceeding it is extremely obvious for even less capable models to predict. But instead of wasting compute on having the models learn trivial relations in the distribution of all possible text outputs, we simply give this to transformer models in the from of a tokenizer by saying that a comma followed by a space, [,_], is something it should care about. ...

September 2, 2025 · 6 min