Hacking Nano-GPT into a Diffusion LLM
Note: Here, I hacked together a diffusion llm implementation on nanoGPT. All the code can be found in this github repo I’ve been really interested in diffusion models lately, and a really interesting application of them is in language modeling. Specifically, I am talking about diffusion LLMs, where an LM iteratively refines a text output. For example, the LLaDa paper outlines a method to start from a fixed number of masked tokens and refine that window to produce a coherent output. The advantage with this is that it is able to parallelize a large number of tokens all at once, whereas autoregressive LMs can really only produce one token at a time (when not batching, as in most inferece applications). ...
Lecture 2 - Constructing the Training Target
To summarize Lecture 1, we (given an $X_0 \sim p_{init}$) a flow model and a diffusion model to obtain trajectories from by solving the ODE and SDE, $$ \begin{align*} \text{d}X_t &= u_t^\theta(X_t) \text{d}t \\ X_t &= u_t^\theta(X_t) \text{d}t + \sigma_t \text{d}W_t, \end{align*} $$ respectively. Now, our goal is to find the parameters $\theta$ that make $u_t^\theta$ a good approximation of our target vector field $u_t^\text{target}$. A simple loss function we could use is the mean squared error: ...
Lecture 1 - Flow and Diffusion Models
ODEs, Vector Fields, and Flows A first-order ODE is an equation and an initial condition that defines a trajectory in time. The general form of an ODE and its initial condition is $$ \begin{align*} \frac{\text{d}}{\text{dt}} X_t &= u_t(X_t) \\ X_0 &= x_0 \end{align*} $$where $X: [0,1] \rightarrow \mathbb{R}^d, \space t \mapsto X_t$ gives us a trajectory through the vtime-varying vector field $u: \mathbb{R}^d \times [0,1] \rightarrow \mathbb{R}^d, (x,t) \mapsto u_t$ for the initial condition $X_0$. Essentially, in an ODE we have a trajectory that follows a vector field throughout time and starts at a specific point in space. ...
Introduction to Flow Matching and Diffusion Models
Here are my notes for MIT CSAIL’s course titled Introduction to Flow Matching and Diffusion Models. While I am finding the labs very helpful and making sure I do them, I will not be documenting my progress on them here. Lecture 1 - Flow and Diffusion Models Lecture 2 - Constructing the Training Target
Thoughts on Tokenization, H-Nets, and Adaptive Compute
This post is a very unstructured set of thoughts I had after reading a blog post [1] by Albert Gu. Ideas here will be very imcomplete and only reflect my current understandings, misunderstandings, and speculations. Tokenization End to end tokenization schemes have always seemed like the natural way to learn natural language to me. In fact, tokenization appeared as a sort of feature engineering trick we used to reduce the computational overhead transformers face when trying to predict things like [Hello][,_][nice_][to_][meet_][you_]. In that example, the comma token , might’ve taken some amount of ‘intelligence’ for a model to predict, but the whitespace proceeding it is extremely obvious for even less capable models to predict. But instead of wasting compute on having the models learn trivial relations in the distribution of all possible text outputs, we simply give this to transformer models in the from of a tokenizer by saying that a comma followed by a space, [,_], is something it should care about. ...
Auditing CNNs with Adversarial Vulnerability
This post is a result I came across while working on a SPAR-sponsored project mentored by Satvik Golechha. You can read our full report here: Near Zero-Knowledge Detection of Undesired Behavior [1]. Introduction and Background Say we have two distributions of data: $D = \{ (x_i, y_i) \}_{i=1}^n$: the intended distribution, which we want to learn $D_u = \{ (x_{u_i}, y_{u_i}) \}_{i=1}^m$: the undesired distribution, which exhibits harmful behavior And two models: $M_D$: a model which performs well on $D$ $M_u$: a model which performs well on $D_u$ and performs $\epsilon$-close to $M_D$ on $D$ Is it possible to detect which model is the undesired one if we only have access to $D$? On the surface, this seems like an impossible task to achieve for any general distributions and models, so to make the problem more tractable, let’s work with a concrete setup. ...
Induction Heads in Chronos Part 2
Previously, in the part 1 post, we found some evidence that induction heads exist in the Chronos models [1]. However, there were some things I did incorrectly and some things I wanted to further explore: First, my implementation of the repeated random tokens (RRT) method was incorrect. Namely, I randomly sampled over all the non-special tokens, but Chronos scales the given input such the encoder input tokens almost always fall within a range of token ids from 1910-2187. Sampling over only this range greatly improved the induction mosaics. I wanted to further study how changing the number of repetitions and the lengths of the individual sequences in the RRT affects how many induction heads we detect. I wanted to go beyond RRT data and see if we can find any interesting inductive properties in multisine data. Background First, let me clear up what an induction head actually is in a more concrete way than my last post. ...
Hunting for Induction Heads in Amazon's Chronos
Notice: While the theory here is correct, I realized I had some implementation errors in the RRT test which are corrected in a follow up post. This Summer, I expect to be working on things related to mechanistic intepretability in time series forecasting, and a model of interest was Amazon’s Chronos model, a probabilistic time series forecasting model. To better understand how the model works and to get my hands dirty with some MI work, I decided to try and look for evidence of induction heads in Chronos. ...
A Clock Hand Puzzle
I used to not like analog clocks because they unecessarily made it harder to tell time in a world where digital clocks are a reality. Now, I appreciate them a lot more for all the mathematical fun that they present. So, here’s a very simple puzzle I thought of while looking at one. The Puzzle It is 3:00 right now on an analog clock. How much longer do I have to wait to see the minute and the hour hands cross each other? ...
Metrobike Optimization Around UT Austin
This project was done as our final project for William Gilpin’s Graduate Computational Physics Course. Our complete GitHub repository, with instructions on how to replicate our results, can be found here. Introduction The goal of this project is to simulate the behavior of a bike-sharing system in a network of stations and destinations, and then optimize the positions of the stations. We approach the simulation of the bike-sharing system with Agent Based Modeling (ABM). ...