Induction Heads in Chronos Part 2
Previously, in the part 1 post, we found some evidence that induction heads exist in the Chronos models [1]. However, there were some things I did incorrectly and some things I wanted to further explore: First, my implementation of the repeated random tokens (RRT) method was incorrect. Namely, I randomly sampled over all the non-special tokens, but Chronos scales the given input such the encoder input tokens almost always fall within a range of token ids from 1910-2187. Sampling over only this range greatly improved the attention mosaics. I wanted to further study how changing the number of repeitions and the lengths of the individual sequences in the RRT affects how many induction heads we detect. I wanted to go beyond RRT data and see if we can find any interesting inductive properties in multisine data. Background First, let me clear up what an induction head actually is in a more concrete way than my last post. ...