Auditing CNNs with Adversarial Vulnerability

This post is a result I came across while working on a SPAR-sponsored project mentored by Satvik Golechha. You can read our full report here: Near Zero-Knowledge Detection of Undesired Behavior [1]. Introduction and Background Say we have two distributions of data: $D = \{ (x_i, y_i) \}_{i=1}^n$: the intended distribution, which we want to learn $D_u = \{ (x_{u_i}, y_{u_i}) \}_{i=1}^m$: the undesired distribution, which exhibits harmful behavior And two models: $M_D$: a model which performs well on $D$ $M_u$: a model which performs well on $D_u$ and performs $\epsilon$-close to $M_D$ on $D$ Is it possible to detect which model is the undesired one if we only have access to $D$? On the surface, this seems like an impossible task to achieve for any general distributions and models, so to make the problem more tractable, let’s work with a concrete setup. ...

June 2, 2025 · 6 min

Review of "Planting Undetectable Backdoors in Machine Learning Models" paper by Goldwasser

Notes on the paper Planting Undetectable Backdoors in Machine Learning Models by Shafi Goldwasser, Michael P. Kim, Vinod Vaikuntanathan, and Or Zamir. This paper was recommended to me by Scott Aaronson if I wanted to better understand some earlier, more cryptographic/theoretical work in backdooring neural networks. I am also reading through Anthropic’s Sleeper Agents paper, which is more recent and practical in its approach to backdooring current LLMs, those notes will be posted soon as well. ...

November 4, 2024 · 9 min · Hasith Vattikuti