Trying to understand the minds on our computers
You can find my publications on Semantic Scholar and my CV. My models and datasets are on HuggingFace. I'm best reached at kyledevinobrien1@gmail.com.
Direction
I am driven by the challenge of understanding the internal mechanisms of advanced machine learning models. While opaque optimization has achieved impressive capabilities, this success has come at the cost of understanding the algorithms these models actually learn. As these systems are increasingly deployed in high-stakes settings, this opacity becomes a critical concern. My career's north star is improving our understanding and control of advanced AI systems.
My technical machine learning research spans activation steering, adversarial robustness, machine unlearning, and pre-training data interventions. I'm particularly passionate about developing lightweight interventions that can address safety issues during deployment without requiring costly retraining - similar to how security patches provide critical safety benefits in traditional software. My research has been featured in leading venues including ICLR and TMLR.
I'm currently seeking research opportunities, including full-time roles and fellowships. Don't hesitate to reach out!
Publications
Steering Language Model Refusal with Sparse Autoencoders
Kyle O'Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangde
Arxiv, 2024 — Paper Link
Composable Interventions for Language Models
Arinbjörn Kolbeinsson*, Kyle O'Brien*, Tianjin Huang, Shanghua Gao, Shiwei Liu, Jonathan Richard Schwarz, Anurag J. Vaidya, Faisal Mahmood, M. Zitnik, Tianlong Chen, Thomas Hartvigsen
ICLR, 2025 — Paper Link
Improving Black-box Robustness with In-Context Rewriting
Kyle O'Brien, Isha Puri, Nathan Ng, Jorge Mendez, Hamid Palangi, Yoon Kim, Marzyeh Ghassemi, Thomas Hartvigsen
TMLR, 2024 — Paper Link
Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon
USVSN Sai Prashanth*, Alvin Deng*, Kyle O'Brien**, V. JyothirS, Mohammad* Aflah Khan, Jaydeep Borkar, Christopher A. Choquette-Choo, Jacob Ray Fuehne, Stella Biderman, Tracy Ke, Katherine Lee, Naomi Saphra
ICLR, 2025 — Paper Link
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Stella Rose Biderman, Hailey Schoelkopf, Quentin G. Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, Oskar van der Wal
ICML, 2023 — Paper Link