Trying to understand the minds on our computers

You can find my publications on Semantic Scholar and my CV. My models and datasets are on HuggingFace. I'm best reached at kyledevinobrien1@gmail.com.

Direction

I am driven by the challenge of understanding the internal mechanisms of advanced machine learning models. While opaque optimization has achieved impressive capabilities, this success has come at the cost of understanding the algorithms these models actually learn. As these systems are increasingly deployed in high-stakes settings, this opacity becomes a critical concern. My career's north star is improving our understanding and control of advanced AI systems.

My technical machine learning research spans activation steering, adversarial robustness, machine unlearning, and pre-training data interventions. I'm particularly passionate about developing lightweight interventions that can address safety issues during deployment without requiring costly retraining - similar to how security patches provide critical safety benefits in traditional software. My research has been featured in leading venues including ICLR and TMLR. 

I'm currently seeking research opportunities, including full-time roles and fellowships. Don't hesitate to reach out!

Publications

Steering Language Model Refusal with Sparse Autoencoders

Kyle O'Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangde

Arxiv, 2024 — Paper Link

Composable Interventions for Language Models

Arinbjörn Kolbeinsson*, Kyle O'Brien*, Tianjin Huang, Shanghua Gao, Shiwei Liu, Jonathan Richard Schwarz, Anurag J. Vaidya, Faisal Mahmood, M. Zitnik, Tianlong Chen, Thomas Hartvigsen

ICLR, 2025 — Paper Link

Improving Black-box Robustness with In-Context Rewriting

Kyle O'Brien, Isha Puri, Nathan Ng, Jorge Mendez, Hamid Palangi, Yoon Kim, Marzyeh Ghassemi, Thomas Hartvigsen

TMLR, 2024 — Paper Link

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

USVSN Sai Prashanth*, Alvin Deng*, Kyle O'Brien**, V. JyothirS, Mohammad* Aflah Khan, Jaydeep Borkar, Christopher A. Choquette-Choo, Jacob Ray Fuehne, Stella Biderman, Tracy Ke, Katherine Lee, Naomi Saphra

ICLR, 2025 — Paper Link

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Rose Biderman, Hailey Schoelkopf, Quentin G. Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, Oskar van der Wal

ICML, 2023 — Paper Link

I am always happy to meet with people to discuss research, collaborations, and career advice. I have benefited tremendously from others sharing their time with me and am excited to pay it forward.