Advith's Calendar

COLLOQUIUM: Denghui Zhang, "Mechanism Understanding as Prerequisite for Responsible AI"

Feb 18, 2026   2:30 pm  
HYBRID: 2405 Siebel Center for Computer Science or online
Sponsor
Siebel School of Computing and Data Science
Views
72
Originating Calendar
Siebel School Colloquium Series

Zoom: https://illinois.zoom.us/j/81733397108?pwd=nzUNrrubnbDoXQoAVbeVu4DUP9p5dz.1

Refreshments Provided.

Abstract: 
As large language models are increasingly deployed in safety, legal, and trust critical settings, prevailing alignment methods rely largely on brute force strategies such as global weight updates, post hoc filtering, or external prompting, without understanding the internal mechanisms that govern model behavior. In this talk, I argue that mechanism understanding is a prerequisite for trustworthy AI, and present a unified research program that opens the black box of modern language models to analyze and control their internal decision processes. I first show that high level cognitive abilities such as Theory of Mind emerge from sparse and structured attention circuits, where perturbing a tiny fraction of parameters selectively disrupts belief tracking while preserving language fluency. Building on this insight, I introduce activation level disentanglement and control as a new paradigm for safety alignment, and present SafeSwitch, a dynamic framework that separates unsafe recognition from refusal behavior at inference time, enabling targeted safety intervention with minimal utility loss. I further advance trustworthy data ecosystems through a holistic copyright risk analysis framework that integrates adversarial red teaming, internal representational signals, and logit level probabilistic auditing. Together, this work outlines a mechanistic foundation for precise intervention, scalable governance, and principled trust in increasingly agentic AI systems.

Bio:
Denghui Zhang is an Assistant Professor at Stevens Institute of Technology and a Visiting Faculty Member with the NLP Group at the University of Illinois Urbana-Champaign. His research focuses on mechanistic understanding and control of large language models, with an emphasis on safety, trustworthiness, and legal compliance, including internal-state analysis, dynamic safety steering, theory-of-mind reasoning, and copyright risk detection. His work has appeared in top venues such as NeurIPS, EMNLP, NAACL, IJCAI, and ICIS, and has received several recognitions, including the ICIS Best Student Paper Award (in Honor of TP Liang), the NSF NAIRR Pilot Program Award, and the OpenAI Researcher Access Program. He has led tutorials at major AI and NLP conferences, including AAAI and NAACL, and has been invited to give talks at workshop on NeurIPS Responsible Foundation Models and trustworthy AI. Additional information is available at https://zhangdenghui.site/.


Part of the Siebel School Speakers Series. Faculty Host: Heng Ji


Meeting ID: 817 3339 7108 
Passcode: csillinois


If accommodation is required, please email <erink@illinois.edu> or <communications@cs.illinois.edu>. Someone from our staff will contact you to discuss your specific needs



 

 

link for robots only