Abstract: Large language models (LLMs) have achieved remarkable fluency and versatility, yet they remain fundamentally opaque and vulnerable—posing challenges for both responsible control and safe deployment. This talk presents two complementary approaches to advancing trustworthy AI: one focused on interpretable control, and the other on adversarial robustness. We first introduce JAM (Just A Move), a novel framework for controllable text generation that leverages causal interventions in the latent space of LLMs. By uncovering and manipulating the causal structure underlying generation, JAM enables interpretable and efficient control over model outputs. Empirical evaluations across alignment benchmarks—including HHH criteria, toxicity reduction, and GPT-4 alignment—demonstrate that JAM improves controllability by up to 22% while maintaining computational efficiency. We next examine the vulnerabilities of LLMs through intent-hiding adversarial prompting, a scalable attack strategy that composes benign skills to conceal malicious intent. Using a game-theoretic framework, we analyze the dynamics between attackers and defense systems, revealing structural advantages for adversaries. We further propose targeted defenses and validate their effectiveness across real-world models and malicious behaviors. Together, these contributions highlight the dual imperative of building LLMs that are both controllable by design and resilient to adversarial misuse, offering a roadmap toward more trustworthy and secure AI systems.
Bio: Dr. Abhishek K. Umrawal is a Teaching Assistant Professor in the Department of Electrical and Computer Engineering at Illinois.