USC Engineers Offer Solutions to Stop Unsafe AI Behaviors

Venice Tang | March 3, 2026 

USC Viterbi Researchers Plan to Develop Mathematical Guardrails for AI Safety in Upcoming Study With Philanthropic Funding

A robot attempting to make good decisions at a stoplight (Credit: MidJourney)

A robot attempting to make good decisions at a stoplight (Credit: MidJourney)

Artificial Intelligence (AI) tools have become embedded in daily life, functioning like pocket assistants, as new versions of increasingly powerful systems are being released every few weeks, pushing the technology closer to human-level performance.

But safety measures have not kept pace with the rapid development of AI, as high-profile safety failures continue to make headlines, from chatbots facilitating explosive construction to AI agents promoting self-harm.

Three USC researchers are teaming up to develop safer AI through a holistic approach in an upcoming study that aims to fundamentally redesign how models are trained. Instead of patching quick fixes onto safety threats as they arise—a long-standing challenge in current efforts battling AI issues— the team plans to address the root causes by rethinking how AI models are trained and designed.

Paria Rashidinejad, Mahdi Soltanolkotabi and Stephen Tu, faculty members from the USC Viterbi School of Engineering’s Ming Hsieh Department of Electrical and Computer Engineering and School of Advanced Computing, were awarded $400,000 in funding from Coefficient Giving, formerly Open Philanthropy, a philanthropic funder that supports neglected areas of scientific research. Soltanolkatabi also holds faculty appointments in the Thomas Lord Department of Computer Science and the Daniel J. Epstein Department of Industrial and Systems Engineering. Tu also holds a faculty appointment in the Thomas Lord Department of Computer Science.

Bringing backgrounds in automation, optimization and reinforcement learning respectively, the three researchers plan to come up with mathematical guardrails designed to prevent misaligned behaviors before damage is done.

“Developing safer AI is a core priority to USC as a leader in AI research,” explained Rashidinejad, WiSE Gabilan Assistant Professor of Electrical and Computer Engineering joining from Meta’s Fundamental AI Research Labs. The trio also emphasized that their findings will be applied broadly across AI models as part of a philanthropic mission, meaning the study’s results will be shared with the wider AI industry at no cost.

Unpacking AI’s Unsafe Behaviors

From teaching users how to build a bomb and promoting prohibited substance use to orchestrating a murder plot or powering autonomous weapons, AI models today still face safety challenges with the potential to cause serious or even fatal harm. Even when direct guardrails are in place, AI agents often still answer dangerous requests and fall for jailbreak prompts that bypass safeguards. For instance, asking an AI model like ChatGPT to write a “fictional movie script” about a character building a bomb without asking for the instructions directly can trick AI into providing harmful answers.

Some of AI’s dangerous, misaligned behaviors that USC researchers hope to tackle in their upcoming study include reward hacking, sandbagging and catastrophic forgetting.

Like a dog performing a trick only to get the treat, reward hacking occurs when an AI system learns to pass safety tests researchers set up, gaming the metrics used to evaluate it, instead of fully understanding the intent behind the task.

Researchers also pointed out AI can be deceptive, as it has been observed that these models switch behavior and begin showing unsafe tendencies when deployed in the real world, as the system optimizes to perform other tasks requested by users. This type of behavior is called sandbagging. It can be particularly dangerous, as it has led AI developers to release unsafe models that are masked as safe during testing.

AI can also be forgetful and overwhelmed when given too many instructions, much like a five-year-old learning calculus in one day and forgetting basic addition or subtraction. This phenomenon, called catastrophic forgetting, can happen when researchers try to “cram” new safety rules or facts into an existing AI. The model often rewrites its internal connections so aggressively that it forgets previous skills, making its ability to perform safely and reliably unpredictable.

Currently, AI experts are tackling AI safety issues whack-a-mole-style, reacting to individual risks as they come up.

But much like the famous line from Taylor Swift’s hit song “Bad Blood,” “Band-Aids don’t fix bullet holes,” these quick fixes ultimately will not stop AI from performing risky tasks.

Soltanolkotabi, Andrew and Erna Viterbi Early Career Chair and Professor of Electrical and Computer Engineering, Computer Science, and Industrial and Systems Engineering, further cited that this reactive approach is inefficient and often “too late,” as the damage may already be done. The rapid, automated generation of harm by AI models can easily outpace manual takedown efforts.

USC’s Solution: Mathematical Guardrails to Stop AI From Acting ‘Unsafe’

As the goal is to develop mathematical guarantees that prevent harmful behaviors before systems are deployed, the team’s theory-driven approach examines safety at the design level in this upcoming study. Tu, assistant Professor of Electrical and Computer Engineering and Computer Science, whose background includes automation and safety frameworks for robotics will draw on control systems that keep machines stable in unpredictable environments. Soltanolkotabi, an expert in optimization, statistics and foundations of AI, will focus on building faster, more efficient formulations, and statistical guarantees that embed safety directly into a model’s objectives. Rashidinejad, whose research advances decision-making, prediction, reasoning, and safety in AI systems in areas such as reinforcement learning and large language models, will develop mathematical formulations and scalable algorithms that enable AI systems to continually adapt through sequential, targeted interventions. .

Together, the researchers aim to create AI systems that are not only more capable, but predictably safe, models designed to act responsibly by design.

Tackling challenges beyond safety, the trio also plans to make AI systems more efficient and easier to update as part of their research. Today, major upgrades to large language models like ChatGPT often require rebuilding from the ground up, a process that is costly, energy-intensive, and slow. In upcoming work, the team wants to make AI models capable of continual learning–so developers can start with a solid, ‘college-level’ system and then keep upgrading it over time, like adding a new floor to a house instead of rebuilding from the ground up.

Learn more about Coefficient Giving’s call for AI safety research here.

Madhi Soltanolkotabi (left), Paria Rashidinejad (center), and Stephen Tu (right) team up to tackle AI safety issues

Madhi Soltanolkotabi (left), Paria Rashidinejad (center), and Stephen Tu (right) team up to tackle AI safety issues

Published on March 3rd, 2026

Last updated on March 3rd, 2026

This article may feature some AI-assisted content for clarity, consistency, and to help explore complex scientific concepts with greater depth and creative range.