The terms AI safety and technical safety are closely related, but not entirely synonymous. The term “AI safety” has become somewhat of a catch all or umbrella term for many facets of safety including technical and socioeconomic areas of focus. AI safety is an interdisciplinary field focused on ensuring that AI systems are developed and deployed in ways that are safe, reliable, and beneficial to humanity. Technical safety can be considered a subset of AI safety that focuses specifically on the technical aspects of making AI systems safe and reliable.
From the perspective of types of technical safety research, there is empirical and theoretical. Empirical AI safety research usually involves working directly with ML models to identify any risks and develop ways to mitigate them. Theoretical AI safety research is much more conceptual and mathematical, and involves coming up with properties that it would be useful for safe ML algorithms to have.
There are three main subsets of technical safety which are robustness, interpretability and explainability, and alignment.
Robustness research includes identifying and defending AI systems against deliberate attacks, and making models stronger against incidental failures. Examples of deliberate attacks are feeding a system inputs intentionally designed to cause it to fail, or manipulating training data to cause a model to learn the wrong thing. Incidental failures can include a model being used in a setting it was not trained for.
Interpretability research involves studying why AI systems do what they do and putting it into terms that humans can understand. The aim is to build tools and approaches that convert the millions of parameters in a machine learning model into forms that allow humans to understand what’s going on.
Alignment research involves developing methodologies and frameworks to guide AI systems in understanding and adhering to human ethical standards and societal norms. AI alignment is important to address the risks associated with powerful AI systems acting in ways that are misaligned with human values, given the potential severity of these outcomes.
It’s worth noting there are many ways and opportunities to further subcategorize areas of technical AI safety including threat modelling, safe exploration, scalable oversight, assurance and verification, etc. For the purposes of brevity, I’ve identified the three largest clusters, and the two main types of research within technical AI safety.