This publication is licensed under the terms of the Creative Commons Attribution License 4.0 which permits unrestricted use, provided the original authors and source are credited.

Introduction

This Expert Analysis is based on a discussion at a joint CETaS–AISI virtual workshop hosted on 7 November 2024. The workshop focused on how to realise the potential of sociotechnical approaches to AI evaluation, involving 38 UK and international experts from across government, academia, industry and civil society. The discussion topics were inspired by “Evaluating Malicious Generative AI Capabilities: Understanding inflection points in risk,” CETaS Briefing Papers (July 2024). We would like to thank the workshop participants for their invaluable insights, and particularly acknowledge the contribution of our keynote presenter, Laura Weidinger.

Over the past year, the discourse on AI safety has increasingly centred on evaluation as a way to mitigate the risks posed by advanced generative AI (GenAI) systems. Approaches to evaluation vary from red-teaming to automated solutions, to human-participant studies. As the field evolves, efforts to establish best practices have become a priority for experts. With different AI safety hubs emerging around the world, it is crucial that experts maintain a holistic understanding of the AI threat landscape, enabling them to effectively address key areas of concern.

As explored in the CETaS Briefing Paper, GenAI systems can uplift malicious actors’ capabilities in numerous ways. The paper described a series of inflection points in risk across three threat domains: malicious code generation, radicalisation, and weapon instruction and attack planning. Unlike previous work focused primarily on technical changes in AI capabilities, the paper took an intelligence-led approach by addressing: malicious actors’ preferences and readiness to adopt certain technologies; in-group characteristics that may inform their interactions with AI systems; and systemic factors that are crucial in shaping the broader operating context. 

Landmark discussions on AI safety took place in San Francisco in November 2024 and will continue at the AI Action Summit in Paris in February 2025. Accordingly, it is more important than ever that AI developers, AI evaluators and the national security community find common ground and adopt shared approaches to these challenges.

Sociotechnical approaches to AI evaluation

Participants in the workshop attempted to establish a shared definition of sociotechnical approaches to human-machine interaction and AI evaluation. According to research by Weidinger et al. (2023), in AI evaluation, sociotechnical approaches address risks linked to:[1]

  • Technical components and system behaviours (e.g. a propensity to reproduce harmful stereotypes in images or utterances).
  • Interactions between technical systems and human users (e.g. challenges for data annotators who are exposed to harmful model outputs).
  • Systemic and structural factors that influence model capability and human interactions (e.g. increasing homogeneity in knowledge production and creativity).

Weidinger et al. assert[2] that the sociotechnical lens gives rise to questions such as:

  • Who is interacting with an AI system, and in which institutions?
  • Which workflows and processes stem from these interactions?
  • What do people use an AI system for, and how does this differ from developers’ intentions?
  • How does an AI system function for people who its creators did not envisage as primary users?
  • What points of failure emerge when there is widespread adoption of an AI system?

As Clark Barrett and his co-authors argued in a 2023 paper on the risks of GenAI,[3] “human activity is remarkably adaptable, nuanced, and context-driven, whereas computational algorithms often exhibit a ‘rigid and brittle’ nature.” Bridging the gap between technical solutions and the social requirements for GenAI deployment is central to designing an effective evaluation ecosystem.

AI evaluation methods

Weidinger et al. have stated[4] that while there is a logic to working ‘in to out’ – i.e. starting with model capabilities and ending with their impact on institutions and society – there is also a need to do the opposite. The latter involves assessing the real-world failure modes of AI systems and investigating the key factors that drive their adoption in different contexts, to ground and validate capability evaluations that might depart from the status quo.

Splitting AI evaluation into goals and methods clarifies the role of sociotechnical approaches in a broader context (Weidinger et al.):[5]

  1. For successful ‘hill-climbing’ (making small adjustments iteratively, to optimise performance based on feedback from evaluation metrics), benchmarks provide a useful performance metric for AI capabilities.
  2. To explore likely failure modes, AI red-teaming appropriates methods from the cybersecurity domain, by identifying a model’s weak points so they can be patched.
  3. For the goal of understanding the inner workings of an AI model, mechanistic interpretability can be helpful – although the science of machine behaviour remains a live and developing area.
  4. For the goal of providing assurance (i.e. improving assessments of whether a model is safe), sociotechnical approaches may show the most promise given how they account for context.

Weidinger et al. also point out that one nascent approach is automation and simulation:[6] testing AI systems in environments that emulate real-world conditions or specific task settings. Despite the complexity of grappling with lived human experiences, simulation could offer a partial solution to the challenge posed by human reviewers’ prolonged exposure to harmful real-world content during evaluation.

Notwithstanding the progress being made in these areas, there remain numerous gaps in AI evaluation methods. For example, research into interaction harms is accelerating but could be outpaced by the increasing volume and variety of people engaging with AI systems. Multi-modal evaluation is another promising area, but it is at an early stage and subject to the uncertainty caused by a growth in the applications of autonomous AI agents.

There is no single solution that fits all the complex layers of AI evaluation. However, as per Weidinger et al.,[7] 85.6% of evaluations focus on capability and only 5.3% and 9.1% on human interaction and systemic impact respectively. There is a need to build on the goals and methods associated with sociotechnical AI evaluation.

Inflection points in risk and malicious actor uplift

Given the increase in both the accessibility of GenAI systems and the potential impact of their use by malicious actors, experts require new conceptual frameworks to address the risks. In this regard, inflection points are valuable in understanding and forecasting the heightened risk in certain uses of GenAI systems. The concept of inflection points encapsulates how risk levels may quickly rise or otherwise change at certain moments, and reflects the reality that risks can increase in a non-linear manner, with sudden shifts resulting from various factors (e.g. technological breakthroughs, regulatory changes or new uses of AI). Just as it is important to work both ‘in to out’ and ‘out to in’, it is vital to balance forward-thinking risk identification with efforts to work backwards from undesirable sociotechnical outcomes.

Greater clarity about inflection points and risk thresholds would be useful in persuading stakeholders to reach a consensus on the points at which mitigation measures will be needed. This may align with intelligence-led frameworks in the national security context, which are often constructed around threat levels. For example, the UK categorises the terrorism threat level as either low, moderate, substantial, severe or critical.[8]

One caveat to this discussion is that the dual-use nature of many AI capabilities means there will also be a positive aspect to certain inflection points. Therefore, developers may see an inflection point in risk as an equally important innovation opportunity. For instance, inflection points in malicious code generation may correspond with those in autonomous agents for cyber defence. Similarly, a considerable improvement in GenAI systems’ social awareness and persuasiveness may be a significant concern from a radicalisation perspective but a boon from a marketing perspective, as it would be advantageous for applications like political fundraising. Given that there is no intrinsic threat to an AI model’s persuasiveness, risk assessments should balance the identification of bad actors deploying GenAI for harmful purposes against benign commercial use cases.

Inflection points and indicators 

As discussed, the workshop focused on the threat areas of malicious code generation, radicalisation, and weapon instruction and attack planning, while also addressing systemic factors. The aim was to scrutinise the inflection points covered in the CETaS Briefing Paper and identify new ones. The tables below summarise the inflection points of greatest interest to participants in the workshop, along with potential indicators that these points have been reached.

Table 1. Malicious code generation 

Inflection points

Indicators

Reasoning and tactical foresight: if GenAI systems can maintain a ‘model’ of code, they may improve their strategic decision-making during cyberattacks.

Differentiation between an uplift in a human’s throughput and a model autonomously executing strategic planning and decision-making.

Companies are likely to publicise their AI models’ ability to complete tasks that require complex autonomous decision-making, making identification easier.

High-quality training examples: if a GenAI system is trained on better examples of sophisticated malware, its offensive capability may outperform its defensive capability.

A model’s ability to infer how to write malicious code without needing to be trained on direct examples of it – and how to generate new exploits far beyond its training data (by using, for example, information available in documents or online videos).

Teams of agents: if autonomous GenAI agents can cooperate, learning from and adapting to one another’s experiences.

A GenAI system’s ability to conduct trial and error, and to learn from mistakes – if it takes millions of iterations to be effective, it is less concerning than if it takes fewer than a hundred.

Languages and tools: if GenAI systems can leverage programming languages and tools best suited to augmenting malicious code.

A GenAI system’s capacity to use C/C++ or machine code – which would be a significant indicator of capability uplift, even if most exploits now are in the web space (e.g. SQL injection).

 

A GenAI system’s ability to use tools specific to cybersecurity, such as Mythic.

Table 2. Radicalisation 

Inflection points

Indicators

Better social awareness and persuasiveness: if GenAI systems produce minimal hallucinations, resulting in better social awareness and persuasiveness, extremist groups could become more willing to delegate their messaging to these systems.

Evidence of improved social awareness and persuasiveness in commercial use cases, which may trigger greater usage by terrorist and violent extremist (TVE) groups.

Sustained improvements in TVE groups’ tradecraft and communications, coming across as more fluent and sophisticated – which may indicate increased uptake.

Enhanced retrieval augmented generation (RAG) capabilities: if RAG capabilities are developed for ‘radicalisation datasets’, GenAI systems may query custom knowledge bases to provide effective extremist messaging.

Increasing proliferation of jailbroken models – which would indicate an easier route for TVE groups to influence users through sustained interactions, at volume and at pace.

Accurate targeting: if GenAI systems can compile precise information about potential radicalisation targets, leading to more tailored approaches and AI scanning/vetting of new members.

TVE groups segmenting their approaches to different target groups based on different modalities (e.g. voice, video or text) and using GenAI to determine particularly vulnerable subsets of their target audience.

Table 3. Weapon instruction  

Inflection points

Indicators

Contextual Adaptation and situational awareness: if GenAI systems can adapt to the specific context of an attack plan, providing real-time tips and advice (see CETaS Briefing Paper above).

A GenAI system’s ability to handle multi-turn conversations, learning from each interaction. Its number of interactions could indicate the level of contextual adaptation and learning pace – with many interactions signalling limited efficiency in contextual adaptability and learning, and few signalling greater efficiency.

Narrow AI Tools: if GenAI systems can successfully integrate with ‘narrow AI’ tools, the natural language features of these systems could make specialised tasks more achievable.

The integration of domain-specific data (such as environmental or biological information) into AI models, and their ability to act on this in the context of designing attack plans or creating chemical, biological, radiological or nuclear weapons.

The accessibility and uptake of smaller models that can be deployed locally and may face fewer barriers to adoption.

Automated Targeting: if agent-based systems can reliably identify connections with other linked individuals and independently compile multiple dossiers on targets.

Models’ ability to accurately synthesise data for identifying, profiling and selecting targets.
Malicious actors’ access to closed-weight models, skills and materials – which could enable low-resource groups to optimise AI for targeting.

Table 4. Systemic factors linked to inflection points  

Systemic factors

Indicators

Demographic change

Future generations’ greater reliance on GenAI systems may result in differing patterns of use and increased technical competency.

Agent-to-agent interactions

As AI systems become more autonomous, individuals may need their own AI agents to effectively interact with other individuals’ or institutions’ AI agents. This complex network of AI-to-AI interactions could have diverse and far-reaching effects.

Decentralised training

Efforts to monitor large, centralised computers could become less useful for controlling the development of GenAI systems.

Overreliance

As reliance on AI systems continues to grow, this may lead to skills atrophy in multiple contexts.

Decentralised learning

Enhancement of AI training pipelines through data acquisition on edge devices. This could accelerate progress towards a future in which AI acts as ‘middleware’ between individuals and computers.

Distributed governance

As responsibility for decision-making and oversight spreads across multiple entities, it becomes challenging to maintain accountability for managing and mitigating the risks involved in these processes. The resulting ambiguity may undermine institutional responses to AI harms.


Conclusion

The workshop highlighted the need for greater scrutiny of how technical, organisational and societal factors interact and mutually reinforce one another to amplify AI risks. Capability-centric perspectives on AI adoption have merits, but one should couple them with intelligence-led threat assessments of malicious actors’ tradecraft and operations.

No single community can tackle this challenge on its own – the CETaS-AISI workshop is just one of many dialogues aiming to broaden the coalition of stakeholders in the endeavour. Such efforts should evolve alongside technical and sociocultural shifts, and proceed with an open mind as to which combination of evaluation approaches will have the best chance of mitigating AI harms.

The views expressed in this article are those of the authors, and do not necessarily represent the views of The Alan Turing Institute or any other organisation. 

References

[1] Laura Weidinger et al., "Sociotechnical Safety Evaluation of Generative AI Systems," arXiv (October 2023), https://arxiv.org/abs/2310.11986.
 

[2] Laura Weidinger et al., "Sociotechnical Safety Evaluation of Generative AI Systems," arXiv (October 2023), https://arxiv.org/abs/2310.11986.
 

[3] Clark Barrett et al., "Identifying and Mitigating the Security Risks of Generative AI," arXiv (December 2023), https://arxiv.org/abs/2308.14840.
 

[4] Laura Weidinger et al., "Sociotechnical Safety Evaluation of Generative AI Systems," arXiv (October 2023), https://arxiv.org/abs/2310.11986.
 

[5] Laura Weidinger et al., "Sociotechnical Safety Evaluation of Generative AI Systems," arXiv (October 2023), https://arxiv.org/abs/2310.11986.
 

[6] Laura Weidinger et al., "Sociotechnical Safety Evaluation of Generative AI Systems," arXiv (October 2023), https://arxiv.org/abs/2310.11986.
 

[7] Laura Weidinger et al., "Sociotechnical Safety Evaluation of Generative AI Systems," arXiv (October 2023), https://arxiv.org/abs/2310.11986.
 

[8] UK Government, "Terrorism and national emergencies," https://www.gov.uk/terrorism-national-emergency.

 

This article was revised on 10 December 2024 to give full credit to all cited work from Weidinger et al. (2023) and Barrett et al. (2023).

Citation information

Ardi Janjeva, Anna Gausen and Tvesha Sippy, "Realising the Potential of Sociotechnical Approaches to AI Evaluation," CETaS Expert Analysis (December 2024).