Trustworthy LLMs: a Survey and Guideline for Evaluating 
Large Language Models' Alignment 

Yang Liu*, Yuanshun Yao*, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo
Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li

ByteDance Research


Alignment, i.e., making models behave in accordance with what a human wants, has become the key task before deploying large language models (LLMs) in production. For example, OpenAI spent 6 months iteratively aligning GPT-4 before the release. Nonetheless, the lack of clear guidance on how to evaluate whether LLM outputs are aligned with social norms, values, and regulations remains an obstacle for practitioners to systematically iterate and deploy LLMs. This paper surveys some major dimensions that we believe are important to consider and hopefully cover the majority of current concerns about LLM trustworthiness. We include $7$ major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, social norm, and robustness. Each major category contains several sub-categories, leading to $29$ of them in total. In addition, we also select a subset of $8$ sub-categories, design corresponding measurement studies, and report results on several widely-used LLMs. In general, the measurement results show that roughly speaking, the more aligned models indeed perform better overall in trustworthiness. However, the effectiveness of alignment differs across the categories of the trustworthiness considered, highlighting the need of performing more fine-grained analyses, testing, and improvements on LLM alignment.

Pdf link: Trustworthy LLMs

Please cite:
title={​Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment}, 
author={​Liu, Yang and Yao, Yuanshun Yao and Ton, Jean-Francois Ton and Zhang, Xiaoying and Guo, Ruocheng and Klochkov, Yegor and Taufiq, Muhammad Faaiz and Li, Hang},