WALLEDEVAL: A Comprehensive Safety Evaluation Toolkit for Large Language Models

Prannaya Gupta; Le Qi Yau; Hao Han Low; I-Shiang Lee; Hugo M. Lim; Yu Xin Teoh; Jia Hng Koh; Dar Win Liew; Rishabh Bhardwaj; Rajat Bhardwaj; Soujanya Poria

Conference proceeding

WALLEDEVAL: A Comprehensive Safety Evaluation Toolkit for Large Language Models

Prannaya Gupta, Le Qi Yau, Hao Han Low, I-Shiang Lee, Hugo M. Lim, Yu Xin Teoh, Jia Hng Koh, Dar Win Liew, Rishabh Bhardwaj, Rajat Bhardwaj, …

2024 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, pp.397-407

01/01/2024

Abstract

Computer Science

Computer Science, Artificial Intelligence

Computer Science, Software Engineering

Computer Science, Theory & Methods

Science & Technology

Technology

WALLEDEVAL is a comprehensive AI safety testing toolkit designed to evaluate large language models (LLMs). It accommodates a diverse range of models, including both open-weight and API-based ones, and features over 35 safety benchmarks covering areas such as multilingual safety, exaggerated safety, and prompt injections. The framework supports both LLM and judge benchmarking and incorporates custom mutators to test safety against various text-style mutations, such as future tense and paraphrasing. Additionally, WALLEDEVAL introduces WALLEDGUARD, a new, small, and performant content moderation tool, and two datasets: SGXSTEST and HIXSTEST, which serve as benchmarks for assessing the exaggerated safety of LLMs and judges in cultural contexts. We make WALLEDEVAL publicly available at https: //github.com/walledai/walledeval.

Metrics

1 Record Views

Details

Title: WALLEDEVAL: A Comprehensive Safety Evaluation Toolkit for Large Language Models
Creators - without role: Prannaya Gupta - Walled AI Labs, Singapore, Singapore
Le Qi Yau - Walled AI Labs, Singapore, Singapore
Hao Han Low - Walled AI Labs, Singapore, Singapore
I-Shiang Lee - Walled AI Labs, Singapore, Singapore
Hugo M. Lim - Walled AI Labs, Singapore, Singapore
Yu Xin Teoh - Walled AI Labs, Singapore, Singapore
Jia Hng Koh - Walled AI Labs, Singapore, Singapore
Dar Win Liew - Walled AI Labs, Singapore, Singapore
Rishabh Bhardwaj - Walled AI Labs, Singapore, Singapore
Rajat Bhardwaj - Walled AI Labs, Singapore, Singapore
Soujanya Poria - Walled AI Labs, Singapore, Singapore
Contributors - without role: DIH Farias
T Hope
M Li
Publication Details: 2024 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, pp.397-407
Publisher: Assoc Computational Linguistics-Acl
Number of pages: 11
Identifiers: 9912605809846
Academic Unit: ISTD Pillar
Language: English
Resource Type: Conference proceeding

WALLEDEVAL: A Comprehensive Safety Evaluation Toolkit for Large Language Models

Abstract

Metrics

Details

Singapore University of Technology and Design Social media