DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph

Abstract

The current paradigm of evaluating Large Language Models (LLMs) through static benchmarks comes with significant limitations, such as vulnerability to data contamination and a lack of adaptability to the evolving capabilities of LLMs. Therefore, evaluation methods that can adapt and generate evaluation data with controlled complexity are urgently needed. In this work, we introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks. We further use a code-augmented LLM to ensure the label correctness of newly generated data. We apply our DARG framework to diverse reasoning tasks in four domains with 15 state-of-the-art LLMs. Experimental results show that almost all LLMs experience a performance decrease with increased complexity and certain LLMs exhibit significant drops. Additionally, we find that LLMs exhibit more biases when being evaluated via the data generated by DARG with higher complexity levels. These observations provide useful insights into how to dynamically and adaptively evaluate LLMs.

Overall Framework

Our DARG framework we first construct the reasoning graphs for data points in given benchmarks using LLMs (e.g., computational reasoning graphs for solving a math problem are shown in the following figure). Next, we perform fine-grained graph perturbations based on various dimensions of the reasoning graph. Afterwards, we convert the reasoning graph back into the description that adapts the linguistic diversity as the original data. In order to ensure the correctness of the reasoning graph construction and graph-to-text generation, we use tool-augmented LLMs to verify the quality of reasoning graphs and generated text to produce valid test examples.

MATH Reasoning

We evaluate 15 SOTA LLMs on GSM8K using DARG with reasoning graphs of increased width, depth, and numerical complexity. Almost all LLMs' performances drop, while closed-source models and larger models show more resilience to complexity increases.

This radar map shows different LLMs' resilience to complexity increases, measured by the Complexity-Induced Accuracy Retention Rate (CIARR), which calculates the average percentage retention in accuracy per complexity increment as the average ratio of accuracy at each subsequent complexity level to the previous level.

Social Reasoning

We evaluate SOTA LLMs on the Bias Benchmark for QA (BBQ) using DARG with reasoning graphs that have an increased number of attribute nodes and modified attributes' polarity. The metrics are accuracy, bias score, and Overall Avoidance Rate, which measures how often LLMs are overly sensitive to contexts involving protected groups, often choosing 'Cannot be determined.' even when clear evidence supports an answer. LLMs perform worse as complexity increases and show increasing biases towards protected groups.

Spatial Reasoning

We evaluate SOTA LLMs on the BIG-Bench Hard (BBH) Navigate dataset, a spatial reasoning dataset that involves giving the LLM navigation steps to determine if the agent returns to the starting point. As the depth of the reasoning graph increases, most LLMs' overall accuracy drops, with a significant decline in accuracy on positive cases (where the label is 'Yes') while the accuracy on negative cases remains comparatively stable, indicating biases.

Symbolic Reasoning

We evaluate SOTA LLMs on the BIG-Bench Hard (BBH) Dyck Language dataset, a symbolic reasoning dataset that requires the model to predict the sequence of closing parentheses for a Dyck-4 word missing its last few closing parentheses. As the depth of the reasoning graph's input and output parts increases, all LLMs' performances tend to decrease.

Finetune LLMs with DARG generated data

We compare Llama2-7B and Mistral-7B finetuned with DARG generated data and the origical GSM8K's training data, both models finetuned with DARG generated data can outperform the one finetuned with an equivalent amount of GSM8K's original training data. This demonstrates DARG's potential not only to dynamically generate new test samples but also to produce training data that enables LLMs to adapt to various complexity levels.

BibTeX

@misc{zhang2024dargdynamicevaluationlarge,
    title={DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph},
    author={Zhehao Zhang and Jiaao Chen and Diyi Yang},
    year={2024},
    eprint={2406.17271},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
    url={https://arxiv.org/abs/2406.17271},
    }

Usage and License Notices

The data, code and model checkpoint are intended and licensed for research use only. Please do not use them for any malicious purposes.

The benchmark is built on top of the C4 dataset, under the ODC Attribution License (ODC-By).

This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This source code of this website is borrowed from Nerfies.