Skip to main content

ARC

ARC or AI2 Reasoning Challenge is a dataset used to benchmark language models' reasoning abilities. The benchmark consists of 8,000 multiple-choice questions from science exams for grades 3 to 9. The dataset includes two modes: easy and challenge, with the latter featuring more difficult questions that require advanced reasoning.

tip

To learn more about the dataset and its construction, you can read the original paper here.

Arguments

There are three optional arguments when using the ARC benchmark:

  • [Optional] n_problems: the number of problems for model evaluation. By default, this is set all problems available in each benchmark mode.
  • [Optional] n_shots: the number of examples for few-shot learning. This is set to 5 by default and cannot exceed 5.
  • [Optional] mode: a ARCMode enum that selects the evaluation mode. This is set to ARCMode.EASY by default. deepeval currently supports 2 modes: EASY and CHALLENGE.
info

Both EASY and CHALLENGE modes consist of multiple-choice questions. However, CHALLENGE questions are more difficult and require more advanced reasoning.

Example

The code below assesses a custom mistral_7b model (click here to learn how to use ANY custom LLM) on 100 problems in ARC in EASY mode.

from deepeval.benchmarks import ARC
from deepeval.benchmarks.modes import ARCMode

# Define benchmark with specific n_problems and n_shots in easy mode
benchmark = ARC(
n_problems=100,
n_shots=3,
mode=ARCMode.EASY
)

# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)

The overall_score ranges from 0 to 1, signifying the fraction of accurate predictions across tasks. Both modes' performances are measured using an exact match scorer, focusing on the quantity of correct answers.