Evaluation Dataset
Dataset Class
The Dataset class is a convenient class that represent a dataset that can be used for evaluation.
The dataset class can be initialized with a path to a folder or a file. The folder should contain the following files:
dataset.jsonlwhich contains a collection of query / instructions and corresponding reference outputs by the modules in the pipeline.- an optional
manifest.yamlwhich declares the structure and fields of the dataset, the license and other metadata.
from continuous_eval.eval import Dataset
dataset = Dataset("path_to_folder") # or Dataset("path_to_file.jsonl")Alternatively, you can also create a dataset from a list of dictionaries:
dataset = Dataset.from_data([ {"question": "What is the capital of France?", "answer": "Paris"}, {"question": "What is the capital of Germany?", "answer": "Berlin"},])To access the raw data, you can use the data attribute:
print(dataset.data[0])Dataset fields
Suppose you want to reference a dataset field, you can use the DatasetField class:
class DatasetField: name: str type: type = typing.Any # type: ignore description: str = "" is_ground_truth: bool = FalseWhen you load the dataset, the Dataset class will automatically infer the fields from the data.
type(dataset.question) # DatasetFieldthis will be particularly useful when defining the input and output of the modules in the pipeline.
Example Data Folder
Here’s an example golden dataset that contains uid, question, answer (ground truth answers), and tool_calls (the tools that are supposed to be used).
Dataset File
{ "uuid": "1", "question": "What is Uber revenue as of March 2022?", "answer": [ "Uber's revenue as of March 2022 is $6,854 million.", "$6,854 million", "$6,854M" ], "tool_calls": [ { "name": "march" } ]}{ "uuid": "2", "question": "What is Uber revenue as of Sept 2022?", "answer": [ "Uber's revenue as of September 2022 is $23,270 million.", "$23,270 million", "$23,270M" ], "tool_calls": [ { "name": "sept" } ]}{ "uuid": "3", "question": "What is Uber revenue as of June 2022?", "answer": [ "Uber's revenue as of September 2022 is $8,073 million.", "$8,073 million", "$8,073M" ], "tool_calls": [ { "name": "june" } ]}Manifest (optional)
name: Uber 10Qdescription: Uber 10Q filings from 2022format: jsonllicense: CC0fields: uuid: description: Unique identifier for the filing type: UUID question: description: The question asked in the filing type: str answer: description: The answer to the question type: List[str] tool_calls: description: The tools used to extract the question and answer type: List[Dict[str, str]]Example Datasets
Below are the example datasets you can use to test your pipeline/code.
| Dataset | Description | Data format |
|---|---|---|
| correctness | 1,200 examples, created from InstructQA | `Dataset` |
| retrieval | 300 examples, created from HotpotQA | `Dataset` |
| faithfulness | 544 examples, created from InstructQA | `Dataset` |
| graham_essays/small/txt | 10 Paul Graham essays, created from graham-essays | Zip of txt |
| graham_essays/small/dataset | 55 questions about Paul Graham essays | `Dataset` |
| graham_essays/small/results | The results (i.e., answer and retrieved documents) from a simple RAG pipeline | JSON |
Download Datasets
The example datasets can be example_data_downloader helper function.
from continuous_eval.data_downloader import example_data_downloader
# Download a dataset for evaluationdataset = example_data_downloader("retrieval")