Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified .DS_Store
Binary file not shown.
43 changes: 2 additions & 41 deletions features/evaluations/building-pipelines.mdx
Original file line number Diff line number Diff line change
@@ -1,43 +1,4 @@
---
title: "Getting Started"
icon: "flag-checkered"
title: "Building an Eval Pipeline"
icon: "hammer"
---

<iframe width="560" height="315" src="https://www.youtube.com/embed/8hW-OjwpwMk" title="YouTube video player" frameBorder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowFullScreen></iframe>
The overall process of building an evaluation pipeline looks like this:

1. **Select Your Dataset**: Choose or upload datasets to serve as the basis for your evaluations, whether for scoring, regression testing, or bulk job processing.
2. **Build Your Pipeline**: Start by visually constructing your evaluation pipeline, defining each step from input data processing to final evaluation.
3. **Run Evaluations**: Execute your pipeline, observe the results in a spreadsheet-like interface, and make informed decisions based on comprehensive metrics and scores.

## Creating a Pipeline

1. **Initiate a Batch Run**: Start by creating a new batch run, which requires specifying a name and selecting a dataset.
2. **Dataset Selection**: Upload a CSV/JSON dataset, or create a dataset from historical data using filters like time range, prompt template logs, scores, and metadata. [Learn more here.](/features/evaluations/datasets)

You now have a pipeline. Preview mode allows you to iterate with live feedback, allowing for adjustments in real-time.

## Setting up the Pipeline

### Adding Steps

Click 'Add Step' to start building your pipeline, with each column representing a step in the evaluation process.

Steps execute in order left to right. That means that if a column depends on a previous column, make sure it appears to the right of the dependency.

#### Common Step Types

- **Prompt Template**: Select a prompt template from the registry, set model parameters, LLM, arguments, and template version.
- **Custom API Endpoint**: Define a URL to send and receive data, suitable for custom evaluators or external systems.
- **Human Input**: Engage human graders by adding a step that allows for textual input.
- **String Comparison**: Use this step to compare the outputs of two previous step, showing a visual diff when relevant.

#### Scoring

If the last step of your evaluation pipeline contains all booleans or numeric values, that will be consider the score for the row. Your full evaluation report will have a scorecard of the average of this last step.

_NOTE: All cells in the last column must be boolean or all must be numeric. If any cell deviates, the score will not be calculated_

## Executing Full Batch Runs

Transition from pipeline to full batch run to apply your pipeline across the entire dataset for comprehensive evaluation.
66 changes: 0 additions & 66 deletions features/evaluations/continuous-integration.mdx

This file was deleted.

49 changes: 8 additions & 41 deletions features/evaluations/datasets.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,53 +8,20 @@ Datasets are often used for evaluations, but they can also be exported. Each dat

You can create a dataset from your LLM request history or by uploading a dataset file.

## Creating a Dataset from History
Datasets are also versioned, allowing you to add examples over time.

## Creating a Dataset from Production History

Creating a dataset from your history is straightforward using the Dataset dialogue. Here, you can build a dataset from your request history. The dataset will include metadata, input variable context, tags, and the request response. This is useful for backtesting new prompt versions.

When creating a dataset from your history, several options are available for customization. You have the option to use time filters to narrow down the history included in the dataset by specifying a start and end time. Additionally, you can refine your dataset by including specific metadata or prompt templates, where metadata involves key-value pairs and prompt templates can be specified by name and version numbers. For those seeking more advanced customization, filtering based on a search query, specific scoring criteria, or tags can be used to build your dataset.

## Uploading a Dataset

## Creating a Dataset File

JSON or CSV files are accepted for the dataset input file.

### JSON Format

In the JSON format, each test case is represented as a separate JSON object. The keys of the object correspond to the input variable names defined in the prompt template. The values represent the specific input values for each test case. Here's an example:

```json
[
{
"name": "John Doe",
"age": 30,
"location": "New York"
},
{
"name": "Jane Smith",
"age": 35,
"location": "Los Angeles"
},
{
"name": "Michael Johnson",
"age": 40,
"location": "Chicago"
}
]

```

In the above example, the prompt template may contain input variables like `{name}`, `{age}`, and `{location}`. Each test case object provides the corresponding values for these variables.

### CSV Format
You can also upload a dataset file in CSV or JSONL format. The uploaded file should contain the input variables and any expected outputs you want to include in the dataset.

In the CSV format, each test case is represented as a separate row in the CSV file. The column headers correspond to the input variable names defined in the prompt template. The cells in each row represent the specific input values for each test case. Here's an example:
When uploading a dataset, ensure that the file is properly formatted. For CSV files, each column should represent an input variable or expected output, with the first row containing the headers. For JSONL files, each line should be a valid JSON object representing a single example with key-value pairs for input variables and expected outputs.

```
name,age,location
John Doe,30,New York
Jane Smith,35,Los Angeles
Michael Johnson,40,Chicago
```
## Editing a dataset on PromptLayer

In this example, the prompt template may contain input variables like `{name}`, `{age}`, and `{location}`. Each row in the CSV file provides the corresponding values for these variables.
You can also edit a dataset directly on PromptLayer
Loading