diff --git a/.DS_Store b/.DS_Store
index 76b1ac1..cca571c 100644
Binary files a/.DS_Store and b/.DS_Store differ
diff --git a/features/evaluations/building-pipelines.mdx b/features/evaluations/building-pipelines.mdx
index 260544e..0cb0097 100644
--- a/features/evaluations/building-pipelines.mdx
+++ b/features/evaluations/building-pipelines.mdx
@@ -1,43 +1,4 @@
---
-title: "Getting Started"
-icon: "flag-checkered"
+title: "Building an Eval Pipeline"
+icon: "hammer"
---
-
-
-The overall process of building an evaluation pipeline looks like this:
-
-1. **Select Your Dataset**: Choose or upload datasets to serve as the basis for your evaluations, whether for scoring, regression testing, or bulk job processing.
-2. **Build Your Pipeline**: Start by visually constructing your evaluation pipeline, defining each step from input data processing to final evaluation.
-3. **Run Evaluations**: Execute your pipeline, observe the results in a spreadsheet-like interface, and make informed decisions based on comprehensive metrics and scores.
-
-## Creating a Pipeline
-
-1. **Initiate a Batch Run**: Start by creating a new batch run, which requires specifying a name and selecting a dataset.
-2. **Dataset Selection**: Upload a CSV/JSON dataset, or create a dataset from historical data using filters like time range, prompt template logs, scores, and metadata. [Learn more here.](/features/evaluations/datasets)
-
-You now have a pipeline. Preview mode allows you to iterate with live feedback, allowing for adjustments in real-time.
-
-## Setting up the Pipeline
-
-### Adding Steps
-
-Click 'Add Step' to start building your pipeline, with each column representing a step in the evaluation process.
-
-Steps execute in order left to right. That means that if a column depends on a previous column, make sure it appears to the right of the dependency.
-
-#### Common Step Types
-
-- **Prompt Template**: Select a prompt template from the registry, set model parameters, LLM, arguments, and template version.
-- **Custom API Endpoint**: Define a URL to send and receive data, suitable for custom evaluators or external systems.
-- **Human Input**: Engage human graders by adding a step that allows for textual input.
-- **String Comparison**: Use this step to compare the outputs of two previous step, showing a visual diff when relevant.
-
-#### Scoring
-
-If the last step of your evaluation pipeline contains all booleans or numeric values, that will be consider the score for the row. Your full evaluation report will have a scorecard of the average of this last step.
-
-_NOTE: All cells in the last column must be boolean or all must be numeric. If any cell deviates, the score will not be calculated_
-
-## Executing Full Batch Runs
-
-Transition from pipeline to full batch run to apply your pipeline across the entire dataset for comprehensive evaluation.
diff --git a/features/evaluations/continuous-integration.mdx b/features/evaluations/continuous-integration.mdx
deleted file mode 100644
index 2d2c1be..0000000
--- a/features/evaluations/continuous-integration.mdx
+++ /dev/null
@@ -1,66 +0,0 @@
----
-title: "Continuous Integration"
-icon: "arrows-spin"
----
-
-Continuous Integration (CI) of prompt evaluations is the holy grail of prompt engineering. 🏆
-
-CI in the context of prompt engineering involves the automated testing and validation of prompts every time a new version is created or updated. LLMs are a probabilistic technology. It is hard (read: virtually impossible) to ensure a new prompt version doesn't break old user behavior just by eyeballing the prompt. Rigorous testing is the best tool we have.
-
-We believe that it's important to both allow subject-matter experts to write new prompts and provide them with tools to easily test if the prompts broke anything. That's where PromptLayer evaluations comes in.
-
-## Test-driven Prompt Engineering
-
-Similar to test-driven development (TDD) in software engineering, test-driven prompt engineering involves writing and running evaluations against new prompt versions before they are used in production. This proactive testing ensures that new prompts meet predefined criteria and behave as expected, minimizing the risk of unintended consequences.
-
-Setting up automatic evaluations on a specific prompt template is easy. When creating a new version, after adding a commit message, you will be prompted to select an evaluation pipeline to run. After doing this once, every new prompt template you create will run this pipeline by default.
-
-**NOTE**: Make sure your evaluation pipeline uses the "latest" version of the prompt template in its column step. The template is fetched at runtime. If you specify a frozen version, the evaluation report won't reflect your newest prompt template.
-
-
-
-## Testing Strategies
-
-### Backtesting
-
-Backtesting involves running new prompt versions against a dataset compiled from historical production data. This strategy provides a real-world context for evaluating prompts, allowing you to assess how new versions would have performed under past conditions. It's an effective way to detect potential regressions and validate improvements, ensuring that updates enhance rather than detract from the user experience.
-
-To set up backtests, follow the steps below:
-
-**1. Create a historical dataset**
-
-
-
-[Create a dataset](/features/evaluations/datasets) using a search query. For example, I might want to create a dataset using all logged requests:
-- That use `my_prompt_template` version 6 or version 5
-- That were made in the last 2 months
-- That were using the tag `prod`
-- That users gave a 👍 response to
-
-This dataset will help you understand if your new prompt version broke any previous versions!
-
-**2. Build an evaluation pipeline**
-
-The next step is to create an evaluation pipeline using our new historical dataset.
-
-In plain English, this evaluation will feed in historical request context into your new prompt version then compare the new results to the old results. You can do a simple string comparison or get fancy with cosine similarities. PromptLayer will even show you a diff view for responses that are different.
-
-**3. Run it when you make a new version**
-
-This is the fun part. Next time you make a new prompt version, just select our new backtesting pipeline to see how the new prompt version fairs.
-
-
-
-### Regression Testing
-
-Regression testing is the continuous refinement of evaluation datasets to include new edge cases and scenarios as they are discovered. This iterative process ensures that prompts remain robust against a growing set of challenges, preventing regressions in areas previously identified as potential failure points. By continually updating evaluations with new edge cases, you maintain a high standard of prompt quality and reliability.
-
-The process of setting up regression tests looks similar to backtesting.
-
-[Create a dataset](/features/evaluations/datasets) containing test cases for every edge case you can think of. The dataset should include context variables that you can input to your prompt template.
-
-### Scoring
-
-The evaluation can result in a single quantitative final score. To configure the score card, all you need to do is make sure that the last step consists entirely of numbers or Booleans. A final objective score makes comparing prompt performance easy, and it will be displayed alongside prompts in the Prompt Registry.
-
-
\ No newline at end of file
diff --git a/features/evaluations/datasets.mdx b/features/evaluations/datasets.mdx
index ebd160a..e06a49c 100644
--- a/features/evaluations/datasets.mdx
+++ b/features/evaluations/datasets.mdx
@@ -8,53 +8,20 @@ Datasets are often used for evaluations, but they can also be exported. Each dat
You can create a dataset from your LLM request history or by uploading a dataset file.
-## Creating a Dataset from History
+Datasets are also versioned, allowing you to add examples over time.
+
+## Creating a Dataset from Production History
Creating a dataset from your history is straightforward using the Dataset dialogue. Here, you can build a dataset from your request history. The dataset will include metadata, input variable context, tags, and the request response. This is useful for backtesting new prompt versions.
When creating a dataset from your history, several options are available for customization. You have the option to use time filters to narrow down the history included in the dataset by specifying a start and end time. Additionally, you can refine your dataset by including specific metadata or prompt templates, where metadata involves key-value pairs and prompt templates can be specified by name and version numbers. For those seeking more advanced customization, filtering based on a search query, specific scoring criteria, or tags can be used to build your dataset.
+## Uploading a Dataset
-## Creating a Dataset File
-
-JSON or CSV files are accepted for the dataset input file.
-
-### JSON Format
-
-In the JSON format, each test case is represented as a separate JSON object. The keys of the object correspond to the input variable names defined in the prompt template. The values represent the specific input values for each test case. Here's an example:
-
-```json
-[
- {
- "name": "John Doe",
- "age": 30,
- "location": "New York"
- },
- {
- "name": "Jane Smith",
- "age": 35,
- "location": "Los Angeles"
- },
- {
- "name": "Michael Johnson",
- "age": 40,
- "location": "Chicago"
- }
-]
-
-```
-
-In the above example, the prompt template may contain input variables like `{name}`, `{age}`, and `{location}`. Each test case object provides the corresponding values for these variables.
-
-### CSV Format
+You can also upload a dataset file in CSV or JSONL format. The uploaded file should contain the input variables and any expected outputs you want to include in the dataset.
-In the CSV format, each test case is represented as a separate row in the CSV file. The column headers correspond to the input variable names defined in the prompt template. The cells in each row represent the specific input values for each test case. Here's an example:
+When uploading a dataset, ensure that the file is properly formatted. For CSV files, each column should represent an input variable or expected output, with the first row containing the headers. For JSONL files, each line should be a valid JSON object representing a single example with key-value pairs for input variables and expected outputs.
-```
-name,age,location
-John Doe,30,New York
-Jane Smith,35,Los Angeles
-Michael Johnson,40,Chicago
-```
+## Editing a dataset on PromptLayer
-In this example, the prompt template may contain input variables like `{name}`, `{age}`, and `{location}`. Each row in the CSV file provides the corresponding values for these variables.
+You can also edit a dataset directly on PromptLayer
diff --git a/features/evaluations/eval-types.mdx b/features/evaluations/eval-types.mdx
deleted file mode 100644
index 7f5842d..0000000
--- a/features/evaluations/eval-types.mdx
+++ /dev/null
@@ -1,287 +0,0 @@
----
-title: "Eval Types"
-icon: "screwdriver-wrench"
----
-
-This page provides an overview of the various evaluation column types available on our platform.
-
-## Primary Types
-
-
-
-### Prompt Template
-
-The _Prompt Template_ evaluation type allows you to execute a prompt template from the Prompt Registry. You have the flexibility to select the latest version, a specific label, or a particular version of the prompt template. You also have the ability to assign the input variables based on available inputs from the dataset or other columns. You can override the model parameters that are set in the Prompt Registry. This functionality is particularly useful for testing a prompt template within a larger evaluation pipeline, comparing different model parameters, or implementing an "LLM as a judge" prompt template.
-
-### Custom API Endpoint
-
-The _Custom API Endpoint_ enables you to set up a webhook that our system will call (POST) with all the columns to the left of the API endpoint when that cell is executed. As cells are processed sequentially, we will call this endpoint with all the columns to the left as the given payload, and the returned result will be displayed. This feature allows for extensive customization to accommodate specific use cases and integrate with external systems or custom evaluators.
-
-The payload will be in the form of
-
- ```json
- {
- data: {
- "column1": "value1",
- "column2": "value2",
- "column3": "value3",
- ...
- }
- }
- ```
-
-### MCP
-
-The _MCP Action_ allows you to run functions on a remote MCP server. Simply plug in your server URL and auth, select your function and you will be able to call your function with inputs mapped from other cells. For more information about MCP check out [the official MCP docs.](https://modelcontextprotocol.io/introduction)
-
-### Human Input
-
-The _Human Input_ evaluation type allows the addition of either numeric or text input where an evaluator can provide feedback via a slider or a text box. This input can then be utilized in subsequent columns in the evaluation pipeline, allowing for the incorporation of human judgment.
-
-### Code Execution
-
-The _Code Execution_ evaluation type allows you to write and execute code for each row in your dataset. You can access the data through the `data` variable and return the cell value. Note that stdout will be ignored. There is a `6 minute timeout` for code execution.
-
-Code example to return a list of the names of each column:
-
-
-
-```py Python
-message = "These are my column names: "
-columns = [column_name for column_name in data.keys()]
-return message + str(columns)
-```
-
-```js JavaScript
-const message = "These are my column names: ";
-const columns = Object.keys(data);
-return message + JSON.stringify(columns);
-```
-
-
-
-**Python Runtime**
-
-```
-The Python runtime runs Python 3.12.0 with no filesystem. Runtime does have network access. Only the standard library is available. Here are the resource quotas:
-
-- Input code size: 1MiB
-- Size of stdin: 10MiB
-- Size of stdout: 20MiB
-- Size of stderr: 10MiB
-- Number of environment variables: 100
-- Environment variable key size: 4KiB
-- Environment variable value size: 100KiB
-- Number of arguments: 100
-- Argument size: 100KiB
-- Memory consumption: 128MiB
-```
-
-**JavaScript Runtime**
-
-```
-The JavaScript runtime is built on Mozilla's SpiderMonkey engine with no filesystem. Runtime does have network access. It is not node or deno. Available APIs include:
-
-- Legacy Encoding: atob, btoa, decodeURI, encodeURI, decodeURIComponent, encodeURIComponent
-- Streams: ReadableStream, ReadableStreamBYOBReader, ReadableStreamBYOBRequest, ReadableStreamDefaultReader, ReadableStreamDefaultController, ReadableByteStreamController, WritableStream, ByteLengthQueuingStrategy, CountQueuingStrategy, TransformStream
-- URL: URL, URLSearchParams
-- Console: console
-- Performance: Performance
-- Task: queueMicrotask, setInterval, setTimeout, clearInterval, clearTimeout
-- Location: WorkerLocation, location
-- JSON: JSON
-- Encoding: TextEncoder, TextDecoder, CompressionStream, DecompressionStream
-- Structured Clone: structuredClone
-- Fetch: fetch, Request, Response, Headers
-- Crypto: SubtleCrypto, Crypto, crypto, CryptoKey
-
-Resource Quotas:
-
-- Input code size: 1MiB
-- Size of stdin: 10MiB
-- Size of stdout: 20MiB
-- Size of stderr: 10MiB
-- Number of environment variables: 100
-- Environment variable key size: 4KiB
-- Environment variable value size: 100KiB
-- Number of arguments: 100
-- Argument size: 100KiB
-- Memory consumption: 128MiB
-```
-
-### Coding Agent
-
-The _Coding Agent_ evaluation type uses an AI coding agent (such as [Claude Code](https://www.claude.com/product/claude-code)) in a secure, sandboxed environment for each row in your dataset. Instead of writing code directly, you provide natural language instructions describing what you want to accomplish, and the AI coding agent handles the implementation.
-
-**How it works:**
-
-You provide a **natural language prompt** describing the task you want to accomplish. The coding agent executes in an isolated sandbox with access to:
-
-* **variables.json** - Automatically injected file containing all column values from previous cells in that row
-* **File attachments** - Any files you upload (CSV, JSON, text files, etc.) are available in the sandbox
-* **Network access** - Can make API calls and fetch external data
-
-The agent returns the result which populates the cell for that row.
-
-**Example use cases:**
-
-* **Data transformation**: "Parse the JSON response from the API column and extract all user emails into a comma-separated list"
-* **File processing**: "Read the attached sales_data.csv and calculate the total revenue for products in the 'Electronics' category"
-* **API integration**: "Use the api_key from variables.json to fetch user details from https://api.example.com/users/{user_id} and return their account status"
-
-### Conversation Simulator
-
-The _Conversation Simulator_ evaluation type automates the back-and-forth between your AI agent and simulated users to test conversational AI performance. This is particularly useful for evaluating multi-turn conversations where context maintenance, goal achievement, and user interaction patterns are critical.
-
-When setting up the conversation simulator:
-
-* Select your AI agent prompt template from the Prompt Registry
-* Pass in user details or context variables from your dataset
-* Define a test persona that challenges your AI with specific behaviors or constraints
-
-**Example Test Persona:**
-```
-User is nervous about seeing the doctor, hasn't been in a long time,
-won't share phone number until asked three times for it
-```
-
-**Optional Advanced Configuration:**
-
-* **User Goes First**: By default, the AI agent initiates the conversation. You can enable this setting to have the simulated user start the conversation instead.
-
-* **Conversation Samples**: You can provide sample conversations to help guide the simulated user's responses. These samples help maintain consistent voice and interaction patterns, ensuring the simulated user behaves realistically and consistently with your expected user base.
-
-The conversation results are returned as a JSON list of messages that can then be evaluated using other eval types like LLM Assertions to assess success criteria.
-
-## Simple Evals
-
-
-
-### Equality Comparison
-
-_Equality Comparison_ allows you to compare two different columns as strings. It provides a visual diff if there is a difference between the columns. Note that the diff is not used when calculating the score in that column and the column will be treated as a boolean for the purposes of a score. If there is no difference, it this column return true.
-
-### Contains Value
-
-The _Contains_ evaluation type enables you to search for a substring within a column. For instance, you could search for a specific word or phrase within each cell in the column. It is using the python `in` operator to check if the substring is in the cell and is case insensitive.
-
-### Regex Match
-
-The _Regex Match_ evaluation type allows you to define a regular expression pattern to search within the column. This provides powerful pattern matching capabilities for complex text analysis tasks.
-
-### Absolute Numeric Distance
-
-The _Absolute Numeric Distance_ evaluation type allows you to select two different columns and output the absolute distance between their numeric values in a new column. Both source columns must contain numeric values.
-
-## LLM Evals
-
-
-
-### Run LLM Assertion
-
-The _LLM Assertion_ evaluation type enables you to run an assertion on a column using natural language prompts. You can create prompts such as "Does this contain an API key?", "Is this sensitive content?", or "Is this in English?". Our system uses a backend prompt template that processes your assertion and returns either true or false. Assertions should be framed as questions.
-
-### AI Data Extract
-
-The _AI Data Extract_ evaluation type uses AI/LLM to extract specific information from data sources. You can describe what you want to extract using natural language queries, whether the content is JSON, XML, or just unstructured text.
-
-Example queries:
-* "Extract the product name"
-* "Find the customer's email address"
-* "Get all mentioned dates"
-* "Extract the total price including tax"
-
-### Cosine Similarity
-
-_Cosine Similarity_ allows you to compare the vector distance between two columns. The system takes the two columns you supply, converts them into strings, and then embeds them using OpenAI's embedding vectors. It then calculates the cosine similarity, resulting in a number between 0 and 1. This metric is useful for understanding how semantically similar two bodies of text are, which can be valuable in assessing topic adherence or content similarity.
-
-## Helper Functions
-
-
-
-### JSON Extraction
-
-The _JSON Extraction_ evaluation type allows you to define a JSON path and extract either the first match or all matches in that path. We will automatically cast the source column into a JSON object. This is particularly useful for parsing structured data within your evaluations.
-
-### Parse Value
-
-The _Parse Value_ column type enables you to convert another column into one of the following value types: string, number, Boolean, or JSON.
-
-### Apply Diff
-
-The _Apply Diff_ evaluation type applies diff patches to original content, similar to git merge operations. This helper function requires two source columns: the original content and a diff patch to apply.
-
-This evaluation type is particularly powerful when combined with code generation workflows or document editing pipelines where AI agents generate incremental changes rather than complete replacements. It enables sophisticated multi-step workflows where agents can review and refine each other's outputs.
-
-Using diff formats often saves context and leads to better results for editing large content.
-
-**Diff Format Details**
-
-The diff patch must be in the standard **unified diff** format, including file headers and hunk headers, as used by tools like `git` and described in the [unidiff documentation](https://pypi.org/project/unidiff/).
-
-If you are using an LLM to generate the diffs, copy and paste the following text into your prompt for format specifics:
-
-```markdown
-## Unified Diff Specification (strict unidiff)
-
-Produce a valid **unified diff** with file headers and hunk headers. Only modifications of existing text are supported (no file creation or deletion).
-
-### File headers (required)
-- Old (source): \`--- a/\`
-- New (target): \`+++ b/\`
-- Use consistent prefixes \`a/\` and \`b/\`.
-
-### Hunk headers (required for every changed region)
-- Format: \`@@ -, +, @@\`
- - \`\` / \`\` are 1-based line numbers.
- - \`\` / \`\` are the line counts for the hunk in old/new.
- - Multiple hunks per file are allowed; order them top-to-bottom.
-
-### Hunk body line prefixes (strict)
-- \`' '\` (space) = unchanged context line
-- \`-\` = line removed from source
-- \`+\` = line added in target
-- Preserve original whitespace and line endings exactly.
-
-### Rules
-- The concatenation of all **context + removed** lines in each hunk must appear **verbatim and contiguously** in the source file.
-- Keep context minimal but sufficient for unambiguous matching (usually 1-3 lines around changes).
-- Multiple files may be patched in one diff, but each requires its own \`---\` / \`+++\` headers and hunks.
-- If no changes are needed, output an empty string (no diff).
-
-### Example
---- a/essay.txt
-+++ b/essay.txt
-@@ -1,4 +1,4 @@
- This is a simple essay.
--It has a bad sentence.
-+It has a better sentence.
- The end.`}
- title="Copy diff specification"
- />
-
- ),
- render: HelperFunctionBlocks.ApplyDiffBlock,
- weight: 5,
- },
-};
-```
-
-### Static Value
-
-The _Static Value_ evaluation type allows you to pre-populate a column with a specific value. This is useful for adding constant values or context that you may need to use later in one of the other columns in your evaluation pipeline.
-
-### Type Validation
-
-_Type Validation_ returns a boolean for the given source column if it fits one of the specified types. The types supported for validation are JSON, number, or SQL. It will return `true` if the value is valid for the specified type, and `false` otherwise. For SQL validation, the system utilizes the [SQLGlot library](https://github.com/tobymao/sqlglot?tab=readme-ov-file#parser-errors).
-
-### Coalesce
-
-The _Coalesce_ evaluation type allows you to take multiple different columns and coalesce them, similar to [SQL's COALESCE function](https://www.w3schools.com/sql/func_sqlserver_coalesce.asp).
-
-### Count
-
-The _Count_ evaluation type allows you to select a source column and count either the characters, words, or paragraphs within it. This will output a numeric value, which can be useful for analyzing the length or complexity of LLM outputs.
-
-
-Please reach out to us if you have any other evaluation types you would like to see on the platform. We are always looking to expand our evaluation capabilities to better serve your needs.
\ No newline at end of file
diff --git a/features/evaluations/examples.mdx b/features/evaluations/examples.mdx
deleted file mode 100644
index d243e13..0000000
--- a/features/evaluations/examples.mdx
+++ /dev/null
@@ -1,39 +0,0 @@
----
-title: "Eval Examples"
-icon: "people-group"
----
-
-## Building & Evaluating a RAG Chatbot
-
-
-
-
-
-
-This example shows how you can use PromptLayer to evaluate Retrieval Augmented Generation (RAG) systems. As a cornerstone of the LLM revolution, RAG systems enhance our ability to extract precise information from vast datasets, significantly improving question-answering capabilities.
-
-We will create a RAG system designed for financial data analysis using a dataset from the New York Stock Exchange. The tutorial video elaborates on the step-by-step process of constructing a pipeline that encompasses prompt creation, data retrieval, and the evaluation of the system's efficacy in answering finance-related queries.
-
-Most importantly, you can use PromptLayer to build end-to-end evaluation tests for RAG systems.
-
-## Migrating Prompts to Open-Source Models
-
-
-
-[Click Here to Read the Tutorial](https://blog.promptlayer.com/migrating-prompts-to-open-source-models-c21e1d482d6f)
-
-This tutorial demonstrates how to use PromptLayer to migrate prompts between different language models, with a focus on open-source models like [Mistral](https://mistral.ai/). It covers techniques for batch model comparisons, allowing you to evaluate the performance of your prompt across multiple models. The example showcases migrating an existing prompt for a RAG system to the open-source Mistral model and comparing the new outputs with visual diffs.
-
-The key steps include:
-
-1. Setting up a batch evaluation pipeline to run the prompt on both the original model (e.g., GPT) and the new target model (Mistral), while diffing the outputs.
-2. Analyzing the results, including accuracy scores, cost/latency metrics, and string output diffs, to assess the impact of migrating to the new model.
-3. Seamlessly updating the prompt template to use the new model (Mistral) if the migration is beneficial.
-
-This example highlights PromptLayer's capabilities for efficient prompt iteration and evaluation across different language models, facilitating the adoption of open-source alternatives like Mistral.
\ No newline at end of file
diff --git a/features/evaluations/getting-started.mdx b/features/evaluations/getting-started.mdx
new file mode 100644
index 0000000..f09b08f
--- /dev/null
+++ b/features/evaluations/getting-started.mdx
@@ -0,0 +1,22 @@
+---
+title: "Getting Started"
+icon: "flag-checkered"
+---
+
+
+To read about PromptLayer's view on Evaluations, see the [Why PromptLayer? Evaluations page](../../features/evaluations/overview).
+
+
+## Introduction
+
+Evaluations are a core feature of PromptLayer, designed to allow teams to test their prompt templates and agents at scale.
+
+At its core, an evaluation is a repeatable pipeline of steps (columns) that you run on a series of data (rows). This allows you to systematically assess the performance of your prompts or agents across various scenarios.
+You can create evaluations directly through the PromptLayer UI, enabling both technical and non-technical team members to collaborate on prompt testing and refinement.
+
+You can also define a score to track the progress as you iterate on your prompts or agents.
+
+There are a few core concepts:
+- ***A dataset***: all evaluations start with a dataset
+- ***An eval pipeline***: a definition of an evaluation and am optional score, defined on a sample of 4 rows from the dataset
+- ***An evaluation report***: the results of running the eval pipeline on the entire dataset
\ No newline at end of file
diff --git a/features/evaluations/overview.mdx b/features/evaluations/overview.mdx
index 8237dc8..68b7fa1 100644
--- a/features/evaluations/overview.mdx
+++ b/features/evaluations/overview.mdx
@@ -1,29 +1,48 @@
---
-title: "Evals Overview"
+title: "Evaluations"
icon: "book"
---
-**We believe that evaluation engineering is half the challenge of building a good prompt.** The Evaluations page is designed to help you iterate, build, and run batch evaluations on top of your prompts. Every prompt and every use case is different.
+
+To read about how to use the Evaluations feature, see the [Evaluations User Guide](../../features/evaluations/getting-started).
+
-Inspired by the flexibility of tools like Excel, we offer a visual pipeline builder that allows users to construct complex evaluation batches tailored to their specific requirements. Whether you're scoring prompts, running bulk jobs, or conducting regression testing, the Evaluations page provides the tools needed to assess prompt quality effectively. Made for both engineers and subject-matter experts.
+## UI-First Evaluations
-## Common Tasks
+PromptLayer's evaluation system is UI-first. We have all the features to run evals [via code](../../reference/create-reports), but we built the system from the ground up to be fully featured and functional in the UI.
-- **Scoring Prompts**: Utilize golden datasets for comparing prompt outputs with ground truths and incorporate human or AI evaluators for quality assessment.
-- **One-off Bulk Jobs**: Ideal for prompt experimentation and iteration.
-- **Backtesting**: Use historical data to build datasets and compare how a new prompt version performs against real production examples.
-- **Regression Testing**: Build evaluation pipelines and datasets to prevent edge-case regression on prompt template updates.
-- **Continuous Integration**: Connect evaluation pipelines to prompt templates to automatically run an eval with each new version (and catologue the results). Think of it like a Github action.
+Why? We want evals to be a collaborative experience with multiple stakeholders—technical and non-technical. This fits with our product ethos: interacting with LLMs is a new form of work that needs to be collaborative, transparent, and include multiple stakeholders.
-## Examples Use-Cases
+## How It Works
-- **Chatbot Enhancements**: Improve chatbot interactions by evaluating responses to user requests against semantic criteria.
-- **RAG System Testing**: Build a RAG pipeline and validate responses against a golden dataset.
-- **SQL Bot Optimization**: Test Natural Language to SQL generation prompts by _actually_ running generated queries against a database (using the API Endpoint step), followed by an evaluation of the results' accuracy.
-- **Improving Summaries**: Combine AI evaluating prompts and human graders to help improve prompts without a ground truth.
+An evaluation is a pipeline of steps (columns) that you run on a series of data. It lets you run a repeatable set of steps on multiple rows and combine them to create a score.
-## Additional Resources
+We've spent a lot of time enabling all types of evaluations—both super simple and super sophisticated—to be successful on PromptLayer. You can read more about how it works and how to create your own evals in the [Evaluations User Guide](../../features/evaluations/getting-started).
-For a deeper understanding of evaluation approaches, especially for complex LLM applications beyond simple classification or programming tasks, check out our blog post: [How to Evaluate LLM Prompts Beyond Simple Use Cases](https://blog.promptlayer.com/how-to-evaluate-llm-prompts-beyond-simple-use-cases/). This guide explores strategies like Decomposition Testing, working with Negative Examples, and implementing LLM as a Judge Rubric frameworks.
+## Ad-Hoc vs Templated Evals
-[Click here to see in-depth examples.](/features/evaluations/examples)
\ No newline at end of file
+While you can use evals for any use case, we have a few mental models we like to share:
+
+**Ad-hoc evals** are used to answer one specific question. Maybe you want to analyze some production data or try out a new model. These are one-off. We have also found customer running ad-hoc evals that are not really evals but more like data analysis or data exploration. For example taking a subset of production data and running a prompt on them to categorize the type of query. This is a one-off experiment that leverages the easy to use evals platform on PromptLayer and the fact that we have access to your production data.
+
+**Templated evals** follow a specific rubric that your team will re-run over and over—[either nightly](../../features/evaluations/programmatic) or as part of a big product or feature push. These are more structured and defined, with specific metrics.
+
+## Our Take: Use Assertions
+
+Whatever type of eval you create, we've found the best rubrics are a series of true/false questions. Our LLM Assertion column is perfect for this, but you can create your own rubric too.
+
+Basically, your team should come up with a series of yes/no questions to ask on your data. For example:
+- Does the output have at least one header?
+- Does it reference the correct customer's name?
+
+One advanced trick: earlier in your eval, generate a series of assertions based on information in that row, then use those to judge the output. You can import the dynamically generated assertions into the LLM Assertion column.
+
+That said, use whatever eval type fits your needs. Assertions are just what we've found easiest to understand and build out.
+
+## Use Production Data
+
+Another strong reccomendation is to use production data. We believe the best data is production data. A lot of teams come to us without a dataset or an evaluation, but they have tons of production data. That is a perfect place to start.
+
+We've built a fully featured dataset builder that lets you build really sophisticated datasets from production data.
+
+If you're not using that, you're missing one of the most powerful features of PromptLayer.
\ No newline at end of file
diff --git a/features/evaluations/score-card.mdx b/features/evaluations/score-card.mdx
index e7e5ef5..eaf86ef 100644
--- a/features/evaluations/score-card.mdx
+++ b/features/evaluations/score-card.mdx
@@ -3,124 +3,52 @@ title: "Score Card"
icon: "star"
---
-The score card feature in PromptLayer allows you to assign a score to each evaluation you run. This score provides a quick and easy way to assess the performance of your prompts and compare different versions.
+Score cards in PromptLayer provide a powerful way to automatically calculate and track evaluation scores for your pipelines. Scores are calculated automatically when an evaluation completes, giving you immediate insights into your prompt performance.
-## Configuring the Score Card
+## How Scoring Works
-
-
-
+When an evaluation pipeline finishes running, PromptLayer automatically calculates a score based on the results. There are two types of scoring methods available:
-### Default Configuration
+### Simple Scores
-By default, the score is calculated based on the last column in your evaluation results:
+Simple scores are the default scoring method. They automatically aggregate results from selected columns in your pipeline.
-- If the last column contains Booleans, the score will be the percentage of `true` values.
-- If the last column contains numbers, the score will be the average of those numbers.
+**How Simple Scoring Works:**
-### Custom Column Selection
+1. **Column Selection**: By default, the last column in your pipeline is used for scoring. You can select specific columns to include in the score calculation.
-You can customize which columns are included in the score card calculation. When setting up your evaluation pipeline, click the "Score card" button to configure the score card.
+
-Here, you can add specific columns to be included in the score calculation:
+2. **Value Aggregation**: For each scoring column, the system:
+ - Collects all completed cell values
+ - Converts values to booleans (for true/false assertions) or numbers
+ - Calculates the mean of all values
-- If you add multiple numeric columns, the total score will be the average of the averages for each selected column.
-- If you add multiple Boolean columns, the total score will be the average of the `true` percentages for each selected column.
-- Columns that do not contain numbers or Booleans will not be included in the score calculation.
+3. **Score Types**:
+ - **Boolean scores**: Displayed as a percentage (0-100%) representing the ratio of true values
+ - **Numeric scores**: Displayed as the average of all numeric values
-
-
-
+4. **Final Score**: If multiple columns are selected for scoring, the final score is the mean of all column scores. You will see a breakdown of each column's contribution to the overall score.
-These selected columns will also be formatted for more easy viewing in the evaluation report. You will see larger numbers, and check/x icons for booleans.
+
-### Custom Scoring Logic
+### Matrix Scores
-For more advanced scoring needs, you can provide your own custom scoring logic using Python or JavaScript code. The code execution environment is the same as the one used for the code execution evaluation column type [(learn more)](/features/evaluations/eval-types#code-execution).
+Matrix scores provide advanced scoring capabilities using custom code. This allows you to implement complex scoring logic, weighted averages, or custom business rules.
-This custom scoring logic can be used to generate a single score number or a drill-down matrix.
+**How Matrix Scoring Works:**
-
-
-
+1. **Custom Code Execution**: You provide Python (or JavaScript) code that receives all evaluation data
+2. **Data Access**: Your code receives a `data` variable containing all row results
+3. **Score Calculation**: Your code must return:
+ - A `score` key with a numeric value (required)
+ - Optionally, a `score_matrix` for detailed scoring breakdowns. You can provide multiple matrices if needed
-You can optionally return multiple drill-down matrices. This is useful for generating confusion matrices.
-
-
-
+
-Your custom scoring code must return an object with the following keys:
-- `score` (required): A number representing the overall score. This is mandatory.
-- `score_matrix` (optional): A list of lists of lists, representing one or more matrices of drilled-down scores. Each cell in these matrices can be a raw value or an object with metadata.
-#### Score Matrix Cell Format
+### Updating a Score on a Report
-Each cell in the `score_matrix` can be either:
-- A raw value (string or number), or
-- An object with the following properties:
- - `value`: The actual value of the cell, which can be a string or number.
- - `positive_metric`: (Optional) A boolean indicating whether an increase in this value is considered positive (`true`). If absent, we default to true.
-
-**Examples**
-- Simple value: `42`
-- Object with metadata: `{"value": 42, "positive_metric": true}`
-
-The optional `positive_metric` property can be used to indicate how changes in the value should be interpreted when comparing evaluations. This is particularly useful for automated reporting and analysis tools.
-
-#### Adding Titles to Score Matrices
-
-To add titles to your score matrices, simply add an extra field to the first row of the matrix and it will automatically be interpreted as the primary title. For example, if you have a matrix like:
-
-```python
-[[1,2],[1,2]]
-```
-
-You can add a title by modifying it to:
-
-```python
-[["Title",1,2],[1,2]]
-```
-
-### Code example
-
-The `data` variable will be available in your scoring code, which is a list containing a dictionary for each row in the evaluation results. The keys in each dictionary correspond to the column names, and the values are the corresponding cell values.
-
-For example:
-
-```py Python
-# The variable `data` is a list of rows.
-# Each row is a dictionary of column name -> value
-# For example: [
-# {'columnA': 1, 'columnB': 2},
-# {'columnA': 4, 'columnB': 1}
-# ]
-#
-# Must return a dictionary with the following structure:
-# {
-# 'score': int, # Required
-# 'score_matrix': [[[int, int, ...], ...]...], # Optional - list of lists of lists of integers
-# }
-
-return {
- 'score': len(data),
- 'score_matrix': [[
- ["Criteria", "Weight", "Value"],
- ["Correctness", 4, 7],
- ["Completeness", 3, 6],
- ["Accuracy", 5, 8],
- ["Relevance", 4, 9]
- ]],
-}
-```
-
-## Comparing Evaluation Reports
-
-You can compare two evaluation reports to see how scores and other metrics have changed between runs. Simply click the "Compare" button and select the evaluation reports you want to compare.
-
-The score card and any score matrices will be displayed side-by-side for easy comparison of your prompt's performance over time.
-
-
-
-
+By default scores are inhereted on a pipline. However, you can update the score calculation on an existing report by editing the report settings. Changes to scoring will automatically recalculate the score based on the new configuration.
\ No newline at end of file
diff --git a/features/prompt-history/sharing-prompts.mdx b/features/prompt-history/sharing-prompts.mdx
deleted file mode 100644
index 029e99d..0000000
--- a/features/prompt-history/sharing-prompts.mdx
+++ /dev/null
@@ -1,22 +0,0 @@
----
-title: "Sharing Requests"
-icon: "share"
----
-
-Often you may find yourself collaborating on prompts with other stakeholders. PromptLayer allows you to share prompts that were logged on our system easily.
-
-To do this, navigate to the dashboard and find the prompt you want to share:
-
-
-
-In the top right-hand corner, select the share button and click on the tab to make your prompt public:
-
-
-
-_Copy that link, and you are good to go!_
-
-Here is a link to the shared prompt from this tutorial: [https://promptlayer.com/share/89cb2cbf2e8b42a341bcd1da5443f65d](https://promptlayer.com/share/89cb2cbf2e8b42a341bcd1da5443f65d)
-
----
-
-Want to say hi 👋 , submit a feature request, or report a bug? [✉️ Contact us](mailto:hello@magniv.io)
\ No newline at end of file
diff --git a/features/prompt-registry/zero-downtime-releases.mdx b/features/prompt-registry/zero-downtime-releases.mdx
deleted file mode 100644
index 89f66ec..0000000
--- a/features/prompt-registry/zero-downtime-releases.mdx
+++ /dev/null
@@ -1,110 +0,0 @@
----
-title: "Zero Downtime Releases"
-icon: "rocket"
----
-
-## Input Variable Handling
-
-The `pl.run()` function handles input variables in the following ways:
-
-1. **Normal Usage**:
-
-Provide all required variables as defined in your prompt template:
-
-```python
-response = pl.run(
- prompt_name="movie_recommender",
- prompt_release_label="prod",
- input_variables={
- "favorite_movie": "The Shawshank Redemption"
- },
-)
-```
-
-2. **Missing Variables**:
-
-If you don't provide the required input variables, you'll receive a warning in the console, but the prompt template will still run. The missing variables will be sent to the LLM as unprocessed strings:
-
-```python
-response = pl.run(
- prompt_name="movie_recommender",
- prompt_release_label="prod",
- input_variables={},
-)
-```
-
-```
-WARNING: While getting your prompt template: Some input variables are missing: (`favorite_movie`)
-Undefined variable in message index 1: 'favorite_movie' is undefined
-```
-
-3. **Extra Variables**:
-
-If you include extra variables that aren't in the template, they will be ignored:
-
-```python
-response = pl.run(
- prompt_name="movie_recommender",
- prompt_release_label="prod",
- input_variables={
- "favorite_movie": "The Shawshank Redemption",
- "release_year": 1994
- },
-)
-```
-
-In this case, the `release_year` variable will be ignored in the LLM request if it's not part of the current template.
-
-When you need to add new input variables to your prompt template, it's important to keep your source code in sync with the template changes. This guide outlines the process for deploying these updates to your production environment.
-
-## Example Scenario
-
-Assume you have a prompt template version tagged with `prod` that uses only one input variable, `favorite_movie`:
-
-```python
-response = pl.run(
- prompt_name="movie_recommender",
- prompt_release_label="prod",
- input_variables={
- "favorite_movie": "The Shawshank Redemption"
- },
-)
-```
-
-## Update Process
-
-Follow these steps to safely add a new `mood` variable to your prompt template:
-
-1. Create a new template version with the new `mood` variable
-
-2. Apply a unique temporary label (e.g., `new-var`) to the new version
-
-3. Update and deploy your code to use the new template version and include the new variable:
-
-```python
-response = pl.run(
- prompt_name="movie_recommender",
- prompt_release_label="new-var",
- input_variables={
- "favorite_movie": "The Shawshank Redemption",
- "mood": "uplifting"
- },
-)
-```
-
-4. In the PromptLayer UI, move the `prod` label to the most recent prompt version
-
-5. Update your source code to reference the `prod` prompt version again and deploy:
-
-```python
-response = pl.run(
- prompt_name="movie_recommender",
- prompt_release_label="prod",
- input_variables={
- "favorite_movie": "The Shawshank Redemption",
- "mood": "uplifting"
- },
-)
-```
-
-6. Delete the temporary `new-var` label from the PromptLayer UI
diff --git a/images/evals/configure-scorecard-simple.png b/images/evals/configure-scorecard-simple.png
new file mode 100644
index 0000000..9ad4c8a
Binary files /dev/null and b/images/evals/configure-scorecard-simple.png differ
diff --git a/images/evals/scorecard-matrix.png b/images/evals/scorecard-matrix.png
new file mode 100644
index 0000000..51b1ef5
Binary files /dev/null and b/images/evals/scorecard-matrix.png differ
diff --git a/images/evals/scorecard-simple.png b/images/evals/scorecard-simple.png
new file mode 100644
index 0000000..7b29685
Binary files /dev/null and b/images/evals/scorecard-simple.png differ
diff --git a/mint.json b/mint.json
index 4125e98..8293aa7 100644
--- a/mint.json
+++ b/mint.json
@@ -39,7 +39,6 @@
{
"group": "Usage Documentation",
"pages": [
- "features/prompt-history/sharing-prompts",
{
"group": "Prompt Registry",
"pages": [
@@ -52,12 +51,21 @@
"features/prompt-registry/input-variable-sets",
"features/prompt-registry/template-variables",
"features/prompt-registry/placeholder-messages",
- "running-requests/prompt-blueprints",
- "features/prompt-registry/webhooks",
- "features/prompt-registry/zero-downtime-releases"
+ "running-requests/prompt-blueprints"
],
"icon": "notebook"
},
+ {
+ "group": "Evaluation",
+ "pages": [
+ "features/evaluations/getting-started",
+ "features/evaluations/datasets",
+ "features/evaluations/building-pipelines",
+ "features/evaluations/score-card",
+ "features/evaluations/programmatic"
+ ],
+ "icon": "vial-circle-check"
+ },
{
"group": "Running Requests",
"pages": [
@@ -81,7 +89,15 @@
],
"icon": "cassette-tape"
},
- "onboarding-guides/deployment-strategies",
+ {
+ "group": "Deployment",
+ "pages": [
+ "onboarding-guides/deployment-strategies",
+ "onboarding-guides/webhooks",
+ "onboarding-guides/zero-downtime-deploys"
+ ],
+ "icon": "ship"
+ },
{
"group": "Models & Integrations",
"pages": [
@@ -102,27 +118,12 @@
"why-promptlayer/prompt-management",
"why-promptlayer/advanced-search",
"why-promptlayer/ab-releases",
- {
- "group": "Evaluations",
- "pages": [
- "features/evaluations/overview",
- "features/evaluations/continuous-integration",
- "features/evaluations/examples",
- "features/evaluations/building-pipelines",
- "features/evaluations/eval-types",
- "features/evaluations/score-card",
- "features/evaluations/datasets",
- "features/evaluations/programmatic",
- "why-promptlayer/voice-agents"
- ],
- "icon": "vial-circle-check"
- },
+ "features/evaluations/overview",
+ "why-promptlayer/voice-agents",
"why-promptlayer/agents",
"why-promptlayer/multi-turn-chat",
"why-promptlayer/fine-tuning",
"why-promptlayer/analytics",
- "why-promptlayer/evaluation-and-ranking",
- "why-promptlayer/playground",
"why-promptlayer/shared-workspaces",
"why-promptlayer/how-it-works"
]
diff --git a/onboarding-guides/deployment-strategies.mdx b/onboarding-guides/deployment-strategies.mdx
index 1ed3260..d7678f2 100644
--- a/onboarding-guides/deployment-strategies.mdx
+++ b/onboarding-guides/deployment-strategies.mdx
@@ -6,9 +6,9 @@ icon: ship
PromptLayer fits into your stack at three levels of sophistication:
-1. **`promptlayer_client.run`** – zero-setup SDK sugar
+1. **`promptlayer_client.run`** – zero-setup SDK sugar
2. **Webhook-driven caching** – maintain local cache of prompt templates
-3. **Managed Agents** – let PromptLayer orchestrate everything server-side
+3. **Managed Agents** – let PromptLayer orchestrate everything server-side
@@ -57,9 +57,9 @@ const response = await plClient.run({
3. SDK writes the log back to PromptLayer.
> 💡 **Tip** – If latency is critical, enqueue the log to a background worker and let your request return immediately.
----
+---
-# Cache prompts with Webhooks
+# Cache prompts with Webhooks
Eliminate the extra round‑trip by **replicating prompts into your own cache or database**.
@@ -86,7 +86,7 @@ sequenceDiagram
1. **Subscribe to webhooks in the UI**
-Read more here about webhooks [here](/features/prompt-registry/webhooks).
+Read more here about webhooks [here](/onboarding-guides/webhooks).
2. **Maintain a local cache**
@@ -107,7 +107,7 @@ queue.enqueue(track_to_promptlayer, llm_response)
> **Tip:** Most teams push the track_to_promptlayer onto a Redis or SQS queue so as to not block on the logging of a request.
-Read the full guide: **[PromptLayer Webhooks ↗](/features/prompt-registry/webhooks)**
+Read the full guide: **[PromptLayer Webhooks ↗](/onboarding-guides/webhooks)**
---
@@ -173,10 +173,11 @@ Learn more: **[Agents documentation ↗](/why-promptlayer/agents)**
## Further reading 📚
* **Quickstart** – [Your first prompt](/quickstart)
-* **Webhooks** – [Events & signature verification](/features/prompt-registry/webhooks)
+* **Webhooks** – [Events & signature verification](/onboarding-guides/webhooks)
+* **Zero Downtime Deploys** – [Deploy new variables safely](/onboarding-guides/zero-downtime-deploys)
* **Agents** – [Concepts & versioning](/why-promptlayer/agents)
-* **CI for prompts** – [Continuous Integration guide](/features/evaluations/ci)
+* **Evaluation** – [Building evaluation pipelines](/features/evaluations/getting-started)
---
-> ✉️ **Need a hand?** Ping us in Discord or email [hello@promptlayer.com](mailto:hello@promptlayer.com)—happy to chat architecture!
+> ✉️ **Need a hand?** Ping us in Discord or email [hello@promptlayer.com](mailto:hello@promptlayer.com)—happy to chat architecture!
\ No newline at end of file
diff --git a/features/prompt-registry/webhooks.mdx b/onboarding-guides/webhooks.mdx
similarity index 83%
rename from features/prompt-registry/webhooks.mdx
rename to onboarding-guides/webhooks.mdx
index 1e84702..99ae4de 100644
--- a/features/prompt-registry/webhooks.mdx
+++ b/onboarding-guides/webhooks.mdx
@@ -5,7 +5,13 @@ icon: "fishing-rod"
Webhooks can be set up to receive notifications about changes to prompt templates. This functionality is particularly useful for storing prompts in cache, allowing for quicker retrieval without slowing down releases.
-### Event Payload Format
+## Setting Up Webhooks
+
+To set up a webhook, go to the **Webhook** section in the **Settings** page. Enter the URL of the endpoint you want to send the webhook to and click **Submit**.
+
+
+
+## Event Payload Format
When an event occurs, we send a POST request with a payload in this structure:
@@ -21,8 +27,7 @@ When an event occurs, we send a POST request with a payload in this structure:
}
```
-### Supported Event Types
-We notify you for these events:
+## Supported Event Types
| Event Type | Description | Details |
|--------------------------------------|-----------------------------------------------------------------|-------------------------------------------------------------------------|
@@ -37,9 +42,9 @@ We notify you for these events:
| `report_finished` | When a evaluation report is completed. |
`report_id` (number)
`report_name` (string)
|
| `dataset_version_created_by_file` | When a dataset version is successfully created from a file upload. |
`dataset_id` (number)
`dataset_version_number` (number)
|
| `dataset_version_created_by_file_failed` | When dataset file processing fails. |
`error_message` (string)
|
-| `dataset_version_created_from_filter_params` | When a dataset version is created from filter parameters. |
`dataset_id` (number)
`rows_added` (number)
`dataset_version_number` (number)
|
+| `dataset_version_created_from_filter_params` | When a dataset version is created from filter parameters. |
`dataset_id` (number)
`rows_added` (number)
`dataset_version_number` (number)
|
-### Example Payload
+## Example Payload
```json
{
@@ -59,21 +64,15 @@ We notify you for these events:
}
```
-### Configuring a Webhook
+## Securing Your Webhook
-To set up a webhook, go to the **Webhook** section in the **Settings** page. Enter the URL of the endpoint you want to send the webhook to and click **Submit**.
-
-
-
-### Securing Your Webhook
When you create a webhook, you'll receive a webhook secret signature that looks like this:

This secret is used to verify that incoming webhook requests are authentic and come from PromptLayer. The signature is included in the `X-PromptLayer-Signature` header of each webhook request.
-
-#### Verifying Webhook Signatures
+### Verifying Webhook Signatures
Here are code examples showing how to verify the signatures:
@@ -135,8 +134,49 @@ export function verifySignature(signature, payload, secretKey) {
return false;
}
}
-
+
const isValid = verifySignature(signature, payload, secretKey);
console.log("Signature is", isValid ? "valid" : "invalid");
```
-
\ No newline at end of file
+
+
+## Using Webhooks for Caching
+
+Webhooks are particularly useful for maintaining a local cache of prompts, eliminating the need for extra round-trips to PromptLayer. Here's how it works:
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant PL as PromptLayer Server
+ participant APP as Your Application
+ participant DB as Your Cache / DB
+ participant LLM as Model Provider
+
+ PL->>APP: "prompt.updated" webhook
+ APP->>DB: invalidate + store latest prompt
+ user->>APP: request needing AI
+ APP->>DB: fetch prompt
+ APP->>LLM: run prompt
+ APP-->>PL: async track.log (optional queue)
+```
+
+### Implementation Example
+
+1. **Handle webhook events**:
+
+```python
+# pseudocode
+def handle_pl_webhook(event):
+ prompt = event["data"]
+ db.prompts.upsert(prompt["prompt_template_name"], prompt)
+```
+
+2. **Serve traffic from cache**:
+
+```python
+prompt = db.prompts.get("order-summary")
+llm_response = openai.chat.completions.create(...)
+queue.enqueue(track_to_promptlayer, llm_response)
+```
+
+> **Tip:** Most teams push the track_to_promptlayer onto a Redis or SQS queue so as to not block on the logging of a request.
\ No newline at end of file
diff --git a/onboarding-guides/zero-downtime-deploys.mdx b/onboarding-guides/zero-downtime-deploys.mdx
new file mode 100644
index 0000000..c47b956
--- /dev/null
+++ b/onboarding-guides/zero-downtime-deploys.mdx
@@ -0,0 +1,155 @@
+---
+title: "Zero Downtime Deploys"
+icon: "rocket"
+---
+
+When you need to add new input variables to your prompt template, it's important to keep your source code in sync with the template changes. This guide outlines the process for deploying these updates to your production environment without any downtime.
+
+## Input Variable Handling
+
+The `pl.run()` function handles input variables in the following ways:
+
+### 1. Normal Usage
+
+Provide all required variables as defined in your prompt template:
+
+```python
+response = pl.run(
+ prompt_name="movie_recommender",
+ prompt_release_label="prod",
+ input_variables={
+ "favorite_movie": "The Shawshank Redemption"
+ },
+)
+```
+
+### 2. Missing Variables
+
+If you don't provide the required input variables, you'll receive a warning in the console, but the prompt template will still run. The missing variables will be sent to the LLM as unprocessed strings:
+
+```python
+response = pl.run(
+ prompt_name="movie_recommender",
+ prompt_release_label="prod",
+ input_variables={},
+)
+```
+
+```
+WARNING: While getting your prompt template: Some input variables are missing: (`favorite_movie`)
+Undefined variable in message index 1: 'favorite_movie' is undefined
+```
+
+### 3. Extra Variables
+
+If you include extra variables that aren't in the template, they will be ignored:
+
+```python
+response = pl.run(
+ prompt_name="movie_recommender",
+ prompt_release_label="prod",
+ input_variables={
+ "favorite_movie": "The Shawshank Redemption",
+ "release_year": 1994 # This will be ignored if not in template
+ },
+)
+```
+
+In this case, the `release_year` variable will be ignored in the LLM request if it's not part of the current template.
+
+## Zero Downtime Update Process
+
+Follow these steps to safely add new variables to your prompt template without any service interruption:
+
+### Example Scenario
+
+Assume you have a prompt template version tagged with `prod` that uses only one input variable, `favorite_movie`:
+
+```python
+response = pl.run(
+ prompt_name="movie_recommender",
+ prompt_release_label="prod",
+ input_variables={
+ "favorite_movie": "The Shawshank Redemption"
+ },
+)
+```
+
+Now you want to add a new `mood` variable to enhance the recommendations.
+
+### Step-by-Step Process
+
+1. **Create a new template version** with the new `mood` variable in the PromptLayer UI
+
+2. **Apply a unique temporary label** (e.g., `new-var`) to the new version in the UI
+
+3. **Update and deploy your code** to use the new template version and include the new variable:
+
+```python
+response = pl.run(
+ prompt_name="movie_recommender",
+ prompt_release_label="new-var", # Temporary label
+ input_variables={
+ "favorite_movie": "The Shawshank Redemption",
+ "mood": "uplifting" # New variable
+ },
+)
+```
+
+4. **Verify the deployment** is working correctly with the new variable
+
+5. **In the PromptLayer UI, move the `prod` label** to the new prompt version
+
+6. **Update your source code** to reference the `prod` prompt version again and deploy:
+
+```python
+response = pl.run(
+ prompt_name="movie_recommender",
+ prompt_release_label="prod", # Back to prod
+ input_variables={
+ "favorite_movie": "The Shawshank Redemption",
+ "mood": "uplifting"
+ },
+)
+```
+
+7. **Clean up** by deleting the temporary `new-var` label from the PromptLayer UI
+
+## Benefits of This Approach
+
+- **Zero downtime**: Your service remains available throughout the update process
+- **Rollback capability**: You can quickly revert to the previous version if issues arise
+- **Gradual rollout**: You can test the new version with a subset of traffic first
+- **Version control**: Both prompt templates and code changes are versioned and synchronized
+
+## Best Practices
+
+1. **Always test new variables** in a staging environment first
+2. **Use descriptive temporary labels** that indicate the purpose (e.g., `add-mood-var`)
+3. **Document variable changes** in your commit messages and PR descriptions
+4. **Consider using feature flags** for more complex deployments
+5. **Monitor logs** during the deployment for any unexpected warnings
+
+## Automation with CI/CD
+
+You can automate this process using PromptLayer's API in your CI/CD pipeline:
+
+```python
+# Example CI/CD script
+import promptlayer
+
+# Step 1: Create new version and apply temporary label
+pl_client.create_prompt_template_version(...)
+pl_client.apply_label("new-var", ...)
+
+# Step 2: Deploy application code
+# ... deployment logic ...
+
+# Step 3: Move prod label to new version
+pl_client.move_label("prod", to_version=new_version)
+
+# Step 4: Clean up temporary label
+pl_client.delete_label("new-var")
+```
+
+This ensures consistent, repeatable deployments without manual intervention.
\ No newline at end of file
diff --git a/why-promptlayer/evaluation-and-ranking.mdx b/why-promptlayer/evaluation-and-ranking.mdx
deleted file mode 100644
index 8a5290f..0000000
--- a/why-promptlayer/evaluation-and-ranking.mdx
+++ /dev/null
@@ -1,70 +0,0 @@
----
-title: "Scoring & Ranking Prompts"
-icon: "ranking-star"
----
-
-One of the biggest challenges in prompt engineering is understanding if Prompt A performs better than Prompt B. PromptLayer helps you solve this.
-
-Testing in development can only get you so far. We believe the best way to understand your prompts is by analyzing them in production.
-
-Below are some ways you can use PromptLayer to answer the following key questions:
-
-- How much does PromptA vs PromptB cost?
-- How often is PromptA used?
-- Is PromptA working better than PromptB?
-- Which prompts are receiving the most negative user feedback?
-- How do I synthetically evaluate my prompts using LLMs?
-
-## A/B Testing
-
-
-
-PromptLayer is best used as an orchestration & data layer of your prompts.
-
-That means [A/B testing](/why-promptlayer/ab-releases) is easy. Use the [Prompt Registry](/features/prompt-registry) to version templates build different tests and automatically segment versions using [dynamic release labels](/features/prompt-registry/dynamic-release-labels).
-
-## Scoring
-
-*Every PromptLayer request can have multiple "Scores". A score is an integer from 0-100.*
-
-
-
-In PromptLayer, ranking is based on Score values. Scores can be updated via the UI or programmatically, allowing for the creation of named or unnamed scores. For further details, refer to the provided documentation on prompt history, metadata, and request IDs.
-The three most common ways to Score to rank your prompts are:
-
-1. **User feedback**: Present a 👍 and 👎 to your users after the completion. A user press of one of those buttons fills in a score of [100, 0] respectively.
-2. **RLHF**: Use our visual dashboard to fill in scores by hand. You can then use this data to decide between prompt templates or to fine-tune.
-3. **Synthetic Evaluation**: Use LLMs to score LLMs. After getting a completion, run an evaluation prompt on it and translate that to a score [0, 100].
-
- For example, your prompt could be:
-
- ```
- The following is an AI chat message given to a user:
-
- {completion}
-
- --
-
- We are worried that the chatbot is being rude.
- How rude is the chat on a scale of 0 to 100?
- ```
-
-
-## Analytics
-
-After populating Scores as described above, navigate to the Prompt Template page to see how each template stacks up.
-
-
-
-
-## Pricing
-
-We live in the real world, so money matters. Building a prod LLM system means managing price. Some LLMs are cheaper than other LLMs. Some prompts are cheaper than other prompts.
-
-Each request history page will tell you its individual cost:
-
-
-
-You can also see the lifetime cost of a template in the Prompt Registry template page.
-
-
\ No newline at end of file
diff --git a/why-promptlayer/playground.mdx b/why-promptlayer/playground.mdx
deleted file mode 100644
index c9e3625..0000000
--- a/why-promptlayer/playground.mdx
+++ /dev/null
@@ -1,26 +0,0 @@
----
-title: "Playground"
-icon: "circle-play"
----
-
-The Playground is a native way to create and run new LLM requests all through PromptLayer. Your run history will be tracked in the sidebar. The Playground is most useful as a tool to "replay" and debug old requests.
-
-## Replay requests
-
-The Playground allows PromptLayer users to rerun previous LLM requests. Simply click "Open in Playground" on any historical request.
-
-
-
-
-
-## OpenAI Tools
-
-The Playground fully supports [OpenAI function calling](https://platform.openai.com/docs/guides/function-calling). These tools can be accessed directly from the Playground interface and can be incorporated into your requests as needed. *Not even OpenAI's playground does this đź‘€*
-
-
-
-## Custom models
-
-The Playground also supports the use of custom models for LLM requests. This means you can use a fine-tuned model or a dedicated OpenAI instance.
\ No newline at end of file