humanloop · harry-humanloop · Nov 29, 2024 · Dec 2, 2024 · Dec 2, 2024 · Dec 3, 2024
diff --git a/assets/conversations-a.json b/assets/conversations-a.json
diff --git a/assets/conversations-b.json b/assets/conversations-b.json
diff --git a/assets/images/external_logs_evaluations.png b/assets/images/external_logs_evaluations.png
diff --git a/assets/images/external_logs_flow_logs.png b/assets/images/external_logs_flow_logs.png
diff --git a/assets/images/external_logs_logs_a.png b/assets/images/external_logs_logs_a.png
diff --git a/assets/images/external_logs_runs_b.png b/assets/images/external_logs_runs_b.png
diff --git a/assets/images/external_logs_stats_a.png b/assets/images/external_logs_stats_a.png
diff --git a/assets/images/external_logs_stats_b.png b/assets/images/external_logs_stats_b.png
diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,6 +4,7 @@ version = "0.1.0"
 description = ""
 authors = ["Peter Hayes <ucabpnh@ucl.ac.uk>"]
 readme = "README.md"
+package-mode = false
 
 [tool.poetry.dependencies]
 python = "^3.11"
@@ -13,7 +14,7 @@ openai = "^1.42.0"
 pandas = "^2.2.2"
 pyarrow = "^17.0.0"
 prettytable = "^3.11.0"
-humanloop = "^0.8.13"
+humanloop = "0.8.18"
 
 [build-system]
 requires = ["poetry-core"]

diff --git a/tutorials/evaluating_external_logs.ipynb b/tutorials/evaluating_external_logs.ipynb
@@ -0,0 +1,356 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Evaluating external Logs on Humanloop\n",
+    "\n",
+    "This notebook demonstrates how to run an Evaluation on Humanloop using your own logs.\n",
+    "This is useful if you have existing logs in an external system and you want to evaluate them on Humanloop with minimal setup.\n",
+    "\n",
+    "In this notebook, we will use the example of a JSON file containing chat messages between users and customer support agents. We will bring you through uploading these logs to Humanloop and creating an Evaluation with them.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup\n",
+    "\n",
+    "First, we import the data we will evaluate. These are the `conversations-a.json` and `conversations-b.json` files.\n",
+    "Then, we configure our Humanloop SDK client."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import os\n",
+    "from pathlib import Path\n",
+    "\n",
+    "with open(Path(os.getcwd()).parent / \"assets\" / \"conversations-a.json\") as f:\n",
+    "    data = json.load(f)\n",
+    "\n",
+    "import pprint\n",
+    "pprint.pprint(data[:2], width=120)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dotenv import load_dotenv\n",
+    "\n",
+    "# load .env file that contains API keys\n",
+    "load_dotenv()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from humanloop import Humanloop\n",
+    "\n",
+    "humanloop = Humanloop(api_key=os.getenv(\"HUMANLOOP_KEY\"), base_url=os.getenv(\"HUMANLOOP_BASE_URL\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Upload Logs to Humanloop\n",
+    "\n",
+    "\n",
+    "Use the `log(...)` method to upload the logs to Humanloop. This will create a new **Flow** on Humanloop.\n",
+    "\n",
+    "We additionally pass in some `attributes` identifying\n",
+    "the configuration of the system that generated these logs.\n",
+    "`attributes` accepts arbitrary values, and is used for versioning the Flow.\n",
+    "Here, it allows us to associate this set of logs with a specific version of the support agent.\n",
+    "\n",
+    "<div class=\"alert alert-block alert-info\">\n",
+    "Note that a Flow on Humanloop usually captures interactions between other Humanloop Files, such as Prompts and Tools.\n",
+    "However, we are using it here to represent Logs captured by a black-box system, with the system only identified by a version number.\n",
+    "\n",
+    "If you have more context about the system generating the Logs, you should consider using a more appropriate Humanloop File or providing more context under the Flow's `metadata` field.\n",
+    "In this support agent chat example, using a Prompt would be more appropriate, but would require also specifying additional information\n",
+    "that might not be available to you, such as `model`. Here, we use a Flow to keep things simple.\n",
+    "</div>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from tqdm import tqdm\n",
+    "\n",
+    "log_ids = []\n",
+    "for messages in tqdm(data):\n",
+    "    log = humanloop.flows.log(\n",
+    "        path=\"External logs demo/Travel planner\",\n",
+    "        flow={\"attributes\": {\"agent-version\": \"1.0.0\"}},  # Optionally add attributes to identify this version of the support agent.\n",
+    "        messages=messages,\n",
+    "    )\n",
+    "    log_ids.append(log.id)\n",
+    "\n",
+    "# We'll use the `version_id` later when creating a Run.\n",
+    "version_id = log.version_id"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This will have created a new Flow on Humanloop named **Travel planner**.\n",
+    "\n",
+    "To confirm this logging has succeeded, navigate to the **Logs** tab of the Flow\n",
+    "and view the uploaded logs. Each Log should correspond to a conversation and contain\n",
+    "a list of messages.\n",
+    "\n",
+    "![Flow Logs](../assets/images/external_logs_flow_logs.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create an Evaluation on Humanloop\n",
+    "\n",
+    "Next, we'll create an Evaluation on Humanloop. This will allow us to evaluate the Logs we uploaded.\n",
+    "An Evaluation will have a set of Runs, each of which will have a set of Logs, allowing us to compare the performance across different Runs.\n",
+    "\n",
+    "Here, we'll use the example \"Helpfulness\" LLM-as-a-judge Evaluator. This will automatically rate\n",
+    "the helpfulness of the support agent across our logs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create Evaluation\n",
+    "evaluation = humanloop.evaluations.create(\n",
+    "    name=\"Past records\",\n",
+    "    # NB: you can use `path`or `id` for references on Humanloop\n",
+    "    file={\"path\": \"External logs demo/Travel planner\"},\n",
+    "    evaluators=[\n",
+    "        {\"path\": \"Example Evaluators/AI/Helpfulness\"},\n",
+    "    ],\n",
+    ")\n",
+    "print(f\"Created Evaluation: {evaluation.id}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create a Run for this set of Logs\n",
+    "run = humanloop.evaluations.create_run(\n",
+    "    id=evaluation.id,\n",
+    "    version={'version_id': version_id},  # Associate this Run to the Flow version created above.\n",
+    ")\n",
+    "print(f\"Created Run: {run.id}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Assign Logs to the Run\n",
+    "\n",
+    "We'll associate the Logs we uploaded to the Run."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "humanloop.evaluations.add_logs_to_run(\n",
+    "    id=evaluation.id,\n",
+    "    run_id=run.id,\n",
+    "    log_ids=log_ids,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Review the Evaluation\n",
+    "\n",
+    "You have now created an Evaluation on Humanloop and added Logs to it. \n",
+    "\n",
+    "Go to the Humanloop UI to view the Evaluation.\n",
+    "\n",
+    "![Evaluation on Humanloop](../assets/images/external_logs_evaluations.png)\n",
+    "\n",
+    "\n",
+    "Within the Evaluation, go to **Logs** tab.\n",
+    "Here, you can view your uploaded logs as well as the Evaluator judgments.\n",
+    "\n",
+    "![Logs tab of Evaluation](../assets/images/external_logs_logs_a.png)\n",
+    "\n",
+    "The following steps will guide you through adding a different set of logs to a new Run\n",
+    "for comparison."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Upload new Logs\n",
+    "\n",
+    "Now that we have an Evaluation on Humanloop, we can add a separate set of logs to it\n",
+    "and compare the performance across this set of logs to the previous set.\n",
+    "\n",
+    "While we can achieve this by repeating the above steps, we can add logs to a Run\n",
+    "in a more direct and simpler way now that we have an existing Evaluation.\n",
+    "\n",
+    "We'll continue with the Evaluation created in the previous section,\n",
+    "and add a new Run with the data from `conversations-b.json`. These represent a set of logs\n",
+    "from a prototype version of the support agent."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load the new data\n",
+    "with open(Path(os.getcwd()).parent / \"assets\" / \"conversations-b.json\") as f:\n",
+    "    new_data = json.load(f)\n",
+    "\n",
+    "import pprint\n",
+    "pprint.pprint(new_data[:2], width=120)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Use the previously-created Evaluation\n",
+    "evaluation_id = evaluation.id\n",
+    "evaluation_id"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create new Run in the same Evaluation\n",
+    "new_run = humanloop.evaluations.create_run(\n",
+    "    id=evaluation.id,\n",
+    ")\n",
+    "print(f\"Created new Run: {new_run.id}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Log to the Run\n",
+    "\n",
+    "Pass the `run_id` argument in the `log(...)` call to associate the Log with the Run.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Add the new data to the Run\n",
+    "for messages in tqdm(new_data):\n",
+    "    log = humanloop.flows.log(\n",
+    "        path=\"External logs demo/Travel planner\",\n",
+    "        flow={\"attributes\": {\"agent-version\": \"2.0.0\"}},\n",
+    "        messages=messages,\n",
+    "        # Pass `run_id` to associate the Log with the Run.\n",
+    "        run_id=new_run.id,\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We have now added a second Run to the Evaluation and populated it with Logs. You can now view the Run on the Humanloop UI."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Compare the results\n",
+    "\n",
+    "\n",
+    "View the Evaluation on Humanloop. It will now contain two Runs.\n",
+    "\n",
+    "![Evaluation with two Runs on Humanloop](../assets/images/external_logs_runs_b.png)\n",
+    "\n",
+    "\n",
+    "In the **Stats** tab of the Evaluation, you can now compare the performance of the two sets of logs.\n",
+    "\n",
+    "In our case, our second set of logs (on the right) can be seen to be less helpful.\n",
+    "\n",
+    "![Stats tab showing box plots for the two Runs](../assets/images/external_logs_stats_b.png)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Next steps\n",
+    "\n",
+    "The above examples demonstrates how you can quickly populate an Evaluation Run with your logs. This allows you to utilise the Evaluation and Evaluator features to perform workflows such as using [Code Evaluators](https://humanloop.com/docs/v5/guides/evals/code-based-evaluator) to calculate metrics, or using [Human Evaluators](https://humanloop.com/docs/v5/guides/evals/human-evaluators) to set up your Logs to be reviewed by your subject-matter experts.\n",
+    "\n",
+    "Refer to our [documentation](https://humanloop.com/docs/v5/guides/evals) for more information on how to set up custom Evaluators and extend the Evaluation for your use-case.\n",
+    "\n",
+    "Now that you've set up an Evaluation, explore the other [File](../../explanation/files) types on Humanloop to see how they can better reflect\n",
+    "your production systems, and how you can use Humanloop to version-control them.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}