Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/lint-and-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ on:
branches:
- main
- dev
- '!f/pypi_release'

jobs:
test-integration:
Expand Down
5 changes: 3 additions & 2 deletions .github/workflows/publish_testpypi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ jobs:
with:
sparse-checkout: |
tests/*
examples/data_factory.ipynb
sparse-checkout-cone-mode: false
- name: Update system packages
run: |
Expand Down Expand Up @@ -88,13 +89,13 @@ jobs:
--ExecutePreprocessor.timeout=120 \
--no-prompt --no-input \
--stdout \
tests/data_factory/factory/test_resume_index_1.ipynb; then
examples/data_factory.ipynb; then
echo "::error::Notebook execution failed"

fi
echo "Notebook executed successfully. Summary:" && \
jupyter nbconvert --to markdown --stdout \
tests/data_factory/factory/test_resume_index_1.ipynb | \
examples/data_factory.ipynb | \
grep -E '^#|^##' || true

# Add tag deletion step
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# Adhoc stuff
input.json
output.json
.serena/
docs/
/vibe_coding/response.md
Expand Down
2 changes: 1 addition & 1 deletion internal
28 changes: 28 additions & 0 deletions prebuilt_template/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,41 @@ Data generation templates are **prebuilt** that encapsulate sophisticated data g
4. **Generate Data**: Run the template to produce high-quality synthetic data
5. **Export & Use**: Data comes ready for training, testing, or evaluation

## Use the data-template CLI like this:
```
# List all templates
data-template list-templates

# List with details
data-template list-templates --detail

# Get template details
data-template get-template my_template

# Print schema
data-template print-schema my_template

# Print example
data-template print-example my_template

# Run template with interactive input
data-template run-template my_template

# Run template with input file
data-template run-template my_template --input-file input.json

# Run template and save output
data-template run-template my_template --input-file input.json --output-file output.json
```
## Source Code Location

The actual implementation of these templates can be found in:
```
src/starfish/data_gen_template/templates/
```



## Community & Contributions 🤝

Like what you see? We'd love your help in expanding our template collection! Here's how you can get involved:
Expand Down
206 changes: 203 additions & 3 deletions prebuilt_template/function_calling/sample_run.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -38,6 +38,206 @@
"loaded = data_gen_template.get(\"starfish/generate_func_call_dataset\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"get the template input_data schema and example"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[32m2025-05-23 11:08:41\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[1mPlease run the template with this input schema\u001b[0m\n",
"\u001b[32m2025-05-23 11:08:41\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[1m{\n",
" \"$defs\": {\n",
" \"APIContract\": {\n",
" \"description\": \"Pydantic model representing an API contract structure.\",\n",
" \"properties\": {\n",
" \"name\": {\n",
" \"title\": \"Name\",\n",
" \"type\": \"string\"\n",
" },\n",
" \"description\": {\n",
" \"title\": \"Description\",\n",
" \"type\": \"string\"\n",
" },\n",
" \"parameters\": {\n",
" \"additionalProperties\": {\n",
" \"$ref\": \"#/$defs/ParameterDefinition\"\n",
" },\n",
" \"title\": \"Parameters\",\n",
" \"type\": \"object\"\n",
" }\n",
" },\n",
" \"required\": [\n",
" \"name\",\n",
" \"description\",\n",
" \"parameters\"\n",
" ],\n",
" \"title\": \"APIContract\",\n",
" \"type\": \"object\"\n",
" },\n",
" \"ParameterDefinition\": {\n",
" \"description\": \"Pydantic model representing parameter definition in an API contract.\",\n",
" \"properties\": {\n",
" \"type\": {\n",
" \"title\": \"Type\",\n",
" \"type\": \"string\"\n",
" },\n",
" \"description\": {\n",
" \"title\": \"Description\",\n",
" \"type\": \"string\"\n",
" },\n",
" \"required\": {\n",
" \"default\": true,\n",
" \"title\": \"Required\",\n",
" \"type\": \"boolean\"\n",
" }\n",
" },\n",
" \"required\": [\n",
" \"type\",\n",
" \"description\"\n",
" ],\n",
" \"title\": \"ParameterDefinition\",\n",
" \"type\": \"object\"\n",
" }\n",
" },\n",
" \"description\": \"Input schema for the generate_by_topic template.\\n\\nIMPORTANT: This Pydantic model is the single source of truth for default values.\\nThe validation and default values are controlled by this model, not the function signature.\",\n",
" \"properties\": {\n",
" \"num_records\": {\n",
" \"anyOf\": [\n",
" {\n",
" \"type\": \"integer\"\n",
" },\n",
" {\n",
" \"type\": \"null\"\n",
" }\n",
" ],\n",
" \"default\": 10,\n",
" \"title\": \"Num Records\"\n",
" },\n",
" \"api_contract\": {\n",
" \"$ref\": \"#/$defs/APIContract\"\n",
" },\n",
" \"topic_model_name\": {\n",
" \"default\": \"openai/gpt-4o-mini\",\n",
" \"title\": \"Topic Model Name\",\n",
" \"type\": \"string\"\n",
" },\n",
" \"topic_model_kwargs\": {\n",
" \"anyOf\": [\n",
" {\n",
" \"additionalProperties\": true,\n",
" \"type\": \"object\"\n",
" },\n",
" {\n",
" \"type\": \"null\"\n",
" }\n",
" ],\n",
" \"default\": null,\n",
" \"title\": \"Topic Model Kwargs\"\n",
" },\n",
" \"generation_model_name\": {\n",
" \"default\": \"openai/gpt-4o-mini\",\n",
" \"title\": \"Generation Model Name\",\n",
" \"type\": \"string\"\n",
" },\n",
" \"generation_model_kwargs\": {\n",
" \"anyOf\": [\n",
" {\n",
" \"additionalProperties\": true,\n",
" \"type\": \"object\"\n",
" },\n",
" {\n",
" \"type\": \"null\"\n",
" }\n",
" ],\n",
" \"default\": null,\n",
" \"title\": \"Generation Model Kwargs\"\n",
" },\n",
" \"data_factory_config\": {\n",
" \"anyOf\": [\n",
" {\n",
" \"additionalProperties\": true,\n",
" \"type\": \"object\"\n",
" },\n",
" {\n",
" \"type\": \"null\"\n",
" }\n",
" ],\n",
" \"default\": {},\n",
" \"title\": \"Data Factory Config\"\n",
" }\n",
" },\n",
" \"required\": [\n",
" \"api_contract\"\n",
" ],\n",
" \"title\": \"GenerateFuncCallDataSet\",\n",
" \"type\": \"object\"\n",
"}\u001b[0m\n"
]
}
],
"source": [
"loaded.print_schema()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[32m2025-05-23 11:09:02\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[1mHere is an example with api_contract.name as weather_api.get_current_weather\u001b[0m\n",
"\u001b[32m2025-05-23 11:09:02\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[1m{\n",
" \"num_records\": 4,\n",
" \"api_contract\": {\n",
" \"name\": \"weather_api.get_current_weather\",\n",
" \"description\": \"Retrieves the current weather conditions for a specified location .\",\n",
" \"parameters\": {\n",
" \"location\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"The name of the city or geographic location .\",\n",
" \"required\": true\n",
" },\n",
" \"units\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"The units for temperature measurement( e.g., 'Celsius', 'Fahrenheit') .\",\n",
" \"required\": false\n",
" }\n",
" }\n",
" },\n",
" \"topic_model_name\": \"openai/gpt-4\",\n",
" \"topic_model_kwargs\": {\n",
" \"temperature\": 0.7\n",
" },\n",
" \"generation_model_name\": \"openai/gpt-4o-mini\",\n",
" \"generation_model_kwargs\": {\n",
" \"temperature\": 0.8,\n",
" \"max_tokens\": 200\n",
" },\n",
" \"data_factory_config\": {\n",
" \"max_concurrency\": 24,\n",
" \"task_runner_timeout\": 120\n",
" }\n",
"}\u001b[0m\n"
]
}
],
"source": [
"loaded.print_example()"
]
},
{
"cell_type": "code",
"execution_count": 5,
Expand Down Expand Up @@ -203,7 +403,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "starfish-core-T7IInzTH-py3.11",
"display_name": ".venv",
"language": "python",
"name": "python3"
},
Expand All @@ -217,7 +417,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
"version": "3.11.4"
}
},
"nbformat": 4,
Expand Down
Loading