SQLite-AI: API Reference

This document provides reference-level documentation for all public SQLite-AI functions, virtual tables, and metadata properties exposed to SQL. These functions enable loading and interacting with LLMs, configuring samplers, generating embeddings and text, and managing chat sessions.

`ai_version()`

Returns: TEXT

Description: Returns the current version of the SQLite-AI extension.

Example:

SELECT ai_version();
-- e.g., '0.5.1'

`ai_log_info(extended_enable BOOLEAN)`

Returns: NULL

Description: Enables or disables extended logging information. Use 1 to enable, 0 to disable.

Example:

SELECT ai_log_info(1);

`llm_model_load(path TEXT, options TEXT)`

Returns: NULL

Description: Loads a GGUF model from the specified file path with optional comma separated key=value configuration. If no options are provided the following default value is used: gpu_layers=99

The following keys are available:

gpu_layers=N       (N is the number of layers to store in VRAM)
main_gpu=K         (K is the GPU that is used for the entire model when split_mode is 0)
split_mode=N       (how to split the model across multiple GPUs, 0 means none, 1 means layer, 2 means rows)
vocab_only=1/0     (only load the vocabulary, no weights)
use_mmap=1/0       (use mmap if possible)
use_mlock=1/0      (force system to keep model in RAM)
check_tensors=1/0  (validate model tensor data)
log_info=1/0       (enable/disable the logging of info)

Example:

SELECT llm_model_load('./models/llama.gguf', 'gpu_layers=99');

`llm_model_free()`

Returns: NULL

Description: Unloads the current model and frees associated memory.

Example:

SELECT llm_model_free();

`llm_context_create(context_settings TEXT)`

Parameters: context_settings: comma-separated key=value pairs (see [context settings](#context settings)).

Returns: NULL

Description: Creates a new inference context with comma separated key=value configuration.

Context must explicitly created before performing any AI operation!

context_settings

The following keys are available in context_settings:

General

Key	Type	Meaning
`generate_embedding`	`1 or 0`	Force the model to generate embeddings.
`normalize_embedding`	`1 or 0`	Force normalization during embedding generation (default to 1).
`json_output`	`1 or 0`	Force JSON output in embedding generation (default to 0).
`max_tokens`	`number`	Set a maximum number of tokens in input. If input is too large then an error is returned.
`n_predict`	`number`	Control the maximum number of tokens generated during text generation.
`embedding_type`	`FLOAT32, FLOAT16, BFLOAT16, UINT8, INT8`	Set the model native type, mandatory during embedding generation.

Core sizing & threading

Key	Type	Meaning
`context_size`	`number`	Equivalent to n_ctx = N and n_batch = N.
`n_ctx`	`number`	Text context length (tokens). `0` = from model.
`n_batch`	`number`	Logical max batch size submitted to `llama_decode`.
`n_ubatch`	`number`	Physical max micro-batch size.
`n_seq_max`	`number`	Max concurrent sequences (parallel states for recurrent models).
`n_threads`	`number`	Threads for generation.
`n_threads_batch`	`number`	Threads for batch processing.

Attention, pooling & flash-attention

Key	Type	Meaning
`pooling_type`	`none, unspecified, mean, cls, last or rank`	How to aggregate token embeddings (e.g., `mean`).
`attention_type`	`unspecified, causal, non_causal`	Attention algorithm for embeddings.
`flash_attn_type`	`auto, disabled, enabled`	Controls when/if Flash-Attention is used.

RoPE & YaRN (positional scaling)

Key	Type	Meaning
`rope_scaling_type`	`unspecified, none, linear, yarn, longrope`	RoPE scaling strategy.
`rope_freq_base`	`float number`	RoPE base frequency. `0` = from model.
`rope_freq_scale`	`float number`	RoPE frequency scaling factor. `0` = from model.
`yarn_ext_factor`	`float number`	YaRN extrapolation mix factor. `<0` = from model.
`yarn_attn_factor`	`float number`	YaRN magnitude scaling.
`yarn_beta_fast`	`float number`	YaRN low correction dimension.
`yarn_beta_slow`	`float number`	YaRN high correction dimension.
`yarn_orig_ctx`	`number`	YaRN original context size.

KV cache types (experimental)

Key	Type	Meaning
`type_k`	ggml_type	Data type for K cache.
`type_v`	ggml_type	Data type for V cache.

Flags

Place booleans at the end of your option string if you’re copy-by-value mirroring a struct; otherwise order doesn’t matter.

Key	Type	Meaning
`embeddings`	`1 or 0`	If `1`, extract embeddings (with logits). Used by the embedding preset.
`offload_kqv`	`1 or 0`	Offload KQV ops (incl. KV cache) to GPU.
`no_perf`	`1 or 0`	Disable performance timing.
`op_offload`	`1 or 0`	Offload host tensor ops to device.
`swa_full`	`1 or 0`	Use full-size SWA cache. When `false` and `n_seq_max > 1`, performance may degrade.
`kv_unified`	`1 or 0`	Use a unified buffer across input sequences during attention. Try disabling when `n_seq_max > 1` and sequences do not share a long prefix.
`defrag_thold`	`float number`	Deprecated. Defragment KV cache if `holes/size > thold`. `<= 0` disables.

Example:

SELECT llm_context_create('n_ctx=2048,n_threads=6,n_batch=256');

`llm_context_create_embedding(context_settings TEXT)`

Parameters: context_settings (optional): Comma-separated key=value pairs to override or extend default settings (see context settings in llm_context_create).

Returns: NULL

Description: Creates a new inference context specifically set for embedding generation.

It is equivalent to SELECT llm_context_create('generate_embedding=1,normalize_embedding=1,pooling_type=mean');

Context must explicitly created before performing any AI operation!

Example:

SELECT llm_context_create_embedding();

`llm_context_create_chat(context_settings TEXT)`

Parameters: context_settings (optional): Comma-separated key=value pairs to override or extend default settings (see context settings in llm_context_create).

Returns: NULL

Description: Creates a new inference context specifically set for chat conversation.

It is equivalent to SELECT llm_context_create('context_size=4096');

Context must explicitly created before performing any AI operation!

Example:

SELECT llm_context_create_chat();

`llm_context_create_textgen(context_settings TEXT)`

Parameters: context_settings (optional): Comma-separated key=value pairs to override or extend default settings (see context settings in llm_context_create).

Returns: NULL

Description: Creates a new inference context specifically set for text generation.

It is equivalent to SELECT llm_context_create('context_size=4096');

Context must explicitly created before performing any AI operation!

Example:

SELECT llm_context_create_textgen();

`llm_context_free()`

Returns: NULL

Description: Frees the current inference context.

Example:

SELECT llm_context_free();

`llm_context_size()`

Returns: INTEGER

Description: Returns the total token capacity (context window) of the current llama context. Use this after llm_context_create to confirm the configured context_size. Raises an error if no context is active.

SELECT llm_context_size();
-- 4096

`llm_context_used()`

Returns: INTEGER

Description: Returns how many tokens of the current llama context have already been consumed. Combine this with llm_context_size() to monitor usage. Raises an error if no context is active.

Example:

SELECT llm_context_used();
-- 1024

`llm_sampler_create()`

Returns: NULL

Description: Initializes a new sampling strategy for text generation. A sampler is the mechanism that determines how the model selects the next token (word or subword) during text generation. If no sampler is explicitly created, one will be created automatically when needed.

Example:

SELECT llm_sampler_create();

`llm_sampler_free()`

Returns: NULL

Description: Frees resources associated with the current sampler.

Example:

SELECT llm_sampler_free();

`llm_lora_load(path TEXT, scale REAL)`

Returns: NULL

Description: Loads a LoRA adapter from the given file path with a mandatory scale value. LoRA (Low-Rank Adaptation) is a technique to inject trainable, low-rank layers into a pre-trained model.

Example:

SELECT llm_lora_load('./adapters/adapter.lora', 1.0);

`llm_lora_free()`

Returns: NULL

Description: Unloads any currently loaded LoRA adapter.

Example:

SELECT llm_lora_free();

`llm_sampler_init_greedy()`

Returns: NULL

Description: Configures the sampler to use greedy decoding (always pick most probable token).

Example:

SELECT llm_sampler_init_greedy();

`llm_sampler_init_dist(seed INT)`

Returns: NULL

Description: Initializes a random distribution-based sampler with the given seed. If a seed value in not specified, a default 0xFFFFFFFF value will be used.

Example:

SELECT llm_sampler_init_dist(42);

`llm_sampler_init_top_k(k INT)`

Returns: NULL

Description: Limits sampling to the top k most likely tokens. Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751

Example:

SELECT llm_sampler_init_top_k(40);

`llm_sampler_init_top_p(p REAL, min_keep INT)`

Returns: NULL

Description: Top-p sampling retains tokens with cumulative probability >= p. Always keeps at least min_keep tokens. Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751

Example:

SELECT llm_sampler_init_top_p(0.9, 1);

`llm_sampler_init_min_p(p REAL, min_keep INT)`

Returns: NULL

Description: Like top-p but with a minimum token probability threshold p. Minimum P sampling as described in ggml-org/llama.cpp#3841

Example:

SELECT llm_sampler_init_min_p(0.05, 1);

`llm_sampler_init_typical(p REAL, min_keep INT)`

Returns: NULL

Description: Typical sampling prefers tokens near the expected entropy level. Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666

Example:

SELECT llm_sampler_init_typical(0.95, 1);

`llm_sampler_init_temp(t REAL)`

Returns: NULL

Description: Adjusts the sampling temperature to control randomness.

Example:

SELECT llm_sampler_init_temp(0.8);

`llm_sampler_init_temp_ext(t REAL, delta REAL, exponent REAL)`

Returns: NULL

Description: Advanced temperature control using exponential scaling. Dynamic temperature implementation (a.k.a. entropy) described in the paper https://arxiv.org/abs/2309.02772

Example:

SELECT llm_sampler_init_temp_ext(0.8, 0.1, 2.0);

`llm_sampler_init_xtc(p REAL, t REAL, min_keep INT, seed INT)`

Returns: NULL

Description: Combines top-p, temperature, and seed-based sampling with a minimum token count. XTC sampler as described in oobabooga/text-generation-webui#6335

Example:

SELECT llm_sampler_init_xtc(0.9, 0.8, 1, 42);

`llm_sampler_init_top_n_sigma(n REAL)`

Returns: NULL

Description: Limits sampling to tokens within n standard deviations. Top n sigma sampling as described in academic paper "Top-nσ: Not All Logits Are You Need" https://arxiv.org/pdf/2411.07641

Example:

SELECT llm_sampler_init_top_n_sigma(1.5);

`llm_sampler_init_mirostat(seed INT, tau REAL, eta REAL, m INT)`

Returns: NULL

Description: Initializes Mirostat sampling with entropy control. Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.

Example:

SELECT llm_sampler_init_mirostat(42, 5.0, 0.1, 100);

`llm_sampler_init_mirostat_v2(seed INT, tau REAL, eta REAL)`

Returns: NULL

Description: Mirostat v2 entropy-based sampling. Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.

Example:

SELECT llm_sampler_init_mirostat_v2(42, 5.0, 0.1);

`llm_sampler_init_grammar(grammar_str TEXT, grammar_root TEXT)`

Returns: NULL

Description: Constrains output to match a specified grammar. Grammar syntax described in https://github.com/ggml-org/llama.cpp/tree/master/grammars

Example:

SELECT llm_sampler_init_grammar('...BNF...', 'root');

`llm_sampler_init_infill()`

Returns: NULL

Description: Enables infill (prefix-suffix) mode for completions.

Example:

SELECT llm_sampler_init_infill();

`llm_sampler_init_penalties(n INT, repeat REAL, freq REAL, present REAL)`

Returns: NULL

Description: Applies repetition, frequency, and presence penalties.

Example:

SELECT llm_sampler_init_penalties(64, 1.2, 0.5, 0.8);

`llm_token_count(text TEXT)`

Returns: INTEGER

Description: Returns how many tokens the current model would consume for the supplied text, using the active context’s vocabulary. Requires a context created via llm_context_create.

Example:

SELECT llm_token_count('Hello world!');
-- 5

`llm_embed_generate(text TEXT, options TEXT)`

Returns: BLOB or TEXT

Description: Generates a text embedding as a BLOB vector, with optional configuration provided as a comma-separated list of key=value pairs. By default, the embedding is normalized unless normalize_embedding=0 is specified. If json_output=1 is set, the function returns a JSON object instead of a BLOB.

Example:

SELECT llm_embed_generate('hello world', 'json_output=1');

`llm_text_generate(text TEXT, options TEXT)`

Returns: TEXT

Description: Generates a full-text completion based on input, with optional configuration provided as a comma-separated list of key=value pairs.

Example:

SELECT llm_text_generate('Once upon a time', 'n_predict=1024');

`llm_chat(prompt TEXT)`

Returns: VIRTUAL TABLE

Description: Streams a chat-style reply one token per row.

Example:

SELECT reply FROM llm_chat('Tell me a joke.');

`llm_chat_create()`

Returns: TEXT

Description: Starts a new in-memory chat session. Returns unique chat UUIDv7 value. If no chat is explicitly created, one will be created automatically when needed.

Example:

SELECT llm_chat_create();

`llm_chat_free()`

Returns: NULL

Description: Ends the current chat session.

Example:

SELECT llm_chat_free();

`llm_chat_save(title TEXT, meta TEXT)`

Returns: TEXT

Description: Saves the current chat session with optional title and meta into the ai_chat_history and ai_chat_messages tables and returns a UUID.

Example:

SELECT llm_chat_save('Support Chat', '{"user": "Marco"}');

`llm_chat_restore(uuid TEXT)`

Returns: NULL

Description: Restores a previously saved chat session by UUID.

Example:

SELECT llm_chat_restore('b59e...');

`llm_chat_respond(text TEXT)`

Returns: TEXT

Description: Generates a context-aware reply using chat memory, returned as a single, complete response. For a streams model reply, use the llm_chat virtual table.

Example:

SELECT llm_chat_respond('What are the most visited cities in Italy?');

Model Metadata

These functions return internal model properties:

SELECT
  llm_model_n_params(),
  llm_model_size(),
  llm_model_n_ctx_train(),
  llm_model_n_embd(),
  llm_model_n_layer(),
  llm_model_n_head(),
  llm_model_n_head_kv(),
  llm_model_n_swa(),
  llm_model_rope_freq_scale_train(),
  llm_model_n_cls_out(),
  llm_model_cls_label(),
  llm_model_desc(),
  llm_model_has_encoder(),
  llm_model_has_decoder(),
  llm_model_is_recurrent(),
  llm_model_chat_template();

All return INTEGER, REAL, or TEXT values depending on the property.

FilesExpand file tree

API.md

Latest commit

History

API.md

File metadata and controls

SQLite-AI: API Reference

ai_version()

ai_log_info(extended_enable BOOLEAN)

llm_model_load(path TEXT, options TEXT)

llm_model_free()

llm_context_create(context_settings TEXT)

context_settings

General

Core sizing & threading

Attention, pooling & flash-attention

RoPE & YaRN (positional scaling)

KV cache types (experimental)

Flags

llm_context_create_embedding(context_settings TEXT)

llm_context_create_chat(context_settings TEXT)

llm_context_create_textgen(context_settings TEXT)

llm_context_free()

llm_context_size()

llm_context_used()

llm_sampler_create()

llm_sampler_free()

llm_lora_load(path TEXT, scale REAL)

llm_lora_free()

llm_sampler_init_greedy()

llm_sampler_init_dist(seed INT)

llm_sampler_init_top_k(k INT)

llm_sampler_init_top_p(p REAL, min_keep INT)

llm_sampler_init_min_p(p REAL, min_keep INT)

llm_sampler_init_typical(p REAL, min_keep INT)

llm_sampler_init_temp(t REAL)

llm_sampler_init_temp_ext(t REAL, delta REAL, exponent REAL)

llm_sampler_init_xtc(p REAL, t REAL, min_keep INT, seed INT)

llm_sampler_init_top_n_sigma(n REAL)

llm_sampler_init_mirostat(seed INT, tau REAL, eta REAL, m INT)

llm_sampler_init_mirostat_v2(seed INT, tau REAL, eta REAL)

llm_sampler_init_grammar(grammar_str TEXT, grammar_root TEXT)

llm_sampler_init_infill()

llm_sampler_init_penalties(n INT, repeat REAL, freq REAL, present REAL)

llm_token_count(text TEXT)

llm_embed_generate(text TEXT, options TEXT)

llm_text_generate(text TEXT, options TEXT)

llm_chat(prompt TEXT)

llm_chat_create()

llm_chat_free()

llm_chat_save(title TEXT, meta TEXT)

llm_chat_restore(uuid TEXT)

llm_chat_respond(text TEXT)

Model Metadata

`ai_version()`

`ai_log_info(extended_enable BOOLEAN)`

`llm_model_load(path TEXT, options TEXT)`

`llm_model_free()`

`llm_context_create(context_settings TEXT)`

`llm_context_create_embedding(context_settings TEXT)`

`llm_context_create_chat(context_settings TEXT)`

`llm_context_create_textgen(context_settings TEXT)`

`llm_context_free()`

`llm_context_size()`

`llm_context_used()`

`llm_sampler_create()`

`llm_sampler_free()`

`llm_lora_load(path TEXT, scale REAL)`

`llm_lora_free()`

`llm_sampler_init_greedy()`

`llm_sampler_init_dist(seed INT)`

`llm_sampler_init_top_k(k INT)`

`llm_sampler_init_top_p(p REAL, min_keep INT)`

`llm_sampler_init_min_p(p REAL, min_keep INT)`

`llm_sampler_init_typical(p REAL, min_keep INT)`

`llm_sampler_init_temp(t REAL)`

`llm_sampler_init_temp_ext(t REAL, delta REAL, exponent REAL)`

`llm_sampler_init_xtc(p REAL, t REAL, min_keep INT, seed INT)`

`llm_sampler_init_top_n_sigma(n REAL)`

`llm_sampler_init_mirostat(seed INT, tau REAL, eta REAL, m INT)`

`llm_sampler_init_mirostat_v2(seed INT, tau REAL, eta REAL)`

`llm_sampler_init_grammar(grammar_str TEXT, grammar_root TEXT)`

`llm_sampler_init_infill()`

`llm_sampler_init_penalties(n INT, repeat REAL, freq REAL, present REAL)`

`llm_token_count(text TEXT)`

`llm_embed_generate(text TEXT, options TEXT)`

`llm_text_generate(text TEXT, options TEXT)`

`llm_chat(prompt TEXT)`

`llm_chat_create()`

`llm_chat_free()`

`llm_chat_save(title TEXT, meta TEXT)`

`llm_chat_restore(uuid TEXT)`

`llm_chat_respond(text TEXT)`