This document provides reference-level documentation for all public SQLite-AI functions, virtual tables, and metadata properties exposed to SQL. These functions enable loading and interacting with LLMs, configuring samplers, generating embeddings and text, and managing chat sessions.
Returns: TEXT
Description: Returns the current version of the SQLite-AI extension.
Example:
SELECT ai_version();
-- e.g., '0.5.1'Returns: NULL
Description:
Enables or disables extended logging information. Use 1 to enable, 0 to disable.
Example:
SELECT ai_log_info(1);Returns: NULL
Description:
Loads a GGUF model from the specified file path with optional comma separated key=value configuration.
If no options are provided the following default value is used: gpu_layers=99
The following keys are available:
gpu_layers=N (N is the number of layers to store in VRAM)
main_gpu=K (K is the GPU that is used for the entire model when split_mode is 0)
split_mode=N (how to split the model across multiple GPUs, 0 means none, 1 means layer, 2 means rows)
vocab_only=1/0 (only load the vocabulary, no weights)
use_mmap=1/0 (use mmap if possible)
use_mlock=1/0 (force system to keep model in RAM)
check_tensors=1/0 (validate model tensor data)
log_info=1/0 (enable/disable the logging of info)
Example:
SELECT llm_model_load('./models/llama.gguf', 'gpu_layers=99');Returns: NULL
Description: Unloads the current model and frees associated memory.
Example:
SELECT llm_model_free();Parameters: context_settings: comma-separated key=value pairs (see [context settings](#context settings)).
Returns: NULL
Description: Creates a new inference context with comma separated key=value configuration.
Context must explicitly created before performing any AI operation!
The following keys are available in context_settings:
| Key | Type | Meaning |
|---|---|---|
generate_embedding |
1 or 0 |
Force the model to generate embeddings. |
normalize_embedding |
1 or 0 |
Force normalization during embedding generation (default to 1). |
json_output |
1 or 0 |
Force JSON output in embedding generation (default to 0). |
max_tokens |
number |
Set a maximum number of tokens in input. If input is too large then an error is returned. |
n_predict |
number |
Control the maximum number of tokens generated during text generation. |
embedding_type |
FLOAT32, FLOAT16, BFLOAT16, UINT8, INT8 |
Set the model native type, mandatory during embedding generation. |
| Key | Type | Meaning |
|---|---|---|
context_size |
number |
Equivalent to n_ctx = N and n_batch = N. |
n_ctx |
number |
Text context length (tokens). 0 = from model. |
n_batch |
number |
Logical max batch size submitted to llama_decode. |
n_ubatch |
number |
Physical max micro-batch size. |
n_seq_max |
number |
Max concurrent sequences (parallel states for recurrent models). |
n_threads |
number |
Threads for generation. |
n_threads_batch |
number |
Threads for batch processing. |
| Key | Type | Meaning |
|---|---|---|
pooling_type |
none, unspecified, mean, cls, last or rank |
How to aggregate token embeddings (e.g., mean). |
attention_type |
unspecified, causal, non_causal |
Attention algorithm for embeddings. |
flash_attn_type |
auto, disabled, enabled |
Controls when/if Flash-Attention is used. |
| Key | Type | Meaning |
|---|---|---|
rope_scaling_type |
unspecified, none, linear, yarn, longrope |
RoPE scaling strategy. |
rope_freq_base |
float number |
RoPE base frequency. 0 = from model. |
rope_freq_scale |
float number |
RoPE frequency scaling factor. 0 = from model. |
yarn_ext_factor |
float number |
YaRN extrapolation mix factor. <0 = from model. |
yarn_attn_factor |
float number |
YaRN magnitude scaling. |
yarn_beta_fast |
float number |
YaRN low correction dimension. |
yarn_beta_slow |
float number |
YaRN high correction dimension. |
yarn_orig_ctx |
number |
YaRN original context size. |
| Key | Type | Meaning |
|---|---|---|
type_k |
ggml_type | Data type for K cache. |
type_v |
ggml_type | Data type for V cache. |
Place booleans at the end of your option string if you’re copy-by-value mirroring a struct; otherwise order doesn’t matter.
| Key | Type | Meaning |
|---|---|---|
embeddings |
1 or 0 |
If 1, extract embeddings (with logits). Used by the embedding preset. |
offload_kqv |
1 or 0 |
Offload KQV ops (incl. KV cache) to GPU. |
no_perf |
1 or 0 |
Disable performance timing. |
op_offload |
1 or 0 |
Offload host tensor ops to device. |
swa_full |
1 or 0 |
Use full-size SWA cache. When false and n_seq_max > 1, performance may degrade. |
kv_unified |
1 or 0 |
Use a unified buffer across input sequences during attention. Try disabling when n_seq_max > 1 and sequences do not share a long prefix. |
defrag_thold |
float number |
Deprecated. Defragment KV cache if holes/size > thold. <= 0 disables. |
Example:
SELECT llm_context_create('n_ctx=2048,n_threads=6,n_batch=256');Parameters: context_settings (optional): Comma-separated key=value pairs to override or extend default settings (see context settings in llm_context_create).
Returns: NULL
Description: Creates a new inference context specifically set for embedding generation.
It is equivalent to SELECT llm_context_create('generate_embedding=1,normalize_embedding=1,pooling_type=mean');
Context must explicitly created before performing any AI operation!
Example:
SELECT llm_context_create_embedding();Parameters: context_settings (optional): Comma-separated key=value pairs to override or extend default settings (see context settings in llm_context_create).
Returns: NULL
Description: Creates a new inference context specifically set for chat conversation.
It is equivalent to SELECT llm_context_create('context_size=4096');
Context must explicitly created before performing any AI operation!
Example:
SELECT llm_context_create_chat();Parameters: context_settings (optional): Comma-separated key=value pairs to override or extend default settings (see context settings in llm_context_create).
Returns: NULL
Description: Creates a new inference context specifically set for text generation.
It is equivalent to SELECT llm_context_create('context_size=4096');
Context must explicitly created before performing any AI operation!
Example:
SELECT llm_context_create_textgen();Returns: NULL
Description: Frees the current inference context.
Example:
SELECT llm_context_free();Returns: INTEGER
Description:
Returns the total token capacity (context window) of the current llama context. Use this after llm_context_create to confirm the configured context_size. Raises an error if no context is active.
SELECT llm_context_size();
-- 4096Returns: INTEGER
Description:
Returns how many tokens of the current llama context have already been consumed. Combine this with llm_context_size() to monitor usage. Raises an error if no context is active.
Example:
SELECT llm_context_used();
-- 1024Returns: NULL
Description: Initializes a new sampling strategy for text generation. A sampler is the mechanism that determines how the model selects the next token (word or subword) during text generation. If no sampler is explicitly created, one will be created automatically when needed.
Example:
SELECT llm_sampler_create();Returns: NULL
Description: Frees resources associated with the current sampler.
Example:
SELECT llm_sampler_free();Returns: NULL
Description: Loads a LoRA adapter from the given file path with a mandatory scale value. LoRA (Low-Rank Adaptation) is a technique to inject trainable, low-rank layers into a pre-trained model.
Example:
SELECT llm_lora_load('./adapters/adapter.lora', 1.0);Returns: NULL
Description: Unloads any currently loaded LoRA adapter.
Example:
SELECT llm_lora_free();Returns: NULL
Description: Configures the sampler to use greedy decoding (always pick most probable token).
Example:
SELECT llm_sampler_init_greedy();Returns: NULL
Description: Initializes a random distribution-based sampler with the given seed. If a seed value in not specified, a default 0xFFFFFFFF value will be used.
Example:
SELECT llm_sampler_init_dist(42);Returns: NULL
Description:
Limits sampling to the top k most likely tokens.
Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
Example:
SELECT llm_sampler_init_top_k(40);Returns: NULL
Description:
Top-p sampling retains tokens with cumulative probability >= p. Always keeps at least min_keep tokens.
Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
Example:
SELECT llm_sampler_init_top_p(0.9, 1);Returns: NULL
Description:
Like top-p but with a minimum token probability threshold p.
Minimum P sampling as described in ggml-org/llama.cpp#3841
Example:
SELECT llm_sampler_init_min_p(0.05, 1);Returns: NULL
Description: Typical sampling prefers tokens near the expected entropy level. Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666
Example:
SELECT llm_sampler_init_typical(0.95, 1);Returns: NULL
Description: Adjusts the sampling temperature to control randomness.
Example:
SELECT llm_sampler_init_temp(0.8);Returns: NULL
Description: Advanced temperature control using exponential scaling. Dynamic temperature implementation (a.k.a. entropy) described in the paper https://arxiv.org/abs/2309.02772
Example:
SELECT llm_sampler_init_temp_ext(0.8, 0.1, 2.0);Returns: NULL
Description: Combines top-p, temperature, and seed-based sampling with a minimum token count. XTC sampler as described in oobabooga/text-generation-webui#6335
Example:
SELECT llm_sampler_init_xtc(0.9, 0.8, 1, 42);Returns: NULL
Description:
Limits sampling to tokens within n standard deviations.
Top n sigma sampling as described in academic paper "Top-nσ: Not All Logits Are You Need" https://arxiv.org/pdf/2411.07641
Example:
SELECT llm_sampler_init_top_n_sigma(1.5);Returns: NULL
Description: Initializes Mirostat sampling with entropy control. Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
Example:
SELECT llm_sampler_init_mirostat(42, 5.0, 0.1, 100);Returns: NULL
Description: Mirostat v2 entropy-based sampling. Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
Example:
SELECT llm_sampler_init_mirostat_v2(42, 5.0, 0.1);Returns: NULL
Description: Constrains output to match a specified grammar. Grammar syntax described in https://github.com/ggml-org/llama.cpp/tree/master/grammars
Example:
SELECT llm_sampler_init_grammar('...BNF...', 'root');Returns: NULL
Description: Enables infill (prefix-suffix) mode for completions.
Example:
SELECT llm_sampler_init_infill();Returns: NULL
Description: Applies repetition, frequency, and presence penalties.
Example:
SELECT llm_sampler_init_penalties(64, 1.2, 0.5, 0.8);Returns: INTEGER
Description:
Returns how many tokens the current model would consume for the supplied text, using the active context’s vocabulary. Requires a context created via llm_context_create.
Example:
SELECT llm_token_count('Hello world!');
-- 5Returns: BLOB or TEXT
Description:
Generates a text embedding as a BLOB vector, with optional configuration provided as a comma-separated list of key=value pairs.
By default, the embedding is normalized unless normalize_embedding=0 is specified.
If json_output=1 is set, the function returns a JSON object instead of a BLOB.
Example:
SELECT llm_embed_generate('hello world', 'json_output=1');Returns: TEXT
Description: Generates a full-text completion based on input, with optional configuration provided as a comma-separated list of key=value pairs.
Example:
SELECT llm_text_generate('Once upon a time', 'n_predict=1024');Returns: VIRTUAL TABLE
Description: Streams a chat-style reply one token per row.
Example:
SELECT reply FROM llm_chat('Tell me a joke.');Returns: TEXT
Description: Starts a new in-memory chat session. Returns unique chat UUIDv7 value. If no chat is explicitly created, one will be created automatically when needed.
Example:
SELECT llm_chat_create();Returns: NULL
Description: Ends the current chat session.
Example:
SELECT llm_chat_free();Returns: TEXT
Description: Saves the current chat session with optional title and meta into the ai_chat_history and ai_chat_messages tables and returns a UUID.
Example:
SELECT llm_chat_save('Support Chat', '{"user": "Marco"}');Returns: NULL
Description: Restores a previously saved chat session by UUID.
Example:
SELECT llm_chat_restore('b59e...');Returns: TEXT
Description: Generates a context-aware reply using chat memory, returned as a single, complete response. For a streams model reply, use the llm_chat virtual table.
Example:
SELECT llm_chat_respond('What are the most visited cities in Italy?');These functions return internal model properties:
SELECT
llm_model_n_params(),
llm_model_size(),
llm_model_n_ctx_train(),
llm_model_n_embd(),
llm_model_n_layer(),
llm_model_n_head(),
llm_model_n_head_kv(),
llm_model_n_swa(),
llm_model_rope_freq_scale_train(),
llm_model_n_cls_out(),
llm_model_cls_label(),
llm_model_desc(),
llm_model_has_encoder(),
llm_model_has_decoder(),
llm_model_is_recurrent(),
llm_model_chat_template();All return INTEGER, REAL, or TEXT values depending on the property.