Skip to content

Eval bug: Runtime failure when using ChatBox with tools enabled and GPT-OSS-20B #15170

@mancubus77

Description

@mancubus77

Name and Version

Environment (compiled from master)

root@77c821627b43:/app# ./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
version: 6118 (6c7e9a54)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Issue

Runtime runs in

Operating systems

Linux

GGML backends

CUDA

Hardware

Environment (Compiled from master):

root@77c821627b43:/app# ./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no  
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no  
ggml_cuda_init: found 2 CUDA devices:  
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes  
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes  
load_backend: loaded CUDA backend from /app/libggml-cuda.so  
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so  
version: 6118 (6c7e9a54)  
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu  

Models

gpt-oss-20b-BF16.gguf

Problem description & steps to reproduce

Issue:

The server fails at runtime when attempting to use ChatBox with tools enabled. Everything builds fine from master, and the server starts up without issues. However, once I initiate a session using the ChatBox frontend with tools turned on, the process crashes or becomes unresponsive.

Expected behavior:

The server should operate normally with tools enabled in ChatBox.

Steps to reproduce:

Build llama-server from latest master
Start the server
Connect with ChatBox
Enable tools (MSP Server)
Attempt to start a chat
Runtime fails with error

srv  log_server_r: request: POST /v1/chat/completions 192.168.1.248 200
slot      release: id  0 | task 2 | stop processing: n_past = 419, truncated = 0
slot print_timing: id  0 | task 2 |
prompt eval time =     109.57 ms /   327 tokens (    0.34 ms per token,  2984.31 tokens per second)
       eval time =     291.01 ms /    34 tokens (    8.56 ms per token,   116.83 tokens per second)
      total time =     400.59 ms /   361 tokens
libggml-base.so(+0x16d4b)[0x7f5eb1ed3d4b]
libggml-base.so(ggml_print_backtrace+0x21f)[0x7f5eb1ed41af]
libggml-base.so(+0x28aaf)[0x7f5eb1ee5aaf]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f5eb1d3d20c]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277)[0x7f5eb1d3d277]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8)[0x7f5eb1d3d4d8]
/app/llama-server(+0x44fb2)[0x5593b03aefb2]
/app/llama-server(+0x158ce8)[0x5593b04c2ce8]
/app/llama-server(+0xb0f14)[0x5593b041af14]
/app/llama-server(+0xb321c)[0x5593b041d21c]
/app/llama-server(+0xdf406)[0x5593b0449406]
/app/llama-server(+0x856fd)[0x5593b03ef6fd]
/app/llama-server(+0x4d5e5)[0x5593b03b75e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f5eb1988d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f5eb1988e40]
/app/llama-server(+0x4f035)[0x5593b03b9035]
terminate called after throwing an instance of 'std::runtime_error'
  what():  Unexpected content at end of input

Additional context:

Both CUDA and CPU backends load successfully.
No errors during build or initial startup.
ChatBox works fine without tools.
The failure happens only when tools are enabled.
Let me know if logs or stack traces are needed.

Runtime configuration

Configuration:
  GPU Layers: 99
  Threads: -1
  Context Size: 16384
  Temperature: 1.0
  Top-p: 1.0
  Top-k: 0
  Jinja: true

First Bad Commit

No response

Relevant log output

srv log_server_r: request: POST /v1/chat/completions 192.168.1.248 200
slot release: id 0 | task 2 | stop processing: n_past = 419, truncated = 0
slot print_timing: id 0 | task 2 |
prompt eval time = 109.57 ms / 327 tokens ( 0.34 ms per token, 2984.31 tokens per second)
eval time = 291.01 ms / 34 tokens ( 8.56 ms per token, 116.83 tokens per second)
total time = 400.59 ms / 361 tokens
libggml-base.so(+0x16d4b)[0x7f5eb1ed3d4b]
libggml-base.so(ggml_print_backtrace+0x21f)[0x7f5eb1ed41af]
libggml-base.so(+0x28aaf)[0x7f5eb1ee5aaf]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f5eb1d3d20c]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277)[0x7f5eb1d3d277]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8)[0x7f5eb1d3d4d8]
/app/llama-server(+0x44fb2)[0x5593b03aefb2]
/app/llama-server(+0x158ce8)[0x5593b04c2ce8]
/app/llama-server(+0xb0f14)[0x5593b041af14]
/app/llama-server(+0xb321c)[0x5593b041d21c]
/app/llama-server(+0xdf406)[0x5593b0449406]
/app/llama-server(+0x856fd)[0x5593b03ef6fd]
/app/llama-server(+0x4d5e5)[0x5593b03b75e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f5eb1988d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f5eb1988e40]
/app/llama-server(+0x4f035)[0x5593b03b9035]
terminate called after throwing an instance of 'std::runtime_error'
what(): Unexpected content at end of input

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions