OpenVINO Model Server for Continue VSCode Extension

Local AI models for code completion and chat in VS Code via Continue Extension. This project can run various models, but the Qwen2.5-Coder model has shown good results experimentally. Server is precompiled for Windows.

The hardware and software on which the system was run

Laptop: Acer Swift Go 14 (iGPU Intel Arc, 18Gb GPU, 32Gb RAM)
Software: Windows 11, Intel® oneAPI Base Toolkit 2025.3.0, Intel® Deep Learning Essentials 2025.3.0 (not sure that it's required)
Folder: C:\ai-projects\ovms-continue (should work from any folder)

Current models

Model	Size	Speed	Goal	Pushed to GitHub
Qwen2.5-Coder-14B	~8GB	⚡	Best quality	No, too huge
Qwen2.5-Coder-7B	~4GB	⚡⚡	Balance	No, too huge
Qwen2.5-Coder-3B	~2GB	⚡⚡⚡	Quick	Yes
Qwen2.5-Coder-1.5B	~1GB	⚡⚡⚡⚡	Autocomplete	Yes
Qwen2.5-Coder-0.5B	~300MB	⚡⚡⚡⚡⚡	Minimal resources	Yes

Quick start

1. Configure models

Edit config_all.json in models folder to configure models that should be loaded.

For example:

{
    "model_config_list": [
        {
            "config": {
                "name": "Qwen2.5-Coder-0.5B-Instruct-int4-ov",
                "base_path": "Qwen2.5-Coder-0.5B-Instruct-int4-ov"
            }
        },
        {
            "config": {
                "name": "Qwen2.5-Coder-1.5B-Instruct-int4-ov",
                "base_path": "Qwen2.5-Coder-1.5B-Instruct-int4-ov"
            }
        },
        {
            "config": {
                "name": "Qwen2.5-Coder-3B-Instruct-int4-ov",
                "base_path": "Qwen2.5-Coder-3B-Instruct-int4-ov"
            }
        },
        {
            "config": {
                "name": "Qwen2.5-Coder-7B-Instruct-int4-ov",
                "base_path": "Qwen2.5-Coder-7B-Instruct-int4-ov"
            }
        }
        {
            "config": {
                "name": "Qwen2.5-Coder-14B-Instruct-int4-ov",
                "base_path": "Qwen2.5-Coder-14B-Instruct-int4-ov"
            }
        }
    ]
}

I not recommend to use many models. More models requires more memory.

2. Start OVMS

cd C:\ai-projects\ovms-continue
start-ovms.bat

3. Check

Open in browser: http://localhost:8000/v3/models

4. Configure VS Code

Continue will automatically connect to OVMS (port 8000).

Edit configuration file C:\Users\<Username>\.continue\config.yaml like so:

models:
  - name: Qwen2.5-Coder-3B (GPU)
    provider: openai
    model: Qwen2.5-Coder-3B-Instruct-int4-ov
    apiKey: unused
    apiBase: http://localhost:8000/v3
    roles:
      - chat
      - edit
      - apply

  - name: Qwen2.5-Coder-1.5B (GPU)
    provider: openai
    model: Qwen2.5-Coder-1.5B-Instruct-int4-ov
    apiKey: unused
    apiBase: http://localhost:8000/v3
    roles:
      - autocomplete

Set your local models in Continue extension:

Check that it's working

Adding New Models from HuggingFace

Download Model

cd .\models
git clone https://huggingface.co/OpenVINO/Qwen2.5-Coder-7B-Instruct-int4-ov

Prepare Model Structure

OVMS requires specific folder structure. After downloading:

1. Create version subfolder 1/:

cd Qwen2.5-Coder-7B-Instruct-int4-ov
mkdir 1

2. Move model files into 1/:

move *.json 1\
move *.bin 1\
move *.xml 1\
move *.txt 1\
move *.model 1\

3. Create graph.pbtxt in model root folder (not in 1/):

input_stream: "HTTP_REQUEST_PAYLOAD:input"
output_stream: "HTTP_RESPONSE_PAYLOAD:output"

node: {
  name: "LLMExecutor"
  calculator: "HttpLLMCalculator"
  input_stream: "LOOPBACK:loopback"
  input_stream: "HTTP_REQUEST_PAYLOAD:input"
  input_side_packet: "LLM_NODE_RESOURCES:llm"
  output_stream: "LOOPBACK:loopback"
  output_stream: "HTTP_RESPONSE_PAYLOAD:output"
  input_stream_info: {
    tag_index: 'LOOPBACK:0',
    back_edge: true
  }
  node_options: {
      [type.googleapis.com / mediapipe.LLMCalculatorOptions]: {
          models_path: "./1",
          cache_size: 4,
          max_num_seqs: 256,
          dynamic_split_fuse: true,
          device: "GPU"
      }
  }
  input_stream_handler {
    input_stream_handler: "SyncSetInputStreamHandler",
    options {
      [mediapipe.SyncSetInputStreamHandlerOptions.ext] {
        sync_set {
          tag_index: "LOOPBACK:0"
        }
      }
    }
  }
}

Change device: "GPU" to "CPU" or "NPU" if needed.

4. Create chat_template.jinja for chat models (example for Qwen):

{%- for message in messages -%}
    {%- if message['role'] == 'system' -%}
        {{- '<|im_start|>system\n' + message['content'] + '<|im_end|>\n' -}}
    {%- elif message['role'] == 'user' -%}
        {{- '<|im_start|>user\n' + message['content'] + '<|im_end|>\n' -}}
    {%- elif message['role'] == 'assistant' -%}
        {{- '<|im_start|>assistant\n' + message['content'] + '<|im_end|>\n' -}}
    {%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
    {{- '<|im_start|>assistant\n' -}}
{%- endif -%}

Final Model Structure

models/
└── Qwen2.5-Coder-7B-Instruct-int4-ov/
    ├── graph.pbtxt              # OVMS graph config
    ├── chat_template.jinja      # Chat format template
    └── 1/                       # Version folder (required!)
        ├── config.json
        ├── generation_config.json
        ├── openvino_model.xml
        ├── openvino_model.bin
        ├── openvino_tokenizer.xml
        ├── openvino_tokenizer.bin
        ├── openvino_detokenizer.xml
        ├── openvino_detokenizer.bin
        ├── tokenizer.json
        ├── tokenizer_config.json
        └── ...

Add to config_all.json

{
    "model_config_list": [
        {
            "config": {
                "name": "Qwen2.5-Coder-7B-Instruct-int4-ov",
                "base_path": "Qwen2.5-Coder-7B-Instruct-int4-ov"
            }
        }
    ]
}

Important Notes

File encoding: graph.pbtxt must be UTF-8 without BOM. PowerShell's Out-File adds BOM — use [System.IO.File]::WriteAllText($path, $content, [System.Text.UTF8Encoding]::new($false)) instead.
Context length: Models with small context (2K tokens like GPT-J) don't work well with Continue. Use models with 4K+ context.
Vision models: May require newer OpenVINO version.

Links

Authors

This project was created for fun by Oleksandr Lopatnov.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
models		models
ovms		ovms
.gitattributes		.gitattributes
README.md		README.md
continue-config.jpg		continue-config.jpg
continue-work.jpg		continue-work.jpg
start-ovms.bat		start-ovms.bat
start-ovms.ps1		start-ovms.ps1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenVINO Model Server for Continue VSCode Extension

The hardware and software on which the system was run

Current models

Quick start

1. Configure models

2. Start OVMS

3. Check

4. Configure VS Code

Check that it's working

Adding New Models from HuggingFace

Download Model

Prepare Model Structure

Final Model Structure

Add to config_all.json

Important Notes

Links

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenVINO Model Server for Continue VSCode Extension

The hardware and software on which the system was run

Current models

Quick start

1. Configure models

2. Start OVMS

3. Check

4. Configure VS Code

Check that it's working

Adding New Models from HuggingFace

Download Model

Prepare Model Structure

Final Model Structure

Add to config_all.json

Important Notes

Links

Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages