Local AI models for code completion and chat in VS Code via Continue Extension. This project can run various models, but the Qwen2.5-Coder model has shown good results experimentally. Server is precompiled for Windows.
- Laptop: Acer Swift Go 14 (iGPU Intel Arc, 18Gb GPU, 32Gb RAM)
- Software: Windows 11, Intel® oneAPI Base Toolkit 2025.3.0, Intel® Deep Learning Essentials 2025.3.0 (not sure that it's required)
- Folder:
C:\ai-projects\ovms-continue(should work from any folder)
| Model | Size | Speed | Goal | Pushed to GitHub |
|---|---|---|---|---|
| Qwen2.5-Coder-14B | ~8GB | ⚡ | Best quality | No, too huge |
| Qwen2.5-Coder-7B | ~4GB | ⚡⚡ | Balance | No, too huge |
| Qwen2.5-Coder-3B | ~2GB | ⚡⚡⚡ | Quick | Yes |
| Qwen2.5-Coder-1.5B | ~1GB | ⚡⚡⚡⚡ | Autocomplete | Yes |
| Qwen2.5-Coder-0.5B | ~300MB | ⚡⚡⚡⚡⚡ | Minimal resources | Yes |
Edit config_all.json in models folder to configure models that should be loaded.
For example:
{
"model_config_list": [
{
"config": {
"name": "Qwen2.5-Coder-0.5B-Instruct-int4-ov",
"base_path": "Qwen2.5-Coder-0.5B-Instruct-int4-ov"
}
},
{
"config": {
"name": "Qwen2.5-Coder-1.5B-Instruct-int4-ov",
"base_path": "Qwen2.5-Coder-1.5B-Instruct-int4-ov"
}
},
{
"config": {
"name": "Qwen2.5-Coder-3B-Instruct-int4-ov",
"base_path": "Qwen2.5-Coder-3B-Instruct-int4-ov"
}
},
{
"config": {
"name": "Qwen2.5-Coder-7B-Instruct-int4-ov",
"base_path": "Qwen2.5-Coder-7B-Instruct-int4-ov"
}
}
{
"config": {
"name": "Qwen2.5-Coder-14B-Instruct-int4-ov",
"base_path": "Qwen2.5-Coder-14B-Instruct-int4-ov"
}
}
]
}I not recommend to use many models. More models requires more memory.
cd C:\ai-projects\ovms-continue
start-ovms.batOpen in browser: http://localhost:8000/v3/models
Continue will automatically connect to OVMS (port 8000).
Edit configuration file C:\Users\<Username>\.continue\config.yaml like so:
models:
- name: Qwen2.5-Coder-3B (GPU)
provider: openai
model: Qwen2.5-Coder-3B-Instruct-int4-ov
apiKey: unused
apiBase: http://localhost:8000/v3
roles:
- chat
- edit
- apply
- name: Qwen2.5-Coder-1.5B (GPU)
provider: openai
model: Qwen2.5-Coder-1.5B-Instruct-int4-ov
apiKey: unused
apiBase: http://localhost:8000/v3
roles:
- autocompleteSet your local models in Continue extension:
cd .\models
git clone https://huggingface.co/OpenVINO/Qwen2.5-Coder-7B-Instruct-int4-ovOVMS requires specific folder structure. After downloading:
1. Create version subfolder 1/:
cd Qwen2.5-Coder-7B-Instruct-int4-ov
mkdir 12. Move model files into 1/:
move *.json 1\
move *.bin 1\
move *.xml 1\
move *.txt 1\
move *.model 1\3. Create graph.pbtxt in model root folder (not in 1/):
input_stream: "HTTP_REQUEST_PAYLOAD:input"
output_stream: "HTTP_RESPONSE_PAYLOAD:output"
node: {
name: "LLMExecutor"
calculator: "HttpLLMCalculator"
input_stream: "LOOPBACK:loopback"
input_stream: "HTTP_REQUEST_PAYLOAD:input"
input_side_packet: "LLM_NODE_RESOURCES:llm"
output_stream: "LOOPBACK:loopback"
output_stream: "HTTP_RESPONSE_PAYLOAD:output"
input_stream_info: {
tag_index: 'LOOPBACK:0',
back_edge: true
}
node_options: {
[type.googleapis.com / mediapipe.LLMCalculatorOptions]: {
models_path: "./1",
cache_size: 4,
max_num_seqs: 256,
dynamic_split_fuse: true,
device: "GPU"
}
}
input_stream_handler {
input_stream_handler: "SyncSetInputStreamHandler",
options {
[mediapipe.SyncSetInputStreamHandlerOptions.ext] {
sync_set {
tag_index: "LOOPBACK:0"
}
}
}
}
}Change device: "GPU" to "CPU" or "NPU" if needed.
4. Create chat_template.jinja for chat models (example for Qwen):
{%- for message in messages -%}
{%- if message['role'] == 'system' -%}
{{- '<|im_start|>system\n' + message['content'] + '<|im_end|>\n' -}}
{%- elif message['role'] == 'user' -%}
{{- '<|im_start|>user\n' + message['content'] + '<|im_end|>\n' -}}
{%- elif message['role'] == 'assistant' -%}
{{- '<|im_start|>assistant\n' + message['content'] + '<|im_end|>\n' -}}
{%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{- '<|im_start|>assistant\n' -}}
{%- endif -%}models/
└── Qwen2.5-Coder-7B-Instruct-int4-ov/
├── graph.pbtxt # OVMS graph config
├── chat_template.jinja # Chat format template
└── 1/ # Version folder (required!)
├── config.json
├── generation_config.json
├── openvino_model.xml
├── openvino_model.bin
├── openvino_tokenizer.xml
├── openvino_tokenizer.bin
├── openvino_detokenizer.xml
├── openvino_detokenizer.bin
├── tokenizer.json
├── tokenizer_config.json
└── ...
{
"model_config_list": [
{
"config": {
"name": "Qwen2.5-Coder-7B-Instruct-int4-ov",
"base_path": "Qwen2.5-Coder-7B-Instruct-int4-ov"
}
}
]
}- File encoding:
graph.pbtxtmust be UTF-8 without BOM. PowerShell'sOut-Fileadds BOM — use[System.IO.File]::WriteAllText($path, $content, [System.Text.UTF8Encoding]::new($false))instead. - Context length: Models with small context (2K tokens like GPT-J) don't work well with Continue. Use models with 4K+ context.
- Vision models: May require newer OpenVINO version.
This project was created for fun by Oleksandr Lopatnov.

