Run any Hugging Face model on a temporary vLLM server inside your CI pipeline.
name: LLM smoke test
on: push
jobs:
infer:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run LLM inference
uses: muhac/llm-actions@v2
with:
model: Qwen/Qwen3-1.7B
# below are optional inputs
vllm_version: v0.17.0
hf_token: ${{ secrets.HF_TOKEN }}
kv_cache_space: 4
max_model_len: 32768
# The rest of your workflow...model: (Required) The model to use for inference. Recommended:Qwen/Qwen3-1.7B.vllm_version: The version of vllm Docker image to use. Default isv0.17.0. You may want a newer version for support of more models.hf_token: The Hugging Face token to use for downloading the model. If you are using a public model, you can leave it empty. If you are using a private model, you need to provide your Hugging Face token.kv_cache_space: CPU KV cache size in GiB (VLLM_CPU_KVCACHE_SPACE). Optional, leave empty to use the vLLM default.max_model_len: Maximum sequence length for the model (--max-model-len). Optional, leave empty to use the model's default.
logs: Full vLLM server log.
Here is an example workflow that runs a few models and makes api calls to them.
This repository contains the code to run LLMs for inference in GitHub Actions. You can run LLMs hosted by vLLM, or write your own code to run LLMs in GitHub Actions. Here are some examples with tested models.
Qwen/Qwen3-1.7B is recommended as the default — it offers a good balance of speed and quality. For vision tasks, Qwen/Qwen2.5-VL-3B-Instruct is a compact multimodal model that fits comfortably within the runner's memory. If you want to disable thinking on Qwen3 models, add /no_think as the system prompt.
For latest models, checkout Open LLM Leaderboard. Newer versions of vllm and transformers may support more models.
The largest size of the model that can be run in GitHub Actions is 6B. 7B models are too large to run in the current environment.
The limit is due to the memory constraints of the GitHub Actions environment.
In extreme cases, you may run quantized 8B (even 14B) models to support advanced reasoning tasks. But the time will be extremely long and the workflow may be unstable.
MIT
Pull requests are welcome.