LLM in GitHub Actions

Run any Hugging Face model on a temporary vLLM server inside your CI pipeline.

Quick start

name: LLM smoke test
on: push

jobs:
  infer:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run LLM inference
        uses: muhac/llm-actions@v2
        with:
          model: Qwen/Qwen3-1.7B
          # below are optional inputs
          vllm_version: v0.17.0
          hf_token: ${{ secrets.HF_TOKEN }}
          kv_cache_space: 4
          max_model_len: 32768

      # The rest of your workflow...

Inputs

model: (Required) The model to use for inference. Recommended: Qwen/Qwen3-1.7B.
vllm_version: The version of vllm Docker image to use. Default is v0.17.0. You may want a newer version for support of more models.
hf_token: The Hugging Face token to use for downloading the model. If you are using a public model, you can leave it empty. If you are using a private model, you need to provide your Hugging Face token.
kv_cache_space: CPU KV cache size in GiB (VLLM_CPU_KVCACHE_SPACE). Optional, leave empty to use the vLLM default.
max_model_len: Maximum sequence length for the model (--max-model-len). Optional, leave empty to use the model's default.

Outputs

logs: Full vLLM server log.

Examples

Here is an example workflow that runs a few models and makes api calls to them.

For advanced users

This repository contains the code to run LLMs for inference in GitHub Actions. You can run LLMs hosted by vLLM, or write your own code to run LLMs in GitHub Actions. Here are some examples with tested models.

Qwen/Qwen3-1.7B is recommended as the default — it offers a good balance of speed and quality. For vision tasks, Qwen/Qwen2.5-VL-3B-Instruct is a compact multimodal model that fits comfortably within the runner's memory. If you want to disable thinking on Qwen3 models, add /no_think as the system prompt.

For latest models, checkout Open LLM Leaderboard. Newer versions of vllm and transformers may support more models.

The largest size of the model that can be run in GitHub Actions is 6B. 7B models are too large to run in the current environment. The limit is due to the memory constraints of the GitHub Actions environment.

In extreme cases, you may run quantized 8B (even 14B) models to support advanced reasoning tasks. But the time will be extremely long and the workflow may be unstable.

License

MIT

Contributing

Pull requests are welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
action.yml		action.yml
huggingface.py		huggingface.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM in GitHub Actions

Quick start

Inputs

Outputs

Examples

For advanced users

License

Contributing

About

Uh oh!

Releases 5

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM in GitHub Actions

Quick start

Inputs

Outputs

Examples

For advanced users

License

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Uh oh!

Contributors

Uh oh!

Languages