NM i AI (Norwegian AI Championship) is one of Norway's largest AI competitions — a 69-hour sprint where teams build working AI systems from scratch. The 2026 edition ran from March 19–22 with a 1,000,000 NOK prize pool and 476 competing teams.
This repository documents my solo entry: all code, experiments, submissions, and lessons learned across three independent tasks.
| Task | Description | Score | Rank |
|---|---|---|---|
| Task 1 — Tripletex | AI accounting agent | 34.3 (normalized) / 35.79 raw | #233 / 398 |
| Task 2 — Astar Island | Norse world terrain prediction | 69.4 (normalized) / 185.1 raw | #242 / 395 |
| Task 3 — NorgesGruppen | Grocery shelf object detection | 95.2 (normalized) / 0.8818 mAP | #204 / 347 |
| Overall | Average across all three tasks | 66.3 | #214 / 476 |
Since it was possible to get points in just one task, the leaderboard for a single task is not as long as the full leaderboard where 476 teams in total got points in at least one task.
Three independent AI challenges, each worth 33% of your total score.
| Task | Partner | Type | Submission format |
|---|---|---|---|
| Tripletex | Tripletex | AI accounting agent | Live HTTPS endpoint |
| Astar Island | — | Probabilistic terrain prediction | REST API predictions |
| NorgesGruppen Data | NorgesGruppen | Grocery shelf object detection | Code upload (ZIP) |
The competition kicked off at 18:00 on Thursday and ended at 15:00 on Sunday — 69 hours to go from nothing to a working, deployed system.
| Time | Rank | Total | Detection | Tripletex | Astar | What happened |
|---|---|---|---|---|---|---|
| Sat Mar 21, ~02:39 | #318 | 12.9 | 19.3 | 19.5 | — | Early Saturday morning |
| Sat Mar 21, morning | #274 | 32.5 | 19.3 | 27.6 | 50.6 | First Astar round result in |
| Sat Mar 21, midday | #273 | 35.2 | 19.3 | 31.8 | 54.5 | Tripletex improving |
| Sat Mar 21, ~21:42 | #233 | — | — | 26.8 | — | Overnight bot loop starts |
| Sun Mar 22, early | #230 | 56.3 | 82.4 | 32.3 | 54.1 | NorgesGruppen: fixed pipeline, 0.1786 → 0.7626, then 0.7780 |
| Sun Mar 22 | #230 | 56.8 | 84.0 | 32.3 | 54.1 | Tripletex f-string crash fixed |
| Sun Mar 22, morning | #220 | 63.1 | 84.0 | 34.3 | 70.9 | Overnight retrain lands + final NorgesGruppen submissions: 0.8818 |
| Sun Mar 22, ~12:xx | #206 | 66.1 | 93.0 | 34.3 | 70.9 | Astar rounds completing |
| Sun Mar 22, ~13:xx | #202 | 66.8 | 95.2 | 34.3 | 70.9 | Peak rank during competition |
| Sun Mar 22, 15:00 | #214 | 66.3 | 95.2 | 34.3 | 69.4 | Competition closed |
The score went from 12.9 → 66.3 and rank from #318 → #214 over the course of the weekend. Peak rank was #202 just before the final Astar round adjusted the score slightly downward at close.
Score: 35.79 raw / 34.3 normalized — Rank #233 of 476 — 30/30 task types
The validator sent a random accounting task — in any of seven languages (Norwegian, Nynorsk, English, German, French, Spanish, Portuguese) — to our HTTP endpoint, then inspected the Tripletex ERP sandbox to see if the agent had completed it correctly. Each of the 30 task types tested a different accounting workflow.
A FastAPI application deployed on Google Cloud Run:
- Receives the task prompt + any attached PDFs
- Calls Gemini 2.5 Flash (via Vertex AI) with detailed Tripletex API instructions
- Gemini returns a structured JSON plan — a sequence of API calls
- The agent executes those calls, resolving references between steps (e.g. "use the ID from step 2 in step 4")
- If calls fail with validation errors, Gemini is asked to repair and retry once
The core insight: the "intelligence" lives almost entirely in the prompt. Better instructions = better score. The Python code is mostly plumbing. Every PR that improved the score was a prompt change, not a code change.
- Python 3.11 + FastAPI
- Google Cloud Run (europe-north1, 2 GB memory, 300s timeout)
- Gemini 2.5 Flash via Vertex AI (service account auth — no API key needed on Cloud Run)
- PDF parsing:
pypdf - Logging: Google Cloud Logging
~23 (night 1) → 28.7 → 35.79 (final). Key improvements:
- Hardcoded VAT IDs instead of fetching them (the endpoint returns 404)
- Added mandatory bank account setup before any invoice payment flow
- Fixed bank return reversal — credit notes cancel invoices, voucher reversal was the correct endpoint
- Fixed f-string crash (see below)
The f-string crash — the system prompt lives inside a Python f-string, so every {word} in documentation examples gets evaluated as a variable. One unescaped {voucherId} caused a NameError → HTTP 500 on every single request, for over an hour, without any alert. Score during that window: 0%.
The overnight loop — we set up an auto-submit loop in the browser to run while we slept. It ran for 5 hours and burned all 300 of our daily submissions by 07:30 AM. When we woke up with new fixes ready, there was nothing left to submit. 6 of 30 task types were never scored.
Lesson: Silent failures compound. Design systems to fail loudly.
📁 Full write-up: task1-Tripletex/README.md
Score: 185.1 raw / 69.4 normalized — Rank #242 of 476 — 7 rounds played
Astar Island is a Norse-themed simulation: a 40×40 tile map where each cell can evolve over time into one of six terrain classes (Empty, Settlement, Port, Ruin, Forest, Mountain). The task was to query a viewport-based simulator, observe partial information about the map, and submit a calibrated probability distribution for every cell.
Scores compound across rounds — later rounds have higher weight multipliers, so performance consistency matters more than any single result.
We built a full operator service with:
- Direct REST API integration against the Astar Island competition API
- A web dashboard (Dashboard, Explorer, Submit, Logs tabs) for real-time monitoring
- Background query runner with automatic draft generation
- Manual submit controls by default, with a T-20 minute deadline guard as safety fallback
- A lightweight history-aware linear model trained on 17 rounds of ground-truth data to calibrate class predictions
The service ran on Google Cloud Run and persisted state across restarts so we could manage live rounds without laptop dependency. We never missed a round.
- Round-to-round variance is real: scores ranged from 25 to 52 across rounds as map dynamics shifted
- Overpredicting rare dynamic classes (Port, Ruin) hurt consistently — the global average for Port across all rounds was only ~0.8% of cells
- "Early baseline submit + deadline guard" = you never miss a round, no matter what
- Keeping
OBSERVED / INFERRED / DECISIONevidence tags in documentation made debugging under pressure much cleaner
📁 Full write-up: task2-Astar/README.md
Score: 0.8818 mAP / 95.2 normalized — Rank #204 of 476 — from 0.1786 to 0.8818 in one night
The task was to detect grocery products on store shelf images. Submit a ZIP file containing run.py + a model (best.onnx). The organizer runs it on 248 images and scores the result using mAP (mean Average Precision).
| Submission | Score | What changed |
|---|---|---|
| Baseline | 0.1786 | Original ONNX pipeline |
| After inference fixes | 0.7626 | Letterbox-aware scaling, correct ONNX decode, class-aware NMS |
| After conf threshold tuning | 0.7780 | CONF_THRESHOLD 0.25 → 0.20 |
| After overnight retraining | 0.8621 | YOLOv8 fine-tuned on GCP GPU (NVIDIA L4) |
| After CONF sweep | 0.8798 | CONF_THRESHOLD → 0.06 |
| After IoU sweep | 0.8808 | IOU_THRESHOLD 0.70 → 0.60 |
| Final | 0.8818 | CONF_THRESHOLD → 0.04 |
- Fixed inference reliability first — the original pipeline had subtle errors in ONNX output decoding, coordinate rescaling, and NMS. Fixing those alone moved us from 0.1786 to 0.7626.
- Switched to bounded one-variable experiments — changing one parameter per submission made outcomes easy to attribute and rollbacks trivial.
- Ran a high-upside overnight retraining job on GCP — a YOLOv8 fine-tune on a dedicated GPU VM (g2-standard-4, NVIDIA L4) ran while we slept and produced the biggest single jump (+0.084).
- Swept thresholds systematically — each submission tested a specific hypothesis with a clear rollback plan.
- Reliability before optimization — we spent hours chasing tuning before the pipeline was even correct. Fix the fundamentals first.
- Offline proxy metrics don't always transfer — one change that looked better in local evaluation scored lower on the leaderboard.
- High-upside compute at the right time — running a model retrain is only worth it once you have a stable, correct baseline.
📁 Full write-up: task3-Norgesgruppen/README.md
All three tasks ran on a Google Cloud Platform project provided as part of the NM i AI 2026 GCP sponsorship:
- Cloud Run — for Task 1 (agent endpoint) and Task 2 (operator service)
- Vertex AI / Gemini 2.5 Flash — LLM inference for Task 1
- Compute Engine — GPU VM (NVIDIA L4) for Task 3 model training
- Cloud Logging — structured log aggregation for debugging under pressure
This entire project was built almost entirely with AI coding assistance (Claude), by someone who had been coding with AI tools for about a week at competition start.
A few things that worked well about this workflow:
- Keeping AGENT.md and PROGRESS.md updated meant each new AI session could pick up context without losing state — essentially a persistent memory layer across sessions
- Don't write prompts yourself — ask the AI how a prompt should be written based on what you need, then use what it produces. Every time we did this instead of editing manually, the prompt got better
- Claude Code can actually use a browser — at one point it started clicking around websites looking for documentation, taking screenshots, analyzing them to figure out where to click next. Things it couldn't read from source code, it found visually. That was genuinely surprising to watch
- Delegating clearly — "you handle the logs, I'll handle merging PRs" — kept the work moving without confusion
- Reviewing every PR before merging, even without fully understanding the code, caught at least one significant bug
And a few things to grow into:
- Learning to read diffs and logs directly — not to write the code, but to have an opinion about whether it looks right
- Building intuition for what's reversible vs irreversible before running something unattended
- Understanding one layer below the abstraction — enough to form a hypothesis when something breaks
The broader lesson about agentic AI: The agent did exactly what we told it to do. When it failed, it was almost always because our instructions were ambiguous or incomplete. The skill in building agents isn't traditional programming — it's clarity of specification: knowing precisely what you want, knowing what can go wrong, and stating both explicitly. That's a learnable skill, and 69 hours of pressure is one of the fastest ways to develop it.
nm-ai-2026/
├── task1-Tripletex/ # AI accounting agent (FastAPI + Gemini + Cloud Run)
│ ├── main.py # Agent logic + Gemini prompt
│ ├── tests/ # 14 unit tests
│ ├── README.md # Full retrospective
│ └── PROGRESS.md # Competition diary — all 30 task types covered
│
├── task2-Astar/ # Terrain prediction operator service
│ ├── core.py # Prediction + probability math
│ ├── backend.py # Orchestration + API client
│ ├── main.py # FastAPI app + endpoints
│ ├── static/ # Web dashboard (HTML/JS/CSS)
│ ├── history/ # Training data + linear model
│ ├── tests/ # 16 unit tests
│ ├── README.md # Full retrospective
│ └── SPEC.md # Architecture decisions
│
└── task3-Norgesgruppen/ # Grocery shelf object detection
├── run.py # ONNX inference pipeline (final submission)
├── README.md # Full retrospective
└── PastSubmissions.md # Full submission log with hypotheses and outcomes
I'm a frontend and mobile development student in my 6th semester at Kristiania University College in Bergen, currently writing my bachelor's thesis. At home I have a wife and a nine-month-old, so free time is limited. I managed roughly 25 hours of actual competition time across the 69 hours.
I signed up about a week before the competition started, having downloaded Claude Code eight days prior. My classmates had full teams. I competed solo.
In those 25 hours I deployed three independent AI systems from scratch on Google Cloud: a live accounting agent, a terrain prediction service with a real-time web dashboard, and a computer vision pipeline for grocery shelf detection. Finished 214th out of 476 teams.
Before the IT studies, I spent nine years in sales and customer service at Elkjøp, including a stint as team lead, and I have a vocational background as a graphic designer. I'm currently looking for my first developer job in Bergen. Frontend, fullstack, mobile, or anything IT-related works. I learn best somewhere I can be around colleagues, ask questions, and grow on the job.
MIT License