Skip to content
View 44za12's full-sized avatar
โ˜•
Turning coffee into code
โ˜•
Turning coffee into code

Block or report 44za12

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this userโ€™s behavior. Learn more about reporting abuse.

Report abuse
44za12/README.md

Hi, I'm Aazar ๐Ÿ‘‹

Founder & Principal Architect at NehmeAI Labs

I build AI systems that actually work. Not the biggest modelsโ€”the right-sized ones.


The Thesis

Most production AI stacks are over-engineered. 70B models for tasks a 4B handles. JSON output where delimiters would do. GPT-4 for email classification that a fine-tuned 1B nails at 1/100th the cost.

I fix this. Architecture audits that cut inference costs 40-60%. Tools that prove you don't need frontier models for most prompts. Research that shows specialization beats scale.


What I'm Building

FlashCheck โ€” Hallucination detection that actually works
A 4B model that hits 91.7% on RAG Truth, beating Llama 405B. Purpose-built verification > general-purpose giants.
FlashCheck-Nano (270M) and FlashCheck-Lite (1B) are open source.

RightSize โ€” Stop overpaying for inference
Most prompts don't need frontier models. This tool proves it on your actual data. 50-100x cost savings.

LLM Sanity Checks โ€” A practical guide to not over-engineering
Decision trees, benchmarks, anti-patterns. Before you reach for GPT-4, read this.


Philosophy

Can a regex solve it?          โ†’ Use that. Stop.
Is it search/retrieval?        โ†’ Try BM25 first. It's 20x faster.
Is the task simple?            โ†’ 1B-8B model. Test it.
Actually complex reasoning?    โ†’ Maybe frontier. But measure.

The JSON Tax: Everyone outputs JSON. But {"name": "John"} is 3x the tokens of John. At scale, that's real money.

Specialization > Scale: FlashCheck-4B beats models 100x larger because it does one thing well. Your extraction task doesn't need a model trained on Shakespeare.

Measure First, Scale Never: I've never seen a production workload where 0% of prompts could use smaller models. The number is usually 60-80%.


Background

  • Global #1 on HackerRank in Python
  • 9+ years building high-load backend systems
  • Previously: Engineering Consultant at Dun & Bradstreet, Senior Software Engineer at Nykaa,
  • Built verification systems processing 2M+ daily requests
  • Outperformed Microsoft Presidio SOTA benchmarks by 10.19% F1 on PII detection

Current Focus

  • Shipping specialized models that outperform frontier on narrow tasks
  • Building tools that make right-sizing painless
  • Writing about what actually matters in production AI
  • Helping enterprises stop burning money on over-provisioned inference

Get in Touch

Building AI and watching your inference bill climb? Let's talk.

๐Ÿ“ง aazar@nehmeailabs.com

๐ŸŒ nehmeailabs.com

๐Ÿ’ผ LinkedIn


"That's not a flex. That's a $50K/month cloud bill waiting to happen." โ€” on teams using GPT-4 for everything

Pinned Loading

  1. NehmeAILabs/llm-sanity-checks NehmeAILabs/llm-sanity-checks Public

    37 2

  2. surf surf Public

    Python 23 4

  3. mailsleuth mailsleuth Public

    MailSleuth is an extremely quick and efficient email OSINT (Open Source Intelligence) tool designed to check the presence of email addresses across various social media platforms and other web servโ€ฆ

    Go 11 3

  4. breach-parse-rs breach-parse-rs Public

    Rust 2

  5. vulnhunter vulnhunter Public

    Go