Skip to content

Arun-Ranjan/ai_usage

Repository files navigation

AI Model Comparison Tool for Code Workflows

Overview

This interactive web application provides data-driven benchmarks comparing general-purpose AI models (like GPT-4, Claude 2, Gemini 1.5 Pro) with specialized, fine-tuned models (like CodeBERT, CodeT5+, BugLab) across common software development tasks.

The goal is to help development teams and technical decision-makers make informed choices about which AI models to use for different coding workflows, balancing cost, performance, and quality.


What Problem Does This Solve?

Modern software teams have access to dozens of AI models, but choosing the right one is challenging:

  • General-purpose LLMs (GPT-4, Claude 2) are powerful but expensive and token-heavy
  • Fine-tuned, task-specific models are cheaper and faster but less well-known
  • Most teams don't have objective data to guide their choices

This tool provides real-world benchmarks with transparent calculations, helping teams:

  • Reduce AI-related costs by 80-95% for routine tasks
  • Maintain high output quality
  • Scale code automation workflows economically

What We're Trying to Do

Main Objectives

  1. Provide Transparent Benchmarks: Every number (tokens, costs, quality ratings) is sourced from official API pricing, research papers, or empirical testing
  2. Enable Smart Model Selection: Show when fine-tuned models can replace expensive general LLMs without sacrificing quality
  3. Demonstrate Cost Savings: Illustrate the dramatic efficiency gains possible with task-specific models
  4. Support Data-Driven Decisions: Give teams the data they need to justify AI tool investments to leadership

Key Insights We're Highlighting

  • Fine-tuned models like CodeT5+ can be 200x cheaper than GPT-4 for code summarization with comparable quality
  • BugLab offers specialized bug detection at 1/200th the cost of GPT-4
  • Gemini 1.5 Pro provides a strong balance of capability and cost for teams needing general-purpose models
  • Reserve expensive LLMs for complex, ambiguous, or creative tasks; use specialized models for routine workflows

What the UI Explains

Interactive Task Selector

The web interface features a dropdown menu that lets users switch between four common developer tasks:

  1. Code Review (~60 line Python file)
  2. Code Summarization (~30 line function)
  3. Bug Detection (~40 line script)
  4. Code Q&A (e.g., "What does this function do?")

Comparison Table

For each task, the UI displays a comprehensive comparison table with the following columns:

  • Model: Name of the AI model (e.g., GPT-4, Claude 2, CodeBERT)
  • Tokens/Run: Average total tokens (input + output) used per task execution
  • Cost/Run ($): Dollar cost per single task execution, calculated from official API pricing
  • Quality: Qualitative assessment based on published benchmarks (SOTA, High, Competitive, etc.)
  • Citation: Link to source documentation or research paper

Example: Code Review Comparison

Model Tokens/Run Cost/Run ($) Quality Citation
GPT-4 700 0.021 SOTA, highest accuracy for complex tasks OpenAI Pricing
Claude 2 700 0.0168 Competitive with GPT-3.5, capable on code review Anthropic API
Gemini 1.5 Pro 700 0.0035 Competitive with GPT-3.5/Claude 2 Google Cloud
Claude Instant 700 0.00467 Very fast, best for routine checks Anthropic API
CodeBERT (fine-tuned) 120 0.00036 Very high, within 1-2 points of GPT-4 CodeBERT Paper
CodeT5+ (fine-tuned) 100 0.0001 High, nearly matches GPT-4 CodeT5+ Paper

Key Insight: CodeT5+ delivers 99.5% cost savings compared to GPT-4 while maintaining high quality for routine code reviews.

Summary Statement

Each task view includes a plain-language summary explaining:

  • Which models offer the best cost-efficiency
  • When to use general LLMs vs. fine-tuned models
  • Key trade-offs between different options

References Section

Every task page links to:

  • Official API pricing pages (OpenAI, Anthropic, Google Cloud)
  • Peer-reviewed research papers (CodeBERT, CodeT5+, BugLab)
  • Benchmark datasets (CodeXGLUE)

This ensures all claims are verifiable and transparent.


Technology Stack

  • Backend: Flask (Python)
  • Data: JSON-based benchmark storage
  • Frontend: HTML templates with Jinja2, CSS styling
  • Deployment: Runs locally on http://127.0.0.1:5000

How to Use

Running the Application

  1. Ensure Python and Flask are installed:

    pip install flask
  2. Start the application:

    python app.py
  3. Open your browser to http://127.0.0.1:5000

Interpreting the Results

For Routine, Repeatable Tasks (code review, bug detection, summarization):

  • Look at fine-tuned models first (CodeBERT, CodeT5+, BugLab)
  • Consider cost per run × expected daily volume
  • Fine-tuned models typically offer 80-95% cost reduction

For Complex, Creative, or Ambiguous Tasks:

  • General LLMs (GPT-4, Claude 2) excel here
  • Higher cost is justified by superior reasoning and context handling

For Teams Wanting Balance:

  • Gemini 1.5 Pro offers strong general capabilities at competitive pricing
  • Good middle ground between specialized models and premium LLMs

Data Sources & Methodology

Token Counts

  • Derived from sample completions, research papers, and vendor documentation
  • Based on realistic code examples (30-60 line Python scripts/functions)

Cost Calculations

Formula: cost_per_task = (tokens_per_task / 1000) × price_per_1K_tokens

Pricing sources:

  • GPT-4: $0.03 per 1K tokens (OpenAI)
  • Claude 2: $0.008 input + $0.024 output per 1K tokens (Anthropic)
  • Claude Instant: $0.00163 input + $0.0055 output per 1K tokens (Anthropic)
  • Gemini 1.5 Pro: $0.005 per 1K tokens (Google Cloud)
  • Fine-tuned models: $0.001-0.003 per 1K tokens (Azure/HuggingFace estimates)

Quality Ratings

Based on published metrics:

  • F1/accuracy scores for classification tasks
  • BLEU/ROUGE scores for summarization
  • Human expert evaluations from research papers
  • Model card disclosures from vendors

Mapping:

  • SOTA (State of the Art): Best published results
  • Very High: Within 1-2% of SOTA
  • High/Competitive: Strong performance, slightly below top tier
  • Good/Routine: Reliable for straightforward use cases

Example Use Cases

Scenario 1: Daily Code Review Automation

Volume: 50 pull requests/day
Model Choice: CodeBERT (fine-tuned)
Daily Cost: 50 × $0.00036 = $0.018/day ($5.40/month)
vs. GPT-4: 50 × $0.021 = $1.05/day ($31.50/month)
Savings: 94%

Scenario 2: Documentation Generation

Volume: 100 functions/week
Model Choice: CodeT5
Weekly Cost: 100 × $0.000042 = $0.0042/week
vs. GPT-4: 100 × $0.009 = $0.90/week
Savings: 99.5%

Scenario 3: Bug Scanning Pipeline

Volume: 200 files/day
Model Choice: BugLab
Daily Cost: 200 × $0.00006 = $0.012/day
vs. GPT-4: 200 × $0.012 = $2.40/day
Savings: 99.5%


Key Takeaways

  1. Not all AI tasks need GPT-4: Fine-tuned models excel at focused, repeatable developer workflows
  2. Cost scales dramatically: At scale, model choice can mean thousands vs. tens of dollars monthly
  3. Quality often comparable: For routine tasks, specialized models match or approach GPT-4 quality
  4. Hybrid approach wins: Use the right tool for each job—reserve expensive models for complex tasks
  5. Data-driven decisions matter: Benchmark-based choices are more defensible to leadership than gut feelings

Future Enhancements

Potential additions to this tool:

  • Cost calculator: Input your expected volume, see projected monthly costs
  • Quality deep-dive: Show actual F1/BLEU scores alongside qualitative ratings
  • Model output samples: Side-by-side comparison of actual model responses
  • Custom benchmarking: Allow users to upload their own code samples for testing
  • Real-time pricing: Auto-update costs as vendors change their pricing

References

API Pricing

Research Papers

Benchmarks


Conclusion

In a market flooded with AI options, real-world benchmarking cuts through the hype with objective, actionable data. This tool empowers teams to choose models strategically, optimize for actual usage patterns, and demonstrate ROI to stakeholders.

Smart model selection = Better outcomes for developers, better economics for organizations.


License & Usage

This tool is provided for educational and decision-making purposes. All benchmark data is derived from publicly available sources. Users should verify current pricing and capabilities with vendors before making production decisions.


Questions or feedback? Open an issue or contribute to the project!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors