Blog

Practical guides for working with AI models

← All posts

How to Choose the Right LLM for Your Project

Choosing an LLM isn't about finding the "best" model—it's about finding the best model for your specific use case. A chatbot needs different characteristics than a code completion tool or a document summarizer.

Start With Your Constraints

Before comparing models, define your requirements:

  • Latency tolerance: Real-time chat needs sub-200ms time-to-first-token. Batch processing can wait.
  • Budget: A hobby project and an enterprise product have different cost sensitivities.
  • Context needs: Processing long documents requires 100k+ context. Simple Q&A works with 8k.
  • Accuracy requirements: Medical or legal applications need higher accuracy than creative writing.

The Decision Framework

For Real-Time Applications (Chat, Autocomplete)

Prioritize latency and throughput. Users notice delays over 300ms. Look for models with sub-100ms latency and high tokens-per-second throughput. Smaller, faster models often beat larger ones for UX.

For Batch Processing (Analysis, Summarization)

Prioritize cost and accuracy. Latency doesn't matter when processing overnight. Optimize for the lowest cost-per-token that meets your quality bar.

For Code Generation

Prioritize code-specific training. Models trained on code repositories outperform general models. Look for models with "code" or "coder" in the name—they're optimized for programming tasks.

For Vision Tasks

Filter to multimodal models. Not all LLMs can process images. You need models that explicitly support image input.

Compare models by your priorities

Filter by speed, latency, price, and capabilities to find models that match your requirements.

Browse Models

Practical Tips

  • Test with real data. Benchmarks don't tell the whole story. Run your actual prompts through candidate models.
  • Consider provider reliability. Some providers have better uptime than others. Check if a model is available from multiple providers for redundancy.
  • Start cheap, upgrade as needed. Begin with a budget model. Only upgrade if quality is genuinely insufficient—you might be surprised.
  • Watch for context limits. A model's stated context length is the max. Actual useful context is often lower due to attention degradation.
← All posts

LLM Pricing Explained: Understanding Token Costs

LLM APIs charge by the token—but what does that actually mean for your costs? Here's how pricing works and how to estimate what you'll pay.

What's a Token?

A token is roughly 4 characters or 0.75 words in English. The sentence "Hello, how are you today?" is about 7 tokens. Code typically uses more tokens per line than prose because of syntax characters.

Rule of thumb: 1,000 tokens ≈ 750 words ≈ 1-2 pages of text.

Input vs Output Pricing

Most models charge differently for input (your prompt) and output (the response):

  • Input tokens: What you send (system prompt, user message, context)
  • Output tokens: What the model generates

Output tokens are typically 2-5x more expensive than input tokens. This means long responses cost more than long prompts.

Example Cost Calculation

For a model priced at $1/M input, $3/M output:

  • You send 500 tokens (input): $0.0005
  • Model responds with 200 tokens (output): $0.0006
  • Total cost per request: $0.0011

At 10,000 requests per day, that's $11/day or ~$330/month.

Cost Reduction Strategies

1. Shorten Your Prompts

Every token in your system prompt is charged on every request. Trim unnecessary instructions. Be concise.

2. Limit Output Length

Set max_tokens to prevent runaway responses. If you need 100 words, don't allow 1,000.

3. Use Cheaper Models for Simple Tasks

Route simple queries to smaller, cheaper models. Save expensive models for complex reasoning.

4. Cache Common Responses

If users ask similar questions, cache responses instead of re-querying the API.

Compare pricing across models

Sort by input/output price to find the most cost-effective options for your use case.

View Pricing

Hidden Costs to Watch

  • System prompts multiply. A 500-token system prompt sent 10,000 times = 5M input tokens.
  • Context accumulation. Chat apps that send full history get expensive fast. Consider summarization.
  • Retries. Failed requests that you retry still cost money for the input tokens sent.
← All posts

LLMs for Coding: What Makes a Good Code Model

Not all LLMs are good at code. Models specifically trained on programming tasks consistently outperform general-purpose models for coding work.

What Makes Code Models Different

Code-focused models are trained on:

  • Large code repositories: GitHub, GitLab, open-source projects
  • Programming documentation: API docs, tutorials, Stack Overflow
  • Commit histories: Understanding how code changes over time

This specialized training means they understand syntax, patterns, and conventions that general models miss.

Key Factors for Code Models

Language Coverage

Models trained on more languages handle edge cases better. Check if your primary language is well-represented in the training data.

Context Length

Coding often requires understanding large files or multiple files at once. Models with longer context windows can hold more code in memory.

Speed for Autocomplete

If you're building an IDE plugin or autocomplete feature, latency matters more than anything. A slower, smarter model creates a worse UX than a faster, slightly less accurate one.

Instruction Following

Good code models follow specific instructions: "refactor this function," "add error handling," "write tests for this class." They don't just complete—they transform.

Find code-optimized models

Use the coding filter to find models specifically trained for programming tasks.

Browse Code Models

Practical Recommendations

  • For autocomplete: Prioritize speed. Sub-100ms latency with decent accuracy beats slow perfection.
  • For code review: Prioritize accuracy. You can wait a second for better suggestions.
  • For generation: Balance both. Users expect reasonable speed and good output.

Testing Code Models

Don't rely on benchmarks alone. Test with your actual codebase:

  • Can it understand your project's conventions?
  • Does it use your existing utilities or reinvent them?
  • Are suggestions syntactically correct in your language?
← All posts

LLM Speed vs Cost: Finding the Right Balance

Faster models typically cost more. But "faster is better" isn't always true—the right choice depends on your application.

When Speed Matters

Real-Time User Interfaces

Chat interfaces, autocomplete, and interactive tools need fast responses. Users perceive delays over 300ms as sluggish. For these cases, invest in low-latency models.

High-Volume Streaming

When you're showing tokens as they generate, throughput (tokens per second) determines how fluid the experience feels. Aim for 50+ tokens/second for smooth streaming.

Time-Sensitive Workflows

If your pipeline has humans waiting—like AI-assisted customer support—delays compound. Fast models keep humans productive.

When Cost Matters More

Batch Processing

Processing 10,000 documents overnight? Nobody's watching. Use the cheapest model that meets quality requirements. A 10x cheaper model saves significant money at scale.

Background Tasks

Email categorization, content moderation, data extraction—if users don't see it happening, optimize for cost.

Development and Testing

You'll run thousands of test queries while building. Use cheap models for development, expensive ones for production.

Sort by what matters to you

Filter models by throughput for speed or by price for cost efficiency.

Compare Models

The Hybrid Approach

Smart systems use multiple models:

  • Router pattern: Classify incoming requests, route simple ones to cheap models, complex ones to capable models.
  • Cascade pattern: Try the cheap model first. If confidence is low, escalate to the expensive one.
  • Task-specific: Different endpoints for different tasks, each using an appropriate model.

These patterns can reduce costs 50-80% while maintaining quality where it matters.

← All posts

Free AI APIs: What's Available

You don't need a budget to start building with AI. Several options let you prototype and even run low-volume production without paying for API access.

Types of Free Access

Free Tiers

Most providers offer limited free usage—typically $5-20 worth of credits or a few thousand requests per month. Good for prototyping and learning.

Open-Source Models

Models like Llama, Mistral, and others are free to run yourself. You pay for compute (or use your own hardware) instead of per-token API costs.

Rate-Limited Free APIs

Some providers offer free access with rate limits. Fine for development, not for production traffic.

What to Watch Out For

  • Rate limits: Free tiers often limit requests per minute or day. Plan around this.
  • Model restrictions: The best models usually aren't in free tiers. You get capable but not cutting-edge.
  • No SLA: Free access means no uptime guarantees. Don't build production systems on free tiers.
  • Data policies: Some free tiers use your data for training. Check the terms.

Filter for free models

Use the "Free" toggle to find models with no per-token costs.

Find Free Models

Making Free Work for You

  • Prototype fast: Validate your idea with free APIs before committing budget.
  • Cache aggressively: Store responses for common queries to stay under limits.
  • Use locally: For development, run smaller open-source models on your machine.
  • Plan for paid: Build with the assumption you'll upgrade. Don't lock yourself into free-tier limitations.
← All posts

Understanding LLM Latency and Throughput

Latency and throughput are the two key performance metrics for LLMs—but they measure different things and matter in different situations.

Latency: Time to First Token

What it measures: How long until the model starts responding.

Latency includes network time, queue time, and the model's processing time before generating the first token. It's measured in milliseconds.

  • Sub-100ms: Feels instant. Ideal for autocomplete.
  • 100-300ms: Responsive. Good for chat.
  • 300-500ms: Noticeable delay. Acceptable for complex queries.
  • 500ms+: Feels slow. Only acceptable for batch processing.

Throughput: Tokens Per Second

What it measures: How fast the model generates output once it starts.

Throughput determines how quickly a response completes. For streaming responses, it's how fast text appears.

  • 100+ t/s: Faster than reading speed. Feels instant.
  • 50-100 t/s: Comfortable reading pace.
  • 20-50 t/s: Noticeably slow streaming.
  • Under 20 t/s: Painfully slow for streaming.

Which Matters When

Latency-Critical Use Cases

  • Autocomplete (IDE, search)
  • Real-time chat
  • Voice assistants

Throughput-Critical Use Cases

  • Streaming long responses
  • Batch processing (total time = tokens / throughput)
  • Document generation

Compare real-time performance

See live latency and throughput metrics across all models.

View Benchmarks

Why Metrics Vary

The same model can show different numbers because of:

  • Provider infrastructure: Different providers run the same model on different hardware.
  • Load: Busy servers mean higher latency and lower throughput.
  • Request size: Longer prompts take longer to process.
  • Time of day: Peak hours see more congestion.

This is why we show metrics by provider—the same model performs differently depending on where it's hosted.