Foundry Local - Run Real AI Without Paying the Cloud Tax

February 5, 2026

AI demos are great until the bill shows up thirty days later.

If you’ve shipped anything using OpenAI, Azure OpenAI, or Copilot Studio, you know the reality: Your app scales faster than your budget.

Foundry Local attempts to fix this. It allows you to run production-grade LLMs, embeddings, and agents locally. No Azure subscription, no invoices, no rate limits, and no vendor lock-in.

You still use the Azure AI Foundry APIs and SDKs. You just swap the runtime. It is cloud-native AI, minus the cloud bill.

What is Foundry Local?

Foundry Local is Microsoft’s local execution runtime for the Azure AI Foundry stack.

It provides:

It runs on your laptop, dev box, or on-prem server. Build once, run anywhere.

Why Use It?

1. Zero inference cost

No per-token fees. Your GPU is the cloud.

2. Regulated & offline workloads

Healthcare, manufacturing, defense, and finance become viable targets when you don’t need a constant internet connection to Redmond.

3. Full parity with Azure AI Foundry

Same code. Same patterns. Different runtime.

4. Model Control

Swap Phi → Llama → Mistral → Qwen without rewriting your business logic.

Mental Model

The architecture is simple:

Your App -> Foundry SDK / REST -> Foundry Runtime (Local) -> LLM + Embeddings + Vector Store + Tools

Getting Started

1. Install Foundry Local

brew install foundry
# or
winget install microsoft.foundry

Verify the install:

foundry --version

2. Start the Runtime

foundry local start

3. Run a Model

foundry model run qwen2.5-0.5b

Check the service status to get your endpoint:

foundry service status

Example output: http://127.0.0.1:52733

This is your FOUNDRY_ENDPOINT.

The Implementation

Most “AI” features boil down to three things:

Since Foundry Local exposes OpenAI-compatible REST endpoints, your existing code usually just works.

Chat Completions (C#)

appsettings.json

{
  "Foundry": {
    "Endpoint": "http://127.0.0.1:52733",
    "Model": "qwen2.5-0.5b"
  }
}

DI + HttpClient

public sealed class FoundryOptions
{
    public required string Endpoint { get; init; }
    public required string Model { get; init; }
}

builder.Services.Configure<FoundryOptions>(
    builder.Configuration.GetSection("Foundry"));

builder.Services.AddHttpClient("Foundry", (sp, client) =>
{
    var opt = sp.GetRequiredService<IOptions<FoundryOptions>>().Value;
    client.BaseAddress = new Uri(opt.Endpoint);
});

Chat call (/v1/chat/completions)

public async Task<string> AskAsync(string prompt)
{
    var req = new
    {
        model = _opt.Model,
        messages = new[]
        {
            new { role = "system", content = "You are a helpful assistant." },
            new { role = "user", content = prompt }
        },
        temperature = 0.2
    };

    var resp = await _http.PostAsJsonAsync("/v1/chat/completions", req);
    resp.EnsureSuccessStatusCode();

    var json = await resp.Content.ReadFromJsonAsync<ChatCompletionsResponse>();
    return json?.choices?.FirstOrDefault()?.message?.content ?? "";
}

Embeddings

public async Task<float[]> EmbedAsync(string text)
{
    var req = new
    {
        model = "phi-embed",
        input = text
    };

    var resp = await _http.PostAsJsonAsync("/v1/embeddings", req);
    resp.EnsureSuccessStatusCode();

    var json = await resp.Content.ReadFromJsonAsync<EmbeddingsResponse>();
    return json?.data?.FirstOrDefault()?.embedding ?? Array.Empty<float>();
}

Once you have vectors, you have semantic search, clustering, and RAG.

Local RAG

public async Task<string> RagAsync(string question, IReadOnlyList<string> chunks)
{
    var qVec = await EmbedAsync(question);

    var ranked = chunks
        .Select(async c => (chunk: c, score: CosineSimilarity(qVec, await EmbedAsync(c))))
        .Select(t => t.Result)
        .OrderByDescending(x => x.score)
        .Take(4)
        .Select(x => x.chunk);

    var context = string.Join("\n---\n", ranked);

    return await AskAsync($"""
    Use the following context to answer the question:

    {context}

    Question: {question}
    """);
}

Two endpoints. Zero cloud.

Native API vs REST

Foundry Local supports a native .NET API that runs without HTTP for performance, but REST is better for demos (curl + Postman). I use REST to prototype and native for production.

Working Examples

I use this pattern in my meetup demos. You can view the source code here:

github.com/Alec13355/JanDotNetUG

The repo includes local LLM demos, embedding pipelines, and RAG patterns.

Summary

Foundry Local is what cloud AI should have been: portable, predictable, and controlled by the engineer.

You choose where it runs. You choose the models. You control the costs. This is how you build systems that don’t expire when your credits run out.