Skip to main content
LM Studio is a desktop application for running large language models locally with a user-friendly interface. It provides an OpenAI-compatible API, making it easy to integrate with the ngrok AI Gateway.

Prerequisites

Overview

Since LM Studio runs locally on HTTP, you’ll expose it through an ngrok internal endpoint, then configure the AI Gateway to route requests to it.

Getting started

1

Download a model

Download a model using the GUI or CLI:
  1. Open LM Studio
  2. Go to the Discover tab (or press Ctrl+2 on Windows/Linux, ⌘+2 on Mac)
  3. Search for a model (for example, llama-3.2-3b-instruct)
  4. Choose a quantization level (Q4 or higher recommended) and click Download
See the LM Studio download guide for more details on choosing the right model and quantization level.
2

Start LM Studio's local server

Start the server using the GUI or CLI:
  1. Go to the Developer tab in LM Studio
  2. Select the model you want to serve
  3. Click Start Server
By default, LM Studio runs on port 1234. Verify the server is running:
curl http://localhost:1234/v1/models
3

Expose LM Studio with ngrok

Use the ngrok agent to create an internal endpoint:
ngrok http 1234 --url https://lm-studio.internal
Internal endpoints (.internal domains) are private to your ngrok account. They’re not accessible from the public internet.
4

Configure the AI Gateway

Create a Traffic Policy with LM Studio as a provider:
policy.yaml
on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "lm-studio"
          base_url: "https://lm-studio.internal"
          models:
            - id: "llama-3.2-3b-instruct"
            - id: "mistral-7b-instruct-v0.3"
            - id: "qwen2.5-coder-7b-instruct"
LM Studio doesn’t require API keys, so you can omit the api_keys field entirely.
The model ID should match the identifier shown in LM Studio. You can find it by calling GET /v1/models or checking the model details in the app.
5

Use with OpenAI SDK

Point any OpenAI-compatible SDK at your AI Gateway:
from openai import OpenAI

client = OpenAI(
    base_url="https://your-ai-subdomain.ngrok.app/v1",
    api_key="unused"  # LM Studio doesn't need a key
)

response = client.chat.completions.create(
    model="lm-studio:llama-3.2-3b-instruct",  # Prefix with provider ID
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Advanced configuration

Restrict to LM Studio only

Block requests to cloud providers and only allow LM Studio:
on_http_request:
  - type: ai-gateway
    config:
      only_allow_configured_providers: true
      only_allow_configured_models: true
      providers:
        - id: "lm-studio"
          base_url: "https://lm-studio.internal"
          models:
            - id: "llama-3.2-3b-instruct"
            - id: "mistral-7b-instruct-v0.3"

Failover to cloud provider

Use LM Studio as primary with automatic failover to OpenAI:
on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "lm-studio"
          base_url: "https://lm-studio.internal"
          models:
            - id: "llama-3.2-3b-instruct"
        
        - id: "openai"
          api_keys:
            - value: ${secrets.get('openai', 'api-key')}
      
      model_selection:
        strategy:
          - "ai.models.filter(m, m.provider_id == 'lm-studio')"
          - "ai.models.filter(m, m.provider_id == 'openai')"
The first strategy that returns models wins. If LM Studio has matching models, only those are tried. OpenAI is only used if no LM Studio models match. For cross-provider failover when requests fail, have clients specify multiple models: models: ["lm-studio:llama-3.2-3b-instruct", "openai:gpt-4o"].

Increase timeouts

Local models can be slower, especially on first load. Increase timeouts as needed:
on_http_request:
  - type: ai-gateway
    config:
      per_request_timeout: "120s"
      total_timeout: "5m"
      providers:
        - id: "lm-studio"
          base_url: "https://lm-studio.internal"
          models:
            - id: "llama-3.2-3b-instruct"

Add model metadata

Track model details with metadata:
on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "lm-studio"
          base_url: "https://lm-studio.internal"
          metadata:
            location: "local"
            hardware: "Apple M2 Pro"
          models:
            - id: "llama-3.2-3b-instruct"
              metadata:
                parameters: "3B"
                quantization: "Q4_K_M"
            - id: "qwen2.5-coder-7b-instruct"
              metadata:
                parameters: "7B"
                use_case: "coding"

Use embeddings

LM Studio supports the /v1/embeddings endpoint for embedding models:
on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "lm-studio"
          base_url: "https://lm-studio.internal"
          models:
            - id: "nomic-embed-text-v1.5"
response = client.embeddings.create(
    model="lm-studio:nomic-embed-text-v1.5",
    input="The quick brown fox jumps over the lazy dog"
)

Troubleshooting

Connection refused

Symptom: Requests fail with connection errors. Solutions:
  1. Verify LM Studio server is running: Check the Developer tab shows “Server running”
  2. Verify the server port: Default is 1234, check LM Studio settings if different
  3. Verify ngrok tunnel is running: Check for https://lm-studio.internal in your ngrok dashboard
  4. Ensure the internal endpoint URL matches your config

Model not found

Symptom: Error saying model doesn’t exist. Solutions:
  1. List available models: curl http://localhost:1234/v1/models
  2. Verify the model is loaded in LM Studio (check the Developer tab)
  3. Ensure the model ID matches exactly what LM Studio reports

Slow first response

Symptom: First request takes a very long time. Cause: LM Studio loads models into memory on first use. Solutions:
  1. Increase per_request_timeout to allow for model loading
  2. Pre-load the model by selecting it in LM Studio before starting the server
  3. Enable “Keep model in memory” in LM Studio settings

Out of memory

Symptom: LM Studio crashes or returns errors for large models. Solutions:
  1. Use a smaller model or more quantized version (for example, Q4 instead of Q8)
  2. Close other applications to free up RAM
  3. Adjust GPU layers in LM Studio’s model settings
  4. Use CPU-only inference if GPU memory is insufficient

Server not starting

Symptom: LM Studio server won’t start. Solutions:
  1. Check if port 1234 is already in use: netstat -an | grep 1234
  2. Try a different port in LM Studio settings
  3. Restart LM Studio

Next steps