LM Studio

LM Studio is a desktop application for running large language models locally with a user-friendly interface. It provides an OpenAI-compatible API, making it easy to integrate with the ngrok AI Gateway.

Prerequisites

ngrok account with AI Gateway access
LM Studio installed
ngrok agent installed
A model downloaded in LM Studio

Overview

Since LM Studio runs locally on HTTP, you’ll expose it through an ngrok internal endpoint, then configure the AI Gateway to route requests to it.

Getting started

Download a model

Download a model using the GUI or CLI:

Open LM Studio
Go to the Discover tab (or press Ctrl+2 on Windows/Linux, ⌘+2 on Mac)
Search for a model (for example, llama-3.2-3b-instruct)
Choose a quantization level (Q4 or higher recommended) and click Download

Use the lms get command to download models directly:

lms get llama-3.2-3b-instruct@q4_k_m

You can specify any model and quantization level using the model@quantization format.

See the LM Studio download guide for more details on choosing the right model and quantization level.

Start LM Studio's local server

Start the server using the GUI or CLI:

Go to the Developer tab in LM Studio
Select the model you want to serve
Click Start Server

Start the server in headless mode:

lms server start

By default, LM Studio runs on port 1234. Verify the server is running:

curl http://localhost:1234/v1/models

Expose LM Studio with ngrok

Use the ngrok agent to create an internal endpoint:

ngrok http 1234 --url https://lm-studio.internal

Internal endpoints (.internal domains) are private to your ngrok account. They’re not accessible from the public internet.

Configure the AI Gateway

Create a Traffic Policy with LM Studio as a provider:

policy.yaml

on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "lm-studio"
          base_url: "https://lm-studio.internal"
          models:
            - id: "llama-3.2-3b-instruct"
            - id: "mistral-7b-instruct-v0.3"
            - id: "qwen2.5-coder-7b-instruct"

LM Studio doesn’t require API keys, so you can omit the api_keys field entirely.

The model ID should match the identifier shown in LM Studio. You can find it by calling GET /v1/models or checking the model details in the app.

Use with OpenAI SDK

Point any OpenAI-compatible SDK at your AI Gateway:

from openai import OpenAI

client = OpenAI(
    base_url="https://your-ai-gateway.ngrok.app/v1",
    api_key="unused"  # LM Studio doesn't need a key
)

response = client.chat.completions.create(
    model="lm-studio:llama-3.2-3b-instruct",  # Prefix with provider ID
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Advanced configuration

Restrict to LM Studio only

Block requests to cloud providers and only allow LM Studio:

on_http_request:
  - type: ai-gateway
    config:
      only_allow_configured_providers: true
      only_allow_configured_models: true
      providers:
        - id: "lm-studio"
          base_url: "https://lm-studio.internal"
          models:
            - id: "llama-3.2-3b-instruct"
            - id: "mistral-7b-instruct-v0.3"

Failover to cloud provider

Use LM Studio as primary with automatic failover to OpenAI:

on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "lm-studio"
          base_url: "https://lm-studio.internal"
          models:
            - id: "llama-3.2-3b-instruct"
        
        - id: "openai"
          api_keys:
            - value: ${secrets.get('openai', 'api-key')}
      
      model_selection:
        strategy:
          - "ai.models.filter(m, m.provider_id == 'lm-studio')"
          - "ai.models.filter(m, m.provider_id == 'openai')"

The first strategy that returns models wins. If LM Studio has matching models, only those are tried. OpenAI is only used if no LM Studio models match. For cross-provider failover when requests fail, have clients specify multiple models: models: ["lm-studio:llama-3.2-3b-instruct", "openai:gpt-4o"].

Increase timeouts

Local models can be slower, especially on first load. Increase timeouts as needed:

on_http_request:
  - type: ai-gateway
    config:
      per_request_timeout: "120s"
      total_timeout: "5m"
      providers:
        - id: "lm-studio"
          base_url: "https://lm-studio.internal"
          models:
            - id: "llama-3.2-3b-instruct"

Add model metadata

Track model details with metadata:

on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "lm-studio"
          base_url: "https://lm-studio.internal"
          metadata:
            location: "local"
            hardware: "Apple M2 Pro"
          models:
            - id: "llama-3.2-3b-instruct"
              metadata:
                parameters: "3B"
                quantization: "Q4_K_M"
            - id: "qwen2.5-coder-7b-instruct"
              metadata:
                parameters: "7B"
                use_case: "coding"

Use embeddings

LM Studio supports the /v1/embeddings endpoint for embedding models:

on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "lm-studio"
          base_url: "https://lm-studio.internal"
          models:
            - id: "nomic-embed-text-v1.5"

response = client.embeddings.create(
    model="lm-studio:nomic-embed-text-v1.5",
    input="The quick brown fox jumps over the lazy dog"
)

Troubleshooting

Connection refused

Symptom: Requests fail with connection errors. Solutions:

Verify LM Studio server is running: Check the Developer tab shows “Server running”
Verify the server port: Default is 1234, check LM Studio settings if different
Verify ngrok tunnel is running: Check for https://lm-studio.internal in your ngrok dashboard
Ensure the internal endpoint URL matches your config

Model not found

Symptom: Error saying model doesn’t exist. Solutions:

List available models: curl http://localhost:1234/v1/models
Verify the model is loaded in LM Studio (check the Developer tab)
Ensure the model ID matches exactly what LM Studio reports

Slow first response

Symptom: First request takes a very long time. Cause: LM Studio loads models into memory on first use. Solutions:

Increase per_request_timeout to allow for model loading
Pre-load the model by selecting it in LM Studio before starting the server
Enable “Keep model in memory” in LM Studio settings

Out of memory

Symptom: LM Studio crashes or returns errors for large models. Solutions:

Use a smaller model or more quantized version (for example, Q4 instead of Q8)
Close other applications to free up RAM
Adjust GPU layers in LM Studio’s model settings
Use CPU-only inference if GPU memory is insufficient

Server not starting

Symptom: LM Studio server won’t start. Solutions:

Check if port 1234 is already in use: netstat -an | grep 1234
Try a different port in LM Studio settings
Restart LM Studio

Next steps

Custom Providers - Learn about URL requirements and configuration options
Model Selection Strategies - Route requests intelligently
Multi-Provider Failover - Advanced failover patterns

SDKs

Concepts

Guides

Custom Providers

Observability

Examples

Reference

Prerequisites

Overview

Getting started

Advanced configuration

Restrict to LM Studio only

Failover to cloud provider

Increase timeouts

Add model metadata

Use embeddings

Troubleshooting

Connection refused

Model not found

Slow first response

Out of memory

Server not starting

Next steps

SDKs

Concepts

Guides

Custom Providers

Observability

Examples

Reference

​Prerequisites

​Overview

​Getting started

​Advanced configuration

​Restrict to LM Studio only

​Failover to cloud provider

​Increase timeouts

​Add model metadata

​Use embeddings

​Troubleshooting

​Connection refused

​Model not found

​Slow first response

​Out of memory

​Server not starting

​Next steps

Prerequisites

Overview

Getting started

Advanced configuration

Restrict to LM Studio only

Failover to cloud provider

Increase timeouts

Add model metadata

Use embeddings

Troubleshooting

Connection refused

Model not found

Slow first response

Out of memory

Server not starting

Next steps