Prerequisites
- ngrok account with AI Gateway access
- LM Studio installed
- ngrok agent installed
- A model downloaded in LM Studio
Overview
Since LM Studio runs locally on HTTP, you’ll expose it through an ngrok internal endpoint, then configure the AI Gateway to route requests to it.Getting started
1
Download a model
Download a model using the GUI or CLI:
- GUI
- CLI
- Open LM Studio
- Go to the Discover tab (or press
Ctrl+2on Windows/Linux,⌘+2on Mac) - Search for a model (for example,
llama-3.2-3b-instruct) - Choose a quantization level (Q4 or higher recommended) and click Download
2
Start LM Studio's local server
Start the server using the GUI or CLI:By default, LM Studio runs on port
- GUI
- CLI
- Go to the Developer tab in LM Studio
- Select the model you want to serve
- Click Start Server
1234. Verify the server is running:3
Expose LM Studio with ngrok
Use the ngrok agent to create an internal endpoint:
Internal endpoints (
.internal domains) are private to your ngrok account. They’re not accessible from the public internet.4
Configure the AI Gateway
Create a Traffic Policy with LM Studio as a provider:
policy.yaml
The model ID should match the identifier shown in LM Studio. You can find it by calling
GET /v1/models or checking the model details in the app.5
Use with OpenAI SDK
Point any OpenAI-compatible SDK at your AI Gateway:
Advanced configuration
Restrict to LM Studio only
Block requests to cloud providers and only allow LM Studio:Failover to cloud provider
Use LM Studio as primary with automatic failover to OpenAI:The first strategy that returns models wins. If LM Studio has matching models, only those are tried. OpenAI is only used if no LM Studio models match. For cross-provider failover when requests fail, have clients specify multiple models:
models: ["lm-studio:llama-3.2-3b-instruct", "openai:gpt-4o"].Increase timeouts
Local models can be slower, especially on first load. Increase timeouts as needed:Add model metadata
Track model details with metadata:Use embeddings
LM Studio supports the/v1/embeddings endpoint for embedding models:
Troubleshooting
Connection refused
Symptom: Requests fail with connection errors. Solutions:- Verify LM Studio server is running: Check the Developer tab shows “Server running”
- Verify the server port: Default is
1234, check LM Studio settings if different - Verify ngrok tunnel is running: Check for
https://lm-studio.internalin your ngrok dashboard - Ensure the internal endpoint URL matches your config
Model not found
Symptom: Error saying model doesn’t exist. Solutions:- List available models:
curl http://localhost:1234/v1/models - Verify the model is loaded in LM Studio (check the Developer tab)
- Ensure the model ID matches exactly what LM Studio reports
Slow first response
Symptom: First request takes a very long time. Cause: LM Studio loads models into memory on first use. Solutions:- Increase
per_request_timeoutto allow for model loading - Pre-load the model by selecting it in LM Studio before starting the server
- Enable “Keep model in memory” in LM Studio settings
Out of memory
Symptom: LM Studio crashes or returns errors for large models. Solutions:- Use a smaller model or more quantized version (for example, Q4 instead of Q8)
- Close other applications to free up RAM
- Adjust GPU layers in LM Studio’s model settings
- Use CPU-only inference if GPU memory is insufficient
Server not starting
Symptom: LM Studio server won’t start. Solutions:- Check if port 1234 is already in use:
netstat -an | grep 1234 - Try a different port in LM Studio settings
- Restart LM Studio
Next steps
- Custom Providers - Learn about URL requirements and configuration options
- Model Selection Strategies - Route requests intelligently
- Multi-Provider Failover - Advanced failover patterns