Fast and secure ingress to remote AIs with ngrok, Deepseek, and Ollama

January 27, 2025
|
8
min read
Joel Hans
01-27-2025: We updated this blog post with new information about recent LLMs, a simplified setup method, and specific instructions on using Deepseek-R1.

Congratulations, you’ve decided to enter the chaotic, thrilling, and occasionally nonsensical world of LLM validation! Maybe your CTO asked you to “just spin up a model real quick.” Maybe you’re on a quest to prove that Deepseek-R1 is the next big thing—or at least better than your company's last AI-powered attempts at collaborating on an open source AI or integrating its outputs into your own apps.

Usually, even validating this kind of work sends you down one of two rabbit-holes: First, you set all this up on your local machine, which is fragile, unrepeatable, and decidedly not GPU-accelerated. Second, you over-complicate and end up with a massive—and expensive—deployment before you've gotten to hello, world.

Today, we're going to show you a just-right setup—including ngrok, Deepseek, and Ollama—to help you test, collaborate on, and validate AI models like Deepseek-R1 without exposing them to the entire internet. Let’s dive in before someone suggests retraining GPT-2 to save a few bucks.

Why connect your AI development to external CPU/GPU compute

The obvious first choice is to develop your LLM locally, as you do with other development work, committing changes and pushing them to a Git provider like GitHub.

Rodrigo Rocco, for example, pleasantly surprised us when he showcased on Twitter/X how he’s running AI models locally and making results available externally through an ngrok tunnel. That said, there are limitations to the local-first model.

First, the most powerful LLMs require 16GB, 32GB, or even more RAM, plus a discrete GPU for churning through LLM-specific workloads. Your local workstation might not be that powerful now, and improvements aren’t in the budget. Or, even if you have a newer system, you don’t feel like taxing it, potentially slowing down other work, while you fine-tune a large language model (LLM) locally.

There are other limitations around collaboration and eventual hosting. A local development workflow doesn’t allow others to collaborate with you easily unless you expose your workstation to the public internet and leave it on 24/7. The same goes for the eventual transition into an API—better to build on a platform you won’t have to migrate off of inevitably.

When looking specifically at remote compute for LLM development, you have two choices:

  • Hosted: Hosted platforms work like a SaaS—they launch your LLM on their infrastructure, and you get a simple web app or API to explore. There are plenty of hosted AI/LLM platforms already, like RunPod, Mystic, BentoML, and others. Major cloud providers also have platforms for training and deploying LLMs, like Google’s Vertex AI or AWS’ AI Services. Hosted platforms win out on simplicity but don’t come with privacy and compliance guarantees, and trend on the expensive side.
  • Self-hosted: When you self-host an LLM, you install and configure the AI toolkit yourself on a barebones virtual machine (VM). There are LLM “orchestrator” tools, like Trainy, that claim to simplify the process, but also come with adoption learning curves. Self-hosting is typically cheaper in the long-term, but the onus is on you to build a workable AI development stack.

Hosted options are great for fast-moving startups that need to launch and train LLMs with the absolute minimum infrastructure, but you need something that’s both owned by your organization and persistent for continuous development.

Self-hosting is the best choice for this situation, but comes with new technical challenges.

Securely connecting your local workstation to a remote service requires you to deal with proxies, port forwarding, firewalls, and so on. If you were to bring an LLM service into production using the normal route—making a formal request with your DevOps peers—it might take weeks of code and coordination to get it running for security and compliance.

That said, companies like Softwrd.ai and Factors.ai are already using ngrok to connect to remote CPU/GPU compute, getting new AI-based APIs to market fast, which means you’re not the first to venture into this proof of concept using a powerful solution: ngrok.

Plan your tech stack for external AI compute

You’ve gone through all the technical requirements of building this proof of concept AI workflow—time for everyone’s favorite part of building new tech: deciding on your stack.

Based on the CTO’s goals and the roadblocks you’ve already found, a viable option consists of:

  • A Linux virtual machine with GPU acceleration: You need a persistent, configurable machine for storing your LLM and computing responses. Because the next two parts of our stack run wherever Linux does, you can launch this VM wherever in whichever cloud provider works best for your organization. The only requirements are that your VM is GPU-accelerated and lets you install both Ollama and ngrok.
  • Ollama: Ollama is an open source project that simplifies how developers run, interact with, and train LLMs. Its original purpose is to be run locally, but because it operates anywhere Linux does, and doesn’t require a web interface, it’s quite easy to install anywhere.
  • ngrok: You’ll lean on ngrok’s universal ingress platform for securing and persisting ingress to Ollama and the GPU power behind your LLM. ngrok abstracts away the networking and configuration complexities around securely connecting to remote services, while also layering in authentication, authorization, and observability you’ll need for a viable long-term solution.

This stack s by no means the only way of connecting your local development workflows to remote compute power. We’ve chosen it here due to its simplicity to get started:

  • You can start playing with many popular open source LLMs like Deepseek-R1 in about 15 minutes.
  • You can maintain your VM’s lifecycle through the GCP console, allowing you to stop your VM while it’s not in use to conserve that pesky budget when you’re not actively developing AI.
  • Unlike hosted platforms, you own the node and its data.
  • Unlike lower-cost platforms like Colab Pro, this solution is persistent, allowing you to store data and fine-tune an existing model in the future (see the following disadvantages).

There are some disadvantages to this approach, too:

  • You’ll likely need to harden your VM, from a Linux and networking standpoint, against cyberattack… I’m sure your IT/DevOps peers would be thrilled to help.
  • VMs require more setup and maintenance than a purely hosted solution.
  • On-premises hardware would likely be cheaper in the long-term.
  • This stack currently lets you deploy existing open source models and customize certain parameters, but not perform deep re-training or fine-tuning.

Launch your remote Linux VM

Head on over to the Google Cloud Console and Create a VM.

Under Machine configuration, select GPUs and pick the GPU type that works for your budget and needs. For the machine type, pick a high memory instance, like n1-highmem-2. Down in the Boot disk section, click Switch Image to get an optimized operating system like Deep Learning on Linux, and up the size of the disk to 100 GB to be on the safe side. Down in the Firewall section, click Allow HTTPS traffic—ngrok will use that later to make your LLM accessible from anywhere.

That should be the fundamentals you need to launch remote AI compute—at about $0.35 per hour.

Give your instance some time to fire up. When it’s ready, SSH into it with your preferred method. The first time you log in, your VM will prompt if you want to install Nvidia drivers—hit <code>y</code> and <code>Enter</code>, as the reason you’re paying extra for VM is access to GPU compute.

Once that’s done, you can run <code>nvidia-smi</code> to verify that your GPU acceleration works as expected.

nvidia-smi

Fri Jan 19 18:15:27 2024  	 
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0 	|
|-------------------------------+----------------------+----------------------+
| GPU  Name    	Persistence-M| Bus-Id    	Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|     	Memory-Usage | GPU-Util  Compute M. |
|                           	|                  	|           	MIG M. |
|===============================+======================+======================|
|   0  Tesla T4        	Off  | 00000000:00:04.0 Off |                	0 |
| N/A   70C	P0	33W /  70W |  	2MiB / 15360MiB |  	9%  	Default |
|                           	|                  	|              	N/A |
+-------------------------------+----------------------+----------------------+

Install Ollama

The fastest way to get Ollama working is their Bash one-liner, but you can also opt for a manual installation or Docker.

curl -fsSL https://ollama.com/install.sh | sh


With Ollama installed, pull and run deepseek-r1.

ollama pull deepseek-r1


You now have Deepseek-R1 available on Ollama, which is already running a server on port 11434. That's great, but since said server is running on the remote VM, you need a way to connect to chats and completions without ssh-ing in or opening port 11434 for attackers and bots to discover and abuse.

Install and start ngrok

To download the ngrok agent, head on over to the downloads page or quickstart doc for multiple options compatible with any Linux VM, Debian and beyond. Make sure you grab your Authtoken from your ngrok dashboard and connect your account.

While you have the ngrok dashboard open, reserve a new ngrok domain—this is your only method of having a consistent tunnel URL for all your future LLM training operations using this stack, either using a subdomain of <code>ngrok.app</code> or a custom domain you own.

The following command will create an ngrok endpoint, which routes all traffic from {YOUR_LLM_DOMAIN}.ngrok.app to port 11434 on your GPU-powered VM. Per the Ollama docs, the --host-header flag ensures Ollama receives the expected Host header. The ngrok agent immediately starts forwarding any API requests on your reserved domain to Ollama and your new Deepseek-R1 model.

ngrok http 11434 --url={YOUR_LLM_DOMAIN} --host-header="localhost:11434"

Run your first LLM requests

You can now send API requests directly to your GPU-powered Deepseek-R1 model.

curl https://{YOUR_LLM_DOMAIN}/api/generate -d '{
  "model": "deepseek-r1",
  "prompt":"Why is the sky blue?", 
  "stream": false,
}'


Depending on the horsepower of the VM you chose, you'll see a similar result soon enough.

{
  "model":"deepseek-r1",
  "created_at":"2025-01-27T21:46:52.004359039Z",
  "response":"\u003cthink\u003e\n\n\u003c/think\u003e\n\nThe sky appears blue due to a phenomenon called Rayleigh scattering. When sunlight reaches Earth's atmosphere, it interacts with molecules and small particles in the air. Blue light scatters more frequently than other colors because it travels shorter wavelengths and interacts more with the atmosphere. This scattering effect is why we see the sky as blue during the day.\n\nAdditionally, during sunrise or sunset, the sky often appears red or orange. This is due to a phenomenon called Rayleigh scattering combined with the way our eyes perceive light. When sunlight passes through layers of the atmosphere, the blue light is scattered away, leaving the longer wavelengths (red and orange) to dominate the sky's appearance.",
  "done":true,
  "done_reason":"stop",
  "context":[151644,10234,374,279,12884,6303,30,151645,151648,271,151649,271,785,12884,7952,6303,4152,311,264,24844,2598,13255,62969,71816,13,3197,39020,24491,9237,594,16566,11,432,83161,448,34615,323,2613,18730,304,279,3720,13,8697,3100,1136,10175,803,13814,1091,1008,7987,1576,432,34192,23327,92859,323,83161,803,448,279,16566,13,1096,71816,2456,374,3170,582,1490,279,12884,438,6303,2337,279,1899,382,49574,11,2337,63819,476,42984,11,279,12884,3545,7952,2518,476,18575,13,1096,374,4152,311,264,24844,2598,13255,62969,71816,10856,448,279,1616,1039,6414,44393,3100,13,3197,39020,16211,1526,13617,315,279,16566,11,279,6303,3100,374,36967,3123,11,9380,279,5021,92859,320,1151,323,18575,8,311,40736,279,12884,594,11094,13],
  "total_duration":43516935923,
  "load_duration":7419493865,
  "prompt_eval_count":9,
  "prompt_eval_duration":1455000000,
  "eval_count":138,
  "eval_duration":34639000000
}

Protect your Deepseek model from unwanted use

Now that you have an LLM running, you absolutely want to limit who can access it.

Through our Traffic Policy engine, ngrok offers many possibilities:

For the sake of simplicity, let's implement the Basic Auth action using Traffic Policy. Close down ngrok on your VM and create a new file called policy.yaml with the following.

---
on_http_request:
  - actions:
      - type: basic-auth
        config:
          credentials:
            - user1:password1
            - user2:password2


These lines of YAML implement the Basic Auth action, which checks for Base64-encoded credentials with every request. All requests without these credentials receive a 401 Unauthorized response directly from ngrok's network without even touching your VM. You can also add up to 10 total credentials for different people on your team.

Start up the ngrok agent again with your new Traffic Policy file.

ngrok http 11434 --url={YOUR_LLM_DOMAIN} --host-header="localhost:11434" --traffic-policy-file=policy.yaml


With your next request to Deepseek-R1, use curl's -u flag, which Base64-encodes the credentials for you.

curl -u user1:password1 \
  https://{YOUR_LLM_DOMAIN}/api/generate -d '{
    "model": "deepseek-r1",
    "prompt":"Why is ngrok so great?", 
    "stream": false,
  }'

What comes after ngrok and Deepseek-R1?

To be honest, we're not quite sure at this point. Things are moving fast.

What matters most is that you now have an effective testbed for open source LLMs like Deepseek-R1, whether you're using it to collaborate with your coworkers, trying to validate whether you could integrate it into an existing app, or just trying to compare this new model's results to LLMs you've used before.

As you experiment with Deepseek or other LLMs and eventually want to make your deployment more secure, I have a few recommendations:

  1. Learn about our Traffic Policy engine, which allows you to filter, manage, and act on traffic from your API gateway.
  2. Implement a mo, like JWT validation and IP restrictions, to prevent anyone outside your organization from accessing your LLM.
  3. Learn about turning your ngrok agent(s) into a composable API gateway implementation for all your APIs.

We'd love to know what you loved learning from this project by pinging us on X (aka Twitter) @ngrokhq or LinkedIn. Have other questions about using ngrok in production? Join us on our monthly Office Hours livestreams for demos and Q&A from our DevRel and Product teams.

Share this post
Joel Hans
Joel Hans is a Senior Developer Educator. Away from blog posts and demo apps, you might find him mountain biking, writing fiction, or digging holes in his yard.
AI
Other
Development