The previous article covered public AI versus private AI: when a public AI service is fine, and when you need more control over data and processing. This article looks at the next question. If you want to run private inference, what does that look like in practice?
Private inference is not a loose model on a GPU. It is a managed chain with a model, inference server, API layer, identity, network isolation, RAG or data layer, logging, monitoring and operations. Without those layers, you mostly have a technical experiment. With them, you have an environment you can use, explain and manage.
The main question is not: which model is best? The first question is: where does the data flow, who has access and who manages the chain?
Private inference in one sentence
Inference means using an existing AI model to generate text, analyse documents, classify information or answer questions. Private means that this processing takes place inside a controlled environment, for example in a private cloud, on dedicated hardware, on-premises or in a sovereign cloud environment.
So this is not about training your own foundation model. It is about making an existing model available to your applications, employees or customers in a secure and manageable way.
Private inference as a chain
Control is not only in the model. It mostly sits in the layers around it.
Model and inference server
A private inference environment usually starts with a model artifact and an inference server. The model artifact contains the model weights, tokenizer and configuration. The inference server loads the model, processes requests, manages memory, schedules tokens and exposes an API for applications.
There are several ways to run such a server. Think of llama.cpp, vLLM, TGI, TensorRT-LLM or Ollama. These are not interchangeable products with one obvious best choice. The right choice depends on your workload, model size, hardware, desired API, operational knowledge and production requirements. For this article, the main point is that there should always be a serving layer between application and model.
Many teams like an OpenAI-compatible API because existing applications then need fewer changes. But OpenAI-compatible only means the API shape looks familiar to developers. It says nothing about privacy, logging, access rights or network security. You still have to design those yourself.
Hardware is more than choosing a GPU
GPUs may be needed, but not every workload starts there. For small models, low volumes or batch tasks, CPU inference can sometimes be enough. For interactive chat, multiple users or longer context, you reach GPUs sooner.
"Does the model fit in VRAM?" is not enough. VRAM is not only used for model weights. You also need memory for context, KV cache, batching, concurrent requests and runtime overhead. A model that barely fits with one test prompt can still fail in production once several users send long documents.
So design limits as well: maximum context length, maximum output tokens, queue duration, concurrency per tenant and rate limits per application. Treat an LLM server like a database under peak load. It is better to return a clear response when something is too large than to make every user wait.
From model process to managed API
Do not let applications talk directly to a model without control.
Little visibility into rights, limits, tenants, logging and error handling.
The API layer decides who can do what, how much traffic is allowed and what gets logged.
Network isolation and access
Private inference is only private if the route to the model is under control. Do not expose an inference endpoint to the public internet unless you have a clear reason. Put applications, inference servers and data layers in private networks. Separate test, acceptance and production. Keep management interfaces separate from application traffic.
Data location alone is not enough. For digital sovereignty, management, access, logging, jurisdiction, exit options and incident response also matter. Running a model in the Netherlands helps little if every application can query everything with the same key.
That is why identity belongs outside the model. The API layer should determine which application may use which model, which user belongs to which tenant, which data class may be processed and which limits apply. A system prompt is not a security boundary. Authorisation belongs outside the model.
RAG makes the data layer central
Private inference often becomes useful when the model gets its own context. Think of internal documents, tickets, customer records, security logs, source code or product data. In SaaS, that data layer directly touches your SaaS infrastructure. This is usually done with RAG: documents are stored, split into chunks, converted into embeddings and made searchable through a vector database or search layer.
The data layer is often more sensitive than the model itself. If retrieval ignores access rights, a user can use AI to see documents they were never allowed to see outside AI. RAG must therefore be permission-aware. Document rights, tenant boundaries and metadata filters must be enforced server-side, not as an instruction in a prompt.
Also pay attention to source references and traceability. In business use, you often want to see which documents were retrieved, which version was used and why an answer stayed within the allowed context.
Logging, monitoring and operations
Private inference gives you control, but only if you set up that control. Monitoring is not only about uptime. Measure latency, time to first token, tokens per second, queue depth, error rates, throttling, GPU usage and VRAM usage. Link those metrics to application, tenant and model version where possible.
Logging needs extra care. Full prompts and outputs can contain personal data, customer data or business secrets. Sometimes metadata-only logging is safer: requestId, userId, tenantId, model version, prompt template version, token counts, retrieved document IDs and errors. Decide who may read logs and how long you keep them.
Operations are the rest of the story: patching, model updates, versioning, rollback, quotas, evaluation and incident response. A model update can change behaviour. A prompt template can accidentally include too much context. A new embedding run can affect retrieval. Record what is running, why it is running and how you can roll back.
First pilot without a large platform
Start small, but put the management layers in place from day one.
Start small, design seriously
You do not need to build a full AI platform from the start. Begin with one concrete use case, a limited data class and a clear acceptance criterion. Choose a model that is good enough for that task, limit context and output, and measure how the environment behaves under normal use.
But do not start without a basic architecture. API, identity, network isolation, logging and monitoring are not later add-ons. A pilot is exactly where you learn which data moves through the chain, which access rights are needed and what operational load is realistic.
Running private inference is therefore mostly infrastructure work. The model matters, but the value is in the controlled chain around it. If you design that chain well, you can use AI in places where public AI gives too little control. The next step is deciding when private inference is suitable for your use case.
Use AI with control over data and operations? See how cloud.nl looks at AI infrastructure and when a private cloud is a logical foundation for private inference.