Privacy-preserving LLM inference

AI progress and the data question

In the current transformer era, LLM quality scales mostly with three things: parameters, data, and compute. Parameters and compute are costly but straightforward: buy more chips and extend training. Data is a more complex bottleneck. The internet has been scraped, filtered, and recycled. The next frontier isn't just bigger models, it's better data.

So where does it come from?

The current race toward AGI is dominated by a handful of frontier AI companies. They have capital, compute, talent, and a strong incentive to collect as much high-quality training data as possible.

Better data improves models. Better models may eventually help improve themselves. If more useful data increases the chance of building more capable models, then every private document, every chat, and every company knowledge base starts to look like fuel.

The stakes are enormous. These companies aren't just selling AI products. They are competing to become the default interface through which people and companies use machine intelligence.

So the question is: should we hand them our most personal data, or our company’s crown jewels, just because a data processing agreement says the right things?

Maybe we don't have to.

This article is about privacy-preserving AI inference, a part of the broader field of privacy-enhancing technologies.

On-device or local inference

This is the cleanest option from a privacy perspective. You buy the hardware, run the inference backend in your own environment, and keep the data inside your own network. No prompt crosses a third-party boundary.

The trade-off is cost and complexity. Useful local inference needs accelerators, memory, operations, and people who know how to run the stack. If you need broad, general-purpose AI capabilities, you usually want frontier models. And frontier models need serious compute.

Maintenance is the second problem. The AI market moves fast. New models, runtimes, quantization methods, serving frameworks, and security patches appear constantly. Running AI locally gives you control, but control isn't free. Someone must keep the system useful, updated, and secure.

Fully homomorphic encryption (FHE)

Fully homomorphic encryption sounds complicated. It also gives the cleanest security story for confidential cloud inference. The server computes entirely on encrypted data and never sees the plaintext input. The idea is simple:

Local encryption:

$enc_prompt \leftarrow ENC (prompt)$

Cloud processing:

$enc_response \leftarrow INFERENC E_{FHE} (enc_prompt)$

Local decryption:

$response \leftarrow DEC (enc_response)$

The cloud only processes ciphertext. The prompt is encrypted before it leaves the client and the response is decrypted only after it returns. That is the appeal. Confidentiality isn't based on a provider promise or an access-control policy. Not even on trust in hardware and architecture. It's based on sound cryptographic assumptions.

The problem is practicality. FHE-based computation is expensive, and LLMs are full of operations that don't map cleanly to it. Attention, nonlinearities, and autoregressive decoding make the overhead painful. For large LLM inference, FHE is therefore still mostly a research direction, not something you deploy for normal chat latency today.

Multi-party computation (MPC)

Multi-party computation takes a different route. The prompt isn't encrypted for one server. It is split into secret shares and processed by multiple parties. Each share is useless on its own. It reveals nothing about the prompt without the other shares. The core idea looks like this:

Local sharing:

$prompt_{1}, prompt_{2} \leftarrow SHARE (prompt)$

Joint cloud processing:

$res_{1}, res_{2} \leftarrow INFERENCE_{MPC} (prompt_{1}, prompt_{2})$

Local reconstruction:

$response \leftarrow UNSHARE (res_{1}, res_{2}$ )

No single compute party sees the plaintext prompt or the plaintext response. To reconstruct useful data, the parties would have to collude or break the protocol assumptions. That gives MPC a strong confidentiality story.

The cost is complexity. Inference is no longer just matrix multiplications on one server. It becomes a cryptographic protocol with communication between parties. For large language models, that usually means high latency, high engineering overhead, and a setup that is still too fiddly for normal production use.

Split inference

Split inference also works on splitting computation. It cuts the model into two parts:

$INFERENCE (prompt) = INFERENCE_{2} (INFERENCE_{1} (prompt))$

The first part runs locally. It's small enough to execute on the client. The second part runs in the cloud, where the expensive computation happens. At first glance, this looks attractive. The raw prompt never leaves the device. The problem is that the cloud still receives an intermediate representation:

$cloud_input \leftarrow INFERENCE_{1} (prompt)$

That representation isn't encryption. It's just the output of the first part of the model. It still leaks a lot about the original prompt.

That makes split inference one of the weaker approaches from a confidentiality perspective. It can reduce exposure, but it doesn't provide strong cryptographic guarantees. For sensitive data, it shouldn't be treated as a complete privacy solution.

Confidential computing

The idea of confidential computing isn't to keep data encrypted during the entire computation, as with FHE. Instead, data is decrypted only inside a protected execution environment during processing. By design, the cloud provider cannot inspect it in plaintext.

Originally, this was mostly a CPU story. Today, the same idea is moving into AI accelerators and GPU-based inference. That matters because large language models don't run on CPUs at useful speed. They need accelerators.

This is the core idea behind Confidential AI: running AI inference in hardware-isolated environments and using remote attestation to verify what is processing the data.

Remote attestation is the key piece. Before sending sensitive data, the client can verify what is running on the server: the hardware, the runtime, the model service, and ideally the exact software stack. Only if the measurement matches the expected deployment does the client release any data.

That shifts the trust boundary. The user no longer must fully trust the cloud provider or the AI service operator. Instead, trust moves toward the hardware vendor, the firmware, and the attestation chain. Confidential computing isn't pure cryptography. Unlike FHE, confidential computing doesn't keep data encrypted throughout the entire computation. The data is decrypted during processing, but only inside a hardware-isolated environment. The confidentiality therefore still depends on additional factors: The processor architecture, the accelerator, the firmware, the runtime, the attestation setup, and the absence or mitigation of relevant side channels.

The quality of the specific setup also matters a lot. A confidential computing environment with a non-transparent software stack isn't the same as an end-to-end attestable and reproducible one.

The practical advantage, however, is hard to ignore. In contrast to FHE and MPC, confidential computing can run large models with ordinary inference performance in production. For large cloud-hosted LLMs, confidential computing is currently the most practical path toward confidential inference with production-grade performance.

Prompt redaction and pseudonymization

Prompt redaction takes a route that isn't directly based on prompt confidentiality. Sensitive data is removed, replaced, or pseudonymized before the prompt is sent to the model. This can be done with static rules, classical PII detection, or dedicated smaller models such as OpenAI’s Privacy Filter.

The idea is simple:

Local redaction

$redacted_prompt, mapping \leftarrow REDACT (prompt)$

Cloud processing

$redacted_response \leftarrow INFERENCE (redacted_prompt)$

Local restoration

$response \leftarrow RESTORE (redacted_response, mapping)$

Redaction and pseudonymization happen locally. The model provider only sees the redacted version. Names, addresses, customer IDs, contract details, or other sensitive fields can be replaced with placeholders before they leave the client.

The problem is reliability. Real prompts are messy. Sensitive information doesn't always look like a name, an email address, or a phone number. It can hide in context, rare identifiers, technical details, customer-specific terminology, or combinations of otherwise harmless facts.

That makes redaction useful, but brittle. It can reduce exposure, it can catch obvious leaks, and it can be a valuable layer in a broader privacy strategy. But it shouldn't be the only guardrail.

Summary

Privacy-preserving AI inference isn't one technology. It's a spectrum.

At one end are pure cryptographic approaches such as FHE and MPC. They offer strong confidentiality guarantees, but are still not practical enough for large LLMs in ordinary production settings.

At the other end are pragmatic controls such as redaction and pseudonymization. They reduce exposure, but they are brittle: real-world prompts are messy, and sensitive information often hides in edge cases.

Confidential computing is a practical direction. It isn't pure cryptography. Data is still decrypted during processing, but only inside a hardware-enforced and isolated execution environment. But it's practical enough to run large models in the cloud while reducing trust in the cloud provider and the AI service operator. That makes it the most realistic path for confidential AI inference today.