Confidential RAG: secure AI knowledge without data leaks

January 19, 2026

In 2023, just weeks after Samsung allowed employees to use ChatGPT, they caused multiple data leaks in a single month. In an effort to work faster, engineers pasted confidential source code into the chatbot to ask for “optimization” and uploaded meeting recordings to have the AI model generate minutes.

‍ChatGPT automatically retained input for training at the time, so Samsung’s proprietary semiconductor code and meeting transcripts effectively became part of OpenAI’s training datasets. In a separate incident, Amazon lawyers even claimed to have seen AI output closely resembling internal company data, according to Business Insider.

‍Independent of data retention policies, all major AI services currently process prompts in plaintext, exposing sensitive inputs to both cloud and AI service admins. At a time where data in storage and in transit is encrypted by default, this is unacceptable for security-conscious users and a big reason for the low adoption of RAG in industries like healthcare or government. Two solutions have emerged to limit this exposure: either process your data in plaintext but not in the cloud - or process it in the cloud but not in plaintext. These approaches are not mutually exclusive – in this post I want to share my experiences and best practices to get the best of both worlds.

How RAG works‍

Retrieval augmented generation (RAG) is a technique to enhance the quality of AI responses when querying about specific information by adding custom context to the prompt (“enriching”). Instead of relying only on the model’s training data (which suffices for most generic questions), custom, often company-specific information is provided to the model as context.

What the model sees

The more relevant data your RAG system can access, the more valuable it is to you. And yes, the relevance of your data often correlates with its sensitivity. RAG with meaningful impact on productivity will require access to your internal knowledge, especially the confidential bits.

Setting this up without the proper security guardrails can cause catastrophic data exposure, harm employees and even damage third parties like customers and partners. So, special attention to data security is necessary on two fronts: ingestion (feeding information to the system) and retrieval (asking the system questions).

Ingestion

Ingestion is the process of feeding documents and other data into LLM-searchable and readable formats. Common approaches include vector databases and knowledge graphs.

You can use either or both as the foundation of your knowledge base, and both approaches usually require AI inference during ingestion.

Vector storage

To format semantic information as a vector, you need an embedding model. Embedding models split your documents into chunks and assign vector values to them, which can be compared to your prompts to find semantic similarities and – as a result – find the most relevant data sources for your query.

‍Privatemode AI offers a confidential embedding API – but for testing and personal use, my recommendation is to pick an embedding model and run it locally, for two reasons:

The specific format of your vectors depends entirely on the chosen embedding model (chunk size, number of dimensions). This means it’s difficult to change the embedding model after having ingested documents because the vector formats will likely not match. Service providers may change or discontinue certain embedding models and significantly impact the functionality of your system, so it’s best to pick one model and stick with it.
Embedding models are small, at least compared to LLMs. State-of-the-art models have between 4B and 8B parameters, so even modern laptops can run them. If your personal vector or graph storage lives on-device, embedding locally is the most flexible option.

The Privatemode embedding API is useful for larger scale systems and you should use it over a local model if:

you are dealing with a high volume of data and files to be ingested continuously
your graph or vector storage are running on-prem but not on the end-user device

On-prem AI platforms like Zylon.ai offer local embedding and SLM capabilities along with connectivity to Privatemode's secure and powerful LLMs for access to reasoning models. This enables easy orchestration and flexible testing for your specific use cases without exposing any data.

Knowledge graphs

Knowledge graphs are essentially machine-readable mind maps that describe entities like organizations, people, locations, etc. and relationships between them. They are proving to be extremely useful for context engineering and giving your LLM relevant additional information – even if it is not semantically similar to your prompt.

When ingesting documents into a knowledge graph, reasoning LLMs can reliably extract these entities and relationships (“Jane Doe <> WORKS_AT <> Doe Corp.”). But to do this, the LLM has to read the entire (sensitive) document. This is where Privatemode AI becomes useful for all, and necessary for some.

Reasoning LLMs that are “smart” enough to identify and relate entities correctly cannot run efficiently on laptops, or even powerful PCs. They require server-grade hardware or personal supercomputers, making data confidentiality a real challenge. Privatemode AI offers you a way to access gpt-oss-120b and (soon) other reasoning models without sacrificing privacy. Nobody, including the cloud or service providers can ever see any input in plaintext, and your sensitive documents can be ingested into a (local) knowledge graph safely.

Retrieval

Retrieval is the process of gathering relevant context for your prompt or question by performing a semantic search. Similar to ingestion, a hybrid approach between local models and confidential cloud inference is best for most users:

First, your prompt should be transformed into vectors by the same local embedding model. These vectors are matched with stored vectors in your database. An additional search across your knowledge graph can also be performed to cover related information which might not match semantically. Next, Small Language Models “SLMs” that run locally can aggregate the retrieved context into a coherent “master prompt”. This can be done locally, as SLMs have become powerful enough for this kind of task and minimize latency.

Note: with local SLMs and embedding models, it can be beneficial to consider alternative architectures to transformer models like Liquid Foundational Models (LFMs), they generally outperform GPT SLMs up to 1B parameters.

Finally, the enriched prompt with all relevant context in a proper structure can be sent to your reasoning model via Privatemode AI for the heavy lifting - the compute-intensive reasoning that requires a large model. This is where you want the smartest model possible with a sufficient context window, so staying local is not feasible for most users.

The takeaway‍

So what does this mean for you? Stick to local embedding models for standardized, private vector indexing, local SLMs for fast, private prompt enriching, and Privatemode AI for heavy but confidential reasoning.‍

Review your AI infrastructure today and ensure that your pipelines protect data end-to-end. And check out Privatemode AI to access the power of advanced models without compromising on privacy.

‍The future of AI is hybrid, and starting today will still put you ahead of the curve.‍

‍

Why most RAG systems are not secure and how to build one that is

How RAG works‍

Ingestion

Vector storage

Knowledge graphs

Retrieval

The takeaway‍

Privatemode – use AI without the security and privacy worries

Product

Resources

Tutorials

Articles

Case studies

Legal

Contact