aivector databasespostgresarchitecture

Polyglot Persistence for AI Apps

Every AI-powered application needs at least three databases. Here's the architecture that works, why vector databases exist, and how to combine them with your existing stack.

10 min read · 21 April 2026

Building an AI-powered application in 2026 means building a polyglot application whether you plan to or not. The moment you add an LLM to your stack, you need to store and retrieve vector embeddings — and that requires a database type that most developers haven’t worked with before.

Here’s the architecture that actually works, and why each piece exists.

The standard AI app stack

A typical AI application needs at least three database layers:

A relational database (PostgreSQL) — for your structured data. Users, conversations, documents, metadata. Everything that has a defined shape and needs ACID guarantees.

A vector store (pgvector or Pinecone) — for embeddings. The numerical representations of text, images, or other content that enable semantic similarity search.

A cache (Redis) — for LLM responses, embeddings, and session state. LLM API calls are slow and expensive. Caching identical or similar requests saves both.

Some applications add a fourth layer — a document store like MongoDB for flexible-schema content, or Elasticsearch for hybrid keyword and semantic search. But the three above are the baseline.

What vector embeddings actually are

Before understanding vector databases, you need to understand embeddings.

When you pass text to an embedding model (like OpenAI’s text-embedding-3-small), it returns a list of numbers — typically 1,536 floats. This list is a vector that represents the meaning of the text in a high-dimensional space.

The key property: texts with similar meanings produce vectors that are close together in that space. “I love dogs” and “Puppies are wonderful” produce similar vectors. “Quarterly revenue report” produces a very different vector.

This enables semantic search — finding content by meaning rather than exact keyword match. It’s how RAG (Retrieval Augmented Generation) works: embed the user’s question, find the most similar document chunks in your vector store, pass those chunks to the LLM as context.

const questionEmbedding = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: userQuestion
});

const similarDocs = await pgvector.query(
  `SELECT content, 1 - (embedding <=> $1) AS similarity
   FROM documents
   ORDER BY embedding <=> $1
   LIMIT 5`,
  [questionEmbedding.data[0].embedding]
);

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    {
      role: 'system',
      content: `Answer based on this context: ${similarDocs.map(d => d.content).join('\n')}`
    },
    { role: 'user', content: userQuestion }
  ]
});

pgvector vs dedicated vector databases

You have two main options for storing vectors: pgvector (an extension for PostgreSQL) or a dedicated vector database like Pinecone, Weaviate, or Qdrant.

pgvector adds vector storage and similarity search directly to PostgreSQL. Your vectors live in the same database as your other data. Simple to operate, no new infrastructure, and good enough for most use cases up to tens of millions of vectors.

CREATE EXTENSION vector;

ALTER TABLE documents ADD COLUMN embedding vector(1536);

CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

SELECT content
FROM documents
ORDER BY embedding <=> '[0.1, 0.2, ...]'
LIMIT 5;

Dedicated vector databases like Pinecone are purpose-built for vector operations. They offer better performance at massive scale, more sophisticated indexing algorithms, and features like namespace separation and metadata filtering that are more ergonomic than pgvector.

The practical guidance: start with pgvector. It requires no new infrastructure, works with your existing PostgreSQL instance, and handles the scale of most applications. Move to a dedicated vector database if you’re storing hundreds of millions of vectors or need advanced filtering capabilities that pgvector doesn’t support well.

Caching LLM responses with Redis

LLM API calls have two problems: they’re slow (often 1-5 seconds) and expensive. Redis solves both with response caching.

async function getLLMResponse(prompt) {
  const cacheKey = `llm:${hashPrompt(prompt)}`;

  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: prompt }]
  });

  const result = response.choices[0].message.content;
  await redis.set(cacheKey, JSON.stringify(result), 'EX', 3600);

  return result;
}

Cache embeddings too — embedding the same text repeatedly wastes API calls:

async function getEmbedding(text) {
  const cacheKey = `embedding:${hashText(text)}`;

  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text
  });

  const embedding = response.data[0].embedding;
  await redis.set(cacheKey, JSON.stringify(embedding), 'EX', 86400);

  return embedding;
}

Storing conversation history

The pattern that works: PostgreSQL for persistence, Redis for active sessions.

await db.query(
  'INSERT INTO messages (conversation_id, role, content) VALUES ($1, $2, $3)',
  [conversationId, role, content]
);

const cacheKey = `conversation:${conversationId}:messages`;
await redis.lpush(cacheKey, JSON.stringify({ role, content }));
await redis.ltrim(cacheKey, 0, 19);
await redis.expire(cacheKey, 3600);

The complete AI app architecture

Putting it all together, a production AI application looks like this:

User request

Application server

Check Redis cache

If this exact prompt was seen before, return the cached response immediately. Skip all remaining steps.

Cache hit → return instantly

Get embedding from OpenAI

Convert the user's question into a vector. Check Redis first — if this text was embedded recently, use the cached vector.

Redis → OpenAI fallback

Similarity search via pgvector

Find the 5 most semantically similar document chunks in PostgreSQL using the embedding from step 2.

pgvector cosine similarity

Load conversation history

Fetch recent messages from Redis for speed. Fall back to PostgreSQL for older history beyond the cache window.

Redis (recent) → PostgreSQL (archive)

Call LLM with full context

Send the question, retrieved documents, and conversation history to the LLM. Cache response in Redis, persist to PostgreSQL.

OpenAI → Redis cache + PostgreSQL persist

Response to user

Getting started

The postgres-vector-starter template on our GitHub gives you PostgreSQL with pgvector enabled, Redis for caching, and boilerplate for embeddings and similarity search — all running locally with one docker compose up command.

It’s the fastest way to get this architecture running without spending hours on configuration.