PythonMCPSQLiteAIArchitecture

Building PKB MCP: Turning My Digital Chaos into a Queryable Brain

Abhishek Joshi

April 17, 2026 · 14 min read

I've had a bad habit for years: I save things I'll never find again. YouTube videos I fully intend to watch. Markdown notes half-finished in Obsidian. Bookmarks dumped into a folder called "read later" that functions more as a black hole. One day I searched my own Obsidian vault for something I was sure I'd written and found nothing — then found the note two weeks later by accident. That was the moment I decided to actually fix it.

The result is PKB MCP — a locally-hosted Model Context Protocol server that ingests my notes, bookmarks, and YouTube watch history, stores them in a searchable local database, and connects to Claude Desktop so I can query my own knowledge base in plain English. It currently has 1,430+ YouTube videos ingested, my Obsidian vault connected, and 10 MCP tools live.

This post is about the design decisions that shaped it — the things I chose deliberately, the things I tried and discarded, and the tradeoffs I'd make differently if I started over.

Why MCP, and why local?

The Model Context Protocol is Anthropic's open standard for connecting AI models to external tools and data sources. It gives Claude a structured way to call functions — search my knowledge base, sync a YouTube playlist, answer a question with citations — rather than stuffing a huge context window with raw text.

Running it locally was a deliberate constraint. My Obsidian vault has health logs, financial planning notes, and journal entries. I wasn't going to pipe that through a third-party cloud service. Local-first also means no latency on the retrieval side — only the final answer_question call touches the Anthropic API, and even that is swappable with a local Ollama model in a future phase for fully air-gapped operation on sensitive sources.

The tradeoff is that "local" means the embedding model, vector store, FTS index, and file watcher all run on my Mac. On an M-series chip this is completely fine — the 90MB all-MiniLM-L6-v2 model loads in under two seconds and inference is fast. But it's a real constraint if you wanted to run this on a headless server or share it across devices. I stubbed NOTE[remote] swap points throughout the codebase to make a future HTTP deployment a configuration change, not a rewrite.

The storage decision: SQLite + ChromaDB

Most vector search projects jump straight to a vector database and treat it as the source of truth. I didn't, and this was one of the most important early decisions.

Vector search is great at semantic similarity — finding concepts, meanings, and topics. It's terrible at structured filtering. If you want "Kurzgesagt videos shorter than 15 minutes uploaded in 2023," you don't want to embed that query and run cosine similarity against 1,400 video descriptions. You want a SQL WHERE clause with channel LIKE '%Kurzgesagt%' AND duration_seconds <= 900 AND uploaded_at >= '2023-01-01'. That's what SQLite is built for.

So the architecture uses two stores with clear, non-overlapping responsibilities:

SQLite with FTS5 — the system of record. All 22 columns of metadata per note: channel, duration, view count, upload date, language, categories, soft-delete status. FTS5 provides BM25-ranked keyword search over the full text. This is where structured browse queries and exact lookups live.
ChromaDB — the semantic layer. Stores 384-dimensional embeddings per chunk with a pointer back to the SQLite note ID. Handles meaning-based retrieval only.

The key insight: ChromaDB is a cache of embeddings, not the canonical store. If it gets corrupted or I want to rebuild it, I can re-embed from the raw_text preserved in SQLite. This also makes soft delete and restore possible — when a note is soft-deleted, its ChromaDB chunks are removed immediately, but the raw text lives on in SQLite. Restoring re-embeds from that preserved text.

One non-obvious gotcha: SQLite's FTS5 doesn't automatically clean up its inverted index when you hard-delete a row. Without a custom AFTER DELETE trigger, FTS5 retains stale references that cause "missing row" errors on subsequent queries. I learned this the painful way and added three triggers: notes_ai, notes_au, and notes_ad to keep the FTS index perfectly in sync with the main table.

Hybrid search and why pure vector search wasn't enough

Early on I tried running purely semantic vector search. It felt magical for vague, concept-driven queries — "space exploration videos" would surface relevant results I'd never have found with keywords. But it had a blind spot that revealed itself quickly: exact lookups.

If I search for a specific YouTube URL, a video ID, or a channel name like "3Blue1Brown," vector search produces mediocre results. The embedding for "3Blue1Brown" doesn't cluster neatly near the embeddings of videos by 3Blue1Brown — proper nouns are notorious for this. Keyword search, on the other hand, handles this trivially.

The solution is hybrid search with Reciprocal Rank Fusion (RRF). Both FTS5 and ChromaDB return ranked result lists independently, and RRF merges them using:

score = 1 / (60 + rank_fts + 1) + 1 / (60 + rank_vector + 1)

Notes that appear in both lists score highest. Notes that appear in only one list still surface if they scored well there. The constant 60 is the standard RRF smoothing factor — it prevents top-ranked results from dominating so completely that lower-ranked items from the other list are ignored. The result is a merged ranking that's better than either system alone.

The tradeoff is latency: every hybrid query runs two lookups in parallel. In practice this is imperceptible locally, but it's worth knowing that fts_only=true is available on search_knowledge when you need the speed of a pure keyword lookup.

The YouTube Watch Later problem

This was the most annoying technical constraint in the whole project, and it forced an architecture I wouldn't have chosen otherwise.

Google intentionally blocks the Watch Later playlist (WL) from the YouTube Data API. If you call playlistItems.list with playlistId=WL using valid OAuth2 credentials, you get zero results. It's not a bug — it's documented. Watch Later is treated as private infrastructure, not a queryable resource.

The workaround is a hybrid approach:

yt-dlp fetches the Watch Later video IDs using extract_flat=True — a single shallow page request using Chrome cookies, not per-video scraping. This gives me the ID list fast.
YouTube Data API v3 then fetches full metadata for each ID via videos.list: title, channel, description, duration, view count, upload date, categories, language, thumbnail.

Custom playlists use the API exclusively — no browser cookies needed. The hybrid is only required for Watch Later. I don't love the cookie dependency for that path, but the alternative was either no Watch Later support or scraping full page HTML, which is far more fragile.

The API free quota is 10,000 units/day, with videos.list costing 1 unit per video. My 1,430-video initial sync consumed 14% of the daily quota. Incremental syncs are cheap — the delta reconciliation skips videos that are already fully enriched in the database.

Delta reconciliation: keeping the KB honest

Syncing data in is easy. Knowing when data should be removed is harder.

If I remove a video from my Watch Later playlist, or delete a note from my Obsidian vault, I want that reflected in the knowledge base on the next sync. Without this, the KB slowly drifts from reality — accumulating ghost entries for things I've deliberately removed.

The reconciliation logic is a set-difference operation:

existing_ids = list_ids_by_source(source_name)   # active notes in DB
current_ids  = {note_id for each item in current playlist/vault}
removed      = existing_ids - current_ids
→ soft_delete each removed ID

This runs at the end of every sync. The deleted_from column records the cause: playlist_removed for YouTube, file_deleted for Obsidian, user for manual deletions. That audit trail has been genuinely useful — I've recovered notes I thought I wanted gone.

Two-tier deletion: the case for soft delete

I built two distinct deletion modes, and the distinction matters more than it first seems.

Soft delete marks a note as inactive (status='deleted'), removes its chunks from ChromaDB immediately so it stops appearing in search results, but preserves the full raw_text in SQLite. Every read query includes AND status='active', so soft- deleted notes are invisible to all normal operations. But they're recoverable — restore_item re-embeds from the preserved raw text and brings the note back fully.

Hard delete permanently removes the row from SQLite and its chunks from ChromaDB. I added a safeguard: get_by_id_any_status() allows hard-deleting a note that's already soft-deleted, without requiring a restore step first. This matters because otherwise you'd have to restore a note before you could permanently remove it, which is exactly backwards from intuitive behavior.

The tradeoff is storage. Soft-deleted notes accumulate over time — sync removals can build up quickly if you're regularly cycling content. A periodic purge step (make purge-deleted) hard- deletes rows soft-deleted more than 90 days ago, giving a configurable grace period before permanent loss.

Choosing the embedding model

I went with all-MiniLM-L6-v2 from sentence-transformers. At ~90MB and 384 dimensions, it's not the most powerful embedding model available — but it was the right choice for this use case for three reasons.

First, it runs fast on CPU. An M-series Mac handles real-time ingestion without perceptible lag. A larger model like all-mpnet-base-v2 would produce better embeddings but takes 3-4x longer per batch — a meaningful difference when you're ingesting 1,400 videos in a single session.

Second, 384 dimensions is more than sufficient for the retrieval task here. The queries are natural language questions against a knowledge base of personal notes and video descriptions. We're not doing cross-lingual retrieval or code search — the semantic complexity is moderate and the model handles it well.

Third, it's entirely local. No API key, no usage cost, no data leaving the machine. For a personal knowledge base this matters both for privacy and for cost.

The test suite: why the fixture design matters

The test architecture has one decision I'm particularly proud of: the shared_embedder fixture is session-scoped.

all-MiniLM-L6-v2 takes about 2 seconds to load. With 8 integration tests, each creating their own embedder instance, that's 16 seconds just in model loading — before any actual test logic runs. Session-scoped fixtures in pytest load once per run and share the instance across all tests. The integration suite runs in ~5 seconds total.

Integration tests use a mock_embedder that returns deterministic fake 384-dim vectors instantly. This means the pipeline logic — deduplication, storage routing, classification, metadata handling — is tested correctly without any neural network inference. Real embedding quality is validated by the smoke tests against the live server, which are the only tests that touch actual embeddings.

The three-layer strategy (unit → integration → smoke) maps cleanly to the three things I care about: individual function correctness, full pipeline correctness with real storage, and end-to-end tool behavior against the live MCP server. Running make test after every change gives fast feedback without touching the production database.

What I'd do differently

A few things I'd reconsider with fresh eyes:

Obsidian vault sync via API, not file watching. The current approach watches the vault folder with watchdog and re-ingests on every save. It works, but Obsidian's own plugin ecosystem has a proper API surface. A dedicated plugin that pushes change events would be more reliable and wouldn't require the vault path to be on the local filesystem.
Chunk IDs should be content-addressed. Currently chunks are keyed as {note_id}_chunk_{index}. If a note is updated and re-chunked, old chunk IDs need to be explicitly deleted and new ones inserted. Content-addressed IDs (based on a hash of the chunk text) would make upserts idempotent and eliminate the delete-then-reinsert cycle.
The YouTube quota constraint is real. 10,000 API units/day sounds like a lot until you realize a large initial sync can dent it meaningfully. I'd add a rate-limiting layer and queue mechanism earlier, rather than relying on the caller to be careful about when they trigger a full re-sync.

Current state and what's next

The server is in daily use. I query it through Claude Desktop asking things like "what have I saved about distributed systems?" or "find chess videos shorter than 10 minutes" — and it works. Finding something I saved six months ago now takes seconds instead of a resigned shrug.

The next phase adds cloud connectors (Apple Notes via AppleScript, OneDrive via Graph API, Gmail) and a local LLM path via Ollama so that sensitive sources like journal entries and health logs never touch an external API at all. The remote deployment path is already fully stubbed — activating it is a config change.

The best personal tools are the ones you build for the version of yourself that's tired of the problem, not the idealized version that would have been more organized in the first place.

The full source is on GitHub at github.com/Abhishek-P-Joshi/pkb-mcp — including the README that covers the architecture, data flows, and setup in full detail.