Engineering May 2026

Why we used embedding models instead of LLMs — and when that's the right call

When we tell people FilingDrift doesn't use an LLM, the reaction is usually some version of: "...why not?" It's 2026. Everything uses LLMs. Your toaster probably has a system prompt.

There's a good reason — and it comes down to what kind of question you're actually trying to answer. We want to share the reasoning because we think it applies to a lot of problems where people reflexively reach for GPT or Claude when a much simpler, cheaper, and more reliable tool would do better.

The question we're asking

FilingDrift's core question is: "Is the language change in this filing unusual — for this company, in this year, relative to the whole corpus?"

Notice what that question requires:

Compare one document against the same company's previous documents
Weight each phrase by how unusual it is across the whole corpus of filings that year
Do this deterministically — the same filing must always produce the same score
Do this across ~4,900 companies, across 10+ years of filings

This is not a summarization task. It's not a question-answering task. It's a geometric comparison task: where does this document sit in semantic space, relative to a reference set?

For that specific task, embedding models are the right tool and LLMs are the wrong one. Here's why.

Five reasons embeddings win this task

1. Determinism

Run an embedding model on the same sentence twice and you get the same vector, every time. Run GPT or Claude on the same sentence twice and the output changes. Temperature, sampling, model updates — all of it introduces variation. For a financial scoring system that needs to be auditable ("why did this company's score change?"), non-determinism is disqualifying.

2. No context window

10-K filings run 70,000–150,000 words. Even today's large-context LLMs can hold a handful of these simultaneously at most. We need to compare a filing against the whole corpus and multiple prior filings. Embeddings handle this naturally: embed each sentence offline, store the vectors, compare them in arbitrary combinations whenever you need. No truncation, no chunking strategy, no "which 30% of the document do we send to the model?"

3. Corpus-wide comparison at scale

The critical insight: SVB's language wasn't just negative — its change was unusual across the whole corpus that year. To make that comparison, you need every company's sentences embedded in the same vector space. With embeddings, this is a nearest-neighbor search across a pre-built index — fast, offline, cheap. With an LLM, you'd need to somehow send thousands of documents simultaneously and ask "which of these is most different?" That's not how LLMs work.

4. No hallucination

The score is computed from actual vectors derived from actual sentences in the actual filing. There is no generation step. The score cannot contain information that wasn't in the document, because it's a mathematical operation on the document's content. An LLM summarizing a 10-K might confidently tell you things that are plausible but wrong. Our system can only tell you what the geometry of the document's language looks like.

5. Cost and latency at corpus scale

We process ~4,900 companies with multi-year filing histories — hundreds of thousands of sentences. At LLM API prices, running a corpus-wide analysis on the full corpus would cost hundreds of dollars per run. Embedding models run locally, cost fractions of a cent per document, and complete in minutes. We rebuild the full corpus score cache every 3 hours. That's not economically viable with LLM API calls.

When LLMs are the right tool

None of this means LLMs are bad — just that they're solving a different class of problem. LLMs are excellent when you need:

Summarization of a single document. "Explain the risk factors section of this 10-K in plain English" — perfect LLM task.
Structured extraction. "Pull all the dollar amounts and dates from this filing into a table" — LLMs handle this well.
Question answering. "Does this company mention going concern uncertainty?" — faster with an LLM than manually reading.
Generating analysis drafts. Once you have a score and flagged sentences, an LLM can help write the narrative explaining why.

The pattern: LLMs are good at single-document tasks where you need flexible language understanding and are okay with probabilistic output. Embeddings are good at cross-document comparison tasks where you need deterministic, geometric reasoning at scale.

The model we use

We use a sentence transformer from the SBERT family, trained specifically for semantic similarity tasks. It captures meaning well enough to distinguish "we face significant liquidity risk" from "we believe our liquidity position is adequate" — which is exactly the kind of distinction that matters here.

It runs locally in under a second per document on CPU. The entire corpus re-embeds in minutes. It's not the most powerful model in the world — but for this specific task, "most powerful" is not what matters. What matters is stable geometry, fast inference, and consistency across runs.

The broader point: The NLP toolbox has more than one tool. Transformer-based embedding models have been solving similarity and retrieval problems efficiently for years, with properties (determinism, speed, geometric interpretability) that make them well-suited for auditable analytical systems. "Just use an LLM" is often the right call; sometimes it's reaching for a jackhammer when a precise chisel is what the job requires.

IDF weighting: why not all sentences are equal

One more piece worth explaining: we don't treat every sentence equally. We weight sentences by their inverse document frequency (IDF) across the full corpus — a technique borrowed from classic information retrieval.

In plain terms: a sentence that every company in our corpus uses scores low regardless of its content, because it's boilerplate. A sentence that only one company is using scores high. "We are subject to various risks and uncertainties" appears in essentially every 10-K ever filed — it carries no signal. "Our held-to-maturity portfolio has unrealized losses of $X, which may require liquidation at a loss to fund deposit withdrawals" — if few companies across the corpus are writing that, it's informative.

IDF weighting is what makes the score sensitive to the specific and unusual, rather than amplifying the routine. An LLM reading a 10-K has no natural way to know which sentences are corpus-wide boilerplate. An embedding model combined with an IDF index does.

← All posts

Questions or pushback on the methodology? We like both. hello@filingdrift.com