Frequently Asked Questions

Common questions about the methodology, data, and what the score means.

What is FilingDrift? +

FilingDrift is a tool that reads SEC 10-K annual and 10-Q quarterly filings and scores language change — how much the wording has shifted compared to the prior period, and how unusual that shift looks compared to the whole corpus in the same year. We flag the outliers. You decide what they mean.

We are not financial analysts, economists, or credit rating agencies. We are engineers who built a language scoring system over a corpus of public filings, and we make the output available so you can add it to your own research process.

How do you compute the score? +

The score combines two things:

What's new or escalating. Phrases that appeared for the first time, or dramatically increased year-over-year, weighted by how rarely other companies use them. Boilerplate that every company uses scores low. Language specific to this company, in this year, scores high.
Where the language is heading. A measure of how much the language has shifted overall, focused on the sections most predictive of distress — risk factors, liquidity disclosures, and the management discussion. This catches concept-level drift that keyword lists miss.

Both components are calibrated against the healthy companies in the corpus. The 95th percentile of that group is the control ceiling — scores above it are flagged. See the About page for more detail.

How is this different from asking ChatGPT? +

We don't use ChatGPT, Claude, or any large language model. This is worth being direct about because it changes the reliability properties entirely.

FilingDrift scores filings using a deterministic algorithm — the same filing always produces the same score. There is no text generation, no prompting, no summarization that might be inaccurate. The output is a number, computed from the actual language in the document.

No hallucination. The score is computed from actual sentences in the filing. We don't generate text or infer things the document doesn't say.
No truncation. 10-Ks are 100–200 pages. LLMs cut off at their context window. We process the entire document.
No one-shot snapshots. ChatGPT sees one filing. We compare it to every prior filing from the same company, and to every other company's filing from the same year.
Repeatable. Run it twice on the same filing and you get the same score. ChatGPT gives you a different answer every time.

ChatGPT is good for summarizing things you already understand. FilingDrift is for detecting the drift you wouldn't otherwise notice.

If the score is deterministic, why did it change since the last time I checked? +

The scoring algorithm is fully deterministic: given the same filings and the same corpus snapshot, it will always produce the same score. But the corpus changes as we ingest new companies.

Each score measures how unusual a company's language change is relative to the rest of the corpus that year. As we add more companies — we currently cover 4905+ and are expanding continuously — the corpus-wide distribution shifts. A company that looked like a strong outlier in a 500-company corpus may look less extreme in a 5,000-company corpus if the rest of the corpus shows similar patterns. (Normalization is corpus-wide, not by industry sector.)

If you run the same score on the same corpus snapshot twice, you will always get identical results. The score you see today may differ from last month not because of any change in the underlying filing, but because the corpus has grown. This is expected behavior, not a bug. We plan to publish corpus version numbers so you can track which corpus snapshot was used for any given score.

How is this different from just reading the 10-K myself? +

You could read SVB's 2022 10-K and notice the phrase "unrealized losses" appears frequently. What you can't easily do: know that the phrase was relatively rare across the whole corpus that year, that SVB's usage increased sharply year-over-year, and that the sentence embeddings of those paragraphs place them semantically closer to distress language than most filings.

The value isn't reading one filing. It's knowing where that filing's change sits relative to the whole corpus, at the same moment in time. That's the cross-sectional comparison you can't do by hand.

Are all the validation numbers from the same dataset? +

No — and this is important. We use two distinct datasets that answer different questions:

Labeled crisis set

~43 companies

Hand-selected companies with known bankruptcy / distress events. Used for recall and precision on crisis detection only.

Full corpus

4905+ companies

All ingested companies, no selection. Used for the forward-return backtest (7069 flag events). No hand-curation — just every company we've processed.

The forward-return backtest numbers (−8.6% alpha at 1yr, 58% underperforming) come from the full 4905-company corpus, not the 43-company labeled set. The recall / precision stats come from the labeled set only. They are measuring different things.

How accurate is it? +

Two ways to measure this, which answer different questions:

Bankruptcy detection — labeled set (~43 companies)

On crisis companies with at least 2 pre-event filing pairs, the model detected the majority before the collapse. Precision ~50% — about 1 in 2 flags co-occurred with a labeled crisis event. Full detail on the About page. This number is computed on a curated set of known failures, not the general corpus.

Forward-return signal — full corpus (4905 companies, 7069 flag events)

Flagged companies underperformed SPY by a median of −8.6% at 12 months, −14.8% at 24 months, and −22.4% at 36 months. 58% of flag events had negative alpha at 12 months vs. ~50% expected by chance. This is computed on every company we've processed — no hand-selection.

This is a screening signal, not a verdict. It is most useful as one layer in a larger research process, not as a standalone buy/sell decision.

Are the backtest results statistically significant? +

Yes — decisively, on both measures. With 7069 flag events the sample is large enough to distinguish a real signal from noise.

Metric	1 yr	2 yr	3 yr
% of flags underperforming SPY	58%	61%	63%
Binomial p-value vs 50% null	<10⁻¹⁵	<10⁻¹⁵	<10⁻¹⁵
Median excess return (flagged minus SPY)	−8.6%	−14.8%	−22.4%
Wilcoxon signed-rank p-value vs 0	<10⁻¹⁵	<10⁻¹⁵	<10⁻²⁰

Two caveats worth stating explicitly. First, statistical significance is not the same as practical significance. The effect size matters — and a median alpha of −8% at one year is meaningful for credit monitoring or due diligence work, but it is not the kind of edge that supports a systematic short strategy on its own (execution costs, position sizing, and timing all matter). Second, these numbers come from a corpus of ~4,900 companies, not a random sample of the entire market. Whether they generalize beyond companies that happen to have made it into EDGAR is an open question.

What the p-values rule out is luck. The pattern is real and large enough that it cannot be explained by random variation across 7069 flag events.

Isn't the underperformance just because your corpus is small-cap and the S&P 500 is dominated by mega-cap tech? +

This is the right question — and it's largely correct: most of the raw vs-SPY underperformance is the small-cap size effect. That's exactly why we don't lead with it. Three responses:

It's not the absolute return. Largely yes, the raw vs-SPY underperformance is the small-cap size effect — the below-ceiling bucket shows it too — which is why we don't lead with it. The distress signal is the lift (moderate-flagged companies reach a distress outcome about 1.2× the corpus base rate) and the 75% labeled recall.

Factor-adjusted. The portfolio version regresses returns on the Fama-French factors — which include the size factor (SMB) explicitly — plus momentum and quality. After all of that, the most-stable-language quintile still earns 93.2 bps/month (t=9.16) over the full 2000–2026 period.

SEC filings are public the moment they're filed. Why hasn't the market already priced this in? +

A 10-K is 60–200 pages of dense legalese. Analysts read the highlights, scan for headline numbers, and move on. Almost nobody reads the full risk factors and MD&A sections word-for-word, and even fewer do it with a memory of exactly what those same sections said last year, or what every competitor wrote this year.

The signal we detect is not in any single sentence — it's in the pattern of change across a full document, measured against a cross-section of ~4930 filings from the same year. A company can write three sentences about covenant headroom that look innocuous in isolation, but are unusual relative to the whole corpus that year. No analyst catches that without a systematic tool.

This is not a new idea. The Loughran–McDonald research on 10-K language (2011, 2016) demonstrated statistically significant forward return predictability from filing text. Our corpus-wide normalization addresses a limitation of that work — distinguishing company-specific deterioration from market-wide language shifts (e.g., every company mentioned interest rate risk in 2022). That refinement is where the incremental signal lives.

Markets are efficient at processing structured data (EPS, revenue, guidance). They are much less efficient at processing high-dimensional, cross-sectional textual change at the sentence level. That gap is what FilingDrift occupies.

The backtest shows 41% of flags didn't underperform at 1 year — isn't that your false positive rate? +

41% is not a false positive rate in the meaningful sense. It's the share of flag events where the stock didn't underperform the S&P 500 within 12 months. Those are different things.

A genuine false positive would be: a company whose language genuinely escalated relative to its own history and the corpus, but where the company was financially fine. That category is real — mergers, regulatory changes, and one-time restructurings can all produce distress-adjacent language without actual distress. We mention RTX 2020 as an example on the About page.

But "didn't underperform at 12 months" has three possible explanations:

Delayed signal. The distress is real but the collapse happens at 18–36 months, not 12. The same flag events that show 58% underperformance at 12 months show 61% at 24 months and 63% at 36 months. The signal strengthens with time.
Resolved event. The company raised capital, changed management, or restructured between the filing date and the 12-month measurement. The language was accurate; the outcome was avoided.
Genuine false positive. The language escalated but there was no underlying distress.

We can't separate these three categories without individual case review. What we can say: most of the raw vs-SPY underperformance is the small-cap size effect, so we don't lean on it — the distress evidence is the lift (moderate-flagged companies reach a distress outcome about 1.2× the corpus base rate) and that 63% of flags have negative alpha vs. the ~50% expected by chance. The signal is real; it is not a precise 12-month trading clock.

What should I do when a company is flagged? +

A flag means the language in a company's 10-K or 10-Q has shifted significantly relative to its own prior filings and relative to the rest of the corpus in the same year. It's a prompt to look closer — not a directive to act.

Useful first steps: read the specific anomalous sentences we surface on the company page — they're the actual text from the filing that drove the score. Then check whether the same language was rising across the corpus that year (if so, it may be macro, not company-specific). Then look at the score arc — is this a one-year spike or a multi-year escalation?

A spike paired with a multi-year trend and corpus-unusual language is the strongest combination. A single anomalous year with language that rose corpus-wide is weaker. The tool gives you the data; the interpretation is yours.

What's the difference between semantic drift and sentiment analysis? +

Sentiment analysis assigns a positive/negative score to a piece of text. "We face significant liquidity risks" is negative. That's useful but shallow — most companies use cautious legal boilerplate, so everything scores slightly negative all the time.

Semantic drift is different. We're not asking "is this sentence negative?" We're asking: "Is this sentence semantically different from what this company said last year, and unusual relative to what the whole corpus said this year?" A company that shifts from standard risk-factor boilerplate to language structurally similar to sentences found in bankruptcy filings — that's drift. The sentiment score might be unchanged. The semantic position has moved.

The other key difference: drift is relative. SVB mentioning "unrealized losses" is only meaningful because they mentioned it sharply more than last year and the phrase was relatively rare across the corpus that year. Sentiment analysis looks at each sentence in isolation.

Isn't this survivorship bias? You picked companies you already knew failed. +

It's a fair challenge. We selected crisis companies after the fact — SVB, BBBY, Rite Aid, Party City — because they had documented collapse events with known dates. That selection process can't introduce bias into the scoring algorithm itself (which is deterministic and has no knowledge of the outcome), but it absolutely could bias how we report results. We've tried to address this three ways.

The full results table — including the misses (Party City, Revlon, Silvergate) — is at /about. We also include 30 healthy control companies in the ceiling computation and report the false positive rate: 8 of 30 exceeded the ceiling at some point. Third, we're expanding the corpus continuously rather than hand-selecting the most favorable set.

The deeper version of the question is: "If I had been watching a random set of 500 companies in 2022, would FilingDrift's elevated scores have been actionable, or would they have been drowned out by false positives?" That's the right test — and it's what we're building toward as the corpus grows.

There's a second, distinct survivorship issue on the factor side: the quintile backtest runs on companies that still file, so names that have since delisted are absent from the universe — which inflates the long (stable-language) side. We bound it by rebuilding the universe with 4,716 delisted companies and modeling each as a total loss (a deliberate upper bound): the corrected edge shrinks and concentrates in small-caps (micro-cap Q1–Q5 ≈ 164 bps/mo), and above ~$300M market cap it inverts. Detail in the validation post.

Does the algorithm have lookahead bias? Did you tune it knowing the outcomes? +

We were careful about this but you're right to ask. The algorithm has two components: a phrase escalation score and a semantic drift score. The phrase escalation score is entirely blind to outcomes — it measures frequency change and cross-company rarity, which are properties of the text itself, not labels we applied.

The semantic component uses an "anchor" set of distress-adjacent sentences drawn from confirmed crisis filings to define a "distress direction." This is where lookahead risk exists: if we tuned the anchor set to maximize scores for known failures, the results would be circular. In practice, we built the anchor set before running the full analysis, and we use the same anchors across all companies — we didn't iterate to improve detection on specific cases.

The approach was developed with some knowledge that certain companies had failed, so this is not a fully out-of-sample test. The algorithm has no company-specific tuning — SVB's score is computed the same way as JPMorgan's. The right validation is prospective: watching how it performs on new filings from companies not in the original set. We'll report on that as the corpus grows.

How does the signal behave during a systemic crisis like 2008? +

The headline numbers run the full 2000–2026 period, including 2007–2011. (An earlier version of this product excluded the crisis era when it was framed as a distress detector; the quality-factor result is robust across eras, so the canonical evaluation now includes the whole period.)

The per-period normalization compares each company's language change against the rest of the corpus that year. During a systemic crisis — 2007–2011, or COVID in 2020 — distress language rises corpus-wide at once: regional banks, mortgage companies, and homebuilders escalating simultaneously. When distress language is the baseline everywhere, the normalized score compresses toward zero — not because individual companies weren't distressed, but because distress was the norm that year.

So the two readings of the score behave differently across eras. The long-side factor (stable-language companies outperform) holds on the full period. The distress recall is weaker in a systemic crisis, because FilingDrift is built to detect idiosyncratic distress — one company deteriorating while the corpus is healthy — not systemic stress where corpus-wide deterioration is the baseline. We report the full-period numbers rather than excluding the crisis, and treat the era sensitivity as a property of the mechanism, not something to hide.

If this tool is so accurate, why don't you trade on the signals yourselves? +

First: we're engineers, not traders. Building a reliable short position on a company requires more than a signal — it requires position sizing, risk management, broker relationships, and a thesis on timing. A company can have elevated language in its filing and still take 18 months to collapse. Being right about the direction doesn't tell you when, and "when" is what determines whether a trade makes money.

Second: this signal is not sufficient alone. FilingDrift scored SVB above the ceiling in its final filing. It also scored RTX above the ceiling in 2020 — because of a merger that generated distress-adjacent language with no actual distress. The false positive rate is low but real. A signal this uncertain, without other confirming indicators, doesn't make a good sole basis for trading.

Most importantly: we don't generate trading signals. What we provide is one research layer — a descriptive tool, a heads-up that the language has changed in a statistically unusual way. We are not predicting the future. What you do with that, in combination with your own analysis, is entirely your call. We explicitly do not provide investment advice.

Is this investment advice? +

No. We analyze language in public SEC filings. We don't predict stock prices, recommend trades, or guarantee any outcome. Past detection of distress events does not mean future detections will be accurate.

We built a linguistic measurement tool. What you do with the measurements is entirely your call.

How often are scores updated? +

We check EDGAR daily for new 10-K and 10-Q filings. When a tracked company files, we process it and update the score within 24 hours. Pro subscribers get an email alert when this happens.

Scores are typically available within 24 hours of the filing appearing on EDGAR. 8-K material event filings are on the roadmap.

What companies are covered? +

Currently 4930+ companies — all US public companies filing annual or quarterly reports on EDGAR. This includes a hand-labeled set of ~43 verified crisis events (SVB, Bed Bath & Beyond, Party City, Rite Aid, and others used for recall and precision measurement) and ~30 healthy control companies, plus the broader corpus of companies added during ingestion with no hand-selection.

Researcher subscribers can add up to 50 tickers to their watchlist. Professional subscribers get up to 500. Desk subscribers get unlimited coverage across the full corpus.

What's in the Pro tier? +

Everything in Free, plus: email alerts when a company you follow files a new 10-K or 10-Q with an elevated score, a watchlist for up to 50 tickers, CSV export for your own models, and API access for programmatic queries.

See the pricing page for current rates.

I've been subscribed for months and haven't received any alerts. Is something wrong? +

Annual 10-K filings are filed once per year, but quarterly 10-Q filings are now live too — so for most companies you can expect up to 4 alerts per year when scores are elevated.

The value of the subscription is not missing the signal when it does arrive. When SVB filed on February 24, 2023, the score above ceiling was available that day. Without an alert, you would have had to check manually.

8-K material event filings (going concern disclosures, officer departures, covenant violations) are on the roadmap and will further increase near-real-time coverage.

If you want to check the current score of any company you're watching, the dashboard is always live. If you believe a company has filed and you haven't received an alert, email us at hello@filingdrift.com.

Why did a company's score change even though its filing didn't? +

Scores are corpus-normalized — they measure how unusual a company's language change is relative to what every other company in the corpus wrote in the same year. When we expand the corpus (adding more companies), the corpus baseline changes, and scores are recomputed accordingly.

This is a feature, not a bug. A phrase that 80% of companies used in 2022 (like "unrealized losses") should score near zero for any individual company in that year — it's macro noise, not a company-specific signal. As we add more companies to the baseline, that normalization becomes more accurate.

The control ceiling (the threshold above which a company is flagged) is also recomputed when the corpus changes, since it's set at the 95th percentile of stable companies in the expanded set.

Practical implication: historical scores may shift slightly between corpus versions. All scores shown on the site are always computed against the current corpus. Point-in-time historical data (scores as of each filing date, using only companies available at that time) is available via the API for systematic backtesting.

How does this compare to existing academic research on 10-K text analysis? +

There's a substantial body of academic work on extracting signals from SEC filings, and FilingDrift builds upon that. Two of the most relevant papers are:

Loughran & McDonald (2011) — the foundational paper on financial text analysis — built a word list of positive and negative terms specific to financial language and showed that sentiment polarity in 10-Ks predicts returns. Their word list is still widely used. FilingDrift doesn't use sentiment polarity; it uses semantic position (where the language sits in embedding space, relative to distress anchor sentences) and frequency escalation (unusual new or escalating phrases). This catches structural drift that uniform sentiment scoring misses.

Lazy Prices (Cohen, Malloy & Nguyen, 2020) — the closest academic precursor — showed that the degree of change in 10-K language year-over-year predicts stock returns: companies that change their filings more tend to underperform. FilingDrift is the directed version: instead of undirected change magnitude, it tracks the increase in distress vocabulary specifically, normalized corpus-wide (not by sector — we tested per-sector and it scored worse), with a secondary semantic-distance signal anchored on confirmed distress language.

Sorted into quintiles, the most-stable-language quintile earns a five-factor + quality (FF5+QMJ) alpha of 93.2 bps/month (t=9.16) over the full 2000–2026 period — and it survives removing all the corpus normalization. The long-short spread (FF3 59.5 bps/mo; FF5 38.9 bps/mo) is comparable to Lazy Prices' 18–45 bps range — the long side is where the signal lives. Our methodology is documented in detail here.

Who is this for? +

People who read SEC filings as part of their job or research — independent investors, credit analysts, short sellers, journalists covering corporate distress, and students studying the 2008 crisis or COVID bankruptcies.

It is probably not for casual retail investors looking for a stock-picking signal. The tool is most useful when you already have a view on a company and want to know if the language is confirming or contradicting it.

You started with 10-Ks. Do you now cover 10-Qs, 8-Ks, earnings call transcripts? And global markets? +

Annual 10-Ks are where we started — they are the most comprehensive, most consistent, and most comparable documents in the corpus. Every US public company files one, on the same schedule, with the same required sections. That consistency is what makes corpus-wide normalization work reliably.

10-Q quarterly filings are now live. Coverage is expanding continuously as new quarterly filings appear on EDGAR. The same scoring methodology applies — year-over-year pairing, corpus-wide normalization, unified ceiling — at quarterly cadence.

8-K material events (CEO departures, going concern disclosures, covenant violations) are next on the roadmap. Earnings call transcripts are also planned: management tone on calls often shifts one or two quarters before the 10-K language changes.

Global markets — EU annual reports (ESMA XBRL), UK Companies House, Japan EDINET — are on the longer-term roadmap. The methodology is form-agnostic; the constraint is building reliable parsers for each filing format.

Who built this? +

FilingDrift is an independent product operated by Latent Systems SAS, a French software company. We are not a hedge fund, not a financial advisory firm, and not affiliated with any broker-dealer.

See the About page for more.

Have a question that's not here? Email us at hello@filingdrift.com.