Common questions about the methodology, data, and what the score means.
FilingDrift is a tool that reads SEC 10-K filings and scores language change — how much the wording has shifted year-over-year, and how unusual that shift looks compared to peer companies in the same year. We flag the outliers. You decide what they mean.
We are not financial analysts, economists, or credit rating agencies. We are engineers who built a language scoring system over a corpus of public filings, and we make the output available so you can add it to your own research process.
The score combines two things:
Both components are calibrated against the healthy companies in the corpus. The 95th percentile of that group is the control ceiling — scores above it are flagged. See the About page for more detail.
We don't use ChatGPT, Claude, or any large language model. This is worth being direct about because it changes the reliability properties entirely.
FilingDrift scores filings using a deterministic algorithm — the same filing always produces the same score. There is no text generation, no prompting, no summarization that might be inaccurate. The output is a number, computed from the actual language in the document.
ChatGPT is good for summarizing things you already understand. FilingDrift is for detecting the drift you wouldn't otherwise notice.
The scoring algorithm is fully deterministic: given the same filings and the same peer set, it will always produce the same score. But the peer set changes as we ingest new companies into the corpus.
Each score is a measure of how unusual a company's language is relative to its peers that year. As we add more companies to the corpus — we currently cover 4923+ and are expanding continuously — the peer distribution shifts. A company that looked like a strong outlier in a 500-company corpus may look less extreme in a 5,000-company corpus if its industry peers show similar patterns.
If you run the same score on the same corpus snapshot twice, you will always get identical results. The score you see today may differ from last month not because of any change in the underlying filing, but because the peer universe has grown. This is expected behavior, not a bug. We plan to publish corpus version numbers so you can track which peer universe was used for any given score.
You could read SVB's 2022 10-K and notice the phrase "unrealized losses" appears frequently. What you can't easily do: know that no other bank mentioned it as often that year, that SVB's usage increased 4× year-over-year, and that the sentence embeddings of those paragraphs place them semantically closer to distress language than anything JPMorgan filed.
The value isn't reading one filing. It's knowing where that filing sits relative to dozens of peers, at the same moment in time. That's the cross-sectional comparison you can't do by hand.
No — and this is important. We use two distinct datasets that answer different questions:
The forward-return backtest numbers (−8.6% alpha at 1yr, 58% underperforming) come from the full 4923-company corpus, not the 43-company labeled set. The recall / precision stats come from the labeled set only. They are measuring different things.
Two ways to measure this, which answer different questions:
On crisis companies with at least 2 pre-event filing pairs, the model detected the majority before the collapse. Precision ~50% — about 1 in 2 flags co-occurred with a labeled crisis event. Full detail on the About page. This number is computed on a curated set of known failures, not the general corpus.
Flagged companies underperformed SPY by a median of −8.6% at 12 months, −14.8% at 24 months, and −22.4% at 36 months. 58% of flag events had negative alpha at 12 months vs. ~50% expected by chance. This is computed on every company we've processed — no hand-selection.
This is a screening signal, not a verdict. It is most useful as one layer in a larger research process, not as a standalone buy/sell decision.
Yes — decisively, on both measures. With 7069 flag events the sample is large enough to distinguish a real signal from noise.
| Metric | 1 yr | 2 yr | 3 yr |
|---|---|---|---|
| % of flags underperforming SPY | 58% | 61% | 63% |
| Binomial p-value vs 50% null | <10−15 | <10−15 | <10−15 |
| Median excess return (flagged minus SPY) | −8.6% | −14.8% | −22.4% |
| Wilcoxon signed-rank p-value vs 0 | <10−15 | <10−15 | <10−20 |
Two caveats worth stating explicitly. First, statistical significance is not the same as practical significance. The effect size matters — and a median alpha of −8% at one year is meaningful for credit monitoring or due diligence work, but it is not the kind of edge that supports a systematic short strategy on its own (execution costs, position sizing, and timing all matter). Second, these numbers come from a corpus of ~4,900 companies, not a random sample of the entire market. Whether they generalize beyond companies that happen to have made it into EDGAR is an open question.
What the p-values rule out is luck. The pattern is real and large enough that it cannot be explained by random variation across 7069 flag events.
This is the right question to ask, and we tested it two ways.
Equal-weighted benchmark test. We recomputed all 6059 flag events using RSP (S&P 500 Equal Weight ETF) as the benchmark instead of SPY. RSP weights every S&P component equally — it removes the Magnificent 7 drag that inflated cap-weighted returns from 2020–2024. Result: −21.9% at 36m vs. −22.4% vs. SPY. The composition bias is <0.5% — negligible.
The alpha is essentially identical on cap-weighted and equal-weighted benchmarks. It is not an artifact of the Magnificent 7 composition effect.
A 10-K is 60–200 pages of dense legalese. Analysts read the highlights, scan for headline numbers, and move on. Almost nobody reads the full risk factors and MD&A sections word-for-word, and even fewer do it with a memory of exactly what those same sections said last year, or what every competitor wrote this year.
The signal we detect is not in any single sentence — it's in the pattern of change across a full document, measured against a cross-section of ~4906 peer filings from the same year. A company can write three sentences about covenant headroom that look innocuous in isolation, but are unlike anything any peer company wrote that year. No analyst catches that without a systematic tool.
This is not a new idea. The Loughran–McDonald research on 10-K language (2011, 2016) demonstrated statistically significant forward return predictability from filing text. Our peer-normalization approach addresses a limitation of that work — distinguishing company-specific deterioration from market-wide language shifts (e.g., every company mentioned interest rate risk in 2022). That refinement is where the incremental signal lives.
Markets are efficient at processing structured data (EPS, revenue, guidance). They are much less efficient at processing high-dimensional, cross-sectional textual change at the sentence level. That gap is what FilingDrift occupies.
41% is not a false positive rate in the meaningful sense. It's the share of flag events where the stock didn't underperform the S&P 500 within 12 months. Those are different things.
A genuine false positive would be: a company whose language genuinely escalated relative to its own history and its peers, but where the company was financially fine. That category is real — mergers, regulatory changes, and one-time restructurings can all produce distress-adjacent language without actual distress. We mention RTX 2020 as an example on the About page.
But "didn't underperform at 12 months" has three possible explanations:
We can't separate these three categories without individual case review. What we can say: measured at 3 years instead of 1, 63% of flags underperform and the median alpha gap is −22.4% — well above the ~50% baseline you'd expect from chance. The signal is real; it is not a precise 12-month trading clock.
A flag means the language in a company's 10-K has shifted significantly relative to its own prior filings and relative to peer companies in the same year. It's a prompt to look closer — not a directive to act.
Useful first steps: read the specific anomalous sentences we surface on the company page — they're the actual text from the filing that drove the score. Then check whether the same language appears in peer filings (if yes, it may be sector-wide). Then look at the score arc — is this a one-year spike or a multi-year escalation?
A spike paired with a multi-year trend and peer-unique language is the strongest combination. A single anomalous year with language that peers also use is weaker. The tool gives you the data; the interpretation is yours.
Sentiment analysis assigns a positive/negative score to a piece of text. "We face significant liquidity risks" is negative. That's useful but shallow — most companies use cautious legal boilerplate, so everything scores slightly negative all the time.
Semantic drift is different. We're not asking "is this sentence negative?" We're asking: "Is this sentence semantically different from what this company said last year, and from what every peer company said this year?" A company that shifts from standard risk-factor boilerplate to language structurally similar to sentences found in bankruptcy filings — that's drift. The sentiment score might be unchanged. The semantic position has moved.
The other key difference: drift is relative. SVB mentioning "unrealized losses" is only meaningful because they mentioned it 4× more than last year and more than any peer bank that quarter. Sentiment analysis looks at each sentence in isolation.
It's a fair challenge. We selected crisis companies after the fact — SVB, BBBY, Rite Aid, Party City — because they had documented collapse events with known dates. That selection process can't introduce bias into the scoring algorithm itself (which is deterministic and has no knowledge of the outcome), but it absolutely could bias how we report results. We've tried to address this three ways.
The full results table — including the misses (Silvergate, Countrywide, PG&E) — is at /about. We also include 30 healthy control companies in the ceiling computation and report the false positive rate: 6 of 30 exceeded the ceiling at some point. Third, we're expanding the corpus continuously rather than hand-selecting the most favorable set.
The deeper version of the question is: "If I had been watching a random set of 500 companies in 2022, would FilingDrift's elevated scores have been actionable, or would they have been drowned out by false positives?" That's the right test — and it's what we're building toward as the corpus grows.
We were careful about this but you're right to ask. The algorithm has two components: a phrase escalation score and a semantic drift score. The phrase escalation score is entirely blind to outcomes — it measures frequency change and cross-company rarity, which are properties of the text itself, not labels we applied.
The semantic component uses an "anchor" set of distress-adjacent sentences drawn from confirmed crisis filings to define a "distress direction." This is where lookahead risk exists: if we tuned the anchor set to maximize scores for known failures, the results would be circular. In practice, we built the anchor set before running the full analysis, and we use the same anchors across all companies — we didn't iterate to improve detection on specific cases.
The approach was developed with some knowledge that certain companies had failed, so this is not a fully out-of-sample test. The algorithm has no company-specific tuning — SVB's score is computed the same way as JPMorgan's. The right validation is prospective: watching how it performs on new filings from companies not in the original set. We'll report on that as the corpus grows.
The peer normalization step requires a healthy population of surviving peers in each industry group for the same filing year. During 2007–2011, entire sectors — regional banks, mortgage companies, homebuilders — collapsed simultaneously. When every peer in a group shows elevated distress language, the peer-normalized score converges toward zero: not because individual companies weren't distressed, but because distress was the baseline for the whole sector that year.
Including that period would dramatically understate the signal during ordinary conditions. Companies that genuinely escalated their language relative to peers would look like non-flags, because there were no healthy peers left to normalize against. Including it in the backtest would make results look weaker than the algorithm actually performs outside of systemic crises — which would be just as misleading in the other direction.
FilingDrift detects idiosyncratic distress — one company deteriorating while its sector is healthy. It does not detect systemic stress where sector-wide deterioration is the baseline. That is why 2007–2011 is excluded, and it is the right framing for interpreting the backtest numbers.
First: we're engineers, not traders. Building a reliable short position on a company requires more than a signal — it requires position sizing, risk management, broker relationships, and a thesis on timing. A company can have elevated language in its filing and still take 18 months to collapse. Being right about the direction doesn't tell you when, and "when" is what determines whether a trade makes money.
Second: this signal is not sufficient alone. FilingDrift scored SVB above the ceiling in its final filing. It also scored RTX above the ceiling in 2020 — because of a merger that generated distress-adjacent language with no actual distress. The false positive rate is low but real. A signal this uncertain, without other confirming indicators, doesn't make a good sole basis for trading.
Most importantly: we don't generate trading signals. What we provide is one research layer — a descriptive tool, a heads-up that the language has changed in a statistically unusual way. We are not predicting the future. What you do with that, in combination with your own analysis, is entirely your call. We explicitly do not provide investment advice.
No. We analyze language in public SEC filings. We don't predict stock prices, recommend trades, or guarantee any outcome. Past detection of distress events does not mean future detections will be accurate.
We built a linguistic measurement tool. What you do with the measurements is entirely your call.
We check EDGAR daily for new 10-K filings. When a tracked company files, we process it and update the score within 24 hours. Pro subscribers get an email alert when this happens.
Most companies file once a year. Scores are typically available within 24 hours of the filing appearing on EDGAR.
We're actively expanding coverage to include quarterly 10-Q filings and material event 8-K filings. This will increase alert cadence from annual to quarterly and near-real-time for major corporate events.
Currently 4906+ companies: a mix of verified crisis events (SVB, Lehman, Enron, Bed Bath & Beyond, Party City, Revlon, and others) and healthy control companies (large-cap banks, retailers, consumer staples) used to calibrate the baseline.
Researcher subscribers can add up to 50 tickers to their watchlist. Professional subscribers get up to 500. Desk subscribers get unlimited coverage across the full corpus.
Everything in Free, plus: email alerts when a company you follow files a new 10-K with an elevated score, a watchlist for up to 40 tickers, CSV export for your own models, and API access for programmatic queries.
See the pricing page for current rates.
Probably not. Annual 10-K filings are — as the name suggests — filed once per year. A company that filed in February 2025 won't file again until February 2026. For most tracked companies, there is genuinely nothing to alert you about for 11 months of the year.
The value of the subscription is not missing the signal when it does arrive. When SVB filed on February 24, 2023, the score above ceiling was available that day. Without an alert, you would have had to check manually.
We're actively expanding coverage to include quarterly 10-Q filings and material event 8-K filings. Once live, this will increase alert cadence significantly — from annual to quarterly and near-real-time for major corporate events.
If you want to check the current score of any company you're watching, the dashboard is always live. If you believe a company has filed and you haven't received an alert, email us at hello@filingdrift.com.
Scores are peer-normalized — they measure how unusual a company's language is relative to what every other company in the corpus wrote in the same year. When we expand the corpus (adding more companies), the peer baseline changes, and scores are recomputed accordingly.
This is a feature, not a bug. A phrase that 80% of S&P 500 companies used in 2022 (like "unrealized losses") should score near zero for any individual company in that year — it's macro noise, not a company-specific signal. As we add more companies to the baseline, that normalization becomes more accurate.
The control ceiling (the threshold above which a company is flagged) is also recomputed when the corpus changes, since it's set at the 95th percentile of stable companies in the expanded set.
Practical implication: historical scores may shift slightly between corpus versions. All scores shown on the site are always computed against the current corpus. Point-in-time historical data (scores as of each filing date, using only companies available at that time) is available via the API for systematic backtesting.
There's a substantial body of academic work on extracting signals from SEC filings, and FilingDrift builds upon that. Two of the most relevant papers are:
Loughran & McDonald (2011) — the foundational paper on financial text analysis — built a word list of positive and negative terms specific to financial language and showed that sentiment polarity in 10-Ks predicts returns. Their word list is still widely used. FilingDrift doesn't use sentiment polarity; it uses semantic position (where the language sits in embedding space, relative to distress anchor sentences) and frequency escalation (unusual new or escalating phrases). This catches structural drift that uniform sentiment scoring misses.
Lazy Prices (Cohen, Malloy & Nguyen, 2020) — the closest academic precursor — showed that the degree of change in 10-K language year-over-year predicts stock returns. Companies that change their filings more tend to underperform. FilingDrift extends this in two ways: it adds peer-normalization (changes that are unusual relative to the sector matter more than changes that every company made) and it combines a change signal with a semantic distance signal anchored on confirmed distress language, not just any change.
The FF3-adjusted alpha of 61 bps/month (t-stat 5.41) we report across 292 months is above the Lazy Prices benchmark on the same data window. Our methodology is documented in detail here.
People who read SEC filings as part of their job or research — independent investors, credit analysts, short sellers, journalists covering corporate distress, and students studying the 2008 crisis or COVID bankruptcies.
It is probably not for casual retail investors looking for a stock-picking signal. The tool is most useful when you already have a view on a company and want to know if the language is confirming or contradicting it.
Annual 10-Ks are where we started — they are the most comprehensive, most consistent, and most comparable documents in the corpus. Every US public company files one, on the same schedule, with the same required sections. That consistency is what makes cross-sectional peer normalization work reliably.
10-Q quarterly filings and 8-K material events (CEO departures, going concern disclosures, covenant violations) are next on the roadmap. 10-Qs are particularly valuable because they shift the signal from annual to quarterly cadence — Q1 10-Qs are filing right now. Earnings call transcripts are also planned: management tone on calls often shifts one or two quarters before the 10-K language changes.
Global markets — EU annual reports (ESMA XBRL), UK Companies House, Japan EDINET — are on the longer-term roadmap. The methodology is form-agnostic; the constraint is building reliable parsers for each filing format. If global coverage is important to you, let us know at hello@filingdrift.com — demand shapes the priority order.
FilingDrift is a small independent product operated by Latent Systems SAS, a French software company. We are not a hedge fund, not a financial advisory firm, and not affiliated with any broker-dealer.
We built this because we noticed that nobody was doing systematic language-change scoring on SEC filings at the sentence level, with peer comparison. The SVB case study validated the approach. We're sharing it. See the About page for more.
Have a question that's not here? Email us at hello@filingdrift.com.
This site uses a session cookie for authentication. We also use Plausible Analytics, a privacy-friendly, cookieless tool that collects no personal data and requires no consent under GDPR. See our Privacy Policy.