How We Search 3.5 Million Pages
for Japanese Names

3.5M Total Pages
280GB Data Volume
DS1–12 Data Sets
186 Known Persons

Between December 2025 and January 2026, the U.S. Department of Justice (DOJ) released Epstein-related documents in stages — approximately 3.5 million pages, 280 GB, organized across Data Sets 1 through 12. The goal of this investigation is singular: to determine whether any Japanese names appear in those documents.

Reading 3.5 million pages by hand is not possible, so we developed an automated detection program. During development, we discovered that some documents contain fake redactions — black boxes placed over text that is still present in the underlying PDF data and can be recovered. This page explains why the results obtained through this methodology can be trusted.

Data Acquisition — The Hardest Part

Before analyzing 3.5 million pages, we first had to obtain them — and that proved extremely difficult. The DOJ published the Epstein files on its official website and through Internet Archive (archive.org), but the published versions are far from complete.

Take Data Set 10 as an example. The DOJ's official release contains roughly 100 PDFs for DS10. The actual DS10, however, contains 503,154 PDFs. The official release omits 99.98% of the data. The EFTA number ranges in the official release are entirely different from those in the full dataset — a researcher relying only on the DOJ website would not even recognize the two as the same data set.

The complete data exists in torrent copies maintained and shared by volunteers around the world. This investigation obtained the full dataset through those torrent networks. For DS10 alone, approximately 180 GB of data was downloaded and scanned locally.

Data Sets 9–11 were released as a single bulk package despite being distinct sets, making them difficult to distinguish. Multiple unexplained gaps exist in the EFTA number sequence. Thousands of files were deleted from the DOJ site after initial publication. The total document count was initially reported as approximately 6 million pages; only around 3.5 million were actually released. The whereabouts of the remaining 2.5 million pages are unknown.

Observations on the DOJ Release Process / Epstein Exposure Investigation Team /

These facts demonstrate the opacity of the release process and explain why this investigation proceeds on the assumption that the published documents do not represent the full picture. We work with the maximum available data under these constraints.

Redactions: Real vs. Fake

There are two types of redaction in PDFs. A genuine redaction deletes the underlying text data and then draws a black rectangle over it. Official DOJ documents bearing an EFTA number use this method — copy-pasting a genuine redaction yields nothing, because the text does not exist.

A fake redaction, by contrast, leaves the text data intact and merely places a black rectangle on top. According to an analysis by the PDF Association, court filings incorporated into the DOJ document release — including 2021 civil litigation documents from the U.S. Virgin Islands Attorney General — contain confirmed fake redactions.

Layer 0 — Reading Beneath the Black Box

The program begins by checking for fake redactions the moment a PDF is opened. It detects every black rectangle on each page, then checks whether text data exists beneath the area covered by each rectangle. If text is found, the rectangle is classified as a fake redaction and the text is recovered.

Detection uses x-ray, an open-source library developed by the Free Law Project — a non-profit organization that operates a U.S. case law database and built the tool specifically to identify improper redactions in court documents. Recovered text is automatically tagged REDACTED_RECOVERED to distinguish it from ordinary text. If a Japanese name is found beneath a redaction, it may have been intentionally concealed — a finding of significant investigative weight.

The Five-Stage Processing Pipeline

Searching 280 GB of PDFs directly for "Japanese names" is inefficient. We inverted the approach: first extract every proper noun in the documents, then filter for Japanese-related names.

STEP 1

Full Proper-Noun Extraction

Text is extracted from all 280 GB of PDFs and scanned for: two- or three-word sequences beginning with capital letters (standard English-language name format), email headers (From / To / Cc fields), names preceded by honorifics (Mr., Dr., Ambassador, etc.), all-caps names common in legal documents ("JOHN DOE" format), and email addresses and phone numbers. At this stage, nationality is not considered. 3.5 million pages are compressed into a list of hundreds of thousands of proper nouns.

STEP 2

Japan Filter — 6-Layer Detection

Layer 1: Exact match against a pre-compiled list of 186 investigation subjects / Layer 2: Presence of one of 185 Japanese family names / Layer 3: Surrounding context includes terms such as "Japan," "Tokyo," or "Mitsubishi" / Layer 4: Email address with a .jp domain / Layer 5: Phone number beginning with +81 (Japan country code) / Layer 6: Detection of hiragana, katakana, or kanji characters. Layer 1 is high-certainty. Layers 2–6 are a net cast to capture individuals not anticipated in advance.

STEP 3

Classification — Known / Unknown

Detected names are matched against the investigation subject list and classified into two categories: Known (matched to a listed individual) and Unknown (not on the list but flagged as potentially Japanese). The discovery of "Unknown" individuals is the greatest value of this system — it opens the possibility of surfacing names no one anticipated.

STEP 4

Cross-Theme Investigation

The established procedure is repeated across four investigative themes: T-A (Japan travel and visa arrangements) / T-B (Grand Hyatt Tokyo staff) / T-D (money laundering via Deutsche Bank Tokyo) / T-E (Tawaraya Inn, Kyoto). A keyword JSON file is designed for each theme, and a cross-dataset scan covering DS1–12 is executed.

STEP 5

Verification Against DOJ Primary Sources & Reporting

All detected findings are verified against DOJ primary documents. Processing status and results are logged to a SQLite database, enabling post-hoc auditing. The system is designed to run incrementally over months — there is no need to process 280 GB in a single session.

Bugs Discovered and Fixed Patched

Two significant bugs were identified during the investigation and immediately corrected. The following records are dated March 5, 2026.

Script Bug Description Fix
tc_context_aggregator v1.5 Context exceeding 50,000 characters was truncated. The overflow was saved as only the first 2,000 characters; the remainder was permanently lost. Large documents (e.g., +107,954 characters) lost most of their content. v1.6 — character limit removed; full text retained
tc_timeline_builder v2.2 Markdown output was truncated to the first 600 characters of context. Classification used the full text, but human reviewers and the AI received only a fragment. v2.3 — 600-character truncation removed; full output

After the bugs were identified, Theme T-C was re-processed using the corrected scripts. Themes T-A, T-B, T-D, and T-E were run from the outset on v1.6 / v2.3 and require no re-run. Problems encountered during the investigation are recorded and disclosed in full to maintain reproducibility.

Execution Log — Proof It Is Running

Below is actual output from the scanner running against DOJ Data Set 1. The unknown column on the right indicates names not on the investigation subject list but flagged by the 6-layer filter as potentially Japanese. This means previously unregistered Japanese-name candidates were found within the documents.

2026-02-11 14:27:50,624 [INFO]   ★ EFTA01362493.pdf: 1 known, 0 unknown
2026-02-11 14:28:20,161 [INFO]   ★ EFTA01362931.pdf: 1 known, 0 unknown
2026-02-11 14:28:20,218 [INFO]   ★ EFTA01362932.pdf: 1 known, 0 unknown
2026-02-11 14:28:34,316 [INFO]   ★ EFTA01363137.pdf: 3 known, 0 unknown
2026-02-11 14:28:34,406 [INFO]   ★ EFTA01363138.pdf: 2 known, 0 unknown
2026-02-11 14:28:44,487 [INFO]   ★ EFTA01363299.pdf: 1 known, 0 unknown
2026-02-11 14:28:49,166 [INFO]   ★ EFTA01363354.pdf: 0 known, 2 unknown

Entries marked with are PDFs in which at least one Japanese-name candidate was detected. Each of these is verified against DOJ primary documents in the next step.

Why We Publish the Methodology

Analysis of Epstein documents is ongoing worldwide. Most of it, however, consists of fragmentary reports: "this page contained this name." Showing one page cherry-picked from 3.5 million tells you nothing about its significance to the whole.

This investigation applies the same algorithm mechanically to all 280 GB of documents. Rather than targeting specific names, we extract every proper noun and then filter for Japan-related names — including text beneath redactions.

We publish the methodology because transparency is the only guarantor of credibility.

  • All tools used are open-source and independently verifiable
  • The fake-redaction detection logic is based on the publicly released x-ray library
  • All 6-layer filter parameters — surname list, keywords, regex patterns — are externalized in configuration files; any third party can reproduce the results under identical conditions
  • Processing progress and results are logged to a SQLite database and remain auditable after the fact

Investigation updates are reported on X: @FactCheckAomori