Why General Tokenizers Struggle with Drug Safety Text
A practical look at why general-purpose tokenizers fragment pharmacovigilance text, inflate token counts, and miss domain-specific meaning.
What is a token?
A token is the smallest unit of text that a language model actually reads. Think of it like this: when you read the word “gastroesophageal,” your brain processes it as one concept — you recognize it whole. But when a general-purpose model reads that same word, it might see something entirely different: g ast ro es oph age al — seven fragments where you see one.
The number of tokens a text produces matters because every model has a finite context window — the maximum number of tokens it can process at once. Fewer tokens means less compute, lower cost, and for long documents like ICSR narratives, the difference between fitting a full case report into context or truncating halfway through.
What is a tokenizer?
A tokenizer is the piece of software that converts raw text into tokens before a language model processes it. It’s the very first thing that touches your data. Everything a model “knows” about your text — every medical term, every lab value, every drug name — is filtered through the tokenizer’s vocabulary.
If the tokenizer doesn’t recognize “rhabdomyolysis” as a meaningful unit, neither will the model. Or at least, the model will have to work much harder to piece it together from fragments.
What is vocabulary?
The vocabulary is the complete set of tokens a tokenizer knows about. GPT-2 has a vocabulary of 50,257 tokens. That sounds like a lot until you realize that GPT-2 was trained on general internet text — Reddit posts, news articles, Wikipedia entries. Its vocabulary reflects that world.
Think of vocabulary as the tokenizer’s dictionary. Every word that isn’t in the dictionary gets broken into smaller pieces — subwords — that are in the dictionary. It’s like having a dictionary that contains “cat” and “dog” but not “catheter” or “digitoxin.” You can still spell them out with smaller pieces, but it’s inefficient and noisy.
What is BPE?
Byte Pair Encoding (BPE) is the algorithm most modern tokenizers use to build their vocabulary. It works like this:
- Start with every individual character as its own token:
h,e,p,a,t,i,t,i,s - Find the pair of tokens that appears together most often in the training data — say,
ti - Merge
tiinto a single token - Repeat thousands of times
After enough iterations, frequent words and word-parts become single tokens. The algorithm is completely driven by what it sees in training data. If your training data is Reddit comments and Wikipedia articles, “hepatotoxicity” never appears frequently enough to survive as a single token. It gets shattered into fragments.
What is a BPE tokenizer?
A BPE tokenizer combines the BPE vocabulary with rules for applying it. GPT-2 uses a “byte-level” BPE tokenizer. This means space characters aren’t separate tokens — they get fused to the beginning of the next word. That’s why you see things like Ġfemale in GPT-2 token outputs. The Ġ literally represents “this word starts after a space.”
BioGPT (Microsoft’s biomedically-trained model) uses a different variant — the Moses-style BPE tokenizer — where end-of-word boundaries get marked with </w>. Same algorithm, slightly different rules, vastly different results depending on training data.
Why pharmacovigilance text is different
Pharmacovigilance text — ICSR narratives, adverse event descriptions, drug safety reports — sits at the intersection of three linguistic domains that general tokenizers are terrible at:
1. Clinical terminology. Words like “thrombocytopenia,” “agranulocytosis,” and “paroxysmal atrial fibrillation” are everyday vocabulary in safety case processing. To GPT-2, they’re alien compounds to be dissected letter by letter.
2. Drug names. Both generic (“atorvastatin calcium,” “clopidogrel bisulfate”) and brand names follow naming conventions that general text never sees. Generic drug names use WHO INN stems — suffixes like “-vastatin,” “-sartan,” “-lol” — that carry pharmacological meaning. A domain-aware tokenizer can learn these patterns.
3. Structured data in free text. Lab values (ALT 245 U/L, INR 3.8, eGFR 28 mL/min/1.73m2), dosing information (40 mg daily), and coded terms mixed into narratives create tokenization nightmares for general-purpose vocabularies.
The consequence? A general tokenizer sees ICSR text as a cascade of unknown subwords, inflating token counts by 60% or more.
Tokenization examples from drug safety text
Here are real examples from running GPT-2’s tokenizer against ICSR narratives:
| Medical Term | GPT-2 Tokens | Count |
|---|---|---|
| hyperlipidemia | hyper lip id emia | 4 |
| atorvastatin | ator v ast atin | 4 |
| rhabdomyolysis | r hab dom y oly sis | 6 |
| clopidogrel bisulfate | cl op id og rel bis ulf ate | 8 |
| gastroesophageal reflux disease | g ast ro es oph age al ref lux disease | 10 |
Now look at those same terms through BioGPT (trained on PubMed) and a custom tokenizer trained on ICSR case reports:
| Medical Term | GPT-2 (general) | BioGPT (PubMed) | Custom ICSR BPE |
|---|---|---|---|
| hyperlipidemia | 4 tokens | 1 token | 1 token |
| atorvastatin | 4 tokens | 1 token | 2 tokens |
| rhabdomyolysis | 6 tokens | 1 token | 2 tokens |
| clopidogrel bisulfate | 8 tokens | 4 tokens | 3 tokens |
| gastroesophageal reflux disease | 10 tokens | 3 tokens | 4 tokens |
The gap is stark. GPT-2 shreds gastroesophageal reflux disease into 10 fragments. A PubMed-trained tokenizer sees it in 3 pieces. The ICSR-trained tokenizer handles it cleanly in 4. Even BioGPT has trouble with the full drug name “clopidogrel bisulfate” (4 tokens), while the ICSR tokenizer compresses it further (3 tokens) because it’s seen that exact pattern in its training data.
And it’s not just isolated terms. Here’s an actual ICSR narrative fragment run through both tokenizers:
GPT-2:
hyperlipidemiapresentedtotheemergency
Domain tokenizer:
hyperlipidemiapresentedtotheemergency
One sees “hyperlipidemia” as a meaningful concept. The other sees noise.
Measuring fragmentation
To quantify just how bad this gets, we tested 49 medical terms — adverse events, drug names, conditions, and lab values — across three tokenizers. Then we ran five realistic ICSR narratives (~100 words each) through the same comparison.
Medical term fragmentation:
| Tokenizer | Avg tokens/term | Terms intact (1 token) | Terms in ≥3 fragments |
|---|---|---|---|
| GPT-2 (general web) | 5.84 | 0 of 49 | 49 of 49 |
| BioGPT (PubMed) | 2.90 | 15 of 49 | 24 of 49 |
| Custom ICSR BPE | 3.22 | 4 of 49 | 23 of 49 |
GPT-2 couldn’t keep a single medical term intact. Not one. The average drug name, condition, or lab value took nearly 6 tokens to represent. BioGPT — trained on PubMed’s 30 million abstracts — managed to keep 15 terms whole, but still fragmented half the list.
Narrative efficiency (5 ICSR case reports, 489 total words):
| Tokenizer | Total tokens | Token/word ratio | Reduction vs GPT-2 |
|---|---|---|---|
| GPT-2 (general) | 780 | 1.60 | — |
| BioGPT (PubMed) | 624 | 1.28 | 20.0% |
| Custom ICSR BPE | 606 | 1.24 | 22.3% |
Across just five sample narratives, the domain-trained tokenizer saved 174 tokens compared to GPT-2. For context: if you’re processing 15,000 ICSR reports (the size of the BioDEX-ICSR dataset), that translates to roughly half a million fewer tokens. At current API pricing and considering context window limitations, that matters.
Why this motivates a PV-specific tokenizer
The data makes a clear case, but let me frame it practically.
1. Token efficiency is cost efficiency. Every API call to a hosted LLM charges by token. If 22% of your tokens are wasted on morphological noise that a medical tokenizer would compress, you’re leaving money on the table at scale.
2. Context windows are scarce. A GPT-2 vocabulary fluffs up ICSR text by 60%. If your narrative is 800 words, GPT-2 needs ~1,280 tokens. That eats into the context window before your model even sees the full case. For large models with longer contexts this is less pressing, but for many production deployments running smaller models, every wasted token is model attention not spent on clinical reasoning.
3. Model performance depends on token quality. A model can only “think” in tokens it knows. When “Stevens-Johnson syndrome” comes in as five fragmented subwords, the model’s attention mechanism has to reconstruct the concept from pieces. That’s cognitive overhead that a domain-aware tokenizer eliminates.
4. BioGPT proves the concept but isn’t optimized for PV. BioGPT’s PubMed training helps massively (20% reduction), but it was trained on biomedical research articles — not case report narratives. The ICSR-trained tokenizer edged it out by another 2.3% because it learned the specific patterns of case reports: the rhythm of “patient presented with,” the structure of lab panels, the co-occurrence of drugs and reactions.
The bottom line: if you’re building language model applications in pharmacovigilance, your tokenizer is not a commodity layer. It’s the lens through which your model sees every single word of every single case. Spending a few hundred lines of code to give it domain-specific vision pays itself back in tokens — literally — from the very first API call.