Blog

Why General Tokenizers Struggle with Drug Safety Text

A practical look at why general-purpose tokenizers fragment pharmacovigilance text, inflate token counts, and miss domain-specific meaning.

PharmacovigilanceTokenizationAINLP

What is a token?

Animation showing how a sentence is split into individual tokens by a tokenizer
A sentence is split into the smallest units a model can read — tokens.

A token is the smallest unit of text that a language model actually reads. Think of it like this: when you read the word “gastroesophageal,” your brain processes it as one concept — you recognize it whole. But when a general-purpose model reads that same word, it might see something entirely different: g ast ro es oph age al — seven fragments where you see one.

The number of tokens a text produces matters because every model has a finite context window — the maximum number of tokens it can process at once. Fewer tokens means less compute, lower cost, and for long documents like ICSR narratives, the difference between fitting a full case report into context or truncating halfway through.


What is a tokenizer?

Animation showing the tokenizer sitting between raw text and the language model
The tokenizer is the first pipeline stage — raw text goes in, tokens come out.

A tokenizer is the piece of software that converts raw text into tokens before a language model processes it. It’s the very first thing that touches your data. Everything a model “knows” about your text — every medical term, every lab value, every drug name — is filtered through the tokenizer’s vocabulary.

If the tokenizer doesn’t recognize “rhabdomyolysis” as a meaningful unit, neither will the model. Or at least, the model will have to work much harder to piece it together from fragments.


What is vocabulary?

Animation showing a tokenizer's vocabulary as a dictionary of known token units
The vocabulary is the dictionary of every token the tokenizer knows.

The vocabulary is the complete set of tokens a tokenizer knows about. GPT-2 has a vocabulary of 50,257 tokens. That sounds like a lot until you realize that GPT-2 was trained on general internet text — Reddit posts, news articles, Wikipedia entries. Its vocabulary reflects that world.

Think of vocabulary as the tokenizer’s dictionary. Every word that isn’t in the dictionary gets broken into smaller pieces — subwords — that are in the dictionary. It’s like having a dictionary that contains “cat” and “dog” but not “catheter” or “digitoxin.” You can still spell them out with smaller pieces, but it’s inefficient and noisy.


What is BPE?

Animation showing byte pair encoding merging frequent character pairs into single tokens step by step
Byte Pair Encoding repeatedly merges the most frequent adjacent pair into a new token.

Byte Pair Encoding (BPE) is the algorithm most modern tokenizers use to build their vocabulary. It works like this:

  1. Start with every individual character as its own token: h, e, p, a, t, i, t, i, s
  2. Find the pair of tokens that appears together most often in the training data — say, ti
  3. Merge ti into a single token
  4. Repeat thousands of times

After enough iterations, frequent words and word-parts become single tokens. The algorithm is completely driven by what it sees in training data. If your training data is Reddit comments and Wikipedia articles, “hepatotoxicity” never appears frequently enough to survive as a single token. It gets shattered into fragments.


What is a BPE tokenizer?

Animation showing how a BPE tokenizer applies its vocabulary to produce token IDs
A BPE tokenizer is the BPE vocabulary plus the rules for applying it to new text.

A BPE tokenizer combines the BPE vocabulary with rules for applying it. GPT-2 uses a “byte-level” BPE tokenizer. This means space characters aren’t separate tokens — they get fused to the beginning of the next word. That’s why you see things like Ġfemale in GPT-2 token outputs. The Ġ literally represents “this word starts after a space.”

BioGPT (Microsoft’s biomedically-trained model) uses a different variant — the Moses-style BPE tokenizer — where end-of-word boundaries get marked with </w>. Same algorithm, slightly different rules, vastly different results depending on training data.


Why pharmacovigilance text is different

Pharmacovigilance text — ICSR narratives, adverse event descriptions, drug safety reports — sits at the intersection of three linguistic domains that general tokenizers are terrible at:

1. Clinical terminology. Words like “thrombocytopenia,” “agranulocytosis,” and “paroxysmal atrial fibrillation” are everyday vocabulary in safety case processing. To GPT-2, they’re alien compounds to be dissected letter by letter.

2. Drug names. Both generic (“atorvastatin calcium,” “clopidogrel bisulfate”) and brand names follow naming conventions that general text never sees. Generic drug names use WHO INN stems — suffixes like “-vastatin,” “-sartan,” “-lol” — that carry pharmacological meaning. A domain-aware tokenizer can learn these patterns.

3. Structured data in free text. Lab values (ALT 245 U/L, INR 3.8, eGFR 28 mL/min/1.73m2), dosing information (40 mg daily), and coded terms mixed into narratives create tokenization nightmares for general-purpose vocabularies.

The consequence? A general tokenizer sees ICSR text as a cascade of unknown subwords, inflating token counts by 60% or more.


Tokenization examples from drug safety text

Here are real examples from running GPT-2’s tokenizer against ICSR narratives:

Medical TermGPT-2 TokensCount
hyperlipidemiahyper lip id emia4
atorvastatinator v ast atin4
rhabdomyolysisr hab dom y oly sis6
clopidogrel bisulfatecl op id og rel bis ulf ate8
gastroesophageal reflux diseaseg ast ro es oph age al ref lux disease10

Now look at those same terms through BioGPT (trained on PubMed) and a custom tokenizer trained on ICSR case reports:

Medical TermGPT-2 (general)BioGPT (PubMed)Custom ICSR BPE
hyperlipidemia4 tokens1 token1 token
atorvastatin4 tokens1 token2 tokens
rhabdomyolysis6 tokens1 token2 tokens
clopidogrel bisulfate8 tokens4 tokens3 tokens
gastroesophageal reflux disease10 tokens3 tokens4 tokens

The gap is stark. GPT-2 shreds gastroesophageal reflux disease into 10 fragments. A PubMed-trained tokenizer sees it in 3 pieces. The ICSR-trained tokenizer handles it cleanly in 4. Even BioGPT has trouble with the full drug name “clopidogrel bisulfate” (4 tokens), while the ICSR tokenizer compresses it further (3 tokens) because it’s seen that exact pattern in its training data.

And it’s not just isolated terms. Here’s an actual ICSR narrative fragment run through both tokenizers:

GPT-2: hyper lip id emia presented to the emergency

Domain tokenizer: hyperlipidemia presented to the emergency

One sees “hyperlipidemia” as a meaningful concept. The other sees noise.


Measuring fragmentation

To quantify just how bad this gets, we tested 49 medical terms — adverse events, drug names, conditions, and lab values — across three tokenizers. Then we ran five realistic ICSR narratives (~100 words each) through the same comparison.

Medical term fragmentation:

TokenizerAvg tokens/termTerms intact (1 token)Terms in ≥3 fragments
GPT-2 (general web)5.840 of 4949 of 49
BioGPT (PubMed)2.9015 of 4924 of 49
Custom ICSR BPE3.224 of 4923 of 49

GPT-2 couldn’t keep a single medical term intact. Not one. The average drug name, condition, or lab value took nearly 6 tokens to represent. BioGPT — trained on PubMed’s 30 million abstracts — managed to keep 15 terms whole, but still fragmented half the list.

Narrative efficiency (5 ICSR case reports, 489 total words):

TokenizerTotal tokensToken/word ratioReduction vs GPT-2
GPT-2 (general)7801.60
BioGPT (PubMed)6241.2820.0%
Custom ICSR BPE6061.2422.3%

Across just five sample narratives, the domain-trained tokenizer saved 174 tokens compared to GPT-2. For context: if you’re processing 15,000 ICSR reports (the size of the BioDEX-ICSR dataset), that translates to roughly half a million fewer tokens. At current API pricing and considering context window limitations, that matters.


Why this motivates a PV-specific tokenizer

The data makes a clear case, but let me frame it practically.

1. Token efficiency is cost efficiency. Every API call to a hosted LLM charges by token. If 22% of your tokens are wasted on morphological noise that a medical tokenizer would compress, you’re leaving money on the table at scale.

2. Context windows are scarce. A GPT-2 vocabulary fluffs up ICSR text by 60%. If your narrative is 800 words, GPT-2 needs ~1,280 tokens. That eats into the context window before your model even sees the full case. For large models with longer contexts this is less pressing, but for many production deployments running smaller models, every wasted token is model attention not spent on clinical reasoning.

3. Model performance depends on token quality. A model can only “think” in tokens it knows. When “Stevens-Johnson syndrome” comes in as five fragmented subwords, the model’s attention mechanism has to reconstruct the concept from pieces. That’s cognitive overhead that a domain-aware tokenizer eliminates.

4. BioGPT proves the concept but isn’t optimized for PV. BioGPT’s PubMed training helps massively (20% reduction), but it was trained on biomedical research articles — not case report narratives. The ICSR-trained tokenizer edged it out by another 2.3% because it learned the specific patterns of case reports: the rhythm of “patient presented with,” the structure of lab panels, the co-occurrence of drugs and reactions.

The bottom line: if you’re building language model applications in pharmacovigilance, your tokenizer is not a commodity layer. It’s the lens through which your model sees every single word of every single case. Spending a few hundred lines of code to give it domain-specific vision pays itself back in tokens — literally — from the very first API call.