Tokenizers

How Fuse counts tokens, the encodings it supports, and how the tokenizer flag resolves a model or encoding name.

Fuse reports the token count of every fusion so you can judge how much of a model's context window it will consume. For OpenAI-compatible encodings it produces that figure with the Microsoft.ML.Tokenizers library, counting the same tokens the target model counts rather than estimating from character length. For the Anthropic (Claude) and Google (Gemini) families, which do not publish a local tokenizer for their current models, Fuse uses a deterministic estimator instead; those counts are approximate.

This page is for engineers who want the reported count to match a specific model and maintainers who need the exact resolution rules.

Purpose And Scope

Fuse counts tokens during the Emission stage, after reduction. The encoding it uses determines the count, because different encodings tokenize the same text differently. The default encoding is o200k_base, the encoding used by the current generation of OpenAI models. You select a different encoding with the --tokenizer flag, which accepts either a model name or an encoding name. See the Options reference.

This page documents how a tokenizer value resolves to an encoding. It does not cover the manifest or the reported summary line, which the Core Concepts page describes.

Resolution Rules

The --tokenizer value resolves to an encoding by the following rules, applied in order. Matching is case-insensitive and surrounding whitespace is trimmed.

Input	Resolves to
`o200k_base`	`o200k_base`
`gpt-4o`	`o200k_base`
`gpt-4o-mini`	`o200k_base`
`gpt-4.1`	`o200k_base`
`gpt-4.1-mini`	`o200k_base`
`cl100k_base`	`cl100k_base`
`gpt-4`	`cl100k_base`
`gpt-3.5-turbo`	`cl100k_base`
`gpt-3.5`	`cl100k_base`
Any value starting with `claude` or `anthropic`	Anthropic estimator
Any value starting with `gemini` or `google`	Gemini estimator
Any value containing an underscore	Used as-is, treated as an encoding name
Anything else	`o200k_base` (the default)

The Anthropic and Gemini families are matched first, by name prefix, so claude-opus-4 and gemini-1.5-pro resolve to their estimators. Two consequences follow from the remaining rules. A value containing an underscore is passed through unchanged on the assumption that it names an encoding, so a misspelled encoding name is used verbatim rather than corrected. Any value that matches no known alias and contains no underscore falls back to the default o200k_base rather than raising an error.

Estimated Encodings

Anthropic and Google do not publish a local tokenizer vocabulary for their current models, so an exact offline count is not possible; both expose a token-counting API for exact figures. For these families Fuse uses a deterministic estimator: each run of letters and digits costs ceil(length / charsPerToken) tokens and each other non-whitespace character costs one token. The Anthropic estimator uses 3.5 characters per token and the Gemini estimator uses 4, both published rules of thumb for English and code. The result is a stable estimate suitable for budgeting, not an exact count. For an exact offline count, use an OpenAI encoding; for an exact provider-specific count, use that provider's counting API. The estimators are AOT-safe and allocation-light, with no reflection.

Choosing An Encoding

Choose the encoding that matches the model you will paste the fusion into. If you target a current OpenAI model, the default o200k_base is correct and no flag is needed. If you target an older model in the GPT-4 or GPT-3.5 family, pass cl100k_base or one of its model aliases so the reported count matches that model. Counts produced under one encoding are an approximation for a model that uses a different one.

What This Does Not Cover

This page covers token counting and encoding resolution. It does not cover the token budget, the split threshold, or how the manifest reports per-file token costs; those are part of the Emission stage and are documented in the Output Specification.

See the Options reference for the --tokenizer flag and related budget options, or Core Concepts for where token counting sits in the pipeline.

Purpose And Scope

Resolution Rules

Estimated Encodings

Choosing An Encoding

What This Does Not Cover

Next

On this page