✂️ Tokenizers¶

Tokenizers split raw text into smaller units (tokens) and assign each token: - The text (string form of the token) - The start and end character offsets in the original text - The token ID (usually the vocabulary index from a model tokenizer)

Chisel uses tokenizers to align character-level entity spans with model-compatible tokens for tasks like Named Entity Recognition (NER).

🧱 Tokenizer Interface¶

Each tokenizer implements the following protocol:

class Tokenizer(Protocol):
    def tokenize(self, text: str) -> List[Token]:
        ...

Where Token is:

class Token(BaseModel):
    id: int       # model vocab ID
    text: str     # token string
    start: int    # start char index
    end: int      # end char index

🧰 Built-in Tokenizers¶

1. HuggingFaceTokenizer¶

Wraps any pretrained Hugging Face tokenizer (AutoTokenizer) and outputs properly aligned Token objects.

🔧 Parameters

Name	Description
`model_id`	Pretrained model name (e.g., `"bert-base-uncased"`, `"distilroberta-base"`)

🧪 Example

from chisel.tokenizers.hf_tokenizer import HuggingFaceTokenizer

tokenizer = HuggingFaceTokenizer("bert-base-uncased")
tokens = tokenizer.tokenize("Barack Obama was president.")

Returns a list of Token objects with offsets and token IDs.

⚠️ Tokenizer Behavior¶

Different tokenizers use different subword strategies:

WordPiece (e.g., BERT): breaks unknown words into fragments with ## prefix

BPE (e.g., RoBERTa, GPT2): breaks based on byte-pair frequencies, often splits tokens at character level

SentencePiece (e.g., ALBERT, T5): learned segmentation of raw text with special prefix tokens like ▁

These affect how span alignment and labeling must be handled. Chisel supports multiple subword alignment strategies (see Labelers) to accommodate this.

🧠 Tips¶

Use tokenizer.tokenize() when creating datasets or debugging span alignment.

The returned tokens are automatically compatible with labelers and chunkers.

You can test tokenizer behavior on edge cases using the tests/test_tokenizers suite.

➕ Custom Tokenizer¶

To create your own tokenizer, simply implement the protocol:

class MyCustomTokenizer:
    def tokenize(self, text: str) -> List[Token]:
        # Custom logic
        ...

Next up: 🏷 Labelers →