Skip to content

🏷 Labelers

Labelers convert annotated EntitySpan objects into token-level labels, typically in the BIO, BILOU, or Binary formats, which are required for training token classification models (e.g., for NER).


🧱 Labeler Interface

class Labeler(Protocol):
    subword_strategy: Literal["first", "all", "strict"] = "strict"
    misalignment_policy: Literal["skip", "warn", "fail"] = "skip"

    def label(self, tokens: List[Token], entities: List[EntitySpan]) -> List[str]:
        ...

βš™οΈ Parameters

Parameter Description
subword_strategy How to label subword tokens:
β€’ "first" = label first subword only
β€’ "all" = label all subwords
β€’ "strict" = label only if token exactly matches entity
misalignment_policy How to handle tokens that don’t align with any entity:
β€’ "skip" = ignore
β€’ "warn" = log a warning
β€’ "fail" = raise an error

🧰 Built-in Labelers

1. BIOLabeler

Applies standard BIO tagging:

B-

I-

O: Outside any entity

2. BILOLabeler

Uses more expressive BILOU tagging:

B-

I-

L-

O: Outside

U-

3. BinaryLabeler

Simplified format for binary tasks:

ENTITY: Any token part of a span

O: Outside

πŸ§ͺ Example

from chisel.labelers.bio import BIOLabeler

labeler = BIOLabeler(subword_strategy="first", misalignment_policy="warn")
labels = labeler.label(tokens, entities)
# ["B-PER", "I-PER", "O", "O", "B-ORG", "I-ORG"]

⚠️ Subword Behavior

Subword tokenization can fragment entities:

Input text: "Barack Obama"

Tokens: ["Bar", "##ack", "Obama"]

Depending on the subword_strategy:

"first" β†’ ["B-PER", "I-PER", "O"]

"all" β†’ ["B-PER", "I-PER", "I-PER"]

"strict" β†’ will only label if one token covers the full span

🧠 Tips

BIO/BILOU output is compatible with most token classification models.

Use LabelEncoder to convert string labels to integer IDs.

For debugging span alignment, use TokenAlignmentValidator.

πŸ”’ LabelEncoder

The SimpleLabelEncoder is a lightweight utility for converting between string-based labels (e.g. "B-PER", "O") and integer IDs required by most machine learning frameworks.

Unlike typical encoders, this version requires you to pass in the label mapping explicitly at initialization β€” making its behavior predictable and immutable.

🧰 Features

Requires an explicit label_to_id dictionary at initialization.

  • Converts labels to IDs (encode) and vice versa (decode).

  • Throws helpful errors if unknown labels or IDs are encountered.

  • Can be used internally in export pipelines for Hugging Face and PyTorch compatibility.

πŸ§ͺ Example

from chisel.extraction.labelers.label_encoder import SimpleLabelEncoder

# Define a label-to-id mapping
label_map = {
    "O": 0,
    "B-PER": 1,
    "I-PER": 2,
    "B-LOC": 3,
    "I-LOC": 4
}

encoder = SimpleLabelEncoder(label_to_id=label_map)

# Encode a list of labels
label_ids = encoder.encode(["B-PER", "I-PER", "O"])  # ➝ [1, 2, 0]

# Decode a list of label IDs
decoded = encoder.decode([1, 2, 0])  # ➝ ["B-PER", "I-PER", "O"]

⚠️ Error Handling

If you try to encode or decode unknown values, the encoder raises a clear error:

encoder.encode(["B-ORG"])  
# ValueError: Unknown label 'B-ORG' encountered during encoding.

πŸ“¦ API Reference

Method Description
encode(labels) Convert list of label strings to integer IDs
decode(ids) Convert list of integer IDs back to label strings
get_label_to_id() Return internal label β†’ ID dictionary
get_id_to_label() Return internal ID β†’ label dictionary