π· Labelers¶
Labelers convert annotated EntitySpan
objects into token-level labels, typically in the BIO, BILOU, or Binary formats, which are required for training token classification models (e.g., for NER).
π§± Labeler Interface¶
class Labeler(Protocol):
subword_strategy: Literal["first", "all", "strict"] = "strict"
misalignment_policy: Literal["skip", "warn", "fail"] = "skip"
def label(self, tokens: List[Token], entities: List[EntitySpan]) -> List[str]:
...
βοΈ Parameters
Parameter | Description |
---|---|
subword_strategy |
How to label subword tokens: β’ "first" = label first subword onlyβ’ "all" = label all subwordsβ’ "strict" = label only if token exactly matches entity |
misalignment_policy |
How to handle tokens that donβt align with any entity: β’ "skip" = ignoreβ’ "warn" = log a warningβ’ "fail" = raise an error |
π§° Built-in Labelers¶
1. BIOLabeler¶
Applies standard BIO tagging:
B-
I-
O: Outside any entity
2. BILOLabeler¶
Uses more expressive BILOU tagging:
B-
I-
L-
O: Outside
U-
3. BinaryLabeler¶
Simplified format for binary tasks:
ENTITY: Any token part of a span
O: Outside
π§ͺ Example
from chisel.labelers.bio import BIOLabeler
labeler = BIOLabeler(subword_strategy="first", misalignment_policy="warn")
labels = labeler.label(tokens, entities)
# ["B-PER", "I-PER", "O", "O", "B-ORG", "I-ORG"]
β οΈ Subword Behavior¶
Subword tokenization can fragment entities:
Input text: "Barack Obama"
Tokens: ["Bar", "##ack", "Obama"]
Depending on the subword_strategy:
"first" β ["B-PER", "I-PER", "O"]
"all" β ["B-PER", "I-PER", "I-PER"]
"strict" β will only label if one token covers the full span
π§ Tips¶
BIO/BILOU output is compatible with most token classification models.
Use LabelEncoder to convert string labels to integer IDs.
For debugging span alignment, use TokenAlignmentValidator.
π’ LabelEncoder¶
The SimpleLabelEncoder is a lightweight utility for converting between string-based labels (e.g. "B-PER", "O") and integer IDs required by most machine learning frameworks.
Unlike typical encoders, this version requires you to pass in the label mapping explicitly at initialization β making its behavior predictable and immutable.
π§° Features¶
Requires an explicit label_to_id dictionary at initialization.
-
Converts labels to IDs (encode) and vice versa (decode).
-
Throws helpful errors if unknown labels or IDs are encountered.
-
Can be used internally in export pipelines for Hugging Face and PyTorch compatibility.
π§ͺ Example¶
from chisel.extraction.labelers.label_encoder import SimpleLabelEncoder
# Define a label-to-id mapping
label_map = {
"O": 0,
"B-PER": 1,
"I-PER": 2,
"B-LOC": 3,
"I-LOC": 4
}
encoder = SimpleLabelEncoder(label_to_id=label_map)
# Encode a list of labels
label_ids = encoder.encode(["B-PER", "I-PER", "O"]) # β [1, 2, 0]
# Decode a list of label IDs
decoded = encoder.decode([1, 2, 0]) # β ["B-PER", "I-PER", "O"]
β οΈ Error Handling¶
If you try to encode or decode unknown values, the encoder raises a clear error:
encoder.encode(["B-ORG"])
# ValueError: Unknown label 'B-ORG' encountered during encoding.
π¦ API Reference¶
Method | Description |
---|---|
encode(labels) |
Convert list of label strings to integer IDs |
decode(ids) |
Convert list of integer IDs back to label strings |
get_label_to_id() |
Return internal label β ID dictionary |
get_id_to_label() |
Return internal ID β label dictionary |