๐งฑ Data Models in Chisel¶
Chisel uses a small set of clearly defined, Pydantic-based data models to standardize the representation of annotated texts, tokenization outputs, and aligned spans. These models form the internal "data language" that parsers, tokenizers, labelers, and exporters operate on.
๐ Token¶
Represents a single token within a text, along with its character-level span.
class Token(BaseModel):
id: int
text: str
start: int
end: int
Field | Type | Description |
---|---|---|
id |
int |
Unique identifier (usually index in token list) |
text |
str |
Raw token text |
start |
int |
Character start index in the original text |
end |
int |
Character end index (exclusive) |
Example:
{
"id": 0,
"text": "aspirin",
"start": 0,
"end": 7
}
๐ง EntitySpan¶
Represents a labeled span of text extracted from annotations, often prior to tokenization.
class EntitySpan(BaseModel):
text: str
start: int
end: int
label: str
attributes: dict[str, str] = {}
Field | Type | Description |
---|---|---|
text |
str |
The extracted text span |
start |
int |
Start character index |
end |
int |
End character index (exclusive) |
label |
str |
The entity label (e.g. "DISEASE", "ORG") |
attributes |
dict |
Optional metadata associated with the span |
๐ TokenEntitySpan¶
Aligns an EntitySpan to its token-level representation. Used after tokenization and alignment.
class TokenEntitySpan(BaseModel):
entity: EntitySpan
token_indices: List[int]
Field | Type | Description |
---|---|---|
entity |
EntitySpan |
The original span |
token_indices |
List[int] |
Indices into the Token list that align with this span |
This format is useful for converting between span-level and sequence-label formats like BIO/BILOU.
๐งพ ChiselRecord¶
A central container for all relevant information about a processed text segment. Used throughout pipelines and by all exporters.
class ChiselRecord(BaseModel):
id: str
chunk_id: int
text: str
tokens: List[Token]
entities: List[EntitySpan]
bio_labels: Optional[List[str]] = Field(default=None, alias="bio-labels")
labels: Optional[List[int]] = None
input_ids: Optional[List[int]] = None
attention_mask: Optional[List[int]] = None
Field | Type | Description |
---|---|---|
id |
str |
Unique document ID |
chunk_id |
int |
Unique ID for this chunk of the document |
text |
str |
The original or preprocessed text |
tokens |
List[Token] |
Tokenized representation of the text |
entities |
List[EntitySpan] |
Extracted entities in character span format |
bio_labels |
Optional[List[str]] |
BIO/BILOU labels (one per token) |
labels |
Optional[List[int]] |
Encoded integer labels |
input_ids |
Optional[List[int]] |
Tokenizer output for transformer input |
attention_mask |
Optional[List[int]] |
Attention mask corresponding to input_ids |
Example:
{
"id": "doc123",
"chunk_id": 0,
"text": "Aspirin is used to treat pain.",
"tokens": [...],
"entities": [...],
"bio_labels": ["B-DRUG", "O", "O", "O", "O"],
"labels": [1, 0, 0, 0, 0],
"input_ids": [101, 1234, 2003, ...],
"attention_mask": [1, 1, 1, ...]
}