Skip to content

Getting Started

Welcome to Chisel — a lightweight and extensible Python library for preparing token-level annotation datasets for tasks like Named Entity Recognition (NER), Span Classification, and beyond.

Chisel tries to make it easy for NLP practitioners to experiment with different models when doing their extraction tasks by standardising the preprocessing steps and validation - making it easy to swap out models with different data and preprocessing requirements without spending ages on writing ad-hoc code.

This guide will walk you through setting up Chisel, understanding the core concepts, and running your first processing pipeline.


🧰 Installation

pip install -e .  # From source

Ensure you have a compatible Python version (3.8–3.11 recommended) and install dependencies:

pip install -r requirements.txt

🧱 Core Concepts

Chisel follows a modular pipeline architecture with pluggable components:

Component Role
Parser Extracts raw entity spans from annotated documents
Tokenizer Splits raw text into tokens
Chunker Optionally breaks long documents into chunks to fit token windows
SpanAligner Maps character spans to tokens
Labeler Converts token-level alignment into BIO, BILOU, or binary tags
ParseValidator Ensures consistency between text and character spans
TokenValidator Ensures consistency between character spans, tokens and labels
Formatters Turns the final processed data to your preferred format

Each component follows a defined Protocol, so you can swap in custom implementations as needed.

🚀 Examples

For examples as to how to use Chisel in pracice with common data annotations such as html tags, conll format or json, take a look at the notebooks in the example folder.

🚧 Development roadmap (in no particular order)

  • Implement exporters for a broad range of data providers such as HuggingFaces Datasets, Pytorch Data, Spacy and DVC
  • Implement parsers for common annotation tools such as Doccano, LabelStudio etc.
  • Ensure compatibility with modern tokenizers such as BPE.
  • Implement more sophisticated chunking methods for models with smaller token limits (like DistilBert 512)
  • Design a principled and declarative way to build pipelines.

📖 What's Next?

  • Learn how each component works
  • Check out real-world examples
  • Explore reference docs