Getting Started¶
Welcome to Chisel — a lightweight and extensible Python library for preparing token-level annotation datasets for tasks like Named Entity Recognition (NER), Span Classification, and beyond.
Chisel tries to make it easy for NLP practitioners to experiment with different models when doing their extraction tasks by standardising the preprocessing steps and validation - making it easy to swap out models with different data and preprocessing requirements without spending ages on writing ad-hoc code.
This guide will walk you through setting up Chisel, understanding the core concepts, and running your first processing pipeline.
🧰 Installation¶
pip install -e . # From source
Ensure you have a compatible Python version (3.8–3.11 recommended) and install dependencies:
pip install -r requirements.txt
🧱 Core Concepts¶
Chisel follows a modular pipeline architecture with pluggable components:
Component | Role |
---|---|
Parser | Extracts raw entity spans from annotated documents |
Tokenizer | Splits raw text into tokens |
Chunker | Optionally breaks long documents into chunks to fit token windows |
SpanAligner | Maps character spans to tokens |
Labeler | Converts token-level alignment into BIO, BILOU, or binary tags |
ParseValidator | Ensures consistency between text and character spans |
TokenValidator | Ensures consistency between character spans, tokens and labels |
Formatters | Turns the final processed data to your preferred format |
Each component follows a defined Protocol, so you can swap in custom implementations as needed.
🚀 Examples¶
For examples as to how to use Chisel in pracice with common data annotations such as html tags, conll format or json, take a look at the notebooks in the example folder.
🚧 Development roadmap (in no particular order)¶
- Implement exporters for a broad range of data providers such as HuggingFaces Datasets, Pytorch Data, Spacy and DVC
- Implement parsers for common annotation tools such as Doccano, LabelStudio etc.
- Ensure compatibility with modern tokenizers such as BPE.
- Implement more sophisticated chunking methods for models with smaller token limits (like DistilBert 512)
- Design a principled and declarative way to build pipelines.
📖 What's Next?
- Learn how each component works
- Check out real-world examples
- Explore reference docs