Skip to content

🔄 Formatters

Formatters in Chisel convert ChiselRecord objects into the final data formats expected by downstream NLP model training libraries such as PyTorch and HuggingFace 🤗 Datasets.

They provide a clean separation between data processing and model consumption, making it easy to switch between libraries or pipelines.

✅ Supported Formatters

TorchDatasetFormatter

Converts a list of ChiselRecord instances into a PyTorch-compatible dataset.

Output: List[Dict[str, torch.Tensor]]

Fields Included:

  • input_ids

  • attention_mask

  • labels

Each record is represented as a dictionary where values are PyTorch tensors, ready to be wrapped in a DataLoader.

from chisel.extraction.formatters.torch_formatter import TorchDatasetFormatter

formatter = TorchDatasetFormatter()
torch_data = formatter.format(chisel_records)

HFDatasetFormatter

Converts a list of ChiselRecord instances into a 🤗 HuggingFace Dataset.

Output: datasets.Dataset object

Fields Included: - id - chunk_id - text - input_ids - attention_mask - labels - Optionally: bio_labels if present

Usage:

from chisel.extraction.formatters.hf_formatter import HFDatasetFormatter

formatter = HFDatasetFormatter()
hf_dataset = formatter.format(chisel_records)

🧩 Why Formatters?

Machine learning frameworks expect specific formats — not domain-rich objects like ChiselRecord. Formatters handle this final transformation step, letting you:

  • Stay library-agnostic during preprocessing.

  • Plug in different downstream toolkits easily.

  • Avoid writing custom conversion logic repeatedly.