Skip to content

🧩 Parsers

Parsers are the entry point to the Chisel pipeline. They take raw annotated documents and extract the cleaned text and a list of entity spans (EntitySpan) with their character offsets.

Each parser implements the following protocol:

class Parser(Protocol):
    def parse(self, doc: str) -> tuple[str, List[EntitySpan]]:
        ...

🧱 Available Parsers

1. HTMLTagParser

Parses entity spans from non-standard HTML/XML tags like , , or custom tags like .

🧪 Example Input

The <ORG>UN</ORG> met with <PER>Joe Biden</PER> today.

✅ Output

Cleaned Text: "The UN met with Joe Biden today."

EntitySpan(label="ORG", start=4, end=6, text="UN")

EntitySpan(label="PER", start=16, end=25, text="Joe Biden")

🔧 Parameters

Name Description
label_strategy "tag" to use the tag name as label (e.g., PER) or "attribute" to extract a specific HTML attribute as the label
attribute_name Used if label_strategy="attribute" — specifies which attribute to use as label
allow_nested If True, allows nested tags and creates spans for each. If False, only outermost span is retained

2. JSONSpanParser

To be written. There is an example implementation in the examples folder on git.

3. ConllParser

To be written. There is an example implementation in the examples folder on git.

🧠 Notes

Parsers are intentionally minimal and decoupled from tokenization.

You can easily extend Chisel by writing your own parser, e.g., for PDFs, docx files, or domain-specific formats.

All parsers return spans using start/end character offsets on the cleaned version of the text.

âž• Custom Parsers

To implement your own parser, simply conform to the protocol:

class MyCustomParser:
    def parse(self, doc: str) -> Tuple[str, List[EntitySpan]]:
        # 1. Extract raw text
        # 2. Identify annotated spans with start, end, label
        # 3. Return cleaned text and EntitySpan list
        ...