herbert-base-ner / README.md
pietruszkowiec's picture
Update README.md
7683b6a
metadata
license: cc-by-4.0
datasets:
  - wikiann
language:
  - pl
pipeline_tag: token-classification
widget:
  - text: >-
      Nazywam się Grzegorz Brzęszczyszczykiewicz, pochodzę z
      Chrząszczyżewoszczyc, pracuję w Łękołodzkim Urzędzie Powiatowym
  - text: Jestem Krzysiek i pracuję w Ministerstwie Sportu
  - text: Na imię jej Wiktoria, pracuje w Krakowie na AGH
model-index:
  - name: herbert-base-ner
    results:
      - task:
          name: Token Classification
          type: token-classification
        dataset:
          name: wikiann
          type: wikiann
          config: pl
          split: test
          args: pl
        metrics:
          - name: Precision
            type: precision
            value: 0.8857142857142857
          - name: Recall
            type: recall
            value: 0.9070532179048386
          - name: F1
            type: f1
            value: 0.896256755412619
          - name: Accuracy
            type: accuracy
            value: 0.9581463871961428

herbert-base-ner

Model description

herbert-base-ner is a fine-tuned HerBERT model that can be used for Named Entity Recognition . It has been trained to recognize three types of entities: person (PER), location (LOC) and organization (ORG).

Specifically, this model is an allegro/herbert-base-cased model that was fine-tuned on the Polish subset of wikiann dataset.

How to use

You can use this model with Transformers pipeline for NER.

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_checkpoint = "pietruszkowiec/herbert-base-ner"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Nazywam się Grzegorz Brzęszczyszczykiewicz, pochodzę "\
    "z Chrząszczyżewoszczyc, pracuję w Łękołodzkim Urzędzie Powiatowym"

ner_results = nlp(example)
print(ner_results)

BibTeX entry and citation info

@inproceedings{mroczkowski-etal-2021-herbert,
    title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
    author = "Mroczkowski, Robert  and
      Rybak, Piotr  and
      Wr{\\'o}blewska, Alina  and
      Gawlik, Ireneusz",
    booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
    month = apr,
    year = "2021",
    address = "Kiyv, Ukraine",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
    pages = "1--10",
}
@inproceedings{pan-etal-2017-cross,
    title = "Cross-lingual Name Tagging and Linking for 282 Languages",
    author = "Pan, Xiaoman  and
      Zhang, Boliang  and
      May, Jonathan  and
      Nothman, Joel  and
      Knight, Kevin  and
      Ji, Heng",
    booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2017",
    address = "Vancouver, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P17-1178",
    doi = "10.18653/v1/P17-1178",
    pages = "1946--1958",
    abstract = "The ambitious goal of this work is to develop a cross-lingual name tagging and linking framework for 282 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify name mentions, assign a coarse-grained or fine-grained type to each mention, and link it to an English Knowledge Base (KB) if it is linkable. We achieve this goal by performing a series of new KB mining methods: generating {``}silver-standard{''} annotations by transferring annotations from English to other languages through cross-lingual links and KB properties, refining annotations through self-training and topic selection, deriving language-specific morphology features from anchor links, and mining word translation pairs from cross-lingual links. Both name tagging and linking results for 282 languages are promising on Wikipedia data and on-Wikipedia data.",
}