Gravitee BERT PII (Personally Identifiable Information extraction)

This application uses the gravitee-io/bert-small-pii-detection model for Named Entity Recognition (NER) to detect personally identifiable information. The model uses token classification with BIO tagging to identify predefined entity types including names, addresses, financial information, and more.

The BERT models can detect the following entity types:

Personal Information:

  • PERSON (names)
  • AGE
  • PHONE_NUMBER
  • EMAIL_ADDRESS

Location & Address:

  • LOCATION
  • COORDINATE

Financial:

  • CREDIT_CARD
  • IBAN_CODE
  • FINANCIAL
  • US_BANK_NUMBER

Government IDs:

  • US_SSN (Social Security Number)
  • US_DRIVER_LICENSE
  • US_PASSPORT
  • US_ITIN
  • US_LICENSE_PLATE
  • NRP (National Registration Number)

Technical:

  • IP_ADDRESS
  • MAC_ADDRESS
  • URL
  • IMEI
  • PASSWORD

Other:

  • DATE_TIME
  • ORGANIZATION
  • TITLE

Installation

To use this model, install the required dependencies:

pip install transformers optimum[onnxruntime] torch

Usage

Load the model using the Optimum library for ONNX Runtime:

from optimum.onnxruntime import ORTModelForTokenClassification
from transformers import AutoTokenizer

model_path = "gravitee-io/bert-small-pii-detection"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = ORTModelForTokenClassification.from_pretrained(model_path, file_name="model.onnx")

text = "John Doe lives at 123 Main St and his email is john@example.com"
inputs = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
outputs = model(**inputs)
0 1

Auto-format JSON, XML, HTML, SQL with proper indentation

Examples
Text input Confidence Threshold Data Type
Pages: