Skip to main content
Skip table of contents

Classifier

Purpose

The classifier resource determines if records are duplicates.

Configuration

Comparison Configuration

JSON
{
  "type": "classifier",
  "id": "customer_classifier",
  "dataset": "customer_dataset",
  "thresholds": {
    "matchThreshold": 0.85,
    "nonMatchThreshold": 0.40
  },
  "weights": [
    {
      "column": "email",
      "weight": 2.0,
      "comparer": "EXACT"
    },
    {
      "column": "full_name",
      "weight": 1.5,
      "comparer": "LEVENSHTEIN"
    },
    {
      "column": "phone",
      "weight": 1.0,
      "comparer": "EXACT"
    }
  ]
}

Comparison Algorithms

Comparer

Type

Description

EXACT

String

Exact string match (1.0 or 0.0)

LEVENSHTEIN

String

Edit distance similarity

JARO_WINKLER

String

Character-based similarity (name matching)

NUMBER

Numeric

Numeric value comparison

DATE

Temporal

Date/time similarity

GEOGRAPHIC

Spatial

Distance-based matching

Threshold Tuning

Match Threshold (e.g., 0.85):

  • Scores ≥ 0.85 → MATCH

  • Higher = fewer false positives, more false negatives

Non-Match Threshold (e.g., 0.40):

  • Scores ≤ 0.40 → NON_MATCH

  • Lower = fewer false negatives, more false positives

Review Range:

  • Scores between thresholds → REVIEW

  • Requires manual review

Start with conservative thresholds (match: 0.90, non-match: 0.30) and adjust based on results.

Best Practices

  • Weight important fields - Email and phone should have high weight

  • Appropriate comparers - Use Levenshtein for names, exact for IDs

  • Conservative thresholds - Start high to avoid false merges

  • Monitor scores - Review classification details regularly

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.