Classifier

Purpose

The classifier resource determines if records are duplicates.

Configuration

Comparison Configuration

JSON

{
  "type": "classifier",
  "id": "customer_classifier",
  "dataset": "customer_dataset",
  "thresholds": {
    "matchThreshold": 0.85,
    "nonMatchThreshold": 0.40
  },
  "weights": [
    {
      "column": "email",
      "weight": 2.0,
      "comparer": "EXACT"
    },
    {
      "column": "full_name",
      "weight": 1.5,
      "comparer": "LEVENSHTEIN"
    },
    {
      "column": "phone",
      "weight": 1.0,
      "comparer": "EXACT"
    }
  ]
}

Comparison Algorithms

Comparer	Type	Description
EXACT	String	Exact string match (1.0 or 0.0)
LEVENSHTEIN	String	Edit distance similarity
JARO_WINKLER	String	Character-based similarity (name matching)
NUMBER	Numeric	Numeric value comparison
DATE	Temporal	Date/time similarity
GEOGRAPHIC	Spatial	Distance-based matching

Threshold Tuning

Match Threshold (e.g., 0.85):

Scores ≥ 0.85 → MATCH
Higher = fewer false positives, more false negatives

Non-Match Threshold (e.g., 0.40):

Scores ≤ 0.40 → NON_MATCH
Lower = fewer false negatives, more false positives

Review Range:

Scores between thresholds → REVIEW
Requires manual review

Start with conservative thresholds (match: 0.90, non-match: 0.30) and adjust based on results.

Best Practices

Weight important fields - Email and phone should have high weight
Appropriate comparers - Use Levenshtein for names, exact for IDs
Conservative thresholds - Start high to avoid false merges
Monitor scores - Review classification details regularly