Classifier
Purpose
The classifier resource determines if records are duplicates.
Configuration
Comparison Configuration
{
"type": "classifier",
"id": "customer_classifier",
"dataset": "customer_dataset",
"thresholds": {
"matchThreshold": 0.85,
"nonMatchThreshold": 0.40
},
"weights": [
{
"column": "email",
"weight": 2.0,
"comparer": "EXACT"
},
{
"column": "full_name",
"weight": 1.5,
"comparer": "LEVENSHTEIN"
},
{
"column": "phone",
"weight": 1.0,
"comparer": "EXACT"
}
]
}
Comparison Algorithms
Comparer | Type | Description |
|---|---|---|
EXACT | String | Exact string match (1.0 or 0.0) |
LEVENSHTEIN | String | Edit distance similarity |
JARO_WINKLER | String | Character-based similarity (name matching) |
NUMBER | Numeric | Numeric value comparison |
DATE | Temporal | Date/time similarity |
GEOGRAPHIC | Spatial | Distance-based matching |
Threshold Tuning
Match Threshold (e.g., 0.85):
Scores ≥ 0.85 →
MATCHHigher = fewer false positives, more false negatives
Non-Match Threshold (e.g., 0.40):
Scores ≤ 0.40 →
NON_MATCHLower = fewer false negatives, more false positives
Review Range:
Scores between thresholds →
REVIEWRequires manual review
Start with conservative thresholds (match: 0.90, non-match: 0.30) and adjust based on results.
Best Practices
Weight important fields - Email and phone should have high weight
Appropriate comparers - Use Levenshtein for names, exact for IDs
Conservative thresholds - Start high to avoid false merges
Monitor scores - Review classification details regularly