Example: Manual Deduplication Workflow
This guide walks you through the process of manually reviewing and resolving duplicate records in Golden. This workflow is essential for handling uncertain matches that require human judgment.
Overview
Item | Value |
|---|---|
Scenario | Review and resolve REVIEW-classified buckets |
Role Required | STEWARD or ADMIN |
Skill Level | Beginner |
Time Per Bucket | 1-5 minutes |
When to Use This Workflow
This workflow applies to entities configured with:
DUPLICATESmode - All buckets require manual reviewAUTO_DUPLICATESmode - Only REVIEW buckets need manual attention
Understanding Classifications
Classification | Confidence | Action |
|---|---|---|
DUPLICATES | High (>90%) | Auto-merged in AUTO mode, manual in DUPLICATES mode |
REVIEW | Medium (60-90%) | Always requires manual review |
UNIQUE | Low (<60%) | No duplicates detected |
The Review Process
Find REVIEW Buckets → Examine Records → Make Decision → Track Progress
Step 1: Find Buckets to Review
First, retrieve all buckets that need manual review.
Using the API
# Get all REVIEW buckets for your entity
curl -X 'GET' "${GOLDEN_URL}/api/golden/customers/bucket?classification=REVIEW" \
-H "Authorization: Bearer ${GOLDEN_TOKEN}"
Response Example
{
"content": [
{
"bucketId": "B-10001",
"classification": "REVIEW",
"confidence": 0.75,
"recordCount": 2,
"created": "2024-01-15T10:30:00Z"
},
{
"bucketId": "B-10002",
"classification": "REVIEW",
"confidence": 0.68,
"recordCount": 3,
"created": "2024-01-15T10:30:00Z"
}
],
"page": {
"totalElements": 150,
"totalPages": 8
}
}
What the Fields Mean
Field | Description |
|---|---|
| Unique identifier for this bucket |
| Current status (REVIEW) |
| Match score (0.0 to 1.0) - higher means more likely duplicates |
| Number of records in this bucket |
| When this bucket was created |
Prioritizing Your Queue
Consider reviewing buckets in this order:
Highest confidence first - More likely to be actual duplicates
Smallest record count - Easier to review (2 records vs 5)
Oldest first - Clear the backlog systematically
Step 2: Examine Records in a Bucket
Open a specific bucket to see all records and their details.
Using the API
# Get detailed bucket information
curl -X 'GET' "${GOLDEN_URL}/api/golden/customers/bucket/B-10001" \
-H "Authorization: Bearer ${GOLDEN_TOKEN}"
Response Example
{
"bucketId": "B-10001",
"classification": "REVIEW",
"confidence": 0.75,
"matchReason": "email_similarity: 0.95, name_similarity: 0.55",
"records": [
{
"recordId": "R-5001",
"source": "CRM",
"sourceId": "CRM-12345",
"data": {
"full_name": "John Smith",
"email": "[email protected]",
"phone": "+1-555-0101",
"city": "Boston"
},
"created": "2024-01-10T08:00:00Z",
"updated": "2024-01-14T15:30:00Z"
},
{
"recordId": "R-5002",
"source": "E-Commerce",
"sourceId": "EC-67890",
"data": {
"full_name": "J. Smith",
"email": "[email protected]",
"phone": "+1-555-0102",
"city": "Cambridge"
},
"created": "2024-01-12T14:00:00Z",
"updated": "2024-01-12T14:00:00Z"
}
],
"goldenPreview": {
"full_name": "John Smith",
"email": "[email protected]",
"phone": "+1-555-0101",
"city": "Boston"
}
}
What to Look For
Check | Question to Ask |
|---|---|
Do the emails match exactly or closely? | |
Name | Could these be the same person with different name formats? |
Phone | Are phone numbers similar or from same area? |
Address | Same or nearby locations? |
Source | Do different sources have different data quality? |
Timing | Does the update timeline make sense? |
Understanding Match Reasons
The matchReason field explains why records were grouped:
Reason | What It Means |
|---|---|
| 95% email match (nearly identical) |
| 55% name match (partial, possible abbreviation) |
| Phone numbers match exactly |
| 70% address similarity (may be typos) |
Step 3: Make Your Decision
Based on your review, choose one of three actions.
Decision Guide
Scenario | Decision | Action |
|---|---|---|
Records are clearly the same person/entity | Merge | Combine into one golden record |
Records are clearly different people/entities | Disconnect | Separate into individual buckets |
Not enough information to decide | Skip | Leave for later review |
Option A: Merge (Confirm Duplicates)
Use this when you're confident the records represent the same entity.
curl -X 'PUT' "${GOLDEN_URL}/api/golden/customers/bucket/merge" \
-H "Authorization: Bearer ${GOLDEN_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"bucketId": "B-10001"
}'
What Happens After Merge
Records are combined into a single golden record
Best values are selected based on merger rules
Source records maintain lineage (can trace back)
Bucket is marked as resolved
Merge Response
{
"success": true,
"goldenRecordId": "G-3001",
"message": "Bucket B-10001 merged successfully",
"recordsMerged": 2
}
Option B: Disconnect (Not Duplicates)
Use this when you determine the records are NOT the same entity.
curl -X 'PUT' "${GOLDEN_URL}/api/golden/customers/bucket/disconnect" \
-H "Authorization: Bearer ${GOLDEN_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"bucketId": "B-10001",
"recordIds": ["R-5001", "R-5002"]
}'
What Happens After Disconnect
Records are separated into individual buckets
Each record becomes its own golden record
System learns from this decision (improves future matching)
Original bucket is marked as resolved
Disconnect Response
{
"success": true,
"message": "Records disconnected from bucket B-10001",
"newBuckets": ["B-10003", "B-10004"]
}
Partial Disconnect
If a bucket has 3+ records and only some are duplicates:
# Disconnect only R-5003 (the non-duplicate)
curl -X 'PUT' "${GOLDEN_URL}/api/golden/customers/bucket/disconnect" \
-H "Authorization: Bearer ${GOLDEN_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"bucketId": "B-10002",
"recordIds": ["R-5003"]
}'
This removes R-5003 into its own bucket while keeping the remaining records together.
Option C: Skip (Defer Decision)
If you need more information or want to review later, simply move to the next bucket. The current bucket remains in the REVIEW queue.
When to Skip:
Need to verify information with another team
Data quality is too poor to determine
Complex case requiring additional context
Training purposes - want supervisor review
Step 4: Track Your Progress
Monitor your review progress and the overall entity statistics.
Get Entity Statistics
curl -X 'GET' "${GOLDEN_URL}/api/entity/customers" \
-H "Authorization: Bearer ${GOLDEN_TOKEN}"
Key Statistics
{
"name": "customers",
"status": "COMPLETED",
"statistics": {
"totalRecords": 50000,
"totalBuckets": 48500,
"uniqueBuckets": 45000,
"duplicateBuckets": 2000,
"reviewBuckets": 150,
"goldenRecords": 47000,
"duplicateRate": 0.06
}
}
Understanding the Numbers
Metric | Description | Target |
|---|---|---|
| All loaded records | |
| Groups after indexing | |
| Confirmed non-duplicates | Higher is better |
| Confirmed duplicates (merged) | |
| Awaiting manual review | Lower is better |
| Master records created | |
| % of records that are duplicates | Varies by data quality |
Tracking Review Progress
Calculate your progress:
Review Completion = (initial reviewBuckets - current reviewBuckets) / initial reviewBuckets
Example: Started with 500 REVIEW buckets, now have 150 = 70% complete
Best Practices
For Efficient Review
Practice | Benefit |
|---|---|
Review in batches | Maintain consistency, avoid fatigue |
Take regular breaks | Keep accuracy high |
Document unusual cases | Help train other stewards |
Use keyboard shortcuts | Speed up common actions |
For Accurate Decisions
Practice | Benefit |
|---|---|
Check all fields | Don't rely on just one match |
Consider data sources | Some sources more reliable than others |
Look at timestamps | Recent data often more accurate |
When in doubt, skip | Better to defer than make wrong decision |
For Quality Improvement
Practice | Benefit |
|---|---|
Track common patterns | Identify system improvements |
Report false positives | Help tune matching rules |
Report missed duplicates | Improve indexer coverage |
Share learnings | Train the team |
Common Scenarios
Scenario 1: Same Email, Different Names
Record 1 | Record 2 |
|---|---|
John Smith | J. Smith |
Decision: MERGE - Same email, name is likely abbreviated
Scenario 2: Same Name, Different Emails
Record 1 | Record 2 |
|---|---|
John Smith | John Smith |
Decision: Need more info - Check phone, address, or other fields. Could be same person with work/personal emails, or different people with same name.
Scenario 3: Similar Data with Typos
Record 1 | Record 2 |
|---|---|
John Smith | John Smth |
123 Main St | 123 Main Street |
Decision: MERGE - Obvious typos and abbreviations
Scenario 4: Family Members
Record 1 | Record 2 |
|---|---|
John Smith | Jane Smith |
123 Main St | 123 Main St |
Decision: DISCONNECT - Different people, same household
Next Steps
After completing your review queue:
Run synchronization - Process new data and create new buckets
Review statistics - Analyze duplicate rates and trends
Refine rules - Work with admins to improve matching accuracy
Schedule reviews - Set regular review cadence
Related Documentation
Topic | Link |
|---|---|
Understanding Buckets | |
Entity Configuration | |
User Guide |