Example: Manual Deduplication Workflow

This guide walks you through the process of manually reviewing and resolving duplicate records in Golden. This workflow is essential for handling uncertain matches that require human judgment.

Overview

Item	Value
Scenario	Review and resolve REVIEW-classified buckets
Role Required	STEWARD or ADMIN
Skill Level	Beginner
Time Per Bucket	1-5 minutes

When to Use This Workflow

This workflow applies to entities configured with:

DUPLICATES mode - All buckets require manual review
AUTO_DUPLICATES mode - Only REVIEW buckets need manual attention

Understanding Classifications

Classification	Confidence	Action
DUPLICATES	High (>90%)	Auto-merged in AUTO mode, manual in DUPLICATES mode
REVIEW	Medium (60-90%)	Always requires manual review
UNIQUE	Low (<60%)	No duplicates detected

The Review Process

CODE

Find REVIEW Buckets → Examine Records → Make Decision → Track Progress

Step 1: Find Buckets to Review

First, retrieve all buckets that need manual review.

Using the API

BASH

# Get all REVIEW buckets for your entity
curl -X 'GET' "${GOLDEN_URL}/api/golden/customers/bucket?classification=REVIEW" \
    -H "Authorization: Bearer ${GOLDEN_TOKEN}"

Response Example

JSON

{
  "content": [
    {
      "bucketId": "B-10001",
      "classification": "REVIEW",
      "confidence": 0.75,
      "recordCount": 2,
      "created": "2024-01-15T10:30:00Z"
    },
    {
      "bucketId": "B-10002",
      "classification": "REVIEW",
      "confidence": 0.68,
      "recordCount": 3,
      "created": "2024-01-15T10:30:00Z"
    }
  ],
  "page": {
    "totalElements": 150,
    "totalPages": 8
  }
}

What the Fields Mean

Field	Description
`bucketId`	Unique identifier for this bucket
`classification`	Current status (REVIEW)
`confidence`	Match score (0.0 to 1.0) - higher means more likely duplicates
`recordCount`	Number of records in this bucket
`created`	When this bucket was created

Prioritizing Your Queue

Consider reviewing buckets in this order:

Highest confidence first - More likely to be actual duplicates
Smallest record count - Easier to review (2 records vs 5)
Oldest first - Clear the backlog systematically

Step 2: Examine Records in a Bucket

Open a specific bucket to see all records and their details.

Using the API

BASH

# Get detailed bucket information
curl -X 'GET' "${GOLDEN_URL}/api/golden/customers/bucket/B-10001" \
    -H "Authorization: Bearer ${GOLDEN_TOKEN}"

Response Example

JSON

{
  "bucketId": "B-10001",
  "classification": "REVIEW",
  "confidence": 0.75,
  "matchReason": "email_similarity: 0.95, name_similarity: 0.55",
  "records": [
    {
      "recordId": "R-5001",
      "source": "CRM",
      "sourceId": "CRM-12345",
      "data": {
        "full_name": "John Smith",
        "email": "[email protected]",
        "phone": "+1-555-0101",
        "city": "Boston"
      },
      "created": "2024-01-10T08:00:00Z",
      "updated": "2024-01-14T15:30:00Z"
    },
    {
      "recordId": "R-5002",
      "source": "E-Commerce",
      "sourceId": "EC-67890",
      "data": {
        "full_name": "J. Smith",
        "email": "[email protected]",
        "phone": "+1-555-0102",
        "city": "Cambridge"
      },
      "created": "2024-01-12T14:00:00Z",
      "updated": "2024-01-12T14:00:00Z"
    }
  ],
  "goldenPreview": {
    "full_name": "John Smith",
    "email": "[email protected]",
    "phone": "+1-555-0101",
    "city": "Boston"
  }
}

What to Look For

Check	Question to Ask
Email	Do the emails match exactly or closely?
Name	Could these be the same person with different name formats?
Phone	Are phone numbers similar or from same area?
Address	Same or nearby locations?
Source	Do different sources have different data quality?
Timing	Does the update timeline make sense?

Understanding Match Reasons

The matchReason field explains why records were grouped:

Reason	What It Means
`email_similarity: 0.95`	95% email match (nearly identical)
`name_similarity: 0.55`	55% name match (partial, possible abbreviation)
`phone_exact: 1.0`	Phone numbers match exactly
`address_fuzzy: 0.70`	70% address similarity (may be typos)

Step 3: Make Your Decision

Based on your review, choose one of three actions.

Decision Guide

Scenario	Decision	Action
Records are clearly the same person/entity	Merge	Combine into one golden record
Records are clearly different people/entities	Disconnect	Separate into individual buckets
Not enough information to decide	Skip	Leave for later review

Option A: Merge (Confirm Duplicates)

Use this when you're confident the records represent the same entity.

BASH

curl -X 'PUT' "${GOLDEN_URL}/api/golden/customers/bucket/merge" \
    -H "Authorization: Bearer ${GOLDEN_TOKEN}" \
    -H "Content-Type: application/json" \
    -d '{
      "bucketId": "B-10001"
    }'

What Happens After Merge

Records are combined into a single golden record
Best values are selected based on merger rules
Source records maintain lineage (can trace back)
Bucket is marked as resolved

Merge Response

JSON

{
  "success": true,
  "goldenRecordId": "G-3001",
  "message": "Bucket B-10001 merged successfully",
  "recordsMerged": 2
}

Option B: Disconnect (Not Duplicates)

Use this when you determine the records are NOT the same entity.

BASH

curl -X 'PUT' "${GOLDEN_URL}/api/golden/customers/bucket/disconnect" \
    -H "Authorization: Bearer ${GOLDEN_TOKEN}" \
    -H "Content-Type: application/json" \
    -d '{
      "bucketId": "B-10001",
      "recordIds": ["R-5001", "R-5002"]
    }'

What Happens After Disconnect

Records are separated into individual buckets
Each record becomes its own golden record
System learns from this decision (improves future matching)
Original bucket is marked as resolved

Disconnect Response

JSON

{
  "success": true,
  "message": "Records disconnected from bucket B-10001",
  "newBuckets": ["B-10003", "B-10004"]
}

Partial Disconnect

If a bucket has 3+ records and only some are duplicates:

BASH

# Disconnect only R-5003 (the non-duplicate)
curl -X 'PUT' "${GOLDEN_URL}/api/golden/customers/bucket/disconnect" \
    -H "Authorization: Bearer ${GOLDEN_TOKEN}" \
    -H "Content-Type: application/json" \
    -d '{
      "bucketId": "B-10002",
      "recordIds": ["R-5003"]
    }'

This removes R-5003 into its own bucket while keeping the remaining records together.

Option C: Skip (Defer Decision)

If you need more information or want to review later, simply move to the next bucket. The current bucket remains in the REVIEW queue.

When to Skip:

Need to verify information with another team
Data quality is too poor to determine
Complex case requiring additional context
Training purposes - want supervisor review

Step 4: Track Your Progress

Monitor your review progress and the overall entity statistics.

Get Entity Statistics

BASH

curl -X 'GET' "${GOLDEN_URL}/api/entity/customers" \
    -H "Authorization: Bearer ${GOLDEN_TOKEN}"

Key Statistics

JSON

{
  "name": "customers",
  "status": "COMPLETED",
  "statistics": {
    "totalRecords": 50000,
    "totalBuckets": 48500,
    "uniqueBuckets": 45000,
    "duplicateBuckets": 2000,
    "reviewBuckets": 150,
    "goldenRecords": 47000,
    "duplicateRate": 0.06
  }
}

Understanding the Numbers

Metric	Description	Target
`totalRecords`	All loaded records
`totalBuckets`	Groups after indexing
`uniqueBuckets`	Confirmed non-duplicates	Higher is better
`duplicateBuckets`	Confirmed duplicates (merged)
`reviewBuckets`	Awaiting manual review	Lower is better
`goldenRecords`	Master records created
`duplicateRate`	% of records that are duplicates	Varies by data quality

Tracking Review Progress

Calculate your progress:

CODE

Review Completion = (initial reviewBuckets - current reviewBuckets) / initial reviewBuckets

Example: Started with 500 REVIEW buckets, now have 150 = 70% complete

Best Practices

For Efficient Review

Practice	Benefit
Review in batches	Maintain consistency, avoid fatigue
Take regular breaks	Keep accuracy high
Document unusual cases	Help train other stewards
Use keyboard shortcuts	Speed up common actions

For Accurate Decisions

Practice	Benefit
Check all fields	Don't rely on just one match
Consider data sources	Some sources more reliable than others
Look at timestamps	Recent data often more accurate
When in doubt, skip	Better to defer than make wrong decision

For Quality Improvement

Practice	Benefit
Track common patterns	Identify system improvements
Report false positives	Help tune matching rules
Report missed duplicates	Improve indexer coverage
Share learnings	Train the team

Common Scenarios

Scenario 1: Same Email, Different Names

Record 1	Record 2
John Smith	J. Smith
[email protected]	[email protected]

Decision: MERGE - Same email, name is likely abbreviated

Scenario 2: Same Name, Different Emails

Record 1	Record 2
John Smith	John Smith
[email protected]	[email protected]

Decision: Need more info - Check phone, address, or other fields. Could be same person with work/personal emails, or different people with same name.

Scenario 3: Similar Data with Typos

Record 1	Record 2
John Smith	John Smth
123 Main St	123 Main Street

Decision: MERGE - Obvious typos and abbreviations

Scenario 4: Family Members

Record 1	Record 2
John Smith	Jane Smith
123 Main St	123 Main St
[email protected]	[email protected]

Decision: DISCONNECT - Different people, same household

Next Steps

After completing your review queue:

Run synchronization - Process new data and create new buckets
Review statistics - Analyze duplicate rates and trends
Refine rules - Work with admins to improve matching accuracy
Schedule reviews - Set regular review cadence

Topic	Link
Understanding Buckets	Golden Records
Entity Configuration	Entities
User Guide	User Guide

Overview

When to Use This Workflow

Understanding Classifications

The Review Process

Step 1: Find Buckets to Review

Using the API

Response Example

What the Fields Mean

Prioritizing Your Queue

Step 2: Examine Records in a Bucket

Using the API

Response Example

What to Look For

Understanding Match Reasons

Step 3: Make Your Decision

Decision Guide

Option A: Merge (Confirm Duplicates)

What Happens After Merge

Merge Response

Option B: Disconnect (Not Duplicates)

What Happens After Disconnect

Disconnect Response

Partial Disconnect

Option C: Skip (Defer Decision)

Step 4: Track Your Progress

Get Entity Statistics

Key Statistics

Understanding the Numbers

Tracking Review Progress

Best Practices

For Efficient Review

For Accurate Decisions

For Quality Improvement

Common Scenarios

Scenario 1: Same Email, Different Names

Scenario 2: Same Name, Different Emails

Scenario 3: Similar Data with Typos

Scenario 4: Family Members

Next Steps

Related Documentation