Skip to main content
Skip table of contents

Example: Manual Deduplication Workflow

This guide walks you through the process of manually reviewing and resolving duplicate records in Golden. This workflow is essential for handling uncertain matches that require human judgment.

Overview

Item

Value

Scenario

Review and resolve REVIEW-classified buckets

Role Required

STEWARD or ADMIN

Skill Level

Beginner

Time Per Bucket

1-5 minutes


When to Use This Workflow

This workflow applies to entities configured with:

  • DUPLICATES mode - All buckets require manual review

  • AUTO_DUPLICATES mode - Only REVIEW buckets need manual attention

Understanding Classifications

Classification

Confidence

Action

DUPLICATES

High (>90%)

Auto-merged in AUTO mode, manual in DUPLICATES mode

REVIEW

Medium (60-90%)

Always requires manual review

UNIQUE

Low (<60%)

No duplicates detected


The Review Process

CODE
Find REVIEW Buckets → Examine Records → Make Decision → Track Progress

Step 1: Find Buckets to Review

First, retrieve all buckets that need manual review.

Using the API

BASH
# Get all REVIEW buckets for your entity
curl -X 'GET' "${GOLDEN_URL}/api/golden/customers/bucket?classification=REVIEW" \
    -H "Authorization: Bearer ${GOLDEN_TOKEN}"

Response Example

JSON
{
  "content": [
    {
      "bucketId": "B-10001",
      "classification": "REVIEW",
      "confidence": 0.75,
      "recordCount": 2,
      "created": "2024-01-15T10:30:00Z"
    },
    {
      "bucketId": "B-10002",
      "classification": "REVIEW",
      "confidence": 0.68,
      "recordCount": 3,
      "created": "2024-01-15T10:30:00Z"
    }
  ],
  "page": {
    "totalElements": 150,
    "totalPages": 8
  }
}

What the Fields Mean

Field

Description

bucketId

Unique identifier for this bucket

classification

Current status (REVIEW)

confidence

Match score (0.0 to 1.0) - higher means more likely duplicates

recordCount

Number of records in this bucket

created

When this bucket was created

Prioritizing Your Queue

Consider reviewing buckets in this order:

  1. Highest confidence first - More likely to be actual duplicates

  2. Smallest record count - Easier to review (2 records vs 5)

  3. Oldest first - Clear the backlog systematically


Step 2: Examine Records in a Bucket

Open a specific bucket to see all records and their details.

Using the API

BASH
# Get detailed bucket information
curl -X 'GET' "${GOLDEN_URL}/api/golden/customers/bucket/B-10001" \
    -H "Authorization: Bearer ${GOLDEN_TOKEN}"

Response Example

JSON
{
  "bucketId": "B-10001",
  "classification": "REVIEW",
  "confidence": 0.75,
  "matchReason": "email_similarity: 0.95, name_similarity: 0.55",
  "records": [
    {
      "recordId": "R-5001",
      "source": "CRM",
      "sourceId": "CRM-12345",
      "data": {
        "full_name": "John Smith",
        "email": "[email protected]",
        "phone": "+1-555-0101",
        "city": "Boston"
      },
      "created": "2024-01-10T08:00:00Z",
      "updated": "2024-01-14T15:30:00Z"
    },
    {
      "recordId": "R-5002",
      "source": "E-Commerce",
      "sourceId": "EC-67890",
      "data": {
        "full_name": "J. Smith",
        "email": "[email protected]",
        "phone": "+1-555-0102",
        "city": "Cambridge"
      },
      "created": "2024-01-12T14:00:00Z",
      "updated": "2024-01-12T14:00:00Z"
    }
  ],
  "goldenPreview": {
    "full_name": "John Smith",
    "email": "[email protected]",
    "phone": "+1-555-0101",
    "city": "Boston"
  }
}

What to Look For

Check

Question to Ask

Email

Do the emails match exactly or closely?

Name

Could these be the same person with different name formats?

Phone

Are phone numbers similar or from same area?

Address

Same or nearby locations?

Source

Do different sources have different data quality?

Timing

Does the update timeline make sense?

Understanding Match Reasons

The matchReason field explains why records were grouped:

Reason

What It Means

email_similarity: 0.95

95% email match (nearly identical)

name_similarity: 0.55

55% name match (partial, possible abbreviation)

phone_exact: 1.0

Phone numbers match exactly

address_fuzzy: 0.70

70% address similarity (may be typos)


Step 3: Make Your Decision

Based on your review, choose one of three actions.

Decision Guide

Scenario

Decision

Action

Records are clearly the same person/entity

Merge

Combine into one golden record

Records are clearly different people/entities

Disconnect

Separate into individual buckets

Not enough information to decide

Skip

Leave for later review


Option A: Merge (Confirm Duplicates)

Use this when you're confident the records represent the same entity.

BASH
curl -X 'PUT' "${GOLDEN_URL}/api/golden/customers/bucket/merge" \
    -H "Authorization: Bearer ${GOLDEN_TOKEN}" \
    -H "Content-Type: application/json" \
    -d '{
      "bucketId": "B-10001"
    }'

What Happens After Merge

  1. Records are combined into a single golden record

  2. Best values are selected based on merger rules

  3. Source records maintain lineage (can trace back)

  4. Bucket is marked as resolved

Merge Response

JSON
{
  "success": true,
  "goldenRecordId": "G-3001",
  "message": "Bucket B-10001 merged successfully",
  "recordsMerged": 2
}

Option B: Disconnect (Not Duplicates)

Use this when you determine the records are NOT the same entity.

BASH
curl -X 'PUT' "${GOLDEN_URL}/api/golden/customers/bucket/disconnect" \
    -H "Authorization: Bearer ${GOLDEN_TOKEN}" \
    -H "Content-Type: application/json" \
    -d '{
      "bucketId": "B-10001",
      "recordIds": ["R-5001", "R-5002"]
    }'

What Happens After Disconnect

  1. Records are separated into individual buckets

  2. Each record becomes its own golden record

  3. System learns from this decision (improves future matching)

  4. Original bucket is marked as resolved

Disconnect Response

JSON
{
  "success": true,
  "message": "Records disconnected from bucket B-10001",
  "newBuckets": ["B-10003", "B-10004"]
}

Partial Disconnect

If a bucket has 3+ records and only some are duplicates:

BASH
# Disconnect only R-5003 (the non-duplicate)
curl -X 'PUT' "${GOLDEN_URL}/api/golden/customers/bucket/disconnect" \
    -H "Authorization: Bearer ${GOLDEN_TOKEN}" \
    -H "Content-Type: application/json" \
    -d '{
      "bucketId": "B-10002",
      "recordIds": ["R-5003"]
    }'

This removes R-5003 into its own bucket while keeping the remaining records together.


Option C: Skip (Defer Decision)

If you need more information or want to review later, simply move to the next bucket. The current bucket remains in the REVIEW queue.

When to Skip:

  • Need to verify information with another team

  • Data quality is too poor to determine

  • Complex case requiring additional context

  • Training purposes - want supervisor review


Step 4: Track Your Progress

Monitor your review progress and the overall entity statistics.

Get Entity Statistics

BASH
curl -X 'GET' "${GOLDEN_URL}/api/entity/customers" \
    -H "Authorization: Bearer ${GOLDEN_TOKEN}"

Key Statistics

JSON
{
  "name": "customers",
  "status": "COMPLETED",
  "statistics": {
    "totalRecords": 50000,
    "totalBuckets": 48500,
    "uniqueBuckets": 45000,
    "duplicateBuckets": 2000,
    "reviewBuckets": 150,
    "goldenRecords": 47000,
    "duplicateRate": 0.06
  }
}

Understanding the Numbers

Metric

Description

Target

totalRecords

All loaded records

totalBuckets

Groups after indexing

uniqueBuckets

Confirmed non-duplicates

Higher is better

duplicateBuckets

Confirmed duplicates (merged)

reviewBuckets

Awaiting manual review

Lower is better

goldenRecords

Master records created

duplicateRate

% of records that are duplicates

Varies by data quality

Tracking Review Progress

Calculate your progress:

CODE
Review Completion = (initial reviewBuckets - current reviewBuckets) / initial reviewBuckets

Example: Started with 500 REVIEW buckets, now have 150 = 70% complete


Best Practices

For Efficient Review

Practice

Benefit

Review in batches

Maintain consistency, avoid fatigue

Take regular breaks

Keep accuracy high

Document unusual cases

Help train other stewards

Use keyboard shortcuts

Speed up common actions

For Accurate Decisions

Practice

Benefit

Check all fields

Don't rely on just one match

Consider data sources

Some sources more reliable than others

Look at timestamps

Recent data often more accurate

When in doubt, skip

Better to defer than make wrong decision

For Quality Improvement

Practice

Benefit

Track common patterns

Identify system improvements

Report false positives

Help tune matching rules

Report missed duplicates

Improve indexer coverage

Share learnings

Train the team


Common Scenarios

Scenario 1: Same Email, Different Names

Record 1

Record 2

John Smith

J. Smith

[email protected]

[email protected]

Decision: MERGE - Same email, name is likely abbreviated


Scenario 2: Same Name, Different Emails

Record 1

Record 2

John Smith

John Smith

[email protected]

[email protected]

Decision: Need more info - Check phone, address, or other fields. Could be same person with work/personal emails, or different people with same name.


Scenario 3: Similar Data with Typos

Record 1

Record 2

John Smith

John Smth

123 Main St

123 Main Street

Decision: MERGE - Obvious typos and abbreviations


Scenario 4: Family Members

Record 1

Record 2

John Smith

Jane Smith

123 Main St

123 Main St

[email protected]

[email protected]

Decision: DISCONNECT - Different people, same household


Next Steps

After completing your review queue:

  1. Run synchronization - Process new data and create new buckets

  2. Review statistics - Analyze duplicate rates and trends

  3. Refine rules - Work with admins to improve matching accuracy

  4. Schedule reviews - Set regular review cadence

Related Documentation

Topic

Link

Understanding Buckets

Golden Records

Entity Configuration

Entities

User Guide

User Guide

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.