Golden Records

Introduction

Golden Records represent the "single source of truth" for each entity in your system. This guide explains how Golden Core identifies duplicates, manages buckets, and creates master data records through intelligent merging.

A Golden Record is a consolidated, high-quality record created by merging duplicate records together using configurable rules.

Core Concepts

Record

A Record is a JSON document containing data about a single entity instance (customer, product, etc.).

Structure:

JSON

{
  "_id": "uuid-12345",
  "email": "[email protected]",
  "full_name": "John Doe",
  "phone": "+1234567890",
  "_metadata": {
    "_updated": "2025-01-30T10:00:00Z",
    "_quality": 0.95,
    "_merged": ["uuid-67890"],
    "_errors": []
  }
}

Metadata

Every record includes metadata tracking:

updated - Last modification timestamp
quality - Quality score (0.0 to 1.0)
merged - IDs of records merged into this one
unrelated - IDs manually marked as not duplicates
errors - Validation errors
quality_facts - Quality observations

Bucket

A Bucket groups potentially duplicate records based on shared characteristics (same email, similar name, etc.).

Purpose: Avoid comparing every record with every other record (O(n²) complexity).

JSON

{
  "_id": "[email protected]",
  "indexId": "email",
  "key": "[email protected]",
  "items": ["uuid-12345", "uuid-67890"],
  "size": 2,
  "classification": "MATCH",
  "averageScore": 0.92
}

Golden Record

A Golden Record is the result of merging duplicate records:

Contains the best values from all duplicates
Tracks source records via merged metadata
Represents the authoritative version of the entity

Deduplication Process

Step-by-Step Workflow

CODE

 1. INDEXING                             
    Records grouped into buckets
    Based on indexer configuration       
                  ↓
 2. CLASSIFICATION                       
    Records within bucket compared       
    Similarity scores calculated         
    Buckets classified                   
                  ↓
 3. REVIEW (Optional)                    
    Human review of uncertain matches    
    Manual merge/split decisions         
                  ↓
 4. MERGING                              
    Duplicate records consolidated       
    Golden record created/updated        
    Source records moved to history

Indexing Phase

Records are grouped into buckets based on index mappings configured in the indexer resource.

Example Index Mappings:

Email exact match
Phone number exact match
Name fuzzy match
Geographic proximity

Result: Records with similar attributes land in the same bucket.

Classification Phase

Records within each bucket are compared using the classifier resource.

Classification Results:

Classification	Score Range	Meaning
MATCH	High (≥ match threshold)	Likely duplicates - should merge
REVIEW	Medium (between thresholds)	Uncertain - needs human review
NON_MATCH	Low (≤ non-match threshold)	Different records - ignore
IGNORE	N/A	Single record or manually ignored

Merging Phase

For MATCH-classified buckets, records are merged using the merger resource.

Merge Strategies:

Weighted selection (trust scores)
Most recent value
Most complete value
Custom merge logic

Working with Buckets

Buckets are the key to managing duplicates in Golden Core.

List Buckets

Endpoint: GET /golden/buckets
Permission: golden.listBucket

Query Parameters:

entity - Entity identifier (required)
classification - Filter by classification (MATCH, REVIEW, etc.)
indexId - Filter by specific index
pageNumber - Page number (default: 0)
pageSize - Page size (default: 10)

View Bucket Details

Endpoint: GET /golden/bucket/{bucketId}
Permission: golden.viewBucket

Returns bucket information and all records within it.

JSON

{
  "bucket": {
    "id": "[email protected]",
    "classification": "MATCH",
    "size": 2,
    "averageScore": 0.92
  },
  "records": [
    {
      "_id": "uuid-12345",
      "email": "[email protected]",
      "full_name": "John Doe"
    },
    {
      "_id": "uuid-67890",
      "email": "[email protected]",
      "full_name": "Jon Doe"
    }
  ]
}

Bucket Operations

Merge Bucket

Endpoint: PUT /golden/bucket/merge
Permission: golden.mergeBucket

Merge all records in a bucket into a single golden record.

JSON

{
  "entity": "customers",
  "bucketId": "[email protected]"
}

Process:

CODE

Apply merger algorithm to create golden record
                  ↓
Store original records in _merged metadata
                  ↓
Move original records to history table
                  ↓
Keep golden record in main table

Split Bucket

Endpoint: PUT /golden/bucket/split
Permission: golden.splitBucket

Break up a bucket by clearing index values that caused records to group together.

JSON

{
  "entity": "customers",
  "bucketId": "[email protected]",
  "recordIds": ["uuid-12345"]
}

Use Case: False positives where records aren't actually duplicates.

Effect: Clears indexing fields for specified records so they won't group together again.

Disconnect Bucket

Endpoint: PUT /golden/bucket/disconnect
Permission: golden.disconnectBucket

Mark records as unrelated without modifying data.

JSON

{
  "entity": "customers",
  "bucketId": "[email protected]",
  "recordIds": ["uuid-12345", "uuid-67890"]
}

Effect: Adds record IDs to _unrelated metadata, preventing future merging.

Use Case: Keep data intact but prevent false-positive merges.

Ignore Bucket

Endpoint: PUT /golden/bucket/ignore
Permission: golden.ignoreBucket

Mark bucket to skip processing.

JSON

{
  "entity": "customers",
  "bucketId": "[email protected]",
  "ignore": true
}

Effect: Sets bucket classification to IGNORE, excluding from steward processing.

Delete Bucket

Endpoint: DELETE /golden/bucket
Permission: golden.deleteBucket

Delete all records in a bucket.

JSON

{
  "entity": "customers",
  "bucketId": "[email protected]"
}

This permanently deletes records. Use with caution.

Working with Records

Search Records

Endpoint: POST /golden/search
Permission: golden.searchRecord

Full-text search across entity records using Typesense.

JSON

{
  "entity": "customers",
  "query": "john",
  "pageNumber": 0,
  "pageSize": 20,
  "facetBy": ["status", "country"]
}

Search Features:

Full-text search across all fields
Faceted search (filtering)
Fuzzy matching
Geo-search (distance-based)
Result ranking

Get Record

Endpoint: GET /golden/record/{recordId}
Permission: golden.viewRecord

Retrieve single record by ID with expanded details.

BASH

GET /golden/record/uuid-12345?entity=customers&expanded=true

Expanded view includes:

Full record data
Metadata and audit trail
Merged record references
Quality information

Upsert Record

Endpoint: POST /golden/record
Permission: golden.upsertRecord

Create or update a record.

JSON

{
  "entity": "customers",
  "record": {
    "_id": "uuid-12345",
    "email": "[email protected]",
    "full_name": "John Doe",
    "phone": "+1234567890"
  }
}

{tip}
Omit _id to create new record. Include _id to update existing record.
{tip}

Delete Record

Endpoint: DELETE /golden/record
Permission: golden.deleteRecord

Delete a record.

JSON

{
  "entity": "customers",
  "recordId": "uuid-12345"
}

Best Practices

Deduplication Strategy

Start manual - Use DUPLICATES type initially
Tune thresholds - Adjust based on manual review results
Test thoroughly - Validate merger logic with sample data
Go automatic - Switch to AUTO_DUPLICATES when confident

Troubleshooting

No Duplicates Detected

Check:

Indexer has duplicates: true on mappings
Records have values in indexed fields
Classifier thresholds aren't too high
Entity synchronization completed successfully

Too Many False Positives

Solution:

Increase match threshold (e.g., 0.85 → 0.90)
Add more weighted comparison fields
Use stricter comparison algorithms
Use disconnect operation for known false positives

Too Many False Negatives

Solution:

Decrease match threshold (e.g., 0.85 → 0.75)
Add fuzzy index mappings
Reduce comparison field weights
Check data quality issues

Poor Golden Record Quality

Check:

Merger weights configured correctly
Merge type appropriate for data
Source data quality (use _quality score)
Nested dataset merge logic