Entities

Introduction

Entities are the primary abstraction in Golden Core for managing master data and performing entity resolution. An entity orchestrates all components necessary for data deduplication, search capabilities, and golden record management.

Think of an entity as a smart wrapper around a data table that adds powerful search, duplicate detection, and master data management capabilities.

An entity represents a business concept (customers, products, suppliers, etc.) and provides:

Data Storage - References a table containing actual records
Search Capabilities - Full-text and structured search via indexing
Duplicate Detection - Intelligent grouping and classification of similar records
Master Data Management - Automated or manual duplicate resolution
Data Integration - Sources (input) and sinks (output) for data flow
Synchronization - Scheduled or on-demand data processing

Entity Types

Type	Capabilities	Required Components	Use Cases
NONE	Basic storage only	Table, Dataset	Simple data storage without advanced features
SEARCH	Storage + Search	Table, Dataset, Indexer	Searchable catalogs, directories
DUPLICATES	Storage + Search + Manual Deduplication	Table, Dataset, Indexer, Classifier, Merger	Customer MDM with human oversight
AUTO_DUPLICATES	Storage + Search + Automatic Deduplication	Table, Dataset, Indexer, Classifier, Merger, Steward	Fully automated master data management

Start with SEARCH type to test your configuration, then upgrade to DUPLICATES or AUTO_DUPLICATES once indexing is working correctly.

Entity Configuration

Core Properties

Property	Type	Required	Description
id	String	Yes	Unique identifier (alphanumeric, underscore, hyphen)
description	String	No	Human-readable description
table	String	Yes	Reference to the data table
type	EntityType	Yes	NONE, SEARCH, DUPLICATES, or AUTO_DUPLICATES
enabled	Boolean	No	Enable/disable data operations (default: true)
locked	Boolean	No	Lock configuration changes (default: false)
automatic	Boolean	No	Enable automatic synchronization (default: false)

Component References

Component	Required For	Purpose
indexer	SEARCH, DUPLICATES, AUTO_DUPLICATES	Defines search indexes and duplicate detection keys
classifier	DUPLICATES, AUTO_DUPLICATES	Compares records to determine if they are duplicates
merger	DUPLICATES, AUTO_DUPLICATES	Defines strategy for creating golden records
steward	AUTO_DUPLICATES	Automated duplicate resolution logic
pipeline	Optional	Data transformation and cleaning operations

Entity Lifecycle

Status States

Status	Description	Next Actions
EMPTY	Just created, no data	Load data via synchronization
WORKING	Currently processing	Wait for completion
READY	Operational and ready	Normal operations
INCONSISTENT	Configuration changed	Re-synchronize to update indexes
ERROR	Validation or runtime error	Check logs, fix configuration

Typical Lifecycle Flow

CODE

CREATE entity → Status: EMPTY
↓
SYNCHRONIZE (load data) → Status: WORKING
↓
Processing completes → Status: READY
↓
UPDATE configuration → Status: INCONSISTENT
↓
RE-SYNCHRONIZE → Status: READY

API Operations

Entity Management

Create or Update Entity

Endpoint: POST /entities
Permission: entity.save

JSON

{
  "create": true,
  "id": "customers",
  "description": "Customer master data",
  "table": "customer_table",
  "type": "AUTO_DUPLICATES",
  "indexer": "customer_indexer",
  "classifier": "customer_classifier",
  "merger": "customer_merger",
  "steward": "customer_steward",
  "stewardCron": "0 2 * * *",
  "enabled": true,
  "automatic": false
}

List All Entities

Endpoint: GET /entities
Permission: entity.list

JSON

{
  "entities": [
    {
      "id": "customers",
      "type": "AUTO_DUPLICATES",
      "status": "READY",
      "enabled": true,
      "locked": false,
      "stats": {
        "recordCount": 15000,
        "totalBuckets": 12500,
        "duplicateBuckets": 450
      }
    }
  ]
}

Get Entity Details

Endpoint: GET /entities/{id}
Permission: entity.view

Delete Entity

Endpoint: DELETE /entities/{id}
Permission: entity.delete

Deleting an entity removes all bucket data. The underlying table is preserved.

Control Operations

Lock/Unlock Entity

Endpoint: PUT /entities/{id}/locked/{true|false}
Permission: entity.lock

Locked entities prevent configuration changes but allow data operations.

Enable/Disable Entity

Endpoint: PUT /entities/{id}/enabled/{true|false}
Permission: entity.enable

Disabled entities reject data operations but allow configuration changes.

Set Automatic Mode

Endpoint: PUT /entities/{id}/automatic/{true|false}
Permission: entity.automatic

Automatic Mode (true):

Sources execute on their CRON schedules
Steward runs automatically (AUTO_DUPLICATES only)
Fully hands-off operation

Manual Mode (false):

No automatic processing
Requires explicit synchronization calls
Full control over timing

Data Operations

Synchronize Entity

Endpoint: PUT /entities/synchronize
Permission: entity.synchronize

JSON

{
  "entity": "customers",
  "loadMask": "FULL",
  "indexClassificationMask": "FULL",
  "sinkMask": "FULL"
}

Load Masks:

FULL - Load all data from all sources
INCREMENTAL - Load only changed records since last execution
CUSTOM - Load data within specific date range (use loadFrom/loadTo)
NONE - Skip loading

Index Classification Masks:

FULL - Re-index and classify all records
CHANGES - Index only new/modified records
NONE - Skip indexing/classification

Sink Masks:

FULL - Export all records to sinks
NONE - Skip sink operations

Returns a task ID for monitoring progress.

Clear Entity

Endpoint: PUT /entities/{id}/clear
Permission: entity.clear

Removes all bucket data. Table records are preserved.

Entity Synchronization Workflow

When you synchronize an entity, the following process occurs:

CODE

1. LOAD PHASE
    - Execute source queries
    - Apply transformations
    - Insert/update records in table
                  ↓
2. INDEX & CLASSIFICATION PHASE         
    - Calculate index keys for records   
    - Group records into buckets         
    - Classify buckets (MATCH/REVIEW)    
    - Update bucket statistics           
                  ↓
 3. SINK PHASE                           
    - Apply output transformations       
    - Export records to configured sinks 
    - Generate audit trail (if enabled)  
                  ↓
 4. STEWARD PHASE (AUTO_DUPLICATES only) 
    - Find MATCH buckets                 
    - Merge duplicate records            
    - Create/update golden records       
    - Update statistics

Working with Duplicates

Duplicate Detection Process

Indexing - Records grouped into buckets by similarity
Classification - Algorithm determines if records in bucket are duplicates
Review - Human or automated review of potential matches
Resolution - Merge duplicates into golden record or mark as non-duplicates

Classification Results

Classification	Meaning	Action
MATCH	High similarity, likely duplicates	Merge (manual or auto)
NON_MATCH	Low similarity, different records	No action needed
REVIEW	Medium similarity, uncertain	Human review required
IGNORE	Manually marked to skip	No processing

Manual Duplicate Operations

See the Golden Records guide for bucket operations (merge, split, disconnect).

Entity Statistics

Entities track key metrics accessible via API:

Metric	Description
recordCount	Total records in table
totalBuckets	Total index buckets created
duplicateBuckets	Buckets with potential duplicates
duplicatesByIndex	Duplicate counts per index mapping

Use these to monitor data quality and deduplication progress.

Import and Export

Export Entities

Endpoint: POST /entities/export

JSON

{
  "ids": ["customers", "products"]
}

Returns entity configurations as JSON for backup or migration.

Import Entities

Endpoint: POST /entities/import

CODE

{
  "entities": [
    { /* entity configuration */ }
  ]
}

Imports entity configurations. Validates before importing.

Best Practices

Configuration

Descriptive IDs - Use clear business names ({{customers}}, {{products}})
Start Simple - Begin with SEARCH type, upgrade after testing
Lock Production - Lock entities in production to prevent accidental changes
Test Thoroughly - Validate configuration with sample data first

Synchronization Strategy

Manual for Testing - Use manual mode during development
Incremental Loading - Enable incremental loads for large datasets
Off-Peak Scheduling - Schedule steward during low-usage hours
Monitor Tasks - Always check task status after synchronization

Duplicate Management

Tune Classifier - Adjust match/review thresholds based on results
Review Buckets - Periodically review REVIEW-classified buckets
Test Merger Logic - Verify golden record quality with sample data
Track Statistics - Monitor duplicate counts over time

Performance

Appropriate Indexes - Use EXACT for high-cardinality fields
Limit Fuzzy Matching - Set reasonable maximumResults for fuzzy searches
Batch Operations - Use FULL sync for initial load, INCREMENTAL thereafter
Monitor Resources - Track database and memory usage during sync

Troubleshooting

Entity Status INCONSISTENT

Cause: Configuration changed after data was loaded

Solution: Run synchronization with FULL index/classification mask

Validation Errors on Create

Cause: Missing or incompatible resources

Solution:

Verify all referenced resources exist (indexer, classifier, etc.)
Check dataset compatibility with indexer/classifier
Ensure CRON expressions are valid

Synchronization Fails

Cause: Entity disabled, locked, or in automatic mode

Solution:

Enable entity if disabled
Unlock if locked (for configuration changes)
Switch to manual mode to run synchronization

No Duplicates Found

Cause: Indexer not configured for duplicates or classifier thresholds too high

Solution:

Check indexer mappings have duplicates: true
Lower classifier match threshold
Verify records have values in indexed fields