Entities
Introduction
Entities are the primary abstraction in Golden Core for managing master data and performing entity resolution. An entity orchestrates all components necessary for data deduplication, search capabilities, and golden record management.
Think of an entity as a smart wrapper around a data table that adds powerful search, duplicate detection, and master data management capabilities.
An entity represents a business concept (customers, products, suppliers, etc.) and provides:
Data Storage - References a table containing actual records
Search Capabilities - Full-text and structured search via indexing
Duplicate Detection - Intelligent grouping and classification of similar records
Master Data Management - Automated or manual duplicate resolution
Data Integration - Sources (input) and sinks (output) for data flow
Synchronization - Scheduled or on-demand data processing
Entity Types
Type | Capabilities | Required Components | Use Cases |
|---|---|---|---|
NONE | Basic storage only | Table, Dataset | Simple data storage without advanced features |
SEARCH | Storage + Search | Table, Dataset, Indexer | Searchable catalogs, directories |
DUPLICATES | Storage + Search + Manual Deduplication | Table, Dataset, Indexer, Classifier, Merger | Customer MDM with human oversight |
AUTO_DUPLICATES | Storage + Search + Automatic Deduplication | Table, Dataset, Indexer, Classifier, Merger, Steward | Fully automated master data management |
Start with SEARCH type to test your configuration, then upgrade to DUPLICATES or AUTO_DUPLICATES once indexing is working correctly.
Entity Configuration
Core Properties
Property | Type | Required | Description |
|---|---|---|---|
id | String | Yes | Unique identifier (alphanumeric, underscore, hyphen) |
description | String | No | Human-readable description |
table | String | Yes | Reference to the data table |
type | EntityType | Yes | NONE, SEARCH, DUPLICATES, or AUTO_DUPLICATES |
enabled | Boolean | No | Enable/disable data operations (default: true) |
locked | Boolean | No | Lock configuration changes (default: false) |
automatic | Boolean | No | Enable automatic synchronization (default: false) |
Component References
Component | Required For | Purpose |
|---|---|---|
indexer | SEARCH, DUPLICATES, AUTO_DUPLICATES | Defines search indexes and duplicate detection keys |
classifier | DUPLICATES, AUTO_DUPLICATES | Compares records to determine if they are duplicates |
merger | DUPLICATES, AUTO_DUPLICATES | Defines strategy for creating golden records |
steward | AUTO_DUPLICATES | Automated duplicate resolution logic |
pipeline | Optional | Data transformation and cleaning operations |
Entity Lifecycle
Status States
Status | Description | Next Actions |
|---|---|---|
EMPTY | Just created, no data | Load data via synchronization |
WORKING | Currently processing | Wait for completion |
READY | Operational and ready | Normal operations |
INCONSISTENT | Configuration changed | Re-synchronize to update indexes |
ERROR | Validation or runtime error | Check logs, fix configuration |
Typical Lifecycle Flow
CREATE entity → Status: EMPTY
↓
SYNCHRONIZE (load data) → Status: WORKING
↓
Processing completes → Status: READY
↓
UPDATE configuration → Status: INCONSISTENT
↓
RE-SYNCHRONIZE → Status: READY
API Operations
Entity Management
Create or Update Entity
Endpoint: POST /entities
Permission: entity.save
{
"create": true,
"id": "customers",
"description": "Customer master data",
"table": "customer_table",
"type": "AUTO_DUPLICATES",
"indexer": "customer_indexer",
"classifier": "customer_classifier",
"merger": "customer_merger",
"steward": "customer_steward",
"stewardCron": "0 2 * * *",
"enabled": true,
"automatic": false
}
List All Entities
Endpoint: GET /entities
Permission: entity.list
{
"entities": [
{
"id": "customers",
"type": "AUTO_DUPLICATES",
"status": "READY",
"enabled": true,
"locked": false,
"stats": {
"recordCount": 15000,
"totalBuckets": 12500,
"duplicateBuckets": 450
}
}
]
}
Get Entity Details
Endpoint: GET /entities/{id}
Permission: entity.view
Delete Entity
Endpoint: DELETE /entities/{id}
Permission: entity.delete
Deleting an entity removes all bucket data. The underlying table is preserved.
Control Operations
Lock/Unlock Entity
Endpoint: PUT /entities/{id}/locked/{true|false}
Permission: entity.lock
Locked entities prevent configuration changes but allow data operations.
Enable/Disable Entity
Endpoint: PUT /entities/{id}/enabled/{true|false}
Permission: entity.enable
Disabled entities reject data operations but allow configuration changes.
Set Automatic Mode
Endpoint: PUT /entities/{id}/automatic/{true|false}
Permission: entity.automatic
Automatic Mode (true):
Sources execute on their
CRONschedulesSteward runs automatically (
AUTO_DUPLICATESonly)Fully hands-off operation
Manual Mode (false):
No automatic processing
Requires explicit synchronization calls
Full control over timing
Data Operations
Synchronize Entity
Endpoint: PUT /entities/synchronize
Permission: entity.synchronize
{
"entity": "customers",
"loadMask": "FULL",
"indexClassificationMask": "FULL",
"sinkMask": "FULL"
}
Load Masks:
FULL- Load all data from all sourcesINCREMENTAL- Load only changed records since last executionCUSTOM- Load data within specific date range (useloadFrom/loadTo)NONE- Skip loading
Index Classification Masks:
FULL- Re-index and classify all recordsCHANGES- Index only new/modified recordsNONE- Skip indexing/classification
Sink Masks:
FULL- Export all records to sinksNONE- Skip sink operations
Returns a task ID for monitoring progress.
Clear Entity
Endpoint: PUT /entities/{id}/clear
Permission: entity.clear
Removes all bucket data. Table records are preserved.
Entity Synchronization Workflow
When you synchronize an entity, the following process occurs:
1. LOAD PHASE
- Execute source queries
- Apply transformations
- Insert/update records in table
↓
2. INDEX & CLASSIFICATION PHASE
- Calculate index keys for records
- Group records into buckets
- Classify buckets (MATCH/REVIEW)
- Update bucket statistics
↓
3. SINK PHASE
- Apply output transformations
- Export records to configured sinks
- Generate audit trail (if enabled)
↓
4. STEWARD PHASE (AUTO_DUPLICATES only)
- Find MATCH buckets
- Merge duplicate records
- Create/update golden records
- Update statistics
Working with Duplicates
Duplicate Detection Process
Indexing - Records grouped into buckets by similarity
Classification - Algorithm determines if records in bucket are duplicates
Review - Human or automated review of potential matches
Resolution - Merge duplicates into golden record or mark as non-duplicates
Classification Results
Classification | Meaning | Action |
|---|---|---|
MATCH | High similarity, likely duplicates | Merge (manual or auto) |
NON_MATCH | Low similarity, different records | No action needed |
REVIEW | Medium similarity, uncertain | Human review required |
IGNORE | Manually marked to skip | No processing |
Manual Duplicate Operations
See the Golden Records guide for bucket operations (merge, split, disconnect).
Entity Statistics
Entities track key metrics accessible via API:
Metric | Description |
|---|---|
recordCount | Total records in table |
totalBuckets | Total index buckets created |
duplicateBuckets | Buckets with potential duplicates |
duplicatesByIndex | Duplicate counts per index mapping |
Use these to monitor data quality and deduplication progress.
Import and Export
Export Entities
Endpoint: POST /entities/export
{
"ids": ["customers", "products"]
}
Returns entity configurations as JSON for backup or migration.
Import Entities
Endpoint: POST /entities/import
{
"entities": [
{ /* entity configuration */ }
]
}
Imports entity configurations. Validates before importing.
Best Practices
Configuration
Descriptive IDs - Use clear business names ({{customers}}, {{products}})
Start Simple - Begin with SEARCH type, upgrade after testing
Lock Production - Lock entities in production to prevent accidental changes
Test Thoroughly - Validate configuration with sample data first
Synchronization Strategy
Manual for Testing - Use manual mode during development
Incremental Loading - Enable incremental loads for large datasets
Off-Peak Scheduling - Schedule steward during low-usage hours
Monitor Tasks - Always check task status after synchronization
Duplicate Management
Tune Classifier - Adjust match/review thresholds based on results
Review Buckets - Periodically review REVIEW-classified buckets
Test Merger Logic - Verify golden record quality with sample data
Track Statistics - Monitor duplicate counts over time
Performance
Appropriate Indexes - Use EXACT for high-cardinality fields
Limit Fuzzy Matching - Set reasonable maximumResults for fuzzy searches
Batch Operations - Use FULL sync for initial load, INCREMENTAL thereafter
Monitor Resources - Track database and memory usage during sync
Troubleshooting
Entity Status INCONSISTENT
Cause: Configuration changed after data was loaded
Solution: Run synchronization with FULL index/classification mask
Validation Errors on Create
Cause: Missing or incompatible resources
Solution:
Verify all referenced resources exist (indexer, classifier, etc.)
Check dataset compatibility with indexer/classifier
Ensure CRON expressions are valid
Synchronization Fails
Cause: Entity disabled, locked, or in automatic mode
Solution:
Enable entity if disabled
Unlock if locked (for configuration changes)
Switch to manual mode to run synchronization
No Duplicates Found
Cause: Indexer not configured for duplicates or classifier thresholds too high
Solution:
Check indexer mappings have
duplicates: trueLower classifier match threshold
Verify records have values in indexed fields