Deduplication Semantics
Configuration
Each deduper is created with a DeDuperConfig that controls the Bloom filter parameters:
| Parameter | Type | Default | Min | Max | Description |
|---|---|---|---|---|---|
noOfHashFunctions |
int |
7 | 7 | 13 | Number of hash functions for the Bloom filter. More functions reduce false positives but increase write cost. |
noOfShards |
long |
10,000,000 | 10,000,000 | 150,000,000 | Number of shards for distributing data. Each entity is assigned to one shard via Murmur3 hashing. |
bitsPerShard |
int |
1,000 | 1,000 | 30,000 | Number of bit positions in each shard's Bloom filter. |
deDuperLevel |
DeDuperLevel |
XDC |
— | — | DC (datacenter-local) or XDC (cross-datacenter). |
DeDuperConfig config = DeDuperConfig.builder()
.noOfHashFunctions(10)
.noOfShards(50_000_000L)
.bitsPerShard(20_000)
.deDuperLevel(DeDuperLevel.XDC)
.build();
Immutable after registration
Once a deduper is registered, its configuration cannot be changed. Attempting to re-register with a different config throws SolusException with ErrorCode.DEDUPER_CONFIG_MISMATCH.
Deduplication levels
Solus supports two deduplication levels that control the scope of duplicate detection:
XDC — Cross Datacenter
Deduplication is shared across all datacenters. An entity added in DC1 will be detected as a duplicate in DC2. The storage key omits the farm identifier:
This is the default and recommended mode for most workloads.
DC — Datacenter Local
Deduplication is isolated to each datacenter. Each farm maintains independent Bloom filter data. The storage backend uses farm-specific sets/tables:
| Backend | Key structure |
|---|---|
| Aerospike | Set name: <farm>_<setName> |
| HBase | Table name: <farm>_<tableName> |
Use DC when you need independent deduplication per region or when cross-DC replication introduces unacceptable latency.
DC consistency
With DC level, the same entity can be added independently in different datacenters without conflict. This is by design — each DC maintains its own Bloom filter state.
Capacity planning
The maximum key space is calculated as:
With maximum configuration (150M shards × 30K bits), Solus can handle up to 4.5 trillion unique keys.
False positive rates
The false positive probability for a Bloom filter is approximately:
Where:
k= number of hash functions (noOfHashFunctions)n= number of elements insertedm= total number of bits (noOfShards × bitsPerShard)
The optimal number of hash functions to minimize false positives:
Recommended configurations
Performance benchmarks
| Configuration | Keys | False Positive Rate |
|---|---|---|
| 7 hash functions, 5K bits, 200K shards | 100M | ~0.8% |
| 7 hash functions, 20K bits, 500K shards | 1B | ~0.6% |
API reference
checkAbsence — single entity
| Method | Description |
|---|---|
checkAbsence(String deDuperName, T entity) |
Returns true if the entity is absent (not seen before). Returns false if the entity was previously added. |
checkAbsence — batch
| Method | Description |
|---|---|
checkAbsence(String deDuperName, Set<T> entities) |
Returns a Map<T, Boolean> — true for absent entities, false for seen entities. Entities are grouped by shard for efficient batch reads. |
add — single entity
| Method | Description |
|---|---|
add(String deDuperName, T entity, long ttlInMs) |
Adds the entity to the Bloom filter with the specified TTL in milliseconds. |
add — batch
| Method | Description |
|---|---|
add(String deDuperName, Set<T> entities, long ttlInMs) |
Adds all entities to the Bloom filter. Entities are grouped by shard for efficient batch writes. |
addIfAbsent — single entity
| Method | Description |
|---|---|
addIfAbsent(String deDuperName, T entity, long ttlInMs) |
Checks if the entity is absent, and if so, adds it. Returns true if the entity was added. |
addIfAbsent — batch
| Method | Description |
|---|---|
addIfAbsent(String deDuperName, Set<T> entities, long ttlInMs) |
Returns a Map<T, Boolean> — true for entities that were added (were absent). |
Registration methods
| Method | Description |
|---|---|
register(String name) |
Register a deduper with default configuration. |
register(String name, DeDuperConfig config) |
Register a deduper with custom configuration. |
unregister(String name) |
Mark a deduper as inactive. |
getDeDuper(String name) |
Retrieve deduper metadata from the store. |
getCachedDeDuper(String name) |
Retrieve deduper metadata from the Caffeine cache. |
getActiveDeDupers() |
Retrieve all active dedupers as Map<String, DeDuper>. |
Error codes
All errors are surfaced as SolusException with one of the following codes:
| Code | Description |
|---|---|
DEDUPER_NOT_FOUND |
The requested deduper does not exist or is inactive. |
INVALID_CONFIG |
DeDuperConfig validation failed (values out of allowed range). |
DEDUPER_CONFIG_MISMATCH |
Attempting to re-register an existing deduper with a different configuration. |
AEROSPIKE_ERROR |
Aerospike backend returned an error during a data operation. |
HBASE_ERROR |
HBase backend returned an error during a data operation. |
TABLE_CREATION_ERROR |
HBase table creation failed during storage context initialization. |
CACHE_ERROR |
Internal Caffeine cache error. |
INTERNAL_ERROR |
Catch-all for unexpected failures. |
Exception propagation
SolusException.propagate(throwable) unwraps nested SolusException instances so you always receive the original error code rather than a wrapped INTERNAL_ERROR.
Thread safety
SolusEngineis thread-safe — it can be shared across threads.- The underlying
DeDuperCrudCommandsuses a CaffeineAsyncLoadingCachefor thread-safe cached reads. ShardCalculatorand Bloom filter hash computations are stateless and thread-safe.