Solus
Solus is a high-performance, TTL-based deduplication library designed for (hundred+) billion scale operations. It uses probabilistic data structures (Bloom filters) with sharding to provide memory-efficient duplicate detection with configurable accuracy.
Why Solus?
In large-scale distributed systems, duplicate detection is a fundamental challenge — whether it is coupon redemption, event deduplication, request idempotency, or notification throttling. Solus provides a lightweight, pluggable deduplication engine that handles billions of unique entities with minimal memory footprint and automatic TTL-based expiration.
Key features
- Massive scale — handle billions of unique entities using sharded Bloom filters.
- TTL support — automatic expiration of entries based on time-to-live.
- Pluggable storage backends — Aerospike and HBase out of the box.
- Multi-datacenter support —
DC(datacenter-local) andXDC(cross-datacenter) deduplication levels. - Configurable accuracy — tune false positive rates by adjusting hash functions, shards, and bits per shard.
- Batch operations — efficient bulk
addandcheckAbsenceoperations. - Thread-safe — safe for concurrent use in multi-threaded applications.
- Atomic operations —
addIfAbsentfor atomic check-and-add patterns. - Built-in caching — Caffeine-based async cache for deduper metadata lookups.
- Built-in retry — configurable retry with backoff on transient Aerospike failures.
How it works
┌─────────────────────────────────────────────────────────────────┐
│ SolusEngine<T> │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ DeDuperCrudCommands │ │ DeDuperDataCommands │ │
│ └──────────┬──────────┘ └──────────┬──────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ StorageContext │ │
│ │ ┌─────────────────────┐ ┌─────────────────────────┐ │ │
│ │ │ AerospikeStorage │ │ HBaseStorage │ │ │
│ │ │ Context │ │ Context │ │ │
│ │ └─────────┬───────────┘ └───────────┬─────────────┘ │ │
│ └────────────┼──────────────────────────┼─────────────────┘ │
└───────────────┼──────────────────────────┼─────────────────────┘
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Aerospike │ │ HBase │
│ Cluster │ │ Cluster │
└───────────────┘ └───────────────┘
Deduplication pipeline
- Sharding — entities are hashed (Murmur3-128) and distributed across millions of shards.
- Bloom filters — each shard maintains a space-efficient Bloom filter with configurable bit positions.
- TTL management — the storage backend handles automatic expiration of entries.
- Hash functions — multiple MD5-based hash functions reduce false positive rates.
What to read next
- Getting Started — dependency setup, prerequisites, building locally.
- Usage — initialization, registration, data operations, error handling.
- Deduplication Semantics — configuration, capacity planning, deduplication levels, false positive rates.
- Storage Backends — Aerospike and HBase internals.