E-commerce Search Infrastructure

Senior Backend Engineer · 2024 · 3 min read

Rebuilt search infrastructure handling 10M+ queries/day, improving relevance scores by 45% and reducing p99 latency to under 100ms

Overview

Led the redesign of search infrastructure for an e-commerce platform, replacing a basic database-backed search with a modern, ML-enhanced search system

Problem

The existing search was a simple LIKE query against PostgreSQL. It couldn't handle typos, synonyms, or relevance ranking. Search conversion rates were poor, and the system couldn't scale beyond 1000 QPS without significant latency degradation.

Constraints

Must index 2M+ products with real-time updates
p99 latency must be under 200ms
Budget constraints ruled out managed search services
Team had no prior Elasticsearch experience

Approach

Implemented Elasticsearch as the search backend with a custom relevance tuning pipeline. Built a real-time indexing system using CDC (Change Data Capture) to keep search index synchronized with the product database. Added query understanding layer for typo correction and synonym expansion.

Key Decisions

Use Elasticsearch over Algolia

Reasoning:

Algolia's pricing at our scale was prohibitive ($50k+/year). Elasticsearch gave us more control over relevance tuning and the ability to run complex aggregations for faceted search.

Alternatives considered:

Algolia managed search
Apache Solr
Meilisearch

Implement CDC-based indexing instead of dual writes

Reasoning:

Dual writes are error-prone and can lead to inconsistencies. CDC from PostgreSQL WAL ensures the search index is eventually consistent with the source of truth without application code changes.

Alternatives considered:

Application-level dual writes
Periodic batch reindexing

Build custom query understanding layer

Reasoning:

Off-the-shelf solutions didn't handle our domain-specific vocabulary well. Custom layer allowed us to incorporate product taxonomy and user behavior signals.

Tech Stack

Elasticsearch
Python
Debezium
Kafka
PostgreSQL
Redis
Kubernetes

Result & Impact

45% improvement in click-through rate

Search Conversion
Under 100ms (down from 800ms)

p99 Latency
10,000+ QPS (up from 1,000)

Query Capacity
Reduced by 60%

Zero-Result Searches

Search went from being a pain point to a competitive advantage. The merchandising team can now tune relevance without engineering involvement. The faceted search and autocomplete features have significantly improved the shopping experience.

Learnings

Relevance tuning is an ongoing process, not a one-time setup—build tools for non-engineers to iterate
CDC is powerful but adds operational complexity—invest in monitoring and alerting
Search is a product, not just a feature—dedicate resources to continuous improvement
Elasticsearch cluster management requires dedicated expertise

Relevance Tuning Journey

The initial Elasticsearch deployment actually performed worse than the PostgreSQL search for some queries. We spent significant time tuning BM25 parameters, field boosting, and function scores.

The breakthrough came when we incorporated click-through data into relevance scoring. Products that users actually clicked on after searching got boosted, creating a feedback loop that continuously improved results.

Operational Lessons

Running Elasticsearch at scale taught us a lot about JVM tuning, shard management, and cluster topology. We had several incidents early on due to GC pauses and unbalanced shards. Building comprehensive monitoring and runbooks was essential.

All projects