E-commerce Search Infrastructure - Case Study

E-commerce Search Infrastructure

Senior Backend Engineer · 2024 · 3 min read

Rebuilt search infrastructure handling 10M+ queries/day, improving relevance scores by 45% and reducing p99 latency to under 100ms

Overview

Led the redesign of search infrastructure for an e-commerce platform, replacing a basic database-backed search with a modern, ML-enhanced search system

Problem

The existing search was a simple LIKE query against PostgreSQL. It couldn't handle typos, synonyms, or relevance ranking. Search conversion rates were poor, and the system couldn't scale beyond 1000 QPS without significant latency degradation.

Constraints

  • Must index 2M+ products with real-time updates
  • p99 latency must be under 200ms
  • Budget constraints ruled out managed search services
  • Team had no prior Elasticsearch experience

Approach

Implemented Elasticsearch as the search backend with a custom relevance tuning pipeline. Built a real-time indexing system using CDC (Change Data Capture) to keep search index synchronized with the product database. Added query understanding layer for typo correction and synonym expansion.

Key Decisions

Use Elasticsearch over Algolia

Reasoning:

Algolia's pricing at our scale was prohibitive ($50k+/year). Elasticsearch gave us more control over relevance tuning and the ability to run complex aggregations for faceted search.

Alternatives considered:
  • Algolia managed search
  • Apache Solr
  • Meilisearch

Implement CDC-based indexing instead of dual writes

Reasoning:

Dual writes are error-prone and can lead to inconsistencies. CDC from PostgreSQL WAL ensures the search index is eventually consistent with the source of truth without application code changes.

Alternatives considered:
  • Application-level dual writes
  • Periodic batch reindexing

Build custom query understanding layer

Reasoning:

Off-the-shelf solutions didn't handle our domain-specific vocabulary well. Custom layer allowed us to incorporate product taxonomy and user behavior signals.

Tech Stack

  • Elasticsearch
  • Python
  • Debezium
  • Kafka
  • PostgreSQL
  • Redis
  • Kubernetes

Result & Impact

45% improvement in click-through rate
Search Conversion
Under 100ms (down from 800ms)
p99 Latency
10,000+ QPS (up from 1,000)
Query Capacity
Reduced by 60%
Zero-Result Searches

Search went from being a pain point to a competitive advantage. The merchandising team can now tune relevance without engineering involvement. The faceted search and autocomplete features have significantly improved the shopping experience.

Learnings

  • Relevance tuning is an ongoing process, not a one-time setup—build tools for non-engineers to iterate
  • CDC is powerful but adds operational complexity—invest in monitoring and alerting
  • Search is a product, not just a feature—dedicate resources to continuous improvement
  • Elasticsearch cluster management requires dedicated expertise

Relevance Tuning Journey

The initial Elasticsearch deployment actually performed worse than the PostgreSQL search for some queries. We spent significant time tuning BM25 parameters, field boosting, and function scores.

The breakthrough came when we incorporated click-through data into relevance scoring. Products that users actually clicked on after searching got boosted, creating a feedback loop that continuously improved results.

Operational Lessons

Running Elasticsearch at scale taught us a lot about JVM tuning, shard management, and cluster topology. We had several incidents early on due to GC pauses and unbalanced shards. Building comprehensive monitoring and runbooks was essential.