Search Ranking & Relevance

Returning search results is easy. Returning the right results in the right order is the hard part. Ranking transforms a set of matching documents into an ordered list where the most useful results appear first.

Scoring Fundamentals

Every search engine assigns a relevance score to each matching document, then sorts by that score. The base score comes from text matching algorithms like BM25, but production systems layer additional signals on top.

Multi-Signal Scoring

Final score = text_relevance * w1 + freshness * w2 + popularity * w3 + quality * w4

Example for a news search:
  text_relevance (BM25):  0.85 (query terms match headline and body)
  freshness:              0.95 (published 2 hours ago)
  popularity:             0.70 (moderate click-through rate)
  quality:                0.90 (reputable source)
  
  Weighted: 0.85*0.4 + 0.95*0.3 + 0.70*0.15 + 0.90*0.15 = 0.865

Google combines hundreds of ranking signals including PageRank (link authority), content quality, page speed, mobile friendliness, and user engagement metrics. No single signal dominates; the combination is what produces useful rankings.

Field-Level Scoring

Different fields in a document carry different importance. A match in the title should score higher than a match in the description, which should score higher than a match in the body.

Product search field weights:
  title:       boost 3.0
  brand:       boost 2.5
  category:    boost 2.0
  description: boost 1.0
  reviews:     boost 0.5

Query: "nike running shoes"
  Product A: "Nike" in title, "running shoes" in title     --> high score
  Product B: "Nike" in brand, "running" in description     --> medium score
  Product C: "running shoes" in review text                 --> low score

Boosting

Boosting increases or decreases the score of specific results based on criteria beyond text relevance.

Query-Time Boosting

Apply boost factors when executing the query, without changing the index.

Boosting strategies:
  
  Recency boost:
    Documents from the last 24 hours: boost 2.0
    Documents from the last week: boost 1.5
    Documents older than a month: boost 1.0
  
  Popularity boost:
    Items with 1000+ reviews: boost 1.8
    Items with 100+ reviews: boost 1.3
    Items with fewer reviews: boost 1.0
  
  Availability boost:
    In stock: boost 2.0
    Backordered: boost 1.0
    Out of stock: boost 0.1 (demote but still show)

Negative Boosting (Demotion)

Some results match textually but should appear lower. Demoting is as important as promoting.

Demotion examples:
  Low-quality content: boost 0.3
  Clickbait titles detected: boost 0.2
  Duplicate or near-duplicate content: boost 0.1
  User has already seen this result: boost 0.5

Amazon product search uses boosting heavily. Products with Prime eligibility, high ratings, and strong sales velocity receive boosts. Products with quality complaints or high return rates are demoted.

Function Score Queries

Elasticsearch function_score queries allow combining text relevance with arbitrary scoring functions.

Function score example:
  Base query: match "wireless headphones"
  
  Functions applied:
    1. Field value factor: multiply by log(1 + sales_count)
    2. Decay function: exponential decay based on days since listing
    3. Script score: custom business logic (margin, inventory level)
    4. Weight: boost verified sellers by 1.5x
  
  Combination: multiply all function scores with text relevance

Faceted Search

Faceted search lets users narrow results by categories, attributes, or ranges. It answers "what can I filter by?" alongside "what matches my query?"

Query: "laptop"
Results: 5,847 matching products

Facets returned alongside results:
  Brand:
    Dell (1,203)
    HP (987)
    Lenovo (856)
    Apple (743)
    Asus (621)
  
  Price range:
    Under $500 (1,456)
    $500-$1000 (2,312)
    $1000-$2000 (1,654)
    Over $2000 (425)
  
  Screen size:
    13 inch (1,102)
    15 inch (2,845)
    17 inch (1,900)
  
  Rating:
    4 stars and up (3,211)
    3 stars and up (4,567)

Computing facet counts requires aggregating across all matching documents, not just the top page of results. This is expensive.

Performance considerations:
  - Facet computation scans all matching documents (not just top 10)
  - 5,847 matching docs with 20 facet dimensions = heavy aggregation
  - Solution: approximate counts using sampling for large result sets
  - Cache facet results for popular queries
  
Elasticsearch aggregations:
  Terms aggregation: count by category
  Range aggregation: count by price buckets
  Histogram aggregation: count by fixed intervals

eBay computes facets across billions of listings. They use a combination of pre-computed facet counts for popular categories and real-time aggregation for long-tail queries.

When a user selects a facet, the system must decide how it interacts with other selected facets.

Facet interaction models:
  
  AND within same facet (single select):
    Brand: Dell AND HP --> products that are both Dell AND HP (empty set)
    This is wrong for most use cases.
  
  OR within same facet (multi-select):
    Brand: Dell OR HP --> products from either brand
    This is the expected behavior.
  
  AND across different facets:
    Brand: Dell AND Price: Under $500 --> Dell laptops under $500
    This is always AND across facet groups.

Personalization

Personalization adjusts rankings based on what the system knows about the individual user.

Personalization Signals

User-level signals:
  - Purchase history (bought running shoes before --> boost running gear)
  - Browse history (viewed gaming laptops --> boost gaming results)
  - Location (show nearby stores, local inventory)
  - Language and locale preferences
  - Price sensitivity (budget vs premium buyer)
  - Brand affinity (repeatedly buys from certain brands)

Personalization Architecture

Search flow with personalization:
  1. User submits query "shoes"
  2. Search service retrieves user profile from user features store
  3. Base search returns candidate results with BM25 scores
  4. Personalization layer re-ranks based on user features
  5. Return personalized top results

  User A (runner): running shoes ranked first
  User B (office worker): dress shoes ranked first
  Same query, different results

Netflix personalizes not just what content appears, but the artwork shown for each title. The same movie displays different poster images to different users based on their viewing preferences.

Spotify's search ranks results based on your listening history, saved artists, and playlist contents. Searching for "the" surfaces "The Weeknd" for R&B listeners and "The Beatles" for classic rock listeners.

Cold Start Problem

New users have no history. Strategies for cold start include using demographic defaults, popular items, and gradually shifting toward personalized results as behavior data accumulates.

Cold start progression:
  Visit 1-5:   100% popularity-based ranking
  Visit 6-20:  70% popularity, 30% personalized
  Visit 20-50: 40% popularity, 60% personalized
  Visit 50+:   10% popularity, 90% personalized

Learning to Rank

Learning to rank (LTR) uses machine learning to train a ranking model on user behavior data, replacing hand-tuned scoring formulas.

Training Data

The model learns from implicit user feedback: clicks, purchases, dwell time, and bounce rates.

Training data format:
  Query: "wireless headphones"
  
  Result A: clicked, purchased, 5-star review   --> relevance: 5 (perfect)
  Result B: clicked, browsed 2 minutes, no buy   --> relevance: 3 (good)
  Result C: shown but never clicked               --> relevance: 0 (irrelevant)
  Result D: clicked, bounced in 3 seconds          --> relevance: 1 (poor)

Feature Engineering

Each query-document pair is represented by features that the model uses for ranking.

Features for learning to rank:
  Text features:
    BM25 score, TF-IDF score, query-title match ratio
  
  Document features:
    Page authority, content quality score, freshness
    Number of reviews, average rating, sales velocity
  
  User features:
    Purchase history similarity, brand affinity score
  
  Query features:
    Query length, query category, commercial intent score
  
  Interaction features:
    Historical CTR for this query-document pair
    Position-adjusted click rate

LTR Model Types

Pointwise: predict absolute relevance score per document
  Simple but ignores relative ordering between documents
  Example: predict rating 1-5 for each result

Pairwise: predict which of two documents is more relevant
  Learns relative ordering, more aligned with ranking objective
  Example: LambdaMART (used by Microsoft Bing)

Listwise: optimize entire ranked list at once
  Most aligned with ranking evaluation metrics
  Example: optimize NDCG directly
  Higher computational cost

Airbnb trains an LTR model that combines listing features (price, location, amenities, photos), user features (trip purpose, budget, past bookings), and context features (search dates, group size) to rank listings. They retrain the model regularly as user behavior evolves.

Online vs Offline Evaluation

Offline metrics:
  NDCG (Normalized Discounted Cumulative Gain): measures ranking quality
  MRR (Mean Reciprocal Rank): how high the first relevant result appears
  Precision@K: fraction of top K results that are relevant

Online metrics (A/B testing):
  Click-through rate
  Conversion rate
  Session success rate (user found what they needed)
  Time to first click

Common Pitfalls

Over-personalizing results. Filter bubbles degrade discovery. Balance personalization with diversity and serendipity.
Optimizing for clicks instead of satisfaction. Clickbait titles generate clicks but not user satisfaction. Measure dwell time and conversion, not just CTR.
Treating all facets the same. Some facets filter (brand, size), others rank (rating, price). Mixing them up creates confusing user experiences.
Static ranking formulas. User expectations and content change over time. Hand-tuned weights drift out of alignment. Use LTR models that adapt from feedback.
Ignoring position bias in training data. Users click higher-ranked results more often regardless of relevance. LTR training data must account for position bias to avoid reinforcing bad rankings.
Boosting without measuring impact. Every boost changes rankings globally. A/B test boost changes to verify they improve outcomes rather than introducing regressions.

Key Takeaways

Ranking combines text relevance with business signals like freshness, popularity, and quality. The weights between these signals define the search experience.
Boosting is a powerful lever but must be measured through A/B testing. Unmeasured boosts drift into invisible technical debt.
Faceted search requires computing aggregations across all matching documents, making it one of the most expensive operations in a search system.
Personalization improves relevance for returning users but needs cold-start strategies for new users and diversity safeguards against filter bubbles.
Learning to rank replaces hand-tuned formulas with models trained on user behavior, but requires careful handling of position bias and ongoing retraining.