[Back to Home]

Weighted Search for E-commerce: From Keyword Matching to Intelligent Ranking

How ML-enriched catalogs and LLM query translation enable semantic product discovery

By Shiva Kumar Pati
December 2025

E-commerce search has relied on the same basic approach for twenty years. A user searches for "educational toys for 8 year old boys" and the system tokenizes the query, looks up matching terms in an inverted index, and ranks results using BM25 or TF-IDF scoring. Simple filters apply for structured fields like age range and category.

This works adequately when users search the way the catalog is organized. But it breaks down completely with natural language queries that express nuance, context, or implicit preferences. Consider "unique educational items for 8 year old boys." Traditional search handles "educational," "8," "year," and "boys" but completely misses "unique." There's no catalog field for uniqueness. The semantic intent—something distinctive, not mass-market—goes unrecognized.

The fundamental mismatch: traditional search requires the query to conform to the catalog's vocabulary. Modern search needs the opposite—the catalog must be understood in the language customers naturally use.


The Architecture Shift

The solution involves two fundamental changes in when and how computation happens:

Ingestion time (offline, expensive): Use machine learning to deeply understand every product, extracting attributes that don't exist in the raw catalog data. Pre-compute everything that doesn't depend on the specific query.

Query time (online, fast): Use an LLM to understand user intent and dynamically construct a weighted scoring formula. Apply this formula using the pre-computed values to rank products in milliseconds.

This inverts the traditional model where most computation happens at query time. Instead, we do the heavy analysis once per product and keep query-time operations lightweight.

Catalog Enrichment with Machine Learning

Raw product data is sparse. A typical product entry contains a title, description, category, price, and maybe an age range. This manual tagging is incomplete and inconsistent.

At ingestion, we run multiple ML models to extract attributes that capture how people actually think about products:

Educational value analysis goes beyond a binary "educational" tag. A classifier trained on user behavior data extracts specific educational dimensions: Is this STEM-focused? Does it develop problem-solving skills? Is it hands-on learning or instruction-based? What's the pedagogical approach—Montessori-aligned, constructivist, self-directed?

For a wooden robot building kit, the model might output high scores for STEM (0.92), engineering (0.88), hands-on learning (0.91), and spatial reasoning (0.78). This granular understanding lets us match against queries that never use the word "educational"—like "science activities for kids" or "gifts for builders."

Uniqueness scoring requires understanding market position. How many sellers carry this item? How does the price compare to category medians? Is the description distinctive? For products with images, computer vision can assess design originality compared to similar items. Brand market share matters—a product from a small artisan maker is inherently more unique than one from a major manufacturer. Material choices factor in too: wooden toys occupy a different market position than plastic equivalents.

A small-batch wooden kit from a boutique brand might score 0.87 for uniqueness. A mass-market plastic toy scores 0.23. These scores get computed once and stored.

Age appropriateness needs more sophistication than a simple range. Sure, the packaging says 8-12, but what do actual purchase patterns show? Review mining reveals that most buyers purchase for 8-9 year olds, and success rates (completing the product without frustration) peak at age 8-10. Factor in complexity indicators: piece count, instruction reading level, fine motor requirements, estimated completion time. The model refines the labeled age range into an optimal range with confidence scores.

Interest mapping replaces crude gender labels with nuanced interest clusters. Instead of "boys toys," we identify building/construction (0.94), mechanical systems (0.89), robotics (0.86), creative design (0.71). This lets us match based on interests rather than stereotypes while still understanding aggregate market patterns. A query mentioning "boys" maps to common interest clusters without hard filtering.

Quality signals aggregate everything indicating product satisfaction. Rating scores get weighted by review count—5 stars with 3 reviews matters less than 4.5 stars with 200 reviews. Return rates reveal problems. A 4% return rate is good; 15% signals issues. Review sentiment analysis goes deeper than star ratings. Seller reputation combines fulfillment rates, response times, years active, dispute rates.

Business metrics capture commercial viability. Margins, inventory levels, conversion rates (view-to-cart, cart-to-purchase), price positioning within the category, even estimated demand elasticity. Some products drive new customer acquisition or increase basket sizes—strategic value beyond the immediate sale.

After this enrichment, every product has dozens of pre-computed attributes that didn't exist in the raw data. A wooden robot kit that started as just a title and description now has quantified scores for educational value, uniqueness, age appropriateness, interest alignment, quality indicators, and business metrics. All stored and indexed, ready for fast lookup.

Semantic Embeddings

Beyond discrete attributes, we generate dense vector embeddings that capture semantic meaning. The product title, description, and enriched attributes get encoded into a 768-dimensional vector using a model trained on e-commerce data.

These embeddings enable semantic similarity matching. A query about "screen-free activities" will match products even if they never mention screens, because the embedding space clusters similar concepts together. Physical building toys, board games, and outdoor equipment all sit near each other in this semantic space.

Vector databases with approximate nearest neighbor search make this fast even with millions of products. The expensive part—generating embeddings—happens once at ingestion. Query-time similarity search takes milliseconds.

LLM Query Translation

When a user searches for "unique educational items for 8 year old boys," an LLM analyzes the query with a structured prompt that extracts intent, context, requirements, and preferences.

The LLM identifies purchase stage (consideration phase, likely comparing options), infers context (probably gift shopping based on the emphasis on "unique"), recognizes urgency (medium—not urgent but not idle browsing), and estimates price sensitivity (moderate—uniqueness emphasis suggests willing to pay more than bottom-dollar).

It separates hard requirements from soft preferences. Hard filters: age must be 7-9, product must be educational, exclude purely digital items. Soft preferences: uniqueness is highly important, certain educational domains (STEM, problem-solving) preferred over others, interest alignment with building/robotics/science, strong preference for physical over digital.

Most importantly, the LLM generates a query-specific weight formula. Traditional search applies the same weights to every query. The LLM recognizes that this query heavily emphasizes uniqueness, so it constructs a formula that weights uniqueness at 0.20 within the relevance score—much higher than typical. Educational value gets 0.20, age matching gets 0.25, interest alignment gets 0.20, semantic similarity gets 0.15.

For quality components, it allocates 40% to rating scores, 30% to return rates, 30% to seller reputation. For business metrics: 35% price competitiveness, 25% margin, 20% conversion rate, 20% inventory health.

Top-level weights between these three categories: 55% relevance (finding the right product matters most), 30% quality (gift context means quality is important), 15% business (less weight since this is about satisfying the customer, not maximizing margin).

The LLM also specifies ranking adjustments. Apply a 1.15x boost to products with uniqueness scores above 0.8. Apply a 1.08x boost to products tagged as gift-appropriate. Penalize common items (uniqueness below 0.3) with 0.85x.

This entire query analysis takes 200-800ms depending on LLM latency. But it happens once per query, and the output drives scoring for thousands of candidate products.

Runtime Scoring

Now scoring becomes straightforward arithmetic using pre-computed values.

For the wooden robot kit:

Relevance scoring: Age match uses a gaussian distribution centered at 8, and the product's optimal age range of 8-10 scores 0.98. Educational value was pre-computed at 0.95, exceeding the 0.7 threshold—if it didn't, we'd filter this product entirely. Uniqueness score of 0.87 exceeds the 0.6 preference threshold. Interest alignment compares the query's desired clusters (building, robotics, STEM) against the product's pre-computed clusters (building 0.94, robotics 0.86, STEM 0.92), averaging to 0.89. Semantic similarity between query embedding and product embedding computes to 0.91.

Apply the LLM's weights: (0.98 × 0.25) + (0.95 × 0.20) + (0.87 × 0.20) + (0.89 × 0.20) + (0.91 × 0.15) = 0.916

Quality scoring: The product's pre-computed base quality score is 0.82, combining weighted rating (4.7 stars with 127 reviews), low return rate (4%), and strong seller reputation.

Business scoring: Pre-computed base business score is 0.71, reflecting good margins (38%), healthy inventory, solid conversion rates, and competitive pricing.

Final combination: (0.916 × 0.55) + (0.82 × 0.30) + (0.71 × 0.15) = 0.857

Adjustments: Uniqueness score of 0.87 triggers the 1.15x boost: 0.857 × 1.15 = 0.986

Scoring a single product takes microseconds. The expensive operations—ML attribute extraction, embedding generation, quality analysis—all happened days ago at ingestion. Query time is just fast arithmetic.

Multi-Stage Retrieval

Large catalogs need progressive refinement to stay fast. A naive approach that scores every product is too slow.

Stage 1: Candidate retrieval. Vector search finds the top 1,000 products most semantically similar to the query, filtered by hard requirements (age includes 8, must be educational, must be in stock). Approximate nearest neighbor algorithms make this fast—around 50ms even with millions of products.

Stage 2: Lightweight scoring. These 1,000 candidates get quick scores using a simplified formula: semantic similarity (0.4), base quality score (0.3), base business score (0.3). Sort and keep the top 500. Takes about 10ms.

Stage 3: Full scoring. Apply the complete query-specific weight formula to these 500 products. Calculate all the detailed relevance components, apply the adjustments, get precise final scores. Top 100 products emerge. Takes about 15ms.

Stage 4: Diversification. Ensure variety in the final results—no more than 3 products from the same brand, represent different price tiers, cover multiple educational domains. Apply final business rules like minimum margin thresholds. Takes about 5ms.

Total query time: 80ms for the scoring pipeline, plus 200-800ms for LLM query translation. Under one second total, and much of this can be parallelized or cached for similar queries.

Why This Works

The key insight is separating expensive, product-intrinsic computation from fast, query-specific scoring. Understanding that a wooden robot kit scores 0.87 for uniqueness requires analyzing market data, comparing to similar products, assessing design originality. This takes time. But it's the same answer whether someone queries "unique toys" or "distinctive educational items" or "special gifts for kids." Compute it once.

The query-specific part—deciding that uniqueness should weigh 0.20 in the relevance score for this particular query—is just a number. Applying that weight is multiplication. Fast.

Traditional search conflates these. Every query re-analyzes products to determine relevance. This system decouples analysis from application.

It also enables flexibility impossible in traditional search. Changing ranking logic in traditional systems requires rebuilding indexes. Here, the LLM just generates different weights. Want to emphasize gift-appropriateness during holidays? The LLM adjusts weights. Running a clearance sale? Boost inventory-health weights. The pre-computed attributes stay the same; only the formula changes.

Concrete Comparison

Traditional keyword search

Top result: "Educational Alphabet Flash Cards" because it contains "educational" and the seller gamed SEO with "boys" and "8" in the description. The product is generic (sold by 400 vendors), targets ages 3-6, and has nothing unique about it.

Second result: "Boys' Science Kit" gets keyword matches on "boys," "educational" (implied by "science"), and happens to have "8+" on the box. It's a basic chemistry set that's been on the market for 30 years—not remotely unique.

Third result: "Learning Tablet for Kids" ranks high because "educational" and the seller specified age 8 in metadata. But it's a screen-based device (arguably not what someone wants when emphasizing "items") and mass-market.

None of these are good matches. The query's intent—finding something distinctive and educational—gets lost because traditional search can't understand "unique" and can't weight attributes differently based on query context.

ML-enriched, LLM-translated search

Top result: "Wooden Robot Building Kit" scores 0.986. Near-perfect age match, very high educational value, excellent uniqueness score, strong alignment with typical interests. Quality signals are solid, business metrics are good, and it gets the uniqueness boost.

Second result: "Marble Run Engineering Set" scores 0.941. Also excellent age fit, highly educational, above-average uniqueness (smaller brand, distinctive design), matches building/engineering interests well.

Third result: "Crystal Growing Science Lab" scores 0.923. Perfect age range, strong educational component, unique experience (growing crystals is less common than building toys), aligns with science interests.

These results actually satisfy the query's intent. They're all distinctive products that will appeal to a parent looking for something beyond the typical toy-store offerings.

The Learning Feedback Loop

The system improves over time by learning from user behavior. When people search for "educational toys" and consistently click on STEM-related products, the educational value classifier learns that "educational" often means "STEM" in practice. When products marked as unique get higher engagement than generic alternatives, the uniqueness model refines its scoring.

This feedback can extend to the weighting formulas themselves. Track which weight combinations lead to higher conversion rates. If queries emphasizing "unique" see better outcomes when uniqueness weighs 0.25 instead of 0.20, future queries can use the better weights.

Eventually, you can train a learning-to-rank model that directly optimizes for business outcomes rather than using hand-crafted weights. But starting with interpretable weighted formulas is crucial. You need to understand what's working and why before handing control to a black-box model.


Where This Leads

Once you have ML-enriched catalogs and LLM query translation, new capabilities emerge naturally. Personalization becomes straightforward—just include user preferences in the weight formula. Conversational refinement is easy—the LLM maintains context across multiple turns. Cross-lingual search works because embeddings capture meaning beyond specific words.

The real transformation is moving from "search" to "understanding." Traditional search asks which products contain these words. Modern search asks what the person actually wants and which products genuinely satisfy that need. The technical architecture—pre-computed attributes, semantic embeddings, dynamic weighting—is just the implementation of that philosophical shift.

[🏠]