Grading Shopify AI Search: A Retrieval Eval Harness

Frederick Casey Housand - Jun 8, 2026

A retrieval eval harness for Shopify search: a PASS or BLOCKED regression-gate stamp over a strip of messy real shopper queries (typo, synonym, descriptive, intent) graded right, partial, and wrong

A retrieval eval harness for Shopify search grades an AI answer before a shopper ever sees it: a fixed set of real, messy queries runs against the index on every change, and a relevance score below the last-good baseline blocks the ship. (Updated: June 2026.)

An app that skips this step has not removed the test. It has moved the test onto its shoppers, who pay for every regression with a product they never found.

Here is the claim we will defend: a retrieval change you cannot grade offline is a change you are shipping blind. A new embedding model, a reranker, a re-index, a single synonym tweak. Each one either improves or quietly degrades which products surface for a real query, and correct-but-blocked beats fast-but-wrong-in-production every time.

What follows is the grading view, not the scale or speed view. It sits inside our wider look at AI search for Shopify , alongside its two siblings: the scale-axis sibling on what breaks past a million SKUs, and the time-axis sibling on where the milliseconds go.

This is the correctness axis. The eval grades which products surface; the live read grades whether their stock and price are current. Different problems, and we keep them apart on purpose.

Why “we use AI” is a claim you cannot check without an eval

“We use AI” and “semantic search” are unfalsifiable from the outside. The only honest check is whether the right product surfaces for a real shopper query, and that check has to run on the messy queries production actually sees, not the clean ones a demo is built around. Without an offline eval, the shopper is the test harness and the bug report arrives weeks later as a dip in the conversion report.

The demo is a curated query set. Production is everything the demo left out.

A shopper types “waterprood jacket” and a typo-blind index returns nothing.
A shopper wants a “swim shirt” and the catalog files it under rashguard.
A parent searches “warm coat for toddler” and gets every coat, sorted by no one’s intent.

Shopify’s own search team grades offline before they ship. Their published framework collects ground-truth relevance labels, scores candidate algorithms on MAP and NDCG, then runs an online A/B, explicitly to “mitigate the risks of launching new algorithms into production” by running thousands of historical queries against each variant.

The merchant-facing cut almost nobody makes: the app you install will rarely tell you whether it does the same. The search log audit finds the failure after it happens; the eval is what keeps it from coming back.

What a golden set is, and why it is full of typos

A golden set is a fixed, version-controlled list of real shopper queries, each paired with a known-good expected result. It is not the clean demo set. It is built from the things shoppers actually type, and it lives in source control so every retrieval change runs against the same bar. The four query classes worth labeling are the four the demo never shows.

Typo: “waterprood jacket” should still find the waterproof jacket.
Synonym: “swim shirt” should find the rashguard. (This is the synonyms trap , measured instead of patched.)
Descriptive: “warm coat for toddler” should rank insulated kids’ coats over thin ones.
Intent: “gift for a runner” should surface giftable running gear, not the entire running category.

Why messy and real, not curated? Because zero-setup autonomous learning means no human hand-tuned a synonym list, so the eval has to grade the catalog exactly as shoppers hit it. And because the app answers in 23+ languages with automatic detection , the golden set carries non-English rows too. The golden-set size is an illustrative target throughout this post, never a measured Shoply number.

The graded-relevance view

Relevance is graded, not a yes or a no

Each messy query is scored across its top three results. A result is right, partially relevant, or wrong. The grade is what separates “buried at the bottom” from “absent”, and that difference is the whole signal.

Messy query

Result #1

Result #2

Result #3

typo · “waterprood jacket”

right

partial

synonym · “swim shirt”

right

partial

wrong

descriptive · “warm coat for toddler”

right

intent · “gift for a runner”

partial

wrong

rightpartialwrong

Methodology: grades are illustrative of how graded relevance works, not measured results on a Shoply index. (VERIFY-WITH-RUI before any real grade replaces an illustrative one.)

The graded-not-binary idea is the load-bearing one. Weaviate’s write-up on retrieval evaluation and Pinecone’s offline-evaluation guide both treat relevance as a scale, not a switch, because a “close” result and a “missing” result are not the same failure.

How recall and NDCG say how wrong, not just whether

Two numbers carry the grade, in plain merchant terms. Recall@k asks whether the right product made the top k at all. NDCG@k asks whether it landed high enough to actually be seen, with partial credit for “close” results. A binary right-or-wrong throws away the gap between buried at #19 and absent entirely, and that gap is exactly where a quiet regression hides.

The math stays light here on purpose; the Stanford IR textbook chapter on ranked retrieval and the Weaviate metrics guide carry the formal definitions for anyone who wants them.

One illustrative case shows why both numbers matter: a change that lifts recall on typo queries but drops NDCG on intent queries is not obviously an improvement. The harness is what surfaces that trade instead of letting it ride. (Illustrative, not a measured result.)

Recall falling off as a catalog grows is the scale-axis sibling’s subject. Here recall is the instrument, measured on any change at any catalog size. One sentence, one link, and we stay on the correctness axis.

How the regression gate turns a number into a decision

The regression gate is the part that makes the number matter. The harness runs the golden set against the candidate change, computes recall@k and NDCG@k, and blocks the ship if either falls below the last-good baseline. The output is not a dashboard someone might glance at. It is a gate, the same shape as a test suite that refuses a bad deploy, which is exactly why an engineer trusts it.

The eval-loop view

A retrieval change ships only if it clears the gate

Every candidate change runs the golden set, gets graded, and meets one decision: pass to production, or block and loop back. The gate (mint) is the whole point. A recall drop below baseline takes the red branch home.

Methodology: the baseline threshold is an illustrative target, not a measured production figure. (VERIFY-WITH-RUI before any real number replaces it.)

One harness can cover more than search alone. The combined AI Search + Chatbot is one retrieval surface over one live index, so the chatbot answer and the search results are the same retrieval under test. Grade it once, gate it once. The structural case for why those two belong together lives in the AI Search and Chatbot pillar .

The gate covers retrieval correctness; keeping the surfaced product’s stock and price current is the job of live-state reads, which carry a latency budget of their own . We name the seam and leave it there.

The contrast that makes the gate worth building is who finds the regression first.

The who-finds-it-first view

Either the harness finds it, or the shopper does

Same retrieval regression, two timelines. Without an eval, the shopper is the test set and the lesson costs a sale. With one, the gate catches it before the change ever ships.

WITHOUT eval · the shopper is the test set

WEEKS TO NOTICE

change ships→shopper hits a missing product→lost sale→noticed weeks later in the conversion report

WITH eval · the harness is the test set

CAUGHT PRE-SHIP

change runs the harness→gate blocks the regression→shopper never sees it

Methodology: an illustrative contrast of two workflows, not a measured timeline. (VERIFY-WITH-RUI before any real cadence replaces it.)

What this harness does not catch

An offline eval de-risks a change; it does not certify the answer. Naming the limits is part of trusting the number, so here is what the harness misses even when it is green.

A golden set is only as good as its labels. Stale labels grade yesterday’s catalog. Shopify’s own team flags that evaluation datasets can drift and need refreshing, and ours are no different.
Offline recall and NDCG are not conversion. A technically relevant result can still be the wrong merchandising call. The online A/B is the other half of the picture and is out of scope here.
The harness grades correctness, not freshness or latency. Whether the surfaced product’s stock is current is the live read; how fast the answer arrives is the budget. Those are the time-axis sibling’s territory, not this one’s.
Graded relevance still embeds human judgment. Two annotators disagree, and an LLM-as-judge inherits its own biases. Using a model to label is a real option, with a real caveat attached.

Thanks to Rui for the reality-check that kept every number in this post labeled honestly as illustrative. If you run your own retrieval eval and want to compare notes on where your scores start to slip, we would genuinely like to hear about it.

Happy grading.

Frequently asked questions

What is a retrieval eval harness for Shopify search?

It is an offline test that grades an AI search or chatbot answer before a shopper sees it. A fixed golden set of real, messy shopper queries runs against the index on every retrieval change, each result is graded for relevance, and a recall or NDCG score below the last-good baseline blocks the change from shipping. The thresholds here are illustrative targets, not measured Shoply figures.

How do you know if your Shopify AI search is accurate?

Run a golden set of real shopper queries against the index and grade the results with recall@k and NDCG@k. Recall@k asks whether the right product made the top k at all; NDCG@k asks whether it ranked high enough to be seen. Accuracy you cannot reproduce on a fixed query set is a claim, not a measurement.

What is a golden query set and what should it contain?

A golden query set is a fixed, version-controlled list of real shopper queries each paired with a known-good expected result. It should contain the messy queries production actually sees: typos like “waterprood jacket”, synonyms like “swim shirt” for rashguard, descriptive phrases like “warm coat for toddler”, and intent queries like “gift for a runner”, including non-English rows when the store sells internationally.

Does offline evaluation replace A/B testing?

No. Offline evaluation de-risks a retrieval change before it ships by catching recall and NDCG regressions on a golden set. Online A/B testing measures what the change does to real conversion afterward. They are two halves of one process, and the harness is the half that runs on every code or index change.

If you want to run the falsification test yourself

The fastest way to feel the difference is to ask a demo store a messy query the demo set would never contain. A typo, a synonym, a “gift for someone who runs”. Shoply AI runs combined AI Search and a chatbot over one live index, learns the catalog with zero setup, and reads live stock and price on the answer.

See it: Shopify listing at apps.shopify.com/shopping-assistant-by-shoplyai , live demo at demo.shoplyai.ai .
Go deeper: the broader AI search for Shopify landscape, its scale-axis sibling on what breaks past a million SKUs, and its time-axis sibling on where the milliseconds go.

Happy grading.