We all know how to use search engines. Enter “weather” or “Justin Bieber birthday” into a search box and voila! Ads, links, a seven-day local forecast, perhaps even an answer like “March 1, 1994 (age 19 years)” appear before us. The sheer convenience of search engines, after all, is precisely their point.
But using technology and understanding it are two different beasts. So I asked Fred Mailhot, a computational linguist with Ask.com, née AskJeeves.com, to fill in some of the gaps of my own knowledge of search engines.
The first thing to understand is that bots (web robots) known as web crawlers visit URLs (web addresses)—a long, ever-replenishing list of URLs. With each visit, a site’s content is copied into a giant index. How giant? According to Mailhot, it contains “a nontrivial fraction” of all of the documents available on the World Wide Web. It is this index—not the immensity of the entire Internet—that we search when we enter our queries. Efficient and discriminating web crawlers ensure a timely and representative search index, and thus improve the index’s odds of containing relevant results.
But whether the search engine will spit back those relevant results is another question entirely. Every query returns a subset of the search index. But what good is a seven-day local forecast to a user preparing a report on tornadoes, or “March 1, 1994 (age 19 years)” to a user keen on viewing photos from celebrity birthday bashes?
So Mailhot and his fellow linguists and data scientists spend much of their days trying to discern a user’s intent. “Is the user simply looking for open-ended information? Very specific information?” asks Mailhot. “Are they trying to purchase something, or download some resource?” In other words, a search engine cannot be relevant without engaging in a bit of mind reading. Just how the mind reading works, says Mailhot, “is every search engine company’s secret sauce.”
Or, to be more precise, every company’s secret algorithm. In an ideal world, the search process might consist of taking a single query, comparing it to every query that has ever been entered, identifying all similar queries, and then returning the URLs that were eventually clicked on by users possessing the similar queries. But in practice, Mailhot tells me, such a process is simply not feasible. (For one, it would take too long: “If Google made you wait two seconds before you get your results back, you would stop using them. We’ve been trained to get our results back very quickly.”) So instead, a search engine has an algorithm: a set of probabilistic rules inferred from successful query-URL pairs.
What sort of rules? Take, for a simple example, a navigational query in which users enter search terms with a particular destination—Amazon, say, or Facebook—in mind. (Think of all the times you’ve gotten to a specific site via a search engine’s page in lieu of typing out the entire URL in your browser’s address bar.) Were a search algorithm to know that your query is navigational, it could send the desired website to the top of your search results—or even send you directly to your desired site, thoroughly delighting you and ensuring you’ll be back. How might it know? Well, including “.com” or “.org” in a query is pretty much a dead giveaway, says Mailhot, so a search engine might have a rule such as if “.org” is contained in a query, it is navigational with some probability. Alternatively, an algorithm might learn over time that certain types of queries are navigational because they end up quite literally “looking like” the URL a user ends up visiting.
And for the most common queries, a search engine might decide to abandon the algorithm altogether. (Incidentally, across all search engines, the most common queries also happen to be navigational: things like “Google.com,” its bastardized cousin “Gogle.com,” and “Twitter.”) The distribution of search terms has a decidedly long tail; a small number of terms are searched very frequently, while the vast majority is searched hardly ever. But for this small set of very popular terms, churning through the entire algorithm again and again is more work than simply hardcoding a fixed page of results.
In some ways, then, the algorithms that search engines use to understand language efficiently—with their rules and their exceptions—seem to resemble language itself. We have implicit grammatical “rules” that let us make sense of combinations of words we’ve never heard before. Yet we also have idioms like “kicked the bucket” or “balls to the wall”—fixed phrases whose meanings are not understood compositionally (one cannot “kick the bucket very hard,” for instance). And while we’ve all internalized combinatorial rules like add –ed to a verb to form the past tense, we’re still quick to understand words like bent, ate, and understood.
I’ll include the rest of my conversation with Mailhot in a subsequent post: how are search engines using what we know about language to discern intent from even the most cryptic of queries?
Permission required for reprinting, reproducing, or other uses.