There’s a reason we don’t communicate using keywords. Imagine Hamlet’s third-act soliloquy reduced to: Question nobler mind suffer slings arrows outrageous fortune arms sea troubles opposing end—obviously no profundity and eloquence here, never mind logic and comprehensibility.
Yet the search engines we use to scour the web for information make sense of such fragments all the time. “Search engines like Google and Yahoo have taught people implicitly to enter queries as unstructured strings of nouns rather than as syntactically well-formed questions,” says Fred Mailhot, a computational linguist with the search company Ask.com.
The typical search averages just three or four words, he tells me, which “rapidly breaks our ability to actually parse these queries.” Long gone are function words like is, on, or they. Historically, these words—which carry little explicitly meaningful content—have not provided a good indication of relevance; they are just as likely to appear on the web pages we want as the ones we don’t. Yet as my butchered Shakespeare rendition demonstrates, they provide information critical for making sense of language.
Thus, search engines like Ask.com would like to “train people to enter queries that have a bit more structure,” says Mailhot, allowing their algorithms to make use of how the words are strung together.
What might be gained here? Studies of natural-language processing in humans give us some ideas. Consider the difference between the active sentence The Double R Diner bakes the best cherry pie and the passive sentence The best cherry pie is baked by the Double R Diner. Ostensibly, the two sentences cover much the same territory (and they certainly share keywords). Yet in the first sentence, the Double R Diner is the subject—a position of prominence, in that people tend to focus on and remember it better than other parts of speech. In the second sentence, cherry pie is the subject.
This difference in emphasis becomes even more apparent for questions. Consider Does the Double R Diner bake the best cherry pie? and Is the best cherry pie baked by the Double R Diner? Had these sentences been entered as queries, a linguistically savvy search engine might intuit that the first user has ever so slightly more interest in the Double R Diner, while the second user’s interest leans toward cherry pie, and display results accordingly.
But although such interpretive feats are simple enough for people, they’re pretty hard to carry out algorithmically. (Indeed, search engines struggle to tag words as nouns or verbs, the first step toward identifying the subject of a sentence.) And infuriatingly, even an algorithmic rule like Upweight the subject of the sentence would be insufficient. Consider queries like Tell me about Spanish politics or Are there ghosts in graveyards? For these sentences, the topics—Spanish politics and ghosts—are not in the subject position at all.
Still, the rewards tantalize. The ability to identify and assign more prominence to the topic of a sentence would, for instance, assist search algorithms in understanding pronouns, which also tend to be topical.
So when will this brave new world of conversational search engines arise? Mailhot suspects sooner rather than later, thanks to features such as Google’s Voice Search: “My suspicion is that typed queries won’t ever become as structured as spoken ones,” he says. But this “won’t matter as searching happens more and more via mobile interfaces, where voice is the natural medium.” Take, for example, the iPhone’s Siri.
Search users (and the advertisers who hope to lure them) won’t be the only ones who benefit from search engines capable of parsing speechlike queries. The science behind search may change how linguists view natural language. For one, linguists may move away from modeling language using formal grammars, Chomskyan or otherwise. “What you end up finding is that the things that people do with language very rarely fit into the formal grammar that you carved out from the outset,” says Mailhot. “The data are what they are and people do what they do and the best strategy is to make inferences based on what people do rather than carve something out ahead of time and shoehorn the data into it.”
This hit home for Mailhot while briefly working on a project using data from Twitter, where “people just do the absolute most fascinating stuff with language.” Take the ironic construction X much—as in Jealous much? or Hypocritical much? “There are no grammars that will give you something out of this,” he says, “but you have to know that X much is a thing that people are doing.”
Jessica Love’s next post will appear on July 11.
Permission required for reprinting, reproducing, or other uses.