Wednesday, September 7, 2011

Google CSE Auto Completions

The emergence of universal (i.e. non-contextual) , multilingual, and as a consequence algorithm-based search engines made it difficult to standardize the indexing of web-pages with controlled vocabulary. First, search engine indexing rests on the terminologies set by writers with various degrees of expertise (either in meta data or full-text or even anchor text ) and is therefore inconsistent. Second, the scope of these search engines increases the possibility of homographs from different disciplines, languages, or even different dialects. As a result general search engines may show more serendipity results. (On the other hand, they may fill the gaps between experts' and users' terminology).

Preferred Terms and Qualifiers

Historically, the most common terms in a given field were chosen as the "preferred terms", and unambiguous terms were "standardized" with qualifiers. For example, Wikipedia uses the qualifiers "fruit" and "colour" to represent two meanings of the word "orange" (i.e. the terms "Orange (fruit)" and "Orange (colour)" designate two interpretations of the word "orange"). Also, DuckDuckGo (DDG) uses Wikipedia's terms to refine one-word queries. For instance, if we type "orange" in DDG's search box, we will get a list of Wikipedia's terms that are represented by this word. However, in the realm of full-text search there are only keywords and key phrases. Thus the query "Orange (colour)" will return links to web-pages with the keywords "orang" AND "colour" (the parentheses used designate the order of operations in the Boolean expression in most search engines.) In other words, DDG uses quasi-qualifiers as a rough filter to limit search results to the appropriate context.

Qualifiers that are enclosed in parentheses may seem odd for many people. Moreover, the NISO guidelines recommend to avoid qualifiers as much as possible, and to use unambiguous and precise terms instead. For instance, in the previous post it was noted that "inlinks" is one synonym of "backlinks". Nevertheless, "inLinks" is also the name of a contextual advertising service. (Google queries are not case-sensitive so the brand name is equal to "inlinks".) Now, we could use the term "inlinks (advertising)" to refer specifically to this concept (i.e. to this interpretation of the word "inlinks"). However the expression "inlinks text ads" would be a more natural term (actually expressions are one of NISO's alternatives to qualifiers).

Auto Completions and Homographs

When we deal with topical search engines the difficulty of homographs is less severe since we know the context of these terms (i.e. the topic of this search engine). More specifically, custom-search engines produce results mainly by limiting their search results to lists of whole or partial websites with content related to their declared topic.

For instance, if we search "PR" in a Search Engine Optimization (SEO) search engine, we probably wouldn't get results for "Pattern Recognition" or "Puerto Rico". On the other hand, we would get results for "Page Rank" and "Public Relations". Now, Google CSE doesn't have a special feature to deal with homographs. Thus in the case of acronyms like "PR", a good workaround would be to add the expressions "Page Rank" and "Public Relations" to the 'Auto Completions' section in the 'Control panel' so that when the user types the letter 'P' he/she would get these two expressions on the top of the suggestion list (that is because the manually-entered suggestions have a preference over Google's algorithm-made suggestions, no matter what scores you assign to them.)

'Autocompletions' is a proper candidate for dealing with homographs, because it allows to choose between near expressions. However,finding the appropriate expressions is not always trivial. Going back to the "inlink" example, we could suggest the expression "inlinks text ads" to refer to the advertising service. However, it seems that there is no expression that starts with the word "inlinks" and can refer to "inlinks" as kind of hyperlinks. Of course, we could use the term "inlinks (hyperlinks)" as a reference, but the syntax as well as the qualifier may be unclear to the average user. Instead,we can use related terms (RT) like "inlinks quality", "inlinks analysis", and "inlinks anchor text" to possibly represent the intention of the user who types the word "inlinks" in the search box.

Additional Practical Advice

Some people may avoid using the 'Autocompletions' feature, due to lack of control over the algorithm-based suggestions. As I mentioned earlier, Google places the manually-made suggestions on top of the suggestions list. Also, Google CSE enables the user to manually exclude specific suggestions and even suggestions patterns. However,blocking unwanted suggestions may be an exhausting endless task, so in my opinion, a more productive approach would be to use the 'Autocompletions' algorithm to trim the CSE 'Included sites' list.

For example, my SEO search engine has a 'refinement' that focuses the search query on the webmasters guidelines provided by several major search engines. When I tried to analyzed some of the irrelevant suggestions on this search engine, I realized that the vast majority of the results for these queries came from the website. Now, has discontinued web crawling almost a year ago, so it is unclear if their webmaster guidelines are still relevant. Moreover, most of their website contains Q&A that are not necessarily related to SEO. Consequently including this content in the CSE has generated irrelevant suggestions. Additional irrelevant suggestions came from websites with aggressively-promoted services. So after excluding spam-producing websites as well as limiting some websites’ results to specific subdomains or subfolders, my search engine suggestions became more relevant.

Finally, it is important to remember that the update of the auto completion database can take 24 to 48 hours. Thus, changes in the 'Included list' like deleting websites, changing website subfolder, or even adding new websites should be examined after this estimated time period.

