Tuesday, December 30, 2014

Finland's Musicinfo seeks to be world's largest music search engine

cds hdr by piddy77, on Flickr
Creative Commons Creative Commons Attribution 2.0 Generic License   by  piddy77 

From YLE.fi:
“People look for something online for a maximum of ten seconds and then give up their search as fruitless. There is an opportunity there. People want one spot that provides fast and convenient service, and that is what we offer,” he says.

The Finnish company’s first goal was to create the world’s largest music database. It has since succeeded in this endeavour, winning many awards along the way.
"Finland's Musicinfo seeks to be world’s largest music search engine",  DEC 21, 2014,  Available from: http://yle.fi/uutiset/finlands_musicinfo_seeks_to_be_worlds_largest_music_search_engine/7701989 (Accessed Dec 30, 2014

New Booking Site Hotelwatchdog Wants to Change How You Pick Hotels

By Paul Brady

From Condé Nast Traveler (via Hospitality Net):
Ever been overwhelmed when looking for a hotel in a huge city filled with options like New York or Paris? The flight deal-finding site Airfarewatchdog has a new way to filter an overwhelming amount of hotel information with its newly launched spinoff site Hotelwatchdog.

Here’s how it works: Rather than present a huge list of hotels and a seemingly endless array of filters, Hotelwatchdog automatically picks the top 20 properties in a given destination and then lets you make further refinements from there. How do they arrive at the top 20? The site, which is part of TripAdvisor, parses thousands of guest reviews as well as historical price information to determine which properties offer the best value. The aim, site spokesman George Hobica says, is to save consumers time by only putting forward the very best (though not necessarily the most expensive) hotels.
Paul Brady (2014), "New Booking Site Hotelwatchdog Wants to Change How You Pick Hotels",  CNTraveler.com, November 12, 2014, Available from: http://www.cntraveler.com/stories/2014-11-12/new-booking-site-hotelwatchdog-wants-to-change-how-you-pick-hotels (Accessed Dec 30, 2014)

Tuesday, October 15, 2013

Healthcare-Related Search Engines

DC Health Week Code-a-Thon 13121 by tedeytan, on Flickr
Creative Commons Attribution-Share Alike 2.0 Generic License  by  tedeytan 

At one stage of search engines’ history, consumer health search engines were considered a true rivals of horizontal search engines. Taking into account the demand for online health information and the online advertising budgets of the pharmaceutical industry and health-care institutions, these search engines had a profitable business model. Moreover, Microsoft acquired one of these specialized search engines for its then newly-horizontal search engine, Bing.

However, in the long run, search engines like Healia, Kosmix, and MammaHealth, made room for health information portals that comprise comprehensive consumer health information. These portals usually include an internal search engine with keyword-based or symptoms-based search options. (see CHAPIS top 100 list for the most authoritative portals). This way they can gain more control over content reliability and provide unified experience for their users.

From a business perspective, these portals can keep users in their own domain and make money from search as well as from content. Moreover, internet users may still use their favorite general search engine as a vehicle to for these websites.

Taking these into consideration, it's not surprising that nonprofit organizations undertake the development of consumer health search engines.

 MedlinePlus is both a health portal and a specialized search engine. Established by the U.S. National library of Medicine (NLM), this vertical search engine includes references to websites that are compatible with its quality guidelines. MedlinePlus’ user interface (UI) includes a spell checker and refinement options on its left column. In addition, it has Spanish and mobile versions.

HONsearch is a Google CSE not-for-profit search engine with English, Spanish, French, German, and Polish versions. This Multilingual CSE is based on Health On the Net (HON) certified health websites. Its UI include features like spell checker and refinements based on the consumer’s gender and age. As opposed to MedlinePlus, HONsearch dosen't have a mobile website.

Last but not least, HealthMash is a semantic health-related search engine that uses a top-down approach to enhance its search results. HealthMash utilizes WebLib’s proprietary Health Knowledge Base to suggest to its users related concepts, health concerns, tests and treatments based on the their own keywords. By default HealthMash displays search results only from trustworthy websites; however, it also has unfiltered web search options. Additionally, HealthMash has Android and iPhone applications. (BTW, HealthMash business model is licensing its semantic federated search to libraries, publishers,and governmental organizations.)

Saturday, July 6, 2013

Google CSE for Open Access Web Catalogs

The rise of Google's PageRank algorithm made human-powered web catalogs like Yahoo! Directory unprofitable. Although over the years there were some attempts to incorporate human curation into the search engines’ evaluation process, algorithm-based search engines are still dominant in this market.

Nevertheless, in the academic world, human-based catalogs occupy a place of honor since, as opposed to algorithm-based search engines, they may guarantee the quality of their resources. In other words, while listing in major search engines is open for every website, listing in an academic catalog relies upon professional judgment. Now, although Google has made great efforts to enhance its search quality, it is still far from being perfect. Moreover, Google's definition of quality is much broader.

"Open Access" Catalogs

The term "open access" (OA) is used to describe scholarly articles that are available on the web without any access fee. One of the advantages of OA resources is that search engines may index their full text. I would like to borrow this term for describing web catalogs that have opened their metadata for web users and crawlers through metadata web-pages.

For example, Temoa is a multilingual catalog of Open Educational Resources (OER) that is curated by an academic community. When one searches Temoa for OERs, he/she may see "ten blue links". However,  clicking on one of the links will lead not to the resource directly, but to the resource’s metadata pagethat includes hyperlink, abstract, keywords, etc. Globe, On the other hand, supplies direct a hyperlink for its resources (some of metadata is integrated into the results page).

Google CSE for OA Catalogs

While almost every web catalog has an internal search feature, searching these catalogs through one of the major search engines like Google or Bing may have some benefits. First, their UI is familiar to many users. Second, they enable highly complex queries through Boolean searching, nesting, and superior query length limits. Third, almost all of these search engines have an advanced spelling-correction mechanism. Finally, the speed of major search engines is usually greater than that of the catalogs’ internal search.

Nevertheless, web catalogs usually implement faceted search through filters on the left pane of their user interface (UI).This feature can be imitated in general search engines through advanced operators only partly, since their metadata records structure is much less detailed. In addition, new resources are usually indexed in the catalog before the respective web-page is crawled by the search engine, which means that the internal index is more updated.

While we may use the "site:" operator to search in every OA catalog using one of the major search engines (e.g. [van gogh site:temoa.info]), a better idea would be to list all these catalogs in a Google Custom Search Engine (CSE). This search engine would enable to formulate highly complex queries and search in these catalogs from a single UI. In addition, this custom-built search engine would be different from other CSEs in the sense that it would supply references to handpicked web-pages from various domains instead of referencing all the web-pages in a list of domains.


Although not identical to web catalog, I included two OER repositories that specialized in open-source materials. Since in the open source world every material may be copied and distributed with the appropriate attributions, anyone may store such material on his own website, consequently making a reference catalog for offsite resources unnecessary.

Saylor,the first repository on the list, is managed by a non-profit organization that hires credentialed professors to locate, assess, and compile open-source OERs into comprehensive OpenCourseWares. Open Course Library, the second repository, specializes in college courses. However, not all the resources in the repository are inclusive. Some of the resources contain only some of the components that are required for a complete courseware, such as syllabi, course activities, readings, and assessments.

One important catalog that I excluded from my CSE is Merlot. This catalog uses peer review process to rate its OER materials. However, not all of the resources in this catalog are rated. Actually, any person is allowed to contribute an OER reference to the catalog. To mitigate this, Merlot’s search results are ordered according to their rating by default. Therefore, the best way to find materials on Merlot is through its internal search engine.

Monday, May 13, 2013

Flights Search Engines Review

The entrance of global online travel agencies (OTA) revolutionized the worldwide travel industry. On the one hand, it enabled customers to book by themselves flights, hotels and other travel-related accommodations, and save the commission of the human travel agent. On the other hand, it moved the challenge of looking for the best deal from the travel agent to the customer.

Travel comparison search engines or travel aggregators try to mitigate this issue by enabling customers to compare different OTAs and airlines offers with a single keystroke. However, as other algorithm-based entities, they are often better on the quantitative than the qualitative aspect of the product.

Flights aggregators suffer from unique problems. First, since flights markets behave like a stock exchange most aggregators find it difficult to guarantee real-time flights quotes. Second, since airlines usually do not supply these aggregators with data about flights taxes & fees,  they cannot calculate extra travel expenses.

Fly.com - The Real Time Aggregator

Fly .com is unique in the arena of flight aggregators in the sense that it can provide real-time quotes. It also declares that it searches budget airlines, a feature that may lower the flight price. In addition, Fly.com has some filtering options on the left pane which have almost become standard by now. These include filtering by price range, airlines, flight duration, stops, and take-off & landing time. It also use airlines matrix in order to visualize the filtering .

Tip #1:  Fly.com has a partly advertorial feature that suggests comparing flight & hotel deals on the same dates on Expedia. While Expedia is not the only source for package deals,  this is one of the known strategies to lower the travel price.

Kayak.com -  The Flexible Search Engine

One of the features of Kayak.com that makes it ideal for travelers (as opposed to tourists) is the "flex dates". This feature includes a pre-search tool to find the cheapest dates for departure and return within a period of one month of the desired date (this tool is available only for round trips searches).  In addition, it allows users to widen their search to flights in a range of three days before and after the planed departure and the return dates. Kayak also enables users to search for round-trips for all the weekends in a given calendared month.  

Kayak includes by default "Hacker Fares" which combine two one-way tickets from separate airlines. After choosing the Hacker Fares, Kayak will ask the user to confirm a disclaimer and then it will provide him with two links to book each ticket with its respective airline. BTW, many times OTAs bundle tickets from separate airlines to generate a cheaper round trip. (These bundles are usually symbolized by a monochromatic airplane icon in flights search engines.) This way the user may book both tickets on the same website and avoid the risk of booking a ticket in one site while the tickets on the other site may have been sold or their price may have increased.

Finally, Kayak enables users to widen search results with flights within a 70 miles radius of the departure and return airports. One may also search for flights between two clusters of up-to four airports within a radius of 200 miles of a given departure and return airport. This feature is useful especially when performing a "multi-city" or complex itinerary search.

Tip #2: Kayak’s pre-search calendar displays the cheapest flights for each day in the selected month. However these values are dynamic and may change from one search to another. Moreover, these prices by no means represent the average prices for these dates. Therefore, in order to “catch” the ideal days for the flight, one may select departure and return dates that are in the middle of successive low fare days and then expand the results to three days before and after these dates.


While the previous search engines focus on the cheapest price, Momondo  tries to find the balance between money & time savings. True, one may use Fly’s and Kayak’s filters to limit flight duration, layover and stops through their left pane options.  However, as we already know, most users usually overlooked these options due to the great amount of filtering selections.

By default, Momondo’s rating balances between money & time savings. However, one may adjust the slide above the search results to change the ratio between the two. Also, it is important to notice that by default Momondo sorts its results by price, yet it advises the users to sort the results by rating by clicking the 'show rating' button, which adds a 'rating' column to the search results and sorts them by rating values.

Momondo, as its developers state, aggregates a long list of budget airlines and OTAs.  Combined with its rating system, this engine may display quality low-fare flights on top of its search results.

Tip #3: One may search for the best rated flights under conditions like the overall flight time, departure & arrival times, and airlines, by using the slides and check-boxes on the left pane.


TripAdvisor is a social travel-rating website that is known for its large community. Now, despite the criticism of its reviewers moderation, it is still one of the largest travel website in terms of reviews and visitors. The uniqueness of TripAdvisor’s flight search is the integration of its social airlines rating system into the search results. First, by hovering over the airline logo in the search results, one may see the airline’s rating. Second, one may filter flights in the search results by specifying the minimum airline rating in the left pane.

In addition, TripAdvisor has some of the features of the previous search engines. First, by default, it expands the results with flights in the range of day before and after the departure and return dates. This mild tweak can save the user from booking expensive flights on crowded days, without changing the schedule dramatically. Moreover,  TripAdvisor displays remarks about scheduled or layover deviations in red color. Third,  it automatically includes flights to airports nearby the resulting airports. In addition, it has a 'best value' button that sorts search results by weighing price, time and other un-documented factors. Finally,  it has a baggage fees calculator that adds estimated baggage fees to the search results. All these features make TripAdvisor ideal for people who seldom fly and want to find quality flights with a reasonable price without the need to manage detailed research.

Tip #4:  If one is flexible, he/she may optimize departure and return dates with Kayak’s "flexible dates" pre-search tool and then search for these dates in TripAdvisor.

Final Thoughts

Although every search engines in this list has some features to lower the airfare,  it seems that right now none of them can beat a skilled searcher.  In particular, none of these search engines can find cheaper alternative routes  to the desired destination. Actually, a crowd-sourcing site like FlightFox may get a better results than all of these search engines.

In order to break the itinerary into smaller segments, one must open at least two windows -- one for the origin and one for the destination -- and then explore all of the reasonable intermediate stations between them. For example, if one wants to find a cheap route from Sydney to Berlin he/she may open one window with Sydney as the origin and one with Berlin as the destination and then fill the blank destination and origin in the respective windows with cities like Bangkok and New Delhi.

In order to explore multiple routes without the clutter of outdated search results it may be better to use a real-time search engine like Fly.com. In addition one may combine well-filtered email alerts  or Twitter follow-ups  to obtain more real-time results. However in order to mitigate the risk of low airfares one should use tools like FlightStats and Tripadvisor’s airlines rating.

Thursday, January 12, 2012

An Alternative to Reading Lists?

One of the ideas behind creating a customized search engine for course reading materials is to let students explore the course topics in an arbitrary order. However, even massive open online course (MOOC) opponents may look for some sort of agenda that may take the form of daily newsletters to highlight distinguished thoughts. When using custom-built search engines, this agenda may be manifested beyond the search engine’s interface such as links to highlighted queries, or be integrated into the engine's interface by using promotions  or tweaking the ranking of the search results.

Although students are considered to be more search dominant , providing a browsable version  of the reading list will make it more usable for link-dominant students. In addition, it may enhance their learning process by acquainting them with the instructor’s scheme of the course material, and may equip them with a web browsing skill that is often neglected nowadays (i.e. hierarchical navigation).

Now, this browsable reading list may be implemented as a vertical portal or as part of an institutional courses’ portal. In any case, the hyperlinks to the course web-resources (e.g. web-pages, power-point or video files) may be implemented manually or dynamically through URLs of highlighted query strings in the custom-search engine or even in some general search engine. The latter will provide the students with reference to relevant documents and at the same time supply the students with examples of  domain experts queries that may reformulate or even inspire them to formulate new queries.

Getting the most relevant results may be in the students' best interest  even more than in their instructors'. The simplest way to control the course search engine results is to specify a detailed list of URLs to relevant resources. Although this approach may seem as worthy as a standard reading list, it may have some advantages like full text search over the list of resources (except maybe for subscribed materials which Google may has access only to their abstract web-page). This method may entail greater investment from the instructor and leave less room for exploration to the student.

An Alternative method is to set a list of domain names  (i.e. “websites”) and let the student search this space of domains. Although this approach may lead to incidental search results it is less likely to yield empty results  compared with the previous approach.

Gogle Custom Search Engine Settings

Google Custom Search Engine  (CSE) has various means to customize the user’s search interface. These may help students to discover course-applicable resources and mitigate the effect of incidental results.

One way to improve search engine efficiency is to use auto-completions. These may be curated by the instructor or be generated solely by the platform's algorithm. Curated suggestions can integrate experts’ queries into the search engine interface. These queries may represent the instructor's endeavors to extract relevant resources over the listed domains space. Additionally, there is a new feature in Google CSE which assimilates promotions into the auto-completions management so that one may set a suggested query and a promoted URL that will be triggered by this query directly from the 'Autocompletions' section (i.e., 'Control Panel'–>'Autocompletions'–>'Autocompletion Promotions' ).

Another way to gain control over the search results is to set synonyms expansions . Although it is difficult to predict what terms students will use and how these terms may related to the domain's professional terms, search logs or even students’ works from previous related courses may provide a hint for necessary synonyms expansions (in any case it is recommended to confirm that Google isn't utilizing these synonyms  already). Students' search behavior tracking may rise some ethical issues, though librarians track their patrons' (i.e. the institute’s) databases search logs as a common practice. It is also worth mentioning that Google CSE has two levels of users' search behavior tracking. By default it reports only the top search queries; to get detailed report one should ask explicitly to track users' search history. Additionally, current search logs may serve as an opportunity to enrich students’ professional vocabulary and clarify essential terms (directly or indirectly).

Using Google CSE 'Keywords'

Although apparently a minor feature, CSE 'Keywords' may help the instructor to confirm that students are reading resources which correlate to the course’s key concepts (significant especially for entry level classes). Also, the 'keywords' property may serve as an alternative to expert's personalized keywords in match-word profile algorithms. To set the CSE keywords one should go to 'Control panel'->'Basics' in the GUI or add them to the 'CustomSearchEngine' element in the 'context file'.

Suppose, for example, we want to set CSE for an “environmental protection” course. First we would recognize the core principles behind this issue. In our case they may be: sustainability, waste hierarchy, and conservation. Then we may set the principles that derive from them in the 'Keyword' filed. Here is the outline of our imaginary course principles:
  1. Sustainability
    1. Sustainable energy
    2. Sustainable agriculture
    3. Sustainable design
  2. Waste hierarchy
    1. Reduce
    2. Reuse
    3. Recycle
    4. Recovery
    5. (Disposal)
  3. Conservation
    1. Water conservation
    2. Energy conservation
    3. Conservation biology
    4. (Geoengineering)
Now, Google's CSE keywords are limited to a maximum of 100 characters (include spaces between keywords and quotation marks for key-phrases).Thus, we may use the keyword “sustainable” to represent all the principles that derive from the “sustainability” core principle. Next, we will add the key-phrase “waste hierarchy” and its related keywords. However we may want to eliminate the term “disposal” because it is the least-recommended action (i.e. it is on the bottom of the hierarchy). Alternatively, we may use the term “incineration” which is less environmentally destructive. Additionally, the instructor may use the term “Geoengineering” according to his world-view or to the course's goals.

Of course, we may use alternative sets of keywords for the same course or even use distinct sets for various courses over the same set of websites. This may be obtained through referring to the same external annotations file from every course context file. In any case, students that submit one-word-query or use improper terms may still get valuable articles that deal with the most important aspects of the course (though not necessarily corresponding to the intent of the student).

Wednesday, September 7, 2011

Google CSE Auto Completions

The emergence of universal (i.e. non-contextual) , multilingual, and as a consequence algorithm-based search engines made it difficult to standardize the indexing of web-pages with controlled vocabulary. First, search engine indexing rests on the terminologies set by writers with various degrees of expertise (either in meta data or full-text or even anchor text ) and is therefore inconsistent. Second, the scope of these search engines increases the possibility of homographs from different disciplines, languages, or even different dialects. As a result general search engines may show more serendipity results. (On the other hand, they may fill the gaps between experts' and users' terminology).

Preferred Terms and Qualifiers

Historically, the most common terms in a given field were chosen as the "preferred terms", and unambiguous terms were "standardized" with qualifiers. For example, Wikipedia uses the qualifiers "fruit" and "colour" to represent two meanings of the word "orange" (i.e. the terms "Orange (fruit)" and "Orange (colour)" designate two interpretations of the word "orange"). Also, DuckDuckGo (DDG) uses Wikipedia's terms to refine one-word queries. For instance, if we type "orange" in DDG's search box, we will get a list of Wikipedia's terms that are represented by this word. However, in the realm of full-text search there are only keywords and key phrases. Thus the query "Orange (colour)" will return links to web-pages with the keywords "orang" AND "colour" (the parentheses used designate the order of operations in the Boolean expression in most search engines.) In other words, DDG uses quasi-qualifiers as a rough filter to limit search results to the appropriate context.

Qualifiers that are enclosed in parentheses may seem odd for many people. Moreover, the NISO guidelines recommend to avoid qualifiers as much as possible, and to use unambiguous and precise terms instead. For instance, in the previous post it was noted that "inlinks" is one synonym of "backlinks". Nevertheless, "inLinks" is also the name of a contextual advertising service. (Google queries are not case-sensitive so the brand name is equal to "inlinks".) Now, we could use the term "inlinks (advertising)" to refer specifically to this concept (i.e. to this interpretation of the word "inlinks"). However the expression "inlinks text ads" would be a more natural term (actually expressions are one of NISO's alternatives to qualifiers).

Auto Completions and Homographs

When we deal with topical search engines the difficulty of homographs is less severe since we know the context of these terms (i.e. the topic of this search engine). More specifically, custom-search engines produce results mainly by limiting their search results to lists of whole or partial websites with content related to their declared topic.

For instance, if we search "PR" in a Search Engine Optimization (SEO) search engine, we probably wouldn't get results for "Pattern Recognition" or "Puerto Rico". On the other hand, we would get results for "Page Rank" and "Public Relations". Now, Google CSE doesn't have a special feature to deal with homographs. Thus in the case of acronyms like "PR", a good workaround would be to add the expressions "Page Rank" and "Public Relations" to the 'Auto Completions' section in the 'Control panel' so that when the user types the letter 'P' he/she would get these two expressions on the top of the suggestion list (that is because the manually-entered suggestions have a preference over Google's algorithm-made suggestions, no matter what scores you assign to them.)

'Autocompletions' is a proper candidate for dealing with homographs, because it allows to choose between near expressions. However,finding the appropriate expressions is not always trivial. Going back to the "inlink" example, we could suggest the expression "inlinks text ads" to refer to the advertising service. However, it seems that there is no expression that starts with the word "inlinks" and can refer to "inlinks" as kind of hyperlinks. Of course, we could use the term "inlinks (hyperlinks)" as a reference, but the syntax as well as the qualifier may be unclear to the average user. Instead,we can use related terms (RT) like "inlinks quality", "inlinks analysis", and "inlinks anchor text" to possibly represent the intention of the user who types the word "inlinks" in the search box.

Additional Practical Advice

Some people may avoid using the 'Autocompletions' feature, due to lack of control over the algorithm-based suggestions. As I mentioned earlier, Google places the manually-made suggestions on top of the suggestions list. Also, Google CSE enables the user to manually exclude specific suggestions and even suggestions patterns. However,blocking unwanted suggestions may be an exhausting endless task, so in my opinion, a more productive approach would be to use the 'Autocompletions' algorithm to trim the CSE 'Included sites' list.

For example, my SEO search engine has a 'refinement' that focuses the search query on the webmasters guidelines provided by several major search engines. When I tried to analyzed some of the irrelevant suggestions on this search engine, I realized that the vast majority of the results for these queries came from the Ask.com website. Now, Ask.com has discontinued web crawling almost a year ago, so it is unclear if their webmaster guidelines are still relevant. Moreover, most of their website contains Q&A that are not necessarily related to SEO. Consequently including this content in the CSE has generated irrelevant suggestions. Additional irrelevant suggestions came from websites with aggressively-promoted services. So after excluding spam-producing websites as well as limiting some websites’ results to specific subdomains or subfolders, my search engine suggestions became more relevant.

Finally, it is important to remember that the update of the auto completion database can take 24 to 48 hours. Thus, changes in the 'Included list' like deleting websites, changing website subfolder, or even adding new websites should be examined after this estimated time period.

Read More