You are what you index

Twitter Summary:  Search engines only return results for items indexed. The more garbage they add, the more garbage their customers see.

Garbage In, garbage out

Search engines can only return results based on the content they index. The items indexed need to (a) match the user’s expectation, and (b) need to be relevant high quality results. This seems really obvious but the implications for building high quality search results are tremendous in that web sites are quickly limited to the type of information they can return. In addition the nature of the items indexed create requirements for information quality that most search engines can’t completely control.

Prime examples of this are:

Search Engine Items Indexed Relevancy Things People can buy Sales Rank
Google Web Pages Page Rank
Twitter search Tweets Chronological Order

These search engines will break the customer’s expectations if they begin returning results that the customer did not intend to see. If were to start returning web pages or Twitter “tweets” as part of its search results, it would entirely miss the expectations of its customers as to what the search results should provide for them. If Google began returning a page full of “tweet” results, and product information, it would certainly capture more information to index, but it would unlikely return results that would be satisfying to the customer.

The relevancy of the results also becomes a limiting factor for these search engines as they need to spend effort controlling for quality when their basic algorithms return inadequate results. In Amazon’s case, quality of results is impacted by manufacturers creating product names that have typos, or music band names that were intentionally misspelled. It would be easier to just let the misspellings float through the system, but by returning a poor result or not showing the customer a potential match they can reduce a potential sale. In Google’s case, they have to deal with malicious websites that create inflated page ranks for pages through the use of “link farms”,  or websites that are mirror images of other websites that only vary by the URL at the top of the page. Rather then display the same information repeatedly on the same result page, Google spends effort removing duplicate web pages and eliminating rank inappropriately created by a link farm. In Twitter’s case, returning results in chronological order is simple, but a search for “milk chocolate” and “chocolate milk” are actually two distinct searches, and can’t improve the results without breaking their default time ordering. Twitter also suffers from their user’s typos (or simple pluralization mistakes) that could be remedied, but because of how they return and display search results makes it difficult to fix.

At various companies I have worked, there have been multiple attempts to return search results that returned a variety of types of data. The user interface challenge was enormous and difficult, since that type of results requires the user to know what type of result they should expect to see prior to clicking on the link. In the end, it was simpler to design, create and explain to the customers that depending on which search box they used, they would have the type of result they were expecting. By managing customer expectations prior to the search, it resulted in making the page easier to design and for the customer to use.

Improving data quality is challenging regardless of domain and frequently requires human judgment as the data is typically made by humans for humans to consume. If the search engines were made for just machines to use, we wouldn’t need relevancy as a machine could process all the results.  People using a search engine require assistance in helping figure out which search result matches their query.

In the end, the old rule of “Garbage In, Garbage Out” or in this case, “You are what you index” is tremendously meaningful in figuring out what it takes to return a great search result.

One thought on “You are what you index”

  1. I seem to recall that a few years back both Amazon and Microsoft had problems with this. Amazon vastly expanded their index by adding in the content of many books. Microsoft Live Search made a big push to increase the size of their web crawl.

    In both cases, the index bloat caused many new false positives, and relevance plummeted. They both took years and much time and effort to improve their rankers to the point that they could weed out all the false positives and get back to their former level of relevance.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>