Course background

Information retrieval has traditionally been conceived of in terms of precision and recall. The results of a search engine query are said to have high precision if a high percentage of the retrieved documents are relevant to the user's question. The results are said to have high recall if the search engine returned a high percentage of the relevant documents from the whole document base. These types of measures were particularly relevant in thinking about the performance of search engines in the context of professional searchers, well-defined topics, and document collections of fewer than a couple hundred thousand.

Now that search engines must do their job for regular people searching for documents related to ill-defined topics within document collections numbering in the billions, the discussion concerning information retrieval should take on a whole new direction. Precision and recall are still relevant; however, new concerns are more central to the discussion. In this context the important dimensions to consider when thinking about the choice of a search engine are its ability to support the discovery of a query result's structure, the exploration of that structure, and a facility for helping a user be aware of pertinent changes to a document repository. Discovering the structure is part of the interactive process the user goes through both to help him/her learn about what documents are available and to help him/her refine the query. Exploring the structure is the process the user goes through when learning about how the documents (returned as the result of a query) might be clustered and how the clusters might be related to each other. Change notification is a tool that allows a person to keep up with changes (relevant to a particular query or set of documents) as they occur in some structured and facilitated way.

Further, searchers also are interested in the delivery form of the information, the automation level of the query, and the type of the information. The delivery form is the underlying technological infrastructure within which the information is delivered (email, text message, Web page, or RSS feed). The automation level of the query (and the resulting information) refers to whether the user could set up an automated monitor to deliver the information or if he/she would have to perform the query each time he/she wants the information. The type of information refers to whether the information comes from a reference site (acting as a primary source of some information), a mainstream media site (newspaper or journal), or a more opinion-based site (a blog or wiki). Knowing the document type is not simply of academic interest; it determines which search engines cannot be used (since they do not index the document type of interest). The document type also is an indicator of the quality of the data. The choice of search engine and the user's overall query strategy are determined by the user's desired delivery form, how automated he wants the delivery of the information to be, and where the user thinks the information should come from.

Finally, the scope, scale, and variety of information are so vast that a person cannot hope to have complete knowledge of anything but the most obscure topic. This is where the interrelationships among the types of information can help the searcher find relevant information and reference sites. Blogs tend to be focused on specific topics and attempt to follow developments reported on in the mainstream media; they also tend to point out reference sites that they find to be of interest and value. Mainstream media sites tend to provide well organized access to specific topics, provide easy-to-access archives of stories, and provide informative articles that provide both useful overviews of specific topics and their own opinion about the usefulness of both online and real world reference sites. If a person is interested in some specific topic, both blogging sites and the mainstream media can be used to search for and filter information that might be of interest. Thus, even if a person is simply interested in the reference sites related to a specific topic, reading and tracking relevant bloggers and mainstream media sites can be an effective way of discovering new sites and gaining knowledge of current developments related to them.

All of the above leads to the goals of this class. The student is going to learn how to evaluate Web search engines according to a variety of criteria, including how well it performs (precision, recall), how well it supports the search process (discovery of and exploration of structure), the continued monitoring of a topic (change notification), the delivery forms that it supports, the automation level it provides, and the type of information it can retrieve. The student is also going to learn how to use a variety of search engines (for Web pages, blog sites, RSS feeds) and search tools (email alerts, page monitors) to search for and monitor Web pages and blogs in order to learn how to more efficiently learn about a topic and keep updated with changes related to that topic.

1. "30,000 Hits May Be Better Than 300: Precision Anomalies in Internet Searches", by Caroline M. Eastman, Journal of the American Society for Information Science and Technology, Volume 53, Issue 11, p 879-882.
2. "Towards information retrieval measures for evaluation of Web search engines", by Jacek Gwizdka and Mark Chignell, unpublished manuscript, 1999.
3. "Information overload, retrieval strategies and Internet user empowerment", by Christopher N. Carlson, in Haddon, Leslie, Eds. Proceedings The Good, the Bad and the Irrelevant (COST 269) 1(1), pp. 169-173, Helsinki (Finland), 2003.
4. "Criteria for Evaluating Information Retrieval Systems in Highly Dynamic Environments", by Judit Bar-lian, in Proceedings of the 2nd International Workshop on Web Dynamics (May 2002).
5. The Deep Web: Surfacing Hidden Value , by Michael K. Bergman, The Journal of Electronic Publishing, 7:1 (August 2001).
6. "Accessing the Deep Web", by Bin He, Mitesh Patel, Zhen Zhang, and Kevin Chen-Chuan Chang, Communications of the ACM, 50:5 (May 2007), pages 94-101.
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License