by Scott Moore (
samoore in BIT330, Fall 2008)
Summary data for the class
Search engine overlap data
The following tables contains the precision of the top 20 results returned on queries and the percentage overlap between the top 20 results of different search engines.
- Web search
- The apparent result is that just over half of the results for Google and Yahoo are relevant, while about two-fifths of the Live results are relevant. Further, for all pairs of search engines, about one-fifth of the results overlap with other search engines.
- Blog search
- The apparent result is that about half of the results for Google Blog are relevant, about two-fifths of the results for Bloglines, and about one-third for Technorati. Further, less than one-tenth of the results of one search engine show up in another search engine's results.
|
|
The following table presents a bit more detail than the averages on the diagonals above.
|
|
Search engine ranking overlap data
- Web search. For the following, I am going to combine the results of the top two tables. More detailed results can be found when looking at one table or the other or looking at values in different cells — but since the numbers are so small, I am just going to focus on the last column in each table.
- If you look at the top 5 results in one search engine, on average no more than 1.6 of those results will be found in the top 20 results of the other search engine.
- If you look at the top 10 results in one search engine, on average no more than 2.6 of those results will be found in the top 20 results of the other search engine.
- If you look at the top 20 results in one search engine, on average no more than 3.8 of those results will be found in the top 20 results of the other search engine.
- Blog search: The numbers in all of these cells are so small that the standard deviation is greater than the average in all cases. Thus, it wouldn't be any surprise for a searcher to find absolutely no results in common when looking at the top 20 results of the two search engines; however, on average about 1 result will be found in common between the two blog search engines.
|
This table provides a measure of how much of Google's responses are reproduced by Yahoo.
|
This table provides a measure of how much of Yahoo's responses are reproduced by Google.
|
||||||||||||||||||||||||||||||||||||||||
|
This table provides a measure of how much of Blogline's responses are reproduced by Google Blog Search.
|
This table provides a measure of how much of Google Blog Search's responses are reproduced by Bloglines.
|
||||||||||||||||||||||||||||||||||||||||
Results
Explanation of statistics
For the individual results, I show for how many students the precision was better for the first search engine, better for the second search engine, or the same for the two search engines. For the Student's paired t, I test the hypothesis that the differences in precision for the two search engines is equal to zero; this test assumes that the data is normally distributed. I used this table of values to test the hypotheses. For the Wilcoxon signed rank test, I am testing the hypothesis that the precisions for the two search engines are selected from the same distribution (no matter what that distribution might be). I used the method described on this page to calculate this statistic.
Web search: Differences in precision
Google vs. Live: Test hypothesis that Google is better than Live. The statistics are listed below. All three statistics support the conclusion that Google is better than Live (with a 97.5% confidence level).
- Individual results (G/L/=): 14/4/0
- Student's paired t:

- Wilcoxon:

Yahoo vs Live: Test hypothesis that Yahoo is better than Live. The statistics are listed below. All three statistics support the conclusion that Yahoo is better than Live (with a 99% confidence level).
- Individual results (Y/L/=): 12/3/3
- Student's paired t:

- Wilcoxon:

Google vs Yahoo: Test hypothesis that Google and Yahoo are equivalent. The statistics are listed below. The results here are mixed. The Student t test supports that idea that these two are equivalent, while the other two provide a slight level of support that they are not equivalent. More data would have to be collected to make any kind of strong conclusion either way.
- Individual results (G/L/=): 10/6/4
- Student's paired t:

- Wilcoxon:

Blog search: Differences in precision
Google Blog Search vs Technorati: Test hypothesis that Google is better than Technorati. The statistics are listed below. All three statistics support the conclusion that Google Blog Search is better than Technorati (with a 99.5% confidence level).
- Individual results (G/T/=): 14/1/3
- Student's paired t:

- Wilcoxon:

Bloglines vs Technorati: Test hypothesis that Bloglines is better than Technorati. The statistics are listed below. All three statistics support the conclusion that Bloglines is better than Technorati (with a 99% confidence level).
- Individual results (B/T/=): 14/4/0
- Student's paired t:

- Wilcoxon:

Google vs Bloglines: Test hypothesis that Google Blog and Bloglines are equivalent. The statistics are listed below. The results here indicate that Google Blog Search and Bloglines are not equivalent (with only a 90% confidence level). I would conclude that more data needs to be collected, but these results are suggestive that Google Blog Search is better than Bloglines.
- Individual results (G/B/=): 11/6/1
- Student's paired t:

- Wilcoxon:

Discussion
Web search
Google and Yahoo are clearly the best two Web search engines (of the three we looked at). Let's consider what would happen if the searcher looked at the top 10 results in Google and then the top 20 in Yahoo, or looked at the top 10 in Google and then the top 10 in Yahoo:
- Look at 20 in Google, then 20 in Yahoo
- 11 (20*0.544) of the top 20 Google results should be relevant. Of the top 20 Yahoo results, 10 (20*0.517) would be relevant but about 4 (3.7) would be duplicates of the Google results, and I'll assume that half of these would be duplicates of the relevant results (given no information otherwise). Thus, looking at the top 20 results in Yahoo would give you an additional 8 relevant results — for a total of 19.
- Look at 10 in Google, then 10 in Yahoo
- 5.4 (10*0.544) of the top 10 Google results should be relevant. Of the top 10 Yahoo results, 5.2 (10*0.517) would be relevant but about 2 would be duplicates, and about half of these would be duplicates of the relevant results. Thus, looking at the top 10 results in Yahoo would give you 4.2 additional results — for a total of almost 10.
In any case, it's clear that, after looking at the results in Google, the searcher has much to gain by looking at the results given by Yahoo.
Blog search
Google Blog Search and Bloglines are both clearly better than Technorati, and it might be the case that Google Blog Search is better than Bloglines. Let's consider what would happen if the searcher looked at the top 10 results in Google Blog Search and then the top 20 in Bloglines, or looked at the top 10 in Google Blog Search and then the top 10 in Bloglines:
- Look at 20 in Google, then 20 in Bloglines
- Between 8.5 (20*0.425) and 10.5 (20*0.525) of the top 20 Google results should be relevant. Of the top 20 Bloglines results, about 8.9 (20*0.444) would be relevant but about 1.1 would be duplicates of the Google results, and I'll assume that half of these would be duplicates of the relevant results (given no information otherwise). Thus, looking at the top 20 results in Bloglines would give you just over 8 additional relevant results — for a total of 19.
- Look at 10 in Google, then 10 in Bloglines
- Between 4.3 (10*0.425) and 5.3 (10*0.525) of the top 10 Google results should be relevant. Of the top 10 Bloglines results, 4.4 (10*0.444) would be relevant but about 0.5 would be duplicates, and about half of these (0.25) would be duplicates of the relevant results. Thus, looking at the top 10 results in Bloglines would give you 4.1 additional results, for a total of about 9.4.
Again, it's clear that, after looking at the results in Google Blog Search, the searcher has much to gain by looking at the results given by Bloglines.