Search Analysis Writeup

by Scott Moore (samooresamoore in BIT330, Fall 2008)

Summary data for the class

Search engine overlap data

The following tables contains the precision of the top 20 results returned on queries and the percentage overlap between the top 20 results of different search engines.

Web search
The apparent result is that just over half of the results for Google and Yahoo are relevant, while about two-fifths of the Live results are relevant. Further, for all pairs of search engines, about one-fifth of the results overlap with other search engines.
Blog search
The apparent result is that about half of the results for Google Blog are relevant, about two-fifths of the results for Bloglines, and about one-third for Technorati. Further, less than one-tenth of the results of one search engine show up in another search engine's results.
Web search Live Google Yahoo Web
Live 42.8 18.3 20.0
Google 54.4 20.6
Yahoo Web 51.7
All 10.0
Blog search Technorati Google Blog Bloglines
Technorati 33.1 3.6 9.2
Google Blog 52.5 6.9
Bloglines 44.4
All 1.4

The following table presents a bit more detail than the averages on the diagonals above.

Precision Avg Med sd
Live 42.8 42.5 22.8
Google 54.4 57.5 20.1
Yahoo 51.7 52.5 22.4
Precision Avg Med sd
Technorati 33.1 30.0 21.2
Google Blog 52.5 42.5 22.2
Bloglines 44.4 47.4 14.3

Search engine ranking overlap data

  • Web search. For the following, I am going to combine the results of the top two tables. More detailed results can be found when looking at one table or the other or looking at values in different cells — but since the numbers are so small, I am just going to focus on the last column in each table.
    • If you look at the top 5 results in one search engine, on average no more than 1.6 of those results will be found in the top 20 results of the other search engine.
    • If you look at the top 10 results in one search engine, on average no more than 2.6 of those results will be found in the top 20 results of the other search engine.
    • If you look at the top 20 results in one search engine, on average no more than 3.8 of those results will be found in the top 20 results of the other search engine.
  • Blog search: The numbers in all of these cells are so small that the standard deviation is greater than the average in all cases. Thus, it wouldn't be any surprise for a searcher to find absolutely no results in common when looking at the top 20 results of the two search engines; however, on average about 1 result will be found in common between the two blog search engines.
This table provides a measure of how much of Google's responses are reproduced by Yahoo.
GY Yahoo
Google 5 10 20
5 1.1 (1.2) 1.3 (1.2) 1.6 (1.2)
10 1.4 (1.3) 2.0 (1.3) 2.5 (1.5)
20 1.6 (1.4) 2.6 (1.7) 3.7 (2.1)
This table provides a measure of how much of Yahoo's responses are reproduced by Google.
YG Google
Yahoo 5 10 20
5 1.1 (1.2) 1.5 (1.2) 1.9 (1.3)
10 1.2 (1.3) 1.9 (1.4) 2.6 (1.7)
20 1.6 (1.4) 2.5 (1.6) 3.8 (2.1)
This table provides a measure of how much of Blogline's responses are reproduced by Google Blog Search.
BG Google
Bloglines 5 10 20
5 0.3 (0.5) 0.4 (0.6) 0.7 (0.9)
10 0.4 (0.6) 0.5 (0.7) 0.8 (1.1)
20 0.5 (0.6) 0.8 (1.0) 1.1 (1.2)
This table provides a measure of how much of Google Blog Search's responses are reproduced by Bloglines.
GB Bloglines
GBlog 5 10 20
5 0.3 (0.5) 0.4 (0.6) 0.5 (0.6)
10 0.4 (0.6) 0.5 (0.7) 0.9 (1.0)
20 0.6 (0.9) 0.8 (1.1) 1.1 (1.2)

Results

Explanation of statistics

For the individual results, I show for how many students the precision was better for the first search engine, better for the second search engine, or the same for the two search engines. For the Student's paired t, I test the hypothesis that the differences in precision for the two search engines is equal to zero; this test assumes that the data is normally distributed. I used this table of values to test the hypotheses. For the Wilcoxon signed rank test, I am testing the hypothesis that the precisions for the two search engines are selected from the same distribution (no matter what that distribution might be). I used the method described on this page to calculate this statistic.

Web search: Differences in precision

Google vs. Live: Test hypothesis that Google is better than Live. The statistics are listed below. All three statistics support the conclusion that Google is better than Live (with a 97.5% confidence level).

  • Individual results (G/L/=): 14/4/0
  • Student's paired t: $t = 2.5 > t_{17,97.5} = 2.110$
  • Wilcoxon: $W = 1.5 \Rightarrow z = 2.276 > z_{97.5} = 1.960$

Yahoo vs Live: Test hypothesis that Yahoo is better than Live. The statistics are listed below. All three statistics support the conclusion that Yahoo is better than Live (with a 99% confidence level).

  • Individual results (Y/L/=): 12/3/3
  • Student's paired t: $t = 2.8 > t_{17,99} = 2.567$
  • Wilcoxon: $W = 85 \Rightarrow z = 2.4 > z_{99} = 2.326$

Google vs Yahoo: Test hypothesis that Google and Yahoo are equivalent. The statistics are listed below. The results here are mixed. The Student t test supports that idea that these two are equivalent, while the other two provide a slight level of support that they are not equivalent. More data would have to be collected to make any kind of strong conclusion either way.

  • Individual results (G/L/=): 10/6/4
  • Student's paired t: $t = 0.83 < t_{17,60} = 0.863$
  • Wilcoxon: $W = 35 \Rightarrow z = 0.9178 > z_{90} = 1.645$

Blog search: Differences in precision

Google Blog Search vs Technorati: Test hypothesis that Google is better than Technorati. The statistics are listed below. All three statistics support the conclusion that Google Blog Search is better than Technorati (with a 99.5% confidence level).

  • Individual results (G/T/=): 14/1/3
  • Student's paired t: $t = 4.0 > t_{17,99.95} = 3.965$
  • Wilcoxon: $W = 117 \Rightarrow z = 3.308 > z_{99.5} = 2.576$

Bloglines vs Technorati: Test hypothesis that Bloglines is better than Technorati. The statistics are listed below. All three statistics support the conclusion that Bloglines is better than Technorati (with a 99% confidence level).

  • Individual results (B/T/=): 14/4/0
  • Student's paired t: $t = 2.78 > t_{17,99} = 2.567$
  • Wilcoxon: $W = 107 \Rightarrow z = 2.319 > z_{97.5} = 1.96$

Google vs Bloglines: Test hypothesis that Google Blog and Bloglines are equivalent. The statistics are listed below. The results here indicate that Google Blog Search and Bloglines are not equivalent (with only a 90% confidence level). I would conclude that more data needs to be collected, but these results are suggestive that Google Blog Search is better than Bloglines.

  • Individual results (G/B/=): 11/6/1
  • Student's paired t: $t = 1.93 > t_{17,90} = 1.740$
  • Wilcoxon: $W = 75 \Rightarrow z = 1.787 > z_{90} = 1.645$

Discussion

Web search

Google and Yahoo are clearly the best two Web search engines (of the three we looked at). Let's consider what would happen if the searcher looked at the top 10 results in Google and then the top 20 in Yahoo, or looked at the top 10 in Google and then the top 10 in Yahoo:

Look at 20 in Google, then 20 in Yahoo
11 (20*0.544) of the top 20 Google results should be relevant. Of the top 20 Yahoo results, 10 (20*0.517) would be relevant but about 4 (3.7) would be duplicates of the Google results, and I'll assume that half of these would be duplicates of the relevant results (given no information otherwise). Thus, looking at the top 20 results in Yahoo would give you an additional 8 relevant results — for a total of 19.
Look at 10 in Google, then 10 in Yahoo
5.4 (10*0.544) of the top 10 Google results should be relevant. Of the top 10 Yahoo results, 5.2 (10*0.517) would be relevant but about 2 would be duplicates, and about half of these would be duplicates of the relevant results. Thus, looking at the top 10 results in Yahoo would give you 4.2 additional results — for a total of almost 10.

In any case, it's clear that, after looking at the results in Google, the searcher has much to gain by looking at the results given by Yahoo.

Blog search

Google Blog Search and Bloglines are both clearly better than Technorati, and it might be the case that Google Blog Search is better than Bloglines. Let's consider what would happen if the searcher looked at the top 10 results in Google Blog Search and then the top 20 in Bloglines, or looked at the top 10 in Google Blog Search and then the top 10 in Bloglines:

Look at 20 in Google, then 20 in Bloglines
Between 8.5 (20*0.425) and 10.5 (20*0.525) of the top 20 Google results should be relevant. Of the top 20 Bloglines results, about 8.9 (20*0.444) would be relevant but about 1.1 would be duplicates of the Google results, and I'll assume that half of these would be duplicates of the relevant results (given no information otherwise). Thus, looking at the top 20 results in Bloglines would give you just over 8 additional relevant results — for a total of 19.
Look at 10 in Google, then 10 in Bloglines
Between 4.3 (10*0.425) and 5.3 (10*0.525) of the top 10 Google results should be relevant. Of the top 10 Bloglines results, 4.4 (10*0.444) would be relevant but about 0.5 would be duplicates, and about half of these (0.25) would be duplicates of the relevant results. Thus, looking at the top 10 results in Bloglines would give you 4.1 additional results, for a total of about 9.4.

Again, it's clear that, after looking at the results in Google Blog Search, the searcher has much to gain by looking at the results given by Bloglines.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License