By
jenstan (1223912483|%a, %b %e at %I:%M%p)
by Jennifer Stanczak (
jenstan in BIT330, Fall 2008)
Questions and queries
Web search engines
When I was younger I always played with Barbie dolls. I had tons of them, from Beach Barbie to Doctor Barbie. I knew that Barbies had been around for a long time, but I wondered who was the creator of the popular doll. My web search engine search will be to find the creator of the Barbie Doll.
In all three search engines, I will use the search query “Barbie doll creator.”
Blog search engines
For my search in the blog search engines, I will be looking for reviews for Google’s new web browser, Google Chrome. I don’t know anything about it so I am looking for reviews on what it does and if it is a good web browser to use.
In all three blog search engines, I will use the search query “google chrome.”
Data that I collected
Search engine overlap data
Web search |
Live |
Google |
Yahoo Web |
Live |
80 |
25 |
20 |
Google |
|
45 |
20 |
Yahoo Web |
|
|
75 |
All |
10 |
|
|
|
Blog search |
Technorati |
Google Blog |
Bloglines |
Technorati |
30 |
5 |
10 |
Google Blog |
|
70 |
10 |
Bloglines |
|
|
55 |
All |
5 |
|
|
|
Search engine ranking overlap data
This table provides a measure of how much of Google's responses are reproduced by Yahoo.
GY |
Yahoo |
Google |
5 |
10 |
20 |
5 |
1 |
2 |
2 |
10 |
2 |
3 |
3 |
20 |
2 |
4 |
4 |
|
This table provides a measure of how much of Yahoo's responses are reproduced by Google.
YG |
Google |
Yahoo |
5 |
10 |
20 |
5 |
1 |
2 |
2 |
10 |
2 |
3 |
4 |
20 |
2 |
3 |
4 |
|
This table provides a measure of how much of Blogline's responses are reproduced by Google Blog Search.
BG |
Google |
Bloglines |
5 |
10 |
20 |
5 |
0 |
0 |
0 |
10 |
0 |
1 |
2 |
20 |
0 |
1 |
2 |
|
This table provides a measure of how much of Google Blog Search's responses are reproduced by Bloglines.
GB |
Bloglines |
GBlog |
5 |
10 |
20 |
5 |
0 |
0 |
0 |
10 |
0 |
1 |
1 |
20 |
0 |
2 |
2 |
|
Results
Web search
Web search |
|
Precision |
Overlap |
All |
|
|
Live |
Google |
Yahoo |
L/G |
L/Y |
G/Y |
L/G/Y |
PrecisionGoogle - PrecisionYahoo |
Mean |
42.78 |
54.44 |
51.67 |
18.33 |
20 |
20.56 |
10 |
2.78 |
Median |
42.5 |
57.5 |
52.5 |
20 |
20 |
20 |
10 |
5 |
Mode |
15 |
70 |
70 |
10 |
10 |
25 |
10 |
10 |
Std. Dev. |
22.77 |
20.07 |
22.43 |
9.549 |
11.38 |
7.838 |
7.475 |
14.17 |
N |
18 |
18 |
18 |
18 |
18 |
18 |
18 |
18 |
In the above table, I calculated the mean, median, mode, and standard deviations for the precision and overlap of the search engines. I also made a new column that calculated the precision of Google minus the precision of Yahoo. I chose these two search engines because they had a higher average precision than Live Search. I then used this data to perform a hypothesis test to determine if there was sufficient evidence to conclude that Google is more precise than Yahoo. My null hypothesis is ud is less than or equal to zero and the alternative hypothesis is ud is greater than zero. I chose a significance level (alpha) of .025 and calculated the t-statistic to be .83, which does not fall within the rejection region of greater than 2.110. Therefore, I fail to reject the null hypothesis. There is insufficient statistical evidence to conclude that the difference in average precisions of Google and Yahoo are greater than zero (no evidence to prove that Google has a higher precision).
|
|
GY |
YG |
|
o(5,5) |
o(10,5) |
o(20,5) |
o(5,10) |
o(10,10) |
o(20,10) |
o(5,10) |
o(10,20) |
o(20,20) |
o(5,5) |
o(10,5) |
o(20,5) |
o(5,10) |
o(10,10) |
o(20,10) |
o(5,10) |
o(10,20) |
o(20,20) |
Mean |
1.0588 |
1.3529 |
1.6471 |
1.2941 |
2 |
2.6471 |
1.6471 |
2.4706 |
3.7059 |
1.0588 |
1.1765 |
1.6471 |
1.4706 |
1.9412 |
2.4706 |
1.8824 |
2.6471 |
3.7647 |
Median |
1 |
1 |
2 |
1 |
2 |
3 |
1 |
3 |
4 |
1 |
1 |
1 |
1 |
2 |
3 |
2 |
3 |
4 |
Mode |
1 |
0 |
0 |
1 |
1 |
4 |
1 |
3 |
5 |
1 |
0 |
1 |
1 |
3 |
3 |
1 |
4 |
5 |
Std. Dev. |
1.1974 |
1.3201 |
1.4116 |
1.2127 |
1.3229 |
1.7299 |
1.2217 |
1.5459 |
2.1144 |
1.1974 |
1.2862 |
1.3666 |
1.2307 |
1.3906 |
1.5858 |
1.269 |
1.7299 |
2.0775 |
N |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
Top 5 results in Yahoo also appearing in Google |
1.6471 |
Results 5-10 of Yahoo also appearing in Google |
2.6471 - 1.6471 = 1 |
Results 10-20 of Yahoo also appearing in Google |
3.7059 - 2.6471 = 1.0588 ( divided by 2 to put in terms of 5 results = .5294) |
Top 5 results in Google also appearing in Yahoo |
1.6471 |
Results 5-10 of Google also appearing in Yahoo |
2.4706 - 1.6471 = .8235 |
Results 10-20 of Google also appearing in Yahoo |
3.7647 - 2.4706 = 1.2941 (divided by 2 to put in terms of 5 results = .64705) |
In the above table, I calculated the mean, median, mode, and standard deviations for the overlap of rankings in Google/Yahoo and Yahoo/Google. The idea is that we are trying to determine if it's more likely that a top result (compared to a lower result) in one search engine appears in another search engine. Therefore, I then used the medians to calculate the average number of overlaps in the top 5, 5-10, and 10-20 results. I found that there is indeed more overlap in the top results compared to the lower results.
Blog search
Web search |
|
Precision |
Overlap |
All |
|
|
Technorati |
Google Blog |
Bloglines |
T/G |
T/B |
G/B |
T/G/B |
PrecisionGoogle Blog - PrecisionBloglines |
Mean |
33.06 |
52.5 |
44.44 |
3.611 |
9.167 |
6.944 |
1.389 |
8.06 |
Median |
30 |
42.5 |
47.5 |
0 |
7.5 |
5 |
0 |
10 |
Mode |
30 |
40 |
50 |
0 |
5 |
5 |
0 |
10 |
Std. Dev. |
21.15 |
22.18 |
14.34 |
7.031 |
7.717 |
6.449 |
3.346 |
17.75 |
N |
18 |
18 |
18 |
18 |
18 |
18 |
18 |
18 |
In the above table, I calculated the mean, median, mode, and standard deviations for the precision and overlap of the blogsearch engines. I also made a new column that calculated the precision of Google Blog minus the precision of Bloglines. I chose these two search engines because they had a higher average precision than Technorati. I then used this data to perform a hypothesis test to determine if there was sufficient evidence to conclude that Google Blog searches are more precise than Bloglines searches. My null hypothesis is ud is less than or equal to zero and the alternative hypothesis is ud is greater than zero. I chose a significance level (alpha) of .025 and calculated the t-statistic to be 1.93, which does not fall within the rejection region of greater than 2.110. Therefore, I fail to reject the null hypothesis. There is insufficient statistical evidence to conclude that the difference in average precisions of Google Blog and Bloglines are greater than zero (no evidence to prove that Google Blog has a higher precision).
|
|
GB |
BG |
|
o(5,5) |
o(10,5) |
o(20,5) |
o(5,10) |
o(10,10) |
o(20,10) |
o(5,10) |
o(10,20) |
o(20,20) |
o(5,5) |
o(10,5) |
o(20,5) |
o(5,10) |
o(10,10) |
o(20,10) |
o(5,10) |
o(10,20) |
o(20,20) |
Mean |
0.2941 |
0.3529 |
0.4706 |
0.4118 |
0.4706 |
0.8235 |
0.7059 |
0.7647 |
1.0588 |
0.2941 |
0.3529 |
0.5882 |
0.4118 |
0.5294 |
0.8235 |
0.5294 |
0.8824 |
1.1176 |
Median |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
1 |
1 |
Mode |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
Std. Dev. |
0.4697 |
0.6063 |
0.6243 |
0.6183 |
0.7174 |
1.0146 |
0.9196 |
1.0914 |
1.1974 |
0.4697 |
0.6063 |
0.8703 |
0.6183 |
0.7174 |
1.0744 |
0.6243 |
0.9926 |
1.1663 |
N |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
17 |
Top 5 results in Bloglines also appearing in Google Blog |
.4706 |
Results 5-10 of Bloglines also appearing in Google Blog |
.8235 - .4706 = .3529 |
Results 10-20 of Bloglines also appearing in Google Blog |
1.0588 - .8235 = .2353 ( divided by 2 to put in terms of 5 results = .11765) |
Top 5 results in Google Blog also appearing in Bloglines |
.5882 |
Results 5-10 of Google Blog also appearing in Bloglines |
.8235 - .5882 = .2353 |
Results 10-20 of Google Blog also appearing in Bloglines |
1.1176 - .8235 = .2941 (divided by 2 to put in terms of 5 results = .14705) |
In the above table, I calculated the mean, median, mode, and standard deviations for the overlap of rankings in Google Blog/Bloglines and Bloglines/Google Blog. The idea is that we are trying to determine if it's more likely that a top result (compared to a lower result) in one search engine appears in another search engine. Therefore, I then used the medians to calculate the average number of overlaps in the top 5, 5-10, and 10-20 results. I found that there is indeed more overlap in the top results compared to the lower results.
Discussion
Web search
Based on the data sets, I can conclude that not one search engine is more accurate than another. I showed this in my hypothesis test to determine if Google searches were more precise than Yahoo. There was no statistical evidence to prove that they were. The top results in each of them were more likely to contain results from the other search. Therefore, when performing a search and only looking at the top few results, you are essentially getting a lot of the same results no matter what search engine you use. I recommend that if a person is searching for information they use the search engines interchangably for the most part. I would not recommend doing the same search in all three for time saving purposes because there is not that much of a difference in results or precision. If further investigation was done on this topic, I would recommend that you use a bigger sample size. The small samples used here do not provide a good representation of the true data and make statistical analysis more difficult when the sample is not large enough to assume normality.
Blog search
Based on the data sets, I can conclude that not one blog search engine is more accurate than another. I showed this in my hypothesis test to determine if Google Blog searches were more precise than Bloglines. There was no statistical evidence to prove that they were. The top results in each of them were more likely to contain results from the other search. Therefore, if you only look at the top results, you will get some of the same results across the two search engines. However, because the overlap was so low, I would recommend searching for a query in all of them if time allows. If further investigation was done on this topic, I would recommend a few changes in the methods and approach. First of all, I would use bigger samples. The small samples used here do not provide a good representation of the true data. Also, I would revise my query because searching for a term that specifically relates to the search engine such as "Google Chrome" in the Google Blog search engine could distort results somewhat.