Search Tool Data Analysis

By jenstanjenstan (1223912483|%a, %b %e at %I:%M%p)

by Jennifer Stanczak (jenstanjenstan in BIT330, Fall 2008)

Questions and queries

Web search engines

When I was younger I always played with Barbie dolls. I had tons of them, from Beach Barbie to Doctor Barbie. I knew that Barbies had been around for a long time, but I wondered who was the creator of the popular doll. My web search engine search will be to find the creator of the Barbie Doll.

In all three search engines, I will use the search query “Barbie doll creator.”

Blog search engines

For my search in the blog search engines, I will be looking for reviews for Google’s new web browser, Google Chrome. I don’t know anything about it so I am looking for reviews on what it does and if it is a good web browser to use.

In all three blog search engines, I will use the search query “google chrome.”

Data that I collected

Search engine overlap data

Web search Live Google Yahoo Web
Live 80 25 20
Google 45 20
Yahoo Web 75
All 10
Blog search Technorati Google Blog Bloglines
Technorati 30 5 10
Google Blog 70 10
Bloglines 55
All 5

Search engine ranking overlap data

This table provides a measure of how much of Google's responses are reproduced by Yahoo.
GY Yahoo
Google 5 10 20
5 1 2 2
10 2 3 3
20 2 4 4
This table provides a measure of how much of Yahoo's responses are reproduced by Google.
YG Google
Yahoo 5 10 20
5 1 2 2
10 2 3 4
20 2 3 4
This table provides a measure of how much of Blogline's responses are reproduced by Google Blog Search.
BG Google
Bloglines 5 10 20
5 0 0 0
10 0 1 2
20 0 1 2
This table provides a measure of how much of Google Blog Search's responses are reproduced by Bloglines.
GB Bloglines
GBlog 5 10 20
5 0 0 0
10 0 1 1
20 0 2 2

Results

Web search

Web search
Precision Overlap All
Live Google Yahoo L/G L/Y G/Y L/G/Y PrecisionGoogle - PrecisionYahoo
Mean 42.78 54.44 51.67 18.33 20 20.56 10 2.78
Median 42.5 57.5 52.5 20 20 20 10 5
Mode 15 70 70 10 10 25 10 10
Std. Dev. 22.77 20.07 22.43 9.549 11.38 7.838 7.475 14.17
N 18 18 18 18 18 18 18 18

In the above table, I calculated the mean, median, mode, and standard deviations for the precision and overlap of the search engines. I also made a new column that calculated the precision of Google minus the precision of Yahoo. I chose these two search engines because they had a higher average precision than Live Search. I then used this data to perform a hypothesis test to determine if there was sufficient evidence to conclude that Google is more precise than Yahoo. My null hypothesis is ud is less than or equal to zero and the alternative hypothesis is ud is greater than zero. I chose a significance level (alpha) of .025 and calculated the t-statistic to be .83, which does not fall within the rejection region of greater than 2.110. Therefore, I fail to reject the null hypothesis. There is insufficient statistical evidence to conclude that the difference in average precisions of Google and Yahoo are greater than zero (no evidence to prove that Google has a higher precision).

GY YG
o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20) o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
Mean 1.0588 1.3529 1.6471 1.2941 2 2.6471 1.6471 2.4706 3.7059 1.0588 1.1765 1.6471 1.4706 1.9412 2.4706 1.8824 2.6471 3.7647
Median 1 1 2 1 2 3 1 3 4 1 1 1 1 2 3 2 3 4
Mode 1 0 0 1 1 4 1 3 5 1 0 1 1 3 3 1 4 5
Std. Dev. 1.1974 1.3201 1.4116 1.2127 1.3229 1.7299 1.2217 1.5459 2.1144 1.1974 1.2862 1.3666 1.2307 1.3906 1.5858 1.269 1.7299 2.0775
N 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17
Top 5 results in Yahoo also appearing in Google 1.6471
Results 5-10 of Yahoo also appearing in Google 2.6471 - 1.6471 = 1
Results 10-20 of Yahoo also appearing in Google 3.7059 - 2.6471 = 1.0588 ( divided by 2 to put in terms of 5 results = .5294)
Top 5 results in Google also appearing in Yahoo 1.6471
Results 5-10 of Google also appearing in Yahoo 2.4706 - 1.6471 = .8235
Results 10-20 of Google also appearing in Yahoo 3.7647 - 2.4706 = 1.2941 (divided by 2 to put in terms of 5 results = .64705)

In the above table, I calculated the mean, median, mode, and standard deviations for the overlap of rankings in Google/Yahoo and Yahoo/Google. The idea is that we are trying to determine if it's more likely that a top result (compared to a lower result) in one search engine appears in another search engine. Therefore, I then used the medians to calculate the average number of overlaps in the top 5, 5-10, and 10-20 results. I found that there is indeed more overlap in the top results compared to the lower results.

Blog search

Web search
Precision Overlap All
Technorati Google Blog Bloglines T/G T/B G/B T/G/B PrecisionGoogle Blog - PrecisionBloglines
Mean 33.06 52.5 44.44 3.611 9.167 6.944 1.389 8.06
Median 30 42.5 47.5 0 7.5 5 0 10
Mode 30 40 50 0 5 5 0 10
Std. Dev. 21.15 22.18 14.34 7.031 7.717 6.449 3.346 17.75
N 18 18 18 18 18 18 18 18

In the above table, I calculated the mean, median, mode, and standard deviations for the precision and overlap of the blogsearch engines. I also made a new column that calculated the precision of Google Blog minus the precision of Bloglines. I chose these two search engines because they had a higher average precision than Technorati. I then used this data to perform a hypothesis test to determine if there was sufficient evidence to conclude that Google Blog searches are more precise than Bloglines searches. My null hypothesis is ud is less than or equal to zero and the alternative hypothesis is ud is greater than zero. I chose a significance level (alpha) of .025 and calculated the t-statistic to be 1.93, which does not fall within the rejection region of greater than 2.110. Therefore, I fail to reject the null hypothesis. There is insufficient statistical evidence to conclude that the difference in average precisions of Google Blog and Bloglines are greater than zero (no evidence to prove that Google Blog has a higher precision).

GB BG
o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20) o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
Mean 0.2941 0.3529 0.4706 0.4118 0.4706 0.8235 0.7059 0.7647 1.0588 0.2941 0.3529 0.5882 0.4118 0.5294 0.8235 0.5294 0.8824 1.1176
Median 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1
Mode 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Std. Dev. 0.4697 0.6063 0.6243 0.6183 0.7174 1.0146 0.9196 1.0914 1.1974 0.4697 0.6063 0.8703 0.6183 0.7174 1.0744 0.6243 0.9926 1.1663
N 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17
Top 5 results in Bloglines also appearing in Google Blog .4706
Results 5-10 of Bloglines also appearing in Google Blog .8235 - .4706 = .3529
Results 10-20 of Bloglines also appearing in Google Blog 1.0588 - .8235 = .2353 ( divided by 2 to put in terms of 5 results = .11765)
Top 5 results in Google Blog also appearing in Bloglines .5882
Results 5-10 of Google Blog also appearing in Bloglines .8235 - .5882 = .2353
Results 10-20 of Google Blog also appearing in Bloglines 1.1176 - .8235 = .2941 (divided by 2 to put in terms of 5 results = .14705)

In the above table, I calculated the mean, median, mode, and standard deviations for the overlap of rankings in Google Blog/Bloglines and Bloglines/Google Blog. The idea is that we are trying to determine if it's more likely that a top result (compared to a lower result) in one search engine appears in another search engine. Therefore, I then used the medians to calculate the average number of overlaps in the top 5, 5-10, and 10-20 results. I found that there is indeed more overlap in the top results compared to the lower results.

Discussion

Web search

Based on the data sets, I can conclude that not one search engine is more accurate than another. I showed this in my hypothesis test to determine if Google searches were more precise than Yahoo. There was no statistical evidence to prove that they were. The top results in each of them were more likely to contain results from the other search. Therefore, when performing a search and only looking at the top few results, you are essentially getting a lot of the same results no matter what search engine you use. I recommend that if a person is searching for information they use the search engines interchangably for the most part. I would not recommend doing the same search in all three for time saving purposes because there is not that much of a difference in results or precision. If further investigation was done on this topic, I would recommend that you use a bigger sample size. The small samples used here do not provide a good representation of the true data and make statistical analysis more difficult when the sample is not large enough to assume normality.

Blog search

Based on the data sets, I can conclude that not one blog search engine is more accurate than another. I showed this in my hypothesis test to determine if Google Blog searches were more precise than Bloglines. There was no statistical evidence to prove that they were. The top results in each of them were more likely to contain results from the other search. Therefore, if you only look at the top results, you will get some of the same results across the two search engines. However, because the overlap was so low, I would recommend searching for a query in all of them if time allows. If further investigation was done on this topic, I would recommend a few changes in the methods and approach. First of all, I would use bigger samples. The small samples used here do not provide a good representation of the true data. Also, I would revise my query because searching for a term that specifically relates to the search engine such as "Google Chrome" in the Google Blog search engine could distort results somewhat.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License