advert

JOURNAL OF RESEARCH IN NATIONAL DEVELOPMENT VOLUME 8 NO 1, JUNE, 2010


INTERNET SEARCH TOOLS: A COMPARATIVE STUDY AND ANALYSIS

Akeredolu Gbenga
Department Of Computer Science, Federal Polytechnic, Idah. Kogi-State, Nigeria
akmasgbenga@yahoo.com

Abstract
The ultimate goal in designing and publishing a web page is to share information. In order to do so web pages must be accessible to the outside world and an important facilitator for this are the internet search tools (ST). The objective of this paper is to evaluate the precision and effectiveness of information retrieved by nine internet search tools: Google, Altavista, Alltheweb, Lii, Yahoo, Infomine, Vivisimo, ixquick and askjeeves. Each of the nine search tools were tested with ten different queries, an experiment was conducted to compare the precision in relevance for the first ten retrieved links. Two-way analysis of variance (ANOVA) was used to analyze subject responses. The result shows that all the search tools were sensitive to the queries and there were noticeable difference in precision for the retrieved references. Among all the search tools Vivisimo was found to be the best having come out with the highest mean precision 0.89 (i.e. number of good links) followed by askjeeves and yahoo respectively.

Keywords: Search tool, search engine, information retrieval, query.

 

Introduction
The web has had very rapid growth in the number of pages, number of hosts and number of domain names.

Fig 1 illustrates the number of hosts available on the Internet between 1994 and 2009:


Figure 1:  Internet host count between 1994 and 2009.


The largest search engines have done an impressive job in extending their reach, though web growth itself has exceeded the crawling ability of search engines (Lawrence and Giles, 1999) even the largest popular search engines such as Alta Vista and HotBot index less than 18% of the accessible web as at Feb 1999 (Lawrence and Giles, 1999).

Web search engines create and maintain an index of words within documents that are found on the web. They return to a user a ranked list of relevant documents as search results. Few of these results may be valuable to a user (Cartwright and Shepherd, 2002). Several ranking methods have been proposed to improve the ranking of resulting document (Lawrence and Giles, 1999). For this reason, it may be helpful to use some user information context in returning and ranking results.

Search engines are listed among the top accessed site (Courtois and Benny, 1999) and most people use them to find interesting information on the web. As the web continues to grow, major general purpose search engines have been faced with serious problems. They are unable to index all the documents on the web because of the rapid growth in the amount of data and the number of documents that are publicly available. Their results may be out - of – date and they do not index documents with authentication requirements or the information behind search forms. As more people share their information with others, the need for better search services to locate the interesting information is becoming increasingly important (John and Carol, 1999).           T

The World Wide Web (WWW) is so widely used for the exchange of information. Search engines play a large part in facilitating research and the exchange of information by allowing users to find what is already in the public domain and providing the Uniform Resource Locator (URLs) needed to reach it. There are many search tools available and users tend to find one or two with which they are comfortable and stick to it (Hearst 2000,) but is there any way of ascertaining if these are well suited to the user’s need, in terms of results they return? Will experimentation and qualitative analysis of this result allow us to make inferences about the suitability and relative performance of different Search Tools (ST’s)?

A study  indicates that Information Technology (IT) professional tend to use mainly simple searches, suggesting perhaps they are no better informed about Searching or Information Retrieval (IR) techniques than web users in general (Courtois and Benny, 1999).

This research work is interested in exploring empirical approach to investigate search tools and analyze their results. This will allow us to make some inferences about the tools involved and discover questions or avenues for further research.

Methodology
Experimental data collected
The results for each search tool were measured as follows:
Measures of overlap
Number of Unique URLs returned: a count of the number of URLs not duplicated in the results returned by the other search tools tested. For the results from one search engine, duplicate links with the same URL are eliminated.
Number of URLs found by other search tools: a count of URLs returned which matched results returned by other search tools.
Number of Unique relevant URLs: a count of the number of URLs that are relevant and not duplicated in the results of other search tools.
Number of Unique irrelevant URLs: a count of the number of URLs that are irrelevant and are not duplicated by other search tools.

Measure of relevancy
Number of URLs irrelevant to the query: a count of the URLs irrelevant to the query.
Mean Keyword matches for all URLs: the mean of the keyword matches for all results returned by a search tool.
Mean Keyword matches for irrelevant URLs: the mean of the keyword matches for the results returned by a search tool considered irrelevant.
Mean click to relevant information: the mean of the number of hyperlinks that has to be followed in order to access relevant information for each search tools.

Measure of freshness
Number of dead URLs: a count of the number on URLs which generate error 404 (path not fund) or 603 (server not responding) when requested.  In case of error 603, the link is tried twice before concluding that it was a dead link.
Number of redirect: a count of the number of URLs that leads to a redirect page.

Sample queries and the test environment
Sample queries
Ten separate search queries were constructed for each of the nine search tools.  These queries are intended to be used for testing various features each search engine claims to have as well as to represent different levels of searching complexity.

Test queries:

  • TCP/TP
  • Data Mining
  • Artificial Intelligence
  • HTML
  • Java
  • Fortran
  • Software engineering
  • Distributed systems
  • Information systems
  • Compiler construction

As can be seen all these queries falls within the domain of information technology. This was intentionally chosen by the author for the purpose of familiarity and to be able to judge the search tools well.

The test environment
Microsoft Internet Explorer was used as web browser for the searches since it is compatible will all the search tools selected and it adequately supports their features.
Due to time factor only 10 web records were examined for each query since all the selected search tools display results in descending order of relevance calculated one way or another, it is believed that this will not critically affect the validity of this study.

 

Data analysis
Precision chart for the nine search tools
The search tools are indicated using the following symbols:
ST1 – Google               ST2 – AltaVista                       ST3 – Alltheweb
ST4 – lii.org                             ST5 – Yahoo.com                    ST6 – Infomine.edu
ST7 – Vivisimo.com    ST8 – ixquick.com                  ST9 – askjeeves.com


 

Query

ST1

ST2

ST3

ST4

ST5

ST6

ST7

ST8

ST9

Mean

1

0.8

0.3

0.7

0.5

0.9

0.5

1.0

0.8

0.9

0.71

2

0.6

0.3

0.3

0.3

0.8

0.8

0.8

0.6

0.8

0.58

3

0.9

0.6

0.7

0.9

0.9

0.9

1.0

0.9

1.0

0.86

4

0.8

0.5

0.6

0.4

0.8

0.7

0.9

0.7

0.9

0.70

5

0.7

0.7

0.5

0.3

0.9

0.8

0.8

0.8

0.8

0.70

6

0.8

0.5

0.4

0.5

0.7

0.7

0.9

0.9

0.9

0.70

7

0.9

0.6

0.7

0.6

0.8

0.8

0.8

0.7

0.8

0.74

8

0.7

0.8

0.6

0.3

0.8

0.7

0.9

0.6

0.7

0.67

9

0.7

0.7

0.7

0.4

0.8

0.8

0.9

0.8

0.8

0.73

10

0.8

0.8

0.5

0.3

0.9

0.7

0.9

0.7

0.8

0.71

Mean

0.77

0.58

0.57

0.45

0.83

0.74

0.89

0.75

0.84

 

Table 1: Precision table for the nine search tools


Figure 2:  Mean Precision chart for the nine search tools

Dead/Irrelevant links by search tools


Seach Tool

Dead/Irrelevant links

Google

23/100          0.23

AltaVista

42/100          0.42

Alltheweb

43/100          0.43

lii.org

55/100          0.55

Yahoo.com

17/100          0.17

Infomine.edu

26/100          0.26

Vivisimo.com

11/100          0.11

Ixquick.com

25/100          0.25

Askjeeves.com

16/100          0.16

 

 

Table 3:  Dead/irrelevant URLs by search tool category


Classes of Search tool

Proportion of dead/irrelevant URLs

Search Engine

108/300          0.36

Directory

98/300            0.32

Meta Search Engine

52/300            0.17

 

Table 4: ANOVA Result

 

Sum of squares

Df

Mean Square

F

Sig

Between Groups

2.181

8

0.2726

26.21

0.000

Within Groups

0.840

81

0.0104

 

 

Total

3.021

89

 

 

 

 

Table 5: Test of Between-Subject Effects
Dependent variable: OUTPUT


 Source

Sum of squares

Df

Mean Square

F

Sig

Treatment(Search Tools)

0.38

8

0.05

7.1

0.000

Block(Queries)

1.30

9

0.14

20.0

0.001

Error

0.48

72

0.007

 

 

Corrected Total

2.16

89

 

 

 


Results
The result for this research work is judged based on two hypothesis
Looking at the ANOVA result from table 4:

     H1: There are differences in the search engines result.
Level of significant =0.05.
Test statistic =Fcal=26.21.
Fv1,v2 5% = F 8,72, 0.05 = Ftab=2.10.
Decision rule: Reject Ho if Fcal >= Ftab, otherwise do not reject
Conclusion: Since Fcal > Ftab i.e 26.21 > 2.10
Then we may conclude that the factors are significant.

Also from the Test of Between subjects effect in table 5:

H1: There is a significant difference among the means obtained for the different search tools and queries.
Level of significant 0.05.
Test statistic For search Tool= Fcal=7.1
                     For queries= Fcal=20.0.
Fv1,v2 5% = F8,72 = 2.10, F9,72 =2.10
Decision rule: Reject Ho if Fcal >=Ftab, otherwise do not reject.
Since Fcal > Ftab i.e 7.1 > 2.08 and 20.0 > 2.02.
Then we can conclude that the search engines and queries have significant effect on the model. Therefore hypothesis (H1) can be accepted.

This result is due to the different features for different search engines. Different search engine collects different information according to its own algorithm and constructs its own information resources. This suggests that the constructions of each search engine are diversified enough that they consequently represent different portions of the entire magnificent web system.
From the Precision chart in Fig 2 Vivisimo has the highest mean precision 0.89 (i.e number of good links) while lii.org happened to be the worst with a mean precision of 0.45. This search engine (lii.org) is comparatively unreliable for research work.

The manner in which Vivisimo.com clustered its results into categories makes it easier to use and provides better information for search engine users.
Looking at the search tools by category in table 3 the Meta search engines are the most current overall, this may be due to the fact that Meta search engines make use of both Directories and search engines to compile their indexes. The lower level of irrelevant URLs for meta search engines may support their claims to add value to searches by their ranking algorithms, an example being giving more weight to URLs that are highly place by more than one of the search tool they use.
Table 2 shows the number of irrelevant links by search tool as a subjective judgement by the author. Lii.org stands out as the least helpful search tool with 55% links being judged irrelevant.

Summary
This work analyzes the results of information retrieved from nine search tools.  Ten different queries were used to test each of the search engines and the first ten links generated were carefully followed to see how relevant they are to the given query.  The results of the experiment shows that Vivisimo is the best having scored the highest in all the features tested.  This was followed closely by askjeeves and yahoo.

Conclusions
This work has described a preliminary study into web searching and has pointed to some specific results:

 

References:
Belkin N.J(2000): Helping people find what they don’t know, Communication of the ACM,43(8).

Cartwright A.M.H and  Shepherd M.J(2002.): A quantitative analysis of search engine responses

Courtois M.P and Benny M.N(1999): Results ranking in web search engines,
.
Davidson B.D: (2000):Topical locality in the web, Research and Development in Information Retrieval, pages 272-279.

Diligenti .M, Coetzee . F, Lawrence. S Giles(2000): Focused crawling using context graphs. In 26th International Conference on Very large Databases, VLDB Cairo, Egypt,.

Guernsey .L (,2001): Mining the deep web with specialized drills, New York Times, January 25.

Hearst M.A (2000): Next Generation web search: IEEE Data Engineering Bulletin.
John Gantz and Carol Glashem (1999): The global market forecast for internet usage and commerce, IDC Report Version 5,.
Lawrence.S and  Giles C.L(1999): Accessibility of information on the web, Nature 400:107-109.