The Anatomy of a Small-Scale Document Search Engine Tool: Incorporating a new Ranking Algorithm

(1)

The Anatomy of a Small-Scale Document

Search Engine Tool: Incorporating a new

Ranking Algorithm

1

KAUSTUBH S. RAVAL

Research Scholar raval_kaustubh@yahoo.co.in

2

RANJEETSINGH S. SURYAWANSHI

Research Scholar

ranjeetsuryawanshi06@gmail.com

3

J. NAVEENKUMAR

Research Scholar pro_naveen@hotmail.com

4

DEVENDRA M. THAKORE

Professor, Department of Computer Engineering dmthakore@bvucoep.edu.in

1, 2, 3, 4

Bharati Vidyapeeth Deemed University, College of Engineering, Pune – 411043, Maharashtra, INDIA.

Abstract:

A search engine is an information retrieval system to help find out the information contained in documents stored on a computer system. The results provided by this kind of a system are usually in form of a list. Search engines basically work on the concept called ‘Text-Mining’. Text mining is a variation on a field called data mining and refers to the process of deriving high-quality information from unstructured text. In this paper we are going to depict an intelligent agent based search engine tool which takes the input from user in form of keyword and based on the keyword, find out the matching documents and show it to user (in the form of links). This tool uses a new ‘Ranking Algorithm’ to rank the documents.

Keywords – Search Engine, Text Mining, Intelligent Agent, Ranking Algorithm.

I. Introduction

First of all, we need basic information about various terms on which this work is to be carried out.

Search Engine: A search engine is an information retrieval system to help find out the information contained in documents stored on a computer system. The results provided by this kind of a system are usually in form of a list.

(2)

to as a search query. Then search engines identify the desired concept that one or more documents may contain.

The list of items that meet the criteria specified by the query is typically sorted, or ranked. Ranking items by

relevance (from highest to lowest) reduces the time required to find the desired information. [1]

Text Mining: Text-mining is a variation on a field called data-mining and refers to the process of deriving high-quality information from the unstructured text. ‘High high-quality’ in text-mining usually refers to some combination

of relevance, novelty and interestingness. [2]

Intelligent Agents: Intelligent agents are software entities that carry out some set of operations on behalf of a user with some degree of independence or autonomy, and in doing so, employ some knowledge or

representation of the user’s goals or desires. Software agents are useful in automating repetitive tasks, finding

and filtering information, intelligently summarizing complex data, and so on, but more importantly, just like

their human counterparts, intelligent agents can have capability to learn from the managers and even make

recommendations to them regarding a particular course of action. [3]

Ranking Algorithm: Ranking algorithms are mainly used to rank or index the documents based on some relevance. They are a formal set of instructions that can be followed to perform ranking or indexing task, such as

a mathematical formula or a set of instructions in a computer program. A ranking is a relationship between a set

of items such that, for any two items, the first is either 'ranked higher than', 'ranked lower than' or 'ranked equal

to' the second. The rankings themselves are totally ordered and make it possible to evaluate complex

information according to certain criteria. [4]

Motivation: The literature study of various research papers and my interest in the field of ‘Data Mining’ motivated me to take up this as my dissertation topic for post-graduation.

Study of different existing ranking algorithms gave me insight how the ranking algorithms work and ultimately

provided me the idea of developing new algorithm.

Working scenario of ‘Google Search Engine’ also has been the motivational factor to take up this topic as my

dissertation work. ‘Google Search Engine’ is the best example of optimized intelligent software agent based

text-mining system encompassing a very large domain of web.

II. System Description

System description is the context which includes the details about the overall working of the existing or

proposed system.

Why Agents?

Text mining mainly includes the field of information retrieval which means the finding of documents which

contain answers to questions and not the finding of answers itself and for this to achieve statistical measures and

methods are used. By using statistical measures and methods automatic processing of text data and comparison

(3)

System Architecture

Fig. 1 shows the architectural diagram for intelligent agent based text-mining system. It includes all the

components required to make the system workable and the relationship and interaction between them. There are

mainly three agents, one dataset, the user category, and one cache/log component.

We are basically considering that all the processing has been done on the dataset and after those

pre-processing steps we would have following things:

i) List of keywords

ii) Table or File consisting of Contexts (categories) and their respective Keywords.

iii) Table or File consisting of Contexts and their respective Documents (Articles).

Now, how exactly the system should work is to be explained as follows:

Step 1: When the user types in something (Word), get that keyword and look up into the list of keywords.

Step 2: If the ‘keyword’ appears in the list then look for the corresponding category.

Step 3: After getting the corresponding category, find out all the documents in that category in which ‘keyword’

appears.

Step 4: Now, weigh all these documents for a given ‘keyword’ using term frequency method and assign weight

values to respective documents (articles).

Step 5: Now, using the ranking algorithm (which will be presented in the next section), rank all the documents

and then shows those ranked documents to user.

(4)

III. Algorithm

According to Webster’s Dictionary “An algorithm is a precise set of rules specifying how to solve some

problem”.

I’m going to write the steps for a new ranking algorithm which is used to rank documents based on their

weights. Weights of the documents are calculated using (tf*idf) equation, where ‘tf’ represents term frequency

(word frequency) and ‘idf’ represents inverse document frequency.

So, the equation for weighing a particular document d1 can be,

(1)

Where, tf(d1) - term frequency in document d1

n – Total number of documents

df – no. of documents in which the term appears.

Moreover, I’m considering that all the documents which contains the required word are weighted using the [Equ

– 1].

Now, below is the ranking algorithm for ranking the documents based on weight values calculated.

Ranking Algorithm

Inputs: list l, int dw[], int num

// ‘l’ is the list of documents in which inputted word appears.

// ‘dw’ is an integer array containing weight values of the documents in ‘l’.

// ‘num’ represents the no. of documents containing inputted word.

Output: list final // ‘final’ will contain all the ranked documents

Step 1: Find the average weight value by adding the weights of all documents in ‘l’ and dividing that by ‘num’.

Step 2: for all the documents in ‘l’, compare the weight of document with the ‘avg. weight’.

Step 3: The documents which are having less weight than the ‘avg. weight’ will be added to another list ‘M’ and their weights are added to another array ‘weight1’ and the documents which are having more weight than ‘avg.

weight’ will be added to another list ‘N’ and their respective weights are added to another array ‘weight2’.

Step 4: If there are more than one documents in the list ‘N’ then recursively call the Algorithm with parameters related to documents in list ‘N’. If all the documents in the list ‘N’ are having same ‘weight value’ then there is

no need to repeat the process because every time the average weight will be same and thus it will go into the

infinite loop. And if there is only one document in the list ‘N’ then add that document in the ‘final’ list.

Step 5: If there are more than one documents in the list ‘M’ then recursively call the Algorithm with parameters related to documents in list ‘M’. If all the documents in the list ‘N’ are having same ‘weight value’ then there is

Weight (d1) = tf(d1) * idf

(5)

no need to repeat the process because every time the average weight will be same and thus it will go into the

infinite loop. And if there is only one document in the list ‘N’ then add that document in the ‘final’ list.

Step 6: Display the documents in the list ‘final’ to the user.

Thus ‘final’ will have all the ranked documents with their title and data which will be displayed to user who had

given the search query using particular keyword.

IV. Results

We converted the above algorithm into JAVA programming code and tested it on the Reuters-21578 news

articles dataset. We had compiled and ran that code into command prompt and got the results as shown below

for the inputted word ‘APPLE’.

Table 1: Weighted Documents

Weight Title (Inputted Word: Apple)

34.91431376 APPLE COMPUTER <AAPL> UPGRADES

MACINTOSH LINE

20.94858825 APPLE COMPUTER <AAPL> HAS NEW

MACINTOSH MODELS

6.982862751 LOTUS <LOTS> INTRODUCES NEW SOFTWARE

20.94858825 APPLE <AAPL>, AST <ASTA> OFFER MS-DOS

PRODUCTS

6.982862751 SOFTWARE COS SUPPORT NEW APPLE

<AAPL> PRODUCTS

6.982862751 COMPUTER COMPANIES FORM NETWORKING

GROUP

13.9657255 FCOJ SUPPLIES SIGNIFICANTLY ABOVE YEAR

AGO-USDA

13.9657255 BERTELSMANN TO MARKET APPLE

SOFTWARE IN GERMANY

6.982862751 MOTOROLA (MOT) SEES CONTINUED

GROWTH FOR CHIPS

6.982862751 TECHNOLOGY/DESKTOP PUBLISHING

20.94858825 APPLE <AAPL> EXPANDS NETWORK

CAPABILITIES

6.982862751 RJR <RJR> UNIT SELLS FOUR TOBACCO

BRANDS

6.982862751 NYNEX <NYN> TO SELL NEW IBM <IBM>

COMPUTERS

6.982862751 WALL STREET STOCKS/COMPAQ COMPUTER

<CPQ>

13.9657255 THORN-EMI WINS U.S. RULING IN BEATLES

SUIT

6.982862751 DIGITAL COMMUNICATIONS <DCAI>

INTRODUCES ITEMS

The graph has been generated on the basis of the result of the above table which shows the weight of the word in

each document with respect to which the documents has been weighted in descending order. The graph is shown

(6)

Figure 2: Weighted document graph.

The weighted document shown above in table 1 is just a sample, for showing the result of the algorithm which is

been implemented. Some more results about the words inputted to the algorithm are shown below. The graph

below shows the highest ranked documents for the searched word.

Table 2: Highest Ranked documents for various words

Weight Title

34.91431376 APPLE COMPUTER <AAPL> UPGRADES

MACINTOSH LINE

31.59958129 PORSCHE EXPECTS IMPROVEMENT IN U.S.

SALES

21.61913187 REGAN DEPARTURE MAKES 3RD VOLCKER

TERM LIKELY

21.61913187 DONALD REGAN SAYS U.S. SHOULD EASE

CREDIT SUPPLY

36.35504269 QATAR"S BANKS SET FOR FURTHER LEAN

SPELL

13.77510514 ESSO UK PLANNING SLIGHTLY LESS OIL

(7)

Figure 3: Graphs for highest ranked documents

Conclusion

Information retrieval in documents became a very tedious work because of very large amount of data. But if the

documents are weighted and ranked properly then the relevant documents are retrieved easily and with less

amount of time. Results in this research paper shows that, both, document weighing and ranking works

correctly. Thus we can conclude that there could be a small-scale search engine tool that can incorporate a new

ranking algorithm and serve the purpose of information retrieval in documents.

References

[1] http://en.wikipedia.org/wiki/Search_engine_ (computing)

[2] Vishal Gupta and Gurpreet S. Lehal, “A Survey of Text Mining Techniques and Applications”, Journal of Emerging Technologies in Web Intelligence, vol. 1,pages 60-76, August 2009

[3] Stuart Russell and Peter Norvig, “Artificial Intelligence, Chapter 2: Intelligent Agents – A Modern Approach”. [4] http://en.wikipedia.org/wiki/Ranking

[5] Kaustubh S. Raval, Ranjeetsingh S. Suryawanshi and Prof. Devendra M. Thakore, “An Intelligent Agent Based Text-Mining System: Presenting Concept through Design Approach”, International Journal of Computer Science and Information Security, Vol.9 No.4, pages. 112-117, April 2011.