SEO Articles
Topic-Sensitive PageRank
July 17th, 2006The link structure of the Web is highly sensitive to page topic. Pages tend to contain links pointing to other pages on the same broad topic, e.g. pages on investment banking often link to other business-related resources but rarely to sports portals. While using offline PageRank scores has an advantage of faster processing, it also creates a situation where some highly linked page receive higher ranking on topics for which they have no authority. A query-time adjustment of the scoring function is necessary to refine the search results. Some algorithms like HITS and Hilltop allow such an adjustment. However these algorithms have their own shortcomings that restrict their efficient use by search engines.
HITS algorithm calculates hubs and authorities in query-time but relies on a relatively small subset of the Web – the immediate neighborhood of a page, since otherwise computation time would be unacceptably long. Hilltop algorithm analyses a query and calculates score values by finding pages that seem to be experts in the query-specific topic. This algorithm restricts itself to popular queries, since it can’t produce score values when no experts for an uncommon search term are found.
Topic-Sensitive PageRank extends the original PageRank idea by adding a query-time topic-sensitive adjustment. Instead of a single vector of PageRank values, multiple topic-specific PageRank vectors are calculated. Creation of a PageRank vectors for every possible topic would require extensive resources, so in practice the algorithm uses only 16 topic-specific ranking vectors representing the top categories of the ODP project. Other sources of topics can be used for this purpose as well, but since ODP project is created and edited by a large number of independent volunteers, it is the least likely to be influenced by any one party. For each page in the Web a set of importance scores with respect to various topics is precomputed and stored offline. In query time the topic-specific score is combined with other scores (e.g. content analysis) to form the final ranking for a page.
Let cj be one of the 16 top-level ODP categories. For each topic cj it is necessary to calculate a biased PageRank vector. Let M be a modified adjacency matrix. Each element mji has value 1/Nj, if there is a link from j to i, and where Nj is the number of outgoing links on page j. Otherwise element value is 0. Then the original PageRank formula in matrix notation looks as following:

Parameter α here is the dumping factor that equals 1-d in the original PageRank formula. The resulting vector of PageRank values is denoted as PR(α, p).
With minor modifications the same formula is used to calculate the topic-sensitive ranking vectors. Let Tj be the set of pages under a topic cj. Then instead of the uniform distribution damping factor p, a non-uniform vector vj is used, where:

Resulting PageRank vector is denoted as PR(a, vij). Additionally using all the documents under each topic cj a term vector Dj is constructed. Term vector elements Djt are the numbers of occurrences of every term under topic cj. In order to detect a topic, to which a search query term relates to, two scenarios are considered. In the first scenario a user highlights a keyword in a page and initiates a search. In this case the search topic is defined by the page content. For example if word ‘architecture’ is highlighted in a page about famous buildings, the pages on CPU architecture should not appear among search results. So if a term q is highlighted in some page u, its context q’ would be the words in u. In the second scenario a user enters a keyword into a search form in the conventional way. In this case the context of the query q is the search term itself: q’ = q. When a history of search terms is kept, it is also possible to use it as the context q’. In query time the proximity of query context q’ to one of the topics cj is calculated:

Proximity value P(q’|cj) is calculated using the term vectors Dj. Then a query-sensitive score values are computed for every page d from the index that contains the original search term q. For each document d we sum up the products of topic proximity values P(cj|q’) and the page rank rd. The rank rd is an element of the page rank vector PR(a, vji) of a topic to which the document d belongs to:

The final search results are sorted by the values of sd. Since the PageRank calculations are performed in advance, the algorithm is able to quickly perform topic adjustments in the query time.
Cited resources
- Haveliwala, T.H. ‘Topic-sensitive PageRank’. In Proceedings of the Eleventh International World Wide Web Conference, Honolulu, Hawaii, May 2002. Available at http://citeseer.ist.psu.edu/haveliwala02topicsensitive.html
- Haveliwala, T.H. ‘Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search’. IEEE Trans. Knowl. Data Eng., 15(4):784–796, 2003. Available at http://citeseer.ist.psu.edu/article/haveliwala03topicsensitive.html
Did you like it? Was it useful? Bookmark or share this post:



Digg This!
Technorati
Del.icio.us
Furl
Blinklist
Ma.gnolia
Yahoo! My Web

