By Gerard Salton

Provides a concept of indexing in a position to rating index phrases, or topic identifiers in reducing order of value. This ends up in the alternative of excellent rfile representations, and likewise money owed for the function of words and of word list sessions within the indexing strategy.

This learn is standard of theoretical paintings in computerized details association and retrieval, in that thoughts are used from arithmetic, laptop technology, and linguistics. a whole concept of details retrieval may perhaps emerge from a suitable mixture of those 3 disciplines.

48 G. SALTON TABLE 20 Average precision values at indicated recall points for three collections Standard term Phrases formed from Phrases formed from frequency high frequency medium frequency weights nondiscriminators discriminators /? 3854 SPT PT ST P Standard term frequency weighting (word stem run). Single terms, pairs and triples used in queries and documents. Pairs and triples used; corresponding single terms deleted. Single terms retained; triples added. Pairs added; corresponding singJe terms deleted.

Standard TF:f\ A. 0084 A :> B A ;> B 23 % 8% To summarize, several methods based on the multiplication of standard term frequency weights by inverse document frequency and discrimination values have been found that appear to offer high performance standards. Among the methods which offer statistically significant improvements over the standard term weighting procedures for all processing environments, the following are the most promising: (a) ft standard weights with elimination of poor discriminators; (b) /* • WFk without elimination, or with elimination of poor discriminators or of terms with high document frequency; (c) fkt-DVk with elimination of poor discriminators or of high frequency terms.

1, averaged over the 24 user queries that are utilized with each collection. TABLE 9 Comparison of binary and term frequency weighting with and without inverse document frequency normalization Binary Term frequency Binary with weights weights IDF weights with IDF $ /! 1 CRAN MED Time Term frequency A THEORY OF INDEXING 29 Four weighting procedures are used to produce the output of Table 9, including binary term weights £>,, term frequency weights /*, and binary as well as term frequency weights multiplied by an inverse document frequency factor, designated (IDF)k in Table 9.