TF-IDF.md (4252B)
1 # TF-IDF 2 3 **Source:** Wikipedia 4 5 **Definition:** TF-IDF (term frequency-inverse document frequency) is an information retrieval metric used to describe the importance of a word in a given document from a corpus of documents. 6 7 Intuitively, this value describes how different the usage of a given term is in the current document compared to how it is used in other documents in the corpus. The idea is terms used more frequently in the current document are of interest. 8 9 ## Calculation 10 11 1. Calculate TF(word) 12 - \# of times word appears in document / total number of terms in the document 13 - Probability of a random term in the document being the word 14 2. Calculate IDF(word) 15 - Inverse document frequency of the word is the log of the # of documents / number of documents with the word 16 - This comes directly from the information theoretic derivation of the information derived from knowing a word exists in a document. 17 3. Calculate TFIDF(word) 18 - TF(word) * IDF(word) 19 - This amounts to the probability of a random term in the document being the word multiplied by the probability a random document has the term. 20 - As such, this value describes 21 22 ## Description of Calculation 23 24 TF is high when the current document uses the term frequently. 25 26 IDF is high when most documents don't use the term at all. 27 28 Given this, high values occur when infrequently used terms are used frequently in the current document since we multiply the TF and IDF to get the final value. 29 30 ## Implementation 31 32 ```python3 33 import os 34 import math 35 import re 36 import sys 37 38 # get all words 39 def get_words(filename): 40 with open(filename, 'r') as f: 41 lines = f.read() 42 lines = re.sub('[^0-9a-zA-Z]+', ' ', lines) 43 lines = lines.split(' ') 44 final = [] 45 for i in range(0, len(lines)): 46 if lines[i] != '': 47 final.append(lines[i].lower()) 48 return final 49 50 def tfs(prefix, filenames, word): 51 tfs = {} 52 for filename in filenames: 53 tfs[filename] = tf(prefix, filename, word) 54 return tfs 55 56 def tf(prefix, filename, word): 57 words = get_words(prefix + filename) 58 count_of_word = 0 59 for cw in words: 60 if cw == word: 61 count_of_word += 1 62 if len(words) != 0: 63 return count_of_word / len(words) 64 return 0 # empty documents 65 66 # technically, we might just want the one word and output the value for that 67 def idf(prefix, filenames): 68 word_document_frequency = {} 69 for filename in filenames: 70 words = get_words(prefix + filename) 71 for word in words: 72 if word in word_document_frequency: 73 word_document_frequency[word] += 1 74 else: 75 word_document_frequency[word] = 1 76 idf = word_document_frequency.copy() 77 for word in idf: 78 idf[word] = math.log(len(filenames) / idf[word]) 79 return idf 80 81 if __name__ == "__main__": 82 user_input = True # continually prompt if the user is in interactive mode 83 while user_input: 84 word = "" 85 top_k = 1 86 if len(sys.argv) == 2: 87 word = sys.argv[1] 88 top_k = int(sys.argv[2]) 89 user_input = False 90 else: 91 user_input = True 92 word = input("Word to find: ") 93 top_k = int(input("Top k elements to show: ")) 94 95 filenames = os.listdir('documents') 96 idf_dict = idf('documents/', filenames) 97 98 if word not in idf_dict: 99 print('Word does not appear in any documents') 100 exit() 101 102 tf_dict = tfs('documents/', filenames, word) 103 104 tfidf = {} 105 106 for filename in filenames: 107 tfidf[filename] = idf_dict[word] * tf_dict[filename] 108 109 sorted_items = sorted(tfidf.items(), key=lambda kv: (kv[1], kv[0])) 110 sorted_items.reverse() 111 112 for i in range(top_k): 113 print(sorted_items[i]) 114 ``` 115 116 ## TF-IDF For Feature Extraction 117 118 There is some literature about using TF-IDF for feature extraction. In particular, existing approaches have used TF-IDF with [SVMs](SVM.md) and (separately) [Naive Bayes](NaiveBayes.md) for text classification tasks, first computing features with TF-IDF, and then passing the processed samples into the subsequent classifier. That said, the literature appears sparse.