notes

Personal notes
git clone git://git.laack.co/notes.git
Log | Files | Refs

TF-IDF.md (4252B)


      1 # TF-IDF
      2 
      3 **Source:** Wikipedia
      4 
      5 **Definition:** TF-IDF (term frequency-inverse document frequency) is an information retrieval metric used to describe the importance of a word in a given document from a corpus of documents.
      6 
      7 Intuitively, this value describes how different the usage of a given term is in the current document compared to how it is used in other documents in the corpus. The idea is terms used more frequently in the current document are of interest.
      8 
      9 ## Calculation
     10 
     11 1. Calculate TF(word)
     12     - \# of times word appears in document / total number of terms in the document
     13         - Probability of a random term in the document being the word
     14 2. Calculate IDF(word)
     15     - Inverse document frequency of the word is the log of the # of documents / number of documents with the word
     16         - This comes directly from the information theoretic derivation of the information derived from knowing a word exists in a document.
     17 3. Calculate TFIDF(word)
     18     - TF(word) * IDF(word)
     19         - This amounts to the probability of a random term in the document being the word multiplied by the probability a random document has the term.
     20             - As such, this value describes 
     21 
     22 ## Description of Calculation
     23 
     24 TF is high when the current document uses the term frequently. 
     25 
     26 IDF is high when most documents don't use the term at all.
     27 
     28 Given this, high values occur when infrequently used terms are used frequently in the current document since we multiply the TF and IDF to get the final value.
     29 
     30 ## Implementation
     31 
     32 ```python3
     33 import os
     34 import math
     35 import re
     36 import sys
     37 
     38 # get all words
     39 def get_words(filename):
     40     with open(filename, 'r') as f:
     41         lines = f.read()
     42     lines = re.sub('[^0-9a-zA-Z]+', ' ', lines)
     43     lines = lines.split(' ')
     44     final = []
     45     for i in range(0, len(lines)):
     46         if lines[i] != '':
     47             final.append(lines[i].lower())
     48     return final
     49 
     50 def tfs(prefix, filenames, word):
     51     tfs = {}
     52     for filename in filenames:
     53         tfs[filename] = tf(prefix, filename, word)
     54     return tfs
     55 
     56 def tf(prefix, filename, word):
     57     words = get_words(prefix + filename)
     58     count_of_word = 0
     59     for cw in words:
     60         if cw == word:
     61             count_of_word += 1
     62     if len(words) != 0:
     63         return count_of_word / len(words)
     64     return 0 # empty documents
     65 
     66 # technically, we might just want the one word and output the value for that
     67 def idf(prefix, filenames):
     68     word_document_frequency = {}
     69     for filename in filenames:
     70         words = get_words(prefix + filename)
     71         for word in words:
     72             if word in word_document_frequency:
     73                 word_document_frequency[word] += 1
     74             else:
     75                 word_document_frequency[word] = 1
     76     idf = word_document_frequency.copy()
     77     for word in idf:
     78         idf[word] = math.log(len(filenames) / idf[word])
     79     return idf
     80 
     81 if __name__ == "__main__":
     82     user_input = True # continually prompt if the user is in interactive mode
     83     while user_input:
     84         word = ""
     85         top_k = 1
     86         if len(sys.argv) == 2:
     87             word = sys.argv[1]
     88             top_k = int(sys.argv[2])
     89             user_input = False
     90         else:
     91             user_input = True
     92             word = input("Word to find: ")
     93             top_k = int(input("Top k elements to show: "))
     94 
     95         filenames = os.listdir('documents')
     96         idf_dict = idf('documents/', filenames)
     97 
     98         if word not in idf_dict:
     99             print('Word does not appear in any documents')
    100             exit()
    101 
    102         tf_dict = tfs('documents/', filenames, word)
    103 
    104         tfidf = {}
    105 
    106         for filename in filenames:
    107             tfidf[filename] = idf_dict[word] * tf_dict[filename]
    108 
    109         sorted_items = sorted(tfidf.items(), key=lambda kv: (kv[1], kv[0]))
    110         sorted_items.reverse()
    111 
    112         for i in range(top_k):
    113             print(sorted_items[i])
    114 ```
    115 
    116 ## TF-IDF For Feature Extraction
    117 
    118 There is some literature about using TF-IDF for feature extraction. In particular, existing approaches have used TF-IDF with [SVMs](SVM.md) and (separately) [Naive Bayes](NaiveBayes.md) for text classification tasks, first computing features with TF-IDF, and then passing the processed samples into the subsequent classifier. That said, the literature appears sparse.