information-retrieval

Exploration of information retrieval topics
git clone git://git.laack.co/information-retrieval.git
Log | Files | Refs

README.md (2886B)


      1 # Indexing
      2 
      3 The indexer reads from the indexing queue and indexes results in said queue.
      4 
      5 ## Guarantees
      6 
      7 - Every filepath / url pair added to the indexing queue will start the indexing process at least once
      8     - NOTE: There is an early pruning process that may result in it not appearing in searches.
      9         - This removes non-english documents and documents that are short after reading the plaintext.
     10     - This is achieved with the pending status and pending status unlocking based on time passed
     11     - It is uncertain if we will guarantee correct ordering, because there may be additional priorities
     12         so if the same url is added with different filepaths it shouldn't be assumed which one will be indexed.
     13 - All information required for the search engine to function will be stored in the indexing database
     14     - No necessary data, including snippets, will be stored on the filesystem
     15     - There is still consideration for where things like pagerank will live, so those calculations may not be 
     16         part of indexing, but the information required to do that will reside in the indexing db.
     17 - Old indexed data will be removed from the database in the same transaction as new data being added for a given page
     18 - Old indexed data may be removed at any time, for any reason. 
     19 
     20 ## Metrics
     21 
     22 - BM25
     23     - Calculation for query term q_i:
     24         - IDF(q_i) * ((occurences in document * (k_1 + 1)) / (occurences in document + k_1) * (1 - b + b * (terms in d / average document length)))
     25             - We note k_1 and b are hyperparameters often:
     26                 - k_1 \in [1.2, 2.0]
     27                 - b = 0.75
     28         - Given the above, we must be able to determine how many instances of the term there are in a document, average document length, and the total number of terms in the document.
     29         - We also need to idf which requires the total number of documents and the number of documents containing the term
     30 - Page rankings
     31 - Domain rankings
     32 
     33 ## Schema
     34 
     35 (Consider how to rank domains / pages by quality beyond bm25 and such)
     36 (consider how to include serach information like paragraphs, titles, etc.)
     37 
     38 - indexing_queue(id, url, status, claimed_at, filepath, creation_timestamp)
     39 - page(url, language, url_term_count, term_count, last_updated_timestamp) -- currently only supports english
     40 - document_term(term, url, tf, positional_postings)
     41 - url_term(term, url, tf, positional_postings)
     42 - title_term(term, url, tf, positional_postings) -- not in use currently
     43 - link(source, destination)
     44 - term(name, document_count) -- should this be a computed value instead of document_count?
     45                              -- should we also add url_count for url counts here?
     46 - collection(num_documents, average_document_length, average_url_length)
     47 
     48 ---
     49 
     50 TODO: Add snippets and such to tables to ensure querying works correctly
     51 TODO: (should I) compute IDFs for title vs url vs body (?)