README.md (2886B)
1 # Indexing 2 3 The indexer reads from the indexing queue and indexes results in said queue. 4 5 ## Guarantees 6 7 - Every filepath / url pair added to the indexing queue will start the indexing process at least once 8 - NOTE: There is an early pruning process that may result in it not appearing in searches. 9 - This removes non-english documents and documents that are short after reading the plaintext. 10 - This is achieved with the pending status and pending status unlocking based on time passed 11 - It is uncertain if we will guarantee correct ordering, because there may be additional priorities 12 so if the same url is added with different filepaths it shouldn't be assumed which one will be indexed. 13 - All information required for the search engine to function will be stored in the indexing database 14 - No necessary data, including snippets, will be stored on the filesystem 15 - There is still consideration for where things like pagerank will live, so those calculations may not be 16 part of indexing, but the information required to do that will reside in the indexing db. 17 - Old indexed data will be removed from the database in the same transaction as new data being added for a given page 18 - Old indexed data may be removed at any time, for any reason. 19 20 ## Metrics 21 22 - BM25 23 - Calculation for query term q_i: 24 - IDF(q_i) * ((occurences in document * (k_1 + 1)) / (occurences in document + k_1) * (1 - b + b * (terms in d / average document length))) 25 - We note k_1 and b are hyperparameters often: 26 - k_1 \in [1.2, 2.0] 27 - b = 0.75 28 - Given the above, we must be able to determine how many instances of the term there are in a document, average document length, and the total number of terms in the document. 29 - We also need to idf which requires the total number of documents and the number of documents containing the term 30 - Page rankings 31 - Domain rankings 32 33 ## Schema 34 35 (Consider how to rank domains / pages by quality beyond bm25 and such) 36 (consider how to include serach information like paragraphs, titles, etc.) 37 38 - indexing_queue(id, url, status, claimed_at, filepath, creation_timestamp) 39 - page(url, language, url_term_count, term_count, last_updated_timestamp) -- currently only supports english 40 - document_term(term, url, tf, positional_postings) 41 - url_term(term, url, tf, positional_postings) 42 - title_term(term, url, tf, positional_postings) -- not in use currently 43 - link(source, destination) 44 - term(name, document_count) -- should this be a computed value instead of document_count? 45 -- should we also add url_count for url counts here? 46 - collection(num_documents, average_document_length, average_url_length) 47 48 --- 49 50 TODO: Add snippets and such to tables to ensure querying works correctly 51 TODO: (should I) compute IDFs for title vs url vs body (?)