information-retrieval

Exploration of information retrieval topics
git clone git://git.laack.co/information-retrieval.git
Log | Files | Refs

TODO.md (1332B)


      1 1. crawler (A crawler bg job is responsible for cleaning up old files, even if they exist in the indexing_queue, crawl_cache is ephemeral)
      2     - pulls link from queue
      3     - pull data
      4     - saves to disk
      5     - extracts links to crawl next
      6     - adds file to indexing queue
      7 2. indexer (NOTE: The indexer never changes crawl_cache data, even after using all of the information it needs)
      8     - if the file exists on disk it loads it into memory / copies it to a persistent location for manipulation
      9     - parses data and performs calculations for scoring
     10     - deletes existing indexed data for the current url from the db and in the same transaction
     11         writes all necessary data to support search engine functionality atomically to the db including snippets
     12         (NOTE: The old data should still be deleted in certain circumstances even if there is nothing to insert)
     13 3. search engine
     14     - ranks search results, uses indexed snippets and scoring.
     15 
     16 ---
     17 
     18 - How to do incremental updates of idf over time?
     19     - should this just be derived by querying across all terms on some timescale?
     20         - that would be easier since all the data is known, and really isn't that bad because it is more of a suggestion than anything else
     21             as long as it updates over time, that should suffice (take advantage of new information)