TODO.md (1332B)
1 1. crawler (A crawler bg job is responsible for cleaning up old files, even if they exist in the indexing_queue, crawl_cache is ephemeral) 2 - pulls link from queue 3 - pull data 4 - saves to disk 5 - extracts links to crawl next 6 - adds file to indexing queue 7 2. indexer (NOTE: The indexer never changes crawl_cache data, even after using all of the information it needs) 8 - if the file exists on disk it loads it into memory / copies it to a persistent location for manipulation 9 - parses data and performs calculations for scoring 10 - deletes existing indexed data for the current url from the db and in the same transaction 11 writes all necessary data to support search engine functionality atomically to the db including snippets 12 (NOTE: The old data should still be deleted in certain circumstances even if there is nothing to insert) 13 3. search engine 14 - ranks search results, uses indexed snippets and scoring. 15 16 --- 17 18 - How to do incremental updates of idf over time? 19 - should this just be derived by querying across all terms on some timescale? 20 - that would be easier since all the data is known, and really isn't that bad because it is more of a suggestion than anything else 21 as long as it updates over time, that should suffice (take advantage of new information)