TODO.md (8612B)
1 - add language information somewhere 2 - should we have a term count table for documents? 3 - tf indexing 4 - for each document go through each term within it and calculate the tf value directly 5 - we then save something to this table: 6 - tf(document-path, term, value) 7 - indexed: document-path and term 8 - combined index? 9 - probably not 10 - in general we are interested in the terms being indexed, but it also probably makes sense to lookup documents too for derived values. 11 - add linking support? 12 - how to do this? 13 - key value db where keys are urls and values are outlinks? 14 - language detection 15 - this should be included as part of the sites lookup 16 - or should this be its own db per lang? 17 - should this be authority based 18 - on one hand, I hate centralization and authority 19 - on the other hand, is there any way around this if there are llms online 20 - should we just do some sort of llm prediction? 21 - rank domains based on llm suspicion 22 - distance from authority 23 - ensure pruning logic is used during spider crawling so we don't write useless stuff to begin with 24 - update deletion of documents to also update the db 25 - this will be used to apply rules backwards 26 - improve pruning 27 - currently just based on length, but I could see information content being useful too 28 - since we have words, the information content could be computed based on word frequency 29 - actually, that'd be a bit different because we don't have word frequency globally 30 31 32 --- 33 34 - update idf to be incrementally calculated 35 - constantly updating as things change in the corpus 36 - hmm, what about when we remove old stuff though? that stuff will throw off the stats more and more over time... 37 - might not want this to be incremental after all... 38 - indexing should include adding language 39 - add centralized indexing 40 - i added an indexed field to support this idea for incremental indexing 41 - smarter queueing 42 - url lookup table 43 - fixes some of the memory issues 44 - ensure pruning prior to writing 45 - should there be a penalty for this? probably 46 - Make everything incremental 47 - this is getting too slow to do tf / idf on everything... 48 - I probably don't want to do indexing and crawling at the same time for perf sake so there should be a field added to the site table for 'indexed'. 49 50 - How to do eviction of old urls / sites that are queried? 51 - this might be early, but this can ballon quickly (100s of gb added every few days) 52 53 - forward / backlink calculation 54 - maybe after crawling as these are derived / indexing type values 55 56 --- 57 58 - improve priority assignment 59 - how to do this? 60 - rank domains based on importance 61 - how? 62 - then use ranking to assign priority 63 - there should be some temporal priority too so maybe it is based on date/time inversion stuff 64 65 - support sitemap .xml 66 - this should more fully search specific domains 67 - these should (probably) take priority over other links 68 - example: https://www.google.com/chromebook/sitemap.xml 69 - we should treat these differently because they don't have href stuffs 70 - regexp it is 71 - something interesting would be link counts 72 - the more a page is linked to, the better 73 - respect: 74 - noindex and nofollow 75 - https://en.wikipedia.org/wiki/Web_crawler 76 - what I want 77 - I want to crawl a smaller subset of the internet that is useful 78 - I get lots of stuff from audible which is kind of useless in terms of importance 79 - could we do some sort of ranking per domain for average IC across documents? 80 - maybe, but would that just prioritize random garbage? 81 - probably.... 82 - maybe prioritize longer documents 83 - that seems like it could be easily gamed, but also this is just the spider 84 85 --- 86 87 - add regexp for url filtering 88 - don't use stuff with parameters (or logins maybe?) 89 - https://www.oed.com/shibboleth-login-redirect?returnUrl=https://oup-sp.sams-sigma.com/Shibboleth.sso/Login?SAMLDS%3D1%26target%3Dss%253Amem%253A292bfd0634875a2fa6b2ffc899913c45dca16f346eb3fd65a9b1269d9c16a659&entityID=https://idp.plymouthart.ac.uk/shibboleth, 90 91 --- 92 93 next selection where C is the corpus: 94 95 Where C_i,b is the backlinks of the i'th element 96 Where C_i,t is the seconds since being added to the corpus 97 (uncertain about this one) Where C_i,d is the distance from a seed authority (beta being a negative hyperparameter) 98 99 s(C) = max( C_i,b * alpha + C_i,t * gamma + z + C_i,d * beta) 100 101 --- 102 103 104 Other approaches: 105 106 - breadth first 107 - The explanation given by the authors for this result is that "the most important pages have many links to them from numerous hosts, and those links will be found early, regardless of on which host or page the crawl originates 108 - this makes sense as a starting point, but then we have to consider other things too... because otherwise we never reindex things??? 109 - or do we expect circularity to fix that issue? 110 111 --- 112 113 - languages 114 - this is really important so put at top of list 115 - we want to calculate and add a column to the site table for the language of each site 116 - we then want these joined together for search querying 117 118 --- 119 120 - sites should be prioritized by language as well... 121 - maybe, we don't actually know the language until we query it 122 - could use a heuristic based on subdomain 123 124 --- 125 126 - lang detect is kind of weak 127 - these are english: 128 - https://hi.wikipedia.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%95%E0%A4%BF%E0%A4%AA%E0%A5%80%E0%A4%A1%E0%A4%BF%E0%A4%AF%E0%A4%BE:%E0%A4%AC%E0%A5%89%E0%A4%9F/%E0%A4%85%E0%A4%A8%E0%A5%81%E0%A4%AE%E0%A5%8B%E0%A4%A6%E0%A4%A8_%E0%A4%B9%E0%A5%87%E0%A4%A4%E0%A5%81_%E0%A4%85%E0%A4%A8%E0%A5%81%E0%A4%B0%E0%A5%8B%E0%A4%A7 129 - https://www.mediawiki.org/wiki/ORES/ru 130 - maybe add more stringent requirements about language percentage? 131 - seems sus, but still... 132 - there seems to be some html stuff related to content language 133 - https://stackoverflow.com/questions/6157485/what-are-content-language-and-accept-language 134 135 SOLVED: 136 137 - Made current version the fallback and added an html check for content type 138 - should probably ding scores if they use the fallback because that is not to spec 139 140 ---- 141 142 - this sucks for word association 143 - it finds the words on pages, but not in the right locations 144 - I think this is why you have index values and ordering stuff 145 146 --- 147 148 - incremental updating for some stuff 149 - sites that haven't been reindexed can remain the same 150 - they wouldn't add to the term list anyways, at present anyways... 151 152 --- 153 154 - track forward and backward links better 155 - implement pagerank like system 156 157 --- 158 159 - uprank domain names 160 - this is what people often search for... 161 - why do they rank openai so highly? 162 - is there some sort of stock / valuation manifest I could use for this / I haven't indexed yet? 163 - news sites? 164 165 " 166 As page-rank is described in the original article, and in the wikipedia article, it is indeed not defined when out-degree(v)=0 for some v, since you get P(v,u)=d/n+(1-d)*0/0 - which is undefined 167 168 A node that has no outgoing edge is called a dangling node and there are basically 3 common ways to take care of them: 169 170 Eliminate such nodes from the graph (and repeat the process iteratively until there are no dangling nodes. 171 Consider those pages to link back to the pages that linked to them (i.e. - for each edge (u,v), if out-degree(v) = 0, regard (v,u) as an edge). 172 Link the dangling node to all pages (including itself usually), and effectively make the probability for random jump from this node 1. 173 About a page with no incoming node - that shouldn't be an issue because everything is perfectly defined. Such a node will have a page rank of exactly d/n - because you can only get to it by random surfing from any node - and that's the probability to be in it. 174 175 Hope that answered your question! 176 " 177 178 --- 179 180 Improve indexing 181 182 - I like most of what I have right now, but when a site is recrawled, the old version should be deleted. 183 - this should help with headaches that require ordering by date / timestamp, and it should make page rank easier as each site will be unique. 184 - this also means that we only care about url and not the filepath location, except in cases where we are doing something to the cached data. 185 - how would this impact my search functionality? 186 - unclear