information-retrieval

Exploration of information retrieval topics
git clone git://git.laack.co/information-retrieval.git
Log | Files | Refs

TODO.md (8612B)


      1 - add language information somewhere
      2 - should we have a term count table for documents?
      3 - tf indexing
      4     - for each document go through each term within it and calculate the tf value directly
      5         - we then save something to this table:
      6             - tf(document-path, term, value)
      7                 - indexed: document-path and term
      8                     - combined index?
      9                         - probably not
     10                             - in general we are interested in the terms being indexed, but it also probably makes sense to lookup documents too for derived values.
     11 - add linking support?
     12     - how to do this?
     13         - key value db where keys are urls and values are outlinks?
     14 - language detection
     15     - this should be included as part of the sites lookup
     16         - or should this be its own db per lang?
     17 - should this be authority based
     18     - on one hand, I hate centralization and authority
     19     - on the other hand, is there any way around this if there are llms online
     20     - should we just do some sort of llm prediction?
     21         - rank domains based on llm suspicion
     22 - distance from authority
     23 - ensure pruning logic is used during spider crawling so we don't write useless stuff to begin with
     24 - update deletion of documents to also update the db
     25     - this will be used to apply rules backwards
     26 - improve pruning
     27     - currently just based on length, but I could see information content being useful too
     28     - since we have words, the information content could be computed based on word frequency
     29         - actually, that'd be a bit different because we don't have word frequency globally
     30 
     31 
     32 ---
     33 
     34 - update idf to be incrementally calculated
     35     - constantly updating as things change in the corpus
     36         - hmm, what about when we remove old stuff though? that stuff will throw off the stats more and more over time...
     37         - might not want this to be incremental after all...
     38 - indexing should include adding language
     39 - add centralized indexing
     40     - i added an indexed field to support this idea for incremental indexing
     41 - smarter queueing
     42 - url lookup table
     43     - fixes some of the memory issues
     44 - ensure pruning prior to writing
     45     - should there be a penalty for this? probably
     46 - Make everything incremental
     47     - this is getting too slow to do tf / idf on everything...
     48         - I probably don't want to do indexing and crawling at the same time for perf sake so there should be a field added to the site table for 'indexed'.
     49 
     50 - How to do eviction of old urls / sites that are queried?
     51     - this might be early, but this can ballon quickly (100s of gb added every few days)
     52 
     53 - forward / backlink calculation
     54     - maybe after crawling as these are derived / indexing type values
     55 
     56 ---
     57 
     58 - improve priority assignment
     59     - how to do this?
     60         - rank domains based on importance
     61             - how?
     62         - then use ranking to assign priority
     63             - there should be some temporal priority too so maybe it is based on date/time inversion stuff
     64 
     65 - support sitemap .xml
     66     - this should more fully search specific domains
     67         - these should (probably) take priority over other links
     68             - example: https://www.google.com/chromebook/sitemap.xml
     69                 - we should treat these differently because they don't have href stuffs
     70                     - regexp it is
     71 - something interesting would be link counts
     72     - the more a page is linked to, the better
     73 - respect:
     74     - noindex and nofollow
     75         - https://en.wikipedia.org/wiki/Web_crawler
     76 - what I want
     77     - I want to crawl a smaller subset of the internet that is useful
     78         - I get lots of stuff from audible which is kind of useless in terms of importance
     79             - could we do some sort of ranking per domain for average IC across documents?
     80                 - maybe, but would that just prioritize random garbage?
     81                     - probably....
     82                 - maybe prioritize longer documents
     83                     - that seems like it could be easily gamed, but also this is just the spider
     84 
     85 ---
     86 
     87 - add regexp for url filtering
     88     - don't use stuff with parameters (or logins maybe?)
     89     - https://www.oed.com/shibboleth-login-redirect?returnUrl=https://oup-sp.sams-sigma.com/Shibboleth.sso/Login?SAMLDS%3D1%26target%3Dss%253Amem%253A292bfd0634875a2fa6b2ffc899913c45dca16f346eb3fd65a9b1269d9c16a659&entityID=https://idp.plymouthart.ac.uk/shibboleth,
     90 
     91 ---
     92 
     93 next selection where C is the corpus:
     94 
     95 Where C_i,b is the backlinks of the i'th element
     96 Where C_i,t is the seconds since being added to the corpus
     97 (uncertain about this one) Where C_i,d is the distance from a seed authority (beta being a negative hyperparameter)
     98 
     99 s(C) = max( C_i,b * alpha  + C_i,t * gamma + z + C_i,d * beta)
    100 
    101 ---
    102 
    103 
    104 Other approaches:
    105 
    106 - breadth first
    107     - The explanation given by the authors for this result is that "the most important pages have many links to them from numerous hosts, and those links will be found early, regardless of on which host or page the crawl originates
    108         - this makes sense as a starting point, but then we have to consider other things too... because otherwise we never reindex things???
    109             - or do we expect circularity to fix that issue?
    110 
    111 ---
    112 
    113 - languages
    114     - this is really important so put at top of list
    115     - we want to calculate and add a column to the site table for the language of each site
    116     - we then want these joined together for search querying
    117 
    118 ---
    119 
    120 - sites should be prioritized by language as well...
    121     - maybe, we don't actually know the language until we query it
    122         - could use a heuristic based on subdomain
    123 
    124 ---
    125 
    126 - lang detect is kind of weak
    127     - these are english:
    128         - https://hi.wikipedia.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%95%E0%A4%BF%E0%A4%AA%E0%A5%80%E0%A4%A1%E0%A4%BF%E0%A4%AF%E0%A4%BE:%E0%A4%AC%E0%A5%89%E0%A4%9F/%E0%A4%85%E0%A4%A8%E0%A5%81%E0%A4%AE%E0%A5%8B%E0%A4%A6%E0%A4%A8_%E0%A4%B9%E0%A5%87%E0%A4%A4%E0%A5%81_%E0%A4%85%E0%A4%A8%E0%A5%81%E0%A4%B0%E0%A5%8B%E0%A4%A7
    129         - https://www.mediawiki.org/wiki/ORES/ru
    130 - maybe add more stringent requirements about language percentage?
    131     - seems sus, but still...
    132 - there seems to be some html stuff related to content language
    133     - https://stackoverflow.com/questions/6157485/what-are-content-language-and-accept-language
    134 
    135 SOLVED:
    136 
    137 - Made current version the fallback and added an html check for content type
    138     - should probably ding scores if they use the fallback because that is not to spec
    139 
    140 ----
    141 
    142 - this sucks for word association
    143     - it finds the words on pages, but not in the right locations
    144     - I think this is why you have index values and ordering stuff
    145 
    146 ---
    147 
    148 - incremental updating for some stuff
    149     - sites that haven't been reindexed can remain the same
    150         - they wouldn't add to the term list anyways, at present anyways...
    151 
    152 ---
    153 
    154 - track forward and backward links better
    155     - implement pagerank like system
    156 
    157 ---
    158 
    159 - uprank domain names
    160     - this is what people often search for...
    161 - why do they rank openai so highly?
    162     - is there some sort of stock / valuation manifest I could use for this / I haven't indexed yet?
    163         - news sites?
    164 
    165 "
    166 As page-rank is described in the original article, and in the wikipedia article, it is indeed not defined when out-degree(v)=0 for some v, since you get P(v,u)=d/n+(1-d)*0/0 - which is undefined
    167 
    168 A node that has no outgoing edge is called a dangling node and there are basically 3 common ways to take care of them:
    169 
    170 Eliminate such nodes from the graph (and repeat the process iteratively until there are no dangling nodes.
    171 Consider those pages to link back to the pages that linked to them (i.e. - for each edge (u,v), if out-degree(v) = 0, regard (v,u) as an edge).
    172 Link the dangling node to all pages (including itself usually), and effectively make the probability for random jump from this node 1.
    173 About a page with no incoming node - that shouldn't be an issue because everything is perfectly defined. Such a node will have a page rank of exactly d/n - because you can only get to it by random surfing from any node - and that's the probability to be in it.
    174 
    175 Hope that answered your question!
    176 "
    177 
    178 ---
    179 
    180 Improve indexing
    181 
    182 - I like most of what I have right now, but when a site is recrawled, the old version should be deleted. 
    183     - this should help with headaches that require ordering by date / timestamp, and it should make page rank easier as each site will be unique. 
    184         - this also means that we only care about url and not the filepath location, except in cases where we are doing something to the cached data.
    185 - how would this impact my search functionality?
    186     - unclear