information-retrieval

Unnamed repository; edit this file 'description' to name the repository.
Log | Files | Refs

commit 0547119084bf9b2df319d703a9e30f116e940c66
parent 3fd57e78245f76e883bf38fd86c4b40be798b05a
Author: Andrew Laack <andrew.laack@imbue.com>
Date:   Fri,  2 Jan 2026 22:26:39 -0600

Leaving as is; starting fresh with postgres

Diffstat:
DTODO.md | 122-------------------------------------------------------------------------------
Dcollection/spider.py | 250-------------------------------------------------------------------------------
Dindexing/lang-detect.py | 28----------------------------
Dindexing/tf.py | 43-------------------------------------------
Dseeds/code.txt | 7-------
Asqlite-tfidf/TODO.md | 186+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Asqlite-tfidf/collection/__pycache__/prune.cpython-313.pyc | 0
Rcollection/prune.py -> sqlite-tfidf/collection/prune.py | 0
Asqlite-tfidf/collection/spider.py | 256+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Rindexing/__init__.py -> sqlite-tfidf/indexing/__init__.py | 0
Asqlite-tfidf/indexing/__pycache__/__init__.cpython-313.pyc | 0
Asqlite-tfidf/indexing/__pycache__/utils.cpython-313.pyc | 0
Rindexing/idf.py -> sqlite-tfidf/indexing/idf.py | 0
Asqlite-tfidf/indexing/lang-detect.py | 41+++++++++++++++++++++++++++++++++++++++++
Asqlite-tfidf/indexing/tf.py | 43+++++++++++++++++++++++++++++++++++++++++++
Rindexing/utils.py -> sqlite-tfidf/indexing/utils.py | 0
Rmetrics/cosine-similarity.py -> sqlite-tfidf/metrics/cosine-similarity.py | 0
Rmetrics/tf-idf.py -> sqlite-tfidf/metrics/tf-idf.py | 0
Rpyproject.toml -> sqlite-tfidf/pyproject.toml | 0
Rsearch/query.py -> sqlite-tfidf/search/query.py | 0
Asqlite-tfidf/seeds/code.txt | 7+++++++
Rseeds/dictionaries.txt -> sqlite-tfidf/seeds/dictionaries.txt | 0
Rseeds/music.txt -> sqlite-tfidf/seeds/music.txt | 0
Rseeds/otr.txt -> sqlite-tfidf/seeds/otr.txt | 0
Rseeds/piracy.txt -> sqlite-tfidf/seeds/piracy.txt | 0
Rseeds/research.txt -> sqlite-tfidf/seeds/research.txt | 0
Rseeds/wikis.txt -> sqlite-tfidf/seeds/wikis.txt | 0
27 files changed, 533 insertions(+), 450 deletions(-)

diff --git a/TODO.md b/TODO.md @@ -1,122 +0,0 @@ -- add language information somewhere -- should we have a term count table for documents? -- tf indexing - - for each document go through each term within it and calculate the tf value directly - - we then save something to this table: - - tf(document-path, term, value) - - indexed: document-path and term - - combined index? - - probably not - - in general we are interested in the terms being indexed, but it also probably makes sense to lookup documents too for derived values. -- add linking support? - - how to do this? - - key value db where keys are urls and values are outlinks? -- language detection - - this should be included as part of the sites lookup - - or should this be its own db per lang? -- should this be authority based - - on one hand, I hate centralization and authority - - on the other hand, is there any way around this if there are llms online - - should we just do some sort of llm prediction? - - rank domains based on llm suspicion -- distance from authority -- ensure pruning logic is used during spider crawling so we don't write useless stuff to begin with -- update deletion of documents to also update the db - - this will be used to apply rules backwards -- improve pruning - - currently just based on length, but I could see information content being useful too - - since we have words, the information content could be computed based on word frequency - - actually, that'd be a bit different because we don't have word frequency globally - - ---- - -- update idf to be incrementally calculated - - constantly updating as things change in the corpus - - hmm, what about when we remove old stuff though? that stuff will throw off the stats more and more over time... - - might not want this to be incremental after all... -- indexing should include adding language -- add centralized indexing - - i added an indexed field to support this idea for incremental indexing -- smarter queueing -- url lookup table - - fixes some of the memory issues -- ensure pruning prior to writing - - should there be a penalty for this? probably -- Make everything incremental - - this is getting too slow to do tf / idf on everything... - - I probably don't want to do indexing and crawling at the same time for perf sake so there should be a field added to the site table for 'indexed'. - -- How to do eviction of old urls / sites that are queried? - - this might be early, but this can ballon quickly (100s of gb added every few days) - -- forward / backlink calculation - - maybe after crawling as these are derived / indexing type values - ---- - -- improve priority assignment - - how to do this? - - rank domains based on importance - - how? - - then use ranking to assign priority - - there should be some temporal priority too so maybe it is based on date/time inversion stuff - -- support sitemap .xml - - this should more fully search specific domains - - these should (probably) take priority over other links - - example: https://www.google.com/chromebook/sitemap.xml - - we should treat these differently because they don't have href stuffs - - regexp it is -- something interesting would be link counts - - the more a page is linked to, the better -- respect: - - noindex and nofollow - - https://en.wikipedia.org/wiki/Web_crawler -- what I want - - I want to crawl a smaller subset of the internet that is useful - - I get lots of stuff from audible which is kind of useless in terms of importance - - could we do some sort of ranking per domain for average IC across documents? - - maybe, but would that just prioritize random garbage? - - probably.... - - maybe prioritize longer documents - - that seems like it could be easily gamed, but also this is just the spider - ---- - -- add regexp for url filtering - - don't use stuff with parameters (or logins maybe?) - - https://www.oed.com/shibboleth-login-redirect?returnUrl=https://oup-sp.sams-sigma.com/Shibboleth.sso/Login?SAMLDS%3D1%26target%3Dss%253Amem%253A292bfd0634875a2fa6b2ffc899913c45dca16f346eb3fd65a9b1269d9c16a659&entityID=https://idp.plymouthart.ac.uk/shibboleth, - ---- - -next selection where C is the corpus: - -Where C_i,b is the backlinks of the i'th element -Where C_i,t is the seconds since being added to the corpus -(uncertain about this one) Where C_i,d is the distance from a seed authority (beta being a negative hyperparameter) - -s(C) = max( C_i,b * alpha + C_i,t * gamma + z + C_i,d * beta) - ---- - - -Other approaches: - -- breadth first - - The explanation given by the authors for this result is that "the most important pages have many links to them from numerous hosts, and those links will be found early, regardless of on which host or page the crawl originates - - this makes sense as a starting point, but then we have to consider other things too... because otherwise we never reindex things??? - - or do we expect circularity to fix that issue? - ---- - -- languages - - this is really important so put at top of list - - we want to calculate and add a column to the site table for the language of each site - - we then want these joined together for search querying - ---- - -- sites should be prioritized by language as well... - - maybe, we don't actually know the language until we query it - - could use a heuristic based on subdomain diff --git a/collection/spider.py b/collection/spider.py @@ -1,250 +0,0 @@ -# TODO: Create another database (it should be unique in case it needs to be flushed, it is basically tmp after all) -# that is a priority queue for links to be parsed. This requires consideration for how we determine what should be -# searched and when it should be searched. - -# MVP: Create db to store links to follow. Add to the bottom and take from the top. - -# Q: How should we handle the query timelines? -# A: We have a day check for selections, but I don't like this. We should instead be -# using a queue to remove this additional logic. We can reuse that to populate the queue, but that is -# unrelated. - -import urllib.robotparser -from urllib.parse import urlparse -import urllib.request -import requests -import os -import datetime -import uuid -from urllib.parse import urljoin, urlparse -from concurrent.futures import ThreadPoolExecutor, as_completed -from bs4 import BeautifulSoup -import sys -import sqlite3 -from prune import process_file -import time - -# Layout: -# - sites - # - date - # - each file is one site with a UUID as the filename -# - database - # - manifest.db - # this is the database that maps UUIDs with urls and dates - # the urls might not be unique as we could have multiple copies of the same site from different times - # no, this is not 3nf, no I don't care, this is faster. - # tables: - # site - # url, filepath, date, indexed - # tf - # TODO - # - urls.db - # - this is only for url lookups. this table can be considered ephemeral - # tables: - # url, priority, (possible add distance from authority here as well, uncertain) - - -# there seems to be a memory leak somewhere which is limited by max workers, but this really bugs me. - - -# bytes -MAX_SIZE = 2_000_000 -MAX_WORKERS = 50 -MAX_URLS_PER_SITE = 100 -NOT_INDEXED = 0 -INDEXED = 1 -REINDEX_FREQUENCY_DAYS = 7 - - -# if seconds weight == 1 then 3600 for backlink weight means -# each additional backlink equates to a decrease of 1 hour - -BACKLINK_WEIGHT = 3600 -SECONDS_WEIGHT = 1 - -def should_queue(url, cur): - cutoff = time.time() - (REINDEX_FREQUENCY_DAYS * 86400) - cur.execute(""" - SELECT 1 FROM site WHERE url = ? AND date > ? LIMIT 1 - """, (url, cutoff)) - return cur.fetchone() is None - -def is_allowed(url, user_agent, timeout=1): - try: - parsed = urlparse(url) - robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt" - rp = urllib.robotparser.RobotFileParser() - rp.set_url(robots_url) - with urllib.request.urlopen(robots_url, timeout=timeout) as response: - rp.parse(response.read().decode('utf-8').splitlines()) - return rp.can_fetch(user_agent, url) - except Exception: - return True - -# you should always repect robots.txt, but if you are trying to do something with this spider I guess you can -# disable it. please don't do this en-masse though, that's naughty. -# TODO: Check the size with a HEAD request prior to reading into memory. -def search_url(url, filepath, respect_robots_txt=True): - user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:142.0) Gecko/20100101 Firefox/142.0' - if respect_robots_txt: - if not is_allowed(url,user_agent): - print(f"Can't crawl {url} due to robots.txt violation") - return "", "", set() - - links = set() - - headers = { - 'User-Agent': user_agent, - } - - try: - source_code = requests.get(url, headers=headers, timeout=1) # natural limit to file size in memory - if not source_code.ok: - print(f'Status code not 2xx for {url}, returning.') - return "", "", set() - - content_type = source_code.headers.get('Content-Type', '') - if 'text/html' not in content_type: - print(f'Content type for {url} not html, returning.') - return "", "", set() - - soup = BeautifulSoup(source_code.content, 'html.parser') - content = soup.prettify() - - if len(content.encode('utf-8')) < MAX_SIZE: - with open(filepath, 'w') as f: - f.write(content) - print(f'Wrote {url} to {filepath}') - deleted = process_file(filepath) - - # we don't want to spider from bad sites. - # process_file does some regexp checks on the site to see if it is short / bad in some other way. - if deleted: - return "", "", set() - else: - print(f'skipping fs write for {url}, too large') - except Exception as e: - print(e) - return "", "", set() - - current_url_without_fragment = urlparse(url)._replace(fragment='').geturl() - - # find all links < max_urls_per_site that direct to a different page. - for link in soup.find_all('a', href=True): - href = link.get('href') - - if href.startswith('#'): - continue - - absolute_url = urljoin(url, href) - parsed = urlparse(absolute_url) - - url_without_fragment = parsed._replace(fragment='').geturl() - - if url_without_fragment == current_url_without_fragment: - continue - - if parsed.scheme in ('http', 'https'): - if len(links) < MAX_URLS_PER_SITE: - links.add(absolute_url) - - return filepath, url, links - -# pop links so multiple processes can run concurrently -def get_links(num_links, cur_link, con_links, backlink_weight, seconds_weight): - - cur_link.execute(""" - DELETE FROM link - WHERE url IN ( - SELECT url FROM link - ORDER BY (backlink_count * ?) + ( ? * (? - added)) DESC - LIMIT ? - ) - RETURNING url - """, (backlink_weight, seconds_weight, int(time.time()), num_links)) - - urls = {row[0] for row in cur_link.fetchall()} - con_links.commit() - return urls - - -if __name__ == "__main__": - seed_filename = "" - if len(sys.argv) == 2: - seed_filename = sys.argv[1] - con = sqlite3.connect('database/manifest.db', timeout=60) - con.execute('PRAGMA journal_mode=WAL') - cur = con.cursor() - - cur.execute("CREATE TABLE IF NOT EXISTS site(url, filepath, date, indexed, language)") - cur.execute("CREATE INDEX IF NOT EXISTS idx_site_url ON site(url)") - cur.execute("CREATE INDEX IF NOT EXISTS idx_site_indexed ON site(indexed)") - cur.execute("CREATE INDEX IF NOT EXISTS idx_site_filepath ON site(filepath)") - cur.execute("CREATE INDEX IF NOT EXISTS idx_site_language ON site(filepath)") - - - con_links = sqlite3.connect('database/urls.db', timeout=60) - con_links.execute('PRAGMA journal_mode=WAL') - cur_link = con_links.cursor() - cur_link.execute("CREATE TABLE IF NOT EXISTS link(url UNIQUE, backlink_count, added)") - cur_link.execute("CREATE INDEX IF NOT EXISTS idx_link_url ON link(url)") - cur_link.execute("CREATE INDEX IF NOT EXISTS idx_link_added ON link(added)") - cur_link.execute("CREATE INDEX IF NOT EXISTS idx_link_backlink_count ON link(backlink_count)") - - urls = set() - - if seed_filename != "": - urlLs = [] - with open(seed_filename, 'r') as f: - urlLs = f.readlines() - for i in range(len(urlLs)): - urls.add(urlLs[i].strip()) - print(f"Loaded seed file with {len(urls)} urls") - save_location = 'sites/' - - # TODO: better stopping. only stops when all links have been traversed - while True: - if len(urls) == 0: - urls = get_links(MAX_WORKERS, cur_link, con_links, BACKLINK_WEIGHT, SECONDS_WEIGHT) - if len(urls) == 0: - print("NO MORE QUEUED LINKS TO SEARCH, EXITING") - break - print(f"Loaded {len(urls)} urls from queue") - - with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor: - now = datetime.datetime.now() - pth = save_location + now.strftime("%Y-%m-%d") + "/" - os.makedirs(pth, exist_ok=True) - futures = { - executor.submit(search_url, url, pth+ str(uuid.uuid4())): url - for url in urls - } - - for future in as_completed(futures): - filepath, url, links = future.result() - if filepath != '' and url != '': - # insert into site list - cur.execute(""" - INSERT INTO site (url, filepath, date, indexed) - VALUES (?, ?, ?, ?) - """, (url, filepath, int(datetime.datetime.now().timestamp()), NOT_INDEXED)) - con.commit() - for link in links: - # TODO: Make priority better, also speed this up with transactions - # link(url UNIQUE, backlink_count, added) - if should_queue(link, cur): - cur_link.execute(""" - INSERT INTO link VALUES (?, 1, ?) - ON CONFLICT(url) DO UPDATE SET backlink_count = backlink_count + 1 - """, (link, int(datetime.datetime.now().timestamp()))) - con_links.commit() - else: - print(f"Skipping '{link}' for indexing") - urls = set() - - - cur_link.close() - - con_links.close() - cur.close() - con.close() diff --git a/indexing/lang-detect.py b/indexing/lang-detect.py @@ -1,28 +0,0 @@ -from langdetect import detect -from tqdm import tqdm -import sqlite3 -from utils import get_plaintext - -def detect_language(text): - return detect(text) - -if __name__ == "__main__": - con_sites = sqlite3.connect('database/manifest.db', timeout=60) - con_sites.execute('PRAGMA journal_mode=WAL') - cur_sites = con_sites.cursor() - - cur_sites.execute(""" - SELECT filepath FROM site WHERE language IS NULL - """) - - missing_language = cur_sites.fetchall() - - for i in tqdm(range(len(missing_language))): - filepath_tuple = missing_language[i] - filepath = filepath_tuple[0] - plaintext = get_plaintext(filepath) - lang = detect_language(plaintext) - cur_sites.execute(""" - UPDATE site SET language = ? WHERE filepath = ? - """, (lang, filepath)) - con_sites.commit() diff --git a/indexing/tf.py b/indexing/tf.py @@ -1,43 +0,0 @@ -import sqlite3 -from utils import get_tfs -from utils import get_terms -from utils import get_filepaths -from tqdm import tqdm - -if __name__ == "__main__": - filepaths = get_filepaths('sites') - print(f'Found {len(filepaths)} files') - - # returns a set of all indexed terms - terms = get_terms('database/manifest.db') - print(f'fetched {len(terms)} terms') - - # tfs is a dict: - # term : tf - # tf(document_path, term, value) - - # having the one db makes the deletion very slow. - con = sqlite3.connect('database/manifest.db', timeout=60) - con.execute('PRAGMA journal_mode=WAL') - cur = con.cursor() - cur.execute("CREATE TABLE IF NOT EXISTS tf(document_path, term, value)") - cur.execute("CREATE INDEX IF NOT EXISTS idx_tf_document_path ON tf(document_path)") - cur.execute("CREATE INDEX IF NOT EXISTS idx_tf_term ON tf(term)") - cur.execute("DELETE FROM tf") - - # the reason we want a term list is because that gives us a guarantee about having an idf. - # considerations can be made for the necessity of this, but I think this is safe for now. - # TODO: Possibly add idf imputation during for searches on terms that don't exist in the term table. - - - # TODO: update indexed status for sites and only use that status for indexing, not every file with get_filepaths - - for i in tqdm(range(0,len(filepaths))): - filepath = filepaths[i] - # this returns the tfs for all terms from terms that exist in the current document (filepath) - tfs = get_tfs(filepath, terms) - for word in tfs: - cur.execute("INSERT INTO tf VALUES (?, ?, ?)", (filepath, word, tfs[word])) - con.commit() - cur.close() - con.close() diff --git a/seeds/code.txt b/seeds/code.txt @@ -1,7 +0,0 @@ -https://github.com/trending -https://codeberg.org/ -https://about.gitlab.com/ -https://github.com/ -https://github.com/topics/awesome -https://www.reddit.com/r/programming/ -https://www.reddit.com/r/ProgrammingLanguages/ diff --git a/sqlite-tfidf/TODO.md b/sqlite-tfidf/TODO.md @@ -0,0 +1,186 @@ +- add language information somewhere +- should we have a term count table for documents? +- tf indexing + - for each document go through each term within it and calculate the tf value directly + - we then save something to this table: + - tf(document-path, term, value) + - indexed: document-path and term + - combined index? + - probably not + - in general we are interested in the terms being indexed, but it also probably makes sense to lookup documents too for derived values. +- add linking support? + - how to do this? + - key value db where keys are urls and values are outlinks? +- language detection + - this should be included as part of the sites lookup + - or should this be its own db per lang? +- should this be authority based + - on one hand, I hate centralization and authority + - on the other hand, is there any way around this if there are llms online + - should we just do some sort of llm prediction? + - rank domains based on llm suspicion +- distance from authority +- ensure pruning logic is used during spider crawling so we don't write useless stuff to begin with +- update deletion of documents to also update the db + - this will be used to apply rules backwards +- improve pruning + - currently just based on length, but I could see information content being useful too + - since we have words, the information content could be computed based on word frequency + - actually, that'd be a bit different because we don't have word frequency globally + + +--- + +- update idf to be incrementally calculated + - constantly updating as things change in the corpus + - hmm, what about when we remove old stuff though? that stuff will throw off the stats more and more over time... + - might not want this to be incremental after all... +- indexing should include adding language +- add centralized indexing + - i added an indexed field to support this idea for incremental indexing +- smarter queueing +- url lookup table + - fixes some of the memory issues +- ensure pruning prior to writing + - should there be a penalty for this? probably +- Make everything incremental + - this is getting too slow to do tf / idf on everything... + - I probably don't want to do indexing and crawling at the same time for perf sake so there should be a field added to the site table for 'indexed'. + +- How to do eviction of old urls / sites that are queried? + - this might be early, but this can ballon quickly (100s of gb added every few days) + +- forward / backlink calculation + - maybe after crawling as these are derived / indexing type values + +--- + +- improve priority assignment + - how to do this? + - rank domains based on importance + - how? + - then use ranking to assign priority + - there should be some temporal priority too so maybe it is based on date/time inversion stuff + +- support sitemap .xml + - this should more fully search specific domains + - these should (probably) take priority over other links + - example: https://www.google.com/chromebook/sitemap.xml + - we should treat these differently because they don't have href stuffs + - regexp it is +- something interesting would be link counts + - the more a page is linked to, the better +- respect: + - noindex and nofollow + - https://en.wikipedia.org/wiki/Web_crawler +- what I want + - I want to crawl a smaller subset of the internet that is useful + - I get lots of stuff from audible which is kind of useless in terms of importance + - could we do some sort of ranking per domain for average IC across documents? + - maybe, but would that just prioritize random garbage? + - probably.... + - maybe prioritize longer documents + - that seems like it could be easily gamed, but also this is just the spider + +--- + +- add regexp for url filtering + - don't use stuff with parameters (or logins maybe?) + - https://www.oed.com/shibboleth-login-redirect?returnUrl=https://oup-sp.sams-sigma.com/Shibboleth.sso/Login?SAMLDS%3D1%26target%3Dss%253Amem%253A292bfd0634875a2fa6b2ffc899913c45dca16f346eb3fd65a9b1269d9c16a659&entityID=https://idp.plymouthart.ac.uk/shibboleth, + +--- + +next selection where C is the corpus: + +Where C_i,b is the backlinks of the i'th element +Where C_i,t is the seconds since being added to the corpus +(uncertain about this one) Where C_i,d is the distance from a seed authority (beta being a negative hyperparameter) + +s(C) = max( C_i,b * alpha + C_i,t * gamma + z + C_i,d * beta) + +--- + + +Other approaches: + +- breadth first + - The explanation given by the authors for this result is that "the most important pages have many links to them from numerous hosts, and those links will be found early, regardless of on which host or page the crawl originates + - this makes sense as a starting point, but then we have to consider other things too... because otherwise we never reindex things??? + - or do we expect circularity to fix that issue? + +--- + +- languages + - this is really important so put at top of list + - we want to calculate and add a column to the site table for the language of each site + - we then want these joined together for search querying + +--- + +- sites should be prioritized by language as well... + - maybe, we don't actually know the language until we query it + - could use a heuristic based on subdomain + +--- + +- lang detect is kind of weak + - these are english: + - https://hi.wikipedia.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%95%E0%A4%BF%E0%A4%AA%E0%A5%80%E0%A4%A1%E0%A4%BF%E0%A4%AF%E0%A4%BE:%E0%A4%AC%E0%A5%89%E0%A4%9F/%E0%A4%85%E0%A4%A8%E0%A5%81%E0%A4%AE%E0%A5%8B%E0%A4%A6%E0%A4%A8_%E0%A4%B9%E0%A5%87%E0%A4%A4%E0%A5%81_%E0%A4%85%E0%A4%A8%E0%A5%81%E0%A4%B0%E0%A5%8B%E0%A4%A7 + - https://www.mediawiki.org/wiki/ORES/ru +- maybe add more stringent requirements about language percentage? + - seems sus, but still... +- there seems to be some html stuff related to content language + - https://stackoverflow.com/questions/6157485/what-are-content-language-and-accept-language + +SOLVED: + +- Made current version the fallback and added an html check for content type + - should probably ding scores if they use the fallback because that is not to spec + +---- + +- this sucks for word association + - it finds the words on pages, but not in the right locations + - I think this is why you have index values and ordering stuff + +--- + +- incremental updating for some stuff + - sites that haven't been reindexed can remain the same + - they wouldn't add to the term list anyways, at present anyways... + +--- + +- track forward and backward links better + - implement pagerank like system + +--- + +- uprank domain names + - this is what people often search for... +- why do they rank openai so highly? + - is there some sort of stock / valuation manifest I could use for this / I haven't indexed yet? + - news sites? + +" +As page-rank is described in the original article, and in the wikipedia article, it is indeed not defined when out-degree(v)=0 for some v, since you get P(v,u)=d/n+(1-d)*0/0 - which is undefined + +A node that has no outgoing edge is called a dangling node and there are basically 3 common ways to take care of them: + +Eliminate such nodes from the graph (and repeat the process iteratively until there are no dangling nodes. +Consider those pages to link back to the pages that linked to them (i.e. - for each edge (u,v), if out-degree(v) = 0, regard (v,u) as an edge). +Link the dangling node to all pages (including itself usually), and effectively make the probability for random jump from this node 1. +About a page with no incoming node - that shouldn't be an issue because everything is perfectly defined. Such a node will have a page rank of exactly d/n - because you can only get to it by random surfing from any node - and that's the probability to be in it. + +Hope that answered your question! +" + +--- + +Improve indexing + +- I like most of what I have right now, but when a site is recrawled, the old version should be deleted. + - this should help with headaches that require ordering by date / timestamp, and it should make page rank easier as each site will be unique. + - this also means that we only care about url and not the filepath location, except in cases where we are doing something to the cached data. +- how would this impact my search functionality? + - unclear diff --git a/sqlite-tfidf/collection/__pycache__/prune.cpython-313.pyc b/sqlite-tfidf/collection/__pycache__/prune.cpython-313.pyc Binary files differ. diff --git a/collection/prune.py b/sqlite-tfidf/collection/prune.py diff --git a/sqlite-tfidf/collection/spider.py b/sqlite-tfidf/collection/spider.py @@ -0,0 +1,256 @@ +# TODO: Create another database (it should be unique in case it needs to be flushed, it is basically tmp after all) +# that is a priority queue for links to be parsed. This requires consideration for how we determine what should be +# searched and when it should be searched. + +# MVP: Create db to store links to follow. Add to the bottom and take from the top. + +# Q: How should we handle the query timelines? +# A: We have a day check for selections, but I don't like this. We should instead be +# using a queue to remove this additional logic. We can reuse that to populate the queue, but that is +# unrelated. + +import urllib.robotparser +from urllib.parse import urlparse +import urllib.request +import requests +import os +import datetime +import uuid +from urllib.parse import urljoin, urlparse +from concurrent.futures import ThreadPoolExecutor, as_completed +from bs4 import BeautifulSoup +import sys +import sqlite3 +from prune import process_file +import time + +# Layout: +# - sites + # - date + # - each file is one site with a UUID as the filename +# - database + # - manifest.db + # this is the database that maps UUIDs with urls and dates + # the urls might not be unique as we could have multiple copies of the same site from different times + # no, this is not 3nf, no I don't care, this is faster. + # tables: + # site + # url, filepath, date, indexed + # tf + # TODO + # - urls.db + # - this is only for url lookups. this table can be considered ephemeral + # tables: + # url, priority, (possible add distance from authority here as well, uncertain) + + +# there seems to be a memory leak somewhere which is limited by max workers, but this really bugs me. + + +# bytes +MAX_SIZE = 2_000_000 +MAX_WORKERS = 50 +MAX_URLS_PER_SITE = 100 +NOT_INDEXED = 0 +INDEXED = 1 +REINDEX_FREQUENCY_DAYS = 7 + + +# if seconds weight == 1 then 3600 for backlink weight means +# each additional backlink equates to a decrease of 1 hour + +BACKLINK_WEIGHT = 3600 +SECONDS_WEIGHT = 1 + +def should_queue(url, cur): + cutoff = time.time() - (REINDEX_FREQUENCY_DAYS * 86400) + cur.execute(""" + SELECT 1 FROM site WHERE url = ? AND date > ? LIMIT 1 + """, (url, cutoff)) + return cur.fetchone() is None + +def is_allowed(url, user_agent, timeout=1): + try: + parsed = urlparse(url) + robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt" + rp = urllib.robotparser.RobotFileParser() + rp.set_url(robots_url) + with urllib.request.urlopen(robots_url, timeout=timeout) as response: + rp.parse(response.read().decode('utf-8').splitlines()) + return rp.can_fetch(user_agent, url) + except Exception: + return True + +# you should always repect robots.txt, but if you are trying to do something with this spider I guess you can +# disable it. please don't do this en-masse though, that's naughty. +# TODO: Check the size with a HEAD request prior to reading into memory. +def search_url(url, filepath, respect_robots_txt=True): + user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:142.0) Gecko/20100101 Firefox/142.0' + if respect_robots_txt: + if not is_allowed(url,user_agent): + print(f"Can't crawl {url} due to robots.txt violation") + return "", "", set() + + links = set() + + headers = { + 'User-Agent': user_agent, + } + + try: + source_code = requests.get(url, headers=headers, timeout=1) # natural limit to file size in memory + if not source_code.ok: + print(f'Status code not 2xx for {url}, returning.') + return "", "", set() + + content_type = source_code.headers.get('Content-Type', '') + if 'text/html' not in content_type: + print(f'Content type for {url} not html, returning.') + return "", "", set() + + soup = BeautifulSoup(source_code.content, 'html.parser') + content = soup.prettify() + + if len(content.encode('utf-8')) < MAX_SIZE: + with open(filepath, 'w') as f: + f.write(content) + print(f'Wrote {url} to {filepath}') + deleted = process_file(filepath) + + # we don't want to spider from bad sites. + # process_file does some regexp checks on the site to see if it is short / bad in some other way. + if deleted: + return "", "", set() + else: + print(f'skipping fs write for {url}, too large') + except Exception as e: + print(e) + return "", "", set() + + current_url_without_fragment = urlparse(url)._replace(fragment='').geturl() + + # find all links < max_urls_per_site that direct to a different page. + for link in soup.find_all('a', href=True): + href = link.get('href') + + if href.startswith('#'): + continue + + absolute_url = urljoin(url, href) + parsed = urlparse(absolute_url) + + url_without_fragment = parsed._replace(fragment='').geturl() + + if url_without_fragment == current_url_without_fragment: + continue + + if parsed.scheme in ('http', 'https'): + if len(links) < MAX_URLS_PER_SITE: + links.add(absolute_url) + + return filepath, url, links + +# pop links so multiple processes can run concurrently +def get_links(num_links, cur_link, con_links, backlink_weight, seconds_weight): + + # this is kind of bad... if there is a concurrent spider running and it picks up the same + # link that is currently being processed, it will get added to the queue right after + # being popped, but before being written to the sites table... + + # TODO: Add mutex to fix issue here + + cur_link.execute(""" + DELETE FROM link + WHERE url IN ( + SELECT url FROM link + ORDER BY (backlink_count * ?) + ( ? * (? - added)) DESC + LIMIT ? + ) + RETURNING url + """, (backlink_weight, seconds_weight, int(time.time()), num_links)) + + urls = {row[0] for row in cur_link.fetchall()} + con_links.commit() + return urls + + +if __name__ == "__main__": + seed_filename = "" + if len(sys.argv) == 2: + seed_filename = sys.argv[1] + con = sqlite3.connect('database/manifest.db', timeout=60) + con.execute('PRAGMA journal_mode=WAL') + cur = con.cursor() + + cur.execute("CREATE TABLE IF NOT EXISTS site(url, filepath, date, indexed, language)") + cur.execute("CREATE INDEX IF NOT EXISTS idx_site_url ON site(url)") + cur.execute("CREATE INDEX IF NOT EXISTS idx_site_indexed ON site(indexed)") + cur.execute("CREATE INDEX IF NOT EXISTS idx_site_filepath ON site(filepath)") + cur.execute("CREATE INDEX IF NOT EXISTS idx_site_language ON site(filepath)") + + + con_links = sqlite3.connect('database/urls.db', timeout=60) + con_links.execute('PRAGMA journal_mode=WAL') + cur_link = con_links.cursor() + cur_link.execute("CREATE TABLE IF NOT EXISTS link(url UNIQUE, backlink_count, added)") + cur_link.execute("CREATE INDEX IF NOT EXISTS idx_link_url ON link(url)") + cur_link.execute("CREATE INDEX IF NOT EXISTS idx_link_added ON link(added)") + cur_link.execute("CREATE INDEX IF NOT EXISTS idx_link_backlink_count ON link(backlink_count)") + + urls = set() + + if seed_filename != "": + urlLs = [] + with open(seed_filename, 'r') as f: + urlLs = f.readlines() + for i in range(len(urlLs)): + urls.add(urlLs[i].strip()) + print(f"Loaded seed file with {len(urls)} urls") + save_location = 'sites/' + + # TODO: better stopping. only stops when all links have been traversed + while True: + if len(urls) == 0: + urls = get_links(MAX_WORKERS, cur_link, con_links, BACKLINK_WEIGHT, SECONDS_WEIGHT) + if len(urls) == 0: + print("NO MORE QUEUED LINKS TO SEARCH, EXITING") + break + print(f"Loaded {len(urls)} urls from queue") + + with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor: + now = datetime.datetime.now() + pth = save_location + now.strftime("%Y-%m-%d") + "/" + os.makedirs(pth, exist_ok=True) + futures = { + executor.submit(search_url, url, pth+ str(uuid.uuid4())): url + for url in urls + } + + for future in as_completed(futures): + filepath, url, links = future.result() + if filepath != '' and url != '': + # insert into site list + cur.execute(""" + INSERT INTO site (url, filepath, date, indexed) + VALUES (?, ?, ?, ?) + """, (url, filepath, int(datetime.datetime.now().timestamp()), NOT_INDEXED)) + con.commit() + for link in links: + # TODO: Make priority better, also speed this up with transactions + # link(url UNIQUE, backlink_count, added) + if should_queue(link, cur): + cur_link.execute(""" + INSERT INTO link VALUES (?, 1, ?) + ON CONFLICT(url) DO UPDATE SET backlink_count = backlink_count + 1 + """, (link, int(datetime.datetime.now().timestamp()))) + con_links.commit() + else: + print(f"Skipping '{link}' for indexing") + urls = set() + + + cur_link.close() + + con_links.close() + cur.close() + con.close() diff --git a/indexing/__init__.py b/sqlite-tfidf/indexing/__init__.py diff --git a/sqlite-tfidf/indexing/__pycache__/__init__.cpython-313.pyc b/sqlite-tfidf/indexing/__pycache__/__init__.cpython-313.pyc Binary files differ. diff --git a/sqlite-tfidf/indexing/__pycache__/utils.cpython-313.pyc b/sqlite-tfidf/indexing/__pycache__/utils.cpython-313.pyc Binary files differ. diff --git a/indexing/idf.py b/sqlite-tfidf/indexing/idf.py diff --git a/sqlite-tfidf/indexing/lang-detect.py b/sqlite-tfidf/indexing/lang-detect.py @@ -0,0 +1,41 @@ +from langdetect import detect +from tqdm import tqdm +import sqlite3 +from utils import get_plaintext +from bs4 import BeautifulSoup + +def get_html_language(filepath): + try: + with open(filepath, 'r', encoding='utf-8') as f: + soup = BeautifulSoup(f, 'html.parser') + html_tag = soup.find('html') + if html_tag and html_tag.get('lang'): + return html_tag.get('lang').split('-')[0].lower() + except OSError: + pass + +def detect_language(filepath): + result = get_html_language(filepath) + if result is not None: + return result + return detect(get_plaintext(filepath)) + +if __name__ == "__main__": + con_sites = sqlite3.connect('database/manifest.db', timeout=60) + con_sites.execute('PRAGMA journal_mode=WAL') + cur_sites = con_sites.cursor() + + cur_sites.execute(""" + SELECT filepath FROM site WHERE language IS NULL + """) + + missing_language = cur_sites.fetchall() + + for i in tqdm(range(len(missing_language))): + filepath_tuple = missing_language[i] + filepath = filepath_tuple[0] + lang = detect_language(filepath) + cur_sites.execute(""" + UPDATE site SET language = ? WHERE filepath = ? + """, (lang, filepath)) + con_sites.commit() diff --git a/sqlite-tfidf/indexing/tf.py b/sqlite-tfidf/indexing/tf.py @@ -0,0 +1,43 @@ +import sqlite3 +from utils import get_tfs +from utils import get_terms +from utils import get_filepaths +from tqdm import tqdm + +if __name__ == "__main__": + filepaths = get_filepaths('sites') + print(f'Found {len(filepaths)} files') + + # returns a set of all indexed terms + terms = get_terms('database/manifest.db') + print(f'fetched {len(terms)} terms') + + # tfs is a dict: + # term : tf + # tf(document_path, term, value) + + # having the one db makes the deletion very slow. + con = sqlite3.connect('database/manifest.db', timeout=60) + con.execute('PRAGMA journal_mode=WAL') + cur = con.cursor() + cur.execute("CREATE TABLE IF NOT EXISTS tf(document_path, term, value)") + cur.execute("CREATE INDEX IF NOT EXISTS idx_tf_document_path ON tf(document_path)") + cur.execute("CREATE INDEX IF NOT EXISTS idx_tf_term ON tf(term)") + cur.execute("DELETE FROM tf") + + # the reason we want a term list is because that gives us a guarantee about having an idf. + # considerations can be made for the necessity of this, but I think this is safe for now. + # TODO: Possibly add idf imputation during for searches on terms that don't exist in the term table. + + + # TODO: update indexed status for sites and only use that status for indexing, not every file with get_filepaths + + for i in tqdm(range(0,len(filepaths))): + filepath = filepaths[i] + # this returns the tfs for all terms from terms that exist in the current document (filepath) + tfs = get_tfs(filepath, terms) + for word in tfs: + cur.execute("INSERT INTO tf VALUES (?, ?, ?)", (filepath, word, tfs[word])) + con.commit() + cur.close() + con.close() diff --git a/indexing/utils.py b/sqlite-tfidf/indexing/utils.py diff --git a/metrics/cosine-similarity.py b/sqlite-tfidf/metrics/cosine-similarity.py diff --git a/metrics/tf-idf.py b/sqlite-tfidf/metrics/tf-idf.py diff --git a/pyproject.toml b/sqlite-tfidf/pyproject.toml diff --git a/search/query.py b/sqlite-tfidf/search/query.py diff --git a/sqlite-tfidf/seeds/code.txt b/sqlite-tfidf/seeds/code.txt @@ -0,0 +1,7 @@ +https://www.die.net/ +https://en.wikipedia.org/wiki/List_of_programming_languages +https://ziglang.org/ +https://rust-lang.org/ +https://www.python.org/ +https://cppreference.com/ +https://cplusplus.com/reference/ diff --git a/seeds/dictionaries.txt b/sqlite-tfidf/seeds/dictionaries.txt diff --git a/seeds/music.txt b/sqlite-tfidf/seeds/music.txt diff --git a/seeds/otr.txt b/sqlite-tfidf/seeds/otr.txt diff --git a/seeds/piracy.txt b/sqlite-tfidf/seeds/piracy.txt diff --git a/seeds/research.txt b/sqlite-tfidf/seeds/research.txt diff --git a/seeds/wikis.txt b/sqlite-tfidf/seeds/wikis.txt