blog

Personal blog
git clone git://git.laack.co/blog.git
Log | Files | Refs

commit f92512849b9e2f42d5846baa868aad52f3c12dd6
parent 5c8ad2975f85d2dbe174a2b0688e84f27ac24675
Author: Andrew Laack <andrew@laack.co>
Date:   Mon,  1 Jun 2026 17:47:48 -0500

This will be too much work; consider how to optimize this search methodology so I don't have to spend so much time thinking about it.

Diffstat:
Mposts/wip/advertising.md | 4----
Aposts/wip/captchas.md | 52++++++++++++++++++++++++++++++++++++++++++++++++++++
Apython/search-engines/query.py | 80+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 132 insertions(+), 4 deletions(-)

diff --git a/posts/wip/advertising.md b/posts/wip/advertising.md @@ -50,7 +50,3 @@ Unacceptable: - This is slightly different, but if I'm at something like defcon I'd expect not to be gratuitiously advertised to during talks. Stations are par for the course, but talks should be devoid of ulterior motationvs; be a human. A form of advertisement that bothers me is self-advertisement. People whoring theselves out to find a job at so-called "networking events" are one thing (simply don't attend), but if I'm at dinner with a group, I expect everyone to live in the moment, not try to get a job out of me or get me to work with / for them. It's annoying. I noticed this is especially bad in San Francisco, mostly because of how little I've noticed these annoyances in the midwest. - -## Sharing - -Moving forwards, I am going to maintain a git repository that catalogs every company that has unconsentually advertised to me to warn others about these companies, and to ensure I never patronize them. diff --git a/posts/wip/captchas.md b/posts/wip/captchas.md @@ -0,0 +1,52 @@ +# CAPTCHAs + +## Background + +I use VPNs most of the time despite concerns their usage may limit my fourth amendment rights [1]. One annoyance of using VPNs is being hit with CAPTCHAs. This isn't an issue when I use my Searxng instance because it doesn't have CAPTCHAs, but all modern, freely available, search engines do. + +## Methodology + +EDIT: TODO - I used librewolf instead of mullvad browser because of some issues with multi-search engine opening + +Each query was sent while connected to a U.S. based ProtonVPN exit node. While the exit nodes changed over time, the exit node was consistent across each search engine on the basis of a given query. To achieve this, I used the Multi engine search Firefox extension [2]. Additionally, I used my browser of choice, that supports Javascript, the Mullvad browser [3] to perform these evaluations. The only changes to the browser were adding the multi-search extension, removing the mullvad browser extension, and adding each of the search engines I was interested in testing as search engines in the browser settings. Finally, I created a "New Identity" after every 5 searches to see evaluate how often CAPTCHAs are shown when users have already completed them within the same "session". I also rotated my IP address every 5 searches by using a different random US-based protonvpn server. I considered resetting after every search, or using my browser naturally, but thought this would be a more consistent way to evalutate CAPTCHA rates. + +Alongside tracking CAPTCHA hit rates, I also tracked how long it took to pass through the CAPTCHAs, and the search index of the result I was looking for. This index was the first result that contained a satisfactory answer to my query. I also tracked slop count, which was the count of the top 5 results that were AI slop / SEO spam sites, based on my subjective definitions of both. Since I am using Mullvad browser, which comes with uBlock origin out of the box, advertisements were never (obviously and statedly) displayed in search results. + +## Limitations + +A few limitations are listed below: + +- Mullvad browser + - I could be treated differently by each of the browsers on this basis +- Multi-search queries + - It is possible data is being shared between search engines on the backend, resulting in some search engines showing CAPTCHAs based on searches in other search engines that match the current search. This seems slightly unlikely, but Duckduckgo does primarily use Bing on the backend, so it's possible. + +## Description of My Search Habits + +All searches are tracked in a md file [4???], but at a high-level, I was working, reading, and other general usage things. I don't think it's fair to say these are my normal habits because there was a non-zero amount of friction added by marking down this data as I used the web, but I didn't conciously make any changes to how I search the web. + +## Selected Search Engines + +- Google +- Startpage +- Brave Search +- noai.duckduckgo.com +- Bing +- Ecosia +- Qwant +- Mojeek +- Yahoo + +## Unused Search Engines + +- Kagi + - Kagi lacks a privacy respecting payment method. As such, this is a non-starter for me. If there was a browser that accepted crypto, and had accounts similar to Mullvad, I would consider using it. +- Searxng + - Searxng is what I like to use, but it does have some drawbacks. Specifically, if you aren't sharing your Searxng instance with other people, the IP address of yours server will get tied to your identity for tracking, reducing many of the privacy benefits associated with using a VPN. Additionally, since my Searxng instance is hosted on Hetzner, it frequently returns no results due to all upstreams replying with CAPTCHAs. While an interesting concept, I find it breaks down in practice. +- Perplexity + - I'll write about this. + +[1] - https://www.wired.com/story/using-a-vpn-may-subject-you-to-nsa-spying/ +[2] - https://addons.mozilla.org/en-US/firefox/addon/multi-engine-search/ +[3] - https://mullvad.net/en/browser/ +[4] - TODO diff --git a/python/search-engines/query.py b/python/search-engines/query.py @@ -0,0 +1,80 @@ +import csv +import os +from enum import Enum +import time + + +# list of search engines +class Engine(Enum): + GOOGLE = 'google' + STARTPAGE = 'startpage' + BRAVE_SEARCH = 'brave_search' + DDG = 'ddg' + BING = 'bing' + ECOSIA = 'ecosia' + QWANT = 'qwant' + MOJEEK = 'mojeeek' + YAHOO = 'yahoo' + +class Query: + query_message: str + engine: Engine + time: str + captcha_hit: bool + captcha_time: float + pow_captcha: bool + answer_index: int + slop_sites_top_5: int + + +def create_query(): + current = time.time() + query_message = input("Enter a search query: ") + query = Query() + query.query_message = query_message + query.time = current + return query + + +def ensure_csv(): + if not os.path.exists('search.csv') or os.path.getsize('search.csv') == 0: + with open('search.csv', 'w', newline='') as f: + f.write('query,engine,time,captcha_hit,captcha_time,pow_captcha,slop_sites_top_5,answer_index\n') + +def write_query(query): + f = open('search.csv', 'a') + csvwriter = csv.writer(f) + csvwriter.writerow([query.query_message, query.engine.value, query.time, query.captcha_hit, query.captcha_time, query.pow_captcha, query.slop_sites_top_5, query.answer_index]) + f.close() + +def get_csv_row_count(): + f = open('search.csv', 'r') + + return len(f.readlines()) - 1 + +ensure_csv() +query = create_query() + +for engine in Engine: + query.engine = engine + print("Searching for " + query.query_message + " with " + engine.value + "") + + captcha = input("Did you see a captcha? (y/n): ") + + if captcha == 'y': + query.captcha_time = float(input("How many seconds did it take to solve (-1 means failed to solve): ")) + query.captcha_hit = True + query.pow_captcha = input("Was the captcha a PoW patcha? (y/n): ") == "y" + else: + query.pow_captcha = False + query.captcha_hit = False + query.captcha_time = 0 + + query.slop_sites_top_5 = int(input("Slop sites / SEO site count in the top 5: ")) + query.answer_index = int(input("First index containing the answer (-1 means no answer on first page of results): ")) + + write_query(query) + + + if get_csv_row_count() % 5 == 0: + print("Rotate IP and clear fingerprint.")