This effectively removes the parallelism of filtering, which doubles the processing time (5->8 hours), but this allows me to toy around with the performances of this step, which I aim to improve drastically.
|2 years ago|
|dist||2 years ago|
|rules||2 years ago|
|subdomains||2 years ago|
|temp||2 years ago|
|websites||2 years ago|
|.gitignore||2 years ago|
|README.md||2 years ago|
|collect_subdomains.py||2 years ago|
|collect_subdomains.sh||2 years ago|
|eulaurarien.sh||2 years ago|
|fetch_resources.sh||2 years ago|
|filter_subdomains.py||2 years ago|
|filter_subdomains.sh||2 years ago|
|regexes.py||2 years ago|
|resolve_subdomains.py||2 years ago|
Generates a host list of first-party trackers for ad-blocking.
The latest list is available here: https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt
DISCLAIMER: I'm by no way an expert on this subject so my vocabulary or other stuff might be wrong. Use at your own risk.
What's a first-party tracker?
Traditionally, websites load trackers scripts directly.
website2.com both load
https://trackercompany.com/trackerscript.js to track their users.
In order to block those, one can simply block the host
However, to circumvent this easy block, tracker companies made the website using them load trackers from
The latter being a DNS redirection to
website1.trackercompany.com, directly pointing to a server serving the tracking script.
Those are the first-party trackers.
trackercompany.com doesn't work any more, and blocking
*.trackercompany.com isn't really possible since:
- Most ad-blocker don't support wildcards
- It's a DNS redirection, meaning that most ad-blockers will only see
So the only solution is to block every
somethingirelevant.website1.com-like subdomains known, which is a lot.
That's where this scripts comes in, to generate a list of such subdomains.
How does this script work
It takes an input a list of websites with trackers included. So far, this list is manually-generated from the list of clients of such first-party trackers (latter we should use a general list of websites to be more exhaustive). It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes.
Additionaly, or alternatively, you can feed the script some browsing history and get domains from there.
It then find the DNS redirections of those domains, and compare with regexes of known tracking domains. It finally outputs the matching ones.
Just to build the list, you can find an already-built list in the releases.
(if you don't want to collect the subdomains, you can skip the following)
This is only if you want to build the list yourself. If you just want to use the list, the latest build is available here: https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt It was build using additional sources not included in this repository for privacy reasons.
Add personal sources
The list of websites provided in this script is by no mean exhaustive, so adding your own browsing history will help create a better list. Here's reference command for possible sources:
sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list
cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp
Collect subdomains from websites
This is a long step, and might be memory-intensive from time to time.
This step is optional if you already added personal sources.
Alternatively, you can get just download the list of subdomains used to generate the official block list here: https://hostfiles.frogeye.fr/from_websites.cache.list (put it in the
Extract tracking domains
Make sure your system is configured with a DNS server without limitation.
The files you need will be in the folder
Just add the URL to the relevant list:
Adding first-party trackers regex
Just add them to