Generates a host list of first-party trackers for ad-blocking. https://hostfiles.frogeye.fr

Find a file

Geoffrey “Frogeye” Preud'homme dcf39c9582 Put packing in parsing thread Why did I think this would be a good idea? - value don't need to be packed most of the time, but we don't know that early - packed domain (it's one most of the time) is way larger than its unpacked counterpart		2019-12-16 10:38:37 +01:00
dist	Added possibility to add personal sources	2019-11-11 11:19:46 +01:00
rules	Tracker: intendmedia?	2019-12-08 01:32:49 +01:00
rules_adblock	Improved rules handling	2019-12-03 08:48:12 +01:00
rules_asn	Workflow: Automatically import IP ranges from ASN	2019-12-13 08:23:38 +01:00
rules_hosts	Improved rules handling	2019-12-03 08:48:12 +01:00
rules_ip	Workflow: Automatically import IP ranges from ASN	2019-12-13 08:23:38 +01:00
subdomains	Added possibility to add personal sources	2019-11-11 11:19:46 +01:00
temp	Separated DNS resolution from filtering	2019-12-02 19:03:08 +01:00
tests	Tracker: intendmedia?	2019-12-08 01:32:49 +01:00
websites	Added RED by SFR website	2019-11-13 18:14:56 +01:00
.gitignore	Workflow: SQL -> Tree	2019-12-15 15:56:26 +01:00
adblock_to_domain_list.py	Removed third-parties from easyprivacy	2019-12-05 01:19:10 +01:00
collect_subdomains.py	Added some delay for websites subdomains collecting	2019-11-14 06:29:24 +01:00
collect_subdomains.sh	Fix log in scripts	2019-12-07 18:45:48 +01:00
database.py	Put packing in parsing thread	2019-12-16 10:38:37 +01:00
eulaurarien.sh	Improved rules handling	2019-12-03 08:48:12 +01:00
export.py	Added level	2019-12-16 09:31:29 +01:00
feed_asn.py	Reworked match and node system	2019-12-15 23:13:25 +01:00
feed_dns.old.py	Reworked how paths work	2019-12-15 22:21:05 +01:00
feed_dns.py	Put packing in parsing thread	2019-12-16 10:38:37 +01:00
feed_rules.py	Put packing in parsing thread	2019-12-16 10:38:37 +01:00
fetch_resources.sh	Typo in source	2019-12-15 01:52:45 +01:00
filter_subdomains.py	Optimized IP matching	2019-12-08 01:23:36 +01:00
filter_subdomains.sh	Worflow: Fixed rules counts	2019-12-13 18:36:08 +01:00
import_rules.sh	Workflow: SQL -> Tree	2019-12-15 15:56:26 +01:00
new_workflow.sh	Workflow: Can now import DnsMass output	2019-12-15 00:28:08 +01:00
README.md	Can now use AdBlock lists for tracking matching	2019-11-15 08:57:31 +01:00
resolve_subdomains.sh	Added intermediate representation for DNS datasets	2019-12-13 21:59:35 +01:00

README.md

eulaurarien

Generates a host list of first-party trackers for ad-blocking.

The latest list is available here: https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt

DISCLAIMER: I'm by no way an expert on this subject so my vocabulary or other stuff might be wrong. Use at your own risk.

What's a first-party tracker?

Traditionally, websites load trackers scripts directly. For example, website1.com and website2.com both load https://trackercompany.com/trackerscript.js to track their users. In order to block those, one can simply block the host trackercompany.com.

However, to circumvent this easy block, tracker companies made the website using them load trackers from somethingirelevant.website1.com. The latter being a DNS redirection to website1.trackercompany.com, directly pointing to a server serving the tracking script. Those are the first-party trackers.

Blocking trackercompany.com doesn't work any more, and blocking *.trackercompany.com isn't really possible since:

Most ad-blocker don't support wildcards
It's a DNS redirection, meaning that most ad-blockers will only see somethingirelevant.website1.com

So the only solution is to block every somethingirelevant.website1.com-like subdomains known, which is a lot. That's where this scripts comes in, to generate a list of such subdomains.

How does this script work

It takes an input a list of websites with trackers included. So far, this list is manually-generated from the list of clients of such first-party trackers (latter we should use a general list of websites to be more exhaustive). It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes.

Additionaly, or alternatively, you can feed the script some browsing history and get domains from there.

It then find the DNS redirections of those domains, and compare with regexes of known tracking domains. It finally outputs the matching ones.

Requirements

Just to build the list, you can find an already-built list in the releases.

Bash
Python 3.4+
progressbar2
dnspython
A Python wrapper for re2 (optional, just speeds things up)

(if you don't want to collect the subdomains, you can skip the following)

Firefox
Selenium
seleniumwire

Usage

This is only if you want to build the list yourself. If you just want to use the list, the latest build is available here: https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt It was build using additional sources not included in this repository for privacy reasons.

Add personal sources

The list of websites provided in this script is by no mean exhaustive, so adding your own browsing history will help create a better list. Here's reference command for possible sources:

Pi-hole: sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list
Firefox: cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp

Collect subdomains from websites

Just run collect_subdomain.sh. This is a long step, and might be memory-intensive from time to time.

This step is optional if you already added personal sources. Alternatively, you can get just download the list of subdomains used to generate the official block list here: https://hostfiles.frogeye.fr/from_websites.cache.list (put it in the subdomains folder).

Extract tracking domains

Make sure your system is configured with a DNS server without limitation. Then, run filter_subdomain.sh. The files you need will be in the folder dist.

Contributing

Adding websites

Just add the URL to the relevant list: websites/<source>.list.

Adding first-party trackers regex

Just add them to regexes.py.