Geoffrey Frogeye
7d01d016a5
It's not very performant by itself, especially since pyre2 isn't maintained nor really compilableinstallable anymore. The performance seems to have decreased from 200 req/s to 0.2 req/s when using 512 threads, and to 80 req/s using 64 req/s. This might or might not be related,as the CPU doesn't seem to be the bottleneck. I will probably add support for host-based rules, matching the subdomains of such hosts (as for now there doesn't seem to be any other pattern for first-party trackers than subdomains, and this would be a very broad performace / compatibility with existing lists improvement), and convert the AdBlock lists to this format, only keeping domains-only rules.
93 lines
4 KiB
Markdown
93 lines
4 KiB
Markdown
# eulaurarien
|
|
|
|
Generates a host list of first-party trackers for ad-blocking.
|
|
|
|
The latest list is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
|
|
|
|
**DISCLAIMER:** I'm by no way an expert on this subject so my vocabulary or other stuff might be wrong. Use at your own risk.
|
|
|
|
## What's a first-party tracker?
|
|
|
|
Traditionally, websites load trackers scripts directly.
|
|
For example, `website1.com` and `website2.com` both load `https://trackercompany.com/trackerscript.js` to track their users.
|
|
In order to block those, one can simply block the host `trackercompany.com`.
|
|
|
|
However, to circumvent this easy block, tracker companies made the website using them load trackers from `somethingirelevant.website1.com`.
|
|
The latter being a DNS redirection to `website1.trackercompany.com`, directly pointing to a server serving the tracking script.
|
|
Those are the first-party trackers.
|
|
|
|
Blocking `trackercompany.com` doesn't work any more, and blocking `*.trackercompany.com` isn't really possible since:
|
|
|
|
1. Most ad-blocker don't support wildcards
|
|
2. It's a DNS redirection, meaning that most ad-blockers will only see `somethingirelevant.website1.com`
|
|
|
|
So the only solution is to block every `somethingirelevant.website1.com`-like subdomains known, which is a lot.
|
|
That's where this scripts comes in, to generate a list of such subdomains.
|
|
|
|
## How does this script work
|
|
|
|
It takes an input a list of websites with trackers included.
|
|
So far, this list is manually-generated from the list of clients of such first-party trackers
|
|
(latter we should use a general list of websites to be more exhaustive).
|
|
It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes.
|
|
|
|
Additionaly, or alternatively, you can feed the script some browsing history and get domains from there.
|
|
|
|
It then find the DNS redirections of those domains, and compare with regexes of known tracking domains.
|
|
It finally outputs the matching ones.
|
|
|
|
## Requirements
|
|
|
|
Just to build the list, you can find an already-built list in the releases.
|
|
|
|
- Bash
|
|
- [Python 3.4+](https://www.python.org/)
|
|
- [progressbar2](https://pypi.org/project/progressbar2/)
|
|
- dnspython
|
|
- [A Python wrapper for re2](https://pypi.org/project/google-re2/) (optional, just speeds things up)
|
|
|
|
(if you don't want to collect the subdomains, you can skip the following)
|
|
|
|
- Firefox
|
|
- Selenium
|
|
- seleniumwire
|
|
|
|
## Usage
|
|
|
|
This is only if you want to build the list yourself.
|
|
If you just want to use the list, the latest build is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
|
|
It was build using additional sources not included in this repository for privacy reasons.
|
|
|
|
### Add personal sources
|
|
|
|
The list of websites provided in this script is by no mean exhaustive,
|
|
so adding your own browsing history will help create a better list.
|
|
Here's reference command for possible sources:
|
|
|
|
- **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
|
|
- **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp`
|
|
|
|
### Collect subdomains from websites
|
|
|
|
Just run `collect_subdomain.sh`.
|
|
This is a long step, and might be memory-intensive from time to time.
|
|
|
|
This step is optional if you already added personal sources.
|
|
Alternatively, you can get just download the list of subdomains used to generate the official block list here: <https://hostfiles.frogeye.fr/from_websites.cache.list> (put it in the `subdomains` folder).
|
|
|
|
### Extract tracking domains
|
|
|
|
Make sure your system is configured with a DNS server without limitation.
|
|
Then, run `filter_subdomain.sh`.
|
|
The files you need will be in the folder `dist`.
|
|
|
|
## Contributing
|
|
|
|
### Adding websites
|
|
|
|
Just add the URL to the relevant list: `websites/<source>.list`.
|
|
|
|
### Adding first-party trackers regex
|
|
|
|
Just add them to `regexes.py`.
|