eulaurarien/README.md

99 lines
4.4 KiB
Markdown
Raw Normal View History

2019-11-10 18:14:25 +01:00
# eulaurarien
Generates a host list of first-party trackers for ad-blocking.
2019-11-11 12:10:46 +01:00
The latest list is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
2019-11-10 18:14:25 +01:00
**DISCLAIMER:** I'm by no way an expert on this subject so my vocabulary or other stuff might be wrong. Use at your own risk.
## What's a first-party tracker?
Traditionally, websites load trackers scripts directly.
For example, `website1.com` and `website2.com` both load `https://trackercompany.com/trackerscript.js` to track their users.
In order to block those, one can simply block the host `trackercompany.com`.
However, to circumvent this easy block, tracker companies made the website using them load trackers from `somethingirelevant.website1.com`.
The latter being a DNS redirection to `website1.trackercompany.com`, directly pointing to a server serving the tracking script.
Those are the first-party trackers.
Blocking `trackercompany.com` doesn't work any more, and blocking `*.trackercompany.com` isn't really possible since:
1. Most ad-blocker don't support wildcards
2. It's a DNS redirection, meaning that most ad-blockers will only see `somethingirelevant.website1.com`
So the only solution is to block every `somethingirelevant.website1.com`-like subdomains known, which is a lot.
That's where this scripts comes in, to generate a list of such subdomains.
## How does this script work
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
2019-11-10 18:14:25 +01:00
It takes an input a list of websites with trackers included.
So far, this list is manually-generated from the list of clients of such first-party trackers
(latter we should use a general list of websites to be more exhaustive).
It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes.
Additionaly, or alternatively, you can feed the script some browsing history and get domains from there.
2019-11-10 18:14:25 +01:00
It then find the DNS redirections of those domains, and compare with regexes of known tracking domains.
It finally outputs the matching ones.
## Requirements
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
2019-11-10 18:14:25 +01:00
Just to build the list, you can find an already-built list in the releases.
- Bash
2019-11-11 12:41:22 +01:00
- [Python 3.4+](https://www.python.org/)
- [progressbar2](https://pypi.org/project/progressbar2/)
- dnspython
- [A Python wrapper for re2](https://pypi.org/project/google-re2/) (optional, just speeds things up)
(if you don't want to collect the subdomains, you can skip the following)
2019-11-10 18:14:25 +01:00
- Firefox
- Selenium
- seleniumwire
## Usage
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
2019-11-11 12:10:46 +01:00
This is only if you want to build the list yourself.
If you just want to use the list, the latest build is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
It was build using additional sources not included in this repository for privacy reasons.
### Add personal sources
The list of websites provided in this script is by no mean exhaustive,
so adding your own browsing history will help create a better list.
Here's reference command for possible sources:
- **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
2019-11-11 12:10:46 +01:00
- **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp`
### Collect subdomains from websites
Just run `collect_subdomain.sh`.
This is a long step, and might be memory-intensive from time to time.
2019-11-11 12:10:46 +01:00
This step is optional if you already added personal sources.
Alternatively, you can get just download the list of subdomains used to generate the official block list here: <https://hostfiles.frogeye.fr/from_websites.cache.list> (put it in the `subdomains` folder).
### Extract tracking domains
Make sure your system is configured with a DNS server without limitation.
Then, run `filter_subdomain.sh`.
The files you need will be in the folder `dist`.
2019-11-10 18:29:16 +01:00
2019-11-10 18:14:25 +01:00
## Contributing
### Adding websites
Just add the URL to the relevant list: `websites/<source>.list`.
2019-11-10 18:14:25 +01:00
### Adding first-party trackers regex
Just add them to `regexes.py`.