Browse Source
Updated README
Updated README
Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier.newworkflow
4 changed files with 170 additions and 72 deletions
-
149README.md
-
74dist/README.md
-
7export_lists.sh
-
12fetch_resources.sh
@ -1,98 +1,133 @@ |
|||
# eulaurarien |
|||
|
|||
Generates a host list of first-party trackers for ad-blocking. |
|||
This program is able to generate a list of every hostnames being a DNS redirection to a list of DNS zones and IP networks. |
|||
|
|||
The latest list is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt> |
|||
It is primarilyy used to generate [Geoffrey Frogeye's block list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md) (learn about first-party trackers by following this link). |
|||
|
|||
**DISCLAIMER:** I'm by no way an expert on this subject so my vocabulary or other stuff might be wrong. Use at your own risk. |
|||
If you want to contribute but don't want to create an account on this forge, contact me the way you like: <https://geoffrey.frogeye.fr> |
|||
|
|||
## What's a first-party tracker? |
|||
## How does this work |
|||
|
|||
Traditionally, websites load trackers scripts directly. |
|||
For example, `website1.com` and `website2.com` both load `https://trackercompany.com/trackerscript.js` to track their users. |
|||
In order to block those, one can simply block the host `trackercompany.com`. |
|||
This program takes as input: |
|||
|
|||
However, to circumvent this easy block, tracker companies made the website using them load trackers from `somethingirelevant.website1.com`. |
|||
The latter being a DNS redirection to `website1.trackercompany.com`, directly pointing to a server serving the tracking script. |
|||
Those are the first-party trackers. |
|||
- Lists of hostnames to match |
|||
- Lists of DNS zone to match (a domain and their subdomains) |
|||
- Lists of IP address / IP networks to match |
|||
- Lists of Autonomous System numbers to match |
|||
- An enormous quantity of DNS records |
|||
|
|||
Blocking `trackercompany.com` doesn't work any more, and blocking `*.trackercompany.com` isn't really possible since: |
|||
It will be able to output hostnames being a DNS redirection to any item in the lists provided. |
|||
|
|||
1. Most ad-blocker don't support wildcards |
|||
2. It's a DNS redirection, meaning that most ad-blockers will only see `somethingirelevant.website1.com` |
|||
DNS records can either come from [Rapid7 Open Data Sets](https://opendata.rapid7.com/sonar.fdns_v2/) or can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns). |
|||
|
|||
So the only solution is to block every `somethingirelevant.website1.com`-like subdomains known, which is a lot. |
|||
That's where this scripts comes in, to generate a list of such subdomains. |
|||
Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that). |
|||
|
|||
## How does this script work |
|||
## Usage |
|||
|
|||
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this. |
|||
Remember you can get an already generated and up-to-date list of first-party trackers from [here](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md). |
|||
|
|||
It takes an input a list of websites with trackers included. |
|||
So far, this list is manually-generated from the list of clients of such first-party trackers |
|||
(latter we should use a general list of websites to be more exhaustive). |
|||
It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes. |
|||
The following is for the people wanting to build their own list. |
|||
|
|||
Additionaly, or alternatively, you can feed the script some browsing history and get domains from there. |
|||
### Requirements |
|||
|
|||
It then find the DNS redirections of those domains, and compare with regexes of known tracking domains. |
|||
It finally outputs the matching ones. |
|||
Depending on the sources you'll be using to generate the list, you'll need to install some of the following: |
|||
|
|||
## Requirements |
|||
- [Bash](https://www.gnu.org/software/bash/bash.html) |
|||
- [Coreutils](https://www.gnu.org/software/coreutils/) |
|||
- [curl](https://curl.haxx.se) |
|||
- [pv](http://www.ivarch.com/programs/pv.shtml) |
|||
- [Python 3.4+](https://www.python.org/) |
|||
- [coloredlogs](https://pypi.org/project/coloredlogs/) (sorry I can't help myself) |
|||
- [massdns](https://github.com/blechschmidt/massdns) in your `$PATH` (only if you have subdomains as a source) |
|||
- [Firefox](https://www.mozilla.org/firefox/) (only if you have websites as a source) |
|||
- [selenium (Python bindings)](https://pypi.python.org/pypi/selenium) (only if you have websites as a source) |
|||
- [selenium-wire](https://pypi.org/project/selenium-wire/) (only if you have websites as a source) |
|||
|
|||
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this. |
|||
### Create a new database |
|||
|
|||
Just to build the list, you can find an already-built list in the releases. |
|||
The so-called database (in the form of `blocking.p`) is a file storing all the matching entities (ASN, IPs, hostnames, zones…) and every entity leading to it. |
|||
For now there's no way to remove data from it, so here's the command to recreate it: `./db.py --initialize`. |
|||
|
|||
- Bash |
|||
- [Python 3.4+](https://www.python.org/) |
|||
- [progressbar2](https://pypi.org/project/progressbar2/) |
|||
- dnspython |
|||
- [A Python wrapper for re2](https://pypi.org/project/google-re2/) (optional, just speeds things up) |
|||
### Gather external sources |
|||
|
|||
(if you don't want to collect the subdomains, you can skip the following) |
|||
External sources are not stored in this repository. |
|||
You'll need to fetch them by running `./fetch_resources.sh`. |
|||
Those include: |
|||
|
|||
- Firefox |
|||
- Selenium |
|||
- seleniumwire |
|||
- Third-party trackers lists |
|||
- TLD lists (used to test the validity of hostnames) |
|||
- List of public DNS resolvers (for DNS resolving from subdomains) |
|||
- Top 1M subdomains |
|||
|
|||
## Usage |
|||
### Import rules into the database |
|||
|
|||
You need to put the lists of rules for matching in the different subfolders: |
|||
|
|||
- `rules`: Lists of DNS zones |
|||
- `rules_ip`: Lists of IP networks (for IP addresses append `/32`) |
|||
- `rules_asn`: Lists of Autonomous Systems numbers (IP ranges will be deducted from them) |
|||
- `rules_adblock`: Lists of DNS zones, but in the form of AdBlock lists (only the ones concerning domains will be extracted) |
|||
- `rules_hosts`: Lists of DNS zones, but in the form of hosts lists |
|||
|
|||
See the provided examples for syntax. |
|||
|
|||
In each folder: |
|||
|
|||
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this. |
|||
- `first-party.ext` will be the only files considered for the first-party variant of the list |
|||
- `*.cache.ext` are from external sources, and thus might be deleted / overwrote |
|||
- `*.custom.ext` are for sources that you don't want commited |
|||
|
|||
This is only if you want to build the list yourself. |
|||
If you just want to use the list, the latest build is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt> |
|||
It was build using additional sources not included in this repository for privacy reasons. |
|||
Then, run `./import_rules.sh`. |
|||
|
|||
### Add personal sources |
|||
### Add subdomains |
|||
|
|||
The list of websites provided in this script is by no mean exhaustive, |
|||
so adding your own browsing history will help create a better list. |
|||
If you plan to resolve DNS records yourself (as the DNS records datasets are not exhaustive), |
|||
the top 1M subdomains provided might not be enough. |
|||
|
|||
You can add them into the `subdomains` folder. |
|||
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files. |
|||
|
|||
#### Add personal sources |
|||
|
|||
Adding your own browsing history will help create a more suited subdomains list. |
|||
Here's reference command for possible sources: |
|||
|
|||
- **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list` |
|||
- **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp` |
|||
|
|||
### Collect subdomains from websites |
|||
#### Collect subdomains from websites |
|||
|
|||
You can add the websites URLs into the `websites` folder. |
|||
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files. |
|||
|
|||
Just run `collect_subdomain.sh`. |
|||
Then, run `collect_subdomain.sh`. |
|||
This is a long step, and might be memory-intensive from time to time. |
|||
|
|||
This step is optional if you already added personal sources. |
|||
Alternatively, you can get just download the list of subdomains used to generate the official block list here: <https://hostfiles.frogeye.fr/from_websites.cache.list> (put it in the `subdomains` folder). |
|||
> **Note:** For first-party tracking, a list of subdomains issued from the websites in the repository is avaliable here: <https://hostfiles.frogeye.fr/from_websites.cache.list> |
|||
|
|||
### Resolve DNS records |
|||
|
|||
Once you've added subdomains, you'll need to resolve them to get their DNS records. |
|||
The program will use a list of public nameservers to do that, but you can add your own in the `nameservers` directory. |
|||
|
|||
Then, run `./resolve_subdomains.sh`. |
|||
Note that this is a network intensive process, not in term of bandwith, but in terms of packet number. |
|||
|
|||
### Extract tracking domains |
|||
> Some VPS providers might detect this as a DDoS attack and cut the network access. |
|||
> Some Wi-Fi connections can be rendered unusable for other uses, some routers might cease to work. |
|||
> Since massdns does not support yet rate limiting, my best bet was a Raspberry Pi with a slow ethernet link (Raspberry Pi < 4). |
|||
|
|||
Make sure your system is configured with a DNS server without limitation. |
|||
Then, run `filter_subdomain.sh`. |
|||
The files you need will be in the folder `dist`. |
|||
The DNS records will automatically be imported into the database. |
|||
If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script. |
|||
|
|||
## Contributing |
|||
### Import DNS records from Rapid7 |
|||
|
|||
### Adding websites |
|||
Just run `./import_rapid7.sh`. |
|||
This will download about 35 GiB of data, but only the matching records will be stored (about a few MiB for the tracking rules). |
|||
Note the download speed will most likely be limited by the database operation thoughput (a quick RAM will help). |
|||
|
|||
Just add the URL to the relevant list: `websites/<source>.list`. |
|||
### Export the lists |
|||
|
|||
### Adding first-party trackers regex |
|||
For the tracking list, use `./export_lists.sh`, the output will be in the `dist` forlder (please change the links before distributing them). |
|||
For other purposes, tinker with the `./export.py` program. |
|||
|
|||
Just add them to `regexes.py`. |
@ -0,0 +1,74 @@ |
|||
# Geoffrey Frogeye's block list of first-party trackers |
|||
|
|||
## What's a first-party tracker? |
|||
|
|||
A tracker is a script put on many websites to gather informations about the visitor. |
|||
They can be used for multiple reasons: statistics, risk management, marketing, ads serving… |
|||
In any case, they are a threat to Internet users' privacy and many may want to block them. |
|||
|
|||
Traditionnaly, trackers are served from a third-party. |
|||
For example, `website1.com` and `website2.com` both load their tracking script from `https://trackercompany.com/trackerscript.js`. |
|||
In order to block those, one can simply block the hostname `trackercompany.com`, which is what most ad blockers do. |
|||
|
|||
However, to circumvent this block, tracker companies made the websites using them load trackers from `somestring.website1.com`. |
|||
The latter is a DNS redirection to `website1.trackercompany.com`, directly to an IP address belonging to the tracking company. |
|||
Those are called first-party trackers. |
|||
|
|||
In order to block those trackers, ad blockers would need to block every subdomain pointing to anything under `trackercompany.com` or to their network. |
|||
Unfortunately, most don't support those blocking methods as they are not DNS-aware, e.g. they only see `somestring.website1.com`. |
|||
|
|||
This list is an inventory of every `somestring.website1.com` found to allow non DNS-aware ad blocker to still block first-party trackers. |
|||
|
|||
## List variants |
|||
|
|||
### First-party trackers (recommended) |
|||
|
|||
- Hosts file: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt> |
|||
- Raw list: <https://hostfiles.frogeye.fr/firstparty-trackers.txt> |
|||
|
|||
This list contains every hostname redirecting to [a hand-picked list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/rules/first-party.list). |
|||
It should be safe from false-positives. |
|||
Don't be afraid of the size of the list, as this is due to the nature of first-party trackers: a single tracker generates at least one hostname per client (typically two). |
|||
|
|||
### First-party only trackers |
|||
|
|||
- Hosts file: <https://hostfiles.frogeye.fr/firstparty-only-trackers-hosts.txt> |
|||
- Raw list: <https://hostfiles.frogeye.fr/firstparty-only-trackers.txt> |
|||
|
|||
This is the same list as above, albeit not containing the hostnames under the tracking company domains. |
|||
This reduces the size of the list, but it doesn't prevent from third-party tracking too. |
|||
Use in conjunction with other block lists. |
|||
|
|||
### Multi-party trackers |
|||
|
|||
- Hosts file: <https://hostfiles.frogeye.fr/multiparty-trackers-hosts.txt> |
|||
- Raw list: <https://hostfiles.frogeye.fr/multiparty-trackers.txt> |
|||
|
|||
As first-party trackers usually evolve from third-party trackers, this list contains every hostname redirecting to trackers found in existing lists of third-party trackers (see next section). |
|||
Since the latter were not designed with first-party trackers in mind, they are likely to contain false-positives. |
|||
In the other hand, they might protect against first-party tracker that we're not aware of / have not yet confirmed. |
|||
|
|||
#### Source of third-party trackers |
|||
|
|||
- [EasyPrivacy](https://easylist.to/easylist/easyprivacy.txt) |
|||
|
|||
(yes there's only one for now. A lot of existing ones cause a lot of false positives) |
|||
|
|||
### Multi-party only trackers |
|||
|
|||
- Hosts file: <https://hostfiles.frogeye.fr/multiparty-only-trackers-hosts.txt> |
|||
- Raw list: <https://hostfiles.frogeye.fr/multiparty-only-trackers.txt> |
|||
|
|||
This is the same list as above, albeit not containing the hostnames under the tracking company domains. |
|||
This reduces the size of the list, but it doesn't prevent from third-party tracking too. |
|||
Use in conjunction with other block lists, especially the ones used to generate this list in the previous section. |
|||
|
|||
## Meta |
|||
|
|||
In case of false positives/negatives, or any other question contact me the way you like: <https://geoffrey.frogeye.fr> |
|||
|
|||
The software used to generate this list is available here: <https://git.frogeye.fr/geoffrey/eulaurarien> |
|||
|
|||
Some of the first-party tracker included in this list have been found by: |
|||
- [Aeris](https://imirhil.fr/) |
|||
- NextDNS and [their blocklist](https://github.com/nextdns/cname-cloaking-blocklist)'s contributors |
Write
Preview
Loading…
Cancel
Save
Reference in new issue