eulaurarien/README.md

# eulaurarien

This program is able to generate a list of every hostnames being a DNS redirection to a list of DNS zones and IP networks.

It is primarilyy used to generate [Geoffrey Frogeye's block list of first-party trackers](https://hostfiles.frogeye.fr) (learn about first-party trackers by following this link).

If you want to contribute but don't want to create an account on this forge, contact me the way you like: <https://geoffrey.frogeye.fr>

## How does this work

This program takes as input:

- Lists of hostnames to match
- Lists of DNS zone to match (a domain and their subdomains)
- Lists of IP address / IP networks to match
- Lists of Autonomous System numbers to match
- An enormous quantity of DNS records

It will be able to output hostnames being a DNS redirection to any item in the lists provided.

DNS records can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns).

Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that).

## Usage

Remember you can get an already generated and up-to-date list of first-party trackers from [here](https://hostfiles.frogeye.fr).

The following is for the people wanting to build their own list.

### Requirements

Depending on the sources you'll be using to generate the list, you'll need to install some of the following:

- [Bash](https://www.gnu.org/software/bash/bash.html)
- [Coreutils](https://www.gnu.org/software/coreutils/)
- [Gawk](https://www.gnu.org/software/gawk/)
- [curl](https://curl.haxx.se)
- [pv](http://www.ivarch.com/programs/pv.shtml)
- [Python 3.4+](https://www.python.org/)
- [coloredlogs](https://pypi.org/project/coloredlogs/) (sorry I can't help myself)
- [numpy](https://www.numpy.org/)
- [python-abp](https://pypi.org/project/python-abp/) (only if you intend to use AdBlock rules as a rule source)
- [massdns](https://github.com/blechschmidt/massdns) in your `$PATH` (only if you have subdomains as a source)
- [Firefox](https://www.mozilla.org/firefox/) (only if you have websites as a source)
- [selenium (Python bindings)](https://pypi.python.org/pypi/selenium) (only if you have websites as a source)
- [selenium-wire](https://pypi.org/project/selenium-wire/) (only if you have websites as a source)
- [markdown2](https://pypi.org/project/markdown2/) (only if you intend to generate the index webpage)

### Create a new database

The so-called database (in the form of `blocking.p`) is a file storing all the matching entities (ASN, IPs, hostnames, zones…) and every entity leading to it.
It exists because the list cannot be generated in one pass, as DNS redirections chain links do not have to be inputed in order.

You can purge of old records the database by running `./prune.sh`.
When you remove a source of data, remove its corresponding file in `last_updates` to fix the pruning process.

### Gather external sources

External sources are not stored in this repository.
You'll need to fetch them by running `./fetch_resources.sh`.
Those include:

- Third-party trackers lists
- TLD lists (used to test the validity of hostnames)
- List of public DNS resolvers (for DNS resolving from subdomains)
- Top 1M subdomains

### Import rules into the database

You need to put the lists of rules for matching in the different subfolders:

- `rules`: Lists of DNS zones
- `rules_ip`: Lists of IP networks (for IP addresses append `/32`)
- `rules_asn`: Lists of Autonomous Systems numbers (IP ranges will be deducted from them)
- `rules_adblock`: Lists of DNS zones, but in the form of AdBlock lists (only the ones concerning domains will be extracted)
- `rules_hosts`: Lists of DNS zones, but in the form of hosts lists

See the provided examples for syntax.

In each folder:

- `first-party.ext` will be the only files considered for the first-party variant of the list
- `*.cache.ext` are from external sources, and thus might be deleted / overwrote
- `*.custom.ext` are for sources that you don't want commited

Then, run `./import_rules.sh`.

If you removed rules and you want to remove every record depending on those rules immediately,
run the following command:

```
./db.py --prune --prune-before "$(cat "last_updates/rules.txt")" --prune-base
```

### Add subdomains

If you plan to resolve DNS records yourself (as the DNS records datasets are not exhaustive),
the top 1M subdomains provided might not be enough.

You can add them into the `subdomains` folder.
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.

#### Add personal sources

Adding your own browsing history will help create a more suited subdomains list.
Here's reference command for possible sources:

- **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
- **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp`

#### Collect subdomains from websites

You can add the websites URLs into the `websites` folder.
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.

Then, run `collect_subdomain.sh`.
This is a long step, and might be memory-intensive from time to time.

> **Note:** For first-party tracking, a list of subdomains issued from the websites in the repository is avaliable here: <https://hostfiles.frogeye.fr/from_websites.cache.list> 

### Resolve DNS records

Once you've added subdomains, you'll need to resolve them to get their DNS records.
The program will use a list of public nameservers to do that, but you can add your own in the `nameservers` directory.

Then, run `./resolve_subdomains.sh`.
Note that this is a network intensive process, not in term of bandwith, but in terms of packet number.

> **Note:** Some VPS providers might detect this as a DDoS attack and cut the network access.
> Some Wi-Fi connections can be rendered unusable for other uses, some routers might cease to work.
> Since massdns does not support yet rate limiting, my best bet was a Raspberry Pi with a slow ethernet link (Raspberry Pi < 4).

The DNS records will automatically be imported into the database.
If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script.

### Export the lists

For the tracking list, use `./export_lists.sh`, the output will be in the `dist` folder (please change the links before distributing them).
For other purposes, tinker with the `./export.py` program.

#### Explanations

Note that if you created an `explanations` folder at the root of the project, a file with a timestamp will be created in it.
It contains every rule in the database and the reason of their presence (i.e. their dependency).
This might be useful to track changes between runs.

Every rule has an associated tag with four components:

1. A number: the level of the rule (1 if it is a rule present in the `rules*` folders)
2. A letter: `F` if first-party, `M` if multi-party.
3. A letter: `D` if a dupplicate (e.g. `foo.bar.com` if `*.bar.com` is already a rule), `_` if not.
4. A number: the number of rules relying on this one

### Generate the index webpage

This is the one served on <https://hostfiles.frogeye.fr>.
Just run `./generate_index.py`.

### Everything

Once you've made sure every step runs fine, you can use `./eulaurarien.sh` to run every step consecutively.
Initial commit 2019-11-10 17:14:25 +00:00			`# eulaurarien`

Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`This program is able to generate a list of every hostnames being a DNS redirection to a list of DNS zones and IP networks.`
Initial commit 2019-11-10 17:14:25 +00:00
Added index webpage 2019-12-27 14:21:33 +00:00			`It is primarilyy used to generate [Geoffrey Frogeye's block list of first-party trackers](https://hostfiles.frogeye.fr) (learn about first-party trackers by following this link).`
Added public updated list link 2019-11-11 11:10:46 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`If you want to contribute but don't want to create an account on this forge, contact me the way you like: <https://geoffrey.frogeye.fr>`
Initial commit 2019-11-10 17:14:25 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`## How does this work`
Initial commit 2019-11-10 17:14:25 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`This program takes as input:`
Initial commit 2019-11-10 17:14:25 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`- Lists of hostnames to match`
			`- Lists of DNS zone to match (a domain and their subdomains)`
			`- Lists of IP address / IP networks to match`
			`- Lists of Autonomous System numbers to match`
			`- An enormous quantity of DNS records`
Initial commit 2019-11-10 17:14:25 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`It will be able to output hostnames being a DNS redirection to any item in the lists provided.`
Initial commit 2019-11-10 17:14:25 +00:00
Remove support for Rapid7 They changed their privacy / pricing model and as such I don't have access to their massive DNS dataset anymore, even after asking. Since 2022-01-02, I put the list on freeze while looking for an alternative, but couldn't find any. To make the list update again with the remaining DNS sources I have, I put the last version of the list generated with the Rapid7 dataset as an input for subdomains, that will now get resolved with MassDNS. 2022-11-13 19:10:27 +00:00			`DNS records can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns).`
Initial commit 2019-11-10 17:14:25 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that).`
Initial commit 2019-11-10 17:14:25 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`## Usage`
Initial commit 2019-11-10 17:14:25 +00:00
Added index webpage 2019-12-27 14:21:33 +00:00			`Remember you can get an already generated and up-to-date list of first-party trackers from [here](https://hostfiles.frogeye.fr).`
Added outdated documentation warning in README 2019-12-17 13:27:22 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`The following is for the people wanting to build their own list.`
Added possibility to add personal sources 2019-11-11 10:19:46 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`### Requirements`
Added possibility to add personal sources 2019-11-11 10:19:46 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`Depending on the sources you'll be using to generate the list, you'll need to install some of the following:`
Initial commit 2019-11-10 17:14:25 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`- [Bash](https://www.gnu.org/software/bash/bash.html)`
			`- [Coreutils](https://www.gnu.org/software/coreutils/)`
Add Fukuda & co research paper to test suite 2020-12-06 21:13:05 +00:00			`- [Gawk](https://www.gnu.org/software/gawk/)`
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`- [curl](https://curl.haxx.se)`
			`- [pv](http://www.ivarch.com/programs/pv.shtml)`
			`- [Python 3.4+](https://www.python.org/)`
			`- [coloredlogs](https://pypi.org/project/coloredlogs/) (sorry I can't help myself)`
Forgot numpy dependency 2019-12-20 20:08:21 +00:00			`- [numpy](https://www.numpy.org/)`
Forgot one dependency 2019-12-25 23:16:18 +00:00			`- [python-abp](https://pypi.org/project/python-abp/) (only if you intend to use AdBlock rules as a rule source)`
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			- [massdns](https://github.com/blechschmidt/massdns) in your `$PATH` (only if you have subdomains as a source)
			`- [Firefox](https://www.mozilla.org/firefox/) (only if you have websites as a source)`
			`- [selenium (Python bindings)](https://pypi.python.org/pypi/selenium) (only if you have websites as a source)`
			`- [selenium-wire](https://pypi.org/project/selenium-wire/) (only if you have websites as a source)`
Added index webpage 2019-12-27 14:21:33 +00:00			`- [markdown2](https://pypi.org/project/markdown2/) (only if you intend to generate the index webpage)`
Initial commit 2019-11-10 17:14:25 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`### Create a new database`
Added outdated documentation warning in README 2019-12-17 13:27:22 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			The so-called database (in the form of `blocking.p`) is a file storing all the matching entities (ASN, IPs, hostnames, zones…) and every entity leading to it.
Implement pruning 2019-12-21 18:38:20 +00:00			`It exists because the list cannot be generated in one pass, as DNS redirections chain links do not have to be inputed in order.`
Added SINGLE_PROCESS environment variable 2019-12-25 14:15:49 +00:00
Clever pruning mechanism 2019-12-25 13:54:57 +00:00			You can purge of old records the database by running `./prune.sh`.
Added SINGLE_PROCESS environment variable 2019-12-25 14:15:49 +00:00			When you remove a source of data, remove its corresponding file in `last_updates` to fix the pruning process.
Initial commit 2019-11-10 17:14:25 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`### Gather external sources`
Added possibility to add personal sources 2019-11-11 10:19:46 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`External sources are not stored in this repository.`
			You'll need to fetch them by running `./fetch_resources.sh`.
			`Those include:`
Added possibility to add personal sources 2019-11-11 10:19:46 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`- Third-party trackers lists`
			`- TLD lists (used to test the validity of hostnames)`
			`- List of public DNS resolvers (for DNS resolving from subdomains)`
			`- Top 1M subdomains`
Initial commit 2019-11-10 17:14:25 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`### Import rules into the database`

			`You need to put the lists of rules for matching in the different subfolders:`

			- `rules`: Lists of DNS zones
			- `rules_ip`: Lists of IP networks (for IP addresses append `/32`)
			- `rules_asn`: Lists of Autonomous Systems numbers (IP ranges will be deducted from them)
			- `rules_adblock`: Lists of DNS zones, but in the form of AdBlock lists (only the ones concerning domains will be extracted)
			- `rules_hosts`: Lists of DNS zones, but in the form of hosts lists

			`See the provided examples for syntax.`

			`In each folder:`
Added possibility to add personal sources 2019-11-11 10:19:46 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			- `first-party.ext` will be the only files considered for the first-party variant of the list
			- `*.cache.ext` are from external sources, and thus might be deleted / overwrote
			- `*.custom.ext` are for sources that you don't want commited
Added outdated documentation warning in README 2019-12-17 13:27:22 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			Then, run `./import_rules.sh`.
Clever pruning mechanism 2019-12-25 13:54:57 +00:00
			`If you removed rules and you want to remove every record depending on those rules immediately,`
			`run the following command:`

			```
			`./db.py --prune --prune-before "$(cat "last_updates/rules.txt")" --prune-base`
			```
Added public updated list link 2019-11-11 11:10:46 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`### Add subdomains`
Added possibility to add personal sources 2019-11-11 10:19:46 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`If you plan to resolve DNS records yourself (as the DNS records datasets are not exhaustive),`
			`the top 1M subdomains provided might not be enough.`

			You can add them into the `subdomains` folder.
			It follows the same specificities as the rules folder for `.cache.ext` and `.custom.ext` files.

			`#### Add personal sources`

			`Adding your own browsing history will help create a more suited subdomains list.`
Added possibility to add personal sources 2019-11-11 10:19:46 +00:00			`Here's reference command for possible sources:`

			- Pi-hole: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
Added public updated list link 2019-11-11 11:10:46 +00:00			- Firefox: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" \| rev \| sed 's\|^\.\|\|' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp`
Added possibility to add personal sources 2019-11-11 10:19:46 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`#### Collect subdomains from websites`

			You can add the websites URLs into the `websites` folder.
			It follows the same specificities as the rules folder for `.cache.ext` and `.custom.ext` files.
Added possibility to add personal sources 2019-11-11 10:19:46 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			Then, run `collect_subdomain.sh`.
Added possibility to add personal sources 2019-11-11 10:19:46 +00:00			`This is a long step, and might be memory-intensive from time to time.`

Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`> Note: For first-party tracking, a list of subdomains issued from the websites in the repository is avaliable here: <https://hostfiles.frogeye.fr/from_websites.cache.list>`

			`### Resolve DNS records`

			`Once you've added subdomains, you'll need to resolve them to get their DNS records.`
			The program will use a list of public nameservers to do that, but you can add your own in the `nameservers` directory.

			Then, run `./resolve_subdomains.sh`.
			`Note that this is a network intensive process, not in term of bandwith, but in terms of packet number.`
Added public updated list link 2019-11-11 11:10:46 +00:00
Updated links (could not bother guessing them) 2019-12-20 16:22:21 +00:00			`> Note: Some VPS providers might detect this as a DDoS attack and cut the network access.`
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`> Some Wi-Fi connections can be rendered unusable for other uses, some routers might cease to work.`
			`> Since massdns does not support yet rate limiting, my best bet was a Raspberry Pi with a slow ethernet link (Raspberry Pi < 4).`
Added possibility to add personal sources 2019-11-11 10:19:46 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`The DNS records will automatically be imported into the database.`
			If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script.
Fixed typos 2019-11-10 17:29:16 +00:00
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			`### Export the lists`
Initial commit 2019-11-10 17:14:25 +00:00
Improvements to subdomain collection I use this for tracker identification so it's not perfect but still it's a bit better. 2020-01-03 21:08:06 +00:00			For the tracking list, use `./export_lists.sh`, the output will be in the `dist` folder (please change the links before distributing them).
Updated README Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier. 2019-12-20 16:15:39 +00:00			For other purposes, tinker with the `./export.py` program.
Initial commit 2019-11-10 17:14:25 +00:00
Explanations folder 2019-12-27 14:35:30 +00:00			`#### Explanations`

			Note that if you created an `explanations` folder at the root of the project, a file with a timestamp will be created in it.
			`It contains every rule in the database and the reason of their presence (i.e. their dependency).`
			`This might be useful to track changes between runs.`

			`Every rule has an associated tag with four components:`

			1. A number: the level of the rule (1 if it is a rule present in the `rules*` folders)
			2. A letter: `F` if first-party, `M` if multi-party.
			3. A letter: `D` if a dupplicate (e.g. `foo.bar.com` if `*.bar.com` is already a rule), `_` if not.
			`4. A number: the number of rules relying on this one`

			`### Generate the index webpage`

			`This is the one served on <https://hostfiles.frogeye.fr>.`
			Just run `./generate_index.py`.

Clever pruning mechanism 2019-12-25 13:54:57 +00:00			`### Everything`

			Once you've made sure every step runs fine, you can use `./eulaurarien.sh` to run every step consecutively.