2019-11-10 18:14:25 +01:00
# eulaurarien
2019-12-20 17:15:39 +01:00
This program is able to generate a list of every hostnames being a DNS redirection to a list of DNS zones and IP networks.
2019-11-10 18:14:25 +01:00
2019-12-20 17:15:39 +01:00
It is primarilyy used to generate [Geoffrey Frogeye's block list of first-party trackers ](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md ) (learn about first-party trackers by following this link).
2019-11-11 12:10:46 +01:00
2019-12-20 17:15:39 +01:00
If you want to contribute but don't want to create an account on this forge, contact me the way you like: < https: / / geoffrey . frogeye . fr >
2019-11-10 18:14:25 +01:00
2019-12-20 17:15:39 +01:00
## How does this work
2019-11-10 18:14:25 +01:00
2019-12-20 17:15:39 +01:00
This program takes as input:
2019-11-10 18:14:25 +01:00
2019-12-20 17:15:39 +01:00
- Lists of hostnames to match
- Lists of DNS zone to match (a domain and their subdomains)
- Lists of IP address / IP networks to match
- Lists of Autonomous System numbers to match
- An enormous quantity of DNS records
2019-11-10 18:14:25 +01:00
2019-12-20 17:15:39 +01:00
It will be able to output hostnames being a DNS redirection to any item in the lists provided.
2019-11-10 18:14:25 +01:00
2019-12-20 17:15:39 +01:00
DNS records can either come from [Rapid7 Open Data Sets ](https://opendata.rapid7.com/sonar.fdns_v2/ ) or can be locally resolved from a list of subdomains using [MassDNS ](https://github.com/blechschmidt/massdns ).
2019-11-10 18:14:25 +01:00
2019-12-20 17:15:39 +01:00
Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List ](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html ), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that).
2019-11-10 18:14:25 +01:00
2019-12-20 17:15:39 +01:00
## Usage
2019-11-10 18:14:25 +01:00
2019-12-20 17:15:39 +01:00
Remember you can get an already generated and up-to-date list of first-party trackers from [here ](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md ).
2019-12-17 14:27:22 +01:00
2019-12-20 17:15:39 +01:00
The following is for the people wanting to build their own list.
2019-11-11 11:19:46 +01:00
2019-12-20 17:15:39 +01:00
### Requirements
2019-11-11 11:19:46 +01:00
2019-12-20 17:15:39 +01:00
Depending on the sources you'll be using to generate the list, you'll need to install some of the following:
2019-11-10 18:14:25 +01:00
2019-12-20 17:15:39 +01:00
- [Bash ](https://www.gnu.org/software/bash/bash.html )
- [Coreutils ](https://www.gnu.org/software/coreutils/ )
- [curl ](https://curl.haxx.se )
- [pv ](http://www.ivarch.com/programs/pv.shtml )
- [Python 3.4+ ](https://www.python.org/ )
- [coloredlogs ](https://pypi.org/project/coloredlogs/ ) (sorry I can't help myself)
2019-12-20 21:08:21 +01:00
- [numpy ](https://www.numpy.org/ )
2019-12-26 00:16:18 +01:00
- [python-abp ](https://pypi.org/project/python-abp/ ) (only if you intend to use AdBlock rules as a rule source)
2019-12-24 15:08:18 +01:00
- [jq ](http://stedolan.github.io/jq/ ) (only if you have a Rapid7 API key)
2019-12-20 17:15:39 +01:00
- [massdns ](https://github.com/blechschmidt/massdns ) in your `$PATH` (only if you have subdomains as a source)
- [Firefox ](https://www.mozilla.org/firefox/ ) (only if you have websites as a source)
- [selenium (Python bindings) ](https://pypi.python.org/pypi/selenium ) (only if you have websites as a source)
- [selenium-wire ](https://pypi.org/project/selenium-wire/ ) (only if you have websites as a source)
2019-11-10 18:14:25 +01:00
2019-12-20 17:15:39 +01:00
### Create a new database
2019-12-17 14:27:22 +01:00
2019-12-20 17:15:39 +01:00
The so-called database (in the form of `blocking.p` ) is a file storing all the matching entities (ASN, IPs, hostnames, zones…) and every entity leading to it.
2019-12-21 19:38:20 +01:00
It exists because the list cannot be generated in one pass, as DNS redirections chain links do not have to be inputed in order.
2019-12-25 15:15:49 +01:00
2019-12-25 14:54:57 +01:00
You can purge of old records the database by running `./prune.sh` .
2019-12-25 15:15:49 +01:00
When you remove a source of data, remove its corresponding file in `last_updates` to fix the pruning process.
2019-11-10 18:14:25 +01:00
2019-12-20 17:15:39 +01:00
### Gather external sources
2019-11-11 11:19:46 +01:00
2019-12-20 17:15:39 +01:00
External sources are not stored in this repository.
You'll need to fetch them by running `./fetch_resources.sh` .
Those include:
2019-11-11 11:19:46 +01:00
2019-12-20 17:15:39 +01:00
- Third-party trackers lists
- TLD lists (used to test the validity of hostnames)
- List of public DNS resolvers (for DNS resolving from subdomains)
- Top 1M subdomains
2019-11-10 18:14:25 +01:00
2019-12-20 17:15:39 +01:00
### Import rules into the database
You need to put the lists of rules for matching in the different subfolders:
- `rules` : Lists of DNS zones
- `rules_ip` : Lists of IP networks (for IP addresses append `/32` )
- `rules_asn` : Lists of Autonomous Systems numbers (IP ranges will be deducted from them)
- `rules_adblock` : Lists of DNS zones, but in the form of AdBlock lists (only the ones concerning domains will be extracted)
- `rules_hosts` : Lists of DNS zones, but in the form of hosts lists
See the provided examples for syntax.
In each folder:
2019-11-11 11:19:46 +01:00
2019-12-20 17:15:39 +01:00
- `first-party.ext` will be the only files considered for the first-party variant of the list
- `*.cache.ext` are from external sources, and thus might be deleted / overwrote
- `*.custom.ext` are for sources that you don't want commited
2019-12-17 14:27:22 +01:00
2019-12-20 17:15:39 +01:00
Then, run `./import_rules.sh` .
2019-12-25 14:54:57 +01:00
If you removed rules and you want to remove every record depending on those rules immediately,
run the following command:
```
./db.py --prune --prune-before "$(cat "last_updates/rules.txt")" --prune-base
```
2019-11-11 12:10:46 +01:00
2019-12-20 17:15:39 +01:00
### Add subdomains
2019-11-11 11:19:46 +01:00
2019-12-20 17:15:39 +01:00
If you plan to resolve DNS records yourself (as the DNS records datasets are not exhaustive),
the top 1M subdomains provided might not be enough.
You can add them into the `subdomains` folder.
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
#### Add personal sources
Adding your own browsing history will help create a more suited subdomains list.
2019-11-11 11:19:46 +01:00
Here's reference command for possible sources:
- **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
2019-11-11 12:10:46 +01:00
- **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp`
2019-11-11 11:19:46 +01:00
2019-12-20 17:15:39 +01:00
#### Collect subdomains from websites
You can add the websites URLs into the `websites` folder.
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
2019-11-11 11:19:46 +01:00
2019-12-20 17:15:39 +01:00
Then, run `collect_subdomain.sh` .
2019-11-11 11:19:46 +01:00
This is a long step, and might be memory-intensive from time to time.
2019-12-20 17:15:39 +01:00
> **Note:** For first-party tracking, a list of subdomains issued from the websites in the repository is avaliable here: <https://hostfiles.frogeye.fr/from_websites.cache.list>
### Resolve DNS records
Once you've added subdomains, you'll need to resolve them to get their DNS records.
The program will use a list of public nameservers to do that, but you can add your own in the `nameservers` directory.
Then, run `./resolve_subdomains.sh` .
Note that this is a network intensive process, not in term of bandwith, but in terms of packet number.
2019-11-11 12:10:46 +01:00
2019-12-20 17:22:21 +01:00
> **Note:** Some VPS providers might detect this as a DDoS attack and cut the network access.
2019-12-20 17:15:39 +01:00
> Some Wi-Fi connections can be rendered unusable for other uses, some routers might cease to work.
> Since massdns does not support yet rate limiting, my best bet was a Raspberry Pi with a slow ethernet link (Raspberry Pi < 4).
2019-11-11 11:19:46 +01:00
2019-12-20 17:15:39 +01:00
The DNS records will automatically be imported into the database.
If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script.
2019-11-10 18:29:16 +01:00
2019-12-20 17:15:39 +01:00
### Import DNS records from Rapid7
2019-11-10 18:14:25 +01:00
2019-12-25 14:54:57 +01:00
If you have a Rapid7 Organization API key, make sure to append to `.env` :
```
RAPID7_API_KEY=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
```
Then, run `./import_rapid7.sh` .
This will download about 35 GiB of data the first time, but only the matching records will be stored (about a few MiB for the tracking rules).
2019-12-20 17:15:39 +01:00
Note the download speed will most likely be limited by the database operation thoughput (a quick RAM will help).
2019-11-10 18:14:25 +01:00
2019-12-25 14:54:57 +01:00
The script remembers which were the last sets downloaded, and will only import newer sets.
2019-12-25 15:15:49 +01:00
If you want to force re-importing, run `rm last_updates/rapid7_*.txt` .
2019-12-25 14:54:57 +01:00
2019-12-20 17:15:39 +01:00
### Export the lists
2019-11-10 18:14:25 +01:00
2019-12-20 17:15:39 +01:00
For the tracking list, use `./export_lists.sh` , the output will be in the `dist` forlder (please change the links before distributing them).
For other purposes, tinker with the `./export.py` program.
2019-11-10 18:14:25 +01:00
2019-12-25 14:54:57 +01:00
### Everything
Once you've made sure every step runs fine, you can use `./eulaurarien.sh` to run every step consecutively.