159 lines
7.6 KiB
Markdown
159 lines
7.6 KiB
Markdown
# eulaurarien
|
|
|
|
This program is able to generate a list of every hostnames being a DNS redirection to a list of DNS zones and IP networks.
|
|
|
|
It is primarilyy used to generate [Geoffrey Frogeye's block list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md) (learn about first-party trackers by following this link).
|
|
|
|
If you want to contribute but don't want to create an account on this forge, contact me the way you like: <https://geoffrey.frogeye.fr>
|
|
|
|
## How does this work
|
|
|
|
This program takes as input:
|
|
|
|
- Lists of hostnames to match
|
|
- Lists of DNS zone to match (a domain and their subdomains)
|
|
- Lists of IP address / IP networks to match
|
|
- Lists of Autonomous System numbers to match
|
|
- An enormous quantity of DNS records
|
|
|
|
It will be able to output hostnames being a DNS redirection to any item in the lists provided.
|
|
|
|
DNS records can either come from [Rapid7 Open Data Sets](https://opendata.rapid7.com/sonar.fdns_v2/) or can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns).
|
|
|
|
Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that).
|
|
|
|
## Usage
|
|
|
|
Remember you can get an already generated and up-to-date list of first-party trackers from [here](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md).
|
|
|
|
The following is for the people wanting to build their own list.
|
|
|
|
### Requirements
|
|
|
|
Depending on the sources you'll be using to generate the list, you'll need to install some of the following:
|
|
|
|
- [Bash](https://www.gnu.org/software/bash/bash.html)
|
|
- [Coreutils](https://www.gnu.org/software/coreutils/)
|
|
- [curl](https://curl.haxx.se)
|
|
- [pv](http://www.ivarch.com/programs/pv.shtml)
|
|
- [Python 3.4+](https://www.python.org/)
|
|
- [coloredlogs](https://pypi.org/project/coloredlogs/) (sorry I can't help myself)
|
|
- [numpy](https://www.numpy.org/)
|
|
- [python-abp](https://pypi.org/project/python-abp/) (only if you intend to use AdBlock rules as a rule source)
|
|
- [jq](http://stedolan.github.io/jq/) (only if you have a Rapid7 API key)
|
|
- [massdns](https://github.com/blechschmidt/massdns) in your `$PATH` (only if you have subdomains as a source)
|
|
- [Firefox](https://www.mozilla.org/firefox/) (only if you have websites as a source)
|
|
- [selenium (Python bindings)](https://pypi.python.org/pypi/selenium) (only if you have websites as a source)
|
|
- [selenium-wire](https://pypi.org/project/selenium-wire/) (only if you have websites as a source)
|
|
|
|
### Create a new database
|
|
|
|
The so-called database (in the form of `blocking.p`) is a file storing all the matching entities (ASN, IPs, hostnames, zones…) and every entity leading to it.
|
|
It exists because the list cannot be generated in one pass, as DNS redirections chain links do not have to be inputed in order.
|
|
|
|
You can purge of old records the database by running `./prune.sh`.
|
|
When you remove a source of data, remove its corresponding file in `last_updates` to fix the pruning process.
|
|
|
|
### Gather external sources
|
|
|
|
External sources are not stored in this repository.
|
|
You'll need to fetch them by running `./fetch_resources.sh`.
|
|
Those include:
|
|
|
|
- Third-party trackers lists
|
|
- TLD lists (used to test the validity of hostnames)
|
|
- List of public DNS resolvers (for DNS resolving from subdomains)
|
|
- Top 1M subdomains
|
|
|
|
### Import rules into the database
|
|
|
|
You need to put the lists of rules for matching in the different subfolders:
|
|
|
|
- `rules`: Lists of DNS zones
|
|
- `rules_ip`: Lists of IP networks (for IP addresses append `/32`)
|
|
- `rules_asn`: Lists of Autonomous Systems numbers (IP ranges will be deducted from them)
|
|
- `rules_adblock`: Lists of DNS zones, but in the form of AdBlock lists (only the ones concerning domains will be extracted)
|
|
- `rules_hosts`: Lists of DNS zones, but in the form of hosts lists
|
|
|
|
See the provided examples for syntax.
|
|
|
|
In each folder:
|
|
|
|
- `first-party.ext` will be the only files considered for the first-party variant of the list
|
|
- `*.cache.ext` are from external sources, and thus might be deleted / overwrote
|
|
- `*.custom.ext` are for sources that you don't want commited
|
|
|
|
Then, run `./import_rules.sh`.
|
|
|
|
If you removed rules and you want to remove every record depending on those rules immediately,
|
|
run the following command:
|
|
|
|
```
|
|
./db.py --prune --prune-before "$(cat "last_updates/rules.txt")" --prune-base
|
|
```
|
|
|
|
### Add subdomains
|
|
|
|
If you plan to resolve DNS records yourself (as the DNS records datasets are not exhaustive),
|
|
the top 1M subdomains provided might not be enough.
|
|
|
|
You can add them into the `subdomains` folder.
|
|
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
|
|
|
|
#### Add personal sources
|
|
|
|
Adding your own browsing history will help create a more suited subdomains list.
|
|
Here's reference command for possible sources:
|
|
|
|
- **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
|
|
- **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp`
|
|
|
|
#### Collect subdomains from websites
|
|
|
|
You can add the websites URLs into the `websites` folder.
|
|
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
|
|
|
|
Then, run `collect_subdomain.sh`.
|
|
This is a long step, and might be memory-intensive from time to time.
|
|
|
|
> **Note:** For first-party tracking, a list of subdomains issued from the websites in the repository is avaliable here: <https://hostfiles.frogeye.fr/from_websites.cache.list>
|
|
|
|
### Resolve DNS records
|
|
|
|
Once you've added subdomains, you'll need to resolve them to get their DNS records.
|
|
The program will use a list of public nameservers to do that, but you can add your own in the `nameservers` directory.
|
|
|
|
Then, run `./resolve_subdomains.sh`.
|
|
Note that this is a network intensive process, not in term of bandwith, but in terms of packet number.
|
|
|
|
> **Note:** Some VPS providers might detect this as a DDoS attack and cut the network access.
|
|
> Some Wi-Fi connections can be rendered unusable for other uses, some routers might cease to work.
|
|
> Since massdns does not support yet rate limiting, my best bet was a Raspberry Pi with a slow ethernet link (Raspberry Pi < 4).
|
|
|
|
The DNS records will automatically be imported into the database.
|
|
If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script.
|
|
|
|
### Import DNS records from Rapid7
|
|
|
|
If you have a Rapid7 Organization API key, make sure to append to `.env`:
|
|
|
|
```
|
|
RAPID7_API_KEY=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
|
|
```
|
|
|
|
Then, run `./import_rapid7.sh`.
|
|
This will download about 35 GiB of data the first time, but only the matching records will be stored (about a few MiB for the tracking rules).
|
|
Note the download speed will most likely be limited by the database operation thoughput (a quick RAM will help).
|
|
|
|
The script remembers which were the last sets downloaded, and will only import newer sets.
|
|
If you want to force re-importing, run `rm last_updates/rapid7_*.txt`.
|
|
|
|
### Export the lists
|
|
|
|
For the tracking list, use `./export_lists.sh`, the output will be in the `dist` forlder (please change the links before distributing them).
|
|
For other purposes, tinker with the `./export.py` program.
|
|
|
|
### Everything
|
|
|
|
Once you've made sure every step runs fine, you can use `./eulaurarien.sh` to run every step consecutively.
|