Generates a host list of first-party trackers for ad-blocking. https://hostfiles.frogeye.fr

Find a file

Geoffrey Frogeye f5f9f88c42 Remove ThreatMetrix I received a lot of false positives for this one, and while I wasn't able to reproduce the issue in most of the cases, I trust the community. It's also not in any other CNAME tracker list, probably for the same reason. Plus, it's apparently not very nasty. So I'll let it go. Closes #17		2021-08-14 21:24:48 +02:00
dist	Add AdGuard in the distribution README	2020-12-06 23:18:27 +01:00
last_updates	Clever pruning mechanism	2019-12-25 14:54:57 +01:00
nameservers	Fixed scripting around	2019-12-18 13:01:32 +01:00
rules	Remove ThreatMetrix	2021-08-14 21:24:48 +02:00
rules_adblock	Improved rules handling	2019-12-03 08:48:12 +01:00
rules_asn	Investigated >1% trackers from Fukuda paper	2020-12-07 00:03:58 +01:00
rules_hosts	Improved rules handling	2019-12-03 08:48:12 +01:00
rules_ip	Re-import Rapid7 datasets when rules have been updated	2020-01-04 10:54:46 +01:00
subdomains	Added possibility to add personal sources	2019-11-11 11:19:46 +01:00
temp	Separated DNS resolution from filtering	2019-12-02 19:03:08 +01:00
tests	Investigated >0.5% trackers from Fukuda paper	2020-12-19 13:41:07 +01:00
websites	Added RED by SFR website	2019-11-13 18:14:56 +01:00
.env.default	Allow custom massdns path	2019-12-26 00:33:23 +01:00
.gitignore	Explanations folder	2019-12-27 15:35:30 +01:00
adblock_to_domain_list.py	Removed third-parties from easyprivacy	2019-12-05 01:19:10 +01:00
collect_subdomains.py	Improvements to subdomain collection	2020-01-03 22:08:06 +01:00
collect_subdomains.sh	Fix log in scripts	2019-12-07 18:45:48 +01:00
database.py	Fixed feed_dns not saving in single-threaded mode	2019-12-26 00:02:01 +01:00
db.py	Removed TODO placeholders in commands description	2019-12-19 08:07:01 +01:00
eulaurarien.sh	Added index webpage	2019-12-27 15:21:33 +01:00
export.py	Removed TODO placeholders in commands description	2019-12-19 08:07:01 +01:00
export_lists.sh	I don't know how to write the word “explanation“...	2020-01-11 11:31:16 +01:00
feed_asn.py	Removed TODO placeholders in commands description	2019-12-19 08:07:01 +01:00
feed_dns.py	Fixed handling of unknown field error	2019-12-27 01:10:21 +01:00
feed_rules.py	Removed TODO placeholders in commands description	2019-12-19 08:07:01 +01:00
fetch_resources.sh	Add Fukuda & co research paper to test suite	2020-12-06 22:13:05 +01:00
generate_index.py	Added index webpage	2019-12-27 15:21:33 +01:00
import_rapid7.sh	Disabled RDNS import due to #15	2020-01-07 14:17:38 +01:00
import_rules.sh	Clever pruning mechanism	2019-12-25 14:54:57 +01:00
LICENSE	Added LICENSE	2019-12-20 17:38:26 +01:00
prune.sh	Clever pruning mechanism	2019-12-25 14:54:57 +01:00
README.md	Add Fukuda & co research paper to test suite	2020-12-06 22:13:05 +01:00
resolve_subdomains.sh	Better list output	2019-12-27 21:46:57 +01:00
run_tests.py	Add Fukuda & co research paper to test suite	2020-12-06 22:13:05 +01:00
validate_list.py	Validate also lower the case of domains	2019-12-25 15:31:20 +01:00

README.md

eulaurarien

This program is able to generate a list of every hostnames being a DNS redirection to a list of DNS zones and IP networks.

It is primarilyy used to generate Geoffrey Frogeye's block list of first-party trackers (learn about first-party trackers by following this link).

If you want to contribute but don't want to create an account on this forge, contact me the way you like: https://geoffrey.frogeye.fr

How does this work

This program takes as input:

Lists of hostnames to match
Lists of DNS zone to match (a domain and their subdomains)
Lists of IP address / IP networks to match
Lists of Autonomous System numbers to match
An enormous quantity of DNS records

It will be able to output hostnames being a DNS redirection to any item in the lists provided.

DNS records can either come from Rapid7 Open Data Sets or can be locally resolved from a list of subdomains using MassDNS.

Those subdomains can either be provided as is, come from Cisco Umbrella Popularity List, from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that).

Usage

Remember you can get an already generated and up-to-date list of first-party trackers from here.

The following is for the people wanting to build their own list.

Requirements

Depending on the sources you'll be using to generate the list, you'll need to install some of the following:

Bash
Coreutils
Gawk
curl
pv
Python 3.4+
coloredlogs (sorry I can't help myself)
numpy
python-abp (only if you intend to use AdBlock rules as a rule source)
jq (only if you have a Rapid7 API key)
massdns in your $PATH (only if you have subdomains as a source)
Firefox (only if you have websites as a source)
selenium (Python bindings) (only if you have websites as a source)
selenium-wire (only if you have websites as a source)
markdown2 (only if you intend to generate the index webpage)

Create a new database

The so-called database (in the form of blocking.p) is a file storing all the matching entities (ASN, IPs, hostnames, zones…) and every entity leading to it. It exists because the list cannot be generated in one pass, as DNS redirections chain links do not have to be inputed in order.

You can purge of old records the database by running ./prune.sh. When you remove a source of data, remove its corresponding file in last_updates to fix the pruning process.

Gather external sources

External sources are not stored in this repository. You'll need to fetch them by running ./fetch_resources.sh. Those include:

Third-party trackers lists
TLD lists (used to test the validity of hostnames)
List of public DNS resolvers (for DNS resolving from subdomains)
Top 1M subdomains

Import rules into the database

You need to put the lists of rules for matching in the different subfolders:

rules: Lists of DNS zones
rules_ip: Lists of IP networks (for IP addresses append /32)
rules_asn: Lists of Autonomous Systems numbers (IP ranges will be deducted from them)
rules_adblock: Lists of DNS zones, but in the form of AdBlock lists (only the ones concerning domains will be extracted)
rules_hosts: Lists of DNS zones, but in the form of hosts lists

See the provided examples for syntax.

In each folder:

first-party.ext will be the only files considered for the first-party variant of the list
*.cache.ext are from external sources, and thus might be deleted / overwrote
*.custom.ext are for sources that you don't want commited

Then, run ./import_rules.sh.

If you removed rules and you want to remove every record depending on those rules immediately, run the following command:

./db.py --prune --prune-before "$(cat "last_updates/rules.txt")" --prune-base

Add subdomains

If you plan to resolve DNS records yourself (as the DNS records datasets are not exhaustive), the top 1M subdomains provided might not be enough.

You can add them into the subdomains folder. It follows the same specificities as the rules folder for *.cache.ext and *.custom.ext files.

Add personal sources

Adding your own browsing history will help create a more suited subdomains list. Here's reference command for possible sources:

Pi-hole: sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list
Firefox: cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp

Collect subdomains from websites

You can add the websites URLs into the websites folder. It follows the same specificities as the rules folder for *.cache.ext and *.custom.ext files.

Then, run collect_subdomain.sh. This is a long step, and might be memory-intensive from time to time.

Note: For first-party tracking, a list of subdomains issued from the websites in the repository is avaliable here: https://hostfiles.frogeye.fr/from_websites.cache.list

Resolve DNS records

Once you've added subdomains, you'll need to resolve them to get their DNS records. The program will use a list of public nameservers to do that, but you can add your own in the nameservers directory.

Then, run ./resolve_subdomains.sh. Note that this is a network intensive process, not in term of bandwith, but in terms of packet number.

Note: Some VPS providers might detect this as a DDoS attack and cut the network access. Some Wi-Fi connections can be rendered unusable for other uses, some routers might cease to work. Since massdns does not support yet rate limiting, my best bet was a Raspberry Pi with a slow ethernet link (Raspberry Pi < 4).

The DNS records will automatically be imported into the database. If you want to re-import the records without re-doing the resolving, just run the last line of the ./resolve_subdomains.sh script.

Import DNS records from Rapid7

If you have a Rapid7 Organization API key, make sure to append to .env:

RAPID7_API_KEY=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Then, run ./import_rapid7.sh. This will download about 35 GiB of data the first time, but only the matching records will be stored (about a few MiB for the tracking rules). Note the download speed will most likely be limited by the database operation thoughput (a quick RAM will help).

The script remembers which were the last sets downloaded, and will only newer sets. If the first-party rules changed, the corresponding sets will be re-imported anyway. If you want to force re-importing, run rm last_updates/rapid7_*.txt.

Export the lists

For the tracking list, use ./export_lists.sh, the output will be in the dist folder (please change the links before distributing them). For other purposes, tinker with the ./export.py program.

Explanations

Note that if you created an explanations folder at the root of the project, a file with a timestamp will be created in it. It contains every rule in the database and the reason of their presence (i.e. their dependency). This might be useful to track changes between runs.

Every rule has an associated tag with four components:

A number: the level of the rule (1 if it is a rule present in the rules* folders)
A letter: F if first-party, M if multi-party.
A letter: D if a dupplicate (e.g. foo.bar.com if *.bar.com is already a rule), _ if not.
A number: the number of rules relying on this one

Generate the index webpage

This is the one served on https://hostfiles.frogeye.fr. Just run ./generate_index.py.

Everything

Once you've made sure every step runs fine, you can use ./eulaurarien.sh to run every step consecutively.