diff --git a/README.md b/README.md index f27b6f6..7229f30 100644 --- a/README.md +++ b/README.md @@ -1,98 +1,133 @@ # eulaurarien -Generates a host list of first-party trackers for ad-blocking. +This program is able to generate a list of every hostnames being a DNS redirection to a list of DNS zones and IP networks. -The latest list is available here: +It is primarilyy used to generate [Geoffrey Frogeye's block list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md) (learn about first-party trackers by following this link). -**DISCLAIMER:** I'm by no way an expert on this subject so my vocabulary or other stuff might be wrong. Use at your own risk. +If you want to contribute but don't want to create an account on this forge, contact me the way you like: -## What's a first-party tracker? +## How does this work -Traditionally, websites load trackers scripts directly. -For example, `website1.com` and `website2.com` both load `https://trackercompany.com/trackerscript.js` to track their users. -In order to block those, one can simply block the host `trackercompany.com`. +This program takes as input: -However, to circumvent this easy block, tracker companies made the website using them load trackers from `somethingirelevant.website1.com`. -The latter being a DNS redirection to `website1.trackercompany.com`, directly pointing to a server serving the tracking script. -Those are the first-party trackers. +- Lists of hostnames to match +- Lists of DNS zone to match (a domain and their subdomains) +- Lists of IP address / IP networks to match +- Lists of Autonomous System numbers to match +- An enormous quantity of DNS records -Blocking `trackercompany.com` doesn't work any more, and blocking `*.trackercompany.com` isn't really possible since: +It will be able to output hostnames being a DNS redirection to any item in the lists provided. -1. Most ad-blocker don't support wildcards -2. It's a DNS redirection, meaning that most ad-blockers will only see `somethingirelevant.website1.com` +DNS records can either come from [Rapid7 Open Data Sets](https://opendata.rapid7.com/sonar.fdns_v2/) or can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns). -So the only solution is to block every `somethingirelevant.website1.com`-like subdomains known, which is a lot. -That's where this scripts comes in, to generate a list of such subdomains. - -## How does this script work - -> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this. - -It takes an input a list of websites with trackers included. -So far, this list is manually-generated from the list of clients of such first-party trackers -(latter we should use a general list of websites to be more exhaustive). -It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes. - -Additionaly, or alternatively, you can feed the script some browsing history and get domains from there. - -It then find the DNS redirections of those domains, and compare with regexes of known tracking domains. -It finally outputs the matching ones. - -## Requirements - -> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this. - -Just to build the list, you can find an already-built list in the releases. - -- Bash -- [Python 3.4+](https://www.python.org/) -- [progressbar2](https://pypi.org/project/progressbar2/) -- dnspython -- [A Python wrapper for re2](https://pypi.org/project/google-re2/) (optional, just speeds things up) - -(if you don't want to collect the subdomains, you can skip the following) - -- Firefox -- Selenium -- seleniumwire +Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that). ## Usage -> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this. +Remember you can get an already generated and up-to-date list of first-party trackers from [here](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md). -This is only if you want to build the list yourself. -If you just want to use the list, the latest build is available here: -It was build using additional sources not included in this repository for privacy reasons. +The following is for the people wanting to build their own list. -### Add personal sources +### Requirements -The list of websites provided in this script is by no mean exhaustive, -so adding your own browsing history will help create a better list. +Depending on the sources you'll be using to generate the list, you'll need to install some of the following: + +- [Bash](https://www.gnu.org/software/bash/bash.html) +- [Coreutils](https://www.gnu.org/software/coreutils/) +- [curl](https://curl.haxx.se) +- [pv](http://www.ivarch.com/programs/pv.shtml) +- [Python 3.4+](https://www.python.org/) +- [coloredlogs](https://pypi.org/project/coloredlogs/) (sorry I can't help myself) +- [massdns](https://github.com/blechschmidt/massdns) in your `$PATH` (only if you have subdomains as a source) +- [Firefox](https://www.mozilla.org/firefox/) (only if you have websites as a source) +- [selenium (Python bindings)](https://pypi.python.org/pypi/selenium) (only if you have websites as a source) +- [selenium-wire](https://pypi.org/project/selenium-wire/) (only if you have websites as a source) + +### Create a new database + +The so-called database (in the form of `blocking.p`) is a file storing all the matching entities (ASN, IPs, hostnames, zones…) and every entity leading to it. +For now there's no way to remove data from it, so here's the command to recreate it: `./db.py --initialize`. + +### Gather external sources + +External sources are not stored in this repository. +You'll need to fetch them by running `./fetch_resources.sh`. +Those include: + +- Third-party trackers lists +- TLD lists (used to test the validity of hostnames) +- List of public DNS resolvers (for DNS resolving from subdomains) +- Top 1M subdomains + +### Import rules into the database + +You need to put the lists of rules for matching in the different subfolders: + +- `rules`: Lists of DNS zones +- `rules_ip`: Lists of IP networks (for IP addresses append `/32`) +- `rules_asn`: Lists of Autonomous Systems numbers (IP ranges will be deducted from them) +- `rules_adblock`: Lists of DNS zones, but in the form of AdBlock lists (only the ones concerning domains will be extracted) +- `rules_hosts`: Lists of DNS zones, but in the form of hosts lists + +See the provided examples for syntax. + +In each folder: + +- `first-party.ext` will be the only files considered for the first-party variant of the list +- `*.cache.ext` are from external sources, and thus might be deleted / overwrote +- `*.custom.ext` are for sources that you don't want commited + +Then, run `./import_rules.sh`. + +### Add subdomains + +If you plan to resolve DNS records yourself (as the DNS records datasets are not exhaustive), +the top 1M subdomains provided might not be enough. + +You can add them into the `subdomains` folder. +It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files. + +#### Add personal sources + +Adding your own browsing history will help create a more suited subdomains list. Here's reference command for possible sources: - **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list` - **Firefox**: `cp ~/.mozilla/firefox/.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp` -### Collect subdomains from websites +#### Collect subdomains from websites -Just run `collect_subdomain.sh`. +You can add the websites URLs into the `websites` folder. +It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files. + +Then, run `collect_subdomain.sh`. This is a long step, and might be memory-intensive from time to time. -This step is optional if you already added personal sources. -Alternatively, you can get just download the list of subdomains used to generate the official block list here: (put it in the `subdomains` folder). +> **Note:** For first-party tracking, a list of subdomains issued from the websites in the repository is avaliable here: -### Extract tracking domains +### Resolve DNS records -Make sure your system is configured with a DNS server without limitation. -Then, run `filter_subdomain.sh`. -The files you need will be in the folder `dist`. +Once you've added subdomains, you'll need to resolve them to get their DNS records. +The program will use a list of public nameservers to do that, but you can add your own in the `nameservers` directory. -## Contributing +Then, run `./resolve_subdomains.sh`. +Note that this is a network intensive process, not in term of bandwith, but in terms of packet number. -### Adding websites +> Some VPS providers might detect this as a DDoS attack and cut the network access. +> Some Wi-Fi connections can be rendered unusable for other uses, some routers might cease to work. +> Since massdns does not support yet rate limiting, my best bet was a Raspberry Pi with a slow ethernet link (Raspberry Pi < 4). -Just add the URL to the relevant list: `websites/.list`. +The DNS records will automatically be imported into the database. +If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script. -### Adding first-party trackers regex +### Import DNS records from Rapid7 + +Just run `./import_rapid7.sh`. +This will download about 35 GiB of data, but only the matching records will be stored (about a few MiB for the tracking rules). +Note the download speed will most likely be limited by the database operation thoughput (a quick RAM will help). + +### Export the lists + +For the tracking list, use `./export_lists.sh`, the output will be in the `dist` forlder (please change the links before distributing them). +For other purposes, tinker with the `./export.py` program. -Just add them to `regexes.py`. diff --git a/dist/README.md b/dist/README.md new file mode 100644 index 0000000..31db01f --- /dev/null +++ b/dist/README.md @@ -0,0 +1,74 @@ +# Geoffrey Frogeye's block list of first-party trackers + +## What's a first-party tracker? + +A tracker is a script put on many websites to gather informations about the visitor. +They can be used for multiple reasons: statistics, risk management, marketing, ads serving… +In any case, they are a threat to Internet users' privacy and many may want to block them. + +Traditionnaly, trackers are served from a third-party. +For example, `website1.com` and `website2.com` both load their tracking script from `https://trackercompany.com/trackerscript.js`. +In order to block those, one can simply block the hostname `trackercompany.com`, which is what most ad blockers do. + +However, to circumvent this block, tracker companies made the websites using them load trackers from `somestring.website1.com`. +The latter is a DNS redirection to `website1.trackercompany.com`, directly to an IP address belonging to the tracking company. +Those are called first-party trackers. + +In order to block those trackers, ad blockers would need to block every subdomain pointing to anything under `trackercompany.com` or to their network. +Unfortunately, most don't support those blocking methods as they are not DNS-aware, e.g. they only see `somestring.website1.com`. + +This list is an inventory of every `somestring.website1.com` found to allow non DNS-aware ad blocker to still block first-party trackers. + +## List variants + +### First-party trackers (recommended) + +- Hosts file: +- Raw list: + +This list contains every hostname redirecting to [a hand-picked list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/rules/first-party.list). +It should be safe from false-positives. +Don't be afraid of the size of the list, as this is due to the nature of first-party trackers: a single tracker generates at least one hostname per client (typically two). + +### First-party only trackers + +- Hosts file: +- Raw list: + +This is the same list as above, albeit not containing the hostnames under the tracking company domains. +This reduces the size of the list, but it doesn't prevent from third-party tracking too. +Use in conjunction with other block lists. + +### Multi-party trackers + +- Hosts file: +- Raw list: + +As first-party trackers usually evolve from third-party trackers, this list contains every hostname redirecting to trackers found in existing lists of third-party trackers (see next section). +Since the latter were not designed with first-party trackers in mind, they are likely to contain false-positives. +In the other hand, they might protect against first-party tracker that we're not aware of / have not yet confirmed. + +#### Source of third-party trackers + +- [EasyPrivacy](https://easylist.to/easylist/easyprivacy.txt) + +(yes there's only one for now. A lot of existing ones cause a lot of false positives) + +### Multi-party only trackers + +- Hosts file: +- Raw list: + +This is the same list as above, albeit not containing the hostnames under the tracking company domains. +This reduces the size of the list, but it doesn't prevent from third-party tracking too. +Use in conjunction with other block lists, especially the ones used to generate this list in the previous section. + +## Meta + +In case of false positives/negatives, or any other question contact me the way you like: + +The software used to generate this list is available here: + +Some of the first-party tracker included in this list have been found by: +- [Aeris](https://imirhil.fr/) +- NextDNS and [their blocklist](https://github.com/nextdns/cname-cloaking-blocklist)'s contributors diff --git a/export_lists.sh b/export_lists.sh index b9853ed..5120562 100755 --- a/export_lists.sh +++ b/export_lists.sh @@ -54,7 +54,7 @@ do rules_output=$(./export.py --count $partyness_flags $trackerness_flags) function link() { # link partyness, link trackerness - url="https://hostfiles.frogeye.fr/${partyness}party-${trackerness}-hosts.txt" + url="https://hostfiles.frogeye.fr/${1}party-${2}-hosts.txt" if [ "$1" = "$partyness" ] && [ "$2" = "$trackerness" ] then url="$url (this one)" @@ -66,17 +66,18 @@ do echo "# First-party trackers host list" echo "# Variant: ${partyness}-party ${trackerness}" echo "#" - echo "# About first-party trackers: https://git.frogeye.fr/geoffrey/eulaurarien#whats-a-first-party-tracker" + echo "# About first-party trackers: TODO" echo "# Source code: https://git.frogeye.fr/geoffrey/eulaurarien" echo "#" echo "# In case of false positives/negatives, or any other question," echo "# contact me the way you like: https://geoffrey.frogeye.fr" echo "#" - echo "# Latest versions:" + echo "# Latest versions and variants:" echo "# - First-party trackers : $(link first trackers)" echo "# - … excluding redirected: $(link first only-trackers)" echo "# - First and third party : $(link multi trackers)" echo "# - … excluding redirected: $(link multi only-trackers)" + echo '# (variants informations: TODO)' echo '# (you can remove `-hosts` to get the raw list)' echo "#" echo "# Generation date: $gen_date" diff --git a/fetch_resources.sh b/fetch_resources.sh index cb66ff7..393d8e1 100755 --- a/fetch_resources.sh +++ b/fetch_resources.sh @@ -17,18 +17,6 @@ function dl() { log "Retrieving rules…" rm -f rules*/*.cache.* dl https://easylist.to/easylist/easyprivacy.txt rules_adblock/easyprivacy.cache.txt -# From firebog.net Tracking & Telemetry Lists -# dl https://v.firebog.net/hosts/Prigent-Ads.txt rules/prigent-ads.cache.list -# dl https://gitlab.com/quidsup/notrack-blocklists/raw/master/notrack-blocklist.txt rules/notrack-blocklist.cache.list -# False positives: https://github.com/WaLLy3K/wally3k.github.io/issues/73 -> 69.media.tumblr.com chicdn.net -dl https://raw.githubusercontent.com/StevenBlack/hosts/master/data/add.2o7Net/hosts rules_hosts/add2o7.cache.txt -dl https://raw.githubusercontent.com/crazy-max/WindowsSpyBlocker/master/data/hosts/spy.txt rules_hosts/spy.cache.txt -# dl https://raw.githubusercontent.com/Kees1958/WS3_annual_most_used_survey_blocklist/master/w3tech_hostfile.txt rules/w3tech.cache.list -# False positives: agreements.apple.com -> edgekey.net -# dl https://www.github.developerdan.com/hosts/lists/ads-and-tracking-extended.txt rules_hosts/ads-and-tracking-extended.cache.txt # Lots of false-positives -# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/android-tracking.txt rules_hosts/android-tracking.cache.txt -# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/SmartTV.txt rules_hosts/smart-tv.cache.txt -# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/AmazonFireTV.txt rules_hosts/amazon-fire-tv.cache.txt log "Retrieving TLD list…" dl http://data.iana.org/TLD/tlds-alpha-by-domain.txt temp/all_tld.temp.list