Updated README

Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier.
2019-12-20 17:15:39 +01:00 · 2019-12-20 17:15:39 +01:00 · 38cf532854
commit 38cf532854
parent 53b14c6ffa
4 changed files with 180 additions and 82 deletions
--- a/README.md
+++ b/README.md
@ -1,98 +1,133 @@
 # eulaurarien

-Generates a host list of first-party trackers for ad-blocking.
+This program is able to generate a list of every hostnames being a DNS redirection to a list of DNS zones and IP networks.

-The latest list is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
+It is primarilyy used to generate [Geoffrey Frogeye's block list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md) (learn about first-party trackers by following this link).

-**DISCLAIMER:** I'm by no way an expert on this subject so my vocabulary or other stuff might be wrong. Use at your own risk.
+If you want to contribute but don't want to create an account on this forge, contact me the way you like: <https://geoffrey.frogeye.fr>

-## What's a first-party tracker?
+## How does this work

-Traditionally, websites load trackers scripts directly.
-For example, `website1.com` and `website2.com` both load `https://trackercompany.com/trackerscript.js` to track their users.
-In order to block those, one can simply block the host `trackercompany.com`.
+This program takes as input:

-However, to circumvent this easy block, tracker companies made the website using them load trackers from `somethingirelevant.website1.com`.
-The latter being a DNS redirection to `website1.trackercompany.com`, directly pointing to a server serving the tracking script.
-Those are the first-party trackers.
+- Lists of hostnames to match
+- Lists of DNS zone to match (a domain and their subdomains)
+- Lists of IP address / IP networks to match
+- Lists of Autonomous System numbers to match
+- An enormous quantity of DNS records

-Blocking `trackercompany.com` doesn't work any more, and blocking `*.trackercompany.com` isn't really possible since:
+It will be able to output hostnames being a DNS redirection to any item in the lists provided.

-1. Most ad-blocker don't support wildcards
-2. It's a DNS redirection, meaning that most ad-blockers will only see `somethingirelevant.website1.com`
+DNS records can either come from [Rapid7 Open Data Sets](https://opendata.rapid7.com/sonar.fdns_v2/) or can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns).

-So the only solution is to block every `somethingirelevant.website1.com`-like subdomains known, which is a lot.
-That's where this scripts comes in, to generate a list of such subdomains.
-
-## How does this script work
-
-> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
-
-It takes an input a list of websites with trackers included.
-So far, this list is manually-generated from the list of clients of such first-party trackers
-(latter we should use a general list of websites to be more exhaustive).
-It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes.
-
-Additionaly, or alternatively, you can feed the script some browsing history and get domains from there.
-
-It then find the DNS redirections of those domains, and compare with regexes of known tracking domains.
-It finally outputs the matching ones.
-
-## Requirements
-
-> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
-
-Just to build the list, you can find an already-built list in the releases.
-
- Bash
- [Python 3.4+](https://www.python.org/)
- [progressbar2](https://pypi.org/project/progressbar2/)
- dnspython
- [A Python wrapper for re2](https://pypi.org/project/google-re2/) (optional, just speeds things up)
-
-(if you don't want to collect the subdomains, you can skip the following) 
-
- Firefox
- Selenium
- seleniumwire
+Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that).

 ## Usage

-> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
+Remember you can get an already generated and up-to-date list of first-party trackers from [here](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md).

-This is only if you want to build the list yourself.
-If you just want to use the list, the latest build is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
-It was build using additional sources not included in this repository for privacy reasons.
+The following is for the people wanting to build their own list.

-### Add personal sources
+### Requirements

-The list of websites provided in this script is by no mean exhaustive,
-so adding your own browsing history will help create a better list.
+Depending on the sources you'll be using to generate the list, you'll need to install some of the following:
+
+- [Bash](https://www.gnu.org/software/bash/bash.html)
+- [Coreutils](https://www.gnu.org/software/coreutils/)
+- [curl](https://curl.haxx.se)
+- [pv](http://www.ivarch.com/programs/pv.shtml)
+- [Python 3.4+](https://www.python.org/)
+- [coloredlogs](https://pypi.org/project/coloredlogs/) (sorry I can't help myself)
+- [massdns](https://github.com/blechschmidt/massdns) in your `$PATH` (only if you have subdomains as a source)
+- [Firefox](https://www.mozilla.org/firefox/) (only if you have websites as a source)
+- [selenium (Python bindings)](https://pypi.python.org/pypi/selenium) (only if you have websites as a source)
+- [selenium-wire](https://pypi.org/project/selenium-wire/) (only if you have websites as a source)
+
+### Create a new database
+
+The so-called database (in the form of `blocking.p`) is a file storing all the matching entities (ASN, IPs, hostnames, zones…) and every entity leading to it.
+For now there's no way to remove data from it, so here's the command to recreate it: `./db.py --initialize`.
+
+### Gather external sources
+
+External sources are not stored in this repository.
+You'll need to fetch them by running `./fetch_resources.sh`.
+Those include:
+
+- Third-party trackers lists
+- TLD lists (used to test the validity of hostnames)
+- List of public DNS resolvers (for DNS resolving from subdomains)
+- Top 1M subdomains
+
+### Import rules into the database
+
+You need to put the lists of rules for matching in the different subfolders:
+
+- `rules`: Lists of DNS zones
+- `rules_ip`: Lists of IP networks (for IP addresses append `/32`)
+- `rules_asn`: Lists of Autonomous Systems numbers (IP ranges will be deducted from them)
+- `rules_adblock`: Lists of DNS zones, but in the form of AdBlock lists (only the ones concerning domains will be extracted)
+- `rules_hosts`: Lists of DNS zones, but in the form of hosts lists
+
+See the provided examples for syntax.
+
+In each folder:
+
+- `first-party.ext` will be the only files considered for the first-party variant of the list
+- `*.cache.ext` are from external sources, and thus might be deleted / overwrote
+- `*.custom.ext` are for sources that you don't want commited
+
+Then, run `./import_rules.sh`.
+
+### Add subdomains
+
+If you plan to resolve DNS records yourself (as the DNS records datasets are not exhaustive),
+the top 1M subdomains provided might not be enough.
+
+You can add them into the `subdomains` folder.
+It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
+
+#### Add personal sources
+
+Adding your own browsing history will help create a more suited subdomains list.
 Here's reference command for possible sources:

 - **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
 - **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp`

-### Collect subdomains from websites
+#### Collect subdomains from websites

-Just run `collect_subdomain.sh`.
+You can add the websites URLs into the `websites` folder.
+It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
+
+Then, run `collect_subdomain.sh`.
 This is a long step, and might be memory-intensive from time to time.

-This step is optional if you already added personal sources.
-Alternatively, you can get just download the list of subdomains used to generate the official block list here: <https://hostfiles.frogeye.fr/from_websites.cache.list> (put it in the `subdomains` folder).
+> **Note:** For first-party tracking, a list of subdomains issued from the websites in the repository is avaliable here: <https://hostfiles.frogeye.fr/from_websites.cache.list> 

-### Extract tracking domains
+### Resolve DNS records

-Make sure your system is configured with a DNS server without limitation.
-Then, run `filter_subdomain.sh`.
-The files you need will be in the folder `dist`.
+Once you've added subdomains, you'll need to resolve them to get their DNS records.
+The program will use a list of public nameservers to do that, but you can add your own in the `nameservers` directory.

-## Contributing
+Then, run `./resolve_subdomains.sh`.
+Note that this is a network intensive process, not in term of bandwith, but in terms of packet number.

-### Adding websites
+> Some VPS providers might detect this as a DDoS attack and cut the network access.
+> Some Wi-Fi connections can be rendered unusable for other uses, some routers might cease to work.
+> Since massdns does not support yet rate limiting, my best bet was a Raspberry Pi with a slow ethernet link (Raspberry Pi < 4).

-Just add the URL to the relevant list: `websites/<source>.list`.
+The DNS records will automatically be imported into the database.
+If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script.

-### Adding first-party trackers regex
+### Import DNS records from Rapid7
+
+Just run `./import_rapid7.sh`.
+This will download about 35 GiB of data, but only the matching records will be stored (about a few MiB for the tracking rules).
+Note the download speed will most likely be limited by the database operation thoughput (a quick RAM will help).
+
+### Export the lists
+
+For the tracking list, use `./export_lists.sh`, the output will be in the `dist` forlder (please change the links before distributing them).
+For other purposes, tinker with the `./export.py` program.

-Just add them to `regexes.py`.
--- a/dist/README.md
+++ b/dist/README.md
@ -0,0 +1,74 @@
+# Geoffrey Frogeye's block list of first-party trackers
+
+## What's a first-party tracker?
+
+A tracker is a script put on many websites to gather informations about the visitor.
+They can be used for multiple reasons: statistics, risk management, marketing, ads serving…
+In any case, they are a threat to Internet users' privacy and many may want to block them.
+
+Traditionnaly, trackers are served from a third-party.
+For example, `website1.com` and `website2.com` both load their tracking script from `https://trackercompany.com/trackerscript.js`.
+In order to block those, one can simply block the hostname `trackercompany.com`, which is what most ad blockers do.
+
+However, to circumvent this block, tracker companies made the websites using them load trackers from `somestring.website1.com`.
+The latter is a DNS redirection to `website1.trackercompany.com`, directly to an IP address belonging to the tracking company.
+Those are called first-party trackers.
+
+In order to block those trackers, ad blockers would need to block every subdomain pointing to anything under `trackercompany.com` or to their network.
+Unfortunately, most don't support those blocking methods as they are not DNS-aware, e.g. they only see `somestring.website1.com`.
+
+This list is an inventory of every `somestring.website1.com` found to allow non DNS-aware ad blocker to still block first-party trackers.
+
+## List variants
+
+### First-party trackers (recommended)
+
+- Hosts file: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
+- Raw list: <https://hostfiles.frogeye.fr/firstparty-trackers.txt>
+
+This list contains every hostname redirecting to [a hand-picked list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/rules/first-party.list).
+It should be safe from false-positives.
+Don't be afraid of the size of the list, as this is due to the nature of first-party trackers: a single tracker generates at least one hostname per client (typically two).
+
+### First-party only trackers
+
+- Hosts file: <https://hostfiles.frogeye.fr/firstparty-only-trackers-hosts.txt>
+- Raw list: <https://hostfiles.frogeye.fr/firstparty-only-trackers.txt>
+
+This is the same list as above, albeit not containing the hostnames under the tracking company domains.
+This reduces the size of the list, but it doesn't prevent from third-party tracking too.
+Use in conjunction with other block lists.
+
+### Multi-party trackers
+
+- Hosts file: <https://hostfiles.frogeye.fr/multiparty-trackers-hosts.txt>
+- Raw list: <https://hostfiles.frogeye.fr/multiparty-trackers.txt>
+
+As first-party trackers usually evolve from third-party trackers, this list contains every hostname redirecting to trackers found in existing lists of third-party trackers (see next section).
+Since the latter were not designed with first-party trackers in mind, they are likely to contain false-positives.
+In the other hand, they might protect against first-party tracker that we're not aware of / have not yet confirmed.
+
+#### Source of third-party trackers
+
+- [EasyPrivacy](https://easylist.to/easylist/easyprivacy.txt)
+
+(yes there's only one for now. A lot of existing ones cause a lot of false positives)
+
+### Multi-party only trackers
+
+- Hosts file: <https://hostfiles.frogeye.fr/multiparty-only-trackers-hosts.txt>
+- Raw list: <https://hostfiles.frogeye.fr/multiparty-only-trackers.txt>
+
+This is the same list as above, albeit not containing the hostnames under the tracking company domains.
+This reduces the size of the list, but it doesn't prevent from third-party tracking too.
+Use in conjunction with other block lists, especially the ones used to generate this list in the previous section.
+
+## Meta
+
+In case of false positives/negatives, or any other question contact me the way you like: <https://geoffrey.frogeye.fr>
+
+The software used to generate this list is available here: <https://git.frogeye.fr/geoffrey/eulaurarien>
+
+Some of the first-party tracker included in this list have been found by:
+- [Aeris](https://imirhil.fr/)
+- NextDNS and [their blocklist](https://github.com/nextdns/cname-cloaking-blocklist)'s contributors
--- a/export_lists.sh
+++ b/export_lists.sh
@ -54,7 +54,7 @@ do
        rules_output=$(./export.py --count $partyness_flags $trackerness_flags)

        function link() { # link partyness, link trackerness
-            url="https://hostfiles.frogeye.fr/${partyness}party-${trackerness}-hosts.txt"
+            url="https://hostfiles.frogeye.fr/${1}party-${2}-hosts.txt"
            if [ "$1" = "$partyness" ] && [ "$2" = "$trackerness" ]
            then
                url="$url (this one)"
@ -66,17 +66,18 @@ do
            echo "# First-party trackers host list"
            echo "# Variant: ${partyness}-party ${trackerness}"
            echo "#"
-            echo "# About first-party trackers: https://git.frogeye.fr/geoffrey/eulaurarien#whats-a-first-party-tracker"
+            echo "# About first-party trackers: TODO"
            echo "# Source code: https://git.frogeye.fr/geoffrey/eulaurarien"
            echo "#"
            echo "# In case of false positives/negatives, or any other question,"
            echo "# contact me the way you like: https://geoffrey.frogeye.fr"
            echo "#"
-            echo "# Latest versions:"
+            echo "# Latest versions and variants:"
            echo "# - First-party trackers  : $(link first trackers)"
            echo "# - … excluding redirected: $(link first only-trackers)"
            echo "# - First and third party : $(link multi trackers)"
            echo "# - … excluding redirected: $(link multi only-trackers)"
+            echo '# (variants informations: TODO)'
            echo '# (you can remove `-hosts` to get the raw list)'
            echo "#"
            echo "# Generation date: $gen_date"
--- a/fetch_resources.sh
+++ b/fetch_resources.sh
@ -17,18 +17,6 @@ function dl() {
 log "Retrieving rules…"
 rm -f rules*/*.cache.*
 dl https://easylist.to/easylist/easyprivacy.txt rules_adblock/easyprivacy.cache.txt
-# From firebog.net Tracking & Telemetry Lists
-# dl https://v.firebog.net/hosts/Prigent-Ads.txt rules/prigent-ads.cache.list
-# dl https://gitlab.com/quidsup/notrack-blocklists/raw/master/notrack-blocklist.txt rules/notrack-blocklist.cache.list
-# False positives: https://github.com/WaLLy3K/wally3k.github.io/issues/73 -> 69.media.tumblr.com chicdn.net
-dl https://raw.githubusercontent.com/StevenBlack/hosts/master/data/add.2o7Net/hosts rules_hosts/add2o7.cache.txt
-dl https://raw.githubusercontent.com/crazy-max/WindowsSpyBlocker/master/data/hosts/spy.txt rules_hosts/spy.cache.txt
-# dl https://raw.githubusercontent.com/Kees1958/WS3_annual_most_used_survey_blocklist/master/w3tech_hostfile.txt rules/w3tech.cache.list
-# False positives: agreements.apple.com -> edgekey.net
-# dl https://www.github.developerdan.com/hosts/lists/ads-and-tracking-extended.txt rules_hosts/ads-and-tracking-extended.cache.txt # Lots of false-positives
-# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/android-tracking.txt rules_hosts/android-tracking.cache.txt
-# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/SmartTV.txt rules_hosts/smart-tv.cache.txt
-# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/AmazonFireTV.txt rules_hosts/amazon-fire-tv.cache.txt

 log "Retrieving TLD list…"
 dl http://data.iana.org/TLD/tlds-alpha-by-domain.txt temp/all_tld.temp.list