Updated README
Split in two actually (program and list). Closes #3 Also, Closes #1 Because I forgot to do it earlier.
This commit is contained in:
parent
53b14c6ffa
commit
38cf532854
169
README.md
169
README.md
|
@ -1,98 +1,133 @@
|
|||
# eulaurarien
|
||||
|
||||
Generates a host list of first-party trackers for ad-blocking.
|
||||
This program is able to generate a list of every hostnames being a DNS redirection to a list of DNS zones and IP networks.
|
||||
|
||||
The latest list is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
|
||||
It is primarilyy used to generate [Geoffrey Frogeye's block list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md) (learn about first-party trackers by following this link).
|
||||
|
||||
**DISCLAIMER:** I'm by no way an expert on this subject so my vocabulary or other stuff might be wrong. Use at your own risk.
|
||||
If you want to contribute but don't want to create an account on this forge, contact me the way you like: <https://geoffrey.frogeye.fr>
|
||||
|
||||
## What's a first-party tracker?
|
||||
## How does this work
|
||||
|
||||
Traditionally, websites load trackers scripts directly.
|
||||
For example, `website1.com` and `website2.com` both load `https://trackercompany.com/trackerscript.js` to track their users.
|
||||
In order to block those, one can simply block the host `trackercompany.com`.
|
||||
This program takes as input:
|
||||
|
||||
However, to circumvent this easy block, tracker companies made the website using them load trackers from `somethingirelevant.website1.com`.
|
||||
The latter being a DNS redirection to `website1.trackercompany.com`, directly pointing to a server serving the tracking script.
|
||||
Those are the first-party trackers.
|
||||
- Lists of hostnames to match
|
||||
- Lists of DNS zone to match (a domain and their subdomains)
|
||||
- Lists of IP address / IP networks to match
|
||||
- Lists of Autonomous System numbers to match
|
||||
- An enormous quantity of DNS records
|
||||
|
||||
Blocking `trackercompany.com` doesn't work any more, and blocking `*.trackercompany.com` isn't really possible since:
|
||||
It will be able to output hostnames being a DNS redirection to any item in the lists provided.
|
||||
|
||||
1. Most ad-blocker don't support wildcards
|
||||
2. It's a DNS redirection, meaning that most ad-blockers will only see `somethingirelevant.website1.com`
|
||||
DNS records can either come from [Rapid7 Open Data Sets](https://opendata.rapid7.com/sonar.fdns_v2/) or can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns).
|
||||
|
||||
So the only solution is to block every `somethingirelevant.website1.com`-like subdomains known, which is a lot.
|
||||
That's where this scripts comes in, to generate a list of such subdomains.
|
||||
|
||||
## How does this script work
|
||||
|
||||
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
|
||||
|
||||
It takes an input a list of websites with trackers included.
|
||||
So far, this list is manually-generated from the list of clients of such first-party trackers
|
||||
(latter we should use a general list of websites to be more exhaustive).
|
||||
It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes.
|
||||
|
||||
Additionaly, or alternatively, you can feed the script some browsing history and get domains from there.
|
||||
|
||||
It then find the DNS redirections of those domains, and compare with regexes of known tracking domains.
|
||||
It finally outputs the matching ones.
|
||||
|
||||
## Requirements
|
||||
|
||||
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
|
||||
|
||||
Just to build the list, you can find an already-built list in the releases.
|
||||
|
||||
- Bash
|
||||
- [Python 3.4+](https://www.python.org/)
|
||||
- [progressbar2](https://pypi.org/project/progressbar2/)
|
||||
- dnspython
|
||||
- [A Python wrapper for re2](https://pypi.org/project/google-re2/) (optional, just speeds things up)
|
||||
|
||||
(if you don't want to collect the subdomains, you can skip the following)
|
||||
|
||||
- Firefox
|
||||
- Selenium
|
||||
- seleniumwire
|
||||
Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that).
|
||||
|
||||
## Usage
|
||||
|
||||
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
|
||||
Remember you can get an already generated and up-to-date list of first-party trackers from [here](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md).
|
||||
|
||||
This is only if you want to build the list yourself.
|
||||
If you just want to use the list, the latest build is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
|
||||
It was build using additional sources not included in this repository for privacy reasons.
|
||||
The following is for the people wanting to build their own list.
|
||||
|
||||
### Add personal sources
|
||||
### Requirements
|
||||
|
||||
The list of websites provided in this script is by no mean exhaustive,
|
||||
so adding your own browsing history will help create a better list.
|
||||
Depending on the sources you'll be using to generate the list, you'll need to install some of the following:
|
||||
|
||||
- [Bash](https://www.gnu.org/software/bash/bash.html)
|
||||
- [Coreutils](https://www.gnu.org/software/coreutils/)
|
||||
- [curl](https://curl.haxx.se)
|
||||
- [pv](http://www.ivarch.com/programs/pv.shtml)
|
||||
- [Python 3.4+](https://www.python.org/)
|
||||
- [coloredlogs](https://pypi.org/project/coloredlogs/) (sorry I can't help myself)
|
||||
- [massdns](https://github.com/blechschmidt/massdns) in your `$PATH` (only if you have subdomains as a source)
|
||||
- [Firefox](https://www.mozilla.org/firefox/) (only if you have websites as a source)
|
||||
- [selenium (Python bindings)](https://pypi.python.org/pypi/selenium) (only if you have websites as a source)
|
||||
- [selenium-wire](https://pypi.org/project/selenium-wire/) (only if you have websites as a source)
|
||||
|
||||
### Create a new database
|
||||
|
||||
The so-called database (in the form of `blocking.p`) is a file storing all the matching entities (ASN, IPs, hostnames, zones…) and every entity leading to it.
|
||||
For now there's no way to remove data from it, so here's the command to recreate it: `./db.py --initialize`.
|
||||
|
||||
### Gather external sources
|
||||
|
||||
External sources are not stored in this repository.
|
||||
You'll need to fetch them by running `./fetch_resources.sh`.
|
||||
Those include:
|
||||
|
||||
- Third-party trackers lists
|
||||
- TLD lists (used to test the validity of hostnames)
|
||||
- List of public DNS resolvers (for DNS resolving from subdomains)
|
||||
- Top 1M subdomains
|
||||
|
||||
### Import rules into the database
|
||||
|
||||
You need to put the lists of rules for matching in the different subfolders:
|
||||
|
||||
- `rules`: Lists of DNS zones
|
||||
- `rules_ip`: Lists of IP networks (for IP addresses append `/32`)
|
||||
- `rules_asn`: Lists of Autonomous Systems numbers (IP ranges will be deducted from them)
|
||||
- `rules_adblock`: Lists of DNS zones, but in the form of AdBlock lists (only the ones concerning domains will be extracted)
|
||||
- `rules_hosts`: Lists of DNS zones, but in the form of hosts lists
|
||||
|
||||
See the provided examples for syntax.
|
||||
|
||||
In each folder:
|
||||
|
||||
- `first-party.ext` will be the only files considered for the first-party variant of the list
|
||||
- `*.cache.ext` are from external sources, and thus might be deleted / overwrote
|
||||
- `*.custom.ext` are for sources that you don't want commited
|
||||
|
||||
Then, run `./import_rules.sh`.
|
||||
|
||||
### Add subdomains
|
||||
|
||||
If you plan to resolve DNS records yourself (as the DNS records datasets are not exhaustive),
|
||||
the top 1M subdomains provided might not be enough.
|
||||
|
||||
You can add them into the `subdomains` folder.
|
||||
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
|
||||
|
||||
#### Add personal sources
|
||||
|
||||
Adding your own browsing history will help create a more suited subdomains list.
|
||||
Here's reference command for possible sources:
|
||||
|
||||
- **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
|
||||
- **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp`
|
||||
|
||||
### Collect subdomains from websites
|
||||
#### Collect subdomains from websites
|
||||
|
||||
Just run `collect_subdomain.sh`.
|
||||
You can add the websites URLs into the `websites` folder.
|
||||
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
|
||||
|
||||
Then, run `collect_subdomain.sh`.
|
||||
This is a long step, and might be memory-intensive from time to time.
|
||||
|
||||
This step is optional if you already added personal sources.
|
||||
Alternatively, you can get just download the list of subdomains used to generate the official block list here: <https://hostfiles.frogeye.fr/from_websites.cache.list> (put it in the `subdomains` folder).
|
||||
> **Note:** For first-party tracking, a list of subdomains issued from the websites in the repository is avaliable here: <https://hostfiles.frogeye.fr/from_websites.cache.list>
|
||||
|
||||
### Extract tracking domains
|
||||
### Resolve DNS records
|
||||
|
||||
Make sure your system is configured with a DNS server without limitation.
|
||||
Then, run `filter_subdomain.sh`.
|
||||
The files you need will be in the folder `dist`.
|
||||
Once you've added subdomains, you'll need to resolve them to get their DNS records.
|
||||
The program will use a list of public nameservers to do that, but you can add your own in the `nameservers` directory.
|
||||
|
||||
## Contributing
|
||||
Then, run `./resolve_subdomains.sh`.
|
||||
Note that this is a network intensive process, not in term of bandwith, but in terms of packet number.
|
||||
|
||||
### Adding websites
|
||||
> Some VPS providers might detect this as a DDoS attack and cut the network access.
|
||||
> Some Wi-Fi connections can be rendered unusable for other uses, some routers might cease to work.
|
||||
> Since massdns does not support yet rate limiting, my best bet was a Raspberry Pi with a slow ethernet link (Raspberry Pi < 4).
|
||||
|
||||
Just add the URL to the relevant list: `websites/<source>.list`.
|
||||
The DNS records will automatically be imported into the database.
|
||||
If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script.
|
||||
|
||||
### Adding first-party trackers regex
|
||||
### Import DNS records from Rapid7
|
||||
|
||||
Just run `./import_rapid7.sh`.
|
||||
This will download about 35 GiB of data, but only the matching records will be stored (about a few MiB for the tracking rules).
|
||||
Note the download speed will most likely be limited by the database operation thoughput (a quick RAM will help).
|
||||
|
||||
### Export the lists
|
||||
|
||||
For the tracking list, use `./export_lists.sh`, the output will be in the `dist` forlder (please change the links before distributing them).
|
||||
For other purposes, tinker with the `./export.py` program.
|
||||
|
||||
Just add them to `regexes.py`.
|
||||
|
|
74
dist/README.md
vendored
Normal file
74
dist/README.md
vendored
Normal file
|
@ -0,0 +1,74 @@
|
|||
# Geoffrey Frogeye's block list of first-party trackers
|
||||
|
||||
## What's a first-party tracker?
|
||||
|
||||
A tracker is a script put on many websites to gather informations about the visitor.
|
||||
They can be used for multiple reasons: statistics, risk management, marketing, ads serving…
|
||||
In any case, they are a threat to Internet users' privacy and many may want to block them.
|
||||
|
||||
Traditionnaly, trackers are served from a third-party.
|
||||
For example, `website1.com` and `website2.com` both load their tracking script from `https://trackercompany.com/trackerscript.js`.
|
||||
In order to block those, one can simply block the hostname `trackercompany.com`, which is what most ad blockers do.
|
||||
|
||||
However, to circumvent this block, tracker companies made the websites using them load trackers from `somestring.website1.com`.
|
||||
The latter is a DNS redirection to `website1.trackercompany.com`, directly to an IP address belonging to the tracking company.
|
||||
Those are called first-party trackers.
|
||||
|
||||
In order to block those trackers, ad blockers would need to block every subdomain pointing to anything under `trackercompany.com` or to their network.
|
||||
Unfortunately, most don't support those blocking methods as they are not DNS-aware, e.g. they only see `somestring.website1.com`.
|
||||
|
||||
This list is an inventory of every `somestring.website1.com` found to allow non DNS-aware ad blocker to still block first-party trackers.
|
||||
|
||||
## List variants
|
||||
|
||||
### First-party trackers (recommended)
|
||||
|
||||
- Hosts file: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
|
||||
- Raw list: <https://hostfiles.frogeye.fr/firstparty-trackers.txt>
|
||||
|
||||
This list contains every hostname redirecting to [a hand-picked list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/rules/first-party.list).
|
||||
It should be safe from false-positives.
|
||||
Don't be afraid of the size of the list, as this is due to the nature of first-party trackers: a single tracker generates at least one hostname per client (typically two).
|
||||
|
||||
### First-party only trackers
|
||||
|
||||
- Hosts file: <https://hostfiles.frogeye.fr/firstparty-only-trackers-hosts.txt>
|
||||
- Raw list: <https://hostfiles.frogeye.fr/firstparty-only-trackers.txt>
|
||||
|
||||
This is the same list as above, albeit not containing the hostnames under the tracking company domains.
|
||||
This reduces the size of the list, but it doesn't prevent from third-party tracking too.
|
||||
Use in conjunction with other block lists.
|
||||
|
||||
### Multi-party trackers
|
||||
|
||||
- Hosts file: <https://hostfiles.frogeye.fr/multiparty-trackers-hosts.txt>
|
||||
- Raw list: <https://hostfiles.frogeye.fr/multiparty-trackers.txt>
|
||||
|
||||
As first-party trackers usually evolve from third-party trackers, this list contains every hostname redirecting to trackers found in existing lists of third-party trackers (see next section).
|
||||
Since the latter were not designed with first-party trackers in mind, they are likely to contain false-positives.
|
||||
In the other hand, they might protect against first-party tracker that we're not aware of / have not yet confirmed.
|
||||
|
||||
#### Source of third-party trackers
|
||||
|
||||
- [EasyPrivacy](https://easylist.to/easylist/easyprivacy.txt)
|
||||
|
||||
(yes there's only one for now. A lot of existing ones cause a lot of false positives)
|
||||
|
||||
### Multi-party only trackers
|
||||
|
||||
- Hosts file: <https://hostfiles.frogeye.fr/multiparty-only-trackers-hosts.txt>
|
||||
- Raw list: <https://hostfiles.frogeye.fr/multiparty-only-trackers.txt>
|
||||
|
||||
This is the same list as above, albeit not containing the hostnames under the tracking company domains.
|
||||
This reduces the size of the list, but it doesn't prevent from third-party tracking too.
|
||||
Use in conjunction with other block lists, especially the ones used to generate this list in the previous section.
|
||||
|
||||
## Meta
|
||||
|
||||
In case of false positives/negatives, or any other question contact me the way you like: <https://geoffrey.frogeye.fr>
|
||||
|
||||
The software used to generate this list is available here: <https://git.frogeye.fr/geoffrey/eulaurarien>
|
||||
|
||||
Some of the first-party tracker included in this list have been found by:
|
||||
- [Aeris](https://imirhil.fr/)
|
||||
- NextDNS and [their blocklist](https://github.com/nextdns/cname-cloaking-blocklist)'s contributors
|
|
@ -54,7 +54,7 @@ do
|
|||
rules_output=$(./export.py --count $partyness_flags $trackerness_flags)
|
||||
|
||||
function link() { # link partyness, link trackerness
|
||||
url="https://hostfiles.frogeye.fr/${partyness}party-${trackerness}-hosts.txt"
|
||||
url="https://hostfiles.frogeye.fr/${1}party-${2}-hosts.txt"
|
||||
if [ "$1" = "$partyness" ] && [ "$2" = "$trackerness" ]
|
||||
then
|
||||
url="$url (this one)"
|
||||
|
@ -66,17 +66,18 @@ do
|
|||
echo "# First-party trackers host list"
|
||||
echo "# Variant: ${partyness}-party ${trackerness}"
|
||||
echo "#"
|
||||
echo "# About first-party trackers: https://git.frogeye.fr/geoffrey/eulaurarien#whats-a-first-party-tracker"
|
||||
echo "# About first-party trackers: TODO"
|
||||
echo "# Source code: https://git.frogeye.fr/geoffrey/eulaurarien"
|
||||
echo "#"
|
||||
echo "# In case of false positives/negatives, or any other question,"
|
||||
echo "# contact me the way you like: https://geoffrey.frogeye.fr"
|
||||
echo "#"
|
||||
echo "# Latest versions:"
|
||||
echo "# Latest versions and variants:"
|
||||
echo "# - First-party trackers : $(link first trackers)"
|
||||
echo "# - … excluding redirected: $(link first only-trackers)"
|
||||
echo "# - First and third party : $(link multi trackers)"
|
||||
echo "# - … excluding redirected: $(link multi only-trackers)"
|
||||
echo '# (variants informations: TODO)'
|
||||
echo '# (you can remove `-hosts` to get the raw list)'
|
||||
echo "#"
|
||||
echo "# Generation date: $gen_date"
|
||||
|
|
|
@ -17,18 +17,6 @@ function dl() {
|
|||
log "Retrieving rules…"
|
||||
rm -f rules*/*.cache.*
|
||||
dl https://easylist.to/easylist/easyprivacy.txt rules_adblock/easyprivacy.cache.txt
|
||||
# From firebog.net Tracking & Telemetry Lists
|
||||
# dl https://v.firebog.net/hosts/Prigent-Ads.txt rules/prigent-ads.cache.list
|
||||
# dl https://gitlab.com/quidsup/notrack-blocklists/raw/master/notrack-blocklist.txt rules/notrack-blocklist.cache.list
|
||||
# False positives: https://github.com/WaLLy3K/wally3k.github.io/issues/73 -> 69.media.tumblr.com chicdn.net
|
||||
dl https://raw.githubusercontent.com/StevenBlack/hosts/master/data/add.2o7Net/hosts rules_hosts/add2o7.cache.txt
|
||||
dl https://raw.githubusercontent.com/crazy-max/WindowsSpyBlocker/master/data/hosts/spy.txt rules_hosts/spy.cache.txt
|
||||
# dl https://raw.githubusercontent.com/Kees1958/WS3_annual_most_used_survey_blocklist/master/w3tech_hostfile.txt rules/w3tech.cache.list
|
||||
# False positives: agreements.apple.com -> edgekey.net
|
||||
# dl https://www.github.developerdan.com/hosts/lists/ads-and-tracking-extended.txt rules_hosts/ads-and-tracking-extended.cache.txt # Lots of false-positives
|
||||
# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/android-tracking.txt rules_hosts/android-tracking.cache.txt
|
||||
# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/SmartTV.txt rules_hosts/smart-tv.cache.txt
|
||||
# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/AmazonFireTV.txt rules_hosts/amazon-fire-tv.cache.txt
|
||||
|
||||
log "Retrieving TLD list…"
|
||||
dl http://data.iana.org/TLD/tlds-alpha-by-domain.txt temp/all_tld.temp.list
|
||||
|
|
Loading…
Reference in a new issue