28 changed files with 1648 additions and 687 deletions
-
3.gitignore
-
149README.md
-
739database.py
-
46db.py
-
74dist/README.md
-
64export.py
-
98export_lists.sh
-
71feed_asn.py
-
227feed_dns.py
-
54feed_rules.py
-
24fetch_resources.sh
-
160filter_subdomains.py
-
85filter_subdomains.sh
-
26import_rapid7.sh
-
22import_rules.sh
-
2nameservers/.gitignore
-
24nameservers/popular.list
-
21regexes.py
-
284resolve_subdomains.py
-
17resolve_subdomains.sh
-
9rules/first-party.list
-
2rules_asn/.gitignore
-
10rules_asn/first-party.txt
-
51rules_ip/first-party.txt
-
34run_tests.py
-
1tests/false-positives.csv
-
3tests/first-party.csv
-
35validate_list.py
@ -1,3 +1,2 @@ |
|||
*.log |
|||
nameservers |
|||
nameservers.head |
|||
*.p |
@ -1,98 +1,133 @@ |
|||
# eulaurarien |
|||
|
|||
Generates a host list of first-party trackers for ad-blocking. |
|||
This program is able to generate a list of every hostnames being a DNS redirection to a list of DNS zones and IP networks. |
|||
|
|||
The latest list is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt> |
|||
It is primarilyy used to generate [Geoffrey Frogeye's block list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md) (learn about first-party trackers by following this link). |
|||
|
|||
**DISCLAIMER:** I'm by no way an expert on this subject so my vocabulary or other stuff might be wrong. Use at your own risk. |
|||
If you want to contribute but don't want to create an account on this forge, contact me the way you like: <https://geoffrey.frogeye.fr> |
|||
|
|||
## What's a first-party tracker? |
|||
## How does this work |
|||
|
|||
Traditionally, websites load trackers scripts directly. |
|||
For example, `website1.com` and `website2.com` both load `https://trackercompany.com/trackerscript.js` to track their users. |
|||
In order to block those, one can simply block the host `trackercompany.com`. |
|||
This program takes as input: |
|||
|
|||
However, to circumvent this easy block, tracker companies made the website using them load trackers from `somethingirelevant.website1.com`. |
|||
The latter being a DNS redirection to `website1.trackercompany.com`, directly pointing to a server serving the tracking script. |
|||
Those are the first-party trackers. |
|||
- Lists of hostnames to match |
|||
- Lists of DNS zone to match (a domain and their subdomains) |
|||
- Lists of IP address / IP networks to match |
|||
- Lists of Autonomous System numbers to match |
|||
- An enormous quantity of DNS records |
|||
|
|||
Blocking `trackercompany.com` doesn't work any more, and blocking `*.trackercompany.com` isn't really possible since: |
|||
It will be able to output hostnames being a DNS redirection to any item in the lists provided. |
|||
|
|||
1. Most ad-blocker don't support wildcards |
|||
2. It's a DNS redirection, meaning that most ad-blockers will only see `somethingirelevant.website1.com` |
|||
DNS records can either come from [Rapid7 Open Data Sets](https://opendata.rapid7.com/sonar.fdns_v2/) or can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns). |
|||
|
|||
So the only solution is to block every `somethingirelevant.website1.com`-like subdomains known, which is a lot. |
|||
That's where this scripts comes in, to generate a list of such subdomains. |
|||
Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that). |
|||
|
|||
## How does this script work |
|||
## Usage |
|||
|
|||
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this. |
|||
Remember you can get an already generated and up-to-date list of first-party trackers from [here](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md). |
|||
|
|||
It takes an input a list of websites with trackers included. |
|||
So far, this list is manually-generated from the list of clients of such first-party trackers |
|||
(latter we should use a general list of websites to be more exhaustive). |
|||
It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes. |
|||
The following is for the people wanting to build their own list. |
|||
|
|||
Additionaly, or alternatively, you can feed the script some browsing history and get domains from there. |
|||
### Requirements |
|||
|
|||
It then find the DNS redirections of those domains, and compare with regexes of known tracking domains. |
|||
It finally outputs the matching ones. |
|||
Depending on the sources you'll be using to generate the list, you'll need to install some of the following: |
|||
|
|||
## Requirements |
|||
- [Bash](https://www.gnu.org/software/bash/bash.html) |
|||
- [Coreutils](https://www.gnu.org/software/coreutils/) |
|||
- [curl](https://curl.haxx.se) |
|||
- [pv](http://www.ivarch.com/programs/pv.shtml) |
|||
- [Python 3.4+](https://www.python.org/) |
|||
- [coloredlogs](https://pypi.org/project/coloredlogs/) (sorry I can't help myself) |
|||
- [massdns](https://github.com/blechschmidt/massdns) in your `$PATH` (only if you have subdomains as a source) |
|||
- [Firefox](https://www.mozilla.org/firefox/) (only if you have websites as a source) |
|||
- [selenium (Python bindings)](https://pypi.python.org/pypi/selenium) (only if you have websites as a source) |
|||
- [selenium-wire](https://pypi.org/project/selenium-wire/) (only if you have websites as a source) |
|||
|
|||
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this. |
|||
### Create a new database |
|||
|
|||
Just to build the list, you can find an already-built list in the releases. |
|||
The so-called database (in the form of `blocking.p`) is a file storing all the matching entities (ASN, IPs, hostnames, zonesโฆ) and every entity leading to it. |
|||
For now there's no way to remove data from it, so here's the command to recreate it: `./db.py --initialize`. |
|||
|
|||
- Bash |
|||
- [Python 3.4+](https://www.python.org/) |
|||
- [progressbar2](https://pypi.org/project/progressbar2/) |
|||
- dnspython |
|||
- [A Python wrapper for re2](https://pypi.org/project/google-re2/) (optional, just speeds things up) |
|||
### Gather external sources |
|||
|
|||
(if you don't want to collect the subdomains, you can skip the following) |
|||
External sources are not stored in this repository. |
|||
You'll need to fetch them by running `./fetch_resources.sh`. |
|||
Those include: |
|||
|
|||
- Firefox |
|||
- Selenium |
|||
- seleniumwire |
|||
- Third-party trackers lists |
|||
- TLD lists (used to test the validity of hostnames) |
|||
- List of public DNS resolvers (for DNS resolving from subdomains) |
|||
- Top 1M subdomains |
|||
|
|||
## Usage |
|||
### Import rules into the database |
|||
|
|||
You need to put the lists of rules for matching in the different subfolders: |
|||
|
|||
- `rules`: Lists of DNS zones |
|||
- `rules_ip`: Lists of IP networks (for IP addresses append `/32`) |
|||
- `rules_asn`: Lists of Autonomous Systems numbers (IP ranges will be deducted from them) |
|||
- `rules_adblock`: Lists of DNS zones, but in the form of AdBlock lists (only the ones concerning domains will be extracted) |
|||
- `rules_hosts`: Lists of DNS zones, but in the form of hosts lists |
|||
|
|||
See the provided examples for syntax. |
|||
|
|||
In each folder: |
|||
|
|||
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this. |
|||
- `first-party.ext` will be the only files considered for the first-party variant of the list |
|||
- `*.cache.ext` are from external sources, and thus might be deleted / overwrote |
|||
- `*.custom.ext` are for sources that you don't want commited |
|||
|
|||
This is only if you want to build the list yourself. |
|||
If you just want to use the list, the latest build is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt> |
|||
It was build using additional sources not included in this repository for privacy reasons. |
|||
Then, run `./import_rules.sh`. |
|||
|
|||
### Add personal sources |
|||
### Add subdomains |
|||
|
|||
The list of websites provided in this script is by no mean exhaustive, |
|||
so adding your own browsing history will help create a better list. |
|||
If you plan to resolve DNS records yourself (as the DNS records datasets are not exhaustive), |
|||
the top 1M subdomains provided might not be enough. |
|||
|
|||
You can add them into the `subdomains` folder. |
|||
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files. |
|||
|
|||
#### Add personal sources |
|||
|
|||
Adding your own browsing history will help create a more suited subdomains list. |
|||
Here's reference command for possible sources: |
|||
|
|||
- **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list` |
|||
- **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp` |
|||
|
|||
### Collect subdomains from websites |
|||
#### Collect subdomains from websites |
|||
|
|||
You can add the websites URLs into the `websites` folder. |
|||
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files. |
|||
|
|||
Just run `collect_subdomain.sh`. |
|||
Then, run `collect_subdomain.sh`. |
|||
This is a long step, and might be memory-intensive from time to time. |
|||
|
|||
This step is optional if you already added personal sources. |
|||
Alternatively, you can get just download the list of subdomains used to generate the official block list here: <https://hostfiles.frogeye.fr/from_websites.cache.list> (put it in the `subdomains` folder). |
|||
> **Note:** For first-party tracking, a list of subdomains issued from the websites in the repository is avaliable here: <https://hostfiles.frogeye.fr/from_websites.cache.list> |
|||
|
|||
### Resolve DNS records |
|||
|
|||
Once you've added subdomains, you'll need to resolve them to get their DNS records. |
|||
The program will use a list of public nameservers to do that, but you can add your own in the `nameservers` directory. |
|||
|
|||
Then, run `./resolve_subdomains.sh`. |
|||
Note that this is a network intensive process, not in term of bandwith, but in terms of packet number. |
|||
|
|||
### Extract tracking domains |
|||
> Some VPS providers might detect this as a DDoS attack and cut the network access. |
|||
> Some Wi-Fi connections can be rendered unusable for other uses, some routers might cease to work. |
|||
> Since massdns does not support yet rate limiting, my best bet was a Raspberry Pi with a slow ethernet link (Raspberry Pi < 4). |
|||
|
|||
Make sure your system is configured with a DNS server without limitation. |
|||
Then, run `filter_subdomain.sh`. |
|||
The files you need will be in the folder `dist`. |
|||
The DNS records will automatically be imported into the database. |
|||
If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script. |
|||
|
|||
## Contributing |
|||
### Import DNS records from Rapid7 |
|||
|
|||
### Adding websites |
|||
Just run `./import_rapid7.sh`. |
|||
This will download about 35 GiB of data, but only the matching records will be stored (about a few MiB for the tracking rules). |
|||
Note the download speed will most likely be limited by the database operation thoughput (a quick RAM will help). |
|||
|
|||
Just add the URL to the relevant list: `websites/<source>.list`. |
|||
### Export the lists |
|||
|
|||
### Adding first-party trackers regex |
|||
For the tracking list, use `./export_lists.sh`, the output will be in the `dist` forlder (please change the links before distributing them). |
|||
For other purposes, tinker with the `./export.py` program. |
|||
|
|||
Just add them to `regexes.py`. |
@ -0,0 +1,739 @@ |
|||
#!/usr/bin/env python3 |
|||
|
|||
""" |
|||
Utility functions to interact with the database. |
|||
""" |
|||
|
|||
import typing |
|||
import time |
|||
import logging |
|||
import coloredlogs |
|||
import pickle |
|||
import numpy |
|||
import math |
|||
|
|||
TLD_LIST: typing.Set[str] = set() |
|||
|
|||
coloredlogs.install( |
|||
level='DEBUG', |
|||
fmt='%(asctime)s %(name)s %(levelname)s %(message)s' |
|||
) |
|||
|
|||
Asn = int |
|||
Timestamp = int |
|||
Level = int |
|||
|
|||
|
|||
class Path(): |
|||
# FP add boolean here |
|||
pass |
|||
|
|||
|
|||
class RulePath(Path): |
|||
def __str__(self) -> str: |
|||
return '(rule)' |
|||
|
|||
|
|||
class RuleFirstPath(RulePath): |
|||
def __str__(self) -> str: |
|||
return '(first-party rule)' |
|||
|
|||
|
|||
class RuleMultiPath(RulePath): |
|||
def __str__(self) -> str: |
|||
return '(multi-party rule)' |
|||
|
|||
|
|||
class DomainPath(Path): |
|||
def __init__(self, parts: typing.List[str]): |
|||
self.parts = parts |
|||
|
|||
def __str__(self) -> str: |
|||
return '?.' + Database.unpack_domain(self) |
|||
|
|||
|
|||
class HostnamePath(DomainPath): |
|||
def __str__(self) -> str: |
|||
return Database.unpack_domain(self) |
|||
|
|||
|
|||
class ZonePath(DomainPath): |
|||
def __str__(self) -> str: |
|||
return '*.' + Database.unpack_domain(self) |
|||
|
|||
|
|||
class AsnPath(Path): |
|||
def __init__(self, asn: Asn): |
|||
self.asn = asn |
|||
|
|||
def __str__(self) -> str: |
|||
return Database.unpack_asn(self) |
|||
|
|||
|
|||
class Ip4Path(Path): |
|||
def __init__(self, value: int, prefixlen: int): |
|||
self.value = value |
|||
self.prefixlen = prefixlen |
|||
|
|||
def __str__(self) -> str: |
|||
return Database.unpack_ip4network(self) |
|||
|
|||
|
|||
class Match(): |
|||
def __init__(self) -> None: |
|||
self.source: typing.Optional[Path] = None |
|||
self.updated: int = 0 |
|||
self.dupplicate: bool = False |
|||
|
|||
# Cache |
|||
self.level: int = 0 |
|||
self.first_party: bool = False |
|||
self.references: int = 0 |
|||
|
|||
def active(self, first_party: bool = None) -> bool: |
|||
if self.updated == 0 or (first_party and not self.first_party): |
|||
return False |
|||
return True |
|||
|
|||
|
|||
class AsnNode(Match): |
|||
def __init__(self) -> None: |
|||
Match.__init__(self) |
|||
self.name = '' |
|||
|
|||
|
|||
class DomainTreeNode(): |
|||
def __init__(self) -> None: |
|||
self.children: typing.Dict[str, DomainTreeNode] = dict() |
|||
self.match_zone = Match() |
|||
self.match_hostname = Match() |
|||
|
|||
|
|||
class IpTreeNode(Match): |
|||
def __init__(self) -> None: |
|||
Match.__init__(self) |
|||
self.zero: typing.Optional[IpTreeNode] = None |
|||
self.one: typing.Optional[IpTreeNode] = None |
|||
|
|||
|
|||
Node = typing.Union[DomainTreeNode, IpTreeNode, AsnNode] |
|||
MatchCallable = typing.Callable[[Path, |
|||
Match], |
|||
typing.Any] |
|||
|
|||
|
|||
class Profiler(): |
|||
def __init__(self) -> None: |
|||
self.log = logging.getLogger('profiler') |
|||
self.time_last = time.perf_counter() |
|||
self.time_step = 'init' |
|||
self.time_dict: typing.Dict[str, float] = dict() |
|||
self.step_dict: typing.Dict[str, int] = dict() |
|||
|
|||
def enter_step(self, name: str) -> None: |
|||
now = time.perf_counter() |
|||
try: |
|||
self.time_dict[self.time_step] += now - self.time_last |
|||
self.step_dict[self.time_step] += int(name != self.time_step) |
|||
except KeyError: |
|||
self.time_dict[self.time_step] = now - self.time_last |
|||
self.step_dict[self.time_step] = 1 |
|||
self.time_step = name |
|||
self.time_last = time.perf_counter() |
|||
|
|||
def profile(self) -> None: |
|||
self.enter_step('profile') |
|||
total = sum(self.time_dict.values()) |
|||
for key, secs in sorted(self.time_dict.items(), key=lambda t: t[1]): |
|||
times = self.step_dict[key] |
|||
self.log.debug(f"{key:<20}: {times:9d} ร {secs/times:5.3e} " |
|||
f"= {secs:9.2f} s ({secs/total:7.2%}) ") |
|||
self.log.debug(f"{'total':<20}: " |
|||
f"{total:9.2f} s ({1:7.2%})") |
|||
|
|||
|
|||
class Database(Profiler): |
|||
VERSION = 18 |
|||
PATH = "blocking.p" |
|||
|
|||
def initialize(self) -> None: |
|||
self.log.warning( |
|||
"Creating database version: %d ", |
|||
Database.VERSION) |
|||
# Dummy match objects that everything refer to |
|||
self.rules: typing.List[Match] = list() |
|||
for first_party in (False, True): |
|||
m = Match() |
|||
m.updated = 1 |
|||
m.level = 0 |
|||
m.first_party = first_party |
|||
self.rules.append(m) |
|||
self.domtree = DomainTreeNode() |
|||
self.asns: typing.Dict[Asn, AsnNode] = dict() |
|||
self.ip4tree = IpTreeNode() |
|||
|
|||
def load(self) -> None: |
|||
self.enter_step('load') |
|||
try: |
|||
with open(self.PATH, 'rb') as db_fdsec: |
|||
version, data = pickle.load(db_fdsec) |
|||
if version == Database.VERSION: |
|||
self.rules, self.domtree, self.asns, self.ip4tree = data |
|||
return |
|||
self.log.warning( |
|||
"Outdated database version found: %d, " |
|||
"it will be rebuilt.", |
|||
version) |
|||
except (TypeError, AttributeError, EOFError): |
|||
self.log.error( |
|||
"Corrupt (or heavily outdated) database found, " |
|||
"it will be rebuilt.") |
|||
except FileNotFoundError: |
|||
pass |
|||
self.initialize() |
|||
|
|||
def save(self) -> None: |
|||
self.enter_step('save') |
|||
with open(self.PATH, 'wb') as db_fdsec: |
|||
data = self.rules, self.domtree, self.asns, self.ip4tree |
|||
pickle.dump((self.VERSION, data), db_fdsec) |
|||
self.profile() |
|||
|
|||
def __init__(self) -> None: |
|||
Profiler.__init__(self) |
|||
self.log = logging.getLogger('db') |
|||
self.load() |
|||
self.ip4cache_shift: int = 32 |
|||
self.ip4cache = numpy.ones(1) |
|||
|
|||
def _set_ip4cache(self, path: Path, _: Match) -> None: |
|||
assert isinstance(path, Ip4Path) |
|||
self.enter_step('set_ip4cache') |
|||
mini = path.value >> self.ip4cache_shift |
|||
maxi = (path.value + 2**(32-path.prefixlen)) >> self.ip4cache_shift |
|||
if mini == maxi: |
|||
self.ip4cache[mini] = True |
|||
else: |
|||
self.ip4cache[mini:maxi] = True |
|||
|
|||
def fill_ip4cache(self, max_size: int = 512*1024**2) -> None: |
|||
""" |
|||
Size in bytes |
|||
""" |
|||
if max_size > 2**32/8: |
|||
self.log.warning("Allocating more than 512 MiB of RAM for " |
|||
"the Ip4 cache is not necessary.") |
|||
max_cache_width = int(math.log2(max(1, max_size*8))) |
|||
cache_width = min(2**32, max_cache_width) |
|||
self.ip4cache_shift = 32-cache_width |
|||
cache_size = 2**cache_width |
|||
self.ip4cache = numpy.zeros(cache_size, dtype=numpy.bool) |
|||
for _ in self.exec_each_ip4(self._set_ip4cache): |
|||
pass |
|||
|
|||
@staticmethod |
|||
def populate_tld_list() -> None: |
|||
with open('temp/all_tld.list', 'r') as tld_fdesc: |
|||
for tld in tld_fdesc: |
|||
tld = tld.strip() |
|||
TLD_LIST.add(tld) |
|||
|
|||
@staticmethod |
|||
def validate_domain(path: str) -> bool: |
|||
if len(path) > 255: |
|||
return False |
|||
splits = path.split('.') |
|||
if not TLD_LIST: |
|||
Database.populate_tld_list() |
|||
if splits[-1] not in TLD_LIST: |
|||
return False |
|||
for split in splits: |
|||
if not 1 <= len(split) <= 63: |
|||
return False |
|||
return True |
|||
|
|||
@staticmethod |
|||
def pack_domain(domain: str) -> DomainPath: |
|||
return DomainPath(domain.split('.')[::-1]) |
|||
|
|||
@staticmethod |
|||
def unpack_domain(domain: DomainPath) -> str: |
|||
return '.'.join(domain.parts[::-1]) |
|||
|
|||
@staticmethod |
|||
def pack_asn(asn: str) -> AsnPath: |
|||
asn = asn.upper() |
|||
if asn.startswith('AS'): |
|||
asn = asn[2:] |
|||
return AsnPath(int(asn)) |
|||
|
|||
@staticmethod |
|||
def unpack_asn(asn: AsnPath) -> str: |
|||
return f'AS{asn.asn}' |
|||
|
|||
@staticmethod |
|||
def validate_ip4address(path: str) -> bool: |
|||
splits = path.split('.') |
|||
if len(splits) != 4: |
|||
return False |
|||
for split in splits: |
|||
try: |
|||
if not 0 <= int(split) <= 255: |
|||
return False |
|||
except ValueError: |
|||
return False |
|||
return True |
|||
|
|||
@staticmethod |
|||
def pack_ip4address(address: str) -> Ip4Path: |
|||
addr = 0 |
|||
for split in address.split('.'): |
|||
addr = (addr << 8) + int(split) |
|||
return Ip4Path(addr, 32) |
|||
|
|||
@staticmethod |
|||
def unpack_ip4address(address: Ip4Path) -> str: |
|||
addr = address.value |
|||
assert address.prefixlen == 32 |
|||
octets: typing.List[int] = list() |
|||
octets = [0] * 4 |
|||
for o in reversed(range(4)): |
|||
octets[o] = addr & 0xFF |
|||
addr >>= 8 |
|||
return '.'.join(map(str, octets)) |
|||
|
|||
@staticmethod |
|||
def validate_ip4network(path: str) -> bool: |
|||
# A bit generous but ok for our usage |
|||
splits = path.split('/') |
|||
if len(splits) != 2: |
|||
return False |
|||
if not Database.validate_ip4address(splits[0]): |
|||
return False |
|||
try: |
|||
if not 0 <= int(splits[1]) <= 32: |
|||
return False |
|||
except ValueError: |
|||
return False |
|||
return True |
|||
|
|||
@staticmethod |
|||
def pack_ip4network(network: str) -> Ip4Path: |
|||
address, prefixlen_str = network.split('/') |
|||
prefixlen = int(prefixlen_str) |
|||
addr = Database.pack_ip4address(address) |
|||
addr.prefixlen = prefixlen |
|||
return addr |
|||
|
|||
@staticmethod |
|||
def unpack_ip4network(network: Ip4Path) -> str: |
|||
addr = network.value |
|||
octets: typing.List[int] = list() |
|||
octets = [0] * 4 |
|||
for o in reversed(range(4)): |
|||
octets[o] = addr & 0xFF |
|||
addr >>= 8 |
|||
return '.'.join(map(str, octets)) + '/' + str(network.prefixlen) |
|||
|
|||
def get_match(self, path: Path) -> Match: |
|||
if isinstance(path, RuleMultiPath): |
|||
return self.rules[0] |
|||
elif isinstance(path, RuleFirstPath): |
|||
return self.rules[1] |
|||
elif isinstance(path, AsnPath): |
|||
return self.asns[path.asn] |
|||
elif isinstance(path, DomainPath): |
|||
dicd = self.domtree |
|||
for part in path.parts: |
|||
dicd = dicd.children[part] |
|||
if isinstance(path, HostnamePath): |
|||
return dicd.match_hostname |
|||
elif isinstance(path, ZonePath): |
|||
return dicd.match_zone |
|||
else: |
|||
raise ValueError |
|||
elif isinstance(path, Ip4Path): |
|||
dici = self.ip4tree |
|||
for i in range(31, 31-path.prefixlen, -1): |
|||
bit = (path.value >> i) & 0b1 |
|||
dici_next = dici.one if bit else dici.zero |
|||
if not dici_next: |
|||
raise IndexError |
|||
dici = dici_next |
|||
return dici |
|||
else: |
|||
raise ValueError |
|||
|
|||
def exec_each_asn(self, |
|||
callback: MatchCallable, |
|||
) -> typing.Any: |
|||
for asn in self.asns: |
|||
match = self.asns[asn] |
|||
if match.active(): |
|||
c = callback( |
|||
AsnPath(asn), |
|||
match, |
|||
) |
|||
try: |
|||
yield from c |
|||
except TypeError: # not iterable |
|||
pass |
|||
|
|||
def exec_each_domain(self, |
|||
callback: MatchCallable, |
|||
_dic: DomainTreeNode = None, |
|||
_par: DomainPath = None, |
|||
) -> typing.Any: |
|||
_dic = _dic or self.domtree |
|||
_par = _par or DomainPath([]) |
|||
if _dic.match_hostname.active(): |
|||
c = callback( |
|||
HostnamePath(_par.parts), |
|||
_dic.match_hostname, |
|||
) |
|||
try: |
|||
yield from c |
|||
except TypeError: # not iterable |
|||
pass |
|||
if _dic.match_zone.active(): |
|||
c = callback( |
|||
ZonePath(_par.parts), |
|||
_dic.match_zone, |
|||
) |
|||
try: |
|||
yield from c |
|||
except TypeError: # not iterable |
|||
pass |
|||
for part in _dic.children: |
|||
dic = _dic.children[part] |
|||
yield from self.exec_each_domain( |
|||
callback, |
|||
_dic=dic, |
|||
_par=DomainPath(_par.parts + [part]) |
|||
) |
|||
|
|||
def exec_each_ip4(self, |
|||
callback: MatchCallable, |
|||
_dic: IpTreeNode = None, |
|||
_par: Ip4Path = None, |
|||
) -> typing.Any: |
|||
_dic = _dic or self.ip4tree |
|||
_par = _par or Ip4Path(0, 0) |
|||
if _dic.active(): |
|||
c = callback( |
|||
_par, |
|||
_dic, |
|||
) |
|||
try: |
|||
yield from c |
|||
except TypeError: # not iterable |
|||
pass |
|||
|
|||
# 0 |
|||
pref = _par.prefixlen + 1 |
|||
dic = _dic.zero |
|||
if dic: |
|||
# addr0 = _par.value & (0xFFFFFFFF ^ (1 << (32-pref))) |
|||
# assert addr0 == _par.value |
|||
addr0 = _par.value |
|||
yield from self.exec_each_ip4( |
|||
callback, |
|||
_dic=dic, |
|||
_par=Ip4Path(addr0, pref) |
|||
) |
|||
# 1 |
|||
dic = _dic.one |
|||
if dic: |
|||
addr1 = _par.value | (1 << (32-pref)) |
|||
# assert addr1 != _par.value |
|||
yield from self.exec_each_ip4( |
|||
callback, |
|||
_dic=dic, |
|||
_par=Ip4Path(addr1, pref) |
|||
) |
|||
|
|||
def exec_each(self, |
|||
callback: MatchCallable, |
|||
) -> typing.Any: |
|||
yield from self.exec_each_domain(callback) |
|||
yield from self.exec_each_ip4(callback) |
|||
yield from self.exec_each_asn(callback) |
|||
|
|||
def update_references(self) -> None: |
|||
# Should be correctly calculated normally, |
|||
# keeping this just in case |
|||
def reset_references_cb(path: Path, |
|||
match: Match |
|||
) -> None: |
|||
match.references = 0 |
|||
for _ in self.exec_each(reset_references_cb): |
|||
pass |
|||
|
|||
def increment_references_cb(path: Path, |
|||
match: Match |
|||
) -> None: |
|||
if match.source: |
|||
source = self.get_match(match.source) |
|||
source.references += 1 |
|||
for _ in self.exec_each(increment_references_cb): |
|||
pass |
|||
|
|||
def prune(self, before: int, base_only: bool = False) -> None: |
|||
raise NotImplementedError |
|||
|
|||
def explain(self, path: Path) -> str: |
|||
match = self.get_match(path) |
|||
if isinstance(match, AsnNode): |
|||
string = f'{path} ({match.name}) #{match.references}' |
|||
else: |
|||
string = f'{path} #{match.references}' |
|||
if match.source: |
|||
string += f' โ {self.explain(match.source)}' |
|||
return string |
|||
|
|||
def list_records(self, |
|||
first_party_only: bool = False, |
|||
end_chain_only: bool = False, |
|||
no_dupplicates: bool = False, |
|||
rules_only: bool = False, |
|||
hostnames_only: bool = False, |
|||
explain: bool = False, |
|||
) -> typing.Iterable[str]: |
|||
|
|||
def export_cb(path: Path, match: Match |
|||
) -> typing.Iterable[str]: |
|||
if first_party_only and not match.first_party: |
|||
return |
|||
if end_chain_only and match.references > 0: |
|||
return |
|||
if no_dupplicates and match.dupplicate: |
|||
return |
|||
if rules_only and match.level > 1: |
|||
return |
|||
if hostnames_only and not isinstance(path, HostnamePath): |
|||
return |
|||
|
|||
if explain: |
|||
yield self.explain(path) |
|||
else: |
|||
yield str(path) |
|||
|
|||
yield from self.exec_each(export_cb) |
|||
|
|||
def count_records(self, |
|||
first_party_only: bool = False, |
|||
end_chain_only: bool = False, |
|||
no_dupplicates: bool = False, |
|||
rules_only: bool = False, |
|||
hostnames_only: bool = False, |
|||
) -> str: |
|||
memo: typing.Dict[str, int] = dict() |
|||
|
|||
def count_records_cb(path: Path, match: Match) -> None: |
|||
if first_party_only and not match.first_party: |
|||
return |
|||
if end_chain_only and match.references > 0: |
|||
return |
|||
if no_dupplicates and match.dupplicate: |
|||
return |
|||
if rules_only and match.level > 1: |
|||
return |
|||
if hostnames_only and not isinstance(path, HostnamePath): |
|||
return |
|||
|
|||
try: |
|||
memo[path.__class__.__name__] += 1 |
|||
except KeyError: |
|||
memo[path.__class__.__name__] = 1 |
|||
|
|||
for _ in self.exec_each(count_records_cb): |
|||
pass |
|||
|
|||
split: typing.List[str] = list() |
|||
for key, value in sorted(memo.items(), key=lambda s: s[0]): |
|||
split.append(f'{key[:-4].lower()}s: {value}') |
|||
return ', '.join(split) |
|||
|
|||
def get_domain(self, domain_str: str) -> typing.Iterable[DomainPath]: |
|||
self.enter_step('get_domain_pack') |
|||
domain = self.pack_domain(domain_str) |
|||
self.enter_step('get_domain_brws') |
|||
dic = self.domtree |
|||
depth = 0 |
|||
for part in domain.parts: |
|||
if dic.match_zone.active(): |
|||
self.enter_step('get_domain_yield') |
|||
yield ZonePath(domain.parts[:depth]) |
|||
self.enter_step('get_domain_brws') |
|||
if part not in dic.children: |
|||
return |
|||
dic = dic.children[part] |
|||
depth += 1 |
|||
if dic.match_zone.active(): |
|||
self.enter_step('get_domain_yield') |
|||
yield ZonePath(domain.parts) |
|||
if dic.match_hostname.active(): |
|||
self.enter_step('get_domain_yield') |
|||
yield HostnamePath(domain.parts) |
|||
|
|||
def get_ip4(self, ip4_str: str) -> typing.Iterable[Path]: |
|||
self.enter_step('get_ip4_pack') |
|||
ip4 = self.pack_ip4address(ip4_str) |
|||
self.enter_step('get_ip4_cache') |
|||
if not self.ip4cache[ip4.value >> self.ip4cache_shift]: |
|||
return |
|||
self.enter_step('get_ip4_brws') |
|||
dic = self.ip4tree |
|||
for i in range(31, 31-ip4.prefixlen, -1): |
|||
bit = (ip4.value >> i) & 0b1 |
|||
if dic.active(): |
|||
self.enter_step('get_ip4_yield') |
|||
yield Ip4Path(ip4.value >> (i+1) << (i+1), 31-i) |
|||
self.enter_step('get_ip4_brws') |
|||
next_dic = dic.one if bit else dic.zero |
|||
if next_dic is None: |
|||
return |
|||
dic = next_dic |
|||
if dic.active(): |
|||
self.enter_step('get_ip4_yield') |
|||
yield ip4 |
|||
|
|||
def _set_match(self, |
|||
match: Match, |
|||
updated: int, |
|||
source: Path, |
|||
source_match: Match = None, |
|||
dupplicate: bool = False, |
|||
) -> None: |
|||
# source_match is in parameters because most of the time |
|||
# its parent function needs it too, |
|||
# so it can pass it to save a traversal |
|||
source_match = source_match or self.get_match(source) |
|||
new_level = source_match.level + 1 |
|||
if updated > match.updated or new_level < match.level \ |
|||
or source_match.first_party > match.first_party: |
|||
# NOTE FP and level of matches referencing this one |
|||
# won't be updated until run or prune |
|||
if match.source: |
|||
old_source = self.get_match(match.source) |
|||
old_source.references -= 1 |
|||
match.updated = updated |
|||
match.level = new_level |
|||
match.first_party = source_match.first_party |
|||
match.source = source |
|||
source_match.references += 1 |
|||
match.dupplicate = dupplicate |
|||
|
|||
def _set_domain(self, |
|||
hostname: bool, |
|||
domain_str: str, |
|||
updated: int, |
|||
source: Path) -> None: |
|||
self.enter_step('set_domain_val') |
|||
if not Database.validate_domain(domain_str): |
|||
raise ValueError(f"Invalid domain: {domain_str}") |
|||
self.enter_step('set_domain_pack') |
|||
domain = self.pack_domain(domain_str) |
|||
self.enter_step('set_domain_fp') |
|||
source_match = self.get_match(source) |
|||
is_first_party = source_match.first_party |
|||
self.enter_step('set_domain_brws') |
|||
dic = self.domtree |
|||
dupplicate = False |
|||
for part in domain.parts: |
|||
if part not in dic.children: |
|||
dic.children[part] = DomainTreeNode() |
|||
dic = dic.children[part] |
|||
if dic.match_zone.active(is_first_party): |
|||
dupplicate = True |
|||
if hostname: |
|||
match = dic.match_hostname |
|||
else: |
|||
match = dic.match_zone |
|||
self._set_match( |
|||
match, |
|||
updated, |
|||
source, |
|||
source_match=source_match, |
|||
dupplicate=dupplicate, |
|||
) |
|||
|
|||
def set_hostname(self, |
|||
*args: typing.Any, **kwargs: typing.Any |
|||
) -> None: |
|||
self._set_domain(True, *args, **kwargs) |
|||
|
|||
def set_zone(self, |
|||
*args: typing.Any, **kwargs: typing.Any |
|||
) -> None: |
|||
self._set_domain(False, *args, **kwargs) |
|||
|
|||
def set_asn(self, |
|||
asn_str: str, |
|||
updated: int, |
|||
source: Path) -> None: |
|||
self.enter_step('set_asn') |
|||
path = self.pack_asn(asn_str) |
|||
if path.asn in self.asns: |
|||
match = self.asns[path.asn] |
|||
else: |
|||
match = AsnNode() |
|||
self.asns[path.asn] = match |
|||
self._set_match( |
|||
match, |
|||
updated, |
|||
source, |
|||
) |
|||
|
|||
def _set_ip4(self, |
|||
ip4: Ip4Path, |
|||
updated: int, |
|||
source: Path) -> None: |
|||
self.enter_step('set_ip4_fp') |
|||
source_match = self.get_match(source) |
|||
is_first_party = source_match.first_party |
|||
self.enter_step('set_ip4_brws') |
|||
dic = self.ip4tree |
|||
dupplicate = False |
|||
for i in range(31, 31-ip4.prefixlen, -1): |
|||
bit = (ip4.value >> i) & 0b1 |
|||
next_dic = dic.one if bit else dic.zero |
|||
if next_dic is None: |
|||
next_dic = IpTreeNode() |
|||
if bit: |
|||
dic.one = next_dic |
|||
else: |
|||
dic.zero = next_dic |
|||
dic = next_dic |
|||
if dic.active(is_first_party): |
|||
dupplicate = True |
|||
self._set_match( |
|||
dic, |
|||
updated, |
|||
source, |
|||
source_match=source_match, |
|||
dupplicate=dupplicate, |
|||
) |
|||
self._set_ip4cache(ip4, dic) |
|||
|
|||
def set_ip4address(self, |
|||
ip4address_str: str, |
|||
*args: typing.Any, **kwargs: typing.Any |
|||
) -> None: |
|||
self.enter_step('set_ip4add_val') |
|||
if not Database.validate_ip4address(ip4address_str): |
|||
raise ValueError(f"Invalid ip4address: {ip4address_str}") |
|||
self.enter_step('set_ip4add_pack') |
|||
ip4 = self.pack_ip4address(ip4address_str) |
|||
self._set_ip4(ip4, *args, **kwargs) |
|||
|
|||
def set_ip4network(self, |
|||
ip4network_str: str, |
|||
*args: typing.Any, **kwargs: typing.Any |
|||
) -> None: |
|||
self.enter_step('set_ip4net_val') |
|||
if not Database.validate_ip4network(ip4network_str): |
|||
raise ValueError(f"Invalid ip4network: {ip4network_str}") |
|||
self.enter_step('set_ip4net_pack') |
|||
ip4 = self.pack_ip4network(ip4network_str) |
|||
self._set_ip4(ip4, *args, **kwargs) |
@ -0,0 +1,46 @@ |
|||
#!/usr/bin/env python3 |
|||
|
|||
import argparse |
|||
import database |
|||
import time |
|||
import os |
|||
|
|||
if __name__ == '__main__': |
|||
|
|||
# Parsing arguments |
|||
parser = argparse.ArgumentParser( |
|||
description="Database operations") |
|||
parser.add_argument( |
|||
'-i', '--initialize', action='store_true', |
|||
help="Reconstruct the whole database") |
|||
parser.add_argument( |
|||
'-p', '--prune', action='store_true', |
|||
help="Remove old entries from database") |
|||
parser.add_argument( |
|||
'-b', '--prune-base', action='store_true', |
|||
help="With --prune, only prune base rules " |
|||
"(the ones added by ./feed_rules.py)") |
|||
parser.add_argument( |
|||
'-s', '--prune-before', type=int, |
|||
default=(int(time.time()) - 60*60*24*31*6), |
|||
help="With --prune, only rules updated before " |
|||
"this UNIX timestamp will be deleted") |
|||
parser.add_argument( |
|||
'-r', '--references', action='store_true', |
|||
help="DEBUG: Update the reference count") |
|||
args = parser.parse_args() |
|||
|
|||
if not args.initialize: |
|||
DB = database.Database() |
|||
else: |
|||
if os.path.isfile(database.Database.PATH): |
|||
os.unlink(database.Database.PATH) |
|||
DB = database.Database() |
|||
|
|||
DB.enter_step('main') |
|||
if args.prune: |
|||
DB.prune(before=args.prune_before, base_only=args.prune_base) |
|||
if args.references: |
|||
DB.update_references() |
|||
|
|||
DB.save() |
@ -0,0 +1,74 @@ |
|||
# Geoffrey Frogeye's block list of first-party trackers |
|||
|
|||
## What's a first-party tracker? |
|||
|
|||
A tracker is a script put on many websites to gather informations about the visitor. |
|||
They can be used for multiple reasons: statistics, risk management, marketing, ads servingโฆ |
|||
In any case, they are a threat to Internet users' privacy and many may want to block them. |
|||
|
|||
Traditionnaly, trackers are served from a third-party. |
|||
For example, `website1.com` and `website2.com` both load their tracking script from `https://trackercompany.com/trackerscript.js`. |
|||
In order to block those, one can simply block the hostname `trackercompany.com`, which is what most ad blockers do. |
|||
|
|||
However, to circumvent this block, tracker companies made the websites using them load trackers from `somestring.website1.com`. |
|||
The latter is a DNS redirection to `website1.trackercompany.com`, directly to an IP address belonging to the tracking company. |
|||
Those are called first-party trackers. |
|||
|
|||
In order to block those trackers, ad blockers would need to block every subdomain pointing to anything under `trackercompany.com` or to their network. |
|||
Unfortunately, most don't support those blocking methods as they are not DNS-aware, e.g. they only see `somestring.website1.com`. |
|||
|
|||
This list is an inventory of every `somestring.website1.com` found to allow non DNS-aware ad blocker to still block first-party trackers. |
|||
|
|||
## List variants |
|||
|
|||
### First-party trackers (recommended) |
|||
|
|||
- Hosts file: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt> |
|||
- Raw list: <https://hostfiles.frogeye.fr/firstparty-trackers.txt> |
|||
|
|||
This list contains every hostname redirecting to [a hand-picked list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/rules/first-party.list). |
|||
It should be safe from false-positives. |
|||
Don't be afraid of the size of the list, as this is due to the nature of first-party trackers: a single tracker generates at least one hostname per client (typically two). |
|||
|
|||
### First-party only trackers |
|||
|
|||
- Hosts file: <https://hostfiles.frogeye.fr/firstparty-only-trackers-hosts.txt> |
|||
- Raw list: <https://hostfiles.frogeye.fr/firstparty-only-trackers.txt> |
|||
|
|||
This is the same list as above, albeit not containing the hostnames under the tracking company domains. |
|||
This reduces the size of the list, but it doesn't prevent from third-party tracking too. |
|||
Use in conjunction with other block lists. |
|||
|
|||
### Multi-party trackers |
|||
|
|||
- Hosts file: <https://hostfiles.frogeye.fr/multiparty-trackers-hosts.txt> |
|||
- Raw list: <https://hostfiles.frogeye.fr/multiparty-trackers.txt> |
|||
|
|||
As first-party trackers usually evolve from third-party trackers, this list contains every hostname redirecting to trackers found in existing lists of third-party trackers (see next section). |
|||
Since the latter were not designed with first-party trackers in mind, they are likely to contain false-positives. |
|||
In the other hand, they might protect against first-party tracker that we're not aware of / have not yet confirmed. |
|||
|
|||
#### Source of third-party trackers |
|||
|
|||
- [EasyPrivacy](https://easylist.to/easylist/easyprivacy.txt) |
|||
|
|||
(yes there's only one for now. A lot of existing ones cause a lot of false positives) |
|||
|
|||
### Multi-party only trackers |
|||
|
|||
- Hosts file: <https://hostfiles.frogeye.fr/multiparty-only-trackers-hosts.txt> |
|||
- Raw list: <https://hostfiles.frogeye.fr/multiparty-only-trackers.txt> |
|||
|
|||
This is the same list as above, albeit not containing the hostnames under the tracking company domains. |
|||
This reduces the size of the list, but it doesn't prevent from third-party tracking too. |
|||
Use in conjunction with other block lists, especially the ones used to generate this list in the previous section. |
|||
|
|||
## Meta |
|||
|
|||
In case of false positives/negatives, or any other question contact me the way you like: <https://geoffrey.frogeye.fr> |
|||
|
|||
The software used to generate this list is available here: <https://git.frogeye.fr/geoffrey/eulaurarien> |
|||
|
|||
Some of the first-party tracker included in this list have been found by: |
|||
- [Aeris](https://imirhil.fr/) |
|||
- NextDNS and [their blocklist](https://github.com/nextdns/cname-cloaking-blocklist)'s contributors |
@ -0,0 +1,64 @@ |
|||
#!/usr/bin/env python3 |
|||
|
|||
import database |
|||
import argparse |
|||
import sys |
|||
|
|||
|
|||
if __name__ == '__main__': |
|||
|
|||
# Parsing arguments |
|||
parser = argparse.ArgumentParser( |
|||
description="Export the hostnames rules stored " |
|||
"in the Database as plain text") |
|||
parser.add_argument( |
|||
'-o', '--output', type=argparse.FileType('w'), default=sys.stdout, |
|||
help="Output file, one rule per line") |
|||
parser.add_argument( |
|||
'-f', '--first-party', action='store_true', |
|||
help="Only output rules issued from first-party sources") |
|||
parser.add_argument( |
|||
'-e', '--end-chain', action='store_true', |
|||
help="Only output rules that are not referenced by any other") |
|||
parser.add_argument( |
|||
'-r', '--rules', action='store_true', |
|||
help="Output all kinds of rules, not just hostnames") |
|||
parser.add_argument( |
|||
'-b', '--base-rules', action='store_true', |
|||
help="Output base rules " |
|||
"(the ones added by ./feed_rules.py) " |
|||
"(implies --rules)") |
|||
parser.add_argument( |
|||
'-d', '--no-dupplicates', action='store_true', |
|||
help="Do not output rules that already match a zone/network rule " |
|||
"(e.g. dummy.example.com when there's a zone example.com rule)") |
|||
parser.add_argument( |
|||
'-x', '--explain', action='store_true', |
|||
help="Show the chain of rules leading to one " |
|||
"(and the number of references they have)") |
|||
parser.add_argument( |
|||
'-c', '--count', action='store_true', |
|||
help="Show the number of rules per type instead of listing them") |
|||
args = parser.parse_args() |
|||
|
|||
DB = database.Database() |
|||
|
|||
if args.count: |
|||
assert not args.explain |
|||
print(DB.count_records( |
|||
first_party_only=args.first_party, |
|||
end_chain_only=args.end_chain, |
|||
no_dupplicates=args.no_dupplicates, |
|||
rules_only=args.base_rules, |
|||
hostnames_only=not (args.rules or args.base_rules), |
|||
)) |
|||
else: |
|||
for domain in DB.list_records( |
|||
first_party_only=args.first_party, |
|||
end_chain_only=args.end_chain, |
|||
no_dupplicates=args.no_dupplicates, |
|||
rules_only=args.base_rules, |
|||
hostnames_only=not (args.rules or args.base_rules), |
|||
explain=args.explain, |
|||
): |
|||
print(domain, file=args.output) |
@ -0,0 +1,98 @@ |
|||
#!/usr/bin/env bash |
|||
|
|||
function log() { |
|||
echo -e "\033[33m$@\033[0m" |
|||
} |
|||
|
|||
log "Calculating statisticsโฆ" |
|||
gen_date=$(date -Isec) |
|||
gen_software=$(git describe --tags) |
|||
number_websites=$(wc -l < temp/all_websites.list) |
|||
number_subdomains=$(wc -l < temp/all_subdomains.list) |
|||
number_dns=$(grep '^$' temp/all_resolved.txt | wc -l) |
|||
|
|||
for partyness in {first,multi} |
|||
do |
|||
if [ $partyness = "first" ] |
|||
then |
|||
partyness_flags="--first-party" |
|||
else |
|||
partyness_flags="" |
|||
fi |
|||
|
|||
echo "Statistics for ${partyness}-party trackers" |
|||
echo "Input rules: $(./export.py --count --base-rules $partyness_flags)" |
|||
echo "Subsequent rules: $(./export.py --count --rules $partyness_flags)" |
|||
echo "Subsequent rules (no dupplicate): $(./export.py --count --rules --no-dupplicates $partyness_flags)" |
|||
echo "Output hostnames: $(./export.py --count $partyness_flags)" |
|||
echo "Output hostnames (no dupplicate): $(./export.py --count --no-dupplicates $partyness_flags)" |
|||
echo "Output hostnames (end-chain only): $(./export.py --count --end-chain $partyness_flags)" |
|||
echo "Output hostnames (no dupplicate, end-chain only): $(./export.py --count --no-dupplicates --end-chain $partyness_flags)" |
|||
echo |
|||
|
|||
for trackerness in {trackers,only-trackers} |
|||
do |
|||
if [ $trackerness = "trackers" ] |
|||
then |
|||
trackerness_flags="" |
|||
else |
|||
trackerness_flags="--end-chain --no-dupplicates" |
|||
fi |
|||
file_list="dist/${partyness}party-${trackerness}.txt" |
|||
file_host="dist/${partyness}party-${trackerness}-hosts.txt" |
|||
|
|||
log "Generating lists for variant ${partyness}-party ${trackerness}โฆ" |
|||
|
|||
# Real export heeere |
|||
./export.py $partyness_flags $trackerness_flags > $file_list |
|||
# Sometimes a bit heavy to have the DB open and sort the output |
|||
# so this is done in two steps |
|||
sort -u $file_list -o $file_list |
|||
|
|||
rules_input=$(./export.py --count --base-rules $partyness_flags) |
|||
rules_found=$(./export.py --count --rules $partyness_flags) |
|||
rules_output=$(./export.py --count $partyness_flags $trackerness_flags) |
|||
|
|||
function link() { # link partyness, link trackerness |
|||
url="https://hostfiles.frogeye.fr/${1}party-${2}-hosts.txt" |
|||
if [ "$1" = "$partyness" ] && [ "$2" = "$trackerness" ] |
|||
then |
|||
url="$url (this one)" |
|||
fi |
|||
echo $url |
|||
} |
|||
|
|||
( |
|||
echo "# First-party trackers host list" |
|||
echo "# Variant: ${partyness}-party ${trackerness}" |
|||
echo "#" |
|||
echo "# About first-party trackers: TODO" |
|||
echo "# Source code: https://git.frogeye.fr/geoffrey/eulaurarien" |
|||
echo "#" |
|||
echo "# In case of false positives/negatives, or any other question," |
|||
echo "# contact me the way you like: https://geoffrey.frogeye.fr" |
|||
echo "#" |
|||
echo "# Latest versions and variants:" |
|||
echo "# - First-party trackers : $(link first trackers)" |
|||
echo "# - โฆ excluding redirected: $(link first only-trackers)" |
|||
echo "# - First and third party : $(link multi trackers)" |
|||
echo "# - โฆ excluding redirected: $(link multi only-trackers)" |
|||
echo '# (variants informations: TODO)' |
|||
echo '# (you can remove `-hosts` to get the raw list)' |
|||
echo "#" |
|||
echo "# Generation date: $gen_date" |
|||
echo "# Generation software: eulaurarien $gen_software" |
|||
echo "# Number of source websites: $number_websites" |
|||
echo "# Number of source subdomains: $number_subdomains" |
|||
echo "# Number of source DNS records: ~2E9 + $number_dns" |
|||
echo "#" |
|||
echo "# Input rules: $rules_input" |
|||
echo "# Subsequent rules: $rules_found" |
|||
echo "# Output rules: $rules_output" |
|||
echo "#" |
|||
echo |
|||
sed 's|^|0.0.0.0 |' "$file_list" |
|||
) > "$file_host" |
|||
|
|||
done |
|||
done |
@ -0,0 +1,71 @@ |
|||
#!/usr/bin/env python3 |
|||
|
|||
import database |
|||
import argparse |
|||
import requests |
|||
import typing |
|||
import ipaddress |
|||
import logging |
|||
import time |
|||
|
|||
IPNetwork = typing.Union[ipaddress.IPv4Network, ipaddress.IPv6Network] |
|||
|
|||
|
|||
def get_ranges(asn: str) -> typing.Iterable[str]: |
|||
req = requests.get( |
|||
'https://stat.ripe.net/data/as-routing-consistency/data.json', |
|||
params={'resource': asn} |
|||
) |
|||
data = req.json() |
|||
for pref in data['data']['prefixes']: |
|||
yield pref['prefix'] |
|||
|
|||
|
|||
def get_name(asn: str) -> str: |
|||
req = requests.get( |
|||
'https://stat.ripe.net/data/as-overview/data.json', |
|||
params={'resource': asn} |
|||
) |
|||
data = req.json() |
|||
return data['data']['holder'] |
|||
|
|||
|
|||
if __name__ == '__main__': |
|||
|
|||
log = logging.getLogger('feed_asn') |
|||
|
|||
# Parsing arguments |
|||
parser = argparse.ArgumentParser( |
|||
description="Add the IP ranges associated to the AS in the database") |
|||
args = parser.parse_args() |
|||
|
|||
DB = database.Database() |
|||
|
|||
def add_ranges(path: database.Path, |
|||
match: database.Match, |
|||
) -> None: |
|||
assert isinstance(path, database.AsnPath) |
|||
assert isinstance(match, database.AsnNode) |
|||
asn_str = database.Database.unpack_asn(path) |
|||
DB.enter_step('asn_get_name') |
|||
name = get_name(asn_str) |
|||
match.name = name |
|||
DB.enter_step('asn_get_ranges') |
|||
for prefix in get_ranges(asn_str): |
|||
parsed_prefix: IPNetwork = ipaddress.ip_network(prefix) |
|||
if parsed_prefix.version == 4: |
|||
DB.set_ip4network( |
|||
prefix, |
|||
source=path, |
|||
updated=int(time.time()) |
|||
) |
|||
log.info('Added %s from %s (%s)', prefix, path, name) |
|||
elif parsed_prefix.version == 6: |
|||
log.warning('Unimplemented prefix version: %s', prefix) |
|||
else: |
|||
log.error('Unknown prefix version: %s', prefix) |
|||
|
|||
for _ in DB.exec_each_asn(add_ranges): |
|||
pass |
|||
|
|||
DB.save() |
@ -0,0 +1,227 @@ |
|||
#!/usr/bin/env python3 |
|||
|
|||
import argparse |
|||
import database |
|||
import logging |
|||
import sys |
|||
import typing |
|||
import multiprocessing |
|||
import time |
|||
|
|||
Record = typing.Tuple[typing.Callable, typing.Callable, int, str, str] |
|||
|
|||
# select, write |
|||
FUNCTION_MAP: typing.Any = { |
|||
'a': ( |
|||
database.Database.get_ip4, |
|||
database.Database.set_hostname, |
|||
), |
|||
'cname': ( |
|||
database.Database.get_domain, |
|||
database.Database.set_hostname, |
|||
), |
|||
'ptr': ( |
|||
database.Database.get_domain, |
|||
database.Database.set_ip4address, |
|||
), |
|||
} |
|||
|
|||
|
|||
class Writer(multiprocessing.Process): |
|||
def __init__(self, |
|||
recs_queue: multiprocessing.Queue, |
|||
autosave_interval: int = 0, |
|||
ip4_cache: int = 0, |
|||
): |
|||
super(Writer, self).__init__() |
|||
self.log = logging.getLogger(f'wr') |
|||
self.recs_queue = recs_queue |
|||
self.autosave_interval = autosave_interval |
|||
self.ip4_cache = ip4_cache |
|||
|
|||
def run(self) -> None: |
|||
self.db = database.Database() |
|||
self.db.log = logging.getLogger(f'wr') |
|||
self.db.fill_ip4cache(max_size=self.ip4_cache) |
|||
if self.autosave_interval > 0: |
|||
next_save = time.time() + self.autosave_interval |
|||
else: |
|||
next_save = 0 |
|||
|
|||
self.db.enter_step('block_wait') |
|||
block: typing.List[Record] |
|||
for block in iter(self.recs_queue.get, None): |
|||
|
|||
record: Record |
|||
for record in block: |
|||
|
|||
select, write, updated, name, value = record |
|||
self.db.enter_step('feed_switch') |
|||
|
|||
try: |
|||
for source in select(self.db, value): |
|||
write(self.db, name, updated, source=source) |
|||
except ValueError: |
|||
self.log.exception("Cannot execute: %s", record) |
|||
|
|||
if next_save > 0 and time.time() > next_save: |
|||
self.log.info("Saving database...") |
|||
self.db.save() |
|||
self.log.info("Done!") |
|||
next_save = time.time() + self.autosave_interval |
|||
|
|||
self.db.enter_step('block_wait') |
|||
|
|||
self.db.enter_step('end') |
|||
self.db.save() |
|||
|
|||
|
|||
class Parser(): |
|||
def __init__(self, |
|||
buf: typing.Any, |
|||
recs_queue: multiprocessing.Queue, |
|||
block_size: int, |
|||
): |
|||
super(Parser, self).__init__() |
|||
self.buf = buf |
|||
self.log = logging.getLogger('pr') |
|||
self.recs_queue = recs_queue |
|||
self.block: typing.List[Record] = list() |
|||
self.block_size = block_size |
|||
self.prof = database.Profiler() |
|||
self.prof.log = logging.getLogger('pr') |
|||
|
|||
def register(self, record: Record) -> None: |
|||
self.prof.enter_step('register') |
|||
self.block.append(record) |
|||
if len(self.block) >= self.block_size: |
|||
self.prof.enter_step('put_block') |
|||
self.recs_queue.put(self.block) |
|||
self.block = list() |
|||
|
|||
def run(self) -> None: |
|||
self.consume() |
|||
self.recs_queue.put(self.block) |
|||
self.prof.profile() |
|||
|
|||
def consume(self) -> None: |
|||
raise NotImplementedError |
|||
|
|||
|
|||
class Rapid7Parser(Parser): |
|||
def consume(self) -> None: |
|||
data = dict() |
|||
for line in self.buf: |
|||
self.prof.enter_step('parse_rapid7') |
|||
split = line.split('"') |
|||
|
|||
try: |
|||
for k in range(1, 14, 4): |
|||
key = split[k] |
|||
val = split[k+2] |
|||
data[key] = val |
|||
|
|||
select, writer = FUNCTION_MAP[data['type']] |
|||
record = ( |
|||
select, |
|||
writer, |
|||
int(data['timestamp']), |
|||
data['name'], |
|||
data['value'] |
|||
) |
|||
except IndexError: |
|||
self.log.exception("Cannot parse: %s", line) |
|||
self.register(record) |
|||
|
|||
|
|||
class MassDnsParser(Parser): |
|||
# massdns --output Snrql |
|||
# --retry REFUSED,SERVFAIL --resolvers nameservers-ipv4 |
|||
TYPES = { |
|||
'A': (FUNCTION_MAP['a'][0], FUNCTION_MAP['a'][1], -1, None), |
|||
# 'AAAA': (FUNCTION_MAP['aaaa'][0], FUNCTION_MAP['aaaa'][1], -1, None), |
|||
'CNAME': (FUNCTION_MAP['cname'][0], FUNCTION_MAP['cname'][1], -1, -1), |
|||
} |
|||
|
|||
def consume(self) -> None: |
|||
self.prof.enter_step('parse_massdns') |
|||
timestamp = 0 |
|||
header = True |
|||
for line in self.buf: |
|||
line = line[:-1] |
|||
if not line: |
|||
header = True |
|||
continue |
|||
|
|||
split = line.split(' ') |
|||
try: |
|||
if header: |
|||
timestamp = int(split[1]) |
|||
header = False |
|||
else: |
|||
select, write, name_offset, value_offset = \ |
|||
MassDnsParser.TYPES[split[1]] |
|||
record = ( |
|||
select, |
|||
write, |
|||
timestamp, |
|||
split[0][:name_offset], |
|||
split[2][:value_offset], |
|||
) |
|||
self.register(record) |
|||
self.prof.enter_step('parse_massdns') |
|||
except KeyError: |
|||
continue |
|||
|
|||
|
|||
PARSERS = { |
|||
'rapid7': Rapid7Parser, |
|||
'massdns': MassDnsParser, |
|||
} |
|||
|
|||
if __name__ == '__main__': |
|||
|
|||
# Parsing arguments |
|||
log = logging.getLogger('feed_dns') |
|||
args_parser = argparse.ArgumentParser( |
|||
description="Read DNS records and import " |
|||
"tracking-relevant data into the database") |
|||
args_parser.add_argument( |
|||
'parser', |
|||
choices=PARSERS.keys(), |
|||
help="Input format") |
|||
args_parser.add_argument( |
|||
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin, |
|||
help="Input file") |
|||
args_parser.add_argument( |
|||
'-b', '--block-size', type=int, default=1024, |
|||
help="Performance tuning value") |
|||
args_parser.add_argument( |
|||
'-q', '--queue-size', type=int, default=128, |
|||
help="Performance tuning value") |
|||
args_parser.add_argument( |
|||
'-a', '--autosave-interval', type=int, default=900, |
|||
help="Interval to which the database will save in seconds. " |
|||
"0 to disable.") |
|||
args_parser.add_argument( |
|||
'-4', '--ip4-cache', type=int, default=0, |
|||
help="RAM cache for faster IPv4 lookup. " |
|||
"Maximum useful value: 512 MiB (536870912). " |
|||
"Warning: Depending on the rules, this might already " |
|||
"be a memory-heavy process, even without the cache.") |
|||
args = args_parser.parse_args() |
|||
|
|||
recs_queue: multiprocessing.Queue = multiprocessing.Queue( |
|||
maxsize=args.queue_size) |
|||
|
|||
writer = Writer(recs_queue, |
|||
autosave_interval=args.autosave_interval, |
|||
ip4_cache=args.ip4_cache |
|||
) |
|||
writer.start() |
|||
|
|||
parser = PARSERS[args.parser](args.input, recs_queue, args.block_size) |
|||
parser.run() |
|||
|
|||
recs_queue.put(None) |
|||
writer.join() |
@ -0,0 +1,54 @@ |
|||
#!/usr/bin/env python3 |
|||
|
|||
import database |
|||
import argparse |
|||
import sys |
|||
import time |
|||
|
|||
FUNCTION_MAP = { |
|||
'zone': database.Database.set_zone, |
|||
'hostname': database.Database.set_hostname, |
|||
'asn': database.Database.set_asn, |
|||
'ip4network': database.Database.set_ip4network, |
|||
'ip4address': database.Database.set_ip4address, |
|||
} |
|||
|
|||
if __name__ == '__main__': |
|||
|
|||
# Parsing arguments |
|||
parser = argparse.ArgumentParser( |
|||
description="Import base rules to the database") |
|||
parser.add_argument( |
|||
'type', |
|||
choices=FUNCTION_MAP.keys(), |
|||
help="Type of rule inputed") |
|||
parser.add_argument( |
|||
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin, |
|||
help="File with one rule per line") |
|||
parser.add_argument( |
|||
'-f', '--first-party', action='store_true', |
|||
help="The input only comes from verified first-party sources") |
|||
args = parser.parse_args() |
|||
|
|||
DB = database.Database() |
|||
|
|||
fun = FUNCTION_MAP[args.type] |
|||
|
|||
source: database.RulePath |
|||
if args.first_party: |
|||
source = database.RuleFirstPath() |
|||
else: |
|||
source = database.RuleMultiPath() |
|||
|
|||
for rule in args.input: |
|||
rule = rule.strip() |
|||
try: |
|||
fun(DB, |
|||
rule, |
|||
source=source, |
|||
updated=int(time.time()), |
|||
) |
|||
except ValueError: |