diff --git a/.gitignore b/.gitignore index e38bcd9..e6abf3c 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,2 @@ *.log -nameservers -nameservers.head +*.p diff --git a/README.md b/README.md index f27b6f6..7229f30 100644 --- a/README.md +++ b/README.md @@ -1,98 +1,133 @@ # eulaurarien -Generates a host list of first-party trackers for ad-blocking. +This program is able to generate a list of every hostnames being a DNS redirection to a list of DNS zones and IP networks. -The latest list is available here: +It is primarilyy used to generate [Geoffrey Frogeye's block list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md) (learn about first-party trackers by following this link). -**DISCLAIMER:** I'm by no way an expert on this subject so my vocabulary or other stuff might be wrong. Use at your own risk. +If you want to contribute but don't want to create an account on this forge, contact me the way you like: -## What's a first-party tracker? +## How does this work -Traditionally, websites load trackers scripts directly. -For example, `website1.com` and `website2.com` both load `https://trackercompany.com/trackerscript.js` to track their users. -In order to block those, one can simply block the host `trackercompany.com`. +This program takes as input: -However, to circumvent this easy block, tracker companies made the website using them load trackers from `somethingirelevant.website1.com`. -The latter being a DNS redirection to `website1.trackercompany.com`, directly pointing to a server serving the tracking script. -Those are the first-party trackers. +- Lists of hostnames to match +- Lists of DNS zone to match (a domain and their subdomains) +- Lists of IP address / IP networks to match +- Lists of Autonomous System numbers to match +- An enormous quantity of DNS records -Blocking `trackercompany.com` doesn't work any more, and blocking `*.trackercompany.com` isn't really possible since: +It will be able to output hostnames being a DNS redirection to any item in the lists provided. -1. Most ad-blocker don't support wildcards -2. It's a DNS redirection, meaning that most ad-blockers will only see `somethingirelevant.website1.com` +DNS records can either come from [Rapid7 Open Data Sets](https://opendata.rapid7.com/sonar.fdns_v2/) or can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns). -So the only solution is to block every `somethingirelevant.website1.com`-like subdomains known, which is a lot. -That's where this scripts comes in, to generate a list of such subdomains. - -## How does this script work - -> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this. - -It takes an input a list of websites with trackers included. -So far, this list is manually-generated from the list of clients of such first-party trackers -(latter we should use a general list of websites to be more exhaustive). -It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes. - -Additionaly, or alternatively, you can feed the script some browsing history and get domains from there. - -It then find the DNS redirections of those domains, and compare with regexes of known tracking domains. -It finally outputs the matching ones. - -## Requirements - -> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this. - -Just to build the list, you can find an already-built list in the releases. - -- Bash -- [Python 3.4+](https://www.python.org/) -- [progressbar2](https://pypi.org/project/progressbar2/) -- dnspython -- [A Python wrapper for re2](https://pypi.org/project/google-re2/) (optional, just speeds things up) - -(if you don't want to collect the subdomains, you can skip the following) - -- Firefox -- Selenium -- seleniumwire +Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that). ## Usage -> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this. +Remember you can get an already generated and up-to-date list of first-party trackers from [here](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md). -This is only if you want to build the list yourself. -If you just want to use the list, the latest build is available here: -It was build using additional sources not included in this repository for privacy reasons. +The following is for the people wanting to build their own list. -### Add personal sources +### Requirements -The list of websites provided in this script is by no mean exhaustive, -so adding your own browsing history will help create a better list. +Depending on the sources you'll be using to generate the list, you'll need to install some of the following: + +- [Bash](https://www.gnu.org/software/bash/bash.html) +- [Coreutils](https://www.gnu.org/software/coreutils/) +- [curl](https://curl.haxx.se) +- [pv](http://www.ivarch.com/programs/pv.shtml) +- [Python 3.4+](https://www.python.org/) +- [coloredlogs](https://pypi.org/project/coloredlogs/) (sorry I can't help myself) +- [massdns](https://github.com/blechschmidt/massdns) in your `$PATH` (only if you have subdomains as a source) +- [Firefox](https://www.mozilla.org/firefox/) (only if you have websites as a source) +- [selenium (Python bindings)](https://pypi.python.org/pypi/selenium) (only if you have websites as a source) +- [selenium-wire](https://pypi.org/project/selenium-wire/) (only if you have websites as a source) + +### Create a new database + +The so-called database (in the form of `blocking.p`) is a file storing all the matching entities (ASN, IPs, hostnames, zones…) and every entity leading to it. +For now there's no way to remove data from it, so here's the command to recreate it: `./db.py --initialize`. + +### Gather external sources + +External sources are not stored in this repository. +You'll need to fetch them by running `./fetch_resources.sh`. +Those include: + +- Third-party trackers lists +- TLD lists (used to test the validity of hostnames) +- List of public DNS resolvers (for DNS resolving from subdomains) +- Top 1M subdomains + +### Import rules into the database + +You need to put the lists of rules for matching in the different subfolders: + +- `rules`: Lists of DNS zones +- `rules_ip`: Lists of IP networks (for IP addresses append `/32`) +- `rules_asn`: Lists of Autonomous Systems numbers (IP ranges will be deducted from them) +- `rules_adblock`: Lists of DNS zones, but in the form of AdBlock lists (only the ones concerning domains will be extracted) +- `rules_hosts`: Lists of DNS zones, but in the form of hosts lists + +See the provided examples for syntax. + +In each folder: + +- `first-party.ext` will be the only files considered for the first-party variant of the list +- `*.cache.ext` are from external sources, and thus might be deleted / overwrote +- `*.custom.ext` are for sources that you don't want commited + +Then, run `./import_rules.sh`. + +### Add subdomains + +If you plan to resolve DNS records yourself (as the DNS records datasets are not exhaustive), +the top 1M subdomains provided might not be enough. + +You can add them into the `subdomains` folder. +It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files. + +#### Add personal sources + +Adding your own browsing history will help create a more suited subdomains list. Here's reference command for possible sources: - **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list` - **Firefox**: `cp ~/.mozilla/firefox/.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp` -### Collect subdomains from websites +#### Collect subdomains from websites -Just run `collect_subdomain.sh`. +You can add the websites URLs into the `websites` folder. +It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files. + +Then, run `collect_subdomain.sh`. This is a long step, and might be memory-intensive from time to time. -This step is optional if you already added personal sources. -Alternatively, you can get just download the list of subdomains used to generate the official block list here: (put it in the `subdomains` folder). +> **Note:** For first-party tracking, a list of subdomains issued from the websites in the repository is avaliable here: -### Extract tracking domains +### Resolve DNS records -Make sure your system is configured with a DNS server without limitation. -Then, run `filter_subdomain.sh`. -The files you need will be in the folder `dist`. +Once you've added subdomains, you'll need to resolve them to get their DNS records. +The program will use a list of public nameservers to do that, but you can add your own in the `nameservers` directory. -## Contributing +Then, run `./resolve_subdomains.sh`. +Note that this is a network intensive process, not in term of bandwith, but in terms of packet number. -### Adding websites +> Some VPS providers might detect this as a DDoS attack and cut the network access. +> Some Wi-Fi connections can be rendered unusable for other uses, some routers might cease to work. +> Since massdns does not support yet rate limiting, my best bet was a Raspberry Pi with a slow ethernet link (Raspberry Pi < 4). -Just add the URL to the relevant list: `websites/.list`. +The DNS records will automatically be imported into the database. +If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script. -### Adding first-party trackers regex +### Import DNS records from Rapid7 + +Just run `./import_rapid7.sh`. +This will download about 35 GiB of data, but only the matching records will be stored (about a few MiB for the tracking rules). +Note the download speed will most likely be limited by the database operation thoughput (a quick RAM will help). + +### Export the lists + +For the tracking list, use `./export_lists.sh`, the output will be in the `dist` forlder (please change the links before distributing them). +For other purposes, tinker with the `./export.py` program. -Just add them to `regexes.py`. diff --git a/database.py b/database.py new file mode 100644 index 0000000..c37369f --- /dev/null +++ b/database.py @@ -0,0 +1,739 @@ +#!/usr/bin/env python3 + +""" +Utility functions to interact with the database. +""" + +import typing +import time +import logging +import coloredlogs +import pickle +import numpy +import math + +TLD_LIST: typing.Set[str] = set() + +coloredlogs.install( + level='DEBUG', + fmt='%(asctime)s %(name)s %(levelname)s %(message)s' +) + +Asn = int +Timestamp = int +Level = int + + +class Path(): + # FP add boolean here + pass + + +class RulePath(Path): + def __str__(self) -> str: + return '(rule)' + + +class RuleFirstPath(RulePath): + def __str__(self) -> str: + return '(first-party rule)' + + +class RuleMultiPath(RulePath): + def __str__(self) -> str: + return '(multi-party rule)' + + +class DomainPath(Path): + def __init__(self, parts: typing.List[str]): + self.parts = parts + + def __str__(self) -> str: + return '?.' + Database.unpack_domain(self) + + +class HostnamePath(DomainPath): + def __str__(self) -> str: + return Database.unpack_domain(self) + + +class ZonePath(DomainPath): + def __str__(self) -> str: + return '*.' + Database.unpack_domain(self) + + +class AsnPath(Path): + def __init__(self, asn: Asn): + self.asn = asn + + def __str__(self) -> str: + return Database.unpack_asn(self) + + +class Ip4Path(Path): + def __init__(self, value: int, prefixlen: int): + self.value = value + self.prefixlen = prefixlen + + def __str__(self) -> str: + return Database.unpack_ip4network(self) + + +class Match(): + def __init__(self) -> None: + self.source: typing.Optional[Path] = None + self.updated: int = 0 + self.dupplicate: bool = False + + # Cache + self.level: int = 0 + self.first_party: bool = False + self.references: int = 0 + + def active(self, first_party: bool = None) -> bool: + if self.updated == 0 or (first_party and not self.first_party): + return False + return True + + +class AsnNode(Match): + def __init__(self) -> None: + Match.__init__(self) + self.name = '' + + +class DomainTreeNode(): + def __init__(self) -> None: + self.children: typing.Dict[str, DomainTreeNode] = dict() + self.match_zone = Match() + self.match_hostname = Match() + + +class IpTreeNode(Match): + def __init__(self) -> None: + Match.__init__(self) + self.zero: typing.Optional[IpTreeNode] = None + self.one: typing.Optional[IpTreeNode] = None + + +Node = typing.Union[DomainTreeNode, IpTreeNode, AsnNode] +MatchCallable = typing.Callable[[Path, + Match], + typing.Any] + + +class Profiler(): + def __init__(self) -> None: + self.log = logging.getLogger('profiler') + self.time_last = time.perf_counter() + self.time_step = 'init' + self.time_dict: typing.Dict[str, float] = dict() + self.step_dict: typing.Dict[str, int] = dict() + + def enter_step(self, name: str) -> None: + now = time.perf_counter() + try: + self.time_dict[self.time_step] += now - self.time_last + self.step_dict[self.time_step] += int(name != self.time_step) + except KeyError: + self.time_dict[self.time_step] = now - self.time_last + self.step_dict[self.time_step] = 1 + self.time_step = name + self.time_last = time.perf_counter() + + def profile(self) -> None: + self.enter_step('profile') + total = sum(self.time_dict.values()) + for key, secs in sorted(self.time_dict.items(), key=lambda t: t[1]): + times = self.step_dict[key] + self.log.debug(f"{key:<20}: {times:9d} × {secs/times:5.3e} " + f"= {secs:9.2f} s ({secs/total:7.2%}) ") + self.log.debug(f"{'total':<20}: " + f"{total:9.2f} s ({1:7.2%})") + + +class Database(Profiler): + VERSION = 18 + PATH = "blocking.p" + + def initialize(self) -> None: + self.log.warning( + "Creating database version: %d ", + Database.VERSION) + # Dummy match objects that everything refer to + self.rules: typing.List[Match] = list() + for first_party in (False, True): + m = Match() + m.updated = 1 + m.level = 0 + m.first_party = first_party + self.rules.append(m) + self.domtree = DomainTreeNode() + self.asns: typing.Dict[Asn, AsnNode] = dict() + self.ip4tree = IpTreeNode() + + def load(self) -> None: + self.enter_step('load') + try: + with open(self.PATH, 'rb') as db_fdsec: + version, data = pickle.load(db_fdsec) + if version == Database.VERSION: + self.rules, self.domtree, self.asns, self.ip4tree = data + return + self.log.warning( + "Outdated database version found: %d, " + "it will be rebuilt.", + version) + except (TypeError, AttributeError, EOFError): + self.log.error( + "Corrupt (or heavily outdated) database found, " + "it will be rebuilt.") + except FileNotFoundError: + pass + self.initialize() + + def save(self) -> None: + self.enter_step('save') + with open(self.PATH, 'wb') as db_fdsec: + data = self.rules, self.domtree, self.asns, self.ip4tree + pickle.dump((self.VERSION, data), db_fdsec) + self.profile() + + def __init__(self) -> None: + Profiler.__init__(self) + self.log = logging.getLogger('db') + self.load() + self.ip4cache_shift: int = 32 + self.ip4cache = numpy.ones(1) + + def _set_ip4cache(self, path: Path, _: Match) -> None: + assert isinstance(path, Ip4Path) + self.enter_step('set_ip4cache') + mini = path.value >> self.ip4cache_shift + maxi = (path.value + 2**(32-path.prefixlen)) >> self.ip4cache_shift + if mini == maxi: + self.ip4cache[mini] = True + else: + self.ip4cache[mini:maxi] = True + + def fill_ip4cache(self, max_size: int = 512*1024**2) -> None: + """ + Size in bytes + """ + if max_size > 2**32/8: + self.log.warning("Allocating more than 512 MiB of RAM for " + "the Ip4 cache is not necessary.") + max_cache_width = int(math.log2(max(1, max_size*8))) + cache_width = min(2**32, max_cache_width) + self.ip4cache_shift = 32-cache_width + cache_size = 2**cache_width + self.ip4cache = numpy.zeros(cache_size, dtype=numpy.bool) + for _ in self.exec_each_ip4(self._set_ip4cache): + pass + + @staticmethod + def populate_tld_list() -> None: + with open('temp/all_tld.list', 'r') as tld_fdesc: + for tld in tld_fdesc: + tld = tld.strip() + TLD_LIST.add(tld) + + @staticmethod + def validate_domain(path: str) -> bool: + if len(path) > 255: + return False + splits = path.split('.') + if not TLD_LIST: + Database.populate_tld_list() + if splits[-1] not in TLD_LIST: + return False + for split in splits: + if not 1 <= len(split) <= 63: + return False + return True + + @staticmethod + def pack_domain(domain: str) -> DomainPath: + return DomainPath(domain.split('.')[::-1]) + + @staticmethod + def unpack_domain(domain: DomainPath) -> str: + return '.'.join(domain.parts[::-1]) + + @staticmethod + def pack_asn(asn: str) -> AsnPath: + asn = asn.upper() + if asn.startswith('AS'): + asn = asn[2:] + return AsnPath(int(asn)) + + @staticmethod + def unpack_asn(asn: AsnPath) -> str: + return f'AS{asn.asn}' + + @staticmethod + def validate_ip4address(path: str) -> bool: + splits = path.split('.') + if len(splits) != 4: + return False + for split in splits: + try: + if not 0 <= int(split) <= 255: + return False + except ValueError: + return False + return True + + @staticmethod + def pack_ip4address(address: str) -> Ip4Path: + addr = 0 + for split in address.split('.'): + addr = (addr << 8) + int(split) + return Ip4Path(addr, 32) + + @staticmethod + def unpack_ip4address(address: Ip4Path) -> str: + addr = address.value + assert address.prefixlen == 32 + octets: typing.List[int] = list() + octets = [0] * 4 + for o in reversed(range(4)): + octets[o] = addr & 0xFF + addr >>= 8 + return '.'.join(map(str, octets)) + + @staticmethod + def validate_ip4network(path: str) -> bool: + # A bit generous but ok for our usage + splits = path.split('/') + if len(splits) != 2: + return False + if not Database.validate_ip4address(splits[0]): + return False + try: + if not 0 <= int(splits[1]) <= 32: + return False + except ValueError: + return False + return True + + @staticmethod + def pack_ip4network(network: str) -> Ip4Path: + address, prefixlen_str = network.split('/') + prefixlen = int(prefixlen_str) + addr = Database.pack_ip4address(address) + addr.prefixlen = prefixlen + return addr + + @staticmethod + def unpack_ip4network(network: Ip4Path) -> str: + addr = network.value + octets: typing.List[int] = list() + octets = [0] * 4 + for o in reversed(range(4)): + octets[o] = addr & 0xFF + addr >>= 8 + return '.'.join(map(str, octets)) + '/' + str(network.prefixlen) + + def get_match(self, path: Path) -> Match: + if isinstance(path, RuleMultiPath): + return self.rules[0] + elif isinstance(path, RuleFirstPath): + return self.rules[1] + elif isinstance(path, AsnPath): + return self.asns[path.asn] + elif isinstance(path, DomainPath): + dicd = self.domtree + for part in path.parts: + dicd = dicd.children[part] + if isinstance(path, HostnamePath): + return dicd.match_hostname + elif isinstance(path, ZonePath): + return dicd.match_zone + else: + raise ValueError + elif isinstance(path, Ip4Path): + dici = self.ip4tree + for i in range(31, 31-path.prefixlen, -1): + bit = (path.value >> i) & 0b1 + dici_next = dici.one if bit else dici.zero + if not dici_next: + raise IndexError + dici = dici_next + return dici + else: + raise ValueError + + def exec_each_asn(self, + callback: MatchCallable, + ) -> typing.Any: + for asn in self.asns: + match = self.asns[asn] + if match.active(): + c = callback( + AsnPath(asn), + match, + ) + try: + yield from c + except TypeError: # not iterable + pass + + def exec_each_domain(self, + callback: MatchCallable, + _dic: DomainTreeNode = None, + _par: DomainPath = None, + ) -> typing.Any: + _dic = _dic or self.domtree + _par = _par or DomainPath([]) + if _dic.match_hostname.active(): + c = callback( + HostnamePath(_par.parts), + _dic.match_hostname, + ) + try: + yield from c + except TypeError: # not iterable + pass + if _dic.match_zone.active(): + c = callback( + ZonePath(_par.parts), + _dic.match_zone, + ) + try: + yield from c + except TypeError: # not iterable + pass + for part in _dic.children: + dic = _dic.children[part] + yield from self.exec_each_domain( + callback, + _dic=dic, + _par=DomainPath(_par.parts + [part]) + ) + + def exec_each_ip4(self, + callback: MatchCallable, + _dic: IpTreeNode = None, + _par: Ip4Path = None, + ) -> typing.Any: + _dic = _dic or self.ip4tree + _par = _par or Ip4Path(0, 0) + if _dic.active(): + c = callback( + _par, + _dic, + ) + try: + yield from c + except TypeError: # not iterable + pass + + # 0 + pref = _par.prefixlen + 1 + dic = _dic.zero + if dic: + # addr0 = _par.value & (0xFFFFFFFF ^ (1 << (32-pref))) + # assert addr0 == _par.value + addr0 = _par.value + yield from self.exec_each_ip4( + callback, + _dic=dic, + _par=Ip4Path(addr0, pref) + ) + # 1 + dic = _dic.one + if dic: + addr1 = _par.value | (1 << (32-pref)) + # assert addr1 != _par.value + yield from self.exec_each_ip4( + callback, + _dic=dic, + _par=Ip4Path(addr1, pref) + ) + + def exec_each(self, + callback: MatchCallable, + ) -> typing.Any: + yield from self.exec_each_domain(callback) + yield from self.exec_each_ip4(callback) + yield from self.exec_each_asn(callback) + + def update_references(self) -> None: + # Should be correctly calculated normally, + # keeping this just in case + def reset_references_cb(path: Path, + match: Match + ) -> None: + match.references = 0 + for _ in self.exec_each(reset_references_cb): + pass + + def increment_references_cb(path: Path, + match: Match + ) -> None: + if match.source: + source = self.get_match(match.source) + source.references += 1 + for _ in self.exec_each(increment_references_cb): + pass + + def prune(self, before: int, base_only: bool = False) -> None: + raise NotImplementedError + + def explain(self, path: Path) -> str: + match = self.get_match(path) + if isinstance(match, AsnNode): + string = f'{path} ({match.name}) #{match.references}' + else: + string = f'{path} #{match.references}' + if match.source: + string += f' ← {self.explain(match.source)}' + return string + + def list_records(self, + first_party_only: bool = False, + end_chain_only: bool = False, + no_dupplicates: bool = False, + rules_only: bool = False, + hostnames_only: bool = False, + explain: bool = False, + ) -> typing.Iterable[str]: + + def export_cb(path: Path, match: Match + ) -> typing.Iterable[str]: + if first_party_only and not match.first_party: + return + if end_chain_only and match.references > 0: + return + if no_dupplicates and match.dupplicate: + return + if rules_only and match.level > 1: + return + if hostnames_only and not isinstance(path, HostnamePath): + return + + if explain: + yield self.explain(path) + else: + yield str(path) + + yield from self.exec_each(export_cb) + + def count_records(self, + first_party_only: bool = False, + end_chain_only: bool = False, + no_dupplicates: bool = False, + rules_only: bool = False, + hostnames_only: bool = False, + ) -> str: + memo: typing.Dict[str, int] = dict() + + def count_records_cb(path: Path, match: Match) -> None: + if first_party_only and not match.first_party: + return + if end_chain_only and match.references > 0: + return + if no_dupplicates and match.dupplicate: + return + if rules_only and match.level > 1: + return + if hostnames_only and not isinstance(path, HostnamePath): + return + + try: + memo[path.__class__.__name__] += 1 + except KeyError: + memo[path.__class__.__name__] = 1 + + for _ in self.exec_each(count_records_cb): + pass + + split: typing.List[str] = list() + for key, value in sorted(memo.items(), key=lambda s: s[0]): + split.append(f'{key[:-4].lower()}s: {value}') + return ', '.join(split) + + def get_domain(self, domain_str: str) -> typing.Iterable[DomainPath]: + self.enter_step('get_domain_pack') + domain = self.pack_domain(domain_str) + self.enter_step('get_domain_brws') + dic = self.domtree + depth = 0 + for part in domain.parts: + if dic.match_zone.active(): + self.enter_step('get_domain_yield') + yield ZonePath(domain.parts[:depth]) + self.enter_step('get_domain_brws') + if part not in dic.children: + return + dic = dic.children[part] + depth += 1 + if dic.match_zone.active(): + self.enter_step('get_domain_yield') + yield ZonePath(domain.parts) + if dic.match_hostname.active(): + self.enter_step('get_domain_yield') + yield HostnamePath(domain.parts) + + def get_ip4(self, ip4_str: str) -> typing.Iterable[Path]: + self.enter_step('get_ip4_pack') + ip4 = self.pack_ip4address(ip4_str) + self.enter_step('get_ip4_cache') + if not self.ip4cache[ip4.value >> self.ip4cache_shift]: + return + self.enter_step('get_ip4_brws') + dic = self.ip4tree + for i in range(31, 31-ip4.prefixlen, -1): + bit = (ip4.value >> i) & 0b1 + if dic.active(): + self.enter_step('get_ip4_yield') + yield Ip4Path(ip4.value >> (i+1) << (i+1), 31-i) + self.enter_step('get_ip4_brws') + next_dic = dic.one if bit else dic.zero + if next_dic is None: + return + dic = next_dic + if dic.active(): + self.enter_step('get_ip4_yield') + yield ip4 + + def _set_match(self, + match: Match, + updated: int, + source: Path, + source_match: Match = None, + dupplicate: bool = False, + ) -> None: + # source_match is in parameters because most of the time + # its parent function needs it too, + # so it can pass it to save a traversal + source_match = source_match or self.get_match(source) + new_level = source_match.level + 1 + if updated > match.updated or new_level < match.level \ + or source_match.first_party > match.first_party: + # NOTE FP and level of matches referencing this one + # won't be updated until run or prune + if match.source: + old_source = self.get_match(match.source) + old_source.references -= 1 + match.updated = updated + match.level = new_level + match.first_party = source_match.first_party + match.source = source + source_match.references += 1 + match.dupplicate = dupplicate + + def _set_domain(self, + hostname: bool, + domain_str: str, + updated: int, + source: Path) -> None: + self.enter_step('set_domain_val') + if not Database.validate_domain(domain_str): + raise ValueError(f"Invalid domain: {domain_str}") + self.enter_step('set_domain_pack') + domain = self.pack_domain(domain_str) + self.enter_step('set_domain_fp') + source_match = self.get_match(source) + is_first_party = source_match.first_party + self.enter_step('set_domain_brws') + dic = self.domtree + dupplicate = False + for part in domain.parts: + if part not in dic.children: + dic.children[part] = DomainTreeNode() + dic = dic.children[part] + if dic.match_zone.active(is_first_party): + dupplicate = True + if hostname: + match = dic.match_hostname + else: + match = dic.match_zone + self._set_match( + match, + updated, + source, + source_match=source_match, + dupplicate=dupplicate, + ) + + def set_hostname(self, + *args: typing.Any, **kwargs: typing.Any + ) -> None: + self._set_domain(True, *args, **kwargs) + + def set_zone(self, + *args: typing.Any, **kwargs: typing.Any + ) -> None: + self._set_domain(False, *args, **kwargs) + + def set_asn(self, + asn_str: str, + updated: int, + source: Path) -> None: + self.enter_step('set_asn') + path = self.pack_asn(asn_str) + if path.asn in self.asns: + match = self.asns[path.asn] + else: + match = AsnNode() + self.asns[path.asn] = match + self._set_match( + match, + updated, + source, + ) + + def _set_ip4(self, + ip4: Ip4Path, + updated: int, + source: Path) -> None: + self.enter_step('set_ip4_fp') + source_match = self.get_match(source) + is_first_party = source_match.first_party + self.enter_step('set_ip4_brws') + dic = self.ip4tree + dupplicate = False + for i in range(31, 31-ip4.prefixlen, -1): + bit = (ip4.value >> i) & 0b1 + next_dic = dic.one if bit else dic.zero + if next_dic is None: + next_dic = IpTreeNode() + if bit: + dic.one = next_dic + else: + dic.zero = next_dic + dic = next_dic + if dic.active(is_first_party): + dupplicate = True + self._set_match( + dic, + updated, + source, + source_match=source_match, + dupplicate=dupplicate, + ) + self._set_ip4cache(ip4, dic) + + def set_ip4address(self, + ip4address_str: str, + *args: typing.Any, **kwargs: typing.Any + ) -> None: + self.enter_step('set_ip4add_val') + if not Database.validate_ip4address(ip4address_str): + raise ValueError(f"Invalid ip4address: {ip4address_str}") + self.enter_step('set_ip4add_pack') + ip4 = self.pack_ip4address(ip4address_str) + self._set_ip4(ip4, *args, **kwargs) + + def set_ip4network(self, + ip4network_str: str, + *args: typing.Any, **kwargs: typing.Any + ) -> None: + self.enter_step('set_ip4net_val') + if not Database.validate_ip4network(ip4network_str): + raise ValueError(f"Invalid ip4network: {ip4network_str}") + self.enter_step('set_ip4net_pack') + ip4 = self.pack_ip4network(ip4network_str) + self._set_ip4(ip4, *args, **kwargs) diff --git a/db.py b/db.py new file mode 100755 index 0000000..91d00c5 --- /dev/null +++ b/db.py @@ -0,0 +1,46 @@ +#!/usr/bin/env python3 + +import argparse +import database +import time +import os + +if __name__ == '__main__': + + # Parsing arguments + parser = argparse.ArgumentParser( + description="Database operations") + parser.add_argument( + '-i', '--initialize', action='store_true', + help="Reconstruct the whole database") + parser.add_argument( + '-p', '--prune', action='store_true', + help="Remove old entries from database") + parser.add_argument( + '-b', '--prune-base', action='store_true', + help="With --prune, only prune base rules " + "(the ones added by ./feed_rules.py)") + parser.add_argument( + '-s', '--prune-before', type=int, + default=(int(time.time()) - 60*60*24*31*6), + help="With --prune, only rules updated before " + "this UNIX timestamp will be deleted") + parser.add_argument( + '-r', '--references', action='store_true', + help="DEBUG: Update the reference count") + args = parser.parse_args() + + if not args.initialize: + DB = database.Database() + else: + if os.path.isfile(database.Database.PATH): + os.unlink(database.Database.PATH) + DB = database.Database() + + DB.enter_step('main') + if args.prune: + DB.prune(before=args.prune_before, base_only=args.prune_base) + if args.references: + DB.update_references() + + DB.save() diff --git a/dist/README.md b/dist/README.md new file mode 100644 index 0000000..31db01f --- /dev/null +++ b/dist/README.md @@ -0,0 +1,74 @@ +# Geoffrey Frogeye's block list of first-party trackers + +## What's a first-party tracker? + +A tracker is a script put on many websites to gather informations about the visitor. +They can be used for multiple reasons: statistics, risk management, marketing, ads serving… +In any case, they are a threat to Internet users' privacy and many may want to block them. + +Traditionnaly, trackers are served from a third-party. +For example, `website1.com` and `website2.com` both load their tracking script from `https://trackercompany.com/trackerscript.js`. +In order to block those, one can simply block the hostname `trackercompany.com`, which is what most ad blockers do. + +However, to circumvent this block, tracker companies made the websites using them load trackers from `somestring.website1.com`. +The latter is a DNS redirection to `website1.trackercompany.com`, directly to an IP address belonging to the tracking company. +Those are called first-party trackers. + +In order to block those trackers, ad blockers would need to block every subdomain pointing to anything under `trackercompany.com` or to their network. +Unfortunately, most don't support those blocking methods as they are not DNS-aware, e.g. they only see `somestring.website1.com`. + +This list is an inventory of every `somestring.website1.com` found to allow non DNS-aware ad blocker to still block first-party trackers. + +## List variants + +### First-party trackers (recommended) + +- Hosts file: +- Raw list: + +This list contains every hostname redirecting to [a hand-picked list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/rules/first-party.list). +It should be safe from false-positives. +Don't be afraid of the size of the list, as this is due to the nature of first-party trackers: a single tracker generates at least one hostname per client (typically two). + +### First-party only trackers + +- Hosts file: +- Raw list: + +This is the same list as above, albeit not containing the hostnames under the tracking company domains. +This reduces the size of the list, but it doesn't prevent from third-party tracking too. +Use in conjunction with other block lists. + +### Multi-party trackers + +- Hosts file: +- Raw list: + +As first-party trackers usually evolve from third-party trackers, this list contains every hostname redirecting to trackers found in existing lists of third-party trackers (see next section). +Since the latter were not designed with first-party trackers in mind, they are likely to contain false-positives. +In the other hand, they might protect against first-party tracker that we're not aware of / have not yet confirmed. + +#### Source of third-party trackers + +- [EasyPrivacy](https://easylist.to/easylist/easyprivacy.txt) + +(yes there's only one for now. A lot of existing ones cause a lot of false positives) + +### Multi-party only trackers + +- Hosts file: +- Raw list: + +This is the same list as above, albeit not containing the hostnames under the tracking company domains. +This reduces the size of the list, but it doesn't prevent from third-party tracking too. +Use in conjunction with other block lists, especially the ones used to generate this list in the previous section. + +## Meta + +In case of false positives/negatives, or any other question contact me the way you like: + +The software used to generate this list is available here: + +Some of the first-party tracker included in this list have been found by: +- [Aeris](https://imirhil.fr/) +- NextDNS and [their blocklist](https://github.com/nextdns/cname-cloaking-blocklist)'s contributors diff --git a/export.py b/export.py new file mode 100755 index 0000000..c5eefb2 --- /dev/null +++ b/export.py @@ -0,0 +1,64 @@ +#!/usr/bin/env python3 + +import database +import argparse +import sys + + +if __name__ == '__main__': + + # Parsing arguments + parser = argparse.ArgumentParser( + description="Export the hostnames rules stored " + "in the Database as plain text") + parser.add_argument( + '-o', '--output', type=argparse.FileType('w'), default=sys.stdout, + help="Output file, one rule per line") + parser.add_argument( + '-f', '--first-party', action='store_true', + help="Only output rules issued from first-party sources") + parser.add_argument( + '-e', '--end-chain', action='store_true', + help="Only output rules that are not referenced by any other") + parser.add_argument( + '-r', '--rules', action='store_true', + help="Output all kinds of rules, not just hostnames") + parser.add_argument( + '-b', '--base-rules', action='store_true', + help="Output base rules " + "(the ones added by ./feed_rules.py) " + "(implies --rules)") + parser.add_argument( + '-d', '--no-dupplicates', action='store_true', + help="Do not output rules that already match a zone/network rule " + "(e.g. dummy.example.com when there's a zone example.com rule)") + parser.add_argument( + '-x', '--explain', action='store_true', + help="Show the chain of rules leading to one " + "(and the number of references they have)") + parser.add_argument( + '-c', '--count', action='store_true', + help="Show the number of rules per type instead of listing them") + args = parser.parse_args() + + DB = database.Database() + + if args.count: + assert not args.explain + print(DB.count_records( + first_party_only=args.first_party, + end_chain_only=args.end_chain, + no_dupplicates=args.no_dupplicates, + rules_only=args.base_rules, + hostnames_only=not (args.rules or args.base_rules), + )) + else: + for domain in DB.list_records( + first_party_only=args.first_party, + end_chain_only=args.end_chain, + no_dupplicates=args.no_dupplicates, + rules_only=args.base_rules, + hostnames_only=not (args.rules or args.base_rules), + explain=args.explain, + ): + print(domain, file=args.output) diff --git a/export_lists.sh b/export_lists.sh new file mode 100755 index 0000000..5120562 --- /dev/null +++ b/export_lists.sh @@ -0,0 +1,98 @@ +#!/usr/bin/env bash + +function log() { + echo -e "\033[33m$@\033[0m" +} + +log "Calculating statistics…" +gen_date=$(date -Isec) +gen_software=$(git describe --tags) +number_websites=$(wc -l < temp/all_websites.list) +number_subdomains=$(wc -l < temp/all_subdomains.list) +number_dns=$(grep '^$' temp/all_resolved.txt | wc -l) + +for partyness in {first,multi} +do + if [ $partyness = "first" ] + then + partyness_flags="--first-party" + else + partyness_flags="" + fi + + echo "Statistics for ${partyness}-party trackers" + echo "Input rules: $(./export.py --count --base-rules $partyness_flags)" + echo "Subsequent rules: $(./export.py --count --rules $partyness_flags)" + echo "Subsequent rules (no dupplicate): $(./export.py --count --rules --no-dupplicates $partyness_flags)" + echo "Output hostnames: $(./export.py --count $partyness_flags)" + echo "Output hostnames (no dupplicate): $(./export.py --count --no-dupplicates $partyness_flags)" + echo "Output hostnames (end-chain only): $(./export.py --count --end-chain $partyness_flags)" + echo "Output hostnames (no dupplicate, end-chain only): $(./export.py --count --no-dupplicates --end-chain $partyness_flags)" + echo + + for trackerness in {trackers,only-trackers} + do + if [ $trackerness = "trackers" ] + then + trackerness_flags="" + else + trackerness_flags="--end-chain --no-dupplicates" + fi + file_list="dist/${partyness}party-${trackerness}.txt" + file_host="dist/${partyness}party-${trackerness}-hosts.txt" + + log "Generating lists for variant ${partyness}-party ${trackerness}…" + + # Real export heeere + ./export.py $partyness_flags $trackerness_flags > $file_list + # Sometimes a bit heavy to have the DB open and sort the output + # so this is done in two steps + sort -u $file_list -o $file_list + + rules_input=$(./export.py --count --base-rules $partyness_flags) + rules_found=$(./export.py --count --rules $partyness_flags) + rules_output=$(./export.py --count $partyness_flags $trackerness_flags) + + function link() { # link partyness, link trackerness + url="https://hostfiles.frogeye.fr/${1}party-${2}-hosts.txt" + if [ "$1" = "$partyness" ] && [ "$2" = "$trackerness" ] + then + url="$url (this one)" + fi + echo $url + } + + ( + echo "# First-party trackers host list" + echo "# Variant: ${partyness}-party ${trackerness}" + echo "#" + echo "# About first-party trackers: TODO" + echo "# Source code: https://git.frogeye.fr/geoffrey/eulaurarien" + echo "#" + echo "# In case of false positives/negatives, or any other question," + echo "# contact me the way you like: https://geoffrey.frogeye.fr" + echo "#" + echo "# Latest versions and variants:" + echo "# - First-party trackers : $(link first trackers)" + echo "# - … excluding redirected: $(link first only-trackers)" + echo "# - First and third party : $(link multi trackers)" + echo "# - … excluding redirected: $(link multi only-trackers)" + echo '# (variants informations: TODO)' + echo '# (you can remove `-hosts` to get the raw list)' + echo "#" + echo "# Generation date: $gen_date" + echo "# Generation software: eulaurarien $gen_software" + echo "# Number of source websites: $number_websites" + echo "# Number of source subdomains: $number_subdomains" + echo "# Number of source DNS records: ~2E9 + $number_dns" + echo "#" + echo "# Input rules: $rules_input" + echo "# Subsequent rules: $rules_found" + echo "# Output rules: $rules_output" + echo "#" + echo + sed 's|^|0.0.0.0 |' "$file_list" + ) > "$file_host" + + done +done diff --git a/feed_asn.py b/feed_asn.py new file mode 100755 index 0000000..25a35e2 --- /dev/null +++ b/feed_asn.py @@ -0,0 +1,71 @@ +#!/usr/bin/env python3 + +import database +import argparse +import requests +import typing +import ipaddress +import logging +import time + +IPNetwork = typing.Union[ipaddress.IPv4Network, ipaddress.IPv6Network] + + +def get_ranges(asn: str) -> typing.Iterable[str]: + req = requests.get( + 'https://stat.ripe.net/data/as-routing-consistency/data.json', + params={'resource': asn} + ) + data = req.json() + for pref in data['data']['prefixes']: + yield pref['prefix'] + + +def get_name(asn: str) -> str: + req = requests.get( + 'https://stat.ripe.net/data/as-overview/data.json', + params={'resource': asn} + ) + data = req.json() + return data['data']['holder'] + + +if __name__ == '__main__': + + log = logging.getLogger('feed_asn') + + # Parsing arguments + parser = argparse.ArgumentParser( + description="Add the IP ranges associated to the AS in the database") + args = parser.parse_args() + + DB = database.Database() + + def add_ranges(path: database.Path, + match: database.Match, + ) -> None: + assert isinstance(path, database.AsnPath) + assert isinstance(match, database.AsnNode) + asn_str = database.Database.unpack_asn(path) + DB.enter_step('asn_get_name') + name = get_name(asn_str) + match.name = name + DB.enter_step('asn_get_ranges') + for prefix in get_ranges(asn_str): + parsed_prefix: IPNetwork = ipaddress.ip_network(prefix) + if parsed_prefix.version == 4: + DB.set_ip4network( + prefix, + source=path, + updated=int(time.time()) + ) + log.info('Added %s from %s (%s)', prefix, path, name) + elif parsed_prefix.version == 6: + log.warning('Unimplemented prefix version: %s', prefix) + else: + log.error('Unknown prefix version: %s', prefix) + + for _ in DB.exec_each_asn(add_ranges): + pass + + DB.save() diff --git a/feed_dns.py b/feed_dns.py new file mode 100755 index 0000000..74fe1dd --- /dev/null +++ b/feed_dns.py @@ -0,0 +1,227 @@ +#!/usr/bin/env python3 + +import argparse +import database +import logging +import sys +import typing +import multiprocessing +import time + +Record = typing.Tuple[typing.Callable, typing.Callable, int, str, str] + +# select, write +FUNCTION_MAP: typing.Any = { + 'a': ( + database.Database.get_ip4, + database.Database.set_hostname, + ), + 'cname': ( + database.Database.get_domain, + database.Database.set_hostname, + ), + 'ptr': ( + database.Database.get_domain, + database.Database.set_ip4address, + ), +} + + +class Writer(multiprocessing.Process): + def __init__(self, + recs_queue: multiprocessing.Queue, + autosave_interval: int = 0, + ip4_cache: int = 0, + ): + super(Writer, self).__init__() + self.log = logging.getLogger(f'wr') + self.recs_queue = recs_queue + self.autosave_interval = autosave_interval + self.ip4_cache = ip4_cache + + def run(self) -> None: + self.db = database.Database() + self.db.log = logging.getLogger(f'wr') + self.db.fill_ip4cache(max_size=self.ip4_cache) + if self.autosave_interval > 0: + next_save = time.time() + self.autosave_interval + else: + next_save = 0 + + self.db.enter_step('block_wait') + block: typing.List[Record] + for block in iter(self.recs_queue.get, None): + + record: Record + for record in block: + + select, write, updated, name, value = record + self.db.enter_step('feed_switch') + + try: + for source in select(self.db, value): + write(self.db, name, updated, source=source) + except ValueError: + self.log.exception("Cannot execute: %s", record) + + if next_save > 0 and time.time() > next_save: + self.log.info("Saving database...") + self.db.save() + self.log.info("Done!") + next_save = time.time() + self.autosave_interval + + self.db.enter_step('block_wait') + + self.db.enter_step('end') + self.db.save() + + +class Parser(): + def __init__(self, + buf: typing.Any, + recs_queue: multiprocessing.Queue, + block_size: int, + ): + super(Parser, self).__init__() + self.buf = buf + self.log = logging.getLogger('pr') + self.recs_queue = recs_queue + self.block: typing.List[Record] = list() + self.block_size = block_size + self.prof = database.Profiler() + self.prof.log = logging.getLogger('pr') + + def register(self, record: Record) -> None: + self.prof.enter_step('register') + self.block.append(record) + if len(self.block) >= self.block_size: + self.prof.enter_step('put_block') + self.recs_queue.put(self.block) + self.block = list() + + def run(self) -> None: + self.consume() + self.recs_queue.put(self.block) + self.prof.profile() + + def consume(self) -> None: + raise NotImplementedError + + +class Rapid7Parser(Parser): + def consume(self) -> None: + data = dict() + for line in self.buf: + self.prof.enter_step('parse_rapid7') + split = line.split('"') + + try: + for k in range(1, 14, 4): + key = split[k] + val = split[k+2] + data[key] = val + + select, writer = FUNCTION_MAP[data['type']] + record = ( + select, + writer, + int(data['timestamp']), + data['name'], + data['value'] + ) + except IndexError: + self.log.exception("Cannot parse: %s", line) + self.register(record) + + +class MassDnsParser(Parser): + # massdns --output Snrql + # --retry REFUSED,SERVFAIL --resolvers nameservers-ipv4 + TYPES = { + 'A': (FUNCTION_MAP['a'][0], FUNCTION_MAP['a'][1], -1, None), + # 'AAAA': (FUNCTION_MAP['aaaa'][0], FUNCTION_MAP['aaaa'][1], -1, None), + 'CNAME': (FUNCTION_MAP['cname'][0], FUNCTION_MAP['cname'][1], -1, -1), + } + + def consume(self) -> None: + self.prof.enter_step('parse_massdns') + timestamp = 0 + header = True + for line in self.buf: + line = line[:-1] + if not line: + header = True + continue + + split = line.split(' ') + try: + if header: + timestamp = int(split[1]) + header = False + else: + select, write, name_offset, value_offset = \ + MassDnsParser.TYPES[split[1]] + record = ( + select, + write, + timestamp, + split[0][:name_offset], + split[2][:value_offset], + ) + self.register(record) + self.prof.enter_step('parse_massdns') + except KeyError: + continue + + +PARSERS = { + 'rapid7': Rapid7Parser, + 'massdns': MassDnsParser, +} + +if __name__ == '__main__': + + # Parsing arguments + log = logging.getLogger('feed_dns') + args_parser = argparse.ArgumentParser( + description="Read DNS records and import " + "tracking-relevant data into the database") + args_parser.add_argument( + 'parser', + choices=PARSERS.keys(), + help="Input format") + args_parser.add_argument( + '-i', '--input', type=argparse.FileType('r'), default=sys.stdin, + help="Input file") + args_parser.add_argument( + '-b', '--block-size', type=int, default=1024, + help="Performance tuning value") + args_parser.add_argument( + '-q', '--queue-size', type=int, default=128, + help="Performance tuning value") + args_parser.add_argument( + '-a', '--autosave-interval', type=int, default=900, + help="Interval to which the database will save in seconds. " + "0 to disable.") + args_parser.add_argument( + '-4', '--ip4-cache', type=int, default=0, + help="RAM cache for faster IPv4 lookup. " + "Maximum useful value: 512 MiB (536870912). " + "Warning: Depending on the rules, this might already " + "be a memory-heavy process, even without the cache.") + args = args_parser.parse_args() + + recs_queue: multiprocessing.Queue = multiprocessing.Queue( + maxsize=args.queue_size) + + writer = Writer(recs_queue, + autosave_interval=args.autosave_interval, + ip4_cache=args.ip4_cache + ) + writer.start() + + parser = PARSERS[args.parser](args.input, recs_queue, args.block_size) + parser.run() + + recs_queue.put(None) + writer.join() diff --git a/feed_rules.py b/feed_rules.py new file mode 100755 index 0000000..9d0365f --- /dev/null +++ b/feed_rules.py @@ -0,0 +1,54 @@ +#!/usr/bin/env python3 + +import database +import argparse +import sys +import time + +FUNCTION_MAP = { + 'zone': database.Database.set_zone, + 'hostname': database.Database.set_hostname, + 'asn': database.Database.set_asn, + 'ip4network': database.Database.set_ip4network, + 'ip4address': database.Database.set_ip4address, +} + +if __name__ == '__main__': + + # Parsing arguments + parser = argparse.ArgumentParser( + description="Import base rules to the database") + parser.add_argument( + 'type', + choices=FUNCTION_MAP.keys(), + help="Type of rule inputed") + parser.add_argument( + '-i', '--input', type=argparse.FileType('r'), default=sys.stdin, + help="File with one rule per line") + parser.add_argument( + '-f', '--first-party', action='store_true', + help="The input only comes from verified first-party sources") + args = parser.parse_args() + + DB = database.Database() + + fun = FUNCTION_MAP[args.type] + + source: database.RulePath + if args.first_party: + source = database.RuleFirstPath() + else: + source = database.RuleMultiPath() + + for rule in args.input: + rule = rule.strip() + try: + fun(DB, + rule, + source=source, + updated=int(time.time()), + ) + except ValueError: + DB.log.error(f"Could not add rule: {rule}") + + DB.save() diff --git a/fetch_resources.sh b/fetch_resources.sh index bd6aa12..393d8e1 100644 --- a/fetch_resources.sh +++ b/fetch_resources.sh @@ -17,26 +17,13 @@ function dl() { log "Retrieving rules…" rm -f rules*/*.cache.* dl https://easylist.to/easylist/easyprivacy.txt rules_adblock/easyprivacy.cache.txt -# From firebog.net Tracking & Telemetry Lists -dl https://v.firebog.net/hosts/Prigent-Ads.txt rules/prigent-ads.cache.list -# dl https://gitlab.com/quidsup/notrack-blocklists/raw/master/notrack-blocklist.txt rules/notrack-blocklist.cache.list -# False positives: https://github.com/WaLLy3K/wally3k.github.io/issues/73 -> 69.media.tumblr.com chicdn.net -dl https://raw.githubusercontent.com/StevenBlack/hosts/master/data/add.2o7Net/hosts rules_hosts/add2o7.cache.txt -dl https://raw.githubusercontent.com/crazy-max/WindowsSpyBlocker/master/data/hosts/spy.txt rules_hosts/spy.cache.txt -# dl https://raw.githubusercontent.com/Kees1958/WS3_annual_most_used_survey_blocklist/master/w3tech_hostfile.txt rules/w3tech.cache.list -# False positives: agreements.apple.com -> edgekey.net -# dl https://www.github.developerdan.com/hosts/lists/ads-and-tracking-extended.txt rules_hosts/ads-and-tracking-extended.cache.txt # Lots of false-positives -# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/android-tracking.txt rules_hosts/android-tracking.cache.txt -# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/SmartTV.txt rules_hosts/smart-tv.cache.txt -# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/AmazonFireTV.txt rules_hosts/amazon-fire-tv.cache.txt + +log "Retrieving TLD list…" +dl http://data.iana.org/TLD/tlds-alpha-by-domain.txt temp/all_tld.temp.list +grep -v '^#' temp/all_tld.temp.list | awk '{print tolower($0)}' > temp/all_tld.list log "Retrieving nameservers…" -rm -f nameservers -touch nameservers -[ -f nameservers.head ] && cat nameservers.head >> nameservers -dl https://public-dns.info/nameservers.txt nameservers.temp -sort -R nameservers.temp >> nameservers -rm nameservers.temp +dl https://public-dns.info/nameservers.txt nameservers/public-dns.cache.list log "Retrieving top subdomains…" dl http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip top-1m.csv.zip @@ -51,4 +38,3 @@ then else mv temp/cisco-umbrella_popularity.fresh.list subdomains/cisco-umbrella_popularity.cache.list fi -dl https://www.orwell1984.today/cname/eulerian.net.txt subdomains/orwell-eulerian-cname-list.cache.list diff --git a/filter_subdomains.py b/filter_subdomains.py deleted file mode 100755 index 601a031..0000000 --- a/filter_subdomains.py +++ /dev/null @@ -1,160 +0,0 @@ -#!/usr/bin/env python3 -# pylint: disable=C0103 - -""" -From a list of subdomains, output only -the ones resolving to a first-party tracker. -""" - -import argparse -import sys -import progressbar -import csv -import typing -import ipaddress - -# DomainRule = typing.Union[bool, typing.Dict[str, 'DomainRule']] -DomainRule = typing.Union[bool, typing.Dict] -# IpRule = typing.Union[bool, typing.Dict[int, 'DomainRule']] -IpRule = typing.Union[bool, typing.Dict] - -RULES_DICT: DomainRule = dict() -RULES_IP_DICT: IpRule = dict() - - -def get_bits(address: ipaddress.IPv4Address) -> typing.Iterator[int]: - for char in address.packed: - for i in range(7, -1, -1): - yield (char >> i) & 0b1 - - -def subdomain_matching(subdomain: str) -> bool: - parts = subdomain.split('.') - parts.reverse() - dic = RULES_DICT - for part in parts: - if isinstance(dic, bool) or part not in dic: - break - dic = dic[part] - if isinstance(dic, bool): - return dic - return False - - -def ip_matching(ip_str: str) -> bool: - ip = ipaddress.ip_address(ip_str) - dic = RULES_IP_DICT - i = 0 - for bit in get_bits(ip): - i += 1 - if isinstance(dic, bool) or bit not in dic: - break - dic = dic[bit] - if isinstance(dic, bool): - return dic - return False - - -def get_matching(chain: typing.List[str], no_explicit: bool = False - ) -> typing.Iterable[str]: - if len(chain) <= 1: - return - initial = chain[0] - cname_destinations = chain[1:-1] - a_destination = chain[-1] - initial_matching = subdomain_matching(initial) - if no_explicit and initial_matching: - return - cname_matching = any(map(subdomain_matching, cname_destinations)) - if cname_matching or initial_matching or ip_matching(a_destination): - yield initial - - -def register_rule(subdomain: str) -> None: - # Make a tree with domain parts - parts = subdomain.split('.') - parts.reverse() - dic = RULES_DICT - last_part = len(parts) - 1 - for p, part in enumerate(parts): - if isinstance(dic, bool): - return - if p == last_part: - dic[part] = True - else: - dic.setdefault(part, dict()) - dic = dic[part] - - -def register_rule_ip(network: str) -> None: - net = ipaddress.ip_network(network) - ip = net.network_address - dic = RULES_IP_DICT - last_bit = net.prefixlen - 1 - for b, bit in enumerate(get_bits(ip)): - if isinstance(dic, bool): - return - if b == last_bit: - dic[bit] = True - else: - dic.setdefault(bit, dict()) - dic = dic[bit] - - -if __name__ == '__main__': - - # Parsing arguments - parser = argparse.ArgumentParser( - description="Filter first-party trackers from a list of subdomains") - parser.add_argument( - '-i', '--input', type=argparse.FileType('r'), default=sys.stdin, - help="Input file with DNS chains") - parser.add_argument( - '-o', '--output', type=argparse.FileType('w'), default=sys.stdout, - help="Outptut file with one tracking subdomain per line") - parser.add_argument( - '-n', '--no-explicit', action='store_true', - help="Don't output domains already blocked with rules without CNAME") - parser.add_argument( - '-r', '--rules', type=argparse.FileType('r'), - help="List of domains domains to block (with their subdomains)") - parser.add_argument( - '-p', '--rules-ip', type=argparse.FileType('r'), - help="List of IPs ranges to block") - args = parser.parse_args() - - # Progress bar - widgets = [ - progressbar.Percentage(), - ' ', progressbar.SimpleProgress(), - ' ', progressbar.Bar(), - ' ', progressbar.Timer(), - ' ', progressbar.AdaptiveTransferSpeed(unit='req'), - ' ', progressbar.AdaptiveETA(), - ] - progress = progressbar.ProgressBar(widgets=widgets) - - # Reading rules - if args.rules: - for rule in args.rules: - register_rule(rule.strip()) - if args.rules_ip: - for rule in args.rules_ip: - register_rule_ip(rule.strip()) - - # Approximating line count - if args.input.seekable(): - lines = 0 - for line in args.input: - lines += 1 - progress.max_value = lines - args.input.seek(0) - - # Reading domains to filter - reader = csv.reader(args.input) - progress.start() - for chain in reader: - for match in get_matching(chain, no_explicit=args.no_explicit): - print(match, file=args.output) - progress.update(progress.value + 1) - progress.finish() diff --git a/filter_subdomains.sh b/filter_subdomains.sh deleted file mode 100755 index 9a09b9a..0000000 --- a/filter_subdomains.sh +++ /dev/null @@ -1,85 +0,0 @@ -#!/usr/bin/env bash - -function log() { - echo -e "\033[33m$@\033[0m" -} - -if [ ! -f temp/all_resolved.csv ] -then - echo "Run ./resolve_subdomains.sh first!" - exit 1 -fi - -# Gather all the rules for filtering -log "Compiling rules…" -cat rules_adblock/*.txt | grep -v '^!' | grep -v '^\[Adblock' | sort -u > temp/all_rules_adblock.txt -./adblock_to_domain_list.py --input temp/all_rules_adblock.txt --output rules/from_adblock.cache.list -cat rules_hosts/*.txt | grep -v '^#' | grep -v '^$' | cut -d ' ' -f2 > rules/from_hosts.cache.list -cat rules/*.list | grep -v '^#' | grep -v '^$' | sort -u > temp/all_rules_multi.list -cat rules/first-party.list | grep -v '^#' | grep -v '^$' | sort -u > temp/all_rules_first.list -cat rules_ip/*.txt | grep -v '^#' | grep -v '^$' | sort -u > temp/all_ip_rules_multi.txt -cat rules_ip/first-party.txt | grep -v '^#' | grep -v '^$' | sort -u > temp/all_ip_rules_first.txt - -log "Filtering first-party tracking domains…" -./filter_subdomains.py --rules temp/all_rules_first.list --rules-ip temp/all_ip_rules_first.txt --input temp/all_resolved_sorted.csv --output temp/firstparty-trackers.list -sort -u temp/firstparty-trackers.list > dist/firstparty-trackers.txt - -log "Filtering first-party curated tracking domains…" -./filter_subdomains.py --rules temp/all_rules_first.list --rules-ip temp/all_ip_rules_first.txt --input temp/all_resolved_sorted.csv --no-explicit --output temp/firstparty-only-trackers.list -sort -u temp/firstparty-only-trackers.list > dist/firstparty-only-trackers.txt - -log "Filtering multi-party tracking domains…" -./filter_subdomains.py --rules temp/all_rules_multi.list --rules-ip temp/all_ip_rules_multi.txt --input temp/all_resolved_sorted.csv --output temp/multiparty-trackers.list -sort -u temp/multiparty-trackers.list > dist/multiparty-trackers.txt - -log "Filtering multi-party curated tracking domains…" -./filter_subdomains.py --rules temp/all_rules_multi.list --rules-ip temp/all_ip_rules_multi.txt --input temp/all_resolved_sorted.csv --no-explicit --output temp/multiparty-only-trackers.list -sort -u temp/multiparty-only-trackers.list > dist/multiparty-only-trackers.txt - -# Format the blocklist so it can be used as a hostlist -function generate_hosts { - basename="$1" - description="$2" - description2="$3" - - ( - echo "# First-party trackers host list" - echo "# $description" - echo "# $description2" - echo "#" - echo "# About first-party trackers: https://git.frogeye.fr/geoffrey/eulaurarien#whats-a-first-party-tracker" - echo "# Source code: https://git.frogeye.fr/geoffrey/eulaurarien" - echo "#" - echo "# In case of false positives/negatives, or any other question," - echo "# contact me the way you like: https://geoffrey.frogeye.fr" - echo "#" - echo "# Latest version:" - echo "# - First-party trackers : https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt" - echo "# - … excluding redirected: https://hostfiles.frogeye.fr/firstparty-only-trackers-hosts.txt" - echo "# - First and third party : https://hostfiles.frogeye.fr/multiparty-trackers-hosts.txt" - echo "# - … excluding redirected: https://hostfiles.frogeye.fr/multiparty-only-trackers-hosts.txt" - echo "#" - echo "# Generation date: $(date -Isec)" - echo "# Generation software: eulaurarien $(git describe --tags)" - echo "# Number of source websites: $(wc -l temp/all_websites.list | cut -d' ' -f1)" - echo "# Number of source subdomains: $(wc -l temp/all_subdomains.list | cut -d' ' -f1)" - echo "#" - echo "# Number of known first-party trackers: $(wc -l temp/all_rules_first.list | cut -d' ' -f1)" - echo "# Number of first-party subdomains: $(wc -l dist/firstparty-trackers.txt | cut -d' ' -f1)" - echo "# … excluding redirected: $(wc -l dist/firstparty-only-trackers.txt | cut -d' ' -f1)" - echo "#" - echo "# Number of known multi-party trackers: $(wc -l temp/all_rules_multi.list | cut -d' ' -f1)" - echo "# Number of multi-party subdomains: $(wc -l dist/multiparty-trackers.txt | cut -d' ' -f1)" - echo "# … excluding redirected: $(wc -l dist/multiparty-only-trackers.txt | cut -d' ' -f1)" - echo - cat "dist/$basename.txt" | while read host; - do - echo "0.0.0.0 $host" - done - ) > "dist/$basename-hosts.txt" -} - -generate_hosts "firstparty-trackers" "Generated from a curated list of first-party trackers" "" -generate_hosts "firstparty-only-trackers" "Generated from a curated list of first-party trackers" "Only contain the first chain of redirection." -generate_hosts "multiparty-trackers" "Generated from known third-party trackers." "Also contains trackers used as third-party." -generate_hosts "multiparty-only-trackers" "Generated from known third-party trackers." "Do not contain trackers used in third-party. Use in combination with third-party lists." diff --git a/import_rapid7.sh b/import_rapid7.sh new file mode 100755 index 0000000..4b5714f --- /dev/null +++ b/import_rapid7.sh @@ -0,0 +1,26 @@ +#!/usr/bin/env bash + +function log() { + echo -e "\033[33m$@\033[0m" +} + +function feed_rapid7_fdns { # dataset + dataset=$1 + line=$(curl -s https://opendata.rapid7.com/sonar.fdns_v2/ | grep "href=\".\+-fdns_$dataset.json.gz\"") + link="https://opendata.rapid7.com$(echo "$line" | cut -d'"' -f2)" + log "Reading $(echo "$dataset" | awk '{print toupper($0)}') records from $link" + curl -L "$link" | gunzip +} + +function feed_rapid7_rdns { + dataset=$1 + line=$(curl -s https://opendata.rapid7.com/sonar.rdns_v2/ | grep "href=\".\+-rdns.json.gz\"") + link="https://opendata.rapid7.com$(echo "$line" | cut -d'"' -f2)" + log "Reading PTR records from $link" + curl -L "$link" | gunzip +} + +feed_rapid7_rdns | ./feed_dns.py rapid7 +feed_rapid7_fdns a | ./feed_dns.py rapid7 --ip4-cache 536870912 +# feed_rapid7_fdns aaaa | ./feed_dns.py rapid7 --ip6-cache 536870912 +feed_rapid7_fdns cname | ./feed_dns.py rapid7 diff --git a/import_rules.sh b/import_rules.sh new file mode 100755 index 0000000..cbcfbd8 --- /dev/null +++ b/import_rules.sh @@ -0,0 +1,22 @@ +#!/usr/bin/env bash + +function log() { + echo -e "\033[33m$@\033[0m" +} + +log "Importing rules…" +BEFORE="$(date +%s)" +cat rules_adblock/*.txt | grep -v '^!' | grep -v '^\[Adblock' | ./adblock_to_domain_list.py | ./feed_rules.py zone +cat rules_hosts/*.txt | grep -v '^#' | grep -v '^$' | cut -d ' ' -f2 | ./feed_rules.py zone +cat rules/*.list | grep -v '^#' | grep -v '^$' | ./feed_rules.py zone +cat rules_ip/*.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py ip4network +cat rules_asn/*.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py asn + +cat rules/first-party.list | grep -v '^#' | grep -v '^$' | ./feed_rules.py zone --first-party +cat rules_ip/first-party.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py ip4network --first-party +cat rules_asn/first-party.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py asn --first-party + +./feed_asn.py + +# log "Pruning old rules…" +# ./db.py --prune --prune-before "$BEFORE" --prune-base diff --git a/nameservers/.gitignore b/nameservers/.gitignore new file mode 100644 index 0000000..dbd03bc --- /dev/null +++ b/nameservers/.gitignore @@ -0,0 +1,2 @@ +*.custom.list +*.cache.list diff --git a/nameservers/popular.list b/nameservers/popular.list new file mode 100644 index 0000000..c35d391 --- /dev/null +++ b/nameservers/popular.list @@ -0,0 +1,24 @@ +8.8.8.8 +8.8.4.4 +2001:4860:4860:0:0:0:0:8888 +2001:4860:4860:0:0:0:0:8844 +208.67.222.222 +208.67.220.220 +2620:119:35::35 +2620:119:53::53 +4.2.2.1 +4.2.2.2 +8.26.56.26 +8.20.247.20 +84.200.69.80 +84.200.70.40 +2001:1608:10:25:0:0:1c04:b12f +2001:1608:10:25:0:0:9249:d69b +9.9.9.10 +149.112.112.10 +2620:fe::10 +2620:fe::fe:10 +1.1.1.1 +1.0.0.1 +2606:4700:4700::1111 +2606:4700:4700::1001 diff --git a/regexes.py b/regexes.py deleted file mode 100644 index 0e48441..0000000 --- a/regexes.py +++ /dev/null @@ -1,21 +0,0 @@ -#!/usr/bin/env python3 - -""" -List of regex matching first-party trackers. -""" - -# Syntax: https://docs.python.org/3/library/re.html#regular-expression-syntax - -REGEXES = [ - r'^.+\.eulerian\.net\.$', # Eulerian - r'^.+\.criteo\.com\.$', # Criteo - r'^.+\.dnsdelegation\.io\.$', # Criteo - r'^.+\.keyade\.com\.$', # Keyade - r'^.+\.omtrdc\.net\.$', # Adobe Experience Cloud - r'^.+\.bp01\.net\.$', # NP6 - r'^.+\.ati-host\.net\.$', # Xiti (AT Internet) - r'^.+\.at-o\.net\.$', # Xiti (AT Internet) - r'^.+\.edgkey\.net\.$', # Edgekey (Akamai) - r'^.+\.akaimaiedge\.net\.$', # Edgekey (Akamai) - r'^.+\.storetail\.io\.$', # Storetail (Criteo) -] diff --git a/resolve_subdomains.py b/resolve_subdomains.py deleted file mode 100755 index ec10c47..0000000 --- a/resolve_subdomains.py +++ /dev/null @@ -1,284 +0,0 @@ -#!/usr/bin/env python3 - -""" -From a list of subdomains, output only -the ones resolving to a first-party tracker. -""" - -import argparse -import logging -import os -import queue -import sys -import threading -import typing -import csv - -import coloredlogs -import dns.exception -import dns.resolver -import progressbar - -DNS_TIMEOUT = 5.0 -NUMBER_THREADS = 512 -NUMBER_TRIES = 5 - -# TODO All the domains don't get treated, -# so it leaves with 4-5 subdomains not resolved - -glob = None - - -class Worker(threading.Thread): - """ - Worker process for a DNS resolver. - Will resolve DNS to match first-party subdomains. - """ - - def change_nameserver(self) -> None: - """ - Assign a this worker another nameserver from the queue. - """ - server = None - while server is None: - try: - server = self.orchestrator.nameservers_queue.get(block=False) - except queue.Empty: - self.orchestrator.refill_nameservers_queue() - self.log.info("Using nameserver: %s", server) - self.resolver.nameservers = [server] - - def __init__(self, - orchestrator: 'Orchestrator', - index: int = 0): - super(Worker, self).__init__() - self.log = logging.getLogger(f'worker{index:03d}') - self.orchestrator = orchestrator - - self.resolver = dns.resolver.Resolver() - self.change_nameserver() - - def resolve_subdomain(self, subdomain: str) -> typing.Optional[ - typing.List[ - str - ] - ]: - """ - Returns the resolution chain of the subdomain to an A record, - including any intermediary CNAME. - The last element is an IP address. - Returns None if the nameserver was unable to satisfy the request. - Returns [] if the requests points to nothing. - """ - self.log.debug("Querying %s", subdomain) - try: - query = self.resolver.query(subdomain, 'A', lifetime=DNS_TIMEOUT) - except dns.resolver.NXDOMAIN: - return [] - except dns.resolver.NoAnswer: - return [] - except dns.resolver.YXDOMAIN: - self.log.warning("Query name too long for %s", subdomain) - return None - except dns.resolver.NoNameservers: - # NOTE Most of the time this error message means that the domain - # does not exists, but sometimes it means the that the server - # itself is broken. So we count on the retry logic. - self.log.warning("All nameservers broken for %s", subdomain) - return None - except dns.exception.Timeout: - # NOTE Same as above - self.log.warning("Timeout for %s", subdomain) - return None - except dns.name.EmptyLabel: - self.log.warning("Empty label for %s", subdomain) - return None - resolved = list() - last = len(query.response.answer) - 1 - for a, answer in enumerate(query.response.answer): - if answer.rdtype == dns.rdatatype.CNAME: - assert a < last - resolved.append(answer.items[0].to_text()[:-1]) - elif answer.rdtype == dns.rdatatype.A: - assert a == last - resolved.append(answer.items[0].address) - else: - assert False - return resolved - - def run(self) -> None: - self.log.info("Started") - subdomain: str - for subdomain in iter(self.orchestrator.subdomains_queue.get, None): - - for _ in range(NUMBER_TRIES): - resolved = self.resolve_subdomain(subdomain) - # Retry with another nameserver if error - if resolved is None: - self.change_nameserver() - else: - break - - # If it wasn't found after multiple tries - if resolved is None: - self.log.error("Gave up on %s", subdomain) - resolved = [] - - resolved.insert(0, subdomain) - assert isinstance(resolved, list) - self.orchestrator.results_queue.put(resolved) - - self.orchestrator.results_queue.put(None) - self.log.info("Stopped") - - -class Orchestrator(): - """ - Orchestrator of the different Worker threads. - """ - - def refill_nameservers_queue(self) -> None: - """ - Re-fill the given nameservers into the nameservers queue. - Done every-time the queue is empty, making it - basically looping and infinite. - """ - # Might be in a race condition but that's probably fine - for nameserver in self.nameservers: - self.nameservers_queue.put(nameserver) - self.log.info("Refilled nameserver queue") - - def __init__(self, subdomains: typing.Iterable[str], - nameservers: typing.List[str] = None, - ): - self.log = logging.getLogger('orchestrator') - self.subdomains = subdomains - - # Use interal resolver by default - self.nameservers = nameservers or dns.resolver.Resolver().nameservers - - self.subdomains_queue: queue.Queue = queue.Queue( - maxsize=NUMBER_THREADS) - self.results_queue: queue.Queue = queue.Queue() - self.nameservers_queue: queue.Queue = queue.Queue() - - self.refill_nameservers_queue() - - def fill_subdomain_queue(self) -> None: - """ - Read the subdomains in input and put them into the queue. - Done in a thread so we can both: - - yield the results as they come - - not store all the subdomains at once - """ - self.log.info("Started reading subdomains") - # Send data to workers - for subdomain in self.subdomains: - self.subdomains_queue.put(subdomain) - - self.log.info("Finished reading subdomains") - # Send sentinel to each worker - # sentinel = None ~= EOF - for _ in range(NUMBER_THREADS): - self.subdomains_queue.put(None) - - def run(self) -> typing.Iterable[typing.List[str]]: - """ - Yield the results. - """ - # Create workers - self.log.info("Creating workers") - for i in range(NUMBER_THREADS): - Worker(self, i).start() - - fill_thread = threading.Thread(target=self.fill_subdomain_queue) - fill_thread.start() - - # Wait for one sentinel per worker - # In the meantime output results - for _ in range(NUMBER_THREADS): - result: typing.List[str] - for result in iter(self.results_queue.get, None): - yield result - - self.log.info("Waiting for reader thread") - fill_thread.join() - - self.log.info("Done!") - - -def main() -> None: - """ - Main function when used directly. - Read the subdomains provided and output it, - the last CNAME resolved and the IP adress it resolves to. - Takes as an input a filename (or nothing, for stdin), - and as an output a filename (or nothing, for stdout). - The input must be a subdomain per line, the output is a comma-sep - file with the columns source CNAME and A. - Use the file `nameservers` as the list of nameservers - to use, or else it will use the system defaults. - Also shows a nice progressbar. - """ - - # Initialization - coloredlogs.install( - level='DEBUG', - fmt='%(asctime)s %(name)s %(levelname)s %(message)s' - ) - - # Parsing arguments - parser = argparse.ArgumentParser( - description="Massively resolves subdomains and store them in a file.") - parser.add_argument( - '-i', '--input', type=argparse.FileType('r'), default=sys.stdin, - help="Input file with one subdomain per line") - parser.add_argument( - '-o', '--output', type=argparse.FileType('w'), default=sys.stdout, - help="Outptut file with DNS chains") - # parser.add_argument( - # '-n', '--nameserver', type=argparse.FileType('r'), - # default='nameservers', help="File with one nameserver per line") - # parser.add_argument( - # '-j', '--workers', type=int, default=512, - # help="Number of threads to use") - args = parser.parse_args() - - # Progress bar - widgets = [ - progressbar.Percentage(), - ' ', progressbar.SimpleProgress(), - ' ', progressbar.Bar(), - ' ', progressbar.Timer(), - ' ', progressbar.AdaptiveTransferSpeed(unit='req'), - ' ', progressbar.AdaptiveETA(), - ] - progress = progressbar.ProgressBar(widgets=widgets) - if args.input.seekable(): - progress.max_value = len(args.input.readlines()) - args.input.seek(0) - - # Cleaning input - iterator = iter(args.input) - iterator = map(str.strip, iterator) - iterator = filter(None, iterator) - - # Reading nameservers - servers: typing.List[str] = list() - if os.path.isfile('nameservers'): - servers = open('nameservers').readlines() - servers = list(filter(None, map(str.strip, servers))) - - writer = csv.writer(args.output) - - progress.start() - global glob - glob = Orchestrator(iterator, servers) - for resolved in glob.run(): - progress.update(progress.value + 1) - writer.writerow(resolved) - progress.finish() - - -if __name__ == '__main__': - main() diff --git a/resolve_subdomains.sh b/resolve_subdomains.sh index ed7af79..7a91337 100755 --- a/resolve_subdomains.sh +++ b/resolve_subdomains.sh @@ -4,11 +4,16 @@ function log() { echo -e "\033[33m$@\033[0m" } -# Resolve the CNAME chain of all the known subdomains for later analysis -log "Compiling subdomain lists..." -pv subdomains/*.list | sort -u > temp/all_subdomains.list -# Sort by last character to utilize the DNS server caching mechanism -pv temp/all_subdomains.list | rev | sort | rev > temp/all_subdomains_reversort.list -./resolve_subdomains.py --input temp/all_subdomains_reversort.list --output temp/all_resolved.csv -sort -u temp/all_resolved.csv > temp/all_resolved_sorted.csv +log "Compiling nameservers…" +pv nameservers/*.list | ./validate_list.py --ip4 | sort -u > temp/all_nameservers_ip4.list +log "Compiling subdomain…" +# Sort by last character to utilize the DNS server caching mechanism +# (not as efficient with massdns but it's almost free so why not) +pv subdomains/*.list | ./validate_list.py --domain | rev | sort -u | rev > temp/all_subdomains.list + +log "Resolving subdomain…" +massdns --output Snrql --retry REFUSED,SERVFAIL --resolvers temp/all_nameservers_ip4.list --outfile temp/all_resolved.txt temp/all_subdomains.list + +log "Importing into database…" +pv temp/all_resolved.txt | ./feed_dns.py massdns diff --git a/rules/first-party.list b/rules/first-party.list index b7c393e..3092397 100644 --- a/rules/first-party.list +++ b/rules/first-party.list @@ -18,7 +18,14 @@ omtrdc.net online-metrix.net # Webtrekk wt-eu02.net +webtrekk.net # Otto Group oghub.io -# ??? +# Intent.com partner.intentmedia.net +# Wizaly +wizaly.com +# Commanders Act +tagcommander.com +# Ingenious Technologies +affex.org diff --git a/rules_asn/.gitignore b/rules_asn/.gitignore new file mode 100644 index 0000000..d2df6a8 --- /dev/null +++ b/rules_asn/.gitignore @@ -0,0 +1,2 @@ +*.custom.txt +*.cache.txt diff --git a/rules_asn/first-party.txt b/rules_asn/first-party.txt new file mode 100644 index 0000000..e7b93fa --- /dev/null +++ b/rules_asn/first-party.txt @@ -0,0 +1,10 @@ +# Eulerian +AS50234 +# Criteo +AS44788 +AS19750 +AS55569 +# ThreatMetrix +AS30286 +# Webtrekk +AS60164 diff --git a/rules_ip/first-party.txt b/rules_ip/first-party.txt index 3561894..e69de29 100644 --- a/rules_ip/first-party.txt +++ b/rules_ip/first-party.txt @@ -1,51 +0,0 @@ -# Eulerian (AS50234 EULERIAN TECHNOLOGIES S.A.S.) -109.232.192.0/21 -# Criteo (AS44788 Criteo SA) -91.199.242.0/24 -91.212.98.0/24 -178.250.0.0/21 -178.250.0.0/24 -178.250.1.0/24 -178.250.2.0/24 -178.250.3.0/24 -178.250.4.0/24 -178.250.6.0/24 -185.235.84.0/24 -# Criteo (AS19750 Criteo Corp.) -74.119.116.0/22 -74.119.117.0/24 -74.119.118.0/24 -74.119.119.0/24 -91.199.242.0/24 -185.235.85.0/24 -199.204.168.0/22 -199.204.168.0/24 -199.204.169.0/24 -199.204.170.0/24 -199.204.171.0/24 -178.250.0.0/21 -91.212.98.0/24 -91.199.242.0/24 -185.235.84.0/24 -# Criteo (AS55569 Criteo APAC) -91.199.242.0/24 -116.213.20.0/22 -116.213.20.0/24 -116.213.21.0/24 -182.161.72.0/22 -182.161.72.0/24 -182.161.73.0/24 -185.235.86.0/24 -185.235.87.0/24 -# ThreatMetrix (AS30286 ThreatMetrix Inc.) -69.84.176.0/24 -173.254.179.0/24 -185.32.240.0/23 -185.32.242.0/23 -192.225.156.0/22 -199.101.156.0/23 -199.101.158.0/23 -# Webtrekk (AS60164 Webtrekk GmbH) -185.54.148.0/22 -185.54.150.0/24 -185.54.151.0/24 diff --git a/run_tests.py b/run_tests.py new file mode 100755 index 0000000..548b6eb --- /dev/null +++ b/run_tests.py @@ -0,0 +1,34 @@ +#!/usr/bin/env python3 + +import database +import os +import logging +import csv + +TESTS_DIR = 'tests' + +if __name__ == '__main__': + + DB = database.Database() + log = logging.getLogger('tests') + + for filename in os.listdir(TESTS_DIR): + log.info("") + log.info("Running tests from %s", filename) + path = os.path.join(TESTS_DIR, filename) + with open(path, 'rt') as fdesc: + reader = csv.DictReader(fdesc) + for test in reader: + log.info("Testing %s (%s)", test['url'], test['comment']) + + for white in test['white'].split(':'): + if not white: + continue + if any(DB.get_domain(white)): + log.error("False positive: %s", white) + + for black in test['black'].split(':'): + if not black: + continue + if not any(DB.get_domain(black)): + log.error("False negative: %s", black) diff --git a/tests/false-positives.csv b/tests/false-positives.csv index c20639a..664b630 100644 --- a/tests/false-positives.csv +++ b/tests/false-positives.csv @@ -1,6 +1,5 @@ url,white,black,comment https://support.apple.com,support.apple.com,,EdgeKey / AkamaiEdge https://www.pinterest.fr/,i.pinimg.com,,Cedexis -https://www.pinterest.fr/,i.pinimg.com,,Cedexis https://www.tumblr.com/,66.media.tumblr.com,,ChiCDN https://www.skype.com/fr/,www.skype.com,,TrafficManager diff --git a/tests/first-party.csv b/tests/first-party.csv index 5084bcb..92ff458 100644 --- a/tests/first-party.csv +++ b/tests/first-party.csv @@ -5,3 +5,6 @@ https://www.discover.com/,,content.discover.com,ThreatMetrix https://www.mytoys.de/,,web.mytoys.de,Webtrekk https://www.baur.de/,,tp.baur.de,Otto Group https://www.liligo.com/,,compare.liligo.com,??? +https://www.boulanger.com/,,tag.boulanger.fr,TagCommander +https://www.airfrance.fr/FR/,,tk.airfrance.fr,Wizaly +https://www.vsgamers.es/,,marketing.net.vsgamers.es,Affex diff --git a/validate_list.py b/validate_list.py new file mode 100755 index 0000000..23e46d7 --- /dev/null +++ b/validate_list.py @@ -0,0 +1,35 @@ +#!/usr/bin/env python3 +# pylint: disable=C0103 + +""" +Filter out invalid domain names +""" + +import database +import argparse +import sys + +if __name__ == '__main__': + + # Parsing arguments + parser = argparse.ArgumentParser( + description="Filter out invalid domain name/ip addresses from a list.") + parser.add_argument( + '-i', '--input', type=argparse.FileType('r'), default=sys.stdin, + help="Input file, one element per line") + parser.add_argument( + '-o', '--output', type=argparse.FileType('w'), default=sys.stdout, + help="Output file, one element per line") + parser.add_argument( + '-d', '--domain', action='store_true', + help="Can be domain name") + parser.add_argument( + '-4', '--ip4', action='store_true', + help="Can be IP4") + args = parser.parse_args() + + for line in args.input: + line = line.strip() + if (args.domain and database.Database.validate_domain(line)) or \ + (args.ip4 and database.Database.validate_ip4address(line)): + print(line, file=args.output)