Merge branch 'newworkflow'

2019-12-20 17:18:42 +01:00 · 2019-12-20 17:18:42 +01:00 · cd46b39756
commit cd46b39756
parent f5c60c482a 38cf532854
28 changed files with 1659 additions and 698 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1,3 +1,2 @@
 *.log
-nameservers
-nameservers.head
+*.p
--- a/README.md
+++ b/README.md
@ -1,98 +1,133 @@
 # eulaurarien

-Generates a host list of first-party trackers for ad-blocking.
+This program is able to generate a list of every hostnames being a DNS redirection to a list of DNS zones and IP networks.

-The latest list is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
+It is primarilyy used to generate [Geoffrey Frogeye's block list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md) (learn about first-party trackers by following this link).

-**DISCLAIMER:** I'm by no way an expert on this subject so my vocabulary or other stuff might be wrong. Use at your own risk.
+If you want to contribute but don't want to create an account on this forge, contact me the way you like: <https://geoffrey.frogeye.fr>

-## What's a first-party tracker?
+## How does this work

-Traditionally, websites load trackers scripts directly.
-For example, `website1.com` and `website2.com` both load `https://trackercompany.com/trackerscript.js` to track their users.
-In order to block those, one can simply block the host `trackercompany.com`.
+This program takes as input:

-However, to circumvent this easy block, tracker companies made the website using them load trackers from `somethingirelevant.website1.com`.
-The latter being a DNS redirection to `website1.trackercompany.com`, directly pointing to a server serving the tracking script.
-Those are the first-party trackers.
+- Lists of hostnames to match
+- Lists of DNS zone to match (a domain and their subdomains)
+- Lists of IP address / IP networks to match
+- Lists of Autonomous System numbers to match
+- An enormous quantity of DNS records

-Blocking `trackercompany.com` doesn't work any more, and blocking `*.trackercompany.com` isn't really possible since:
+It will be able to output hostnames being a DNS redirection to any item in the lists provided.

-1. Most ad-blocker don't support wildcards
-2. It's a DNS redirection, meaning that most ad-blockers will only see `somethingirelevant.website1.com`
+DNS records can either come from [Rapid7 Open Data Sets](https://opendata.rapid7.com/sonar.fdns_v2/) or can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns).

-So the only solution is to block every `somethingirelevant.website1.com`-like subdomains known, which is a lot.
-That's where this scripts comes in, to generate a list of such subdomains.
-
-## How does this script work
-
-> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
-
-It takes an input a list of websites with trackers included.
-So far, this list is manually-generated from the list of clients of such first-party trackers
-(latter we should use a general list of websites to be more exhaustive).
-It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes.
-
-Additionaly, or alternatively, you can feed the script some browsing history and get domains from there.
-
-It then find the DNS redirections of those domains, and compare with regexes of known tracking domains.
-It finally outputs the matching ones.
-
-## Requirements
-
-> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
-
-Just to build the list, you can find an already-built list in the releases.
-
- Bash
- [Python 3.4+](https://www.python.org/)
- [progressbar2](https://pypi.org/project/progressbar2/)
- dnspython
- [A Python wrapper for re2](https://pypi.org/project/google-re2/) (optional, just speeds things up)
-
-(if you don't want to collect the subdomains, you can skip the following) 
-
- Firefox
- Selenium
- seleniumwire
+Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that).

 ## Usage

-> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
+Remember you can get an already generated and up-to-date list of first-party trackers from [here](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md).

-This is only if you want to build the list yourself.
-If you just want to use the list, the latest build is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
-It was build using additional sources not included in this repository for privacy reasons.
+The following is for the people wanting to build their own list.

-### Add personal sources
+### Requirements

-The list of websites provided in this script is by no mean exhaustive,
-so adding your own browsing history will help create a better list.
+Depending on the sources you'll be using to generate the list, you'll need to install some of the following:
+
+- [Bash](https://www.gnu.org/software/bash/bash.html)
+- [Coreutils](https://www.gnu.org/software/coreutils/)
+- [curl](https://curl.haxx.se)
+- [pv](http://www.ivarch.com/programs/pv.shtml)
+- [Python 3.4+](https://www.python.org/)
+- [coloredlogs](https://pypi.org/project/coloredlogs/) (sorry I can't help myself)
+- [massdns](https://github.com/blechschmidt/massdns) in your `$PATH` (only if you have subdomains as a source)
+- [Firefox](https://www.mozilla.org/firefox/) (only if you have websites as a source)
+- [selenium (Python bindings)](https://pypi.python.org/pypi/selenium) (only if you have websites as a source)
+- [selenium-wire](https://pypi.org/project/selenium-wire/) (only if you have websites as a source)
+
+### Create a new database
+
+The so-called database (in the form of `blocking.p`) is a file storing all the matching entities (ASN, IPs, hostnames, zones…) and every entity leading to it.
+For now there's no way to remove data from it, so here's the command to recreate it: `./db.py --initialize`.
+
+### Gather external sources
+
+External sources are not stored in this repository.
+You'll need to fetch them by running `./fetch_resources.sh`.
+Those include:
+
+- Third-party trackers lists
+- TLD lists (used to test the validity of hostnames)
+- List of public DNS resolvers (for DNS resolving from subdomains)
+- Top 1M subdomains
+
+### Import rules into the database
+
+You need to put the lists of rules for matching in the different subfolders:
+
+- `rules`: Lists of DNS zones
+- `rules_ip`: Lists of IP networks (for IP addresses append `/32`)
+- `rules_asn`: Lists of Autonomous Systems numbers (IP ranges will be deducted from them)
+- `rules_adblock`: Lists of DNS zones, but in the form of AdBlock lists (only the ones concerning domains will be extracted)
+- `rules_hosts`: Lists of DNS zones, but in the form of hosts lists
+
+See the provided examples for syntax.
+
+In each folder:
+
+- `first-party.ext` will be the only files considered for the first-party variant of the list
+- `*.cache.ext` are from external sources, and thus might be deleted / overwrote
+- `*.custom.ext` are for sources that you don't want commited
+
+Then, run `./import_rules.sh`.
+
+### Add subdomains
+
+If you plan to resolve DNS records yourself (as the DNS records datasets are not exhaustive),
+the top 1M subdomains provided might not be enough.
+
+You can add them into the `subdomains` folder.
+It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
+
+#### Add personal sources
+
+Adding your own browsing history will help create a more suited subdomains list.
 Here's reference command for possible sources:

 - **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
 - **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp`

-### Collect subdomains from websites
+#### Collect subdomains from websites

-Just run `collect_subdomain.sh`.
+You can add the websites URLs into the `websites` folder.
+It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
+
+Then, run `collect_subdomain.sh`.
 This is a long step, and might be memory-intensive from time to time.

-This step is optional if you already added personal sources.
-Alternatively, you can get just download the list of subdomains used to generate the official block list here: <https://hostfiles.frogeye.fr/from_websites.cache.list> (put it in the `subdomains` folder).
+> **Note:** For first-party tracking, a list of subdomains issued from the websites in the repository is avaliable here: <https://hostfiles.frogeye.fr/from_websites.cache.list> 

-### Extract tracking domains
+### Resolve DNS records

-Make sure your system is configured with a DNS server without limitation.
-Then, run `filter_subdomain.sh`.
-The files you need will be in the folder `dist`.
+Once you've added subdomains, you'll need to resolve them to get their DNS records.
+The program will use a list of public nameservers to do that, but you can add your own in the `nameservers` directory.

-## Contributing
+Then, run `./resolve_subdomains.sh`.
+Note that this is a network intensive process, not in term of bandwith, but in terms of packet number.

-### Adding websites
+> Some VPS providers might detect this as a DDoS attack and cut the network access.
+> Some Wi-Fi connections can be rendered unusable for other uses, some routers might cease to work.
+> Since massdns does not support yet rate limiting, my best bet was a Raspberry Pi with a slow ethernet link (Raspberry Pi < 4).

-Just add the URL to the relevant list: `websites/<source>.list`.
+The DNS records will automatically be imported into the database.
+If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script.

-### Adding first-party trackers regex
+### Import DNS records from Rapid7
+
+Just run `./import_rapid7.sh`.
+This will download about 35 GiB of data, but only the matching records will be stored (about a few MiB for the tracking rules).
+Note the download speed will most likely be limited by the database operation thoughput (a quick RAM will help).
+
+### Export the lists
+
+For the tracking list, use `./export_lists.sh`, the output will be in the `dist` forlder (please change the links before distributing them).
+For other purposes, tinker with the `./export.py` program.

-Just add them to `regexes.py`.
--- a/database.py
+++ b/database.py
@ -0,0 +1,739 @@
+#!/usr/bin/env python3
+
+"""
+Utility functions to interact with the database.
+"""
+
+import typing
+import time
+import logging
+import coloredlogs
+import pickle
+import numpy
+import math
+
+TLD_LIST: typing.Set[str] = set()
+
+coloredlogs.install(
+    level='DEBUG',
+    fmt='%(asctime)s %(name)s %(levelname)s %(message)s'
+)
+
+Asn = int
+Timestamp = int
+Level = int
+
+
+class Path():
+    # FP add boolean here
+    pass
+
+
+class RulePath(Path):
+    def __str__(self) -> str:
+        return '(rule)'
+
+
+class RuleFirstPath(RulePath):
+    def __str__(self) -> str:
+        return '(first-party rule)'
+
+
+class RuleMultiPath(RulePath):
+    def __str__(self) -> str:
+        return '(multi-party rule)'
+
+
+class DomainPath(Path):
+    def __init__(self, parts: typing.List[str]):
+        self.parts = parts
+
+    def __str__(self) -> str:
+        return '?.' + Database.unpack_domain(self)
+
+
+class HostnamePath(DomainPath):
+    def __str__(self) -> str:
+        return Database.unpack_domain(self)
+
+
+class ZonePath(DomainPath):
+    def __str__(self) -> str:
+        return '*.' + Database.unpack_domain(self)
+
+
+class AsnPath(Path):
+    def __init__(self, asn: Asn):
+        self.asn = asn
+
+    def __str__(self) -> str:
+        return Database.unpack_asn(self)
+
+
+class Ip4Path(Path):
+    def __init__(self, value: int, prefixlen: int):
+        self.value = value
+        self.prefixlen = prefixlen
+
+    def __str__(self) -> str:
+        return Database.unpack_ip4network(self)
+
+
+class Match():
+    def __init__(self) -> None:
+        self.source: typing.Optional[Path] = None
+        self.updated: int = 0
+        self.dupplicate: bool = False
+
+        # Cache
+        self.level: int = 0
+        self.first_party: bool = False
+        self.references: int = 0
+
+    def active(self, first_party: bool = None) -> bool:
+        if self.updated == 0 or (first_party and not self.first_party):
+            return False
+        return True
+
+
+class AsnNode(Match):
+    def __init__(self) -> None:
+        Match.__init__(self)
+        self.name = ''
+
+
+class DomainTreeNode():
+    def __init__(self) -> None:
+        self.children: typing.Dict[str, DomainTreeNode] = dict()
+        self.match_zone = Match()
+        self.match_hostname = Match()
+
+
+class IpTreeNode(Match):
+    def __init__(self) -> None:
+        Match.__init__(self)
+        self.zero: typing.Optional[IpTreeNode] = None
+        self.one: typing.Optional[IpTreeNode] = None
+
+
+Node = typing.Union[DomainTreeNode, IpTreeNode, AsnNode]
+MatchCallable = typing.Callable[[Path,
+                                 Match],
+                                typing.Any]
+
+
+class Profiler():
+    def __init__(self) -> None:
+        self.log = logging.getLogger('profiler')
+        self.time_last = time.perf_counter()
+        self.time_step = 'init'
+        self.time_dict: typing.Dict[str, float] = dict()
+        self.step_dict: typing.Dict[str, int] = dict()
+
+    def enter_step(self, name: str) -> None:
+        now = time.perf_counter()
+        try:
+            self.time_dict[self.time_step] += now - self.time_last
+            self.step_dict[self.time_step] += int(name != self.time_step)
+        except KeyError:
+            self.time_dict[self.time_step] = now - self.time_last
+            self.step_dict[self.time_step] = 1
+        self.time_step = name
+        self.time_last = time.perf_counter()
+
+    def profile(self) -> None:
+        self.enter_step('profile')
+        total = sum(self.time_dict.values())
+        for key, secs in sorted(self.time_dict.items(), key=lambda t: t[1]):
+            times = self.step_dict[key]
+            self.log.debug(f"{key:<20}: {times:9d} × {secs/times:5.3e} "
+                           f"= {secs:9.2f} s ({secs/total:7.2%}) ")
+        self.log.debug(f"{'total':<20}:                         "
+                       f"{total:9.2f} s ({1:7.2%})")
+
+
+class Database(Profiler):
+    VERSION = 18
+    PATH = "blocking.p"
+
+    def initialize(self) -> None:
+        self.log.warning(
+            "Creating database version: %d ",
+            Database.VERSION)
+        # Dummy match objects that everything refer to
+        self.rules: typing.List[Match] = list()
+        for first_party in (False, True):
+            m = Match()
+            m.updated = 1
+            m.level = 0
+            m.first_party = first_party
+            self.rules.append(m)
+        self.domtree = DomainTreeNode()
+        self.asns: typing.Dict[Asn, AsnNode] = dict()
+        self.ip4tree = IpTreeNode()
+
+    def load(self) -> None:
+        self.enter_step('load')
+        try:
+            with open(self.PATH, 'rb') as db_fdsec:
+                version, data = pickle.load(db_fdsec)
+                if version == Database.VERSION:
+                    self.rules, self.domtree, self.asns, self.ip4tree = data
+                    return
+                self.log.warning(
+                    "Outdated database version found: %d, "
+                    "it will be rebuilt.",
+                    version)
+        except (TypeError, AttributeError, EOFError):
+            self.log.error(
+                "Corrupt (or heavily outdated) database found, "
+                "it will be rebuilt.")
+        except FileNotFoundError:
+            pass
+        self.initialize()
+
+    def save(self) -> None:
+        self.enter_step('save')
+        with open(self.PATH, 'wb') as db_fdsec:
+            data = self.rules, self.domtree, self.asns, self.ip4tree
+            pickle.dump((self.VERSION, data), db_fdsec)
+        self.profile()
+
+    def __init__(self) -> None:
+        Profiler.__init__(self)
+        self.log = logging.getLogger('db')
+        self.load()
+        self.ip4cache_shift: int = 32
+        self.ip4cache = numpy.ones(1)
+
+    def _set_ip4cache(self, path: Path, _: Match) -> None:
+        assert isinstance(path, Ip4Path)
+        self.enter_step('set_ip4cache')
+        mini = path.value >> self.ip4cache_shift
+        maxi = (path.value + 2**(32-path.prefixlen)) >> self.ip4cache_shift
+        if mini == maxi:
+            self.ip4cache[mini] = True
+        else:
+            self.ip4cache[mini:maxi] = True
+
+    def fill_ip4cache(self, max_size: int = 512*1024**2) -> None:
+        """
+        Size in bytes
+        """
+        if max_size > 2**32/8:
+            self.log.warning("Allocating more than 512 MiB of RAM for "
+                             "the Ip4 cache is not necessary.")
+        max_cache_width = int(math.log2(max(1, max_size*8)))
+        cache_width = min(2**32, max_cache_width)
+        self.ip4cache_shift = 32-cache_width
+        cache_size = 2**cache_width
+        self.ip4cache = numpy.zeros(cache_size, dtype=numpy.bool)
+        for _ in self.exec_each_ip4(self._set_ip4cache):
+            pass
+
+    @staticmethod
+    def populate_tld_list() -> None:
+        with open('temp/all_tld.list', 'r') as tld_fdesc:
+            for tld in tld_fdesc:
+                tld = tld.strip()
+                TLD_LIST.add(tld)
+
+    @staticmethod
+    def validate_domain(path: str) -> bool:
+        if len(path) > 255:
+            return False
+        splits = path.split('.')
+        if not TLD_LIST:
+            Database.populate_tld_list()
+        if splits[-1] not in TLD_LIST:
+            return False
+        for split in splits:
+            if not 1 <= len(split) <= 63:
+                return False
+        return True
+
+    @staticmethod
+    def pack_domain(domain: str) -> DomainPath:
+        return DomainPath(domain.split('.')[::-1])
+
+    @staticmethod
+    def unpack_domain(domain: DomainPath) -> str:
+        return '.'.join(domain.parts[::-1])
+
+    @staticmethod
+    def pack_asn(asn: str) -> AsnPath:
+        asn = asn.upper()
+        if asn.startswith('AS'):
+            asn = asn[2:]
+        return AsnPath(int(asn))
+
+    @staticmethod
+    def unpack_asn(asn: AsnPath) -> str:
+        return f'AS{asn.asn}'
+
+    @staticmethod
+    def validate_ip4address(path: str) -> bool:
+        splits = path.split('.')
+        if len(splits) != 4:
+            return False
+        for split in splits:
+            try:
+                if not 0 <= int(split) <= 255:
+                    return False
+            except ValueError:
+                return False
+        return True
+
+    @staticmethod
+    def pack_ip4address(address: str) -> Ip4Path:
+        addr = 0
+        for split in address.split('.'):
+            addr = (addr << 8) + int(split)
+        return Ip4Path(addr, 32)
+
+    @staticmethod
+    def unpack_ip4address(address: Ip4Path) -> str:
+        addr = address.value
+        assert address.prefixlen == 32
+        octets: typing.List[int] = list()
+        octets = [0] * 4
+        for o in reversed(range(4)):
+            octets[o] = addr & 0xFF
+            addr >>= 8
+        return '.'.join(map(str, octets))
+
+    @staticmethod
+    def validate_ip4network(path: str) -> bool:
+        # A bit generous but ok for our usage
+        splits = path.split('/')
+        if len(splits) != 2:
+            return False
+        if not Database.validate_ip4address(splits[0]):
+            return False
+        try:
+            if not 0 <= int(splits[1]) <= 32:
+                return False
+        except ValueError:
+            return False
+        return True
+
+    @staticmethod
+    def pack_ip4network(network: str) -> Ip4Path:
+        address, prefixlen_str = network.split('/')
+        prefixlen = int(prefixlen_str)
+        addr = Database.pack_ip4address(address)
+        addr.prefixlen = prefixlen
+        return addr
+
+    @staticmethod
+    def unpack_ip4network(network: Ip4Path) -> str:
+        addr = network.value
+        octets: typing.List[int] = list()
+        octets = [0] * 4
+        for o in reversed(range(4)):
+            octets[o] = addr & 0xFF
+            addr >>= 8
+        return '.'.join(map(str, octets)) + '/' + str(network.prefixlen)
+
+    def get_match(self, path: Path) -> Match:
+        if isinstance(path, RuleMultiPath):
+            return self.rules[0]
+        elif isinstance(path, RuleFirstPath):
+            return self.rules[1]
+        elif isinstance(path, AsnPath):
+            return self.asns[path.asn]
+        elif isinstance(path, DomainPath):
+            dicd = self.domtree
+            for part in path.parts:
+                dicd = dicd.children[part]
+            if isinstance(path, HostnamePath):
+                return dicd.match_hostname
+            elif isinstance(path, ZonePath):
+                return dicd.match_zone
+            else:
+                raise ValueError
+        elif isinstance(path, Ip4Path):
+            dici = self.ip4tree
+            for i in range(31, 31-path.prefixlen, -1):
+                bit = (path.value >> i) & 0b1
+                dici_next = dici.one if bit else dici.zero
+                if not dici_next:
+                    raise IndexError
+                dici = dici_next
+            return dici
+        else:
+            raise ValueError
+
+    def exec_each_asn(self,
+                      callback: MatchCallable,
+                      ) -> typing.Any:
+        for asn in self.asns:
+            match = self.asns[asn]
+            if match.active():
+                c = callback(
+                    AsnPath(asn),
+                    match,
+                )
+                try:
+                    yield from c
+                except TypeError:  # not iterable
+                    pass
+
+    def exec_each_domain(self,
+                         callback: MatchCallable,
+                         _dic: DomainTreeNode = None,
+                         _par: DomainPath = None,
+                         ) -> typing.Any:
+        _dic = _dic or self.domtree
+        _par = _par or DomainPath([])
+        if _dic.match_hostname.active():
+            c = callback(
+                HostnamePath(_par.parts),
+                _dic.match_hostname,
+            )
+            try:
+                yield from c
+            except TypeError:  # not iterable
+                pass
+        if _dic.match_zone.active():
+            c = callback(
+                ZonePath(_par.parts),
+                _dic.match_zone,
+            )
+            try:
+                yield from c
+            except TypeError:  # not iterable
+                pass
+        for part in _dic.children:
+            dic = _dic.children[part]
+            yield from self.exec_each_domain(
+                callback,
+                _dic=dic,
+                _par=DomainPath(_par.parts + [part])
+            )
+
+    def exec_each_ip4(self,
+                      callback: MatchCallable,
+                      _dic: IpTreeNode = None,
+                      _par: Ip4Path = None,
+                      ) -> typing.Any:
+        _dic = _dic or self.ip4tree
+        _par = _par or Ip4Path(0, 0)
+        if _dic.active():
+            c = callback(
+                _par,
+                _dic,
+            )
+            try:
+                yield from c
+            except TypeError:  # not iterable
+                pass
+
+        # 0
+        pref = _par.prefixlen + 1
+        dic = _dic.zero
+        if dic:
+            # addr0 = _par.value & (0xFFFFFFFF ^ (1 << (32-pref)))
+            # assert addr0 == _par.value
+            addr0 = _par.value
+            yield from self.exec_each_ip4(
+                callback,
+                _dic=dic,
+                _par=Ip4Path(addr0, pref)
+            )
+        # 1
+        dic = _dic.one
+        if dic:
+            addr1 = _par.value | (1 << (32-pref))
+            # assert addr1 != _par.value
+            yield from self.exec_each_ip4(
+                callback,
+                _dic=dic,
+                _par=Ip4Path(addr1, pref)
+            )
+
+    def exec_each(self,
+                  callback: MatchCallable,
+                  ) -> typing.Any:
+        yield from self.exec_each_domain(callback)
+        yield from self.exec_each_ip4(callback)
+        yield from self.exec_each_asn(callback)
+
+    def update_references(self) -> None:
+        # Should be correctly calculated normally,
+        # keeping this just in case
+        def reset_references_cb(path: Path,
+                                match: Match
+                                ) -> None:
+            match.references = 0
+        for _ in self.exec_each(reset_references_cb):
+            pass
+
+        def increment_references_cb(path: Path,
+                                    match: Match
+                                    ) -> None:
+            if match.source:
+                source = self.get_match(match.source)
+                source.references += 1
+        for _ in self.exec_each(increment_references_cb):
+            pass
+
+    def prune(self, before: int, base_only: bool = False) -> None:
+        raise NotImplementedError
+
+    def explain(self, path: Path) -> str:
+        match = self.get_match(path)
+        if isinstance(match, AsnNode):
+            string = f'{path} ({match.name}) #{match.references}'
+        else:
+            string = f'{path} #{match.references}'
+        if match.source:
+            string += f' ← {self.explain(match.source)}'
+        return string
+
+    def list_records(self,
+                     first_party_only: bool = False,
+                     end_chain_only: bool = False,
+                     no_dupplicates: bool = False,
+                     rules_only: bool = False,
+                     hostnames_only: bool = False,
+                     explain: bool = False,
+                     ) -> typing.Iterable[str]:
+
+        def export_cb(path: Path, match: Match
+                      ) -> typing.Iterable[str]:
+            if first_party_only and not match.first_party:
+                return
+            if end_chain_only and match.references > 0:
+                return
+            if no_dupplicates and match.dupplicate:
+                return
+            if rules_only and match.level > 1:
+                return
+            if hostnames_only and not isinstance(path, HostnamePath):
+                return
+
+            if explain:
+                yield self.explain(path)
+            else:
+                yield str(path)
+
+        yield from self.exec_each(export_cb)
+
+    def count_records(self,
+                      first_party_only: bool = False,
+                      end_chain_only: bool = False,
+                      no_dupplicates: bool = False,
+                      rules_only: bool = False,
+                      hostnames_only: bool = False,
+                      ) -> str:
+        memo: typing.Dict[str, int] = dict()
+
+        def count_records_cb(path: Path, match: Match) -> None:
+            if first_party_only and not match.first_party:
+                return
+            if end_chain_only and match.references > 0:
+                return
+            if no_dupplicates and match.dupplicate:
+                return
+            if rules_only and match.level > 1:
+                return
+            if hostnames_only and not isinstance(path, HostnamePath):
+                return
+
+            try:
+                memo[path.__class__.__name__] += 1
+            except KeyError:
+                memo[path.__class__.__name__] = 1
+
+        for _ in self.exec_each(count_records_cb):
+            pass
+
+        split: typing.List[str] = list()
+        for key, value in sorted(memo.items(), key=lambda s: s[0]):
+            split.append(f'{key[:-4].lower()}s: {value}')
+        return ', '.join(split)
+
+    def get_domain(self, domain_str: str) -> typing.Iterable[DomainPath]:
+        self.enter_step('get_domain_pack')
+        domain = self.pack_domain(domain_str)
+        self.enter_step('get_domain_brws')
+        dic = self.domtree
+        depth = 0
+        for part in domain.parts:
+            if dic.match_zone.active():
+                self.enter_step('get_domain_yield')
+                yield ZonePath(domain.parts[:depth])
+            self.enter_step('get_domain_brws')
+            if part not in dic.children:
+                return
+            dic = dic.children[part]
+            depth += 1
+        if dic.match_zone.active():
+            self.enter_step('get_domain_yield')
+            yield ZonePath(domain.parts)
+        if dic.match_hostname.active():
+            self.enter_step('get_domain_yield')
+            yield HostnamePath(domain.parts)
+
+    def get_ip4(self, ip4_str: str) -> typing.Iterable[Path]:
+        self.enter_step('get_ip4_pack')
+        ip4 = self.pack_ip4address(ip4_str)
+        self.enter_step('get_ip4_cache')
+        if not self.ip4cache[ip4.value >> self.ip4cache_shift]:
+            return
+        self.enter_step('get_ip4_brws')
+        dic = self.ip4tree
+        for i in range(31, 31-ip4.prefixlen, -1):
+            bit = (ip4.value >> i) & 0b1
+            if dic.active():
+                self.enter_step('get_ip4_yield')
+                yield Ip4Path(ip4.value >> (i+1) << (i+1), 31-i)
+                self.enter_step('get_ip4_brws')
+            next_dic = dic.one if bit else dic.zero
+            if next_dic is None:
+                return
+            dic = next_dic
+        if dic.active():
+            self.enter_step('get_ip4_yield')
+            yield ip4
+
+    def _set_match(self,
+                   match: Match,
+                   updated: int,
+                   source: Path,
+                   source_match: Match = None,
+                   dupplicate: bool = False,
+                   ) -> None:
+        # source_match is in parameters because most of the time
+        # its parent function needs it too,
+        # so it can pass it to save a traversal
+        source_match = source_match or self.get_match(source)
+        new_level = source_match.level + 1
+        if updated > match.updated or new_level < match.level \
+                or source_match.first_party > match.first_party:
+            # NOTE FP and level of matches referencing this one
+            # won't be updated until run or prune
+            if match.source:
+                old_source = self.get_match(match.source)
+                old_source.references -= 1
+            match.updated = updated
+            match.level = new_level
+            match.first_party = source_match.first_party
+            match.source = source
+            source_match.references += 1
+            match.dupplicate = dupplicate
+
+    def _set_domain(self,
+                    hostname: bool,
+                    domain_str: str,
+                    updated: int,
+                    source: Path) -> None:
+        self.enter_step('set_domain_val')
+        if not Database.validate_domain(domain_str):
+            raise ValueError(f"Invalid domain: {domain_str}")
+        self.enter_step('set_domain_pack')
+        domain = self.pack_domain(domain_str)
+        self.enter_step('set_domain_fp')
+        source_match = self.get_match(source)
+        is_first_party = source_match.first_party
+        self.enter_step('set_domain_brws')
+        dic = self.domtree
+        dupplicate = False
+        for part in domain.parts:
+            if part not in dic.children:
+                dic.children[part] = DomainTreeNode()
+            dic = dic.children[part]
+            if dic.match_zone.active(is_first_party):
+                dupplicate = True
+        if hostname:
+            match = dic.match_hostname
+        else:
+            match = dic.match_zone
+        self._set_match(
+            match,
+            updated,
+            source,
+            source_match=source_match,
+            dupplicate=dupplicate,
+        )
+
+    def set_hostname(self,
+                     *args: typing.Any, **kwargs: typing.Any
+                     ) -> None:
+        self._set_domain(True, *args, **kwargs)
+
+    def set_zone(self,
+                 *args: typing.Any, **kwargs: typing.Any
+                 ) -> None:
+        self._set_domain(False, *args, **kwargs)
+
+    def set_asn(self,
+                asn_str: str,
+                updated: int,
+                source: Path) -> None:
+        self.enter_step('set_asn')
+        path = self.pack_asn(asn_str)
+        if path.asn in self.asns:
+            match = self.asns[path.asn]
+        else:
+            match = AsnNode()
+            self.asns[path.asn] = match
+        self._set_match(
+            match,
+            updated,
+            source,
+        )
+
+    def _set_ip4(self,
+                 ip4: Ip4Path,
+                 updated: int,
+                 source: Path) -> None:
+        self.enter_step('set_ip4_fp')
+        source_match = self.get_match(source)
+        is_first_party = source_match.first_party
+        self.enter_step('set_ip4_brws')
+        dic = self.ip4tree
+        dupplicate = False
+        for i in range(31, 31-ip4.prefixlen, -1):
+            bit = (ip4.value >> i) & 0b1
+            next_dic = dic.one if bit else dic.zero
+            if next_dic is None:
+                next_dic = IpTreeNode()
+                if bit:
+                    dic.one = next_dic
+                else:
+                    dic.zero = next_dic
+            dic = next_dic
+            if dic.active(is_first_party):
+                dupplicate = True
+        self._set_match(
+            dic,
+            updated,
+            source,
+            source_match=source_match,
+            dupplicate=dupplicate,
+        )
+        self._set_ip4cache(ip4, dic)
+
+    def set_ip4address(self,
+                       ip4address_str: str,
+                       *args: typing.Any, **kwargs: typing.Any
+                       ) -> None:
+        self.enter_step('set_ip4add_val')
+        if not Database.validate_ip4address(ip4address_str):
+            raise ValueError(f"Invalid ip4address: {ip4address_str}")
+        self.enter_step('set_ip4add_pack')
+        ip4 = self.pack_ip4address(ip4address_str)
+        self._set_ip4(ip4, *args, **kwargs)
+
+    def set_ip4network(self,
+                       ip4network_str: str,
+                       *args: typing.Any, **kwargs: typing.Any
+                       ) -> None:
+        self.enter_step('set_ip4net_val')
+        if not Database.validate_ip4network(ip4network_str):
+            raise ValueError(f"Invalid ip4network: {ip4network_str}")
+        self.enter_step('set_ip4net_pack')
+        ip4 = self.pack_ip4network(ip4network_str)
+        self._set_ip4(ip4, *args, **kwargs)
--- a/db.py
+++ b/db.py
@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+
+import argparse
+import database
+import time
+import os
+
+if __name__ == '__main__':
+
+    # Parsing arguments
+    parser = argparse.ArgumentParser(
+        description="Database operations")
+    parser.add_argument(
+        '-i', '--initialize', action='store_true',
+        help="Reconstruct the whole database")
+    parser.add_argument(
+        '-p', '--prune', action='store_true',
+        help="Remove old entries from database")
+    parser.add_argument(
+        '-b', '--prune-base', action='store_true',
+        help="With --prune, only prune base rules "
+        "(the ones added by ./feed_rules.py)")
+    parser.add_argument(
+        '-s', '--prune-before', type=int,
+        default=(int(time.time()) - 60*60*24*31*6),
+        help="With --prune, only rules updated before "
+        "this UNIX timestamp will be deleted")
+    parser.add_argument(
+        '-r', '--references', action='store_true',
+        help="DEBUG: Update the reference count")
+    args = parser.parse_args()
+
+    if not args.initialize:
+        DB = database.Database()
+    else:
+        if os.path.isfile(database.Database.PATH):
+            os.unlink(database.Database.PATH)
+        DB = database.Database()
+
+    DB.enter_step('main')
+    if args.prune:
+        DB.prune(before=args.prune_before, base_only=args.prune_base)
+    if args.references:
+        DB.update_references()
+
+    DB.save()
--- a/dist/README.md
+++ b/dist/README.md
@ -0,0 +1,74 @@
+# Geoffrey Frogeye's block list of first-party trackers
+
+## What's a first-party tracker?
+
+A tracker is a script put on many websites to gather informations about the visitor.
+They can be used for multiple reasons: statistics, risk management, marketing, ads serving…
+In any case, they are a threat to Internet users' privacy and many may want to block them.
+
+Traditionnaly, trackers are served from a third-party.
+For example, `website1.com` and `website2.com` both load their tracking script from `https://trackercompany.com/trackerscript.js`.
+In order to block those, one can simply block the hostname `trackercompany.com`, which is what most ad blockers do.
+
+However, to circumvent this block, tracker companies made the websites using them load trackers from `somestring.website1.com`.
+The latter is a DNS redirection to `website1.trackercompany.com`, directly to an IP address belonging to the tracking company.
+Those are called first-party trackers.
+
+In order to block those trackers, ad blockers would need to block every subdomain pointing to anything under `trackercompany.com` or to their network.
+Unfortunately, most don't support those blocking methods as they are not DNS-aware, e.g. they only see `somestring.website1.com`.
+
+This list is an inventory of every `somestring.website1.com` found to allow non DNS-aware ad blocker to still block first-party trackers.
+
+## List variants
+
+### First-party trackers (recommended)
+
+- Hosts file: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
+- Raw list: <https://hostfiles.frogeye.fr/firstparty-trackers.txt>
+
+This list contains every hostname redirecting to [a hand-picked list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/rules/first-party.list).
+It should be safe from false-positives.
+Don't be afraid of the size of the list, as this is due to the nature of first-party trackers: a single tracker generates at least one hostname per client (typically two).
+
+### First-party only trackers
+
+- Hosts file: <https://hostfiles.frogeye.fr/firstparty-only-trackers-hosts.txt>
+- Raw list: <https://hostfiles.frogeye.fr/firstparty-only-trackers.txt>
+
+This is the same list as above, albeit not containing the hostnames under the tracking company domains.
+This reduces the size of the list, but it doesn't prevent from third-party tracking too.
+Use in conjunction with other block lists.
+
+### Multi-party trackers
+
+- Hosts file: <https://hostfiles.frogeye.fr/multiparty-trackers-hosts.txt>
+- Raw list: <https://hostfiles.frogeye.fr/multiparty-trackers.txt>
+
+As first-party trackers usually evolve from third-party trackers, this list contains every hostname redirecting to trackers found in existing lists of third-party trackers (see next section).
+Since the latter were not designed with first-party trackers in mind, they are likely to contain false-positives.
+In the other hand, they might protect against first-party tracker that we're not aware of / have not yet confirmed.
+
+#### Source of third-party trackers
+
+- [EasyPrivacy](https://easylist.to/easylist/easyprivacy.txt)
+
+(yes there's only one for now. A lot of existing ones cause a lot of false positives)
+
+### Multi-party only trackers
+
+- Hosts file: <https://hostfiles.frogeye.fr/multiparty-only-trackers-hosts.txt>
+- Raw list: <https://hostfiles.frogeye.fr/multiparty-only-trackers.txt>
+
+This is the same list as above, albeit not containing the hostnames under the tracking company domains.
+This reduces the size of the list, but it doesn't prevent from third-party tracking too.
+Use in conjunction with other block lists, especially the ones used to generate this list in the previous section.
+
+## Meta
+
+In case of false positives/negatives, or any other question contact me the way you like: <https://geoffrey.frogeye.fr>
+
+The software used to generate this list is available here: <https://git.frogeye.fr/geoffrey/eulaurarien>
+
+Some of the first-party tracker included in this list have been found by:
+- [Aeris](https://imirhil.fr/)
+- NextDNS and [their blocklist](https://github.com/nextdns/cname-cloaking-blocklist)'s contributors
--- a/export.py
+++ b/export.py
@ -0,0 +1,64 @@
+#!/usr/bin/env python3
+
+import database
+import argparse
+import sys
+
+
+if __name__ == '__main__':
+
+    # Parsing arguments
+    parser = argparse.ArgumentParser(
+        description="Export the hostnames rules stored "
+        "in the Database as plain text")
+    parser.add_argument(
+        '-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
+        help="Output file, one rule per line")
+    parser.add_argument(
+        '-f', '--first-party', action='store_true',
+        help="Only output rules issued from first-party sources")
+    parser.add_argument(
+        '-e', '--end-chain', action='store_true',
+        help="Only output rules that are not referenced by any other")
+    parser.add_argument(
+        '-r', '--rules', action='store_true',
+        help="Output all kinds of rules, not just hostnames")
+    parser.add_argument(
+        '-b', '--base-rules', action='store_true',
+        help="Output base rules "
+        "(the ones added by ./feed_rules.py) "
+        "(implies --rules)")
+    parser.add_argument(
+        '-d', '--no-dupplicates', action='store_true',
+        help="Do not output rules that already match a zone/network rule "
+        "(e.g. dummy.example.com when there's a zone example.com rule)")
+    parser.add_argument(
+        '-x', '--explain', action='store_true',
+        help="Show the chain of rules leading to one "
+        "(and the number of references they have)")
+    parser.add_argument(
+        '-c', '--count', action='store_true',
+        help="Show the number of rules per type instead of listing them")
+    args = parser.parse_args()
+
+    DB = database.Database()
+
+    if args.count:
+        assert not args.explain
+        print(DB.count_records(
+            first_party_only=args.first_party,
+            end_chain_only=args.end_chain,
+            no_dupplicates=args.no_dupplicates,
+            rules_only=args.base_rules,
+            hostnames_only=not (args.rules or args.base_rules),
+        ))
+    else:
+        for domain in DB.list_records(
+            first_party_only=args.first_party,
+            end_chain_only=args.end_chain,
+            no_dupplicates=args.no_dupplicates,
+            rules_only=args.base_rules,
+            hostnames_only=not (args.rules or args.base_rules),
+            explain=args.explain,
+        ):
+            print(domain, file=args.output)
--- a/export_lists.sh
+++ b/export_lists.sh
@ -0,0 +1,98 @@
+#!/usr/bin/env bash
+
+function log() {
+    echo -e "\033[33m$@\033[0m"
+}
+
+log "Calculating statistics…"
+gen_date=$(date -Isec)
+gen_software=$(git describe --tags)
+number_websites=$(wc -l < temp/all_websites.list)
+number_subdomains=$(wc -l < temp/all_subdomains.list)
+number_dns=$(grep '^$' temp/all_resolved.txt | wc -l)
+
+for partyness in {first,multi}
+do
+    if [ $partyness = "first" ]
+    then
+        partyness_flags="--first-party"
+    else
+        partyness_flags=""
+    fi
+
+    echo "Statistics for ${partyness}-party trackers"
+    echo "Input rules: $(./export.py --count --base-rules $partyness_flags)"
+    echo "Subsequent rules: $(./export.py --count --rules $partyness_flags)"
+    echo "Subsequent rules (no dupplicate): $(./export.py --count --rules --no-dupplicates $partyness_flags)"
+    echo "Output hostnames: $(./export.py --count $partyness_flags)"
+    echo "Output hostnames (no dupplicate): $(./export.py --count --no-dupplicates $partyness_flags)"
+    echo "Output hostnames (end-chain only): $(./export.py --count --end-chain $partyness_flags)"
+    echo "Output hostnames (no dupplicate, end-chain only): $(./export.py --count --no-dupplicates --end-chain $partyness_flags)"
+    echo
+
+    for trackerness in {trackers,only-trackers}
+    do
+        if [ $trackerness = "trackers" ]
+        then
+            trackerness_flags=""
+        else
+            trackerness_flags="--end-chain --no-dupplicates"
+        fi
+        file_list="dist/${partyness}party-${trackerness}.txt"
+        file_host="dist/${partyness}party-${trackerness}-hosts.txt"
+
+        log "Generating lists for variant ${partyness}-party ${trackerness}…"
+
+        # Real export heeere
+        ./export.py $partyness_flags $trackerness_flags > $file_list
+        # Sometimes a bit heavy to have the DB open and sort the output
+        # so this is done in two steps
+        sort -u $file_list -o $file_list
+
+        rules_input=$(./export.py --count --base-rules $partyness_flags)
+        rules_found=$(./export.py --count --rules $partyness_flags)
+        rules_output=$(./export.py --count $partyness_flags $trackerness_flags)
+
+        function link() { # link partyness, link trackerness
+            url="https://hostfiles.frogeye.fr/${1}party-${2}-hosts.txt"
+            if [ "$1" = "$partyness" ] && [ "$2" = "$trackerness" ]
+            then
+                url="$url (this one)"
+            fi
+            echo $url
+        }
+
+        (
+            echo "# First-party trackers host list"
+            echo "# Variant: ${partyness}-party ${trackerness}"
+            echo "#"
+            echo "# About first-party trackers: TODO"
+            echo "# Source code: https://git.frogeye.fr/geoffrey/eulaurarien"
+            echo "#"
+            echo "# In case of false positives/negatives, or any other question,"
+            echo "# contact me the way you like: https://geoffrey.frogeye.fr"
+            echo "#"
+            echo "# Latest versions and variants:"
+            echo "# - First-party trackers  : $(link first trackers)"
+            echo "# - … excluding redirected: $(link first only-trackers)"
+            echo "# - First and third party : $(link multi trackers)"
+            echo "# - … excluding redirected: $(link multi only-trackers)"
+            echo '# (variants informations: TODO)'
+            echo '# (you can remove `-hosts` to get the raw list)'
+            echo "#"
+            echo "# Generation date: $gen_date"
+            echo "# Generation software: eulaurarien $gen_software"
+            echo "# Number of source websites: $number_websites"
+            echo "# Number of source subdomains: $number_subdomains"
+            echo "# Number of source DNS records: ~2E9 + $number_dns"
+            echo "#"
+            echo "# Input rules: $rules_input"
+            echo "# Subsequent rules: $rules_found"
+            echo "# Output rules: $rules_output"
+            echo "#"
+            echo
+            sed 's|^|0.0.0.0 |' "$file_list"
+        ) > "$file_host"
+
+    done
+done
--- a/feed_asn.py
+++ b/feed_asn.py
@ -0,0 +1,71 @@
+#!/usr/bin/env python3
+
+import database
+import argparse
+import requests
+import typing
+import ipaddress
+import logging
+import time
+
+IPNetwork = typing.Union[ipaddress.IPv4Network, ipaddress.IPv6Network]
+
+
+def get_ranges(asn: str) -> typing.Iterable[str]:
+    req = requests.get(
+        'https://stat.ripe.net/data/as-routing-consistency/data.json',
+        params={'resource': asn}
+    )
+    data = req.json()
+    for pref in data['data']['prefixes']:
+        yield pref['prefix']
+
+
+def get_name(asn: str) -> str:
+    req = requests.get(
+        'https://stat.ripe.net/data/as-overview/data.json',
+        params={'resource': asn}
+    )
+    data = req.json()
+    return data['data']['holder']
+
+
+if __name__ == '__main__':
+
+    log = logging.getLogger('feed_asn')
+
+    # Parsing arguments
+    parser = argparse.ArgumentParser(
+        description="Add the IP ranges associated to the AS in the database")
+    args = parser.parse_args()
+
+    DB = database.Database()
+
+    def add_ranges(path: database.Path,
+                   match: database.Match,
+                   ) -> None:
+        assert isinstance(path, database.AsnPath)
+        assert isinstance(match, database.AsnNode)
+        asn_str = database.Database.unpack_asn(path)
+        DB.enter_step('asn_get_name')
+        name = get_name(asn_str)
+        match.name = name
+        DB.enter_step('asn_get_ranges')
+        for prefix in get_ranges(asn_str):
+            parsed_prefix: IPNetwork = ipaddress.ip_network(prefix)
+            if parsed_prefix.version == 4:
+                DB.set_ip4network(
+                    prefix,
+                    source=path,
+                    updated=int(time.time())
+                )
+                log.info('Added %s from %s (%s)', prefix, path, name)
+            elif parsed_prefix.version == 6:
+                log.warning('Unimplemented prefix version: %s', prefix)
+            else:
+                log.error('Unknown prefix version: %s', prefix)
+
+    for _ in DB.exec_each_asn(add_ranges):
+        pass
+
+    DB.save()
--- a/feed_dns.py
+++ b/feed_dns.py
@ -0,0 +1,227 @@
+#!/usr/bin/env python3
+
+import argparse
+import database
+import logging
+import sys
+import typing
+import multiprocessing
+import time
+
+Record = typing.Tuple[typing.Callable, typing.Callable, int, str, str]
+
+# select, write
+FUNCTION_MAP: typing.Any = {
+    'a': (
+        database.Database.get_ip4,
+        database.Database.set_hostname,
+    ),
+    'cname': (
+        database.Database.get_domain,
+        database.Database.set_hostname,
+    ),
+    'ptr': (
+        database.Database.get_domain,
+        database.Database.set_ip4address,
+    ),
+}
+
+
+class Writer(multiprocessing.Process):
+    def __init__(self,
+                 recs_queue: multiprocessing.Queue,
+                 autosave_interval: int = 0,
+                 ip4_cache: int = 0,
+                 ):
+        super(Writer, self).__init__()
+        self.log = logging.getLogger(f'wr')
+        self.recs_queue = recs_queue
+        self.autosave_interval = autosave_interval
+        self.ip4_cache = ip4_cache
+
+    def run(self) -> None:
+        self.db = database.Database()
+        self.db.log = logging.getLogger(f'wr')
+        self.db.fill_ip4cache(max_size=self.ip4_cache)
+        if self.autosave_interval > 0:
+            next_save = time.time() + self.autosave_interval
+        else:
+            next_save = 0
+
+        self.db.enter_step('block_wait')
+        block: typing.List[Record]
+        for block in iter(self.recs_queue.get, None):
+
+            record: Record
+            for record in block:
+
+                select, write, updated, name, value = record
+                self.db.enter_step('feed_switch')
+
+                try:
+                    for source in select(self.db, value):
+                        write(self.db, name, updated, source=source)
+                except ValueError:
+                    self.log.exception("Cannot execute: %s", record)
+
+            if next_save > 0 and time.time() > next_save:
+                self.log.info("Saving database...")
+                self.db.save()
+                self.log.info("Done!")
+                next_save = time.time() + self.autosave_interval
+
+            self.db.enter_step('block_wait')
+
+        self.db.enter_step('end')
+        self.db.save()
+
+
+class Parser():
+    def __init__(self,
+                 buf: typing.Any,
+                 recs_queue: multiprocessing.Queue,
+                 block_size: int,
+                 ):
+        super(Parser, self).__init__()
+        self.buf = buf
+        self.log = logging.getLogger('pr')
+        self.recs_queue = recs_queue
+        self.block: typing.List[Record] = list()
+        self.block_size = block_size
+        self.prof = database.Profiler()
+        self.prof.log = logging.getLogger('pr')
+
+    def register(self, record: Record) -> None:
+        self.prof.enter_step('register')
+        self.block.append(record)
+        if len(self.block) >= self.block_size:
+            self.prof.enter_step('put_block')
+            self.recs_queue.put(self.block)
+            self.block = list()
+
+    def run(self) -> None:
+        self.consume()
+        self.recs_queue.put(self.block)
+        self.prof.profile()
+
+    def consume(self) -> None:
+        raise NotImplementedError
+
+
+class Rapid7Parser(Parser):
+    def consume(self) -> None:
+        data = dict()
+        for line in self.buf:
+            self.prof.enter_step('parse_rapid7')
+            split = line.split('"')
+
+            try:
+                for k in range(1, 14, 4):
+                    key = split[k]
+                    val = split[k+2]
+                    data[key] = val
+
+                select, writer = FUNCTION_MAP[data['type']]
+                record = (
+                    select,
+                    writer,
+                    int(data['timestamp']),
+                    data['name'],
+                    data['value']
+                )
+            except IndexError:
+                self.log.exception("Cannot parse: %s", line)
+            self.register(record)
+
+
+class MassDnsParser(Parser):
+    # massdns --output Snrql
+    # --retry REFUSED,SERVFAIL --resolvers nameservers-ipv4
+    TYPES = {
+        'A': (FUNCTION_MAP['a'][0], FUNCTION_MAP['a'][1], -1, None),
+        # 'AAAA': (FUNCTION_MAP['aaaa'][0], FUNCTION_MAP['aaaa'][1], -1, None),
+        'CNAME': (FUNCTION_MAP['cname'][0], FUNCTION_MAP['cname'][1], -1, -1),
+    }
+
+    def consume(self) -> None:
+        self.prof.enter_step('parse_massdns')
+        timestamp = 0
+        header = True
+        for line in self.buf:
+            line = line[:-1]
+            if not line:
+                header = True
+                continue
+
+            split = line.split(' ')
+            try:
+                if header:
+                    timestamp = int(split[1])
+                    header = False
+                else:
+                    select, write, name_offset, value_offset = \
+                        MassDnsParser.TYPES[split[1]]
+                    record = (
+                        select,
+                        write,
+                        timestamp,
+                        split[0][:name_offset],
+                        split[2][:value_offset],
+                    )
+                    self.register(record)
+                    self.prof.enter_step('parse_massdns')
+            except KeyError:
+                continue
+
+
+PARSERS = {
+    'rapid7': Rapid7Parser,
+    'massdns': MassDnsParser,
+}
+
+if __name__ == '__main__':
+
+    # Parsing arguments
+    log = logging.getLogger('feed_dns')
+    args_parser = argparse.ArgumentParser(
+        description="Read DNS records and import "
+        "tracking-relevant data into the database")
+    args_parser.add_argument(
+        'parser',
+        choices=PARSERS.keys(),
+        help="Input format")
+    args_parser.add_argument(
+        '-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
+        help="Input file")
+    args_parser.add_argument(
+        '-b', '--block-size', type=int, default=1024,
+        help="Performance tuning value")
+    args_parser.add_argument(
+        '-q', '--queue-size', type=int, default=128,
+        help="Performance tuning value")
+    args_parser.add_argument(
+        '-a', '--autosave-interval', type=int, default=900,
+        help="Interval to which the database will save in seconds. "
+        "0 to disable.")
+    args_parser.add_argument(
+        '-4', '--ip4-cache', type=int, default=0,
+        help="RAM cache for faster IPv4 lookup. "
+        "Maximum useful value: 512 MiB (536870912). "
+        "Warning: Depending on the rules, this might already "
+        "be a memory-heavy process, even without the cache.")
+    args = args_parser.parse_args()
+
+    recs_queue: multiprocessing.Queue = multiprocessing.Queue(
+        maxsize=args.queue_size)
+
+    writer = Writer(recs_queue,
+                    autosave_interval=args.autosave_interval,
+                    ip4_cache=args.ip4_cache
+                    )
+    writer.start()
+
+    parser = PARSERS[args.parser](args.input, recs_queue, args.block_size)
+    parser.run()
+
+    recs_queue.put(None)
+    writer.join()
--- a/feed_rules.py
+++ b/feed_rules.py
@ -0,0 +1,54 @@
+#!/usr/bin/env python3
+
+import database
+import argparse
+import sys
+import time
+
+FUNCTION_MAP = {
+    'zone': database.Database.set_zone,
+    'hostname': database.Database.set_hostname,
+    'asn': database.Database.set_asn,
+    'ip4network': database.Database.set_ip4network,
+    'ip4address': database.Database.set_ip4address,
+}
+
+if __name__ == '__main__':
+
+    # Parsing arguments
+    parser = argparse.ArgumentParser(
+        description="Import base rules to the database")
+    parser.add_argument(
+        'type',
+        choices=FUNCTION_MAP.keys(),
+        help="Type of rule inputed")
+    parser.add_argument(
+        '-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
+        help="File with one rule per line")
+    parser.add_argument(
+        '-f', '--first-party', action='store_true',
+        help="The input only comes from verified first-party sources")
+    args = parser.parse_args()
+
+    DB = database.Database()
+
+    fun = FUNCTION_MAP[args.type]
+
+    source: database.RulePath
+    if args.first_party:
+        source = database.RuleFirstPath()
+    else:
+        source = database.RuleMultiPath()
+
+    for rule in args.input:
+        rule = rule.strip()
+        try:
+            fun(DB,
+                rule,
+                source=source,
+                updated=int(time.time()),
+                )
+        except ValueError:
+            DB.log.error(f"Could not add rule: {rule}")
+
+    DB.save()
--- a/fetch_resources.sh
+++ b/fetch_resources.sh
@ -17,26 +17,13 @@ function dl() {
 log "Retrieving rules…"
 rm -f rules*/*.cache.*
 dl https://easylist.to/easylist/easyprivacy.txt rules_adblock/easyprivacy.cache.txt
-# From firebog.net Tracking & Telemetry Lists
-dl https://v.firebog.net/hosts/Prigent-Ads.txt rules/prigent-ads.cache.list
-# dl https://gitlab.com/quidsup/notrack-blocklists/raw/master/notrack-blocklist.txt rules/notrack-blocklist.cache.list
-# False positives: https://github.com/WaLLy3K/wally3k.github.io/issues/73 -> 69.media.tumblr.com chicdn.net
-dl https://raw.githubusercontent.com/StevenBlack/hosts/master/data/add.2o7Net/hosts rules_hosts/add2o7.cache.txt
-dl https://raw.githubusercontent.com/crazy-max/WindowsSpyBlocker/master/data/hosts/spy.txt rules_hosts/spy.cache.txt
-# dl https://raw.githubusercontent.com/Kees1958/WS3_annual_most_used_survey_blocklist/master/w3tech_hostfile.txt rules/w3tech.cache.list
-# False positives: agreements.apple.com -> edgekey.net
-# dl https://www.github.developerdan.com/hosts/lists/ads-and-tracking-extended.txt rules_hosts/ads-and-tracking-extended.cache.txt # Lots of false-positives
-# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/android-tracking.txt rules_hosts/android-tracking.cache.txt
-# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/SmartTV.txt rules_hosts/smart-tv.cache.txt
-# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/AmazonFireTV.txt rules_hosts/amazon-fire-tv.cache.txt
+
+log "Retrieving TLD list…"
+dl http://data.iana.org/TLD/tlds-alpha-by-domain.txt temp/all_tld.temp.list
+grep -v '^#' temp/all_tld.temp.list | awk '{print tolower($0)}' > temp/all_tld.list

 log "Retrieving nameservers…"
-rm -f nameservers
-touch nameservers
-[ -f nameservers.head ] && cat nameservers.head >> nameservers
-dl https://public-dns.info/nameservers.txt nameservers.temp
-sort -R nameservers.temp >> nameservers
-rm nameservers.temp
+dl https://public-dns.info/nameservers.txt nameservers/public-dns.cache.list

 log "Retrieving top subdomains…"
 dl http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip top-1m.csv.zip
@ -51,4 +38,3 @@ then
 else
    mv temp/cisco-umbrella_popularity.fresh.list subdomains/cisco-umbrella_popularity.cache.list
 fi
-dl https://www.orwell1984.today/cname/eulerian.net.txt subdomains/orwell-eulerian-cname-list.cache.list
--- a/filter_subdomains.py
+++ b/filter_subdomains.py
@ -1,160 +0,0 @@
-#!/usr/bin/env python3
-# pylint: disable=C0103
-
-"""
-From a list of subdomains, output only
-the ones resolving to a first-party tracker.
-"""
-
-import argparse
-import sys
-import progressbar
-import csv
-import typing
-import ipaddress
-
-# DomainRule = typing.Union[bool, typing.Dict[str, 'DomainRule']]
-DomainRule = typing.Union[bool, typing.Dict]
-# IpRule = typing.Union[bool, typing.Dict[int, 'DomainRule']]
-IpRule = typing.Union[bool, typing.Dict]
-
-RULES_DICT: DomainRule = dict()
-RULES_IP_DICT: IpRule = dict()
-
-
-def get_bits(address: ipaddress.IPv4Address) -> typing.Iterator[int]:
-    for char in address.packed:
-        for i in range(7, -1, -1):
-            yield (char >> i) & 0b1
-
-
-def subdomain_matching(subdomain: str) -> bool:
-    parts = subdomain.split('.')
-    parts.reverse()
-    dic = RULES_DICT
-    for part in parts:
-        if isinstance(dic, bool) or part not in dic:
-            break
-        dic = dic[part]
-    if isinstance(dic, bool):
-        return dic
-    return False
-
-
-def ip_matching(ip_str: str) -> bool:
-    ip = ipaddress.ip_address(ip_str)
-    dic = RULES_IP_DICT
-    i = 0
-    for bit in get_bits(ip):
-        i += 1
-        if isinstance(dic, bool) or bit not in dic:
-            break
-        dic = dic[bit]
-    if isinstance(dic, bool):
-        return dic
-    return False
-
-
-def get_matching(chain: typing.List[str], no_explicit: bool = False
-                 ) -> typing.Iterable[str]:
-    if len(chain) <= 1:
-        return
-    initial = chain[0]
-    cname_destinations = chain[1:-1]
-    a_destination = chain[-1]
-    initial_matching = subdomain_matching(initial)
-    if no_explicit and initial_matching:
-        return
-    cname_matching = any(map(subdomain_matching, cname_destinations))
-    if cname_matching or initial_matching or ip_matching(a_destination):
-        yield initial
-
-
-def register_rule(subdomain: str) -> None:
-    # Make a tree with domain parts
-    parts = subdomain.split('.')
-    parts.reverse()
-    dic = RULES_DICT
-    last_part = len(parts) - 1
-    for p, part in enumerate(parts):
-        if isinstance(dic, bool):
-            return
-        if p == last_part:
-            dic[part] = True
-        else:
-            dic.setdefault(part, dict())
-            dic = dic[part]
-
-
-def register_rule_ip(network: str) -> None:
-    net = ipaddress.ip_network(network)
-    ip = net.network_address
-    dic = RULES_IP_DICT
-    last_bit = net.prefixlen - 1
-    for b, bit in enumerate(get_bits(ip)):
-        if isinstance(dic, bool):
-            return
-        if b == last_bit:
-            dic[bit] = True
-        else:
-            dic.setdefault(bit, dict())
-            dic = dic[bit]
-
-
-if __name__ == '__main__':
-
-    # Parsing arguments
-    parser = argparse.ArgumentParser(
-        description="Filter first-party trackers from a list of subdomains")
-    parser.add_argument(
-        '-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
-        help="Input file with DNS chains")
-    parser.add_argument(
-        '-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
-        help="Outptut file with one tracking subdomain per line")
-    parser.add_argument(
-        '-n', '--no-explicit', action='store_true',
-        help="Don't output domains already blocked with rules without CNAME")
-    parser.add_argument(
-        '-r', '--rules', type=argparse.FileType('r'),
-        help="List of domains domains to block (with their subdomains)")
-    parser.add_argument(
-        '-p', '--rules-ip', type=argparse.FileType('r'),
-        help="List of IPs ranges to block")
-    args = parser.parse_args()
-
-    # Progress bar
-    widgets = [
-        progressbar.Percentage(),
-        ' ', progressbar.SimpleProgress(),
-        ' ', progressbar.Bar(),
-        ' ', progressbar.Timer(),
-        ' ', progressbar.AdaptiveTransferSpeed(unit='req'),
-        ' ', progressbar.AdaptiveETA(),
-    ]
-    progress = progressbar.ProgressBar(widgets=widgets)
-
-    # Reading rules
-    if args.rules:
-        for rule in args.rules:
-            register_rule(rule.strip())
-    if args.rules_ip:
-        for rule in args.rules_ip:
-            register_rule_ip(rule.strip())
-
-    # Approximating line count
-    if args.input.seekable():
-        lines = 0
-        for line in args.input:
-            lines += 1
-        progress.max_value = lines
-        args.input.seek(0)
-
-    # Reading domains to filter
-    reader = csv.reader(args.input)
-    progress.start()
-    for chain in reader:
-        for match in get_matching(chain, no_explicit=args.no_explicit):
-            print(match, file=args.output)
-        progress.update(progress.value + 1)
-    progress.finish()
--- a/filter_subdomains.sh
+++ b/filter_subdomains.sh
@ -1,85 +0,0 @@
-#!/usr/bin/env bash
-
-function log() {
-    echo -e "\033[33m$@\033[0m"
-}
-
-if [ ! -f temp/all_resolved.csv ]
-then
-    echo "Run ./resolve_subdomains.sh first!"
-    exit 1
-fi
-
-# Gather all the rules for filtering
-log "Compiling rules…"
-cat rules_adblock/*.txt | grep -v '^!' | grep -v '^\[Adblock' | sort -u > temp/all_rules_adblock.txt
-./adblock_to_domain_list.py --input temp/all_rules_adblock.txt --output rules/from_adblock.cache.list
-cat rules_hosts/*.txt | grep -v '^#' | grep -v '^$' | cut -d ' ' -f2 > rules/from_hosts.cache.list
-cat rules/*.list | grep -v '^#' | grep -v '^$' | sort -u > temp/all_rules_multi.list
-cat rules/first-party.list | grep -v '^#' | grep -v '^$' | sort -u > temp/all_rules_first.list
-cat rules_ip/*.txt | grep -v '^#' | grep -v '^$' | sort -u > temp/all_ip_rules_multi.txt
-cat rules_ip/first-party.txt | grep -v '^#' | grep -v '^$' | sort -u > temp/all_ip_rules_first.txt
-
-log "Filtering first-party tracking domains…"
-./filter_subdomains.py --rules temp/all_rules_first.list --rules-ip temp/all_ip_rules_first.txt --input temp/all_resolved_sorted.csv --output temp/firstparty-trackers.list
-sort -u temp/firstparty-trackers.list > dist/firstparty-trackers.txt
-
-log "Filtering first-party curated tracking domains…"
-./filter_subdomains.py --rules temp/all_rules_first.list --rules-ip temp/all_ip_rules_first.txt --input temp/all_resolved_sorted.csv --no-explicit --output temp/firstparty-only-trackers.list
-sort -u temp/firstparty-only-trackers.list > dist/firstparty-only-trackers.txt
-
-log "Filtering multi-party tracking domains…"
-./filter_subdomains.py --rules temp/all_rules_multi.list --rules-ip temp/all_ip_rules_multi.txt --input temp/all_resolved_sorted.csv --output temp/multiparty-trackers.list
-sort -u temp/multiparty-trackers.list > dist/multiparty-trackers.txt
-
-log "Filtering multi-party curated tracking domains…"
-./filter_subdomains.py --rules temp/all_rules_multi.list --rules-ip temp/all_ip_rules_multi.txt --input temp/all_resolved_sorted.csv --no-explicit --output temp/multiparty-only-trackers.list
-sort -u temp/multiparty-only-trackers.list > dist/multiparty-only-trackers.txt
-
-# Format the blocklist so it can be used as a hostlist
-function generate_hosts {
-    basename="$1"
-    description="$2"
-    description2="$3"
-
-    (
-        echo "# First-party trackers host list"
-        echo "# $description"
-        echo "# $description2"
-        echo "#"
-        echo "# About first-party trackers: https://git.frogeye.fr/geoffrey/eulaurarien#whats-a-first-party-tracker"
-        echo "# Source code: https://git.frogeye.fr/geoffrey/eulaurarien"
-        echo "#"
-        echo "# In case of false positives/negatives, or any other question,"
-        echo "# contact me the way you like: https://geoffrey.frogeye.fr"
-        echo "#"
-        echo "# Latest version:"
-        echo "# - First-party trackers  : https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt"
-        echo "# - … excluding redirected: https://hostfiles.frogeye.fr/firstparty-only-trackers-hosts.txt"
-        echo "# - First and third party : https://hostfiles.frogeye.fr/multiparty-trackers-hosts.txt"
-        echo "# - … excluding redirected: https://hostfiles.frogeye.fr/multiparty-only-trackers-hosts.txt"
-        echo "#"
-        echo "# Generation date: $(date -Isec)"
-        echo "# Generation software: eulaurarien $(git describe --tags)"
-        echo "# Number of source websites: $(wc -l temp/all_websites.list | cut -d' ' -f1)"
-        echo "# Number of source subdomains: $(wc -l temp/all_subdomains.list | cut -d' ' -f1)"
-        echo "#"
-        echo "# Number of known first-party trackers: $(wc -l temp/all_rules_first.list | cut -d' ' -f1)"
-        echo "# Number of first-party subdomains: $(wc -l dist/firstparty-trackers.txt | cut -d' ' -f1)"
-        echo "# … excluding redirected: $(wc -l dist/firstparty-only-trackers.txt | cut -d' ' -f1)"
-        echo "#"
-        echo "# Number of known multi-party trackers: $(wc -l temp/all_rules_multi.list | cut -d' ' -f1)"
-        echo "# Number of multi-party subdomains: $(wc -l dist/multiparty-trackers.txt | cut -d' ' -f1)"
-        echo "# … excluding redirected: $(wc -l dist/multiparty-only-trackers.txt | cut -d' ' -f1)"
-        echo
-        cat "dist/$basename.txt" | while read host;
-        do
-            echo "0.0.0.0 $host"
-        done
-    ) > "dist/$basename-hosts.txt"
-}
-
-generate_hosts "firstparty-trackers" "Generated from a curated list of first-party trackers" ""
-generate_hosts "firstparty-only-trackers" "Generated from a curated list of first-party trackers" "Only contain the first chain of redirection."
-generate_hosts "multiparty-trackers" "Generated from known third-party trackers." "Also contains trackers used as third-party."
-generate_hosts "multiparty-only-trackers" "Generated from known third-party trackers." "Do not contain trackers used in third-party. Use in combination with third-party lists."
--- a/import_rapid7.sh
+++ b/import_rapid7.sh
@ -0,0 +1,26 @@
+#!/usr/bin/env bash
+
+function log() {
+    echo -e "\033[33m$@\033[0m"
+}
+
+function feed_rapid7_fdns { # dataset
+    dataset=$1
+    line=$(curl -s https://opendata.rapid7.com/sonar.fdns_v2/ | grep "href=\".\+-fdns_$dataset.json.gz\"")
+    link="https://opendata.rapid7.com$(echo "$line" | cut -d'"' -f2)"
+    log "Reading $(echo "$dataset" | awk '{print toupper($0)}') records from $link"
+    curl -L "$link" | gunzip
+}
+
+function feed_rapid7_rdns {
+    dataset=$1
+    line=$(curl -s https://opendata.rapid7.com/sonar.rdns_v2/ | grep "href=\".\+-rdns.json.gz\"")
+    link="https://opendata.rapid7.com$(echo "$line" | cut -d'"' -f2)"
+    log "Reading PTR records from $link"
+    curl -L "$link" | gunzip
+}
+
+feed_rapid7_rdns | ./feed_dns.py rapid7
+feed_rapid7_fdns a | ./feed_dns.py rapid7 --ip4-cache 536870912
+# feed_rapid7_fdns aaaa | ./feed_dns.py rapid7 --ip6-cache 536870912
+feed_rapid7_fdns cname | ./feed_dns.py rapid7
--- a/import_rules.sh
+++ b/import_rules.sh
@ -0,0 +1,22 @@
+#!/usr/bin/env bash
+
+function log() {
+    echo -e "\033[33m$@\033[0m"
+}
+
+log "Importing rules…"
+BEFORE="$(date +%s)"
+cat rules_adblock/*.txt | grep -v '^!' | grep -v '^\[Adblock' | ./adblock_to_domain_list.py | ./feed_rules.py zone
+cat rules_hosts/*.txt | grep -v '^#' | grep -v '^$' | cut -d ' ' -f2 | ./feed_rules.py zone
+cat rules/*.list | grep -v '^#' | grep -v '^$' | ./feed_rules.py zone
+cat rules_ip/*.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py ip4network
+cat rules_asn/*.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py asn
+
+cat rules/first-party.list | grep -v '^#' | grep -v '^$' | ./feed_rules.py zone --first-party
+cat rules_ip/first-party.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py ip4network --first-party
+cat rules_asn/first-party.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py asn --first-party
+
+./feed_asn.py
+
+# log "Pruning old rules…"
+# ./db.py --prune --prune-before "$BEFORE" --prune-base
--- a/nameservers/.gitignore
+++ b/nameservers/.gitignore
@ -0,0 +1,2 @@
+*.custom.list
+*.cache.list
--- a/nameservers/popular.list
+++ b/nameservers/popular.list
@ -0,0 +1,24 @@
+8.8.8.8
+8.8.4.4
+2001:4860:4860:0:0:0:0:8888
+2001:4860:4860:0:0:0:0:8844
+208.67.222.222
+208.67.220.220
+2620:119:35::35
+2620:119:53::53
+4.2.2.1
+4.2.2.2
+8.26.56.26
+8.20.247.20
+84.200.69.80
+84.200.70.40
+2001:1608:10:25:0:0:1c04:b12f
+2001:1608:10:25:0:0:9249:d69b
+9.9.9.10
+149.112.112.10
+2620:fe::10
+2620:fe::fe:10
+1.1.1.1
+1.0.0.1
+2606:4700:4700::1111
+2606:4700:4700::1001
--- a/regexes.py
+++ b/regexes.py
@ -1,21 +0,0 @@
-#!/usr/bin/env python3
-
-"""
-List of regex matching first-party trackers.
-"""
-
-# Syntax: https://docs.python.org/3/library/re.html#regular-expression-syntax
-
-REGEXES = [
-    r'^.+\.eulerian\.net\.$',  # Eulerian
-    r'^.+\.criteo\.com\.$',  # Criteo
-    r'^.+\.dnsdelegation\.io\.$',  # Criteo
-    r'^.+\.keyade\.com\.$',  # Keyade
-    r'^.+\.omtrdc\.net\.$',  # Adobe Experience Cloud
-    r'^.+\.bp01\.net\.$',  # NP6
-    r'^.+\.ati-host\.net\.$',  # Xiti (AT Internet)
-    r'^.+\.at-o\.net\.$',  # Xiti (AT Internet)
-    r'^.+\.edgkey\.net\.$',  # Edgekey (Akamai)
-    r'^.+\.akaimaiedge\.net\.$',  # Edgekey (Akamai)
-    r'^.+\.storetail\.io\.$',  # Storetail (Criteo)
-]
--- a/resolve_subdomains.py
+++ b/resolve_subdomains.py
@ -1,284 +0,0 @@
-#!/usr/bin/env python3
-
-"""
-From a list of subdomains, output only
-the ones resolving to a first-party tracker.
-"""
-
-import argparse
-import logging
-import os
-import queue
-import sys
-import threading
-import typing
-import csv
-
-import coloredlogs
-import dns.exception
-import dns.resolver
-import progressbar
-
-DNS_TIMEOUT = 5.0
-NUMBER_THREADS = 512
-NUMBER_TRIES = 5
-
-# TODO All the domains don't get treated,
-# so it leaves with 4-5 subdomains not resolved
-
-glob = None
-
-
-class Worker(threading.Thread):
-    """
-    Worker process for a DNS resolver.
-    Will resolve DNS to match first-party subdomains.
-    """
-
-    def change_nameserver(self) -> None:
-        """
-        Assign a this worker another nameserver from the queue.
-        """
-        server = None
-        while server is None:
-            try:
-                server = self.orchestrator.nameservers_queue.get(block=False)
-            except queue.Empty:
-                self.orchestrator.refill_nameservers_queue()
-        self.log.info("Using nameserver: %s", server)
-        self.resolver.nameservers = [server]
-
-    def __init__(self,
-                 orchestrator: 'Orchestrator',
-                 index: int = 0):
-        super(Worker, self).__init__()
-        self.log = logging.getLogger(f'worker{index:03d}')
-        self.orchestrator = orchestrator
-
-        self.resolver = dns.resolver.Resolver()
-        self.change_nameserver()
-
-    def resolve_subdomain(self, subdomain: str) -> typing.Optional[
-        typing.List[
-            str
-        ]
-    ]:
-        """
-        Returns the resolution chain of the subdomain to an A record,
-        including any intermediary CNAME.
-        The last element is an IP address.
-        Returns None if the nameserver was unable to satisfy the request.
-        Returns [] if the requests points to nothing.
-        """
-        self.log.debug("Querying %s", subdomain)
-        try:
-            query = self.resolver.query(subdomain, 'A', lifetime=DNS_TIMEOUT)
-        except dns.resolver.NXDOMAIN:
-            return []
-        except dns.resolver.NoAnswer:
-            return []
-        except dns.resolver.YXDOMAIN:
-            self.log.warning("Query name too long for %s", subdomain)
-            return None
-        except dns.resolver.NoNameservers:
-            # NOTE Most of the time this error message means that the domain
-            # does not exists, but sometimes it means the that the server
-            # itself is broken. So we count on the retry logic.
-            self.log.warning("All nameservers broken for %s", subdomain)
-            return None
-        except dns.exception.Timeout:
-            # NOTE Same as above
-            self.log.warning("Timeout for %s", subdomain)
-            return None
-        except dns.name.EmptyLabel:
-            self.log.warning("Empty label for %s", subdomain)
-            return None
-        resolved = list()
-        last = len(query.response.answer) - 1
-        for a, answer in enumerate(query.response.answer):
-            if answer.rdtype == dns.rdatatype.CNAME:
-                assert a < last
-                resolved.append(answer.items[0].to_text()[:-1])
-            elif answer.rdtype == dns.rdatatype.A:
-                assert a == last
-                resolved.append(answer.items[0].address)
-            else:
-                assert False
-        return resolved
-
-    def run(self) -> None:
-        self.log.info("Started")
-        subdomain: str
-        for subdomain in iter(self.orchestrator.subdomains_queue.get, None):
-
-            for _ in range(NUMBER_TRIES):
-                resolved = self.resolve_subdomain(subdomain)
-                # Retry with another nameserver if error
-                if resolved is None:
-                    self.change_nameserver()
-                else:
-                    break
-
-            # If it wasn't found after multiple tries
-            if resolved is None:
-                self.log.error("Gave up on %s", subdomain)
-                resolved = []
-
-            resolved.insert(0, subdomain)
-            assert isinstance(resolved, list)
-            self.orchestrator.results_queue.put(resolved)
-
-        self.orchestrator.results_queue.put(None)
-        self.log.info("Stopped")
-
-
-class Orchestrator():
-    """
-    Orchestrator of the different Worker threads.
-    """
-
-    def refill_nameservers_queue(self) -> None:
-        """
-        Re-fill the given nameservers into the nameservers queue.
-        Done every-time the queue is empty, making it
-        basically looping and infinite.
-        """
-        # Might be in a race condition but that's probably fine
-        for nameserver in self.nameservers:
-            self.nameservers_queue.put(nameserver)
-        self.log.info("Refilled nameserver queue")
-
-    def __init__(self, subdomains: typing.Iterable[str],
-                 nameservers: typing.List[str] = None,
-                 ):
-        self.log = logging.getLogger('orchestrator')
-        self.subdomains = subdomains
-
-        # Use interal resolver by default
-        self.nameservers = nameservers or dns.resolver.Resolver().nameservers
-
-        self.subdomains_queue: queue.Queue = queue.Queue(
-            maxsize=NUMBER_THREADS)
-        self.results_queue: queue.Queue = queue.Queue()
-        self.nameservers_queue: queue.Queue = queue.Queue()
-
-        self.refill_nameservers_queue()
-
-    def fill_subdomain_queue(self) -> None:
-        """
-        Read the subdomains in input and put them into the queue.
-        Done in a thread so we can both:
-        - yield the results as they come
-        - not store all the subdomains at once
-        """
-        self.log.info("Started reading subdomains")
-        # Send data to workers
-        for subdomain in self.subdomains:
-            self.subdomains_queue.put(subdomain)
-
-        self.log.info("Finished reading subdomains")
-        # Send sentinel to each worker
-        # sentinel = None ~= EOF
-        for _ in range(NUMBER_THREADS):
-            self.subdomains_queue.put(None)
-
-    def run(self) -> typing.Iterable[typing.List[str]]:
-        """
-        Yield the results.
-        """
-        # Create workers
-        self.log.info("Creating workers")
-        for i in range(NUMBER_THREADS):
-            Worker(self, i).start()
-
-        fill_thread = threading.Thread(target=self.fill_subdomain_queue)
-        fill_thread.start()
-
-        # Wait for one sentinel per worker
-        # In the meantime output results
-        for _ in range(NUMBER_THREADS):
-            result: typing.List[str]
-            for result in iter(self.results_queue.get, None):
-                yield result
-
-        self.log.info("Waiting for reader thread")
-        fill_thread.join()
-
-        self.log.info("Done!")
-
-
-def main() -> None:
-    """
-    Main function when used directly.
-    Read the subdomains provided and output it,
-    the last CNAME resolved and the IP adress it resolves to.
-    Takes as an input a filename (or nothing, for stdin),
-    and as an output a filename (or nothing, for stdout).
-    The input must be a subdomain per line, the output is a comma-sep
-    file with the columns source CNAME and A.
-    Use the file `nameservers` as the list of nameservers
-    to use, or else it will use the system defaults.
-    Also shows a nice progressbar.
-    """
-
-    # Initialization
-    coloredlogs.install(
-        level='DEBUG',
-        fmt='%(asctime)s %(name)s %(levelname)s %(message)s'
-    )
-
-    # Parsing arguments
-    parser = argparse.ArgumentParser(
-        description="Massively resolves subdomains and store them in a file.")
-    parser.add_argument(
-        '-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
-        help="Input file with one subdomain per line")
-    parser.add_argument(
-        '-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
-        help="Outptut file with DNS chains")
-    # parser.add_argument(
-    #     '-n', '--nameserver', type=argparse.FileType('r'),
-    #     default='nameservers', help="File with one nameserver per line")
-    # parser.add_argument(
-    #     '-j', '--workers', type=int, default=512,
-    #     help="Number of threads to use")
-    args = parser.parse_args()
-
-    # Progress bar
-    widgets = [
-        progressbar.Percentage(),
-        ' ', progressbar.SimpleProgress(),
-        ' ', progressbar.Bar(),
-        ' ', progressbar.Timer(),
-        ' ', progressbar.AdaptiveTransferSpeed(unit='req'),
-        ' ', progressbar.AdaptiveETA(),
-    ]
-    progress = progressbar.ProgressBar(widgets=widgets)
-    if args.input.seekable():
-        progress.max_value = len(args.input.readlines())
-        args.input.seek(0)
-
-    # Cleaning input
-    iterator = iter(args.input)
-    iterator = map(str.strip, iterator)
-    iterator = filter(None, iterator)
-
-    # Reading nameservers
-    servers: typing.List[str] = list()
-    if os.path.isfile('nameservers'):
-        servers = open('nameservers').readlines()
-        servers = list(filter(None, map(str.strip, servers)))
-
-    writer = csv.writer(args.output)
-
-    progress.start()
-    global glob
-    glob = Orchestrator(iterator, servers)
-    for resolved in glob.run():
-        progress.update(progress.value + 1)
-        writer.writerow(resolved)
-    progress.finish()
-
-
-if __name__ == '__main__':
-    main()
--- a/resolve_subdomains.sh
+++ b/resolve_subdomains.sh
@ -4,11 +4,16 @@ function log() {
    echo -e "\033[33m$@\033[0m"
 }

-# Resolve the CNAME chain of all the known subdomains for later analysis
-log "Compiling subdomain lists..."
-pv subdomains/*.list | sort -u > temp/all_subdomains.list
-# Sort by last character to utilize the DNS server caching mechanism
-pv temp/all_subdomains.list | rev | sort | rev > temp/all_subdomains_reversort.list
-./resolve_subdomains.py --input temp/all_subdomains_reversort.list --output temp/all_resolved.csv
-sort -u temp/all_resolved.csv > temp/all_resolved_sorted.csv
+log "Compiling nameservers…"
+pv nameservers/*.list | ./validate_list.py --ip4 | sort -u > temp/all_nameservers_ip4.list

+log "Compiling subdomain…"
+# Sort by last character to utilize the DNS server caching mechanism
+# (not as efficient with massdns but it's almost free so why not)
+pv subdomains/*.list | ./validate_list.py --domain | rev | sort -u | rev > temp/all_subdomains.list
+
+log "Resolving subdomain…"
+massdns --output Snrql --retry REFUSED,SERVFAIL --resolvers temp/all_nameservers_ip4.list --outfile temp/all_resolved.txt temp/all_subdomains.list
+
+log "Importing into database…"
+pv temp/all_resolved.txt | ./feed_dns.py massdns
--- a/rules/first-party.list
+++ b/rules/first-party.list
@ -18,7 +18,14 @@ omtrdc.net
 online-metrix.net
 # Webtrekk
 wt-eu02.net
+webtrekk.net
 # Otto Group
 oghub.io
-# ???
+# Intent.com
 partner.intentmedia.net
+# Wizaly
+wizaly.com
+# Commanders Act
+tagcommander.com
+# Ingenious Technologies
+affex.org
--- a/rules_asn/.gitignore
+++ b/rules_asn/.gitignore
@ -0,0 +1,2 @@
+*.custom.txt
+*.cache.txt
--- a/rules_asn/first-party.txt
+++ b/rules_asn/first-party.txt
@ -0,0 +1,10 @@
+# Eulerian
+AS50234
+# Criteo
+AS44788
+AS19750
+AS55569
+# ThreatMetrix
+AS30286
+# Webtrekk
+AS60164
--- a/rules_ip/first-party.txt
+++ b/rules_ip/first-party.txt
@ -1,51 +0,0 @@
-# Eulerian (AS50234 EULERIAN TECHNOLOGIES S.A.S.)
-109.232.192.0/21
-# Criteo (AS44788 Criteo SA)
-91.199.242.0/24
-91.212.98.0/24
-178.250.0.0/21
-178.250.0.0/24
-178.250.1.0/24
-178.250.2.0/24
-178.250.3.0/24
-178.250.4.0/24
-178.250.6.0/24
-185.235.84.0/24
-# Criteo (AS19750 Criteo Corp.)
-74.119.116.0/22
-74.119.117.0/24
-74.119.118.0/24
-74.119.119.0/24
-91.199.242.0/24
-185.235.85.0/24
-199.204.168.0/22
-199.204.168.0/24
-199.204.169.0/24
-199.204.170.0/24
-199.204.171.0/24
-178.250.0.0/21
-91.212.98.0/24
-91.199.242.0/24
-185.235.84.0/24
-# Criteo (AS55569 Criteo APAC)
-91.199.242.0/24
-116.213.20.0/22
-116.213.20.0/24
-116.213.21.0/24
-182.161.72.0/22
-182.161.72.0/24
-182.161.73.0/24
-185.235.86.0/24
-185.235.87.0/24
-# ThreatMetrix (AS30286 ThreatMetrix Inc.)
-69.84.176.0/24
-173.254.179.0/24
-185.32.240.0/23
-185.32.242.0/23
-192.225.156.0/22
-199.101.156.0/23
-199.101.158.0/23
-# Webtrekk (AS60164 Webtrekk GmbH)
-185.54.148.0/22
-185.54.150.0/24
-185.54.151.0/24
--- a/run_tests.py
+++ b/run_tests.py
@ -0,0 +1,34 @@
+#!/usr/bin/env python3
+
+import database
+import os
+import logging
+import csv
+
+TESTS_DIR = 'tests'
+
+if __name__ == '__main__':
+
+    DB = database.Database()
+    log = logging.getLogger('tests')
+
+    for filename in os.listdir(TESTS_DIR):
+        log.info("")
+        log.info("Running tests from %s", filename)
+        path = os.path.join(TESTS_DIR, filename)
+        with open(path, 'rt') as fdesc:
+            reader = csv.DictReader(fdesc)
+            for test in reader:
+                log.info("Testing %s (%s)", test['url'], test['comment'])
+
+                for white in test['white'].split(':'):
+                    if not white:
+                        continue
+                    if any(DB.get_domain(white)):
+                        log.error("False positive: %s", white)
+
+                for black in test['black'].split(':'):
+                    if not black:
+                        continue
+                    if not any(DB.get_domain(black)):
+                        log.error("False negative: %s", black)
--- a/tests/false-positives.csv
+++ b/tests/false-positives.csv
@ -1,6 +1,5 @@
 url,white,black,comment
 https://support.apple.com,support.apple.com,,EdgeKey / AkamaiEdge
 https://www.pinterest.fr/,i.pinimg.com,,Cedexis
-https://www.pinterest.fr/,i.pinimg.com,,Cedexis
 https://www.tumblr.com/,66.media.tumblr.com,,ChiCDN
 https://www.skype.com/fr/,www.skype.com,,TrafficManager
--- a/tests/first-party.csv
+++ b/tests/first-party.csv
@ -5,3 +5,6 @@ https://www.discover.com/,,content.discover.com,ThreatMetrix
 https://www.mytoys.de/,,web.mytoys.de,Webtrekk
 https://www.baur.de/,,tp.baur.de,Otto Group
 https://www.liligo.com/,,compare.liligo.com,???
+https://www.boulanger.com/,,tag.boulanger.fr,TagCommander
+https://www.airfrance.fr/FR/,,tk.airfrance.fr,Wizaly
+https://www.vsgamers.es/,,marketing.net.vsgamers.es,Affex
--- a/validate_list.py
+++ b/validate_list.py
@ -0,0 +1,35 @@
+#!/usr/bin/env python3
+# pylint: disable=C0103
+
+"""
+Filter out invalid domain names
+"""
+
+import database
+import argparse
+import sys
+
+if __name__ == '__main__':
+
+    # Parsing arguments
+    parser = argparse.ArgumentParser(
+        description="Filter out invalid domain name/ip addresses from a list.")
+    parser.add_argument(
+        '-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
+        help="Input file, one element per line")
+    parser.add_argument(
+        '-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
+        help="Output file, one element per line")
+    parser.add_argument(
+        '-d', '--domain', action='store_true',
+        help="Can be domain name")
+    parser.add_argument(
+        '-4', '--ip4', action='store_true',
+        help="Can be IP4")
+    args = parser.parse_args()
+
+    for line in args.input:
+        line = line.strip()
+        if (args.domain and database.Database.validate_domain(line)) or \
+                (args.ip4 and database.Database.validate_ip4address(line)):
+            print(line, file=args.output)