Remove support for Rapid7

They changed their privacy / pricing model and as such I don't have access to their massive DNS dataset anymore, even after asking. Since 2022-01-02, I put the list on freeze while looking for an alternative, but couldn't find any. To make the list update again with the remaining DNS sources I have, I put the last version of the list generated with the Rapid7 dataset as an input for subdomains, that will now get resolved with MassDNS.
Add requirements.txt file
2022-11-13 20:10:27 +01:00 · 2022-02-26 13:01:11 +01:00 · 2021-08-28 20:58:34 +02:00 · 2021-08-22 18:02:37 +02:00 · 2021-08-22 17:07:25 +02:00 · 2021-08-22 16:53:58 +02:00
20 changed files with 478 additions and 517 deletions
--- a/.env.default
+++ b/.env.default
@ -1,4 +1,3 @@
-RAPID7_API_KEY=
 CACHE_SIZE=536870912
 MASSDNS_HASHMAP_SIZE=1000
 PROFILE=0
--- a/README.md
+++ b/README.md
@ -18,7 +18,7 @@ This program takes as input:

 It will be able to output hostnames being a DNS redirection to any item in the lists provided.

-DNS records can either come from [Rapid7 Open Data Sets](https://opendata.rapid7.com/sonar.fdns_v2/) or can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns).
+DNS records can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns).

 Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that).

@ -41,7 +41,6 @@ Depending on the sources you'll be using to generate the list, you'll need to in
 - [coloredlogs](https://pypi.org/project/coloredlogs/) (sorry I can't help myself)
 - [numpy](https://www.numpy.org/)
 - [python-abp](https://pypi.org/project/python-abp/) (only if you intend to use AdBlock rules as a rule source)
- [jq](http://stedolan.github.io/jq/) (only if you have a Rapid7 API key)
 - [massdns](https://github.com/blechschmidt/massdns) in your `$PATH` (only if you have subdomains as a source)
 - [Firefox](https://www.mozilla.org/firefox/) (only if you have websites as a source)
 - [selenium (Python bindings)](https://pypi.python.org/pypi/selenium) (only if you have websites as a source)
@ -135,22 +134,6 @@ Note that this is a network intensive process, not in term of bandwith, but in t
 The DNS records will automatically be imported into the database.
 If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script.

-### Import DNS records from Rapid7
-
-If you have a Rapid7 Organization API key, make sure to append to `.env`:
-
-```
-RAPID7_API_KEY=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
-```
-
-Then, run `./import_rapid7.sh`.
-This will download about 35 GiB of data the first time, but only the matching records will be stored (about a few MiB for the tracking rules).
-Note the download speed will most likely be limited by the database operation thoughput (a quick RAM will help).
-
-The script remembers which were the last sets downloaded, and will only newer sets.
-If the first-party rules changed, the corresponding sets will be re-imported anyway.
-If you want to force re-importing, run `rm last_updates/rapid7_*.txt`.
-
 ### Export the lists

 For the tracking list, use `./export_lists.sh`, the output will be in the `dist` folder (please change the links before distributing them).
--- a/adblock_to_domain_list.py
+++ b/adblock_to_domain_list.py
@ -16,25 +16,36 @@ import abp.filters
 def get_domains(rule: abp.filters.parser.Filter) -> typing.Iterable[str]:
    if rule.options:
        return
-    selector_type = rule.selector['type']
-    selector_value = rule.selector['value']
-    if selector_type == 'url-pattern' \
-            and selector_value.startswith('||') \
-            and selector_value.endswith('^'):
+    selector_type = rule.selector["type"]
+    selector_value = rule.selector["value"]
+    if (
+        selector_type == "url-pattern"
+        and selector_value.startswith("||")
+        and selector_value.endswith("^")
+    ):
        yield selector_value[2:-1]


-if __name__ == '__main__':
+if __name__ == "__main__":

    # Parsing arguments
    parser = argparse.ArgumentParser(
-        description="Extract whole domains from an AdBlock blocking list")
+        description="Extract whole domains from an AdBlock blocking list"
+    )
    parser.add_argument(
-        '-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
-        help="Input file with AdBlock rules")
+        "-i",
+        "--input",
+        type=argparse.FileType("r"),
+        default=sys.stdin,
+        help="Input file with AdBlock rules",
+    )
    parser.add_argument(
-        '-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
-        help="Outptut file with one rule tracking subdomain per line")
+        "-o",
+        "--output",
+        type=argparse.FileType("w"),
+        default=sys.stdout,
+        help="Outptut file with one rule tracking subdomain per line",
+    )
    args = parser.parse_args()

    # Reading rules
--- a/collect_subdomains.py
+++ b/collect_subdomains.py
@ -16,26 +16,25 @@ import selenium.webdriver.firefox.options
 import seleniumwire.webdriver
 import logging

-log = logging.getLogger('cs')
+log = logging.getLogger("cs")
 DRIVER = None
 SCROLL_TIME = 10.0
 SCROLL_STEPS = 100
-SCROLL_CMD = f'window.scrollBy(0,document.body.scrollHeight/{SCROLL_STEPS})'
+SCROLL_CMD = f"window.scrollBy(0,document.body.scrollHeight/{SCROLL_STEPS})"


 def new_driver() -> seleniumwire.webdriver.browser.Firefox:
    profile = selenium.webdriver.FirefoxProfile()
-    profile.set_preference('privacy.trackingprotection.enabled', False)
-    profile.set_preference('network.cookie.cookieBehavior', 0)
-    profile.set_preference('privacy.trackingprotection.pbmode.enabled', False)
-    profile.set_preference(
-        'privacy.trackingprotection.cryptomining.enabled', False)
-    profile.set_preference(
-        'privacy.trackingprotection.fingerprinting.enabled', False)
+    profile.set_preference("privacy.trackingprotection.enabled", False)
+    profile.set_preference("network.cookie.cookieBehavior", 0)
+    profile.set_preference("privacy.trackingprotection.pbmode.enabled", False)
+    profile.set_preference("privacy.trackingprotection.cryptomining.enabled", False)
+    profile.set_preference("privacy.trackingprotection.fingerprinting.enabled", False)
    options = selenium.webdriver.firefox.options.Options()
    # options.add_argument('-headless')
-    driver = seleniumwire.webdriver.Firefox(profile,
-                                            executable_path='geckodriver', options=options)
+    driver = seleniumwire.webdriver.Firefox(
+        profile, executable_path="geckodriver", options=options
+    )
    return driver


@ -60,11 +59,11 @@ def collect_subdomains(url: str) -> typing.Iterable[str]:
        DRIVER.get(url)
        for s in range(SCROLL_STEPS):
            DRIVER.execute_script(SCROLL_CMD)
-            time.sleep(SCROLL_TIME/SCROLL_STEPS)
+            time.sleep(SCROLL_TIME / SCROLL_STEPS)
        for request in DRIVER.requests:
            if request.response:
                yield subdomain_from_url(request.path)
-    except:
+    except Exception:
        log.exception("Error")
        DRIVER.quit()
        DRIVER = None
@ -78,10 +77,10 @@ def collect_subdomains_standalone(url: str) -> None:
        print(subdomain)


-if __name__ == '__main__':
+if __name__ == "__main__":
    assert len(sys.argv) <= 2
    filename = None
-    if len(sys.argv) == 2 and sys.argv[1] != '-':
+    if len(sys.argv) == 2 and sys.argv[1] != "-":
        filename = sys.argv[1]
        num_lines = sum(1 for line in open(filename))
        iterator = progressbar.progressbar(open(filename), max_value=num_lines)
--- a/database.py
+++ b/database.py
@ -15,33 +15,30 @@ import os

 TLD_LIST: typing.Set[str] = set()

-coloredlogs.install(
-    level='DEBUG',
-    fmt='%(asctime)s %(name)s %(levelname)s %(message)s'
-)
+coloredlogs.install(level="DEBUG", fmt="%(asctime)s %(name)s %(levelname)s %(message)s")

 Asn = int
 Timestamp = int
 Level = int


-class Path():
+class Path:
    pass


 class RulePath(Path):
    def __str__(self) -> str:
-        return '(rule)'
+        return "(rule)"


 class RuleFirstPath(RulePath):
    def __str__(self) -> str:
-        return '(first-party rule)'
+        return "(first-party rule)"


 class RuleMultiPath(RulePath):
    def __str__(self) -> str:
-        return '(multi-party rule)'
+        return "(multi-party rule)"


 class DomainPath(Path):
@ -49,7 +46,7 @@ class DomainPath(Path):
        self.parts = parts

    def __str__(self) -> str:
-        return '?.' + Database.unpack_domain(self)
+        return "?." + Database.unpack_domain(self)


 class HostnamePath(DomainPath):
@ -59,7 +56,7 @@ class HostnamePath(DomainPath):

 class ZonePath(DomainPath):
    def __str__(self) -> str:
-        return '*.' + Database.unpack_domain(self)
+        return "*." + Database.unpack_domain(self)


 class AsnPath(Path):
@ -79,7 +76,7 @@ class Ip4Path(Path):
        return Database.unpack_ip4network(self)


-class Match():
+class Match:
    def __init__(self) -> None:
        self.source: typing.Optional[Path] = None
        self.updated: int = 0
@ -102,10 +99,10 @@ class Match():
 class AsnNode(Match):
    def __init__(self) -> None:
        Match.__init__(self)
-        self.name = ''
+        self.name = ""


-class DomainTreeNode():
+class DomainTreeNode:
    def __init__(self) -> None:
        self.children: typing.Dict[str, DomainTreeNode] = dict()
        self.match_zone = Match()
@ -120,18 +117,16 @@ class IpTreeNode(Match):


 Node = typing.Union[DomainTreeNode, IpTreeNode, AsnNode]
-MatchCallable = typing.Callable[[Path,
-                                 Match],
-                                typing.Any]
+MatchCallable = typing.Callable[[Path, Match], typing.Any]


-class Profiler():
+class Profiler:
    def __init__(self) -> None:
-        do_profile = int(os.environ.get('PROFILE', '0'))
+        do_profile = int(os.environ.get("PROFILE", "0"))
        if do_profile:
-            self.log = logging.getLogger('profiler')
+            self.log = logging.getLogger("profiler")
            self.time_last = time.perf_counter()
-            self.time_step = 'init'
+            self.time_step = "init"
            self.time_dict: typing.Dict[str, float] = dict()
            self.step_dict: typing.Dict[str, int] = dict()
            self.enter_step = self.enter_step_real
@ -158,14 +153,17 @@ class Profiler():
        return

    def profile_real(self) -> None:
-        self.enter_step('profile')
+        self.enter_step("profile")
        total = sum(self.time_dict.values())
        for key, secs in sorted(self.time_dict.items(), key=lambda t: t[1]):
            times = self.step_dict[key]
-            self.log.debug(f"{key:<20}: {times:9d} × {secs/times:5.3e} "
-                           f"= {secs:9.2f} s ({secs/total:7.2%}) ")
-        self.log.debug(f"{'total':<20}:                         "
-                       f"{total:9.2f} s ({1:7.2%})")
+            self.log.debug(
+                f"{key:<20}: {times:9d} × {secs/times:5.3e} "
+                f"= {secs:9.2f} s ({secs/total:7.2%}) "
+            )
+        self.log.debug(
+            f"{'total':<20}:                         " f"{total:9.2f} s ({1:7.2%})"
+        )


 class Database(Profiler):
@ -173,9 +171,7 @@ class Database(Profiler):
    PATH = "blocking.p"

    def initialize(self) -> None:
-        self.log.warning(
-            "Creating database version: %d ",
-            Database.VERSION)
+        self.log.warning("Creating database version: %d ", Database.VERSION)
        # Dummy match objects that everything refer to
        self.rules: typing.List[Match] = list()
        for first_party in (False, True):
@ -189,76 +185,77 @@ class Database(Profiler):
        self.ip4tree = IpTreeNode()

    def load(self) -> None:
-        self.enter_step('load')
+        self.enter_step("load")
        try:
-            with open(self.PATH, 'rb') as db_fdsec:
+            with open(self.PATH, "rb") as db_fdsec:
                version, data = pickle.load(db_fdsec)
                if version == Database.VERSION:
                    self.rules, self.domtree, self.asns, self.ip4tree = data
                    return
                self.log.warning(
-                    "Outdated database version found: %d, "
-                    "it will be rebuilt.",
-                    version)
+                    "Outdated database version found: %d, " "it will be rebuilt.",
+                    version,
+                )
        except (TypeError, AttributeError, EOFError):
            self.log.error(
-                "Corrupt (or heavily outdated) database found, "
-                "it will be rebuilt.")
+                "Corrupt (or heavily outdated) database found, " "it will be rebuilt."
+            )
        except FileNotFoundError:
            pass
        self.initialize()

    def save(self) -> None:
-        self.enter_step('save')
-        with open(self.PATH, 'wb') as db_fdsec:
+        self.enter_step("save")
+        with open(self.PATH, "wb") as db_fdsec:
            data = self.rules, self.domtree, self.asns, self.ip4tree
            pickle.dump((self.VERSION, data), db_fdsec)
        self.profile()

    def __init__(self) -> None:
        Profiler.__init__(self)
-        self.log = logging.getLogger('db')
+        self.log = logging.getLogger("db")
        self.load()
        self.ip4cache_shift: int = 32
        self.ip4cache = numpy.ones(1)

    def _set_ip4cache(self, path: Path, _: Match) -> None:
        assert isinstance(path, Ip4Path)
-        self.enter_step('set_ip4cache')
+        self.enter_step("set_ip4cache")
        mini = path.value >> self.ip4cache_shift
-        maxi = (path.value + 2**(32-path.prefixlen)) >> self.ip4cache_shift
+        maxi = (path.value + 2 ** (32 - path.prefixlen)) >> self.ip4cache_shift
        if mini == maxi:
            self.ip4cache[mini] = True
        else:
            self.ip4cache[mini:maxi] = True

-    def fill_ip4cache(self, max_size: int = 512*1024**2) -> None:
+    def fill_ip4cache(self, max_size: int = 512 * 1024 ** 2) -> None:
        """
        Size in bytes
        """
-        if max_size > 2**32/8:
-            self.log.warning("Allocating more than 512 MiB of RAM for "
-                             "the Ip4 cache is not necessary.")
-        max_cache_width = int(math.log2(max(1, max_size*8)))
+        if max_size > 2 ** 32 / 8:
+            self.log.warning(
+                "Allocating more than 512 MiB of RAM for "
+                "the Ip4 cache is not necessary."
+            )
+        max_cache_width = int(math.log2(max(1, max_size * 8)))
        allocated = False
-        cache_width = min(2**32, max_cache_width)
+        cache_width = min(32, max_cache_width)
        while not allocated:
-            cache_size = 2**cache_width
+            cache_size = 2 ** cache_width
            try:
-                self.ip4cache = numpy.zeros(cache_size, dtype=numpy.bool)
+                self.ip4cache = numpy.zeros(cache_size, dtype=bool)
            except MemoryError:
-                self.log.exception(
-                    "Could not allocate cache. Retrying a smaller one.")
+                self.log.exception("Could not allocate cache. Retrying a smaller one.")
                cache_width -= 1
                continue
            allocated = True
-        self.ip4cache_shift = 32-cache_width
+        self.ip4cache_shift = 32 - cache_width
        for _ in self.exec_each_ip4(self._set_ip4cache):
            pass

    @staticmethod
    def populate_tld_list() -> None:
-        with open('temp/all_tld.list', 'r') as tld_fdesc:
+        with open("temp/all_tld.list", "r") as tld_fdesc:
            for tld in tld_fdesc:
                tld = tld.strip()
                TLD_LIST.add(tld)
@ -267,7 +264,7 @@ class Database(Profiler):
    def validate_domain(path: str) -> bool:
        if len(path) > 255:
            return False
-        splits = path.split('.')
+        splits = path.split(".")
        if not TLD_LIST:
            Database.populate_tld_list()
        if splits[-1] not in TLD_LIST:
@ -279,26 +276,26 @@ class Database(Profiler):

    @staticmethod
    def pack_domain(domain: str) -> DomainPath:
-        return DomainPath(domain.split('.')[::-1])
+        return DomainPath(domain.split(".")[::-1])

    @staticmethod
    def unpack_domain(domain: DomainPath) -> str:
-        return '.'.join(domain.parts[::-1])
+        return ".".join(domain.parts[::-1])

    @staticmethod
    def pack_asn(asn: str) -> AsnPath:
        asn = asn.upper()
-        if asn.startswith('AS'):
+        if asn.startswith("AS"):
            asn = asn[2:]
        return AsnPath(int(asn))

    @staticmethod
    def unpack_asn(asn: AsnPath) -> str:
-        return f'AS{asn.asn}'
+        return f"AS{asn.asn}"

    @staticmethod
    def validate_ip4address(path: str) -> bool:
-        splits = path.split('.')
+        splits = path.split(".")
        if len(splits) != 4:
            return False
        for split in splits:
@ -312,7 +309,7 @@ class Database(Profiler):
    @staticmethod
    def pack_ip4address_low(address: str) -> int:
        addr = 0
-        for split in address.split('.'):
+        for split in address.split("."):
            octet = int(split)
            addr = (addr << 8) + octet
        return addr
@ -330,12 +327,12 @@ class Database(Profiler):
        for o in reversed(range(4)):
            octets[o] = addr & 0xFF
            addr >>= 8
-        return '.'.join(map(str, octets))
+        return ".".join(map(str, octets))

    @staticmethod
    def validate_ip4network(path: str) -> bool:
        # A bit generous but ok for our usage
-        splits = path.split('/')
+        splits = path.split("/")
        if len(splits) != 2:
            return False
        if not Database.validate_ip4address(splits[0]):
@ -349,7 +346,7 @@ class Database(Profiler):

    @staticmethod
    def pack_ip4network(network: str) -> Ip4Path:
-        address, prefixlen_str = network.split('/')
+        address, prefixlen_str = network.split("/")
        prefixlen = int(prefixlen_str)
        addr = Database.pack_ip4address(address)
        addr.prefixlen = prefixlen
@ -363,7 +360,7 @@ class Database(Profiler):
        for o in reversed(range(4)):
            octets[o] = addr & 0xFF
            addr >>= 8
-        return '.'.join(map(str, octets)) + '/' + str(network.prefixlen)
+        return ".".join(map(str, octets)) + "/" + str(network.prefixlen)

    def get_match(self, path: Path) -> Match:
        if isinstance(path, RuleMultiPath):
@ -384,7 +381,7 @@ class Database(Profiler):
                raise ValueError
        elif isinstance(path, Ip4Path):
            dici = self.ip4tree
-            for i in range(31, 31-path.prefixlen, -1):
+            for i in range(31, 31 - path.prefixlen, -1):
                bit = (path.value >> i) & 0b1
                dici_next = dici.one if bit else dici.zero
                if not dici_next:
@ -394,9 +391,10 @@ class Database(Profiler):
        else:
            raise ValueError

-    def exec_each_asn(self,
-                      callback: MatchCallable,
-                      ) -> typing.Any:
+    def exec_each_asn(
+        self,
+        callback: MatchCallable,
+    ) -> typing.Any:
        for asn in self.asns:
            match = self.asns[asn]
            if match.active():
@ -409,11 +407,12 @@ class Database(Profiler):
                except TypeError:  # not iterable
                    pass

-    def exec_each_domain(self,
-                         callback: MatchCallable,
-                         _dic: DomainTreeNode = None,
-                         _par: DomainPath = None,
-                         ) -> typing.Any:
+    def exec_each_domain(
+        self,
+        callback: MatchCallable,
+        _dic: DomainTreeNode = None,
+        _par: DomainPath = None,
+    ) -> typing.Any:
        _dic = _dic or self.domtree
        _par = _par or DomainPath([])
        if _dic.match_hostname.active():
@ -437,16 +436,15 @@ class Database(Profiler):
        for part in _dic.children:
            dic = _dic.children[part]
            yield from self.exec_each_domain(
-                callback,
-                _dic=dic,
-                _par=DomainPath(_par.parts + [part])
+                callback, _dic=dic, _par=DomainPath(_par.parts + [part])
            )

-    def exec_each_ip4(self,
-                      callback: MatchCallable,
-                      _dic: IpTreeNode = None,
-                      _par: Ip4Path = None,
-                      ) -> typing.Any:
+    def exec_each_ip4(
+        self,
+        callback: MatchCallable,
+        _dic: IpTreeNode = None,
+        _par: Ip4Path = None,
+    ) -> typing.Any:
        _dic = _dic or self.ip4tree
        _par = _par or Ip4Path(0, 0)
        if _dic.active():
@ -466,25 +464,18 @@ class Database(Profiler):
            # addr0 = _par.value & (0xFFFFFFFF ^ (1 << (32-pref)))
            # assert addr0 == _par.value
            addr0 = _par.value
-            yield from self.exec_each_ip4(
-                callback,
-                _dic=dic,
-                _par=Ip4Path(addr0, pref)
-            )
+            yield from self.exec_each_ip4(callback, _dic=dic, _par=Ip4Path(addr0, pref))
        # 1
        dic = _dic.one
        if dic:
-            addr1 = _par.value | (1 << (32-pref))
+            addr1 = _par.value | (1 << (32 - pref))
            # assert addr1 != _par.value
-            yield from self.exec_each_ip4(
-                callback,
-                _dic=dic,
-                _par=Ip4Path(addr1, pref)
-            )
+            yield from self.exec_each_ip4(callback, _dic=dic, _par=Ip4Path(addr1, pref))

-    def exec_each(self,
-                  callback: MatchCallable,
-                  ) -> typing.Any:
+    def exec_each(
+        self,
+        callback: MatchCallable,
+    ) -> typing.Any:
        yield from self.exec_each_domain(callback)
        yield from self.exec_each_ip4(callback)
        yield from self.exec_each_asn(callback)
@ -492,19 +483,17 @@ class Database(Profiler):
    def update_references(self) -> None:
        # Should be correctly calculated normally,
        # keeping this just in case
-        def reset_references_cb(path: Path,
-                                match: Match
-                                ) -> None:
+        def reset_references_cb(path: Path, match: Match) -> None:
            match.references = 0
+
        for _ in self.exec_each(reset_references_cb):
            pass

-        def increment_references_cb(path: Path,
-                                    match: Match
-                                    ) -> None:
+        def increment_references_cb(path: Path, match: Match) -> None:
            if match.source:
                source = self.get_match(match.source)
                source.references += 1
+
        for _ in self.exec_each(increment_references_cb):
            pass

@ -513,9 +502,7 @@ class Database(Profiler):
        # matches until all disabled matches reference count = 0
        did_something = True

-        def clean_deps_cb(path: Path,
-                          match: Match
-                          ) -> None:
+        def clean_deps_cb(path: Path, match: Match) -> None:
            nonlocal did_something
            if not match.source:
                return
@ -530,15 +517,13 @@ class Database(Profiler):

        while did_something:
            did_something = False
-            self.enter_step('pass_clean_deps')
+            self.enter_step("pass_clean_deps")
            for _ in self.exec_each(clean_deps_cb):
                pass

    def prune(self, before: int, base_only: bool = False) -> None:
        # Disable the matches targeted
-        def prune_cb(path: Path,
-                     match: Match
-                     ) -> None:
+        def prune_cb(path: Path, match: Match) -> None:
            if base_only and match.level > 1:
                return
            if match.updated > before:
@ -546,7 +531,7 @@ class Database(Profiler):
            self._unset_match(match)
            self.log.debug("Print: disabled %s", path)

-        self.enter_step('pass_prune')
+        self.enter_step("pass_prune")
        for _ in self.exec_each(prune_cb):
            pass

@ -559,25 +544,24 @@ class Database(Profiler):
        match = self.get_match(path)
        string = str(path)
        if isinstance(match, AsnNode):
-            string += f' ({match.name})'
-        party_char = 'F' if match.first_party else 'M'
-        dup_char = 'D' if match.dupplicate else '_'
-        string += f' {match.level}{party_char}{dup_char}{match.references}'
+            string += f" ({match.name})"
+        party_char = "F" if match.first_party else "M"
+        dup_char = "D" if match.dupplicate else "_"
+        string += f" {match.level}{party_char}{dup_char}{match.references}"
        if match.source:
-            string += f' ← {self.explain(match.source)}'
+            string += f" ← {self.explain(match.source)}"
        return string

-    def list_records(self,
-                     first_party_only: bool = False,
-                     end_chain_only: bool = False,
-                     no_dupplicates: bool = False,
-                     rules_only: bool = False,
-                     hostnames_only: bool = False,
-                     explain: bool = False,
-                     ) -> typing.Iterable[str]:
-
-        def export_cb(path: Path, match: Match
-                      ) -> typing.Iterable[str]:
+    def list_records(
+        self,
+        first_party_only: bool = False,
+        end_chain_only: bool = False,
+        no_dupplicates: bool = False,
+        rules_only: bool = False,
+        hostnames_only: bool = False,
+        explain: bool = False,
+    ) -> typing.Iterable[str]:
+        def export_cb(path: Path, match: Match) -> typing.Iterable[str]:
            if first_party_only and not match.first_party:
                return
            if end_chain_only and match.references > 0:
@ -596,13 +580,14 @@ class Database(Profiler):

        yield from self.exec_each(export_cb)

-    def count_records(self,
-                      first_party_only: bool = False,
-                      end_chain_only: bool = False,
-                      no_dupplicates: bool = False,
-                      rules_only: bool = False,
-                      hostnames_only: bool = False,
-                      ) -> str:
+    def count_records(
+        self,
+        first_party_only: bool = False,
+        end_chain_only: bool = False,
+        no_dupplicates: bool = False,
+        rules_only: bool = False,
+        hostnames_only: bool = False,
+    ) -> str:
        memo: typing.Dict[str, int] = dict()

        def count_records_cb(path: Path, match: Match) -> None:
@ -627,75 +612,80 @@ class Database(Profiler):

        split: typing.List[str] = list()
        for key, value in sorted(memo.items(), key=lambda s: s[0]):
-            split.append(f'{key[:-4].lower()}s: {value}')
-        return ', '.join(split)
+            split.append(f"{key[:-4].lower()}s: {value}")
+        return ", ".join(split)

    def get_domain(self, domain_str: str) -> typing.Iterable[DomainPath]:
-        self.enter_step('get_domain_pack')
+        self.enter_step("get_domain_pack")
        domain = self.pack_domain(domain_str)
-        self.enter_step('get_domain_brws')
+        self.enter_step("get_domain_brws")
        dic = self.domtree
        depth = 0
        for part in domain.parts:
            if dic.match_zone.active():
-                self.enter_step('get_domain_yield')
+                self.enter_step("get_domain_yield")
                yield ZonePath(domain.parts[:depth])
-            self.enter_step('get_domain_brws')
+            self.enter_step("get_domain_brws")
            if part not in dic.children:
                return
            dic = dic.children[part]
            depth += 1
        if dic.match_zone.active():
-            self.enter_step('get_domain_yield')
+            self.enter_step("get_domain_yield")
            yield ZonePath(domain.parts)
        if dic.match_hostname.active():
-            self.enter_step('get_domain_yield')
+            self.enter_step("get_domain_yield")
            yield HostnamePath(domain.parts)

    def get_ip4(self, ip4_str: str) -> typing.Iterable[Path]:
-        self.enter_step('get_ip4_pack')
+        self.enter_step("get_ip4_pack")
        ip4val = self.pack_ip4address_low(ip4_str)
-        self.enter_step('get_ip4_cache')
+        self.enter_step("get_ip4_cache")
        if not self.ip4cache[ip4val >> self.ip4cache_shift]:
            return
-        self.enter_step('get_ip4_brws')
+        self.enter_step("get_ip4_brws")
        dic = self.ip4tree
        for i in range(31, -1, -1):
            bit = (ip4val >> i) & 0b1
            if dic.active():
-                self.enter_step('get_ip4_yield')
-                yield Ip4Path(ip4val >> (i+1) << (i+1), 31-i)
-                self.enter_step('get_ip4_brws')
+                self.enter_step("get_ip4_yield")
+                yield Ip4Path(ip4val >> (i + 1) << (i + 1), 31 - i)
+                self.enter_step("get_ip4_brws")
            next_dic = dic.one if bit else dic.zero
            if next_dic is None:
                return
            dic = next_dic
        if dic.active():
-            self.enter_step('get_ip4_yield')
+            self.enter_step("get_ip4_yield")
            yield Ip4Path(ip4val, 32)

-    def _unset_match(self,
-                     match: Match,
-                     ) -> None:
+    def _unset_match(
+        self,
+        match: Match,
+    ) -> None:
        match.disable()
        if match.source:
            source_match = self.get_match(match.source)
            source_match.references -= 1

-    def _set_match(self,
-                   match: Match,
-                   updated: int,
-                   source: Path,
-                   source_match: Match = None,
-                   dupplicate: bool = False,
-                   ) -> None:
+    def _set_match(
+        self,
+        match: Match,
+        updated: int,
+        source: Path,
+        source_match: Match = None,
+        dupplicate: bool = False,
+    ) -> None:
        # source_match is in parameters because most of the time
        # its parent function needs it too,
        # so it can pass it to save a traversal
        source_match = source_match or self.get_match(source)
        new_level = source_match.level + 1
-        if updated > match.updated or new_level < match.level \
-                or source_match.first_party > match.first_party:
+        if (
+            updated > match.updated
+            or new_level < match.level
+            or source_match.first_party > match.first_party
+        ):
            # NOTE FP and level of matches referencing this one
            # won't be updated until run or prune
            if match.source:
@ -708,20 +698,18 @@ class Database(Profiler):
            source_match.references += 1
            match.dupplicate = dupplicate

-    def _set_domain(self,
-                    hostname: bool,
-                    domain_str: str,
-                    updated: int,
-                    source: Path) -> None:
-        self.enter_step('set_domain_val')
+    def _set_domain(
+        self, hostname: bool, domain_str: str, updated: int, source: Path
+    ) -> None:
+        self.enter_step("set_domain_val")
        if not Database.validate_domain(domain_str):
            raise ValueError(f"Invalid domain: {domain_str}")
-        self.enter_step('set_domain_pack')
+        self.enter_step("set_domain_pack")
        domain = self.pack_domain(domain_str)
-        self.enter_step('set_domain_fp')
+        self.enter_step("set_domain_fp")
        source_match = self.get_match(source)
        is_first_party = source_match.first_party
-        self.enter_step('set_domain_brws')
+        self.enter_step("set_domain_brws")
        dic = self.domtree
        dupplicate = False
        for part in domain.parts:
@ -742,21 +730,14 @@ class Database(Profiler):
            dupplicate=dupplicate,
        )

-    def set_hostname(self,
-                     *args: typing.Any, **kwargs: typing.Any
-                     ) -> None:
+    def set_hostname(self, *args: typing.Any, **kwargs: typing.Any) -> None:
        self._set_domain(True, *args, **kwargs)

-    def set_zone(self,
-                 *args: typing.Any, **kwargs: typing.Any
-                 ) -> None:
+    def set_zone(self, *args: typing.Any, **kwargs: typing.Any) -> None:
        self._set_domain(False, *args, **kwargs)

-    def set_asn(self,
-                asn_str: str,
-                updated: int,
-                source: Path) -> None:
-        self.enter_step('set_asn')
+    def set_asn(self, asn_str: str, updated: int, source: Path) -> None:
+        self.enter_step("set_asn")
        path = self.pack_asn(asn_str)
        if path.asn in self.asns:
            match = self.asns[path.asn]
@ -769,17 +750,14 @@ class Database(Profiler):
            source,
        )

-    def _set_ip4(self,
-                 ip4: Ip4Path,
-                 updated: int,
-                 source: Path) -> None:
-        self.enter_step('set_ip4_fp')
+    def _set_ip4(self, ip4: Ip4Path, updated: int, source: Path) -> None:
+        self.enter_step("set_ip4_fp")
        source_match = self.get_match(source)
        is_first_party = source_match.first_party
-        self.enter_step('set_ip4_brws')
+        self.enter_step("set_ip4_brws")
        dic = self.ip4tree
        dupplicate = False
-        for i in range(31, 31-ip4.prefixlen, -1):
+        for i in range(31, 31 - ip4.prefixlen, -1):
            bit = (ip4.value >> i) & 0b1
            next_dic = dic.one if bit else dic.zero
            if next_dic is None:
@ -800,24 +778,22 @@ class Database(Profiler):
        )
        self._set_ip4cache(ip4, dic)

-    def set_ip4address(self,
-                       ip4address_str: str,
-                       *args: typing.Any, **kwargs: typing.Any
-                       ) -> None:
-        self.enter_step('set_ip4add_val')
+    def set_ip4address(
+        self, ip4address_str: str, *args: typing.Any, **kwargs: typing.Any
+    ) -> None:
+        self.enter_step("set_ip4add_val")
        if not Database.validate_ip4address(ip4address_str):
            raise ValueError(f"Invalid ip4address: {ip4address_str}")
-        self.enter_step('set_ip4add_pack')
+        self.enter_step("set_ip4add_pack")
        ip4 = self.pack_ip4address(ip4address_str)
        self._set_ip4(ip4, *args, **kwargs)

-    def set_ip4network(self,
-                       ip4network_str: str,
-                       *args: typing.Any, **kwargs: typing.Any
-                       ) -> None:
-        self.enter_step('set_ip4net_val')
+    def set_ip4network(
+        self, ip4network_str: str, *args: typing.Any, **kwargs: typing.Any
+    ) -> None:
+        self.enter_step("set_ip4net_val")
        if not Database.validate_ip4network(ip4network_str):
            raise ValueError(f"Invalid ip4network: {ip4network_str}")
-        self.enter_step('set_ip4net_pack')
+        self.enter_step("set_ip4net_pack")
        ip4 = self.pack_ip4network(ip4network_str)
        self._set_ip4(ip4, *args, **kwargs)
--- a/db.py
+++ b/db.py
@ -5,29 +5,37 @@ import database
 import time
 import os

-if __name__ == '__main__':
+if __name__ == "__main__":

    # Parsing arguments
-    parser = argparse.ArgumentParser(
-        description="Database operations")
+    parser = argparse.ArgumentParser(description="Database operations")
    parser.add_argument(
-        '-i', '--initialize', action='store_true',
-        help="Reconstruct the whole database")
+        "-i", "--initialize", action="store_true", help="Reconstruct the whole database"
+    )
    parser.add_argument(
-        '-p', '--prune', action='store_true',
-        help="Remove old entries from database")
+        "-p", "--prune", action="store_true", help="Remove old entries from database"
+    )
    parser.add_argument(
-        '-b', '--prune-base', action='store_true',
+        "-b",
+        "--prune-base",
+        action="store_true",
        help="With --prune, only prune base rules "
-        "(the ones added by ./feed_rules.py)")
+        "(the ones added by ./feed_rules.py)",
+    )
    parser.add_argument(
-        '-s', '--prune-before', type=int,
-        default=(int(time.time()) - 60*60*24*31*6),
+        "-s",
+        "--prune-before",
+        type=int,
+        default=(int(time.time()) - 60 * 60 * 24 * 31 * 6),
        help="With --prune, only rules updated before "
-        "this UNIX timestamp will be deleted")
+        "this UNIX timestamp will be deleted",
+    )
    parser.add_argument(
-        '-r', '--references', action='store_true',
-        help="DEBUG: Update the reference count")
+        "-r",
+        "--references",
+        action="store_true",
+        help="DEBUG: Update the reference count",
+    )
    args = parser.parse_args()

    if not args.initialize:
@ -37,7 +45,7 @@ if __name__ == '__main__':
            os.unlink(database.Database.PATH)
        DB = database.Database()

-    DB.enter_step('main')
+    DB.enter_step("main")
    if args.prune:
        DB.prune(before=args.prune_before, base_only=args.prune_base)
    if args.references:
--- a/dist/README.md
+++ b/dist/README.md
@ -35,7 +35,7 @@ This list is an inventory of every `somestring.website1.com` found to allow non

 ### First-party trackers

-**Recommended for hostfiles-based ad blockers, such as [Pi-hole](https://pi-hole.net/).**
+**Recommended for hostfiles-based ad blockers, such as [Pi-hole](https://pi-hole.net/) (&lt;v5.0, as it introduced CNAME blocking).**
 **Recommended for Android ad blockers as applications, such ad [Blokada](https://blokada.org/).**

 - Hosts file: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
@ -50,7 +50,7 @@ Don't be afraid of the size of the list, as this is due to the nature of first-p

 ### First-party only trackers

-**Recommended for ad blockers as web browser extensions, such as [uBlock Origin](https://pi-hole.net/).**
+**Recommended for ad blockers as web browser extensions, such as [uBlock Origin](https://ublockorigin.com/) (&lt;v1.25.0 or for Chromium-based browsers, as it introduced CNAME uncloaking for Firefox).**

 - Hosts file: <https://hostfiles.frogeye.fr/firstparty-only-trackers-hosts.txt>
 - Raw list: <https://hostfiles.frogeye.fr/firstparty-only-trackers.txt>
@ -98,10 +98,10 @@ Some of the first-party tracker included in this list have been found by:
 - NextDNS and [their blocklist](https://github.com/nextdns/cname-cloaking-blocklist)'s contributors
 - Yuki2718 from [Wilders Security Forums](https://www.wilderssecurity.com/threads/ublock-a-lean-and-fast-blocker.365273/page-168#post-2880361)
 - Ha Dao, Johan Mazel, and Kensuke Fukuda, ["Characterizing CNAME Cloaking-Based Tracking on the Web", Proceedings of IFIP/IEEE Traffic Measurement Analysis Conference (TMA), 9 pages, 2020.](https://tma.ifip.org/2020/wp-content/uploads/sites/9/2020/06/tma2020-camera-paper66.pdf)
+- AdGuard and [their blocklist](https://github.com/AdguardTeam/cname-trackers)'s contributors

 The list was generated using data from

- [Rapid7 OpenData](https://opendata.rapid7.com/sonar.fdns_v2/), who kindly provided a free account
 - [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html)
 - [Public DNS Server List](https://public-dns.info/)

@ -110,4 +110,5 @@ Similar projects:

 - [NextDNS blocklist](https://github.com/nextdns/cname-cloaking-blocklist): for DNS-aware ad blockers
 - [Stefan Froberg's lists](https://www.orwell1984.today/cname/): subset of those lists grouped by tracker
+- [AdGuard blocklist](https://github.com/AdguardTeam/cname-trackers): same thing with a bigger scope, maintained by a bigger team

--- a/eulaurarien.sh
+++ b/eulaurarien.sh
@ -8,7 +8,6 @@
 ./collect_subdomains.sh
 ./import_rules.sh
 ./resolve_subdomains.sh
-./import_rapid7.sh
 ./prune.sh
 ./export_lists.sh
 ./generate_index.py
--- a/export.py
+++ b/export.py
@ -5,53 +5,80 @@ import argparse
 import sys


-if __name__ == '__main__':
+if __name__ == "__main__":

    # Parsing arguments
    parser = argparse.ArgumentParser(
-        description="Export the hostnames rules stored "
-        "in the Database as plain text")
+        description="Export the hostnames rules stored " "in the Database as plain text"
+    )
    parser.add_argument(
-        '-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
-        help="Output file, one rule per line")
+        "-o",
+        "--output",
+        type=argparse.FileType("w"),
+        default=sys.stdout,
+        help="Output file, one rule per line",
+    )
    parser.add_argument(
-        '-f', '--first-party', action='store_true',
-        help="Only output rules issued from first-party sources")
+        "-f",
+        "--first-party",
+        action="store_true",
+        help="Only output rules issued from first-party sources",
+    )
    parser.add_argument(
-        '-e', '--end-chain', action='store_true',
-        help="Only output rules that are not referenced by any other")
+        "-e",
+        "--end-chain",
+        action="store_true",
+        help="Only output rules that are not referenced by any other",
+    )
    parser.add_argument(
-        '-r', '--rules', action='store_true',
-        help="Output all kinds of rules, not just hostnames")
+        "-r",
+        "--rules",
+        action="store_true",
+        help="Output all kinds of rules, not just hostnames",
+    )
    parser.add_argument(
-        '-b', '--base-rules', action='store_true',
+        "-b",
+        "--base-rules",
+        action="store_true",
        help="Output base rules "
        "(the ones added by ./feed_rules.py) "
-        "(implies --rules)")
+        "(implies --rules)",
+    )
    parser.add_argument(
-        '-d', '--no-dupplicates', action='store_true',
+        "-d",
+        "--no-dupplicates",
+        action="store_true",
        help="Do not output rules that already match a zone/network rule "
-        "(e.g. dummy.example.com when there's a zone example.com rule)")
+        "(e.g. dummy.example.com when there's a zone example.com rule)",
+    )
    parser.add_argument(
-        '-x', '--explain', action='store_true',
+        "-x",
+        "--explain",
+        action="store_true",
        help="Show the chain of rules leading to one "
-        "(and the number of references they have)")
+        "(and the number of references they have)",
+    )
    parser.add_argument(
-        '-c', '--count', action='store_true',
-        help="Show the number of rules per type instead of listing them")
+        "-c",
+        "--count",
+        action="store_true",
+        help="Show the number of rules per type instead of listing them",
+    )
    args = parser.parse_args()

    DB = database.Database()

    if args.count:
        assert not args.explain
-        print(DB.count_records(
-            first_party_only=args.first_party,
-            end_chain_only=args.end_chain,
-            no_dupplicates=args.no_dupplicates,
-            rules_only=args.base_rules,
-            hostnames_only=not (args.rules or args.base_rules),
-        ))
+        print(
+            DB.count_records(
+                first_party_only=args.first_party,
+                end_chain_only=args.end_chain,
+                no_dupplicates=args.no_dupplicates,
+                rules_only=args.base_rules,
+                hostnames_only=not (args.rules or args.base_rules),
+            )
+        )
    else:
        for domain in DB.list_records(
            first_party_only=args.first_party,
--- a/export_lists.sh
+++ b/export_lists.sh
@ -76,7 +76,7 @@ do
            echo "# Oldest record: $oldest_date"
            echo "# Number of source websites: $number_websites"
            echo "# Number of source subdomains: $number_subdomains"
-            echo "# Number of source DNS records: ~2E9 + $number_dns"
+            echo "# Number of source DNS records: $number_dns"
            echo "#"
            echo "# Input rules: $rules_input"
            echo "# Subsequent rules: $rules_found"
--- a/feed_asn.py
+++ b/feed_asn.py
@ -13,57 +13,54 @@ IPNetwork = typing.Union[ipaddress.IPv4Network, ipaddress.IPv6Network]

 def get_ranges(asn: str) -> typing.Iterable[str]:
    req = requests.get(
-        'https://stat.ripe.net/data/as-routing-consistency/data.json',
-        params={'resource': asn}
+        "https://stat.ripe.net/data/as-routing-consistency/data.json",
+        params={"resource": asn},
    )
    data = req.json()
-    for pref in data['data']['prefixes']:
-        yield pref['prefix']
+    for pref in data["data"]["prefixes"]:
+        yield pref["prefix"]


 def get_name(asn: str) -> str:
    req = requests.get(
-        'https://stat.ripe.net/data/as-overview/data.json',
-        params={'resource': asn}
+        "https://stat.ripe.net/data/as-overview/data.json", params={"resource": asn}
    )
    data = req.json()
-    return data['data']['holder']
+    return data["data"]["holder"]


-if __name__ == '__main__':
+if __name__ == "__main__":

-    log = logging.getLogger('feed_asn')
+    log = logging.getLogger("feed_asn")

    # Parsing arguments
    parser = argparse.ArgumentParser(
-        description="Add the IP ranges associated to the AS in the database")
+        description="Add the IP ranges associated to the AS in the database"
+    )
    args = parser.parse_args()

    DB = database.Database()

-    def add_ranges(path: database.Path,
-                   match: database.Match,
-                   ) -> None:
+    def add_ranges(
+        path: database.Path,
+        match: database.Match,
+    ) -> None:
        assert isinstance(path, database.AsnPath)
        assert isinstance(match, database.AsnNode)
        asn_str = database.Database.unpack_asn(path)
-        DB.enter_step('asn_get_name')
+        DB.enter_step("asn_get_name")
        name = get_name(asn_str)
        match.name = name
-        DB.enter_step('asn_get_ranges')
+        DB.enter_step("asn_get_ranges")
        for prefix in get_ranges(asn_str):
            parsed_prefix: IPNetwork = ipaddress.ip_network(prefix)
            if parsed_prefix.version == 4:
-                DB.set_ip4network(
-                    prefix,
-                    source=path,
-                    updated=int(time.time())
-                )
-                log.info('Added %s from %s (%s)', prefix, path, name)
+                DB.set_ip4network(prefix, source=path, updated=int(time.time()))
+                log.info("Added %s from %s (%s)", prefix, path, name)
            elif parsed_prefix.version == 6:
-                log.warning('Unimplemented prefix version: %s', prefix)
+                log.warning("Unimplemented prefix version: %s", prefix)
            else:
-                log.error('Unknown prefix version: %s', prefix)
+                log.error("Unknown prefix version: %s", prefix)

    for _ in DB.exec_each_asn(add_ranges):
        pass
--- a/feed_dns.py
+++ b/feed_dns.py
@ -12,15 +12,15 @@ Record = typing.Tuple[typing.Callable, typing.Callable, int, str, str]

 # select, write
 FUNCTION_MAP: typing.Any = {
-    'a': (
+    "a": (
        database.Database.get_ip4,
        database.Database.set_hostname,
    ),
-    'cname': (
+    "cname": (
        database.Database.get_domain,
        database.Database.set_hostname,
    ),
-    'ptr': (
+    "ptr": (
        database.Database.get_domain,
        database.Database.set_ip4address,
    ),
@ -28,15 +28,16 @@ FUNCTION_MAP: typing.Any = {


 class Writer(multiprocessing.Process):
-    def __init__(self,
-                 recs_queue: multiprocessing.Queue = None,
-                 autosave_interval: int = 0,
-                 ip4_cache: int = 0,
-                 ):
+    def __init__(
+        self,
+        recs_queue: multiprocessing.Queue = None,
+        autosave_interval: int = 0,
+        ip4_cache: int = 0,
+    ):
        if recs_queue:  # MP
            super(Writer, self).__init__()
            self.recs_queue = recs_queue
-        self.log = logging.getLogger(f'wr')
+        self.log = logging.getLogger("wr")
        self.autosave_interval = autosave_interval
        self.ip4_cache = ip4_cache
        if not recs_queue:  # No MP
@ -44,11 +45,11 @@ class Writer(multiprocessing.Process):

    def open_db(self) -> None:
        self.db = database.Database()
-        self.db.log = logging.getLogger(f'wr')
+        self.db.log = logging.getLogger("wr")
        self.db.fill_ip4cache(max_size=self.ip4_cache)

    def exec_record(self, record: Record) -> None:
-        self.db.enter_step('exec_record')
+        self.db.enter_step("exec_record")
        select, write, updated, name, value = record
        try:
            for source in select(self.db, value):
@ -59,7 +60,7 @@ class Writer(multiprocessing.Process):
            self.log.exception("Cannot execute: %s", record)

    def end(self) -> None:
-        self.db.enter_step('end')
+        self.db.enter_step("end")
        self.db.save()

    def run(self) -> None:
@ -69,10 +70,11 @@ class Writer(multiprocessing.Process):
        else:
            next_save = 0

-        self.db.enter_step('block_wait')
+        self.db.enter_step("block_wait")
        block: typing.List[Record]
        for block in iter(self.recs_queue.get, None):

+            assert block
            record: Record
            for record in block:
                self.exec_record(record)
@ -83,20 +85,21 @@ class Writer(multiprocessing.Process):
                self.log.info("Done!")
                next_save = time.time() + self.autosave_interval

-            self.db.enter_step('block_wait')
+            self.db.enter_step("block_wait")
        self.end()


-class Parser():
-    def __init__(self,
-                 buf: typing.Any,
-                 recs_queue: multiprocessing.Queue = None,
-                 block_size: int = 0,
-                 writer: Writer = None,
-                 ):
+class Parser:
+    def __init__(
+        self,
+        buf: typing.Any,
+        recs_queue: multiprocessing.Queue = None,
+        block_size: int = 0,
+        writer: Writer = None,
+    ):
        assert bool(writer) ^ bool(block_size and recs_queue)
        self.buf = buf
-        self.log = logging.getLogger('pr')
+        self.log = logging.getLogger("pr")
        self.recs_queue = recs_queue
        if writer:  # No MP
            self.prof: database.Profiler = writer.db
@ -105,14 +108,14 @@ class Parser():
            self.block: typing.List[Record] = list()
            self.block_size = block_size
            self.prof = database.Profiler()
-            self.prof.log = logging.getLogger('pr')
+            self.prof.log = logging.getLogger("pr")
            self.register = self.add_to_queue

    def add_to_queue(self, record: Record) -> None:
-        self.prof.enter_step('register')
+        self.prof.enter_step("register")
        self.block.append(record)
        if len(self.block) >= self.block_size:
-            self.prof.enter_step('put_block')
+            self.prof.enter_step("put_block")
            assert self.recs_queue
            self.recs_queue.put(self.block)
            self.block = list()
@ -127,45 +130,17 @@ class Parser():
        raise NotImplementedError


-class Rapid7Parser(Parser):
-    def consume(self) -> None:
-        data = dict()
-        for line in self.buf:
-            self.prof.enter_step('parse_rapid7')
-            split = line.split('"')
-
-            try:
-                for k in range(1, 14, 4):
-                    key = split[k]
-                    val = split[k+2]
-                    data[key] = val
-
-                select, writer = FUNCTION_MAP[data['type']]
-                record = (
-                    select,
-                    writer,
-                    int(data['timestamp']),
-                    data['name'],
-                    data['value']
-                )
-            except (IndexError, KeyError):
-            # IndexError: missing field
-            # KeyError: Unknown type field
-                self.log.exception("Cannot parse: %s", line)
-            self.register(record)
-
-
 class MassDnsParser(Parser):
    # massdns --output Snrql
    # --retry REFUSED,SERVFAIL --resolvers nameservers-ipv4
    TYPES = {
-        'A': (FUNCTION_MAP['a'][0], FUNCTION_MAP['a'][1], -1, None),
+        "A": (FUNCTION_MAP["a"][0], FUNCTION_MAP["a"][1], -1, None),
        # 'AAAA': (FUNCTION_MAP['aaaa'][0], FUNCTION_MAP['aaaa'][1], -1, None),
-        'CNAME': (FUNCTION_MAP['cname'][0], FUNCTION_MAP['cname'][1], -1, -1),
+        "CNAME": (FUNCTION_MAP["cname"][0], FUNCTION_MAP["cname"][1], -1, -1),
    }

    def consume(self) -> None:
-        self.prof.enter_step('parse_massdns')
+        self.prof.enter_step("parse_massdns")
        timestamp = 0
        header = True
        for line in self.buf:
@ -174,14 +149,15 @@ class MassDnsParser(Parser):
                header = True
                continue

-            split = line.split(' ')
+            split = line.split(" ")
            try:
                if header:
                    timestamp = int(split[1])
                    header = False
                else:
-                    select, write, name_offset, value_offset = \
-                        MassDnsParser.TYPES[split[1]]
+                    select, write, name_offset, value_offset = MassDnsParser.TYPES[
+                        split[1]
+                    ]
                    record = (
                        select,
                        write,
@ -190,75 +166,85 @@ class MassDnsParser(Parser):
                        split[2][:value_offset].lower(),
                    )
                    self.register(record)
-                    self.prof.enter_step('parse_massdns')
+                    self.prof.enter_step("parse_massdns")
            except KeyError:
                continue


 PARSERS = {
-    'rapid7': Rapid7Parser,
-    'massdns': MassDnsParser,
+    "massdns": MassDnsParser,
 }

-if __name__ == '__main__':
+if __name__ == "__main__":

    # Parsing arguments
-    log = logging.getLogger('feed_dns')
+    log = logging.getLogger("feed_dns")
    args_parser = argparse.ArgumentParser(
        description="Read DNS records and import "
-        "tracking-relevant data into the database")
+        "tracking-relevant data into the database"
+    )
+    args_parser.add_argument("parser", choices=PARSERS.keys(), help="Input format")
    args_parser.add_argument(
-        'parser',
-        choices=PARSERS.keys(),
-        help="Input format")
+        "-i",
+        "--input",
+        type=argparse.FileType("r"),
+        default=sys.stdin,
+        help="Input file",
+    )
    args_parser.add_argument(
-        '-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
-        help="Input file")
+        "-b", "--block-size", type=int, default=1024, help="Performance tuning value"
+    )
    args_parser.add_argument(
-        '-b', '--block-size', type=int, default=1024,
-        help="Performance tuning value")
+        "-q", "--queue-size", type=int, default=128, help="Performance tuning value"
+    )
    args_parser.add_argument(
-        '-q', '--queue-size', type=int, default=128,
-        help="Performance tuning value")
+        "-a",
+        "--autosave-interval",
+        type=int,
+        default=900,
+        help="Interval to which the database will save in seconds. " "0 to disable.",
+    )
    args_parser.add_argument(
-        '-a', '--autosave-interval', type=int, default=900,
-        help="Interval to which the database will save in seconds. "
-        "0 to disable.")
+        "-s",
+        "--single-process",
+        action="store_true",
+        help="Only use one process. " "Might be useful for single core computers.",
+    )
    args_parser.add_argument(
-        '-s', '--single-process', action='store_true',
-        help="Only use one process. "
-        "Might be useful for single core computers.")
-    args_parser.add_argument(
-        '-4', '--ip4-cache', type=int, default=0,
+        "-4",
+        "--ip4-cache",
+        type=int,
+        default=0,
        help="RAM cache for faster IPv4 lookup. "
        "Maximum useful value: 512 MiB (536870912). "
        "Warning: Depending on the rules, this might already "
-        "be a memory-heavy process, even without the cache.")
+        "be a memory-heavy process, even without the cache.",
+    )
    args = args_parser.parse_args()

    parser_cls = PARSERS[args.parser]
    if args.single_process:
        writer = Writer(
-            autosave_interval=args.autosave_interval,
-            ip4_cache=args.ip4_cache
+            autosave_interval=args.autosave_interval, ip4_cache=args.ip4_cache
        )
        parser = parser_cls(args.input, writer=writer)
        parser.run()
        writer.end()
    else:
        recs_queue: multiprocessing.Queue = multiprocessing.Queue(
-            maxsize=args.queue_size)
+            maxsize=args.queue_size
+        )

-        writer = Writer(recs_queue,
-                        autosave_interval=args.autosave_interval,
-                        ip4_cache=args.ip4_cache
-                        )
+        writer = Writer(
+            recs_queue,
+            autosave_interval=args.autosave_interval,
+            ip4_cache=args.ip4_cache,
+        )
        writer.start()

-        parser = parser_cls(args.input,
-                            recs_queue=recs_queue,
-                            block_size=args.block_size
-                            )
+        parser = parser_cls(
+            args.input, recs_queue=recs_queue, block_size=args.block_size
+        )
        parser.run()

        recs_queue.put(None)
--- a/feed_rules.py
+++ b/feed_rules.py
@ -4,30 +4,36 @@ import database
 import argparse
 import sys
 import time
+import typing

 FUNCTION_MAP = {
-    'zone': database.Database.set_zone,
-    'hostname': database.Database.set_hostname,
-    'asn': database.Database.set_asn,
-    'ip4network': database.Database.set_ip4network,
-    'ip4address': database.Database.set_ip4address,
+    "zone": database.Database.set_zone,
+    "hostname": database.Database.set_hostname,
+    "asn": database.Database.set_asn,
+    "ip4network": database.Database.set_ip4network,
+    "ip4address": database.Database.set_ip4address,
 }

-if __name__ == '__main__':
+if __name__ == "__main__":

    # Parsing arguments
-    parser = argparse.ArgumentParser(
-        description="Import base rules to the database")
+    parser = argparse.ArgumentParser(description="Import base rules to the database")
    parser.add_argument(
-        'type',
-        choices=FUNCTION_MAP.keys(),
-        help="Type of rule inputed")
+        "type", choices=FUNCTION_MAP.keys(), help="Type of rule inputed"
+    )
    parser.add_argument(
-        '-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
-        help="File with one rule per line")
+        "-i",
+        "--input",
+        type=argparse.FileType("r"),
+        default=sys.stdin,
+        help="File with one rule per line",
+    )
    parser.add_argument(
-        '-f', '--first-party', action='store_true',
-        help="The input only comes from verified first-party sources")
+        "-f",
+        "--first-party",
+        action="store_true",
+        help="The input only comes from verified first-party sources",
+    )
    args = parser.parse_args()

    DB = database.Database()
@ -43,11 +49,12 @@ if __name__ == '__main__':
    for rule in args.input:
        rule = rule.strip()
        try:
-            fun(DB,
+            fun(
+                DB,
                rule,
                source=source,
                updated=int(time.time()),
-                )
+            )
        except ValueError:
            DB.log.error(f"Could not add rule: {rule}")

--- a/generate_index.py
+++ b/generate_index.py
@ -2,11 +2,9 @@

 import markdown2

-extras = [
-        "header-ids"
-]
+extras = ["header-ids"]

-with open('dist/README.md', 'r') as fdesc:
+with open("dist/README.md", "r") as fdesc:
    body = markdown2.markdown(fdesc.read(), extras=extras)

 output = f"""<!DOCTYPE html>
@ -23,5 +21,5 @@ output = f"""<!DOCTYPE html>
 </html>
 """

-with open('dist/index.html', 'w') as fdesc:
+with open("dist/index.html", "w") as fdesc:
    fdesc.write(output)
--- a/import_rapid7.sh
+++ b/import_rapid7.sh
@ -1,81 +0,0 @@
-#!/usr/bin/env bash
-
-source .env.default
-source .env
-
-function log() {
-    echo -e "\033[33m$@\033[0m"
-}
-
-function api_call {
-    curl -s -H "X-Api-Key: $RAPID7_API_KEY" "https://us.api.insight.rapid7.com/opendata/studies/$1/"
-}
-
-function get_timestamp { # study, dataset
-    study="$1"
-    dataset="$2"
-    if [ -z "$RAPID7_API_KEY" ]
-    then
-        line=$(curl -s "https://opendata.rapid7.com/$study/" | grep "href=\".\+-$dataset.json.gz\"" | head -1)
-        echo "$line" | cut -d'"' -f2 | cut -d'/' -f3 | cut -d'-' -f4
-    else
-        filename=$(api_call "$study" | jq '.sonarfile_set[]' -r | grep "${dataset}.json.gz" | sort | tail -1)
-        echo $filename | cut -d'-' -f4
-    fi
-}
-
-function get_download_url { # study, dataset
-    study="$1"
-    dataset="$2"
-    if [ -z "$RAPID7_API_KEY" ]
-    then
-        line=$(curl -s "https://opendata.rapid7.com/$study/" | grep "href=\".\+-$dataset.json.gz\"" | head -1)
-        echo "https://opendata.rapid7.com$(echo "$line" | cut -d'"' -f2)"
-    else
-        filename=$(api_call "$study" | jq '.sonarfile_set[]' -r | grep "${dataset}.json.gz" | sort | tail -1)
-        api_call "$study/$filename/download" | jq '.url' -r
-    fi
-}
-
-function feed_rapid7 { # study, dataset, rule_file, ./feed_dns args
-    # The dataset will be imported if:
-    #   none of this dataset was ever imported
-    #  or
-    #   the last dataset imported is older than the one to be imported
-    # or
-    #  the rule_file is newer than when the last dataset was imported
-    #
-    # (note the difference between the age oft the dataset itself and
-    # the date when it is imported)
-    study="$1"
-    dataset="$2"
-    rule_file="$3"
-    shift; shift; shift
-    new_ts="$(get_timestamp $study $dataset)"
-    old_ts_file="last_updates/rapid7_${study}_${dataset}.txt"
-    if [ -f "$old_ts_file" ]
-    then
-        old_ts=$(cat "$old_ts_file")
-    else
-        old_ts="0"
-    fi
-    if [ $new_ts -gt $old_ts ] || [ $rule_file -nt $old_ts_file ]
-    then
-        link="$(get_download_url $study $dataset)"
-        log "Reading $dataset dataset from $link ($old_ts -> $new_ts)…"
-        [ $SINGLE_PROCESS -eq 1 ] && EXTRA_ARGS="--single-process"
-        curl -L "$link" | gunzip | ./feed_dns.py rapid7 $@ $EXTRA_ARGS
-        if [ $? -eq 0 ]
-        then
-            echo $new_ts > $old_ts_file
-        fi
-    else
-        log "Skipping $dataset as there is no new version since $old_ts"
-    fi
-}
-
-# feed_rapid7 sonar.rdns_v2 rdns rules_asn/first-party.list
-feed_rapid7 sonar.fdns_v2 fdns_a rules_asn/first-party.list --ip4-cache "$CACHE_SIZE"
-# feed_rapid7 sonar.fdns_v2 fdns_aaaa rules_asn/first-party.list --ip6-cache "$CACHE_SIZE"
-feed_rapid7 sonar.fdns_v2 fdns_cname rules/first-party.list
-
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,4 @@
+coloredlogs>=10
+markdown2>=2.4<3
+numpy>=1.21<2
+python-abp>=0.2<0.3
--- a/rules/first-party.list
+++ b/rules/first-party.list
@ -12,10 +12,11 @@ storetail.io
 # Keyade
 keyade.com
 # Adobe Experience Cloud
+# https://experienceleague.adobe.com/docs/analytics/implementation/vars/config-vars/trackingserversecure.html?lang=en#ssl-tracking-server-in-adobe-experience-platform-launch
 omtrdc.net
 2o7.net
-# ThreatMetrix
-online-metrix.net
+data.adobedc.net
+sc.adobedc.net
 # Webtrekk
 wt-eu02.net
 webtrekk.net
@ -36,10 +37,10 @@ a88045584548111e997c60ac8a4ec150-1610510072.eu-central-1.elb.amazonaws.com
 afc4d9aa2a91d11e997c60ac8a4ec150-2082092489.eu-central-1.elb.amazonaws.com
 # A8
 trck.a8.net
-# Ebis
+# AD EBiS
 # https://prtimes.jp/main/html/rd/p/000000215.000009812.html
 ebis.ne.jp
-# Geniesspv
+# GENIEE
 genieesspv.jp
 # SP-Prod
 sp-prod.net
@ -55,3 +56,36 @@ extole.com
 hs.eloqua.com
 # segment.com
 xid.segment.com
+# exponea.com
+exponea.com
+# adclear.net
+adclear.net
+# contentsfeed.com
+contentsfeed.com
+# postaffiliatepro.com
+postaffiliatepro.com
+# Sugar Market (Salesfusion)
+msgapp.com
+# Exactag
+exactag.com
+# GMO Internet Group
+ad-cloud.jp
+# Pardot
+pardot.com
+# Fathom
+# https://usefathom.com/docs/settings/custom-domains
+starman.fathomdns.com
+# Lead Forensics
+# https://www.reddit.com/r/pihole/comments/g7qv3e/leadforensics_tracking_domains_blacklist/
+# No real-world data but the website doesn't hide what it does
+ghochv3eng.trafficmanager.net
+# Branch.io
+thirdparty.bnc.lt
+# Plausible.io
+custom.plausible.io
+# DataUnlocker
+# Bit different as it is a proxy to non first-party trackers scripts
+# but it fits I guess.
+smartproxy.dataunlocker.com
+# SAS
+ci360.sas.com
--- a/rules_asn/first-party.txt
+++ b/rules_asn/first-party.txt
@ -4,8 +4,6 @@ AS50234
 AS44788
 AS19750
 AS55569
-# ThreatMetrix
-AS30286
 # Webtrekk
 AS60164
 # Act-On Software
--- a/run_tests.py
+++ b/run_tests.py
@ -57,7 +57,11 @@ if __name__ == "__main__":
            perc_all = (100 * pass_all / count_all) if count_all else 100
            perc_den = (100 * pass_den / count_den) if count_den else 100
            log.info(
-                "%s: Entries %d/%d (%.2f%%) | Allow %d/%d (%.2f%%) | Deny %d/%d (%.2f%%)",
+                (
+                    "%s: Entries %d/%d (%.2f%%)"
+                    " | Allow %d/%d (%.2f%%)"
+                    "| Deny %d/%d (%.2f%%)"
+                ),
                filename,
                pass_ent,
                count_ent,
--- a/tests/first-party.csv
+++ b/tests/first-party.csv
@ -1,7 +1,6 @@
 url,allow,deny,comment
 https://www.red-by-sfr.fr/,static.s-sfr.fr,nrg.red-by-sfr.fr,Eulerian
 https://www.cbc.ca/,,smetrics.cbc.ca,2o7 | Ominuture | Adobe Experience Cloud
-https://www.discover.com/,,content.discover.com,ThreatMetrix
 https://www.mytoys.de/,,web.mytoys.de,Webtrekk
 https://www.baur.de/,,tp.baur.de,Otto Group
 https://www.liligo.com/,,compare.liligo.com,???
@ -9,9 +8,21 @@ https://www.boulanger.com/,,tag.boulanger.fr,TagCommander
 https://www.airfrance.fr/FR/,,tk.airfrance.fr,Wizaly
 https://www.vsgamers.es/,,marketing.net.vsgamers.es,Affex
 https://www.vacansoleil.fr/,,tdep.vacansoleil.fr,TraceDock
-https://www.ozmall.co.jp/,,js.enhance.co.jp,Genieesspv
+https://www.ozmall.co.jp/,,js.enhance.co.jp,GENIEE
 https://www.thetimes.co.uk/,,cmp.thetimes.co.uk,SP-Prod
 https://agilent.com/,,seahorseinfo.agilent.com,Act-On Software
 https://halifax.co.uk/,,cem.halifax.co.uk,eum-appdynamics.com
 https://www.reallygoodstuff.com/,,refer.reallygoodstuff.com,Extole
 https://unity.com/,,eloqua-trackings.unity.com,Eloqua
+https://www.notino.gr/,,api.campaigns.notino.com,Exponea
+https://www.mytoys.de/,,0815.mytoys.de.adclear.net
+https://www.imbc.com/,,ads.imbc.com.contentsfeed.com
+https://www.cbdbiocare.com/,,affiliate.cbdbiocare.com,postaffiliatepro.com
+https://www.seatadvisor.com/,,marketing.seatadvisor.com,Sugar Market (Salesfusion)
+https://www.tchibo.de/,,tagm.tchibo.de,Exactag
+https://www.bouygues-immobilier.com/,,go.bouygues-immobilier.fr,Pardot
+https://caddyserver.com/,,mule.caddysever.com,Fathom
+Reddit.com mail notifications,,click.redditmail.com,Branch.io
+https://www.phpliveregex.com/,,yolo.phpliveregex.xom,Plausible.io
+https://www.earthclassmail.com/,,1avhg3kanx9.www.earthclassmail.com,DataUnlocker
+https://paulfredrick.com/,,execution-ci360.paulfredrick.com,SAS
Author	SHA1	Message	Date
Geoffrey Frogeye	3b6f7a58b3	Remove support for Rapid7 They changed their privacy / pricing model and as such I don't have access to their massive DNS dataset anymore, even after asking. Since 2022-01-02, I put the list on freeze while looking for an alternative, but couldn't find any. To make the list update again with the remaining DNS sources I have, I put the last version of the list generated with the Rapid7 dataset as an input for subdomains, that will now get resolved with MassDNS.	2022-11-13 20:10:27 +01:00
Geoffrey Frogeye	49a36f32f2	Add requirements.txt file	2022-02-26 13:01:11 +01:00
Geoffrey Frogeye	29cf72ae92	Fix most of the README being bold Why did I go with this Markdown generator again?	2021-08-28 20:58:34 +02:00
Geoffrey Frogeye	998c3faf8f	Add SAS.com	2021-08-22 18:02:37 +02:00
Geoffrey Frogeye	c8a14a4e21	Add DataUnlocker	2021-08-22 17:07:25 +02:00
Geoffrey Frogeye	1ec26e7f96	Add Plausible.io	2021-08-22 16:53:58 +02:00
Geoffrey Frogeye	5b49441bc0	Add Branch.io tracker	2021-08-22 16:37:31 +02:00
Geoffrey Frogeye	afd122f2ab	Update usage recommendations	2021-08-15 13:04:55 +02:00
Geoffrey Frogeye	6ae3d5fb55	Add Lead Forensics tracker	2021-08-15 11:39:37 +02:00
Geoffrey Frogeye	10a505d84f	Add Fathom	2021-08-15 11:18:35 +02:00
Geoffrey Frogeye	c06648da53	Added Pardot tracker	2021-08-15 11:06:53 +02:00
Geoffrey Frogeye	f165e5a094	Fix (most) mypy / flake8 errors	2021-08-14 23:35:51 +02:00
Geoffrey Frogeye	3dcccad39a	Black pass	2021-08-14 23:27:28 +02:00
Geoffrey Frogeye	a023dc8322	Fix deprecated np.bool	2021-08-14 23:21:03 +02:00
Geoffrey Frogeye	389e83d492	Fix database maximum cache size cap	2021-08-14 23:19:12 +02:00
Geoffrey Frogeye	edf444cc28	Add ad-cloud.jp and improve names of Japanese trackers Closes #19 Names from https://github.com/AdguardTeam/cname-trackers/issues/1	2021-08-14 22:55:58 +02:00
Geoffrey Frogeye	fa23d466d2	Actually remove ThreatMetrix Forgot -i when grepping	2021-08-14 21:55:44 +02:00
Geoffrey Frogeye	f5f9f88c42	Remove ThreatMetrix I received a lot of false positives for this one, and while I wasn't able to reproduce the issue in most of the cases, I trust the community. It's also not in any other CNAME tracker list, probably for the same reason. Plus, it's apparently not very nasty. So I'll let it go. Closes #17	2021-08-14 21:24:48 +02:00
Geoffrey Frogeye	2997e41f98	Investigated >0.5% trackers from Fukuda paper	2020-12-19 13:41:07 +01:00
Geoffrey Frogeye	6cf1028174	Added other tracking source for Adobe Found on the Adobe documentation and in the wild https://experienceleague.adobe.com/docs/analytics/implementation/vars/config-vars/trackingserversecure.html?lang=en#s.trackingserversecure-in-appmeasurement-and-launch-custom-code-editor	2020-12-19 13:15:38 +01:00