Remove support for Rapid7

They changed their privacy / pricing model and as such I don't have access to their massive DNS dataset anymore, even after asking. Since 2022-01-02, I put the list on freeze while looking for an alternative, but couldn't find any. To make the list update again with the remaining DNS sources I have, I put the last version of the list generated with the Rapid7 dataset as an input for subdomains, that will now get resolved with MassDNS.
2022-11-13 20:10:27 +01:00 · 2022-11-13 20:10:27 +01:00 · 3b6f7a58b3
parent 49a36f32f2
commit 3b6f7a58b3
7 changed files with 2 additions and 132 deletions
--- a/.env.default
+++ b/.env.default
@ -1,4 +1,3 @@
 RAPID7_API_KEY=
 CACHE_SIZE=536870912
 MASSDNS_HASHMAP_SIZE=1000
 PROFILE=0
--- a/README.md
+++ b/README.md
@ -18,7 +18,7 @@ This program takes as input:
 It will be able to output hostnames being a DNS redirection to any item in the lists provided.
-DNS records can either come from [Rapid7 Open Data Sets](https://opendata.rapid7.com/sonar.fdns_v2/) or can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns).
+DNS records can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns).
 Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that).
@ -41,7 +41,6 @@ Depending on the sources you'll be using to generate the list, you'll need to in
 - [coloredlogs](https://pypi.org/project/coloredlogs/) (sorry I can't help myself)
 - [numpy](https://www.numpy.org/)
 - [python-abp](https://pypi.org/project/python-abp/) (only if you intend to use AdBlock rules as a rule source)
 - [jq](http://stedolan.github.io/jq/) (only if you have a Rapid7 API key)
 - [massdns](https://github.com/blechschmidt/massdns) in your `$PATH` (only if you have subdomains as a source)
 - [Firefox](https://www.mozilla.org/firefox/) (only if you have websites as a source)
 - [selenium (Python bindings)](https://pypi.python.org/pypi/selenium) (only if you have websites as a source)
@ -135,22 +134,6 @@ Note that this is a network intensive process, not in term of bandwith, but in t
 The DNS records will automatically be imported into the database.
 If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script.
 ### Import DNS records from Rapid7
 If you have a Rapid7 Organization API key, make sure to append to `.env`:
 ```
 RAPID7_API_KEY=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
 ```
 Then, run `./import_rapid7.sh`.
 This will download about 35 GiB of data the first time, but only the matching records will be stored (about a few MiB for the tracking rules).
 Note the download speed will most likely be limited by the database operation thoughput (a quick RAM will help).
 The script remembers which were the last sets downloaded, and will only newer sets.
 If the first-party rules changed, the corresponding sets will be re-imported anyway.
 If you want to force re-importing, run `rm last_updates/rapid7_*.txt`.
 ### Export the lists
 For the tracking list, use `./export_lists.sh`, the output will be in the `dist` folder (please change the links before distributing them).
--- a/dist/README.md
+++ b/dist/README.md
@ -102,7 +102,6 @@ Some of the first-party tracker included in this list have been found by:
 The list was generated using data from
 - [Rapid7 OpenData](https://opendata.rapid7.com/sonar.fdns_v2/), who kindly provided a free account
 - [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html)
 - [Public DNS Server List](https://public-dns.info/)
--- a/eulaurarien.sh
+++ b/eulaurarien.sh
@ -8,7 +8,6 @@
 ./collect_subdomains.sh
 ./import_rules.sh
 ./resolve_subdomains.sh
 ./import_rapid7.sh
 ./prune.sh
 ./export_lists.sh
 ./generate_index.py
--- a/export_lists.sh
+++ b/export_lists.sh
@ -76,7 +76,7 @@ do
            echo "# Oldest record: $oldest_date"
            echo "# Number of source websites: $number_websites"
            echo "# Number of source subdomains: $number_subdomains"
-            echo "# Number of source DNS records: ~2E9 + $number_dns"
+            echo "# Number of source DNS records: $number_dns"
            echo "#"
            echo "# Input rules: $rules_input"
            echo "# Subsequent rules: $rules_found"
--- a/feed_dns.py
+++ b/feed_dns.py
@ -130,34 +130,6 @@ class Parser:
        raise NotImplementedError
 class Rapid7Parser(Parser):
    def consume(self) -> None:
        data = dict()
        for line in self.buf:
            self.prof.enter_step("parse_rapid7")
            split = line.split('"')
            try:
                for k in range(1, 14, 4):
                    key = split[k]
                    val = split[k + 2]
                    data[key] = val
                select, writer = FUNCTION_MAP[data["type"]]
                record = (
                    select,
                    writer,
                    int(data["timestamp"]),
                    data["name"],
                    data["value"],
                )
            except (IndexError, KeyError):
                # IndexError: missing field
                # KeyError: Unknown type field
                self.log.exception("Cannot parse: %s", line)
            self.register(record)
 class MassDnsParser(Parser):
    # massdns --output Snrql
    # --retry REFUSED,SERVFAIL --resolvers nameservers-ipv4
@ -200,7 +172,6 @@ class MassDnsParser(Parser):
 PARSERS = {
    "rapid7": Rapid7Parser,
    "massdns": MassDnsParser,
 }
--- a/import_rapid7.sh
+++ b/import_rapid7.sh
@ -1,81 +0,0 @@
 #!/usr/bin/env bash
 source .env.default
 source .env
 function log() {
    echo -e "\033[33m$@\033[0m"
 }
 function api_call {
    curl -s -H "X-Api-Key: $RAPID7_API_KEY" "https://us.api.insight.rapid7.com/opendata/studies/$1/"
 }
 function get_timestamp { # study, dataset
    study="$1"
    dataset="$2"
    if [ -z "$RAPID7_API_KEY" ]
    then
        line=$(curl -s "https://opendata.rapid7.com/$study/" | grep "href=\".\+-$dataset.json.gz\"" | head -1)
        echo "$line" | cut -d'"' -f2 | cut -d'/' -f3 | cut -d'-' -f4
    else
        filename=$(api_call "$study" | jq '.sonarfile_set[]' -r | grep "${dataset}.json.gz" | sort | tail -1)
        echo $filename | cut -d'-' -f4
    fi
 }
 function get_download_url { # study, dataset
    study="$1"
    dataset="$2"
    if [ -z "$RAPID7_API_KEY" ]
    then
        line=$(curl -s "https://opendata.rapid7.com/$study/" | grep "href=\".\+-$dataset.json.gz\"" | head -1)
        echo "https://opendata.rapid7.com$(echo "$line" | cut -d'"' -f2)"
    else
        filename=$(api_call "$study" | jq '.sonarfile_set[]' -r | grep "${dataset}.json.gz" | sort | tail -1)
        api_call "$study/$filename/download" | jq '.url' -r
    fi
 }
 function feed_rapid7 { # study, dataset, rule_file, ./feed_dns args
    # The dataset will be imported if:
    #   none of this dataset was ever imported
    #  or
    #   the last dataset imported is older than the one to be imported
    # or
    #  the rule_file is newer than when the last dataset was imported
    #
    # (note the difference between the age oft the dataset itself and
    # the date when it is imported)
    study="$1"
    dataset="$2"
    rule_file="$3"
    shift; shift; shift
    new_ts="$(get_timestamp $study $dataset)"
    old_ts_file="last_updates/rapid7_${study}_${dataset}.txt"
    if [ -f "$old_ts_file" ]
    then
        old_ts=$(cat "$old_ts_file")
    else
        old_ts="0"
    fi
    if [ $new_ts -gt $old_ts ] || [ $rule_file -nt $old_ts_file ]
    then
        link="$(get_download_url $study $dataset)"
        log "Reading $dataset dataset from $link ($old_ts -> $new_ts)…"
        [ $SINGLE_PROCESS -eq 1 ] && EXTRA_ARGS="--single-process"
        curl -L "$link" | gunzip | ./feed_dns.py rapid7 $@ $EXTRA_ARGS
        if [ $? -eq 0 ]
        then
            echo $new_ts > $old_ts_file
        fi
    else
        log "Skipping $dataset as there is no new version since $old_ts"
    fi
 }
 # feed_rapid7 sonar.rdns_v2 rdns rules_asn/first-party.list
 feed_rapid7 sonar.fdns_v2 fdns_a rules_asn/first-party.list --ip4-cache "$CACHE_SIZE"
 # feed_rapid7 sonar.fdns_v2 fdns_aaaa rules_asn/first-party.list --ip6-cache "$CACHE_SIZE"
 feed_rapid7 sonar.fdns_v2 fdns_cname rules/first-party.list