Added intermediate representation for DNS datasets

It's just CSV.
The DNS from the datasets are not ordered consistently,
so we need to parse it completly.
It seems that converting to an IR before sending data to ./feed_dns.py
through a pipe is faster than decoding the JSON in ./feed_dns.py.
This will also reduce the storage of the resolved subdomains by
about 15% (compressed).
This commit is contained in:
Geoffrey Frogeye 2019-12-13 21:59:35 +01:00
parent 269b8278b5
commit 5023b85d7c
Signed by: geoffrey
GPG key ID: D8A7ECA00A8CD3DD
5 changed files with 68 additions and 31 deletions

View file

@ -6,7 +6,7 @@ function log() {
log "Compiling locally known subdomain…"
# Sort by last character to utilize the DNS server caching mechanism
pv subdomains/*.list | rev | sort -u | rev > temp/all_subdomains.list
pv subdomains/*.list | sed 's/\r$//' | rev | sort -u | rev > temp/all_subdomains.list
log "Resolving locally known subdomain…"
pv temp/all_subdomains.list | ./resolve_subdomains.py --output temp/all_resolved.json
pv temp/all_subdomains.list | ./resolve_subdomains.py --output temp/all_resolved.csv