Added intermediate representation for DNS datasets
It's just CSV. The DNS from the datasets are not ordered consistently, so we need to parse it completly. It seems that converting to an IR before sending data to ./feed_dns.py through a pipe is faster than decoding the JSON in ./feed_dns.py. This will also reduce the storage of the resolved subdomains by about 15% (compressed).
This commit is contained in:
parent
269b8278b5
commit
5023b85d7c
5 changed files with 68 additions and 31 deletions
|
@ -9,11 +9,11 @@ function log() {
|
|||
|
||||
# TODO Fetch 'em
|
||||
log "Reading PTR records…"
|
||||
pv ptr.json.gz | gunzip | ./feed_dns.py
|
||||
pv ptr.json.gz | gunzip | ./json_to_csv.py | ./feed_dns.py
|
||||
log "Reading A records…"
|
||||
pv a.json.gz | gunzip | ./feed_dns.py
|
||||
pv a.json.gz | gunzip | ./json_to_csv.py | ./feed_dns.py
|
||||
log "Reading CNAME records…"
|
||||
pv cname.json.gz | gunzip | ./feed_dns.py
|
||||
pv cname.json.gz | gunzip | ./json_to_csv.py | ./feed_dns.py
|
||||
|
||||
log "Pruning old data…"
|
||||
./database.py --prune
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue