Compare commits

...

20 commits
v2.3 ... master

Author SHA1 Message Date
Geoffrey Frogeye 3b6f7a58b3
Remove support for Rapid7
They changed their privacy / pricing model and as such I don't have
access to their massive DNS dataset anymore,
even after asking.

Since 2022-01-02, I put the list on freeze while looking for an alternative,
but couldn't find any.
To make the list update again with the remaining DNS sources I have,
I put the last version of the list generated with the Rapid7 dataset
as an input for subdomains, that will now get resolved with MassDNS.
2022-11-13 20:10:27 +01:00
Geoffrey Frogeye 49a36f32f2
Add requirements.txt file 2022-02-26 13:01:11 +01:00
Geoffrey Frogeye 29cf72ae92 Fix most of the README being bold
Why did I go with this Markdown generator again?
2021-08-28 20:58:34 +02:00
Geoffrey Frogeye 998c3faf8f
Add SAS.com 2021-08-22 18:02:37 +02:00
Geoffrey Frogeye c8a14a4e21
Add DataUnlocker 2021-08-22 17:07:25 +02:00
Geoffrey Frogeye 1ec26e7f96
Add Plausible.io 2021-08-22 16:53:58 +02:00
Geoffrey Frogeye 5b49441bc0 Add Branch.io tracker 2021-08-22 16:37:31 +02:00
Geoffrey Frogeye afd122f2ab
Update usage recommendations 2021-08-15 13:04:55 +02:00
Geoffrey Frogeye 6ae3d5fb55
Add Lead Forensics tracker 2021-08-15 11:39:37 +02:00
Geoffrey Frogeye 10a505d84f
Add Fathom 2021-08-15 11:18:35 +02:00
Geoffrey Frogeye c06648da53
Added Pardot tracker 2021-08-15 11:06:53 +02:00
Geoffrey Frogeye f165e5a094
Fix (most) mypy / flake8 errors 2021-08-14 23:35:51 +02:00
Geoffrey Frogeye 3dcccad39a
Black pass 2021-08-14 23:27:28 +02:00
Geoffrey Frogeye a023dc8322
Fix deprecated np.bool 2021-08-14 23:21:03 +02:00
Geoffrey Frogeye 389e83d492
Fix database maximum cache size cap 2021-08-14 23:19:12 +02:00
Geoffrey Frogeye edf444cc28
Add ad-cloud.jp and improve names of Japanese trackers
Closes #19

Names from https://github.com/AdguardTeam/cname-trackers/issues/1
2021-08-14 22:55:58 +02:00
Geoffrey Frogeye fa23d466d2
Actually remove ThreatMetrix
Forgot -i when grepping
2021-08-14 21:55:44 +02:00
Geoffrey Frogeye f5f9f88c42
Remove ThreatMetrix
I received a lot of false positives for this one,
and while I wasn't able to reproduce the issue in most of the cases,
I trust the community.
It's also not in any other CNAME tracker list, probably for the same reason.
Plus, it's apparently not very nasty.
So I'll let it go.

Closes #17
2021-08-14 21:24:48 +02:00
Geoffrey Frogeye 2997e41f98
Investigated >0.5% trackers from Fukuda paper 2020-12-19 13:41:07 +01:00
Geoffrey Frogeye 6cf1028174
Added other tracking source for Adobe
Found on the Adobe documentation and in the wild
https://experienceleague.adobe.com/docs/analytics/implementation/vars/config-vars/trackingserversecure.html?lang=en#s.trackingserversecure-in-appmeasurement-and-launch-custom-code-editor
2020-12-19 13:15:38 +01:00
20 changed files with 478 additions and 517 deletions

View file

@ -1,4 +1,3 @@
RAPID7_API_KEY=
CACHE_SIZE=536870912
MASSDNS_HASHMAP_SIZE=1000
PROFILE=0

View file

@ -18,7 +18,7 @@ This program takes as input:
It will be able to output hostnames being a DNS redirection to any item in the lists provided.
DNS records can either come from [Rapid7 Open Data Sets](https://opendata.rapid7.com/sonar.fdns_v2/) or can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns).
DNS records can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns).
Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that).
@ -41,7 +41,6 @@ Depending on the sources you'll be using to generate the list, you'll need to in
- [coloredlogs](https://pypi.org/project/coloredlogs/) (sorry I can't help myself)
- [numpy](https://www.numpy.org/)
- [python-abp](https://pypi.org/project/python-abp/) (only if you intend to use AdBlock rules as a rule source)
- [jq](http://stedolan.github.io/jq/) (only if you have a Rapid7 API key)
- [massdns](https://github.com/blechschmidt/massdns) in your `$PATH` (only if you have subdomains as a source)
- [Firefox](https://www.mozilla.org/firefox/) (only if you have websites as a source)
- [selenium (Python bindings)](https://pypi.python.org/pypi/selenium) (only if you have websites as a source)
@ -135,22 +134,6 @@ Note that this is a network intensive process, not in term of bandwith, but in t
The DNS records will automatically be imported into the database.
If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script.
### Import DNS records from Rapid7
If you have a Rapid7 Organization API key, make sure to append to `.env`:
```
RAPID7_API_KEY=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
```
Then, run `./import_rapid7.sh`.
This will download about 35 GiB of data the first time, but only the matching records will be stored (about a few MiB for the tracking rules).
Note the download speed will most likely be limited by the database operation thoughput (a quick RAM will help).
The script remembers which were the last sets downloaded, and will only newer sets.
If the first-party rules changed, the corresponding sets will be re-imported anyway.
If you want to force re-importing, run `rm last_updates/rapid7_*.txt`.
### Export the lists
For the tracking list, use `./export_lists.sh`, the output will be in the `dist` folder (please change the links before distributing them).

View file

@ -16,25 +16,36 @@ import abp.filters
def get_domains(rule: abp.filters.parser.Filter) -> typing.Iterable[str]:
if rule.options:
return
selector_type = rule.selector['type']
selector_value = rule.selector['value']
if selector_type == 'url-pattern' \
and selector_value.startswith('||') \
and selector_value.endswith('^'):
selector_type = rule.selector["type"]
selector_value = rule.selector["value"]
if (
selector_type == "url-pattern"
and selector_value.startswith("||")
and selector_value.endswith("^")
):
yield selector_value[2:-1]
if __name__ == '__main__':
if __name__ == "__main__":
# Parsing arguments
parser = argparse.ArgumentParser(
description="Extract whole domains from an AdBlock blocking list")
description="Extract whole domains from an AdBlock blocking list"
)
parser.add_argument(
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
help="Input file with AdBlock rules")
"-i",
"--input",
type=argparse.FileType("r"),
default=sys.stdin,
help="Input file with AdBlock rules",
)
parser.add_argument(
'-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
help="Outptut file with one rule tracking subdomain per line")
"-o",
"--output",
type=argparse.FileType("w"),
default=sys.stdout,
help="Outptut file with one rule tracking subdomain per line",
)
args = parser.parse_args()
# Reading rules

View file

@ -16,26 +16,25 @@ import selenium.webdriver.firefox.options
import seleniumwire.webdriver
import logging
log = logging.getLogger('cs')
log = logging.getLogger("cs")
DRIVER = None
SCROLL_TIME = 10.0
SCROLL_STEPS = 100
SCROLL_CMD = f'window.scrollBy(0,document.body.scrollHeight/{SCROLL_STEPS})'
SCROLL_CMD = f"window.scrollBy(0,document.body.scrollHeight/{SCROLL_STEPS})"
def new_driver() -> seleniumwire.webdriver.browser.Firefox:
profile = selenium.webdriver.FirefoxProfile()
profile.set_preference('privacy.trackingprotection.enabled', False)
profile.set_preference('network.cookie.cookieBehavior', 0)
profile.set_preference('privacy.trackingprotection.pbmode.enabled', False)
profile.set_preference(
'privacy.trackingprotection.cryptomining.enabled', False)
profile.set_preference(
'privacy.trackingprotection.fingerprinting.enabled', False)
profile.set_preference("privacy.trackingprotection.enabled", False)
profile.set_preference("network.cookie.cookieBehavior", 0)
profile.set_preference("privacy.trackingprotection.pbmode.enabled", False)
profile.set_preference("privacy.trackingprotection.cryptomining.enabled", False)
profile.set_preference("privacy.trackingprotection.fingerprinting.enabled", False)
options = selenium.webdriver.firefox.options.Options()
# options.add_argument('-headless')
driver = seleniumwire.webdriver.Firefox(profile,
executable_path='geckodriver', options=options)
driver = seleniumwire.webdriver.Firefox(
profile, executable_path="geckodriver", options=options
)
return driver
@ -60,11 +59,11 @@ def collect_subdomains(url: str) -> typing.Iterable[str]:
DRIVER.get(url)
for s in range(SCROLL_STEPS):
DRIVER.execute_script(SCROLL_CMD)
time.sleep(SCROLL_TIME/SCROLL_STEPS)
time.sleep(SCROLL_TIME / SCROLL_STEPS)
for request in DRIVER.requests:
if request.response:
yield subdomain_from_url(request.path)
except:
except Exception:
log.exception("Error")
DRIVER.quit()
DRIVER = None
@ -78,10 +77,10 @@ def collect_subdomains_standalone(url: str) -> None:
print(subdomain)
if __name__ == '__main__':
if __name__ == "__main__":
assert len(sys.argv) <= 2
filename = None
if len(sys.argv) == 2 and sys.argv[1] != '-':
if len(sys.argv) == 2 and sys.argv[1] != "-":
filename = sys.argv[1]
num_lines = sum(1 for line in open(filename))
iterator = progressbar.progressbar(open(filename), max_value=num_lines)

View file

@ -15,33 +15,30 @@ import os
TLD_LIST: typing.Set[str] = set()
coloredlogs.install(
level='DEBUG',
fmt='%(asctime)s %(name)s %(levelname)s %(message)s'
)
coloredlogs.install(level="DEBUG", fmt="%(asctime)s %(name)s %(levelname)s %(message)s")
Asn = int
Timestamp = int
Level = int
class Path():
class Path:
pass
class RulePath(Path):
def __str__(self) -> str:
return '(rule)'
return "(rule)"
class RuleFirstPath(RulePath):
def __str__(self) -> str:
return '(first-party rule)'
return "(first-party rule)"
class RuleMultiPath(RulePath):
def __str__(self) -> str:
return '(multi-party rule)'
return "(multi-party rule)"
class DomainPath(Path):
@ -49,7 +46,7 @@ class DomainPath(Path):
self.parts = parts
def __str__(self) -> str:
return '?.' + Database.unpack_domain(self)
return "?." + Database.unpack_domain(self)
class HostnamePath(DomainPath):
@ -59,7 +56,7 @@ class HostnamePath(DomainPath):
class ZonePath(DomainPath):
def __str__(self) -> str:
return '*.' + Database.unpack_domain(self)
return "*." + Database.unpack_domain(self)
class AsnPath(Path):
@ -79,7 +76,7 @@ class Ip4Path(Path):
return Database.unpack_ip4network(self)
class Match():
class Match:
def __init__(self) -> None:
self.source: typing.Optional[Path] = None
self.updated: int = 0
@ -102,10 +99,10 @@ class Match():
class AsnNode(Match):
def __init__(self) -> None:
Match.__init__(self)
self.name = ''
self.name = ""
class DomainTreeNode():
class DomainTreeNode:
def __init__(self) -> None:
self.children: typing.Dict[str, DomainTreeNode] = dict()
self.match_zone = Match()
@ -120,18 +117,16 @@ class IpTreeNode(Match):
Node = typing.Union[DomainTreeNode, IpTreeNode, AsnNode]
MatchCallable = typing.Callable[[Path,
Match],
typing.Any]
MatchCallable = typing.Callable[[Path, Match], typing.Any]
class Profiler():
class Profiler:
def __init__(self) -> None:
do_profile = int(os.environ.get('PROFILE', '0'))
do_profile = int(os.environ.get("PROFILE", "0"))
if do_profile:
self.log = logging.getLogger('profiler')
self.log = logging.getLogger("profiler")
self.time_last = time.perf_counter()
self.time_step = 'init'
self.time_step = "init"
self.time_dict: typing.Dict[str, float] = dict()
self.step_dict: typing.Dict[str, int] = dict()
self.enter_step = self.enter_step_real
@ -158,14 +153,17 @@ class Profiler():
return
def profile_real(self) -> None:
self.enter_step('profile')
self.enter_step("profile")
total = sum(self.time_dict.values())
for key, secs in sorted(self.time_dict.items(), key=lambda t: t[1]):
times = self.step_dict[key]
self.log.debug(f"{key:<20}: {times:9d} × {secs/times:5.3e} "
f"= {secs:9.2f} s ({secs/total:7.2%}) ")
self.log.debug(f"{'total':<20}: "
f"{total:9.2f} s ({1:7.2%})")
self.log.debug(
f"{key:<20}: {times:9d} × {secs/times:5.3e} "
f"= {secs:9.2f} s ({secs/total:7.2%}) "
)
self.log.debug(
f"{'total':<20}: " f"{total:9.2f} s ({1:7.2%})"
)
class Database(Profiler):
@ -173,9 +171,7 @@ class Database(Profiler):
PATH = "blocking.p"
def initialize(self) -> None:
self.log.warning(
"Creating database version: %d ",
Database.VERSION)
self.log.warning("Creating database version: %d ", Database.VERSION)
# Dummy match objects that everything refer to
self.rules: typing.List[Match] = list()
for first_party in (False, True):
@ -189,76 +185,77 @@ class Database(Profiler):
self.ip4tree = IpTreeNode()
def load(self) -> None:
self.enter_step('load')
self.enter_step("load")
try:
with open(self.PATH, 'rb') as db_fdsec:
with open(self.PATH, "rb") as db_fdsec:
version, data = pickle.load(db_fdsec)
if version == Database.VERSION:
self.rules, self.domtree, self.asns, self.ip4tree = data
return
self.log.warning(
"Outdated database version found: %d, "
"it will be rebuilt.",
version)
"Outdated database version found: %d, " "it will be rebuilt.",
version,
)
except (TypeError, AttributeError, EOFError):
self.log.error(
"Corrupt (or heavily outdated) database found, "
"it will be rebuilt.")
"Corrupt (or heavily outdated) database found, " "it will be rebuilt."
)
except FileNotFoundError:
pass
self.initialize()
def save(self) -> None:
self.enter_step('save')
with open(self.PATH, 'wb') as db_fdsec:
self.enter_step("save")
with open(self.PATH, "wb") as db_fdsec:
data = self.rules, self.domtree, self.asns, self.ip4tree
pickle.dump((self.VERSION, data), db_fdsec)
self.profile()
def __init__(self) -> None:
Profiler.__init__(self)
self.log = logging.getLogger('db')
self.log = logging.getLogger("db")
self.load()
self.ip4cache_shift: int = 32
self.ip4cache = numpy.ones(1)
def _set_ip4cache(self, path: Path, _: Match) -> None:
assert isinstance(path, Ip4Path)
self.enter_step('set_ip4cache')
self.enter_step("set_ip4cache")
mini = path.value >> self.ip4cache_shift
maxi = (path.value + 2**(32-path.prefixlen)) >> self.ip4cache_shift
maxi = (path.value + 2 ** (32 - path.prefixlen)) >> self.ip4cache_shift
if mini == maxi:
self.ip4cache[mini] = True
else:
self.ip4cache[mini:maxi] = True
def fill_ip4cache(self, max_size: int = 512*1024**2) -> None:
def fill_ip4cache(self, max_size: int = 512 * 1024 ** 2) -> None:
"""
Size in bytes
"""
if max_size > 2**32/8:
self.log.warning("Allocating more than 512 MiB of RAM for "
"the Ip4 cache is not necessary.")
max_cache_width = int(math.log2(max(1, max_size*8)))
if max_size > 2 ** 32 / 8:
self.log.warning(
"Allocating more than 512 MiB of RAM for "
"the Ip4 cache is not necessary."
)
max_cache_width = int(math.log2(max(1, max_size * 8)))
allocated = False
cache_width = min(2**32, max_cache_width)
cache_width = min(32, max_cache_width)
while not allocated:
cache_size = 2**cache_width
cache_size = 2 ** cache_width
try:
self.ip4cache = numpy.zeros(cache_size, dtype=numpy.bool)
self.ip4cache = numpy.zeros(cache_size, dtype=bool)
except MemoryError:
self.log.exception(
"Could not allocate cache. Retrying a smaller one.")
self.log.exception("Could not allocate cache. Retrying a smaller one.")
cache_width -= 1
continue
allocated = True
self.ip4cache_shift = 32-cache_width
self.ip4cache_shift = 32 - cache_width
for _ in self.exec_each_ip4(self._set_ip4cache):
pass
@staticmethod
def populate_tld_list() -> None:
with open('temp/all_tld.list', 'r') as tld_fdesc:
with open("temp/all_tld.list", "r") as tld_fdesc:
for tld in tld_fdesc:
tld = tld.strip()
TLD_LIST.add(tld)
@ -267,7 +264,7 @@ class Database(Profiler):
def validate_domain(path: str) -> bool:
if len(path) > 255:
return False
splits = path.split('.')
splits = path.split(".")
if not TLD_LIST:
Database.populate_tld_list()
if splits[-1] not in TLD_LIST:
@ -279,26 +276,26 @@ class Database(Profiler):
@staticmethod
def pack_domain(domain: str) -> DomainPath:
return DomainPath(domain.split('.')[::-1])
return DomainPath(domain.split(".")[::-1])
@staticmethod
def unpack_domain(domain: DomainPath) -> str:
return '.'.join(domain.parts[::-1])
return ".".join(domain.parts[::-1])
@staticmethod
def pack_asn(asn: str) -> AsnPath:
asn = asn.upper()
if asn.startswith('AS'):
if asn.startswith("AS"):
asn = asn[2:]
return AsnPath(int(asn))
@staticmethod
def unpack_asn(asn: AsnPath) -> str:
return f'AS{asn.asn}'
return f"AS{asn.asn}"
@staticmethod
def validate_ip4address(path: str) -> bool:
splits = path.split('.')
splits = path.split(".")
if len(splits) != 4:
return False
for split in splits:
@ -312,7 +309,7 @@ class Database(Profiler):
@staticmethod
def pack_ip4address_low(address: str) -> int:
addr = 0
for split in address.split('.'):
for split in address.split("."):
octet = int(split)
addr = (addr << 8) + octet
return addr
@ -330,12 +327,12 @@ class Database(Profiler):
for o in reversed(range(4)):
octets[o] = addr & 0xFF
addr >>= 8
return '.'.join(map(str, octets))
return ".".join(map(str, octets))
@staticmethod
def validate_ip4network(path: str) -> bool:
# A bit generous but ok for our usage
splits = path.split('/')
splits = path.split("/")
if len(splits) != 2:
return False
if not Database.validate_ip4address(splits[0]):
@ -349,7 +346,7 @@ class Database(Profiler):
@staticmethod
def pack_ip4network(network: str) -> Ip4Path:
address, prefixlen_str = network.split('/')
address, prefixlen_str = network.split("/")
prefixlen = int(prefixlen_str)
addr = Database.pack_ip4address(address)
addr.prefixlen = prefixlen
@ -363,7 +360,7 @@ class Database(Profiler):
for o in reversed(range(4)):
octets[o] = addr & 0xFF
addr >>= 8
return '.'.join(map(str, octets)) + '/' + str(network.prefixlen)
return ".".join(map(str, octets)) + "/" + str(network.prefixlen)
def get_match(self, path: Path) -> Match:
if isinstance(path, RuleMultiPath):
@ -384,7 +381,7 @@ class Database(Profiler):
raise ValueError
elif isinstance(path, Ip4Path):
dici = self.ip4tree
for i in range(31, 31-path.prefixlen, -1):
for i in range(31, 31 - path.prefixlen, -1):
bit = (path.value >> i) & 0b1
dici_next = dici.one if bit else dici.zero
if not dici_next:
@ -394,9 +391,10 @@ class Database(Profiler):
else:
raise ValueError
def exec_each_asn(self,
callback: MatchCallable,
) -> typing.Any:
def exec_each_asn(
self,
callback: MatchCallable,
) -> typing.Any:
for asn in self.asns:
match = self.asns[asn]
if match.active():
@ -409,11 +407,12 @@ class Database(Profiler):
except TypeError: # not iterable
pass
def exec_each_domain(self,
callback: MatchCallable,
_dic: DomainTreeNode = None,
_par: DomainPath = None,
) -> typing.Any:
def exec_each_domain(
self,
callback: MatchCallable,
_dic: DomainTreeNode = None,
_par: DomainPath = None,
) -> typing.Any:
_dic = _dic or self.domtree
_par = _par or DomainPath([])
if _dic.match_hostname.active():
@ -437,16 +436,15 @@ class Database(Profiler):
for part in _dic.children:
dic = _dic.children[part]
yield from self.exec_each_domain(
callback,
_dic=dic,
_par=DomainPath(_par.parts + [part])
callback, _dic=dic, _par=DomainPath(_par.parts + [part])
)
def exec_each_ip4(self,
callback: MatchCallable,
_dic: IpTreeNode = None,
_par: Ip4Path = None,
) -> typing.Any:
def exec_each_ip4(
self,
callback: MatchCallable,
_dic: IpTreeNode = None,
_par: Ip4Path = None,
) -> typing.Any:
_dic = _dic or self.ip4tree
_par = _par or Ip4Path(0, 0)
if _dic.active():
@ -466,25 +464,18 @@ class Database(Profiler):
# addr0 = _par.value & (0xFFFFFFFF ^ (1 << (32-pref)))
# assert addr0 == _par.value
addr0 = _par.value
yield from self.exec_each_ip4(
callback,
_dic=dic,
_par=Ip4Path(addr0, pref)
)
yield from self.exec_each_ip4(callback, _dic=dic, _par=Ip4Path(addr0, pref))
# 1
dic = _dic.one
if dic:
addr1 = _par.value | (1 << (32-pref))
addr1 = _par.value | (1 << (32 - pref))
# assert addr1 != _par.value
yield from self.exec_each_ip4(
callback,
_dic=dic,
_par=Ip4Path(addr1, pref)
)
yield from self.exec_each_ip4(callback, _dic=dic, _par=Ip4Path(addr1, pref))
def exec_each(self,
callback: MatchCallable,
) -> typing.Any:
def exec_each(
self,
callback: MatchCallable,
) -> typing.Any:
yield from self.exec_each_domain(callback)
yield from self.exec_each_ip4(callback)
yield from self.exec_each_asn(callback)
@ -492,19 +483,17 @@ class Database(Profiler):
def update_references(self) -> None:
# Should be correctly calculated normally,
# keeping this just in case
def reset_references_cb(path: Path,
match: Match
) -> None:
def reset_references_cb(path: Path, match: Match) -> None:
match.references = 0
for _ in self.exec_each(reset_references_cb):
pass
def increment_references_cb(path: Path,
match: Match
) -> None:
def increment_references_cb(path: Path, match: Match) -> None:
if match.source:
source = self.get_match(match.source)
source.references += 1
for _ in self.exec_each(increment_references_cb):
pass
@ -513,9 +502,7 @@ class Database(Profiler):
# matches until all disabled matches reference count = 0
did_something = True
def clean_deps_cb(path: Path,
match: Match
) -> None:
def clean_deps_cb(path: Path, match: Match) -> None:
nonlocal did_something
if not match.source:
return
@ -530,15 +517,13 @@ class Database(Profiler):
while did_something:
did_something = False
self.enter_step('pass_clean_deps')
self.enter_step("pass_clean_deps")
for _ in self.exec_each(clean_deps_cb):
pass
def prune(self, before: int, base_only: bool = False) -> None:
# Disable the matches targeted
def prune_cb(path: Path,
match: Match
) -> None:
def prune_cb(path: Path, match: Match) -> None:
if base_only and match.level > 1:
return
if match.updated > before:
@ -546,7 +531,7 @@ class Database(Profiler):
self._unset_match(match)
self.log.debug("Print: disabled %s", path)
self.enter_step('pass_prune')
self.enter_step("pass_prune")
for _ in self.exec_each(prune_cb):
pass
@ -559,25 +544,24 @@ class Database(Profiler):
match = self.get_match(path)
string = str(path)
if isinstance(match, AsnNode):
string += f' ({match.name})'
party_char = 'F' if match.first_party else 'M'
dup_char = 'D' if match.dupplicate else '_'
string += f' {match.level}{party_char}{dup_char}{match.references}'
string += f" ({match.name})"
party_char = "F" if match.first_party else "M"
dup_char = "D" if match.dupplicate else "_"
string += f" {match.level}{party_char}{dup_char}{match.references}"
if match.source:
string += f'{self.explain(match.source)}'
string += f"{self.explain(match.source)}"
return string
def list_records(self,
first_party_only: bool = False,
end_chain_only: bool = False,
no_dupplicates: bool = False,
rules_only: bool = False,
hostnames_only: bool = False,
explain: bool = False,
) -> typing.Iterable[str]:
def export_cb(path: Path, match: Match
) -> typing.Iterable[str]:
def list_records(
self,
first_party_only: bool = False,
end_chain_only: bool = False,
no_dupplicates: bool = False,
rules_only: bool = False,
hostnames_only: bool = False,
explain: bool = False,
) -> typing.Iterable[str]:
def export_cb(path: Path, match: Match) -> typing.Iterable[str]:
if first_party_only and not match.first_party:
return
if end_chain_only and match.references > 0:
@ -596,13 +580,14 @@ class Database(Profiler):
yield from self.exec_each(export_cb)
def count_records(self,
first_party_only: bool = False,
end_chain_only: bool = False,
no_dupplicates: bool = False,
rules_only: bool = False,
hostnames_only: bool = False,
) -> str:
def count_records(
self,
first_party_only: bool = False,
end_chain_only: bool = False,
no_dupplicates: bool = False,
rules_only: bool = False,
hostnames_only: bool = False,
) -> str:
memo: typing.Dict[str, int] = dict()
def count_records_cb(path: Path, match: Match) -> None:
@ -627,75 +612,80 @@ class Database(Profiler):
split: typing.List[str] = list()
for key, value in sorted(memo.items(), key=lambda s: s[0]):
split.append(f'{key[:-4].lower()}s: {value}')
return ', '.join(split)
split.append(f"{key[:-4].lower()}s: {value}")
return ", ".join(split)
def get_domain(self, domain_str: str) -> typing.Iterable[DomainPath]:
self.enter_step('get_domain_pack')
self.enter_step("get_domain_pack")
domain = self.pack_domain(domain_str)
self.enter_step('get_domain_brws')
self.enter_step("get_domain_brws")
dic = self.domtree
depth = 0
for part in domain.parts:
if dic.match_zone.active():
self.enter_step('get_domain_yield')
self.enter_step("get_domain_yield")
yield ZonePath(domain.parts[:depth])
self.enter_step('get_domain_brws')
self.enter_step("get_domain_brws")
if part not in dic.children:
return
dic = dic.children[part]
depth += 1
if dic.match_zone.active():
self.enter_step('get_domain_yield')
self.enter_step("get_domain_yield")
yield ZonePath(domain.parts)
if dic.match_hostname.active():
self.enter_step('get_domain_yield')
self.enter_step("get_domain_yield")
yield HostnamePath(domain.parts)
def get_ip4(self, ip4_str: str) -> typing.Iterable[Path]:
self.enter_step('get_ip4_pack')
self.enter_step("get_ip4_pack")
ip4val = self.pack_ip4address_low(ip4_str)
self.enter_step('get_ip4_cache')
self.enter_step("get_ip4_cache")
if not self.ip4cache[ip4val >> self.ip4cache_shift]:
return
self.enter_step('get_ip4_brws')
self.enter_step("get_ip4_brws")
dic = self.ip4tree
for i in range(31, -1, -1):
bit = (ip4val >> i) & 0b1
if dic.active():
self.enter_step('get_ip4_yield')
yield Ip4Path(ip4val >> (i+1) << (i+1), 31-i)
self.enter_step('get_ip4_brws')
self.enter_step("get_ip4_yield")
yield Ip4Path(ip4val >> (i + 1) << (i + 1), 31 - i)
self.enter_step("get_ip4_brws")
next_dic = dic.one if bit else dic.zero
if next_dic is None:
return
dic = next_dic
if dic.active():
self.enter_step('get_ip4_yield')
self.enter_step("get_ip4_yield")
yield Ip4Path(ip4val, 32)
def _unset_match(self,
match: Match,
) -> None:
def _unset_match(
self,
match: Match,
) -> None:
match.disable()
if match.source:
source_match = self.get_match(match.source)
source_match.references -= 1
def _set_match(self,
match: Match,
updated: int,
source: Path,
source_match: Match = None,
dupplicate: bool = False,
) -> None:
def _set_match(
self,
match: Match,
updated: int,
source: Path,
source_match: Match = None,
dupplicate: bool = False,
) -> None:
# source_match is in parameters because most of the time
# its parent function needs it too,
# so it can pass it to save a traversal
source_match = source_match or self.get_match(source)
new_level = source_match.level + 1
if updated > match.updated or new_level < match.level \
or source_match.first_party > match.first_party:
if (
updated > match.updated
or new_level < match.level
or source_match.first_party > match.first_party
):
# NOTE FP and level of matches referencing this one
# won't be updated until run or prune
if match.source:
@ -708,20 +698,18 @@ class Database(Profiler):
source_match.references += 1
match.dupplicate = dupplicate
def _set_domain(self,
hostname: bool,
domain_str: str,
updated: int,
source: Path) -> None:
self.enter_step('set_domain_val')
def _set_domain(
self, hostname: bool, domain_str: str, updated: int, source: Path
) -> None:
self.enter_step("set_domain_val")
if not Database.validate_domain(domain_str):
raise ValueError(f"Invalid domain: {domain_str}")
self.enter_step('set_domain_pack')
self.enter_step("set_domain_pack")
domain = self.pack_domain(domain_str)
self.enter_step('set_domain_fp')
self.enter_step("set_domain_fp")
source_match = self.get_match(source)
is_first_party = source_match.first_party
self.enter_step('set_domain_brws')
self.enter_step("set_domain_brws")
dic = self.domtree
dupplicate = False
for part in domain.parts:
@ -742,21 +730,14 @@ class Database(Profiler):
dupplicate=dupplicate,
)
def set_hostname(self,
*args: typing.Any, **kwargs: typing.Any
) -> None:
def set_hostname(self, *args: typing.Any, **kwargs: typing.Any) -> None:
self._set_domain(True, *args, **kwargs)
def set_zone(self,
*args: typing.Any, **kwargs: typing.Any
) -> None:
def set_zone(self, *args: typing.Any, **kwargs: typing.Any) -> None:
self._set_domain(False, *args, **kwargs)
def set_asn(self,
asn_str: str,
updated: int,
source: Path) -> None:
self.enter_step('set_asn')
def set_asn(self, asn_str: str, updated: int, source: Path) -> None:
self.enter_step("set_asn")
path = self.pack_asn(asn_str)
if path.asn in self.asns:
match = self.asns[path.asn]
@ -769,17 +750,14 @@ class Database(Profiler):
source,
)
def _set_ip4(self,
ip4: Ip4Path,
updated: int,
source: Path) -> None:
self.enter_step('set_ip4_fp')
def _set_ip4(self, ip4: Ip4Path, updated: int, source: Path) -> None:
self.enter_step("set_ip4_fp")
source_match = self.get_match(source)
is_first_party = source_match.first_party
self.enter_step('set_ip4_brws')
self.enter_step("set_ip4_brws")
dic = self.ip4tree
dupplicate = False
for i in range(31, 31-ip4.prefixlen, -1):
for i in range(31, 31 - ip4.prefixlen, -1):
bit = (ip4.value >> i) & 0b1
next_dic = dic.one if bit else dic.zero
if next_dic is None:
@ -800,24 +778,22 @@ class Database(Profiler):
)
self._set_ip4cache(ip4, dic)
def set_ip4address(self,
ip4address_str: str,
*args: typing.Any, **kwargs: typing.Any
) -> None:
self.enter_step('set_ip4add_val')
def set_ip4address(
self, ip4address_str: str, *args: typing.Any, **kwargs: typing.Any
) -> None:
self.enter_step("set_ip4add_val")
if not Database.validate_ip4address(ip4address_str):
raise ValueError(f"Invalid ip4address: {ip4address_str}")
self.enter_step('set_ip4add_pack')
self.enter_step("set_ip4add_pack")
ip4 = self.pack_ip4address(ip4address_str)
self._set_ip4(ip4, *args, **kwargs)
def set_ip4network(self,
ip4network_str: str,
*args: typing.Any, **kwargs: typing.Any
) -> None:
self.enter_step('set_ip4net_val')
def set_ip4network(
self, ip4network_str: str, *args: typing.Any, **kwargs: typing.Any
) -> None:
self.enter_step("set_ip4net_val")
if not Database.validate_ip4network(ip4network_str):
raise ValueError(f"Invalid ip4network: {ip4network_str}")
self.enter_step('set_ip4net_pack')
self.enter_step("set_ip4net_pack")
ip4 = self.pack_ip4network(ip4network_str)
self._set_ip4(ip4, *args, **kwargs)

38
db.py
View file

@ -5,29 +5,37 @@ import database
import time
import os
if __name__ == '__main__':
if __name__ == "__main__":
# Parsing arguments
parser = argparse.ArgumentParser(
description="Database operations")
parser = argparse.ArgumentParser(description="Database operations")
parser.add_argument(
'-i', '--initialize', action='store_true',
help="Reconstruct the whole database")
"-i", "--initialize", action="store_true", help="Reconstruct the whole database"
)
parser.add_argument(
'-p', '--prune', action='store_true',
help="Remove old entries from database")
"-p", "--prune", action="store_true", help="Remove old entries from database"
)
parser.add_argument(
'-b', '--prune-base', action='store_true',
"-b",
"--prune-base",
action="store_true",
help="With --prune, only prune base rules "
"(the ones added by ./feed_rules.py)")
"(the ones added by ./feed_rules.py)",
)
parser.add_argument(
'-s', '--prune-before', type=int,
default=(int(time.time()) - 60*60*24*31*6),
"-s",
"--prune-before",
type=int,
default=(int(time.time()) - 60 * 60 * 24 * 31 * 6),
help="With --prune, only rules updated before "
"this UNIX timestamp will be deleted")
"this UNIX timestamp will be deleted",
)
parser.add_argument(
'-r', '--references', action='store_true',
help="DEBUG: Update the reference count")
"-r",
"--references",
action="store_true",
help="DEBUG: Update the reference count",
)
args = parser.parse_args()
if not args.initialize:
@ -37,7 +45,7 @@ if __name__ == '__main__':
os.unlink(database.Database.PATH)
DB = database.Database()
DB.enter_step('main')
DB.enter_step("main")
if args.prune:
DB.prune(before=args.prune_before, base_only=args.prune_base)
if args.references:

7
dist/README.md vendored
View file

@ -35,7 +35,7 @@ This list is an inventory of every `somestring.website1.com` found to allow non
### First-party trackers
**Recommended for hostfiles-based ad blockers, such as [Pi-hole](https://pi-hole.net/).**
**Recommended for hostfiles-based ad blockers, such as [Pi-hole](https://pi-hole.net/) (&lt;v5.0, as it introduced CNAME blocking).**
**Recommended for Android ad blockers as applications, such ad [Blokada](https://blokada.org/).**
- Hosts file: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
@ -50,7 +50,7 @@ Don't be afraid of the size of the list, as this is due to the nature of first-p
### First-party only trackers
**Recommended for ad blockers as web browser extensions, such as [uBlock Origin](https://pi-hole.net/).**
**Recommended for ad blockers as web browser extensions, such as [uBlock Origin](https://ublockorigin.com/) (&lt;v1.25.0 or for Chromium-based browsers, as it introduced CNAME uncloaking for Firefox).**
- Hosts file: <https://hostfiles.frogeye.fr/firstparty-only-trackers-hosts.txt>
- Raw list: <https://hostfiles.frogeye.fr/firstparty-only-trackers.txt>
@ -98,10 +98,10 @@ Some of the first-party tracker included in this list have been found by:
- NextDNS and [their blocklist](https://github.com/nextdns/cname-cloaking-blocklist)'s contributors
- Yuki2718 from [Wilders Security Forums](https://www.wilderssecurity.com/threads/ublock-a-lean-and-fast-blocker.365273/page-168#post-2880361)
- Ha Dao, Johan Mazel, and Kensuke Fukuda, ["Characterizing CNAME Cloaking-Based Tracking on the Web", Proceedings of IFIP/IEEE Traffic Measurement Analysis Conference (TMA), 9 pages, 2020.](https://tma.ifip.org/2020/wp-content/uploads/sites/9/2020/06/tma2020-camera-paper66.pdf)
- AdGuard and [their blocklist](https://github.com/AdguardTeam/cname-trackers)'s contributors
The list was generated using data from
- [Rapid7 OpenData](https://opendata.rapid7.com/sonar.fdns_v2/), who kindly provided a free account
- [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html)
- [Public DNS Server List](https://public-dns.info/)
@ -110,4 +110,5 @@ Similar projects:
- [NextDNS blocklist](https://github.com/nextdns/cname-cloaking-blocklist): for DNS-aware ad blockers
- [Stefan Froberg's lists](https://www.orwell1984.today/cname/): subset of those lists grouped by tracker
- [AdGuard blocklist](https://github.com/AdguardTeam/cname-trackers): same thing with a bigger scope, maintained by a bigger team

View file

@ -8,7 +8,6 @@
./collect_subdomains.sh
./import_rules.sh
./resolve_subdomains.sh
./import_rapid7.sh
./prune.sh
./export_lists.sh
./generate_index.py

View file

@ -5,53 +5,80 @@ import argparse
import sys
if __name__ == '__main__':
if __name__ == "__main__":
# Parsing arguments
parser = argparse.ArgumentParser(
description="Export the hostnames rules stored "
"in the Database as plain text")
description="Export the hostnames rules stored " "in the Database as plain text"
)
parser.add_argument(
'-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
help="Output file, one rule per line")
"-o",
"--output",
type=argparse.FileType("w"),
default=sys.stdout,
help="Output file, one rule per line",
)
parser.add_argument(
'-f', '--first-party', action='store_true',
help="Only output rules issued from first-party sources")
"-f",
"--first-party",
action="store_true",
help="Only output rules issued from first-party sources",
)
parser.add_argument(
'-e', '--end-chain', action='store_true',
help="Only output rules that are not referenced by any other")
"-e",
"--end-chain",
action="store_true",
help="Only output rules that are not referenced by any other",
)
parser.add_argument(
'-r', '--rules', action='store_true',
help="Output all kinds of rules, not just hostnames")
"-r",
"--rules",
action="store_true",
help="Output all kinds of rules, not just hostnames",
)
parser.add_argument(
'-b', '--base-rules', action='store_true',
"-b",
"--base-rules",
action="store_true",
help="Output base rules "
"(the ones added by ./feed_rules.py) "
"(implies --rules)")
"(implies --rules)",
)
parser.add_argument(
'-d', '--no-dupplicates', action='store_true',
"-d",
"--no-dupplicates",
action="store_true",
help="Do not output rules that already match a zone/network rule "
"(e.g. dummy.example.com when there's a zone example.com rule)")
"(e.g. dummy.example.com when there's a zone example.com rule)",
)
parser.add_argument(
'-x', '--explain', action='store_true',
"-x",
"--explain",
action="store_true",
help="Show the chain of rules leading to one "
"(and the number of references they have)")
"(and the number of references they have)",
)
parser.add_argument(
'-c', '--count', action='store_true',
help="Show the number of rules per type instead of listing them")
"-c",
"--count",
action="store_true",
help="Show the number of rules per type instead of listing them",
)
args = parser.parse_args()
DB = database.Database()
if args.count:
assert not args.explain
print(DB.count_records(
first_party_only=args.first_party,
end_chain_only=args.end_chain,
no_dupplicates=args.no_dupplicates,
rules_only=args.base_rules,
hostnames_only=not (args.rules or args.base_rules),
))
print(
DB.count_records(
first_party_only=args.first_party,
end_chain_only=args.end_chain,
no_dupplicates=args.no_dupplicates,
rules_only=args.base_rules,
hostnames_only=not (args.rules or args.base_rules),
)
)
else:
for domain in DB.list_records(
first_party_only=args.first_party,

View file

@ -76,7 +76,7 @@ do
echo "# Oldest record: $oldest_date"
echo "# Number of source websites: $number_websites"
echo "# Number of source subdomains: $number_subdomains"
echo "# Number of source DNS records: ~2E9 + $number_dns"
echo "# Number of source DNS records: $number_dns"
echo "#"
echo "# Input rules: $rules_input"
echo "# Subsequent rules: $rules_found"

View file

@ -13,57 +13,54 @@ IPNetwork = typing.Union[ipaddress.IPv4Network, ipaddress.IPv6Network]
def get_ranges(asn: str) -> typing.Iterable[str]:
req = requests.get(
'https://stat.ripe.net/data/as-routing-consistency/data.json',
params={'resource': asn}
"https://stat.ripe.net/data/as-routing-consistency/data.json",
params={"resource": asn},
)
data = req.json()
for pref in data['data']['prefixes']:
yield pref['prefix']
for pref in data["data"]["prefixes"]:
yield pref["prefix"]
def get_name(asn: str) -> str:
req = requests.get(
'https://stat.ripe.net/data/as-overview/data.json',
params={'resource': asn}
"https://stat.ripe.net/data/as-overview/data.json", params={"resource": asn}
)
data = req.json()
return data['data']['holder']
return data["data"]["holder"]
if __name__ == '__main__':
if __name__ == "__main__":
log = logging.getLogger('feed_asn')
log = logging.getLogger("feed_asn")
# Parsing arguments
parser = argparse.ArgumentParser(
description="Add the IP ranges associated to the AS in the database")
description="Add the IP ranges associated to the AS in the database"
)
args = parser.parse_args()
DB = database.Database()
def add_ranges(path: database.Path,
match: database.Match,
) -> None:
def add_ranges(
path: database.Path,
match: database.Match,
) -> None:
assert isinstance(path, database.AsnPath)
assert isinstance(match, database.AsnNode)
asn_str = database.Database.unpack_asn(path)
DB.enter_step('asn_get_name')
DB.enter_step("asn_get_name")
name = get_name(asn_str)
match.name = name
DB.enter_step('asn_get_ranges')
DB.enter_step("asn_get_ranges")
for prefix in get_ranges(asn_str):
parsed_prefix: IPNetwork = ipaddress.ip_network(prefix)
if parsed_prefix.version == 4:
DB.set_ip4network(
prefix,
source=path,
updated=int(time.time())
)
log.info('Added %s from %s (%s)', prefix, path, name)
DB.set_ip4network(prefix, source=path, updated=int(time.time()))
log.info("Added %s from %s (%s)", prefix, path, name)
elif parsed_prefix.version == 6:
log.warning('Unimplemented prefix version: %s', prefix)
log.warning("Unimplemented prefix version: %s", prefix)
else:
log.error('Unknown prefix version: %s', prefix)
log.error("Unknown prefix version: %s", prefix)
for _ in DB.exec_each_asn(add_ranges):
pass

View file

@ -12,15 +12,15 @@ Record = typing.Tuple[typing.Callable, typing.Callable, int, str, str]
# select, write
FUNCTION_MAP: typing.Any = {
'a': (
"a": (
database.Database.get_ip4,
database.Database.set_hostname,
),
'cname': (
"cname": (
database.Database.get_domain,
database.Database.set_hostname,
),
'ptr': (
"ptr": (
database.Database.get_domain,
database.Database.set_ip4address,
),
@ -28,15 +28,16 @@ FUNCTION_MAP: typing.Any = {
class Writer(multiprocessing.Process):
def __init__(self,
recs_queue: multiprocessing.Queue = None,
autosave_interval: int = 0,
ip4_cache: int = 0,
):
def __init__(
self,
recs_queue: multiprocessing.Queue = None,
autosave_interval: int = 0,
ip4_cache: int = 0,
):
if recs_queue: # MP
super(Writer, self).__init__()
self.recs_queue = recs_queue
self.log = logging.getLogger(f'wr')
self.log = logging.getLogger("wr")
self.autosave_interval = autosave_interval
self.ip4_cache = ip4_cache
if not recs_queue: # No MP
@ -44,11 +45,11 @@ class Writer(multiprocessing.Process):
def open_db(self) -> None:
self.db = database.Database()
self.db.log = logging.getLogger(f'wr')
self.db.log = logging.getLogger("wr")
self.db.fill_ip4cache(max_size=self.ip4_cache)
def exec_record(self, record: Record) -> None:
self.db.enter_step('exec_record')
self.db.enter_step("exec_record")
select, write, updated, name, value = record
try:
for source in select(self.db, value):
@ -59,7 +60,7 @@ class Writer(multiprocessing.Process):
self.log.exception("Cannot execute: %s", record)
def end(self) -> None:
self.db.enter_step('end')
self.db.enter_step("end")
self.db.save()
def run(self) -> None:
@ -69,10 +70,11 @@ class Writer(multiprocessing.Process):
else:
next_save = 0
self.db.enter_step('block_wait')
self.db.enter_step("block_wait")
block: typing.List[Record]
for block in iter(self.recs_queue.get, None):
assert block
record: Record
for record in block:
self.exec_record(record)
@ -83,20 +85,21 @@ class Writer(multiprocessing.Process):
self.log.info("Done!")
next_save = time.time() + self.autosave_interval
self.db.enter_step('block_wait')
self.db.enter_step("block_wait")
self.end()
class Parser():
def __init__(self,
buf: typing.Any,
recs_queue: multiprocessing.Queue = None,
block_size: int = 0,
writer: Writer = None,
):
class Parser:
def __init__(
self,
buf: typing.Any,
recs_queue: multiprocessing.Queue = None,
block_size: int = 0,
writer: Writer = None,
):
assert bool(writer) ^ bool(block_size and recs_queue)
self.buf = buf
self.log = logging.getLogger('pr')
self.log = logging.getLogger("pr")
self.recs_queue = recs_queue
if writer: # No MP
self.prof: database.Profiler = writer.db
@ -105,14 +108,14 @@ class Parser():
self.block: typing.List[Record] = list()
self.block_size = block_size
self.prof = database.Profiler()
self.prof.log = logging.getLogger('pr')
self.prof.log = logging.getLogger("pr")
self.register = self.add_to_queue
def add_to_queue(self, record: Record) -> None:
self.prof.enter_step('register')
self.prof.enter_step("register")
self.block.append(record)
if len(self.block) >= self.block_size:
self.prof.enter_step('put_block')
self.prof.enter_step("put_block")
assert self.recs_queue
self.recs_queue.put(self.block)
self.block = list()
@ -127,45 +130,17 @@ class Parser():
raise NotImplementedError
class Rapid7Parser(Parser):
def consume(self) -> None:
data = dict()
for line in self.buf:
self.prof.enter_step('parse_rapid7')
split = line.split('"')
try:
for k in range(1, 14, 4):
key = split[k]
val = split[k+2]
data[key] = val
select, writer = FUNCTION_MAP[data['type']]
record = (
select,
writer,
int(data['timestamp']),
data['name'],
data['value']
)
except (IndexError, KeyError):
# IndexError: missing field
# KeyError: Unknown type field
self.log.exception("Cannot parse: %s", line)
self.register(record)
class MassDnsParser(Parser):
# massdns --output Snrql
# --retry REFUSED,SERVFAIL --resolvers nameservers-ipv4
TYPES = {
'A': (FUNCTION_MAP['a'][0], FUNCTION_MAP['a'][1], -1, None),
"A": (FUNCTION_MAP["a"][0], FUNCTION_MAP["a"][1], -1, None),
# 'AAAA': (FUNCTION_MAP['aaaa'][0], FUNCTION_MAP['aaaa'][1], -1, None),
'CNAME': (FUNCTION_MAP['cname'][0], FUNCTION_MAP['cname'][1], -1, -1),
"CNAME": (FUNCTION_MAP["cname"][0], FUNCTION_MAP["cname"][1], -1, -1),
}
def consume(self) -> None:
self.prof.enter_step('parse_massdns')
self.prof.enter_step("parse_massdns")
timestamp = 0
header = True
for line in self.buf:
@ -174,14 +149,15 @@ class MassDnsParser(Parser):
header = True
continue
split = line.split(' ')
split = line.split(" ")
try:
if header:
timestamp = int(split[1])
header = False
else:
select, write, name_offset, value_offset = \
MassDnsParser.TYPES[split[1]]
select, write, name_offset, value_offset = MassDnsParser.TYPES[
split[1]
]
record = (
select,
write,
@ -190,75 +166,85 @@ class MassDnsParser(Parser):
split[2][:value_offset].lower(),
)
self.register(record)
self.prof.enter_step('parse_massdns')
self.prof.enter_step("parse_massdns")
except KeyError:
continue
PARSERS = {
'rapid7': Rapid7Parser,
'massdns': MassDnsParser,
"massdns": MassDnsParser,
}
if __name__ == '__main__':
if __name__ == "__main__":
# Parsing arguments
log = logging.getLogger('feed_dns')
log = logging.getLogger("feed_dns")
args_parser = argparse.ArgumentParser(
description="Read DNS records and import "
"tracking-relevant data into the database")
"tracking-relevant data into the database"
)
args_parser.add_argument("parser", choices=PARSERS.keys(), help="Input format")
args_parser.add_argument(
'parser',
choices=PARSERS.keys(),
help="Input format")
"-i",
"--input",
type=argparse.FileType("r"),
default=sys.stdin,
help="Input file",
)
args_parser.add_argument(
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
help="Input file")
"-b", "--block-size", type=int, default=1024, help="Performance tuning value"
)
args_parser.add_argument(
'-b', '--block-size', type=int, default=1024,
help="Performance tuning value")
"-q", "--queue-size", type=int, default=128, help="Performance tuning value"
)
args_parser.add_argument(
'-q', '--queue-size', type=int, default=128,
help="Performance tuning value")
"-a",
"--autosave-interval",
type=int,
default=900,
help="Interval to which the database will save in seconds. " "0 to disable.",
)
args_parser.add_argument(
'-a', '--autosave-interval', type=int, default=900,
help="Interval to which the database will save in seconds. "
"0 to disable.")
"-s",
"--single-process",
action="store_true",
help="Only use one process. " "Might be useful for single core computers.",
)
args_parser.add_argument(
'-s', '--single-process', action='store_true',
help="Only use one process. "
"Might be useful for single core computers.")
args_parser.add_argument(
'-4', '--ip4-cache', type=int, default=0,
"-4",
"--ip4-cache",
type=int,
default=0,
help="RAM cache for faster IPv4 lookup. "
"Maximum useful value: 512 MiB (536870912). "
"Warning: Depending on the rules, this might already "
"be a memory-heavy process, even without the cache.")
"be a memory-heavy process, even without the cache.",
)
args = args_parser.parse_args()
parser_cls = PARSERS[args.parser]
if args.single_process:
writer = Writer(
autosave_interval=args.autosave_interval,
ip4_cache=args.ip4_cache
autosave_interval=args.autosave_interval, ip4_cache=args.ip4_cache
)
parser = parser_cls(args.input, writer=writer)
parser.run()
writer.end()
else:
recs_queue: multiprocessing.Queue = multiprocessing.Queue(
maxsize=args.queue_size)
maxsize=args.queue_size
)
writer = Writer(recs_queue,
autosave_interval=args.autosave_interval,
ip4_cache=args.ip4_cache
)
writer = Writer(
recs_queue,
autosave_interval=args.autosave_interval,
ip4_cache=args.ip4_cache,
)
writer.start()
parser = parser_cls(args.input,
recs_queue=recs_queue,
block_size=args.block_size
)
parser = parser_cls(
args.input, recs_queue=recs_queue, block_size=args.block_size
)
parser.run()
recs_queue.put(None)

View file

@ -4,30 +4,36 @@ import database
import argparse
import sys
import time
import typing
FUNCTION_MAP = {
'zone': database.Database.set_zone,
'hostname': database.Database.set_hostname,
'asn': database.Database.set_asn,
'ip4network': database.Database.set_ip4network,
'ip4address': database.Database.set_ip4address,
"zone": database.Database.set_zone,
"hostname": database.Database.set_hostname,
"asn": database.Database.set_asn,
"ip4network": database.Database.set_ip4network,
"ip4address": database.Database.set_ip4address,
}
if __name__ == '__main__':
if __name__ == "__main__":
# Parsing arguments
parser = argparse.ArgumentParser(
description="Import base rules to the database")
parser = argparse.ArgumentParser(description="Import base rules to the database")
parser.add_argument(
'type',
choices=FUNCTION_MAP.keys(),
help="Type of rule inputed")
"type", choices=FUNCTION_MAP.keys(), help="Type of rule inputed"
)
parser.add_argument(
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
help="File with one rule per line")
"-i",
"--input",
type=argparse.FileType("r"),
default=sys.stdin,
help="File with one rule per line",
)
parser.add_argument(
'-f', '--first-party', action='store_true',
help="The input only comes from verified first-party sources")
"-f",
"--first-party",
action="store_true",
help="The input only comes from verified first-party sources",
)
args = parser.parse_args()
DB = database.Database()
@ -43,11 +49,12 @@ if __name__ == '__main__':
for rule in args.input:
rule = rule.strip()
try:
fun(DB,
fun(
DB,
rule,
source=source,
updated=int(time.time()),
)
)
except ValueError:
DB.log.error(f"Could not add rule: {rule}")

View file

@ -2,11 +2,9 @@
import markdown2
extras = [
"header-ids"
]
extras = ["header-ids"]
with open('dist/README.md', 'r') as fdesc:
with open("dist/README.md", "r") as fdesc:
body = markdown2.markdown(fdesc.read(), extras=extras)
output = f"""<!DOCTYPE html>
@ -23,5 +21,5 @@ output = f"""<!DOCTYPE html>
</html>
"""
with open('dist/index.html', 'w') as fdesc:
with open("dist/index.html", "w") as fdesc:
fdesc.write(output)

View file

@ -1,81 +0,0 @@
#!/usr/bin/env bash
source .env.default
source .env
function log() {
echo -e "\033[33m$@\033[0m"
}
function api_call {
curl -s -H "X-Api-Key: $RAPID7_API_KEY" "https://us.api.insight.rapid7.com/opendata/studies/$1/"
}
function get_timestamp { # study, dataset
study="$1"
dataset="$2"
if [ -z "$RAPID7_API_KEY" ]
then
line=$(curl -s "https://opendata.rapid7.com/$study/" | grep "href=\".\+-$dataset.json.gz\"" | head -1)
echo "$line" | cut -d'"' -f2 | cut -d'/' -f3 | cut -d'-' -f4
else
filename=$(api_call "$study" | jq '.sonarfile_set[]' -r | grep "${dataset}.json.gz" | sort | tail -1)
echo $filename | cut -d'-' -f4
fi
}
function get_download_url { # study, dataset
study="$1"
dataset="$2"
if [ -z "$RAPID7_API_KEY" ]
then
line=$(curl -s "https://opendata.rapid7.com/$study/" | grep "href=\".\+-$dataset.json.gz\"" | head -1)
echo "https://opendata.rapid7.com$(echo "$line" | cut -d'"' -f2)"
else
filename=$(api_call "$study" | jq '.sonarfile_set[]' -r | grep "${dataset}.json.gz" | sort | tail -1)
api_call "$study/$filename/download" | jq '.url' -r
fi
}
function feed_rapid7 { # study, dataset, rule_file, ./feed_dns args
# The dataset will be imported if:
# none of this dataset was ever imported
# or
# the last dataset imported is older than the one to be imported
# or
# the rule_file is newer than when the last dataset was imported
#
# (note the difference between the age oft the dataset itself and
# the date when it is imported)
study="$1"
dataset="$2"
rule_file="$3"
shift; shift; shift
new_ts="$(get_timestamp $study $dataset)"
old_ts_file="last_updates/rapid7_${study}_${dataset}.txt"
if [ -f "$old_ts_file" ]
then
old_ts=$(cat "$old_ts_file")
else
old_ts="0"
fi
if [ $new_ts -gt $old_ts ] || [ $rule_file -nt $old_ts_file ]
then
link="$(get_download_url $study $dataset)"
log "Reading $dataset dataset from $link ($old_ts -> $new_ts)…"
[ $SINGLE_PROCESS -eq 1 ] && EXTRA_ARGS="--single-process"
curl -L "$link" | gunzip | ./feed_dns.py rapid7 $@ $EXTRA_ARGS
if [ $? -eq 0 ]
then
echo $new_ts > $old_ts_file
fi
else
log "Skipping $dataset as there is no new version since $old_ts"
fi
}
# feed_rapid7 sonar.rdns_v2 rdns rules_asn/first-party.list
feed_rapid7 sonar.fdns_v2 fdns_a rules_asn/first-party.list --ip4-cache "$CACHE_SIZE"
# feed_rapid7 sonar.fdns_v2 fdns_aaaa rules_asn/first-party.list --ip6-cache "$CACHE_SIZE"
feed_rapid7 sonar.fdns_v2 fdns_cname rules/first-party.list

4
requirements.txt Normal file
View file

@ -0,0 +1,4 @@
coloredlogs>=10
markdown2>=2.4<3
numpy>=1.21<2
python-abp>=0.2<0.3

View file

@ -12,10 +12,11 @@ storetail.io
# Keyade
keyade.com
# Adobe Experience Cloud
# https://experienceleague.adobe.com/docs/analytics/implementation/vars/config-vars/trackingserversecure.html?lang=en#ssl-tracking-server-in-adobe-experience-platform-launch
omtrdc.net
2o7.net
# ThreatMetrix
online-metrix.net
data.adobedc.net
sc.adobedc.net
# Webtrekk
wt-eu02.net
webtrekk.net
@ -36,10 +37,10 @@ a88045584548111e997c60ac8a4ec150-1610510072.eu-central-1.elb.amazonaws.com
afc4d9aa2a91d11e997c60ac8a4ec150-2082092489.eu-central-1.elb.amazonaws.com
# A8
trck.a8.net
# Ebis
# AD EBiS
# https://prtimes.jp/main/html/rd/p/000000215.000009812.html
ebis.ne.jp
# Geniesspv
# GENIEE
genieesspv.jp
# SP-Prod
sp-prod.net
@ -55,3 +56,36 @@ extole.com
hs.eloqua.com
# segment.com
xid.segment.com
# exponea.com
exponea.com
# adclear.net
adclear.net
# contentsfeed.com
contentsfeed.com
# postaffiliatepro.com
postaffiliatepro.com
# Sugar Market (Salesfusion)
msgapp.com
# Exactag
exactag.com
# GMO Internet Group
ad-cloud.jp
# Pardot
pardot.com
# Fathom
# https://usefathom.com/docs/settings/custom-domains
starman.fathomdns.com
# Lead Forensics
# https://www.reddit.com/r/pihole/comments/g7qv3e/leadforensics_tracking_domains_blacklist/
# No real-world data but the website doesn't hide what it does
ghochv3eng.trafficmanager.net
# Branch.io
thirdparty.bnc.lt
# Plausible.io
custom.plausible.io
# DataUnlocker
# Bit different as it is a proxy to non first-party trackers scripts
# but it fits I guess.
smartproxy.dataunlocker.com
# SAS
ci360.sas.com

View file

@ -4,8 +4,6 @@ AS50234
AS44788
AS19750
AS55569
# ThreatMetrix
AS30286
# Webtrekk
AS60164
# Act-On Software

View file

@ -57,7 +57,11 @@ if __name__ == "__main__":
perc_all = (100 * pass_all / count_all) if count_all else 100
perc_den = (100 * pass_den / count_den) if count_den else 100
log.info(
"%s: Entries %d/%d (%.2f%%) | Allow %d/%d (%.2f%%) | Deny %d/%d (%.2f%%)",
(
"%s: Entries %d/%d (%.2f%%)"
" | Allow %d/%d (%.2f%%)"
"| Deny %d/%d (%.2f%%)"
),
filename,
pass_ent,
count_ent,

View file

@ -1,7 +1,6 @@
url,allow,deny,comment
https://www.red-by-sfr.fr/,static.s-sfr.fr,nrg.red-by-sfr.fr,Eulerian
https://www.cbc.ca/,,smetrics.cbc.ca,2o7 | Ominuture | Adobe Experience Cloud
https://www.discover.com/,,content.discover.com,ThreatMetrix
https://www.mytoys.de/,,web.mytoys.de,Webtrekk
https://www.baur.de/,,tp.baur.de,Otto Group
https://www.liligo.com/,,compare.liligo.com,???
@ -9,9 +8,21 @@ https://www.boulanger.com/,,tag.boulanger.fr,TagCommander
https://www.airfrance.fr/FR/,,tk.airfrance.fr,Wizaly
https://www.vsgamers.es/,,marketing.net.vsgamers.es,Affex
https://www.vacansoleil.fr/,,tdep.vacansoleil.fr,TraceDock
https://www.ozmall.co.jp/,,js.enhance.co.jp,Genieesspv
https://www.ozmall.co.jp/,,js.enhance.co.jp,GENIEE
https://www.thetimes.co.uk/,,cmp.thetimes.co.uk,SP-Prod
https://agilent.com/,,seahorseinfo.agilent.com,Act-On Software
https://halifax.co.uk/,,cem.halifax.co.uk,eum-appdynamics.com
https://www.reallygoodstuff.com/,,refer.reallygoodstuff.com,Extole
https://unity.com/,,eloqua-trackings.unity.com,Eloqua
https://www.notino.gr/,,api.campaigns.notino.com,Exponea
https://www.mytoys.de/,,0815.mytoys.de.adclear.net
https://www.imbc.com/,,ads.imbc.com.contentsfeed.com
https://www.cbdbiocare.com/,,affiliate.cbdbiocare.com,postaffiliatepro.com
https://www.seatadvisor.com/,,marketing.seatadvisor.com,Sugar Market (Salesfusion)
https://www.tchibo.de/,,tagm.tchibo.de,Exactag
https://www.bouygues-immobilier.com/,,go.bouygues-immobilier.fr,Pardot
https://caddyserver.com/,,mule.caddysever.com,Fathom
Reddit.com mail notifications,,click.redditmail.com,Branch.io
https://www.phpliveregex.com/,,yolo.phpliveregex.xom,Plausible.io
https://www.earthclassmail.com/,,1avhg3kanx9.www.earthclassmail.com,DataUnlocker
https://paulfredrick.com/,,execution-ci360.paulfredrick.com,SAS

Can't render this file because it has a wrong number of fields in line 18.