Merge branch 'newworkflow'

This commit is contained in:
Geoffrey Frogeye 2019-12-20 17:18:42 +01:00
commit cd46b39756
Signed by: geoffrey
GPG key ID: D8A7ECA00A8CD3DD
28 changed files with 1659 additions and 698 deletions

3
.gitignore vendored
View file

@ -1,3 +1,2 @@
*.log
nameservers
nameservers.head
*.p

169
README.md
View file

@ -1,98 +1,133 @@
# eulaurarien
Generates a host list of first-party trackers for ad-blocking.
This program is able to generate a list of every hostnames being a DNS redirection to a list of DNS zones and IP networks.
The latest list is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
It is primarilyy used to generate [Geoffrey Frogeye's block list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md) (learn about first-party trackers by following this link).
**DISCLAIMER:** I'm by no way an expert on this subject so my vocabulary or other stuff might be wrong. Use at your own risk.
If you want to contribute but don't want to create an account on this forge, contact me the way you like: <https://geoffrey.frogeye.fr>
## What's a first-party tracker?
## How does this work
Traditionally, websites load trackers scripts directly.
For example, `website1.com` and `website2.com` both load `https://trackercompany.com/trackerscript.js` to track their users.
In order to block those, one can simply block the host `trackercompany.com`.
This program takes as input:
However, to circumvent this easy block, tracker companies made the website using them load trackers from `somethingirelevant.website1.com`.
The latter being a DNS redirection to `website1.trackercompany.com`, directly pointing to a server serving the tracking script.
Those are the first-party trackers.
- Lists of hostnames to match
- Lists of DNS zone to match (a domain and their subdomains)
- Lists of IP address / IP networks to match
- Lists of Autonomous System numbers to match
- An enormous quantity of DNS records
Blocking `trackercompany.com` doesn't work any more, and blocking `*.trackercompany.com` isn't really possible since:
It will be able to output hostnames being a DNS redirection to any item in the lists provided.
1. Most ad-blocker don't support wildcards
2. It's a DNS redirection, meaning that most ad-blockers will only see `somethingirelevant.website1.com`
DNS records can either come from [Rapid7 Open Data Sets](https://opendata.rapid7.com/sonar.fdns_v2/) or can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns).
So the only solution is to block every `somethingirelevant.website1.com`-like subdomains known, which is a lot.
That's where this scripts comes in, to generate a list of such subdomains.
## How does this script work
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
It takes an input a list of websites with trackers included.
So far, this list is manually-generated from the list of clients of such first-party trackers
(latter we should use a general list of websites to be more exhaustive).
It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes.
Additionaly, or alternatively, you can feed the script some browsing history and get domains from there.
It then find the DNS redirections of those domains, and compare with regexes of known tracking domains.
It finally outputs the matching ones.
## Requirements
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
Just to build the list, you can find an already-built list in the releases.
- Bash
- [Python 3.4+](https://www.python.org/)
- [progressbar2](https://pypi.org/project/progressbar2/)
- dnspython
- [A Python wrapper for re2](https://pypi.org/project/google-re2/) (optional, just speeds things up)
(if you don't want to collect the subdomains, you can skip the following)
- Firefox
- Selenium
- seleniumwire
Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that).
## Usage
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
Remember you can get an already generated and up-to-date list of first-party trackers from [here](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md).
This is only if you want to build the list yourself.
If you just want to use the list, the latest build is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
It was build using additional sources not included in this repository for privacy reasons.
The following is for the people wanting to build their own list.
### Add personal sources
### Requirements
The list of websites provided in this script is by no mean exhaustive,
so adding your own browsing history will help create a better list.
Depending on the sources you'll be using to generate the list, you'll need to install some of the following:
- [Bash](https://www.gnu.org/software/bash/bash.html)
- [Coreutils](https://www.gnu.org/software/coreutils/)
- [curl](https://curl.haxx.se)
- [pv](http://www.ivarch.com/programs/pv.shtml)
- [Python 3.4+](https://www.python.org/)
- [coloredlogs](https://pypi.org/project/coloredlogs/) (sorry I can't help myself)
- [massdns](https://github.com/blechschmidt/massdns) in your `$PATH` (only if you have subdomains as a source)
- [Firefox](https://www.mozilla.org/firefox/) (only if you have websites as a source)
- [selenium (Python bindings)](https://pypi.python.org/pypi/selenium) (only if you have websites as a source)
- [selenium-wire](https://pypi.org/project/selenium-wire/) (only if you have websites as a source)
### Create a new database
The so-called database (in the form of `blocking.p`) is a file storing all the matching entities (ASN, IPs, hostnames, zones…) and every entity leading to it.
For now there's no way to remove data from it, so here's the command to recreate it: `./db.py --initialize`.
### Gather external sources
External sources are not stored in this repository.
You'll need to fetch them by running `./fetch_resources.sh`.
Those include:
- Third-party trackers lists
- TLD lists (used to test the validity of hostnames)
- List of public DNS resolvers (for DNS resolving from subdomains)
- Top 1M subdomains
### Import rules into the database
You need to put the lists of rules for matching in the different subfolders:
- `rules`: Lists of DNS zones
- `rules_ip`: Lists of IP networks (for IP addresses append `/32`)
- `rules_asn`: Lists of Autonomous Systems numbers (IP ranges will be deducted from them)
- `rules_adblock`: Lists of DNS zones, but in the form of AdBlock lists (only the ones concerning domains will be extracted)
- `rules_hosts`: Lists of DNS zones, but in the form of hosts lists
See the provided examples for syntax.
In each folder:
- `first-party.ext` will be the only files considered for the first-party variant of the list
- `*.cache.ext` are from external sources, and thus might be deleted / overwrote
- `*.custom.ext` are for sources that you don't want commited
Then, run `./import_rules.sh`.
### Add subdomains
If you plan to resolve DNS records yourself (as the DNS records datasets are not exhaustive),
the top 1M subdomains provided might not be enough.
You can add them into the `subdomains` folder.
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
#### Add personal sources
Adding your own browsing history will help create a more suited subdomains list.
Here's reference command for possible sources:
- **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
- **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp`
### Collect subdomains from websites
#### Collect subdomains from websites
Just run `collect_subdomain.sh`.
You can add the websites URLs into the `websites` folder.
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
Then, run `collect_subdomain.sh`.
This is a long step, and might be memory-intensive from time to time.
This step is optional if you already added personal sources.
Alternatively, you can get just download the list of subdomains used to generate the official block list here: <https://hostfiles.frogeye.fr/from_websites.cache.list> (put it in the `subdomains` folder).
> **Note:** For first-party tracking, a list of subdomains issued from the websites in the repository is avaliable here: <https://hostfiles.frogeye.fr/from_websites.cache.list>
### Extract tracking domains
### Resolve DNS records
Make sure your system is configured with a DNS server without limitation.
Then, run `filter_subdomain.sh`.
The files you need will be in the folder `dist`.
Once you've added subdomains, you'll need to resolve them to get their DNS records.
The program will use a list of public nameservers to do that, but you can add your own in the `nameservers` directory.
## Contributing
Then, run `./resolve_subdomains.sh`.
Note that this is a network intensive process, not in term of bandwith, but in terms of packet number.
### Adding websites
> Some VPS providers might detect this as a DDoS attack and cut the network access.
> Some Wi-Fi connections can be rendered unusable for other uses, some routers might cease to work.
> Since massdns does not support yet rate limiting, my best bet was a Raspberry Pi with a slow ethernet link (Raspberry Pi < 4).
Just add the URL to the relevant list: `websites/<source>.list`.
The DNS records will automatically be imported into the database.
If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script.
### Adding first-party trackers regex
### Import DNS records from Rapid7
Just run `./import_rapid7.sh`.
This will download about 35 GiB of data, but only the matching records will be stored (about a few MiB for the tracking rules).
Note the download speed will most likely be limited by the database operation thoughput (a quick RAM will help).
### Export the lists
For the tracking list, use `./export_lists.sh`, the output will be in the `dist` forlder (please change the links before distributing them).
For other purposes, tinker with the `./export.py` program.
Just add them to `regexes.py`.

739
database.py Normal file
View file

@ -0,0 +1,739 @@
#!/usr/bin/env python3
"""
Utility functions to interact with the database.
"""
import typing
import time
import logging
import coloredlogs
import pickle
import numpy
import math
TLD_LIST: typing.Set[str] = set()
coloredlogs.install(
level='DEBUG',
fmt='%(asctime)s %(name)s %(levelname)s %(message)s'
)
Asn = int
Timestamp = int
Level = int
class Path():
# FP add boolean here
pass
class RulePath(Path):
def __str__(self) -> str:
return '(rule)'
class RuleFirstPath(RulePath):
def __str__(self) -> str:
return '(first-party rule)'
class RuleMultiPath(RulePath):
def __str__(self) -> str:
return '(multi-party rule)'
class DomainPath(Path):
def __init__(self, parts: typing.List[str]):
self.parts = parts
def __str__(self) -> str:
return '?.' + Database.unpack_domain(self)
class HostnamePath(DomainPath):
def __str__(self) -> str:
return Database.unpack_domain(self)
class ZonePath(DomainPath):
def __str__(self) -> str:
return '*.' + Database.unpack_domain(self)
class AsnPath(Path):
def __init__(self, asn: Asn):
self.asn = asn
def __str__(self) -> str:
return Database.unpack_asn(self)
class Ip4Path(Path):
def __init__(self, value: int, prefixlen: int):
self.value = value
self.prefixlen = prefixlen
def __str__(self) -> str:
return Database.unpack_ip4network(self)
class Match():
def __init__(self) -> None:
self.source: typing.Optional[Path] = None
self.updated: int = 0
self.dupplicate: bool = False
# Cache
self.level: int = 0
self.first_party: bool = False
self.references: int = 0
def active(self, first_party: bool = None) -> bool:
if self.updated == 0 or (first_party and not self.first_party):
return False
return True
class AsnNode(Match):
def __init__(self) -> None:
Match.__init__(self)
self.name = ''
class DomainTreeNode():
def __init__(self) -> None:
self.children: typing.Dict[str, DomainTreeNode] = dict()
self.match_zone = Match()
self.match_hostname = Match()
class IpTreeNode(Match):
def __init__(self) -> None:
Match.__init__(self)
self.zero: typing.Optional[IpTreeNode] = None
self.one: typing.Optional[IpTreeNode] = None
Node = typing.Union[DomainTreeNode, IpTreeNode, AsnNode]
MatchCallable = typing.Callable[[Path,
Match],
typing.Any]
class Profiler():
def __init__(self) -> None:
self.log = logging.getLogger('profiler')
self.time_last = time.perf_counter()
self.time_step = 'init'
self.time_dict: typing.Dict[str, float] = dict()
self.step_dict: typing.Dict[str, int] = dict()
def enter_step(self, name: str) -> None:
now = time.perf_counter()
try:
self.time_dict[self.time_step] += now - self.time_last
self.step_dict[self.time_step] += int(name != self.time_step)
except KeyError:
self.time_dict[self.time_step] = now - self.time_last
self.step_dict[self.time_step] = 1
self.time_step = name
self.time_last = time.perf_counter()
def profile(self) -> None:
self.enter_step('profile')
total = sum(self.time_dict.values())
for key, secs in sorted(self.time_dict.items(), key=lambda t: t[1]):
times = self.step_dict[key]
self.log.debug(f"{key:<20}: {times:9d} × {secs/times:5.3e} "
f"= {secs:9.2f} s ({secs/total:7.2%}) ")
self.log.debug(f"{'total':<20}: "
f"{total:9.2f} s ({1:7.2%})")
class Database(Profiler):
VERSION = 18
PATH = "blocking.p"
def initialize(self) -> None:
self.log.warning(
"Creating database version: %d ",
Database.VERSION)
# Dummy match objects that everything refer to
self.rules: typing.List[Match] = list()
for first_party in (False, True):
m = Match()
m.updated = 1
m.level = 0
m.first_party = first_party
self.rules.append(m)
self.domtree = DomainTreeNode()
self.asns: typing.Dict[Asn, AsnNode] = dict()
self.ip4tree = IpTreeNode()
def load(self) -> None:
self.enter_step('load')
try:
with open(self.PATH, 'rb') as db_fdsec:
version, data = pickle.load(db_fdsec)
if version == Database.VERSION:
self.rules, self.domtree, self.asns, self.ip4tree = data
return
self.log.warning(
"Outdated database version found: %d, "
"it will be rebuilt.",
version)
except (TypeError, AttributeError, EOFError):
self.log.error(
"Corrupt (or heavily outdated) database found, "
"it will be rebuilt.")
except FileNotFoundError:
pass
self.initialize()
def save(self) -> None:
self.enter_step('save')
with open(self.PATH, 'wb') as db_fdsec:
data = self.rules, self.domtree, self.asns, self.ip4tree
pickle.dump((self.VERSION, data), db_fdsec)
self.profile()
def __init__(self) -> None:
Profiler.__init__(self)
self.log = logging.getLogger('db')
self.load()
self.ip4cache_shift: int = 32
self.ip4cache = numpy.ones(1)
def _set_ip4cache(self, path: Path, _: Match) -> None:
assert isinstance(path, Ip4Path)
self.enter_step('set_ip4cache')
mini = path.value >> self.ip4cache_shift
maxi = (path.value + 2**(32-path.prefixlen)) >> self.ip4cache_shift
if mini == maxi:
self.ip4cache[mini] = True
else:
self.ip4cache[mini:maxi] = True
def fill_ip4cache(self, max_size: int = 512*1024**2) -> None:
"""
Size in bytes
"""
if max_size > 2**32/8:
self.log.warning("Allocating more than 512 MiB of RAM for "
"the Ip4 cache is not necessary.")
max_cache_width = int(math.log2(max(1, max_size*8)))
cache_width = min(2**32, max_cache_width)
self.ip4cache_shift = 32-cache_width
cache_size = 2**cache_width
self.ip4cache = numpy.zeros(cache_size, dtype=numpy.bool)
for _ in self.exec_each_ip4(self._set_ip4cache):
pass
@staticmethod
def populate_tld_list() -> None:
with open('temp/all_tld.list', 'r') as tld_fdesc:
for tld in tld_fdesc:
tld = tld.strip()
TLD_LIST.add(tld)
@staticmethod
def validate_domain(path: str) -> bool:
if len(path) > 255:
return False
splits = path.split('.')
if not TLD_LIST:
Database.populate_tld_list()
if splits[-1] not in TLD_LIST:
return False
for split in splits:
if not 1 <= len(split) <= 63:
return False
return True
@staticmethod
def pack_domain(domain: str) -> DomainPath:
return DomainPath(domain.split('.')[::-1])
@staticmethod
def unpack_domain(domain: DomainPath) -> str:
return '.'.join(domain.parts[::-1])
@staticmethod
def pack_asn(asn: str) -> AsnPath:
asn = asn.upper()
if asn.startswith('AS'):
asn = asn[2:]
return AsnPath(int(asn))
@staticmethod
def unpack_asn(asn: AsnPath) -> str:
return f'AS{asn.asn}'
@staticmethod
def validate_ip4address(path: str) -> bool:
splits = path.split('.')
if len(splits) != 4:
return False
for split in splits:
try:
if not 0 <= int(split) <= 255:
return False
except ValueError:
return False
return True
@staticmethod
def pack_ip4address(address: str) -> Ip4Path:
addr = 0
for split in address.split('.'):
addr = (addr << 8) + int(split)
return Ip4Path(addr, 32)
@staticmethod
def unpack_ip4address(address: Ip4Path) -> str:
addr = address.value
assert address.prefixlen == 32
octets: typing.List[int] = list()
octets = [0] * 4
for o in reversed(range(4)):
octets[o] = addr & 0xFF
addr >>= 8
return '.'.join(map(str, octets))
@staticmethod
def validate_ip4network(path: str) -> bool:
# A bit generous but ok for our usage
splits = path.split('/')
if len(splits) != 2:
return False
if not Database.validate_ip4address(splits[0]):
return False
try:
if not 0 <= int(splits[1]) <= 32:
return False
except ValueError:
return False
return True
@staticmethod
def pack_ip4network(network: str) -> Ip4Path:
address, prefixlen_str = network.split('/')
prefixlen = int(prefixlen_str)
addr = Database.pack_ip4address(address)
addr.prefixlen = prefixlen
return addr
@staticmethod
def unpack_ip4network(network: Ip4Path) -> str:
addr = network.value
octets: typing.List[int] = list()
octets = [0] * 4
for o in reversed(range(4)):
octets[o] = addr & 0xFF
addr >>= 8
return '.'.join(map(str, octets)) + '/' + str(network.prefixlen)
def get_match(self, path: Path) -> Match:
if isinstance(path, RuleMultiPath):
return self.rules[0]
elif isinstance(path, RuleFirstPath):
return self.rules[1]
elif isinstance(path, AsnPath):
return self.asns[path.asn]
elif isinstance(path, DomainPath):
dicd = self.domtree
for part in path.parts:
dicd = dicd.children[part]
if isinstance(path, HostnamePath):
return dicd.match_hostname
elif isinstance(path, ZonePath):
return dicd.match_zone
else:
raise ValueError
elif isinstance(path, Ip4Path):
dici = self.ip4tree
for i in range(31, 31-path.prefixlen, -1):
bit = (path.value >> i) & 0b1
dici_next = dici.one if bit else dici.zero
if not dici_next:
raise IndexError
dici = dici_next
return dici
else:
raise ValueError
def exec_each_asn(self,
callback: MatchCallable,
) -> typing.Any:
for asn in self.asns:
match = self.asns[asn]
if match.active():
c = callback(
AsnPath(asn),
match,
)
try:
yield from c
except TypeError: # not iterable
pass
def exec_each_domain(self,
callback: MatchCallable,
_dic: DomainTreeNode = None,
_par: DomainPath = None,
) -> typing.Any:
_dic = _dic or self.domtree
_par = _par or DomainPath([])
if _dic.match_hostname.active():
c = callback(
HostnamePath(_par.parts),
_dic.match_hostname,
)
try:
yield from c
except TypeError: # not iterable
pass
if _dic.match_zone.active():
c = callback(
ZonePath(_par.parts),
_dic.match_zone,
)
try:
yield from c
except TypeError: # not iterable
pass
for part in _dic.children:
dic = _dic.children[part]
yield from self.exec_each_domain(
callback,
_dic=dic,
_par=DomainPath(_par.parts + [part])
)
def exec_each_ip4(self,
callback: MatchCallable,
_dic: IpTreeNode = None,
_par: Ip4Path = None,
) -> typing.Any:
_dic = _dic or self.ip4tree
_par = _par or Ip4Path(0, 0)
if _dic.active():
c = callback(
_par,
_dic,
)
try:
yield from c
except TypeError: # not iterable
pass
# 0
pref = _par.prefixlen + 1
dic = _dic.zero
if dic:
# addr0 = _par.value & (0xFFFFFFFF ^ (1 << (32-pref)))
# assert addr0 == _par.value
addr0 = _par.value
yield from self.exec_each_ip4(
callback,
_dic=dic,
_par=Ip4Path(addr0, pref)
)
# 1
dic = _dic.one
if dic:
addr1 = _par.value | (1 << (32-pref))
# assert addr1 != _par.value
yield from self.exec_each_ip4(
callback,
_dic=dic,
_par=Ip4Path(addr1, pref)
)
def exec_each(self,
callback: MatchCallable,
) -> typing.Any:
yield from self.exec_each_domain(callback)
yield from self.exec_each_ip4(callback)
yield from self.exec_each_asn(callback)
def update_references(self) -> None:
# Should be correctly calculated normally,
# keeping this just in case
def reset_references_cb(path: Path,
match: Match
) -> None:
match.references = 0
for _ in self.exec_each(reset_references_cb):
pass
def increment_references_cb(path: Path,
match: Match
) -> None:
if match.source:
source = self.get_match(match.source)
source.references += 1
for _ in self.exec_each(increment_references_cb):
pass
def prune(self, before: int, base_only: bool = False) -> None:
raise NotImplementedError
def explain(self, path: Path) -> str:
match = self.get_match(path)
if isinstance(match, AsnNode):
string = f'{path} ({match.name}) #{match.references}'
else:
string = f'{path} #{match.references}'
if match.source:
string += f'{self.explain(match.source)}'
return string
def list_records(self,
first_party_only: bool = False,
end_chain_only: bool = False,
no_dupplicates: bool = False,
rules_only: bool = False,
hostnames_only: bool = False,
explain: bool = False,
) -> typing.Iterable[str]:
def export_cb(path: Path, match: Match
) -> typing.Iterable[str]:
if first_party_only and not match.first_party:
return
if end_chain_only and match.references > 0:
return
if no_dupplicates and match.dupplicate:
return
if rules_only and match.level > 1:
return
if hostnames_only and not isinstance(path, HostnamePath):
return
if explain:
yield self.explain(path)
else:
yield str(path)
yield from self.exec_each(export_cb)
def count_records(self,
first_party_only: bool = False,
end_chain_only: bool = False,
no_dupplicates: bool = False,
rules_only: bool = False,
hostnames_only: bool = False,
) -> str:
memo: typing.Dict[str, int] = dict()
def count_records_cb(path: Path, match: Match) -> None:
if first_party_only and not match.first_party:
return
if end_chain_only and match.references > 0:
return
if no_dupplicates and match.dupplicate:
return
if rules_only and match.level > 1:
return
if hostnames_only and not isinstance(path, HostnamePath):
return
try:
memo[path.__class__.__name__] += 1
except KeyError:
memo[path.__class__.__name__] = 1
for _ in self.exec_each(count_records_cb):
pass
split: typing.List[str] = list()
for key, value in sorted(memo.items(), key=lambda s: s[0]):
split.append(f'{key[:-4].lower()}s: {value}')
return ', '.join(split)
def get_domain(self, domain_str: str) -> typing.Iterable[DomainPath]:
self.enter_step('get_domain_pack')
domain = self.pack_domain(domain_str)
self.enter_step('get_domain_brws')
dic = self.domtree
depth = 0
for part in domain.parts:
if dic.match_zone.active():
self.enter_step('get_domain_yield')
yield ZonePath(domain.parts[:depth])
self.enter_step('get_domain_brws')
if part not in dic.children:
return
dic = dic.children[part]
depth += 1
if dic.match_zone.active():
self.enter_step('get_domain_yield')
yield ZonePath(domain.parts)
if dic.match_hostname.active():
self.enter_step('get_domain_yield')
yield HostnamePath(domain.parts)
def get_ip4(self, ip4_str: str) -> typing.Iterable[Path]:
self.enter_step('get_ip4_pack')
ip4 = self.pack_ip4address(ip4_str)
self.enter_step('get_ip4_cache')
if not self.ip4cache[ip4.value >> self.ip4cache_shift]:
return
self.enter_step('get_ip4_brws')
dic = self.ip4tree
for i in range(31, 31-ip4.prefixlen, -1):
bit = (ip4.value >> i) & 0b1
if dic.active():
self.enter_step('get_ip4_yield')
yield Ip4Path(ip4.value >> (i+1) << (i+1), 31-i)
self.enter_step('get_ip4_brws')
next_dic = dic.one if bit else dic.zero
if next_dic is None:
return
dic = next_dic
if dic.active():
self.enter_step('get_ip4_yield')
yield ip4
def _set_match(self,
match: Match,
updated: int,
source: Path,
source_match: Match = None,
dupplicate: bool = False,
) -> None:
# source_match is in parameters because most of the time
# its parent function needs it too,
# so it can pass it to save a traversal
source_match = source_match or self.get_match(source)
new_level = source_match.level + 1
if updated > match.updated or new_level < match.level \
or source_match.first_party > match.first_party:
# NOTE FP and level of matches referencing this one
# won't be updated until run or prune
if match.source:
old_source = self.get_match(match.source)
old_source.references -= 1
match.updated = updated
match.level = new_level
match.first_party = source_match.first_party
match.source = source
source_match.references += 1
match.dupplicate = dupplicate
def _set_domain(self,
hostname: bool,
domain_str: str,
updated: int,
source: Path) -> None:
self.enter_step('set_domain_val')
if not Database.validate_domain(domain_str):
raise ValueError(f"Invalid domain: {domain_str}")
self.enter_step('set_domain_pack')
domain = self.pack_domain(domain_str)
self.enter_step('set_domain_fp')
source_match = self.get_match(source)
is_first_party = source_match.first_party
self.enter_step('set_domain_brws')
dic = self.domtree
dupplicate = False
for part in domain.parts:
if part not in dic.children:
dic.children[part] = DomainTreeNode()
dic = dic.children[part]
if dic.match_zone.active(is_first_party):
dupplicate = True
if hostname:
match = dic.match_hostname
else:
match = dic.match_zone
self._set_match(
match,
updated,
source,
source_match=source_match,
dupplicate=dupplicate,
)
def set_hostname(self,
*args: typing.Any, **kwargs: typing.Any
) -> None:
self._set_domain(True, *args, **kwargs)
def set_zone(self,
*args: typing.Any, **kwargs: typing.Any
) -> None:
self._set_domain(False, *args, **kwargs)
def set_asn(self,
asn_str: str,
updated: int,
source: Path) -> None:
self.enter_step('set_asn')
path = self.pack_asn(asn_str)
if path.asn in self.asns:
match = self.asns[path.asn]
else:
match = AsnNode()
self.asns[path.asn] = match
self._set_match(
match,
updated,
source,
)
def _set_ip4(self,
ip4: Ip4Path,
updated: int,
source: Path) -> None:
self.enter_step('set_ip4_fp')
source_match = self.get_match(source)
is_first_party = source_match.first_party
self.enter_step('set_ip4_brws')
dic = self.ip4tree
dupplicate = False
for i in range(31, 31-ip4.prefixlen, -1):
bit = (ip4.value >> i) & 0b1
next_dic = dic.one if bit else dic.zero
if next_dic is None:
next_dic = IpTreeNode()
if bit:
dic.one = next_dic
else:
dic.zero = next_dic
dic = next_dic
if dic.active(is_first_party):
dupplicate = True
self._set_match(
dic,
updated,
source,
source_match=source_match,
dupplicate=dupplicate,
)
self._set_ip4cache(ip4, dic)
def set_ip4address(self,
ip4address_str: str,
*args: typing.Any, **kwargs: typing.Any
) -> None:
self.enter_step('set_ip4add_val')
if not Database.validate_ip4address(ip4address_str):
raise ValueError(f"Invalid ip4address: {ip4address_str}")
self.enter_step('set_ip4add_pack')
ip4 = self.pack_ip4address(ip4address_str)
self._set_ip4(ip4, *args, **kwargs)
def set_ip4network(self,
ip4network_str: str,
*args: typing.Any, **kwargs: typing.Any
) -> None:
self.enter_step('set_ip4net_val')
if not Database.validate_ip4network(ip4network_str):
raise ValueError(f"Invalid ip4network: {ip4network_str}")
self.enter_step('set_ip4net_pack')
ip4 = self.pack_ip4network(ip4network_str)
self._set_ip4(ip4, *args, **kwargs)

46
db.py Executable file
View file

@ -0,0 +1,46 @@
#!/usr/bin/env python3
import argparse
import database
import time
import os
if __name__ == '__main__':
# Parsing arguments
parser = argparse.ArgumentParser(
description="Database operations")
parser.add_argument(
'-i', '--initialize', action='store_true',
help="Reconstruct the whole database")
parser.add_argument(
'-p', '--prune', action='store_true',
help="Remove old entries from database")
parser.add_argument(
'-b', '--prune-base', action='store_true',
help="With --prune, only prune base rules "
"(the ones added by ./feed_rules.py)")
parser.add_argument(
'-s', '--prune-before', type=int,
default=(int(time.time()) - 60*60*24*31*6),
help="With --prune, only rules updated before "
"this UNIX timestamp will be deleted")
parser.add_argument(
'-r', '--references', action='store_true',
help="DEBUG: Update the reference count")
args = parser.parse_args()
if not args.initialize:
DB = database.Database()
else:
if os.path.isfile(database.Database.PATH):
os.unlink(database.Database.PATH)
DB = database.Database()
DB.enter_step('main')
if args.prune:
DB.prune(before=args.prune_before, base_only=args.prune_base)
if args.references:
DB.update_references()
DB.save()

74
dist/README.md vendored Normal file
View file

@ -0,0 +1,74 @@
# Geoffrey Frogeye's block list of first-party trackers
## What's a first-party tracker?
A tracker is a script put on many websites to gather informations about the visitor.
They can be used for multiple reasons: statistics, risk management, marketing, ads serving…
In any case, they are a threat to Internet users' privacy and many may want to block them.
Traditionnaly, trackers are served from a third-party.
For example, `website1.com` and `website2.com` both load their tracking script from `https://trackercompany.com/trackerscript.js`.
In order to block those, one can simply block the hostname `trackercompany.com`, which is what most ad blockers do.
However, to circumvent this block, tracker companies made the websites using them load trackers from `somestring.website1.com`.
The latter is a DNS redirection to `website1.trackercompany.com`, directly to an IP address belonging to the tracking company.
Those are called first-party trackers.
In order to block those trackers, ad blockers would need to block every subdomain pointing to anything under `trackercompany.com` or to their network.
Unfortunately, most don't support those blocking methods as they are not DNS-aware, e.g. they only see `somestring.website1.com`.
This list is an inventory of every `somestring.website1.com` found to allow non DNS-aware ad blocker to still block first-party trackers.
## List variants
### First-party trackers (recommended)
- Hosts file: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
- Raw list: <https://hostfiles.frogeye.fr/firstparty-trackers.txt>
This list contains every hostname redirecting to [a hand-picked list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/rules/first-party.list).
It should be safe from false-positives.
Don't be afraid of the size of the list, as this is due to the nature of first-party trackers: a single tracker generates at least one hostname per client (typically two).
### First-party only trackers
- Hosts file: <https://hostfiles.frogeye.fr/firstparty-only-trackers-hosts.txt>
- Raw list: <https://hostfiles.frogeye.fr/firstparty-only-trackers.txt>
This is the same list as above, albeit not containing the hostnames under the tracking company domains.
This reduces the size of the list, but it doesn't prevent from third-party tracking too.
Use in conjunction with other block lists.
### Multi-party trackers
- Hosts file: <https://hostfiles.frogeye.fr/multiparty-trackers-hosts.txt>
- Raw list: <https://hostfiles.frogeye.fr/multiparty-trackers.txt>
As first-party trackers usually evolve from third-party trackers, this list contains every hostname redirecting to trackers found in existing lists of third-party trackers (see next section).
Since the latter were not designed with first-party trackers in mind, they are likely to contain false-positives.
In the other hand, they might protect against first-party tracker that we're not aware of / have not yet confirmed.
#### Source of third-party trackers
- [EasyPrivacy](https://easylist.to/easylist/easyprivacy.txt)
(yes there's only one for now. A lot of existing ones cause a lot of false positives)
### Multi-party only trackers
- Hosts file: <https://hostfiles.frogeye.fr/multiparty-only-trackers-hosts.txt>
- Raw list: <https://hostfiles.frogeye.fr/multiparty-only-trackers.txt>
This is the same list as above, albeit not containing the hostnames under the tracking company domains.
This reduces the size of the list, but it doesn't prevent from third-party tracking too.
Use in conjunction with other block lists, especially the ones used to generate this list in the previous section.
## Meta
In case of false positives/negatives, or any other question contact me the way you like: <https://geoffrey.frogeye.fr>
The software used to generate this list is available here: <https://git.frogeye.fr/geoffrey/eulaurarien>
Some of the first-party tracker included in this list have been found by:
- [Aeris](https://imirhil.fr/)
- NextDNS and [their blocklist](https://github.com/nextdns/cname-cloaking-blocklist)'s contributors

64
export.py Executable file
View file

@ -0,0 +1,64 @@
#!/usr/bin/env python3
import database
import argparse
import sys
if __name__ == '__main__':
# Parsing arguments
parser = argparse.ArgumentParser(
description="Export the hostnames rules stored "
"in the Database as plain text")
parser.add_argument(
'-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
help="Output file, one rule per line")
parser.add_argument(
'-f', '--first-party', action='store_true',
help="Only output rules issued from first-party sources")
parser.add_argument(
'-e', '--end-chain', action='store_true',
help="Only output rules that are not referenced by any other")
parser.add_argument(
'-r', '--rules', action='store_true',
help="Output all kinds of rules, not just hostnames")
parser.add_argument(
'-b', '--base-rules', action='store_true',
help="Output base rules "
"(the ones added by ./feed_rules.py) "
"(implies --rules)")
parser.add_argument(
'-d', '--no-dupplicates', action='store_true',
help="Do not output rules that already match a zone/network rule "
"(e.g. dummy.example.com when there's a zone example.com rule)")
parser.add_argument(
'-x', '--explain', action='store_true',
help="Show the chain of rules leading to one "
"(and the number of references they have)")
parser.add_argument(
'-c', '--count', action='store_true',
help="Show the number of rules per type instead of listing them")
args = parser.parse_args()
DB = database.Database()
if args.count:
assert not args.explain
print(DB.count_records(
first_party_only=args.first_party,
end_chain_only=args.end_chain,
no_dupplicates=args.no_dupplicates,
rules_only=args.base_rules,
hostnames_only=not (args.rules or args.base_rules),
))
else:
for domain in DB.list_records(
first_party_only=args.first_party,
end_chain_only=args.end_chain,
no_dupplicates=args.no_dupplicates,
rules_only=args.base_rules,
hostnames_only=not (args.rules or args.base_rules),
explain=args.explain,
):
print(domain, file=args.output)

98
export_lists.sh Executable file
View file

@ -0,0 +1,98 @@
#!/usr/bin/env bash
function log() {
echo -e "\033[33m$@\033[0m"
}
log "Calculating statistics…"
gen_date=$(date -Isec)
gen_software=$(git describe --tags)
number_websites=$(wc -l < temp/all_websites.list)
number_subdomains=$(wc -l < temp/all_subdomains.list)
number_dns=$(grep '^$' temp/all_resolved.txt | wc -l)
for partyness in {first,multi}
do
if [ $partyness = "first" ]
then
partyness_flags="--first-party"
else
partyness_flags=""
fi
echo "Statistics for ${partyness}-party trackers"
echo "Input rules: $(./export.py --count --base-rules $partyness_flags)"
echo "Subsequent rules: $(./export.py --count --rules $partyness_flags)"
echo "Subsequent rules (no dupplicate): $(./export.py --count --rules --no-dupplicates $partyness_flags)"
echo "Output hostnames: $(./export.py --count $partyness_flags)"
echo "Output hostnames (no dupplicate): $(./export.py --count --no-dupplicates $partyness_flags)"
echo "Output hostnames (end-chain only): $(./export.py --count --end-chain $partyness_flags)"
echo "Output hostnames (no dupplicate, end-chain only): $(./export.py --count --no-dupplicates --end-chain $partyness_flags)"
echo
for trackerness in {trackers,only-trackers}
do
if [ $trackerness = "trackers" ]
then
trackerness_flags=""
else
trackerness_flags="--end-chain --no-dupplicates"
fi
file_list="dist/${partyness}party-${trackerness}.txt"
file_host="dist/${partyness}party-${trackerness}-hosts.txt"
log "Generating lists for variant ${partyness}-party ${trackerness}"
# Real export heeere
./export.py $partyness_flags $trackerness_flags > $file_list
# Sometimes a bit heavy to have the DB open and sort the output
# so this is done in two steps
sort -u $file_list -o $file_list
rules_input=$(./export.py --count --base-rules $partyness_flags)
rules_found=$(./export.py --count --rules $partyness_flags)
rules_output=$(./export.py --count $partyness_flags $trackerness_flags)
function link() { # link partyness, link trackerness
url="https://hostfiles.frogeye.fr/${1}party-${2}-hosts.txt"
if [ "$1" = "$partyness" ] && [ "$2" = "$trackerness" ]
then
url="$url (this one)"
fi
echo $url
}
(
echo "# First-party trackers host list"
echo "# Variant: ${partyness}-party ${trackerness}"
echo "#"
echo "# About first-party trackers: TODO"
echo "# Source code: https://git.frogeye.fr/geoffrey/eulaurarien"
echo "#"
echo "# In case of false positives/negatives, or any other question,"
echo "# contact me the way you like: https://geoffrey.frogeye.fr"
echo "#"
echo "# Latest versions and variants:"
echo "# - First-party trackers : $(link first trackers)"
echo "# - … excluding redirected: $(link first only-trackers)"
echo "# - First and third party : $(link multi trackers)"
echo "# - … excluding redirected: $(link multi only-trackers)"
echo '# (variants informations: TODO)'
echo '# (you can remove `-hosts` to get the raw list)'
echo "#"
echo "# Generation date: $gen_date"
echo "# Generation software: eulaurarien $gen_software"
echo "# Number of source websites: $number_websites"
echo "# Number of source subdomains: $number_subdomains"
echo "# Number of source DNS records: ~2E9 + $number_dns"
echo "#"
echo "# Input rules: $rules_input"
echo "# Subsequent rules: $rules_found"
echo "# Output rules: $rules_output"
echo "#"
echo
sed 's|^|0.0.0.0 |' "$file_list"
) > "$file_host"
done
done

71
feed_asn.py Executable file
View file

@ -0,0 +1,71 @@
#!/usr/bin/env python3
import database
import argparse
import requests
import typing
import ipaddress
import logging
import time
IPNetwork = typing.Union[ipaddress.IPv4Network, ipaddress.IPv6Network]
def get_ranges(asn: str) -> typing.Iterable[str]:
req = requests.get(
'https://stat.ripe.net/data/as-routing-consistency/data.json',
params={'resource': asn}
)
data = req.json()
for pref in data['data']['prefixes']:
yield pref['prefix']
def get_name(asn: str) -> str:
req = requests.get(
'https://stat.ripe.net/data/as-overview/data.json',
params={'resource': asn}
)
data = req.json()
return data['data']['holder']
if __name__ == '__main__':
log = logging.getLogger('feed_asn')
# Parsing arguments
parser = argparse.ArgumentParser(
description="Add the IP ranges associated to the AS in the database")
args = parser.parse_args()
DB = database.Database()
def add_ranges(path: database.Path,
match: database.Match,
) -> None:
assert isinstance(path, database.AsnPath)
assert isinstance(match, database.AsnNode)
asn_str = database.Database.unpack_asn(path)
DB.enter_step('asn_get_name')
name = get_name(asn_str)
match.name = name
DB.enter_step('asn_get_ranges')
for prefix in get_ranges(asn_str):
parsed_prefix: IPNetwork = ipaddress.ip_network(prefix)
if parsed_prefix.version == 4:
DB.set_ip4network(
prefix,
source=path,
updated=int(time.time())
)
log.info('Added %s from %s (%s)', prefix, path, name)
elif parsed_prefix.version == 6:
log.warning('Unimplemented prefix version: %s', prefix)
else:
log.error('Unknown prefix version: %s', prefix)
for _ in DB.exec_each_asn(add_ranges):
pass
DB.save()

227
feed_dns.py Executable file
View file

@ -0,0 +1,227 @@
#!/usr/bin/env python3
import argparse
import database
import logging
import sys
import typing
import multiprocessing
import time
Record = typing.Tuple[typing.Callable, typing.Callable, int, str, str]
# select, write
FUNCTION_MAP: typing.Any = {
'a': (
database.Database.get_ip4,
database.Database.set_hostname,
),
'cname': (
database.Database.get_domain,
database.Database.set_hostname,
),
'ptr': (
database.Database.get_domain,
database.Database.set_ip4address,
),
}
class Writer(multiprocessing.Process):
def __init__(self,
recs_queue: multiprocessing.Queue,
autosave_interval: int = 0,
ip4_cache: int = 0,
):
super(Writer, self).__init__()
self.log = logging.getLogger(f'wr')
self.recs_queue = recs_queue
self.autosave_interval = autosave_interval
self.ip4_cache = ip4_cache
def run(self) -> None:
self.db = database.Database()
self.db.log = logging.getLogger(f'wr')
self.db.fill_ip4cache(max_size=self.ip4_cache)
if self.autosave_interval > 0:
next_save = time.time() + self.autosave_interval
else:
next_save = 0
self.db.enter_step('block_wait')
block: typing.List[Record]
for block in iter(self.recs_queue.get, None):
record: Record
for record in block:
select, write, updated, name, value = record
self.db.enter_step('feed_switch')
try:
for source in select(self.db, value):
write(self.db, name, updated, source=source)
except ValueError:
self.log.exception("Cannot execute: %s", record)
if next_save > 0 and time.time() > next_save:
self.log.info("Saving database...")
self.db.save()
self.log.info("Done!")
next_save = time.time() + self.autosave_interval
self.db.enter_step('block_wait')
self.db.enter_step('end')
self.db.save()
class Parser():
def __init__(self,
buf: typing.Any,
recs_queue: multiprocessing.Queue,
block_size: int,
):
super(Parser, self).__init__()
self.buf = buf
self.log = logging.getLogger('pr')
self.recs_queue = recs_queue
self.block: typing.List[Record] = list()
self.block_size = block_size
self.prof = database.Profiler()
self.prof.log = logging.getLogger('pr')
def register(self, record: Record) -> None:
self.prof.enter_step('register')
self.block.append(record)
if len(self.block) >= self.block_size:
self.prof.enter_step('put_block')
self.recs_queue.put(self.block)
self.block = list()
def run(self) -> None:
self.consume()
self.recs_queue.put(self.block)
self.prof.profile()
def consume(self) -> None:
raise NotImplementedError
class Rapid7Parser(Parser):
def consume(self) -> None:
data = dict()
for line in self.buf:
self.prof.enter_step('parse_rapid7')
split = line.split('"')
try:
for k in range(1, 14, 4):
key = split[k]
val = split[k+2]
data[key] = val
select, writer = FUNCTION_MAP[data['type']]
record = (
select,
writer,
int(data['timestamp']),
data['name'],
data['value']
)
except IndexError:
self.log.exception("Cannot parse: %s", line)
self.register(record)
class MassDnsParser(Parser):
# massdns --output Snrql
# --retry REFUSED,SERVFAIL --resolvers nameservers-ipv4
TYPES = {
'A': (FUNCTION_MAP['a'][0], FUNCTION_MAP['a'][1], -1, None),
# 'AAAA': (FUNCTION_MAP['aaaa'][0], FUNCTION_MAP['aaaa'][1], -1, None),
'CNAME': (FUNCTION_MAP['cname'][0], FUNCTION_MAP['cname'][1], -1, -1),
}
def consume(self) -> None:
self.prof.enter_step('parse_massdns')
timestamp = 0
header = True
for line in self.buf:
line = line[:-1]
if not line:
header = True
continue
split = line.split(' ')
try:
if header:
timestamp = int(split[1])
header = False
else:
select, write, name_offset, value_offset = \
MassDnsParser.TYPES[split[1]]
record = (
select,
write,
timestamp,
split[0][:name_offset],
split[2][:value_offset],
)
self.register(record)
self.prof.enter_step('parse_massdns')
except KeyError:
continue
PARSERS = {
'rapid7': Rapid7Parser,
'massdns': MassDnsParser,
}
if __name__ == '__main__':
# Parsing arguments
log = logging.getLogger('feed_dns')
args_parser = argparse.ArgumentParser(
description="Read DNS records and import "
"tracking-relevant data into the database")
args_parser.add_argument(
'parser',
choices=PARSERS.keys(),
help="Input format")
args_parser.add_argument(
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
help="Input file")
args_parser.add_argument(
'-b', '--block-size', type=int, default=1024,
help="Performance tuning value")
args_parser.add_argument(
'-q', '--queue-size', type=int, default=128,
help="Performance tuning value")
args_parser.add_argument(
'-a', '--autosave-interval', type=int, default=900,
help="Interval to which the database will save in seconds. "
"0 to disable.")
args_parser.add_argument(
'-4', '--ip4-cache', type=int, default=0,
help="RAM cache for faster IPv4 lookup. "
"Maximum useful value: 512 MiB (536870912). "
"Warning: Depending on the rules, this might already "
"be a memory-heavy process, even without the cache.")
args = args_parser.parse_args()
recs_queue: multiprocessing.Queue = multiprocessing.Queue(
maxsize=args.queue_size)
writer = Writer(recs_queue,
autosave_interval=args.autosave_interval,
ip4_cache=args.ip4_cache
)
writer.start()
parser = PARSERS[args.parser](args.input, recs_queue, args.block_size)
parser.run()
recs_queue.put(None)
writer.join()

54
feed_rules.py Executable file
View file

@ -0,0 +1,54 @@
#!/usr/bin/env python3
import database
import argparse
import sys
import time
FUNCTION_MAP = {
'zone': database.Database.set_zone,
'hostname': database.Database.set_hostname,
'asn': database.Database.set_asn,
'ip4network': database.Database.set_ip4network,
'ip4address': database.Database.set_ip4address,
}
if __name__ == '__main__':
# Parsing arguments
parser = argparse.ArgumentParser(
description="Import base rules to the database")
parser.add_argument(
'type',
choices=FUNCTION_MAP.keys(),
help="Type of rule inputed")
parser.add_argument(
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
help="File with one rule per line")
parser.add_argument(
'-f', '--first-party', action='store_true',
help="The input only comes from verified first-party sources")
args = parser.parse_args()
DB = database.Database()
fun = FUNCTION_MAP[args.type]
source: database.RulePath
if args.first_party:
source = database.RuleFirstPath()
else:
source = database.RuleMultiPath()
for rule in args.input:
rule = rule.strip()
try:
fun(DB,
rule,
source=source,
updated=int(time.time()),
)
except ValueError:
DB.log.error(f"Could not add rule: {rule}")
DB.save()

View file

@ -17,26 +17,13 @@ function dl() {
log "Retrieving rules…"
rm -f rules*/*.cache.*
dl https://easylist.to/easylist/easyprivacy.txt rules_adblock/easyprivacy.cache.txt
# From firebog.net Tracking & Telemetry Lists
dl https://v.firebog.net/hosts/Prigent-Ads.txt rules/prigent-ads.cache.list
# dl https://gitlab.com/quidsup/notrack-blocklists/raw/master/notrack-blocklist.txt rules/notrack-blocklist.cache.list
# False positives: https://github.com/WaLLy3K/wally3k.github.io/issues/73 -> 69.media.tumblr.com chicdn.net
dl https://raw.githubusercontent.com/StevenBlack/hosts/master/data/add.2o7Net/hosts rules_hosts/add2o7.cache.txt
dl https://raw.githubusercontent.com/crazy-max/WindowsSpyBlocker/master/data/hosts/spy.txt rules_hosts/spy.cache.txt
# dl https://raw.githubusercontent.com/Kees1958/WS3_annual_most_used_survey_blocklist/master/w3tech_hostfile.txt rules/w3tech.cache.list
# False positives: agreements.apple.com -> edgekey.net
# dl https://www.github.developerdan.com/hosts/lists/ads-and-tracking-extended.txt rules_hosts/ads-and-tracking-extended.cache.txt # Lots of false-positives
# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/android-tracking.txt rules_hosts/android-tracking.cache.txt
# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/SmartTV.txt rules_hosts/smart-tv.cache.txt
# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/AmazonFireTV.txt rules_hosts/amazon-fire-tv.cache.txt
log "Retrieving TLD list…"
dl http://data.iana.org/TLD/tlds-alpha-by-domain.txt temp/all_tld.temp.list
grep -v '^#' temp/all_tld.temp.list | awk '{print tolower($0)}' > temp/all_tld.list
log "Retrieving nameservers…"
rm -f nameservers
touch nameservers
[ -f nameservers.head ] && cat nameservers.head >> nameservers
dl https://public-dns.info/nameservers.txt nameservers.temp
sort -R nameservers.temp >> nameservers
rm nameservers.temp
dl https://public-dns.info/nameservers.txt nameservers/public-dns.cache.list
log "Retrieving top subdomains…"
dl http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip top-1m.csv.zip
@ -51,4 +38,3 @@ then
else
mv temp/cisco-umbrella_popularity.fresh.list subdomains/cisco-umbrella_popularity.cache.list
fi
dl https://www.orwell1984.today/cname/eulerian.net.txt subdomains/orwell-eulerian-cname-list.cache.list

View file

@ -1,160 +0,0 @@
#!/usr/bin/env python3
# pylint: disable=C0103
"""
From a list of subdomains, output only
the ones resolving to a first-party tracker.
"""
import argparse
import sys
import progressbar
import csv
import typing
import ipaddress
# DomainRule = typing.Union[bool, typing.Dict[str, 'DomainRule']]
DomainRule = typing.Union[bool, typing.Dict]
# IpRule = typing.Union[bool, typing.Dict[int, 'DomainRule']]
IpRule = typing.Union[bool, typing.Dict]
RULES_DICT: DomainRule = dict()
RULES_IP_DICT: IpRule = dict()
def get_bits(address: ipaddress.IPv4Address) -> typing.Iterator[int]:
for char in address.packed:
for i in range(7, -1, -1):
yield (char >> i) & 0b1
def subdomain_matching(subdomain: str) -> bool:
parts = subdomain.split('.')
parts.reverse()
dic = RULES_DICT
for part in parts:
if isinstance(dic, bool) or part not in dic:
break
dic = dic[part]
if isinstance(dic, bool):
return dic
return False
def ip_matching(ip_str: str) -> bool:
ip = ipaddress.ip_address(ip_str)
dic = RULES_IP_DICT
i = 0
for bit in get_bits(ip):
i += 1
if isinstance(dic, bool) or bit not in dic:
break
dic = dic[bit]
if isinstance(dic, bool):
return dic
return False
def get_matching(chain: typing.List[str], no_explicit: bool = False
) -> typing.Iterable[str]:
if len(chain) <= 1:
return
initial = chain[0]
cname_destinations = chain[1:-1]
a_destination = chain[-1]
initial_matching = subdomain_matching(initial)
if no_explicit and initial_matching:
return
cname_matching = any(map(subdomain_matching, cname_destinations))
if cname_matching or initial_matching or ip_matching(a_destination):
yield initial
def register_rule(subdomain: str) -> None:
# Make a tree with domain parts
parts = subdomain.split('.')
parts.reverse()
dic = RULES_DICT
last_part = len(parts) - 1
for p, part in enumerate(parts):
if isinstance(dic, bool):
return
if p == last_part:
dic[part] = True
else:
dic.setdefault(part, dict())
dic = dic[part]
def register_rule_ip(network: str) -> None:
net = ipaddress.ip_network(network)
ip = net.network_address
dic = RULES_IP_DICT
last_bit = net.prefixlen - 1
for b, bit in enumerate(get_bits(ip)):
if isinstance(dic, bool):
return
if b == last_bit:
dic[bit] = True
else:
dic.setdefault(bit, dict())
dic = dic[bit]
if __name__ == '__main__':
# Parsing arguments
parser = argparse.ArgumentParser(
description="Filter first-party trackers from a list of subdomains")
parser.add_argument(
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
help="Input file with DNS chains")
parser.add_argument(
'-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
help="Outptut file with one tracking subdomain per line")
parser.add_argument(
'-n', '--no-explicit', action='store_true',
help="Don't output domains already blocked with rules without CNAME")
parser.add_argument(
'-r', '--rules', type=argparse.FileType('r'),
help="List of domains domains to block (with their subdomains)")
parser.add_argument(
'-p', '--rules-ip', type=argparse.FileType('r'),
help="List of IPs ranges to block")
args = parser.parse_args()
# Progress bar
widgets = [
progressbar.Percentage(),
' ', progressbar.SimpleProgress(),
' ', progressbar.Bar(),
' ', progressbar.Timer(),
' ', progressbar.AdaptiveTransferSpeed(unit='req'),
' ', progressbar.AdaptiveETA(),
]
progress = progressbar.ProgressBar(widgets=widgets)
# Reading rules
if args.rules:
for rule in args.rules:
register_rule(rule.strip())
if args.rules_ip:
for rule in args.rules_ip:
register_rule_ip(rule.strip())
# Approximating line count
if args.input.seekable():
lines = 0
for line in args.input:
lines += 1
progress.max_value = lines
args.input.seek(0)
# Reading domains to filter
reader = csv.reader(args.input)
progress.start()
for chain in reader:
for match in get_matching(chain, no_explicit=args.no_explicit):
print(match, file=args.output)
progress.update(progress.value + 1)
progress.finish()

View file

@ -1,85 +0,0 @@
#!/usr/bin/env bash
function log() {
echo -e "\033[33m$@\033[0m"
}
if [ ! -f temp/all_resolved.csv ]
then
echo "Run ./resolve_subdomains.sh first!"
exit 1
fi
# Gather all the rules for filtering
log "Compiling rules…"
cat rules_adblock/*.txt | grep -v '^!' | grep -v '^\[Adblock' | sort -u > temp/all_rules_adblock.txt
./adblock_to_domain_list.py --input temp/all_rules_adblock.txt --output rules/from_adblock.cache.list
cat rules_hosts/*.txt | grep -v '^#' | grep -v '^$' | cut -d ' ' -f2 > rules/from_hosts.cache.list
cat rules/*.list | grep -v '^#' | grep -v '^$' | sort -u > temp/all_rules_multi.list
cat rules/first-party.list | grep -v '^#' | grep -v '^$' | sort -u > temp/all_rules_first.list
cat rules_ip/*.txt | grep -v '^#' | grep -v '^$' | sort -u > temp/all_ip_rules_multi.txt
cat rules_ip/first-party.txt | grep -v '^#' | grep -v '^$' | sort -u > temp/all_ip_rules_first.txt
log "Filtering first-party tracking domains…"
./filter_subdomains.py --rules temp/all_rules_first.list --rules-ip temp/all_ip_rules_first.txt --input temp/all_resolved_sorted.csv --output temp/firstparty-trackers.list
sort -u temp/firstparty-trackers.list > dist/firstparty-trackers.txt
log "Filtering first-party curated tracking domains…"
./filter_subdomains.py --rules temp/all_rules_first.list --rules-ip temp/all_ip_rules_first.txt --input temp/all_resolved_sorted.csv --no-explicit --output temp/firstparty-only-trackers.list
sort -u temp/firstparty-only-trackers.list > dist/firstparty-only-trackers.txt
log "Filtering multi-party tracking domains…"
./filter_subdomains.py --rules temp/all_rules_multi.list --rules-ip temp/all_ip_rules_multi.txt --input temp/all_resolved_sorted.csv --output temp/multiparty-trackers.list
sort -u temp/multiparty-trackers.list > dist/multiparty-trackers.txt
log "Filtering multi-party curated tracking domains…"
./filter_subdomains.py --rules temp/all_rules_multi.list --rules-ip temp/all_ip_rules_multi.txt --input temp/all_resolved_sorted.csv --no-explicit --output temp/multiparty-only-trackers.list
sort -u temp/multiparty-only-trackers.list > dist/multiparty-only-trackers.txt
# Format the blocklist so it can be used as a hostlist
function generate_hosts {
basename="$1"
description="$2"
description2="$3"
(
echo "# First-party trackers host list"
echo "# $description"
echo "# $description2"
echo "#"
echo "# About first-party trackers: https://git.frogeye.fr/geoffrey/eulaurarien#whats-a-first-party-tracker"
echo "# Source code: https://git.frogeye.fr/geoffrey/eulaurarien"
echo "#"
echo "# In case of false positives/negatives, or any other question,"
echo "# contact me the way you like: https://geoffrey.frogeye.fr"
echo "#"
echo "# Latest version:"
echo "# - First-party trackers : https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt"
echo "# - … excluding redirected: https://hostfiles.frogeye.fr/firstparty-only-trackers-hosts.txt"
echo "# - First and third party : https://hostfiles.frogeye.fr/multiparty-trackers-hosts.txt"
echo "# - … excluding redirected: https://hostfiles.frogeye.fr/multiparty-only-trackers-hosts.txt"
echo "#"
echo "# Generation date: $(date -Isec)"
echo "# Generation software: eulaurarien $(git describe --tags)"
echo "# Number of source websites: $(wc -l temp/all_websites.list | cut -d' ' -f1)"
echo "# Number of source subdomains: $(wc -l temp/all_subdomains.list | cut -d' ' -f1)"
echo "#"
echo "# Number of known first-party trackers: $(wc -l temp/all_rules_first.list | cut -d' ' -f1)"
echo "# Number of first-party subdomains: $(wc -l dist/firstparty-trackers.txt | cut -d' ' -f1)"
echo "# … excluding redirected: $(wc -l dist/firstparty-only-trackers.txt | cut -d' ' -f1)"
echo "#"
echo "# Number of known multi-party trackers: $(wc -l temp/all_rules_multi.list | cut -d' ' -f1)"
echo "# Number of multi-party subdomains: $(wc -l dist/multiparty-trackers.txt | cut -d' ' -f1)"
echo "# … excluding redirected: $(wc -l dist/multiparty-only-trackers.txt | cut -d' ' -f1)"
echo
cat "dist/$basename.txt" | while read host;
do
echo "0.0.0.0 $host"
done
) > "dist/$basename-hosts.txt"
}
generate_hosts "firstparty-trackers" "Generated from a curated list of first-party trackers" ""
generate_hosts "firstparty-only-trackers" "Generated from a curated list of first-party trackers" "Only contain the first chain of redirection."
generate_hosts "multiparty-trackers" "Generated from known third-party trackers." "Also contains trackers used as third-party."
generate_hosts "multiparty-only-trackers" "Generated from known third-party trackers." "Do not contain trackers used in third-party. Use in combination with third-party lists."

26
import_rapid7.sh Executable file
View file

@ -0,0 +1,26 @@
#!/usr/bin/env bash
function log() {
echo -e "\033[33m$@\033[0m"
}
function feed_rapid7_fdns { # dataset
dataset=$1
line=$(curl -s https://opendata.rapid7.com/sonar.fdns_v2/ | grep "href=\".\+-fdns_$dataset.json.gz\"")
link="https://opendata.rapid7.com$(echo "$line" | cut -d'"' -f2)"
log "Reading $(echo "$dataset" | awk '{print toupper($0)}') records from $link"
curl -L "$link" | gunzip
}
function feed_rapid7_rdns {
dataset=$1
line=$(curl -s https://opendata.rapid7.com/sonar.rdns_v2/ | grep "href=\".\+-rdns.json.gz\"")
link="https://opendata.rapid7.com$(echo "$line" | cut -d'"' -f2)"
log "Reading PTR records from $link"
curl -L "$link" | gunzip
}
feed_rapid7_rdns | ./feed_dns.py rapid7
feed_rapid7_fdns a | ./feed_dns.py rapid7 --ip4-cache 536870912
# feed_rapid7_fdns aaaa | ./feed_dns.py rapid7 --ip6-cache 536870912
feed_rapid7_fdns cname | ./feed_dns.py rapid7

22
import_rules.sh Executable file
View file

@ -0,0 +1,22 @@
#!/usr/bin/env bash
function log() {
echo -e "\033[33m$@\033[0m"
}
log "Importing rules…"
BEFORE="$(date +%s)"
cat rules_adblock/*.txt | grep -v '^!' | grep -v '^\[Adblock' | ./adblock_to_domain_list.py | ./feed_rules.py zone
cat rules_hosts/*.txt | grep -v '^#' | grep -v '^$' | cut -d ' ' -f2 | ./feed_rules.py zone
cat rules/*.list | grep -v '^#' | grep -v '^$' | ./feed_rules.py zone
cat rules_ip/*.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py ip4network
cat rules_asn/*.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py asn
cat rules/first-party.list | grep -v '^#' | grep -v '^$' | ./feed_rules.py zone --first-party
cat rules_ip/first-party.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py ip4network --first-party
cat rules_asn/first-party.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py asn --first-party
./feed_asn.py
# log "Pruning old rules…"
# ./db.py --prune --prune-before "$BEFORE" --prune-base

2
nameservers/.gitignore vendored Normal file
View file

@ -0,0 +1,2 @@
*.custom.list
*.cache.list

24
nameservers/popular.list Normal file
View file

@ -0,0 +1,24 @@
8.8.8.8
8.8.4.4
2001:4860:4860:0:0:0:0:8888
2001:4860:4860:0:0:0:0:8844
208.67.222.222
208.67.220.220
2620:119:35::35
2620:119:53::53
4.2.2.1
4.2.2.2
8.26.56.26
8.20.247.20
84.200.69.80
84.200.70.40
2001:1608:10:25:0:0:1c04:b12f
2001:1608:10:25:0:0:9249:d69b
9.9.9.10
149.112.112.10
2620:fe::10
2620:fe::fe:10
1.1.1.1
1.0.0.1
2606:4700:4700::1111
2606:4700:4700::1001

View file

@ -1,21 +0,0 @@
#!/usr/bin/env python3
"""
List of regex matching first-party trackers.
"""
# Syntax: https://docs.python.org/3/library/re.html#regular-expression-syntax
REGEXES = [
r'^.+\.eulerian\.net\.$', # Eulerian
r'^.+\.criteo\.com\.$', # Criteo
r'^.+\.dnsdelegation\.io\.$', # Criteo
r'^.+\.keyade\.com\.$', # Keyade
r'^.+\.omtrdc\.net\.$', # Adobe Experience Cloud
r'^.+\.bp01\.net\.$', # NP6
r'^.+\.ati-host\.net\.$', # Xiti (AT Internet)
r'^.+\.at-o\.net\.$', # Xiti (AT Internet)
r'^.+\.edgkey\.net\.$', # Edgekey (Akamai)
r'^.+\.akaimaiedge\.net\.$', # Edgekey (Akamai)
r'^.+\.storetail\.io\.$', # Storetail (Criteo)
]

View file

@ -1,284 +0,0 @@
#!/usr/bin/env python3
"""
From a list of subdomains, output only
the ones resolving to a first-party tracker.
"""
import argparse
import logging
import os
import queue
import sys
import threading
import typing
import csv
import coloredlogs
import dns.exception
import dns.resolver
import progressbar
DNS_TIMEOUT = 5.0
NUMBER_THREADS = 512
NUMBER_TRIES = 5
# TODO All the domains don't get treated,
# so it leaves with 4-5 subdomains not resolved
glob = None
class Worker(threading.Thread):
"""
Worker process for a DNS resolver.
Will resolve DNS to match first-party subdomains.
"""
def change_nameserver(self) -> None:
"""
Assign a this worker another nameserver from the queue.
"""
server = None
while server is None:
try:
server = self.orchestrator.nameservers_queue.get(block=False)
except queue.Empty:
self.orchestrator.refill_nameservers_queue()
self.log.info("Using nameserver: %s", server)
self.resolver.nameservers = [server]
def __init__(self,
orchestrator: 'Orchestrator',
index: int = 0):
super(Worker, self).__init__()
self.log = logging.getLogger(f'worker{index:03d}')
self.orchestrator = orchestrator
self.resolver = dns.resolver.Resolver()
self.change_nameserver()
def resolve_subdomain(self, subdomain: str) -> typing.Optional[
typing.List[
str
]
]:
"""
Returns the resolution chain of the subdomain to an A record,
including any intermediary CNAME.
The last element is an IP address.
Returns None if the nameserver was unable to satisfy the request.
Returns [] if the requests points to nothing.
"""
self.log.debug("Querying %s", subdomain)
try:
query = self.resolver.query(subdomain, 'A', lifetime=DNS_TIMEOUT)
except dns.resolver.NXDOMAIN:
return []
except dns.resolver.NoAnswer:
return []
except dns.resolver.YXDOMAIN:
self.log.warning("Query name too long for %s", subdomain)
return None
except dns.resolver.NoNameservers:
# NOTE Most of the time this error message means that the domain
# does not exists, but sometimes it means the that the server
# itself is broken. So we count on the retry logic.
self.log.warning("All nameservers broken for %s", subdomain)
return None
except dns.exception.Timeout:
# NOTE Same as above
self.log.warning("Timeout for %s", subdomain)
return None
except dns.name.EmptyLabel:
self.log.warning("Empty label for %s", subdomain)
return None
resolved = list()
last = len(query.response.answer) - 1
for a, answer in enumerate(query.response.answer):
if answer.rdtype == dns.rdatatype.CNAME:
assert a < last
resolved.append(answer.items[0].to_text()[:-1])
elif answer.rdtype == dns.rdatatype.A:
assert a == last
resolved.append(answer.items[0].address)
else:
assert False
return resolved
def run(self) -> None:
self.log.info("Started")
subdomain: str
for subdomain in iter(self.orchestrator.subdomains_queue.get, None):
for _ in range(NUMBER_TRIES):
resolved = self.resolve_subdomain(subdomain)
# Retry with another nameserver if error
if resolved is None:
self.change_nameserver()
else:
break
# If it wasn't found after multiple tries
if resolved is None:
self.log.error("Gave up on %s", subdomain)
resolved = []
resolved.insert(0, subdomain)
assert isinstance(resolved, list)
self.orchestrator.results_queue.put(resolved)
self.orchestrator.results_queue.put(None)
self.log.info("Stopped")
class Orchestrator():
"""
Orchestrator of the different Worker threads.
"""
def refill_nameservers_queue(self) -> None:
"""
Re-fill the given nameservers into the nameservers queue.
Done every-time the queue is empty, making it
basically looping and infinite.
"""
# Might be in a race condition but that's probably fine
for nameserver in self.nameservers:
self.nameservers_queue.put(nameserver)
self.log.info("Refilled nameserver queue")
def __init__(self, subdomains: typing.Iterable[str],
nameservers: typing.List[str] = None,
):
self.log = logging.getLogger('orchestrator')
self.subdomains = subdomains
# Use interal resolver by default
self.nameservers = nameservers or dns.resolver.Resolver().nameservers
self.subdomains_queue: queue.Queue = queue.Queue(
maxsize=NUMBER_THREADS)
self.results_queue: queue.Queue = queue.Queue()
self.nameservers_queue: queue.Queue = queue.Queue()
self.refill_nameservers_queue()
def fill_subdomain_queue(self) -> None:
"""
Read the subdomains in input and put them into the queue.
Done in a thread so we can both:
- yield the results as they come
- not store all the subdomains at once
"""
self.log.info("Started reading subdomains")
# Send data to workers
for subdomain in self.subdomains:
self.subdomains_queue.put(subdomain)
self.log.info("Finished reading subdomains")
# Send sentinel to each worker
# sentinel = None ~= EOF
for _ in range(NUMBER_THREADS):
self.subdomains_queue.put(None)
def run(self) -> typing.Iterable[typing.List[str]]:
"""
Yield the results.
"""
# Create workers
self.log.info("Creating workers")
for i in range(NUMBER_THREADS):
Worker(self, i).start()
fill_thread = threading.Thread(target=self.fill_subdomain_queue)
fill_thread.start()
# Wait for one sentinel per worker
# In the meantime output results
for _ in range(NUMBER_THREADS):
result: typing.List[str]
for result in iter(self.results_queue.get, None):
yield result
self.log.info("Waiting for reader thread")
fill_thread.join()
self.log.info("Done!")
def main() -> None:
"""
Main function when used directly.
Read the subdomains provided and output it,
the last CNAME resolved and the IP adress it resolves to.
Takes as an input a filename (or nothing, for stdin),
and as an output a filename (or nothing, for stdout).
The input must be a subdomain per line, the output is a comma-sep
file with the columns source CNAME and A.
Use the file `nameservers` as the list of nameservers
to use, or else it will use the system defaults.
Also shows a nice progressbar.
"""
# Initialization
coloredlogs.install(
level='DEBUG',
fmt='%(asctime)s %(name)s %(levelname)s %(message)s'
)
# Parsing arguments
parser = argparse.ArgumentParser(
description="Massively resolves subdomains and store them in a file.")
parser.add_argument(
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
help="Input file with one subdomain per line")
parser.add_argument(
'-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
help="Outptut file with DNS chains")
# parser.add_argument(
# '-n', '--nameserver', type=argparse.FileType('r'),
# default='nameservers', help="File with one nameserver per line")
# parser.add_argument(
# '-j', '--workers', type=int, default=512,
# help="Number of threads to use")
args = parser.parse_args()
# Progress bar
widgets = [
progressbar.Percentage(),
' ', progressbar.SimpleProgress(),
' ', progressbar.Bar(),
' ', progressbar.Timer(),
' ', progressbar.AdaptiveTransferSpeed(unit='req'),
' ', progressbar.AdaptiveETA(),
]
progress = progressbar.ProgressBar(widgets=widgets)
if args.input.seekable():
progress.max_value = len(args.input.readlines())
args.input.seek(0)
# Cleaning input
iterator = iter(args.input)
iterator = map(str.strip, iterator)
iterator = filter(None, iterator)
# Reading nameservers
servers: typing.List[str] = list()
if os.path.isfile('nameservers'):
servers = open('nameservers').readlines()
servers = list(filter(None, map(str.strip, servers)))
writer = csv.writer(args.output)
progress.start()
global glob
glob = Orchestrator(iterator, servers)
for resolved in glob.run():
progress.update(progress.value + 1)
writer.writerow(resolved)
progress.finish()
if __name__ == '__main__':
main()

View file

@ -4,11 +4,16 @@ function log() {
echo -e "\033[33m$@\033[0m"
}
# Resolve the CNAME chain of all the known subdomains for later analysis
log "Compiling subdomain lists..."
pv subdomains/*.list | sort -u > temp/all_subdomains.list
# Sort by last character to utilize the DNS server caching mechanism
pv temp/all_subdomains.list | rev | sort | rev > temp/all_subdomains_reversort.list
./resolve_subdomains.py --input temp/all_subdomains_reversort.list --output temp/all_resolved.csv
sort -u temp/all_resolved.csv > temp/all_resolved_sorted.csv
log "Compiling nameservers…"
pv nameservers/*.list | ./validate_list.py --ip4 | sort -u > temp/all_nameservers_ip4.list
log "Compiling subdomain…"
# Sort by last character to utilize the DNS server caching mechanism
# (not as efficient with massdns but it's almost free so why not)
pv subdomains/*.list | ./validate_list.py --domain | rev | sort -u | rev > temp/all_subdomains.list
log "Resolving subdomain…"
massdns --output Snrql --retry REFUSED,SERVFAIL --resolvers temp/all_nameservers_ip4.list --outfile temp/all_resolved.txt temp/all_subdomains.list
log "Importing into database…"
pv temp/all_resolved.txt | ./feed_dns.py massdns

View file

@ -18,7 +18,14 @@ omtrdc.net
online-metrix.net
# Webtrekk
wt-eu02.net
webtrekk.net
# Otto Group
oghub.io
# ???
# Intent.com
partner.intentmedia.net
# Wizaly
wizaly.com
# Commanders Act
tagcommander.com
# Ingenious Technologies
affex.org

2
rules_asn/.gitignore vendored Normal file
View file

@ -0,0 +1,2 @@
*.custom.txt
*.cache.txt

10
rules_asn/first-party.txt Normal file
View file

@ -0,0 +1,10 @@
# Eulerian
AS50234
# Criteo
AS44788
AS19750
AS55569
# ThreatMetrix
AS30286
# Webtrekk
AS60164

View file

@ -1,51 +0,0 @@
# Eulerian (AS50234 EULERIAN TECHNOLOGIES S.A.S.)
109.232.192.0/21
# Criteo (AS44788 Criteo SA)
91.199.242.0/24
91.212.98.0/24
178.250.0.0/21
178.250.0.0/24
178.250.1.0/24
178.250.2.0/24
178.250.3.0/24
178.250.4.0/24
178.250.6.0/24
185.235.84.0/24
# Criteo (AS19750 Criteo Corp.)
74.119.116.0/22
74.119.117.0/24
74.119.118.0/24
74.119.119.0/24
91.199.242.0/24
185.235.85.0/24
199.204.168.0/22
199.204.168.0/24
199.204.169.0/24
199.204.170.0/24
199.204.171.0/24
178.250.0.0/21
91.212.98.0/24
91.199.242.0/24
185.235.84.0/24
# Criteo (AS55569 Criteo APAC)
91.199.242.0/24
116.213.20.0/22
116.213.20.0/24
116.213.21.0/24
182.161.72.0/22
182.161.72.0/24
182.161.73.0/24
185.235.86.0/24
185.235.87.0/24
# ThreatMetrix (AS30286 ThreatMetrix Inc.)
69.84.176.0/24
173.254.179.0/24
185.32.240.0/23
185.32.242.0/23
192.225.156.0/22
199.101.156.0/23
199.101.158.0/23
# Webtrekk (AS60164 Webtrekk GmbH)
185.54.148.0/22
185.54.150.0/24
185.54.151.0/24

34
run_tests.py Executable file
View file

@ -0,0 +1,34 @@
#!/usr/bin/env python3
import database
import os
import logging
import csv
TESTS_DIR = 'tests'
if __name__ == '__main__':
DB = database.Database()
log = logging.getLogger('tests')
for filename in os.listdir(TESTS_DIR):
log.info("")
log.info("Running tests from %s", filename)
path = os.path.join(TESTS_DIR, filename)
with open(path, 'rt') as fdesc:
reader = csv.DictReader(fdesc)
for test in reader:
log.info("Testing %s (%s)", test['url'], test['comment'])
for white in test['white'].split(':'):
if not white:
continue
if any(DB.get_domain(white)):
log.error("False positive: %s", white)
for black in test['black'].split(':'):
if not black:
continue
if not any(DB.get_domain(black)):
log.error("False negative: %s", black)

View file

@ -1,6 +1,5 @@
url,white,black,comment
https://support.apple.com,support.apple.com,,EdgeKey / AkamaiEdge
https://www.pinterest.fr/,i.pinimg.com,,Cedexis
https://www.pinterest.fr/,i.pinimg.com,,Cedexis
https://www.tumblr.com/,66.media.tumblr.com,,ChiCDN
https://www.skype.com/fr/,www.skype.com,,TrafficManager

1 url white black comment
2 https://support.apple.com support.apple.com EdgeKey / AkamaiEdge
3 https://www.pinterest.fr/ i.pinimg.com Cedexis
https://www.pinterest.fr/ i.pinimg.com Cedexis
4 https://www.tumblr.com/ 66.media.tumblr.com ChiCDN
5 https://www.skype.com/fr/ www.skype.com TrafficManager

View file

@ -5,3 +5,6 @@ https://www.discover.com/,,content.discover.com,ThreatMetrix
https://www.mytoys.de/,,web.mytoys.de,Webtrekk
https://www.baur.de/,,tp.baur.de,Otto Group
https://www.liligo.com/,,compare.liligo.com,???
https://www.boulanger.com/,,tag.boulanger.fr,TagCommander
https://www.airfrance.fr/FR/,,tk.airfrance.fr,Wizaly
https://www.vsgamers.es/,,marketing.net.vsgamers.es,Affex

1 url white black comment
5 https://www.mytoys.de/ web.mytoys.de Webtrekk
6 https://www.baur.de/ tp.baur.de Otto Group
7 https://www.liligo.com/ compare.liligo.com ???
8 https://www.boulanger.com/ tag.boulanger.fr TagCommander
9 https://www.airfrance.fr/FR/ tk.airfrance.fr Wizaly
10 https://www.vsgamers.es/ marketing.net.vsgamers.es Affex

35
validate_list.py Executable file
View file

@ -0,0 +1,35 @@
#!/usr/bin/env python3
# pylint: disable=C0103
"""
Filter out invalid domain names
"""
import database
import argparse
import sys
if __name__ == '__main__':
# Parsing arguments
parser = argparse.ArgumentParser(
description="Filter out invalid domain name/ip addresses from a list.")
parser.add_argument(
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
help="Input file, one element per line")
parser.add_argument(
'-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
help="Output file, one element per line")
parser.add_argument(
'-d', '--domain', action='store_true',
help="Can be domain name")
parser.add_argument(
'-4', '--ip4', action='store_true',
help="Can be IP4")
args = parser.parse_args()
for line in args.input:
line = line.strip()
if (args.domain and database.Database.validate_domain(line)) or \
(args.ip4 and database.Database.validate_ip4address(line)):
print(line, file=args.output)