Merge branch 'newworkflow'
This commit is contained in:
commit
cd46b39756
3
.gitignore
vendored
3
.gitignore
vendored
|
@ -1,3 +1,2 @@
|
|||
*.log
|
||||
nameservers
|
||||
nameservers.head
|
||||
*.p
|
||||
|
|
169
README.md
169
README.md
|
@ -1,98 +1,133 @@
|
|||
# eulaurarien
|
||||
|
||||
Generates a host list of first-party trackers for ad-blocking.
|
||||
This program is able to generate a list of every hostnames being a DNS redirection to a list of DNS zones and IP networks.
|
||||
|
||||
The latest list is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
|
||||
It is primarilyy used to generate [Geoffrey Frogeye's block list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md) (learn about first-party trackers by following this link).
|
||||
|
||||
**DISCLAIMER:** I'm by no way an expert on this subject so my vocabulary or other stuff might be wrong. Use at your own risk.
|
||||
If you want to contribute but don't want to create an account on this forge, contact me the way you like: <https://geoffrey.frogeye.fr>
|
||||
|
||||
## What's a first-party tracker?
|
||||
## How does this work
|
||||
|
||||
Traditionally, websites load trackers scripts directly.
|
||||
For example, `website1.com` and `website2.com` both load `https://trackercompany.com/trackerscript.js` to track their users.
|
||||
In order to block those, one can simply block the host `trackercompany.com`.
|
||||
This program takes as input:
|
||||
|
||||
However, to circumvent this easy block, tracker companies made the website using them load trackers from `somethingirelevant.website1.com`.
|
||||
The latter being a DNS redirection to `website1.trackercompany.com`, directly pointing to a server serving the tracking script.
|
||||
Those are the first-party trackers.
|
||||
- Lists of hostnames to match
|
||||
- Lists of DNS zone to match (a domain and their subdomains)
|
||||
- Lists of IP address / IP networks to match
|
||||
- Lists of Autonomous System numbers to match
|
||||
- An enormous quantity of DNS records
|
||||
|
||||
Blocking `trackercompany.com` doesn't work any more, and blocking `*.trackercompany.com` isn't really possible since:
|
||||
It will be able to output hostnames being a DNS redirection to any item in the lists provided.
|
||||
|
||||
1. Most ad-blocker don't support wildcards
|
||||
2. It's a DNS redirection, meaning that most ad-blockers will only see `somethingirelevant.website1.com`
|
||||
DNS records can either come from [Rapid7 Open Data Sets](https://opendata.rapid7.com/sonar.fdns_v2/) or can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns).
|
||||
|
||||
So the only solution is to block every `somethingirelevant.website1.com`-like subdomains known, which is a lot.
|
||||
That's where this scripts comes in, to generate a list of such subdomains.
|
||||
|
||||
## How does this script work
|
||||
|
||||
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
|
||||
|
||||
It takes an input a list of websites with trackers included.
|
||||
So far, this list is manually-generated from the list of clients of such first-party trackers
|
||||
(latter we should use a general list of websites to be more exhaustive).
|
||||
It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes.
|
||||
|
||||
Additionaly, or alternatively, you can feed the script some browsing history and get domains from there.
|
||||
|
||||
It then find the DNS redirections of those domains, and compare with regexes of known tracking domains.
|
||||
It finally outputs the matching ones.
|
||||
|
||||
## Requirements
|
||||
|
||||
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
|
||||
|
||||
Just to build the list, you can find an already-built list in the releases.
|
||||
|
||||
- Bash
|
||||
- [Python 3.4+](https://www.python.org/)
|
||||
- [progressbar2](https://pypi.org/project/progressbar2/)
|
||||
- dnspython
|
||||
- [A Python wrapper for re2](https://pypi.org/project/google-re2/) (optional, just speeds things up)
|
||||
|
||||
(if you don't want to collect the subdomains, you can skip the following)
|
||||
|
||||
- Firefox
|
||||
- Selenium
|
||||
- seleniumwire
|
||||
Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that).
|
||||
|
||||
## Usage
|
||||
|
||||
> **Notice:** This section is a tad outdated. I'm still experimenting to make the generation process better. I'll update this once I'm done with this.
|
||||
Remember you can get an already generated and up-to-date list of first-party trackers from [here](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/dist/README.md).
|
||||
|
||||
This is only if you want to build the list yourself.
|
||||
If you just want to use the list, the latest build is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
|
||||
It was build using additional sources not included in this repository for privacy reasons.
|
||||
The following is for the people wanting to build their own list.
|
||||
|
||||
### Add personal sources
|
||||
### Requirements
|
||||
|
||||
The list of websites provided in this script is by no mean exhaustive,
|
||||
so adding your own browsing history will help create a better list.
|
||||
Depending on the sources you'll be using to generate the list, you'll need to install some of the following:
|
||||
|
||||
- [Bash](https://www.gnu.org/software/bash/bash.html)
|
||||
- [Coreutils](https://www.gnu.org/software/coreutils/)
|
||||
- [curl](https://curl.haxx.se)
|
||||
- [pv](http://www.ivarch.com/programs/pv.shtml)
|
||||
- [Python 3.4+](https://www.python.org/)
|
||||
- [coloredlogs](https://pypi.org/project/coloredlogs/) (sorry I can't help myself)
|
||||
- [massdns](https://github.com/blechschmidt/massdns) in your `$PATH` (only if you have subdomains as a source)
|
||||
- [Firefox](https://www.mozilla.org/firefox/) (only if you have websites as a source)
|
||||
- [selenium (Python bindings)](https://pypi.python.org/pypi/selenium) (only if you have websites as a source)
|
||||
- [selenium-wire](https://pypi.org/project/selenium-wire/) (only if you have websites as a source)
|
||||
|
||||
### Create a new database
|
||||
|
||||
The so-called database (in the form of `blocking.p`) is a file storing all the matching entities (ASN, IPs, hostnames, zones…) and every entity leading to it.
|
||||
For now there's no way to remove data from it, so here's the command to recreate it: `./db.py --initialize`.
|
||||
|
||||
### Gather external sources
|
||||
|
||||
External sources are not stored in this repository.
|
||||
You'll need to fetch them by running `./fetch_resources.sh`.
|
||||
Those include:
|
||||
|
||||
- Third-party trackers lists
|
||||
- TLD lists (used to test the validity of hostnames)
|
||||
- List of public DNS resolvers (for DNS resolving from subdomains)
|
||||
- Top 1M subdomains
|
||||
|
||||
### Import rules into the database
|
||||
|
||||
You need to put the lists of rules for matching in the different subfolders:
|
||||
|
||||
- `rules`: Lists of DNS zones
|
||||
- `rules_ip`: Lists of IP networks (for IP addresses append `/32`)
|
||||
- `rules_asn`: Lists of Autonomous Systems numbers (IP ranges will be deducted from them)
|
||||
- `rules_adblock`: Lists of DNS zones, but in the form of AdBlock lists (only the ones concerning domains will be extracted)
|
||||
- `rules_hosts`: Lists of DNS zones, but in the form of hosts lists
|
||||
|
||||
See the provided examples for syntax.
|
||||
|
||||
In each folder:
|
||||
|
||||
- `first-party.ext` will be the only files considered for the first-party variant of the list
|
||||
- `*.cache.ext` are from external sources, and thus might be deleted / overwrote
|
||||
- `*.custom.ext` are for sources that you don't want commited
|
||||
|
||||
Then, run `./import_rules.sh`.
|
||||
|
||||
### Add subdomains
|
||||
|
||||
If you plan to resolve DNS records yourself (as the DNS records datasets are not exhaustive),
|
||||
the top 1M subdomains provided might not be enough.
|
||||
|
||||
You can add them into the `subdomains` folder.
|
||||
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
|
||||
|
||||
#### Add personal sources
|
||||
|
||||
Adding your own browsing history will help create a more suited subdomains list.
|
||||
Here's reference command for possible sources:
|
||||
|
||||
- **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
|
||||
- **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp`
|
||||
|
||||
### Collect subdomains from websites
|
||||
#### Collect subdomains from websites
|
||||
|
||||
Just run `collect_subdomain.sh`.
|
||||
You can add the websites URLs into the `websites` folder.
|
||||
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
|
||||
|
||||
Then, run `collect_subdomain.sh`.
|
||||
This is a long step, and might be memory-intensive from time to time.
|
||||
|
||||
This step is optional if you already added personal sources.
|
||||
Alternatively, you can get just download the list of subdomains used to generate the official block list here: <https://hostfiles.frogeye.fr/from_websites.cache.list> (put it in the `subdomains` folder).
|
||||
> **Note:** For first-party tracking, a list of subdomains issued from the websites in the repository is avaliable here: <https://hostfiles.frogeye.fr/from_websites.cache.list>
|
||||
|
||||
### Extract tracking domains
|
||||
### Resolve DNS records
|
||||
|
||||
Make sure your system is configured with a DNS server without limitation.
|
||||
Then, run `filter_subdomain.sh`.
|
||||
The files you need will be in the folder `dist`.
|
||||
Once you've added subdomains, you'll need to resolve them to get their DNS records.
|
||||
The program will use a list of public nameservers to do that, but you can add your own in the `nameservers` directory.
|
||||
|
||||
## Contributing
|
||||
Then, run `./resolve_subdomains.sh`.
|
||||
Note that this is a network intensive process, not in term of bandwith, but in terms of packet number.
|
||||
|
||||
### Adding websites
|
||||
> Some VPS providers might detect this as a DDoS attack and cut the network access.
|
||||
> Some Wi-Fi connections can be rendered unusable for other uses, some routers might cease to work.
|
||||
> Since massdns does not support yet rate limiting, my best bet was a Raspberry Pi with a slow ethernet link (Raspberry Pi < 4).
|
||||
|
||||
Just add the URL to the relevant list: `websites/<source>.list`.
|
||||
The DNS records will automatically be imported into the database.
|
||||
If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script.
|
||||
|
||||
### Adding first-party trackers regex
|
||||
### Import DNS records from Rapid7
|
||||
|
||||
Just run `./import_rapid7.sh`.
|
||||
This will download about 35 GiB of data, but only the matching records will be stored (about a few MiB for the tracking rules).
|
||||
Note the download speed will most likely be limited by the database operation thoughput (a quick RAM will help).
|
||||
|
||||
### Export the lists
|
||||
|
||||
For the tracking list, use `./export_lists.sh`, the output will be in the `dist` forlder (please change the links before distributing them).
|
||||
For other purposes, tinker with the `./export.py` program.
|
||||
|
||||
Just add them to `regexes.py`.
|
||||
|
|
739
database.py
Normal file
739
database.py
Normal file
|
@ -0,0 +1,739 @@
|
|||
#!/usr/bin/env python3
|
||||
|
||||
"""
|
||||
Utility functions to interact with the database.
|
||||
"""
|
||||
|
||||
import typing
|
||||
import time
|
||||
import logging
|
||||
import coloredlogs
|
||||
import pickle
|
||||
import numpy
|
||||
import math
|
||||
|
||||
TLD_LIST: typing.Set[str] = set()
|
||||
|
||||
coloredlogs.install(
|
||||
level='DEBUG',
|
||||
fmt='%(asctime)s %(name)s %(levelname)s %(message)s'
|
||||
)
|
||||
|
||||
Asn = int
|
||||
Timestamp = int
|
||||
Level = int
|
||||
|
||||
|
||||
class Path():
|
||||
# FP add boolean here
|
||||
pass
|
||||
|
||||
|
||||
class RulePath(Path):
|
||||
def __str__(self) -> str:
|
||||
return '(rule)'
|
||||
|
||||
|
||||
class RuleFirstPath(RulePath):
|
||||
def __str__(self) -> str:
|
||||
return '(first-party rule)'
|
||||
|
||||
|
||||
class RuleMultiPath(RulePath):
|
||||
def __str__(self) -> str:
|
||||
return '(multi-party rule)'
|
||||
|
||||
|
||||
class DomainPath(Path):
|
||||
def __init__(self, parts: typing.List[str]):
|
||||
self.parts = parts
|
||||
|
||||
def __str__(self) -> str:
|
||||
return '?.' + Database.unpack_domain(self)
|
||||
|
||||
|
||||
class HostnamePath(DomainPath):
|
||||
def __str__(self) -> str:
|
||||
return Database.unpack_domain(self)
|
||||
|
||||
|
||||
class ZonePath(DomainPath):
|
||||
def __str__(self) -> str:
|
||||
return '*.' + Database.unpack_domain(self)
|
||||
|
||||
|
||||
class AsnPath(Path):
|
||||
def __init__(self, asn: Asn):
|
||||
self.asn = asn
|
||||
|
||||
def __str__(self) -> str:
|
||||
return Database.unpack_asn(self)
|
||||
|
||||
|
||||
class Ip4Path(Path):
|
||||
def __init__(self, value: int, prefixlen: int):
|
||||
self.value = value
|
||||
self.prefixlen = prefixlen
|
||||
|
||||
def __str__(self) -> str:
|
||||
return Database.unpack_ip4network(self)
|
||||
|
||||
|
||||
class Match():
|
||||
def __init__(self) -> None:
|
||||
self.source: typing.Optional[Path] = None
|
||||
self.updated: int = 0
|
||||
self.dupplicate: bool = False
|
||||
|
||||
# Cache
|
||||
self.level: int = 0
|
||||
self.first_party: bool = False
|
||||
self.references: int = 0
|
||||
|
||||
def active(self, first_party: bool = None) -> bool:
|
||||
if self.updated == 0 or (first_party and not self.first_party):
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
class AsnNode(Match):
|
||||
def __init__(self) -> None:
|
||||
Match.__init__(self)
|
||||
self.name = ''
|
||||
|
||||
|
||||
class DomainTreeNode():
|
||||
def __init__(self) -> None:
|
||||
self.children: typing.Dict[str, DomainTreeNode] = dict()
|
||||
self.match_zone = Match()
|
||||
self.match_hostname = Match()
|
||||
|
||||
|
||||
class IpTreeNode(Match):
|
||||
def __init__(self) -> None:
|
||||
Match.__init__(self)
|
||||
self.zero: typing.Optional[IpTreeNode] = None
|
||||
self.one: typing.Optional[IpTreeNode] = None
|
||||
|
||||
|
||||
Node = typing.Union[DomainTreeNode, IpTreeNode, AsnNode]
|
||||
MatchCallable = typing.Callable[[Path,
|
||||
Match],
|
||||
typing.Any]
|
||||
|
||||
|
||||
class Profiler():
|
||||
def __init__(self) -> None:
|
||||
self.log = logging.getLogger('profiler')
|
||||
self.time_last = time.perf_counter()
|
||||
self.time_step = 'init'
|
||||
self.time_dict: typing.Dict[str, float] = dict()
|
||||
self.step_dict: typing.Dict[str, int] = dict()
|
||||
|
||||
def enter_step(self, name: str) -> None:
|
||||
now = time.perf_counter()
|
||||
try:
|
||||
self.time_dict[self.time_step] += now - self.time_last
|
||||
self.step_dict[self.time_step] += int(name != self.time_step)
|
||||
except KeyError:
|
||||
self.time_dict[self.time_step] = now - self.time_last
|
||||
self.step_dict[self.time_step] = 1
|
||||
self.time_step = name
|
||||
self.time_last = time.perf_counter()
|
||||
|
||||
def profile(self) -> None:
|
||||
self.enter_step('profile')
|
||||
total = sum(self.time_dict.values())
|
||||
for key, secs in sorted(self.time_dict.items(), key=lambda t: t[1]):
|
||||
times = self.step_dict[key]
|
||||
self.log.debug(f"{key:<20}: {times:9d} × {secs/times:5.3e} "
|
||||
f"= {secs:9.2f} s ({secs/total:7.2%}) ")
|
||||
self.log.debug(f"{'total':<20}: "
|
||||
f"{total:9.2f} s ({1:7.2%})")
|
||||
|
||||
|
||||
class Database(Profiler):
|
||||
VERSION = 18
|
||||
PATH = "blocking.p"
|
||||
|
||||
def initialize(self) -> None:
|
||||
self.log.warning(
|
||||
"Creating database version: %d ",
|
||||
Database.VERSION)
|
||||
# Dummy match objects that everything refer to
|
||||
self.rules: typing.List[Match] = list()
|
||||
for first_party in (False, True):
|
||||
m = Match()
|
||||
m.updated = 1
|
||||
m.level = 0
|
||||
m.first_party = first_party
|
||||
self.rules.append(m)
|
||||
self.domtree = DomainTreeNode()
|
||||
self.asns: typing.Dict[Asn, AsnNode] = dict()
|
||||
self.ip4tree = IpTreeNode()
|
||||
|
||||
def load(self) -> None:
|
||||
self.enter_step('load')
|
||||
try:
|
||||
with open(self.PATH, 'rb') as db_fdsec:
|
||||
version, data = pickle.load(db_fdsec)
|
||||
if version == Database.VERSION:
|
||||
self.rules, self.domtree, self.asns, self.ip4tree = data
|
||||
return
|
||||
self.log.warning(
|
||||
"Outdated database version found: %d, "
|
||||
"it will be rebuilt.",
|
||||
version)
|
||||
except (TypeError, AttributeError, EOFError):
|
||||
self.log.error(
|
||||
"Corrupt (or heavily outdated) database found, "
|
||||
"it will be rebuilt.")
|
||||
except FileNotFoundError:
|
||||
pass
|
||||
self.initialize()
|
||||
|
||||
def save(self) -> None:
|
||||
self.enter_step('save')
|
||||
with open(self.PATH, 'wb') as db_fdsec:
|
||||
data = self.rules, self.domtree, self.asns, self.ip4tree
|
||||
pickle.dump((self.VERSION, data), db_fdsec)
|
||||
self.profile()
|
||||
|
||||
def __init__(self) -> None:
|
||||
Profiler.__init__(self)
|
||||
self.log = logging.getLogger('db')
|
||||
self.load()
|
||||
self.ip4cache_shift: int = 32
|
||||
self.ip4cache = numpy.ones(1)
|
||||
|
||||
def _set_ip4cache(self, path: Path, _: Match) -> None:
|
||||
assert isinstance(path, Ip4Path)
|
||||
self.enter_step('set_ip4cache')
|
||||
mini = path.value >> self.ip4cache_shift
|
||||
maxi = (path.value + 2**(32-path.prefixlen)) >> self.ip4cache_shift
|
||||
if mini == maxi:
|
||||
self.ip4cache[mini] = True
|
||||
else:
|
||||
self.ip4cache[mini:maxi] = True
|
||||
|
||||
def fill_ip4cache(self, max_size: int = 512*1024**2) -> None:
|
||||
"""
|
||||
Size in bytes
|
||||
"""
|
||||
if max_size > 2**32/8:
|
||||
self.log.warning("Allocating more than 512 MiB of RAM for "
|
||||
"the Ip4 cache is not necessary.")
|
||||
max_cache_width = int(math.log2(max(1, max_size*8)))
|
||||
cache_width = min(2**32, max_cache_width)
|
||||
self.ip4cache_shift = 32-cache_width
|
||||
cache_size = 2**cache_width
|
||||
self.ip4cache = numpy.zeros(cache_size, dtype=numpy.bool)
|
||||
for _ in self.exec_each_ip4(self._set_ip4cache):
|
||||
pass
|
||||
|
||||
@staticmethod
|
||||
def populate_tld_list() -> None:
|
||||
with open('temp/all_tld.list', 'r') as tld_fdesc:
|
||||
for tld in tld_fdesc:
|
||||
tld = tld.strip()
|
||||
TLD_LIST.add(tld)
|
||||
|
||||
@staticmethod
|
||||
def validate_domain(path: str) -> bool:
|
||||
if len(path) > 255:
|
||||
return False
|
||||
splits = path.split('.')
|
||||
if not TLD_LIST:
|
||||
Database.populate_tld_list()
|
||||
if splits[-1] not in TLD_LIST:
|
||||
return False
|
||||
for split in splits:
|
||||
if not 1 <= len(split) <= 63:
|
||||
return False
|
||||
return True
|
||||
|
||||
@staticmethod
|
||||
def pack_domain(domain: str) -> DomainPath:
|
||||
return DomainPath(domain.split('.')[::-1])
|
||||
|
||||
@staticmethod
|
||||
def unpack_domain(domain: DomainPath) -> str:
|
||||
return '.'.join(domain.parts[::-1])
|
||||
|
||||
@staticmethod
|
||||
def pack_asn(asn: str) -> AsnPath:
|
||||
asn = asn.upper()
|
||||
if asn.startswith('AS'):
|
||||
asn = asn[2:]
|
||||
return AsnPath(int(asn))
|
||||
|
||||
@staticmethod
|
||||
def unpack_asn(asn: AsnPath) -> str:
|
||||
return f'AS{asn.asn}'
|
||||
|
||||
@staticmethod
|
||||
def validate_ip4address(path: str) -> bool:
|
||||
splits = path.split('.')
|
||||
if len(splits) != 4:
|
||||
return False
|
||||
for split in splits:
|
||||
try:
|
||||
if not 0 <= int(split) <= 255:
|
||||
return False
|
||||
except ValueError:
|
||||
return False
|
||||
return True
|
||||
|
||||
@staticmethod
|
||||
def pack_ip4address(address: str) -> Ip4Path:
|
||||
addr = 0
|
||||
for split in address.split('.'):
|
||||
addr = (addr << 8) + int(split)
|
||||
return Ip4Path(addr, 32)
|
||||
|
||||
@staticmethod
|
||||
def unpack_ip4address(address: Ip4Path) -> str:
|
||||
addr = address.value
|
||||
assert address.prefixlen == 32
|
||||
octets: typing.List[int] = list()
|
||||
octets = [0] * 4
|
||||
for o in reversed(range(4)):
|
||||
octets[o] = addr & 0xFF
|
||||
addr >>= 8
|
||||
return '.'.join(map(str, octets))
|
||||
|
||||
@staticmethod
|
||||
def validate_ip4network(path: str) -> bool:
|
||||
# A bit generous but ok for our usage
|
||||
splits = path.split('/')
|
||||
if len(splits) != 2:
|
||||
return False
|
||||
if not Database.validate_ip4address(splits[0]):
|
||||
return False
|
||||
try:
|
||||
if not 0 <= int(splits[1]) <= 32:
|
||||
return False
|
||||
except ValueError:
|
||||
return False
|
||||
return True
|
||||
|
||||
@staticmethod
|
||||
def pack_ip4network(network: str) -> Ip4Path:
|
||||
address, prefixlen_str = network.split('/')
|
||||
prefixlen = int(prefixlen_str)
|
||||
addr = Database.pack_ip4address(address)
|
||||
addr.prefixlen = prefixlen
|
||||
return addr
|
||||
|
||||
@staticmethod
|
||||
def unpack_ip4network(network: Ip4Path) -> str:
|
||||
addr = network.value
|
||||
octets: typing.List[int] = list()
|
||||
octets = [0] * 4
|
||||
for o in reversed(range(4)):
|
||||
octets[o] = addr & 0xFF
|
||||
addr >>= 8
|
||||
return '.'.join(map(str, octets)) + '/' + str(network.prefixlen)
|
||||
|
||||
def get_match(self, path: Path) -> Match:
|
||||
if isinstance(path, RuleMultiPath):
|
||||
return self.rules[0]
|
||||
elif isinstance(path, RuleFirstPath):
|
||||
return self.rules[1]
|
||||
elif isinstance(path, AsnPath):
|
||||
return self.asns[path.asn]
|
||||
elif isinstance(path, DomainPath):
|
||||
dicd = self.domtree
|
||||
for part in path.parts:
|
||||
dicd = dicd.children[part]
|
||||
if isinstance(path, HostnamePath):
|
||||
return dicd.match_hostname
|
||||
elif isinstance(path, ZonePath):
|
||||
return dicd.match_zone
|
||||
else:
|
||||
raise ValueError
|
||||
elif isinstance(path, Ip4Path):
|
||||
dici = self.ip4tree
|
||||
for i in range(31, 31-path.prefixlen, -1):
|
||||
bit = (path.value >> i) & 0b1
|
||||
dici_next = dici.one if bit else dici.zero
|
||||
if not dici_next:
|
||||
raise IndexError
|
||||
dici = dici_next
|
||||
return dici
|
||||
else:
|
||||
raise ValueError
|
||||
|
||||
def exec_each_asn(self,
|
||||
callback: MatchCallable,
|
||||
) -> typing.Any:
|
||||
for asn in self.asns:
|
||||
match = self.asns[asn]
|
||||
if match.active():
|
||||
c = callback(
|
||||
AsnPath(asn),
|
||||
match,
|
||||
)
|
||||
try:
|
||||
yield from c
|
||||
except TypeError: # not iterable
|
||||
pass
|
||||
|
||||
def exec_each_domain(self,
|
||||
callback: MatchCallable,
|
||||
_dic: DomainTreeNode = None,
|
||||
_par: DomainPath = None,
|
||||
) -> typing.Any:
|
||||
_dic = _dic or self.domtree
|
||||
_par = _par or DomainPath([])
|
||||
if _dic.match_hostname.active():
|
||||
c = callback(
|
||||
HostnamePath(_par.parts),
|
||||
_dic.match_hostname,
|
||||
)
|
||||
try:
|
||||
yield from c
|
||||
except TypeError: # not iterable
|
||||
pass
|
||||
if _dic.match_zone.active():
|
||||
c = callback(
|
||||
ZonePath(_par.parts),
|
||||
_dic.match_zone,
|
||||
)
|
||||
try:
|
||||
yield from c
|
||||
except TypeError: # not iterable
|
||||
pass
|
||||
for part in _dic.children:
|
||||
dic = _dic.children[part]
|
||||
yield from self.exec_each_domain(
|
||||
callback,
|
||||
_dic=dic,
|
||||
_par=DomainPath(_par.parts + [part])
|
||||
)
|
||||
|
||||
def exec_each_ip4(self,
|
||||
callback: MatchCallable,
|
||||
_dic: IpTreeNode = None,
|
||||
_par: Ip4Path = None,
|
||||
) -> typing.Any:
|
||||
_dic = _dic or self.ip4tree
|
||||
_par = _par or Ip4Path(0, 0)
|
||||
if _dic.active():
|
||||
c = callback(
|
||||
_par,
|
||||
_dic,
|
||||
)
|
||||
try:
|
||||
yield from c
|
||||
except TypeError: # not iterable
|
||||
pass
|
||||
|
||||
# 0
|
||||
pref = _par.prefixlen + 1
|
||||
dic = _dic.zero
|
||||
if dic:
|
||||
# addr0 = _par.value & (0xFFFFFFFF ^ (1 << (32-pref)))
|
||||
# assert addr0 == _par.value
|
||||
addr0 = _par.value
|
||||
yield from self.exec_each_ip4(
|
||||
callback,
|
||||
_dic=dic,
|
||||
_par=Ip4Path(addr0, pref)
|
||||
)
|
||||
# 1
|
||||
dic = _dic.one
|
||||
if dic:
|
||||
addr1 = _par.value | (1 << (32-pref))
|
||||
# assert addr1 != _par.value
|
||||
yield from self.exec_each_ip4(
|
||||
callback,
|
||||
_dic=dic,
|
||||
_par=Ip4Path(addr1, pref)
|
||||
)
|
||||
|
||||
def exec_each(self,
|
||||
callback: MatchCallable,
|
||||
) -> typing.Any:
|
||||
yield from self.exec_each_domain(callback)
|
||||
yield from self.exec_each_ip4(callback)
|
||||
yield from self.exec_each_asn(callback)
|
||||
|
||||
def update_references(self) -> None:
|
||||
# Should be correctly calculated normally,
|
||||
# keeping this just in case
|
||||
def reset_references_cb(path: Path,
|
||||
match: Match
|
||||
) -> None:
|
||||
match.references = 0
|
||||
for _ in self.exec_each(reset_references_cb):
|
||||
pass
|
||||
|
||||
def increment_references_cb(path: Path,
|
||||
match: Match
|
||||
) -> None:
|
||||
if match.source:
|
||||
source = self.get_match(match.source)
|
||||
source.references += 1
|
||||
for _ in self.exec_each(increment_references_cb):
|
||||
pass
|
||||
|
||||
def prune(self, before: int, base_only: bool = False) -> None:
|
||||
raise NotImplementedError
|
||||
|
||||
def explain(self, path: Path) -> str:
|
||||
match = self.get_match(path)
|
||||
if isinstance(match, AsnNode):
|
||||
string = f'{path} ({match.name}) #{match.references}'
|
||||
else:
|
||||
string = f'{path} #{match.references}'
|
||||
if match.source:
|
||||
string += f' ← {self.explain(match.source)}'
|
||||
return string
|
||||
|
||||
def list_records(self,
|
||||
first_party_only: bool = False,
|
||||
end_chain_only: bool = False,
|
||||
no_dupplicates: bool = False,
|
||||
rules_only: bool = False,
|
||||
hostnames_only: bool = False,
|
||||
explain: bool = False,
|
||||
) -> typing.Iterable[str]:
|
||||
|
||||
def export_cb(path: Path, match: Match
|
||||
) -> typing.Iterable[str]:
|
||||
if first_party_only and not match.first_party:
|
||||
return
|
||||
if end_chain_only and match.references > 0:
|
||||
return
|
||||
if no_dupplicates and match.dupplicate:
|
||||
return
|
||||
if rules_only and match.level > 1:
|
||||
return
|
||||
if hostnames_only and not isinstance(path, HostnamePath):
|
||||
return
|
||||
|
||||
if explain:
|
||||
yield self.explain(path)
|
||||
else:
|
||||
yield str(path)
|
||||
|
||||
yield from self.exec_each(export_cb)
|
||||
|
||||
def count_records(self,
|
||||
first_party_only: bool = False,
|
||||
end_chain_only: bool = False,
|
||||
no_dupplicates: bool = False,
|
||||
rules_only: bool = False,
|
||||
hostnames_only: bool = False,
|
||||
) -> str:
|
||||
memo: typing.Dict[str, int] = dict()
|
||||
|
||||
def count_records_cb(path: Path, match: Match) -> None:
|
||||
if first_party_only and not match.first_party:
|
||||
return
|
||||
if end_chain_only and match.references > 0:
|
||||
return
|
||||
if no_dupplicates and match.dupplicate:
|
||||
return
|
||||
if rules_only and match.level > 1:
|
||||
return
|
||||
if hostnames_only and not isinstance(path, HostnamePath):
|
||||
return
|
||||
|
||||
try:
|
||||
memo[path.__class__.__name__] += 1
|
||||
except KeyError:
|
||||
memo[path.__class__.__name__] = 1
|
||||
|
||||
for _ in self.exec_each(count_records_cb):
|
||||
pass
|
||||
|
||||
split: typing.List[str] = list()
|
||||
for key, value in sorted(memo.items(), key=lambda s: s[0]):
|
||||
split.append(f'{key[:-4].lower()}s: {value}')
|
||||
return ', '.join(split)
|
||||
|
||||
def get_domain(self, domain_str: str) -> typing.Iterable[DomainPath]:
|
||||
self.enter_step('get_domain_pack')
|
||||
domain = self.pack_domain(domain_str)
|
||||
self.enter_step('get_domain_brws')
|
||||
dic = self.domtree
|
||||
depth = 0
|
||||
for part in domain.parts:
|
||||
if dic.match_zone.active():
|
||||
self.enter_step('get_domain_yield')
|
||||
yield ZonePath(domain.parts[:depth])
|
||||
self.enter_step('get_domain_brws')
|
||||
if part not in dic.children:
|
||||
return
|
||||
dic = dic.children[part]
|
||||
depth += 1
|
||||
if dic.match_zone.active():
|
||||
self.enter_step('get_domain_yield')
|
||||
yield ZonePath(domain.parts)
|
||||
if dic.match_hostname.active():
|
||||
self.enter_step('get_domain_yield')
|
||||
yield HostnamePath(domain.parts)
|
||||
|
||||
def get_ip4(self, ip4_str: str) -> typing.Iterable[Path]:
|
||||
self.enter_step('get_ip4_pack')
|
||||
ip4 = self.pack_ip4address(ip4_str)
|
||||
self.enter_step('get_ip4_cache')
|
||||
if not self.ip4cache[ip4.value >> self.ip4cache_shift]:
|
||||
return
|
||||
self.enter_step('get_ip4_brws')
|
||||
dic = self.ip4tree
|
||||
for i in range(31, 31-ip4.prefixlen, -1):
|
||||
bit = (ip4.value >> i) & 0b1
|
||||
if dic.active():
|
||||
self.enter_step('get_ip4_yield')
|
||||
yield Ip4Path(ip4.value >> (i+1) << (i+1), 31-i)
|
||||
self.enter_step('get_ip4_brws')
|
||||
next_dic = dic.one if bit else dic.zero
|
||||
if next_dic is None:
|
||||
return
|
||||
dic = next_dic
|
||||
if dic.active():
|
||||
self.enter_step('get_ip4_yield')
|
||||
yield ip4
|
||||
|
||||
def _set_match(self,
|
||||
match: Match,
|
||||
updated: int,
|
||||
source: Path,
|
||||
source_match: Match = None,
|
||||
dupplicate: bool = False,
|
||||
) -> None:
|
||||
# source_match is in parameters because most of the time
|
||||
# its parent function needs it too,
|
||||
# so it can pass it to save a traversal
|
||||
source_match = source_match or self.get_match(source)
|
||||
new_level = source_match.level + 1
|
||||
if updated > match.updated or new_level < match.level \
|
||||
or source_match.first_party > match.first_party:
|
||||
# NOTE FP and level of matches referencing this one
|
||||
# won't be updated until run or prune
|
||||
if match.source:
|
||||
old_source = self.get_match(match.source)
|
||||
old_source.references -= 1
|
||||
match.updated = updated
|
||||
match.level = new_level
|
||||
match.first_party = source_match.first_party
|
||||
match.source = source
|
||||
source_match.references += 1
|
||||
match.dupplicate = dupplicate
|
||||
|
||||
def _set_domain(self,
|
||||
hostname: bool,
|
||||
domain_str: str,
|
||||
updated: int,
|
||||
source: Path) -> None:
|
||||
self.enter_step('set_domain_val')
|
||||
if not Database.validate_domain(domain_str):
|
||||
raise ValueError(f"Invalid domain: {domain_str}")
|
||||
self.enter_step('set_domain_pack')
|
||||
domain = self.pack_domain(domain_str)
|
||||
self.enter_step('set_domain_fp')
|
||||
source_match = self.get_match(source)
|
||||
is_first_party = source_match.first_party
|
||||
self.enter_step('set_domain_brws')
|
||||
dic = self.domtree
|
||||
dupplicate = False
|
||||
for part in domain.parts:
|
||||
if part not in dic.children:
|
||||
dic.children[part] = DomainTreeNode()
|
||||
dic = dic.children[part]
|
||||
if dic.match_zone.active(is_first_party):
|
||||
dupplicate = True
|
||||
if hostname:
|
||||
match = dic.match_hostname
|
||||
else:
|
||||
match = dic.match_zone
|
||||
self._set_match(
|
||||
match,
|
||||
updated,
|
||||
source,
|
||||
source_match=source_match,
|
||||
dupplicate=dupplicate,
|
||||
)
|
||||
|
||||
def set_hostname(self,
|
||||
*args: typing.Any, **kwargs: typing.Any
|
||||
) -> None:
|
||||
self._set_domain(True, *args, **kwargs)
|
||||
|
||||
def set_zone(self,
|
||||
*args: typing.Any, **kwargs: typing.Any
|
||||
) -> None:
|
||||
self._set_domain(False, *args, **kwargs)
|
||||
|
||||
def set_asn(self,
|
||||
asn_str: str,
|
||||
updated: int,
|
||||
source: Path) -> None:
|
||||
self.enter_step('set_asn')
|
||||
path = self.pack_asn(asn_str)
|
||||
if path.asn in self.asns:
|
||||
match = self.asns[path.asn]
|
||||
else:
|
||||
match = AsnNode()
|
||||
self.asns[path.asn] = match
|
||||
self._set_match(
|
||||
match,
|
||||
updated,
|
||||
source,
|
||||
)
|
||||
|
||||
def _set_ip4(self,
|
||||
ip4: Ip4Path,
|
||||
updated: int,
|
||||
source: Path) -> None:
|
||||
self.enter_step('set_ip4_fp')
|
||||
source_match = self.get_match(source)
|
||||
is_first_party = source_match.first_party
|
||||
self.enter_step('set_ip4_brws')
|
||||
dic = self.ip4tree
|
||||
dupplicate = False
|
||||
for i in range(31, 31-ip4.prefixlen, -1):
|
||||
bit = (ip4.value >> i) & 0b1
|
||||
next_dic = dic.one if bit else dic.zero
|
||||
if next_dic is None:
|
||||
next_dic = IpTreeNode()
|
||||
if bit:
|
||||
dic.one = next_dic
|
||||
else:
|
||||
dic.zero = next_dic
|
||||
dic = next_dic
|
||||
if dic.active(is_first_party):
|
||||
dupplicate = True
|
||||
self._set_match(
|
||||
dic,
|
||||
updated,
|
||||
source,
|
||||
source_match=source_match,
|
||||
dupplicate=dupplicate,
|
||||
)
|
||||
self._set_ip4cache(ip4, dic)
|
||||
|
||||
def set_ip4address(self,
|
||||
ip4address_str: str,
|
||||
*args: typing.Any, **kwargs: typing.Any
|
||||
) -> None:
|
||||
self.enter_step('set_ip4add_val')
|
||||
if not Database.validate_ip4address(ip4address_str):
|
||||
raise ValueError(f"Invalid ip4address: {ip4address_str}")
|
||||
self.enter_step('set_ip4add_pack')
|
||||
ip4 = self.pack_ip4address(ip4address_str)
|
||||
self._set_ip4(ip4, *args, **kwargs)
|
||||
|
||||
def set_ip4network(self,
|
||||
ip4network_str: str,
|
||||
*args: typing.Any, **kwargs: typing.Any
|
||||
) -> None:
|
||||
self.enter_step('set_ip4net_val')
|
||||
if not Database.validate_ip4network(ip4network_str):
|
||||
raise ValueError(f"Invalid ip4network: {ip4network_str}")
|
||||
self.enter_step('set_ip4net_pack')
|
||||
ip4 = self.pack_ip4network(ip4network_str)
|
||||
self._set_ip4(ip4, *args, **kwargs)
|
46
db.py
Executable file
46
db.py
Executable file
|
@ -0,0 +1,46 @@
|
|||
#!/usr/bin/env python3
|
||||
|
||||
import argparse
|
||||
import database
|
||||
import time
|
||||
import os
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
||||
# Parsing arguments
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Database operations")
|
||||
parser.add_argument(
|
||||
'-i', '--initialize', action='store_true',
|
||||
help="Reconstruct the whole database")
|
||||
parser.add_argument(
|
||||
'-p', '--prune', action='store_true',
|
||||
help="Remove old entries from database")
|
||||
parser.add_argument(
|
||||
'-b', '--prune-base', action='store_true',
|
||||
help="With --prune, only prune base rules "
|
||||
"(the ones added by ./feed_rules.py)")
|
||||
parser.add_argument(
|
||||
'-s', '--prune-before', type=int,
|
||||
default=(int(time.time()) - 60*60*24*31*6),
|
||||
help="With --prune, only rules updated before "
|
||||
"this UNIX timestamp will be deleted")
|
||||
parser.add_argument(
|
||||
'-r', '--references', action='store_true',
|
||||
help="DEBUG: Update the reference count")
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.initialize:
|
||||
DB = database.Database()
|
||||
else:
|
||||
if os.path.isfile(database.Database.PATH):
|
||||
os.unlink(database.Database.PATH)
|
||||
DB = database.Database()
|
||||
|
||||
DB.enter_step('main')
|
||||
if args.prune:
|
||||
DB.prune(before=args.prune_before, base_only=args.prune_base)
|
||||
if args.references:
|
||||
DB.update_references()
|
||||
|
||||
DB.save()
|
74
dist/README.md
vendored
Normal file
74
dist/README.md
vendored
Normal file
|
@ -0,0 +1,74 @@
|
|||
# Geoffrey Frogeye's block list of first-party trackers
|
||||
|
||||
## What's a first-party tracker?
|
||||
|
||||
A tracker is a script put on many websites to gather informations about the visitor.
|
||||
They can be used for multiple reasons: statistics, risk management, marketing, ads serving…
|
||||
In any case, they are a threat to Internet users' privacy and many may want to block them.
|
||||
|
||||
Traditionnaly, trackers are served from a third-party.
|
||||
For example, `website1.com` and `website2.com` both load their tracking script from `https://trackercompany.com/trackerscript.js`.
|
||||
In order to block those, one can simply block the hostname `trackercompany.com`, which is what most ad blockers do.
|
||||
|
||||
However, to circumvent this block, tracker companies made the websites using them load trackers from `somestring.website1.com`.
|
||||
The latter is a DNS redirection to `website1.trackercompany.com`, directly to an IP address belonging to the tracking company.
|
||||
Those are called first-party trackers.
|
||||
|
||||
In order to block those trackers, ad blockers would need to block every subdomain pointing to anything under `trackercompany.com` or to their network.
|
||||
Unfortunately, most don't support those blocking methods as they are not DNS-aware, e.g. they only see `somestring.website1.com`.
|
||||
|
||||
This list is an inventory of every `somestring.website1.com` found to allow non DNS-aware ad blocker to still block first-party trackers.
|
||||
|
||||
## List variants
|
||||
|
||||
### First-party trackers (recommended)
|
||||
|
||||
- Hosts file: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
|
||||
- Raw list: <https://hostfiles.frogeye.fr/firstparty-trackers.txt>
|
||||
|
||||
This list contains every hostname redirecting to [a hand-picked list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/rules/first-party.list).
|
||||
It should be safe from false-positives.
|
||||
Don't be afraid of the size of the list, as this is due to the nature of first-party trackers: a single tracker generates at least one hostname per client (typically two).
|
||||
|
||||
### First-party only trackers
|
||||
|
||||
- Hosts file: <https://hostfiles.frogeye.fr/firstparty-only-trackers-hosts.txt>
|
||||
- Raw list: <https://hostfiles.frogeye.fr/firstparty-only-trackers.txt>
|
||||
|
||||
This is the same list as above, albeit not containing the hostnames under the tracking company domains.
|
||||
This reduces the size of the list, but it doesn't prevent from third-party tracking too.
|
||||
Use in conjunction with other block lists.
|
||||
|
||||
### Multi-party trackers
|
||||
|
||||
- Hosts file: <https://hostfiles.frogeye.fr/multiparty-trackers-hosts.txt>
|
||||
- Raw list: <https://hostfiles.frogeye.fr/multiparty-trackers.txt>
|
||||
|
||||
As first-party trackers usually evolve from third-party trackers, this list contains every hostname redirecting to trackers found in existing lists of third-party trackers (see next section).
|
||||
Since the latter were not designed with first-party trackers in mind, they are likely to contain false-positives.
|
||||
In the other hand, they might protect against first-party tracker that we're not aware of / have not yet confirmed.
|
||||
|
||||
#### Source of third-party trackers
|
||||
|
||||
- [EasyPrivacy](https://easylist.to/easylist/easyprivacy.txt)
|
||||
|
||||
(yes there's only one for now. A lot of existing ones cause a lot of false positives)
|
||||
|
||||
### Multi-party only trackers
|
||||
|
||||
- Hosts file: <https://hostfiles.frogeye.fr/multiparty-only-trackers-hosts.txt>
|
||||
- Raw list: <https://hostfiles.frogeye.fr/multiparty-only-trackers.txt>
|
||||
|
||||
This is the same list as above, albeit not containing the hostnames under the tracking company domains.
|
||||
This reduces the size of the list, but it doesn't prevent from third-party tracking too.
|
||||
Use in conjunction with other block lists, especially the ones used to generate this list in the previous section.
|
||||
|
||||
## Meta
|
||||
|
||||
In case of false positives/negatives, or any other question contact me the way you like: <https://geoffrey.frogeye.fr>
|
||||
|
||||
The software used to generate this list is available here: <https://git.frogeye.fr/geoffrey/eulaurarien>
|
||||
|
||||
Some of the first-party tracker included in this list have been found by:
|
||||
- [Aeris](https://imirhil.fr/)
|
||||
- NextDNS and [their blocklist](https://github.com/nextdns/cname-cloaking-blocklist)'s contributors
|
64
export.py
Executable file
64
export.py
Executable file
|
@ -0,0 +1,64 @@
|
|||
#!/usr/bin/env python3
|
||||
|
||||
import database
|
||||
import argparse
|
||||
import sys
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
||||
# Parsing arguments
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Export the hostnames rules stored "
|
||||
"in the Database as plain text")
|
||||
parser.add_argument(
|
||||
'-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
|
||||
help="Output file, one rule per line")
|
||||
parser.add_argument(
|
||||
'-f', '--first-party', action='store_true',
|
||||
help="Only output rules issued from first-party sources")
|
||||
parser.add_argument(
|
||||
'-e', '--end-chain', action='store_true',
|
||||
help="Only output rules that are not referenced by any other")
|
||||
parser.add_argument(
|
||||
'-r', '--rules', action='store_true',
|
||||
help="Output all kinds of rules, not just hostnames")
|
||||
parser.add_argument(
|
||||
'-b', '--base-rules', action='store_true',
|
||||
help="Output base rules "
|
||||
"(the ones added by ./feed_rules.py) "
|
||||
"(implies --rules)")
|
||||
parser.add_argument(
|
||||
'-d', '--no-dupplicates', action='store_true',
|
||||
help="Do not output rules that already match a zone/network rule "
|
||||
"(e.g. dummy.example.com when there's a zone example.com rule)")
|
||||
parser.add_argument(
|
||||
'-x', '--explain', action='store_true',
|
||||
help="Show the chain of rules leading to one "
|
||||
"(and the number of references they have)")
|
||||
parser.add_argument(
|
||||
'-c', '--count', action='store_true',
|
||||
help="Show the number of rules per type instead of listing them")
|
||||
args = parser.parse_args()
|
||||
|
||||
DB = database.Database()
|
||||
|
||||
if args.count:
|
||||
assert not args.explain
|
||||
print(DB.count_records(
|
||||
first_party_only=args.first_party,
|
||||
end_chain_only=args.end_chain,
|
||||
no_dupplicates=args.no_dupplicates,
|
||||
rules_only=args.base_rules,
|
||||
hostnames_only=not (args.rules or args.base_rules),
|
||||
))
|
||||
else:
|
||||
for domain in DB.list_records(
|
||||
first_party_only=args.first_party,
|
||||
end_chain_only=args.end_chain,
|
||||
no_dupplicates=args.no_dupplicates,
|
||||
rules_only=args.base_rules,
|
||||
hostnames_only=not (args.rules or args.base_rules),
|
||||
explain=args.explain,
|
||||
):
|
||||
print(domain, file=args.output)
|
98
export_lists.sh
Executable file
98
export_lists.sh
Executable file
|
@ -0,0 +1,98 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
function log() {
|
||||
echo -e "\033[33m$@\033[0m"
|
||||
}
|
||||
|
||||
log "Calculating statistics…"
|
||||
gen_date=$(date -Isec)
|
||||
gen_software=$(git describe --tags)
|
||||
number_websites=$(wc -l < temp/all_websites.list)
|
||||
number_subdomains=$(wc -l < temp/all_subdomains.list)
|
||||
number_dns=$(grep '^$' temp/all_resolved.txt | wc -l)
|
||||
|
||||
for partyness in {first,multi}
|
||||
do
|
||||
if [ $partyness = "first" ]
|
||||
then
|
||||
partyness_flags="--first-party"
|
||||
else
|
||||
partyness_flags=""
|
||||
fi
|
||||
|
||||
echo "Statistics for ${partyness}-party trackers"
|
||||
echo "Input rules: $(./export.py --count --base-rules $partyness_flags)"
|
||||
echo "Subsequent rules: $(./export.py --count --rules $partyness_flags)"
|
||||
echo "Subsequent rules (no dupplicate): $(./export.py --count --rules --no-dupplicates $partyness_flags)"
|
||||
echo "Output hostnames: $(./export.py --count $partyness_flags)"
|
||||
echo "Output hostnames (no dupplicate): $(./export.py --count --no-dupplicates $partyness_flags)"
|
||||
echo "Output hostnames (end-chain only): $(./export.py --count --end-chain $partyness_flags)"
|
||||
echo "Output hostnames (no dupplicate, end-chain only): $(./export.py --count --no-dupplicates --end-chain $partyness_flags)"
|
||||
echo
|
||||
|
||||
for trackerness in {trackers,only-trackers}
|
||||
do
|
||||
if [ $trackerness = "trackers" ]
|
||||
then
|
||||
trackerness_flags=""
|
||||
else
|
||||
trackerness_flags="--end-chain --no-dupplicates"
|
||||
fi
|
||||
file_list="dist/${partyness}party-${trackerness}.txt"
|
||||
file_host="dist/${partyness}party-${trackerness}-hosts.txt"
|
||||
|
||||
log "Generating lists for variant ${partyness}-party ${trackerness}…"
|
||||
|
||||
# Real export heeere
|
||||
./export.py $partyness_flags $trackerness_flags > $file_list
|
||||
# Sometimes a bit heavy to have the DB open and sort the output
|
||||
# so this is done in two steps
|
||||
sort -u $file_list -o $file_list
|
||||
|
||||
rules_input=$(./export.py --count --base-rules $partyness_flags)
|
||||
rules_found=$(./export.py --count --rules $partyness_flags)
|
||||
rules_output=$(./export.py --count $partyness_flags $trackerness_flags)
|
||||
|
||||
function link() { # link partyness, link trackerness
|
||||
url="https://hostfiles.frogeye.fr/${1}party-${2}-hosts.txt"
|
||||
if [ "$1" = "$partyness" ] && [ "$2" = "$trackerness" ]
|
||||
then
|
||||
url="$url (this one)"
|
||||
fi
|
||||
echo $url
|
||||
}
|
||||
|
||||
(
|
||||
echo "# First-party trackers host list"
|
||||
echo "# Variant: ${partyness}-party ${trackerness}"
|
||||
echo "#"
|
||||
echo "# About first-party trackers: TODO"
|
||||
echo "# Source code: https://git.frogeye.fr/geoffrey/eulaurarien"
|
||||
echo "#"
|
||||
echo "# In case of false positives/negatives, or any other question,"
|
||||
echo "# contact me the way you like: https://geoffrey.frogeye.fr"
|
||||
echo "#"
|
||||
echo "# Latest versions and variants:"
|
||||
echo "# - First-party trackers : $(link first trackers)"
|
||||
echo "# - … excluding redirected: $(link first only-trackers)"
|
||||
echo "# - First and third party : $(link multi trackers)"
|
||||
echo "# - … excluding redirected: $(link multi only-trackers)"
|
||||
echo '# (variants informations: TODO)'
|
||||
echo '# (you can remove `-hosts` to get the raw list)'
|
||||
echo "#"
|
||||
echo "# Generation date: $gen_date"
|
||||
echo "# Generation software: eulaurarien $gen_software"
|
||||
echo "# Number of source websites: $number_websites"
|
||||
echo "# Number of source subdomains: $number_subdomains"
|
||||
echo "# Number of source DNS records: ~2E9 + $number_dns"
|
||||
echo "#"
|
||||
echo "# Input rules: $rules_input"
|
||||
echo "# Subsequent rules: $rules_found"
|
||||
echo "# Output rules: $rules_output"
|
||||
echo "#"
|
||||
echo
|
||||
sed 's|^|0.0.0.0 |' "$file_list"
|
||||
) > "$file_host"
|
||||
|
||||
done
|
||||
done
|
71
feed_asn.py
Executable file
71
feed_asn.py
Executable file
|
@ -0,0 +1,71 @@
|
|||
#!/usr/bin/env python3
|
||||
|
||||
import database
|
||||
import argparse
|
||||
import requests
|
||||
import typing
|
||||
import ipaddress
|
||||
import logging
|
||||
import time
|
||||
|
||||
IPNetwork = typing.Union[ipaddress.IPv4Network, ipaddress.IPv6Network]
|
||||
|
||||
|
||||
def get_ranges(asn: str) -> typing.Iterable[str]:
|
||||
req = requests.get(
|
||||
'https://stat.ripe.net/data/as-routing-consistency/data.json',
|
||||
params={'resource': asn}
|
||||
)
|
||||
data = req.json()
|
||||
for pref in data['data']['prefixes']:
|
||||
yield pref['prefix']
|
||||
|
||||
|
||||
def get_name(asn: str) -> str:
|
||||
req = requests.get(
|
||||
'https://stat.ripe.net/data/as-overview/data.json',
|
||||
params={'resource': asn}
|
||||
)
|
||||
data = req.json()
|
||||
return data['data']['holder']
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
||||
log = logging.getLogger('feed_asn')
|
||||
|
||||
# Parsing arguments
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Add the IP ranges associated to the AS in the database")
|
||||
args = parser.parse_args()
|
||||
|
||||
DB = database.Database()
|
||||
|
||||
def add_ranges(path: database.Path,
|
||||
match: database.Match,
|
||||
) -> None:
|
||||
assert isinstance(path, database.AsnPath)
|
||||
assert isinstance(match, database.AsnNode)
|
||||
asn_str = database.Database.unpack_asn(path)
|
||||
DB.enter_step('asn_get_name')
|
||||
name = get_name(asn_str)
|
||||
match.name = name
|
||||
DB.enter_step('asn_get_ranges')
|
||||
for prefix in get_ranges(asn_str):
|
||||
parsed_prefix: IPNetwork = ipaddress.ip_network(prefix)
|
||||
if parsed_prefix.version == 4:
|
||||
DB.set_ip4network(
|
||||
prefix,
|
||||
source=path,
|
||||
updated=int(time.time())
|
||||
)
|
||||
log.info('Added %s from %s (%s)', prefix, path, name)
|
||||
elif parsed_prefix.version == 6:
|
||||
log.warning('Unimplemented prefix version: %s', prefix)
|
||||
else:
|
||||
log.error('Unknown prefix version: %s', prefix)
|
||||
|
||||
for _ in DB.exec_each_asn(add_ranges):
|
||||
pass
|
||||
|
||||
DB.save()
|
227
feed_dns.py
Executable file
227
feed_dns.py
Executable file
|
@ -0,0 +1,227 @@
|
|||
#!/usr/bin/env python3
|
||||
|
||||
import argparse
|
||||
import database
|
||||
import logging
|
||||
import sys
|
||||
import typing
|
||||
import multiprocessing
|
||||
import time
|
||||
|
||||
Record = typing.Tuple[typing.Callable, typing.Callable, int, str, str]
|
||||
|
||||
# select, write
|
||||
FUNCTION_MAP: typing.Any = {
|
||||
'a': (
|
||||
database.Database.get_ip4,
|
||||
database.Database.set_hostname,
|
||||
),
|
||||
'cname': (
|
||||
database.Database.get_domain,
|
||||
database.Database.set_hostname,
|
||||
),
|
||||
'ptr': (
|
||||
database.Database.get_domain,
|
||||
database.Database.set_ip4address,
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
class Writer(multiprocessing.Process):
|
||||
def __init__(self,
|
||||
recs_queue: multiprocessing.Queue,
|
||||
autosave_interval: int = 0,
|
||||
ip4_cache: int = 0,
|
||||
):
|
||||
super(Writer, self).__init__()
|
||||
self.log = logging.getLogger(f'wr')
|
||||
self.recs_queue = recs_queue
|
||||
self.autosave_interval = autosave_interval
|
||||
self.ip4_cache = ip4_cache
|
||||
|
||||
def run(self) -> None:
|
||||
self.db = database.Database()
|
||||
self.db.log = logging.getLogger(f'wr')
|
||||
self.db.fill_ip4cache(max_size=self.ip4_cache)
|
||||
if self.autosave_interval > 0:
|
||||
next_save = time.time() + self.autosave_interval
|
||||
else:
|
||||
next_save = 0
|
||||
|
||||
self.db.enter_step('block_wait')
|
||||
block: typing.List[Record]
|
||||
for block in iter(self.recs_queue.get, None):
|
||||
|
||||
record: Record
|
||||
for record in block:
|
||||
|
||||
select, write, updated, name, value = record
|
||||
self.db.enter_step('feed_switch')
|
||||
|
||||
try:
|
||||
for source in select(self.db, value):
|
||||
write(self.db, name, updated, source=source)
|
||||
except ValueError:
|
||||
self.log.exception("Cannot execute: %s", record)
|
||||
|
||||
if next_save > 0 and time.time() > next_save:
|
||||
self.log.info("Saving database...")
|
||||
self.db.save()
|
||||
self.log.info("Done!")
|
||||
next_save = time.time() + self.autosave_interval
|
||||
|
||||
self.db.enter_step('block_wait')
|
||||
|
||||
self.db.enter_step('end')
|
||||
self.db.save()
|
||||
|
||||
|
||||
class Parser():
|
||||
def __init__(self,
|
||||
buf: typing.Any,
|
||||
recs_queue: multiprocessing.Queue,
|
||||
block_size: int,
|
||||
):
|
||||
super(Parser, self).__init__()
|
||||
self.buf = buf
|
||||
self.log = logging.getLogger('pr')
|
||||
self.recs_queue = recs_queue
|
||||
self.block: typing.List[Record] = list()
|
||||
self.block_size = block_size
|
||||
self.prof = database.Profiler()
|
||||
self.prof.log = logging.getLogger('pr')
|
||||
|
||||
def register(self, record: Record) -> None:
|
||||
self.prof.enter_step('register')
|
||||
self.block.append(record)
|
||||
if len(self.block) >= self.block_size:
|
||||
self.prof.enter_step('put_block')
|
||||
self.recs_queue.put(self.block)
|
||||
self.block = list()
|
||||
|
||||
def run(self) -> None:
|
||||
self.consume()
|
||||
self.recs_queue.put(self.block)
|
||||
self.prof.profile()
|
||||
|
||||
def consume(self) -> None:
|
||||
raise NotImplementedError
|
||||
|
||||
|
||||
class Rapid7Parser(Parser):
|
||||
def consume(self) -> None:
|
||||
data = dict()
|
||||
for line in self.buf:
|
||||
self.prof.enter_step('parse_rapid7')
|
||||
split = line.split('"')
|
||||
|
||||
try:
|
||||
for k in range(1, 14, 4):
|
||||
key = split[k]
|
||||
val = split[k+2]
|
||||
data[key] = val
|
||||
|
||||
select, writer = FUNCTION_MAP[data['type']]
|
||||
record = (
|
||||
select,
|
||||
writer,
|
||||
int(data['timestamp']),
|
||||
data['name'],
|
||||
data['value']
|
||||
)
|
||||
except IndexError:
|
||||
self.log.exception("Cannot parse: %s", line)
|
||||
self.register(record)
|
||||
|
||||
|
||||
class MassDnsParser(Parser):
|
||||
# massdns --output Snrql
|
||||
# --retry REFUSED,SERVFAIL --resolvers nameservers-ipv4
|
||||
TYPES = {
|
||||
'A': (FUNCTION_MAP['a'][0], FUNCTION_MAP['a'][1], -1, None),
|
||||
# 'AAAA': (FUNCTION_MAP['aaaa'][0], FUNCTION_MAP['aaaa'][1], -1, None),
|
||||
'CNAME': (FUNCTION_MAP['cname'][0], FUNCTION_MAP['cname'][1], -1, -1),
|
||||
}
|
||||
|
||||
def consume(self) -> None:
|
||||
self.prof.enter_step('parse_massdns')
|
||||
timestamp = 0
|
||||
header = True
|
||||
for line in self.buf:
|
||||
line = line[:-1]
|
||||
if not line:
|
||||
header = True
|
||||
continue
|
||||
|
||||
split = line.split(' ')
|
||||
try:
|
||||
if header:
|
||||
timestamp = int(split[1])
|
||||
header = False
|
||||
else:
|
||||
select, write, name_offset, value_offset = \
|
||||
MassDnsParser.TYPES[split[1]]
|
||||
record = (
|
||||
select,
|
||||
write,
|
||||
timestamp,
|
||||
split[0][:name_offset],
|
||||
split[2][:value_offset],
|
||||
)
|
||||
self.register(record)
|
||||
self.prof.enter_step('parse_massdns')
|
||||
except KeyError:
|
||||
continue
|
||||
|
||||
|
||||
PARSERS = {
|
||||
'rapid7': Rapid7Parser,
|
||||
'massdns': MassDnsParser,
|
||||
}
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
||||
# Parsing arguments
|
||||
log = logging.getLogger('feed_dns')
|
||||
args_parser = argparse.ArgumentParser(
|
||||
description="Read DNS records and import "
|
||||
"tracking-relevant data into the database")
|
||||
args_parser.add_argument(
|
||||
'parser',
|
||||
choices=PARSERS.keys(),
|
||||
help="Input format")
|
||||
args_parser.add_argument(
|
||||
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
|
||||
help="Input file")
|
||||
args_parser.add_argument(
|
||||
'-b', '--block-size', type=int, default=1024,
|
||||
help="Performance tuning value")
|
||||
args_parser.add_argument(
|
||||
'-q', '--queue-size', type=int, default=128,
|
||||
help="Performance tuning value")
|
||||
args_parser.add_argument(
|
||||
'-a', '--autosave-interval', type=int, default=900,
|
||||
help="Interval to which the database will save in seconds. "
|
||||
"0 to disable.")
|
||||
args_parser.add_argument(
|
||||
'-4', '--ip4-cache', type=int, default=0,
|
||||
help="RAM cache for faster IPv4 lookup. "
|
||||
"Maximum useful value: 512 MiB (536870912). "
|
||||
"Warning: Depending on the rules, this might already "
|
||||
"be a memory-heavy process, even without the cache.")
|
||||
args = args_parser.parse_args()
|
||||
|
||||
recs_queue: multiprocessing.Queue = multiprocessing.Queue(
|
||||
maxsize=args.queue_size)
|
||||
|
||||
writer = Writer(recs_queue,
|
||||
autosave_interval=args.autosave_interval,
|
||||
ip4_cache=args.ip4_cache
|
||||
)
|
||||
writer.start()
|
||||
|
||||
parser = PARSERS[args.parser](args.input, recs_queue, args.block_size)
|
||||
parser.run()
|
||||
|
||||
recs_queue.put(None)
|
||||
writer.join()
|
54
feed_rules.py
Executable file
54
feed_rules.py
Executable file
|
@ -0,0 +1,54 @@
|
|||
#!/usr/bin/env python3
|
||||
|
||||
import database
|
||||
import argparse
|
||||
import sys
|
||||
import time
|
||||
|
||||
FUNCTION_MAP = {
|
||||
'zone': database.Database.set_zone,
|
||||
'hostname': database.Database.set_hostname,
|
||||
'asn': database.Database.set_asn,
|
||||
'ip4network': database.Database.set_ip4network,
|
||||
'ip4address': database.Database.set_ip4address,
|
||||
}
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
||||
# Parsing arguments
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Import base rules to the database")
|
||||
parser.add_argument(
|
||||
'type',
|
||||
choices=FUNCTION_MAP.keys(),
|
||||
help="Type of rule inputed")
|
||||
parser.add_argument(
|
||||
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
|
||||
help="File with one rule per line")
|
||||
parser.add_argument(
|
||||
'-f', '--first-party', action='store_true',
|
||||
help="The input only comes from verified first-party sources")
|
||||
args = parser.parse_args()
|
||||
|
||||
DB = database.Database()
|
||||
|
||||
fun = FUNCTION_MAP[args.type]
|
||||
|
||||
source: database.RulePath
|
||||
if args.first_party:
|
||||
source = database.RuleFirstPath()
|
||||
else:
|
||||
source = database.RuleMultiPath()
|
||||
|
||||
for rule in args.input:
|
||||
rule = rule.strip()
|
||||
try:
|
||||
fun(DB,
|
||||
rule,
|
||||
source=source,
|
||||
updated=int(time.time()),
|
||||
)
|
||||
except ValueError:
|
||||
DB.log.error(f"Could not add rule: {rule}")
|
||||
|
||||
DB.save()
|
|
@ -17,26 +17,13 @@ function dl() {
|
|||
log "Retrieving rules…"
|
||||
rm -f rules*/*.cache.*
|
||||
dl https://easylist.to/easylist/easyprivacy.txt rules_adblock/easyprivacy.cache.txt
|
||||
# From firebog.net Tracking & Telemetry Lists
|
||||
dl https://v.firebog.net/hosts/Prigent-Ads.txt rules/prigent-ads.cache.list
|
||||
# dl https://gitlab.com/quidsup/notrack-blocklists/raw/master/notrack-blocklist.txt rules/notrack-blocklist.cache.list
|
||||
# False positives: https://github.com/WaLLy3K/wally3k.github.io/issues/73 -> 69.media.tumblr.com chicdn.net
|
||||
dl https://raw.githubusercontent.com/StevenBlack/hosts/master/data/add.2o7Net/hosts rules_hosts/add2o7.cache.txt
|
||||
dl https://raw.githubusercontent.com/crazy-max/WindowsSpyBlocker/master/data/hosts/spy.txt rules_hosts/spy.cache.txt
|
||||
# dl https://raw.githubusercontent.com/Kees1958/WS3_annual_most_used_survey_blocklist/master/w3tech_hostfile.txt rules/w3tech.cache.list
|
||||
# False positives: agreements.apple.com -> edgekey.net
|
||||
# dl https://www.github.developerdan.com/hosts/lists/ads-and-tracking-extended.txt rules_hosts/ads-and-tracking-extended.cache.txt # Lots of false-positives
|
||||
# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/android-tracking.txt rules_hosts/android-tracking.cache.txt
|
||||
# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/SmartTV.txt rules_hosts/smart-tv.cache.txt
|
||||
# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/AmazonFireTV.txt rules_hosts/amazon-fire-tv.cache.txt
|
||||
|
||||
log "Retrieving TLD list…"
|
||||
dl http://data.iana.org/TLD/tlds-alpha-by-domain.txt temp/all_tld.temp.list
|
||||
grep -v '^#' temp/all_tld.temp.list | awk '{print tolower($0)}' > temp/all_tld.list
|
||||
|
||||
log "Retrieving nameservers…"
|
||||
rm -f nameservers
|
||||
touch nameservers
|
||||
[ -f nameservers.head ] && cat nameservers.head >> nameservers
|
||||
dl https://public-dns.info/nameservers.txt nameservers.temp
|
||||
sort -R nameservers.temp >> nameservers
|
||||
rm nameservers.temp
|
||||
dl https://public-dns.info/nameservers.txt nameservers/public-dns.cache.list
|
||||
|
||||
log "Retrieving top subdomains…"
|
||||
dl http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip top-1m.csv.zip
|
||||
|
@ -51,4 +38,3 @@ then
|
|||
else
|
||||
mv temp/cisco-umbrella_popularity.fresh.list subdomains/cisco-umbrella_popularity.cache.list
|
||||
fi
|
||||
dl https://www.orwell1984.today/cname/eulerian.net.txt subdomains/orwell-eulerian-cname-list.cache.list
|
||||
|
|
|
@ -1,160 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
# pylint: disable=C0103
|
||||
|
||||
"""
|
||||
From a list of subdomains, output only
|
||||
the ones resolving to a first-party tracker.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import progressbar
|
||||
import csv
|
||||
import typing
|
||||
import ipaddress
|
||||
|
||||
# DomainRule = typing.Union[bool, typing.Dict[str, 'DomainRule']]
|
||||
DomainRule = typing.Union[bool, typing.Dict]
|
||||
# IpRule = typing.Union[bool, typing.Dict[int, 'DomainRule']]
|
||||
IpRule = typing.Union[bool, typing.Dict]
|
||||
|
||||
RULES_DICT: DomainRule = dict()
|
||||
RULES_IP_DICT: IpRule = dict()
|
||||
|
||||
|
||||
def get_bits(address: ipaddress.IPv4Address) -> typing.Iterator[int]:
|
||||
for char in address.packed:
|
||||
for i in range(7, -1, -1):
|
||||
yield (char >> i) & 0b1
|
||||
|
||||
|
||||
def subdomain_matching(subdomain: str) -> bool:
|
||||
parts = subdomain.split('.')
|
||||
parts.reverse()
|
||||
dic = RULES_DICT
|
||||
for part in parts:
|
||||
if isinstance(dic, bool) or part not in dic:
|
||||
break
|
||||
dic = dic[part]
|
||||
if isinstance(dic, bool):
|
||||
return dic
|
||||
return False
|
||||
|
||||
|
||||
def ip_matching(ip_str: str) -> bool:
|
||||
ip = ipaddress.ip_address(ip_str)
|
||||
dic = RULES_IP_DICT
|
||||
i = 0
|
||||
for bit in get_bits(ip):
|
||||
i += 1
|
||||
if isinstance(dic, bool) or bit not in dic:
|
||||
break
|
||||
dic = dic[bit]
|
||||
if isinstance(dic, bool):
|
||||
return dic
|
||||
return False
|
||||
|
||||
|
||||
def get_matching(chain: typing.List[str], no_explicit: bool = False
|
||||
) -> typing.Iterable[str]:
|
||||
if len(chain) <= 1:
|
||||
return
|
||||
initial = chain[0]
|
||||
cname_destinations = chain[1:-1]
|
||||
a_destination = chain[-1]
|
||||
initial_matching = subdomain_matching(initial)
|
||||
if no_explicit and initial_matching:
|
||||
return
|
||||
cname_matching = any(map(subdomain_matching, cname_destinations))
|
||||
if cname_matching or initial_matching or ip_matching(a_destination):
|
||||
yield initial
|
||||
|
||||
|
||||
def register_rule(subdomain: str) -> None:
|
||||
# Make a tree with domain parts
|
||||
parts = subdomain.split('.')
|
||||
parts.reverse()
|
||||
dic = RULES_DICT
|
||||
last_part = len(parts) - 1
|
||||
for p, part in enumerate(parts):
|
||||
if isinstance(dic, bool):
|
||||
return
|
||||
if p == last_part:
|
||||
dic[part] = True
|
||||
else:
|
||||
dic.setdefault(part, dict())
|
||||
dic = dic[part]
|
||||
|
||||
|
||||
def register_rule_ip(network: str) -> None:
|
||||
net = ipaddress.ip_network(network)
|
||||
ip = net.network_address
|
||||
dic = RULES_IP_DICT
|
||||
last_bit = net.prefixlen - 1
|
||||
for b, bit in enumerate(get_bits(ip)):
|
||||
if isinstance(dic, bool):
|
||||
return
|
||||
if b == last_bit:
|
||||
dic[bit] = True
|
||||
else:
|
||||
dic.setdefault(bit, dict())
|
||||
dic = dic[bit]
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
||||
# Parsing arguments
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Filter first-party trackers from a list of subdomains")
|
||||
parser.add_argument(
|
||||
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
|
||||
help="Input file with DNS chains")
|
||||
parser.add_argument(
|
||||
'-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
|
||||
help="Outptut file with one tracking subdomain per line")
|
||||
parser.add_argument(
|
||||
'-n', '--no-explicit', action='store_true',
|
||||
help="Don't output domains already blocked with rules without CNAME")
|
||||
parser.add_argument(
|
||||
'-r', '--rules', type=argparse.FileType('r'),
|
||||
help="List of domains domains to block (with their subdomains)")
|
||||
parser.add_argument(
|
||||
'-p', '--rules-ip', type=argparse.FileType('r'),
|
||||
help="List of IPs ranges to block")
|
||||
args = parser.parse_args()
|
||||
|
||||
# Progress bar
|
||||
widgets = [
|
||||
progressbar.Percentage(),
|
||||
' ', progressbar.SimpleProgress(),
|
||||
' ', progressbar.Bar(),
|
||||
' ', progressbar.Timer(),
|
||||
' ', progressbar.AdaptiveTransferSpeed(unit='req'),
|
||||
' ', progressbar.AdaptiveETA(),
|
||||
]
|
||||
progress = progressbar.ProgressBar(widgets=widgets)
|
||||
|
||||
# Reading rules
|
||||
if args.rules:
|
||||
for rule in args.rules:
|
||||
register_rule(rule.strip())
|
||||
if args.rules_ip:
|
||||
for rule in args.rules_ip:
|
||||
register_rule_ip(rule.strip())
|
||||
|
||||
# Approximating line count
|
||||
if args.input.seekable():
|
||||
lines = 0
|
||||
for line in args.input:
|
||||
lines += 1
|
||||
progress.max_value = lines
|
||||
args.input.seek(0)
|
||||
|
||||
# Reading domains to filter
|
||||
reader = csv.reader(args.input)
|
||||
progress.start()
|
||||
for chain in reader:
|
||||
for match in get_matching(chain, no_explicit=args.no_explicit):
|
||||
print(match, file=args.output)
|
||||
progress.update(progress.value + 1)
|
||||
progress.finish()
|
|
@ -1,85 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
function log() {
|
||||
echo -e "\033[33m$@\033[0m"
|
||||
}
|
||||
|
||||
if [ ! -f temp/all_resolved.csv ]
|
||||
then
|
||||
echo "Run ./resolve_subdomains.sh first!"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Gather all the rules for filtering
|
||||
log "Compiling rules…"
|
||||
cat rules_adblock/*.txt | grep -v '^!' | grep -v '^\[Adblock' | sort -u > temp/all_rules_adblock.txt
|
||||
./adblock_to_domain_list.py --input temp/all_rules_adblock.txt --output rules/from_adblock.cache.list
|
||||
cat rules_hosts/*.txt | grep -v '^#' | grep -v '^$' | cut -d ' ' -f2 > rules/from_hosts.cache.list
|
||||
cat rules/*.list | grep -v '^#' | grep -v '^$' | sort -u > temp/all_rules_multi.list
|
||||
cat rules/first-party.list | grep -v '^#' | grep -v '^$' | sort -u > temp/all_rules_first.list
|
||||
cat rules_ip/*.txt | grep -v '^#' | grep -v '^$' | sort -u > temp/all_ip_rules_multi.txt
|
||||
cat rules_ip/first-party.txt | grep -v '^#' | grep -v '^$' | sort -u > temp/all_ip_rules_first.txt
|
||||
|
||||
log "Filtering first-party tracking domains…"
|
||||
./filter_subdomains.py --rules temp/all_rules_first.list --rules-ip temp/all_ip_rules_first.txt --input temp/all_resolved_sorted.csv --output temp/firstparty-trackers.list
|
||||
sort -u temp/firstparty-trackers.list > dist/firstparty-trackers.txt
|
||||
|
||||
log "Filtering first-party curated tracking domains…"
|
||||
./filter_subdomains.py --rules temp/all_rules_first.list --rules-ip temp/all_ip_rules_first.txt --input temp/all_resolved_sorted.csv --no-explicit --output temp/firstparty-only-trackers.list
|
||||
sort -u temp/firstparty-only-trackers.list > dist/firstparty-only-trackers.txt
|
||||
|
||||
log "Filtering multi-party tracking domains…"
|
||||
./filter_subdomains.py --rules temp/all_rules_multi.list --rules-ip temp/all_ip_rules_multi.txt --input temp/all_resolved_sorted.csv --output temp/multiparty-trackers.list
|
||||
sort -u temp/multiparty-trackers.list > dist/multiparty-trackers.txt
|
||||
|
||||
log "Filtering multi-party curated tracking domains…"
|
||||
./filter_subdomains.py --rules temp/all_rules_multi.list --rules-ip temp/all_ip_rules_multi.txt --input temp/all_resolved_sorted.csv --no-explicit --output temp/multiparty-only-trackers.list
|
||||
sort -u temp/multiparty-only-trackers.list > dist/multiparty-only-trackers.txt
|
||||
|
||||
# Format the blocklist so it can be used as a hostlist
|
||||
function generate_hosts {
|
||||
basename="$1"
|
||||
description="$2"
|
||||
description2="$3"
|
||||
|
||||
(
|
||||
echo "# First-party trackers host list"
|
||||
echo "# $description"
|
||||
echo "# $description2"
|
||||
echo "#"
|
||||
echo "# About first-party trackers: https://git.frogeye.fr/geoffrey/eulaurarien#whats-a-first-party-tracker"
|
||||
echo "# Source code: https://git.frogeye.fr/geoffrey/eulaurarien"
|
||||
echo "#"
|
||||
echo "# In case of false positives/negatives, or any other question,"
|
||||
echo "# contact me the way you like: https://geoffrey.frogeye.fr"
|
||||
echo "#"
|
||||
echo "# Latest version:"
|
||||
echo "# - First-party trackers : https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt"
|
||||
echo "# - … excluding redirected: https://hostfiles.frogeye.fr/firstparty-only-trackers-hosts.txt"
|
||||
echo "# - First and third party : https://hostfiles.frogeye.fr/multiparty-trackers-hosts.txt"
|
||||
echo "# - … excluding redirected: https://hostfiles.frogeye.fr/multiparty-only-trackers-hosts.txt"
|
||||
echo "#"
|
||||
echo "# Generation date: $(date -Isec)"
|
||||
echo "# Generation software: eulaurarien $(git describe --tags)"
|
||||
echo "# Number of source websites: $(wc -l temp/all_websites.list | cut -d' ' -f1)"
|
||||
echo "# Number of source subdomains: $(wc -l temp/all_subdomains.list | cut -d' ' -f1)"
|
||||
echo "#"
|
||||
echo "# Number of known first-party trackers: $(wc -l temp/all_rules_first.list | cut -d' ' -f1)"
|
||||
echo "# Number of first-party subdomains: $(wc -l dist/firstparty-trackers.txt | cut -d' ' -f1)"
|
||||
echo "# … excluding redirected: $(wc -l dist/firstparty-only-trackers.txt | cut -d' ' -f1)"
|
||||
echo "#"
|
||||
echo "# Number of known multi-party trackers: $(wc -l temp/all_rules_multi.list | cut -d' ' -f1)"
|
||||
echo "# Number of multi-party subdomains: $(wc -l dist/multiparty-trackers.txt | cut -d' ' -f1)"
|
||||
echo "# … excluding redirected: $(wc -l dist/multiparty-only-trackers.txt | cut -d' ' -f1)"
|
||||
echo
|
||||
cat "dist/$basename.txt" | while read host;
|
||||
do
|
||||
echo "0.0.0.0 $host"
|
||||
done
|
||||
) > "dist/$basename-hosts.txt"
|
||||
}
|
||||
|
||||
generate_hosts "firstparty-trackers" "Generated from a curated list of first-party trackers" ""
|
||||
generate_hosts "firstparty-only-trackers" "Generated from a curated list of first-party trackers" "Only contain the first chain of redirection."
|
||||
generate_hosts "multiparty-trackers" "Generated from known third-party trackers." "Also contains trackers used as third-party."
|
||||
generate_hosts "multiparty-only-trackers" "Generated from known third-party trackers." "Do not contain trackers used in third-party. Use in combination with third-party lists."
|
26
import_rapid7.sh
Executable file
26
import_rapid7.sh
Executable file
|
@ -0,0 +1,26 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
function log() {
|
||||
echo -e "\033[33m$@\033[0m"
|
||||
}
|
||||
|
||||
function feed_rapid7_fdns { # dataset
|
||||
dataset=$1
|
||||
line=$(curl -s https://opendata.rapid7.com/sonar.fdns_v2/ | grep "href=\".\+-fdns_$dataset.json.gz\"")
|
||||
link="https://opendata.rapid7.com$(echo "$line" | cut -d'"' -f2)"
|
||||
log "Reading $(echo "$dataset" | awk '{print toupper($0)}') records from $link"
|
||||
curl -L "$link" | gunzip
|
||||
}
|
||||
|
||||
function feed_rapid7_rdns {
|
||||
dataset=$1
|
||||
line=$(curl -s https://opendata.rapid7.com/sonar.rdns_v2/ | grep "href=\".\+-rdns.json.gz\"")
|
||||
link="https://opendata.rapid7.com$(echo "$line" | cut -d'"' -f2)"
|
||||
log "Reading PTR records from $link"
|
||||
curl -L "$link" | gunzip
|
||||
}
|
||||
|
||||
feed_rapid7_rdns | ./feed_dns.py rapid7
|
||||
feed_rapid7_fdns a | ./feed_dns.py rapid7 --ip4-cache 536870912
|
||||
# feed_rapid7_fdns aaaa | ./feed_dns.py rapid7 --ip6-cache 536870912
|
||||
feed_rapid7_fdns cname | ./feed_dns.py rapid7
|
22
import_rules.sh
Executable file
22
import_rules.sh
Executable file
|
@ -0,0 +1,22 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
function log() {
|
||||
echo -e "\033[33m$@\033[0m"
|
||||
}
|
||||
|
||||
log "Importing rules…"
|
||||
BEFORE="$(date +%s)"
|
||||
cat rules_adblock/*.txt | grep -v '^!' | grep -v '^\[Adblock' | ./adblock_to_domain_list.py | ./feed_rules.py zone
|
||||
cat rules_hosts/*.txt | grep -v '^#' | grep -v '^$' | cut -d ' ' -f2 | ./feed_rules.py zone
|
||||
cat rules/*.list | grep -v '^#' | grep -v '^$' | ./feed_rules.py zone
|
||||
cat rules_ip/*.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py ip4network
|
||||
cat rules_asn/*.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py asn
|
||||
|
||||
cat rules/first-party.list | grep -v '^#' | grep -v '^$' | ./feed_rules.py zone --first-party
|
||||
cat rules_ip/first-party.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py ip4network --first-party
|
||||
cat rules_asn/first-party.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py asn --first-party
|
||||
|
||||
./feed_asn.py
|
||||
|
||||
# log "Pruning old rules…"
|
||||
# ./db.py --prune --prune-before "$BEFORE" --prune-base
|
2
nameservers/.gitignore
vendored
Normal file
2
nameservers/.gitignore
vendored
Normal file
|
@ -0,0 +1,2 @@
|
|||
*.custom.list
|
||||
*.cache.list
|
24
nameservers/popular.list
Normal file
24
nameservers/popular.list
Normal file
|
@ -0,0 +1,24 @@
|
|||
8.8.8.8
|
||||
8.8.4.4
|
||||
2001:4860:4860:0:0:0:0:8888
|
||||
2001:4860:4860:0:0:0:0:8844
|
||||
208.67.222.222
|
||||
208.67.220.220
|
||||
2620:119:35::35
|
||||
2620:119:53::53
|
||||
4.2.2.1
|
||||
4.2.2.2
|
||||
8.26.56.26
|
||||
8.20.247.20
|
||||
84.200.69.80
|
||||
84.200.70.40
|
||||
2001:1608:10:25:0:0:1c04:b12f
|
||||
2001:1608:10:25:0:0:9249:d69b
|
||||
9.9.9.10
|
||||
149.112.112.10
|
||||
2620:fe::10
|
||||
2620:fe::fe:10
|
||||
1.1.1.1
|
||||
1.0.0.1
|
||||
2606:4700:4700::1111
|
||||
2606:4700:4700::1001
|
21
regexes.py
21
regexes.py
|
@ -1,21 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
|
||||
"""
|
||||
List of regex matching first-party trackers.
|
||||
"""
|
||||
|
||||
# Syntax: https://docs.python.org/3/library/re.html#regular-expression-syntax
|
||||
|
||||
REGEXES = [
|
||||
r'^.+\.eulerian\.net\.$', # Eulerian
|
||||
r'^.+\.criteo\.com\.$', # Criteo
|
||||
r'^.+\.dnsdelegation\.io\.$', # Criteo
|
||||
r'^.+\.keyade\.com\.$', # Keyade
|
||||
r'^.+\.omtrdc\.net\.$', # Adobe Experience Cloud
|
||||
r'^.+\.bp01\.net\.$', # NP6
|
||||
r'^.+\.ati-host\.net\.$', # Xiti (AT Internet)
|
||||
r'^.+\.at-o\.net\.$', # Xiti (AT Internet)
|
||||
r'^.+\.edgkey\.net\.$', # Edgekey (Akamai)
|
||||
r'^.+\.akaimaiedge\.net\.$', # Edgekey (Akamai)
|
||||
r'^.+\.storetail\.io\.$', # Storetail (Criteo)
|
||||
]
|
|
@ -1,284 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
|
||||
"""
|
||||
From a list of subdomains, output only
|
||||
the ones resolving to a first-party tracker.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
import queue
|
||||
import sys
|
||||
import threading
|
||||
import typing
|
||||
import csv
|
||||
|
||||
import coloredlogs
|
||||
import dns.exception
|
||||
import dns.resolver
|
||||
import progressbar
|
||||
|
||||
DNS_TIMEOUT = 5.0
|
||||
NUMBER_THREADS = 512
|
||||
NUMBER_TRIES = 5
|
||||
|
||||
# TODO All the domains don't get treated,
|
||||
# so it leaves with 4-5 subdomains not resolved
|
||||
|
||||
glob = None
|
||||
|
||||
|
||||
class Worker(threading.Thread):
|
||||
"""
|
||||
Worker process for a DNS resolver.
|
||||
Will resolve DNS to match first-party subdomains.
|
||||
"""
|
||||
|
||||
def change_nameserver(self) -> None:
|
||||
"""
|
||||
Assign a this worker another nameserver from the queue.
|
||||
"""
|
||||
server = None
|
||||
while server is None:
|
||||
try:
|
||||
server = self.orchestrator.nameservers_queue.get(block=False)
|
||||
except queue.Empty:
|
||||
self.orchestrator.refill_nameservers_queue()
|
||||
self.log.info("Using nameserver: %s", server)
|
||||
self.resolver.nameservers = [server]
|
||||
|
||||
def __init__(self,
|
||||
orchestrator: 'Orchestrator',
|
||||
index: int = 0):
|
||||
super(Worker, self).__init__()
|
||||
self.log = logging.getLogger(f'worker{index:03d}')
|
||||
self.orchestrator = orchestrator
|
||||
|
||||
self.resolver = dns.resolver.Resolver()
|
||||
self.change_nameserver()
|
||||
|
||||
def resolve_subdomain(self, subdomain: str) -> typing.Optional[
|
||||
typing.List[
|
||||
str
|
||||
]
|
||||
]:
|
||||
"""
|
||||
Returns the resolution chain of the subdomain to an A record,
|
||||
including any intermediary CNAME.
|
||||
The last element is an IP address.
|
||||
Returns None if the nameserver was unable to satisfy the request.
|
||||
Returns [] if the requests points to nothing.
|
||||
"""
|
||||
self.log.debug("Querying %s", subdomain)
|
||||
try:
|
||||
query = self.resolver.query(subdomain, 'A', lifetime=DNS_TIMEOUT)
|
||||
except dns.resolver.NXDOMAIN:
|
||||
return []
|
||||
except dns.resolver.NoAnswer:
|
||||
return []
|
||||
except dns.resolver.YXDOMAIN:
|
||||
self.log.warning("Query name too long for %s", subdomain)
|
||||
return None
|
||||
except dns.resolver.NoNameservers:
|
||||
# NOTE Most of the time this error message means that the domain
|
||||
# does not exists, but sometimes it means the that the server
|
||||
# itself is broken. So we count on the retry logic.
|
||||
self.log.warning("All nameservers broken for %s", subdomain)
|
||||
return None
|
||||
except dns.exception.Timeout:
|
||||
# NOTE Same as above
|
||||
self.log.warning("Timeout for %s", subdomain)
|
||||
return None
|
||||
except dns.name.EmptyLabel:
|
||||
self.log.warning("Empty label for %s", subdomain)
|
||||
return None
|
||||
resolved = list()
|
||||
last = len(query.response.answer) - 1
|
||||
for a, answer in enumerate(query.response.answer):
|
||||
if answer.rdtype == dns.rdatatype.CNAME:
|
||||
assert a < last
|
||||
resolved.append(answer.items[0].to_text()[:-1])
|
||||
elif answer.rdtype == dns.rdatatype.A:
|
||||
assert a == last
|
||||
resolved.append(answer.items[0].address)
|
||||
else:
|
||||
assert False
|
||||
return resolved
|
||||
|
||||
def run(self) -> None:
|
||||
self.log.info("Started")
|
||||
subdomain: str
|
||||
for subdomain in iter(self.orchestrator.subdomains_queue.get, None):
|
||||
|
||||
for _ in range(NUMBER_TRIES):
|
||||
resolved = self.resolve_subdomain(subdomain)
|
||||
# Retry with another nameserver if error
|
||||
if resolved is None:
|
||||
self.change_nameserver()
|
||||
else:
|
||||
break
|
||||
|
||||
# If it wasn't found after multiple tries
|
||||
if resolved is None:
|
||||
self.log.error("Gave up on %s", subdomain)
|
||||
resolved = []
|
||||
|
||||
resolved.insert(0, subdomain)
|
||||
assert isinstance(resolved, list)
|
||||
self.orchestrator.results_queue.put(resolved)
|
||||
|
||||
self.orchestrator.results_queue.put(None)
|
||||
self.log.info("Stopped")
|
||||
|
||||
|
||||
class Orchestrator():
|
||||
"""
|
||||
Orchestrator of the different Worker threads.
|
||||
"""
|
||||
|
||||
def refill_nameservers_queue(self) -> None:
|
||||
"""
|
||||
Re-fill the given nameservers into the nameservers queue.
|
||||
Done every-time the queue is empty, making it
|
||||
basically looping and infinite.
|
||||
"""
|
||||
# Might be in a race condition but that's probably fine
|
||||
for nameserver in self.nameservers:
|
||||
self.nameservers_queue.put(nameserver)
|
||||
self.log.info("Refilled nameserver queue")
|
||||
|
||||
def __init__(self, subdomains: typing.Iterable[str],
|
||||
nameservers: typing.List[str] = None,
|
||||
):
|
||||
self.log = logging.getLogger('orchestrator')
|
||||
self.subdomains = subdomains
|
||||
|
||||
# Use interal resolver by default
|
||||
self.nameservers = nameservers or dns.resolver.Resolver().nameservers
|
||||
|
||||
self.subdomains_queue: queue.Queue = queue.Queue(
|
||||
maxsize=NUMBER_THREADS)
|
||||
self.results_queue: queue.Queue = queue.Queue()
|
||||
self.nameservers_queue: queue.Queue = queue.Queue()
|
||||
|
||||
self.refill_nameservers_queue()
|
||||
|
||||
def fill_subdomain_queue(self) -> None:
|
||||
"""
|
||||
Read the subdomains in input and put them into the queue.
|
||||
Done in a thread so we can both:
|
||||
- yield the results as they come
|
||||
- not store all the subdomains at once
|
||||
"""
|
||||
self.log.info("Started reading subdomains")
|
||||
# Send data to workers
|
||||
for subdomain in self.subdomains:
|
||||
self.subdomains_queue.put(subdomain)
|
||||
|
||||
self.log.info("Finished reading subdomains")
|
||||
# Send sentinel to each worker
|
||||
# sentinel = None ~= EOF
|
||||
for _ in range(NUMBER_THREADS):
|
||||
self.subdomains_queue.put(None)
|
||||
|
||||
def run(self) -> typing.Iterable[typing.List[str]]:
|
||||
"""
|
||||
Yield the results.
|
||||
"""
|
||||
# Create workers
|
||||
self.log.info("Creating workers")
|
||||
for i in range(NUMBER_THREADS):
|
||||
Worker(self, i).start()
|
||||
|
||||
fill_thread = threading.Thread(target=self.fill_subdomain_queue)
|
||||
fill_thread.start()
|
||||
|
||||
# Wait for one sentinel per worker
|
||||
# In the meantime output results
|
||||
for _ in range(NUMBER_THREADS):
|
||||
result: typing.List[str]
|
||||
for result in iter(self.results_queue.get, None):
|
||||
yield result
|
||||
|
||||
self.log.info("Waiting for reader thread")
|
||||
fill_thread.join()
|
||||
|
||||
self.log.info("Done!")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
"""
|
||||
Main function when used directly.
|
||||
Read the subdomains provided and output it,
|
||||
the last CNAME resolved and the IP adress it resolves to.
|
||||
Takes as an input a filename (or nothing, for stdin),
|
||||
and as an output a filename (or nothing, for stdout).
|
||||
The input must be a subdomain per line, the output is a comma-sep
|
||||
file with the columns source CNAME and A.
|
||||
Use the file `nameservers` as the list of nameservers
|
||||
to use, or else it will use the system defaults.
|
||||
Also shows a nice progressbar.
|
||||
"""
|
||||
|
||||
# Initialization
|
||||
coloredlogs.install(
|
||||
level='DEBUG',
|
||||
fmt='%(asctime)s %(name)s %(levelname)s %(message)s'
|
||||
)
|
||||
|
||||
# Parsing arguments
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Massively resolves subdomains and store them in a file.")
|
||||
parser.add_argument(
|
||||
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
|
||||
help="Input file with one subdomain per line")
|
||||
parser.add_argument(
|
||||
'-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
|
||||
help="Outptut file with DNS chains")
|
||||
# parser.add_argument(
|
||||
# '-n', '--nameserver', type=argparse.FileType('r'),
|
||||
# default='nameservers', help="File with one nameserver per line")
|
||||
# parser.add_argument(
|
||||
# '-j', '--workers', type=int, default=512,
|
||||
# help="Number of threads to use")
|
||||
args = parser.parse_args()
|
||||
|
||||
# Progress bar
|
||||
widgets = [
|
||||
progressbar.Percentage(),
|
||||
' ', progressbar.SimpleProgress(),
|
||||
' ', progressbar.Bar(),
|
||||
' ', progressbar.Timer(),
|
||||
' ', progressbar.AdaptiveTransferSpeed(unit='req'),
|
||||
' ', progressbar.AdaptiveETA(),
|
||||
]
|
||||
progress = progressbar.ProgressBar(widgets=widgets)
|
||||
if args.input.seekable():
|
||||
progress.max_value = len(args.input.readlines())
|
||||
args.input.seek(0)
|
||||
|
||||
# Cleaning input
|
||||
iterator = iter(args.input)
|
||||
iterator = map(str.strip, iterator)
|
||||
iterator = filter(None, iterator)
|
||||
|
||||
# Reading nameservers
|
||||
servers: typing.List[str] = list()
|
||||
if os.path.isfile('nameservers'):
|
||||
servers = open('nameservers').readlines()
|
||||
servers = list(filter(None, map(str.strip, servers)))
|
||||
|
||||
writer = csv.writer(args.output)
|
||||
|
||||
progress.start()
|
||||
global glob
|
||||
glob = Orchestrator(iterator, servers)
|
||||
for resolved in glob.run():
|
||||
progress.update(progress.value + 1)
|
||||
writer.writerow(resolved)
|
||||
progress.finish()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
|
@ -4,11 +4,16 @@ function log() {
|
|||
echo -e "\033[33m$@\033[0m"
|
||||
}
|
||||
|
||||
# Resolve the CNAME chain of all the known subdomains for later analysis
|
||||
log "Compiling subdomain lists..."
|
||||
pv subdomains/*.list | sort -u > temp/all_subdomains.list
|
||||
# Sort by last character to utilize the DNS server caching mechanism
|
||||
pv temp/all_subdomains.list | rev | sort | rev > temp/all_subdomains_reversort.list
|
||||
./resolve_subdomains.py --input temp/all_subdomains_reversort.list --output temp/all_resolved.csv
|
||||
sort -u temp/all_resolved.csv > temp/all_resolved_sorted.csv
|
||||
log "Compiling nameservers…"
|
||||
pv nameservers/*.list | ./validate_list.py --ip4 | sort -u > temp/all_nameservers_ip4.list
|
||||
|
||||
log "Compiling subdomain…"
|
||||
# Sort by last character to utilize the DNS server caching mechanism
|
||||
# (not as efficient with massdns but it's almost free so why not)
|
||||
pv subdomains/*.list | ./validate_list.py --domain | rev | sort -u | rev > temp/all_subdomains.list
|
||||
|
||||
log "Resolving subdomain…"
|
||||
massdns --output Snrql --retry REFUSED,SERVFAIL --resolvers temp/all_nameservers_ip4.list --outfile temp/all_resolved.txt temp/all_subdomains.list
|
||||
|
||||
log "Importing into database…"
|
||||
pv temp/all_resolved.txt | ./feed_dns.py massdns
|
||||
|
|
|
@ -18,7 +18,14 @@ omtrdc.net
|
|||
online-metrix.net
|
||||
# Webtrekk
|
||||
wt-eu02.net
|
||||
webtrekk.net
|
||||
# Otto Group
|
||||
oghub.io
|
||||
# ???
|
||||
# Intent.com
|
||||
partner.intentmedia.net
|
||||
# Wizaly
|
||||
wizaly.com
|
||||
# Commanders Act
|
||||
tagcommander.com
|
||||
# Ingenious Technologies
|
||||
affex.org
|
||||
|
|
2
rules_asn/.gitignore
vendored
Normal file
2
rules_asn/.gitignore
vendored
Normal file
|
@ -0,0 +1,2 @@
|
|||
*.custom.txt
|
||||
*.cache.txt
|
10
rules_asn/first-party.txt
Normal file
10
rules_asn/first-party.txt
Normal file
|
@ -0,0 +1,10 @@
|
|||
# Eulerian
|
||||
AS50234
|
||||
# Criteo
|
||||
AS44788
|
||||
AS19750
|
||||
AS55569
|
||||
# ThreatMetrix
|
||||
AS30286
|
||||
# Webtrekk
|
||||
AS60164
|
|
@ -1,51 +0,0 @@
|
|||
# Eulerian (AS50234 EULERIAN TECHNOLOGIES S.A.S.)
|
||||
109.232.192.0/21
|
||||
# Criteo (AS44788 Criteo SA)
|
||||
91.199.242.0/24
|
||||
91.212.98.0/24
|
||||
178.250.0.0/21
|
||||
178.250.0.0/24
|
||||
178.250.1.0/24
|
||||
178.250.2.0/24
|
||||
178.250.3.0/24
|
||||
178.250.4.0/24
|
||||
178.250.6.0/24
|
||||
185.235.84.0/24
|
||||
# Criteo (AS19750 Criteo Corp.)
|
||||
74.119.116.0/22
|
||||
74.119.117.0/24
|
||||
74.119.118.0/24
|
||||
74.119.119.0/24
|
||||
91.199.242.0/24
|
||||
185.235.85.0/24
|
||||
199.204.168.0/22
|
||||
199.204.168.0/24
|
||||
199.204.169.0/24
|
||||
199.204.170.0/24
|
||||
199.204.171.0/24
|
||||
178.250.0.0/21
|
||||
91.212.98.0/24
|
||||
91.199.242.0/24
|
||||
185.235.84.0/24
|
||||
# Criteo (AS55569 Criteo APAC)
|
||||
91.199.242.0/24
|
||||
116.213.20.0/22
|
||||
116.213.20.0/24
|
||||
116.213.21.0/24
|
||||
182.161.72.0/22
|
||||
182.161.72.0/24
|
||||
182.161.73.0/24
|
||||
185.235.86.0/24
|
||||
185.235.87.0/24
|
||||
# ThreatMetrix (AS30286 ThreatMetrix Inc.)
|
||||
69.84.176.0/24
|
||||
173.254.179.0/24
|
||||
185.32.240.0/23
|
||||
185.32.242.0/23
|
||||
192.225.156.0/22
|
||||
199.101.156.0/23
|
||||
199.101.158.0/23
|
||||
# Webtrekk (AS60164 Webtrekk GmbH)
|
||||
185.54.148.0/22
|
||||
185.54.150.0/24
|
||||
185.54.151.0/24
|
34
run_tests.py
Executable file
34
run_tests.py
Executable file
|
@ -0,0 +1,34 @@
|
|||
#!/usr/bin/env python3
|
||||
|
||||
import database
|
||||
import os
|
||||
import logging
|
||||
import csv
|
||||
|
||||
TESTS_DIR = 'tests'
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
||||
DB = database.Database()
|
||||
log = logging.getLogger('tests')
|
||||
|
||||
for filename in os.listdir(TESTS_DIR):
|
||||
log.info("")
|
||||
log.info("Running tests from %s", filename)
|
||||
path = os.path.join(TESTS_DIR, filename)
|
||||
with open(path, 'rt') as fdesc:
|
||||
reader = csv.DictReader(fdesc)
|
||||
for test in reader:
|
||||
log.info("Testing %s (%s)", test['url'], test['comment'])
|
||||
|
||||
for white in test['white'].split(':'):
|
||||
if not white:
|
||||
continue
|
||||
if any(DB.get_domain(white)):
|
||||
log.error("False positive: %s", white)
|
||||
|
||||
for black in test['black'].split(':'):
|
||||
if not black:
|
||||
continue
|
||||
if not any(DB.get_domain(black)):
|
||||
log.error("False negative: %s", black)
|
|
@ -1,6 +1,5 @@
|
|||
url,white,black,comment
|
||||
https://support.apple.com,support.apple.com,,EdgeKey / AkamaiEdge
|
||||
https://www.pinterest.fr/,i.pinimg.com,,Cedexis
|
||||
https://www.pinterest.fr/,i.pinimg.com,,Cedexis
|
||||
https://www.tumblr.com/,66.media.tumblr.com,,ChiCDN
|
||||
https://www.skype.com/fr/,www.skype.com,,TrafficManager
|
||||
|
|
|
|
@ -5,3 +5,6 @@ https://www.discover.com/,,content.discover.com,ThreatMetrix
|
|||
https://www.mytoys.de/,,web.mytoys.de,Webtrekk
|
||||
https://www.baur.de/,,tp.baur.de,Otto Group
|
||||
https://www.liligo.com/,,compare.liligo.com,???
|
||||
https://www.boulanger.com/,,tag.boulanger.fr,TagCommander
|
||||
https://www.airfrance.fr/FR/,,tk.airfrance.fr,Wizaly
|
||||
https://www.vsgamers.es/,,marketing.net.vsgamers.es,Affex
|
||||
|
|
|
35
validate_list.py
Executable file
35
validate_list.py
Executable file
|
@ -0,0 +1,35 @@
|
|||
#!/usr/bin/env python3
|
||||
# pylint: disable=C0103
|
||||
|
||||
"""
|
||||
Filter out invalid domain names
|
||||
"""
|
||||
|
||||
import database
|
||||
import argparse
|
||||
import sys
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
||||
# Parsing arguments
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Filter out invalid domain name/ip addresses from a list.")
|
||||
parser.add_argument(
|
||||
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
|
||||
help="Input file, one element per line")
|
||||
parser.add_argument(
|
||||
'-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
|
||||
help="Output file, one element per line")
|
||||
parser.add_argument(
|
||||
'-d', '--domain', action='store_true',
|
||||
help="Can be domain name")
|
||||
parser.add_argument(
|
||||
'-4', '--ip4', action='store_true',
|
||||
help="Can be IP4")
|
||||
args = parser.parse_args()
|
||||
|
||||
for line in args.input:
|
||||
line = line.strip()
|
||||
if (args.domain and database.Database.validate_domain(line)) or \
|
||||
(args.ip4 and database.Database.validate_ip4address(line)):
|
||||
print(line, file=args.output)
|
Loading…
Reference in a new issue