Compare commits

..

1 commit

Author SHA1 Message Date
Geoffrey Frogeye dcf39c9582
Put packing in parsing thread
Why did I think this would be a good idea?
- value don't need to be packed most of the time, but we don't know that
early
- packed domain (it's one most of the time) is way larger than its
unpacked counterpart
2019-12-16 10:38:37 +01:00
38 changed files with 983 additions and 1598 deletions

View file

@ -1,5 +0,0 @@
CACHE_SIZE=536870912
MASSDNS_HASHMAP_SIZE=1000
PROFILE=0
SINGLE_PROCESS=0
MASSDNS_BINARY=massdns

5
.gitignore vendored
View file

@ -1,5 +1,4 @@
*.log *.log
*.p *.p
.env nameservers
__pycache__ nameservers.head
explanations

21
LICENSE
View file

@ -1,21 +0,0 @@
MIT License
Copyright (c) 2019 Geoffrey 'Frogeye' Preud'homme
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

194
README.md
View file

@ -1,162 +1,92 @@
# eulaurarien # eulaurarien
This program is able to generate a list of every hostnames being a DNS redirection to a list of DNS zones and IP networks. Generates a host list of first-party trackers for ad-blocking.
It is primarilyy used to generate [Geoffrey Frogeye's block list of first-party trackers](https://hostfiles.frogeye.fr) (learn about first-party trackers by following this link). The latest list is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
If you want to contribute but don't want to create an account on this forge, contact me the way you like: <https://geoffrey.frogeye.fr> **DISCLAIMER:** I'm by no way an expert on this subject so my vocabulary or other stuff might be wrong. Use at your own risk.
## How does this work ## What's a first-party tracker?
This program takes as input: Traditionally, websites load trackers scripts directly.
For example, `website1.com` and `website2.com` both load `https://trackercompany.com/trackerscript.js` to track their users.
In order to block those, one can simply block the host `trackercompany.com`.
- Lists of hostnames to match However, to circumvent this easy block, tracker companies made the website using them load trackers from `somethingirelevant.website1.com`.
- Lists of DNS zone to match (a domain and their subdomains) The latter being a DNS redirection to `website1.trackercompany.com`, directly pointing to a server serving the tracking script.
- Lists of IP address / IP networks to match Those are the first-party trackers.
- Lists of Autonomous System numbers to match
- An enormous quantity of DNS records
It will be able to output hostnames being a DNS redirection to any item in the lists provided. Blocking `trackercompany.com` doesn't work any more, and blocking `*.trackercompany.com` isn't really possible since:
DNS records can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns). 1. Most ad-blocker don't support wildcards
2. It's a DNS redirection, meaning that most ad-blockers will only see `somethingirelevant.website1.com`
Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that). So the only solution is to block every `somethingirelevant.website1.com`-like subdomains known, which is a lot.
That's where this scripts comes in, to generate a list of such subdomains.
## How does this script work
It takes an input a list of websites with trackers included.
So far, this list is manually-generated from the list of clients of such first-party trackers
(latter we should use a general list of websites to be more exhaustive).
It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes.
Additionaly, or alternatively, you can feed the script some browsing history and get domains from there.
It then find the DNS redirections of those domains, and compare with regexes of known tracking domains.
It finally outputs the matching ones.
## Requirements
Just to build the list, you can find an already-built list in the releases.
- Bash
- [Python 3.4+](https://www.python.org/)
- [progressbar2](https://pypi.org/project/progressbar2/)
- dnspython
- [A Python wrapper for re2](https://pypi.org/project/google-re2/) (optional, just speeds things up)
(if you don't want to collect the subdomains, you can skip the following)
- Firefox
- Selenium
- seleniumwire
## Usage ## Usage
Remember you can get an already generated and up-to-date list of first-party trackers from [here](https://hostfiles.frogeye.fr). This is only if you want to build the list yourself.
If you just want to use the list, the latest build is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
It was build using additional sources not included in this repository for privacy reasons.
The following is for the people wanting to build their own list. ### Add personal sources
### Requirements The list of websites provided in this script is by no mean exhaustive,
so adding your own browsing history will help create a better list.
Depending on the sources you'll be using to generate the list, you'll need to install some of the following:
- [Bash](https://www.gnu.org/software/bash/bash.html)
- [Coreutils](https://www.gnu.org/software/coreutils/)
- [Gawk](https://www.gnu.org/software/gawk/)
- [curl](https://curl.haxx.se)
- [pv](http://www.ivarch.com/programs/pv.shtml)
- [Python 3.4+](https://www.python.org/)
- [coloredlogs](https://pypi.org/project/coloredlogs/) (sorry I can't help myself)
- [numpy](https://www.numpy.org/)
- [python-abp](https://pypi.org/project/python-abp/) (only if you intend to use AdBlock rules as a rule source)
- [massdns](https://github.com/blechschmidt/massdns) in your `$PATH` (only if you have subdomains as a source)
- [Firefox](https://www.mozilla.org/firefox/) (only if you have websites as a source)
- [selenium (Python bindings)](https://pypi.python.org/pypi/selenium) (only if you have websites as a source)
- [selenium-wire](https://pypi.org/project/selenium-wire/) (only if you have websites as a source)
- [markdown2](https://pypi.org/project/markdown2/) (only if you intend to generate the index webpage)
### Create a new database
The so-called database (in the form of `blocking.p`) is a file storing all the matching entities (ASN, IPs, hostnames, zones…) and every entity leading to it.
It exists because the list cannot be generated in one pass, as DNS redirections chain links do not have to be inputed in order.
You can purge of old records the database by running `./prune.sh`.
When you remove a source of data, remove its corresponding file in `last_updates` to fix the pruning process.
### Gather external sources
External sources are not stored in this repository.
You'll need to fetch them by running `./fetch_resources.sh`.
Those include:
- Third-party trackers lists
- TLD lists (used to test the validity of hostnames)
- List of public DNS resolvers (for DNS resolving from subdomains)
- Top 1M subdomains
### Import rules into the database
You need to put the lists of rules for matching in the different subfolders:
- `rules`: Lists of DNS zones
- `rules_ip`: Lists of IP networks (for IP addresses append `/32`)
- `rules_asn`: Lists of Autonomous Systems numbers (IP ranges will be deducted from them)
- `rules_adblock`: Lists of DNS zones, but in the form of AdBlock lists (only the ones concerning domains will be extracted)
- `rules_hosts`: Lists of DNS zones, but in the form of hosts lists
See the provided examples for syntax.
In each folder:
- `first-party.ext` will be the only files considered for the first-party variant of the list
- `*.cache.ext` are from external sources, and thus might be deleted / overwrote
- `*.custom.ext` are for sources that you don't want commited
Then, run `./import_rules.sh`.
If you removed rules and you want to remove every record depending on those rules immediately,
run the following command:
```
./db.py --prune --prune-before "$(cat "last_updates/rules.txt")" --prune-base
```
### Add subdomains
If you plan to resolve DNS records yourself (as the DNS records datasets are not exhaustive),
the top 1M subdomains provided might not be enough.
You can add them into the `subdomains` folder.
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
#### Add personal sources
Adding your own browsing history will help create a more suited subdomains list.
Here's reference command for possible sources: Here's reference command for possible sources:
- **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list` - **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
- **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp` - **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp`
#### Collect subdomains from websites ### Collect subdomains from websites
You can add the websites URLs into the `websites` folder. Just run `collect_subdomain.sh`.
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
Then, run `collect_subdomain.sh`.
This is a long step, and might be memory-intensive from time to time. This is a long step, and might be memory-intensive from time to time.
> **Note:** For first-party tracking, a list of subdomains issued from the websites in the repository is avaliable here: <https://hostfiles.frogeye.fr/from_websites.cache.list> This step is optional if you already added personal sources.
Alternatively, you can get just download the list of subdomains used to generate the official block list here: <https://hostfiles.frogeye.fr/from_websites.cache.list> (put it in the `subdomains` folder).
### Resolve DNS records ### Extract tracking domains
Once you've added subdomains, you'll need to resolve them to get their DNS records. Make sure your system is configured with a DNS server without limitation.
The program will use a list of public nameservers to do that, but you can add your own in the `nameservers` directory. Then, run `filter_subdomain.sh`.
The files you need will be in the folder `dist`.
Then, run `./resolve_subdomains.sh`. ## Contributing
Note that this is a network intensive process, not in term of bandwith, but in terms of packet number.
> **Note:** Some VPS providers might detect this as a DDoS attack and cut the network access. ### Adding websites
> Some Wi-Fi connections can be rendered unusable for other uses, some routers might cease to work.
> Since massdns does not support yet rate limiting, my best bet was a Raspberry Pi with a slow ethernet link (Raspberry Pi < 4).
The DNS records will automatically be imported into the database. Just add the URL to the relevant list: `websites/<source>.list`.
If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script.
### Export the lists ### Adding first-party trackers regex
For the tracking list, use `./export_lists.sh`, the output will be in the `dist` folder (please change the links before distributing them). Just add them to `regexes.py`.
For other purposes, tinker with the `./export.py` program.
#### Explanations
Note that if you created an `explanations` folder at the root of the project, a file with a timestamp will be created in it.
It contains every rule in the database and the reason of their presence (i.e. their dependency).
This might be useful to track changes between runs.
Every rule has an associated tag with four components:
1. A number: the level of the rule (1 if it is a rule present in the `rules*` folders)
2. A letter: `F` if first-party, `M` if multi-party.
3. A letter: `D` if a dupplicate (e.g. `foo.bar.com` if `*.bar.com` is already a rule), `_` if not.
4. A number: the number of rules relying on this one
### Generate the index webpage
This is the one served on <https://hostfiles.frogeye.fr>.
Just run `./generate_index.py`.
### Everything
Once you've made sure every step runs fine, you can use `./eulaurarien.sh` to run every step consecutively.

View file

@ -16,36 +16,25 @@ import abp.filters
def get_domains(rule: abp.filters.parser.Filter) -> typing.Iterable[str]: def get_domains(rule: abp.filters.parser.Filter) -> typing.Iterable[str]:
if rule.options: if rule.options:
return return
selector_type = rule.selector["type"] selector_type = rule.selector['type']
selector_value = rule.selector["value"] selector_value = rule.selector['value']
if ( if selector_type == 'url-pattern' \
selector_type == "url-pattern" and selector_value.startswith('||') \
and selector_value.startswith("||") and selector_value.endswith('^'):
and selector_value.endswith("^")
):
yield selector_value[2:-1] yield selector_value[2:-1]
if __name__ == "__main__": if __name__ == '__main__':
# Parsing arguments # Parsing arguments
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description="Extract whole domains from an AdBlock blocking list" description="Extract whole domains from an AdBlock blocking list")
)
parser.add_argument( parser.add_argument(
"-i", '-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
"--input", help="Input file with AdBlock rules")
type=argparse.FileType("r"),
default=sys.stdin,
help="Input file with AdBlock rules",
)
parser.add_argument( parser.add_argument(
"-o", '-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
"--output", help="Outptut file with one rule tracking subdomain per line")
type=argparse.FileType("w"),
default=sys.stdout,
help="Outptut file with one rule tracking subdomain per line",
)
args = parser.parse_args() args = parser.parse_args()
# Reading rules # Reading rules

View file

@ -14,28 +14,6 @@ import time
import progressbar import progressbar
import selenium.webdriver.firefox.options import selenium.webdriver.firefox.options
import seleniumwire.webdriver import seleniumwire.webdriver
import logging
log = logging.getLogger("cs")
DRIVER = None
SCROLL_TIME = 10.0
SCROLL_STEPS = 100
SCROLL_CMD = f"window.scrollBy(0,document.body.scrollHeight/{SCROLL_STEPS})"
def new_driver() -> seleniumwire.webdriver.browser.Firefox:
profile = selenium.webdriver.FirefoxProfile()
profile.set_preference("privacy.trackingprotection.enabled", False)
profile.set_preference("network.cookie.cookieBehavior", 0)
profile.set_preference("privacy.trackingprotection.pbmode.enabled", False)
profile.set_preference("privacy.trackingprotection.cryptomining.enabled", False)
profile.set_preference("privacy.trackingprotection.fingerprinting.enabled", False)
options = selenium.webdriver.firefox.options.Options()
# options.add_argument('-headless')
driver = seleniumwire.webdriver.Firefox(
profile, executable_path="geckodriver", options=options
)
return driver
def subdomain_from_url(url: str) -> str: def subdomain_from_url(url: str) -> str:
@ -51,36 +29,34 @@ def collect_subdomains(url: str) -> typing.Iterable[str]:
Load an URL into an headless browser and return all the domains Load an URL into an headless browser and return all the domains
it tried to access. it tried to access.
""" """
global DRIVER options = selenium.webdriver.firefox.options.Options()
if not DRIVER: options.add_argument('-headless')
DRIVER = new_driver() driver = seleniumwire.webdriver.Firefox(
executable_path='geckodriver', options=options)
try: driver.get(url)
DRIVER.get(url) time.sleep(10)
for s in range(SCROLL_STEPS): for request in driver.requests:
DRIVER.execute_script(SCROLL_CMD)
time.sleep(SCROLL_TIME / SCROLL_STEPS)
for request in DRIVER.requests:
if request.response: if request.response:
yield subdomain_from_url(request.path) yield subdomain_from_url(request.path)
except Exception: driver.close()
log.exception("Error")
DRIVER.quit()
DRIVER = None
def collect_subdomains_standalone(url: str) -> None: def collect_subdomains_standalone(url: str) -> None:
url = url.strip() url = url.strip()
if not url: if not url:
return return
try:
for subdomain in collect_subdomains(url): for subdomain in collect_subdomains(url):
print(subdomain) print(subdomain)
except:
pass
if __name__ == "__main__": if __name__ == '__main__':
assert len(sys.argv) <= 2 assert len(sys.argv) <= 2
filename = None filename = None
if len(sys.argv) == 2 and sys.argv[1] != "-": if len(sys.argv) == 2 and sys.argv[1] != '-':
filename = sys.argv[1] filename = sys.argv[1]
num_lines = sum(1 for line in open(filename)) num_lines = sum(1 for line in open(filename))
iterator = progressbar.progressbar(open(filename), max_value=num_lines) iterator = progressbar.progressbar(open(filename), max_value=num_lines)
@ -90,8 +66,5 @@ if __name__ == "__main__":
for line in iterator: for line in iterator:
collect_subdomains_standalone(line) collect_subdomains_standalone(line)
if DRIVER:
DRIVER.quit()
if filename: if filename:
iterator.close() iterator.close()

View file

@ -9,36 +9,25 @@ import time
import logging import logging
import coloredlogs import coloredlogs
import pickle import pickle
import numpy
import math
import os
TLD_LIST: typing.Set[str] = set() coloredlogs.install(
level='DEBUG',
coloredlogs.install(level="DEBUG", fmt="%(asctime)s %(name)s %(levelname)s %(message)s") fmt='%(asctime)s %(name)s %(levelname)s %(message)s'
)
Asn = int Asn = int
Timestamp = int Timestamp = int
Level = int Level = int
class Path: class Path():
# FP add boolean here
pass pass
class RulePath(Path): class RulePath(Path):
def __str__(self) -> str: def __str__(self) -> str:
return "(rule)" return '(rules)'
class RuleFirstPath(RulePath):
def __str__(self) -> str:
return "(first-party rule)"
class RuleMultiPath(RulePath):
def __str__(self) -> str:
return "(multi-party rule)"
class DomainPath(Path): class DomainPath(Path):
@ -46,7 +35,7 @@ class DomainPath(Path):
self.parts = parts self.parts = parts
def __str__(self) -> str: def __str__(self) -> str:
return "?." + Database.unpack_domain(self) return '?.' + Database.unpack_domain(self)
class HostnamePath(DomainPath): class HostnamePath(DomainPath):
@ -56,7 +45,7 @@ class HostnamePath(DomainPath):
class ZonePath(DomainPath): class ZonePath(DomainPath):
def __str__(self) -> str: def __str__(self) -> str:
return "*." + Database.unpack_domain(self) return '*.' + Database.unpack_domain(self)
class AsnPath(Path): class AsnPath(Path):
@ -76,33 +65,33 @@ class Ip4Path(Path):
return Database.unpack_ip4network(self) return Database.unpack_ip4network(self)
class Match: class Match():
def __init__(self) -> None: def __init__(self) -> None:
self.source: typing.Optional[Path] = None
self.updated: int = 0 self.updated: int = 0
self.dupplicate: bool = False
# Cache
self.level: int = 0 self.level: int = 0
self.first_party: bool = False self.source: typing.Optional[Path] = None
self.references: int = 0 # FP dupplicate args
def active(self, first_party: bool = None) -> bool: def set(self,
if self.updated == 0 or (first_party and not self.first_party): updated: int,
return False level: int,
return True source: Path,
) -> None:
if updated > self.updated or level > self.level:
self.updated = updated
self.level = level
self.source = source
# FP dupplicate function
def disable(self) -> None: def active(self) -> bool:
self.updated = 0 return self.updated > 0
class AsnNode(Match): class AsnNode(Match):
def __init__(self) -> None: pass
Match.__init__(self)
self.name = ""
class DomainTreeNode: class DomainTreeNode():
def __init__(self) -> None: def __init__(self) -> None:
self.children: typing.Dict[str, DomainTreeNode] = dict() self.children: typing.Dict[str, DomainTreeNode] = dict()
self.match_zone = Match() self.match_zone = Match()
@ -117,28 +106,21 @@ class IpTreeNode(Match):
Node = typing.Union[DomainTreeNode, IpTreeNode, AsnNode] Node = typing.Union[DomainTreeNode, IpTreeNode, AsnNode]
MatchCallable = typing.Callable[[Path, Match], typing.Any] MatchCallable = typing.Callable[[Path,
Match,
typing.Optional[typing.Any]],
typing.Any]
class Profiler: class Profiler():
def __init__(self) -> None: def __init__(self) -> None:
do_profile = int(os.environ.get("PROFILE", "0")) self.log = logging.getLogger('profiler')
if do_profile:
self.log = logging.getLogger("profiler")
self.time_last = time.perf_counter() self.time_last = time.perf_counter()
self.time_step = "init" self.time_step = 'init'
self.time_dict: typing.Dict[str, float] = dict() self.time_dict: typing.Dict[str, float] = dict()
self.step_dict: typing.Dict[str, int] = dict() self.step_dict: typing.Dict[str, int] = dict()
self.enter_step = self.enter_step_real
self.profile = self.profile_real
else:
self.enter_step = self.enter_step_dummy
self.profile = self.profile_dummy
def enter_step_dummy(self, name: str) -> None: def enter_step(self, name: str) -> None:
return
def enter_step_real(self, name: str) -> None:
now = time.perf_counter() now = time.perf_counter()
try: try:
self.time_dict[self.time_step] += now - self.time_last self.time_dict[self.time_step] += now - self.time_last
@ -149,174 +131,86 @@ class Profiler:
self.time_step = name self.time_step = name
self.time_last = time.perf_counter() self.time_last = time.perf_counter()
def profile_dummy(self) -> None: def profile(self) -> None:
return self.enter_step('profile')
def profile_real(self) -> None:
self.enter_step("profile")
total = sum(self.time_dict.values()) total = sum(self.time_dict.values())
for key, secs in sorted(self.time_dict.items(), key=lambda t: t[1]): for key, secs in sorted(self.time_dict.items(), key=lambda t: t[1]):
times = self.step_dict[key] times = self.step_dict[key]
self.log.debug( self.log.debug(f"{key:<20}: {times:9d} × {secs/times:5.3e} "
f"{key:<20}: {times:9d} × {secs/times:5.3e} " f"= {secs:9.2f} s ({secs/total:7.2%}) ")
f"= {secs:9.2f} s ({secs/total:7.2%}) " self.log.debug(f"{'total':<20}: "
) f"{total:9.2f} s ({1:7.2%})")
self.log.debug(
f"{'total':<20}: " f"{total:9.2f} s ({1:7.2%})"
)
class Database(Profiler): class Database(Profiler):
VERSION = 18 VERSION = 13
PATH = "blocking.p" PATH = "blocking.p"
def initialize(self) -> None: def initialize(self) -> None:
self.log.warning("Creating database version: %d ", Database.VERSION) self.log.warning(
# Dummy match objects that everything refer to "Creating database version: %d ",
self.rules: typing.List[Match] = list() Database.VERSION)
for first_party in (False, True):
m = Match()
m.updated = 1
m.level = 0
m.first_party = first_party
self.rules.append(m)
self.domtree = DomainTreeNode() self.domtree = DomainTreeNode()
self.asns: typing.Dict[Asn, AsnNode] = dict() self.asns: typing.Dict[Asn, AsnNode] = dict()
self.ip4tree = IpTreeNode() self.ip4tree = IpTreeNode()
def load(self) -> None: def load(self) -> None:
self.enter_step("load") self.enter_step('load')
try: try:
with open(self.PATH, "rb") as db_fdsec: with open(self.PATH, 'rb') as db_fdsec:
version, data = pickle.load(db_fdsec) version, data = pickle.load(db_fdsec)
if version == Database.VERSION: if version == Database.VERSION:
self.rules, self.domtree, self.asns, self.ip4tree = data self.domtree, self.asns, self.ip4tree = data
return return
self.log.warning( self.log.warning(
"Outdated database version found: %d, " "it will be rebuilt.", "Outdated database version found: %d, "
version, "it will be rebuilt.",
) version)
except (TypeError, AttributeError, EOFError): except (TypeError, AttributeError, EOFError):
self.log.error( self.log.error(
"Corrupt (or heavily outdated) database found, " "it will be rebuilt." "Corrupt (or heavily outdated) database found, "
) "it will be rebuilt.")
except FileNotFoundError: except FileNotFoundError:
pass pass
self.initialize() self.initialize()
def save(self) -> None: def save(self) -> None:
self.enter_step("save") self.enter_step('save')
with open(self.PATH, "wb") as db_fdsec: with open(self.PATH, 'wb') as db_fdsec:
data = self.rules, self.domtree, self.asns, self.ip4tree data = self.domtree, self.asns, self.ip4tree
pickle.dump((self.VERSION, data), db_fdsec) pickle.dump((self.VERSION, data), db_fdsec)
self.profile() self.profile()
def __init__(self) -> None: def __init__(self) -> None:
Profiler.__init__(self) Profiler.__init__(self)
self.log = logging.getLogger("db") self.log = logging.getLogger('db')
self.load() self.load()
self.ip4cache_shift: int = 32
self.ip4cache = numpy.ones(1)
def _set_ip4cache(self, path: Path, _: Match) -> None:
assert isinstance(path, Ip4Path)
self.enter_step("set_ip4cache")
mini = path.value >> self.ip4cache_shift
maxi = (path.value + 2 ** (32 - path.prefixlen)) >> self.ip4cache_shift
if mini == maxi:
self.ip4cache[mini] = True
else:
self.ip4cache[mini:maxi] = True
def fill_ip4cache(self, max_size: int = 512 * 1024 ** 2) -> None:
"""
Size in bytes
"""
if max_size > 2 ** 32 / 8:
self.log.warning(
"Allocating more than 512 MiB of RAM for "
"the Ip4 cache is not necessary."
)
max_cache_width = int(math.log2(max(1, max_size * 8)))
allocated = False
cache_width = min(32, max_cache_width)
while not allocated:
cache_size = 2 ** cache_width
try:
self.ip4cache = numpy.zeros(cache_size, dtype=bool)
except MemoryError:
self.log.exception("Could not allocate cache. Retrying a smaller one.")
cache_width -= 1
continue
allocated = True
self.ip4cache_shift = 32 - cache_width
for _ in self.exec_each_ip4(self._set_ip4cache):
pass
@staticmethod
def populate_tld_list() -> None:
with open("temp/all_tld.list", "r") as tld_fdesc:
for tld in tld_fdesc:
tld = tld.strip()
TLD_LIST.add(tld)
@staticmethod
def validate_domain(path: str) -> bool:
if len(path) > 255:
return False
splits = path.split(".")
if not TLD_LIST:
Database.populate_tld_list()
if splits[-1] not in TLD_LIST:
return False
for split in splits:
if not 1 <= len(split) <= 63:
return False
return True
@staticmethod @staticmethod
def pack_domain(domain: str) -> DomainPath: def pack_domain(domain: str) -> DomainPath:
return DomainPath(domain.split(".")[::-1]) return DomainPath(domain.split('.')[::-1])
@staticmethod @staticmethod
def unpack_domain(domain: DomainPath) -> str: def unpack_domain(domain: DomainPath) -> str:
return ".".join(domain.parts[::-1]) return '.'.join(domain.parts[::-1])
@staticmethod @staticmethod
def pack_asn(asn: str) -> AsnPath: def pack_asn(asn: str) -> AsnPath:
asn = asn.upper() asn = asn.upper()
if asn.startswith("AS"): if asn.startswith('AS'):
asn = asn[2:] asn = asn[2:]
return AsnPath(int(asn)) return AsnPath(int(asn))
@staticmethod @staticmethod
def unpack_asn(asn: AsnPath) -> str: def unpack_asn(asn: AsnPath) -> str:
return f"AS{asn.asn}" return f'AS{asn.asn}'
@staticmethod
def validate_ip4address(path: str) -> bool:
splits = path.split(".")
if len(splits) != 4:
return False
for split in splits:
try:
if not 0 <= int(split) <= 255:
return False
except ValueError:
return False
return True
@staticmethod
def pack_ip4address_low(address: str) -> int:
addr = 0
for split in address.split("."):
octet = int(split)
addr = (addr << 8) + octet
return addr
@staticmethod @staticmethod
def pack_ip4address(address: str) -> Ip4Path: def pack_ip4address(address: str) -> Ip4Path:
return Ip4Path(Database.pack_ip4address_low(address), 32) addr = 0
for split in address.split('.'):
addr = (addr << 8) + int(split)
return Ip4Path(addr, 32)
@staticmethod @staticmethod
def unpack_ip4address(address: Ip4Path) -> str: def unpack_ip4address(address: Ip4Path) -> str:
@ -327,26 +221,11 @@ class Database(Profiler):
for o in reversed(range(4)): for o in reversed(range(4)):
octets[o] = addr & 0xFF octets[o] = addr & 0xFF
addr >>= 8 addr >>= 8
return ".".join(map(str, octets)) return '.'.join(map(str, octets))
@staticmethod
def validate_ip4network(path: str) -> bool:
# A bit generous but ok for our usage
splits = path.split("/")
if len(splits) != 2:
return False
if not Database.validate_ip4address(splits[0]):
return False
try:
if not 0 <= int(splits[1]) <= 32:
return False
except ValueError:
return False
return True
@staticmethod @staticmethod
def pack_ip4network(network: str) -> Ip4Path: def pack_ip4network(network: str) -> Ip4Path:
address, prefixlen_str = network.split("/") address, prefixlen_str = network.split('/')
prefixlen = int(prefixlen_str) prefixlen = int(prefixlen_str)
addr = Database.pack_ip4address(address) addr = Database.pack_ip4address(address)
addr.prefixlen = prefixlen addr.prefixlen = prefixlen
@ -360,13 +239,11 @@ class Database(Profiler):
for o in reversed(range(4)): for o in reversed(range(4)):
octets[o] = addr & 0xFF octets[o] = addr & 0xFF
addr >>= 8 addr >>= 8
return ".".join(map(str, octets)) + "/" + str(network.prefixlen) return '.'.join(map(str, octets)) + '/' + str(network.prefixlen)
def get_match(self, path: Path) -> Match: def get_match(self, path: Path) -> Match:
if isinstance(path, RuleMultiPath): if isinstance(path, RulePath):
return self.rules[0] return Match()
elif isinstance(path, RuleFirstPath):
return self.rules[1]
elif isinstance(path, AsnPath): elif isinstance(path, AsnPath):
return self.asns[path.asn] return self.asns[path.asn]
elif isinstance(path, DomainPath): elif isinstance(path, DomainPath):
@ -381,7 +258,7 @@ class Database(Profiler):
raise ValueError raise ValueError
elif isinstance(path, Ip4Path): elif isinstance(path, Ip4Path):
dici = self.ip4tree dici = self.ip4tree
for i in range(31, 31 - path.prefixlen, -1): for i in range(31, 31-path.prefixlen, -1):
bit = (path.value >> i) & 0b1 bit = (path.value >> i) & 0b1
dici_next = dici.one if bit else dici.zero dici_next = dici.one if bit else dici.zero
if not dici_next: if not dici_next:
@ -391,374 +268,265 @@ class Database(Profiler):
else: else:
raise ValueError raise ValueError
def exec_each_asn( def exec_each_domain(self,
self,
callback: MatchCallable,
) -> typing.Any:
for asn in self.asns:
match = self.asns[asn]
if match.active():
c = callback(
AsnPath(asn),
match,
)
try:
yield from c
except TypeError: # not iterable
pass
def exec_each_domain(
self,
callback: MatchCallable, callback: MatchCallable,
arg: typing.Any = None,
_dic: DomainTreeNode = None, _dic: DomainTreeNode = None,
_par: DomainPath = None, _par: DomainPath = None,
) -> typing.Any: ) -> typing.Any:
_dic = _dic or self.domtree _dic = _dic or self.domtree
_par = _par or DomainPath([]) _par = _par or DomainPath([])
if _dic.match_hostname.active(): if _dic.match_hostname.active():
c = callback( yield from callback(
HostnamePath(_par.parts), HostnamePath(_par.parts),
_dic.match_hostname, _dic.match_hostname,
arg
) )
try:
yield from c
except TypeError: # not iterable
pass
if _dic.match_zone.active(): if _dic.match_zone.active():
c = callback( yield from callback(
ZonePath(_par.parts), ZonePath(_par.parts),
_dic.match_zone, _dic.match_zone,
arg
) )
try:
yield from c
except TypeError: # not iterable
pass
for part in _dic.children: for part in _dic.children:
dic = _dic.children[part] dic = _dic.children[part]
yield from self.exec_each_domain( yield from self.exec_each_domain(
callback, _dic=dic, _par=DomainPath(_par.parts + [part]) callback,
arg,
_dic=dic,
_par=DomainPath(_par.parts + [part])
) )
def exec_each_ip4( def exec_each_ip4(self,
self,
callback: MatchCallable, callback: MatchCallable,
arg: typing.Any = None,
_dic: IpTreeNode = None, _dic: IpTreeNode = None,
_par: Ip4Path = None, _par: Ip4Path = None,
) -> typing.Any: ) -> typing.Any:
_dic = _dic or self.ip4tree _dic = _dic or self.ip4tree
_par = _par or Ip4Path(0, 0) _par = _par or Ip4Path(0, 0)
if _dic.active(): if _dic.active():
c = callback( yield from callback(
_par, _par,
_dic, _dic,
arg
) )
try:
yield from c
except TypeError: # not iterable
pass
# 0 # 0
pref = _par.prefixlen + 1 pref = _par.prefixlen + 1
dic = _dic.zero dic = _dic.zero
if dic: if dic:
# addr0 = _par.value & (0xFFFFFFFF ^ (1 << (32-pref))) addr0 = _par.value & (0xFFFFFFFF ^ (1 << (32-pref)))
# assert addr0 == _par.value assert addr0 == _par.value
addr0 = _par.value yield from self.exec_each_ip4(
yield from self.exec_each_ip4(callback, _dic=dic, _par=Ip4Path(addr0, pref)) callback,
arg,
_dic=dic,
_par=Ip4Path(addr0, pref)
)
# 1 # 1
dic = _dic.one dic = _dic.one
if dic: if dic:
addr1 = _par.value | (1 << (32 - pref)) addr1 = _par.value | (1 << (32-pref))
# assert addr1 != _par.value yield from self.exec_each_ip4(
yield from self.exec_each_ip4(callback, _dic=dic, _par=Ip4Path(addr1, pref)) callback,
arg,
_dic=dic,
_par=Ip4Path(addr1, pref)
)
def exec_each( def exec_each(self,
self,
callback: MatchCallable, callback: MatchCallable,
arg: typing.Any = None,
) -> typing.Any: ) -> typing.Any:
yield from self.exec_each_domain(callback) yield from self.exec_each_domain(callback)
yield from self.exec_each_ip4(callback) yield from self.exec_each_ip4(callback)
yield from self.exec_each_asn(callback) # TODO ASN
def update_references(self) -> None: def update_references(self) -> None:
# Should be correctly calculated normally, raise NotImplementedError
# keeping this just in case
def reset_references_cb(path: Path, match: Match) -> None:
match.references = 0
for _ in self.exec_each(reset_references_cb):
pass
def increment_references_cb(path: Path, match: Match) -> None:
if match.source:
source = self.get_match(match.source)
source.references += 1
for _ in self.exec_each(increment_references_cb):
pass
def _clean_deps(self) -> None:
# Disable the matches that depends on the targeted
# matches until all disabled matches reference count = 0
did_something = True
def clean_deps_cb(path: Path, match: Match) -> None:
nonlocal did_something
if not match.source:
return
source = self.get_match(match.source)
if not source.active():
self._unset_match(match)
elif match.first_party > source.first_party:
match.first_party = source.first_party
else:
return
did_something = True
while did_something:
did_something = False
self.enter_step("pass_clean_deps")
for _ in self.exec_each(clean_deps_cb):
pass
def prune(self, before: int, base_only: bool = False) -> None: def prune(self, before: int, base_only: bool = False) -> None:
# Disable the matches targeted raise NotImplementedError
def prune_cb(path: Path, match: Match) -> None:
if base_only and match.level > 1:
return
if match.updated > before:
return
self._unset_match(match)
self.log.debug("Print: disabled %s", path)
self.enter_step("pass_prune")
for _ in self.exec_each(prune_cb):
pass
self._clean_deps()
# Remove branches with no match
# TODO
def explain(self, path: Path) -> str: def explain(self, path: Path) -> str:
match = self.get_match(path)
string = str(path) string = str(path)
if isinstance(match, AsnNode): match = self.get_match(path)
string += f" ({match.name})"
party_char = "F" if match.first_party else "M"
dup_char = "D" if match.dupplicate else "_"
string += f" {match.level}{party_char}{dup_char}{match.references}"
if match.source: if match.source:
string += f"{self.explain(match.source)}" string += f'{self.explain(match.source)}'
return string return string
def list_records( def export(self,
self,
first_party_only: bool = False, first_party_only: bool = False,
end_chain_only: bool = False, end_chain_only: bool = False,
no_dupplicates: bool = False,
rules_only: bool = False,
hostnames_only: bool = False,
explain: bool = False, explain: bool = False,
) -> typing.Iterable[str]: ) -> typing.Iterable[str]:
def export_cb(path: Path, match: Match) -> typing.Iterable[str]: if first_party_only or end_chain_only:
if first_party_only and not match.first_party: raise NotImplementedError
return
if end_chain_only and match.references > 0:
return
if no_dupplicates and match.dupplicate:
return
if rules_only and match.level > 1:
return
if hostnames_only and not isinstance(path, HostnamePath):
return
def export_cb(path: Path, match: Match, _: typing.Any
) -> typing.Iterable[str]:
assert isinstance(path, DomainPath)
if isinstance(path, HostnamePath):
if explain: if explain:
yield self.explain(path) yield self.explain(path)
else: else:
yield str(path) yield self.unpack_domain(path)
yield from self.exec_each(export_cb) yield from self.exec_each_domain(export_cb, None)
def count_records( def list_rules(self,
self, first_party_only: bool = False,
) -> typing.Iterable[str]:
if first_party_only:
raise NotImplementedError
def list_rules_cb(path: Path, match: Match, _: typing.Any
) -> typing.Iterable[str]:
if isinstance(path, ZonePath) \
or (isinstance(path, Ip4Path) and path.prefixlen < 32):
# if match.level == 0:
yield self.explain(path)
yield from self.exec_each(list_rules_cb, None)
def count_rules(self,
first_party_only: bool = False, first_party_only: bool = False,
end_chain_only: bool = False,
no_dupplicates: bool = False,
rules_only: bool = False,
hostnames_only: bool = False,
) -> str: ) -> str:
memo: typing.Dict[str, int] = dict() raise NotImplementedError
def count_records_cb(path: Path, match: Match) -> None: def get_domain(self, domain: DomainPath) -> typing.Iterable[DomainPath]:
if first_party_only and not match.first_party: self.enter_step('get_domain_brws')
return
if end_chain_only and match.references > 0:
return
if no_dupplicates and match.dupplicate:
return
if rules_only and match.level > 1:
return
if hostnames_only and not isinstance(path, HostnamePath):
return
try:
memo[path.__class__.__name__] += 1
except KeyError:
memo[path.__class__.__name__] = 1
for _ in self.exec_each(count_records_cb):
pass
split: typing.List[str] = list()
for key, value in sorted(memo.items(), key=lambda s: s[0]):
split.append(f"{key[:-4].lower()}s: {value}")
return ", ".join(split)
def get_domain(self, domain_str: str) -> typing.Iterable[DomainPath]:
self.enter_step("get_domain_pack")
domain = self.pack_domain(domain_str)
self.enter_step("get_domain_brws")
dic = self.domtree dic = self.domtree
depth = 0 depth = 0
for part in domain.parts: for part in domain.parts:
if dic.match_zone.active(): if dic.match_zone.active():
self.enter_step("get_domain_yield") self.enter_step('get_domain_yield')
yield ZonePath(domain.parts[:depth]) yield ZonePath(domain.parts[:depth])
self.enter_step("get_domain_brws") self.enter_step('get_domain_brws')
if part not in dic.children: if part not in dic.children:
return return
dic = dic.children[part] dic = dic.children[part]
depth += 1 depth += 1
if dic.match_zone.active(): if dic.match_zone.active():
self.enter_step("get_domain_yield") self.enter_step('get_domain_yield')
yield ZonePath(domain.parts) yield ZonePath(domain.parts)
if dic.match_hostname.active(): if dic.match_hostname.active():
self.enter_step("get_domain_yield") self.enter_step('get_domain_yield')
yield HostnamePath(domain.parts) yield HostnamePath(domain.parts)
def get_ip4(self, ip4_str: str) -> typing.Iterable[Path]: def get_ip4(self, ip4: Ip4Path) -> typing.Iterable[Path]:
self.enter_step("get_ip4_pack") self.enter_step('get_ip4_brws')
ip4val = self.pack_ip4address_low(ip4_str)
self.enter_step("get_ip4_cache")
if not self.ip4cache[ip4val >> self.ip4cache_shift]:
return
self.enter_step("get_ip4_brws")
dic = self.ip4tree dic = self.ip4tree
for i in range(31, -1, -1): for i in range(31, 31-ip4.prefixlen, -1):
bit = (ip4val >> i) & 0b1 bit = (ip4.value >> i) & 0b1
if dic.active(): if dic.active():
self.enter_step("get_ip4_yield") self.enter_step('get_ip4_yield')
yield Ip4Path(ip4val >> (i + 1) << (i + 1), 31 - i) a = Ip4Path(ip4.value >> (i+1) << (i+1), 31-i)
self.enter_step("get_ip4_brws") yield a
self.enter_step('get_ip4_brws')
next_dic = dic.one if bit else dic.zero next_dic = dic.one if bit else dic.zero
if next_dic is None: if next_dic is None:
return return
dic = next_dic dic = next_dic
if dic.active(): if dic.active():
self.enter_step("get_ip4_yield") self.enter_step('get_ip4_yield')
yield Ip4Path(ip4val, 32) yield ip4
def _unset_match( def list_asn(self) -> typing.Iterable[AsnPath]:
self, for asn in self.asns:
match: Match, yield AsnPath(asn)
) -> None:
match.disable()
if match.source:
source_match = self.get_match(match.source)
source_match.references -= 1
def _set_match( def _set_domain(self,
self, hostname: bool,
match: Match, domain: DomainPath,
updated: int, updated: int,
source: Path, is_first_party: bool = None,
source_match: Match = None, source: Path = None) -> None:
dupplicate: bool = False, if is_first_party:
) -> None: raise NotImplementedError
# source_match is in parameters because most of the time self.enter_step('set_domain_src')
# its parent function needs it too, if source is None:
# so it can pass it to save a traversal level = 0
source_match = source_match or self.get_match(source) source = RulePath()
new_level = source_match.level + 1 else:
if ( match = self.get_match(source)
updated > match.updated level = match.level + 1
or new_level < match.level self.enter_step('set_domain_brws')
or source_match.first_party > match.first_party
):
# NOTE FP and level of matches referencing this one
# won't be updated until run or prune
if match.source:
old_source = self.get_match(match.source)
old_source.references -= 1
match.updated = updated
match.level = new_level
match.first_party = source_match.first_party
match.source = source
source_match.references += 1
match.dupplicate = dupplicate
def _set_domain(
self, hostname: bool, domain_str: str, updated: int, source: Path
) -> None:
self.enter_step("set_domain_val")
if not Database.validate_domain(domain_str):
raise ValueError(f"Invalid domain: {domain_str}")
self.enter_step("set_domain_pack")
domain = self.pack_domain(domain_str)
self.enter_step("set_domain_fp")
source_match = self.get_match(source)
is_first_party = source_match.first_party
self.enter_step("set_domain_brws")
dic = self.domtree dic = self.domtree
dupplicate = False
for part in domain.parts: for part in domain.parts:
if dic.match_zone.active():
# Refuse to add domain whose zone is already matching
return
if part not in dic.children: if part not in dic.children:
dic.children[part] = DomainTreeNode() dic.children[part] = DomainTreeNode()
dic = dic.children[part] dic = dic.children[part]
if dic.match_zone.active(is_first_party):
dupplicate = True
if hostname: if hostname:
match = dic.match_hostname match = dic.match_hostname
else: else:
match = dic.match_zone match = dic.match_zone
self._set_match( match.set(
match,
updated, updated,
level,
source, source,
source_match=source_match,
dupplicate=dupplicate,
) )
def set_hostname(self, *args: typing.Any, **kwargs: typing.Any) -> None: def set_hostname(self,
*args: typing.Any, **kwargs: typing.Any
) -> None:
self._set_domain(True, *args, **kwargs) self._set_domain(True, *args, **kwargs)
def set_zone(self, *args: typing.Any, **kwargs: typing.Any) -> None: def set_zone(self,
*args: typing.Any, **kwargs: typing.Any
) -> None:
self._set_domain(False, *args, **kwargs) self._set_domain(False, *args, **kwargs)
def set_asn(self, asn_str: str, updated: int, source: Path) -> None: def set_asn(self,
self.enter_step("set_asn") asn: AsnPath,
path = self.pack_asn(asn_str) updated: int,
if path.asn in self.asns: is_first_party: bool = None,
match = self.asns[path.asn] source: Path = None) -> None:
self.enter_step('set_asn')
if is_first_party:
raise NotImplementedError
if source is None:
level = 0
source = RulePath()
else:
match = self.get_match(source)
level = match.level + 1
if asn.asn in self.asns:
match = self.asns[asn.asn]
else: else:
match = AsnNode() match = AsnNode()
self.asns[path.asn] = match self.asns[asn.asn] = match
self._set_match( match.set(
match,
updated, updated,
level,
source, source,
) )
def _set_ip4(self, ip4: Ip4Path, updated: int, source: Path) -> None: def set_ip4network(self,
self.enter_step("set_ip4_fp") ip4: Ip4Path,
source_match = self.get_match(source) updated: int,
is_first_party = source_match.first_party is_first_party: bool = None,
self.enter_step("set_ip4_brws") source: Path = None) -> None:
if is_first_party:
raise NotImplementedError
self.enter_step('set_ip4_src')
if source is None:
level = 0
source = RulePath()
else:
match = self.get_match(source)
level = match.level + 1
self.enter_step('set_ip4_brws')
dic = self.ip4tree dic = self.ip4tree
dupplicate = False for i in range(31, 31-ip4.prefixlen, -1):
for i in range(31, 31 - ip4.prefixlen, -1):
bit = (ip4.value >> i) & 0b1 bit = (ip4.value >> i) & 0b1
if dic.active():
# Refuse to add ip4* whose network is already matching
return
next_dic = dic.one if bit else dic.zero next_dic = dic.one if bit else dic.zero
if next_dic is None: if next_dic is None:
next_dic = IpTreeNode() next_dic = IpTreeNode()
@ -767,33 +535,15 @@ class Database(Profiler):
else: else:
dic.zero = next_dic dic.zero = next_dic
dic = next_dic dic = next_dic
if dic.active(is_first_party): dic.set(
dupplicate = True
self._set_match(
dic,
updated, updated,
level,
source, source,
source_match=source_match,
dupplicate=dupplicate,
) )
self._set_ip4cache(ip4, dic)
def set_ip4address( def set_ip4address(self,
self, ip4address_str: str, *args: typing.Any, **kwargs: typing.Any ip4: Ip4Path,
*args: typing.Any, **kwargs: typing.Any
) -> None: ) -> None:
self.enter_step("set_ip4add_val") assert ip4.prefixlen == 32
if not Database.validate_ip4address(ip4address_str): self.set_ip4network(ip4, *args, **kwargs)
raise ValueError(f"Invalid ip4address: {ip4address_str}")
self.enter_step("set_ip4add_pack")
ip4 = self.pack_ip4address(ip4address_str)
self._set_ip4(ip4, *args, **kwargs)
def set_ip4network(
self, ip4network_str: str, *args: typing.Any, **kwargs: typing.Any
) -> None:
self.enter_step("set_ip4net_val")
if not Database.validate_ip4network(ip4network_str):
raise ValueError(f"Invalid ip4network: {ip4network_str}")
self.enter_step("set_ip4net_pack")
ip4 = self.pack_ip4network(ip4network_str)
self._set_ip4(ip4, *args, **kwargs)

54
db.py
View file

@ -1,54 +0,0 @@
#!/usr/bin/env python3
import argparse
import database
import time
import os
if __name__ == "__main__":
# Parsing arguments
parser = argparse.ArgumentParser(description="Database operations")
parser.add_argument(
"-i", "--initialize", action="store_true", help="Reconstruct the whole database"
)
parser.add_argument(
"-p", "--prune", action="store_true", help="Remove old entries from database"
)
parser.add_argument(
"-b",
"--prune-base",
action="store_true",
help="With --prune, only prune base rules "
"(the ones added by ./feed_rules.py)",
)
parser.add_argument(
"-s",
"--prune-before",
type=int,
default=(int(time.time()) - 60 * 60 * 24 * 31 * 6),
help="With --prune, only rules updated before "
"this UNIX timestamp will be deleted",
)
parser.add_argument(
"-r",
"--references",
action="store_true",
help="DEBUG: Update the reference count",
)
args = parser.parse_args()
if not args.initialize:
DB = database.Database()
else:
if os.path.isfile(database.Database.PATH):
os.unlink(database.Database.PATH)
DB = database.Database()
DB.enter_step("main")
if args.prune:
DB.prune(before=args.prune_before, base_only=args.prune_base)
if args.references:
DB.update_references()
DB.save()

1
dist/.gitignore vendored
View file

@ -1,2 +1 @@
*.txt *.txt
*.html

114
dist/README.md vendored
View file

@ -1,114 +0,0 @@
# Geoffrey Frogeye's block list of first-party trackers
## What's a first-party tracker?
A tracker is a script put on many websites to gather informations about the visitor.
They can be used for multiple reasons: statistics, risk management, marketing, ads serving…
In any case, they are a threat to Internet users' privacy and many may want to block them.
Traditionnaly, trackers are served from a third-party.
For example, `website1.com` and `website2.com` both load their tracking script from `https://trackercompany.com/trackerscript.js`.
In order to block those, one can simply block the hostname `trackercompany.com`, which is what most ad blockers do.
However, to circumvent this block, tracker companies made the websites using them load trackers from `somestring.website1.com`.
The latter is a DNS redirection to `website1.trackercompany.com`, directly to an IP address belonging to the tracking company.
Those are called first-party trackers.
On top of aforementionned privacy issues, they also cause some security issue, as websites usually trust those scripts more.
For more information, learn about [Content Security Policy](https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP), [same-origin policy](https://developer.mozilla.org/en-US/docs/Web/Security/Same-origin_policy) and [Cross-Origin Resource Sharing](https://enable-cors.org/).
In order to block those trackers, ad blockers would need to block every subdomain pointing to anything under `trackercompany.com` or to their network.
Unfortunately, most don't support those blocking methods as they are not DNS-aware, e.g. they only see `somestring.website1.com`.
This list is an inventory of every `somestring.website1.com` found to allow non DNS-aware ad blocker to still block first-party trackers.
### Learn more
- [CNAME Cloaking, the dangerous disguise of third-party trackers](https://medium.com/nextdns/cname-cloaking-the-dangerous-disguise-of-third-party-trackers-195205dc522a) from NextDNS
- [Trackers first-party](https://blog.imirhil.fr/2019/11/13/first-party-tracker.html) from Aeris, in french
- [uBlock Origin issue](https://github.com/uBlockOrigin/uBlock-issues/issues/780)
- [CNAME Cloaking and Bounce Tracking Defense](https://webkit.org/blog/11338/cname-cloaking-and-bounce-tracking-defense/) on WebKit's blog
- [Characterizing CNAME cloaking-based tracking](https://blog.apnic.net/2020/08/04/characterizing-cname-cloaking-based-tracking/) on APNIC's webiste
- [Characterizing CNAME Cloaking-Based Tracking on the Web](https://tma.ifip.org/2020/wp-content/uploads/sites/9/2020/06/tma2020-camera-paper66.pdf) is a research paper from Sokendai and ANSSI
## List variants
### First-party trackers
**Recommended for hostfiles-based ad blockers, such as [Pi-hole](https://pi-hole.net/) (&lt;v5.0, as it introduced CNAME blocking).**
**Recommended for Android ad blockers as applications, such ad [Blokada](https://blokada.org/).**
- Hosts file: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
- Raw list: <https://hostfiles.frogeye.fr/firstparty-trackers.txt>
This list contains every hostname redirecting to [a hand-picked list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/rules/first-party.list).
It should be safe from false-positives.
It also contains all tracking hostnames under company domains (e.g. `website1.trackercompany.com`),
useful for ad blockers that don't support mass regex blocking,
while still preventing fallback to third-party trackers.
Don't be afraid of the size of the list, as this is due to the nature of first-party trackers: a single tracker generates at least one hostname per client (typically two).
### First-party only trackers
**Recommended for ad blockers as web browser extensions, such as [uBlock Origin](https://ublockorigin.com/) (&lt;v1.25.0 or for Chromium-based browsers, as it introduced CNAME uncloaking for Firefox).**
- Hosts file: <https://hostfiles.frogeye.fr/firstparty-only-trackers-hosts.txt>
- Raw list: <https://hostfiles.frogeye.fr/firstparty-only-trackers.txt>
This is the same list as above, albeit not containing the hostnames under the tracking company domains (e.g. `website1.trackercompany.com`).
This allows for reducing the size of the list for ad-blockers that already block those third-party trackers with their support of regex blocking.
Use in conjunction with other block lists used in regex-mode, such as [Peter Lowe's](https://pgl.yoyo.org/adservers/)
### Multi-party trackers
- Hosts file: <https://hostfiles.frogeye.fr/multiparty-trackers-hosts.txt>
- Raw list: <https://hostfiles.frogeye.fr/multiparty-trackers.txt>
As first-party trackers usually evolve from third-party trackers, this list contains every hostname redirecting to trackers found in existing lists of third-party trackers (see next section).
Since the latter were not designed with first-party trackers in mind, they are likely to contain false-positives.
On the other hand, they might protect against first-party tracker that we're not aware of / have not yet confirmed.
#### Source of third-party trackers
- [EasyPrivacy](https://easylist.to/easylist/easyprivacy.txt)
- [AdGuard](https://github.com/AdguardTeam/AdguardFilters)
(yes there's only two for now. A lot of existing ones cause a lot of false positives)
### Multi-party only trackers
- Hosts file: <https://hostfiles.frogeye.fr/multiparty-only-trackers-hosts.txt>
- Raw list: <https://hostfiles.frogeye.fr/multiparty-only-trackers.txt>
This is the same list as above, albeit not containing the hostnames under the tracking company domains (e.g. `website1.trackercompany.com`).
This allows for reducing the size of the list for ad-blockers that already block those third-party trackers with their support of regex blocking.
Use in conjunction with other block lists used in regex-mode, such as the ones in the previous section.
## Meta
In case of false positives/negatives, or any other question contact me the way you like: <https://geoffrey.frogeye.fr>
The software used to generate this list is available here: <https://git.frogeye.fr/geoffrey/eulaurarien>
## Acknowledgements
Some of the first-party tracker included in this list have been found by:
- [Aeris](https://imirhil.fr/)
- NextDNS and [their blocklist](https://github.com/nextdns/cname-cloaking-blocklist)'s contributors
- Yuki2718 from [Wilders Security Forums](https://www.wilderssecurity.com/threads/ublock-a-lean-and-fast-blocker.365273/page-168#post-2880361)
- Ha Dao, Johan Mazel, and Kensuke Fukuda, ["Characterizing CNAME Cloaking-Based Tracking on the Web", Proceedings of IFIP/IEEE Traffic Measurement Analysis Conference (TMA), 9 pages, 2020.](https://tma.ifip.org/2020/wp-content/uploads/sites/9/2020/06/tma2020-camera-paper66.pdf)
- AdGuard and [their blocklist](https://github.com/AdguardTeam/cname-trackers)'s contributors
The list was generated using data from
- [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html)
- [Public DNS Server List](https://public-dns.info/)
Similar projects:
- [NextDNS blocklist](https://github.com/nextdns/cname-cloaking-blocklist): for DNS-aware ad blockers
- [Stefan Froberg's lists](https://www.orwell1984.today/cname/): subset of those lists grouped by tracker
- [AdGuard blocklist](https://github.com/AdguardTeam/cname-trackers): same thing with a bigger scope, maintained by a bigger team

View file

@ -1,2 +0,0 @@
/* Source: https://github.com/jasonm23/markdown-css-themes */
body{font-family:Helvetica,arial,sans-serif;font-size:14px;line-height:1.6;padding-top:10px;padding-bottom:10px;background-color:#fff;padding:30px}body>:first-child{margin-top:0!important}body>:last-child{margin-bottom:0!important}a{color:#4183c4}a.absent{color:#c00}a.anchor{display:block;padding-left:30px;margin-left:-30px;cursor:pointer;position:absolute;top:0;left:0;bottom:0}h1,h2,h3,h4,h5,h6{margin:20px 0 10px;padding:0;font-weight:700;-webkit-font-smoothing:antialiased;cursor:text;position:relative}h1:hover a.anchor,h2:hover a.anchor,h3:hover a.anchor,h4:hover a.anchor,h5:hover a.anchor,h6:hover a.anchor{text-decoration:none}h1 code,h1 tt{font-size:inherit}h2 code,h2 tt{font-size:inherit}h3 code,h3 tt{font-size:inherit}h4 code,h4 tt{font-size:inherit}h5 code,h5 tt{font-size:inherit}h6 code,h6 tt{font-size:inherit}h1{font-size:28px;color:#000}h2{font-size:24px;border-bottom:1px solid #ccc;color:#000}h3{font-size:18px}h4{font-size:16px}h5{font-size:14px}h6{color:#777;font-size:14px}blockquote,dl,li,ol,p,pre,table,ul{margin:15px 0}hr{border:0 none;color:#ccc;height:4px;padding:0}body>h2:first-child{margin-top:0;padding-top:0}body>h1:first-child{margin-top:0;padding-top:0}body>h1:first-child+h2{margin-top:0;padding-top:0}body>h3:first-child,body>h4:first-child,body>h5:first-child,body>h6:first-child{margin-top:0;padding-top:0}a:first-child h1,a:first-child h2,a:first-child h3,a:first-child h4,a:first-child h5,a:first-child h6{margin-top:0;padding-top:0}h1 p,h2 p,h3 p,h4 p,h5 p,h6 p{margin-top:0}li p.first{display:inline-block}li{margin:0}ol,ul{padding-left:30px}ol :first-child,ul :first-child{margin-top:0}dl{padding:0}dl dt{font-size:14px;font-weight:700;font-style:italic;padding:0;margin:15px 0 5px}dl dt:first-child{padding:0}dl dt>:first-child{margin-top:0}dl dt>:last-child{margin-bottom:0}dl dd{margin:0 0 15px;padding:0 15px}dl dd>:first-child{margin-top:0}dl dd>:last-child{margin-bottom:0}blockquote{border-left:4px solid #ddd;padding:0 15px;color:#777}blockquote>:first-child{margin-top:0}blockquote>:last-child{margin-bottom:0}table{padding:0;border-collapse:collapse}table tr{border-top:1px solid #ccc;background-color:#fff;margin:0;padding:0}table tr:nth-child(2n){background-color:#f8f8f8}table tr th{font-weight:700;border:1px solid #ccc;margin:0;padding:6px 13px}table tr td{border:1px solid #ccc;margin:0;padding:6px 13px}table tr td :first-child,table tr th :first-child{margin-top:0}table tr td :last-child,table tr th :last-child{margin-bottom:0}img{max-width:100%}span.frame{display:block;overflow:hidden}span.frame>span{border:1px solid #ddd;display:block;float:left;overflow:hidden;margin:13px 0 0;padding:7px;width:auto}span.frame span img{display:block;float:left}span.frame span span{clear:both;color:#333;display:block;padding:5px 0 0}span.align-center{display:block;overflow:hidden;clear:both}span.align-center>span{display:block;overflow:hidden;margin:13px auto 0;text-align:center}span.align-center span img{margin:0 auto;text-align:center}span.align-right{display:block;overflow:hidden;clear:both}span.align-right>span{display:block;overflow:hidden;margin:13px 0 0;text-align:right}span.align-right span img{margin:0;text-align:right}span.float-left{display:block;margin-right:13px;overflow:hidden;float:left}span.float-left span{margin:13px 0 0}span.float-right{display:block;margin-left:13px;overflow:hidden;float:right}span.float-right>span{display:block;overflow:hidden;margin:13px auto 0;text-align:right}code,tt{margin:0 2px;padding:0 5px;white-space:nowrap;border:1px solid #eaeaea;background-color:#f8f8f8;border-radius:3px}pre code{margin:0;padding:0;white-space:pre;border:none;background:0 0}.highlight pre{background-color:#f8f8f8;border:1px solid #ccc;font-size:13px;line-height:19px;overflow:auto;padding:6px 10px;border-radius:3px}pre{background-color:#f8f8f8;border:1px solid #ccc;font-size:13px;line-height:19px;overflow:auto;padding:6px 10px;border-radius:3px}pre code,pre tt{background-color:transparent;border:none}sup{font-size:.83em;vertical-align:super;line-height:0}*{-webkit-print-color-adjust:exact}@media screen and (min-width:914px){body{width:854px;margin:0 auto}}@media print{pre,table{page-break-inside:avoid}pre{word-wrap:break-word}}

View file

@ -2,13 +2,8 @@
# Main script for eulaurarien # Main script for eulaurarien
[ ! -f .env ] && touch .env
./fetch_resources.sh ./fetch_resources.sh
./collect_subdomains.sh ./collect_subdomains.sh
./import_rules.sh
./resolve_subdomains.sh ./resolve_subdomains.sh
./prune.sh ./filter_subdomains.sh
./export_lists.sh
./generate_index.py

View file

@ -5,87 +5,45 @@ import argparse
import sys import sys
if __name__ == "__main__": if __name__ == '__main__':
# Parsing arguments # Parsing arguments
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description="Export the hostnames rules stored " "in the Database as plain text" description="TODO")
)
parser.add_argument( parser.add_argument(
"-o", '-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
"--output", help="TODO")
type=argparse.FileType("w"),
default=sys.stdout,
help="Output file, one rule per line",
)
parser.add_argument( parser.add_argument(
"-f", '-f', '--first-party', action='store_true',
"--first-party", help="TODO")
action="store_true",
help="Only output rules issued from first-party sources",
)
parser.add_argument( parser.add_argument(
"-e", '-e', '--end-chain', action='store_true',
"--end-chain", help="TODO")
action="store_true",
help="Only output rules that are not referenced by any other",
)
parser.add_argument( parser.add_argument(
"-r", '-x', '--explain', action='store_true',
"--rules", help="TODO")
action="store_true",
help="Output all kinds of rules, not just hostnames",
)
parser.add_argument( parser.add_argument(
"-b", '-r', '--rules', action='store_true',
"--base-rules", help="TODO")
action="store_true",
help="Output base rules "
"(the ones added by ./feed_rules.py) "
"(implies --rules)",
)
parser.add_argument( parser.add_argument(
"-d", '-c', '--count', action='store_true',
"--no-dupplicates", help="TODO")
action="store_true",
help="Do not output rules that already match a zone/network rule "
"(e.g. dummy.example.com when there's a zone example.com rule)",
)
parser.add_argument(
"-x",
"--explain",
action="store_true",
help="Show the chain of rules leading to one "
"(and the number of references they have)",
)
parser.add_argument(
"-c",
"--count",
action="store_true",
help="Show the number of rules per type instead of listing them",
)
args = parser.parse_args() args = parser.parse_args()
DB = database.Database() DB = database.Database()
if args.rules:
if args.count: if args.count:
assert not args.explain print(DB.count_rules(first_party_only=args.first_party))
print(
DB.count_records(
first_party_only=args.first_party,
end_chain_only=args.end_chain,
no_dupplicates=args.no_dupplicates,
rules_only=args.base_rules,
hostnames_only=not (args.rules or args.base_rules),
)
)
else: else:
for domain in DB.list_records( for line in DB.list_rules():
print(line)
else:
if args.count:
raise NotImplementedError
for domain in DB.export(
first_party_only=args.first_party, first_party_only=args.first_party,
end_chain_only=args.end_chain, end_chain_only=args.end_chain,
no_dupplicates=args.no_dupplicates,
rules_only=args.base_rules,
hostnames_only=not (args.rules or args.base_rules),
explain=args.explain, explain=args.explain,
): ):
print(domain, file=args.output) print(domain, file=args.output)

View file

@ -1,98 +0,0 @@
#!/usr/bin/env bash
function log() {
echo -e "\033[33m$@\033[0m"
}
log "Calculating statistics…"
oldest="$(cat last_updates/*.txt | sort -n | head -1)"
oldest_date=$(date -Isec -d @$oldest)
gen_date=$(date -Isec)
gen_software=$(git describe --tags)
number_websites=$(wc -l < temp/all_websites.list)
number_subdomains=$(wc -l < temp/all_subdomains.list)
number_dns=$(grep 'NOERROR' temp/all_resolved.txt | wc -l)
for partyness in {first,multi}
do
if [ $partyness = "first" ]
then
partyness_flags="--first-party"
else
partyness_flags=""
fi
rules_input=$(./export.py --count --base-rules $partyness_flags)
rules_found=$(./export.py --count --rules $partyness_flags)
rules_found_nd=$(./export.py --count --rules --no-dupplicates $partyness_flags)
echo
echo "Statistics for ${partyness}-party trackers"
echo "Input rules: $rules_input"
echo "Subsequent rules: $rules_found"
echo "Subsequent rules (no dupplicate): $rules_found_nd"
echo "Output hostnames: $(./export.py --count $partyness_flags)"
echo "Output hostnames (no dupplicate): $(./export.py --count --no-dupplicates $partyness_flags)"
echo "Output hostnames (end-chain only): $(./export.py --count --end-chain $partyness_flags)"
echo "Output hostnames (no dupplicate, end-chain only): $(./export.py --count --no-dupplicates --end-chain $partyness_flags)"
for trackerness in {trackers,only-trackers}
do
if [ $trackerness = "trackers" ]
then
trackerness_flags=""
else
trackerness_flags="--no-dupplicates"
fi
file_list="dist/${partyness}party-${trackerness}.txt"
file_host="dist/${partyness}party-${trackerness}-hosts.txt"
log "Generating lists for variant ${partyness}-party ${trackerness}"
# Real export heeere
./export.py $partyness_flags $trackerness_flags > $file_list
# Sometimes a bit heavy to have the DB open and sort the output
# so this is done in two steps
sort -u $file_list -o $file_list
rules_output=$(./export.py --count $partyness_flags $trackerness_flags)
(
echo "# First-party trackers host list"
echo "# Variant: ${partyness}-party ${trackerness}"
echo "#"
echo "# About first-party trackers: https://hostfiles.frogeye.fr/#whats-a-first-party-tracker"
echo "#"
echo "# In case of false positives/negatives, or any other question,"
echo "# contact me the way you like: https://geoffrey.frogeye.fr"
echo "#"
echo "# Latest versions and variants: https://hostfiles.frogeye.fr/#list-variants"
echo "# Source code: https://git.frogeye.fr/geoffrey/eulaurarien"
echo "# License: https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/LICENSE"
echo "# Acknowledgements: https://hostfiles.frogeye.fr/#acknowledgements"
echo "#"
echo "# Generation software: eulaurarien $gen_software"
echo "# List generation date: $gen_date"
echo "# Oldest record: $oldest_date"
echo "# Number of source websites: $number_websites"
echo "# Number of source subdomains: $number_subdomains"
echo "# Number of source DNS records: $number_dns"
echo "#"
echo "# Input rules: $rules_input"
echo "# Subsequent rules: $rules_found"
echo "# … no dupplicates: $rules_found_nd"
echo "# Output rules: $rules_output"
echo "#"
echo
sed 's|^|0.0.0.0 |' "$file_list"
) > "$file_host"
done
done
if [ -d explanations ]
then
filename="$(date -Isec).txt"
./export.py --explain > "explanations/$filename"
ln --force --symbolic "$filename" "explanations/latest.txt"
fi

View file

@ -13,56 +13,40 @@ IPNetwork = typing.Union[ipaddress.IPv4Network, ipaddress.IPv6Network]
def get_ranges(asn: str) -> typing.Iterable[str]: def get_ranges(asn: str) -> typing.Iterable[str]:
req = requests.get( req = requests.get(
"https://stat.ripe.net/data/as-routing-consistency/data.json", 'https://stat.ripe.net/data/as-routing-consistency/data.json',
params={"resource": asn}, params={'resource': asn}
) )
data = req.json() data = req.json()
for pref in data["data"]["prefixes"]: for pref in data['data']['prefixes']:
yield pref["prefix"] yield pref['prefix']
def get_name(asn: str) -> str: if __name__ == '__main__':
req = requests.get(
"https://stat.ripe.net/data/as-overview/data.json", params={"resource": asn}
)
data = req.json()
return data["data"]["holder"]
log = logging.getLogger('feed_asn')
if __name__ == "__main__":
log = logging.getLogger("feed_asn")
# Parsing arguments # Parsing arguments
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description="Add the IP ranges associated to the AS in the database" description="TODO")
)
args = parser.parse_args() args = parser.parse_args()
DB = database.Database() DB = database.Database()
def add_ranges( for path in DB.list_asn():
path: database.Path,
match: database.Match,
) -> None:
assert isinstance(path, database.AsnPath)
assert isinstance(match, database.AsnNode)
asn_str = database.Database.unpack_asn(path) asn_str = database.Database.unpack_asn(path)
DB.enter_step("asn_get_name") DB.enter_step('asn_get_ranges')
name = get_name(asn_str)
match.name = name
DB.enter_step("asn_get_ranges")
for prefix in get_ranges(asn_str): for prefix in get_ranges(asn_str):
parsed_prefix: IPNetwork = ipaddress.ip_network(prefix) parsed_prefix: IPNetwork = ipaddress.ip_network(prefix)
if parsed_prefix.version == 4: if parsed_prefix.version == 4:
DB.set_ip4network(prefix, source=path, updated=int(time.time())) DB.set_ip4network(
log.info("Added %s from %s (%s)", prefix, path, name) prefix,
source=path,
updated=int(time.time())
)
log.info('Added %s from %s (%s)', prefix, asn_str, path)
elif parsed_prefix.version == 6: elif parsed_prefix.version == 6:
log.warning("Unimplemented prefix version: %s", prefix) log.warning('Unimplemented prefix version: %s', prefix)
else: else:
log.error("Unknown prefix version: %s", prefix) log.error('Unknown prefix version: %s', prefix)
for _ in DB.exec_each_asn(add_ranges):
pass
DB.save() DB.save()

147
feed_dns.old.py Executable file
View file

@ -0,0 +1,147 @@
#!/usr/bin/env python3
import argparse
import database
import logging
import sys
import typing
import enum
RecordType = enum.Enum('RecordType', 'A AAAA CNAME PTR')
Record = typing.Tuple[RecordType, int, str, str]
# select, write
FUNCTION_MAP: typing.Any = {
RecordType.A: (
database.Database.get_ip4,
database.Database.set_hostname,
),
RecordType.CNAME: (
database.Database.get_domain,
database.Database.set_hostname,
),
RecordType.PTR: (
database.Database.get_domain,
database.Database.set_ip4address,
),
}
class Parser():
def __init__(self, buf: typing.Any) -> None:
self.buf = buf
self.log = logging.getLogger('parser')
self.db = database.Database()
def end(self) -> None:
self.db.save()
def register(self,
rtype: RecordType,
updated: int,
name: str,
value: str
) -> None:
self.db.enter_step('register')
select, write = FUNCTION_MAP[rtype]
for source in select(self.db, value):
# write(self.db, name, updated, source=source)
write(self.db, name, updated)
def consume(self) -> None:
raise NotImplementedError
class Rapid7Parser(Parser):
TYPES = {
'a': RecordType.A,
'aaaa': RecordType.AAAA,
'cname': RecordType.CNAME,
'ptr': RecordType.PTR,
}
def consume(self) -> None:
data = dict()
for line in self.buf:
self.db.enter_step('parse_rapid7')
split = line.split('"')
for k in range(1, 14, 4):
key = split[k]
val = split[k+2]
data[key] = val
self.register(
Rapid7Parser.TYPES[data['type']],
int(data['timestamp']),
data['name'],
data['value']
)
class DnsMassParser(Parser):
# dnsmass --output Snrql
# --retry REFUSED,SERVFAIL --resolvers nameservers-ipv4
TYPES = {
'A': (RecordType.A, -1, None),
'AAAA': (RecordType.AAAA, -1, None),
'CNAME': (RecordType.CNAME, -1, -1),
}
def consume(self) -> None:
self.db.enter_step('parse_dnsmass')
timestamp = 0
header = True
for line in self.buf:
line = line[:-1]
if not line:
header = True
continue
split = line.split(' ')
try:
if header:
timestamp = int(split[1])
header = False
else:
dtype, name_offset, value_offset = \
DnsMassParser.TYPES[split[1]]
self.register(
dtype,
timestamp,
split[0][:name_offset],
split[2][:value_offset],
)
self.db.enter_step('parse_dnsmass')
except KeyError:
continue
PARSERS = {
'rapid7': Rapid7Parser,
'dnsmass': DnsMassParser,
}
if __name__ == '__main__':
# Parsing arguments
log = logging.getLogger('feed_dns')
args_parser = argparse.ArgumentParser(
description="TODO")
args_parser.add_argument(
'parser',
choices=PARSERS.keys(),
help="TODO")
args_parser.add_argument(
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
help="TODO")
args = args_parser.parse_args()
parser = PARSERS[args.parser](args.input)
try:
parser.consume()
except KeyboardInterrupt:
pass
parser.end()

View file

@ -6,123 +6,115 @@ import logging
import sys import sys
import typing import typing
import multiprocessing import multiprocessing
import time import enum
Record = typing.Tuple[typing.Callable, typing.Callable, int, str, str] Record = typing.Tuple[typing.Callable,
typing.Callable, int, database.Path, database.Path]
# select, write # select, write, name_packer, value_packer
FUNCTION_MAP: typing.Any = { FUNCTION_MAP: typing.Any = {
"a": ( 'a': (
database.Database.get_ip4, database.Database.get_ip4,
database.Database.set_hostname, database.Database.set_hostname,
database.Database.pack_domain,
database.Database.pack_ip4address,
), ),
"cname": ( 'cname': (
database.Database.get_domain, database.Database.get_domain,
database.Database.set_hostname, database.Database.set_hostname,
database.Database.pack_domain,
database.Database.pack_domain,
), ),
"ptr": ( 'ptr': (
database.Database.get_domain, database.Database.get_domain,
database.Database.set_ip4address, database.Database.set_ip4address,
database.Database.pack_ip4address,
database.Database.pack_domain,
), ),
} }
class Writer(multiprocessing.Process): class Writer(multiprocessing.Process):
def __init__( def __init__(self,
self, recs_queue: multiprocessing.Queue,
recs_queue: multiprocessing.Queue = None, index: int = 0):
autosave_interval: int = 0,
ip4_cache: int = 0,
):
if recs_queue: # MP
super(Writer, self).__init__() super(Writer, self).__init__()
self.log = logging.getLogger(f'wr')
self.recs_queue = recs_queue self.recs_queue = recs_queue
self.log = logging.getLogger("wr")
self.autosave_interval = autosave_interval
self.ip4_cache = ip4_cache
if not recs_queue: # No MP
self.open_db()
def open_db(self) -> None:
self.db = database.Database()
self.db.log = logging.getLogger("wr")
self.db.fill_ip4cache(max_size=self.ip4_cache)
def exec_record(self, record: Record) -> None:
self.db.enter_step("exec_record")
select, write, updated, name, value = record
try:
for source in select(self.db, value):
write(self.db, name, updated, source=source)
except (ValueError, IndexError):
# ValueError: non-number in IP
# IndexError: IP too big
self.log.exception("Cannot execute: %s", record)
def end(self) -> None:
self.db.enter_step("end")
self.db.save()
def run(self) -> None: def run(self) -> None:
self.open_db() self.db = database.Database()
if self.autosave_interval > 0: self.db.log = logging.getLogger(f'wr')
next_save = time.time() + self.autosave_interval
else:
next_save = 0
self.db.enter_step("block_wait") self.db.enter_step('block_wait')
block: typing.List[Record] block: typing.List[Record]
for block in iter(self.recs_queue.get, None): for block in iter(self.recs_queue.get, None):
assert block
record: Record record: Record
for record in block: for record in block:
self.exec_record(record)
if next_save > 0 and time.time() > next_save: select, write, updated, name, value = record
self.log.info("Saving database...") self.db.enter_step('feed_switch')
for source in select(self.db, value):
write(self.db, name, updated, source=source)
self.db.enter_step('block_wait')
self.db.enter_step('end')
self.db.save() self.db.save()
self.log.info("Done!")
next_save = time.time() + self.autosave_interval
self.db.enter_step("block_wait")
self.end()
class Parser: class Parser():
def __init__( def __init__(self,
self,
buf: typing.Any, buf: typing.Any,
recs_queue: multiprocessing.Queue = None, recs_queue: multiprocessing.Queue,
block_size: int = 0, block_size: int,
writer: Writer = None,
): ):
assert bool(writer) ^ bool(block_size and recs_queue) super(Parser, self).__init__()
self.buf = buf self.buf = buf
self.log = logging.getLogger("pr") self.log = logging.getLogger('pr')
self.recs_queue = recs_queue self.recs_queue = recs_queue
if writer: # No MP
self.prof: database.Profiler = writer.db
self.register = writer.exec_record
else: # MP
self.block: typing.List[Record] = list() self.block: typing.List[Record] = list()
self.block_size = block_size self.block_size = block_size
self.prof = database.Profiler() self.prof = database.Profiler()
self.prof.log = logging.getLogger("pr") self.prof.log = logging.getLogger('pr')
self.register = self.add_to_queue
def add_to_queue(self, record: Record) -> None: def register(self,
self.prof.enter_step("register") rtype: str,
timestamp: int,
name_str: str,
value_str: str,
) -> None:
self.prof.enter_step('pack')
try:
select, write, name_packer, value_packer = FUNCTION_MAP[rtype]
except KeyError:
self.log.exception("Unknown record type")
return
try:
name = name_packer(name_str)
except ValueError:
self.log.exception("Cannot parse name ('%s' with %s)",
name_str, name_packer)
return
try:
value = value_packer(value_str)
except ValueError:
self.log.exception("Cannot parse value ('%s' with %s)",
value_str, value_packer)
return
record = (select, write, timestamp, name, value)
self.prof.enter_step('grow_block')
self.block.append(record) self.block.append(record)
if len(self.block) >= self.block_size: if len(self.block) >= self.block_size:
self.prof.enter_step("put_block") self.prof.enter_step('put_block')
assert self.recs_queue
self.recs_queue.put(self.block) self.recs_queue.put(self.block)
self.block = list() self.block = list()
def run(self) -> None: def run(self) -> None:
self.consume() self.consume()
if self.recs_queue:
self.recs_queue.put(self.block) self.recs_queue.put(self.block)
self.prof.profile() self.prof.profile()
@ -130,17 +122,43 @@ class Parser:
raise NotImplementedError raise NotImplementedError
class MassDnsParser(Parser): class Rapid7Parser(Parser):
# massdns --output Snrql def consume(self) -> None:
data = dict()
self.prof.enter_step('iowait')
for line in self.buf:
self.prof.enter_step('parse_rapid7')
split = line.split('"')
try:
for k in range(1, 14, 4):
key = split[k]
val = split[k+2]
data[key] = val
self.register(
data['type'],
int(data['timestamp']),
data['name'],
data['value'],
)
self.prof.enter_step('iowait')
except KeyError:
# Sometimes JSON records are off the place
self.log.exception("Cannot parse: %s", line)
class DnsMassParser(Parser):
# dnsmass --output Snrql
# --retry REFUSED,SERVFAIL --resolvers nameservers-ipv4 # --retry REFUSED,SERVFAIL --resolvers nameservers-ipv4
TYPES = { TYPES = {
"A": (FUNCTION_MAP["a"][0], FUNCTION_MAP["a"][1], -1, None), 'A': ('a', -1, None),
# 'AAAA': (FUNCTION_MAP['aaaa'][0], FUNCTION_MAP['aaaa'][1], -1, None), # 'AAAA': ('aaaa', -1, None),
"CNAME": (FUNCTION_MAP["cname"][0], FUNCTION_MAP["cname"][1], -1, -1), 'CNAME': ('cname', -1, -1),
} }
def consume(self) -> None: def consume(self) -> None:
self.prof.enter_step("parse_massdns") self.prof.enter_step('parse_dnsmass')
timestamp = 0 timestamp = 0
header = True header = True
for line in self.buf: for line in self.buf:
@ -149,102 +167,63 @@ class MassDnsParser(Parser):
header = True header = True
continue continue
split = line.split(" ") split = line.split(' ')
try: try:
if header: if header:
timestamp = int(split[1]) timestamp = int(split[1])
header = False header = False
else: else:
select, write, name_offset, value_offset = MassDnsParser.TYPES[ rtype, name_offset, value_offset = \
split[1] DnsMassParser.TYPES[split[1]]
] self.register(
record = ( rtype,
select,
write,
timestamp, timestamp,
split[0][:name_offset].lower(), split[0][:name_offset],
split[2][:value_offset].lower(), split[2][:value_offset],
) )
self.register(record) self.prof.enter_step('parse_dnsmass')
self.prof.enter_step("parse_massdns")
except KeyError: except KeyError:
continue # Malformed records are less likely to happen,
# but we may never be sure
self.log.exception("Cannot parse: %s", line)
PARSERS = { PARSERS = {
"massdns": MassDnsParser, 'rapid7': Rapid7Parser,
'dnsmass': DnsMassParser,
} }
if __name__ == "__main__": if __name__ == '__main__':
# Parsing arguments # Parsing arguments
log = logging.getLogger("feed_dns") log = logging.getLogger('feed_dns')
args_parser = argparse.ArgumentParser( args_parser = argparse.ArgumentParser(
description="Read DNS records and import " description="TODO")
"tracking-relevant data into the database"
)
args_parser.add_argument("parser", choices=PARSERS.keys(), help="Input format")
args_parser.add_argument( args_parser.add_argument(
"-i", 'parser',
"--input", choices=PARSERS.keys(),
type=argparse.FileType("r"), help="TODO")
default=sys.stdin,
help="Input file",
)
args_parser.add_argument( args_parser.add_argument(
"-b", "--block-size", type=int, default=1024, help="Performance tuning value" '-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
) help="TODO")
args_parser.add_argument( args_parser.add_argument(
"-q", "--queue-size", type=int, default=128, help="Performance tuning value" '-j', '--workers', type=int, default=4,
) help="TODO")
args_parser.add_argument( args_parser.add_argument(
"-a", '-b', '--block-size', type=int, default=100,
"--autosave-interval", help="TODO")
type=int,
default=900,
help="Interval to which the database will save in seconds. " "0 to disable.",
)
args_parser.add_argument( args_parser.add_argument(
"-s", '-q', '--queue-size', type=int, default=10,
"--single-process", help="TODO")
action="store_true",
help="Only use one process. " "Might be useful for single core computers.",
)
args_parser.add_argument(
"-4",
"--ip4-cache",
type=int,
default=0,
help="RAM cache for faster IPv4 lookup. "
"Maximum useful value: 512 MiB (536870912). "
"Warning: Depending on the rules, this might already "
"be a memory-heavy process, even without the cache.",
)
args = args_parser.parse_args() args = args_parser.parse_args()
parser_cls = PARSERS[args.parser]
if args.single_process:
writer = Writer(
autosave_interval=args.autosave_interval, ip4_cache=args.ip4_cache
)
parser = parser_cls(args.input, writer=writer)
parser.run()
writer.end()
else:
recs_queue: multiprocessing.Queue = multiprocessing.Queue( recs_queue: multiprocessing.Queue = multiprocessing.Queue(
maxsize=args.queue_size maxsize=args.queue_size)
)
writer = Writer( writer = Writer(recs_queue)
recs_queue,
autosave_interval=args.autosave_interval,
ip4_cache=args.ip4_cache,
)
writer.start() writer.start()
parser = parser_cls( parser = PARSERS[args.parser](args.input, recs_queue, args.block_size)
args.input, recs_queue=recs_queue, block_size=args.block_size
)
parser.run() parser.run()
recs_queue.put(None) recs_queue.put(None)

View file

@ -6,56 +6,49 @@ import sys
import time import time
import typing import typing
FUNCTION_MAP = { FUNCTION_MAP: typing.Dict[str, typing.Tuple[
"zone": database.Database.set_zone, typing.Callable[[database.Database, database.Path, int], None],
"hostname": database.Database.set_hostname, typing.Callable[[str], database.Path],
"asn": database.Database.set_asn, ]] = {
"ip4network": database.Database.set_ip4network, 'hostname': (database.Database.set_hostname,
"ip4address": database.Database.set_ip4address, database.Database.pack_domain),
'zone': (database.Database.set_zone,
database.Database.pack_domain),
'asn': (database.Database.set_asn,
database.Database.pack_asn),
'ip4address': (database.Database.set_ip4address,
database.Database.pack_ip4address),
'ip4network': (database.Database.set_ip4network,
database.Database.pack_ip4network),
} }
if __name__ == "__main__": if __name__ == '__main__':
# Parsing arguments # Parsing arguments
parser = argparse.ArgumentParser(description="Import base rules to the database") parser = argparse.ArgumentParser(
description="TODO")
parser.add_argument( parser.add_argument(
"type", choices=FUNCTION_MAP.keys(), help="Type of rule inputed" 'type',
) choices=FUNCTION_MAP.keys(),
help="Type of rule inputed")
parser.add_argument( parser.add_argument(
"-i", '-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
"--input", help="List of domains domains to block (with their subdomains)")
type=argparse.FileType("r"),
default=sys.stdin,
help="File with one rule per line",
)
parser.add_argument( parser.add_argument(
"-f", '-f', '--first-party', action='store_true',
"--first-party", help="The input only comes from verified first-party sources")
action="store_true",
help="The input only comes from verified first-party sources",
)
args = parser.parse_args() args = parser.parse_args()
DB = database.Database() DB = database.Database()
fun = FUNCTION_MAP[args.type] fun, packer = FUNCTION_MAP[args.type]
source: database.RulePath
if args.first_party:
source = database.RuleFirstPath()
else:
source = database.RuleMultiPath()
for rule in args.input: for rule in args.input:
rule = rule.strip() packed = packer(rule.strip())
try: fun(DB,
fun( packed,
DB, # is_first_party=args.first_party,
rule,
source=source,
updated=int(time.time()), updated=int(time.time()),
) )
except ValueError:
DB.log.error(f"Could not add rule: {rule}")
DB.save() DB.save()

View file

@ -13,22 +13,30 @@ function dl() {
fi fi
} }
log "Retrieving tests…"
rm -f tests/*.cache.csv
dl https://raw.githubusercontent.com/fukuda-lab/cname_cloaking/master/Subdomain_CNAME-cloaking-based-tracking.csv temp/fukuda.csv
(echo "url,allow,deny,comment"; tail -n +2 temp/fukuda.csv | awk -F, '{ print "https://" $2 "/,," $3 "," $5 }') > tests/fukuda.cache.csv
log "Retrieving rules…" log "Retrieving rules…"
rm -f rules*/*.cache.* rm -f rules*/*.cache.*
dl https://easylist.to/easylist/easyprivacy.txt rules_adblock/easyprivacy.cache.txt dl https://easylist.to/easylist/easyprivacy.txt rules_adblock/easyprivacy.cache.txt
dl https://filters.adtidy.org/extension/chromium/filters/3.txt rules_adblock/adguard.cache.txt # From firebog.net Tracking & Telemetry Lists
# dl https://v.firebog.net/hosts/Prigent-Ads.txt rules/prigent-ads.cache.list
log "Retrieving TLD list…" # dl https://gitlab.com/quidsup/notrack-blocklists/raw/master/notrack-blocklist.txt rules/notrack-blocklist.cache.list
dl http://data.iana.org/TLD/tlds-alpha-by-domain.txt temp/all_tld.temp.list # False positives: https://github.com/WaLLy3K/wally3k.github.io/issues/73 -> 69.media.tumblr.com chicdn.net
grep -v '^#' temp/all_tld.temp.list | awk '{print tolower($0)}' > temp/all_tld.list dl https://raw.githubusercontent.com/StevenBlack/hosts/master/data/add.2o7Net/hosts rules_hosts/add2o7.cache.txt
dl https://raw.githubusercontent.com/crazy-max/WindowsSpyBlocker/master/data/hosts/spy.txt rules_hosts/spy.cache.txt
# dl https://raw.githubusercontent.com/Kees1958/WS3_annual_most_used_survey_blocklist/master/w3tech_hostfile.txt rules/w3tech.cache.list
# False positives: agreements.apple.com -> edgekey.net
# dl https://www.github.developerdan.com/hosts/lists/ads-and-tracking-extended.txt rules_hosts/ads-and-tracking-extended.cache.txt # Lots of false-positives
# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/android-tracking.txt rules_hosts/android-tracking.cache.txt
# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/SmartTV.txt rules_hosts/smart-tv.cache.txt
# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/AmazonFireTV.txt rules_hosts/amazon-fire-tv.cache.txt
log "Retrieving nameservers…" log "Retrieving nameservers…"
dl https://public-dns.info/nameservers.txt nameservers/public-dns.cache.list rm -f nameservers
touch nameservers
[ -f nameservers.head ] && cat nameservers.head >> nameservers
dl https://public-dns.info/nameservers.txt nameservers.temp
sort -R nameservers.temp >> nameservers
rm nameservers.temp
log "Retrieving top subdomains…" log "Retrieving top subdomains…"
dl http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip top-1m.csv.zip dl http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip top-1m.csv.zip
@ -38,8 +46,9 @@ rm top-1m.csv top-1m.csv.zip
if [ -f subdomains/cisco-umbrella_popularity.cache.list ] if [ -f subdomains/cisco-umbrella_popularity.cache.list ]
then then
cp subdomains/cisco-umbrella_popularity.cache.list temp/cisco-umbrella_popularity.old.list cp subdomains/cisco-umbrella_popularity.cache.list temp/cisco-umbrella_popularity.old.list
pv -f temp/cisco-umbrella_popularity.old.list temp/cisco-umbrella_popularity.fresh.list | sort -u > subdomains/cisco-umbrella_popularity.cache.list pv temp/cisco-umbrella_popularity.old.list temp/cisco-umbrella_popularity.fresh.list | sort -u > subdomains/cisco-umbrella_popularity.cache.list
rm temp/cisco-umbrella_popularity.old.list temp/cisco-umbrella_popularity.fresh.list rm temp/cisco-umbrella_popularity.old.list temp/cisco-umbrella_popularity.fresh.list
else else
mv temp/cisco-umbrella_popularity.fresh.list subdomains/cisco-umbrella_popularity.cache.list mv temp/cisco-umbrella_popularity.fresh.list subdomains/cisco-umbrella_popularity.cache.list
fi fi
dl https://www.orwell1984.today/cname/eulerian.net.txt subdomains/orwell-eulerian-cname-list.cache.list

160
filter_subdomains.py Executable file
View file

@ -0,0 +1,160 @@
#!/usr/bin/env python3
# pylint: disable=C0103
"""
From a list of subdomains, output only
the ones resolving to a first-party tracker.
"""
import argparse
import sys
import progressbar
import csv
import typing
import ipaddress
# DomainRule = typing.Union[bool, typing.Dict[str, 'DomainRule']]
DomainRule = typing.Union[bool, typing.Dict]
# IpRule = typing.Union[bool, typing.Dict[int, 'DomainRule']]
IpRule = typing.Union[bool, typing.Dict]
RULES_DICT: DomainRule = dict()
RULES_IP_DICT: IpRule = dict()
def get_bits(address: ipaddress.IPv4Address) -> typing.Iterator[int]:
for char in address.packed:
for i in range(7, -1, -1):
yield (char >> i) & 0b1
def subdomain_matching(subdomain: str) -> bool:
parts = subdomain.split('.')
parts.reverse()
dic = RULES_DICT
for part in parts:
if isinstance(dic, bool) or part not in dic:
break
dic = dic[part]
if isinstance(dic, bool):
return dic
return False
def ip_matching(ip_str: str) -> bool:
ip = ipaddress.ip_address(ip_str)
dic = RULES_IP_DICT
i = 0
for bit in get_bits(ip):
i += 1
if isinstance(dic, bool) or bit not in dic:
break
dic = dic[bit]
if isinstance(dic, bool):
return dic
return False
def get_matching(chain: typing.List[str], no_explicit: bool = False
) -> typing.Iterable[str]:
if len(chain) <= 1:
return
initial = chain[0]
cname_destinations = chain[1:-1]
a_destination = chain[-1]
initial_matching = subdomain_matching(initial)
if no_explicit and initial_matching:
return
cname_matching = any(map(subdomain_matching, cname_destinations))
if cname_matching or initial_matching or ip_matching(a_destination):
yield initial
def register_rule(subdomain: str) -> None:
# Make a tree with domain parts
parts = subdomain.split('.')
parts.reverse()
dic = RULES_DICT
last_part = len(parts) - 1
for p, part in enumerate(parts):
if isinstance(dic, bool):
return
if p == last_part:
dic[part] = True
else:
dic.setdefault(part, dict())
dic = dic[part]
def register_rule_ip(network: str) -> None:
net = ipaddress.ip_network(network)
ip = net.network_address
dic = RULES_IP_DICT
last_bit = net.prefixlen - 1
for b, bit in enumerate(get_bits(ip)):
if isinstance(dic, bool):
return
if b == last_bit:
dic[bit] = True
else:
dic.setdefault(bit, dict())
dic = dic[bit]
if __name__ == '__main__':
# Parsing arguments
parser = argparse.ArgumentParser(
description="Filter first-party trackers from a list of subdomains")
parser.add_argument(
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
help="Input file with DNS chains")
parser.add_argument(
'-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
help="Outptut file with one tracking subdomain per line")
parser.add_argument(
'-n', '--no-explicit', action='store_true',
help="Don't output domains already blocked with rules without CNAME")
parser.add_argument(
'-r', '--rules', type=argparse.FileType('r'),
help="List of domains domains to block (with their subdomains)")
parser.add_argument(
'-p', '--rules-ip', type=argparse.FileType('r'),
help="List of IPs ranges to block")
args = parser.parse_args()
# Progress bar
widgets = [
progressbar.Percentage(),
' ', progressbar.SimpleProgress(),
' ', progressbar.Bar(),
' ', progressbar.Timer(),
' ', progressbar.AdaptiveTransferSpeed(unit='req'),
' ', progressbar.AdaptiveETA(),
]
progress = progressbar.ProgressBar(widgets=widgets)
# Reading rules
if args.rules:
for rule in args.rules:
register_rule(rule.strip())
if args.rules_ip:
for rule in args.rules_ip:
register_rule_ip(rule.strip())
# Approximating line count
if args.input.seekable():
lines = 0
for line in args.input:
lines += 1
progress.max_value = lines
args.input.seek(0)
# Reading domains to filter
reader = csv.reader(args.input)
progress.start()
for chain in reader:
for match in get_matching(chain, no_explicit=args.no_explicit):
print(match, file=args.output)
progress.update(progress.value + 1)
progress.finish()

66
filter_subdomains.sh Executable file
View file

@ -0,0 +1,66 @@
#!/usr/bin/env bash
function log() {
echo -e "\033[33m$@\033[0m"
}
log "Pruning old data…"
./database.py --prune
log "Recounting references…"
./database.py --references
log "Exporting lists…"
./export.py --first-party --output dist/firstparty-trackers.txt
./export.py --first-party --end-chain --output dist/firstparty-only-trackers.txt
./export.py --output dist/multiparty-trackers.txt
./export.py --end-chain --output dist/multiparty-only-trackers.txt
log "Generating hosts lists…"
./export.py --rules --count --first-party > temp/count_rules_firstparty.txt
./export.py --rules --count > temp/count_rules_multiparty.txt
function generate_hosts {
basename="$1"
description="$2"
description2="$3"
(
echo "# First-party trackers host list"
echo "# $description"
echo "# $description2"
echo "#"
echo "# About first-party trackers: https://git.frogeye.fr/geoffrey/eulaurarien#whats-a-first-party-tracker"
echo "# Source code: https://git.frogeye.fr/geoffrey/eulaurarien"
echo "#"
echo "# In case of false positives/negatives, or any other question,"
echo "# contact me the way you like: https://geoffrey.frogeye.fr"
echo "#"
echo "# Latest version:"
echo "# - First-party trackers : https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt"
echo "# - … excluding redirected: https://hostfiles.frogeye.fr/firstparty-only-trackers-hosts.txt"
echo "# - First and third party : https://hostfiles.frogeye.fr/multiparty-trackers-hosts.txt"
echo "# - … excluding redirected: https://hostfiles.frogeye.fr/multiparty-only-trackers-hosts.txt"
echo '# (you can remove `-hosts` to get the raw list)'
echo "#"
echo "# Generation date: $(date -Isec)"
echo "# Generation software: eulaurarien $(git describe --tags)"
echo "# Number of source websites: $(wc -l temp/all_websites.list | cut -d' ' -f1)"
echo "# Number of source subdomains: $(wc -l temp/all_subdomains.list | cut -d' ' -f1)"
echo "# Number of source DNS records: ~2M + $(wc -l temp/all_resolved.json | cut -d' ' -f1)"
echo "#"
echo "# Known first-party trackers: $(cat temp/count_rules_firstparty.txt)"
echo "# Number of first-party hostnames: $(wc -l dist/firstparty-trackers.txt | cut -d' ' -f1)"
echo "# … excluding redirected: $(wc -l dist/firstparty-only-trackers.txt | cut -d' ' -f1)"
echo "#"
echo "# Known multi-party trackers: $(cat temp/count_rules_multiparty.txt)"
echo "# Number of multi-party hostnames: $(wc -l dist/multiparty-trackers.txt | cut -d' ' -f1)"
echo "# … excluding redirected: $(wc -l dist/multiparty-only-trackers.txt | cut -d' ' -f1)"
echo
sed 's|^|0.0.0.0 |' "dist/$basename.txt"
) > "dist/$basename-hosts.txt"
}
generate_hosts "firstparty-trackers" "Generated from a curated list of first-party trackers" ""
generate_hosts "firstparty-only-trackers" "Generated from a curated list of first-party trackers" "Only contain the first chain of redirection."
generate_hosts "multiparty-trackers" "Generated from known third-party trackers." "Also contains trackers used as third-party."
generate_hosts "multiparty-only-trackers" "Generated from known third-party trackers." "Do not contain trackers used in third-party. Use in combination with third-party lists."

View file

@ -1,25 +0,0 @@
#!/usr/bin/env python3
import markdown2
extras = ["header-ids"]
with open("dist/README.md", "r") as fdesc:
body = markdown2.markdown(fdesc.read(), extras=extras)
output = f"""<!DOCTYPE html>
<html lang="en">
<head>
<title>Geoffrey Frogeye's block list of first-party trackers</title>
<meta charset="utf-8">
<meta name="author" content="Geoffrey 'Frogeye' Preud'homme" />
<link rel="stylesheet" type="text/css" href="markdown7.min.css">
</head>
<body>
{body}
</body>
</html>
"""
with open("dist/index.html", "w") as fdesc:
fdesc.write(output)

View file

@ -5,12 +5,12 @@ function log() {
} }
log "Importing rules…" log "Importing rules…"
date +%s > "last_updates/rules.txt" BEFORE="$(date +%s)"
cat rules_adblock/*.txt | grep -v '^!' | grep -v '^\[Adblock' | ./adblock_to_domain_list.py | ./feed_rules.py zone # cat rules_adblock/*.txt | grep -v '^!' | grep -v '^\[Adblock' | ./adblock_to_domain_list.py | ./feed_rules.py zone
cat rules_hosts/*.txt | grep -v '^#' | grep -v '^$' | cut -d ' ' -f2 | ./feed_rules.py zone # cat rules_hosts/*.txt | grep -v '^#' | grep -v '^$' | cut -d ' ' -f2 | ./feed_rules.py zone
cat rules/*.list | grep -v '^#' | grep -v '^$' | ./feed_rules.py zone # cat rules/*.list | grep -v '^#' | grep -v '^$' | ./feed_rules.py zone
cat rules_ip/*.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py ip4network # cat rules_ip/*.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py ip4network
cat rules_asn/*.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py asn # cat rules_asn/*.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py asn
cat rules/first-party.list | grep -v '^#' | grep -v '^$' | ./feed_rules.py zone --first-party cat rules/first-party.list | grep -v '^#' | grep -v '^$' | ./feed_rules.py zone --first-party
cat rules_ip/first-party.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py ip4network --first-party cat rules_ip/first-party.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py ip4network --first-party
@ -18,3 +18,5 @@ cat rules_asn/first-party.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py as
./feed_asn.py ./feed_asn.py
log "Pruning old rules…"
./db.py --prune --prune-before "$BEFORE" --prune-base

View file

@ -1 +0,0 @@
*.txt

View file

@ -1,2 +0,0 @@
*.custom.list
*.cache.list

View file

@ -1,24 +0,0 @@
8.8.8.8
8.8.4.4
2001:4860:4860:0:0:0:0:8888
2001:4860:4860:0:0:0:0:8844
208.67.222.222
208.67.220.220
2620:119:35::35
2620:119:53::53
4.2.2.1
4.2.2.2
8.26.56.26
8.20.247.20
84.200.69.80
84.200.70.40
2001:1608:10:25:0:0:1c04:b12f
2001:1608:10:25:0:0:9249:d69b
9.9.9.10
149.112.112.10
2620:fe::10
2620:fe::fe:10
1.1.1.1
1.0.0.1
2606:4700:4700::1111
2606:4700:4700::1001

22
new_workflow.sh Executable file
View file

@ -0,0 +1,22 @@
#!/usr/bin/env bash
function log() {
echo -e "\033[33m$@\033[0m"
}
./fetch_resources.sh
./import_rules.sh
# TODO Fetch 'em
log "Reading PTR records…"
pv ptr.json.gz | gunzip | ./feed_dns.py
log "Reading A records…"
pv a.json.gz | gunzip | ./feed_dns.py
log "Reading CNAME records…"
pv cname.json.gz | gunzip | ./feed_dns.py
log "Pruning old data…"
./database.py --prune
./filter_subdomains.sh

View file

@ -1,9 +0,0 @@
#!/usr/bin/env bash
function log() {
echo -e "\033[33m$@\033[0m"
}
oldest="$(cat last_updates/*.txt | sort -n | head -1)"
log "Pruning every record before ${oldest}"
./db.py --prune --prune-before "$oldest"

View file

@ -1,4 +0,0 @@
coloredlogs>=10
markdown2>=2.4<3
numpy>=1.21<2
python-abp>=0.2<0.3

View file

@ -1,24 +1,12 @@
#!/usr/bin/env bash #!/usr/bin/env bash
source .env.default
source .env
function log() { function log() {
echo -e "\033[33m$@\033[0m" echo -e "\033[33m$@\033[0m"
} }
log "Compiling nameservers…" log "Compiling locally known subdomain…"
pv -f nameservers/*.list | ./validate_list.py --ip4 | sort -u > temp/all_nameservers_ip4.list
log "Compiling subdomains…"
# Sort by last character to utilize the DNS server caching mechanism # Sort by last character to utilize the DNS server caching mechanism
# (not as efficient with massdns but it's almost free so why not) pv subdomains/*.list | sed 's/\r$//' | rev | sort -u | rev > temp/all_subdomains.list
pv -f subdomains/*.list | ./validate_list.py --domain | rev | sort -u | rev > temp/all_subdomains.list log "Resolving locally known subdomain…"
pv temp/all_subdomains.list | ./resolve_subdomains.py --output temp/all_resolved.csv
log "Resolving subdomain…"
date +%s > "last_updates/massdns.txt"
"$MASSDNS_BINARY" --output Snrql --hashmap-size "$MASSDNS_HASHMAP_SIZE" --resolvers temp/all_nameservers_ip4.list --outfile temp/all_resolved.txt temp/all_subdomains.list
log "Importing into database…"
[ $SINGLE_PROCESS -eq 1 ] && EXTRA_ARGS="--single-process"
pv -f temp/all_resolved.txt | ./feed_dns.py massdns --ip4-cache "$CACHE_SIZE" $EXTRA_ARGS

View file

@ -12,80 +12,13 @@ storetail.io
# Keyade # Keyade
keyade.com keyade.com
# Adobe Experience Cloud # Adobe Experience Cloud
# https://experienceleague.adobe.com/docs/analytics/implementation/vars/config-vars/trackingserversecure.html?lang=en#ssl-tracking-server-in-adobe-experience-platform-launch
omtrdc.net omtrdc.net
2o7.net 2o7.net
data.adobedc.net # ThreatMetrix
sc.adobedc.net online-metrix.net
# Webtrekk # Webtrekk
wt-eu02.net wt-eu02.net
webtrekk.net
# Otto Group # Otto Group
oghub.io oghub.io
# Intent Media # ???
partner.intentmedia.net partner.intentmedia.net
# Wizaly
wizaly.com
# Commanders Act
tagcommander.com
# Ingenious Technologies
affex.org
# TraceDock
a351fec2c318c11ea9b9b0a0ae18fb0b-1529426863.eu-central-1.elb.amazonaws.com
a5e652663674a11e997c60ac8a4ec150-1684524385.eu-central-1.elb.amazonaws.com
a88045584548111e997c60ac8a4ec150-1610510072.eu-central-1.elb.amazonaws.com
afc4d9aa2a91d11e997c60ac8a4ec150-2082092489.eu-central-1.elb.amazonaws.com
# A8
trck.a8.net
# AD EBiS
# https://prtimes.jp/main/html/rd/p/000000215.000009812.html
ebis.ne.jp
# GENIEE
genieesspv.jp
# SP-Prod
sp-prod.net
# Act-On Software
actonsoftware.com
actonservice.com
# eum-appdynamics.com
eum-appdynamics.com
# Extole
extole.io
extole.com
# Eloqua
hs.eloqua.com
# segment.com
xid.segment.com
# exponea.com
exponea.com
# adclear.net
adclear.net
# contentsfeed.com
contentsfeed.com
# postaffiliatepro.com
postaffiliatepro.com
# Sugar Market (Salesfusion)
msgapp.com
# Exactag
exactag.com
# GMO Internet Group
ad-cloud.jp
# Pardot
pardot.com
# Fathom
# https://usefathom.com/docs/settings/custom-domains
starman.fathomdns.com
# Lead Forensics
# https://www.reddit.com/r/pihole/comments/g7qv3e/leadforensics_tracking_domains_blacklist/
# No real-world data but the website doesn't hide what it does
ghochv3eng.trafficmanager.net
# Branch.io
thirdparty.bnc.lt
# Plausible.io
custom.plausible.io
# DataUnlocker
# Bit different as it is a proxy to non first-party trackers scripts
# but it fits I guess.
smartproxy.dataunlocker.com
# SAS
ci360.sas.com

View file

@ -4,7 +4,7 @@ AS50234
AS44788 AS44788
AS19750 AS19750
AS55569 AS55569
# ThreatMetrix
AS30286
# Webtrekk # Webtrekk
AS60164 AS60164
# Act-On Software
AS393648

0
rules_ip/first-party.txt Normal file
View file

View file

@ -1,75 +0,0 @@
#!/usr/bin/env python3
import database
import os
import logging
import csv
TESTS_DIR = "tests"
if __name__ == "__main__":
DB = database.Database()
log = logging.getLogger("tests")
for filename in os.listdir(TESTS_DIR):
if not filename.lower().endswith(".csv"):
continue
log.info("")
log.info("Running tests from %s", filename)
path = os.path.join(TESTS_DIR, filename)
with open(path, "rt") as fdesc:
count_ent = 0
count_all = 0
count_den = 0
pass_ent = 0
pass_all = 0
pass_den = 0
reader = csv.DictReader(fdesc)
for test in reader:
log.debug("Testing %s (%s)", test["url"], test["comment"])
count_ent += 1
passed = True
for allow in test["allow"].split(":"):
if not allow:
continue
count_all += 1
if any(DB.get_domain(allow)):
log.error("False positive: %s", allow)
passed = False
else:
pass_all += 1
for deny in test["deny"].split(":"):
if not deny:
continue
count_den += 1
if not any(DB.get_domain(deny)):
log.error("False negative: %s", deny)
passed = False
else:
pass_den += 1
if passed:
pass_ent += 1
perc_ent = (100 * pass_ent / count_ent) if count_ent else 100
perc_all = (100 * pass_all / count_all) if count_all else 100
perc_den = (100 * pass_den / count_den) if count_den else 100
log.info(
(
"%s: Entries %d/%d (%.2f%%)"
" | Allow %d/%d (%.2f%%)"
"| Deny %d/%d (%.2f%%)"
),
filename,
pass_ent,
count_ent,
perc_ent,
pass_all,
count_all,
perc_all,
pass_den,
count_den,
perc_den,
)

1
tests/.gitignore vendored
View file

@ -1 +0,0 @@
*.cache.csv

View file

@ -1,6 +1,6 @@
url,allow,deny,comment url,white,black,comment
https://support.apple.com,support.apple.com,,EdgeKey / AkamaiEdge https://support.apple.com,support.apple.com,,EdgeKey / AkamaiEdge
https://www.pinterest.fr/,i.pinimg.com,,Cedexis https://www.pinterest.fr/,i.pinimg.com,,Cedexis
https://www.pinterest.fr/,i.pinimg.com,,Cedexis
https://www.tumblr.com/,66.media.tumblr.com,,ChiCDN https://www.tumblr.com/,66.media.tumblr.com,,ChiCDN
https://www.skype.com/fr/,www.skype.com,,TrafficManager https://www.skype.com/fr/,www.skype.com,,TrafficManager
https://www.mitsubishicars.com/,www.mitsubishicars.com,,Tracking domain as reverse DNS

1 url allow white deny black comment
2 https://support.apple.com support.apple.com support.apple.com EdgeKey / AkamaiEdge
3 https://www.pinterest.fr/ i.pinimg.com i.pinimg.com Cedexis
4 https://www.pinterest.fr/ i.pinimg.com Cedexis
5 https://www.tumblr.com/ 66.media.tumblr.com 66.media.tumblr.com ChiCDN
6 https://www.skype.com/fr/ www.skype.com www.skype.com TrafficManager
https://www.mitsubishicars.com/ www.mitsubishicars.com Tracking domain as reverse DNS

View file

@ -1,28 +1,7 @@
url,allow,deny,comment url,white,black,comment
https://www.red-by-sfr.fr/,static.s-sfr.fr,nrg.red-by-sfr.fr,Eulerian https://www.red-by-sfr.fr/,static.s-sfr.fr,nrg.red-by-sfr.fr,Eulerian
https://www.cbc.ca/,,smetrics.cbc.ca,2o7 | Ominuture | Adobe Experience Cloud https://www.cbc.ca/,,smetrics.cbc.ca,2o7 | Ominuture | Adobe Experience Cloud
https://www.discover.com/,,content.discover.com,ThreatMetrix
https://www.mytoys.de/,,web.mytoys.de,Webtrekk https://www.mytoys.de/,,web.mytoys.de,Webtrekk
https://www.baur.de/,,tp.baur.de,Otto Group https://www.baur.de/,,tp.baur.de,Otto Group
https://www.liligo.com/,,compare.liligo.com,??? https://www.liligo.com/,,compare.liligo.com,???
https://www.boulanger.com/,,tag.boulanger.fr,TagCommander
https://www.airfrance.fr/FR/,,tk.airfrance.fr,Wizaly
https://www.vsgamers.es/,,marketing.net.vsgamers.es,Affex
https://www.vacansoleil.fr/,,tdep.vacansoleil.fr,TraceDock
https://www.ozmall.co.jp/,,js.enhance.co.jp,GENIEE
https://www.thetimes.co.uk/,,cmp.thetimes.co.uk,SP-Prod
https://agilent.com/,,seahorseinfo.agilent.com,Act-On Software
https://halifax.co.uk/,,cem.halifax.co.uk,eum-appdynamics.com
https://www.reallygoodstuff.com/,,refer.reallygoodstuff.com,Extole
https://unity.com/,,eloqua-trackings.unity.com,Eloqua
https://www.notino.gr/,,api.campaigns.notino.com,Exponea
https://www.mytoys.de/,,0815.mytoys.de.adclear.net
https://www.imbc.com/,,ads.imbc.com.contentsfeed.com
https://www.cbdbiocare.com/,,affiliate.cbdbiocare.com,postaffiliatepro.com
https://www.seatadvisor.com/,,marketing.seatadvisor.com,Sugar Market (Salesfusion)
https://www.tchibo.de/,,tagm.tchibo.de,Exactag
https://www.bouygues-immobilier.com/,,go.bouygues-immobilier.fr,Pardot
https://caddyserver.com/,,mule.caddysever.com,Fathom
Reddit.com mail notifications,,click.redditmail.com,Branch.io
https://www.phpliveregex.com/,,yolo.phpliveregex.xom,Plausible.io
https://www.earthclassmail.com/,,1avhg3kanx9.www.earthclassmail.com,DataUnlocker
https://paulfredrick.com/,,execution-ci360.paulfredrick.com,SAS

Can't render this file because it has a wrong number of fields in line 18.

View file

@ -1,35 +0,0 @@
#!/usr/bin/env python3
# pylint: disable=C0103
"""
Filter out invalid domain names
"""
import database
import argparse
import sys
if __name__ == '__main__':
# Parsing arguments
parser = argparse.ArgumentParser(
description="Filter out invalid domain name/ip addresses from a list.")
parser.add_argument(
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
help="Input file, one element per line")
parser.add_argument(
'-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
help="Output file, one element per line")
parser.add_argument(
'-d', '--domain', action='store_true',
help="Can be domain name")
parser.add_argument(
'-4', '--ip4', action='store_true',
help="Can be IP4")
args = parser.parse_args()
for line in args.input:
line = line[:-1].lower()
if (args.domain and database.Database.validate_domain(line)) or \
(args.ip4 and database.Database.validate_ip4address(line)):
print(line, file=args.output)