Compare commits
1 commit
master
...
newworkflo
Author | SHA1 | Date | |
---|---|---|---|
Geoffrey Frogeye | dcf39c9582 |
|
@ -1,5 +0,0 @@
|
||||||
CACHE_SIZE=536870912
|
|
||||||
MASSDNS_HASHMAP_SIZE=1000
|
|
||||||
PROFILE=0
|
|
||||||
SINGLE_PROCESS=0
|
|
||||||
MASSDNS_BINARY=massdns
|
|
5
.gitignore
vendored
5
.gitignore
vendored
|
@ -1,5 +1,4 @@
|
||||||
*.log
|
*.log
|
||||||
*.p
|
*.p
|
||||||
.env
|
nameservers
|
||||||
__pycache__
|
nameservers.head
|
||||||
explanations
|
|
||||||
|
|
21
LICENSE
21
LICENSE
|
@ -1,21 +0,0 @@
|
||||||
MIT License
|
|
||||||
|
|
||||||
Copyright (c) 2019 Geoffrey 'Frogeye' Preud'homme
|
|
||||||
|
|
||||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
||||||
of this software and associated documentation files (the "Software"), to deal
|
|
||||||
in the Software without restriction, including without limitation the rights
|
|
||||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
||||||
copies of the Software, and to permit persons to whom the Software is
|
|
||||||
furnished to do so, subject to the following conditions:
|
|
||||||
|
|
||||||
The above copyright notice and this permission notice shall be included in all
|
|
||||||
copies or substantial portions of the Software.
|
|
||||||
|
|
||||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
||||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
||||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
||||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
||||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
||||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
||||||
SOFTWARE.
|
|
194
README.md
194
README.md
|
@ -1,162 +1,92 @@
|
||||||
# eulaurarien
|
# eulaurarien
|
||||||
|
|
||||||
This program is able to generate a list of every hostnames being a DNS redirection to a list of DNS zones and IP networks.
|
Generates a host list of first-party trackers for ad-blocking.
|
||||||
|
|
||||||
It is primarilyy used to generate [Geoffrey Frogeye's block list of first-party trackers](https://hostfiles.frogeye.fr) (learn about first-party trackers by following this link).
|
The latest list is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
|
||||||
|
|
||||||
If you want to contribute but don't want to create an account on this forge, contact me the way you like: <https://geoffrey.frogeye.fr>
|
**DISCLAIMER:** I'm by no way an expert on this subject so my vocabulary or other stuff might be wrong. Use at your own risk.
|
||||||
|
|
||||||
## How does this work
|
## What's a first-party tracker?
|
||||||
|
|
||||||
This program takes as input:
|
Traditionally, websites load trackers scripts directly.
|
||||||
|
For example, `website1.com` and `website2.com` both load `https://trackercompany.com/trackerscript.js` to track their users.
|
||||||
|
In order to block those, one can simply block the host `trackercompany.com`.
|
||||||
|
|
||||||
- Lists of hostnames to match
|
However, to circumvent this easy block, tracker companies made the website using them load trackers from `somethingirelevant.website1.com`.
|
||||||
- Lists of DNS zone to match (a domain and their subdomains)
|
The latter being a DNS redirection to `website1.trackercompany.com`, directly pointing to a server serving the tracking script.
|
||||||
- Lists of IP address / IP networks to match
|
Those are the first-party trackers.
|
||||||
- Lists of Autonomous System numbers to match
|
|
||||||
- An enormous quantity of DNS records
|
|
||||||
|
|
||||||
It will be able to output hostnames being a DNS redirection to any item in the lists provided.
|
Blocking `trackercompany.com` doesn't work any more, and blocking `*.trackercompany.com` isn't really possible since:
|
||||||
|
|
||||||
DNS records can be locally resolved from a list of subdomains using [MassDNS](https://github.com/blechschmidt/massdns).
|
1. Most ad-blocker don't support wildcards
|
||||||
|
2. It's a DNS redirection, meaning that most ad-blockers will only see `somethingirelevant.website1.com`
|
||||||
|
|
||||||
Those subdomains can either be provided as is, come from [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html), from your browsing history, or from analyzing the traffic a web browser makes when opening an URL (the program provides utility to do all that).
|
So the only solution is to block every `somethingirelevant.website1.com`-like subdomains known, which is a lot.
|
||||||
|
That's where this scripts comes in, to generate a list of such subdomains.
|
||||||
|
|
||||||
|
## How does this script work
|
||||||
|
|
||||||
|
It takes an input a list of websites with trackers included.
|
||||||
|
So far, this list is manually-generated from the list of clients of such first-party trackers
|
||||||
|
(latter we should use a general list of websites to be more exhaustive).
|
||||||
|
It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes.
|
||||||
|
|
||||||
|
Additionaly, or alternatively, you can feed the script some browsing history and get domains from there.
|
||||||
|
|
||||||
|
It then find the DNS redirections of those domains, and compare with regexes of known tracking domains.
|
||||||
|
It finally outputs the matching ones.
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
Just to build the list, you can find an already-built list in the releases.
|
||||||
|
|
||||||
|
- Bash
|
||||||
|
- [Python 3.4+](https://www.python.org/)
|
||||||
|
- [progressbar2](https://pypi.org/project/progressbar2/)
|
||||||
|
- dnspython
|
||||||
|
- [A Python wrapper for re2](https://pypi.org/project/google-re2/) (optional, just speeds things up)
|
||||||
|
|
||||||
|
(if you don't want to collect the subdomains, you can skip the following)
|
||||||
|
|
||||||
|
- Firefox
|
||||||
|
- Selenium
|
||||||
|
- seleniumwire
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
Remember you can get an already generated and up-to-date list of first-party trackers from [here](https://hostfiles.frogeye.fr).
|
This is only if you want to build the list yourself.
|
||||||
|
If you just want to use the list, the latest build is available here: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
|
||||||
|
It was build using additional sources not included in this repository for privacy reasons.
|
||||||
|
|
||||||
The following is for the people wanting to build their own list.
|
### Add personal sources
|
||||||
|
|
||||||
### Requirements
|
The list of websites provided in this script is by no mean exhaustive,
|
||||||
|
so adding your own browsing history will help create a better list.
|
||||||
Depending on the sources you'll be using to generate the list, you'll need to install some of the following:
|
|
||||||
|
|
||||||
- [Bash](https://www.gnu.org/software/bash/bash.html)
|
|
||||||
- [Coreutils](https://www.gnu.org/software/coreutils/)
|
|
||||||
- [Gawk](https://www.gnu.org/software/gawk/)
|
|
||||||
- [curl](https://curl.haxx.se)
|
|
||||||
- [pv](http://www.ivarch.com/programs/pv.shtml)
|
|
||||||
- [Python 3.4+](https://www.python.org/)
|
|
||||||
- [coloredlogs](https://pypi.org/project/coloredlogs/) (sorry I can't help myself)
|
|
||||||
- [numpy](https://www.numpy.org/)
|
|
||||||
- [python-abp](https://pypi.org/project/python-abp/) (only if you intend to use AdBlock rules as a rule source)
|
|
||||||
- [massdns](https://github.com/blechschmidt/massdns) in your `$PATH` (only if you have subdomains as a source)
|
|
||||||
- [Firefox](https://www.mozilla.org/firefox/) (only if you have websites as a source)
|
|
||||||
- [selenium (Python bindings)](https://pypi.python.org/pypi/selenium) (only if you have websites as a source)
|
|
||||||
- [selenium-wire](https://pypi.org/project/selenium-wire/) (only if you have websites as a source)
|
|
||||||
- [markdown2](https://pypi.org/project/markdown2/) (only if you intend to generate the index webpage)
|
|
||||||
|
|
||||||
### Create a new database
|
|
||||||
|
|
||||||
The so-called database (in the form of `blocking.p`) is a file storing all the matching entities (ASN, IPs, hostnames, zones…) and every entity leading to it.
|
|
||||||
It exists because the list cannot be generated in one pass, as DNS redirections chain links do not have to be inputed in order.
|
|
||||||
|
|
||||||
You can purge of old records the database by running `./prune.sh`.
|
|
||||||
When you remove a source of data, remove its corresponding file in `last_updates` to fix the pruning process.
|
|
||||||
|
|
||||||
### Gather external sources
|
|
||||||
|
|
||||||
External sources are not stored in this repository.
|
|
||||||
You'll need to fetch them by running `./fetch_resources.sh`.
|
|
||||||
Those include:
|
|
||||||
|
|
||||||
- Third-party trackers lists
|
|
||||||
- TLD lists (used to test the validity of hostnames)
|
|
||||||
- List of public DNS resolvers (for DNS resolving from subdomains)
|
|
||||||
- Top 1M subdomains
|
|
||||||
|
|
||||||
### Import rules into the database
|
|
||||||
|
|
||||||
You need to put the lists of rules for matching in the different subfolders:
|
|
||||||
|
|
||||||
- `rules`: Lists of DNS zones
|
|
||||||
- `rules_ip`: Lists of IP networks (for IP addresses append `/32`)
|
|
||||||
- `rules_asn`: Lists of Autonomous Systems numbers (IP ranges will be deducted from them)
|
|
||||||
- `rules_adblock`: Lists of DNS zones, but in the form of AdBlock lists (only the ones concerning domains will be extracted)
|
|
||||||
- `rules_hosts`: Lists of DNS zones, but in the form of hosts lists
|
|
||||||
|
|
||||||
See the provided examples for syntax.
|
|
||||||
|
|
||||||
In each folder:
|
|
||||||
|
|
||||||
- `first-party.ext` will be the only files considered for the first-party variant of the list
|
|
||||||
- `*.cache.ext` are from external sources, and thus might be deleted / overwrote
|
|
||||||
- `*.custom.ext` are for sources that you don't want commited
|
|
||||||
|
|
||||||
Then, run `./import_rules.sh`.
|
|
||||||
|
|
||||||
If you removed rules and you want to remove every record depending on those rules immediately,
|
|
||||||
run the following command:
|
|
||||||
|
|
||||||
```
|
|
||||||
./db.py --prune --prune-before "$(cat "last_updates/rules.txt")" --prune-base
|
|
||||||
```
|
|
||||||
|
|
||||||
### Add subdomains
|
|
||||||
|
|
||||||
If you plan to resolve DNS records yourself (as the DNS records datasets are not exhaustive),
|
|
||||||
the top 1M subdomains provided might not be enough.
|
|
||||||
|
|
||||||
You can add them into the `subdomains` folder.
|
|
||||||
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
|
|
||||||
|
|
||||||
#### Add personal sources
|
|
||||||
|
|
||||||
Adding your own browsing history will help create a more suited subdomains list.
|
|
||||||
Here's reference command for possible sources:
|
Here's reference command for possible sources:
|
||||||
|
|
||||||
- **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
|
- **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
|
||||||
- **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp`
|
- **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list; rm temp`
|
||||||
|
|
||||||
#### Collect subdomains from websites
|
### Collect subdomains from websites
|
||||||
|
|
||||||
You can add the websites URLs into the `websites` folder.
|
Just run `collect_subdomain.sh`.
|
||||||
It follows the same specificities as the rules folder for `*.cache.ext` and `*.custom.ext` files.
|
|
||||||
|
|
||||||
Then, run `collect_subdomain.sh`.
|
|
||||||
This is a long step, and might be memory-intensive from time to time.
|
This is a long step, and might be memory-intensive from time to time.
|
||||||
|
|
||||||
> **Note:** For first-party tracking, a list of subdomains issued from the websites in the repository is avaliable here: <https://hostfiles.frogeye.fr/from_websites.cache.list>
|
This step is optional if you already added personal sources.
|
||||||
|
Alternatively, you can get just download the list of subdomains used to generate the official block list here: <https://hostfiles.frogeye.fr/from_websites.cache.list> (put it in the `subdomains` folder).
|
||||||
|
|
||||||
### Resolve DNS records
|
### Extract tracking domains
|
||||||
|
|
||||||
Once you've added subdomains, you'll need to resolve them to get their DNS records.
|
Make sure your system is configured with a DNS server without limitation.
|
||||||
The program will use a list of public nameservers to do that, but you can add your own in the `nameservers` directory.
|
Then, run `filter_subdomain.sh`.
|
||||||
|
The files you need will be in the folder `dist`.
|
||||||
|
|
||||||
Then, run `./resolve_subdomains.sh`.
|
## Contributing
|
||||||
Note that this is a network intensive process, not in term of bandwith, but in terms of packet number.
|
|
||||||
|
|
||||||
> **Note:** Some VPS providers might detect this as a DDoS attack and cut the network access.
|
### Adding websites
|
||||||
> Some Wi-Fi connections can be rendered unusable for other uses, some routers might cease to work.
|
|
||||||
> Since massdns does not support yet rate limiting, my best bet was a Raspberry Pi with a slow ethernet link (Raspberry Pi < 4).
|
|
||||||
|
|
||||||
The DNS records will automatically be imported into the database.
|
Just add the URL to the relevant list: `websites/<source>.list`.
|
||||||
If you want to re-import the records without re-doing the resolving, just run the last line of the `./resolve_subdomains.sh` script.
|
|
||||||
|
|
||||||
### Export the lists
|
### Adding first-party trackers regex
|
||||||
|
|
||||||
For the tracking list, use `./export_lists.sh`, the output will be in the `dist` folder (please change the links before distributing them).
|
Just add them to `regexes.py`.
|
||||||
For other purposes, tinker with the `./export.py` program.
|
|
||||||
|
|
||||||
#### Explanations
|
|
||||||
|
|
||||||
Note that if you created an `explanations` folder at the root of the project, a file with a timestamp will be created in it.
|
|
||||||
It contains every rule in the database and the reason of their presence (i.e. their dependency).
|
|
||||||
This might be useful to track changes between runs.
|
|
||||||
|
|
||||||
Every rule has an associated tag with four components:
|
|
||||||
|
|
||||||
1. A number: the level of the rule (1 if it is a rule present in the `rules*` folders)
|
|
||||||
2. A letter: `F` if first-party, `M` if multi-party.
|
|
||||||
3. A letter: `D` if a dupplicate (e.g. `foo.bar.com` if `*.bar.com` is already a rule), `_` if not.
|
|
||||||
4. A number: the number of rules relying on this one
|
|
||||||
|
|
||||||
### Generate the index webpage
|
|
||||||
|
|
||||||
This is the one served on <https://hostfiles.frogeye.fr>.
|
|
||||||
Just run `./generate_index.py`.
|
|
||||||
|
|
||||||
### Everything
|
|
||||||
|
|
||||||
Once you've made sure every step runs fine, you can use `./eulaurarien.sh` to run every step consecutively.
|
|
||||||
|
|
|
@ -16,36 +16,25 @@ import abp.filters
|
||||||
def get_domains(rule: abp.filters.parser.Filter) -> typing.Iterable[str]:
|
def get_domains(rule: abp.filters.parser.Filter) -> typing.Iterable[str]:
|
||||||
if rule.options:
|
if rule.options:
|
||||||
return
|
return
|
||||||
selector_type = rule.selector["type"]
|
selector_type = rule.selector['type']
|
||||||
selector_value = rule.selector["value"]
|
selector_value = rule.selector['value']
|
||||||
if (
|
if selector_type == 'url-pattern' \
|
||||||
selector_type == "url-pattern"
|
and selector_value.startswith('||') \
|
||||||
and selector_value.startswith("||")
|
and selector_value.endswith('^'):
|
||||||
and selector_value.endswith("^")
|
|
||||||
):
|
|
||||||
yield selector_value[2:-1]
|
yield selector_value[2:-1]
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == '__main__':
|
||||||
|
|
||||||
# Parsing arguments
|
# Parsing arguments
|
||||||
parser = argparse.ArgumentParser(
|
parser = argparse.ArgumentParser(
|
||||||
description="Extract whole domains from an AdBlock blocking list"
|
description="Extract whole domains from an AdBlock blocking list")
|
||||||
)
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"-i",
|
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
|
||||||
"--input",
|
help="Input file with AdBlock rules")
|
||||||
type=argparse.FileType("r"),
|
|
||||||
default=sys.stdin,
|
|
||||||
help="Input file with AdBlock rules",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"-o",
|
'-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
|
||||||
"--output",
|
help="Outptut file with one rule tracking subdomain per line")
|
||||||
type=argparse.FileType("w"),
|
|
||||||
default=sys.stdout,
|
|
||||||
help="Outptut file with one rule tracking subdomain per line",
|
|
||||||
)
|
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
# Reading rules
|
# Reading rules
|
||||||
|
|
|
@ -14,28 +14,6 @@ import time
|
||||||
import progressbar
|
import progressbar
|
||||||
import selenium.webdriver.firefox.options
|
import selenium.webdriver.firefox.options
|
||||||
import seleniumwire.webdriver
|
import seleniumwire.webdriver
|
||||||
import logging
|
|
||||||
|
|
||||||
log = logging.getLogger("cs")
|
|
||||||
DRIVER = None
|
|
||||||
SCROLL_TIME = 10.0
|
|
||||||
SCROLL_STEPS = 100
|
|
||||||
SCROLL_CMD = f"window.scrollBy(0,document.body.scrollHeight/{SCROLL_STEPS})"
|
|
||||||
|
|
||||||
|
|
||||||
def new_driver() -> seleniumwire.webdriver.browser.Firefox:
|
|
||||||
profile = selenium.webdriver.FirefoxProfile()
|
|
||||||
profile.set_preference("privacy.trackingprotection.enabled", False)
|
|
||||||
profile.set_preference("network.cookie.cookieBehavior", 0)
|
|
||||||
profile.set_preference("privacy.trackingprotection.pbmode.enabled", False)
|
|
||||||
profile.set_preference("privacy.trackingprotection.cryptomining.enabled", False)
|
|
||||||
profile.set_preference("privacy.trackingprotection.fingerprinting.enabled", False)
|
|
||||||
options = selenium.webdriver.firefox.options.Options()
|
|
||||||
# options.add_argument('-headless')
|
|
||||||
driver = seleniumwire.webdriver.Firefox(
|
|
||||||
profile, executable_path="geckodriver", options=options
|
|
||||||
)
|
|
||||||
return driver
|
|
||||||
|
|
||||||
|
|
||||||
def subdomain_from_url(url: str) -> str:
|
def subdomain_from_url(url: str) -> str:
|
||||||
|
@ -51,36 +29,34 @@ def collect_subdomains(url: str) -> typing.Iterable[str]:
|
||||||
Load an URL into an headless browser and return all the domains
|
Load an URL into an headless browser and return all the domains
|
||||||
it tried to access.
|
it tried to access.
|
||||||
"""
|
"""
|
||||||
global DRIVER
|
options = selenium.webdriver.firefox.options.Options()
|
||||||
if not DRIVER:
|
options.add_argument('-headless')
|
||||||
DRIVER = new_driver()
|
driver = seleniumwire.webdriver.Firefox(
|
||||||
|
executable_path='geckodriver', options=options)
|
||||||
|
|
||||||
try:
|
driver.get(url)
|
||||||
DRIVER.get(url)
|
time.sleep(10)
|
||||||
for s in range(SCROLL_STEPS):
|
for request in driver.requests:
|
||||||
DRIVER.execute_script(SCROLL_CMD)
|
|
||||||
time.sleep(SCROLL_TIME / SCROLL_STEPS)
|
|
||||||
for request in DRIVER.requests:
|
|
||||||
if request.response:
|
if request.response:
|
||||||
yield subdomain_from_url(request.path)
|
yield subdomain_from_url(request.path)
|
||||||
except Exception:
|
driver.close()
|
||||||
log.exception("Error")
|
|
||||||
DRIVER.quit()
|
|
||||||
DRIVER = None
|
|
||||||
|
|
||||||
|
|
||||||
def collect_subdomains_standalone(url: str) -> None:
|
def collect_subdomains_standalone(url: str) -> None:
|
||||||
url = url.strip()
|
url = url.strip()
|
||||||
if not url:
|
if not url:
|
||||||
return
|
return
|
||||||
|
try:
|
||||||
for subdomain in collect_subdomains(url):
|
for subdomain in collect_subdomains(url):
|
||||||
print(subdomain)
|
print(subdomain)
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == '__main__':
|
||||||
assert len(sys.argv) <= 2
|
assert len(sys.argv) <= 2
|
||||||
filename = None
|
filename = None
|
||||||
if len(sys.argv) == 2 and sys.argv[1] != "-":
|
if len(sys.argv) == 2 and sys.argv[1] != '-':
|
||||||
filename = sys.argv[1]
|
filename = sys.argv[1]
|
||||||
num_lines = sum(1 for line in open(filename))
|
num_lines = sum(1 for line in open(filename))
|
||||||
iterator = progressbar.progressbar(open(filename), max_value=num_lines)
|
iterator = progressbar.progressbar(open(filename), max_value=num_lines)
|
||||||
|
@ -90,8 +66,5 @@ if __name__ == "__main__":
|
||||||
for line in iterator:
|
for line in iterator:
|
||||||
collect_subdomains_standalone(line)
|
collect_subdomains_standalone(line)
|
||||||
|
|
||||||
if DRIVER:
|
|
||||||
DRIVER.quit()
|
|
||||||
|
|
||||||
if filename:
|
if filename:
|
||||||
iterator.close()
|
iterator.close()
|
||||||
|
|
688
database.py
688
database.py
|
@ -9,36 +9,25 @@ import time
|
||||||
import logging
|
import logging
|
||||||
import coloredlogs
|
import coloredlogs
|
||||||
import pickle
|
import pickle
|
||||||
import numpy
|
|
||||||
import math
|
|
||||||
import os
|
|
||||||
|
|
||||||
TLD_LIST: typing.Set[str] = set()
|
coloredlogs.install(
|
||||||
|
level='DEBUG',
|
||||||
coloredlogs.install(level="DEBUG", fmt="%(asctime)s %(name)s %(levelname)s %(message)s")
|
fmt='%(asctime)s %(name)s %(levelname)s %(message)s'
|
||||||
|
)
|
||||||
|
|
||||||
Asn = int
|
Asn = int
|
||||||
Timestamp = int
|
Timestamp = int
|
||||||
Level = int
|
Level = int
|
||||||
|
|
||||||
|
|
||||||
class Path:
|
class Path():
|
||||||
|
# FP add boolean here
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
|
||||||
class RulePath(Path):
|
class RulePath(Path):
|
||||||
def __str__(self) -> str:
|
def __str__(self) -> str:
|
||||||
return "(rule)"
|
return '(rules)'
|
||||||
|
|
||||||
|
|
||||||
class RuleFirstPath(RulePath):
|
|
||||||
def __str__(self) -> str:
|
|
||||||
return "(first-party rule)"
|
|
||||||
|
|
||||||
|
|
||||||
class RuleMultiPath(RulePath):
|
|
||||||
def __str__(self) -> str:
|
|
||||||
return "(multi-party rule)"
|
|
||||||
|
|
||||||
|
|
||||||
class DomainPath(Path):
|
class DomainPath(Path):
|
||||||
|
@ -46,7 +35,7 @@ class DomainPath(Path):
|
||||||
self.parts = parts
|
self.parts = parts
|
||||||
|
|
||||||
def __str__(self) -> str:
|
def __str__(self) -> str:
|
||||||
return "?." + Database.unpack_domain(self)
|
return '?.' + Database.unpack_domain(self)
|
||||||
|
|
||||||
|
|
||||||
class HostnamePath(DomainPath):
|
class HostnamePath(DomainPath):
|
||||||
|
@ -56,7 +45,7 @@ class HostnamePath(DomainPath):
|
||||||
|
|
||||||
class ZonePath(DomainPath):
|
class ZonePath(DomainPath):
|
||||||
def __str__(self) -> str:
|
def __str__(self) -> str:
|
||||||
return "*." + Database.unpack_domain(self)
|
return '*.' + Database.unpack_domain(self)
|
||||||
|
|
||||||
|
|
||||||
class AsnPath(Path):
|
class AsnPath(Path):
|
||||||
|
@ -76,33 +65,33 @@ class Ip4Path(Path):
|
||||||
return Database.unpack_ip4network(self)
|
return Database.unpack_ip4network(self)
|
||||||
|
|
||||||
|
|
||||||
class Match:
|
class Match():
|
||||||
def __init__(self) -> None:
|
def __init__(self) -> None:
|
||||||
self.source: typing.Optional[Path] = None
|
|
||||||
self.updated: int = 0
|
self.updated: int = 0
|
||||||
self.dupplicate: bool = False
|
|
||||||
|
|
||||||
# Cache
|
|
||||||
self.level: int = 0
|
self.level: int = 0
|
||||||
self.first_party: bool = False
|
self.source: typing.Optional[Path] = None
|
||||||
self.references: int = 0
|
# FP dupplicate args
|
||||||
|
|
||||||
def active(self, first_party: bool = None) -> bool:
|
def set(self,
|
||||||
if self.updated == 0 or (first_party and not self.first_party):
|
updated: int,
|
||||||
return False
|
level: int,
|
||||||
return True
|
source: Path,
|
||||||
|
) -> None:
|
||||||
|
if updated > self.updated or level > self.level:
|
||||||
|
self.updated = updated
|
||||||
|
self.level = level
|
||||||
|
self.source = source
|
||||||
|
# FP dupplicate function
|
||||||
|
|
||||||
def disable(self) -> None:
|
def active(self) -> bool:
|
||||||
self.updated = 0
|
return self.updated > 0
|
||||||
|
|
||||||
|
|
||||||
class AsnNode(Match):
|
class AsnNode(Match):
|
||||||
def __init__(self) -> None:
|
pass
|
||||||
Match.__init__(self)
|
|
||||||
self.name = ""
|
|
||||||
|
|
||||||
|
|
||||||
class DomainTreeNode:
|
class DomainTreeNode():
|
||||||
def __init__(self) -> None:
|
def __init__(self) -> None:
|
||||||
self.children: typing.Dict[str, DomainTreeNode] = dict()
|
self.children: typing.Dict[str, DomainTreeNode] = dict()
|
||||||
self.match_zone = Match()
|
self.match_zone = Match()
|
||||||
|
@ -117,28 +106,21 @@ class IpTreeNode(Match):
|
||||||
|
|
||||||
|
|
||||||
Node = typing.Union[DomainTreeNode, IpTreeNode, AsnNode]
|
Node = typing.Union[DomainTreeNode, IpTreeNode, AsnNode]
|
||||||
MatchCallable = typing.Callable[[Path, Match], typing.Any]
|
MatchCallable = typing.Callable[[Path,
|
||||||
|
Match,
|
||||||
|
typing.Optional[typing.Any]],
|
||||||
|
typing.Any]
|
||||||
|
|
||||||
|
|
||||||
class Profiler:
|
class Profiler():
|
||||||
def __init__(self) -> None:
|
def __init__(self) -> None:
|
||||||
do_profile = int(os.environ.get("PROFILE", "0"))
|
self.log = logging.getLogger('profiler')
|
||||||
if do_profile:
|
|
||||||
self.log = logging.getLogger("profiler")
|
|
||||||
self.time_last = time.perf_counter()
|
self.time_last = time.perf_counter()
|
||||||
self.time_step = "init"
|
self.time_step = 'init'
|
||||||
self.time_dict: typing.Dict[str, float] = dict()
|
self.time_dict: typing.Dict[str, float] = dict()
|
||||||
self.step_dict: typing.Dict[str, int] = dict()
|
self.step_dict: typing.Dict[str, int] = dict()
|
||||||
self.enter_step = self.enter_step_real
|
|
||||||
self.profile = self.profile_real
|
|
||||||
else:
|
|
||||||
self.enter_step = self.enter_step_dummy
|
|
||||||
self.profile = self.profile_dummy
|
|
||||||
|
|
||||||
def enter_step_dummy(self, name: str) -> None:
|
def enter_step(self, name: str) -> None:
|
||||||
return
|
|
||||||
|
|
||||||
def enter_step_real(self, name: str) -> None:
|
|
||||||
now = time.perf_counter()
|
now = time.perf_counter()
|
||||||
try:
|
try:
|
||||||
self.time_dict[self.time_step] += now - self.time_last
|
self.time_dict[self.time_step] += now - self.time_last
|
||||||
|
@ -149,174 +131,86 @@ class Profiler:
|
||||||
self.time_step = name
|
self.time_step = name
|
||||||
self.time_last = time.perf_counter()
|
self.time_last = time.perf_counter()
|
||||||
|
|
||||||
def profile_dummy(self) -> None:
|
def profile(self) -> None:
|
||||||
return
|
self.enter_step('profile')
|
||||||
|
|
||||||
def profile_real(self) -> None:
|
|
||||||
self.enter_step("profile")
|
|
||||||
total = sum(self.time_dict.values())
|
total = sum(self.time_dict.values())
|
||||||
for key, secs in sorted(self.time_dict.items(), key=lambda t: t[1]):
|
for key, secs in sorted(self.time_dict.items(), key=lambda t: t[1]):
|
||||||
times = self.step_dict[key]
|
times = self.step_dict[key]
|
||||||
self.log.debug(
|
self.log.debug(f"{key:<20}: {times:9d} × {secs/times:5.3e} "
|
||||||
f"{key:<20}: {times:9d} × {secs/times:5.3e} "
|
f"= {secs:9.2f} s ({secs/total:7.2%}) ")
|
||||||
f"= {secs:9.2f} s ({secs/total:7.2%}) "
|
self.log.debug(f"{'total':<20}: "
|
||||||
)
|
f"{total:9.2f} s ({1:7.2%})")
|
||||||
self.log.debug(
|
|
||||||
f"{'total':<20}: " f"{total:9.2f} s ({1:7.2%})"
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
class Database(Profiler):
|
class Database(Profiler):
|
||||||
VERSION = 18
|
VERSION = 13
|
||||||
PATH = "blocking.p"
|
PATH = "blocking.p"
|
||||||
|
|
||||||
def initialize(self) -> None:
|
def initialize(self) -> None:
|
||||||
self.log.warning("Creating database version: %d ", Database.VERSION)
|
self.log.warning(
|
||||||
# Dummy match objects that everything refer to
|
"Creating database version: %d ",
|
||||||
self.rules: typing.List[Match] = list()
|
Database.VERSION)
|
||||||
for first_party in (False, True):
|
|
||||||
m = Match()
|
|
||||||
m.updated = 1
|
|
||||||
m.level = 0
|
|
||||||
m.first_party = first_party
|
|
||||||
self.rules.append(m)
|
|
||||||
self.domtree = DomainTreeNode()
|
self.domtree = DomainTreeNode()
|
||||||
self.asns: typing.Dict[Asn, AsnNode] = dict()
|
self.asns: typing.Dict[Asn, AsnNode] = dict()
|
||||||
self.ip4tree = IpTreeNode()
|
self.ip4tree = IpTreeNode()
|
||||||
|
|
||||||
def load(self) -> None:
|
def load(self) -> None:
|
||||||
self.enter_step("load")
|
self.enter_step('load')
|
||||||
try:
|
try:
|
||||||
with open(self.PATH, "rb") as db_fdsec:
|
with open(self.PATH, 'rb') as db_fdsec:
|
||||||
version, data = pickle.load(db_fdsec)
|
version, data = pickle.load(db_fdsec)
|
||||||
if version == Database.VERSION:
|
if version == Database.VERSION:
|
||||||
self.rules, self.domtree, self.asns, self.ip4tree = data
|
self.domtree, self.asns, self.ip4tree = data
|
||||||
return
|
return
|
||||||
self.log.warning(
|
self.log.warning(
|
||||||
"Outdated database version found: %d, " "it will be rebuilt.",
|
"Outdated database version found: %d, "
|
||||||
version,
|
"it will be rebuilt.",
|
||||||
)
|
version)
|
||||||
except (TypeError, AttributeError, EOFError):
|
except (TypeError, AttributeError, EOFError):
|
||||||
self.log.error(
|
self.log.error(
|
||||||
"Corrupt (or heavily outdated) database found, " "it will be rebuilt."
|
"Corrupt (or heavily outdated) database found, "
|
||||||
)
|
"it will be rebuilt.")
|
||||||
except FileNotFoundError:
|
except FileNotFoundError:
|
||||||
pass
|
pass
|
||||||
self.initialize()
|
self.initialize()
|
||||||
|
|
||||||
def save(self) -> None:
|
def save(self) -> None:
|
||||||
self.enter_step("save")
|
self.enter_step('save')
|
||||||
with open(self.PATH, "wb") as db_fdsec:
|
with open(self.PATH, 'wb') as db_fdsec:
|
||||||
data = self.rules, self.domtree, self.asns, self.ip4tree
|
data = self.domtree, self.asns, self.ip4tree
|
||||||
pickle.dump((self.VERSION, data), db_fdsec)
|
pickle.dump((self.VERSION, data), db_fdsec)
|
||||||
self.profile()
|
self.profile()
|
||||||
|
|
||||||
def __init__(self) -> None:
|
def __init__(self) -> None:
|
||||||
Profiler.__init__(self)
|
Profiler.__init__(self)
|
||||||
self.log = logging.getLogger("db")
|
self.log = logging.getLogger('db')
|
||||||
self.load()
|
self.load()
|
||||||
self.ip4cache_shift: int = 32
|
|
||||||
self.ip4cache = numpy.ones(1)
|
|
||||||
|
|
||||||
def _set_ip4cache(self, path: Path, _: Match) -> None:
|
|
||||||
assert isinstance(path, Ip4Path)
|
|
||||||
self.enter_step("set_ip4cache")
|
|
||||||
mini = path.value >> self.ip4cache_shift
|
|
||||||
maxi = (path.value + 2 ** (32 - path.prefixlen)) >> self.ip4cache_shift
|
|
||||||
if mini == maxi:
|
|
||||||
self.ip4cache[mini] = True
|
|
||||||
else:
|
|
||||||
self.ip4cache[mini:maxi] = True
|
|
||||||
|
|
||||||
def fill_ip4cache(self, max_size: int = 512 * 1024 ** 2) -> None:
|
|
||||||
"""
|
|
||||||
Size in bytes
|
|
||||||
"""
|
|
||||||
if max_size > 2 ** 32 / 8:
|
|
||||||
self.log.warning(
|
|
||||||
"Allocating more than 512 MiB of RAM for "
|
|
||||||
"the Ip4 cache is not necessary."
|
|
||||||
)
|
|
||||||
max_cache_width = int(math.log2(max(1, max_size * 8)))
|
|
||||||
allocated = False
|
|
||||||
cache_width = min(32, max_cache_width)
|
|
||||||
while not allocated:
|
|
||||||
cache_size = 2 ** cache_width
|
|
||||||
try:
|
|
||||||
self.ip4cache = numpy.zeros(cache_size, dtype=bool)
|
|
||||||
except MemoryError:
|
|
||||||
self.log.exception("Could not allocate cache. Retrying a smaller one.")
|
|
||||||
cache_width -= 1
|
|
||||||
continue
|
|
||||||
allocated = True
|
|
||||||
self.ip4cache_shift = 32 - cache_width
|
|
||||||
for _ in self.exec_each_ip4(self._set_ip4cache):
|
|
||||||
pass
|
|
||||||
|
|
||||||
@staticmethod
|
|
||||||
def populate_tld_list() -> None:
|
|
||||||
with open("temp/all_tld.list", "r") as tld_fdesc:
|
|
||||||
for tld in tld_fdesc:
|
|
||||||
tld = tld.strip()
|
|
||||||
TLD_LIST.add(tld)
|
|
||||||
|
|
||||||
@staticmethod
|
|
||||||
def validate_domain(path: str) -> bool:
|
|
||||||
if len(path) > 255:
|
|
||||||
return False
|
|
||||||
splits = path.split(".")
|
|
||||||
if not TLD_LIST:
|
|
||||||
Database.populate_tld_list()
|
|
||||||
if splits[-1] not in TLD_LIST:
|
|
||||||
return False
|
|
||||||
for split in splits:
|
|
||||||
if not 1 <= len(split) <= 63:
|
|
||||||
return False
|
|
||||||
return True
|
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def pack_domain(domain: str) -> DomainPath:
|
def pack_domain(domain: str) -> DomainPath:
|
||||||
return DomainPath(domain.split(".")[::-1])
|
return DomainPath(domain.split('.')[::-1])
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def unpack_domain(domain: DomainPath) -> str:
|
def unpack_domain(domain: DomainPath) -> str:
|
||||||
return ".".join(domain.parts[::-1])
|
return '.'.join(domain.parts[::-1])
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def pack_asn(asn: str) -> AsnPath:
|
def pack_asn(asn: str) -> AsnPath:
|
||||||
asn = asn.upper()
|
asn = asn.upper()
|
||||||
if asn.startswith("AS"):
|
if asn.startswith('AS'):
|
||||||
asn = asn[2:]
|
asn = asn[2:]
|
||||||
return AsnPath(int(asn))
|
return AsnPath(int(asn))
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def unpack_asn(asn: AsnPath) -> str:
|
def unpack_asn(asn: AsnPath) -> str:
|
||||||
return f"AS{asn.asn}"
|
return f'AS{asn.asn}'
|
||||||
|
|
||||||
@staticmethod
|
|
||||||
def validate_ip4address(path: str) -> bool:
|
|
||||||
splits = path.split(".")
|
|
||||||
if len(splits) != 4:
|
|
||||||
return False
|
|
||||||
for split in splits:
|
|
||||||
try:
|
|
||||||
if not 0 <= int(split) <= 255:
|
|
||||||
return False
|
|
||||||
except ValueError:
|
|
||||||
return False
|
|
||||||
return True
|
|
||||||
|
|
||||||
@staticmethod
|
|
||||||
def pack_ip4address_low(address: str) -> int:
|
|
||||||
addr = 0
|
|
||||||
for split in address.split("."):
|
|
||||||
octet = int(split)
|
|
||||||
addr = (addr << 8) + octet
|
|
||||||
return addr
|
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def pack_ip4address(address: str) -> Ip4Path:
|
def pack_ip4address(address: str) -> Ip4Path:
|
||||||
return Ip4Path(Database.pack_ip4address_low(address), 32)
|
addr = 0
|
||||||
|
for split in address.split('.'):
|
||||||
|
addr = (addr << 8) + int(split)
|
||||||
|
return Ip4Path(addr, 32)
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def unpack_ip4address(address: Ip4Path) -> str:
|
def unpack_ip4address(address: Ip4Path) -> str:
|
||||||
|
@ -327,26 +221,11 @@ class Database(Profiler):
|
||||||
for o in reversed(range(4)):
|
for o in reversed(range(4)):
|
||||||
octets[o] = addr & 0xFF
|
octets[o] = addr & 0xFF
|
||||||
addr >>= 8
|
addr >>= 8
|
||||||
return ".".join(map(str, octets))
|
return '.'.join(map(str, octets))
|
||||||
|
|
||||||
@staticmethod
|
|
||||||
def validate_ip4network(path: str) -> bool:
|
|
||||||
# A bit generous but ok for our usage
|
|
||||||
splits = path.split("/")
|
|
||||||
if len(splits) != 2:
|
|
||||||
return False
|
|
||||||
if not Database.validate_ip4address(splits[0]):
|
|
||||||
return False
|
|
||||||
try:
|
|
||||||
if not 0 <= int(splits[1]) <= 32:
|
|
||||||
return False
|
|
||||||
except ValueError:
|
|
||||||
return False
|
|
||||||
return True
|
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def pack_ip4network(network: str) -> Ip4Path:
|
def pack_ip4network(network: str) -> Ip4Path:
|
||||||
address, prefixlen_str = network.split("/")
|
address, prefixlen_str = network.split('/')
|
||||||
prefixlen = int(prefixlen_str)
|
prefixlen = int(prefixlen_str)
|
||||||
addr = Database.pack_ip4address(address)
|
addr = Database.pack_ip4address(address)
|
||||||
addr.prefixlen = prefixlen
|
addr.prefixlen = prefixlen
|
||||||
|
@ -360,13 +239,11 @@ class Database(Profiler):
|
||||||
for o in reversed(range(4)):
|
for o in reversed(range(4)):
|
||||||
octets[o] = addr & 0xFF
|
octets[o] = addr & 0xFF
|
||||||
addr >>= 8
|
addr >>= 8
|
||||||
return ".".join(map(str, octets)) + "/" + str(network.prefixlen)
|
return '.'.join(map(str, octets)) + '/' + str(network.prefixlen)
|
||||||
|
|
||||||
def get_match(self, path: Path) -> Match:
|
def get_match(self, path: Path) -> Match:
|
||||||
if isinstance(path, RuleMultiPath):
|
if isinstance(path, RulePath):
|
||||||
return self.rules[0]
|
return Match()
|
||||||
elif isinstance(path, RuleFirstPath):
|
|
||||||
return self.rules[1]
|
|
||||||
elif isinstance(path, AsnPath):
|
elif isinstance(path, AsnPath):
|
||||||
return self.asns[path.asn]
|
return self.asns[path.asn]
|
||||||
elif isinstance(path, DomainPath):
|
elif isinstance(path, DomainPath):
|
||||||
|
@ -391,374 +268,265 @@ class Database(Profiler):
|
||||||
else:
|
else:
|
||||||
raise ValueError
|
raise ValueError
|
||||||
|
|
||||||
def exec_each_asn(
|
def exec_each_domain(self,
|
||||||
self,
|
|
||||||
callback: MatchCallable,
|
|
||||||
) -> typing.Any:
|
|
||||||
for asn in self.asns:
|
|
||||||
match = self.asns[asn]
|
|
||||||
if match.active():
|
|
||||||
c = callback(
|
|
||||||
AsnPath(asn),
|
|
||||||
match,
|
|
||||||
)
|
|
||||||
try:
|
|
||||||
yield from c
|
|
||||||
except TypeError: # not iterable
|
|
||||||
pass
|
|
||||||
|
|
||||||
def exec_each_domain(
|
|
||||||
self,
|
|
||||||
callback: MatchCallable,
|
callback: MatchCallable,
|
||||||
|
arg: typing.Any = None,
|
||||||
_dic: DomainTreeNode = None,
|
_dic: DomainTreeNode = None,
|
||||||
_par: DomainPath = None,
|
_par: DomainPath = None,
|
||||||
) -> typing.Any:
|
) -> typing.Any:
|
||||||
_dic = _dic or self.domtree
|
_dic = _dic or self.domtree
|
||||||
_par = _par or DomainPath([])
|
_par = _par or DomainPath([])
|
||||||
if _dic.match_hostname.active():
|
if _dic.match_hostname.active():
|
||||||
c = callback(
|
yield from callback(
|
||||||
HostnamePath(_par.parts),
|
HostnamePath(_par.parts),
|
||||||
_dic.match_hostname,
|
_dic.match_hostname,
|
||||||
|
arg
|
||||||
)
|
)
|
||||||
try:
|
|
||||||
yield from c
|
|
||||||
except TypeError: # not iterable
|
|
||||||
pass
|
|
||||||
if _dic.match_zone.active():
|
if _dic.match_zone.active():
|
||||||
c = callback(
|
yield from callback(
|
||||||
ZonePath(_par.parts),
|
ZonePath(_par.parts),
|
||||||
_dic.match_zone,
|
_dic.match_zone,
|
||||||
|
arg
|
||||||
)
|
)
|
||||||
try:
|
|
||||||
yield from c
|
|
||||||
except TypeError: # not iterable
|
|
||||||
pass
|
|
||||||
for part in _dic.children:
|
for part in _dic.children:
|
||||||
dic = _dic.children[part]
|
dic = _dic.children[part]
|
||||||
yield from self.exec_each_domain(
|
yield from self.exec_each_domain(
|
||||||
callback, _dic=dic, _par=DomainPath(_par.parts + [part])
|
callback,
|
||||||
|
arg,
|
||||||
|
_dic=dic,
|
||||||
|
_par=DomainPath(_par.parts + [part])
|
||||||
)
|
)
|
||||||
|
|
||||||
def exec_each_ip4(
|
def exec_each_ip4(self,
|
||||||
self,
|
|
||||||
callback: MatchCallable,
|
callback: MatchCallable,
|
||||||
|
arg: typing.Any = None,
|
||||||
_dic: IpTreeNode = None,
|
_dic: IpTreeNode = None,
|
||||||
_par: Ip4Path = None,
|
_par: Ip4Path = None,
|
||||||
) -> typing.Any:
|
) -> typing.Any:
|
||||||
_dic = _dic or self.ip4tree
|
_dic = _dic or self.ip4tree
|
||||||
_par = _par or Ip4Path(0, 0)
|
_par = _par or Ip4Path(0, 0)
|
||||||
if _dic.active():
|
if _dic.active():
|
||||||
c = callback(
|
yield from callback(
|
||||||
_par,
|
_par,
|
||||||
_dic,
|
_dic,
|
||||||
|
arg
|
||||||
)
|
)
|
||||||
try:
|
|
||||||
yield from c
|
|
||||||
except TypeError: # not iterable
|
|
||||||
pass
|
|
||||||
|
|
||||||
# 0
|
# 0
|
||||||
pref = _par.prefixlen + 1
|
pref = _par.prefixlen + 1
|
||||||
dic = _dic.zero
|
dic = _dic.zero
|
||||||
if dic:
|
if dic:
|
||||||
# addr0 = _par.value & (0xFFFFFFFF ^ (1 << (32-pref)))
|
addr0 = _par.value & (0xFFFFFFFF ^ (1 << (32-pref)))
|
||||||
# assert addr0 == _par.value
|
assert addr0 == _par.value
|
||||||
addr0 = _par.value
|
yield from self.exec_each_ip4(
|
||||||
yield from self.exec_each_ip4(callback, _dic=dic, _par=Ip4Path(addr0, pref))
|
callback,
|
||||||
|
arg,
|
||||||
|
_dic=dic,
|
||||||
|
_par=Ip4Path(addr0, pref)
|
||||||
|
)
|
||||||
# 1
|
# 1
|
||||||
dic = _dic.one
|
dic = _dic.one
|
||||||
if dic:
|
if dic:
|
||||||
addr1 = _par.value | (1 << (32-pref))
|
addr1 = _par.value | (1 << (32-pref))
|
||||||
# assert addr1 != _par.value
|
yield from self.exec_each_ip4(
|
||||||
yield from self.exec_each_ip4(callback, _dic=dic, _par=Ip4Path(addr1, pref))
|
callback,
|
||||||
|
arg,
|
||||||
|
_dic=dic,
|
||||||
|
_par=Ip4Path(addr1, pref)
|
||||||
|
)
|
||||||
|
|
||||||
def exec_each(
|
def exec_each(self,
|
||||||
self,
|
|
||||||
callback: MatchCallable,
|
callback: MatchCallable,
|
||||||
|
arg: typing.Any = None,
|
||||||
) -> typing.Any:
|
) -> typing.Any:
|
||||||
yield from self.exec_each_domain(callback)
|
yield from self.exec_each_domain(callback)
|
||||||
yield from self.exec_each_ip4(callback)
|
yield from self.exec_each_ip4(callback)
|
||||||
yield from self.exec_each_asn(callback)
|
# TODO ASN
|
||||||
|
|
||||||
def update_references(self) -> None:
|
def update_references(self) -> None:
|
||||||
# Should be correctly calculated normally,
|
raise NotImplementedError
|
||||||
# keeping this just in case
|
|
||||||
def reset_references_cb(path: Path, match: Match) -> None:
|
|
||||||
match.references = 0
|
|
||||||
|
|
||||||
for _ in self.exec_each(reset_references_cb):
|
|
||||||
pass
|
|
||||||
|
|
||||||
def increment_references_cb(path: Path, match: Match) -> None:
|
|
||||||
if match.source:
|
|
||||||
source = self.get_match(match.source)
|
|
||||||
source.references += 1
|
|
||||||
|
|
||||||
for _ in self.exec_each(increment_references_cb):
|
|
||||||
pass
|
|
||||||
|
|
||||||
def _clean_deps(self) -> None:
|
|
||||||
# Disable the matches that depends on the targeted
|
|
||||||
# matches until all disabled matches reference count = 0
|
|
||||||
did_something = True
|
|
||||||
|
|
||||||
def clean_deps_cb(path: Path, match: Match) -> None:
|
|
||||||
nonlocal did_something
|
|
||||||
if not match.source:
|
|
||||||
return
|
|
||||||
source = self.get_match(match.source)
|
|
||||||
if not source.active():
|
|
||||||
self._unset_match(match)
|
|
||||||
elif match.first_party > source.first_party:
|
|
||||||
match.first_party = source.first_party
|
|
||||||
else:
|
|
||||||
return
|
|
||||||
did_something = True
|
|
||||||
|
|
||||||
while did_something:
|
|
||||||
did_something = False
|
|
||||||
self.enter_step("pass_clean_deps")
|
|
||||||
for _ in self.exec_each(clean_deps_cb):
|
|
||||||
pass
|
|
||||||
|
|
||||||
def prune(self, before: int, base_only: bool = False) -> None:
|
def prune(self, before: int, base_only: bool = False) -> None:
|
||||||
# Disable the matches targeted
|
raise NotImplementedError
|
||||||
def prune_cb(path: Path, match: Match) -> None:
|
|
||||||
if base_only and match.level > 1:
|
|
||||||
return
|
|
||||||
if match.updated > before:
|
|
||||||
return
|
|
||||||
self._unset_match(match)
|
|
||||||
self.log.debug("Print: disabled %s", path)
|
|
||||||
|
|
||||||
self.enter_step("pass_prune")
|
|
||||||
for _ in self.exec_each(prune_cb):
|
|
||||||
pass
|
|
||||||
|
|
||||||
self._clean_deps()
|
|
||||||
|
|
||||||
# Remove branches with no match
|
|
||||||
# TODO
|
|
||||||
|
|
||||||
def explain(self, path: Path) -> str:
|
def explain(self, path: Path) -> str:
|
||||||
match = self.get_match(path)
|
|
||||||
string = str(path)
|
string = str(path)
|
||||||
if isinstance(match, AsnNode):
|
match = self.get_match(path)
|
||||||
string += f" ({match.name})"
|
|
||||||
party_char = "F" if match.first_party else "M"
|
|
||||||
dup_char = "D" if match.dupplicate else "_"
|
|
||||||
string += f" {match.level}{party_char}{dup_char}{match.references}"
|
|
||||||
if match.source:
|
if match.source:
|
||||||
string += f" ← {self.explain(match.source)}"
|
string += f' ← {self.explain(match.source)}'
|
||||||
return string
|
return string
|
||||||
|
|
||||||
def list_records(
|
def export(self,
|
||||||
self,
|
|
||||||
first_party_only: bool = False,
|
first_party_only: bool = False,
|
||||||
end_chain_only: bool = False,
|
end_chain_only: bool = False,
|
||||||
no_dupplicates: bool = False,
|
|
||||||
rules_only: bool = False,
|
|
||||||
hostnames_only: bool = False,
|
|
||||||
explain: bool = False,
|
explain: bool = False,
|
||||||
) -> typing.Iterable[str]:
|
) -> typing.Iterable[str]:
|
||||||
def export_cb(path: Path, match: Match) -> typing.Iterable[str]:
|
if first_party_only or end_chain_only:
|
||||||
if first_party_only and not match.first_party:
|
raise NotImplementedError
|
||||||
return
|
|
||||||
if end_chain_only and match.references > 0:
|
|
||||||
return
|
|
||||||
if no_dupplicates and match.dupplicate:
|
|
||||||
return
|
|
||||||
if rules_only and match.level > 1:
|
|
||||||
return
|
|
||||||
if hostnames_only and not isinstance(path, HostnamePath):
|
|
||||||
return
|
|
||||||
|
|
||||||
|
def export_cb(path: Path, match: Match, _: typing.Any
|
||||||
|
) -> typing.Iterable[str]:
|
||||||
|
assert isinstance(path, DomainPath)
|
||||||
|
if isinstance(path, HostnamePath):
|
||||||
if explain:
|
if explain:
|
||||||
yield self.explain(path)
|
yield self.explain(path)
|
||||||
else:
|
else:
|
||||||
yield str(path)
|
yield self.unpack_domain(path)
|
||||||
|
|
||||||
yield from self.exec_each(export_cb)
|
yield from self.exec_each_domain(export_cb, None)
|
||||||
|
|
||||||
def count_records(
|
def list_rules(self,
|
||||||
self,
|
first_party_only: bool = False,
|
||||||
|
) -> typing.Iterable[str]:
|
||||||
|
if first_party_only:
|
||||||
|
raise NotImplementedError
|
||||||
|
|
||||||
|
def list_rules_cb(path: Path, match: Match, _: typing.Any
|
||||||
|
) -> typing.Iterable[str]:
|
||||||
|
if isinstance(path, ZonePath) \
|
||||||
|
or (isinstance(path, Ip4Path) and path.prefixlen < 32):
|
||||||
|
# if match.level == 0:
|
||||||
|
yield self.explain(path)
|
||||||
|
|
||||||
|
yield from self.exec_each(list_rules_cb, None)
|
||||||
|
|
||||||
|
def count_rules(self,
|
||||||
first_party_only: bool = False,
|
first_party_only: bool = False,
|
||||||
end_chain_only: bool = False,
|
|
||||||
no_dupplicates: bool = False,
|
|
||||||
rules_only: bool = False,
|
|
||||||
hostnames_only: bool = False,
|
|
||||||
) -> str:
|
) -> str:
|
||||||
memo: typing.Dict[str, int] = dict()
|
raise NotImplementedError
|
||||||
|
|
||||||
def count_records_cb(path: Path, match: Match) -> None:
|
def get_domain(self, domain: DomainPath) -> typing.Iterable[DomainPath]:
|
||||||
if first_party_only and not match.first_party:
|
self.enter_step('get_domain_brws')
|
||||||
return
|
|
||||||
if end_chain_only and match.references > 0:
|
|
||||||
return
|
|
||||||
if no_dupplicates and match.dupplicate:
|
|
||||||
return
|
|
||||||
if rules_only and match.level > 1:
|
|
||||||
return
|
|
||||||
if hostnames_only and not isinstance(path, HostnamePath):
|
|
||||||
return
|
|
||||||
|
|
||||||
try:
|
|
||||||
memo[path.__class__.__name__] += 1
|
|
||||||
except KeyError:
|
|
||||||
memo[path.__class__.__name__] = 1
|
|
||||||
|
|
||||||
for _ in self.exec_each(count_records_cb):
|
|
||||||
pass
|
|
||||||
|
|
||||||
split: typing.List[str] = list()
|
|
||||||
for key, value in sorted(memo.items(), key=lambda s: s[0]):
|
|
||||||
split.append(f"{key[:-4].lower()}s: {value}")
|
|
||||||
return ", ".join(split)
|
|
||||||
|
|
||||||
def get_domain(self, domain_str: str) -> typing.Iterable[DomainPath]:
|
|
||||||
self.enter_step("get_domain_pack")
|
|
||||||
domain = self.pack_domain(domain_str)
|
|
||||||
self.enter_step("get_domain_brws")
|
|
||||||
dic = self.domtree
|
dic = self.domtree
|
||||||
depth = 0
|
depth = 0
|
||||||
for part in domain.parts:
|
for part in domain.parts:
|
||||||
if dic.match_zone.active():
|
if dic.match_zone.active():
|
||||||
self.enter_step("get_domain_yield")
|
self.enter_step('get_domain_yield')
|
||||||
yield ZonePath(domain.parts[:depth])
|
yield ZonePath(domain.parts[:depth])
|
||||||
self.enter_step("get_domain_brws")
|
self.enter_step('get_domain_brws')
|
||||||
if part not in dic.children:
|
if part not in dic.children:
|
||||||
return
|
return
|
||||||
dic = dic.children[part]
|
dic = dic.children[part]
|
||||||
depth += 1
|
depth += 1
|
||||||
if dic.match_zone.active():
|
if dic.match_zone.active():
|
||||||
self.enter_step("get_domain_yield")
|
self.enter_step('get_domain_yield')
|
||||||
yield ZonePath(domain.parts)
|
yield ZonePath(domain.parts)
|
||||||
if dic.match_hostname.active():
|
if dic.match_hostname.active():
|
||||||
self.enter_step("get_domain_yield")
|
self.enter_step('get_domain_yield')
|
||||||
yield HostnamePath(domain.parts)
|
yield HostnamePath(domain.parts)
|
||||||
|
|
||||||
def get_ip4(self, ip4_str: str) -> typing.Iterable[Path]:
|
def get_ip4(self, ip4: Ip4Path) -> typing.Iterable[Path]:
|
||||||
self.enter_step("get_ip4_pack")
|
self.enter_step('get_ip4_brws')
|
||||||
ip4val = self.pack_ip4address_low(ip4_str)
|
|
||||||
self.enter_step("get_ip4_cache")
|
|
||||||
if not self.ip4cache[ip4val >> self.ip4cache_shift]:
|
|
||||||
return
|
|
||||||
self.enter_step("get_ip4_brws")
|
|
||||||
dic = self.ip4tree
|
dic = self.ip4tree
|
||||||
for i in range(31, -1, -1):
|
for i in range(31, 31-ip4.prefixlen, -1):
|
||||||
bit = (ip4val >> i) & 0b1
|
bit = (ip4.value >> i) & 0b1
|
||||||
if dic.active():
|
if dic.active():
|
||||||
self.enter_step("get_ip4_yield")
|
self.enter_step('get_ip4_yield')
|
||||||
yield Ip4Path(ip4val >> (i + 1) << (i + 1), 31 - i)
|
a = Ip4Path(ip4.value >> (i+1) << (i+1), 31-i)
|
||||||
self.enter_step("get_ip4_brws")
|
yield a
|
||||||
|
self.enter_step('get_ip4_brws')
|
||||||
next_dic = dic.one if bit else dic.zero
|
next_dic = dic.one if bit else dic.zero
|
||||||
if next_dic is None:
|
if next_dic is None:
|
||||||
return
|
return
|
||||||
dic = next_dic
|
dic = next_dic
|
||||||
if dic.active():
|
if dic.active():
|
||||||
self.enter_step("get_ip4_yield")
|
self.enter_step('get_ip4_yield')
|
||||||
yield Ip4Path(ip4val, 32)
|
yield ip4
|
||||||
|
|
||||||
def _unset_match(
|
def list_asn(self) -> typing.Iterable[AsnPath]:
|
||||||
self,
|
for asn in self.asns:
|
||||||
match: Match,
|
yield AsnPath(asn)
|
||||||
) -> None:
|
|
||||||
match.disable()
|
|
||||||
if match.source:
|
|
||||||
source_match = self.get_match(match.source)
|
|
||||||
source_match.references -= 1
|
|
||||||
|
|
||||||
def _set_match(
|
def _set_domain(self,
|
||||||
self,
|
hostname: bool,
|
||||||
match: Match,
|
domain: DomainPath,
|
||||||
updated: int,
|
updated: int,
|
||||||
source: Path,
|
is_first_party: bool = None,
|
||||||
source_match: Match = None,
|
source: Path = None) -> None:
|
||||||
dupplicate: bool = False,
|
if is_first_party:
|
||||||
) -> None:
|
raise NotImplementedError
|
||||||
# source_match is in parameters because most of the time
|
self.enter_step('set_domain_src')
|
||||||
# its parent function needs it too,
|
if source is None:
|
||||||
# so it can pass it to save a traversal
|
level = 0
|
||||||
source_match = source_match or self.get_match(source)
|
source = RulePath()
|
||||||
new_level = source_match.level + 1
|
else:
|
||||||
if (
|
match = self.get_match(source)
|
||||||
updated > match.updated
|
level = match.level + 1
|
||||||
or new_level < match.level
|
self.enter_step('set_domain_brws')
|
||||||
or source_match.first_party > match.first_party
|
|
||||||
):
|
|
||||||
# NOTE FP and level of matches referencing this one
|
|
||||||
# won't be updated until run or prune
|
|
||||||
if match.source:
|
|
||||||
old_source = self.get_match(match.source)
|
|
||||||
old_source.references -= 1
|
|
||||||
match.updated = updated
|
|
||||||
match.level = new_level
|
|
||||||
match.first_party = source_match.first_party
|
|
||||||
match.source = source
|
|
||||||
source_match.references += 1
|
|
||||||
match.dupplicate = dupplicate
|
|
||||||
|
|
||||||
def _set_domain(
|
|
||||||
self, hostname: bool, domain_str: str, updated: int, source: Path
|
|
||||||
) -> None:
|
|
||||||
self.enter_step("set_domain_val")
|
|
||||||
if not Database.validate_domain(domain_str):
|
|
||||||
raise ValueError(f"Invalid domain: {domain_str}")
|
|
||||||
self.enter_step("set_domain_pack")
|
|
||||||
domain = self.pack_domain(domain_str)
|
|
||||||
self.enter_step("set_domain_fp")
|
|
||||||
source_match = self.get_match(source)
|
|
||||||
is_first_party = source_match.first_party
|
|
||||||
self.enter_step("set_domain_brws")
|
|
||||||
dic = self.domtree
|
dic = self.domtree
|
||||||
dupplicate = False
|
|
||||||
for part in domain.parts:
|
for part in domain.parts:
|
||||||
|
if dic.match_zone.active():
|
||||||
|
# Refuse to add domain whose zone is already matching
|
||||||
|
return
|
||||||
if part not in dic.children:
|
if part not in dic.children:
|
||||||
dic.children[part] = DomainTreeNode()
|
dic.children[part] = DomainTreeNode()
|
||||||
dic = dic.children[part]
|
dic = dic.children[part]
|
||||||
if dic.match_zone.active(is_first_party):
|
|
||||||
dupplicate = True
|
|
||||||
if hostname:
|
if hostname:
|
||||||
match = dic.match_hostname
|
match = dic.match_hostname
|
||||||
else:
|
else:
|
||||||
match = dic.match_zone
|
match = dic.match_zone
|
||||||
self._set_match(
|
match.set(
|
||||||
match,
|
|
||||||
updated,
|
updated,
|
||||||
|
level,
|
||||||
source,
|
source,
|
||||||
source_match=source_match,
|
|
||||||
dupplicate=dupplicate,
|
|
||||||
)
|
)
|
||||||
|
|
||||||
def set_hostname(self, *args: typing.Any, **kwargs: typing.Any) -> None:
|
def set_hostname(self,
|
||||||
|
*args: typing.Any, **kwargs: typing.Any
|
||||||
|
) -> None:
|
||||||
self._set_domain(True, *args, **kwargs)
|
self._set_domain(True, *args, **kwargs)
|
||||||
|
|
||||||
def set_zone(self, *args: typing.Any, **kwargs: typing.Any) -> None:
|
def set_zone(self,
|
||||||
|
*args: typing.Any, **kwargs: typing.Any
|
||||||
|
) -> None:
|
||||||
self._set_domain(False, *args, **kwargs)
|
self._set_domain(False, *args, **kwargs)
|
||||||
|
|
||||||
def set_asn(self, asn_str: str, updated: int, source: Path) -> None:
|
def set_asn(self,
|
||||||
self.enter_step("set_asn")
|
asn: AsnPath,
|
||||||
path = self.pack_asn(asn_str)
|
updated: int,
|
||||||
if path.asn in self.asns:
|
is_first_party: bool = None,
|
||||||
match = self.asns[path.asn]
|
source: Path = None) -> None:
|
||||||
|
self.enter_step('set_asn')
|
||||||
|
if is_first_party:
|
||||||
|
raise NotImplementedError
|
||||||
|
if source is None:
|
||||||
|
level = 0
|
||||||
|
source = RulePath()
|
||||||
|
else:
|
||||||
|
match = self.get_match(source)
|
||||||
|
level = match.level + 1
|
||||||
|
if asn.asn in self.asns:
|
||||||
|
match = self.asns[asn.asn]
|
||||||
else:
|
else:
|
||||||
match = AsnNode()
|
match = AsnNode()
|
||||||
self.asns[path.asn] = match
|
self.asns[asn.asn] = match
|
||||||
self._set_match(
|
match.set(
|
||||||
match,
|
|
||||||
updated,
|
updated,
|
||||||
|
level,
|
||||||
source,
|
source,
|
||||||
)
|
)
|
||||||
|
|
||||||
def _set_ip4(self, ip4: Ip4Path, updated: int, source: Path) -> None:
|
def set_ip4network(self,
|
||||||
self.enter_step("set_ip4_fp")
|
ip4: Ip4Path,
|
||||||
source_match = self.get_match(source)
|
updated: int,
|
||||||
is_first_party = source_match.first_party
|
is_first_party: bool = None,
|
||||||
self.enter_step("set_ip4_brws")
|
source: Path = None) -> None:
|
||||||
|
if is_first_party:
|
||||||
|
raise NotImplementedError
|
||||||
|
self.enter_step('set_ip4_src')
|
||||||
|
if source is None:
|
||||||
|
level = 0
|
||||||
|
source = RulePath()
|
||||||
|
else:
|
||||||
|
match = self.get_match(source)
|
||||||
|
level = match.level + 1
|
||||||
|
self.enter_step('set_ip4_brws')
|
||||||
dic = self.ip4tree
|
dic = self.ip4tree
|
||||||
dupplicate = False
|
|
||||||
for i in range(31, 31-ip4.prefixlen, -1):
|
for i in range(31, 31-ip4.prefixlen, -1):
|
||||||
bit = (ip4.value >> i) & 0b1
|
bit = (ip4.value >> i) & 0b1
|
||||||
|
if dic.active():
|
||||||
|
# Refuse to add ip4* whose network is already matching
|
||||||
|
return
|
||||||
next_dic = dic.one if bit else dic.zero
|
next_dic = dic.one if bit else dic.zero
|
||||||
if next_dic is None:
|
if next_dic is None:
|
||||||
next_dic = IpTreeNode()
|
next_dic = IpTreeNode()
|
||||||
|
@ -767,33 +535,15 @@ class Database(Profiler):
|
||||||
else:
|
else:
|
||||||
dic.zero = next_dic
|
dic.zero = next_dic
|
||||||
dic = next_dic
|
dic = next_dic
|
||||||
if dic.active(is_first_party):
|
dic.set(
|
||||||
dupplicate = True
|
|
||||||
self._set_match(
|
|
||||||
dic,
|
|
||||||
updated,
|
updated,
|
||||||
|
level,
|
||||||
source,
|
source,
|
||||||
source_match=source_match,
|
|
||||||
dupplicate=dupplicate,
|
|
||||||
)
|
)
|
||||||
self._set_ip4cache(ip4, dic)
|
|
||||||
|
|
||||||
def set_ip4address(
|
def set_ip4address(self,
|
||||||
self, ip4address_str: str, *args: typing.Any, **kwargs: typing.Any
|
ip4: Ip4Path,
|
||||||
|
*args: typing.Any, **kwargs: typing.Any
|
||||||
) -> None:
|
) -> None:
|
||||||
self.enter_step("set_ip4add_val")
|
assert ip4.prefixlen == 32
|
||||||
if not Database.validate_ip4address(ip4address_str):
|
self.set_ip4network(ip4, *args, **kwargs)
|
||||||
raise ValueError(f"Invalid ip4address: {ip4address_str}")
|
|
||||||
self.enter_step("set_ip4add_pack")
|
|
||||||
ip4 = self.pack_ip4address(ip4address_str)
|
|
||||||
self._set_ip4(ip4, *args, **kwargs)
|
|
||||||
|
|
||||||
def set_ip4network(
|
|
||||||
self, ip4network_str: str, *args: typing.Any, **kwargs: typing.Any
|
|
||||||
) -> None:
|
|
||||||
self.enter_step("set_ip4net_val")
|
|
||||||
if not Database.validate_ip4network(ip4network_str):
|
|
||||||
raise ValueError(f"Invalid ip4network: {ip4network_str}")
|
|
||||||
self.enter_step("set_ip4net_pack")
|
|
||||||
ip4 = self.pack_ip4network(ip4network_str)
|
|
||||||
self._set_ip4(ip4, *args, **kwargs)
|
|
||||||
|
|
54
db.py
54
db.py
|
@ -1,54 +0,0 @@
|
||||||
#!/usr/bin/env python3
|
|
||||||
|
|
||||||
import argparse
|
|
||||||
import database
|
|
||||||
import time
|
|
||||||
import os
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
|
|
||||||
# Parsing arguments
|
|
||||||
parser = argparse.ArgumentParser(description="Database operations")
|
|
||||||
parser.add_argument(
|
|
||||||
"-i", "--initialize", action="store_true", help="Reconstruct the whole database"
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"-p", "--prune", action="store_true", help="Remove old entries from database"
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"-b",
|
|
||||||
"--prune-base",
|
|
||||||
action="store_true",
|
|
||||||
help="With --prune, only prune base rules "
|
|
||||||
"(the ones added by ./feed_rules.py)",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"-s",
|
|
||||||
"--prune-before",
|
|
||||||
type=int,
|
|
||||||
default=(int(time.time()) - 60 * 60 * 24 * 31 * 6),
|
|
||||||
help="With --prune, only rules updated before "
|
|
||||||
"this UNIX timestamp will be deleted",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"-r",
|
|
||||||
"--references",
|
|
||||||
action="store_true",
|
|
||||||
help="DEBUG: Update the reference count",
|
|
||||||
)
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
if not args.initialize:
|
|
||||||
DB = database.Database()
|
|
||||||
else:
|
|
||||||
if os.path.isfile(database.Database.PATH):
|
|
||||||
os.unlink(database.Database.PATH)
|
|
||||||
DB = database.Database()
|
|
||||||
|
|
||||||
DB.enter_step("main")
|
|
||||||
if args.prune:
|
|
||||||
DB.prune(before=args.prune_before, base_only=args.prune_base)
|
|
||||||
if args.references:
|
|
||||||
DB.update_references()
|
|
||||||
|
|
||||||
DB.save()
|
|
1
dist/.gitignore
vendored
1
dist/.gitignore
vendored
|
@ -1,2 +1 @@
|
||||||
*.txt
|
*.txt
|
||||||
*.html
|
|
||||||
|
|
114
dist/README.md
vendored
114
dist/README.md
vendored
|
@ -1,114 +0,0 @@
|
||||||
# Geoffrey Frogeye's block list of first-party trackers
|
|
||||||
|
|
||||||
## What's a first-party tracker?
|
|
||||||
|
|
||||||
A tracker is a script put on many websites to gather informations about the visitor.
|
|
||||||
They can be used for multiple reasons: statistics, risk management, marketing, ads serving…
|
|
||||||
In any case, they are a threat to Internet users' privacy and many may want to block them.
|
|
||||||
|
|
||||||
Traditionnaly, trackers are served from a third-party.
|
|
||||||
For example, `website1.com` and `website2.com` both load their tracking script from `https://trackercompany.com/trackerscript.js`.
|
|
||||||
In order to block those, one can simply block the hostname `trackercompany.com`, which is what most ad blockers do.
|
|
||||||
|
|
||||||
However, to circumvent this block, tracker companies made the websites using them load trackers from `somestring.website1.com`.
|
|
||||||
The latter is a DNS redirection to `website1.trackercompany.com`, directly to an IP address belonging to the tracking company.
|
|
||||||
|
|
||||||
Those are called first-party trackers.
|
|
||||||
On top of aforementionned privacy issues, they also cause some security issue, as websites usually trust those scripts more.
|
|
||||||
For more information, learn about [Content Security Policy](https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP), [same-origin policy](https://developer.mozilla.org/en-US/docs/Web/Security/Same-origin_policy) and [Cross-Origin Resource Sharing](https://enable-cors.org/).
|
|
||||||
|
|
||||||
In order to block those trackers, ad blockers would need to block every subdomain pointing to anything under `trackercompany.com` or to their network.
|
|
||||||
Unfortunately, most don't support those blocking methods as they are not DNS-aware, e.g. they only see `somestring.website1.com`.
|
|
||||||
|
|
||||||
This list is an inventory of every `somestring.website1.com` found to allow non DNS-aware ad blocker to still block first-party trackers.
|
|
||||||
|
|
||||||
### Learn more
|
|
||||||
|
|
||||||
- [CNAME Cloaking, the dangerous disguise of third-party trackers](https://medium.com/nextdns/cname-cloaking-the-dangerous-disguise-of-third-party-trackers-195205dc522a) from NextDNS
|
|
||||||
- [Trackers first-party](https://blog.imirhil.fr/2019/11/13/first-party-tracker.html) from Aeris, in french
|
|
||||||
- [uBlock Origin issue](https://github.com/uBlockOrigin/uBlock-issues/issues/780)
|
|
||||||
- [CNAME Cloaking and Bounce Tracking Defense](https://webkit.org/blog/11338/cname-cloaking-and-bounce-tracking-defense/) on WebKit's blog
|
|
||||||
- [Characterizing CNAME cloaking-based tracking](https://blog.apnic.net/2020/08/04/characterizing-cname-cloaking-based-tracking/) on APNIC's webiste
|
|
||||||
- [Characterizing CNAME Cloaking-Based Tracking on the Web](https://tma.ifip.org/2020/wp-content/uploads/sites/9/2020/06/tma2020-camera-paper66.pdf) is a research paper from Sokendai and ANSSI
|
|
||||||
|
|
||||||
## List variants
|
|
||||||
|
|
||||||
### First-party trackers
|
|
||||||
|
|
||||||
**Recommended for hostfiles-based ad blockers, such as [Pi-hole](https://pi-hole.net/) (<v5.0, as it introduced CNAME blocking).**
|
|
||||||
**Recommended for Android ad blockers as applications, such ad [Blokada](https://blokada.org/).**
|
|
||||||
|
|
||||||
- Hosts file: <https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt>
|
|
||||||
- Raw list: <https://hostfiles.frogeye.fr/firstparty-trackers.txt>
|
|
||||||
|
|
||||||
This list contains every hostname redirecting to [a hand-picked list of first-party trackers](https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/rules/first-party.list).
|
|
||||||
It should be safe from false-positives.
|
|
||||||
It also contains all tracking hostnames under company domains (e.g. `website1.trackercompany.com`),
|
|
||||||
useful for ad blockers that don't support mass regex blocking,
|
|
||||||
while still preventing fallback to third-party trackers.
|
|
||||||
Don't be afraid of the size of the list, as this is due to the nature of first-party trackers: a single tracker generates at least one hostname per client (typically two).
|
|
||||||
|
|
||||||
### First-party only trackers
|
|
||||||
|
|
||||||
**Recommended for ad blockers as web browser extensions, such as [uBlock Origin](https://ublockorigin.com/) (<v1.25.0 or for Chromium-based browsers, as it introduced CNAME uncloaking for Firefox).**
|
|
||||||
|
|
||||||
- Hosts file: <https://hostfiles.frogeye.fr/firstparty-only-trackers-hosts.txt>
|
|
||||||
- Raw list: <https://hostfiles.frogeye.fr/firstparty-only-trackers.txt>
|
|
||||||
|
|
||||||
This is the same list as above, albeit not containing the hostnames under the tracking company domains (e.g. `website1.trackercompany.com`).
|
|
||||||
This allows for reducing the size of the list for ad-blockers that already block those third-party trackers with their support of regex blocking.
|
|
||||||
Use in conjunction with other block lists used in regex-mode, such as [Peter Lowe's](https://pgl.yoyo.org/adservers/)
|
|
||||||
|
|
||||||
### Multi-party trackers
|
|
||||||
|
|
||||||
- Hosts file: <https://hostfiles.frogeye.fr/multiparty-trackers-hosts.txt>
|
|
||||||
- Raw list: <https://hostfiles.frogeye.fr/multiparty-trackers.txt>
|
|
||||||
|
|
||||||
As first-party trackers usually evolve from third-party trackers, this list contains every hostname redirecting to trackers found in existing lists of third-party trackers (see next section).
|
|
||||||
Since the latter were not designed with first-party trackers in mind, they are likely to contain false-positives.
|
|
||||||
On the other hand, they might protect against first-party tracker that we're not aware of / have not yet confirmed.
|
|
||||||
|
|
||||||
#### Source of third-party trackers
|
|
||||||
|
|
||||||
- [EasyPrivacy](https://easylist.to/easylist/easyprivacy.txt)
|
|
||||||
- [AdGuard](https://github.com/AdguardTeam/AdguardFilters)
|
|
||||||
|
|
||||||
(yes there's only two for now. A lot of existing ones cause a lot of false positives)
|
|
||||||
|
|
||||||
### Multi-party only trackers
|
|
||||||
|
|
||||||
- Hosts file: <https://hostfiles.frogeye.fr/multiparty-only-trackers-hosts.txt>
|
|
||||||
- Raw list: <https://hostfiles.frogeye.fr/multiparty-only-trackers.txt>
|
|
||||||
|
|
||||||
This is the same list as above, albeit not containing the hostnames under the tracking company domains (e.g. `website1.trackercompany.com`).
|
|
||||||
This allows for reducing the size of the list for ad-blockers that already block those third-party trackers with their support of regex blocking.
|
|
||||||
Use in conjunction with other block lists used in regex-mode, such as the ones in the previous section.
|
|
||||||
|
|
||||||
## Meta
|
|
||||||
|
|
||||||
In case of false positives/negatives, or any other question contact me the way you like: <https://geoffrey.frogeye.fr>
|
|
||||||
|
|
||||||
The software used to generate this list is available here: <https://git.frogeye.fr/geoffrey/eulaurarien>
|
|
||||||
|
|
||||||
## Acknowledgements
|
|
||||||
|
|
||||||
Some of the first-party tracker included in this list have been found by:
|
|
||||||
|
|
||||||
- [Aeris](https://imirhil.fr/)
|
|
||||||
- NextDNS and [their blocklist](https://github.com/nextdns/cname-cloaking-blocklist)'s contributors
|
|
||||||
- Yuki2718 from [Wilders Security Forums](https://www.wilderssecurity.com/threads/ublock-a-lean-and-fast-blocker.365273/page-168#post-2880361)
|
|
||||||
- Ha Dao, Johan Mazel, and Kensuke Fukuda, ["Characterizing CNAME Cloaking-Based Tracking on the Web", Proceedings of IFIP/IEEE Traffic Measurement Analysis Conference (TMA), 9 pages, 2020.](https://tma.ifip.org/2020/wp-content/uploads/sites/9/2020/06/tma2020-camera-paper66.pdf)
|
|
||||||
- AdGuard and [their blocklist](https://github.com/AdguardTeam/cname-trackers)'s contributors
|
|
||||||
|
|
||||||
The list was generated using data from
|
|
||||||
|
|
||||||
- [Cisco Umbrella Popularity List](http://s3-us-west-1.amazonaws.com/umbrella-static/index.html)
|
|
||||||
- [Public DNS Server List](https://public-dns.info/)
|
|
||||||
|
|
||||||
|
|
||||||
Similar projects:
|
|
||||||
|
|
||||||
- [NextDNS blocklist](https://github.com/nextdns/cname-cloaking-blocklist): for DNS-aware ad blockers
|
|
||||||
- [Stefan Froberg's lists](https://www.orwell1984.today/cname/): subset of those lists grouped by tracker
|
|
||||||
- [AdGuard blocklist](https://github.com/AdguardTeam/cname-trackers): same thing with a bigger scope, maintained by a bigger team
|
|
||||||
|
|
2
dist/markdown7.min.css
vendored
2
dist/markdown7.min.css
vendored
|
@ -1,2 +0,0 @@
|
||||||
/* Source: https://github.com/jasonm23/markdown-css-themes */
|
|
||||||
body{font-family:Helvetica,arial,sans-serif;font-size:14px;line-height:1.6;padding-top:10px;padding-bottom:10px;background-color:#fff;padding:30px}body>:first-child{margin-top:0!important}body>:last-child{margin-bottom:0!important}a{color:#4183c4}a.absent{color:#c00}a.anchor{display:block;padding-left:30px;margin-left:-30px;cursor:pointer;position:absolute;top:0;left:0;bottom:0}h1,h2,h3,h4,h5,h6{margin:20px 0 10px;padding:0;font-weight:700;-webkit-font-smoothing:antialiased;cursor:text;position:relative}h1:hover a.anchor,h2:hover a.anchor,h3:hover a.anchor,h4:hover a.anchor,h5:hover a.anchor,h6:hover a.anchor{text-decoration:none}h1 code,h1 tt{font-size:inherit}h2 code,h2 tt{font-size:inherit}h3 code,h3 tt{font-size:inherit}h4 code,h4 tt{font-size:inherit}h5 code,h5 tt{font-size:inherit}h6 code,h6 tt{font-size:inherit}h1{font-size:28px;color:#000}h2{font-size:24px;border-bottom:1px solid #ccc;color:#000}h3{font-size:18px}h4{font-size:16px}h5{font-size:14px}h6{color:#777;font-size:14px}blockquote,dl,li,ol,p,pre,table,ul{margin:15px 0}hr{border:0 none;color:#ccc;height:4px;padding:0}body>h2:first-child{margin-top:0;padding-top:0}body>h1:first-child{margin-top:0;padding-top:0}body>h1:first-child+h2{margin-top:0;padding-top:0}body>h3:first-child,body>h4:first-child,body>h5:first-child,body>h6:first-child{margin-top:0;padding-top:0}a:first-child h1,a:first-child h2,a:first-child h3,a:first-child h4,a:first-child h5,a:first-child h6{margin-top:0;padding-top:0}h1 p,h2 p,h3 p,h4 p,h5 p,h6 p{margin-top:0}li p.first{display:inline-block}li{margin:0}ol,ul{padding-left:30px}ol :first-child,ul :first-child{margin-top:0}dl{padding:0}dl dt{font-size:14px;font-weight:700;font-style:italic;padding:0;margin:15px 0 5px}dl dt:first-child{padding:0}dl dt>:first-child{margin-top:0}dl dt>:last-child{margin-bottom:0}dl dd{margin:0 0 15px;padding:0 15px}dl dd>:first-child{margin-top:0}dl dd>:last-child{margin-bottom:0}blockquote{border-left:4px solid #ddd;padding:0 15px;color:#777}blockquote>:first-child{margin-top:0}blockquote>:last-child{margin-bottom:0}table{padding:0;border-collapse:collapse}table tr{border-top:1px solid #ccc;background-color:#fff;margin:0;padding:0}table tr:nth-child(2n){background-color:#f8f8f8}table tr th{font-weight:700;border:1px solid #ccc;margin:0;padding:6px 13px}table tr td{border:1px solid #ccc;margin:0;padding:6px 13px}table tr td :first-child,table tr th :first-child{margin-top:0}table tr td :last-child,table tr th :last-child{margin-bottom:0}img{max-width:100%}span.frame{display:block;overflow:hidden}span.frame>span{border:1px solid #ddd;display:block;float:left;overflow:hidden;margin:13px 0 0;padding:7px;width:auto}span.frame span img{display:block;float:left}span.frame span span{clear:both;color:#333;display:block;padding:5px 0 0}span.align-center{display:block;overflow:hidden;clear:both}span.align-center>span{display:block;overflow:hidden;margin:13px auto 0;text-align:center}span.align-center span img{margin:0 auto;text-align:center}span.align-right{display:block;overflow:hidden;clear:both}span.align-right>span{display:block;overflow:hidden;margin:13px 0 0;text-align:right}span.align-right span img{margin:0;text-align:right}span.float-left{display:block;margin-right:13px;overflow:hidden;float:left}span.float-left span{margin:13px 0 0}span.float-right{display:block;margin-left:13px;overflow:hidden;float:right}span.float-right>span{display:block;overflow:hidden;margin:13px auto 0;text-align:right}code,tt{margin:0 2px;padding:0 5px;white-space:nowrap;border:1px solid #eaeaea;background-color:#f8f8f8;border-radius:3px}pre code{margin:0;padding:0;white-space:pre;border:none;background:0 0}.highlight pre{background-color:#f8f8f8;border:1px solid #ccc;font-size:13px;line-height:19px;overflow:auto;padding:6px 10px;border-radius:3px}pre{background-color:#f8f8f8;border:1px solid #ccc;font-size:13px;line-height:19px;overflow:auto;padding:6px 10px;border-radius:3px}pre code,pre tt{background-color:transparent;border:none}sup{font-size:.83em;vertical-align:super;line-height:0}*{-webkit-print-color-adjust:exact}@media screen and (min-width:914px){body{width:854px;margin:0 auto}}@media print{pre,table{page-break-inside:avoid}pre{word-wrap:break-word}}
|
|
|
@ -2,13 +2,8 @@
|
||||||
|
|
||||||
# Main script for eulaurarien
|
# Main script for eulaurarien
|
||||||
|
|
||||||
[ ! -f .env ] && touch .env
|
|
||||||
|
|
||||||
./fetch_resources.sh
|
./fetch_resources.sh
|
||||||
./collect_subdomains.sh
|
./collect_subdomains.sh
|
||||||
./import_rules.sh
|
|
||||||
./resolve_subdomains.sh
|
./resolve_subdomains.sh
|
||||||
./prune.sh
|
./filter_subdomains.sh
|
||||||
./export_lists.sh
|
|
||||||
./generate_index.py
|
|
||||||
|
|
||||||
|
|
86
export.py
86
export.py
|
@ -5,87 +5,45 @@ import argparse
|
||||||
import sys
|
import sys
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == '__main__':
|
||||||
|
|
||||||
# Parsing arguments
|
# Parsing arguments
|
||||||
parser = argparse.ArgumentParser(
|
parser = argparse.ArgumentParser(
|
||||||
description="Export the hostnames rules stored " "in the Database as plain text"
|
description="TODO")
|
||||||
)
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"-o",
|
'-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
|
||||||
"--output",
|
help="TODO")
|
||||||
type=argparse.FileType("w"),
|
|
||||||
default=sys.stdout,
|
|
||||||
help="Output file, one rule per line",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"-f",
|
'-f', '--first-party', action='store_true',
|
||||||
"--first-party",
|
help="TODO")
|
||||||
action="store_true",
|
|
||||||
help="Only output rules issued from first-party sources",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"-e",
|
'-e', '--end-chain', action='store_true',
|
||||||
"--end-chain",
|
help="TODO")
|
||||||
action="store_true",
|
|
||||||
help="Only output rules that are not referenced by any other",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"-r",
|
'-x', '--explain', action='store_true',
|
||||||
"--rules",
|
help="TODO")
|
||||||
action="store_true",
|
|
||||||
help="Output all kinds of rules, not just hostnames",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"-b",
|
'-r', '--rules', action='store_true',
|
||||||
"--base-rules",
|
help="TODO")
|
||||||
action="store_true",
|
|
||||||
help="Output base rules "
|
|
||||||
"(the ones added by ./feed_rules.py) "
|
|
||||||
"(implies --rules)",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"-d",
|
'-c', '--count', action='store_true',
|
||||||
"--no-dupplicates",
|
help="TODO")
|
||||||
action="store_true",
|
|
||||||
help="Do not output rules that already match a zone/network rule "
|
|
||||||
"(e.g. dummy.example.com when there's a zone example.com rule)",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"-x",
|
|
||||||
"--explain",
|
|
||||||
action="store_true",
|
|
||||||
help="Show the chain of rules leading to one "
|
|
||||||
"(and the number of references they have)",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"-c",
|
|
||||||
"--count",
|
|
||||||
action="store_true",
|
|
||||||
help="Show the number of rules per type instead of listing them",
|
|
||||||
)
|
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
DB = database.Database()
|
DB = database.Database()
|
||||||
|
|
||||||
|
if args.rules:
|
||||||
if args.count:
|
if args.count:
|
||||||
assert not args.explain
|
print(DB.count_rules(first_party_only=args.first_party))
|
||||||
print(
|
|
||||||
DB.count_records(
|
|
||||||
first_party_only=args.first_party,
|
|
||||||
end_chain_only=args.end_chain,
|
|
||||||
no_dupplicates=args.no_dupplicates,
|
|
||||||
rules_only=args.base_rules,
|
|
||||||
hostnames_only=not (args.rules or args.base_rules),
|
|
||||||
)
|
|
||||||
)
|
|
||||||
else:
|
else:
|
||||||
for domain in DB.list_records(
|
for line in DB.list_rules():
|
||||||
|
print(line)
|
||||||
|
else:
|
||||||
|
if args.count:
|
||||||
|
raise NotImplementedError
|
||||||
|
for domain in DB.export(
|
||||||
first_party_only=args.first_party,
|
first_party_only=args.first_party,
|
||||||
end_chain_only=args.end_chain,
|
end_chain_only=args.end_chain,
|
||||||
no_dupplicates=args.no_dupplicates,
|
|
||||||
rules_only=args.base_rules,
|
|
||||||
hostnames_only=not (args.rules or args.base_rules),
|
|
||||||
explain=args.explain,
|
explain=args.explain,
|
||||||
):
|
):
|
||||||
print(domain, file=args.output)
|
print(domain, file=args.output)
|
||||||
|
|
|
@ -1,98 +0,0 @@
|
||||||
#!/usr/bin/env bash
|
|
||||||
|
|
||||||
function log() {
|
|
||||||
echo -e "\033[33m$@\033[0m"
|
|
||||||
}
|
|
||||||
|
|
||||||
log "Calculating statistics…"
|
|
||||||
oldest="$(cat last_updates/*.txt | sort -n | head -1)"
|
|
||||||
oldest_date=$(date -Isec -d @$oldest)
|
|
||||||
gen_date=$(date -Isec)
|
|
||||||
gen_software=$(git describe --tags)
|
|
||||||
number_websites=$(wc -l < temp/all_websites.list)
|
|
||||||
number_subdomains=$(wc -l < temp/all_subdomains.list)
|
|
||||||
number_dns=$(grep 'NOERROR' temp/all_resolved.txt | wc -l)
|
|
||||||
|
|
||||||
for partyness in {first,multi}
|
|
||||||
do
|
|
||||||
if [ $partyness = "first" ]
|
|
||||||
then
|
|
||||||
partyness_flags="--first-party"
|
|
||||||
else
|
|
||||||
partyness_flags=""
|
|
||||||
fi
|
|
||||||
|
|
||||||
rules_input=$(./export.py --count --base-rules $partyness_flags)
|
|
||||||
rules_found=$(./export.py --count --rules $partyness_flags)
|
|
||||||
rules_found_nd=$(./export.py --count --rules --no-dupplicates $partyness_flags)
|
|
||||||
|
|
||||||
echo
|
|
||||||
echo "Statistics for ${partyness}-party trackers"
|
|
||||||
echo "Input rules: $rules_input"
|
|
||||||
echo "Subsequent rules: $rules_found"
|
|
||||||
echo "Subsequent rules (no dupplicate): $rules_found_nd"
|
|
||||||
echo "Output hostnames: $(./export.py --count $partyness_flags)"
|
|
||||||
echo "Output hostnames (no dupplicate): $(./export.py --count --no-dupplicates $partyness_flags)"
|
|
||||||
echo "Output hostnames (end-chain only): $(./export.py --count --end-chain $partyness_flags)"
|
|
||||||
echo "Output hostnames (no dupplicate, end-chain only): $(./export.py --count --no-dupplicates --end-chain $partyness_flags)"
|
|
||||||
|
|
||||||
for trackerness in {trackers,only-trackers}
|
|
||||||
do
|
|
||||||
if [ $trackerness = "trackers" ]
|
|
||||||
then
|
|
||||||
trackerness_flags=""
|
|
||||||
else
|
|
||||||
trackerness_flags="--no-dupplicates"
|
|
||||||
fi
|
|
||||||
file_list="dist/${partyness}party-${trackerness}.txt"
|
|
||||||
file_host="dist/${partyness}party-${trackerness}-hosts.txt"
|
|
||||||
|
|
||||||
log "Generating lists for variant ${partyness}-party ${trackerness}…"
|
|
||||||
|
|
||||||
# Real export heeere
|
|
||||||
./export.py $partyness_flags $trackerness_flags > $file_list
|
|
||||||
# Sometimes a bit heavy to have the DB open and sort the output
|
|
||||||
# so this is done in two steps
|
|
||||||
sort -u $file_list -o $file_list
|
|
||||||
|
|
||||||
rules_output=$(./export.py --count $partyness_flags $trackerness_flags)
|
|
||||||
|
|
||||||
(
|
|
||||||
echo "# First-party trackers host list"
|
|
||||||
echo "# Variant: ${partyness}-party ${trackerness}"
|
|
||||||
echo "#"
|
|
||||||
echo "# About first-party trackers: https://hostfiles.frogeye.fr/#whats-a-first-party-tracker"
|
|
||||||
echo "#"
|
|
||||||
echo "# In case of false positives/negatives, or any other question,"
|
|
||||||
echo "# contact me the way you like: https://geoffrey.frogeye.fr"
|
|
||||||
echo "#"
|
|
||||||
echo "# Latest versions and variants: https://hostfiles.frogeye.fr/#list-variants"
|
|
||||||
echo "# Source code: https://git.frogeye.fr/geoffrey/eulaurarien"
|
|
||||||
echo "# License: https://git.frogeye.fr/geoffrey/eulaurarien/src/branch/master/LICENSE"
|
|
||||||
echo "# Acknowledgements: https://hostfiles.frogeye.fr/#acknowledgements"
|
|
||||||
echo "#"
|
|
||||||
echo "# Generation software: eulaurarien $gen_software"
|
|
||||||
echo "# List generation date: $gen_date"
|
|
||||||
echo "# Oldest record: $oldest_date"
|
|
||||||
echo "# Number of source websites: $number_websites"
|
|
||||||
echo "# Number of source subdomains: $number_subdomains"
|
|
||||||
echo "# Number of source DNS records: $number_dns"
|
|
||||||
echo "#"
|
|
||||||
echo "# Input rules: $rules_input"
|
|
||||||
echo "# Subsequent rules: $rules_found"
|
|
||||||
echo "# … no dupplicates: $rules_found_nd"
|
|
||||||
echo "# Output rules: $rules_output"
|
|
||||||
echo "#"
|
|
||||||
echo
|
|
||||||
sed 's|^|0.0.0.0 |' "$file_list"
|
|
||||||
) > "$file_host"
|
|
||||||
|
|
||||||
done
|
|
||||||
done
|
|
||||||
|
|
||||||
if [ -d explanations ]
|
|
||||||
then
|
|
||||||
filename="$(date -Isec).txt"
|
|
||||||
./export.py --explain > "explanations/$filename"
|
|
||||||
ln --force --symbolic "$filename" "explanations/latest.txt"
|
|
||||||
fi
|
|
50
feed_asn.py
50
feed_asn.py
|
@ -13,56 +13,40 @@ IPNetwork = typing.Union[ipaddress.IPv4Network, ipaddress.IPv6Network]
|
||||||
|
|
||||||
def get_ranges(asn: str) -> typing.Iterable[str]:
|
def get_ranges(asn: str) -> typing.Iterable[str]:
|
||||||
req = requests.get(
|
req = requests.get(
|
||||||
"https://stat.ripe.net/data/as-routing-consistency/data.json",
|
'https://stat.ripe.net/data/as-routing-consistency/data.json',
|
||||||
params={"resource": asn},
|
params={'resource': asn}
|
||||||
)
|
)
|
||||||
data = req.json()
|
data = req.json()
|
||||||
for pref in data["data"]["prefixes"]:
|
for pref in data['data']['prefixes']:
|
||||||
yield pref["prefix"]
|
yield pref['prefix']
|
||||||
|
|
||||||
|
|
||||||
def get_name(asn: str) -> str:
|
if __name__ == '__main__':
|
||||||
req = requests.get(
|
|
||||||
"https://stat.ripe.net/data/as-overview/data.json", params={"resource": asn}
|
|
||||||
)
|
|
||||||
data = req.json()
|
|
||||||
return data["data"]["holder"]
|
|
||||||
|
|
||||||
|
log = logging.getLogger('feed_asn')
|
||||||
if __name__ == "__main__":
|
|
||||||
|
|
||||||
log = logging.getLogger("feed_asn")
|
|
||||||
|
|
||||||
# Parsing arguments
|
# Parsing arguments
|
||||||
parser = argparse.ArgumentParser(
|
parser = argparse.ArgumentParser(
|
||||||
description="Add the IP ranges associated to the AS in the database"
|
description="TODO")
|
||||||
)
|
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
DB = database.Database()
|
DB = database.Database()
|
||||||
|
|
||||||
def add_ranges(
|
for path in DB.list_asn():
|
||||||
path: database.Path,
|
|
||||||
match: database.Match,
|
|
||||||
) -> None:
|
|
||||||
assert isinstance(path, database.AsnPath)
|
|
||||||
assert isinstance(match, database.AsnNode)
|
|
||||||
asn_str = database.Database.unpack_asn(path)
|
asn_str = database.Database.unpack_asn(path)
|
||||||
DB.enter_step("asn_get_name")
|
DB.enter_step('asn_get_ranges')
|
||||||
name = get_name(asn_str)
|
|
||||||
match.name = name
|
|
||||||
DB.enter_step("asn_get_ranges")
|
|
||||||
for prefix in get_ranges(asn_str):
|
for prefix in get_ranges(asn_str):
|
||||||
parsed_prefix: IPNetwork = ipaddress.ip_network(prefix)
|
parsed_prefix: IPNetwork = ipaddress.ip_network(prefix)
|
||||||
if parsed_prefix.version == 4:
|
if parsed_prefix.version == 4:
|
||||||
DB.set_ip4network(prefix, source=path, updated=int(time.time()))
|
DB.set_ip4network(
|
||||||
log.info("Added %s from %s (%s)", prefix, path, name)
|
prefix,
|
||||||
|
source=path,
|
||||||
|
updated=int(time.time())
|
||||||
|
)
|
||||||
|
log.info('Added %s from %s (%s)', prefix, asn_str, path)
|
||||||
elif parsed_prefix.version == 6:
|
elif parsed_prefix.version == 6:
|
||||||
log.warning("Unimplemented prefix version: %s", prefix)
|
log.warning('Unimplemented prefix version: %s', prefix)
|
||||||
else:
|
else:
|
||||||
log.error("Unknown prefix version: %s", prefix)
|
log.error('Unknown prefix version: %s', prefix)
|
||||||
|
|
||||||
for _ in DB.exec_each_asn(add_ranges):
|
|
||||||
pass
|
|
||||||
|
|
||||||
DB.save()
|
DB.save()
|
||||||
|
|
147
feed_dns.old.py
Executable file
147
feed_dns.old.py
Executable file
|
@ -0,0 +1,147 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import database
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
import typing
|
||||||
|
import enum
|
||||||
|
|
||||||
|
RecordType = enum.Enum('RecordType', 'A AAAA CNAME PTR')
|
||||||
|
Record = typing.Tuple[RecordType, int, str, str]
|
||||||
|
|
||||||
|
# select, write
|
||||||
|
FUNCTION_MAP: typing.Any = {
|
||||||
|
RecordType.A: (
|
||||||
|
database.Database.get_ip4,
|
||||||
|
database.Database.set_hostname,
|
||||||
|
),
|
||||||
|
RecordType.CNAME: (
|
||||||
|
database.Database.get_domain,
|
||||||
|
database.Database.set_hostname,
|
||||||
|
),
|
||||||
|
RecordType.PTR: (
|
||||||
|
database.Database.get_domain,
|
||||||
|
database.Database.set_ip4address,
|
||||||
|
),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class Parser():
|
||||||
|
def __init__(self, buf: typing.Any) -> None:
|
||||||
|
self.buf = buf
|
||||||
|
self.log = logging.getLogger('parser')
|
||||||
|
self.db = database.Database()
|
||||||
|
|
||||||
|
def end(self) -> None:
|
||||||
|
self.db.save()
|
||||||
|
|
||||||
|
def register(self,
|
||||||
|
rtype: RecordType,
|
||||||
|
updated: int,
|
||||||
|
name: str,
|
||||||
|
value: str
|
||||||
|
) -> None:
|
||||||
|
|
||||||
|
self.db.enter_step('register')
|
||||||
|
select, write = FUNCTION_MAP[rtype]
|
||||||
|
for source in select(self.db, value):
|
||||||
|
# write(self.db, name, updated, source=source)
|
||||||
|
write(self.db, name, updated)
|
||||||
|
|
||||||
|
def consume(self) -> None:
|
||||||
|
raise NotImplementedError
|
||||||
|
|
||||||
|
|
||||||
|
class Rapid7Parser(Parser):
|
||||||
|
TYPES = {
|
||||||
|
'a': RecordType.A,
|
||||||
|
'aaaa': RecordType.AAAA,
|
||||||
|
'cname': RecordType.CNAME,
|
||||||
|
'ptr': RecordType.PTR,
|
||||||
|
}
|
||||||
|
|
||||||
|
def consume(self) -> None:
|
||||||
|
data = dict()
|
||||||
|
for line in self.buf:
|
||||||
|
self.db.enter_step('parse_rapid7')
|
||||||
|
split = line.split('"')
|
||||||
|
|
||||||
|
for k in range(1, 14, 4):
|
||||||
|
key = split[k]
|
||||||
|
val = split[k+2]
|
||||||
|
data[key] = val
|
||||||
|
|
||||||
|
self.register(
|
||||||
|
Rapid7Parser.TYPES[data['type']],
|
||||||
|
int(data['timestamp']),
|
||||||
|
data['name'],
|
||||||
|
data['value']
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class DnsMassParser(Parser):
|
||||||
|
# dnsmass --output Snrql
|
||||||
|
# --retry REFUSED,SERVFAIL --resolvers nameservers-ipv4
|
||||||
|
TYPES = {
|
||||||
|
'A': (RecordType.A, -1, None),
|
||||||
|
'AAAA': (RecordType.AAAA, -1, None),
|
||||||
|
'CNAME': (RecordType.CNAME, -1, -1),
|
||||||
|
}
|
||||||
|
|
||||||
|
def consume(self) -> None:
|
||||||
|
self.db.enter_step('parse_dnsmass')
|
||||||
|
timestamp = 0
|
||||||
|
header = True
|
||||||
|
for line in self.buf:
|
||||||
|
line = line[:-1]
|
||||||
|
if not line:
|
||||||
|
header = True
|
||||||
|
continue
|
||||||
|
|
||||||
|
split = line.split(' ')
|
||||||
|
try:
|
||||||
|
if header:
|
||||||
|
timestamp = int(split[1])
|
||||||
|
header = False
|
||||||
|
else:
|
||||||
|
dtype, name_offset, value_offset = \
|
||||||
|
DnsMassParser.TYPES[split[1]]
|
||||||
|
self.register(
|
||||||
|
dtype,
|
||||||
|
timestamp,
|
||||||
|
split[0][:name_offset],
|
||||||
|
split[2][:value_offset],
|
||||||
|
)
|
||||||
|
self.db.enter_step('parse_dnsmass')
|
||||||
|
except KeyError:
|
||||||
|
continue
|
||||||
|
|
||||||
|
|
||||||
|
PARSERS = {
|
||||||
|
'rapid7': Rapid7Parser,
|
||||||
|
'dnsmass': DnsMassParser,
|
||||||
|
}
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
|
||||||
|
# Parsing arguments
|
||||||
|
log = logging.getLogger('feed_dns')
|
||||||
|
args_parser = argparse.ArgumentParser(
|
||||||
|
description="TODO")
|
||||||
|
args_parser.add_argument(
|
||||||
|
'parser',
|
||||||
|
choices=PARSERS.keys(),
|
||||||
|
help="TODO")
|
||||||
|
args_parser.add_argument(
|
||||||
|
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
|
||||||
|
help="TODO")
|
||||||
|
args = args_parser.parse_args()
|
||||||
|
|
||||||
|
parser = PARSERS[args.parser](args.input)
|
||||||
|
try:
|
||||||
|
parser.consume()
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
pass
|
||||||
|
parser.end()
|
||||||
|
|
273
feed_dns.py
273
feed_dns.py
|
@ -6,123 +6,115 @@ import logging
|
||||||
import sys
|
import sys
|
||||||
import typing
|
import typing
|
||||||
import multiprocessing
|
import multiprocessing
|
||||||
import time
|
import enum
|
||||||
|
|
||||||
Record = typing.Tuple[typing.Callable, typing.Callable, int, str, str]
|
Record = typing.Tuple[typing.Callable,
|
||||||
|
typing.Callable, int, database.Path, database.Path]
|
||||||
|
|
||||||
# select, write
|
# select, write, name_packer, value_packer
|
||||||
FUNCTION_MAP: typing.Any = {
|
FUNCTION_MAP: typing.Any = {
|
||||||
"a": (
|
'a': (
|
||||||
database.Database.get_ip4,
|
database.Database.get_ip4,
|
||||||
database.Database.set_hostname,
|
database.Database.set_hostname,
|
||||||
|
database.Database.pack_domain,
|
||||||
|
database.Database.pack_ip4address,
|
||||||
),
|
),
|
||||||
"cname": (
|
'cname': (
|
||||||
database.Database.get_domain,
|
database.Database.get_domain,
|
||||||
database.Database.set_hostname,
|
database.Database.set_hostname,
|
||||||
|
database.Database.pack_domain,
|
||||||
|
database.Database.pack_domain,
|
||||||
),
|
),
|
||||||
"ptr": (
|
'ptr': (
|
||||||
database.Database.get_domain,
|
database.Database.get_domain,
|
||||||
database.Database.set_ip4address,
|
database.Database.set_ip4address,
|
||||||
|
database.Database.pack_ip4address,
|
||||||
|
database.Database.pack_domain,
|
||||||
),
|
),
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
class Writer(multiprocessing.Process):
|
class Writer(multiprocessing.Process):
|
||||||
def __init__(
|
def __init__(self,
|
||||||
self,
|
recs_queue: multiprocessing.Queue,
|
||||||
recs_queue: multiprocessing.Queue = None,
|
index: int = 0):
|
||||||
autosave_interval: int = 0,
|
|
||||||
ip4_cache: int = 0,
|
|
||||||
):
|
|
||||||
if recs_queue: # MP
|
|
||||||
super(Writer, self).__init__()
|
super(Writer, self).__init__()
|
||||||
|
self.log = logging.getLogger(f'wr')
|
||||||
self.recs_queue = recs_queue
|
self.recs_queue = recs_queue
|
||||||
self.log = logging.getLogger("wr")
|
|
||||||
self.autosave_interval = autosave_interval
|
|
||||||
self.ip4_cache = ip4_cache
|
|
||||||
if not recs_queue: # No MP
|
|
||||||
self.open_db()
|
|
||||||
|
|
||||||
def open_db(self) -> None:
|
|
||||||
self.db = database.Database()
|
|
||||||
self.db.log = logging.getLogger("wr")
|
|
||||||
self.db.fill_ip4cache(max_size=self.ip4_cache)
|
|
||||||
|
|
||||||
def exec_record(self, record: Record) -> None:
|
|
||||||
self.db.enter_step("exec_record")
|
|
||||||
select, write, updated, name, value = record
|
|
||||||
try:
|
|
||||||
for source in select(self.db, value):
|
|
||||||
write(self.db, name, updated, source=source)
|
|
||||||
except (ValueError, IndexError):
|
|
||||||
# ValueError: non-number in IP
|
|
||||||
# IndexError: IP too big
|
|
||||||
self.log.exception("Cannot execute: %s", record)
|
|
||||||
|
|
||||||
def end(self) -> None:
|
|
||||||
self.db.enter_step("end")
|
|
||||||
self.db.save()
|
|
||||||
|
|
||||||
def run(self) -> None:
|
def run(self) -> None:
|
||||||
self.open_db()
|
self.db = database.Database()
|
||||||
if self.autosave_interval > 0:
|
self.db.log = logging.getLogger(f'wr')
|
||||||
next_save = time.time() + self.autosave_interval
|
|
||||||
else:
|
|
||||||
next_save = 0
|
|
||||||
|
|
||||||
self.db.enter_step("block_wait")
|
self.db.enter_step('block_wait')
|
||||||
block: typing.List[Record]
|
block: typing.List[Record]
|
||||||
for block in iter(self.recs_queue.get, None):
|
for block in iter(self.recs_queue.get, None):
|
||||||
|
|
||||||
assert block
|
|
||||||
record: Record
|
record: Record
|
||||||
for record in block:
|
for record in block:
|
||||||
self.exec_record(record)
|
|
||||||
|
|
||||||
if next_save > 0 and time.time() > next_save:
|
select, write, updated, name, value = record
|
||||||
self.log.info("Saving database...")
|
self.db.enter_step('feed_switch')
|
||||||
|
|
||||||
|
for source in select(self.db, value):
|
||||||
|
write(self.db, name, updated, source=source)
|
||||||
|
|
||||||
|
self.db.enter_step('block_wait')
|
||||||
|
|
||||||
|
self.db.enter_step('end')
|
||||||
self.db.save()
|
self.db.save()
|
||||||
self.log.info("Done!")
|
|
||||||
next_save = time.time() + self.autosave_interval
|
|
||||||
|
|
||||||
self.db.enter_step("block_wait")
|
|
||||||
self.end()
|
|
||||||
|
|
||||||
|
|
||||||
class Parser:
|
class Parser():
|
||||||
def __init__(
|
def __init__(self,
|
||||||
self,
|
|
||||||
buf: typing.Any,
|
buf: typing.Any,
|
||||||
recs_queue: multiprocessing.Queue = None,
|
recs_queue: multiprocessing.Queue,
|
||||||
block_size: int = 0,
|
block_size: int,
|
||||||
writer: Writer = None,
|
|
||||||
):
|
):
|
||||||
assert bool(writer) ^ bool(block_size and recs_queue)
|
super(Parser, self).__init__()
|
||||||
self.buf = buf
|
self.buf = buf
|
||||||
self.log = logging.getLogger("pr")
|
self.log = logging.getLogger('pr')
|
||||||
self.recs_queue = recs_queue
|
self.recs_queue = recs_queue
|
||||||
if writer: # No MP
|
|
||||||
self.prof: database.Profiler = writer.db
|
|
||||||
self.register = writer.exec_record
|
|
||||||
else: # MP
|
|
||||||
self.block: typing.List[Record] = list()
|
self.block: typing.List[Record] = list()
|
||||||
self.block_size = block_size
|
self.block_size = block_size
|
||||||
self.prof = database.Profiler()
|
self.prof = database.Profiler()
|
||||||
self.prof.log = logging.getLogger("pr")
|
self.prof.log = logging.getLogger('pr')
|
||||||
self.register = self.add_to_queue
|
|
||||||
|
|
||||||
def add_to_queue(self, record: Record) -> None:
|
def register(self,
|
||||||
self.prof.enter_step("register")
|
rtype: str,
|
||||||
|
timestamp: int,
|
||||||
|
name_str: str,
|
||||||
|
value_str: str,
|
||||||
|
) -> None:
|
||||||
|
self.prof.enter_step('pack')
|
||||||
|
try:
|
||||||
|
select, write, name_packer, value_packer = FUNCTION_MAP[rtype]
|
||||||
|
except KeyError:
|
||||||
|
self.log.exception("Unknown record type")
|
||||||
|
return
|
||||||
|
try:
|
||||||
|
name = name_packer(name_str)
|
||||||
|
except ValueError:
|
||||||
|
self.log.exception("Cannot parse name ('%s' with %s)",
|
||||||
|
name_str, name_packer)
|
||||||
|
return
|
||||||
|
try:
|
||||||
|
value = value_packer(value_str)
|
||||||
|
except ValueError:
|
||||||
|
self.log.exception("Cannot parse value ('%s' with %s)",
|
||||||
|
value_str, value_packer)
|
||||||
|
return
|
||||||
|
record = (select, write, timestamp, name, value)
|
||||||
|
|
||||||
|
self.prof.enter_step('grow_block')
|
||||||
self.block.append(record)
|
self.block.append(record)
|
||||||
if len(self.block) >= self.block_size:
|
if len(self.block) >= self.block_size:
|
||||||
self.prof.enter_step("put_block")
|
self.prof.enter_step('put_block')
|
||||||
assert self.recs_queue
|
|
||||||
self.recs_queue.put(self.block)
|
self.recs_queue.put(self.block)
|
||||||
self.block = list()
|
self.block = list()
|
||||||
|
|
||||||
def run(self) -> None:
|
def run(self) -> None:
|
||||||
self.consume()
|
self.consume()
|
||||||
if self.recs_queue:
|
|
||||||
self.recs_queue.put(self.block)
|
self.recs_queue.put(self.block)
|
||||||
self.prof.profile()
|
self.prof.profile()
|
||||||
|
|
||||||
|
@ -130,17 +122,43 @@ class Parser:
|
||||||
raise NotImplementedError
|
raise NotImplementedError
|
||||||
|
|
||||||
|
|
||||||
class MassDnsParser(Parser):
|
class Rapid7Parser(Parser):
|
||||||
# massdns --output Snrql
|
def consume(self) -> None:
|
||||||
|
data = dict()
|
||||||
|
self.prof.enter_step('iowait')
|
||||||
|
for line in self.buf:
|
||||||
|
self.prof.enter_step('parse_rapid7')
|
||||||
|
split = line.split('"')
|
||||||
|
|
||||||
|
try:
|
||||||
|
for k in range(1, 14, 4):
|
||||||
|
key = split[k]
|
||||||
|
val = split[k+2]
|
||||||
|
data[key] = val
|
||||||
|
|
||||||
|
self.register(
|
||||||
|
data['type'],
|
||||||
|
int(data['timestamp']),
|
||||||
|
data['name'],
|
||||||
|
data['value'],
|
||||||
|
)
|
||||||
|
self.prof.enter_step('iowait')
|
||||||
|
except KeyError:
|
||||||
|
# Sometimes JSON records are off the place
|
||||||
|
self.log.exception("Cannot parse: %s", line)
|
||||||
|
|
||||||
|
|
||||||
|
class DnsMassParser(Parser):
|
||||||
|
# dnsmass --output Snrql
|
||||||
# --retry REFUSED,SERVFAIL --resolvers nameservers-ipv4
|
# --retry REFUSED,SERVFAIL --resolvers nameservers-ipv4
|
||||||
TYPES = {
|
TYPES = {
|
||||||
"A": (FUNCTION_MAP["a"][0], FUNCTION_MAP["a"][1], -1, None),
|
'A': ('a', -1, None),
|
||||||
# 'AAAA': (FUNCTION_MAP['aaaa'][0], FUNCTION_MAP['aaaa'][1], -1, None),
|
# 'AAAA': ('aaaa', -1, None),
|
||||||
"CNAME": (FUNCTION_MAP["cname"][0], FUNCTION_MAP["cname"][1], -1, -1),
|
'CNAME': ('cname', -1, -1),
|
||||||
}
|
}
|
||||||
|
|
||||||
def consume(self) -> None:
|
def consume(self) -> None:
|
||||||
self.prof.enter_step("parse_massdns")
|
self.prof.enter_step('parse_dnsmass')
|
||||||
timestamp = 0
|
timestamp = 0
|
||||||
header = True
|
header = True
|
||||||
for line in self.buf:
|
for line in self.buf:
|
||||||
|
@ -149,102 +167,63 @@ class MassDnsParser(Parser):
|
||||||
header = True
|
header = True
|
||||||
continue
|
continue
|
||||||
|
|
||||||
split = line.split(" ")
|
split = line.split(' ')
|
||||||
try:
|
try:
|
||||||
if header:
|
if header:
|
||||||
timestamp = int(split[1])
|
timestamp = int(split[1])
|
||||||
header = False
|
header = False
|
||||||
else:
|
else:
|
||||||
select, write, name_offset, value_offset = MassDnsParser.TYPES[
|
rtype, name_offset, value_offset = \
|
||||||
split[1]
|
DnsMassParser.TYPES[split[1]]
|
||||||
]
|
self.register(
|
||||||
record = (
|
rtype,
|
||||||
select,
|
|
||||||
write,
|
|
||||||
timestamp,
|
timestamp,
|
||||||
split[0][:name_offset].lower(),
|
split[0][:name_offset],
|
||||||
split[2][:value_offset].lower(),
|
split[2][:value_offset],
|
||||||
)
|
)
|
||||||
self.register(record)
|
self.prof.enter_step('parse_dnsmass')
|
||||||
self.prof.enter_step("parse_massdns")
|
|
||||||
except KeyError:
|
except KeyError:
|
||||||
continue
|
# Malformed records are less likely to happen,
|
||||||
|
# but we may never be sure
|
||||||
|
self.log.exception("Cannot parse: %s", line)
|
||||||
|
|
||||||
|
|
||||||
PARSERS = {
|
PARSERS = {
|
||||||
"massdns": MassDnsParser,
|
'rapid7': Rapid7Parser,
|
||||||
|
'dnsmass': DnsMassParser,
|
||||||
}
|
}
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == '__main__':
|
||||||
|
|
||||||
# Parsing arguments
|
# Parsing arguments
|
||||||
log = logging.getLogger("feed_dns")
|
log = logging.getLogger('feed_dns')
|
||||||
args_parser = argparse.ArgumentParser(
|
args_parser = argparse.ArgumentParser(
|
||||||
description="Read DNS records and import "
|
description="TODO")
|
||||||
"tracking-relevant data into the database"
|
|
||||||
)
|
|
||||||
args_parser.add_argument("parser", choices=PARSERS.keys(), help="Input format")
|
|
||||||
args_parser.add_argument(
|
args_parser.add_argument(
|
||||||
"-i",
|
'parser',
|
||||||
"--input",
|
choices=PARSERS.keys(),
|
||||||
type=argparse.FileType("r"),
|
help="TODO")
|
||||||
default=sys.stdin,
|
|
||||||
help="Input file",
|
|
||||||
)
|
|
||||||
args_parser.add_argument(
|
args_parser.add_argument(
|
||||||
"-b", "--block-size", type=int, default=1024, help="Performance tuning value"
|
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
|
||||||
)
|
help="TODO")
|
||||||
args_parser.add_argument(
|
args_parser.add_argument(
|
||||||
"-q", "--queue-size", type=int, default=128, help="Performance tuning value"
|
'-j', '--workers', type=int, default=4,
|
||||||
)
|
help="TODO")
|
||||||
args_parser.add_argument(
|
args_parser.add_argument(
|
||||||
"-a",
|
'-b', '--block-size', type=int, default=100,
|
||||||
"--autosave-interval",
|
help="TODO")
|
||||||
type=int,
|
|
||||||
default=900,
|
|
||||||
help="Interval to which the database will save in seconds. " "0 to disable.",
|
|
||||||
)
|
|
||||||
args_parser.add_argument(
|
args_parser.add_argument(
|
||||||
"-s",
|
'-q', '--queue-size', type=int, default=10,
|
||||||
"--single-process",
|
help="TODO")
|
||||||
action="store_true",
|
|
||||||
help="Only use one process. " "Might be useful for single core computers.",
|
|
||||||
)
|
|
||||||
args_parser.add_argument(
|
|
||||||
"-4",
|
|
||||||
"--ip4-cache",
|
|
||||||
type=int,
|
|
||||||
default=0,
|
|
||||||
help="RAM cache for faster IPv4 lookup. "
|
|
||||||
"Maximum useful value: 512 MiB (536870912). "
|
|
||||||
"Warning: Depending on the rules, this might already "
|
|
||||||
"be a memory-heavy process, even without the cache.",
|
|
||||||
)
|
|
||||||
args = args_parser.parse_args()
|
args = args_parser.parse_args()
|
||||||
|
|
||||||
parser_cls = PARSERS[args.parser]
|
|
||||||
if args.single_process:
|
|
||||||
writer = Writer(
|
|
||||||
autosave_interval=args.autosave_interval, ip4_cache=args.ip4_cache
|
|
||||||
)
|
|
||||||
parser = parser_cls(args.input, writer=writer)
|
|
||||||
parser.run()
|
|
||||||
writer.end()
|
|
||||||
else:
|
|
||||||
recs_queue: multiprocessing.Queue = multiprocessing.Queue(
|
recs_queue: multiprocessing.Queue = multiprocessing.Queue(
|
||||||
maxsize=args.queue_size
|
maxsize=args.queue_size)
|
||||||
)
|
|
||||||
|
|
||||||
writer = Writer(
|
writer = Writer(recs_queue)
|
||||||
recs_queue,
|
|
||||||
autosave_interval=args.autosave_interval,
|
|
||||||
ip4_cache=args.ip4_cache,
|
|
||||||
)
|
|
||||||
writer.start()
|
writer.start()
|
||||||
|
|
||||||
parser = parser_cls(
|
parser = PARSERS[args.parser](args.input, recs_queue, args.block_size)
|
||||||
args.input, recs_queue=recs_queue, block_size=args.block_size
|
|
||||||
)
|
|
||||||
parser.run()
|
parser.run()
|
||||||
|
|
||||||
recs_queue.put(None)
|
recs_queue.put(None)
|
||||||
|
|
|
@ -6,56 +6,49 @@ import sys
|
||||||
import time
|
import time
|
||||||
import typing
|
import typing
|
||||||
|
|
||||||
FUNCTION_MAP = {
|
FUNCTION_MAP: typing.Dict[str, typing.Tuple[
|
||||||
"zone": database.Database.set_zone,
|
typing.Callable[[database.Database, database.Path, int], None],
|
||||||
"hostname": database.Database.set_hostname,
|
typing.Callable[[str], database.Path],
|
||||||
"asn": database.Database.set_asn,
|
]] = {
|
||||||
"ip4network": database.Database.set_ip4network,
|
'hostname': (database.Database.set_hostname,
|
||||||
"ip4address": database.Database.set_ip4address,
|
database.Database.pack_domain),
|
||||||
|
'zone': (database.Database.set_zone,
|
||||||
|
database.Database.pack_domain),
|
||||||
|
'asn': (database.Database.set_asn,
|
||||||
|
database.Database.pack_asn),
|
||||||
|
'ip4address': (database.Database.set_ip4address,
|
||||||
|
database.Database.pack_ip4address),
|
||||||
|
'ip4network': (database.Database.set_ip4network,
|
||||||
|
database.Database.pack_ip4network),
|
||||||
}
|
}
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == '__main__':
|
||||||
|
|
||||||
# Parsing arguments
|
# Parsing arguments
|
||||||
parser = argparse.ArgumentParser(description="Import base rules to the database")
|
parser = argparse.ArgumentParser(
|
||||||
|
description="TODO")
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"type", choices=FUNCTION_MAP.keys(), help="Type of rule inputed"
|
'type',
|
||||||
)
|
choices=FUNCTION_MAP.keys(),
|
||||||
|
help="Type of rule inputed")
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"-i",
|
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
|
||||||
"--input",
|
help="List of domains domains to block (with their subdomains)")
|
||||||
type=argparse.FileType("r"),
|
|
||||||
default=sys.stdin,
|
|
||||||
help="File with one rule per line",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"-f",
|
'-f', '--first-party', action='store_true',
|
||||||
"--first-party",
|
help="The input only comes from verified first-party sources")
|
||||||
action="store_true",
|
|
||||||
help="The input only comes from verified first-party sources",
|
|
||||||
)
|
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
DB = database.Database()
|
DB = database.Database()
|
||||||
|
|
||||||
fun = FUNCTION_MAP[args.type]
|
fun, packer = FUNCTION_MAP[args.type]
|
||||||
|
|
||||||
source: database.RulePath
|
|
||||||
if args.first_party:
|
|
||||||
source = database.RuleFirstPath()
|
|
||||||
else:
|
|
||||||
source = database.RuleMultiPath()
|
|
||||||
|
|
||||||
for rule in args.input:
|
for rule in args.input:
|
||||||
rule = rule.strip()
|
packed = packer(rule.strip())
|
||||||
try:
|
fun(DB,
|
||||||
fun(
|
packed,
|
||||||
DB,
|
# is_first_party=args.first_party,
|
||||||
rule,
|
|
||||||
source=source,
|
|
||||||
updated=int(time.time()),
|
updated=int(time.time()),
|
||||||
)
|
)
|
||||||
except ValueError:
|
|
||||||
DB.log.error(f"Could not add rule: {rule}")
|
|
||||||
|
|
||||||
DB.save()
|
DB.save()
|
||||||
|
|
|
@ -13,22 +13,30 @@ function dl() {
|
||||||
fi
|
fi
|
||||||
}
|
}
|
||||||
|
|
||||||
log "Retrieving tests…"
|
|
||||||
rm -f tests/*.cache.csv
|
|
||||||
dl https://raw.githubusercontent.com/fukuda-lab/cname_cloaking/master/Subdomain_CNAME-cloaking-based-tracking.csv temp/fukuda.csv
|
|
||||||
(echo "url,allow,deny,comment"; tail -n +2 temp/fukuda.csv | awk -F, '{ print "https://" $2 "/,," $3 "," $5 }') > tests/fukuda.cache.csv
|
|
||||||
|
|
||||||
log "Retrieving rules…"
|
log "Retrieving rules…"
|
||||||
rm -f rules*/*.cache.*
|
rm -f rules*/*.cache.*
|
||||||
dl https://easylist.to/easylist/easyprivacy.txt rules_adblock/easyprivacy.cache.txt
|
dl https://easylist.to/easylist/easyprivacy.txt rules_adblock/easyprivacy.cache.txt
|
||||||
dl https://filters.adtidy.org/extension/chromium/filters/3.txt rules_adblock/adguard.cache.txt
|
# From firebog.net Tracking & Telemetry Lists
|
||||||
|
# dl https://v.firebog.net/hosts/Prigent-Ads.txt rules/prigent-ads.cache.list
|
||||||
log "Retrieving TLD list…"
|
# dl https://gitlab.com/quidsup/notrack-blocklists/raw/master/notrack-blocklist.txt rules/notrack-blocklist.cache.list
|
||||||
dl http://data.iana.org/TLD/tlds-alpha-by-domain.txt temp/all_tld.temp.list
|
# False positives: https://github.com/WaLLy3K/wally3k.github.io/issues/73 -> 69.media.tumblr.com chicdn.net
|
||||||
grep -v '^#' temp/all_tld.temp.list | awk '{print tolower($0)}' > temp/all_tld.list
|
dl https://raw.githubusercontent.com/StevenBlack/hosts/master/data/add.2o7Net/hosts rules_hosts/add2o7.cache.txt
|
||||||
|
dl https://raw.githubusercontent.com/crazy-max/WindowsSpyBlocker/master/data/hosts/spy.txt rules_hosts/spy.cache.txt
|
||||||
|
# dl https://raw.githubusercontent.com/Kees1958/WS3_annual_most_used_survey_blocklist/master/w3tech_hostfile.txt rules/w3tech.cache.list
|
||||||
|
# False positives: agreements.apple.com -> edgekey.net
|
||||||
|
# dl https://www.github.developerdan.com/hosts/lists/ads-and-tracking-extended.txt rules_hosts/ads-and-tracking-extended.cache.txt # Lots of false-positives
|
||||||
|
# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/android-tracking.txt rules_hosts/android-tracking.cache.txt
|
||||||
|
# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/SmartTV.txt rules_hosts/smart-tv.cache.txt
|
||||||
|
# dl https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/AmazonFireTV.txt rules_hosts/amazon-fire-tv.cache.txt
|
||||||
|
|
||||||
log "Retrieving nameservers…"
|
log "Retrieving nameservers…"
|
||||||
dl https://public-dns.info/nameservers.txt nameservers/public-dns.cache.list
|
rm -f nameservers
|
||||||
|
touch nameservers
|
||||||
|
[ -f nameservers.head ] && cat nameservers.head >> nameservers
|
||||||
|
dl https://public-dns.info/nameservers.txt nameservers.temp
|
||||||
|
sort -R nameservers.temp >> nameservers
|
||||||
|
rm nameservers.temp
|
||||||
|
|
||||||
log "Retrieving top subdomains…"
|
log "Retrieving top subdomains…"
|
||||||
dl http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip top-1m.csv.zip
|
dl http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip top-1m.csv.zip
|
||||||
|
@ -38,8 +46,9 @@ rm top-1m.csv top-1m.csv.zip
|
||||||
if [ -f subdomains/cisco-umbrella_popularity.cache.list ]
|
if [ -f subdomains/cisco-umbrella_popularity.cache.list ]
|
||||||
then
|
then
|
||||||
cp subdomains/cisco-umbrella_popularity.cache.list temp/cisco-umbrella_popularity.old.list
|
cp subdomains/cisco-umbrella_popularity.cache.list temp/cisco-umbrella_popularity.old.list
|
||||||
pv -f temp/cisco-umbrella_popularity.old.list temp/cisco-umbrella_popularity.fresh.list | sort -u > subdomains/cisco-umbrella_popularity.cache.list
|
pv temp/cisco-umbrella_popularity.old.list temp/cisco-umbrella_popularity.fresh.list | sort -u > subdomains/cisco-umbrella_popularity.cache.list
|
||||||
rm temp/cisco-umbrella_popularity.old.list temp/cisco-umbrella_popularity.fresh.list
|
rm temp/cisco-umbrella_popularity.old.list temp/cisco-umbrella_popularity.fresh.list
|
||||||
else
|
else
|
||||||
mv temp/cisco-umbrella_popularity.fresh.list subdomains/cisco-umbrella_popularity.cache.list
|
mv temp/cisco-umbrella_popularity.fresh.list subdomains/cisco-umbrella_popularity.cache.list
|
||||||
fi
|
fi
|
||||||
|
dl https://www.orwell1984.today/cname/eulerian.net.txt subdomains/orwell-eulerian-cname-list.cache.list
|
||||||
|
|
160
filter_subdomains.py
Executable file
160
filter_subdomains.py
Executable file
|
@ -0,0 +1,160 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
# pylint: disable=C0103
|
||||||
|
|
||||||
|
"""
|
||||||
|
From a list of subdomains, output only
|
||||||
|
the ones resolving to a first-party tracker.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import sys
|
||||||
|
import progressbar
|
||||||
|
import csv
|
||||||
|
import typing
|
||||||
|
import ipaddress
|
||||||
|
|
||||||
|
# DomainRule = typing.Union[bool, typing.Dict[str, 'DomainRule']]
|
||||||
|
DomainRule = typing.Union[bool, typing.Dict]
|
||||||
|
# IpRule = typing.Union[bool, typing.Dict[int, 'DomainRule']]
|
||||||
|
IpRule = typing.Union[bool, typing.Dict]
|
||||||
|
|
||||||
|
RULES_DICT: DomainRule = dict()
|
||||||
|
RULES_IP_DICT: IpRule = dict()
|
||||||
|
|
||||||
|
|
||||||
|
def get_bits(address: ipaddress.IPv4Address) -> typing.Iterator[int]:
|
||||||
|
for char in address.packed:
|
||||||
|
for i in range(7, -1, -1):
|
||||||
|
yield (char >> i) & 0b1
|
||||||
|
|
||||||
|
|
||||||
|
def subdomain_matching(subdomain: str) -> bool:
|
||||||
|
parts = subdomain.split('.')
|
||||||
|
parts.reverse()
|
||||||
|
dic = RULES_DICT
|
||||||
|
for part in parts:
|
||||||
|
if isinstance(dic, bool) or part not in dic:
|
||||||
|
break
|
||||||
|
dic = dic[part]
|
||||||
|
if isinstance(dic, bool):
|
||||||
|
return dic
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def ip_matching(ip_str: str) -> bool:
|
||||||
|
ip = ipaddress.ip_address(ip_str)
|
||||||
|
dic = RULES_IP_DICT
|
||||||
|
i = 0
|
||||||
|
for bit in get_bits(ip):
|
||||||
|
i += 1
|
||||||
|
if isinstance(dic, bool) or bit not in dic:
|
||||||
|
break
|
||||||
|
dic = dic[bit]
|
||||||
|
if isinstance(dic, bool):
|
||||||
|
return dic
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def get_matching(chain: typing.List[str], no_explicit: bool = False
|
||||||
|
) -> typing.Iterable[str]:
|
||||||
|
if len(chain) <= 1:
|
||||||
|
return
|
||||||
|
initial = chain[0]
|
||||||
|
cname_destinations = chain[1:-1]
|
||||||
|
a_destination = chain[-1]
|
||||||
|
initial_matching = subdomain_matching(initial)
|
||||||
|
if no_explicit and initial_matching:
|
||||||
|
return
|
||||||
|
cname_matching = any(map(subdomain_matching, cname_destinations))
|
||||||
|
if cname_matching or initial_matching or ip_matching(a_destination):
|
||||||
|
yield initial
|
||||||
|
|
||||||
|
|
||||||
|
def register_rule(subdomain: str) -> None:
|
||||||
|
# Make a tree with domain parts
|
||||||
|
parts = subdomain.split('.')
|
||||||
|
parts.reverse()
|
||||||
|
dic = RULES_DICT
|
||||||
|
last_part = len(parts) - 1
|
||||||
|
for p, part in enumerate(parts):
|
||||||
|
if isinstance(dic, bool):
|
||||||
|
return
|
||||||
|
if p == last_part:
|
||||||
|
dic[part] = True
|
||||||
|
else:
|
||||||
|
dic.setdefault(part, dict())
|
||||||
|
dic = dic[part]
|
||||||
|
|
||||||
|
|
||||||
|
def register_rule_ip(network: str) -> None:
|
||||||
|
net = ipaddress.ip_network(network)
|
||||||
|
ip = net.network_address
|
||||||
|
dic = RULES_IP_DICT
|
||||||
|
last_bit = net.prefixlen - 1
|
||||||
|
for b, bit in enumerate(get_bits(ip)):
|
||||||
|
if isinstance(dic, bool):
|
||||||
|
return
|
||||||
|
if b == last_bit:
|
||||||
|
dic[bit] = True
|
||||||
|
else:
|
||||||
|
dic.setdefault(bit, dict())
|
||||||
|
dic = dic[bit]
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
|
||||||
|
# Parsing arguments
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Filter first-party trackers from a list of subdomains")
|
||||||
|
parser.add_argument(
|
||||||
|
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
|
||||||
|
help="Input file with DNS chains")
|
||||||
|
parser.add_argument(
|
||||||
|
'-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
|
||||||
|
help="Outptut file with one tracking subdomain per line")
|
||||||
|
parser.add_argument(
|
||||||
|
'-n', '--no-explicit', action='store_true',
|
||||||
|
help="Don't output domains already blocked with rules without CNAME")
|
||||||
|
parser.add_argument(
|
||||||
|
'-r', '--rules', type=argparse.FileType('r'),
|
||||||
|
help="List of domains domains to block (with their subdomains)")
|
||||||
|
parser.add_argument(
|
||||||
|
'-p', '--rules-ip', type=argparse.FileType('r'),
|
||||||
|
help="List of IPs ranges to block")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Progress bar
|
||||||
|
widgets = [
|
||||||
|
progressbar.Percentage(),
|
||||||
|
' ', progressbar.SimpleProgress(),
|
||||||
|
' ', progressbar.Bar(),
|
||||||
|
' ', progressbar.Timer(),
|
||||||
|
' ', progressbar.AdaptiveTransferSpeed(unit='req'),
|
||||||
|
' ', progressbar.AdaptiveETA(),
|
||||||
|
]
|
||||||
|
progress = progressbar.ProgressBar(widgets=widgets)
|
||||||
|
|
||||||
|
# Reading rules
|
||||||
|
if args.rules:
|
||||||
|
for rule in args.rules:
|
||||||
|
register_rule(rule.strip())
|
||||||
|
if args.rules_ip:
|
||||||
|
for rule in args.rules_ip:
|
||||||
|
register_rule_ip(rule.strip())
|
||||||
|
|
||||||
|
# Approximating line count
|
||||||
|
if args.input.seekable():
|
||||||
|
lines = 0
|
||||||
|
for line in args.input:
|
||||||
|
lines += 1
|
||||||
|
progress.max_value = lines
|
||||||
|
args.input.seek(0)
|
||||||
|
|
||||||
|
# Reading domains to filter
|
||||||
|
reader = csv.reader(args.input)
|
||||||
|
progress.start()
|
||||||
|
for chain in reader:
|
||||||
|
for match in get_matching(chain, no_explicit=args.no_explicit):
|
||||||
|
print(match, file=args.output)
|
||||||
|
progress.update(progress.value + 1)
|
||||||
|
progress.finish()
|
66
filter_subdomains.sh
Executable file
66
filter_subdomains.sh
Executable file
|
@ -0,0 +1,66 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
|
||||||
|
function log() {
|
||||||
|
echo -e "\033[33m$@\033[0m"
|
||||||
|
}
|
||||||
|
|
||||||
|
log "Pruning old data…"
|
||||||
|
./database.py --prune
|
||||||
|
|
||||||
|
log "Recounting references…"
|
||||||
|
./database.py --references
|
||||||
|
|
||||||
|
log "Exporting lists…"
|
||||||
|
./export.py --first-party --output dist/firstparty-trackers.txt
|
||||||
|
./export.py --first-party --end-chain --output dist/firstparty-only-trackers.txt
|
||||||
|
./export.py --output dist/multiparty-trackers.txt
|
||||||
|
./export.py --end-chain --output dist/multiparty-only-trackers.txt
|
||||||
|
|
||||||
|
log "Generating hosts lists…"
|
||||||
|
./export.py --rules --count --first-party > temp/count_rules_firstparty.txt
|
||||||
|
./export.py --rules --count > temp/count_rules_multiparty.txt
|
||||||
|
function generate_hosts {
|
||||||
|
basename="$1"
|
||||||
|
description="$2"
|
||||||
|
description2="$3"
|
||||||
|
|
||||||
|
(
|
||||||
|
echo "# First-party trackers host list"
|
||||||
|
echo "# $description"
|
||||||
|
echo "# $description2"
|
||||||
|
echo "#"
|
||||||
|
echo "# About first-party trackers: https://git.frogeye.fr/geoffrey/eulaurarien#whats-a-first-party-tracker"
|
||||||
|
echo "# Source code: https://git.frogeye.fr/geoffrey/eulaurarien"
|
||||||
|
echo "#"
|
||||||
|
echo "# In case of false positives/negatives, or any other question,"
|
||||||
|
echo "# contact me the way you like: https://geoffrey.frogeye.fr"
|
||||||
|
echo "#"
|
||||||
|
echo "# Latest version:"
|
||||||
|
echo "# - First-party trackers : https://hostfiles.frogeye.fr/firstparty-trackers-hosts.txt"
|
||||||
|
echo "# - … excluding redirected: https://hostfiles.frogeye.fr/firstparty-only-trackers-hosts.txt"
|
||||||
|
echo "# - First and third party : https://hostfiles.frogeye.fr/multiparty-trackers-hosts.txt"
|
||||||
|
echo "# - … excluding redirected: https://hostfiles.frogeye.fr/multiparty-only-trackers-hosts.txt"
|
||||||
|
echo '# (you can remove `-hosts` to get the raw list)'
|
||||||
|
echo "#"
|
||||||
|
echo "# Generation date: $(date -Isec)"
|
||||||
|
echo "# Generation software: eulaurarien $(git describe --tags)"
|
||||||
|
echo "# Number of source websites: $(wc -l temp/all_websites.list | cut -d' ' -f1)"
|
||||||
|
echo "# Number of source subdomains: $(wc -l temp/all_subdomains.list | cut -d' ' -f1)"
|
||||||
|
echo "# Number of source DNS records: ~2M + $(wc -l temp/all_resolved.json | cut -d' ' -f1)"
|
||||||
|
echo "#"
|
||||||
|
echo "# Known first-party trackers: $(cat temp/count_rules_firstparty.txt)"
|
||||||
|
echo "# Number of first-party hostnames: $(wc -l dist/firstparty-trackers.txt | cut -d' ' -f1)"
|
||||||
|
echo "# … excluding redirected: $(wc -l dist/firstparty-only-trackers.txt | cut -d' ' -f1)"
|
||||||
|
echo "#"
|
||||||
|
echo "# Known multi-party trackers: $(cat temp/count_rules_multiparty.txt)"
|
||||||
|
echo "# Number of multi-party hostnames: $(wc -l dist/multiparty-trackers.txt | cut -d' ' -f1)"
|
||||||
|
echo "# … excluding redirected: $(wc -l dist/multiparty-only-trackers.txt | cut -d' ' -f1)"
|
||||||
|
echo
|
||||||
|
sed 's|^|0.0.0.0 |' "dist/$basename.txt"
|
||||||
|
) > "dist/$basename-hosts.txt"
|
||||||
|
}
|
||||||
|
|
||||||
|
generate_hosts "firstparty-trackers" "Generated from a curated list of first-party trackers" ""
|
||||||
|
generate_hosts "firstparty-only-trackers" "Generated from a curated list of first-party trackers" "Only contain the first chain of redirection."
|
||||||
|
generate_hosts "multiparty-trackers" "Generated from known third-party trackers." "Also contains trackers used as third-party."
|
||||||
|
generate_hosts "multiparty-only-trackers" "Generated from known third-party trackers." "Do not contain trackers used in third-party. Use in combination with third-party lists."
|
|
@ -1,25 +0,0 @@
|
||||||
#!/usr/bin/env python3
|
|
||||||
|
|
||||||
import markdown2
|
|
||||||
|
|
||||||
extras = ["header-ids"]
|
|
||||||
|
|
||||||
with open("dist/README.md", "r") as fdesc:
|
|
||||||
body = markdown2.markdown(fdesc.read(), extras=extras)
|
|
||||||
|
|
||||||
output = f"""<!DOCTYPE html>
|
|
||||||
<html lang="en">
|
|
||||||
<head>
|
|
||||||
<title>Geoffrey Frogeye's block list of first-party trackers</title>
|
|
||||||
<meta charset="utf-8">
|
|
||||||
<meta name="author" content="Geoffrey 'Frogeye' Preud'homme" />
|
|
||||||
<link rel="stylesheet" type="text/css" href="markdown7.min.css">
|
|
||||||
</head>
|
|
||||||
<body>
|
|
||||||
{body}
|
|
||||||
</body>
|
|
||||||
</html>
|
|
||||||
"""
|
|
||||||
|
|
||||||
with open("dist/index.html", "w") as fdesc:
|
|
||||||
fdesc.write(output)
|
|
|
@ -5,12 +5,12 @@ function log() {
|
||||||
}
|
}
|
||||||
|
|
||||||
log "Importing rules…"
|
log "Importing rules…"
|
||||||
date +%s > "last_updates/rules.txt"
|
BEFORE="$(date +%s)"
|
||||||
cat rules_adblock/*.txt | grep -v '^!' | grep -v '^\[Adblock' | ./adblock_to_domain_list.py | ./feed_rules.py zone
|
# cat rules_adblock/*.txt | grep -v '^!' | grep -v '^\[Adblock' | ./adblock_to_domain_list.py | ./feed_rules.py zone
|
||||||
cat rules_hosts/*.txt | grep -v '^#' | grep -v '^$' | cut -d ' ' -f2 | ./feed_rules.py zone
|
# cat rules_hosts/*.txt | grep -v '^#' | grep -v '^$' | cut -d ' ' -f2 | ./feed_rules.py zone
|
||||||
cat rules/*.list | grep -v '^#' | grep -v '^$' | ./feed_rules.py zone
|
# cat rules/*.list | grep -v '^#' | grep -v '^$' | ./feed_rules.py zone
|
||||||
cat rules_ip/*.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py ip4network
|
# cat rules_ip/*.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py ip4network
|
||||||
cat rules_asn/*.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py asn
|
# cat rules_asn/*.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py asn
|
||||||
|
|
||||||
cat rules/first-party.list | grep -v '^#' | grep -v '^$' | ./feed_rules.py zone --first-party
|
cat rules/first-party.list | grep -v '^#' | grep -v '^$' | ./feed_rules.py zone --first-party
|
||||||
cat rules_ip/first-party.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py ip4network --first-party
|
cat rules_ip/first-party.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py ip4network --first-party
|
||||||
|
@ -18,3 +18,5 @@ cat rules_asn/first-party.txt | grep -v '^#' | grep -v '^$' | ./feed_rules.py as
|
||||||
|
|
||||||
./feed_asn.py
|
./feed_asn.py
|
||||||
|
|
||||||
|
log "Pruning old rules…"
|
||||||
|
./db.py --prune --prune-before "$BEFORE" --prune-base
|
||||||
|
|
1
last_updates/.gitignore
vendored
1
last_updates/.gitignore
vendored
|
@ -1 +0,0 @@
|
||||||
*.txt
|
|
2
nameservers/.gitignore
vendored
2
nameservers/.gitignore
vendored
|
@ -1,2 +0,0 @@
|
||||||
*.custom.list
|
|
||||||
*.cache.list
|
|
|
@ -1,24 +0,0 @@
|
||||||
8.8.8.8
|
|
||||||
8.8.4.4
|
|
||||||
2001:4860:4860:0:0:0:0:8888
|
|
||||||
2001:4860:4860:0:0:0:0:8844
|
|
||||||
208.67.222.222
|
|
||||||
208.67.220.220
|
|
||||||
2620:119:35::35
|
|
||||||
2620:119:53::53
|
|
||||||
4.2.2.1
|
|
||||||
4.2.2.2
|
|
||||||
8.26.56.26
|
|
||||||
8.20.247.20
|
|
||||||
84.200.69.80
|
|
||||||
84.200.70.40
|
|
||||||
2001:1608:10:25:0:0:1c04:b12f
|
|
||||||
2001:1608:10:25:0:0:9249:d69b
|
|
||||||
9.9.9.10
|
|
||||||
149.112.112.10
|
|
||||||
2620:fe::10
|
|
||||||
2620:fe::fe:10
|
|
||||||
1.1.1.1
|
|
||||||
1.0.0.1
|
|
||||||
2606:4700:4700::1111
|
|
||||||
2606:4700:4700::1001
|
|
22
new_workflow.sh
Executable file
22
new_workflow.sh
Executable file
|
@ -0,0 +1,22 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
|
||||||
|
function log() {
|
||||||
|
echo -e "\033[33m$@\033[0m"
|
||||||
|
}
|
||||||
|
|
||||||
|
./fetch_resources.sh
|
||||||
|
./import_rules.sh
|
||||||
|
|
||||||
|
# TODO Fetch 'em
|
||||||
|
log "Reading PTR records…"
|
||||||
|
pv ptr.json.gz | gunzip | ./feed_dns.py
|
||||||
|
log "Reading A records…"
|
||||||
|
pv a.json.gz | gunzip | ./feed_dns.py
|
||||||
|
log "Reading CNAME records…"
|
||||||
|
pv cname.json.gz | gunzip | ./feed_dns.py
|
||||||
|
|
||||||
|
log "Pruning old data…"
|
||||||
|
./database.py --prune
|
||||||
|
|
||||||
|
./filter_subdomains.sh
|
||||||
|
|
9
prune.sh
9
prune.sh
|
@ -1,9 +0,0 @@
|
||||||
#!/usr/bin/env bash
|
|
||||||
|
|
||||||
function log() {
|
|
||||||
echo -e "\033[33m$@\033[0m"
|
|
||||||
}
|
|
||||||
|
|
||||||
oldest="$(cat last_updates/*.txt | sort -n | head -1)"
|
|
||||||
log "Pruning every record before ${oldest}…"
|
|
||||||
./db.py --prune --prune-before "$oldest"
|
|
|
@ -1,4 +0,0 @@
|
||||||
coloredlogs>=10
|
|
||||||
markdown2>=2.4<3
|
|
||||||
numpy>=1.21<2
|
|
||||||
python-abp>=0.2<0.3
|
|
|
@ -1,24 +1,12 @@
|
||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
|
|
||||||
source .env.default
|
|
||||||
source .env
|
|
||||||
|
|
||||||
function log() {
|
function log() {
|
||||||
echo -e "\033[33m$@\033[0m"
|
echo -e "\033[33m$@\033[0m"
|
||||||
}
|
}
|
||||||
|
|
||||||
log "Compiling nameservers…"
|
log "Compiling locally known subdomain…"
|
||||||
pv -f nameservers/*.list | ./validate_list.py --ip4 | sort -u > temp/all_nameservers_ip4.list
|
|
||||||
|
|
||||||
log "Compiling subdomains…"
|
|
||||||
# Sort by last character to utilize the DNS server caching mechanism
|
# Sort by last character to utilize the DNS server caching mechanism
|
||||||
# (not as efficient with massdns but it's almost free so why not)
|
pv subdomains/*.list | sed 's/\r$//' | rev | sort -u | rev > temp/all_subdomains.list
|
||||||
pv -f subdomains/*.list | ./validate_list.py --domain | rev | sort -u | rev > temp/all_subdomains.list
|
log "Resolving locally known subdomain…"
|
||||||
|
pv temp/all_subdomains.list | ./resolve_subdomains.py --output temp/all_resolved.csv
|
||||||
|
|
||||||
log "Resolving subdomain…"
|
|
||||||
date +%s > "last_updates/massdns.txt"
|
|
||||||
"$MASSDNS_BINARY" --output Snrql --hashmap-size "$MASSDNS_HASHMAP_SIZE" --resolvers temp/all_nameservers_ip4.list --outfile temp/all_resolved.txt temp/all_subdomains.list
|
|
||||||
|
|
||||||
log "Importing into database…"
|
|
||||||
[ $SINGLE_PROCESS -eq 1 ] && EXTRA_ARGS="--single-process"
|
|
||||||
pv -f temp/all_resolved.txt | ./feed_dns.py massdns --ip4-cache "$CACHE_SIZE" $EXTRA_ARGS
|
|
||||||
|
|
|
@ -12,80 +12,13 @@ storetail.io
|
||||||
# Keyade
|
# Keyade
|
||||||
keyade.com
|
keyade.com
|
||||||
# Adobe Experience Cloud
|
# Adobe Experience Cloud
|
||||||
# https://experienceleague.adobe.com/docs/analytics/implementation/vars/config-vars/trackingserversecure.html?lang=en#ssl-tracking-server-in-adobe-experience-platform-launch
|
|
||||||
omtrdc.net
|
omtrdc.net
|
||||||
2o7.net
|
2o7.net
|
||||||
data.adobedc.net
|
# ThreatMetrix
|
||||||
sc.adobedc.net
|
online-metrix.net
|
||||||
# Webtrekk
|
# Webtrekk
|
||||||
wt-eu02.net
|
wt-eu02.net
|
||||||
webtrekk.net
|
|
||||||
# Otto Group
|
# Otto Group
|
||||||
oghub.io
|
oghub.io
|
||||||
# Intent Media
|
# ???
|
||||||
partner.intentmedia.net
|
partner.intentmedia.net
|
||||||
# Wizaly
|
|
||||||
wizaly.com
|
|
||||||
# Commanders Act
|
|
||||||
tagcommander.com
|
|
||||||
# Ingenious Technologies
|
|
||||||
affex.org
|
|
||||||
# TraceDock
|
|
||||||
a351fec2c318c11ea9b9b0a0ae18fb0b-1529426863.eu-central-1.elb.amazonaws.com
|
|
||||||
a5e652663674a11e997c60ac8a4ec150-1684524385.eu-central-1.elb.amazonaws.com
|
|
||||||
a88045584548111e997c60ac8a4ec150-1610510072.eu-central-1.elb.amazonaws.com
|
|
||||||
afc4d9aa2a91d11e997c60ac8a4ec150-2082092489.eu-central-1.elb.amazonaws.com
|
|
||||||
# A8
|
|
||||||
trck.a8.net
|
|
||||||
# AD EBiS
|
|
||||||
# https://prtimes.jp/main/html/rd/p/000000215.000009812.html
|
|
||||||
ebis.ne.jp
|
|
||||||
# GENIEE
|
|
||||||
genieesspv.jp
|
|
||||||
# SP-Prod
|
|
||||||
sp-prod.net
|
|
||||||
# Act-On Software
|
|
||||||
actonsoftware.com
|
|
||||||
actonservice.com
|
|
||||||
# eum-appdynamics.com
|
|
||||||
eum-appdynamics.com
|
|
||||||
# Extole
|
|
||||||
extole.io
|
|
||||||
extole.com
|
|
||||||
# Eloqua
|
|
||||||
hs.eloqua.com
|
|
||||||
# segment.com
|
|
||||||
xid.segment.com
|
|
||||||
# exponea.com
|
|
||||||
exponea.com
|
|
||||||
# adclear.net
|
|
||||||
adclear.net
|
|
||||||
# contentsfeed.com
|
|
||||||
contentsfeed.com
|
|
||||||
# postaffiliatepro.com
|
|
||||||
postaffiliatepro.com
|
|
||||||
# Sugar Market (Salesfusion)
|
|
||||||
msgapp.com
|
|
||||||
# Exactag
|
|
||||||
exactag.com
|
|
||||||
# GMO Internet Group
|
|
||||||
ad-cloud.jp
|
|
||||||
# Pardot
|
|
||||||
pardot.com
|
|
||||||
# Fathom
|
|
||||||
# https://usefathom.com/docs/settings/custom-domains
|
|
||||||
starman.fathomdns.com
|
|
||||||
# Lead Forensics
|
|
||||||
# https://www.reddit.com/r/pihole/comments/g7qv3e/leadforensics_tracking_domains_blacklist/
|
|
||||||
# No real-world data but the website doesn't hide what it does
|
|
||||||
ghochv3eng.trafficmanager.net
|
|
||||||
# Branch.io
|
|
||||||
thirdparty.bnc.lt
|
|
||||||
# Plausible.io
|
|
||||||
custom.plausible.io
|
|
||||||
# DataUnlocker
|
|
||||||
# Bit different as it is a proxy to non first-party trackers scripts
|
|
||||||
# but it fits I guess.
|
|
||||||
smartproxy.dataunlocker.com
|
|
||||||
# SAS
|
|
||||||
ci360.sas.com
|
|
||||||
|
|
|
@ -4,7 +4,7 @@ AS50234
|
||||||
AS44788
|
AS44788
|
||||||
AS19750
|
AS19750
|
||||||
AS55569
|
AS55569
|
||||||
|
# ThreatMetrix
|
||||||
|
AS30286
|
||||||
# Webtrekk
|
# Webtrekk
|
||||||
AS60164
|
AS60164
|
||||||
# Act-On Software
|
|
||||||
AS393648
|
|
||||||
|
|
0
rules_ip/first-party.txt
Normal file
0
rules_ip/first-party.txt
Normal file
75
run_tests.py
75
run_tests.py
|
@ -1,75 +0,0 @@
|
||||||
#!/usr/bin/env python3
|
|
||||||
|
|
||||||
import database
|
|
||||||
import os
|
|
||||||
import logging
|
|
||||||
import csv
|
|
||||||
|
|
||||||
TESTS_DIR = "tests"
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
|
|
||||||
DB = database.Database()
|
|
||||||
log = logging.getLogger("tests")
|
|
||||||
|
|
||||||
for filename in os.listdir(TESTS_DIR):
|
|
||||||
if not filename.lower().endswith(".csv"):
|
|
||||||
continue
|
|
||||||
log.info("")
|
|
||||||
log.info("Running tests from %s", filename)
|
|
||||||
path = os.path.join(TESTS_DIR, filename)
|
|
||||||
with open(path, "rt") as fdesc:
|
|
||||||
count_ent = 0
|
|
||||||
count_all = 0
|
|
||||||
count_den = 0
|
|
||||||
pass_ent = 0
|
|
||||||
pass_all = 0
|
|
||||||
pass_den = 0
|
|
||||||
reader = csv.DictReader(fdesc)
|
|
||||||
for test in reader:
|
|
||||||
log.debug("Testing %s (%s)", test["url"], test["comment"])
|
|
||||||
count_ent += 1
|
|
||||||
passed = True
|
|
||||||
|
|
||||||
for allow in test["allow"].split(":"):
|
|
||||||
if not allow:
|
|
||||||
continue
|
|
||||||
count_all += 1
|
|
||||||
if any(DB.get_domain(allow)):
|
|
||||||
log.error("False positive: %s", allow)
|
|
||||||
passed = False
|
|
||||||
else:
|
|
||||||
pass_all += 1
|
|
||||||
|
|
||||||
for deny in test["deny"].split(":"):
|
|
||||||
if not deny:
|
|
||||||
continue
|
|
||||||
count_den += 1
|
|
||||||
if not any(DB.get_domain(deny)):
|
|
||||||
log.error("False negative: %s", deny)
|
|
||||||
passed = False
|
|
||||||
else:
|
|
||||||
pass_den += 1
|
|
||||||
|
|
||||||
if passed:
|
|
||||||
pass_ent += 1
|
|
||||||
perc_ent = (100 * pass_ent / count_ent) if count_ent else 100
|
|
||||||
perc_all = (100 * pass_all / count_all) if count_all else 100
|
|
||||||
perc_den = (100 * pass_den / count_den) if count_den else 100
|
|
||||||
log.info(
|
|
||||||
(
|
|
||||||
"%s: Entries %d/%d (%.2f%%)"
|
|
||||||
" | Allow %d/%d (%.2f%%)"
|
|
||||||
"| Deny %d/%d (%.2f%%)"
|
|
||||||
),
|
|
||||||
filename,
|
|
||||||
pass_ent,
|
|
||||||
count_ent,
|
|
||||||
perc_ent,
|
|
||||||
pass_all,
|
|
||||||
count_all,
|
|
||||||
perc_all,
|
|
||||||
pass_den,
|
|
||||||
count_den,
|
|
||||||
perc_den,
|
|
||||||
)
|
|
1
tests/.gitignore
vendored
1
tests/.gitignore
vendored
|
@ -1 +0,0 @@
|
||||||
*.cache.csv
|
|
|
@ -1,6 +1,6 @@
|
||||||
url,allow,deny,comment
|
url,white,black,comment
|
||||||
https://support.apple.com,support.apple.com,,EdgeKey / AkamaiEdge
|
https://support.apple.com,support.apple.com,,EdgeKey / AkamaiEdge
|
||||||
https://www.pinterest.fr/,i.pinimg.com,,Cedexis
|
https://www.pinterest.fr/,i.pinimg.com,,Cedexis
|
||||||
|
https://www.pinterest.fr/,i.pinimg.com,,Cedexis
|
||||||
https://www.tumblr.com/,66.media.tumblr.com,,ChiCDN
|
https://www.tumblr.com/,66.media.tumblr.com,,ChiCDN
|
||||||
https://www.skype.com/fr/,www.skype.com,,TrafficManager
|
https://www.skype.com/fr/,www.skype.com,,TrafficManager
|
||||||
https://www.mitsubishicars.com/,www.mitsubishicars.com,,Tracking domain as reverse DNS
|
|
||||||
|
|
|
|
@ -1,28 +1,7 @@
|
||||||
url,allow,deny,comment
|
url,white,black,comment
|
||||||
https://www.red-by-sfr.fr/,static.s-sfr.fr,nrg.red-by-sfr.fr,Eulerian
|
https://www.red-by-sfr.fr/,static.s-sfr.fr,nrg.red-by-sfr.fr,Eulerian
|
||||||
https://www.cbc.ca/,,smetrics.cbc.ca,2o7 | Ominuture | Adobe Experience Cloud
|
https://www.cbc.ca/,,smetrics.cbc.ca,2o7 | Ominuture | Adobe Experience Cloud
|
||||||
|
https://www.discover.com/,,content.discover.com,ThreatMetrix
|
||||||
https://www.mytoys.de/,,web.mytoys.de,Webtrekk
|
https://www.mytoys.de/,,web.mytoys.de,Webtrekk
|
||||||
https://www.baur.de/,,tp.baur.de,Otto Group
|
https://www.baur.de/,,tp.baur.de,Otto Group
|
||||||
https://www.liligo.com/,,compare.liligo.com,???
|
https://www.liligo.com/,,compare.liligo.com,???
|
||||||
https://www.boulanger.com/,,tag.boulanger.fr,TagCommander
|
|
||||||
https://www.airfrance.fr/FR/,,tk.airfrance.fr,Wizaly
|
|
||||||
https://www.vsgamers.es/,,marketing.net.vsgamers.es,Affex
|
|
||||||
https://www.vacansoleil.fr/,,tdep.vacansoleil.fr,TraceDock
|
|
||||||
https://www.ozmall.co.jp/,,js.enhance.co.jp,GENIEE
|
|
||||||
https://www.thetimes.co.uk/,,cmp.thetimes.co.uk,SP-Prod
|
|
||||||
https://agilent.com/,,seahorseinfo.agilent.com,Act-On Software
|
|
||||||
https://halifax.co.uk/,,cem.halifax.co.uk,eum-appdynamics.com
|
|
||||||
https://www.reallygoodstuff.com/,,refer.reallygoodstuff.com,Extole
|
|
||||||
https://unity.com/,,eloqua-trackings.unity.com,Eloqua
|
|
||||||
https://www.notino.gr/,,api.campaigns.notino.com,Exponea
|
|
||||||
https://www.mytoys.de/,,0815.mytoys.de.adclear.net
|
|
||||||
https://www.imbc.com/,,ads.imbc.com.contentsfeed.com
|
|
||||||
https://www.cbdbiocare.com/,,affiliate.cbdbiocare.com,postaffiliatepro.com
|
|
||||||
https://www.seatadvisor.com/,,marketing.seatadvisor.com,Sugar Market (Salesfusion)
|
|
||||||
https://www.tchibo.de/,,tagm.tchibo.de,Exactag
|
|
||||||
https://www.bouygues-immobilier.com/,,go.bouygues-immobilier.fr,Pardot
|
|
||||||
https://caddyserver.com/,,mule.caddysever.com,Fathom
|
|
||||||
Reddit.com mail notifications,,click.redditmail.com,Branch.io
|
|
||||||
https://www.phpliveregex.com/,,yolo.phpliveregex.xom,Plausible.io
|
|
||||||
https://www.earthclassmail.com/,,1avhg3kanx9.www.earthclassmail.com,DataUnlocker
|
|
||||||
https://paulfredrick.com/,,execution-ci360.paulfredrick.com,SAS
|
|
||||||
|
|
Can't render this file because it has a wrong number of fields in line 18.
|
|
@ -1,35 +0,0 @@
|
||||||
#!/usr/bin/env python3
|
|
||||||
# pylint: disable=C0103
|
|
||||||
|
|
||||||
"""
|
|
||||||
Filter out invalid domain names
|
|
||||||
"""
|
|
||||||
|
|
||||||
import database
|
|
||||||
import argparse
|
|
||||||
import sys
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
|
|
||||||
# Parsing arguments
|
|
||||||
parser = argparse.ArgumentParser(
|
|
||||||
description="Filter out invalid domain name/ip addresses from a list.")
|
|
||||||
parser.add_argument(
|
|
||||||
'-i', '--input', type=argparse.FileType('r'), default=sys.stdin,
|
|
||||||
help="Input file, one element per line")
|
|
||||||
parser.add_argument(
|
|
||||||
'-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
|
|
||||||
help="Output file, one element per line")
|
|
||||||
parser.add_argument(
|
|
||||||
'-d', '--domain', action='store_true',
|
|
||||||
help="Can be domain name")
|
|
||||||
parser.add_argument(
|
|
||||||
'-4', '--ip4', action='store_true',
|
|
||||||
help="Can be IP4")
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
for line in args.input:
|
|
||||||
line = line[:-1].lower()
|
|
||||||
if (args.domain and database.Database.validate_domain(line)) or \
|
|
||||||
(args.ip4 and database.Database.validate_ip4address(line)):
|
|
||||||
print(line, file=args.output)
|
|
Loading…
Reference in a new issue