Added possibility to add personal sources

newworkflow_parseropti
Geoffrey Frogeye 2019-11-11 11:19:46 +01:00
parent 333ae4eb66
commit a0a2af281f
11 changed files with 65 additions and 24 deletions

2
.gitignore vendored
View File

@ -1,3 +1 @@
*.list
!websites.list
*.log

View File

@ -27,8 +27,10 @@ That's where this scripts comes in, to generate a list of such subdomains.
It takes an input a list of websites with trackers included.
So far, this list is manually-generated from the list of clients of such first-party trackers
(latter we should use a general list of websites to be more exhaustive).
It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes.
Additionaly, or alternatively, you can feed the script some browsing history and get domains from there.
It then find the DNS redirections of those domains, and compare with regexes of known tracking domains.
It finally outputs the matching ones.
@ -38,19 +40,43 @@ Just to build the list, you can find an already-built list in the releases.
- Bash
- Python 3.4+
- [progressbar2](https://pypi.org/project/progressbar2/)
- dnspython
(if you don't want to collect the subdomains, you can skip the following)
- Firefox
- Selenium
- seleniumwire
- dnspython
- [progressbar2](https://pypi.org/project/progressbar2/)
And then just run `eulaurarien.sh`.
## Usage
### Add personal sources
The list of websites provided in this script is by no mean exhaustive,
so adding your own browsing history will help create a better list.
Here's reference command for possible sources:
- **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
- **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list`
### Collect subdomains from websites
This step is optional if you already added personal sources.
Just run `collect_subdomain.sh`.
This is a long step, and might be memory-intensive from time to time.
### Extract tracking domains
Make sure your system is configured with a DNS server without limitation.
Then, run `filter_subdomain.sh`.
The files you need will be in the folder `dist`.
## Contributing
### Adding websites
Just add them to `websites.list`.
Just add the URL to the relevant list: `websites/<source>.list`.
### Adding first-party trackers regex

7
collect_subdomains.sh Executable file
View File

@ -0,0 +1,7 @@
#!/usr/bin/env bash
# Get all subdomains accessed by each website in the website list
cat websites/*.list | sort -u > temp/all_websites.list
./collect_subdomains.py temp/all_websites.list > temp/subdomains_from_websites.list
sort -u temp/subdomains_from_websites.list > subdomains/from_websites.cache.list

1
dist/.gitignore vendored Normal file
View File

@ -0,0 +1 @@
*.txt

View File

@ -2,21 +2,6 @@
# Main script for eulaurarien
# Get all subdomains accessed by each website in the website list
./collect_subdomains.py websites.list > subdomains.list
sort -u subdomains.list > subdomains.sorted.list
./collect_subdomains.sh
./filter_subdomains.sh
# Filter out the subdomains not pointing to a first-party tracker
./filter_subdomains.py subdomains.sorted.list > toblock.list
sort -u toblock.list > toblock.sorted.list
# Format the blocklist so it can be used as a hostlist
(
echo "# First party trackers"
echo "# List generated on $(date -Isec) by eulaurarien $(git describe --tags --dirty)"
cat toblock.sorted.list | while read host;
do
echo "0.0.0.0 $host"
done
) > toblock.hosts.list

18
filter_subdomains.sh Executable file
View File

@ -0,0 +1,18 @@
#!/usr/bin/env bash
# Filter out the subdomains not pointing to a first-party tracker
cat subdomains/*.list | sort -u > temp/all_subdomains.list
./filter_subdomains.py temp/all_subdomains.list > temp/all_toblock.list
sort -u temp/all_toblock.list > dist/firstparty-trackers.txt
# Format the blocklist so it can be used as a hostlist
(
echo "# First-party trackers"
echo "# List generated on $(date -Isec) by eulaurarien $(git describe --tags --dirty)"
cat dist/firstparty-trackers.txt | while read host;
do
echo "0.0.0.0 $host"
done
) > dist/firstparty-trackers-hosts.txt

View File

@ -4,6 +4,8 @@
List of regex matching first-party trackers.
"""
# Syntax: https://docs.python.org/3/library/re.html#regular-expression-syntax
REGEXES = [
r'^.+\.eulerian\.net\.$',
r'^.+\.criteo\.com\.$',

2
subdomains/.gitignore vendored Normal file
View File

@ -0,0 +1,2 @@
*.custom.list
*.cache.list

1
temp/.gitignore vendored Normal file
View File

@ -0,0 +1 @@
*.list

1
websites/.gitignore vendored Normal file
View File

@ -0,0 +1 @@
*.custom.list