Browse Source

Added possibility to add personal sources

newworkflow_parseropti
Geoffrey Frogeye 2 years ago
parent
commit
a0a2af281f
  1. 2
      .gitignore
  2. 36
      README.md
  3. 7
      collect_subdomains.sh
  4. 1
      dist/.gitignore
  5. 19
      eulaurarien.sh
  6. 18
      filter_subdomains.sh
  7. 2
      regexes.py
  8. 2
      subdomains/.gitignore
  9. 1
      temp/.gitignore
  10. 1
      websites/.gitignore
  11. 0
      websites/eulerian_clients.list

2
.gitignore

@ -1,3 +1 @@
*.list
!websites.list
*.log

36
README.md

@ -27,8 +27,10 @@ That's where this scripts comes in, to generate a list of such subdomains.
It takes an input a list of websites with trackers included.
So far, this list is manually-generated from the list of clients of such first-party trackers
(latter we should use a general list of websites to be more exhaustive).
It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes.
Additionaly, or alternatively, you can feed the script some browsing history and get domains from there.
It then find the DNS redirections of those domains, and compare with regexes of known tracking domains.
It finally outputs the matching ones.
@ -38,19 +40,43 @@ Just to build the list, you can find an already-built list in the releases.
- Bash
- Python 3.4+
- [progressbar2](https://pypi.org/project/progressbar2/)
- dnspython
(if you don't want to collect the subdomains, you can skip the following)
- Firefox
- Selenium
- seleniumwire
- dnspython
- [progressbar2](https://pypi.org/project/progressbar2/)
And then just run `eulaurarien.sh`.
## Usage
### Add personal sources
The list of websites provided in this script is by no mean exhaustive,
so adding your own browsing history will help create a better list.
Here's reference command for possible sources:
- **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
- **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list`
### Collect subdomains from websites
This step is optional if you already added personal sources.
Just run `collect_subdomain.sh`.
This is a long step, and might be memory-intensive from time to time.
### Extract tracking domains
Make sure your system is configured with a DNS server without limitation.
Then, run `filter_subdomain.sh`.
The files you need will be in the folder `dist`.
## Contributing
### Adding websites
Just add them to `websites.list`.
Just add the URL to the relevant list: `websites/<source>.list`.
### Adding first-party trackers regex

7
collect_subdomains.sh

@ -0,0 +1,7 @@
#!/usr/bin/env bash
# Get all subdomains accessed by each website in the website list
cat websites/*.list | sort -u > temp/all_websites.list
./collect_subdomains.py temp/all_websites.list > temp/subdomains_from_websites.list
sort -u temp/subdomains_from_websites.list > subdomains/from_websites.cache.list

1
dist/.gitignore

@ -0,0 +1 @@
*.txt

19
eulaurarien.sh

@ -2,21 +2,6 @@
# Main script for eulaurarien
# Get all subdomains accessed by each website in the website list
./collect_subdomains.py websites.list > subdomains.list
sort -u subdomains.list > subdomains.sorted.list
./collect_subdomains.sh
./filter_subdomains.sh
# Filter out the subdomains not pointing to a first-party tracker
./filter_subdomains.py subdomains.sorted.list > toblock.list
sort -u toblock.list > toblock.sorted.list
# Format the blocklist so it can be used as a hostlist
(
echo "# First party trackers"
echo "# List generated on $(date -Isec) by eulaurarien $(git describe --tags --dirty)"
cat toblock.sorted.list | while read host;
do
echo "0.0.0.0 $host"
done
) > toblock.hosts.list

18
filter_subdomains.sh

@ -0,0 +1,18 @@
#!/usr/bin/env bash
# Filter out the subdomains not pointing to a first-party tracker
cat subdomains/*.list | sort -u > temp/all_subdomains.list
./filter_subdomains.py temp/all_subdomains.list > temp/all_toblock.list
sort -u temp/all_toblock.list > dist/firstparty-trackers.txt
# Format the blocklist so it can be used as a hostlist
(
echo "# First-party trackers"
echo "# List generated on $(date -Isec) by eulaurarien $(git describe --tags --dirty)"
cat dist/firstparty-trackers.txt | while read host;
do
echo "0.0.0.0 $host"
done
) > dist/firstparty-trackers-hosts.txt

2
regexes.py

@ -4,6 +4,8 @@
List of regex matching first-party trackers.
"""
# Syntax: https://docs.python.org/3/library/re.html#regular-expression-syntax
REGEXES = [
r'^.+\.eulerian\.net\.$',
r'^.+\.criteo\.com\.$',

2
subdomains/.gitignore

@ -0,0 +1,2 @@
*.custom.list
*.cache.list

1
temp/.gitignore

@ -0,0 +1 @@
*.list

1
websites/.gitignore

@ -0,0 +1 @@
*.custom.list

0
websites.list → websites/eulerian_clients.list

Loading…
Cancel
Save