Browse Source

Added possibility to add personal sources

tags/v1.1.0
Geoffrey Frogeye 3 months ago
parent
commit
a0a2af281f
11 changed files with 65 additions and 24 deletions
  1. +0
    -2
      .gitignore
  2. +31
    -5
      README.md
  3. +7
    -0
      collect_subdomains.sh
  4. +1
    -0
      dist/.gitignore
  5. +2
    -17
      eulaurarien.sh
  6. +18
    -0
      filter_subdomains.sh
  7. +2
    -0
      regexes.py
  8. +2
    -0
      subdomains/.gitignore
  9. +1
    -0
      temp/.gitignore
  10. +1
    -0
      websites/.gitignore
  11. +0
    -0
      websites/eulerian_clients.list

+ 0
- 2
.gitignore View File

@@ -1,3 +1 @@
*.list
!websites.list
*.log

+ 31
- 5
README.md View File

@@ -27,8 +27,10 @@ That's where this scripts comes in, to generate a list of such subdomains.
It takes an input a list of websites with trackers included.
So far, this list is manually-generated from the list of clients of such first-party trackers
(latter we should use a general list of websites to be more exhaustive).

It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes.

Additionaly, or alternatively, you can feed the script some browsing history and get domains from there.

It then find the DNS redirections of those domains, and compare with regexes of known tracking domains.
It finally outputs the matching ones.

@@ -38,19 +40,43 @@ Just to build the list, you can find an already-built list in the releases.

- Bash
- Python 3.4+
- [progressbar2](https://pypi.org/project/progressbar2/)
- dnspython

(if you don't want to collect the subdomains, you can skip the following)

- Firefox
- Selenium
- seleniumwire
- dnspython
- [progressbar2](https://pypi.org/project/progressbar2/)

And then just run `eulaurarien.sh`.
## Usage

### Add personal sources

The list of websites provided in this script is by no mean exhaustive,
so adding your own browsing history will help create a better list.
Here's reference command for possible sources:

- **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
- **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list`

### Collect subdomains from websites

This step is optional if you already added personal sources.
Just run `collect_subdomain.sh`.
This is a long step, and might be memory-intensive from time to time.

### Extract tracking domains

Make sure your system is configured with a DNS server without limitation.
Then, run `filter_subdomain.sh`.
The files you need will be in the folder `dist`.

## Contributing

### Adding websites

Just add them to `websites.list`.
Just add the URL to the relevant list: `websites/<source>.list`.

### Adding first-party trackers regex


+ 7
- 0
collect_subdomains.sh View File

@@ -0,0 +1,7 @@
#!/usr/bin/env bash

# Get all subdomains accessed by each website in the website list

cat websites/*.list | sort -u > temp/all_websites.list
./collect_subdomains.py temp/all_websites.list > temp/subdomains_from_websites.list
sort -u temp/subdomains_from_websites.list > subdomains/from_websites.cache.list

+ 1
- 0
dist/.gitignore View File

@@ -0,0 +1 @@
*.txt

+ 2
- 17
eulaurarien.sh View File

@@ -2,21 +2,6 @@

# Main script for eulaurarien

# Get all subdomains accessed by each website in the website list
./collect_subdomains.py websites.list > subdomains.list
sort -u subdomains.list > subdomains.sorted.list
./collect_subdomains.sh
./filter_subdomains.sh

# Filter out the subdomains not pointing to a first-party tracker
./filter_subdomains.py subdomains.sorted.list > toblock.list
sort -u toblock.list > toblock.sorted.list

# Format the blocklist so it can be used as a hostlist

(
echo "# First party trackers"
echo "# List generated on $(date -Isec) by eulaurarien $(git describe --tags --dirty)"
cat toblock.sorted.list | while read host;
do
echo "0.0.0.0 $host"
done
) > toblock.hosts.list

+ 18
- 0
filter_subdomains.sh View File

@@ -0,0 +1,18 @@
#!/usr/bin/env bash

# Filter out the subdomains not pointing to a first-party tracker

cat subdomains/*.list | sort -u > temp/all_subdomains.list
./filter_subdomains.py temp/all_subdomains.list > temp/all_toblock.list
sort -u temp/all_toblock.list > dist/firstparty-trackers.txt

# Format the blocklist so it can be used as a hostlist

(
echo "# First-party trackers"
echo "# List generated on $(date -Isec) by eulaurarien $(git describe --tags --dirty)"
cat dist/firstparty-trackers.txt | while read host;
do
echo "0.0.0.0 $host"
done
) > dist/firstparty-trackers-hosts.txt

+ 2
- 0
regexes.py View File

@@ -4,6 +4,8 @@
List of regex matching first-party trackers.
"""

# Syntax: https://docs.python.org/3/library/re.html#regular-expression-syntax

REGEXES = [
r'^.+\.eulerian\.net\.$',
r'^.+\.criteo\.com\.$',

+ 2
- 0
subdomains/.gitignore View File

@@ -0,0 +1,2 @@
*.custom.list
*.cache.list

+ 1
- 0
temp/.gitignore View File

@@ -0,0 +1 @@
*.list

+ 1
- 0
websites/.gitignore View File

@@ -0,0 +1 @@
*.custom.list

websites.list → websites/eulerian_clients.list View File


Loading…
Cancel
Save