Generates a host list of first-party trackers for ad-blocking.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

83 lines
3.3 KiB

2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
  1. # eulaurarien
  2. Generates a host list of first-party trackers for ad-blocking.
  3. **DISCLAIMER:** I'm by no way an expert on this subject so my vocabulary or other stuff might be wrong. Use at your own risk.
  4. ## What's a first-party tracker?
  5. Traditionally, websites load trackers scripts directly.
  6. For example, `website1.com` and `website2.com` both load `https://trackercompany.com/trackerscript.js` to track their users.
  7. In order to block those, one can simply block the host `trackercompany.com`.
  8. However, to circumvent this easy block, tracker companies made the website using them load trackers from `somethingirelevant.website1.com`.
  9. The latter being a DNS redirection to `website1.trackercompany.com`, directly pointing to a server serving the tracking script.
  10. Those are the first-party trackers.
  11. Blocking `trackercompany.com` doesn't work any more, and blocking `*.trackercompany.com` isn't really possible since:
  12. 1. Most ad-blocker don't support wildcards
  13. 2. It's a DNS redirection, meaning that most ad-blockers will only see `somethingirelevant.website1.com`
  14. So the only solution is to block every `somethingirelevant.website1.com`-like subdomains known, which is a lot.
  15. That's where this scripts comes in, to generate a list of such subdomains.
  16. ## How does this script work
  17. It takes an input a list of websites with trackers included.
  18. So far, this list is manually-generated from the list of clients of such first-party trackers
  19. (latter we should use a general list of websites to be more exhaustive).
  20. It open each ones of those websites (just the homepage) in a web browser, and record the domains of the network requests the page makes.
  21. Additionaly, or alternatively, you can feed the script some browsing history and get domains from there.
  22. It then find the DNS redirections of those domains, and compare with regexes of known tracking domains.
  23. It finally outputs the matching ones.
  24. ## Requirements
  25. Just to build the list, you can find an already-built list in the releases.
  26. - Bash
  27. - Python 3.4+
  28. - [progressbar2](https://pypi.org/project/progressbar2/)
  29. - dnspython
  30. (if you don't want to collect the subdomains, you can skip the following)
  31. - Firefox
  32. - Selenium
  33. - seleniumwire
  34. ## Usage
  35. ### Add personal sources
  36. The list of websites provided in this script is by no mean exhaustive,
  37. so adding your own browsing history will help create a better list.
  38. Here's reference command for possible sources:
  39. - **Pi-hole**: `sqlite3 /etc/pihole-FTL.db "select distinct domain from queries" > /path/to/eulaurarien/subdomains/my-pihole.custom.list`
  40. - **Firefox**: `cp ~/.mozilla/firefox/<your_profile>.default/places.sqlite temp; sqlite3 temp "select distinct rev_host from moz_places" | rev | sed 's|^\.||' > /path/to/eulaurarien/subdomains/my-firefox.custom.list`
  41. ### Collect subdomains from websites
  42. This step is optional if you already added personal sources.
  43. Just run `collect_subdomain.sh`.
  44. This is a long step, and might be memory-intensive from time to time.
  45. ### Extract tracking domains
  46. Make sure your system is configured with a DNS server without limitation.
  47. Then, run `filter_subdomain.sh`.
  48. The files you need will be in the folder `dist`.
  49. ## Contributing
  50. ### Adding websites
  51. Just add the URL to the relevant list: `websites/<source>.list`.
  52. ### Adding first-party trackers regex
  53. Just add them to `regexes.py`.