With my browser setup I don't get to see the GDPR cookie banners. You can achieve the same by using ublock origin .
After an update, my default browser changed. I opened a link to a news website, and there it was, the cookie banner, asking me if I want to accept sharing my data with 1174 partners1. That's a lot of partners. There's probably some website out there that has more partners than this, and I need to find out.
I've got a list of websites, installed playwright, and I'm off to check how many cookies do these websites self-report they have.
Every website has a "we have 5 partners that ..." in the cookie banner, right ? I wrote a regular expression to capture that number.
def extract_numbers(text) -> str:
pattern = r'\b(\d+)\s+(?:partner.|vendor.)\b|\b(?:partner.|vendor.)\s+(\d+)\b'
matches = re.findall(pattern, text)
# Flatten the list of tuples and remove empty strings
matches = [num for tup in matches for num in tup if num]
return matches[0]
It turns that most websites on my list don't have this partner count. I ended up with lots of missing or inaccurate data. Getting the self-reported data isn't as straightforward. There is no requirement to self report the count of all advertising partners. This approach was a dud.
The next best approach that I thought of was to find and click every "Opt-in" or "Accept" button. Then count the number of cookies, that are set. This isn't ideal either. I won't know the advertising partners, I'll just know how many cookies. It's fair assumption to make that the count of domains would represent the count of advertising partners, more or less.
The data was a lot more accurate however, playwright gets confused between multiple buttons that have the same text.
For example
Error: strict mode violation: locator("button:has-text(\"Accept\"), button:has-text(\"Agree\")") resolved to 2 elements:
1) <button id="onetrust-accept-btn-handler">Accept All Cookies</button> aka get_by_role("button", name="Accept All Cookies")
2) <button id="accept-recommended-btn-handler">Accept All Cookies</button> aka get_by_label("ORCID Cookie Settings").get_by_text("Accept All Cookies")
Looking at some of the websites, the first button, tended to be the "Accept" option, so I decided to tell playwright to click the first button, always. I'm not sure if I'm even clicking on the cookie banner, or another button somewhere on the website that says "Accept", but the chances of that happening are fairly low.
However, I still couldn't make sure that I clicked the "right" button.
Continuing with this approach, I started playwright with Firefox using the "Desktop Firefox", User-Agent. Switching to Chromium with a mix of Desktop Edge
and Desktop Chrome
User-Agents, proved to be a lot more successful, but there are still a lot of websites that I never managed to reach. Ultimately what worked consitently was rotating between Firefox
and Chromium
and using playwright_stealth.
Before I present the data, a recap of all the assumptions:
- I'm not login in into any of these websites. Facebook.com seems to drop exactly one cookie from their own domain after accepting all cookies. I imagine this isn't the case after login 2 .
- These are the cookies stored in the browser just after I click on - what I believe to be - "Accept all cookies", i.e. the data is still not accurate.
- There are websites on the list thatdon't have a cookie banner, e.g. vk.com .
Removing some of the duplicates and domains that no longer exist, I ended up with data from 962 websites.
Top 10 websites that give you the most cookies
The website that sets the most cookies after you accept all cookies is songkick.com
, with a total of 172. This includes cookies set under the .songkick.com
domain.
Domain | Number of cookies set |
---|---|
songkick.com | 172 |
hollywoodreporter.com | 132 |
boredpanda.com | 132 |
vanityfair.com | 128 |
newyorker.com | 121 |
businessinsider.com | 117 |
stocktwits.com | 112 |
wired.com | 111 |
vogue.com | 111 |
mlb.com | 111 |
Top 10 websites that give you cookies from different domains
Cookies from 62 different domains were set in songkick.com
.
Domain | Number of domains |
---|---|
songkick.com | 62 |
hollywoodreporter.com | 49 |
boredpanda.com | 48 |
vanityfair.com | 44 |
aljazeera.com | 44 |
stocktwits.com | 44 |
lesechos.fr | 42 |
huffpost.com | 41 |
businessinsider.com | 39 |
huffingtonpost.com | 39 |
Top 10 cookie domains
The most common domain cookies were set under was .linkedin.com
, found in 330 websites. The top 10 of the most common domains:
Domain | Count |
---|---|
.linkedin.com | 330 |
.doubleclick.net | 176 |
.pubmatic.com | 165 |
.twitter.com | 104 |
.adobe.com | 100 |
.casalemedia.com | 96 |
.demdex.net | 91 |
.bing.com | 72 |
.adnxs.com | 70 |
.yahoo.com | 69 |
It was the dailymail.co.uk