Empirical research on Dark Patterns within cookie banners

On July 22nd and 23rd, 2021 Philipp Hausner (dapde alumnus) conducted a study with the incentive to further elaborate on the extent and types of dark patterns within cookie banners. The following part aims at providing a complete roundup of the study, with particular attention to the technical method.

Philipp’s research method was based on three steps: (1) the identification of the present cookie banner (or the lack thereof), (2) the extraction of buttons on the banner, i.e., web elements the user can interact with, and (3) the extraction of style information from the banner and its buttons.

Cookie Banner Identification

The identification of the cookie banner works in a bottom-up fashion in which first low-level elements are identified that are potentially part of a cookie banner, and then expanded to find larger segments of content that are suitable candidates for a complete cookie banner. Note that in the context of this algorithm, a (web page) element is equivalent to a node in the DOM tree. The general steps are as follows:

  1. Potential candidates are chosen by a keyword search out of a predefined keyword list, which includes among others “cookie”, “tracking-technologien” (tracking technologies), or “dsgvo” (GDPR). Moreover, certain elements having HTML tags such as “body” or “script” as well as elements that have a size of 0 × 0 pixels are discarded.
  2. Candidate elements are then expanded, i.e., they are replaced by their respective parent element or grandparent element in the DOM tree as long as the following criteria hold for the (grand-)parent element “e: e” is visible and displayed on the web page; its tag is neither “body”, “footer”, “header”, “html” nor “main”; its text is shorter than 2500 characters; “e” has a size that at most captures 75% of the web page’s size; and lastly, “e” visually encloses the original child element.
  3. Expanded candidate elements are then set to a respective child element as long as one of its child elements contains the same text as the candidate element itself. Step 2. and 3. enables the extraction of larger segments while avoiding capturing boilerplate elements that are not important for the structure of the cookie banner.
  4. Out of the candidate elements, only one cookie banner is chosen to be the most probable candidate. Therefore, all candidates that do not fulfill the following criteria are removed: (1) the text of the candidate has less than 50 characters. (2) The parent of the element has the HTML tag footer. (3) Each candidate is again checked for a certain set of keywords like “akzeptieren” that heuristically suggest that the candidate is in fact a cookie banner. If after this filtering step more than one candidate remains, the smallest element is chosen, i.e., the element enclosing the fewest pixels.

Button Extraction

The buttons are then extracted from the cookie banner segment by employing various css selector, e.g., by checking if an element has the HTML tag “a” or “button”, or if it contains attributes that are typical for interactive element such as @onclick. This of course does not only detect buttons that would be identified easily by a human investigator, but for example also hyperlinks that are present in the main text of a cookie banner. Since those hyperlinks, however, often also provide links to a settings page or to the privacy policy page of the web site, those are kept deliberately.

Button Classification

To classify the buttons with regard to their functionality within the banner, the text of all button elements is gathered and initially clustered using the k-means clustering algorithm with k = 10. Upon manual inspection, 7 varying classes are identified: (1) accept all, (2) reject all, (3) partial acceptance of cookies like ”Accept essential cookies”, (4) settings, (5) link to privacy protection, (6) link to more information, and (7) other. After manual reordering of button texts into the 7 classes, a support vector classifier was trained. Splitting the data into a training set using 80% of the data, and an according test set, the classifier yields a F1 score of 0.98 both in micro and macro average. In both clustering and classification, from all texts certain special characters are removed (,e.g., # or _), and the texts are subdivided into a set of character n-grams of lengths between 2 and 10. Afterwards, the text of each button is vectorized using tf-idf, and the resulting vector is input for the respective machine learning method.

Feature Extraction. CSS properties of all elements are extracted using JavaScript, and mostly via the getComputedStyle method available in most browsers. To decode the color of web page elements from their machine-readable form as a RGB triplet to a human-readable equivalent, again k-means clustering is employed. Therefore, for all elements the background-color style information is collected in form of their RGB triplets, and clustered using k = 2 to k = 10. Manual inspection of the clustering results that the most reasonable clustering is achieved using k = 6, resulting in the colors (1) white, (2) black, (3) red, (4) blue, (5) green, and (6) yellow. While these categories are able to classify the background color of elements, they are not necessarily suitable to correctly catch varying font colors. Consequently, the same experiment was repeated for the font-color CSS information, yielding the classes (1) white, (2) black, and (3) gray. This coincides with our experience regarding cookie banners that rarely employ colorful fonts.

Implementation

The above algorithms were implemented using Python 3.8.6, Selenium 3.141.0 with Mozilla’s geckodriver 0.26.0, and sklearn 0.24.1 for clustering and classification algorithms. Selenium was used as a browser automation framework, since it allows not only for static analysis of HTML pages, but also processes dynamic content and behaves similar to a standard web browser.

We hope that this roundup was able to give a short insight to our research method. A discussion of the results will be published soon. If you have any further questions regarding the study, feel free to message us via kamke@foev-speyer.de.