On July 22nd and 23rd, 2021 Philipp Hausner (dapde alumnus) conducted a study with the incentive to further elaborate on the extent and types of dark patterns within cookie banners. The following part aims at providing a complete roundup of the study, with particular attention to the technical method.
Philipp’s research method was based on three steps: (1) the identification of the present cookie banner (or the lack thereof), (2) the extraction of buttons on the banner, i.e., web elements the user can interact with, and (3) the extraction of style information from the banner and its buttons.
The identification of the cookie banner works in a bottom-up fashion in which first low-level elements are identified that are potentially part of a cookie banner, and then expanded to find larger segments of content that are suitable candidates for a complete cookie banner. Note that in the context of this algorithm, a (web page) element is equivalent to a node in the DOM tree. The general steps are as follows:
To classify the buttons with regard to their functionality within the banner, the text of all button elements is gathered and initially clustered using the k-means clustering algorithm with k = 10. Upon manual inspection, 7 varying classes are identified: (1) accept all, (2) reject all, (3) partial acceptance of cookies like ”Accept essential cookies”, (4) settings, (5) link to privacy protection, (6) link to more information, and (7) other. After manual reordering of button texts into the 7 classes, a support vector classifier was trained. Splitting the data into a training set using 80% of the data, and an according test set, the classifier yields a F1 score of 0.98 both in micro and macro average. In both clustering and classification, from all texts certain special characters are removed (,e.g., # or _), and the texts are subdivided into a set of character n-grams of lengths between 2 and 10. Afterwards, the text of each button is vectorized using tf-idf, and the resulting vector is input for the respective machine learning method.
The above algorithms were implemented using Python 3.8.6, Selenium 3.141.0 with Mozilla’s geckodriver 0.26.0, and sklearn 0.24.1 for clustering and classification algorithms. Selenium was used as a browser automation framework, since it allows not only for static analysis of HTML pages, but also processes dynamic content and behaves similar to a standard web browser.
We hope that this roundup was able to give a short insight to our research method. A discussion of the results will be published soon. If you have any further questions regarding the study, feel free to message us via firstname.lastname@example.org.