GDPR • opsci

Opsci is committed to protecting the data collected in the course of its research projects. As an institute specialized in the study of the digital public space, Opsci mainly collaborates with NGOs, foundations, scientific teams, and institutions on topics of public interest such as the climate crisis, ecological policies, and online democracy. Opsci uses new AI techniques to identify the main or emerging trends in public opinion.

Opsci's studies frequently focus on publicly accessible content on Very Large Online Platforms: Twitter, Facebook, YouTube, TikTok, or Instagram. This content may contain personal data as defined by the GDPR.

Opsci not only implements the major principles of the GDPR but also contributes to defining best practices for data processing, reconciling the GDPR with other regulations (Digital Service Act), and opportunities and risks related to the use of AI.

General principles of data protection

Opsci's policy is based on the major European principles described in art. 5 of the GDPR:

Data is "processed lawfully, fairly, and transparently." Opsci's studies are research projects that result in publications on this site or partner sites. The studies describe the corpora used and the measures taken to frame the potential processing of personal data.
Data is "collected for specified, explicit, and legitimate purposes" within specific and documented projects.
Data is "adequate, relevant, and limited to what is necessary." Opsci systematically uses "data minimization" procedures: only data immediately useful for the research project is retained and processed. The following section describes more precisely the concrete implementation of this minimization strategy regarding the two main treatments performed (text analysis, community analysis).
Data is only kept "for as long as necessary for the purposes for which it is processed." Opsci's activity is structured around research projects with predefined temporality.
Data is "processed in a manner that ensures appropriate security of the personal data." All our corpora are stored on a secure server located physically in France, owned by the company. For these treatments, we do not use cloud services.

Data processing

Opsci's studies focus exclusively on "publications" distributed and shared in the digital public space. These are contents accessible on the platforms and not limited to a predefined private circle of users.

Although the dissemination of this content is subject to minimal consent, in accordance with the GDPR, we assume that the publication and open dissemination of content on networks do not exhaust the issues related to personal data. This data may include personal identification elements (name or other data specified by the user) or information about a person carried by a third party.

Various measures make it possible to remove identifiable data or confine it to specific and documented uses. Opsci's studies are based on two treatments differentiated according to the account's status:

Analysis of discourses and narratives on social networks using new AI models

Opsci, for example, follows the evolution of discourses on climate change in several European countries. To be exhaustive, these approaches require the collection of large corpora (on Twitter in France, one year of climate debate represents nearly two million tweets). To limit the risks of personal data reuse, we proceed with "minimization" of data: only texts emitted by users and some non-identifiable metadata (date of text creation, circulation and engagement metrics) are retained. All identification data is excluded. Finally, only AI models "read" the corpus in full to classify it on a large scale. Minimized small samples of the corpus are manually studied by analysts in accordance with the predefined principles of the data controller.
Analysis of media, organizations, political figures, and opinion leaders

The contents observed are those published by public figures who carry a public voice. In accordance with the exceptions provided for by the GDPR

Public interest data: contributing to the development of "good practices"

In 2023, regulations on data protection related to social media analysis remain unclear. There are still questions, in particular, about reconciling the GDPR with other fundamental principles (such as the right to information), and more recently, the new regulations of large online platforms.

The GDPR already provides a general exception for the processing of personal data for "reasons of important public interest" (Art. 6 & 9). All of our studies fall within this framework. However, the regulation concerning exceptions still needs to be implemented in the EU member states and, in this context, it is still difficult to consider "important public interest" as a basis for processing sensitive data for scientific research purposes (Wiewiorówski, 2020).
The GDPR also includes derogations from the fundamental principles of personal data protection if they are "processed for scientific or historical research purposes" and if the application of these principles "seriously hinders" such processing (Art. 89).

Opsci is fully committed to creating good practices adapted to this new field of research. We are particularly focused on materializing the balance between data protection and public interest. One of our main projects is to define clear criteria differentiating public figures speaking publicly from the vast majority of users whose identity and privacy must be protected.

Opsci is also closely monitoring the evolution of European regulation. The Digital Service Act notably provides for external access to platform data to be generalized when they present a "systemic risk", particularly in the field of disinformation (Art. 40). Opsci has conducted studies on online disinformation and the circulation of "fake news" (on Twitter, Facebook, TikTok), which fall within this framework. Fine analysis of disinformation requires the processing of identification data, particularly to reconstruct upstream forms of coordination.

AI and data protection

Since 2021, Opsci has specialized in the use of new text analysis technologies using artificial intelligence. Our projects are based on automated classification by a BERT model. This type of model offers a fine understanding of the text beyond the counting of occurrences and words: it manages to identify recurring sentence structures, arguments and positions.

The rapid developments in artificial intelligence applied to the study of social networks raise new questions and provide new answers to the challenges of protecting personal data:

The new classification models present risks that are well identified by specialists in the ethics of artificial intelligence: the results can be biased or amplify pre-existing stereotypes. For Opsci's research projects, this risk is necessarily controlled: we do not carry out classification based on personal or demographic data.
Large-scale classification allows only a small part of the corpus to be consulted during the preparation and interpretation phase of the classification models. The final results of the classification include aggregated data (number of posts or engagements on a subject per day) and simplified semantic representations of the original text (the "embeddings"). Concretely, personal data indirectly present in the original corpus are not retained after this processing. Artificial intelligence thus makes it possible to reconcile the principles of data protection and openness of research results (Open Science).