Labeling a Dataset for Misinformation on WhatsApp

8.2 Detecting Misinformation without Violating Users’ Privacy

8.2.4 Labeling a Dataset for Misinformation on WhatsApp

With the proposal presented in the previous section, it requires that a database of fact-checked images be matched with WhatsApp content to flag misinformation. In order to evaluate the practical potential of the proposed architecture, we built a dataset of WhatsApp messages containing misinformation and a dataset from real fact-checkers identifying which content is fake or not to simulate this process. To reach to this large WhatsApp misinformation dataset, we gathered data from public WhatsApp groups dis-cussing politics from Brazil and India, as already extensively explained in the data collec-tion of this thesis (Chapter5). Complementary, we also collected a dataset of fact-checked misinformation images from well-known and publicly available fact-checking websites.

8.2. Detecting Misinformation without Violating Users’ Privacy 174

Table 8.2: WhatsApp collection.

#Users #Groups Unique Images

Total

images Time Span Brazil 17,465 414 4,524 34,109 2018/08 - 2018/11 India 63,500 4,250 509k 810k 2019/02 - 2019/06

8.2.4.1 WhatsApp Data.

To gather the data explored in this work we use available collection methodology to get access to messages posted on public WhatsApp groups. We selected over 400 and 4,200 groups from Brazil and India, respectively, dedicated to political discussions. The period of data collection for both countries includes the respective national elections in these countries. For this part of the work, we choose to filter only messages containing images. To evaluate our architecture, we selected here only images because this kind of media content is easier to track and keep immutable, while text messages tend to present slightly changes as it is disseminated on the network (e.g. add or remove a emoji, or a URL in the message), which is harder to check if it represent the same piece of misinformation.

The dataset overview and the total number of users, groups, and distinct images are described in Table8.2. Note that the volume of content in India is ten times bigger than Brazil.

8.2.4.2 Fact-checking Agencies Dataset.

As our methodology relies on the fact-checking task performed by specialized agen-cies, we need to build a wide dataset of previous labeled content for misinformation in order to compare it to WhatsApp. This would be the same process that the company can adopt in the design of the application, but it also can be replaced by a bigger partnership between WhatsApp and fact-checking agencies in which the agencies provide the labeled content direct to WhatsApp as they already do for Facebook (HUNT, 2017).

First, we create a list of the main and well-known fact-checking agencies in Brazil ( “Folha–Lupa”<piaui.folha.uol.com.br/lupa/>, “Aos Fatos”<aosfatos.org>, G1–“É ou Não É?” <g1.globo.com/e-ou-nao-e/>, “e-Farsas” <www.e-farsas.com>, Veja–“Me En-gana que eu Posto”<veja.abril.com.br/blog/me-engana-que-eu-posto/>, and “Boatos.org”

<www.boatos.org>) and also some fact-checkers from India (<altnews.in>, <boomlive.

in>,<smhoaxslayer.com>,<factchecker.in>,<factly.in>,<fakenewscounter.com>, and

<check4spam.com>).

Then, for each agency, we developed a web crawler that navigates through the

8.2. Detecting Misinformation without Violating Users’ Privacy 175 content of the website of the and collects all news page which have fact-checked content.

Furthermore, for each of piece of news labeled, we collect the images shared along the page, as well we also obtained the label given by the fact-checker and the date when they were fact-checked. In total, we collected over 100k fact-checked images from Brazil and about 20k images from India.

Moreover, for this work specifically, we used the state-of-the-art perceptual hashing based image matching technique, PDQ hashing, to look for occurrences of the fact-checked images in our data from public groups. The PDQ hashing algorithm is an improvement over the commonly used pHash and produces a 256 bit hash using a discrete cosine transformation algorithm. PDQ is currently the method used by Facebook to detect similar content, and it is one of the best known state-of-the-art approach for clustering together similar images. The hashing algorithm can detect near similar images, even if they were cropped differently or they have small amounts of text overlaid on them, which is better to detect more duplicate images compared to pHash. By comparing the hashes from dataset from fact-checking agencies and WhatsApp images, we can match those from WhatsApp that is fake. By this, we can also compare the dates between when it was shared on WhatsApp and when it was fact-checked. As the perceptual hash allows comparison of values and not only exact matches, we use a threshold similarity of more than 0.8 to match two images as the same.

We also used a second strategy to expand and validate our fact-checking dataset of labeled misinformation images. For each WhatsApp Image from our dataset, we used Google reverse image search to check whether one of the main fact-checking domains were returned when searching for an image in our database. If so, we parsed the fact-checking page and automatically labeled the image depending on how the image was tagged on the fact-checking page. Finally, to make sure our dataset was accurately built, we manually verified each image that appears in both the fact-checking websites and in the WhatsApp data.

As shown in Table8.3, this dataset of images previously fact-checked contains 135 images from Brazil and 205 images from India, which were shown to contain misinfor-mation. It is important to highlight that many checking agencies do not post the actual image that has been disseminated. Often only altered versions of the image are posted and other versions of the false story are omitted to avoid contributing to the spreading of misinformation. This leads to us to have a small number of matches compared to the total number of fact-checked images we obtained, but that is suﬀicient to properly investigate the feasibility of the proposed architecture. Direct contact with the fact-checking agencies, like Facebook already does, certainly increases the size of the fact-checked set much more.

Note that even though the set of fact-checked images is small, the fact that these images have been fact-checked means that they were popular and spread widely. Table8.3shows a summary of the fact-checked images and their activity in our dataset. It also shows the

8.2. Detecting Misinformation without Violating Users’ Privacy 176 Table 8.3: Amount of misinformation image shared on WhatsApp and comparison of shares before and after the checking date of fact-checking agencies.

Misinformation Images found

100% Exact Matches

Total Shares

%Shares After Checking

Max Shares After Checking

Brazil 135 7 2,209 40.7 96

India 205 83 2,944 82.2 1,089

Figure 8.11: Distributions of WhatsApp misinformation images labeled per fact-checking.

0 5 10 15 20 25 30

Percentage (%) me engana

que eu posto comprova fato ou fake e-farsas aos fatos boatos.org lupa

(a) Brazil

0 5 10 15 20 25

Percentage (%) factchecker

factly fakenewscounter check4spam smhoaxslayer boomlive altnews

(b) India Source: The Author.

drawbacks of using a 100% exact match for hashing comparison. While similar perceptual hashes are able to identify more than a hundred images in both countries, using just the exactly same hash to find misinformation, only 5,1% of checked images from Brazil were retrieved and 40% of Indian images.

Figure 8.11 shows the breakdown of fact-checking agencies used and the amount of image retrieved from WhatsApp found in each of them. We find that Lupa is the fact-checking agency that most matches images containing misinformation in Brazil (28.9%) whereas, in India, Alt News has the highest number of matches (24.4%).

No documento Activism and misinformation on WhatsApp: measurement, analysis, and countermeasures (páginas 174-177)