Using Perceptual Hash to Process Image Data

5.4 Measuring Popularity for WhatsApp Content

5.4.1 Using Perceptual Hash to Process Image Data

For merging two pieces of information containing an image as a unique item, this approach uses the perceptual hash algorithm pHash (ZAUNER, 2010) to calculate a fingerprint for every image. Differently from cryptographic hashing algorithms (e.g., MD5, SHA), this kind of hash takes into account visual attributes of the file such as the pixels and RGB layers to generate a code that represents that image. Therefore, small divergences in the content will result in slightly differences between hashes, making it possible to compare hashes and see similar content. These hashing methods create an eﬀicient way to store files as short digital hashes that can determine whether two files are the same or similar, even without the original image.

Using visual hashing for detecting similar images is an approach widely applied by researchers and also in industry, specially for digital forensics and cybercrime studies on Web (HAO et al., 2021). Its structure and easiness to compute allow the hashes to be used to explore the immense universe of online images and find abusive content on Web in various contexts for detection and analysis. For example, there are applications of perceptual hashes to detect malicious advertisements online based on webpage screen-shots (VADREVU; PERDISCI, 2019), for phishing detection using Haar wavelet hashes

5.4. Measuring Popularity for WhatsApp Content 70 (wHash) (MEDVET; KIRDA; KRUEGEL, 2008), to mark potential survey scams on Web by clustering perceptual difference hashes of their websites (KHARRAZ; ROBERT-SON; KIRDA, 2018), for analyzing displayed ads to detect abusive and fraudulent ser-vices (NIKIFORAKIS et al., 2014; RAFIQUE et al., 2016) and also for building perceptual ad-blockers (TRAMèR et al., 2019). to find website vandalism and website defacement attacks (BORGOLTE; KRUEGEL; VIGNA, 2015), and combat different kinds of rogue software by grouping visually correspondent icons (NAPPA; RAFIQUE; CABALLERO, 2013) of malicious software and screenshots (DIETRICH; ROSSOW; POHLMANN, 2013) to find similar distributions campaigns of malware online through the use of average and perceptual hashes. Different perceptual hashing methods are also used to analyze the graphical user interface of apps to detect counterfeit apps that impersonate existing pop-ular mobile apps in attempts to misguide users (RAJASEGARAN et al., 2019) and other app vulnerabilities such as authentication schemes (BIANCHI et al., 2017; SHI; WANG;

LAU, 2019). The hash comparison can be used to identify fake profiles on social networks and recognizing identity impersonation in online social networks by matching duplicated profile photos (GOGA; VENKATADRI; GUMMADI, 2015).

Perceptual hash is also deployed in search engines studies such as malicious black-hat search engine optimization by comparing websites tblack-hat look visually similar (GOETHEM et al., 2019), and different forms of hashing are also heavily used in real-world reverse image search (RIVAS et al., 2017; CHAMOSO et al., 2018)

PASTRANA et al., 2019 also demonstrated an importance in usage of this kind of hashing technique to detect pornography and illicit content and, particularly, to combat eWhoring, a type of online fraud in which cybersexual encounters are simulated for fi-nancial gain; the editing of images to promote explicit contents that evade inappropriate image detectors for promoting illicit products such as sexual products or gambling web-sites (YUAN et al., 2019); and to flag child abuse content shared online (BURSZTEIN et al., 2019). More recently, perceptual hashing is also applied to analyze the images and memes distributed in misinformation campaigns and hate speech over the Web and social networks (AGARWAL et al., 2020; HUCKLE; WHITE, 2017; SAMANTA; JAIN, 2021;

MITTOS et al., 2020; WANG et al., 2021; ZANNETTOU et al., 2018b, 2020; ABILOV et al., 2021). Therefore, this is a well established methodology to compare and detect copies of image content.

However, this technique is not without its flaws. There are some limitations regard-ing its application as it is possible to manipulate the image in way the hash will diverge a lot from its original source and be near to a totally different image (STRUPPEK et al., 2022) and this can be used to fool perceptual based system and even real-world search engines (HAO et al., 2021). Furthermore, in context of misinformation imagery spread online, the false information of a fabricated image can emerge exactly from just minor alterations made on the original source. Then, by using a hashing method with a great

5.4. Measuring Popularity for WhatsApp Content 71 power of image generalization, we can end up detecting the original source and its edited copy as they were exactly the same content and, consequently, merging legitimate and fake content as one.

For example, Figure5.6 shows an example of a fake image shared in 2018 Brazilian elections political context, with the original source at right and an edited false copy of it in the left. In this example, by extracting the pHash of each image, we have similar but not equal resulting hashes with a hamming distance of 2 between both images. Using average hash method, we have exactly same hashes and then 0 hamming distances between them.

Using checksum, on the other hand, we have totally different hashes, as it is a crypto hashing method.

Therefore, we observe how pHash can relate both images (as they are similar), but it is still able to differentiate them. In order to avoid merging false and original images as one single piece of data as the above example, we use pHash algorithm, considering only those with exactly same hash as the same image. In this way, we can track duplicates of images keeping the peculiarities of each one

Figure 5.6: Comparison between the original source image and the edited fake version shared on WhatsApp.

(a) Original Source Image (b) Fake Edited Image

Source: The Author.

Next, we further evaluate the impact of hashing algorithm regarding the time to data collection. Given the volume of multimedia files sent everyday on WhatsApp, another important point to choose a hash algorithm is the time required to process the actual media file attached to the message. For those experiments, a total of 905K image files, 375K video files and 80K audio files collected from WhatsApp data were processed using different hash algorithms.

In Table 5.1, there is a summary of the experiments with different hash meth-ods. There, besides the pHash algorithm, we compare it to other popular visual hashing methods such as Average Hash (aHash), Differential Hash (dHash), Wavelet image hash

5.4. Measuring Popularity for WhatsApp Content 72 Table 5.1: Comparison of different visual hashing methods for processing image data.

checksum

(MD5) pHash aHash dHash wHash (haar)

wHash

(db4) PDQ Total Files Unique

Hashes 867,857 714,114 646,533 741,533 652,472 729,840 783,205 905,671 Matching

Content 4% 21% 29% 18% 28% 19% 14%

Total Time

Spent (min) 316 395 359 361 1,202 1,308 14,044

(wHash), and Facebook PDQ⁴.

It is possible to note how perceptual hashes are more powerful to detect similar content compared to checksum hash. While the cryptographic hash reduce the total of unique images in 4%, perceptual hashes could match up to 29% of the files in distinct contents. Moreover, it does not take much more time to process mostly of perceptual hashes compared to the checksum method.

Figure 5.7 show the time required to process checksum hash for each different media format. Here, it is possible to highlight differences between image, audio and video files. In average, images are the fastest media to process the hash, taking mostly between 0.01 and 0.03 seconds. Video, on the other hand, are larger files, then, take more time to be processed. Figure5.7(a) reports the total cumulative time spent to process the entire dataset of each kind of media using checksum. Even though there are more than twice as many image files as there are video files, we observed that the time required to process these videos was practically three times greater than the total time of the images (5.2 hours to complete all images and 16 hours to all videos). Given the small amount of audio files compared to other two media, the time spent processing them is even smaller, taking only two hours to complete all files during the experiment.

Figure 5.7: Time needed to process the checksum hash for each multimedia type within the WhatsApp data.

0.0001 0.001 0.01 0.1 1.0 10.0

Time (seconds)

0.0 0.2 0.4 0.6 0.8 1.0

CDF

Audio Video Image

(a) CDF of Average Media Time

200,000 400,000 600,000 800,000

#Files

0.0 200.0 400.0 600.0 800.0 1,000.0

Total Time Spent(min)

Audio Video Image

(b) Accumulative Time Spent Source: The Author.

Figure5.8 show the time required by different perceptual hash algorithms to

pro-4<https://github.com/facebook/ThreatExchange/blob/master/hashing/hashing.pdf>

5.4. Measuring Popularity for WhatsApp Content 73 cess the image data of WhatsApp. In Figure 5.8(a)we see the average time to process a single image file from WhatsApp data, we observe that the checksum is the fastest method to compute the hash with most of the files taking less than 0.03 seconds to be processed.

However, it is not a perceptual hash. Average hash (aHash) and pHash are also fast methods, both are calculated within less of 0,1 second. The Facebook algorithm PDQ is also a very good hash method to detect images. However, the computation requires around a second. When looking at the cumulative time spent actually calculating hashes for, we can note the impact of these time differences on processing the entire dataset.

The total time required to process all 900k images using PDQ was almost ten days, while pHash took about 6 hours to run all images. This is not far from the checksum, which also took similar time.

Figure 5.8: Time needed by different perceptual hashes methods to process image data.

0.0001 0.001 0.01 0.1 1.0

Time (seconds)

0.0 0.2 0.4 0.6 0.8 1.0

CDF

dhash ahash whash-haar whash-db4 checksum phash pdq

(a) Average Hash Time

200,000 400,000 600,000 800,000

#Files

0.0 2,000.0 4,000.0 6,000.0 8,000.0 10,000.0 12,000.0 14,000.0

Total Time Spent(min)

dhash ahash whash-haar whash-db4 checksum phash pdq

(b) Total Time Spent Source: The Author.

For those reasons, pHash was the selected method to process image data on What-sApp, since they are a good match for detecting duplicates of the content shared on the public groups and also pHash is faster than more complex hash algorithms such as wHash and PDQ while it is also as fast as some simple hash methods such as average hash. Al-though it is possible to measure distance between two pHashes, the methodology adopted by this work groups only images with exact same hash. This inflexible threshold is used do distinguish problematic images (e.g false images) that are made by making very small image manipulations, slightly changing the content to mislead the user. Therefore, it is wanted that the original and slightly manipulated images are stored distinctly.

No documento Activism and misinformation on WhatsApp: measurement, analysis, and countermeasures (páginas 70-75)