5.3 Architecture of Data Collection
Figure 5.4: WhatsApp Web connection flowchart.
Web System Online
Download Mídias get_messages.py Server running
WhatsApp Web Smartphone with
a valid SIM card
WhatsApp Database
(JSONS) Store Data
Extract Mutimedia
Files Server download_video( )
download_audio( ) download_image( )
save_message( ) Login
WhatsApp Web
Collecting Script
Message - ID - User - Group - Datetime - Text - Media - ...
Extract
Data Store
Store
Merge Multimedia Content by Hash process_hashes.py
Code webwhatsapi
WhatsApp Monitor Interface
Source: The Author.
Next, with all setup of WhatsApp account done and groups selected for the re-search, the data extraction can start collecting all data from chats in the server. The main architecture of the collection is explained through the flowchart in Figure 5.4. In this figure, we have an overview of all the steps performed by the collection, as well as the objects involved during execution: (1) The mobile phone and the Firefox browser, both connected to WhatsApp, are the source of data. (2) A script looks at the data re-ceived in WhatsApp groups and extracts the messages. (3) Messages are structured and saved in JSON. (4) Media files (image, video and audio) are downloaded and saved on the server. (5) Another script is responsible for analyzing these files and grouping messages by similarity. (6) Grouped messages are saved in JSON containing every time they were shared2.
With this architecture, we are able to access the WhatsApp account with the Sele-nium based script and then navigate through the groups joined and collect the messages.
Note that, in WhatsApp, when someone joins a group, this new member has access only recent messages sent after the date of joining, and it is not possible to retrieve the group message history shared before that. Given an initial date for collection, for each group, the script manages to scroll the messages back in chat until it reaches the content from the start date, then, it saves each message individually in local storage. When the message has any media attached to it, the script – using the WhatsAppWeb API library – needs to request to WhatsApp server the file of this media. This file is downloaded and decrypted
2All scripts used in this work for collecting WhatsApp data are accessible through the repository
<https://github.com/Phlop/WhatsApp_Crawler>
5.3. Architecture of Data Collection 65 (using WhatsApp hash) and also saved. At the end of this process, we can obtain some metadata for each group, as well as some information of the messages sent in the groups
5.3.1 Group Metadata
For each group, it is possible to get some metadata from it. There is available a unique group ID of that group; a title that most groups have, the description (just those whose administrator wrote one, which is not the case of many of the groups), the profile image (just a link for the actual file on server, as it was not stored), creation date, the user who created the groups and a list of all members of that group at the time of collection.
Interestingly, by looking at the structure of the data collected, it was observed that the unique identifiers of the WhatsApp groups follow the pattern55319999XXXX-15928XXXX.
In this format, the first half is the phone number of the group creator, and the second half stands for a timestamp that represents when the group was created. Therefore, with only the group ID, one can also infer who created that group and when it was created.
This unique ID is also useful to distinguish the groups, as two groups can share the same title and a single group can change the title over time.
For the other group members, we have a list of phone numbers of those users. It is interesting that all WhatsApp groups have the full list of the members of that group, this can be understood can be understood as a WhatsApp problem of security as malicious scripts can exploit this attribute to parse a huge list of groups collecting their phone numbers and build a large dataset of exposed phones. This issue was more investigated in Chapter6.
5.3.2 Messages Data
For messages, we have the unique identifier of that message, group ID and title where the messages were sent, the user (by phone number) who sent that message, date time, text content, kind of the message (i.e., text, video, audio, image, document), and the filename of the media attached to that message.
Note that in this format, a unique piece of information that was shared in multiple groups and/or by multiple users is stored as totally distinct objects. In order to see how information flows through the network, it is necessary another step to aggregate identical
5.3. Architecture of Data Collection 66 messages and track this dissemination.
5.3.3 Frequency of collecting WhatsApp data
This step of going through groups and get messages can be very time-consuming, due to the design of WhatsApp Web, scrolling to an earlier message requires simulta-neously loading all later messages, which imposes heavy resource (CPU/memory) usage.
CHANG, 2020 performed an experiment to evaluate the CPU resources required to scroll and download messages from WhatsApp, their result shows that checking for messages everyn hours has cost greater thanO(n)each time, since the CPU/memory are strained by having to load n hours of messages all at once, and cannot read/log messages as ef-ficiently. That happens because, for example, in a group with 1000 messages, when one opens this group, firstly, only more recent messages are currently loaded (e.g., messages
#990-#1000), if the user wants to load the message #1 of that chat, it means they need to keep scrolling all the messages from #990 to #2 before get access to #1 message. Using a sample of data, their work checked around 200 groups every three hours (for 48 hours total), and for each group-time pair, they recorded the number of new messages in that group, how long it took to read those messages, and how long it took to scroll to those messages. With this, they estimated the processing time to be around O(n2).
Another problem related to constantly collect WhatsApp is the need of constant Internet connectivity in both server and smartphone side where WhatsApp is installed.
The design of WhatsApp requires that, for WhatsApp Web works, the smartphones must be connected as well. As the smartphone connection relies on wi-fi, it is highly susceptible to regular disconnections and interruptions, which directly affect the scrapper. Further-more, WhatsApp Web frequently has changes in HTML and the structure of how it works, which also causes pauses in the collection. Finally, WhatsApp has an ephemeral design, in which the content is not stored for an unlimited time period. With “Temporary chat”
functionality, all messages from a group using this are erased after a week. Moreover, for the remaining groups, WhatsApp does not store the media3 for more than 15 days. Thus, even a regular user trying to see this content in his smartphone will not be able to get it anymore.
This shows that the problem of continuously collecting data from groups is not a trivial task. It can often lead to some data lost, specially in long terms collections.
3All media file is stored encrypted in WhatsApp server, when one wants to access it through the app or any other client, it makes a request on the server, then download it and decrypt in order to display the content to the user.
5.4. Measuring Popularity for WhatsApp Content 67