Future Work - Using GANs to create synthetic datasets for fake news detection models

Taking into account the previous section, there are multiple ways for extending this work. Firstly, hyper-parameter tuning could be applied using more hyperparameters, and the number of training epochs could take longer. This would allow for better trainedGANs and, probably, better quality synthetic data. Additionally, other generative models could be used and compared with GANs, such as Autoencoders (AEs), flow-based models, or diffusion models.

Another way we could improve this work is by considering news datasets in raw form (text format) and address them as they are. When news are represented as tabular data, some

information is lost in the process. By investigating aboutGANs for text generation, perhaps we can boost the results.

As seen in section 5.4, the GAN that showed better results regarding the multiple utility metrics used (see the radar chart in figure5.6 for a quick refresher) was not the one that provided the best results in the data augmentation task. This was quite interesting and we believe this should be further explored. Not only to better understand the relation between the intrinsic data quality and the data augmentation classification performance but also to assess if there are more informative measures regarding the data augmentation.

In section 5.3we have noted that the featurescapital-gain andcapital-loss were not included in this synthetic data evaluation analysis because they were poorly generated by all the GAN models (see figure 5.1). This, however, may be addressed in future work to understand what caused this to happen and, thus, to explore the relationship between features with a certain distribution and the impact they have on the performance of GANs.

Another aspect we want to improve is the use of more test sets in order to decrease the bias that may have been (inadvertently) introduced. As mentioned in section5.4, both the Adult and the LIAR-PLUS datasets were split into train and test sets. However, despite this random split, some bias may have been introduced. As such, the replication of the experiments with more train-test splits could be important to address the matter.

In what concerns the experiments in section5.4, some further analysis can prove fruitful to shed some light on what causes abrupt gains/losses in ML performance when the number of samples increases. More specifically, we want to fully comprehend what caused the performance gains (in the minority class) in the Adult dataset when the proportion of minority class samples generated by the TabFairGAN increased from 0.9 to 1 in the accuracy and F1 of the logistic regression, as well as the abrupt losses in precision and recall (see figure 5.7). Moreover, in the LIAR-PLUS dataset, the wild oscillations in the performance of the minority class of the CTGAN 4 generated samples for the precision of the decision tree (see figure5.11) are also worth of further exploration.

Furthermore, it became clear from our experience during this study that GANs is under-researched for generating tabular data compared toGANs for synthesizing image data. Therefore, we could extend this work by creating our ownGAN for generating synthetic tabular data. In addition, we would be interested in creating a package that other researchers can easily use without them having to implement it themselves. Such a package could provide researchers with several built-in tabular GAN architectures, removing the need for the user to build them from scratch. Moreover, it could have a module for synthetic data evaluation, which would allow users to get immediate feedback as soon as their GAN are trained. This way, users could rapidly try out several GANarchitectures and see which best fit their use case. This would be very useful since the existing packages still have a lot of room for improvement.

We believe that the approach proposed in this paper has several applications. Since the architectures we use are not limited to news datasets, they can be applied to other domains

6.4. Future Work 75 where tabular data is used. In addition, they could help detect fake news that would not be captured if theML models were trained only on an unexpanded dataset. Clearly, this can have a positive impact and help mitigate a serious problem that affects us all: Fake News.

Bibliography

[1] Annisa Aditsania, Aldo Lionel Saonard, et al. Handling imbalanced data in churn prediction using adasyn and backpropagation algorithm. In 2017 3rd international conference on science in information technology (ICSITech), pages 533–536. IEEE, 2017.

[2] Ahmed M. Alaa, Boris van Breugel, Evgeny Saveliev, and Mihaela van der Schaar. How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models, 2021.

[3] Yasir Alanazi, Nobuo Sato, Pawel Ambrozewicz, Astrid N Hiller Blin, Wally Melnitchouk, Marco Battaglieri, Tianbo Liu, and Yaohang Li. A survey of machine learning-based physics event generation. arXiv preprint arXiv:2106.00643, 2021.

[4] Adamu Ali-Gombe and Eyad Elyan. Mfc-gan: class-imbalanced dataset classification using multiple fake class generative adversarial network. Neurocomputing, 361:212–221, 2019.

[5] Adamu Ali-Gombe, Eyad Elyan, Yann Savoye, and Chrisina Jayne. Few-shot classifier gan.

In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2018.

[6] Hunt Allcott and Matthew Gentzkow. Social media and fake news in the 2016 election.

Journal of economic perspectives, 31(2):211–36, 2017.

[7] Gerard Andrews. What is synthetic data? https://blogs.nvidia.com/blog/2021/06/08/what-is-synthetic-data/, Jun 2021.

[8] Francis J Anscombe. Graphs in statistical analysis. The american statistician, 27(1):17–21, 1973.

[9] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.

[10] Samuel Assefa. Generating synthetic data in finance: opportunities, challenges and pitfalls.

Challenges and Pitfalls (June 23, 2020), 2020.

[11] Joris Baan. A comprehensive introduction to bayesian deep learning. https://jorisbaan.nl/

2021/03/02/introduction-to-bayesian-deep-learning, Mar 2021.

[12] Ruud Barth, JMM IJsselmuiden, Jochen Hemming, and Eldert J van Henten. Optimising realism of synthetic agricultural images using cycle generative adversarial networks. In Proceedings of the IEEE IROS workshop on Agricultural Robotics, pages 18–22, 2017.

[13] Rukshan Batuwita and Vasile Palade. Efficient resampling methods for training support vector machines with imbalanced datasets. InThe 2010 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2010.

[14] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.

[15] Chumphol Bunkhumpornpat, Krung Sinapiromsaran, and Chidchanok Lursinsap. Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. InPacific-Asia conference on knowledge discovery and data mining, pages 475–482. Springer, 2009.

[16] Weiwei Cai and Zhanguo Wei. Piigan: generative adversarial networks for pluralistic image inpainting. IEEE Access, 8:48451–48463, 2020.

[17] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote:

synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:

321–357, 2002.

[18] Jianhui Chen and James J Little. Sports camera calibration via synthetic data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019.

[19] Shuanglian Chen. Research on extreme financial risk early warning based on odr-adasyn-svm. In2017 International Conference on Humanities Science, Management and Education Technology (HSMET 2017), pages 1132–1137. Atlantis Press, 2017.

[20] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.

Infogan: Interpretable representation learning by information maximizing generative adversarial nets. InProceedings of the 30th International Conference on Neural Information Processing Systems, pages 2180–2188, 2016.

[21] Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F Stewart, and Jimeng Sun. Generating multi-label discrete patient records using generative adversarial networks.

InMachine learning for healthcare conference, pages 286–305. PMLR, 2017.

[22] Chanachok Chokwitthaya, Yimin Zhu, Supratik Mukhopadhyay, and Amirhosein Jafari.

Applying the gaussian mixture model to generate large synthetic data from a small data set. In Construction Research Congress 2020: Computer Applications, pages 1251–1260.

American Society of Civil Engineers Reston, VA, 2020.

[23] CKCN Chow and Cong Liu. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3):462–467, 1968.

Bibliography 79 [24] João Coutinho-Almeida, Pedro Pereira Rodrigues, and Ricardo João Cruz-Correia. Gans for tabular healthcare data generation: A review on utility and privacy. In Carlos Soares and Luis Torgo, editors,Discovery Science, pages 282–291, Cham, 2021. Springer International Publishing. ISBN: 978-3-030-88942-5.

[25] Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Yi-Hsuan Yang. Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment, 2017.

[26] Georgios Douzas, Fernando Bacao, and Felix Last. Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Information Sciences, 465:

1–20, 2018.

[27] Chris Drummond, Robert C Holte, et al. C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. InWorkshop on learning from imbalanced datasets II, volume 11, pages 1–8. Citeseer, 2003.

[28] Khaled Emam, Lucy Mosquera, and Richard Hoptroff. Chapter 1: Introducing synthetic data generation. InPractical Synthetic Data Generation: Balancing Privacy and the broad availability of data, page 1–22. O’Reilly Media, Inc., 2020.

[29] Khaled Emam, Lucy Mosquera, and Richard Hoptroff. Chapter 4: Evaluating synthetic data utility. InPractical Synthetic Data Generation: Balancing Privacy and the broad availability of data, page 69–94. O’Reilly Media, Inc., 2020.

[30] Fatemeh Fahimi, Zhuo Zhang, Wooi Boon Goh, Kai Keng Ang, and Cuntai Guan. Towards eeg generation using gans for bci applications. In2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), pages 1–4. IEEE, 2019.

[31] Alvaro Figueira and Bruno Vaz. Survey on synthetic data generation, evaluation methods and gans. Mathematics, 10(15), 2022. ISSN: 2227-7390. doi:10.3390/math10152733.

[32] David Foster. Chapter 3: Variational autoencoders. In Generative deep learning: Teaching machines to paint, write, compose, and play, page 61–96. O’Reilly, 2019.

[33] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.

[34] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Chapter 20: Deep generative models.

InDeep Learning, pages 654–720. MIT Press, 2016.

[35] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Chapter 14: Autoencoders. InDeep Learning, pages 502–525. MIT Press, 2016.

[36] Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. Borderline-smote: a new over-sampling method in imbalanced data sets learning. InInternational conference on intelligent computing, pages 878–887. Springer, 2005.

[37] Max Hänska and Stefan Bauchowitz. Tweeting for Brexit: how social media influenced the referendum. abramis academic publishing, 2017.

[38] Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), pages 1322–1328.

IEEE, 2008.

[39] E. Hellinger. Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. Journal für die reine und angewandte Mathematik, 1909(136):210–271, 1909.

doi:doi:10.1515/crll.1909.136.210 [visited 2022-09-12].

[40] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018.

[41] Zubayer Islam, Mohamed Abdel-Aty, Qing Cai, and Jinghui Yuan. Crash data augmentation using variational autoencoder. Accident Analysis & Prevention, 151:105950, 2021.

[42] Taeho Jo and Nathalie Japkowicz. Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1):40–49, 2004.

[43] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.

[44] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.

[45] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

[46] Lan Lan, Lei You, Zeyang Zhang, Zhiwei Fan, Weiling Zhao, Nianyin Zeng, Yidong Chen, and Xiaobo Zhou. Generative adversarial networks and its applications in biomedical informatics.

Frontiers in Public Health, 8, 2020. ISSN: 2296-2565. doi:10.3389/fpubh.2020.00164.

[47] Scikit Learn. Gaussian mixture models. https://scikit-learn.org/stable/modules/mixture.

html, 2022.

[48] Taejun Lee, Minju Kim, and Sung-Phil Kim. Data augmentation effects using borderline-smote on classification of a p300-based bci. In2020 8th International Winter Conference on Brain-Computer Interface (BCI), pages 1–4. IEEE, 2020.

[49] Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. Advances in neural information processing systems, 29:469–477, 2016.

[50] Chao Lu, Shaofu Lin, Xiliang Liu, and Hui Shi. Telecom fraud identification based on adasyn and random forest. In 2020 5th International Conference on Computer and Communication Systems (ICCCS), pages 447–452. IEEE, 2020.

Bibliography 81 [51] Lara Lusa et al. Evaluation of smote for high-dimensional class-imbalanced microarray data.

In2012 11th international conference on machine learning and applications, volume 2, pages 89–94. IEEE, 2012.

[52] J MacQueen. Classification and analysis of multivariate observations. In 5th Berkeley Symp.

Math. Statist. Probability, pages 281–297, 1967.

[53] Matthew Mayo. Yann lecun quora session overview. https://www.kdnuggets.com/2016/08/

yann-lecun-quora-session.html, Aug 2016.

[54] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.

[55] Robert S Mueller et al. The Mueller Report. e-artnow, 2019.

[56] Sergey I Nikolenko et al. Synthetic data for deep learning. arXiv preprint arXiv:1909.11512, 3, 2019.

[57] Murphy Yuezhen Niu, Alexander Zlokapa, Michael Broughton, Sergio Boixo, Masoud Mohseni, Vadim Smelyanskyi, and Hartmut Neven. Entangling quantum generative adversarial networks. Physical Review Letters, 128(22):220505, 2022.

[58] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. InInternational conference on machine learning, pages 2642–2651.

PMLR, 2017.

[59] Ofcom. News consumption in the uk: Research report - ofcom, Dec 2015.

[60] Femi Olan, Uchitha Jayawickrama, Emmanuel Ogiemwonyi Arakpogun, Jana Suklan, and Shaofeng Liu. Fake news on social media: the impact on society. Information Systems Frontiers, pages 1–16, 2022.

[61] Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and Youngmin Kim. Data synthesis based on generative adversarial networks. arXiv preprint arXiv:1806.03384, 2018.

[62] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2337–2346, 2019.

[63] Mansi Patel, Xuyu Wang, and Shiwen Mao. Data augmentation with conditional gan for automatic modulation classification. InProceedings of the 2nd ACM Workshop on Wireless Security and Machine Learning, pages 31–36, 2020.

[64] Tim Prangemeier, Christoph Reich, Christian Wildner, and Heinz Koeppl. Multi-stylegan: Towards image-based simulation of time-lapse live-cell microscopy. arXiv preprint arXiv:2106.08285, 2021.

[65] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.

[66] Amirarsalan Rajabi and Ozlem Ozmen Garibay. Tabfairgan: Fair tabular data generation with generative adversarial networks. arXiv preprint arXiv:2109.00666, 2021.

[67] Julio C. S. Reis, André Correia, Fabrício Murai, Adriano Veloso, and Fabrício Benevenuto.

Supervised learning for fake news detection. IEEE Intelligent Systems, 34(2):76–81, 2019.

doi:10.1109/MIS.2019.2899143.

[68] D Riafio et al. Using gabriel graphs in borderline-smote to deal with severe two-class imbalance problems on neural networks. InArtificial Intelligence Research and Development:

Proceedings of the 15th International Conference of the Catalan Association for Artificial Intelligence, volume 248, page 29. IOS Press, 2012.

[69] Stuart Jonathan Russell, Peter Norvig, and Ming-Wei Chang. Chapter 13: Probabilistic reasoning. In Artificial Intelligence: A modern approach, page 430–478. Pearson, 2022.

[70] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.

Improved techniques for training gans, 2016.

[71] Sobhan Sarkar, Anima Pramanik, J Maiti, and Genserik Reniers. Predicting and analyzing injury severity: A machine learning-based approach using class-imbalanced proactive and reactive data. Safety science, 125:104616, 2020.

[72] Karen-Beth G Scholthof. The disease triangle: pathogens, the environment and society.

Nature Reviews Microbiology, 5(2):152–156, 2007.

[73] Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari. How good is my gan? In Proceedings of the European Conference on Computer Vision (ECCV), pages 213–229, 2018.

[74] Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. Fake news detection on social media: A data mining perspective. ACM SIGKDD explorations newsletter, 19(1):

22–36, 2017.

[75] Bhargav Siddani, S Balachandar, William C Moore, Yunchao Yang, and Ruogu Fang.

Machine learning for physics-informed generation of dispersed multiphase flow using generative adversarial networks. Theoretical and Computational Fluid Dynamics, 35(6):

807–830, 2021.

[76] Wacharasak Siriseriwan and Krung Sinapiromsaran. The effective redistribution for imbalance dataset: relocating safe-level smote with minority outcast handling. Chiang Mai Journal of Science, 43(1):234–246, 2016.

[77] Devin Soni. Introduction to bayesian networks. https://towardsdatascience.com/

introduction-to-bayesian-networks-81031eeed94e, Jul 2019.

Bibliography 83 [78] Akash Srivastava, Lazar Valkov, Chris Russell, Michael U Gutmann, and Charles Sutton.

Veegan: Reducing mode collapse in gans using implicit variational learning. Advances in neural information processing systems, 30, 2017.

[79] Jie Sun, Jie Lang, Hamido Fujita, and Hui Li. Imbalanced enterprise credit evaluation with dte-sbd: Decision tree ensemble based on smote and bagging with differentiated sampling rates. Information Sciences, 425:76–91, 2018.

[80] Jie Sun, Hui Li, Hamido Fujita, Binbin Fu, and Wenguo Ai. Class-imbalanced dynamic financial distress prediction based on adaboost-svm ensemble combined with smote and time weighting. Information Fusion, 54:128–144, 2020.

[81] Vadim Sushko, Jurgen Gall, and Anna Khoreva. One-shot gan: Learning to generate samples from single images and videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2596–2600, 2021.

[82] Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield. Deep object pose estimation for semantic robotic grasping of household objects.

arXiv preprint arXiv:1809.10790, 2018.

[83] Bruno Vaz, Vítor Bernardes, and Álvaro Figueira. On creation of synthetic samples from gans for fake news identification algorithms. InWorld Conference on Information Systems and Technologies, pages 316–326. Springer, 2022.

[84] Svitlana Volkova, Kyle Shaffer, Jin Yea Jang, and Nathan Hodas. Separating facts from fiction: Linguistic models to classify suspicious and trusted news posts on twitter. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 2: Short papers), pages 647–653, 2017.

[85] Zhiqiang Wan, Yazhou Zhang, and Haibo He. Variational autoencoder based synthetic data generation for imbalanced learning. In2017 IEEE symposium series on computational intelligence (SSCI), pages 1–7. IEEE, 2017.

[86] William Yang Wang. " liar, liar pants on fire": A new benchmark dataset for fake news detection. arXiv preprint arXiv:1705.00648, 2017.

[87] William Weir. History’s Greatest Lies: The Startling Truths Behind World Events Our History Books Got Wrong. Fair winds press, 2009.

[88] Lei Xu and Kalyan Veeramachaneni. Synthesizing tabular data using generative adversarial networks. arXiv preprint arXiv:1811.11264, 2018.

[89] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional gan. arXiv preprint arXiv:1907.00503, 2019.

[90] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked

No documento Using GANs to create synthetic datasets for fake news detection models (páginas 99-110)