FireEye’s Data Science and Information Operations Analysis teams released this blog post to coincide with our Black Hat USA 2020 Briefing, which details how open source, pre-trained neural networks can be leveraged to generate synthetic media for malicious purposes. To summarize our presentation, we first demonstrate three successive proof of concepts for how machine learning models can be fine-tuned in order to generate customizable synthetic media in the text, image, and audio domains. Next, we illustrate examples in which synthetically generated media have been weaponized for information operations (IO), as detected on the front lines by Mandiant Threat Intelligence. Finally, we outline challenges in detecting synthetically generated content, and lay out potential paths forward in a future where synthetically generated media will increasingly look, speak, and write like us.
Background: Synthetic Media, Generative Models, and Transfer Learning
Synthetic media is by no means a new development; methods for manipulating media for specific agendas are as old as the media themselves. In the 1930’s, the chief of the Soviet secret police was photographed walking alongside Joseph Stalin before being retouched out of an official press photo, after he himself was arrested and executed during the Great Purge. Digital graphic manipulation like this became prominent with the advent of Photoshop. Then later in the 2010’s, the term “deepfake” was coined. While deepfake videos, including techniques like face swapping and lip syncing, are concerning in the long term, this blog post focuses on more basic, but we argue more believable, synthetic media generation advancements in the text, static image, and audio domains. Machine learning approaches for creating synthetic media are underpinned by generative models, which have been effectively misused to fabricate high volume submissions to federal public comment websites and clone a voice to trick an executive into handing over $240,000.
The pre-training required to produce models capable of synthetic media generation can cost thousands of dollars, take weeks or months of time, and require access to expensive GPU clusters. However, the application of transfer learning can drastically reduce the amount of time and effort involved. In transfer learning, we start from a large generic model that has been pre-trained for an initial task where copious data is available. We then leverage the model’s acquired knowledge to train it further on a different, smaller dataset so that it excels at a subsequent, related task. This process of training the model further is referred to as fine-tuning, which typically requires less resources compared to pre-training from scratch. You can think of this in more relatable terms—if you’re a professional tennis player, you don’t need to completely relearn how to swing a racket in order to excel at badminton.