Saturday, June 3, 2023
HomeTechnologyThe road to realistic full-body deepfakes

The road to realistic full-body deepfakes

It’s been almost five years since the advent of deepfakes released the ability to change people’s facial identities into the public domain; at first, in In recorded video, now even implemented as streaming, using DeepFaceLive.

You still need to find someone who looks a bit like you want to imitate, though. The more they resemble “target” identities, the more convincing the illusion is.

If you have the right face shape, wear it! Left, Miles Fisher fits perfectly as Tom Cruise’s deepfake “canvas”, while right, Alexis Arquette proves a subject fit for Jerry Seinfeld in a deepfake parody that can Seen at /watch?v=S1MBVXkQbWU.

since Autoencoder based deepfake system in single and

trained for a long time, relatively similar “opposite” identities, follow-up models The authenticity of the entertainment will be affected according to the degree of physical difference between the “host” and the superimposed personality

So it can be difficult to find the right person to act as a “canvas” for a deepfake personality. Even if peripheral features such as hair, ears, neck height, basic tone, and physique (and age) can be “close enough” to be modified, the chances of a full face and full body “matching” in a person are so small that they all but disappear.

If instead, you can recreate

completely without expensive and complex professional CGI techniques Whole person

with machine learning?

1990s Jennifer Connelly (and Henry Cavill in illustration) recreated by Stable Diffusion and EbSynth, based on the actual movements of female performers (bottom left “source” image). The entire body of the actor here has been reinterpreted from the source footage, based on Stable Diffusion’s knowledge of the faces and physiques of the two personalities reconstructed here – both of which are well represented in the database on which the model was trained. Predictably, AI is more likely to turn one woman into another than a muscular man like Henry Cavill.

Soon we will look at the possibilities and

very Stable Diffusion and non- AI ‘tweening’ and style transfer software EbSynth ; and (if you’re wondering) why clothing is so represented in such an attempt Tough challenge.

Now, we should at least consider a few Reason, why a truly effective, 100% neural, stable diffusion text-to-video system for full body deepfakes could be years or even decades away, not months, as many avid stable diffusion fans It seems to be believed now.

100% Nervous Whole Body Deepfake’s (Slow) Future

While the video eo clip above*, despite all its rough edges, is just a cheap Combined with open source software, this tangible “deepfake puppetry” is likely to be the earliest consumer-grade full-body deepfake avatar in the Metaverse, as well as in other potential virtual future environments and contexts (bodily movements of participants and overall appearance) will eventually be able to change in real time).

This is because it’s easy for humans (like in the clip above “To Jennifer Connelly and Henry Cavill” Cavill) powered performers) string a sequence of concepts or instructions into a sequence of actions; and face/body capture AI systems are advanced enough to “map” human movement in real-time, so image or video synthesis systems like DeepFaceLive can perform at extremely low The delay “overwrites” the original identity.

but if you want describe

text-to-video cues for human activity (instead of using footage of real people as a guide), and while you’re expecting convincing and realistic results lasting more than 2-3 seconds, the system in question will require an extraordinary, almostAkashic More stuff to know than Stable Diffusion (or any other existing or planned deepfake system).

These include Anatomy, Psychology, Basic Anthropology, Probability, Gravity, Kinematics, Inverse Kinematics and Physics , to name a few. To make matters worse, the system will require

temporal understanding of such events and concepts, rather than the fixed and temporally static embedding contained in stable diffusion, is based on what it 4.2 billion still images for training.
The “simple” joint representation of image and text differs by orders of magnitude in complexity ( as stable expansion system), and the vast amount of information that must be represented in an equivalent system embedded in motion. The top-most image above represents the classes and domains clustered together in a “latent noise” or trained input “cloud” in a GAN-style system. In this case, the images have been trained as searchable and retrievable embeddings in the latent space of the model. Below we see just one example of a ONE video clip that might be found in the Equivalent Time System. Clips and their associated information require at least as many textual annotations as “static” systems, and likewise must form relationships with similar clips, classes, and domains in broader datasets, albeit with a much more complex set of possible parameters. To start, it requires more storage, processing, and higher computing power to derive and absorb applicable embeddings based on source data. Source: | nvidia-optical-flow-sdk/

That’s thinking

before the hypothetical text-to-video system starts

which textures, lighting, geometry and other visible factors and facets might fit the scene, or how to generate a suitable accompaniment soundtrack (another almost equally complex database and auxiliary models that need to be developed).

Business logistics of text-to-video investment

So it will be more difficult to write a truly comprehensive and versatile ‘body movement’ equivalent to the

that powers stable diffusion LAION database , and schemas and protocols developed for the academic and private research sectors


instead of competing in competitions for VFX and licensed entertainment The “full” deepfake functionality of the app.

Share bare open source architecture on GitHub It’s one thing, releasing a complete open source architecture is another. ned models cost millions to create, as did with Stable Diffusion. From a market share and general business logic standpoint, it’s hard to say whether such a generous event could happen again — and may depend on the extent to which the steady proliferation of open source ultimately undermines OpenAI’s investment in DALL-E2, and/or for Stability .ai brings more financial acumen than by putting its amazing product behind a commercial API.

Anyway, the earliest such text-to-video system to make progress was based on a 9 billion parameter converter CogVideo architecture, we introduced stabilized diffusion video in recent article The future of , and will be released in May 2022.

CogVideo solves the data starvation of text to video

While CogVideo is the premier text-to-video product currently available (also the humans I know of “inventing” fully neural and free-roaming animated text-to-video only method, without any CGI involvement), the authors observe that it and similar systems are limited by the cost and logistics of managing and training a suitable motion-based dataset – this is just one factor that means that stable diffusion fans may require Adjusted their current list of expectations for the hyperregional text to the video a little bit.

As pointed out Elsewhere, the current maximum The multilingual video description dataset (movie clips need to be annotated with text descriptions to have semantics ), OpenAI’s CLIP The task performed in Stable Diffusion’s schema) is VATEX, which contains only 41,250 videos , supported by 825,000 subtitles.

Actually, this means completing a This task is at least ten times as difficult as the generation capacity of a stable proliferation, and it currently has far less data than a tenth of the necessary data.

To solve this problem, CogVideo adapted CogView 2, Chinese static Generate art converter, complete the following Task Text to Video, the resulting CogVideo dataset contains 5.4 million text/video pairs – still arguably insufficient data for this arduous task.

However, if I’m a little sceptical about the enormity of creating a really good challenge text-to-video frame without FAANG-level resources (which inevitably comes with commercialization and final product gatekeeping), my pessimism is not the same as Hong Wenyi, one of CogVideo’s co-contributing authors, with whom I recently had a chance to speak.

“I think it doesn’t have you It’s so expensive,” Hong told me


While she concedes that a temporal video synthesis system comparable to the generative power of stable diffusion or DALL-E 2 may be five to ten times more expensive to develop and train, she suggests that the initial Viral video composite clips can be short and require less excessive resources.

Hong and her colleagues are developing CogVideo integrations For social media platforms, the earliest and widest spread of CogVideo output seems likely to be lasting a few seconds Zhong’s short videos come in the form of short videos that users can share and don’t require hyperscale resources from the start.

“You can use a much smaller dataset than LAION,” she said. “Train a video model like CogVideo, which typically produces videos lasting a few seconds. However, if we want to generate more complex videos, we need larger datasets. The entire field of machine learning research, the research on annotated videos Logistics and availability of GPU memory (VRAM) represent core challenges:

“If we want to generate a high-resolution video, we must specify the resolution, frame rate, and length of the video. This is the biggest problem. If we have enough resources, enough memory, we can generate videos of any length.

“But if the subtitles and the video are not very closely related, or not extensive enough , or not detailed enough, there will be some problems.

” Most subtitles will only describe one action and one video, such as “A person holding a cup”. But if the video is long, maybe a minute, the person won’t hold the object for that long. Maybe they’ll let it go, or they’ll start doing something else. The need to accommodate this level of sophistication will make the entire training process difficult.

” However, we have opened – get CogVideo on GitHub, I Will try to develop an API where people can enter their own sentences. For this, we are working with Hugging Face, who have created the API for us.”

so it can solve development EFFECTIVE, POWERFUL AND


A text-to-video system will be due to global participation, perhaps federated learning, in some SETI-style implementation of the family fold, now Ethereum Move to Proof of Stake Commitment Release GPU capacity and Availability Worldwide.

If we look like we are Text-to-video is desperately wanted, then such a system may be eagerly adopted by more and more image and video synthesis enthusiasts.

When I asked Wenyi how far are we from a nervous system that can effectively parse a script or a book into a movie , she replied: “Well, maybe ten to twenty years.”

Beyond Stylized Transitions in Steady Diffusion Video

However, because Stable Diffusion is so powerful, anyone can now easily create stunning images; and because it grabs The public imagination , only a few warning products from OpenAI’s early and more “locked down” DALL-E 2, the growing public expects text prompts, surreal videos , which is open to everyone and runs for more than a few seconds, may be out in a decade or two.

is actually the AI ​​VFX company that participated in the development of Stable Diffusion Runway , is currently teasing a similar, cue-driven video creation system to be released soon.

Runway’s text-to-video system trailer previews some impressive features, but the only one seems to be From real source footage, what else can be seen to what extent neural humans may or may not be featured in the system. Source:


What AI element of the runway trailer is missing (and from any mature and available current product) is

people – the area we know best, and the most challenging research in AI-based image and video synthesis: all, walking, acting, interacting, running, tripping, lying down, standing, swimming, kissing, punching, listless, laughing, crying, jumping, posing, talking people


While Stable Diffusion can generate static figures and humanoids very convincingly, even realistically, most of the videos that emerge from the crazy efforts of the SD community are either stylized (i.e. cartoon style, usually reversed by Noise-based pipes in Stable Diffusion), ‘Psychedelic’ (usually using

Stable WarpFusion or Deforum), or display very limited motion (usually done with EbSynth, we will be introduced later).

Some more stylized or even psychedelic motion implementations in Stable Diffusion. Source (clockwise): | | .com/watch?v=_MDsKJYqaoY |

As we will see, Using EbSynth to animate the stable diffusion output can produce more realistic images; however, both Stable Diffusion and EbSynth have implicit limitations that limit the ability of any realistic human (or humanoid) to move – this It’s easy to put such mocks in a limited category

‘Let’s animate that static head a bit’

represents DeepNostalgia, and the numerous scientific attempts over the past 4-5 years to endow static human representations with “limited life”:
Some based on GAN The method can limit human movement and limited dynamic faces. Source, clockwise: | | /rendering-with-style-combining-traditional-and-neural-approaches-for-high-quality-face-rendering/

Many of these systems rely on interpreting existing, real-world human motion, and using that motion information to drive conversion, rather than relying on a database of knowledge refinements about human motion, as CogVideo does.

For example, for the aforementioned Connelly/Cavill full body deepfakes, I used Img2Img The function of stabilized diffusion to convert I am also such a lens k of a performer is divided into two characters. With Img2Img, you can provide a source image (anything from a rough sketch to a normal photo) for a stable diffusion, and provide a text prompt that suggests to the system how it should change the image (e.g. ‘Jennifer Connelly in the 1990s’ , or

‘ Henry Cavill, topless


source image and some captions (with negative captions in the box below) result in a fairly accurate Img2Img conversion of a woman to actor Henry Cavill, in the very popular AUTOMATIC1111 distribution of Stable Diffusion.

with autoencoder-based deepfakes (i.e. having videos that have been used to make viral deepfakes for the past five years), when source and target have more in common machine learning systems are more likely to achieve transformations when the image is above — for example, in the image above, Henry Cavill has his hands in his pockets, which doesn’t exactly reflect the source pose.

In contrast, the following figure shows , Stable Diffusion can convert the yoga woman source image to a more accurate approximation of the Jennifer Connelly pose:

Even with lower settings, stable diffusion is more It’s easy to turn the woman in the source image into another woman, not a man — in this case, represented actress Jennifer Connelly in the late 1990s.

Control “Strength” and “Restrain” in Stable Diffusion




The two defining forces in the stable diffusion transformation are CFG ratio


denoising strength


CFG stands for Classifier Free Guidance . The higher you set this ratio, the more closely the system will follow the instructions in the prompt, even though this may cause artifacts and other visual anomalies.

In many cases, training The LAION dataset for the model is so authoritative that even a short additional Img2Img instruction can produce valid results without setting this setting very high.

but if you try to make things happen then stable diffusion has No prior knowledge of

; provide source images that are difficult to isolate; or combine things, people, or concepts that are difficult to combine coherently; then you may have to turn up the CFG scale or denoising strength, which will force the steady diffusion to be more “imaginative” force” – although usually at the expense of some aspect of image quality.

For example, while Stable Diffusion can turn a slender woman into a generalized muscular man like Henry Cavill, it has Extraordinary

The difficulty of simply changing the color of clothes (part of Stable Diffusion’s general question about clothes, which we’ll look at later).


even with CFG at above average 13.5 and 0.58 denoising strength, the dress does not change color even with “red” as a forbidden (negative) word.
In an experiment, I Attempting to change the color of the clothes the female performer was wearing in the source shoot, int ended up with a steady diffusion transformation of actress Salma Hayek.

However, I found that there is no setting, A combination of plugins or other tricks can accomplish this significantly smaller task. In the end, the CFG and denoising settings had to be almost all set to max before Stable Diffusion could convert clothing colors – 90-95% of the converted pose fidelity, style and coherence were lost in the process :

In general, similar to traditional Autoencoders for deepfakes tend to use a host-like way of imposing the identity they want, and it’s usually easier to use at least a source material that’s closer to what you ultimately want to render (ie. requiring your performer to only wear a red dress in the first place).

While at least one Supplementary script CLIP can be used to identify, mask and change specific elements, such as a piece of clothing, which is useful for generating temporally coherent videos for full body deepfakes It’s too inconsistent.


Stable Diffusion’s Txt2Mask plugin can isolate and change clothes, but, characteristic of many of Stable Diffusion’s most “cutting edge” features, it’s currently a coincidence. Source:

Fashion chaos spreads

If you’re wondering why there’s so much bare skin in these examples, it’s not just that stable diffusion has other issues with clothing and body trim.

Surprisingly, very few There’s ‘famous clothing’ that has generalized so strongly to a LAION-based stable diffusion model that you can rely on it to appear consistently across a series of consecutively rendered frames.

Even Levi 501 jeans (of which there are

countless examples in the LAION database), they were voted the most iconic clothing list in 2020 by product time, cannot be relied on to consistently render the Img2Img whole body deepfake sequence stable diffusion.

in terms of time coherence, and has a fixed seed (ie, Stable Diffusion doesn’t ‘randomize’ how it represents jeans , but will stick to your previously chosen settings for good rendering) the world’s most recognizable clothing item performs well above average – but there are still random tears and glitches.


Jennifer Connelly’s face and body? Good – LAION-trained Stable Diffusion has ingested nearly 40 years of Connelly’s photos – event pictures, paparazzi beach snaps, promotional stills, extracted video frames, and many other sources that allow the system to generalize Core identities, faces and physiques of actresses across age ranges.

Kick-off, Connery’s hairstyle over the years Designs are relatively consistent, not always for women (because of fashion and aging) or men (because of fashion, aging and male pattern baldness).

Despite only recently becoming a star, Stable Diffusion has internalized various hairstyles for actress Margot Robbie, including Many hairstyles are very engaging To get a coherent temporal video, try stabilizing them based on the cues.

However, especially because of the sheer amount of material in the database, Connelly is in almost all of her LAION photos All wearing different clothes:

Over the years, Jennifer Connelly has had a variety of clothing choices such as shown in the LAION database and subsequently trained to stabilize diffusion. Source:

So if you ask for stable diffusion, ask for one ‘Jennifer Connelly’

, did it choose a specific outfit that was above average in her LAION picture? Might it sum up every outfit she wears at LAION as “representatively generic”? Was it selected from a collection of garments with the highest LAION aesthetic score ? To what extent does the cue itself affect the choice or continuity of clothing depicted in a series of rendered frames?
Different renders from the stable diffusion hint associated with ‘Jen’ nifer Connelly’, showing a mostly random range of clothing.

Stable Diffusion was only open sourced a little over a month ago, and these are among the many problems that have not been solved; but in practice, even using

Fixed Seed

(we’ll see later), it’s hard to make a difference from a latent diffusion model (e.g. Stable Diffusion or DALL-E 2 – unless the garment in question is unique, unchanged over the years, and already in the model’s It is well represented in the training database.

Potential full-body deepfake consistency via text inversion

In this case, one solution for consistent clothing might be to use Textual Inversion Model – Encapsulation A small piece of additional code that customizes the appearance and semantic meaning of objects on OR entities, through brief training on a limited number of annotated photos.

Users can create standard trained models for inference when text is inverted and placed “adjacent” and can act as effectively as the system was originally trained to .

In this way, theoretically, a For Levi 501 (or specific hairstyles), the look is consistent enough to support temporal video; and also creates truly “stable” models for more humble outfits.

If this becomes an established solution, it’s a bit like the early heyday of the Renderosity market, where users are still Poser and Daz 3D.

Ultimately, text inversion may represent the only reasonable way to obtain temporal consistency of objects in stable diffusion, and easily insert “unknown” people into the system, with the aim of creating by latent diffusion systems Full body deepfake. Some Reddit users are currently putting themselves ( and some of the more obscure public figures) into steady proliferation via this route:

None of these people are in your stable diffusion copy, or not at all, or at this resolution – but added by hobbyists using text inversion. The top image is a self-portrait of Reddit user “Dalle2Pictures”, although his username is him, but in this case he used text inversion with steady diffusion; the middle row is another Reddit user sEi_, who likewise Inserted a portrait of himself into the system using text inversion; the bottom row is a rendering of the stable diffusion of former U.S. Rep. Tulsi Gabbard, not rendered at this level of detail in the standard stable diffusion distribution; in this case, reportedly Reddit user Visual-Ad-8655 generated text inversions for Gabbard in just two hours. Source, top to bottom: | | ://

While currently creating The process is hardware-intensive, but the user can do this through the web-based Google Colabs and Hugging Face APIs.

In addition, the speed of development and optimization by the fast Stable Diffusion developer community means that through It might just get easier for a native consumer-grade video card to put itself (or any celebrity absent or underrepresented in LAION) into the Stable Diffusion world.

(For more on text inversion, check out our

August Feature

Regarding future stable diffusion of “generic” video compositing).

static-and-temporal-embedding Full Body Video Deepfake with Stable Diffusion and EbSynth

To create the beginning of this article Shown is a full body deepfake based on Jennifer Connelly and Henry Cavill Stable Diffusion, I took a short clip from a custom shoot session with the performers and extracted the short video into its constituent frames.

As you can see from the clip, both transformations use the same short snippet.

Then I did some testing on some original source frames and finally found a combined setup (in this example, for Jennifer Connor ), which seems to yield a good result.

more or less helpful in moving from a real-world model to a setting of the target personality.

We have seen stable diffusion for Img2Img text A “random” interpretation of the prompt. In fact, to generate novel and diverse results, the system generates filtered text cues for each individual image via a random seed (*​​ – a via A single, unique route to the latent space of stable diffusion, represented by a hash number. Without this feature, it’s hard to explore the potential of the software, or make changes on cue.

All distributions of stable spread are such that if you find a seed that is actually working well, you “freeze” that seed – when dealing with This ability is absolutely necessary for any hope of temporal coherence when there is a continuous sequence of images, as in this case.

But if the subject moves a lot in the video, it’s unlikely that running such an efficient seed on a single frame will make a difference :



The seed that produces the first image transformation proved to be very efficient and was chosen as the “fixed seed” for the entire sequence “. But it doesn’t work for the second image, which is also part of the video sequence. Here, the quality difference is exaggerated for illustration – although it could be worse, depending on how “moved” the performer is in the clip.

As I write, a A new stable diffusion script has been developed to ostensibly “morph” y between the two best seeds in the render sequence. While such a solution doesn’t solve all the problems of “seed transfer”, it allows the performer to move more in the steady diffusion/EbSynth transition, since most “realistic” examples of SD/EbSynth video clips are currently very limited Character movement is characteristic.

Back to celebrity transformation: into the aforementioned EbSynth – an innovative, obscure, poorly documented non-originally designed Yu An AI application that applies a painting style to short video clips, but is increasingly used as a “tweening” tool for video The more popular it is to use a stable diffusion output.

EbSynth in action.

To see the added smoothness that EbSynth can provide with a full body Stable Diffusion deepfake, compare the original Jennifer Connelly transform generated by Stable Diffusion on the left of the video below with the version on the right. EbSynth creates a smoother video by “warping” between a few carefully chosen keyframes and using only these (apparently the maximum allowed per clip is 24) to recreate the full video, but unavoidable Ground will shorten the running time of the video:

on the left, the original output of the steady diffusion “sizzles” because even with fixed Seeds, temporal consistency is also hard to do by just gluing the raw output frames together. On the right, we see that EbSynth achieves better temporal consistency, converting only 24 frames (out of the original 200 in the clip) into a smoother reconstruction. To improve the quality of the faces in the final video on the right, a publicly shared autoencoder was used – despite the fact that better results can be obtained by upscaling the face and fully re-rendering it in stabilized diffusion (a process that is currently relatively time consuming ).

As great as EbSynth is, due to many confusing interface quirks, lack of cohesive or centralized documentation, minimal and restrictive Reddit, EbSynth is a frustrating tool. There is , and conflicting opinions about what the key settings in the “advanced” section of the app actually do, either for style transfer purposes or for this jury purpose of manipulation.

Also, your number of keyframes is very Less allowed to set in EbSynth means that a) the clips may need to be very short, and b) the people in the clips may need to limit any abrupt motion ts, as every extra motion consumes precious keyframe assignments.

However, as a major workflow, the basic principles and capabilities of EbSynth may be applicable to new software with larger keyframe capacity, Has the ability to detect where extra keyframes should be assigned (in EbSynth you need to manage them very carefully), and more transparent tools for controlling interpolation settings.

Other leads to full body Deepfake the way

except pass The winding and challenging road to efficient CogVideo-style neural networks text-to-video systems, and these extremely limited “hacks” for temporally coherent Img2Img stable diffusion full-body deepfakes, and of course other paths to transform identities Extend beyond the face area.

I have covered most of these alternatives extensively in previous topics in Metaphysic, including the ability and potential of each to generate full-body deepfakes . So I recommend these features to you, About The Future of Autoencoder-Based Deepfakes ; Neural Radiation Fields (NeRF) may be the eventual successor to autoencoders ; and generative adversarial networks (GAN) The future of deepfakes.

With that said, let’s briefly review these alternatives.

Neural Radiation Field (NeRF)

There is no video synthesis technique that deals with a person’s whole-body neural representation more broadly than neural radiation fields. NeRF can recreate time-accurate videos as well as “frozen”, explorable 3D representations by training images and videos into neural scene and object representations.

For example, Neural Human Performer can perform a Kind of deepfake puppetry, although currently very low resolution (a common limitation of most NeRF schemes):

As I mentioned in my previous NeRF article, there are countless other projects dealing with neural people directly in NeRF, including MirrorNeRF, A-NeRF , Animated Neural Radiation Field, Neural Actor , DFA-NeRF, Portrait NeRF, DD-NeRF, H-NeRF, and Surface-aligned neural radiation fields .

Another example of a NeRF-based deepfake puppet is NeRF-Editing, which uses a signed distance function/field (SDF) as an interpretation layer between human performers ( Or, theoretically, a priori from a CogVideo-style database) and a normally inaccessible parameter of a NeRF object – or possibly a different identity:

Deepfake puppet with NeRF editing. Source: Source:

Some body synthesis projects are beginning to integrate NeRF into a broader, more Complex workflows such as texturing, including Disney Research’s Morphable Radiance Fields , or start using NeRF for swapping faces instead of rendering entire bodies. An example of the latter case is RigNeRF, a NeRF-based face swapping method that provides deepfakes very similar to DeepFaceLive puppetry, although it’s not

I can stick with it all day, as this is a fertile and well-funded branch of video synthesis research. The commercial and academic sectors are very enthusiastic about using this technology to develop neural humans, and NVIDIA has recently forayed into more efficient NeRF generation reinvigorated

Nonetheless, the challenges and inherent limitations of NeRF are enormous: neural radiation fields are difficult to edit, and are often expensive and time-consuming to train, NeRF-based neural humans are characterized by limited resolution, which often undermines the system’s potential to create the most realistic possible neural humans from real-world images and videos.

Nonetheless, as Stable Diffusion attests, and DALL-E 2 has foreshadowed, giant leaps in image synthesis tend to make us Surprised, so NeRF might still improve, at one breath, its current struggling status as a viable method for simulating the full body of the human body.

Open source repositories based on autoencoders such as DeepFaceLab and FaceSwap (both based on the controversial code that made a 2017 sensation on Reddit) is what most of us hear about “deepfakes” “The term comes to mind” — a model trained on thousands of images of celebrities, which can then impose those learned faces on the central facial region of another person, effectively changing that person’s facial identity.


The autoencoder deepfake system only swaps faces, not bodies. Still, there is occasional guesswork between fans and developers to design an autoencoder system that models along the same lines , it uses that kind of full-body motion capture software that can create deep camouflaged dancers , thus achieving full-body deep camouflage.

However, even if such a system could be devised, it would face many of the same problems that Stable Diffusion did when generating temporal deepfake content Clothing problem, which makes it practically impossible to create a usable training dataset.

Unless, of course, clothing does not enter the equation, and the assumed The system is to be trained on nude images and intends to make full body deepfake porn.

However, those training pictures will appear whose face? In the case of a particular celebrity, virtually the entire content of the dataset needs to be synthesized, i.e. Photoshopped; and, after training, it will effectively double the difficulty of finding a “body match”. At best, porn deepfakes now have to be processed


, with two different frameworks, double the initial effort compared to what is currently achievable, but the benefits are relatively smaller.

Also, truly different physiques are relatively rare, and given the demands and relatively low standards of the deepfake porn community, this effort is arguably ‘Overkill’.

Given these factors, and how unlikely it is for any substantial corporate entity to fund such an effort, it appears that there are no autoencoders in production all over the body The obvious path in terms of deepfakes, aside from being a possible assistive technique, maintains a focus on face-swapping in the broader context of full-body deepfakes produced by other methods (assuming those methods cannot handle at least that, if not more). good).

Generative Adversarial Network (GAN)

Generative Adversarial The primary use of the network in full body deepfake programs is well-funded industry interest in

based on fashion

body and garment synthesis – especially in regards to systems that can allow “virtual try-ons” , mainly in the women’s market.

While projects such as InsetGAN and StyleGAN-Human (see video below) is keen to develop commercial applications of this nature, the resulting renderings are always static or near static :

Although GANs have gained public acclaim and notoriety over the past five years, Generative Adversarial Networks are able to generate any image synthesis system (including DALL-E 2 and Stable Diffusion) ), but lacks any temporal framework or tools that might be suitable for its production of full-body deepfakes.

After years of almost futile exploration through pure neural approaches to realistically animate human faces in the latent space of GANs, exemplified by the efforts of Disney research, which increasingly accepted that GANs may only be used by disparate, often CGI-based, older techniques Driven texture generators such as 3D Deformable Models (3DMM).

Generative adversarial networks seem to be stuck, at least for now, if there is any real “race” to develop effective and versatile full-body deepfakes on the starting line.

Although The author is a freelance writer on the regular Metaphysic blog and is not a Metaphysic employee. The original full-body deepfake example in this feature is the author’s own experiment and has absolutely nothing to do with Metaphysic’s work, techniques, and output.

on August 10, 2022 Today’s Zoom conversation.




Please enter your comment!
Please enter your name here


Featured NEWS