You still need to find someone who looks a bit like you want to imitate, though. The more they resemble “target” identities, the more convincing the illusion is.
since Autoencoder based deepfake system in single and
So it can be difficult to find the right person to act as a “canvas” for a deepfake personality. Even if peripheral features such as hair, ears, neck height, basic tone, and physique (and age) can be “close enough” to be modified, the chances of a full face and full body “matching” in a person are so small that they all but disappear.
If instead, you can recreate
completely without expensive and complex professional CGI techniques Whole person
Soon we will look at the possibilities and
Now, we should at least consider a few Reason, why a truly effective, 100% neural, stable diffusion text-to-video system for full body deepfakes could be years or even decades away, not months, as many avid stable diffusion fans It seems to be believed now.
100% Nervous Whole Body Deepfake’s (Slow) Future
While the video eo clip above*, despite all its rough edges, is just a cheap Combined with open source software, this tangible “deepfake puppetry” is likely to be the earliest consumer-grade full-body deepfake avatar in the Metaverse, as well as in other potential virtual future environments and contexts (bodily movements of participants and overall appearance) will eventually be able to change in real time).
This is because it’s easy for humans (like in the clip above “To Jennifer Connelly and Henry Cavill” Cavill) powered performers) string a sequence of concepts or instructions into a sequence of actions; and face/body capture AI systems are advanced enough to “map” human movement in real-time, so image or video synthesis systems like DeepFaceLive can perform at extremely low The delay “overwrites” the original identity.
but if you want describe
These include Anatomy, Psychology, Basic Anthropology, Probability, Gravity, Kinematics, Inverse Kinematics and Physics , to name a few. To make matters worse, the system will require
That’s thinking
before the hypothetical text-to-video system starts
Business logistics of text-to-video investment
Share bare open source architecture on GitHub It’s one thing, releasing a complete open source architecture is another. ned models cost millions to create, as Stability.ai did with Stable Diffusion. From a market share and general business logic standpoint, it’s hard to say whether such a generous event could happen again — and may depend on the extent to which the steady proliferation of open source ultimately undermines OpenAI’s investment in DALL-E2, and/or for Stability .ai brings more financial acumen than by putting its amazing product behind a commercial API.
Anyway, the earliest such text-to-video system to make progress was based on a 9 billion parameter converter CogVideo architecture, we introduced stabilized diffusion video in recent article The future of , and will be released in May 2022.
While CogVideo is the premier text-to-video product currently available (also the humans I know of “inventing” fully neural and free-roaming animated text-to-video only method, without any CGI involvement), the authors observe that it and similar systems are limited by the cost and logistics of managing and training a suitable motion-based dataset – this is just one factor that means that stable diffusion fans may require Adjusted their current list of expectations for the hyperregional text to the video a little bit.
As pointed out Elsewhere, the current maximum The multilingual video description dataset (movie clips need to be annotated with text descriptions to have semantics ), OpenAI’s CLIP The task performed in Stable Diffusion’s schema) is VATEX, which contains only 41,250 videos , supported by 825,000 subtitles.
Actually, this means completing a This task is at least ten times as difficult as the generation capacity of a stable proliferation, and it currently has far less data than a tenth of the necessary data.
To solve this problem, CogVideo adapted CogView 2, Chinese static Generate art converter, complete the following Task Text to Video, the resulting CogVideo dataset contains 5.4 million text/video pairs – still arguably insufficient data for this arduous task.
However, if I’m a little sceptical about the enormity of creating a really good challenge text-to-video frame without FAANG-level resources (which inevitably comes with commercialization and final product gatekeeping), my pessimism is not the same as Hong Wenyi, one of CogVideo’s co-contributing authors, with whom I recently had a chance to speak.
“I think it doesn’t have you It’s so expensive,” Hong told me
While she concedes that a temporal video synthesis system comparable to the generative power of stable diffusion or DALL-E 2 may be five to ten times more expensive to develop and train, she suggests that the initial Viral video composite clips can be short and require less excessive resources.
Hong and her colleagues are developing CogVideo integrations For social media platforms, the earliest and widest spread of CogVideo output seems likely to be lasting a few seconds Zhong’s short videos come in the form of short videos that users can share and don’t require hyperscale resources from the start.
“You can use a much smaller dataset than LAION,” she said. “Train a video model like CogVideo, which typically produces videos lasting a few seconds. However, if we want to generate more complex videos, we need larger datasets. The entire field of machine learning research, the research on annotated videos Logistics and availability of GPU memory (VRAM) represent core challenges:
” Most subtitles will only describe one action and one video, such as “A person holding a cup”. But if the video is long, maybe a minute, the person won’t hold the object for that long. Maybe they’ll let it go, or they’ll start doing something else. The need to accommodate this level of sophistication will make the entire training process difficult.
If we look like we are Text-to-video is desperately wanted, then such a system may be eagerly adopted by more and more image and video synthesis enthusiasts.
When I asked Wenyi how far are we from a nervous system that can effectively parse a script or a book into a movie , she replied: “Well, maybe ten to twenty years.”
However, because Stable Diffusion is so powerful, anyone can now easily create stunning images; and because it grabs The public imagination , only a few warning products from OpenAI’s early and more “locked down” DALL-E 2, the growing public expects text prompts, surreal videos , which is open to everyone and runs for more than a few seconds, may be out in a decade or two.
is actually the AI VFX company that participated in the development of Stable Diffusion Runway , is currently teasing a similar, cue-driven video creation system to be released soon.
Runway’s text-to-video system trailer previews some impressive features, but the only one seems to be From real source footage, what else can be seen to what extent neural humans may or may not be featured in the system. Source: https://twitter.com/runwayml/status/1568220303808991232

What AI element of the runway trailer is missing (and from any mature and available current product) is
While Stable Diffusion can generate static figures and humanoids very convincingly, even realistically, most of the videos that emerge from the crazy efforts of the SD community are either stylized (i.e. cartoon style, usually reversed by Noise-based pipes in Stable Diffusion), ‘Psychedelic’ (usually using
Stable WarpFusion or Deforum), or display very limited motion (usually done with EbSynth, we will be introduced later).
As we will see, Using EbSynth to animate the stable diffusion output can produce more realistic images; however, both Stable Diffusion and EbSynth have implicit limitations that limit the ability of any realistic human (or humanoid) to move – this It’s easy to put such mocks in a limited category
Many of these systems rely on interpreting existing, real-world human motion, and using that motion information to drive conversion, rather than relying on a database of knowledge refinements about human motion, as CogVideo does.
For example, for the aforementioned Connelly/Cavill full body deepfakes, I used Img2Img The function of stabilized diffusion to convert I am also such a lens k of a performer is divided into two characters. With Img2Img, you can provide a source image (anything from a rough sketch to a normal photo) for a stable diffusion, and provide a text prompt that suggests to the system how it should change the image (e.g. ‘Jennifer Connelly in the 1990s’ , or

with autoencoder-based deepfakes (i.e. having videos that have been used to make viral deepfakes for the past five years), when source and target have more in common machine learning systems are more likely to achieve transformations when the image is above — for example, in the image above, Henry Cavill has his hands in his pockets, which doesn’t exactly reflect the source pose.
In contrast, the following figure shows , Stable Diffusion can convert the yoga woman source image to a more accurate approximation of the Jennifer Connelly pose:


The two defining forces in the stable diffusion transformation are CFG ratio
and


CFG stands for Classifier Free Guidance . The higher you set this ratio, the more closely the system will follow the instructions in the prompt, even though this may cause artifacts and other visual anomalies.
In many cases, training The LAION dataset for the model is so authoritative that even a short additional Img2Img instruction can produce valid results without setting this setting very high.
but if you try to make things happen then stable diffusion has No prior knowledge of
For example, while Stable Diffusion can turn a slender woman into a generalized muscular man like Henry Cavill, it has Extraordinary

However, I found that there is no setting, A combination of plugins or other tricks can accomplish this significantly smaller task. In the end, the CFG and denoising settings had to be almost all set to max before Stable Diffusion could convert clothing colors – 90-95% of the converted pose fidelity, style and coherence were lost in the process :
In general, similar to traditional Autoencoders for deepfakes tend to use a host-like way of imposing the identity they want, and it’s usually easier to use at least a source material that’s closer to what you ultimately want to render (ie. requiring your performer to only wear a red dress in the first place).
While at least one Supplementary script CLIP can be used to identify, mask and change specific elements, such as a piece of clothing, which is useful for generating temporally coherent videos for full body deepfakes It’s too inconsistent.




Surprisingly, very few There’s ‘famous clothing’ that has generalized so strongly to a LAION-based stable diffusion model that you can rely on it to appear consistently across a series of consecutively rendered frames.
Even Levi 501 jeans (of which there are
countless examples in the LAION database), they were voted the most iconic clothing list in 2020 by product time, cannot be relied on to consistently render the Img2Img whole body deepfake sequence stable diffusion.
)
Jennifer Connelly’s face and body? Good – LAION-trained Stable Diffusion has ingested nearly 40 years of Connelly’s photos – event pictures, paparazzi beach snaps, promotional stills, extracted video frames, and many other sources that allow the system to generalize Core identities, faces and physiques of actresses across age ranges.
Kick-off, Connery’s hairstyle over the years Designs are relatively consistent, not always for women (because of fashion and aging) or men (because of fashion, aging and male pattern baldness).
However, especially because of the sheer amount of material in the database, Connelly is in almost all of her LAION photos All wearing different clothes:

So if you ask for stable diffusion, ask for one ‘Jennifer Connelly’
Stable Diffusion was only open sourced a little over a month ago, and these are among the many problems that have not been solved; but in practice, even using

In this case, one solution for consistent clothing might be to use Textual Inversion Model – Encapsulation A small piece of additional code that customizes the appearance and semantic meaning of objects on OR entities, through brief training on a limited number of annotated photos.
Users can create standard trained models for inference when text is inverted and placed “adjacent” and can act as effectively as the system was originally trained to .
In this way, theoretically, a For Levi 501 (or specific hairstyles), the look is consistent enough to support temporal video; and also creates truly “stable” models for more humble outfits.
If this becomes an established solution, it’s a bit like the early heyday of the Renderosity market, where users are still Poser and Daz 3D.
Ultimately, text inversion may represent the only reasonable way to obtain temporal consistency of objects in stable diffusion, and easily insert “unknown” people into the system, with the aim of creating by latent diffusion systems Full body deepfake. Some Reddit users are currently putting themselves ( and some of the more obscure public figures) into steady proliferation via this route:

While currently creating The process is hardware-intensive, but the user can do this through the web-based Google Colabs and Hugging Face APIs.
In addition, the speed of development and optimization by the fast Stable Diffusion developer community means that through It might just get easier for a native consumer-grade video card to put itself (or any celebrity absent or underrepresented in LAION) into the Stable Diffusion world.
Regarding future stable diffusion of “generic” video compositing).

Full Body Video Deepfake with Stable Diffusion and EbSynth
As you can see from the clip, both transformations use the same short snippet.
Then I did some testing on some original source frames and finally found a combined setup (in this example, for Jennifer Connor ), which seems to yield a good result.
We have seen stable diffusion for Img2Img text A “random” interpretation of the prompt. In fact, to generate novel and diverse results, the system generates filtered text cues for each individual image via a random seed (* – a via A single, unique route to the latent space of stable diffusion, represented by a hash number. Without this feature, it’s hard to explore the potential of the software, or make changes on cue.
All distributions of stable spread are such that if you find a seed that is actually working well, you “freeze” that seed – when dealing with This ability is absolutely necessary for any hope of temporal coherence when there is a continuous sequence of images, as in this case.
But if the subject moves a lot in the video, it’s unlikely that running such an efficient seed on a single frame will make a difference :
)

As I write, a A new stable diffusion script has been developed to ostensibly “morph” y between the two best seeds in the render sequence. While such a solution doesn’t solve all the problems of “seed transfer”, it allows the performer to move more in the steady diffusion/EbSynth transition, since most “realistic” examples of SD/EbSynth video clips are currently very limited Character movement is characteristic.
Back to celebrity transformation: into the aforementioned EbSynth – an innovative, obscure, poorly documented non-originally designed Yu An AI application that applies a painting style to short video clips, but is increasingly used as a “tweening” tool for video The more popular it is to use a stable diffusion output.
To see the added smoothness that EbSynth can provide with a full body Stable Diffusion deepfake, compare the original Jennifer Connelly transform generated by Stable Diffusion on the left of the video below with the version on the right. EbSynth creates a smoother video by “warping” between a few carefully chosen keyframes and using only these (apparently the maximum allowed per clip is 24) to recreate the full video, but unavoidable Ground will shorten the running time of the video:
As great as EbSynth is, due to many confusing interface quirks, lack of cohesive or centralized documentation, minimal and restrictive Reddit, EbSynth is a frustrating tool. There is , and conflicting opinions about what the key settings in the “advanced” section of the app actually do, either for style transfer purposes or for this jury purpose of manipulation.
Also, your number of keyframes is very Less allowed to set in EbSynth means that a) the clips may need to be very short, and b) the people in the clips may need to limit any abrupt motion ts, as every extra motion consumes precious keyframe assignments.
However, as a major workflow, the basic principles and capabilities of EbSynth may be applicable to new software with larger keyframe capacity, Has the ability to detect where extra keyframes should be assigned (in EbSynth you need to manage them very carefully), and more transparent tools for controlling interpolation settings.
except pass The winding and challenging road to efficient CogVideo-style neural networks text-to-video systems, and these extremely limited “hacks” for temporally coherent Img2Img stable diffusion full-body deepfakes, and of course other paths to transform identities Extend beyond the face area.
I have covered most of these alternatives extensively in previous topics in Metaphysic, including the ability and potential of each to generate full-body deepfakes . So I recommend these features to you, About The Future of Autoencoder-Based Deepfakes ; Neural Radiation Fields (NeRF) may be the eventual successor to autoencoders ; and generative adversarial networks (GAN) The future of deepfakes.
With that said, let’s briefly review these alternatives.
There is no video synthesis technique that deals with a person’s whole-body neural representation more broadly than neural radiation fields. NeRF can recreate time-accurate videos as well as “frozen”, explorable 3D representations by training images and videos into neural scene and object representations.
For example, Neural Human Performer can perform a Kind of deepfake puppetry, although currently very low resolution (a common limitation of most NeRF schemes):
As I mentioned in my previous NeRF article, there are countless other projects dealing with neural people directly in NeRF, including MirrorNeRF, A-NeRF , Animated Neural Radiation Field, Neural Actor , DFA-NeRF, Portrait NeRF, DD-NeRF, H-NeRF, and Surface-aligned neural radiation fields .
Another example of a NeRF-based deepfake puppet is NeRF-Editing, which uses a signed distance function/field (SDF) as an interpretation layer between human performers ( Or, theoretically, a priori from a CogVideo-style database) and a normally inaccessible parameter of a NeRF object – or possibly a different identity:
Some body synthesis projects are beginning to integrate NeRF into a broader, more Complex workflows such as texturing, including Disney Research’s Morphable Radiance Fields , or start using NeRF for swapping faces instead of rendering entire bodies. An example of the latter case is RigNeRF, a NeRF-based face swapping method that provides deepfakes very similar to DeepFaceLive puppetry, although it’s not
I can stick with it all day, as this is a fertile and well-funded branch of video synthesis research. The commercial and academic sectors are very enthusiastic about using this technology to develop neural humans, and NVIDIA has recently forayed into more efficient NeRF generation reinvigorated
Nonetheless, the challenges and inherent limitations of NeRF are enormous: neural radiation fields are difficult to edit, and are often expensive and time-consuming to train, NeRF-based neural humans are characterized by limited resolution, which often undermines the system’s potential to create the most realistic possible neural humans from real-world images and videos.
Nonetheless, as Stable Diffusion attests, and DALL-E 2 has foreshadowed, giant leaps in image synthesis tend to make us Surprised, so NeRF might still improve, at one breath, its current struggling status as a viable method for simulating the full body of the human body.
Open source repositories based on autoencoders such as DeepFaceLab and FaceSwap (both based on the controversial code that made a 2017 sensation on Reddit) is what most of us hear about “deepfakes” “The term comes to mind” — a model trained on thousands of images of celebrities, which can then impose those learned faces on the central facial region of another person, effectively changing that person’s facial identity.
)
The autoencoder deepfake system only swaps faces, not bodies. Still, there is occasional guesswork between fans and developers to design an autoencoder system that models along the same lines , it uses that kind of full-body motion capture software that can create deep camouflaged dancers , thus achieving full-body deep camouflage.
Unless, of course, clothing does not enter the equation, and the assumed The system is to be trained on nude images and intends to make full body deepfake porn.
However, those training pictures will appear whose face? In the case of a particular celebrity, virtually the entire content of the dataset needs to be synthesized, i.e. Photoshopped; and, after training, it will effectively double the difficulty of finding a “body match”. At best, porn deepfakes now have to be processed
Also, truly different physiques are relatively rare, and given the demands and relatively low standards of the deepfake porn community, this effort is arguably ‘Overkill’.
Given these factors, and how unlikely it is for any substantial corporate entity to fund such an effort, it appears that there are no autoencoders in production all over the body The obvious path in terms of deepfakes, aside from being a possible assistive technique, maintains a focus on face-swapping in the broader context of full-body deepfakes produced by other methods (assuming those methods cannot handle at least that, if not more). good).
Generative Adversarial The primary use of the network in full body deepfake programs is well-funded industry interest in
While projects such as InsetGAN and StyleGAN-Human (see video below) is keen to develop commercial applications of this nature, the resulting renderings are always static or near static :
After years of almost futile exploration through pure neural approaches to realistically animate human faces in the latent space of GANs, exemplified by the efforts of Disney research, which increasingly accepted that GANs may only be used by disparate, often CGI-based, older techniques Driven texture generators such as 3D Deformable Models (3DMM).
Generative adversarial networks seem to be stuck, at least for now, if there is any real “race” to develop effective and versatile full-body deepfakes on the starting line.
Although The author is a freelance writer on the regular Metaphysic blog and is not a Metaphysic employee. The original full-body deepfake example in this feature is the author’s own experiment and has absolutely nothing to do with Metaphysic’s work, techniques, and output.
)