Detail Versus Fidelity: CEO of AI Video Footage Enhancer Topaz Labs Strongly Advises “Against Forensic and Medical Uses”
Before (insert) and after the AI upscaling in DIG! XX.
Revolutionary! Magical! Game-changing! These are some of the words being used to describe the results of AI-generated upscaling tools for archival and low-resolution footage.
Some documentarians have turned to Video AI from Topaz Labs, a software featuring a suite of 30-plus AI models for video enhancement. To make Secret Mall Apartment (2024), Jeremy Workman upscaled 25 hours of low-quality footage (shot on Pentax Optio S4i cameras with 320x240 resolution at 8–10 fps) up to 4K. For The New York Times 2023 Op-Doc The Army We Had, Michael Tucker upscaled interlaced DV footage, while filmmakers like Robert Stone and Ondi Timoner are relying on Topaz’s Video AI to restore and upres their older films for screenings.
The practice holds both promise and peril, especially in a context of few industry guidelines. Neither Workman nor Tucker labelled the footage nor disclosed the use of the tool in the film or their credits. Meanwhile, the potential for misuse is only growing as the tools improve.
To learn more about Video AI and the future of AI-assisted video upscaling, Documentary interviewed the CEO and Co-founder Eric Yang on Zoom from the Topaz Labs offices in Texas, where the company was founded in 2008. This interview has been edited for length and clarity.
DOCUMENTARY: How is Topaz being used in documentary productions?
ERIC YANG: In upscaling, you have a low-quality source footage, and you want it to be higher quality. You want to expand the number of pixels, and you want what is generated to be as realistic and high fidelity to the original as possible. Technically speaking, this is not possible. You can’t actually go back to the source and fill in what was there. So you’re always generating pixels.
Now, there’s a way to generate them, however, that tries to maintain as much fidelity to the original as possible, and tries to not change any face details in the name of adding detail, and not change any of the structure. And you actually take a little bit of a sacrifice, where you don’t get quite as much detail. If you have high fidelity, you generally have less detail. You can also get a lot of detail, especially in some of the new diffusion models. But a lot of times this will sacrifice fidelity. It’s this spectrum.
We make tools for both ends of the spectrum, although historically, we’re much stronger on the high Fidelity side. When it comes to the documentary use case, archival footage and things like that, we always are very insistent, or opinionated, that we apply the high fidelity technology to it, instead of the more creative stuff.
D: If you’re claiming that there is a high fidelity version where you would get less detail by making the image look smoother or enhanced, less fuzzy, what’s actually going on there? What is the fidelity towards?
EY: It depends on the differences in the method that you use to generate pixels. Let’s say that I have my face right here, and we want to blow it up into a billboard-sized picture. We would need to generate a hundred times as many pixels as we have right now. There are a couple of ways that you can generate the pixels.
One would be high fidelity for all the pixels that currently exist. We’re only going to use them to infer the other pixels around it, based on our understanding of “This is a face. It has skin texture. We should sort of make the texture look like skin.” That’s how you upscale it. That will generally give you pretty high fidelity, because it doesn’t actually touch any of the existing pixels.
It’s not going to be able to generate detail in a way that looks natural as much as the second method, which is when you give the model the ability to change the existing texture a little bit. It won’t do it that much, but it can kind of generate things a little bit to create a much more natural-looking and more detailed result. But it might not look exactly like me anymore, right? It won’t be true to the original.
D: What is the model training on?
EY: Mostly, it is an understanding based on learning from before and after images and videos. We have a proprietary data set of images and videos that we then downscale and then train it to infer the detail.
D: Can you tell us a little bit about that proprietary set? What is the volume, source, or IP of these images?
EY: I can’t really go too much into the data set, because it actually contains some of the secret sauce of how we train our models to behave the way they do. We are okay to use it. The data provenance is fine.
What I can say is that, in general, the different kinds of models use pretty different approaches. The high-fidelity model generally learns the small texture of the image, so it won’t have a semantic understanding of “This is a face. That’s a window, or something like that.” It just looks at the texture around it and infers.
D: It works section by section rather than analyzing the whole thing.
EY: Exactly,. Whereas the very creative models will be like, “This is a man, probably like an Asian man” right? Whenever you upscale or restore anything, it’s all generated pixels. What’s important is that certain parts of those generated pixels are true to the original. What we found through talking with people is actually, it doesn’t have to be every pixel. People mostly care that people don’t get changed, that faces don’t get changed. In general, what we find is if there is some foliage or some trees, a lot of times, it’s okay to be a little bit more generative there. And even if those parts don’t specifically match what was the original scene? It’s still sort of okay, as long as we preserve the subjects and the meaning of the original footage. There definitely is some nuance there.
D: Can your tool do substantive infilling?
EY: It’s all technically possible right now.
D: A bad actor could take that off the shelf and use it.
EY: Absolutely. If somebody wanted to run documentary footage through one of the high creative tools meant for AI filmmakers, then, yeah, we wouldn’t be able to detect it and stop it.
D: There is no tracing material that’s produced by your tool. Once it’s out in the wild.
EY: There is no trace. But I wouldn’t say we’re at the point where you could just create indistinguishable HD footage restoration from archival footage and have it look very natural. It’s actually a technical barrier right now—it kind of looks like AI. As an industry, we have to figure this out in the future because I don’t think that’s gonna be the case forever. Probably in another year there will be tools that can make it indistinguishable from reality. But we are not quite there yet.
D: It’s coming fast.
EY: Currently, for when quality is really important, and you don’t want any artifacts, it’s still a fairly manual process. The AI tools will get you there for certain clips and certain parts of the footage. But right now, you still have to selectively apply those. You have to cut them in and out, and you have to apply a lot of human judgment to these. A lot of humans are in the loop right now, being the creative director in these things. Now, as technology changes, this might also change.
D: Can you talk about your company’s position on disclosure, labeling, or mention in a credit? What are the responsibilities that you think as a toolmaker you have to letting audiences know about these new ways of dealing with footage?
EY: For Topaz in general, one of our philosophies is to build tools to help other people be superheroes. Currently, for the more generative use cases, it’s pretty possible to tell by looking at it that it’s been more generated, and we haven’t really seen any labeling or watermarking standards that are really widely accepted right now. We just haven’t looked at that problem.
Soon, we are going to live in a world where you are just not going to be able to tell. From a social side, it’ll be very important to create safeguards. An analogy I use a lot is when Photoshop came out [there was concern that you] can’t really trust images. But you can actually still trust images, provided you have good context about those images, such as the reputation of the person telling you to believe what they say. I think, in the long run, as a society, we’ll evolve those safeguards for [AI-generated video], although I feel like we should have some technological ones [such as digital watermarks] as well.
D: A buyer-beware approach seems dangerous. Things get decoupled from their context too often and too quickly. It’s also about the public record. Are you having more discussions with the community about basic standards and procedures in journalism and documentary?
EY: That definitely makes sense. We include “used video AI” in the metadata for output, but obviously, that’s super easy to remove if you want to, and platforms will remove it all the time.
The difficult part for us is creating a tamper-proof trail of evidence is actually not super easy. It’s not really our specialty. There are other people that are a lot better at it than us. Because of the rate of development of AI, it’s really important to focus on what we do specialize in. Long story short, it’s not really due to lack of desire to work on this. I would wish that somebody else could do it, and then we can integrate it.
D: There are also other standards, such as disclosure. There are multiple ways of disclosing that footage has been enhanced, as simple as the narrator saying something in a film or mentioning it in the credits, by making it a standard that you request.
EY: We don’t have anything specific about that for documentary, but we do specifically strongly advise against any forensics or medical use cases. We did have a court case [in which a 2021 shooting was captured on a bystander’s smartphone and the unaltered 10-second-long source video of the shooting was entered into evidence in a Washington courtroom]. It was a murder trial where somebody on the defending team used our software to enhance [the cellphone footage] to prove it wasn’t the person. We said, “You cannot use this,” and eventually they didn’t.
D: What do you see coming down that you can tell us about for documentary, for news, for archives, for the public record? What’s your research team working on?
EY: One of the biggest problems that we’re trying to solve right now is the absence of information in really low resolution or degraded footage. We’re planning on creating a way that you can, especially for people, have reference material of the person in the archival footage to be able to get around that trade off, and actually make a really high fidelity and fairly detailed version of that person by giving it a little bit more information. Across all of our customer use cases—not just documentary ones—humans seem to have a really deep affinity with making sure people look the same. It’s like a kind of a core human trait. You don’t wanna change the identity of people, ever. And it’s a pretty hard problem.
D: Do you have a Safety Committee or Ethics Review, or any kind of guardrail within the company itself?
EY: We are about 50 people right now, so we don’t have any committees. Most of our efforts align around what people, what our customers are saying and what they’re telling us. People are pretty ethical, like people want to make sure that this stuff is thought about and handled. That’s the biggest motivator that we have towards it, even though we don’t have a committee on it.
D: Is there anything else that you’d like to say to the documentary community about the potential of upscaling?
EY: The rate of technology improvement, at least from what we’ve been seeing internally, is extraordinarily high right now, and I do feel pretty strongly that in a couple of years from now, somebody is going to make a model that can just simulate anything very photorealistically. In a world like that, I think we have to decide what kind of context and societal safeguards we want to put around this stuff. Because there’s no putting the cat back in the bag after that happens, and I would hope that we can work together, both from the tech side and the usage side, to figure out something while this is being built, rather than after that happens. It’s blazing right now.
Katerina Cizek is a Peabody- and Emmy-winning documentarian, author, producer, and researcher working with collective processes and emergent technologies. She is a research scientist, and co-founder of the Co-Creation Studio at MIT Open Documentary Lab. She is lead author (with Uricchio et al.) of Collective Wisdom: Co-Creating Media for Equity and Justice, published by MIT Press in 2022.