ECCV 2026

Anti-Prompt: Image Protection against Text-Guided Image-to-Video Generation

Yeonghwan Song1, Chanhui Lee2, Jinsoo Park2, Jeany Son2
1GIST 2POSTECH

Abstract

Recent advances in Image-to-Video generation allow a single image to be animated into a convincing video under text guidance, raising serious copyright and privacy risks. We propose Anti-Prompt, an image protection approach that injects imperceptible perturbations into an image, inducing visible inconsistencies and structural failures in text-guided I2V generation. Our method is motivated by a simple empirical observation: when text guidance is removed from modern I2V models, generation quality degrades markedly, not only in motion realism but also in subject preservation, structural coherence, and temporal consistency.

Building on this insight, Anti-Prompt attenuates text-conditioned interactions during denoising while strengthening visual-only pathways. To evaluate protection behavior, we also introduce a Video-LLM protocol that scores subject preservation, structural consistency, dynamic consistency, and artifact suppression using frame-grounded observations.

Method Overview

Anti-Prompt is built on a simple but strong observation: modern I2V systems rely heavily on text guidance not only for semantics, but also for preserving subject identity, spatial layout, and temporal coherence during denoising.

We first analyze prompt-free and weakly conditioned I2V generation and find that removing or attenuating text guidance does not merely reduce motion expressiveness. It often causes broader failures, including subject drift, malformed geometry, unstable scene structure, and visually obvious artifacts. This suggests that text conditioning acts as a stabilizing signal throughout the generation process.

Anti-Prompt exploits this dependency by crafting imperceptible image perturbations that suppress text-conditioned interactions while comparatively strengthening visual-only pathways. As a result, the protected image remains visually natural to human viewers, but becomes significantly harder to animate into a clean, convincing video when paired with a text prompt.

What The Paper Shows

The paper studies both white-box and black-box transfer settings, and also evaluates robustness under common image purification operations.

Across CogVideoX and LTX-Video, Anti-Prompt consistently pushes generated videos toward visible degradation while preserving the appearance of the shared image itself. We compare against I2VGuard and show stronger disruption in direct attack settings, cross-model transfer, and purification-based defenses such as crop-and-resize, JPEG, and ADVClean.

The qualitative examples below are organized to show exactly this behavior: the clean image still produces a plausible video, I2VGuard often leaves more usable generation quality, and our protected image more frequently leads to broken structure, unstable dynamics, or content that no longer supports faithful reuse.

CogVideoX Qualitative Results

Browse all white-box, black-box, and purification settings in a single horizontal viewer.

LTX-Video Qualitative Results

Matched three-way comparisons collected into one carousel for easier side-by-side browsing.

Video-LLM Evaluator Analysis

These examples motivate our Video-LLM protocol. In several cases, aggregate benchmark signals suggest that a video is still acceptable, while frame-level inspection reveals subject drift, structural collapse, temporal incoherence, or visible artifacts that make the generated result unsuitable for convincing reuse.

By scoring Subject Preservation, Structural Consistency, Dynamic Consistency, and Artifact Suppression with explicit visual evidence, our protocol captures protection failures that are easy to miss when only overall quality metrics are reported.

BibTeX

@inproceedings{song2026antiprompt,
  title     = {Anti-Prompt: Image Protection against Text-Guided Image-to-Video Generation},
  author    = {Song, Yeonghwan and Lee, Chanhui and Park, Jinsoo and Son, Jeany},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}