We first analyze prompt-free and weakly conditioned I2V generation and find that removing or attenuating text guidance does not merely reduce motion expressiveness. It often causes broader failures, including subject drift, malformed geometry, unstable scene structure, and visually obvious artifacts. This suggests that text conditioning acts as a stabilizing signal throughout the generation process.
Anti-Prompt exploits this dependency by crafting imperceptible image perturbations that suppress text-conditioned interactions while comparatively strengthening visual-only pathways. As a result, the protected image remains visually natural to human viewers, but becomes significantly harder to animate into a clean, convincing video when paired with a text prompt.