ID-LoRA

Identity-Driven Audio-Video Personalization with In-Context LoRA

Generate video and audio of a specific person from a single text prompt, a reference image, and a short audio clip — all in one model. Now supporting LTX 2.3.

Aviad Dahan*   Moran Yanuka*   Noa Kraicer

Lior Wolf   Raja Giryes

*Equal contribution

73%

preferred over Kling 2.6 Pro for voice similarity

81%

preferred over ElevenLabs v3+Wan2.2 for voice similarity

~3K

training pairs, single GPU

Side-by-Side Comparisons

Same prompt and reference, different methods. Click any video to play — unmute to hear the difference.

User Study

A/B preference test on Amazon Mechanical Turk (hard split — cross-video reference-target pairs).

Preference chart: Ours vs. Kling 2.6 Pro

vs. Kling 2.6 Pro. ID-LoRA is preferred 73% for voice similarity, 65% for speaking style, and 55% for environment sounds.

Preference chart: Ours vs. ElevenLabs v3 + Wan2.2

vs. ElevenLabs v3 + Wan2.2. ID-LoRA is preferred 81% for voice similarity, 56% for speaking style, and 69% for environment sounds.

MOS plot

Mean Opinion Scores. ID-LoRA achieves an overall MOS of 3.05 vs. 2.90 for Kling 2.6 Pro, winning on 8 of 10 physical interaction scenarios.

Abstract

Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. Although prompt-conditioned audio models could offer such control, they lack access to the visual scene.

We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice within a single generative pass. Two challenges arise from this formulation. Reference and generation tokens share the same positional-encoding space, making them hard to distinguish; we address this with negative temporal positions, which place reference tokens in a disjoint region of the RoPE space while preserving their internal temporal structure. Furthermore, speaker characteristics tend to be diluted during denoising; we introduce identity guidance, a classifier-free guidance variant that amplifies speaker-specific features by contrasting predictions with and without the reference signal.

In human preference studies ID-LoRA is preferred over Kling 2.6 Pro, the leading commercial unified model with voice personalization capabilities, by 73% of annotators for voice similarity and 65% for speaking style. Automatic metrics confirm these gains: on cross-environment settings, speaker similarity improves by 24% over Kling, with the gap widening as reference and target conditions diverge. ID-LoRA achieves these results with only approximately 3K training pairs on a single GPU.

Method

ID-LoRA architecture diagram

Architecture overview. ID-LoRA adapts the LTX-2 dual-stream DiT via In-Context LoRA. A reference audio clip is encoded and concatenated with noisy target latents, while the video stream uses standard text-to-video generation with first-frame conditioning.

1

In-Context Audio Conditioning

Reference audio is encoded into latents and concatenated with noisy target audio along the sequence dimension. The model learns to extract speaker identity from the reference during denoising.

2

Negative Temporal Positions

Reference tokens receive negative RoPE positions [-T, 0), while target tokens occupy [0, T]. This cleanly separates reference from target in attention while preserving temporal structure.

3

Identity Guidance

A CFG variant that amplifies speaker-specific features by extrapolating between predictions with and without the reference signal. Scale 4.0 yields a +9% speaker similarity improvement.

Quantitative Results

Comparison across three evaluation splits. Best in bold per column.

Cross-video reference-target pairs where reference and target environments differ.

Method Spk Sim ↑ Face Sim ↑ LSE-D ↓ LSE-C ↑ CLAP ↑ WER ↓
ID-LoRA (Ours) 0.477 0.874 8.49 3.90 0.363 0.113
Kling 2.6 Pro 0.385 0.854 9.49 3.47 0.316 0.121
CosyVoice 3.0 + Wan2.2 0.391 0.890 11.40 1.50 0.249 0.362
VoiceCraft + Wan2.2 0.344 0.892 10.60 1.33 0.258 0.427
ElevenLabs v3 + Wan2.2 0.357 0.894 11.86 1.72 0.238 0.154

24% speaker similarity improvement over Kling 2.6 Pro on the hard split, where reference and target environments diverge. The gap widens on harder cross-environment settings — demonstrating the advantage of unified generation over cascaded pipelines. ID-LoRA also leads in lip synchronization (LSE-D/C) and audio prompt adherence (CLAP).

Citation

@misc{dahan2026idloraidentitydrivenaudiovideopersonalization,
  title     = {ID-LoRA: Identity-Driven Audio-Video Personalization
               with In-Context LoRA},
  author    = {Aviad Dahan and Moran Yanuka and Noa Kraicer and Lior Wolf and Raja Giryes},
  year      = {2026},
  eprint    = {2603.10256},
  archivePrefix = {arXiv},
  primaryClass  = {cs.SD},
  url       = {https://arxiv.org/abs/2603.10256}
}