ID-LoRA

Identity-Driven Audio-Video Personalization with In-Context LoRA

Generate video and audio of a specific person from a single text prompt, a reference image, and a short audio clip — all in one model. Now supporting LTX 2.3.

Paper Code ComfyUI Models

Aviad Dahan^* Moran Yanuka^* Noa Kraicer

Lior Wolf Raja Giryes

^*Equal contribution

User Study

A/B preference test on Amazon Mechanical Turk (hard split — cross-video reference-target pairs).

Preference chart: Ours vs. Kling 2.6 Pro

vs. Kling 2.6 Pro. ID-LoRA is preferred 73% for voice similarity, 65% for speaking style, and 55% for environment sounds.

Preference chart: Ours vs. ElevenLabs v3 + Wan2.2

vs. ElevenLabs v3 + Wan2.2. ID-LoRA is preferred 81% for voice similarity, 56% for speaking style, and 69% for environment sounds.

Mean Opinion Scores. ID-LoRA achieves an overall MOS of 3.05 vs. 2.90 for Kling 2.6 Pro, winning on 8 of 10 physical interaction scenarios.

Abstract

Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. Although prompt-conditioned audio models could offer such control, they lack access to the visual scene.

We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice within a single generative pass. Two challenges arise from this formulation. Reference and generation tokens share the same positional-encoding space, making them hard to distinguish; we address this with negative temporal positions, which place reference tokens in a disjoint region of the RoPE space while preserving their internal temporal structure. Furthermore, speaker characteristics tend to be diluted during denoising; we introduce identity guidance, a classifier-free guidance variant that amplifies speaker-specific features by contrasting predictions with and without the reference signal.

In human preference studies ID-LoRA is preferred over Kling 2.6 Pro, the leading commercial unified model with voice personalization capabilities, by 73% of annotators for voice similarity and 65% for speaking style. Automatic metrics confirm these gains: on cross-environment settings, speaker similarity improves by 24% over Kling, with the gap widening as reference and target conditions diverge. ID-LoRA achieves these results with only approximately 3K training pairs on a single GPU.

Method

Architecture overview. ID-LoRA adapts the LTX-2 dual-stream DiT via In-Context LoRA. A reference audio clip is encoded and concatenated with noisy target latents, while the video stream uses standard text-to-video generation with first-frame conditioning.

In-Context Audio Conditioning

Reference audio is encoded into latents and concatenated with noisy target audio along the sequence dimension. The model learns to extract speaker identity from the reference during denoising.

Negative Temporal Positions

Reference tokens receive negative RoPE positions [-T, 0), while target tokens occupy [0, T]. This cleanly separates reference from target in attention while preserving temporal structure.

Identity Guidance

A CFG variant that amplifies speaker-specific features by extrapolating between predictions with and without the reference signal. Scale 4.0 yields a +9% speaker similarity improvement.

Quantitative Results

Comparison across three evaluation splits. Best in bold per column.

Cross-video reference-target pairs where reference and target environments differ.

Method	Spk Sim ↑	Face Sim ↑	LSE-D ↓	LSE-C ↑	CLAP ↑	WER ↓
ID-LoRA (Ours)	0.477	0.874	8.49	3.90	0.363	0.113
Kling 2.6 Pro	0.385	0.854	9.49	3.47	0.316	0.121
CosyVoice 3.0 + Wan2.2	0.391	0.890	11.40	1.50	0.249	0.362
VoiceCraft + Wan2.2	0.344	0.892	10.60	1.33	0.258	0.427
ElevenLabs v3 + Wan2.2	0.357	0.894	11.86	1.72	0.238	0.154

Same-video reference-target pairs (voice replication scenario).

Method	Spk Sim ↑	Face Sim ↑	LSE-D ↓	LSE-C ↑	CLAP ↑	WER ↓
ID-LoRA (Ours)	0.573	0.870	8.20	4.43	0.372	0.106
Kling 2.6 Pro	0.487	0.847	10.24	3.01	0.327	0.046
CosyVoice 3.0 + Wan2.2	0.510	0.886	11.27	1.53	0.301	0.273
VoiceCraft + Wan2.2	0.412	0.879	10.99	1.48	0.335	0.241
ElevenLabs v3 + Wan2.2	0.413	0.879	11.98	1.51	0.266	0.180

TalkVid test set — 49 videos from 41 held-out speakers.

Method	Spk Sim ↑	Face Sim ↑	LSE-D ↓	LSE-C ↑	CLAP ↑	WER ↓
ID-LoRA (Celeb→TalkVid)	0.595	0.767	10.32	3.12	0.412	0.065
ID-LoRA (Ours)	0.599	0.772	10.62	3.09	0.385	0.054
Kling 2.6 Pro	0.506	0.754	11.59	2.40	0.326	0.040
CosyVoice 3.0 + Wan2.2	0.579	0.770	12.34	1.20	0.315	0.223
VoiceCraft + Wan2.2	0.457	0.773	12.15	1.22	0.407	0.171
ElevenLabs v3 + Wan2.2	0.491	0.772	12.42	1.31	0.319	0.064

24% speaker similarity improvement over Kling 2.6 Pro on the hard split, where reference and target environments diverge. The gap widens on harder cross-environment settings — demonstrating the advantage of unified generation over cascaded pipelines. ID-LoRA also leads in lip synchronization (LSE-D/C) and audio prompt adherence (CLAP).

Citation

@misc{dahan2026idloraidentitydrivenaudiovideopersonalization,
  title     = {ID-LoRA: Identity-Driven Audio-Video Personalization
               with In-Context LoRA},
  author    = {Aviad Dahan and Moran Yanuka and Noa Kraicer and Lior Wolf and Raja Giryes},
  year      = {2026},
  eprint    = {2603.10256},
  archivePrefix = {arXiv},
  primaryClass  = {cs.SD},
  url       = {https://arxiv.org/abs/2603.10256}
}