AtteConDA | Attention-Based Conflict Suppression in Multi-Condition Diffusion Models

Why this project exists

Conventional augmentation changes pixels or geometry, but it usually cannot create semantically meaningful weather or time-of-day changes while preserving the detailed structure required by high-level driving tasks. AtteConDA targets exactly that gap.

Comparison between conventional augmentation and generative augmentation

Conceptual comparison: conventional image augmentation versus structure-preserving generative augmentation.

Abstract

High-level autonomous-driving tasks require more than class masks. They depend on road geometry, distant structure, object presence, lane continuity, and traffic-scene coherence. Existing annotation-conditioned diffusion approaches are promising, but semantic-only control is often insufficient and multi-condition control can introduce destructive conflicts.

AtteConDA addresses this by combining semantic segmentation, depth, and edge conditions in a Uni-ControlNet-style diffusion framework, while introducing a Patch-wise Adaptation Module (PAM) that performs conflict-aware local condition selection. The repository organizes the practical pipeline -- preparation, prompt generation, training, Waymo inference, and evaluation -- so that new methods can be compared on a shared structure-preservation benchmark.

Method

The method is built around a reusable generation pipeline, Uni-ControlNet-compatible initialization, and explicit condition-conflict suppression through PAM.

1. Multi-condition pipeline

RGB images are converted into semantic segmentation, depth, and edge conditions. Prompts are generated to change appearance without rewriting layout.

2. Pretrained control prior

Strong controllable diffusion representations are reused instead of relearning everything from scratch on a smaller autonomous-driving dataset collection.

3. PAM conflict suppression

PAM selects locally effective conditions so that low-frequency geometry and high-frequency contours do not collapse each other in the shared feature space.

Overall multi-condition generation pipeline.

Model detail based on a Uni-ControlNet-compatible controllable diffusion backbone.

PAM explicitly targets local inter-condition conflict suppression.

Prompt generation pipeline with CLIP/open_clip classification and Qwen3-VL captioning.

What PAM changes

+18.89%

Depth RMSE improvement of PAM60K over Tune60K. Lower is better, and the value is shown as relative gain.

+17.04%

Object-preservation F1 improvement of PAM60K over Tune60K.

All four

Semantic, depth, edge, and object-preservation structure metrics improve when PAM is added.

Results

The project focuses on structure-preserving augmentation for high-level driving tasks, so semantic-only scores are not the whole story. The important question is whether geometry, object presence, contours, and realism are preserved together.

Qualitative comparison across training progress and prior-work baselines.

Qualitative comparison showing the effect of PAM on distant structure preservation.

Pretraining improves structure preservation relative to the from-scratch setting.

Zoomed example: PAM improves distant road continuity and local structural consistency.

Quantitative comparison against prior work

Category	Metric	PAM60K	Tune60K	DGInStyle	Best among ours
Semantic Segmentation	mIoU ↑	0.3310	0.3115	0.3722	0.3310
Depth	RMSE ↓	27.77	33.02	36.71	27.77
Edge	L1 Error ↓	0.04493	0.04561	0.09176	0.04493
Object Preservation	F1 ↑	0.1071	0.0889	0.0790	0.1071
Reality	CLIP-CMMD ↓	0.1794	0.1738	0.2710	0.1738
Diversity	1-MS-SSIM ↑	0.8480	0.8497	0.9240	0.8497
Text Alignment	R-Precision@1 ↑	0.3258	0.3563	0.3606	0.3563

Interpretation: AtteConDA is strongest when the target is not only semantic layout fidelity but also geometry, contour preservation, object presence, and realism.

Scaling behavior for structure-related metrics.

Scaling behavior for quality, realism, diversity, and text alignment.

Released models

The Hugging Face collection groups the released checkpoints. Model-card templates are included in the repository for each public release.

Paper name	Model name	Link
FullScratch30K	AtteConDA-SDE-Scratch-30K	Open
Tune30K	AtteConDA-SDE-UniCon-30K	Open
Tune60K	AtteConDA-SDE-UniCon-60K	Open
Tune90K	AtteConDA-SDE-UniCon-90K	Open
PAM60K	AtteConDA-SDE-UniCon-60K-PAM	Open
Initialization checkpoint	AtteConDA-SDE-UniCon-Init	Open

Acknowledgements and provenance

Direct upstream codebases

Uni-ControlNetDGInStyle

Runtime models and tools

Stable Diffusion v1.5 familyOneFormerMetric3D / Metric3Dv2Grounding DINOCLIP / open_clipQwen3-VLLPIPS / AlexNet

PixelPonder is acknowledged as paper-level inspiration for dynamic multi-condition conflict handling. This release does not claim code provenance from an unlicensed source tree.

Citation

If you use AtteConDA, please cite the arXiv paper:

@article{noguchi2026atteconda,
  title   = {AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation},
  author  = {Noguchi, Shogo},
  journal = {arXiv preprint arXiv:2605.09425},
  year    = {2026},
  url     = {https://arxiv.org/abs/2605.09425}
}