Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation.
AtteConDA keeps road-scene structure while changing appearance, and makes conflict suppression among semantic segmentation, depth, and edge conditions explicit through PAM.
Conventional augmentation changes pixels or geometry, but it usually cannot create semantically meaningful weather or time-of-day changes while preserving the detailed structure required by high-level driving tasks. AtteConDA targets exactly that gap.

Conceptual comparison: conventional image augmentation versus structure-preserving generative augmentation.
High-level autonomous-driving tasks require more than class masks. They depend on road geometry, distant structure, object presence, lane continuity, and traffic-scene coherence. Existing annotation-conditioned diffusion approaches are promising, but semantic-only control is often insufficient and multi-condition control can introduce destructive conflicts.
AtteConDA addresses this by combining semantic segmentation, depth, and edge conditions in a Uni-ControlNet-style diffusion framework, while introducing a Patch-wise Adaptation Module (PAM) that performs conflict-aware local condition selection. The repository organizes the practical pipeline -- preparation, prompt generation, training, Waymo inference, and evaluation -- so that new methods can be compared on a shared structure-preservation benchmark.
The method is built around a reusable generation pipeline, Uni-ControlNet-compatible initialization, and explicit condition-conflict suppression through PAM.
RGB images are converted into semantic segmentation, depth, and edge conditions. Prompts are generated to change appearance without rewriting layout.
Strong controllable diffusion representations are reused instead of relearning everything from scratch on a smaller autonomous-driving dataset collection.
PAM selects locally effective conditions so that low-frequency geometry and high-frequency contours do not collapse each other in the shared feature space.
Overall multi-condition generation pipeline.

Model detail based on a Uni-ControlNet-compatible controllable diffusion backbone.

PAM explicitly targets local inter-condition conflict suppression.

Prompt generation pipeline with CLIP/open_clip classification and Qwen3-VL captioning.
Depth RMSE improvement of PAM60K over Tune60K. Lower is better, and the value is shown as relative gain.
Object-preservation F1 improvement of PAM60K over Tune60K.
Semantic, depth, edge, and object-preservation structure metrics improve when PAM is added.
The project focuses on structure-preserving augmentation for high-level driving tasks, so semantic-only scores are not the whole story. The important question is whether geometry, object presence, contours, and realism are preserved together.

Qualitative comparison across training progress and prior-work baselines.

Qualitative comparison showing the effect of PAM on distant structure preservation.

Pretraining improves structure preservation relative to the from-scratch setting.

Zoomed example: PAM improves distant road continuity and local structural consistency.
| Category | Metric | PAM60K | Tune60K | DGInStyle | Best among ours |
|---|---|---|---|---|---|
| Semantic Segmentation | mIoU ↑ | 0.3310 | 0.3115 | 0.3722 | 0.3310 |
| Depth | RMSE ↓ | 27.77 | 33.02 | 36.71 | 27.77 |
| Edge | L1 Error ↓ | 0.04493 | 0.04561 | 0.09176 | 0.04493 |
| Object Preservation | F1 ↑ | 0.1071 | 0.0889 | 0.0790 | 0.1071 |
| Reality | CLIP-CMMD ↓ | 0.1794 | 0.1738 | 0.2710 | 0.1738 |
| Diversity | 1-MS-SSIM ↑ | 0.8480 | 0.8497 | 0.9240 | 0.8497 |
| Text Alignment | R-Precision@1 ↑ | 0.3258 | 0.3563 | 0.3606 | 0.3563 |
Interpretation: AtteConDA is strongest when the target is not only semantic layout fidelity but also geometry, contour preservation, object presence, and realism.

Scaling behavior for structure-related metrics.

Scaling behavior for quality, realism, diversity, and text alignment.
The Hugging Face collection groups the released checkpoints. Model-card templates are included in the repository for each public release.
Uni-ControlNetDGInStyle
Stable Diffusion v1.5 familyOneFormerMetric3D / Metric3Dv2Grounding DINOCLIP / open_clipQwen3-VLLPIPS / AlexNet
PixelPonder is acknowledged as paper-level inspiration for dynamic multi-condition conflict handling. This release does not claim code provenance from an unlicensed source tree.
If you use AtteConDA, please cite the arXiv paper:
@article{noguchi2026atteconda,
title = {AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation},
author = {Noguchi, Shogo},
journal = {arXiv preprint arXiv:2605.09425},
year = {2026},
url = {https://arxiv.org/abs/2605.09425}
}