Autonomous-driving synthetic data augmentation

AtteConDA

Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation.

Shogo Noguchi · Gunma University

AtteConDA keeps road-scene structure while changing appearance, and makes conflict suppression among semantic segmentation, depth, and edge conditions explicit through PAM.

Why this project exists

Conventional augmentation changes pixels or geometry, but it usually cannot create semantically meaningful weather or time-of-day changes while preserving the detailed structure required by high-level driving tasks. AtteConDA targets exactly that gap.

Comparison between conventional augmentation and generative augmentation

Conceptual comparison: conventional image augmentation versus structure-preserving generative augmentation.

Abstract

High-level autonomous-driving tasks require more than class masks. They depend on road geometry, distant structure, object presence, lane continuity, and traffic-scene coherence. Existing annotation-conditioned diffusion approaches are promising, but semantic-only control is often insufficient and multi-condition control can introduce destructive conflicts.

AtteConDA addresses this by combining semantic segmentation, depth, and edge conditions in a Uni-ControlNet-style diffusion framework, while introducing a Patch-wise Adaptation Module (PAM) that performs conflict-aware local condition selection. The repository organizes the practical pipeline -- preparation, prompt generation, training, Waymo inference, and evaluation -- so that new methods can be compared on a shared structure-preservation benchmark.

Method

The method is built around a reusable generation pipeline, Uni-ControlNet-compatible initialization, and explicit condition-conflict suppression through PAM.

1. Multi-condition pipeline

RGB images are converted into semantic segmentation, depth, and edge conditions. Prompts are generated to change appearance without rewriting layout.

2. Pretrained control prior

Strong controllable diffusion representations are reused instead of relearning everything from scratch on a smaller autonomous-driving dataset collection.

3. PAM conflict suppression

PAM selects locally effective conditions so that low-frequency geometry and high-frequency contours do not collapse each other in the shared feature space.

Overall multi-condition generation pipeline

Overall multi-condition generation pipeline.

Model detail

Model detail based on a Uni-ControlNet-compatible controllable diffusion backbone.

PAM architecture

PAM explicitly targets local inter-condition conflict suppression.

Prompt generation pipeline

Prompt generation pipeline with CLIP/open_clip classification and Qwen3-VL captioning.

What PAM changes

+18.89%

Depth RMSE improvement of PAM60K over Tune60K. Lower is better, and the value is shown as relative gain.

+17.04%

Object-preservation F1 improvement of PAM60K over Tune60K.

All four

Semantic, depth, edge, and object-preservation structure metrics improve when PAM is added.

Results

The project focuses on structure-preserving augmentation for high-level driving tasks, so semantic-only scores are not the whole story. The important question is whether geometry, object presence, contours, and realism are preserved together.

Qualitative tuning comparison

Qualitative comparison across training progress and prior-work baselines.

Qualitative PAM comparison

Qualitative comparison showing the effect of PAM on distant structure preservation.

Pretraining impact

Pretraining improves structure preservation relative to the from-scratch setting.

Zoomed PAM improvement

Zoomed example: PAM improves distant road continuity and local structural consistency.

Quantitative comparison against prior work

CategoryMetricPAM60KTune60KDGInStyleBest among ours
Semantic SegmentationmIoU ↑0.33100.31150.37220.3310
DepthRMSE ↓27.7733.0236.7127.77
EdgeL1 Error ↓0.044930.045610.091760.04493
Object PreservationF1 ↑0.10710.08890.07900.1071
RealityCLIP-CMMD ↓0.17940.17380.27100.1738
Diversity1-MS-SSIM ↑0.84800.84970.92400.8497
Text AlignmentR-Precision@1 ↑0.32580.35630.36060.3563

Interpretation: AtteConDA is strongest when the target is not only semantic layout fidelity but also geometry, contour preservation, object presence, and realism.

Scaling structure metrics

Scaling behavior for structure-related metrics.

Scaling quality metrics

Scaling behavior for quality, realism, diversity, and text alignment.

Released models

The Hugging Face collection groups the released checkpoints. Model-card templates are included in the repository for each public release.

Paper nameModel nameLink
FullScratch30KAtteConDA-SDE-Scratch-30KOpen
Tune30KAtteConDA-SDE-UniCon-30KOpen
Tune60KAtteConDA-SDE-UniCon-60KOpen
Tune90KAtteConDA-SDE-UniCon-90KOpen
PAM60KAtteConDA-SDE-UniCon-60K-PAMOpen
Initialization checkpointAtteConDA-SDE-UniCon-InitOpen

Acknowledgements and provenance

Direct upstream codebases

Uni-ControlNetDGInStyle

Runtime models and tools

Stable Diffusion v1.5 familyOneFormerMetric3D / Metric3Dv2Grounding DINOCLIP / open_clipQwen3-VLLPIPS / AlexNet

PixelPonder is acknowledged as paper-level inspiration for dynamic multi-condition conflict handling. This release does not claim code provenance from an unlicensed source tree.

Citation

If you use AtteConDA, please cite the arXiv paper:

@article{noguchi2026atteconda,
  title   = {AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation},
  author  = {Noguchi, Shogo},
  journal = {arXiv preprint arXiv:2605.09425},
  year    = {2026},
  url     = {https://arxiv.org/abs/2605.09425}
}