Autonomous-driving synthetic data augmentation

AtteConDA

Shogo Noguchi
Gunma University

A multi-condition diffusion framework that keeps road-scene structure while changing appearance — and makes condition conflict suppression explicit through PAM (Patch-wise Adaptation Module).

Instead of treating multi-condition control as simple early fusion, AtteConDA explicitly handles inter-condition conflict so that semantic layout, depth geometry, and edge structure can support each other rather than destroy each other.

👨‍💻 Release status

✅ Release the training code
✅ Release the inference code
✅ Release the model
□ Release the arXiv paper

Why this project exists

Conventional augmentation changes pixels or geometry, but it usually cannot create semantically meaningful weather or time-of-day changes while preserving the detailed structure required by high-level driving tasks. AtteConDA targets exactly that gap.

Comparison between conventional augmentation and generative augmentation
Conceptual comparison: conventional image augmentation versus structure-preserving generative augmentation.

Abstract

High-level autonomous-driving tasks require more than class masks: they depend on road geometry, distant structure, object presence, lane continuity, and traffic-scene coherence. Existing annotation-conditioned diffusion approaches are promising, but semantic-only control is often insufficient and multi-condition control can introduce destructive conflicts. AtteConDA addresses this by combining semantic segmentation, depth, and edge conditions in a Uni-ControlNet-style diffusion framework, while introducing a Patch-wise Adaptation Module (PAM) that performs conflict-aware local condition selection. The repository further organizes the complete practical pipeline — preparation, prompt generation, training, Waymo inference, and evaluation — so that new methods can be compared on a shared structure-preservation benchmark.

Method

The method is built around a reusable generation pipeline, Uni-ControlNet-compatible initialization, and explicit condition-conflict suppression through PAM.

1. Multi-condition generation pipeline

RGB images are converted into semantic segmentation, depth, and edge conditions, while prompts are generated to change appearance without rewriting layout.

2. Uni-ControlNet-compatible initialization

Strong controllable diffusion priors are reused instead of relearning everything from scratch on a smaller autonomous-driving dataset collection.

3. PAM for conflict suppression

PAM selects locally effective conditions so that low-frequency geometry and high-frequency contours do not collapse each other in the shared feature space.

Pipeline overview
Overall multi-condition generation pipeline.
Model detail
Model detail based on a Uni-ControlNet-compatible controllable diffusion backbone.
PAM architecture
PAM explicitly targets local inter-condition conflict suppression.
Prompt generation pipeline
Prompt generation pipeline with CLIP/open_clip classification and Qwen3-VL captioning.

What PAM changes

+18.89%
Depth RMSE improvement of PAM60K over Tune60K (lower is better, relative gain shown).
+17.04%
Object-preservation F1 improvement of PAM60K over Tune60K.
All four
Semantic, depth, edge, and object-preservation structure metrics improve when PAM is added.

Results

The project focuses on structure-preserving augmentation for high-level driving tasks, so semantic-only scores are not the whole story. The important question is whether geometry, object presence, contours, and realism are preserved together.

Qualitative tuning comparison
Qualitative comparison across training progress and prior-work baselines.
Qualitative PAM comparison
Qualitative comparison showing the effect of PAM on distant structure preservation.
Pretraining impact
Pretraining improves structure preservation relative to the from-scratch setting.
Zoomed PAM improvement
Zoomed example: PAM improves distant road continuity and local structural consistency.
Category Metric PAM60K Tune60K DGInStyle Best among ours
Semantic Segmentation mIoU ↑ 0.3310 0.3115 0.3722 0.3310
Depth RMSE ↓ 27.77 33.02 36.71 27.77
Edge L1 Error ↓ 0.04493 0.04561 0.09176 0.04493
Object Preservation F1 ↑ 0.1071 0.0889 0.0790 0.1071
Reality CLIP-CMMD ↓ 0.1794 0.1738 0.2710 0.1738
Diversity 1-MS-SSIM ↑ 0.8480 0.8497 0.9240 0.8497
Text Alignment R-Precision@1 ↑ 0.3258 0.3563 0.3606 0.3563

Interpretation: AtteConDA is strongest when the target is not only semantic layout fidelity but also geometry, contour preservation, object presence, and realism.

Scaling structure metrics
Scaling behavior for structure-related metrics.
Scaling quality metrics
Scaling behavior for quality, realism, diversity, and text-alignment metrics.

Released models

The Hugging Face collection already groups the released checkpoints. Model-card templates are included in the repository for each public release.

Paper name Model name Link
FullScratch30K AtteConDA-SDE-Scratch-30K Open
Tune30K AtteConDA-SDE-UniCon-30K Open
Tune60K AtteConDA-SDE-UniCon-60K Open
Tune90K AtteConDA-SDE-UniCon-90K Open
PAM60K AtteConDA-SDE-UniCon-60K-PAM Open
Initialization checkpoint AtteConDA-SDE-UniCon-Init Open

Acknowledgements

Direct upstream codebases

  • Uni-ControlNet
  • DGInStyle

Runtime models and tools

  • Stable Diffusion v1.5 family
  • OneFormer
  • Metric3D / Metric3Dv2
  • Grounding DINO
  • CLIP / open_clip
  • Qwen3-VL
  • LPIPS / AlexNet
  • Tesseract OCR

PixelPonder is acknowledged as paper-level inspiration for dynamic multi-condition conflict handling. This release does not claim code provenance from an unlicensed source tree.

Citation

The paper release is planned later. Until then, please cite the software / project release.

@misc{noguchi2026atteconda,
  title        = {AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation},
  author       = {Shogo Noguchi},
  year         = {2026},
  howpublished = {GitHub repository},
  note         = {Gunma University}
}

Some figure labels and internal filenames are still being translated from the original Japanese thesis draft. The repository keeps them for reproducibility first, then improves the English presentation incrementally.