In this paper, we tackle the copy-paste image-to-image composition problem with a focus on object placement learning. Prior methods have leveraged generative models to minimize the need for dense supervision, which unfortunately may limit their ability to model complex data distributions. Alternatively, transformer networks with a sparse contrastive loss have been employed; yet their over-relaxed regularization often leads to imprecise placement. We propose BootPlace, a novel paradigm that formulates object placement as a placement-by-detection problem. Our method first identifies regions of interest suitable for object placement by training a dedicated detection transformer on object-subtracted backgrounds with multi-object supervisions. It then associates each target compositing object with detected regions based on semantic complementary. Using a boostrapped training approach on randomly object-subtracted images, our model regularizes meaningful placements through richly paired data augmentation. Experimental results on standard benchmarks demonstrate BootPlace’s superior performance in object reposition, significantly outperforming state-fo-the-art baselines on Cityscapes and OPA datasets with notable improvements in IOU scores. Additional ablation studies further showcase the compositionality and generalizability of our approach, supported by user study evaluations.