Image resizing and performance during training and validation #1240

bw4sz · 2025-12-18T19:30:29Z

bw4sz
Dec 18, 2025
Maintainer

Understanding Variable Image Sizes in DeepForest: Training vs Evaluation

✅ Training: Automatically handles variable-sized images - no size config needed
⚠️ Evaluation/Prediction: Requires validation.size to resize images to a common size
💡 Reason: Different batching strategies optimized for each use case

The Question

Why doesn't training need a train.size config parameter, but evaluation requires validation.size?

The Answer

Training with Variable-Sized Images

Training in DeepForest automatically handles images of different dimensions without requiring a size configuration parameter. This works because:

1. List-Based Batching

The training dataset's collate_fn returns images as a list of tensors (not a single stacked tensor):

# From src/deepforest/datasets/training.py
def collate_fn(self, batch):
    """Collate function for DataLoader."""
    images = [item[0] for item in batch]
    targets = [item[1] for item in batch]
    image_names = [item[2] for item in batch]
    
    return images, targets, image_names

Each image can have different dimensions: [(3, 400, 600), (3, 500, 700), (3, 450, 650)]

2. Model Support

The underlying detection models (RetinaNet, DeformableDETR) are designed to accept lists of variable-sized images:

# From src/deepforest/main.py
def training_step(self, batch, batch_idx):
    """Train on a loaded dataset."""
    self.model.train()
    
    images, targets, image_names = batch  # images is a list
    loss_dict = self.model.forward(images, targets)  # model handles list input
    
    losses = sum(loss_dict.values())
    return losses

Modern detection models from torchvision (like RetinaNet) and transformers (like DeformableDETR) natively support this pattern. The model processes each image independently and computes per-image losses.

Evaluation/Prediction with Same-Sized Images

Evaluation and prediction require all images to have the same dimensions. This is enforced by:

1. Tensor-Based Batching

The prediction dataset's collate_fn uses default_collate which attempts to stack images into a single tensor:

# From src/deepforest/datasets/prediction.py
def collate_fn(self, batch):
    """Collate the batch into a single tensor."""
    # Check if all images in batch have same dimensions
    try:
        return default_collate(batch)  # Tries to stack into (B, C, H, W)
    except RuntimeError:
        raise RuntimeError(
            "Images in batch have different dimensions. "
            "Set validation.size in config.yaml to resize all images to a common size."
        ) from None

This requires all images to have identical dimensions to create a proper batch tensor.

2. Why the Difference?

The different batching strategies exist for good reasons:

Training (List-based):

Preserves original image dimensions and aspect ratios
Each image can have different scales - better for learning robustness
Loss computation needs per-image gradients anyway
Augmentations can produce variable-sized outputs

Evaluation/Prediction (Tensor-based):

More efficient GPU utilization during inference
Simpler memory management with fixed batch shapes
Faster processing with vectorized operations
Easier to optimize for production deployment

Configuration

For Training

No train.size parameter exists in the config:

train:
    csv_file: path/to/train.csv
    root_dir: path/to/images
    # No 'size' parameter needed!
    augmentations:
      - HorizontalFlip: {p: 0.5}

For Evaluation/Prediction

The validation.size parameter must be set:

validation:
    csv_file: path/to/test.csv
    root_dir: path/to/images
    size: 800  # Required if images have different dimensions

Then use it in your code:

m = main.deepforest()
m.config["validation"]["size"] = 800

# Must pass size explicitly (see issue #XXX)
results = m.evaluate(csv_file="test.csv", root_dir="./data", size=800)

Questions

Does this have model performance issues? Its faster to batch, should we resize and batch in train as well? This seems like a clear tradeoff between speed and accuracy?

@jveitchmichaelis can you comment here and let's think about whether we need a change.

jveitchmichaelis · 2025-12-18T22:17:25Z

jveitchmichaelis
Dec 18, 2025
Maintainer

TL;DR: this is a bit more complex than setting/fixing training size.

The analysis conflates two things: one is the requirement that models need to have identical tensor shapes for the forward pass (which already happens under the hood), the more subtle issue is how models handle differing image scales at train/test time (which we should care about, a lot).

Batched training and validation are identical: they use the same dataset class, aside from validation having augmentations turned off. In both cases the models cannot deal with variable size images, they have a preprocessing step to fix it. During training, evaluate ignores size and uses the results directly from the validation loop. In predict mode, or if evaluate is called outside a training loop (which calls predict if no preds are provided), then size is used.

Next question is what should the size be? I have opinions...

To get good in-domain predictions, the tensor inputs that the model sees should have a similar GSD or resolution/px OR the model needs to understand scale so that it can handle different resolution inputs without producing garbage. I think we need to talk about this in terms of effective GSD, because that's really what we're tuning for.
The model learns this scale (variation) during training. The advantage is that absolute scale is a good signal for the model - it should learn to never predict biologically unrealistic sized trees for the location. Problems arise at test time when the model is presented with images that have a wildly different pixel scale and/or it does some sneaky resizing that you're not aware of. Both DeformableDETR and RetinaNet implicitly resize images so that the batch is a uniform size. This is also defensive/practical. If a user inadvertently passes in an enormous image, it won't crash, and for many applications shrinking down a smartphone image doesn't have any effect on accuracy. DeformableDETR handles it in the processor, RetinaNet has a transform built into the graph:

(transform): GeneralizedRCNNTransform(
        Normalize(mean=tensor([0.4850, 0.4560, 0.4060]), std=tensor([0.2290, 0.2240, 0.2250]))
        Resize(min_size=(800,), max_size=1333, mode='bilinear')
    )

size (dict[str, int] optional, defaults to {"shortest_edge" -- 800, "longest_edge": 1333}): Size of the image’s (height, width) dimensions after resizing.

RetinaNet just hides the details, but it should be doing that batching internally. Both models do the same thing: anything between 800-1333 is untouched and anything resized should have preserved aspect ratio. Note that transformers also has a common rescale param which refers to scaling pixel values to [0,1], to avoid confusion there.

So for a default of 800px, the model doesn't do anything, but we're assuming that it's appropriate to resize the image to 800px without the scale being off.

I came up against this recently. I trained a model on LIDAR with 1024px images. When I tried to transfer to NeonTreeEval, it was predicting garbage small trees everywhere until I realised that it was scaling up the 400px images to 800px internally.

For training, the simple way to get around this is to turn this off and rely on explicit dataset transforms. This can be done with variations on (small scale) rescaling, cropping and padding to maintain GSD. The problem I had above is averted because the feature scale is kept the same.
The practical limit for setting the maximum input size is how many anchors/boxes you can predict vs density of objects, because if you go too big, you lose the capacity to predict everything. 800px / 300 objects seems like a sane setting for training, but we may want to experiment here. Typically you can't modify the number of anchors on the fly. So even if we could use the B200s to train a 2048px input, we would need 1000 anchors probably.

For semantic segmentation/pixel regression models, turning off auto scaling lets you do stuff like train on 1024px and predict on 2048 or as large as your GPU will allow in one go. This is theoretically true for convolutional models at least. The main benefit is less stitching required. I'm less sure what the behaviour is for RetinaNet/DETR if you input a huge image, but remember these models are benchmarked on datasets like COCO where 100 objects is a lot.

Recommendations?

I'm all for turning off model (internal) resizing and putting it in the training and validation transforms.
We should probably move size out of validation and put it next to the other prediction params.
For prediction, the most robust way is to pick a model input size with an upper bound, based on (4) above, The user might have a low power machine and needs to predict on 512 and not 1024. Then, we can figure out what the window/scaling should be to achieve that. Suppose we set 1024px and the user provides a 5cm/px image, we would want to sample windows at 2048x2048 and resize by 0.5x; then you also need to scale the output boxes 2x but that's easy enough.
This becomes documentation. Tell users that for the best results, they should produce an ortho with the same resolution as their target model. For images where we don't have this information (like JPEGs), just take the model input size and tile/pad as needed.

If possible, avoid having to manually figure out what the prediction size should be.

2 replies

bw4sz Dec 30, 2025
Maintainer Author

I agree with this and I think we can move forward. I need a temporary work around for the bird model, the problem for evaluation is that if you resize the predictions using the transforms, unless you keep that transform matrix somewhere (memory constraint?, probably not), then you don't know the inverse to get the original size. As it is, if you resize validation, main.evaluate() will produce non-sensical results, since the predictions are on the rescaled image, but the ground truth is in the original coordinate dimensions. This tripped me up for a second and I'm not sure the cleanest thing to do in the near term, 1) just resize the ground truth along with the prediction inside of main.evaluate(), 2) Take on the full refactor and probably look a prediction.dataset.postprocess to rescale the prediction to the original image size.

jveitchmichaelis Dec 30, 2025
Maintainer

Have you tried with kornia? We'd need to add some attribute to the dataset to access it, but we should be able to get the (inverse) transform.

bw4sz · 2025-12-30T04:52:37Z

bw4sz
Dec 30, 2025
Maintainer Author

There are two factors that continue to bother me.

We want the inference workflow to be as similiar as possible to train and validation. There is all this machinery to create stacks of tensors, but looking at the source code of retinanet, https://docs.pytorch.org/vision/main/_modules/torchvision/models/detection/retinanet.html it looks like it process it as a list of tensors. I had believed, perhaps mistakenly, that passing a stack of tensors should be faster and more scalable than a list, since its fewer forward passes. This needs to be checked.
We want trainer.validate and model.evaluate() to give really similar results. We probably should do away with model.evaluate() entirely once we get this sorted and deprecate it. Fundamentally I see a problem with main.predict_file if it cannot produce the same results as trainer.predict(), we should test and assert this. This was a convenience function and meant as a wrapper, i was not expecting this to be a place to cause us trouble.

1 reply

bw4sz Dec 30, 2025
Maintainer Author

here are places that feel that it assumed that inference batches are all of one size.

DeepForest/src/deepforest/main.py

Line 873 in e44b268

combined_batch = torch.cat(batch, dim=0)

During #974 I showed tremendous speedup with the batching and overhead. There was a reason for this. I'm not sure it is compatible with images of many sizes. But Its not clear to me how many forward passes are being run if retinanet is iterating through a list anyways, then i think I misunderstood where the (very real) speedup was coming from?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image resizing and performance during training and validation #1240

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Image resizing and performance during training and validation #1240

Uh oh!

bw4sz Dec 18, 2025 Maintainer

Understanding Variable Image Sizes in DeepForest: Training vs Evaluation

The Question

The Answer

Training with Variable-Sized Images

1. List-Based Batching

2. Model Support

Evaluation/Prediction with Same-Sized Images

1. Tensor-Based Batching

2. Why the Difference?

Configuration

For Training

For Evaluation/Prediction

Questions

Replies: 2 comments · 3 replies

Uh oh!

Uh oh!

jveitchmichaelis Dec 18, 2025 Maintainer

Uh oh!

Uh oh!

bw4sz Dec 30, 2025 Maintainer Author

Uh oh!

jveitchmichaelis Dec 30, 2025 Maintainer

Uh oh!

Uh oh!

bw4sz Dec 30, 2025 Maintainer Author

Uh oh!

bw4sz Dec 30, 2025 Maintainer Author

bw4sz
Dec 18, 2025
Maintainer

Replies: 2 comments 3 replies

jveitchmichaelis
Dec 18, 2025
Maintainer

bw4sz Dec 30, 2025
Maintainer Author

jveitchmichaelis Dec 30, 2025
Maintainer

bw4sz
Dec 30, 2025
Maintainer Author

bw4sz Dec 30, 2025
Maintainer Author