-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Hi, I have been trying out different things with EfficientSAM3 and most of the times it works great. Thanks for the great work.
However, I have noticed that when using distilled image encoder + text encoder, there are sometimes very low confidence score masks for simple prompt and images.
Using the sam3/efficientsam3_examples/efficientsam3_image_predictor_example.ipynb notebook,
with the provided example image and prompt ("a shoe") , the results are quite good.
But with a different image of a dog and text prompt "dog", I had to lower the confidence threshold by a lot to get any meaningful masks.
dog_image ="dog6/01.jpg"
image = Image.open(dog_image)
width, height = image.size
processor = Sam3Processor(model, confidence_threshold=0.02)
inference_state = processor.set_image(image)
processor.reset_all_prompts(inference_state)
inference_state = processor.set_text_prompt(state=inference_state, prompt="dog")
img0 = Image.open(dog_image)
plot_results(img0, inference_state)
After taking a closer look, I see that this is mostly due to the presence_logit_dec being too low, even for simple cases like this.
Prompt: 'dog'
Presence score (single value): 0.0320
Top-5 classification probs: [0.7695, 0.4902, 0.3438, 0.332, 0.2734]
Top-5 final probs (class × presence): [0.0247, 0.0156, 0.011, 0.0106, 0.0087]
Question/Discussion :
- Is this expected and is an effect of distillation or the dataset used for distillation itself?
- Perhaps EfficientSam3 need a less aggressive function to be applied on presence_logit_dec. Currently it is sigmoid which leads to presence probability being pushed to either extremes.
Have you already considered it experiments around this ?