- ferret version: 0.4.0
- Python version: 3.9.2
- Operating System: Linux Debian
Description
When comparing your feature attribution scores of the explanation provided by Integrated Gradients (plain) with the ones by the transformers_interpret library (MultiLabelClassificationExplainer), I get significantly different results. For example, a token may have a high score of 0.5 with transformers_interpret, but is negatively attributed with ferret.
Why could that be ?
Of course, I tested this on the same conditions for both transformers_interpret and ferret (e.g.: pretrained local multi-label BertForSequenceClassification, bert-base-german-cased tokenizer, same sample)
What I Did
cls_explainer = MultiLabelClassificationExplainer(model, tokenizer, custom_labels=labels)
word_attrib = cls_explainer(<SAMPLE>)
pred = cls_explainer.predicted_class_name
print(word_attrib[pred])
bench = Benchmark(model, tokenizer)
score = bench.score(<SAMPLE>)
metr = bench.explain(sent, target=target)[4] ### IG (plain) ###
print(metr.scores)
Description
When comparing your feature attribution scores of the explanation provided by Integrated Gradients (plain) with the ones by the transformers_interpret library (MultiLabelClassificationExplainer), I get significantly different results. For example, a token may have a high score of 0.5 with transformers_interpret, but is negatively attributed with ferret.
Why could that be ?
Of course, I tested this on the same conditions for both transformers_interpret and ferret (e.g.: pretrained local multi-label BertForSequenceClassification, bert-base-german-cased tokenizer, same sample)
What I Did