Skip to content

Opposing results when comparing to transformers interpret #21

@racia

Description

@racia
  • ferret version: 0.4.0
  • Python version: 3.9.2
  • Operating System: Linux Debian

Description

When comparing your feature attribution scores of the explanation provided by Integrated Gradients (plain) with the ones by the transformers_interpret library (MultiLabelClassificationExplainer), I get significantly different results. For example, a token may have a high score of 0.5 with transformers_interpret, but is negatively attributed with ferret.
Why could that be ?
Of course, I tested this on the same conditions for both transformers_interpret and ferret (e.g.: pretrained local multi-label BertForSequenceClassification, bert-base-german-cased tokenizer, same sample)

What I Did

  • transformers_interpret:
cls_explainer = MultiLabelClassificationExplainer(model, tokenizer, custom_labels=labels)
word_attrib = cls_explainer(<SAMPLE>)
pred = cls_explainer.predicted_class_name
print(word_attrib[pred])
  • ferret:
bench = Benchmark(model, tokenizer)
score = bench.score(<SAMPLE>)
metr = bench.explain(sent, target=target)[4] ### IG (plain) ###
print(metr.scores)

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions