Skip to content

EmbeddingRetriever rraises error for already tokenized input #2

@KrisHeylen

Description

@KrisHeylen

when running embedding_retriever = EmbeddingRetriever(bert_model, tokenizer, nlp, [ ["Il", "gatto", "beve"], ["Le", "gatte", "bevono"] ]) I got a value error (see below)
It seems EmbeddingRetrieve tries to give a list to friendly_tokenizer.tokenize but down the line spacy wants a string...

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [71], line 1
----> 1 EmbeddingRetriever(bert_model, tokenizer, nlp, [ ["Il", "gatto", "beve"], ["Le", "gatte", "bevono"] ])

File /vol1/qlvlCode/NephoNeural/anthevec/anthevec/embedding_retriever.py:11, in EmbeddingRetriever.__init__(self, model, tokenizer, nlp, input_sentences, mask_special_tokens)
      9 self.friendly_tokenizer = FriendlyTokenizer(tokenizer, nlp)
     10 # Tokenise the input sentence
---> 11 tokenized_outputs = self.friendly_tokenizer.tokenize(input_sentences)
     12 #rint(tokenized_outputs)
     13 # Save the correspondence dict
     14 self.correspondence = tokenized_outputs["correspondence"]

File /vol1/qlvlCode/NephoNeural/anthevec/anthevec/friendly_tokenizer.py:48, in FriendlyTokenizer.tokenize(self, input_strings)
     45 if type(input_strings[sentence_id]) == list:
     46     self.nlp.tokenizer = PreTokenizer(self.nlp.vocab)
---> 48 doc = self.nlp(input_strings[sentence_id])
     50 # Flatten the sentence structure
     51 for sentence in doc.sents:

File ~/krisVenv/lib/python3.10/site-packages/spacy/language.py:1014, in Language.__call__(self, text, disable, component_cfg)
    993 def __call__(
    994     self,
    995     text: Union[str, Doc],
   (...)
    998     component_cfg: Optional[Dict[str, Dict[str, Any]]] = None,
    999 ) -> Doc:
   1000     """Apply the pipeline to some text. The text can span multiple sentences,
   1001     and can contain arbitrary whitespace. Alignment into the original string
   1002     is preserved.
   (...)
   1012     DOCS: https://spacy.io/api/language#call
   1013     """
-> 1014     doc = self._ensure_doc(text)
   1015     if component_cfg is None:
   1016         component_cfg = {}

File ~/krisVenv/lib/python3.10/site-packages/spacy/language.py:1108, in Language._ensure_doc(self, doc_like)
   1106 if isinstance(doc_like, bytes):
   1107     return Doc(self.vocab).from_bytes(doc_like)
-> 1108 raise ValueError(Errors.E1041.format(type=type(doc_like)))

ValueError: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'list'>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions