-
Notifications
You must be signed in to change notification settings - Fork 0
EmbeddingRetriever rraises error for already tokenized input #2
Copy link
Copy link
Open
Description
when running embedding_retriever = EmbeddingRetriever(bert_model, tokenizer, nlp, [ ["Il", "gatto", "beve"], ["Le", "gatte", "bevono"] ]) I got a value error (see below)
It seems EmbeddingRetrieve tries to give a list to friendly_tokenizer.tokenize but down the line spacy wants a string...
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [71], line 1
----> 1 EmbeddingRetriever(bert_model, tokenizer, nlp, [ ["Il", "gatto", "beve"], ["Le", "gatte", "bevono"] ])
File /vol1/qlvlCode/NephoNeural/anthevec/anthevec/embedding_retriever.py:11, in EmbeddingRetriever.__init__(self, model, tokenizer, nlp, input_sentences, mask_special_tokens)
9 self.friendly_tokenizer = FriendlyTokenizer(tokenizer, nlp)
10 # Tokenise the input sentence
---> 11 tokenized_outputs = self.friendly_tokenizer.tokenize(input_sentences)
12 #rint(tokenized_outputs)
13 # Save the correspondence dict
14 self.correspondence = tokenized_outputs["correspondence"]
File /vol1/qlvlCode/NephoNeural/anthevec/anthevec/friendly_tokenizer.py:48, in FriendlyTokenizer.tokenize(self, input_strings)
45 if type(input_strings[sentence_id]) == list:
46 self.nlp.tokenizer = PreTokenizer(self.nlp.vocab)
---> 48 doc = self.nlp(input_strings[sentence_id])
50 # Flatten the sentence structure
51 for sentence in doc.sents:
File ~/krisVenv/lib/python3.10/site-packages/spacy/language.py:1014, in Language.__call__(self, text, disable, component_cfg)
993 def __call__(
994 self,
995 text: Union[str, Doc],
(...)
998 component_cfg: Optional[Dict[str, Dict[str, Any]]] = None,
999 ) -> Doc:
1000 """Apply the pipeline to some text. The text can span multiple sentences,
1001 and can contain arbitrary whitespace. Alignment into the original string
1002 is preserved.
(...)
1012 DOCS: https://spacy.io/api/language#call
1013 """
-> 1014 doc = self._ensure_doc(text)
1015 if component_cfg is None:
1016 component_cfg = {}
File ~/krisVenv/lib/python3.10/site-packages/spacy/language.py:1108, in Language._ensure_doc(self, doc_like)
1106 if isinstance(doc_like, bytes):
1107 return Doc(self.vocab).from_bytes(doc_like)
-> 1108 raise ValueError(Errors.E1041.format(type=type(doc_like)))
ValueError: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'list'>
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels