Unable to align entity mentions with original texts on many samples

Hi @YoumiMa, I am using the following codes to verify if entity mentions are present in text
```
    nsplits = ['train','dev','test']
    for nsplit in nsplits:
        root_path = "JacRED/{}.json".format(nsplit)
        data = get_json(root_path) # read json file
        list_samples = []
        for raw_sample in data:
            entities = raw_sample['vertexSet']
            rel_name_dict = get_json("JacRED/meta/rel_info.json")
            sample_text = "".join(["".join(x) for x in raw_sample['sents']])

            ent2str = dict()
            for ient,entity in enumerate(entities):
                list_name = [x['name'] for x in entity]
                set_name = list(set(list_name))
                ent2str[ient]=set_name

            rel_list = []
            for rel_ in raw_sample['labels']:
                h_list,t_list,r,_ = rel_['h'],rel_['t'],rel_['r'],rel_['evidence']
                h_list = ent2str[h_list]
                t_list = ent2str[t_list]
                r_str = rel_name_dict[r]
                for h in h_list:
                    for t in t_list:
                        assert h in sample_text and t in sample_text # verify de-tokenization
                        rel_json = {"head":h,"tail":t,"relation":r_str}
                        rel_list.append(rel_json)

            sample = {"text":sample_text,"relation":rel_list}
```

Turns out this code fails for 478/104/88 samples on the train/dev/test split. Could you help me check ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to align entity mentions with original texts on many samples #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Unable to align entity mentions with original texts on many samples #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions