Skip to content

Unable to align entity mentions with original texts on many samples #2

@thangld201

Description

@thangld201

Hi @YoumiMa, I am using the following codes to verify if entity mentions are present in text

    nsplits = ['train','dev','test']
    for nsplit in nsplits:
        root_path = "JacRED/{}.json".format(nsplit)
        data = get_json(root_path) # read json file
        list_samples = []
        for raw_sample in data:
            entities = raw_sample['vertexSet']
            rel_name_dict = get_json("JacRED/meta/rel_info.json")
            sample_text = "".join(["".join(x) for x in raw_sample['sents']])

            ent2str = dict()
            for ient,entity in enumerate(entities):
                list_name = [x['name'] for x in entity]
                set_name = list(set(list_name))
                ent2str[ient]=set_name

            rel_list = []
            for rel_ in raw_sample['labels']:
                h_list,t_list,r,_ = rel_['h'],rel_['t'],rel_['r'],rel_['evidence']
                h_list = ent2str[h_list]
                t_list = ent2str[t_list]
                r_str = rel_name_dict[r]
                for h in h_list:
                    for t in t_list:
                        assert h in sample_text and t in sample_text # verify de-tokenization
                        rel_json = {"head":h,"tail":t,"relation":r_str}
                        rel_list.append(rel_json)

            sample = {"text":sample_text,"relation":rel_list}

Turns out this code fails for 478/104/88 samples on the train/dev/test split. Could you help me check ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions