You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -21,10 +22,14 @@ class General(MixingConfig, BaseModel):
21
22
window_size: int=300
22
23
"""Max acceptable dinstance between entities (in characters), care when using this as it can produce sentences that are over 512 tokens (limit is given by tokenizer)"""
23
24
24
-
mct_export_max_non_rel_sample_size:int=200
25
+
limit_samples_per_class: int=-1
26
+
"""Number of samples per class, this limit is applied for train samples, so if train samples are 100 then test would be 20."""
27
+
addl_rels_max_sample_size:int=200
25
28
"""Limit the number of 'Other' samples selected for training/test. This is applied per encountered medcat project, sample_size/num_projects. """
26
-
mct_export_create_addl_rels: bool=False
27
-
"""When processing relations from a MedCAT export, relations labeled as 'Other' are created from all the annotations pairs available"""
29
+
create_addl_rels: bool=False
30
+
"""When processing relations from a MedCAT export/docs, relations labeled as 'Other' are created from all the annotations pairs available"""
31
+
create_addl_rels_by_type: bool=False
32
+
"""When creating the 'Other' relation class, actually split this class into subclasses based on concept types"""
28
33
29
34
tokenizer_name: str="bert"
30
35
"""The name of the tokenizer user.
@@ -46,21 +51,47 @@ class General(MixingConfig, BaseModel):
46
51
"""Tokenizer.
47
52
48
53
NB! For these changes to take effect, the pipe would need to be recreated."""
"""If a foreign non-MCAT trainer dataset is used, you can insert your own Rel entity token delimiters into the tokenizer, \
51
-
copy those token IDs here, and also resize your tokenizer embeddings and adjust the hidden_size of the model, this will depend on the number of tokens you introduce"""
52
-
labels2idx: Dict= {}
53
-
idx2labels: Dict= {}
56
+
copy those token IDs here, and also resize your tokenizer embeddings and adjust the hidden_size of the model, this will depend on the number of tokens you introduce
Please note that the tokenizer special tokens are supposed to be in pairs of two for example [s1] and [e1], [s2] and [e2], the [BLANK] is just an example placeholder token
59
+
If you have more than four tokens here then you need to make sure they are present in the text,
60
+
otherwise the pipeline will throw an error in the get_annotation_schema_tag() function.
0 commit comments