Skip to content

medcat.config.config

Classes:

Functions:

Attributes:

C module-attribute

C = TypeVar('C', bound=Iterable)

T module-attribute

T = TypeVar('T')

logger module-attribute

logger = getLogger(__name__)

AnnotationOutput

Bases: SerialisableBaseModel

The annotation output part of the config

Attributes:

context_left class-attribute instance-attribute

context_left: int = -1

context_right class-attribute instance-attribute

context_right: int = -1

include_text_in_output class-attribute instance-attribute

include_text_in_output: bool = False

lowercase_context class-attribute instance-attribute

lowercase_context: bool = True

CDBMaker

Bases: SerialisableBaseModel

The Context Database (CDB) making part of the config

Attributes:

min_letters_required class-attribute instance-attribute

min_letters_required: int = 2

Minimum number of letters required in a name to be accepted for a concept

multi_separator class-attribute instance-attribute

multi_separator: str = '|'

If multiple names or type_ids for a concept present in one row of a CSV, they are separated by the specified character.

name_versions class-attribute instance-attribute

name_versions: list = ['LOWER', 'CLEAN']

Name versions to be generated.

remove_parenthesis class-attribute instance-attribute

remove_parenthesis: int = 5

Should preferred names with parenthesis be cleaned 0 means no, else it means if longer than or equal e.g. Head (Body part) -> Head

ComponentConfig

Bases: DirtiableBaseModel

Attributes:

comp_name class-attribute instance-attribute

comp_name: str = 'default'

The name of the component.

If a custom implementation is required, it needs to be registered using `medcat.components.types.register_core_component( , , ) By default, only the 'default' component is registered.

Components

Bases: SerialisableBaseModel

Attributes:

addons class-attribute instance-attribute

addons: list[ComponentConfig] = []

comp_order class-attribute instance-attribute

comp_order: list[str] = ['tagging', 'token_normalizing', 'ner', 'linking']

linking class-attribute instance-attribute

linking: Linking = Linking()

ner class-attribute instance-attribute

ner: Ner = Ner()

tagging class-attribute instance-attribute

token_normalizing class-attribute instance-attribute

token_normalizing: ComponentConfig = ComponentConfig()

Config

Bases: SerialisableBaseModel

Attributes:

annotation_output class-attribute instance-attribute

annotation_output: AnnotationOutput = AnnotationOutput()

cdb_maker class-attribute instance-attribute

cdb_maker: CDBMaker = CDBMaker()

components class-attribute instance-attribute

components: Components = Components()

general class-attribute instance-attribute

general: General = General()

meta class-attribute instance-attribute

meta: ModelMeta = Field(default_factory=ModelMeta)

preprocessing class-attribute instance-attribute

preprocessing: Preprocessing = Preprocessing()

DirtiableBaseModel

Bases: SerialisableBaseModel

Methods:

Attributes:

is_dirty property

is_dirty: bool

mark_clean

mark_clean()
Source code in medcat-v2/medcat/config/config.py
139
140
141
142
143
def mark_clean(self):
    self._is_dirty = False
    for part in self.__dict__.values():
        if isinstance(part, PotentiallyDirty):
            part.mark_clean()

General

Bases: SerialisableBaseModel

The general part of the config

Attributes:

diacritics class-attribute instance-attribute

diacritics: bool = False

Should we process diacritics - for languages other than English, symbols such as 'é, ë, ö' can be relevant. Note that this makes spell_check slower.

full_unlink: bool = False

When unlinking a name from a concept should we do full_unlink (means unlink a name from all concepts, not just the one in question)

log_format class-attribute instance-attribute

log_format: str = '%(levelname)s:%(name)s: %(message)s'

log_level class-attribute instance-attribute

log_level: int = INFO

Logging config for everything | 'tagger' can be disabled, but will cause a drop in performance

log_path class-attribute instance-attribute

log_path: str = './medcat.log'

make_pretty_labels class-attribute instance-attribute

make_pretty_labels: Optional[str] = None

Should the labels of entities (shown in displacy) be pretty or just 'concept'. Slows down the annotation pipeline should not be used when annotating millions of documents. If None it will be the string "concept", if short it will be CUI, if long it will be CUI | Name | Confidence

map_cui_to_group class-attribute instance-attribute

map_cui_to_group: bool = False

If the cdb.addl_info['cui2group'] is provided and this option enabled, each CUI will be mapped to the group

map_to_other_ontologies class-attribute instance-attribute

map_to_other_ontologies: Union[Literal['auto'], list[str]] = 'auto'

Which other ontologies to map to if possible.

This will force medcat to include mapping for other ontologies in its outputs. It will use the mappings in cdb.addl_info["cui2<ont>"] are present.

If set to "auto" (or missing), the value will be inferred from available data at first init time. That is to say, it'll map to all ontologies available.

NB! This will only work if the cdb.addl_info["cui2<ont>"] exists. Otherwise, no mapping will be done.

model_config class-attribute instance-attribute

model_config = ConfigDict(extra='allow')

nlp class-attribute instance-attribute

separator class-attribute instance-attribute

separator: str = '~'

Separator that will be used to merge tokens of a name. Once a CDB is built this should always stay the same.

show_nested_entities class-attribute instance-attribute

show_nested_entities: bool = False

If set to True functions like get_entities and get_json will return nested_entities and overlaps

spell_check class-attribute instance-attribute

spell_check: bool = True

Should we check spelling - note that this makes things much slower, use only if necessary. The only thing necessary for the spell checker to work is vocab.dat and cdb.dat built with concepts in the respective language.

spell_check_deep class-attribute instance-attribute

spell_check_deep: bool = False

If True the spell checker will try harder to find mistakes, this can slow down things drastically.

spell_check_len_limit class-attribute instance-attribute

spell_check_len_limit: int = 7

Spelling will not be checked for words with length less than this

usage_monitor class-attribute instance-attribute

usage_monitor: UsageMonitor = UsageMonitor()

Checkpointing config

workers class-attribute instance-attribute

workers: int = workers()

Number of workers used by a parallelizable pipeline component

IncorrectConfigValues

IncorrectConfigValues(cls: Type, attr_name: str, exp_type: Type, got: Any)

Bases: ValueError

Source code in medcat-v2/medcat/config/config.py
103
104
105
106
def __init__(self, cls: Type, attr_name: str,
             exp_type: Type, got: Any):
    super().__init__(f"Incorrect attribute set for {cls}.{attr_name}. "
                     f"Expected {exp_type}, but got {type(got)}: {got}")

Linking

Bases: ComponentConfig

The linking part of the config

Attributes:

additional class-attribute instance-attribute

additional: Optional[Any] = None

Some additional config for non-default linkers. E.g the 2-step linker uses this for alpha calculations and learning rate for type contexts.

always_calculate_similarity class-attribute instance-attribute

always_calculate_similarity: bool = False

Do we want to calculate context similarity even for concepts that are not ambiguous.

calculate_dynamic_threshold class-attribute instance-attribute

calculate_dynamic_threshold: bool = False

Concepts below this similarity will be ignored. Type can be static/dynamic - if dynamic each CUI has a different TH and it is calculated as the average confidence for that CUI * similarity_threshold. Take care that dynamic works only if the cdb was trained with calculate_dynamic_threshold = True.

context_ignore_center_tokens class-attribute instance-attribute

context_ignore_center_tokens: bool = False

If true when the context of a concept is calculated (embedding) the words making that concept are not taken into account

context_vector_sizes class-attribute instance-attribute

context_vector_sizes: dict = {'xlong': 27, 'long': 18, 'medium': 9, 'short': 3}

Context vector sizes that will be calculated and used for linking

context_vector_weights class-attribute instance-attribute

context_vector_weights: dict = {'xlong': 0.1, 'long': 0.4, 'medium': 0.4, 'short': 0.1}

Weight of each vector in the similarity score - make trainable at some point. Should add up to 1.

devalue_linked_concepts class-attribute instance-attribute

devalue_linked_concepts: bool = False

When adding a positive example, should it also be treated as Negative for concepts which link to the positive one via names (ambiguous names).

disamb_length_limit class-attribute instance-attribute

disamb_length_limit: int = 3

All concepts below this will always be disambiguated

filter_before_disamb class-attribute instance-attribute

filter_before_disamb: bool = False

If True it will filter before doing disamb. Useful for the trainer.

filters class-attribute instance-attribute

Filters

model_config class-attribute instance-attribute

model_config = ConfigDict(extra='allow')

negative_ignore_punct_and_num class-attribute instance-attribute

negative_ignore_punct_and_num: bool = True

Do we ignore punct/num when negative sampling

negative_probability class-attribute instance-attribute

negative_probability: float = 0.5

Probability for the negative context to be added for each positive addition

optim class-attribute instance-attribute

optim: dict = {'type': 'linear', 'base_lr': 1, 'min_lr': 5e-05}

Linear anneal

prefer_frequent_concepts class-attribute instance-attribute

prefer_frequent_concepts: float = 0.35

If >0 concepts that are more frequent will be preferred by a multiply of this amount

prefer_primary_name class-attribute instance-attribute

prefer_primary_name: float = 0.35

If >0 concepts for which a detection is its primary name will be preferred by that amount (0 to 1)

random_replacement_unsupervised class-attribute instance-attribute

random_replacement_unsupervised: float = 0.8

If <1 during unsupervised training the detected term will be randomly replaced with a probability of 1 - random_replacement_unsupervised Replaced with a synonym used for that term

similarity_threshold class-attribute instance-attribute

similarity_threshold: float = 0.25

similarity_threshold_type class-attribute instance-attribute

similarity_threshold_type: str = 'static'

subsample_after class-attribute instance-attribute

subsample_after: int = 30000

DISABLED in code permanetly: Subsample during unsupervised training if a concept has received more than

train class-attribute instance-attribute

train: bool = True

Should it train or not, this is set automatically ignore in 99% of cases and do not set manually

train_count_threshold class-attribute instance-attribute

train_count_threshold: int = 1

Concepts that have seen less training examples than this will not be used for similarity calculation and will have a similarity of -1.

LinkingFilters

LinkingFilters(**data)

Bases: SerialisableBaseModel

These describe the linking filters used alongside the model.

When no CUIs nor excluded CUIs are specified (the sets are empty), all CUIs are accepted. If there are CUIs specified then only those will be accepted. If there are excluded CUIs specified, they are excluded.

In some cases, there are extra filters as well as MedCATtrainer (MCT) export filters. These are expected to follow the following: extra_cui_filter ⊆ MCT filter ⊆ Model/config filter

While any other CUIs can be included in the the extra CUI filter or the MCT filter, they would not have any real effect.

Methods:

Attributes:

Source code in medcat-v2/medcat/config/config.py
329
330
331
332
333
334
335
336
337
338
339
340
def __init__(self, **data):
    if 'cuis' in data:
        cuis = data['cuis']
        if isinstance(cuis, dict) and len(cuis) == 0:
            logger.warning("Loading an old model where "
                           "config.linking.filters.cuis has been "
                           "dict to an empty dict instead of an empty "
                           "set. Converting the dict to a set in memory "
                           "as that is what is expected. Please consider "
                           "saving the model again.")
            data['cuis'] = set(cuis.keys())
    super().__init__(**data)

cuis class-attribute instance-attribute

cuis: set[str] = set()

cuis_exclude class-attribute instance-attribute

cuis_exclude: set[str] = set()

check_filters

check_filters(cui: str) -> bool

Checks is a CUI in the filters

Parameters:

  • cui

    (str) –

    The CUI in question

Returns:

  • bool ( bool ) –

    True if the CUI is allowed

Source code in medcat-v2/medcat/config/config.py
342
343
344
345
346
347
348
349
350
351
352
353
354
def check_filters(self, cui: str) -> bool:
    """Checks is a CUI in the filters

    Args:
        cui (str): The CUI in question

    Returns:
        bool: True if the CUI is allowed
    """
    if cui in self.cuis or not self.cuis:
        return cui not in self.cuis_exclude
    else:
        return False

ModelMeta

Bases: SerialisableBaseModel

Methods:

Attributes:

description class-attribute instance-attribute

description: str = 'N/A'

hash class-attribute instance-attribute

hash: str = ''

history class-attribute instance-attribute

history: list[str] = Field(default_factory=list)

last_saved class-attribute instance-attribute

last_saved: datetime = Field(default_factory=now)

location class-attribute instance-attribute

location: str = 'N/A'

medcat_version class-attribute instance-attribute

medcat_version: str = ''

ontology class-attribute instance-attribute

ontology: list[str] = []

saved_environ class-attribute instance-attribute

saved_environ: Environment = Field(default_factory=get_environment_info)

sup_trained class-attribute instance-attribute

sup_trained: list[TrainingDescriptor] = []

unsup_trained class-attribute instance-attribute

unsup_trained: list[TrainingDescriptor] = []

add_sup_training

add_sup_training(start_time: datetime, num_docs: int, project_name: str) -> None

Add supervised training information based on data.

This will mark down the time taken for training by comparing

the start time to the current time.

This will be called for every project being trained separately.

So if there's a MCT export being trained with multiple projects, multiple different training instances will be recorded.

Parameters:

  • start_time

    (datetime) –

    The time at which the training was started.

  • num_docs

    (int) –

    The number of documents that were trained.

  • project_name

    (str) –

    The project name.

Source code in medcat-v2/medcat/config/config.py
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
def add_sup_training(self, start_time: datetime, num_docs: int,
                     project_name: str) -> None:
    """Add supervised training information based on data.

    NOTE: This will mark down the time taken for training by comparing
          the start time to the current time.

    NOTE: This will be called for every project being trained separately.
          So if there's a MCT export being trained with multiple projects,
          multiple different training instances will be recorded.

    Args:
        start_time (datetime): The time at which the training was started.
        num_docs (int): The number of documents that were trained.
        project_name (str): The project name.
    """
    self.sup_trained.append(TrainingDescriptor(
        train_time_start=start_time, train_time_end=datetime.now(),
        project_name=project_name, num_docs=num_docs, num_epochs=1
    ))

add_unsup_training

add_unsup_training(start_time: datetime, num_docs: int, num_epochs: int = 1, project_name: str = 'N/A')

Add unsupervised training information based on data.

This will mark down the time taken for training by comparing

the start time to the current time.

Parameters:

  • start_time

    (datetime) –

    The time at which the training was started.

  • num_docs

    (int) –

    The number of documents trained.

  • num_epochs

    (int, default: 1 ) –

    The number of epochs. Defaults to 1.

  • project_name

    (str, default: 'N/A' ) –

    The project name. Defaults to 'N/A'.

Source code in medcat-v2/medcat/config/config.py
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
def add_unsup_training(self, start_time: datetime, num_docs: int,
                       num_epochs: int = 1, project_name: str = 'N/A'):
    """Add unsupervised training information based on data.

    NOTE: This will mark down the time taken for training by comparing
          the start time to the current time.

    Args:
        start_time (datetime): The time at which the training was started.
        num_docs (int): The number of documents trained.
        num_epochs (int, optional): The number of epochs. Defaults to 1.
        project_name (str, optional): The project name. Defaults to 'N/A'.
    """
    self.unsup_trained.append(TrainingDescriptor(
        train_time_start=start_time, train_time_end=datetime.now(),
        project_name=project_name, num_docs=num_docs,
        num_epochs=num_epochs))

mark_saved_now

mark_saved_now()
Source code in medcat-v2/medcat/config/config.py
544
545
546
547
def mark_saved_now(self):
    self.last_saved = datetime.now()
    self.saved_environ = get_environment_info()
    self.medcat_version = medcat_version

prepare_and_report_training

prepare_and_report_training(data_iterator: C, num_epochs: int, supervised: bool = False, project_name: str = 'N/A') -> Iterator[C]

Context manager for preparing training.

This is used so that we can get the number of items in the data during training.

Parameters:

  • data_iterator

    (C) –

    The data to be trained.

  • num_epochs

    (int) –

    The number of epochs to be used.

  • supervised

    (bool, default: False ) –

    Whether training is supervised. Defaults to False.

  • project_name

    (str, default: 'N/A' ) –

    The project name. Defaults to 'N/A'.

Yields:

  • C

    Iterator[C]: The same data that was input.

Source code in medcat-v2/medcat/config/config.py
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
@contextmanager
def prepare_and_report_training(self,
                                data_iterator: C,
                                num_epochs: int,
                                supervised: bool = False,
                                project_name: str = 'N/A'
                                ) -> Iterator[C]:
    """Context manager for preparing training.

    This is used so that we can get the number of items in the data
    during training.

    Args:
        data_iterator (C): The data to be trained.
        num_epochs (int): The number of epochs to be used.
        supervised (bool, optional): Whether training is supervised.
            Defaults to False.
        project_name (str, optional): The project name. Defaults to 'N/A'.

    Yields:
        Iterator[C]: The same data that was input.
    """
    _names, _counts = [], [0]  # NOTE: 0 count for fallback

    def callback(name: str, count: int) -> None:
        _names.append(name)
        _counts.append(count)
    wrapped = callback_iterator(f"TRAIN-{id(data_iterator)}",
                                data_iterator, callback)
    start_time = datetime.now()
    try:
        yield cast(C, wrapped)
    finally:
        # even if something fails, log the count
        num_docs = _counts[1]
        if supervised:
            self.add_sup_training(start_time=start_time,
                                  num_docs=num_docs,
                                  project_name=project_name)
        else:
            self.add_unsup_training(start_time=start_time,
                                    num_docs=num_docs,
                                    num_epochs=num_epochs,
                                    project_name=project_name)
        if len(_names) != 1:
            logger.warning(
                "Something went wrong during %ssupervised training. "
                "The number of documents trained was unable to be "
                "clearly obtained. Counted %d names (%s) at %s",
                'un' if not supervised else '', len(_names), _names,
                _counts)

NLPConfig

Bases: SerialisableBaseModel

Attributes:

disabled_components class-attribute instance-attribute

disabled_components: list = ['ner', 'parser', 'vectors', 'textcat', 'entity_linker', 'sentencizer', 'entity_ruler', 'merge_noun_chunks', 'merge_entities', 'merge_subtokens']

The list of components that will be disabled for the NLP.

NB! For these changes to take effect, the pipe would need to be recreated.

faster_spacy_tokenization class-attribute instance-attribute

faster_spacy_tokenization: bool = False

Allow skipping the spacy pipeline.

If True, uses basic tokenization only (spacy.make_doc) for ~3-4x overall speedup. If False, uses full linguistic pipeline including POS tagging, lemmatization, and stopword detection.

Impact of fast_tokenization=True: - No part-of-speech tags: All tokens treated uniformly during normalization - No lemmatization: Words used in surface form (e.g., "running" vs "run") - No stopword detection: All tokens in multi-token spans considered; all tokens used in context vector calculation - Real world performance (in terms of precision and recall) is likely to be lower

When to use fast mode: - Processing very large datasets where speed is critical - Text is already clean/normalized - Minor drops in precision/recall (typically 1-3%) are acceptable

When to use full mode (default): - Maximum accuracy is required - Working with noisy or varied text - Proper linguistic analysis improves your specific use case

Benchmark on your data to determine if the speedup justifies the accuracy tradeoff.

PS: Only applicable for spacy based tokenizer.

NB! For these changes to take effect, the pipe would need to be recreated.

model_config class-attribute instance-attribute

model_config = ConfigDict(extra='allow', validate_assignment=True)

modelname class-attribute instance-attribute

modelname: str = 'en_core_web_md'

What model will be used for tokenization.

NB! For these changes to take effect, the pipe would need to be recreated.

provider class-attribute instance-attribute

provider: str = 'regex'

The NLP provider.

Currently only regex and spacy are natively supported.

NB! For these changes to take effect, the pipe would need to be recreated.

Ner

Bases: ComponentConfig

The NER part of the config

Attributes:

check_upper_case_names class-attribute instance-attribute

check_upper_case_names: bool = False

Check uppercase to distinguish uppercase and lowercase words that have a different meaning.

custom_cnf class-attribute instance-attribute

custom_cnf: Optional[Any] = None

The custom config for the component.

max_skip_tokens class-attribute instance-attribute

max_skip_tokens: int = 2

When checking tokens for concepts you can have skipped tokens between used ones (usually spaces, new lines etc). This number tells you how many skipped can you have.

min_name_len class-attribute instance-attribute

min_name_len: int = 3

Do not detect names below this limit, skip them

model_config class-attribute instance-attribute

model_config = ConfigDict(extra='allow')

try_reverse_word_order class-attribute instance-attribute

try_reverse_word_order: bool = False

Try reverse word order for short concepts (2 words max), e.g. heart disease -> disease heart

upper_case_limit_len class-attribute instance-attribute

upper_case_limit_len: int = 4

Any name shorter than this must be uppercase in the text to be considered. If it is not uppercase it will be skipped.

PotentiallyDirty

Bases: Protocol

Methods:

Attributes:

is_dirty property

is_dirty: bool

mark_clean

mark_clean() -> None
Source code in medcat-v2/medcat/config/config.py
115
116
def mark_clean(self) -> None:
    pass

Preprocessing

Bases: SerialisableBaseModel

The preprocessing part of the config

Attributes:

do_not_normalize class-attribute instance-attribute

do_not_normalize: set[str] = {'VBD', 'VBG', 'VBN', 'VBP', 'JJS', 'JJR'}

Should specific word types be normalized: e.g. running -> run Values are detailed part-of-speech tags. See: - https://spacy.io/usage/linguistic-features#pos-tagging - Label scheme section per model at https://spacy.io/models/en

keep_punct class-attribute instance-attribute

keep_punct: set = {'.', ':'}

All punct will be skipped by default, here you can set what will be kept

max_document_length class-attribute instance-attribute

max_document_length: int = 1000000

Documents longer than this will be trimmed.

NB! For these changes to take effect, the pipe would need to be recreated.

min_len_normalize class-attribute instance-attribute

min_len_normalize: int = 5

Nothing below this length will ever be normalized (input tokens or concept names), normalized means lemmatized in this case

skip_stopwords class-attribute instance-attribute

skip_stopwords: bool = False

Should stopwords be skipped/ignored when processing input

stopwords class-attribute instance-attribute

stopwords: Optional[set] = None

If None the default set of stowords from spacy will be used. This must be a Set.

NB! For these changes to take effect, the pipe would need to be recreated.

words_to_skip class-attribute instance-attribute

words_to_skip: set = {'nos'}

This words will be completely ignored from concepts and from the text (must be a Set)

SerialisableBaseModel

Bases: BaseModel

The base serialisable config.

Methods:

get_init_attrs classmethod

get_init_attrs() -> list[str]
Source code in medcat-v2/medcat/config/config.py
32
33
34
@classmethod
def get_init_attrs(cls) -> list[str]:
    return []

get_strategy

get_strategy() -> SerialisingStrategy
Source code in medcat-v2/medcat/config/config.py
29
30
def get_strategy(self) -> SerialisingStrategy:
    return SerialisingStrategy.SERIALISABLES_AND_DICT

ignore_attrs classmethod

ignore_attrs() -> list[str]
Source code in medcat-v2/medcat/config/config.py
36
37
38
@classmethod
def ignore_attrs(cls) -> list[str]:
    return []

include_properties classmethod

include_properties() -> list[str]
Source code in medcat-v2/medcat/config/config.py
40
41
42
@classmethod
def include_properties(cls) -> list[str]:
    return []

load classmethod

load(path: str) -> Self
Source code in medcat-v2/medcat/config/config.py
85
86
87
88
89
90
91
92
93
94
95
96
97
98
@classmethod
def load(cls, path: str) -> Self:
    if os.path.isfile(path) and path.endswith(".dat"):
        if avoid_legacy_conversion():
            raise LegacyConversionDisabledError(cls.__name__)
        doing_legacy_conversion_message(logger, cls.__name__, path)
        from medcat.utils.legacy.convert_config import (
            get_config_from_old_per_cls)
        return cast(Self, get_config_from_old_per_cls(path, cls))
    obj = deserialise(path)
    if not isinstance(obj, cls):
        raise ValueError(f"The path '{path}' is not a {cls.__name__}!" +
                         str(("Instead of", cls, "Got", type(obj))))
    return obj

merge_config

merge_config(other: dict)

Merge this config with another config's (partial) model dump.

The exepctation is that the other dict is a partial model dump. Values specified there are overwritten into the current config. Values not specified there are left intact.

The other config can have keys/values that do not exist in the config or sub-config. And they will be added where possible.

Parameters:

  • other

    (dict) –

    The model dump

Raises:

  • IncorrectConfigValues

    If unable to set the attribute, trying to set incorrect value, or trying to set sub-config values in an incorrect format (non-dict).

Source code in medcat-v2/medcat/config/config.py
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
def merge_config(self, other: dict):
    """Merge this config with another config's (partial) model dump.

    The exepctation is that the `other` dict is a partial model dump.
    Values specified there are overwritten into the current config.
    Values not specified there are left intact.

    The `other` config can have keys/values that do not exist in the
    config or sub-config. And they will be added where possible.

    Args:
        other (dict): The model dump

    Raises:
        IncorrectConfigValues: If unable to set the attribute,
            trying to set incorrect value, or trying to set sub-config
            values in an incorrect format (non-dict).
    """
    for k, v in other.items():
        if not hasattr(self, k):
            try:
                setattr(self, k, v)
            except (ValidationError, ValueError) as e:
                raise IncorrectConfigValues(
                    type(self), k, type(None), v
                ) from e
            continue
        cur_v = getattr(self, k)
        if isinstance(cur_v, SerialisableBaseModel):
            if not isinstance(v, dict):
                raise IncorrectConfigValues(
                    type(self), k, type(cur_v), v)
            cur_v.merge_config(v)
        else:
            try:
                setattr(self, k, v)
            except ValidationError as e:
                raise IncorrectConfigValues(
                    type(self), k, type(cur_v), v
                ) from e

TrainingDescriptor

Bases: SerialisableBaseModel

Attributes:

num_docs instance-attribute

num_docs: int

num_epochs class-attribute instance-attribute

num_epochs: int = 1

project_name instance-attribute

project_name: Optional[str]

train_time_end instance-attribute

train_time_end: datetime

train_time_start instance-attribute

train_time_start: datetime

UsageMonitor

Bases: SerialisableBaseModel

Attributes:

  • batch_size (int) –

    Number of logged events to write at once.

  • enabled (Literal[True, False, 'auto']) –

    Whether usage monitoring is enabled (True), disabled (False),

  • file_prefix (str) –

    The prefix for logged files. The suffix will be the model hash.

  • log_folder (str) –

    The folder which contains the usage logs. In certain situations,

batch_size class-attribute instance-attribute

batch_size: int = 100

Number of logged events to write at once.

enabled class-attribute instance-attribute

enabled: Literal[True, False, 'auto'] = False

Whether usage monitoring is enabled (True), disabled (False), or automatic ('auto'). If set to False, no logging is performed. If set to True, logs are saved in the location specified by log_folder. If set to 'auto', logs will be automatically enabled or disabled based on environmenta variable (MEDCAT_LOGS - setting it to False or 0 disabled logging) and distributed according to the OS preferred logs location (MEDCAT_LOGS_LOCATION). The defaults for the location are: - For Linux: ~/.local/share/medcat/logs/ - For Windows: C:\Users\%USERNAME%.cache\medcat\logs\

file_prefix class-attribute instance-attribute

file_prefix: str = 'usage_'

The prefix for logged files. The suffix will be the model hash.

log_folder class-attribute instance-attribute

log_folder: str = '.'

The folder which contains the usage logs. In certain situations, it may make sense to keep this separate from the overall logs. NOTE: Does not take affect if enabled is set to 'auto'

get_important_config_parameters

get_important_config_parameters(config: Config) -> dict[str, Any]
Source code in medcat-v2/medcat/config/config.py
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
def get_important_config_parameters(config: Config) -> dict[str, Any]:
    return {
        "config.ponents.ner.min_name_len": {
            'value': config.components.ner.min_name_len,
            'description': ("Minimum detection length (found terms/mentions "
                            "shorter than this will not be detected).")
            },
        "config.ponents.ner.upper_case_limit_len": {
            'value': config.components.ner.upper_case_limit_len,
            'description': ("All detected terms shorter than this value have "
                            "to be uppercase, otherwise they will be ignored.")
            },
        "config.ponents.linking.similarity_threshold": {
            'value': config.components.linking.similarity_threshold,
            'description': ("If the confidence of the model is lower than "
                            "this a detection will be ignore.")
            },
        "config.ponents.linking.filters.cuis": {
            'value': len(config.components.linking.filters.cuis),
            'description': ("Length of the CUIs filter to be included in "
                            "outputs. If this is not 0 (i.e. not empty) its "
                            "best to check what is included before using the "
                            "model")
        },
        "config.general.spell_check": {
            'value': config.general.spell_check,
            'description': "Is spell checking enabled."
            },
        "config.general.spell_check_len_limit": {
            'value': config.general.spell_check_len_limit,
            'description': "Words shorter than this will not be spell checked."
            },
    }