medcat.config.config

Classes:

AnnotationOutput –

The annotation output part of the config
CDBMaker –

The Context Database (CDB) making part of the config
ComponentConfig –
Components –
Config –
DirtiableBaseModel –
General –

The general part of the config
IncorrectConfigValues –
Linking –

The linking part of the config
LinkingFilters –

These describe the linking filters used alongside the model.
ModelMeta –
NLPConfig –
Ner –

The NER part of the config
PotentiallyDirty –
Preprocessing –

The preprocessing part of the config
SerialisableBaseModel –

The base serialisable config.
TrainingDescriptor –
UsageMonitor –

Functions:

get_important_config_parameters –

Attributes:

C –
T –
logger –

C `module-attribute`

C = TypeVar('C', bound=Iterable)

T `module-attribute`

T = TypeVar('T')

logger `module-attribute`

logger = getLogger(__name__)

AnnotationOutput

Bases: SerialisableBaseModel

The annotation output part of the config

Attributes:

context_left (int) –
context_right (int) –
include_text_in_output (bool) –
lowercase_context (bool) –

context_left `class-attribute` `instance-attribute`

context_left: int = -1

context_right `class-attribute` `instance-attribute`

context_right: int = -1

include_text_in_output `class-attribute` `instance-attribute`

include_text_in_output: bool = False

lowercase_context `class-attribute` `instance-attribute`

lowercase_context: bool = True

CDBMaker

Bases: SerialisableBaseModel

The Context Database (CDB) making part of the config

Attributes:

min_letters_required (int) –

Minimum number of letters required in a name to be accepted
multi_separator (str) –

If multiple names or type_ids for a concept present in one row of a CSV,
name_versions (list) –

Name versions to be generated.
remove_parenthesis (int) –

Should preferred names with parenthesis be cleaned 0 means no,

min_letters_required `class-attribute` `instance-attribute`

min_letters_required: int = 2

Minimum number of letters required in a name to be accepted for a concept

multi_separator `class-attribute` `instance-attribute`

multi_separator: str = '|'

If multiple names or type_ids for a concept present in one row of a CSV, they are separated by the specified character.

name_versions `class-attribute` `instance-attribute`

name_versions: list = ['LOWER', 'CLEAN']

Name versions to be generated.

remove_parenthesis `class-attribute` `instance-attribute`

remove_parenthesis: int = 5

Should preferred names with parenthesis be cleaned 0 means no, else it means if longer than or equal e.g. Head (Body part) -> Head

ComponentConfig

Bases: DirtiableBaseModel

Attributes:

comp_name (str) –

The name of the component.

comp_name `class-attribute` `instance-attribute`

comp_name: str = 'default'

The name of the component.

If a custom implementation is required, it needs to be registered using `medcat.components.types.register_core_component( , , ) By default, only the 'default' component is registered.

Components

Bases: SerialisableBaseModel

Attributes:

addons (list[ComponentConfig]) –
comp_order (list[str]) –
linking (Linking) –
ner (Ner) –
tagging (ComponentConfig) –
token_normalizing (ComponentConfig) –

addons `class-attribute` `instance-attribute`

addons: list[ComponentConfig] = []

comp_order `class-attribute` `instance-attribute`

comp_order: list[str] = ['tagging', 'token_normalizing', 'ner', 'linking']

linking `class-attribute` `instance-attribute`

linking: Linking = Linking()

ner `class-attribute` `instance-attribute`

ner: Ner = Ner()

tagging `class-attribute` `instance-attribute`

tagging: ComponentConfig = ComponentConfig()

token_normalizing `class-attribute` `instance-attribute`

token_normalizing: ComponentConfig = ComponentConfig()

Config

Bases: SerialisableBaseModel

Attributes:

annotation_output (AnnotationOutput) –
cdb_maker (CDBMaker) –
components (Components) –
general (General) –
meta (ModelMeta) –
preprocessing (Preprocessing) –

annotation_output `class-attribute` `instance-attribute`

annotation_output: AnnotationOutput = AnnotationOutput()

cdb_maker `class-attribute` `instance-attribute`

cdb_maker: CDBMaker = CDBMaker()

components `class-attribute` `instance-attribute`

components: Components = Components()

general `class-attribute` `instance-attribute`

general: General = General()

meta `class-attribute` `instance-attribute`

meta: ModelMeta = Field(default_factory=ModelMeta)

preprocessing `class-attribute` `instance-attribute`

preprocessing: Preprocessing = Preprocessing()

DirtiableBaseModel

Bases: SerialisableBaseModel

Methods:

mark_clean –

Attributes:

is_dirty (bool) –

is_dirty `property`

is_dirty: bool

mark_clean

mark_clean()

Source code in medcat-v2/medcat/config/config.py

def mark_clean(self):
    self._is_dirty = False
    for part in self.__dict__.values():
        if isinstance(part, PotentiallyDirty):
            part.mark_clean()

General

Bases: SerialisableBaseModel

The general part of the config

Attributes:

diacritics (bool) –

Should we process diacritics - for languages other than English,
full_unlink (bool) –

When unlinking a name from a concept should we do full_unlink
log_format (str) –
log_level (int) –

Logging config for everything | 'tagger' can be disabled,
log_path (str) –
make_pretty_labels (Optional[str]) –

Should the labels of entities (shown in displacy) be pretty
map_cui_to_group (bool) –

If the cdb.addl_info['cui2group'] is provided and this option enabled,
map_to_other_ontologies (Union[Literal['auto'], list[str]]) –

Which other ontologies to map to if possible.
model_config –
nlp (NLPConfig) –
separator (str) –

Separator that will be used to merge tokens of a name.
show_nested_entities (bool) –

If set to True functions like get_entities and get_json will return
spell_check (bool) –

Should we check spelling - note that this makes things much slower,
spell_check_deep (bool) –

If True the spell checker will try harder to find mistakes,
spell_check_len_limit (int) –

Spelling will not be checked for words with length less than this
usage_monitor (UsageMonitor) –

Checkpointing config
workers (int) –

Number of workers used by a parallelizable pipeline component

diacritics `class-attribute` `instance-attribute`

diacritics: bool = False

Should we process diacritics - for languages other than English, symbols such as 'é, ë, ö' can be relevant. Note that this makes spell_check slower.

full_unlink `class-attribute` `instance-attribute`

full_unlink: bool = False

When unlinking a name from a concept should we do full_unlink (means unlink a name from all concepts, not just the one in question)

log_format `class-attribute` `instance-attribute`

log_format: str = '%(levelname)s:%(name)s: %(message)s'

log_level `class-attribute` `instance-attribute`

log_level: int = INFO

Logging config for everything | 'tagger' can be disabled, but will cause a drop in performance

log_path `class-attribute` `instance-attribute`

log_path: str = './medcat.log'

make_pretty_labels `class-attribute` `instance-attribute`

make_pretty_labels: Optional[str] = None

Should the labels of entities (shown in displacy) be pretty or just 'concept'. Slows down the annotation pipeline should not be used when annotating millions of documents. If None it will be the string "concept", if short it will be CUI, if long it will be CUI | Name | Confidence

map_cui_to_group `class-attribute` `instance-attribute`

map_cui_to_group: bool = False

If the cdb.addl_info['cui2group'] is provided and this option enabled, each CUI will be mapped to the group

map_to_other_ontologies `class-attribute` `instance-attribute`

map_to_other_ontologies: Union[Literal['auto'], list[str]] = 'auto'

Which other ontologies to map to if possible.

This will force medcat to include mapping for other ontologies in its outputs. It will use the mappings in cdb.addl_info["cui2<ont>"] are present.

If set to "auto" (or missing), the value will be inferred from available data at first init time. That is to say, it'll map to all ontologies available.

NB! This will only work if the cdb.addl_info["cui2<ont>"] exists. Otherwise, no mapping will be done.

model_config `class-attribute` `instance-attribute`

model_config = ConfigDict(extra='allow')

nlp `class-attribute` `instance-attribute`

nlp: NLPConfig = NLPConfig()

separator `class-attribute` `instance-attribute`

separator: str = '~'

Separator that will be used to merge tokens of a name. Once a CDB is built this should always stay the same.

show_nested_entities `class-attribute` `instance-attribute`

show_nested_entities: bool = False

If set to True functions like get_entities and get_json will return nested_entities and overlaps

spell_check `class-attribute` `instance-attribute`

spell_check: bool = True

Should we check spelling - note that this makes things much slower, use only if necessary. The only thing necessary for the spell checker to work is vocab.dat and cdb.dat built with concepts in the respective language.

spell_check_deep `class-attribute` `instance-attribute`

spell_check_deep: bool = False

If True the spell checker will try harder to find mistakes, this can slow down things drastically.

spell_check_len_limit `class-attribute` `instance-attribute`

spell_check_len_limit: int = 7

Spelling will not be checked for words with length less than this

usage_monitor `class-attribute` `instance-attribute`

usage_monitor: UsageMonitor = UsageMonitor()

Checkpointing config

workers `class-attribute` `instance-attribute`

workers: int = workers()

Number of workers used by a parallelizable pipeline component

IncorrectConfigValues

IncorrectConfigValues(cls: Type, attr_name: str, exp_type: Type, got: Any)

Bases: ValueError

Source code in medcat-v2/medcat/config/config.py

def __init__(self, cls: Type, attr_name: str,
             exp_type: Type, got: Any):
    super().__init__(f"Incorrect attribute set for {cls}.{attr_name}. "
                     f"Expected {exp_type}, but got {type(got)}: {got}")

Linking

Bases: ComponentConfig

The linking part of the config

Attributes:

additional (Optional[Any]) –

Some additional config for non-default linkers.
always_calculate_similarity (bool) –

Do we want to calculate context similarity even for concepts that are
calculate_dynamic_threshold (bool) –

Concepts below this similarity will be ignored. Type can be
context_ignore_center_tokens (bool) –

If true when the context of a concept is calculated (embedding)
context_vector_sizes (dict) –

Context vector sizes that will be calculated and used for linking
context_vector_weights (dict) –

Weight of each vector in the similarity score - make trainable at
devalue_linked_concepts (bool) –

When adding a positive example, should it also be treated as Negative
disamb_length_limit (int) –

All concepts below this will always be disambiguated
filter_before_disamb (bool) –

If True it will filter before doing disamb. Useful for the trainer.
filters (LinkingFilters) –

Filters
model_config –
negative_ignore_punct_and_num (bool) –

Do we ignore punct/num when negative sampling
negative_probability (float) –

Probability for the negative context to be added for each
optim (dict) –

Linear anneal
prefer_frequent_concepts (float) –

If >0 concepts that are more frequent will be preferred
prefer_primary_name (float) –

If >0 concepts for which a detection is its primary name
random_replacement_unsupervised (float) –

If <1 during unsupervised training the detected term will be randomly
similarity_threshold (float) –
similarity_threshold_type (str) –
subsample_after (int) –

DISABLED in code permanetly: Subsample during unsupervised
train (bool) –

Should it train or not, this is set automatically ignore in 99% of
train_count_threshold (int) –

Concepts that have seen less training examples than this will not be

additional `class-attribute` `instance-attribute`

additional: Optional[Any] = None

Some additional config for non-default linkers. E.g the 2-step linker uses this for alpha calculations and learning rate for type contexts.

always_calculate_similarity `class-attribute` `instance-attribute`

always_calculate_similarity: bool = False

Do we want to calculate context similarity even for concepts that are not ambiguous.

calculate_dynamic_threshold `class-attribute` `instance-attribute`

calculate_dynamic_threshold: bool = False

Concepts below this similarity will be ignored. Type can be static/dynamic - if dynamic each CUI has a different TH and it is calculated as the average confidence for that CUI * similarity_threshold. Take care that dynamic works only if the cdb was trained with calculate_dynamic_threshold = True.

context_ignore_center_tokens `class-attribute` `instance-attribute`

context_ignore_center_tokens: bool = False

If true when the context of a concept is calculated (embedding) the words making that concept are not taken into account

context_vector_sizes `class-attribute` `instance-attribute`

context_vector_sizes: dict = {'xlong': 27, 'long': 18, 'medium': 9, 'short': 3}

Context vector sizes that will be calculated and used for linking

context_vector_weights `class-attribute` `instance-attribute`

context_vector_weights: dict = {'xlong': 0.1, 'long': 0.4, 'medium': 0.4, 'short': 0.1}

Weight of each vector in the similarity score - make trainable at some point. Should add up to 1.

devalue_linked_concepts `class-attribute` `instance-attribute`

devalue_linked_concepts: bool = False

When adding a positive example, should it also be treated as Negative for concepts which link to the positive one via names (ambiguous names).

disamb_length_limit `class-attribute` `instance-attribute`

disamb_length_limit: int = 3

All concepts below this will always be disambiguated

filter_before_disamb `class-attribute` `instance-attribute`

filter_before_disamb: bool = False

If True it will filter before doing disamb. Useful for the trainer.

filters `class-attribute` `instance-attribute`

filters: LinkingFilters = LinkingFilters()

Filters

model_config `class-attribute` `instance-attribute`

model_config = ConfigDict(extra='allow')

negative_ignore_punct_and_num `class-attribute` `instance-attribute`

negative_ignore_punct_and_num: bool = True

Do we ignore punct/num when negative sampling

negative_probability `class-attribute` `instance-attribute`

negative_probability: float = 0.5

Probability for the negative context to be added for each positive addition

optim `class-attribute` `instance-attribute`

optim: dict = {'type': 'linear', 'base_lr': 1, 'min_lr': 5e-05}

Linear anneal

prefer_frequent_concepts `class-attribute` `instance-attribute`

prefer_frequent_concepts: float = 0.35

If >0 concepts that are more frequent will be preferred by a multiply of this amount

prefer_primary_name `class-attribute` `instance-attribute`

prefer_primary_name: float = 0.35

If >0 concepts for which a detection is its primary name will be preferred by that amount (0 to 1)

random_replacement_unsupervised `class-attribute` `instance-attribute`

random_replacement_unsupervised: float = 0.8

If <1 during unsupervised training the detected term will be randomly replaced with a probability of 1 - random_replacement_unsupervised Replaced with a synonym used for that term

similarity_threshold `class-attribute` `instance-attribute`

similarity_threshold: float = 0.25

similarity_threshold_type `class-attribute` `instance-attribute`

similarity_threshold_type: str = 'static'

subsample_after `class-attribute` `instance-attribute`

subsample_after: int = 30000

DISABLED in code permanetly: Subsample during unsupervised training if a concept has received more than

train `class-attribute` `instance-attribute`

train: bool = True

Should it train or not, this is set automatically ignore in 99% of cases and do not set manually

train_count_threshold `class-attribute` `instance-attribute`

train_count_threshold: int = 1

Concepts that have seen less training examples than this will not be used for similarity calculation and will have a similarity of -1.

LinkingFilters

LinkingFilters(**data)

Bases: SerialisableBaseModel

These describe the linking filters used alongside the model.

When no CUIs nor excluded CUIs are specified (the sets are empty), all CUIs are accepted. If there are CUIs specified then only those will be accepted. If there are excluded CUIs specified, they are excluded.

In some cases, there are extra filters as well as MedCATtrainer (MCT) export filters. These are expected to follow the following: extra_cui_filter ⊆ MCT filter ⊆ Model/config filter

While any other CUIs can be included in the the extra CUI filter or the MCT filter, they would not have any real effect.

Methods:

check_filters –

Checks is a CUI in the filters

Attributes:

cuis (set[str]) –
cuis_exclude (set[str]) –

Source code in medcat-v2/medcat/config/config.py

def __init__(self, **data):
    if 'cuis' in data:
        cuis = data['cuis']
        if isinstance(cuis, dict) and len(cuis) == 0:
            logger.warning("Loading an old model where "
                           "config.linking.filters.cuis has been "
                           "dict to an empty dict instead of an empty "
                           "set. Converting the dict to a set in memory "
                           "as that is what is expected. Please consider "
                           "saving the model again.")
            data['cuis'] = set(cuis.keys())
    super().__init__(**data)

cuis `class-attribute` `instance-attribute`

cuis: set[str] = set()

cuis_exclude `class-attribute` `instance-attribute`

cuis_exclude: set[str] = set()

check_filters

check_filters(cui: str) -> bool

Checks is a CUI in the filters

Parameters:

cui
(str) –

The CUI in question

Returns:

bool ( bool ) –

True if the CUI is allowed

Source code in medcat-v2/medcat/config/config.py

def check_filters(self, cui: str) -> bool:
    """Checks is a CUI in the filters

    Args:
        cui (str): The CUI in question

    Returns:
        bool: True if the CUI is allowed
    """
    if cui in self.cuis or not self.cuis:
        return cui not in self.cuis_exclude
    else:
        return False

ModelMeta

Bases: SerialisableBaseModel

Methods:

add_sup_training –

Add supervised training information based on data.
add_unsup_training –

Add unsupervised training information based on data.
mark_saved_now –
prepare_and_report_training –

Context manager for preparing training.

Attributes:

description (str) –
hash (str) –
history (list[str]) –
last_saved (datetime) –
location (str) –
medcat_version (str) –
ontology (list[str]) –
saved_environ (Environment) –
sup_trained (list[TrainingDescriptor]) –
unsup_trained (list[TrainingDescriptor]) –

description `class-attribute` `instance-attribute`

description: str = 'N/A'

hash `class-attribute` `instance-attribute`

hash: str = ''

history `class-attribute` `instance-attribute`

history: list[str] = Field(default_factory=list)

last_saved `class-attribute` `instance-attribute`

last_saved: datetime = Field(default_factory=now)

location `class-attribute` `instance-attribute`

location: str = 'N/A'

medcat_version `class-attribute` `instance-attribute`

medcat_version: str = ''

ontology `class-attribute` `instance-attribute`

ontology: list[str] = []

saved_environ `class-attribute` `instance-attribute`

saved_environ: Environment = Field(default_factory=get_environment_info)

sup_trained `class-attribute` `instance-attribute`

sup_trained: list[TrainingDescriptor] = []

unsup_trained `class-attribute` `instance-attribute`

unsup_trained: list[TrainingDescriptor] = []

add_sup_training

add_sup_training(start_time: datetime, num_docs: int, project_name: str) -> None

Add supervised training information based on data.

This will mark down the time taken for training by comparing

the start time to the current time.

This will be called for every project being trained separately.

So if there's a MCT export being trained with multiple projects, multiple different training instances will be recorded.

Parameters:

start_time
(datetime) –

The time at which the training was started.
num_docs
(int) –

The number of documents that were trained.
project_name
(str) –

The project name.

Source code in medcat-v2/medcat/config/config.py

def add_sup_training(self, start_time: datetime, num_docs: int,
                     project_name: str) -> None:
    """Add supervised training information based on data.

    NOTE: This will mark down the time taken for training by comparing
          the start time to the current time.

    NOTE: This will be called for every project being trained separately.
          So if there's a MCT export being trained with multiple projects,
          multiple different training instances will be recorded.

    Args:
        start_time (datetime): The time at which the training was started.
        num_docs (int): The number of documents that were trained.
        project_name (str): The project name.
    """
    self.sup_trained.append(TrainingDescriptor(
        train_time_start=start_time, train_time_end=datetime.now(),
        project_name=project_name, num_docs=num_docs, num_epochs=1
    ))

add_unsup_training

add_unsup_training(start_time: datetime, num_docs: int, num_epochs: int = 1, project_name: str = 'N/A')

Add unsupervised training information based on data.

This will mark down the time taken for training by comparing

the start time to the current time.

Parameters:

start_time
(datetime) –

The time at which the training was started.
num_docs
(int) –

The number of documents trained.
num_epochs
(int, default: 1 ) –

The number of epochs. Defaults to 1.
project_name
(str, default: 'N/A' ) –

The project name. Defaults to 'N/A'.

Source code in medcat-v2/medcat/config/config.py

def add_unsup_training(self, start_time: datetime, num_docs: int,
                       num_epochs: int = 1, project_name: str = 'N/A'):
    """Add unsupervised training information based on data.

    NOTE: This will mark down the time taken for training by comparing
          the start time to the current time.

    Args:
        start_time (datetime): The time at which the training was started.
        num_docs (int): The number of documents trained.
        num_epochs (int, optional): The number of epochs. Defaults to 1.
        project_name (str, optional): The project name. Defaults to 'N/A'.
    """
    self.unsup_trained.append(TrainingDescriptor(
        train_time_start=start_time, train_time_end=datetime.now(),
        project_name=project_name, num_docs=num_docs,
        num_epochs=num_epochs))

mark_saved_now

mark_saved_now()

Source code in medcat-v2/medcat/config/config.py

def mark_saved_now(self):
    self.last_saved = datetime.now()
    self.saved_environ = get_environment_info()
    self.medcat_version = medcat_version

prepare_and_report_training

prepare_and_report_training(data_iterator: C, num_epochs: int, supervised: bool = False, project_name: str = 'N/A') -> Iterator[C]

Context manager for preparing training.

This is used so that we can get the number of items in the data during training.

Parameters:

data_iterator
(C) –

The data to be trained.
num_epochs
(int) –

The number of epochs to be used.
supervised
(bool, default: False ) –

Whether training is supervised. Defaults to False.
project_name
(str, default: 'N/A' ) –

The project name. Defaults to 'N/A'.

Yields:

C –

Iterator[C]: The same data that was input.

Source code in medcat-v2/medcat/config/config.py

@contextmanager
def prepare_and_report_training(self,
                                data_iterator: C,
                                num_epochs: int,
                                supervised: bool = False,
                                project_name: str = 'N/A'
                                ) -> Iterator[C]:
    """Context manager for preparing training.

    This is used so that we can get the number of items in the data
    during training.

    Args:
        data_iterator (C): The data to be trained.
        num_epochs (int): The number of epochs to be used.
        supervised (bool, optional): Whether training is supervised.
            Defaults to False.
        project_name (str, optional): The project name. Defaults to 'N/A'.

    Yields:
        Iterator[C]: The same data that was input.
    """
    _names, _counts = [], [0]  # NOTE: 0 count for fallback

    def callback(name: str, count: int) -> None:
        _names.append(name)
        _counts.append(count)
    wrapped = callback_iterator(f"TRAIN-{id(data_iterator)}",
                                data_iterator, callback)
    start_time = datetime.now()
    try:
        yield cast(C, wrapped)
    finally:
        # even if something fails, log the count
        num_docs = _counts[1]
        if supervised:
            self.add_sup_training(start_time=start_time,
                                  num_docs=num_docs,
                                  project_name=project_name)
        else:
            self.add_unsup_training(start_time=start_time,
                                    num_docs=num_docs,
                                    num_epochs=num_epochs,
                                    project_name=project_name)
        if len(_names) != 1:
            logger.warning(
                "Something went wrong during %ssupervised training. "
                "The number of documents trained was unable to be "
                "clearly obtained. Counted %d names (%s) at %s",
                'un' if not supervised else '', len(_names), _names,
                _counts)

NLPConfig

Bases: SerialisableBaseModel

Attributes:

disabled_components (list) –

The list of components that will be disabled for the NLP.
faster_spacy_tokenization (bool) –

Allow skipping the spacy pipeline.
model_config –
modelname (str) –

What model will be used for tokenization.
provider (str) –

The NLP provider.

disabled_components `class-attribute` `instance-attribute`

disabled_components: list = ['ner', 'parser', 'vectors', 'textcat', 'entity_linker', 'sentencizer', 'entity_ruler', 'merge_noun_chunks', 'merge_entities', 'merge_subtokens']

The list of components that will be disabled for the NLP.

NB! For these changes to take effect, the pipe would need to be recreated.

faster_spacy_tokenization `class-attribute` `instance-attribute`

faster_spacy_tokenization: bool = False

Allow skipping the spacy pipeline.

If True, uses basic tokenization only (spacy.make_doc) for ~3-4x overall speedup. If False, uses full linguistic pipeline including POS tagging, lemmatization, and stopword detection.

Impact of fast_tokenization=True: - No part-of-speech tags: All tokens treated uniformly during normalization - No lemmatization: Words used in surface form (e.g., "running" vs "run") - No stopword detection: All tokens in multi-token spans considered; all tokens used in context vector calculation - Real world performance (in terms of precision and recall) is likely to be lower

When to use fast mode: - Processing very large datasets where speed is critical - Text is already clean/normalized - Minor drops in precision/recall (typically 1-3%) are acceptable

When to use full mode (default): - Maximum accuracy is required - Working with noisy or varied text - Proper linguistic analysis improves your specific use case

Benchmark on your data to determine if the speedup justifies the accuracy tradeoff.

PS: Only applicable for spacy based tokenizer.

NB! For these changes to take effect, the pipe would need to be recreated.

model_config `class-attribute` `instance-attribute`

model_config = ConfigDict(extra='allow', validate_assignment=True)

modelname `class-attribute` `instance-attribute`

modelname: str = 'en_core_web_md'

What model will be used for tokenization.

NB! For these changes to take effect, the pipe would need to be recreated.

provider `class-attribute` `instance-attribute`

provider: str = 'regex'

The NLP provider.

Currently only regex and spacy are natively supported.

NB! For these changes to take effect, the pipe would need to be recreated.

Ner

Bases: ComponentConfig

The NER part of the config

Attributes:

check_upper_case_names (bool) –

Check uppercase to distinguish uppercase and lowercase words that have
custom_cnf (Optional[Any]) –

The custom config for the component.
max_skip_tokens (int) –

When checking tokens for concepts you can have skipped tokens between
min_name_len (int) –

Do not detect names below this limit, skip them
model_config –
try_reverse_word_order (bool) –

Try reverse word order for short concepts (2 words max),
upper_case_limit_len (int) –

Any name shorter than this must be uppercase in the text to be

check_upper_case_names `class-attribute` `instance-attribute`

check_upper_case_names: bool = False

Check uppercase to distinguish uppercase and lowercase words that have a different meaning.

custom_cnf `class-attribute` `instance-attribute`

custom_cnf: Optional[Any] = None

The custom config for the component.

max_skip_tokens `class-attribute` `instance-attribute`

max_skip_tokens: int = 2

When checking tokens for concepts you can have skipped tokens between used ones (usually spaces, new lines etc). This number tells you how many skipped can you have.

min_name_len `class-attribute` `instance-attribute`

min_name_len: int = 3

Do not detect names below this limit, skip them

model_config `class-attribute` `instance-attribute`

model_config = ConfigDict(extra='allow')

try_reverse_word_order `class-attribute` `instance-attribute`

try_reverse_word_order: bool = False

Try reverse word order for short concepts (2 words max), e.g. heart disease -> disease heart

upper_case_limit_len `class-attribute` `instance-attribute`

upper_case_limit_len: int = 4

Any name shorter than this must be uppercase in the text to be considered. If it is not uppercase it will be skipped.

PotentiallyDirty

Bases: Protocol

Methods:

mark_clean –

Attributes:

is_dirty (bool) –

is_dirty `property`

is_dirty: bool

mark_clean

mark_clean() -> None

Source code in medcat-v2/medcat/config/config.py

def mark_clean(self) -> None:
    pass

Preprocessing

Bases: SerialisableBaseModel

The preprocessing part of the config

Attributes:

do_not_normalize (set[str]) –

Should specific word types be normalized: e.g. running -> run
keep_punct (set) –

All punct will be skipped by default, here you can set what
max_document_length (int) –

Documents longer than this will be trimmed.
min_len_normalize (int) –

Nothing below this length will ever be normalized (input tokens or
skip_stopwords (bool) –

Should stopwords be skipped/ignored when processing input
stopwords (Optional[set]) –

If None the default set of stowords from spacy will be used.
words_to_skip (set) –

This words will be completely ignored from concepts and from the text

do_not_normalize `class-attribute` `instance-attribute`

do_not_normalize: set[str] = {'VBD', 'VBG', 'VBN', 'VBP', 'JJS', 'JJR'}

Should specific word types be normalized: e.g. running -> run Values are detailed part-of-speech tags. See: - https://spacy.io/usage/linguistic-features#pos-tagging - Label scheme section per model at https://spacy.io/models/en

keep_punct `class-attribute` `instance-attribute`

keep_punct: set = {'.', ':'}

All punct will be skipped by default, here you can set what will be kept

max_document_length `class-attribute` `instance-attribute`

max_document_length: int = 1000000

Documents longer than this will be trimmed.

NB! For these changes to take effect, the pipe would need to be recreated.

min_len_normalize `class-attribute` `instance-attribute`

min_len_normalize: int = 5

Nothing below this length will ever be normalized (input tokens or concept names), normalized means lemmatized in this case

skip_stopwords `class-attribute` `instance-attribute`

skip_stopwords: bool = False

Should stopwords be skipped/ignored when processing input

stopwords `class-attribute` `instance-attribute`

stopwords: Optional[set] = None

If None the default set of stowords from spacy will be used. This must be a Set.

NB! For these changes to take effect, the pipe would need to be recreated.

words_to_skip `class-attribute` `instance-attribute`

words_to_skip: set = {'nos'}

This words will be completely ignored from concepts and from the text (must be a Set)

SerialisableBaseModel

Bases: BaseModel

The base serialisable config.

Methods:

get_init_attrs –
get_strategy –
ignore_attrs –
include_properties –
load –
merge_config –

Merge this config with another config's (partial) model dump.

get_init_attrs `classmethod`

get_init_attrs() -> list[str]

Source code in medcat-v2/medcat/config/config.py

@classmethod
def get_init_attrs(cls) -> list[str]:
    return []

get_strategy

get_strategy() -> SerialisingStrategy

Source code in medcat-v2/medcat/config/config.py

def get_strategy(self) -> SerialisingStrategy:
    return SerialisingStrategy.SERIALISABLES_AND_DICT

ignore_attrs `classmethod`

ignore_attrs() -> list[str]

Source code in medcat-v2/medcat/config/config.py

@classmethod
def ignore_attrs(cls) -> list[str]:
    return []

include_properties `classmethod`

include_properties() -> list[str]

Source code in medcat-v2/medcat/config/config.py

@classmethod
def include_properties(cls) -> list[str]:
    return []

load `classmethod`

load(path: str) -> Self

Source code in medcat-v2/medcat/config/config.py

@classmethod
def load(cls, path: str) -> Self:
    if os.path.isfile(path) and path.endswith(".dat"):
        if avoid_legacy_conversion():
            raise LegacyConversionDisabledError(cls.__name__)
        doing_legacy_conversion_message(logger, cls.__name__, path)
        from medcat.utils.legacy.convert_config import (
            get_config_from_old_per_cls)
        return cast(Self, get_config_from_old_per_cls(path, cls))
    obj = deserialise(path)
    if not isinstance(obj, cls):
        raise ValueError(f"The path '{path}' is not a {cls.__name__}!" +
                         str(("Instead of", cls, "Got", type(obj))))
    return obj

merge_config

merge_config(other: dict)

Merge this config with another config's (partial) model dump.

The exepctation is that the other dict is a partial model dump. Values specified there are overwritten into the current config. Values not specified there are left intact.

The other config can have keys/values that do not exist in the config or sub-config. And they will be added where possible.

Parameters:

other
(dict) –

The model dump

Raises:

IncorrectConfigValues –

If unable to set the attribute, trying to set incorrect value, or trying to set sub-config values in an incorrect format (non-dict).

Source code in medcat-v2/medcat/config/config.py

def merge_config(self, other: dict):
    """Merge this config with another config's (partial) model dump.

    The exepctation is that the `other` dict is a partial model dump.
    Values specified there are overwritten into the current config.
    Values not specified there are left intact.

    The `other` config can have keys/values that do not exist in the
    config or sub-config. And they will be added where possible.

    Args:
        other (dict): The model dump

    Raises:
        IncorrectConfigValues: If unable to set the attribute,
            trying to set incorrect value, or trying to set sub-config
            values in an incorrect format (non-dict).
    """
    for k, v in other.items():
        if not hasattr(self, k):
            try:
                setattr(self, k, v)
            except (ValidationError, ValueError) as e:
                raise IncorrectConfigValues(
                    type(self), k, type(None), v
                ) from e
            continue
        cur_v = getattr(self, k)
        if isinstance(cur_v, SerialisableBaseModel):
            if not isinstance(v, dict):
                raise IncorrectConfigValues(
                    type(self), k, type(cur_v), v)
            cur_v.merge_config(v)
        else:
            try:
                setattr(self, k, v)
            except ValidationError as e:
                raise IncorrectConfigValues(
                    type(self), k, type(cur_v), v
                ) from e

TrainingDescriptor

Bases: SerialisableBaseModel

Attributes:

num_docs (int) –
num_epochs (int) –
project_name (Optional[str]) –
train_time_end (datetime) –
train_time_start (datetime) –

num_docs `instance-attribute`

num_docs: int

num_epochs `class-attribute` `instance-attribute`

num_epochs: int = 1

project_name `instance-attribute`

project_name: Optional[str]

train_time_end `instance-attribute`

train_time_end: datetime

train_time_start `instance-attribute`

train_time_start: datetime

UsageMonitor

Bases: SerialisableBaseModel

Attributes:

batch_size (int) –

Number of logged events to write at once.
enabled (Literal[True, False, 'auto']) –

Whether usage monitoring is enabled (True), disabled (False),
file_prefix (str) –

The prefix for logged files. The suffix will be the model hash.
log_folder (str) –

The folder which contains the usage logs. In certain situations,

batch_size `class-attribute` `instance-attribute`

batch_size: int = 100

Number of logged events to write at once.

enabled `class-attribute` `instance-attribute`

enabled: Literal[True, False, 'auto'] = False

Whether usage monitoring is enabled (True), disabled (False), or automatic ('auto'). If set to False, no logging is performed. If set to True, logs are saved in the location specified by log_folder. If set to 'auto', logs will be automatically enabled or disabled based on environmenta variable (MEDCAT_LOGS - setting it to False or 0 disabled logging) and distributed according to the OS preferred logs location (MEDCAT_LOGS_LOCATION). The defaults for the location are: - For Linux: ~/.local/share/medcat/logs/ - For Windows: C:\Users\%USERNAME%.cache\medcat\logs\

file_prefix `class-attribute` `instance-attribute`

file_prefix: str = 'usage_'

The prefix for logged files. The suffix will be the model hash.

log_folder `class-attribute` `instance-attribute`

log_folder: str = '.'

The folder which contains the usage logs. In certain situations, it may make sense to keep this separate from the overall logs. NOTE: Does not take affect if enabled is set to 'auto'

get_important_config_parameters

get_important_config_parameters(config: Config) -> dict[str, Any]

Source code in medcat-v2/medcat/config/config.py

def get_important_config_parameters(config: Config) -> dict[str, Any]:
    return {
        "config.ponents.ner.min_name_len": {
            'value': config.components.ner.min_name_len,
            'description': ("Minimum detection length (found terms/mentions "
                            "shorter than this will not be detected).")
            },
        "config.ponents.ner.upper_case_limit_len": {
            'value': config.components.ner.upper_case_limit_len,
            'description': ("All detected terms shorter than this value have "
                            "to be uppercase, otherwise they will be ignored.")
            },
        "config.ponents.linking.similarity_threshold": {
            'value': config.components.linking.similarity_threshold,
            'description': ("If the confidence of the model is lower than "
                            "this a detection will be ignore.")
            },
        "config.ponents.linking.filters.cuis": {
            'value': len(config.components.linking.filters.cuis),
            'description': ("Length of the CUIs filter to be included in "
                            "outputs. If this is not 0 (i.e. not empty) its "
                            "best to check what is included before using the "
                            "model")
        },
        "config.general.spell_check": {
            'value': config.general.spell_check,
            'description': "Is spell checking enabled."
            },
        "config.general.spell_check_len_limit": {
            'value': config.general.spell_check_len_limit,
            'description': "Words shorter than this will not be spell checked."
            },
    }

medcat.config.config

C module-attribute

T module-attribute

logger module-attribute

AnnotationOutput

context_left class-attribute instance-attribute

context_right class-attribute instance-attribute

include_text_in_output class-attribute instance-attribute

lowercase_context class-attribute instance-attribute

CDBMaker

min_letters_required class-attribute instance-attribute

multi_separator class-attribute instance-attribute

name_versions class-attribute instance-attribute

remove_parenthesis class-attribute instance-attribute

ComponentConfig

comp_name class-attribute instance-attribute

Components

addons class-attribute instance-attribute

comp_order class-attribute instance-attribute

linking class-attribute instance-attribute

ner class-attribute instance-attribute

tagging class-attribute instance-attribute

token_normalizing class-attribute instance-attribute

Config

annotation_output class-attribute instance-attribute

cdb_maker class-attribute instance-attribute

components class-attribute instance-attribute

general class-attribute instance-attribute

meta class-attribute instance-attribute

preprocessing class-attribute instance-attribute

DirtiableBaseModel

is_dirty property

mark_clean

General

diacritics class-attribute instance-attribute

full_unlink class-attribute instance-attribute

log_format class-attribute instance-attribute

log_level class-attribute instance-attribute

log_path class-attribute instance-attribute

make_pretty_labels class-attribute instance-attribute

map_cui_to_group class-attribute instance-attribute

map_to_other_ontologies class-attribute instance-attribute

model_config class-attribute instance-attribute

nlp class-attribute instance-attribute

separator class-attribute instance-attribute

show_nested_entities class-attribute instance-attribute

spell_check class-attribute instance-attribute

spell_check_deep class-attribute instance-attribute

spell_check_len_limit class-attribute instance-attribute

usage_monitor class-attribute instance-attribute

workers class-attribute instance-attribute

IncorrectConfigValues

Linking

additional class-attribute instance-attribute

always_calculate_similarity class-attribute instance-attribute

calculate_dynamic_threshold class-attribute instance-attribute

context_ignore_center_tokens class-attribute instance-attribute

context_vector_sizes class-attribute instance-attribute

context_vector_weights class-attribute instance-attribute

devalue_linked_concepts class-attribute instance-attribute

disamb_length_limit class-attribute instance-attribute

filter_before_disamb class-attribute instance-attribute

filters class-attribute instance-attribute

model_config class-attribute instance-attribute

negative_ignore_punct_and_num class-attribute instance-attribute

negative_probability class-attribute instance-attribute

optim class-attribute instance-attribute

prefer_frequent_concepts class-attribute instance-attribute

prefer_primary_name class-attribute instance-attribute

random_replacement_unsupervised class-attribute instance-attribute

similarity_threshold class-attribute instance-attribute

similarity_threshold_type class-attribute instance-attribute

subsample_after class-attribute instance-attribute

train class-attribute instance-attribute

train_count_threshold class-attribute instance-attribute

LinkingFilters

cuis class-attribute instance-attribute

cuis_exclude class-attribute instance-attribute

check_filters

cui

C `module-attribute`

T `module-attribute`

logger `module-attribute`

context_left `class-attribute` `instance-attribute`

context_right `class-attribute` `instance-attribute`

include_text_in_output `class-attribute` `instance-attribute`

lowercase_context `class-attribute` `instance-attribute`

min_letters_required `class-attribute` `instance-attribute`

multi_separator `class-attribute` `instance-attribute`

name_versions `class-attribute` `instance-attribute`

remove_parenthesis `class-attribute` `instance-attribute`

comp_name `class-attribute` `instance-attribute`

addons `class-attribute` `instance-attribute`

comp_order `class-attribute` `instance-attribute`

linking `class-attribute` `instance-attribute`

ner `class-attribute` `instance-attribute`

tagging `class-attribute` `instance-attribute`

token_normalizing `class-attribute` `instance-attribute`

annotation_output `class-attribute` `instance-attribute`

cdb_maker `class-attribute` `instance-attribute`

components `class-attribute` `instance-attribute`

general `class-attribute` `instance-attribute`

meta `class-attribute` `instance-attribute`

preprocessing `class-attribute` `instance-attribute`

is_dirty `property`

diacritics `class-attribute` `instance-attribute`

full_unlink `class-attribute` `instance-attribute`

log_format `class-attribute` `instance-attribute`

log_level `class-attribute` `instance-attribute`

log_path `class-attribute` `instance-attribute`

make_pretty_labels `class-attribute` `instance-attribute`

map_cui_to_group `class-attribute` `instance-attribute`

map_to_other_ontologies `class-attribute` `instance-attribute`

model_config `class-attribute` `instance-attribute`

nlp `class-attribute` `instance-attribute`

separator `class-attribute` `instance-attribute`

show_nested_entities `class-attribute` `instance-attribute`

spell_check `class-attribute` `instance-attribute`

spell_check_deep `class-attribute` `instance-attribute`

spell_check_len_limit `class-attribute` `instance-attribute`

usage_monitor `class-attribute` `instance-attribute`

workers `class-attribute` `instance-attribute`

additional `class-attribute` `instance-attribute`

always_calculate_similarity `class-attribute` `instance-attribute`

calculate_dynamic_threshold `class-attribute` `instance-attribute`

context_ignore_center_tokens `class-attribute` `instance-attribute`

context_vector_sizes `class-attribute` `instance-attribute`

context_vector_weights `class-attribute` `instance-attribute`

devalue_linked_concepts `class-attribute` `instance-attribute`

disamb_length_limit `class-attribute` `instance-attribute`

filter_before_disamb `class-attribute` `instance-attribute`

filters `class-attribute` `instance-attribute`

model_config `class-attribute` `instance-attribute`

negative_ignore_punct_and_num `class-attribute` `instance-attribute`

negative_probability `class-attribute` `instance-attribute`

optim `class-attribute` `instance-attribute`

prefer_frequent_concepts `class-attribute` `instance-attribute`

prefer_primary_name `class-attribute` `instance-attribute`

random_replacement_unsupervised `class-attribute` `instance-attribute`

similarity_threshold `class-attribute` `instance-attribute`

similarity_threshold_type `class-attribute` `instance-attribute`

subsample_after `class-attribute` `instance-attribute`

train `class-attribute` `instance-attribute`

train_count_threshold `class-attribute` `instance-attribute`

cuis `class-attribute` `instance-attribute`

cuis_exclude `class-attribute` `instance-attribute`

`cui`

description `class-attribute` `instance-attribute`

hash `class-attribute` `instance-attribute`

history `class-attribute` `instance-attribute`

last_saved `class-attribute` `instance-attribute`

location `class-attribute` `instance-attribute`

medcat_version `class-attribute` `instance-attribute`

ontology `class-attribute` `instance-attribute`

saved_environ `class-attribute` `instance-attribute`

sup_trained `class-attribute` `instance-attribute`

unsup_trained `class-attribute` `instance-attribute`

`start_time`

`num_docs`

`project_name`