mxnet.contrib.text.vocab

Text token indexer.

Classes

Vocabulary([counter, most_freq_count, ...])

Indexing for text tokens.

class mxnet.contrib.text.vocab.Vocabulary(counter=None, most_freq_count=None, min_freq=1, unknown_token='<unk>', reserved_tokens=None)[source]

Bases: object

Indexing for text tokens.

Build indices for the unknown token, reserved tokens, and input counter keys. Indexed tokens can be used by token embeddings.

Parameters:
  • counter (collections.Counter or None, default None) – Counts text token frequencies in the text data. Its keys will be indexed according to frequency thresholds such as most_freq_count and min_freq. Keys of counter, unknown_token, and values of reserved_tokens must be of the same hashable type. Examples: str, int, and tuple.

  • most_freq_count (None or int, default None) – The maximum possible number of the most frequent tokens in the keys of counter that can be indexed. Note that this argument does not count any token from reserved_tokens. Suppose that there are different keys of counter whose frequency are the same, if indexing all of them will exceed this argument value, such keys will be indexed one by one according to their __cmp__() order until the frequency threshold is met. If this argument is None or larger than its largest possible value restricted by counter and reserved_tokens, this argument has no effect.

  • min_freq (int, default 1) – The minimum frequency required for a token in the keys of counter to be indexed.

  • unknown_token (hashable object, default '&lt;unk&gt;') – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation. Keys of counter, unknown_token, and values of reserved_tokens must be of the same hashable type. Examples: str, int, and tuple.

  • reserved_tokens (list of hashable objects or None, default None) – A list of reserved tokens that will always be indexed, such as special symbols representing padding, beginning of sentence, and end of sentence. It cannot contain unknown_token, or duplicate reserved tokens. Keys of counter, unknown_token, and values of reserved_tokens must be of the same hashable type. Examples: str, int, and tuple.

unknown_token

The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.

Type:

hashable object

reserved_tokens

A list of reserved tokens that will always be indexed.

Type:

list of strs or None

property idx_to_token

A list of indexed tokens where the list indices and the token indices are aligned.

Type:

list of strs

to_indices(tokens)[source]

Converts tokens to indices according to the vocabulary.

Parameters:

tokens (str or list of strs) – A source token or tokens to be converted.

Returns:

A token index or a list of token indices according to the vocabulary.

Return type:

int or list of ints

to_tokens(indices)[source]

Converts token indices to tokens according to the vocabulary.

Parameters:

indices (int or list of ints) – A source token index or token indices to be converted.

Returns:

A token or a list of tokens according to the vocabulary.

Return type:

str or list of strs

property token_to_idx

A dict mapping each token to its index integer.

Type:

dict mapping str to int