mxnet.contrib.text.utils¶

Provide utilities for text data processing.

Functions

count_tokens_from_str(source_str[, ...])

Counts tokens in the specified string.

mxnet.contrib.text.utils.count_tokens_from_str(source_str, token_delim=' ', seq_delim='\n', to_lower=False, counter_to_update=None)[source]¶

Counts tokens in the specified string.

For token_delim=’<td>’ and seq_delim=’<sd>’, a specified string of two sequences of tokens may look like:

<td>token1<td>token2<td>token3<td><sd><td>token4<td>token5<td><sd>

<td> and <sd> are regular expressions. Make use of \ to allow special characters as delimiters. The list of special characters can be found at https://docs.python.org/3/library/re.html.

Parameters:

source_str (str) – A source string of tokens.
token_delim (str, default ' ') – A token delimiter.
seq_delim (str, default '\n') – A sequence delimiter.
to_lower (bool, default False) – Whether to convert the source source_str to the lower case.
counter_to_update (collections.Counter or None, default None) – The collections.Counter instance to be updated with the token counts of source_str. If None, return a new collections.Counter instance counting tokens from source_str.

Returns:

The counter_to_update collections.Counter instance after being updated with the token counts of source_str. If counter_to_update is None, return a new collections.Counter instance counting tokens from source_str.

Return type:

collections.Counter

Examples

>>> source_str = ' Life is great ! \n life is good . \n'
>>> count_tokens_from_str(token_line, ' ', '\n', True)
Counter({'!': 1, '.': 1, 'good': 1, 'great': 1, 'is': 2, 'life': 2})

>>> source_str = '*Life*is*great*!*\n*life*is*good*.*\n'
>>> count_tokens_from_str(token_line, '\*', '\n', True)
Counter({'is': 2, 'life': 2, '!': 1, 'great': 1, 'good': 1, '.': 1})