mxnet.contrib.text.utils¶
Provide utilities for text data processing.
Functions
|
Counts tokens in the specified string. |
- mxnet.contrib.text.utils.count_tokens_from_str(source_str, token_delim=' ', seq_delim='\n', to_lower=False, counter_to_update=None)[source]¶
Counts tokens in the specified string.
For token_delim=’<td>’ and seq_delim=’<sd>’, a specified string of two sequences of tokens may look like:
<td>token1<td>token2<td>token3<td><sd><td>token4<td>token5<td><sd>
<td> and <sd> are regular expressions. Make use of \ to allow special characters as delimiters. The list of special characters can be found at https://docs.python.org/3/library/re.html.
- Parameters:
source_str (str) – A source string of tokens.
token_delim (str, default ' ') – A token delimiter.
seq_delim (str, default '\n') – A sequence delimiter.
to_lower (bool, default False) – Whether to convert the source source_str to the lower case.
counter_to_update (collections.Counter or None, default None) – The collections.Counter instance to be updated with the token counts of source_str. If None, return a new collections.Counter instance counting tokens from source_str.
- Returns:
The counter_to_update collections.Counter instance after being updated with the token counts of source_str. If counter_to_update is None, return a new collections.Counter instance counting tokens from source_str.
- Return type:
Examples
>>> source_str = ' Life is great ! \n life is good . \n' >>> count_tokens_from_str(token_line, ' ', '\n', True) Counter({'!': 1, '.': 1, 'good': 1, 'great': 1, 'is': 2, 'life': 2})
>>> source_str = '*Life*is*great*!*\n*life*is*good*.*\n' >>> count_tokens_from_str(token_line, '\*', '\n', True) Counter({'is': 2, 'life': 2, '!': 1, 'great': 1, 'good': 1, '.': 1})