mxnet.contrib.text.embedding¶

Text token embeddings.

Functions

`create`(embedding_name, **kwargs)	Creates an instance of token embedding.
`get_pretrained_file_names`([embedding_name])	Get valid token embedding names and their pre-trained file names.
`register`(embedding_cls)	Registers a new token embedding.

Classes

`CompositeEmbedding`(vocabulary, token_embeddings)	Composite token embeddings.
`CustomEmbedding`(pretrained_file_path[, ...])	User-defined token embedding.
`FastText`([pretrained_file_name, ...])	The fastText word embedding.
`GloVe`([pretrained_file_name, ...])	The GloVe word embedding.

class mxnet.contrib.text.embedding.CompositeEmbedding(vocabulary, token_embeddings)[source]¶

Bases: _TokenEmbedding

Composite token embeddings.

For each indexed token in a vocabulary, multiple embedding vectors, such as concatenated multiple embedding vectors, will be associated with it. Such embedding vectors can be loaded from externally hosted or custom pre-trained token embedding files, such as via token embedding instances.

Parameters:

vocabulary (Vocabulary) – For each indexed token in a vocabulary, multiple embedding vectors, such as concatenated multiple embedding vectors, will be associated with it.
token_embeddings (instance or list of mxnet.contrib.text.embedding._TokenEmbedding) – One or multiple pre-trained token embeddings to load. If it is a list of multiple embeddings, these embedding vectors will be concatenated for each token.

class mxnet.contrib.text.embedding.CustomEmbedding(pretrained_file_path, elem_delim=' ', encoding='utf8', init_unknown_vec=<function zeros>, vocabulary=None, **kwargs)[source]¶

Bases: _TokenEmbedding

User-defined token embedding.

This is to load embedding vectors from a user-defined pre-trained text embedding file.

Denote by ‘[ed]’ the argument elem_delim. Denote by [v_ij] the j-th element of the token embedding vector for [token_i], the expected format of a custom pre-trained token embedding file is:

‘[token_1][ed][v_11][ed][v_12][ed]…[ed][v_1k]\n[token_2][ed][v_21][ed][v_22][ed]…[ed] [v_2k]\n…’

where k is the length of the embedding vector vec_len.

Parameters:

pretrained_file_path (str) – The path to the custom pre-trained token embedding file.
elem_delim (str, default ' ') – The delimiter for splitting a token and every embedding vector element value on the same line of the custom pre-trained token embedding file.
encoding (str, default 'utf8') – The encoding scheme for reading the custom pre-trained token embedding file.
init_unknown_vec (callback) – The callback used to initialize the embedding vector for the unknown token.
vocabulary (Vocabulary, default None) – It contains the tokens to index. Each indexed token will be associated with the loaded embedding vectors, such as loaded from a pre-trained token embedding file. If None, all the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding file, will be indexed.

class mxnet.contrib.text.embedding.FastText(pretrained_file_name='wiki.simple.vec', embedding_root=None, init_unknown_vec=<function zeros>, vocabulary=None, **kwargs)[source]¶

Bases: _TokenEmbedding

The fastText word embedding.

FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices. (Source from https://fasttext.cc/)

References

Enriching Word Vectors with Subword Information. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. https://arxiv.org/abs/1607.04606

Bag of Tricks for Efficient Text Classification. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. https://arxiv.org/abs/1607.01759

FastText.zip: Compressing text classification models. Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve Jegou, and Tomas Mikolov. https://arxiv.org/abs/1612.03651

For ‘wiki.multi’ embeddings: Word Translation Without Parallel Data Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. https://arxiv.org/abs/1710.04087

Website:

https://fasttext.cc/

To get the updated URLs to the externally hosted pre-trained token embedding files, visit https://github.com/facebookresearch/fastText/blob/master/docs/pretrained-vectors.md

License for pre-trained embeddings:

https://creativecommons.org/licenses/by-sa/3.0/

Parameters:

pretrained_file_name (str, default 'wiki.en.vec') – The name of the pre-trained token embedding file.
embedding_root (str, default $MXNET_HOME/embeddings) – The root directory for storing embedding-related files.
init_unknown_vec (callback) – The callback used to initialize the embedding vector for the unknown token.
vocabulary (Vocabulary, default None) – It contains the tokens to index. Each indexed token will be associated with the loaded embedding vectors, such as loaded from a pre-trained token embedding file. If None, all the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding file, will be indexed.

class mxnet.contrib.text.embedding.GloVe(pretrained_file_name='glove.840B.300d.txt', embedding_root=None, init_unknown_vec=<function zeros>, vocabulary=None, **kwargs)[source]¶

Bases: _TokenEmbedding

The GloVe word embedding.

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. (Source from https://nlp.stanford.edu/projects/glove/)

References

GloVe: Global Vectors for Word Representation. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. https://nlp.stanford.edu/pubs/glove.pdf

Website:

https://nlp.stanford.edu/projects/glove/

To get the updated URLs to the externally hosted pre-trained token embedding files, visit https://nlp.stanford.edu/projects/glove/

License for pre-trained embeddings:

https://fedoraproject.org/wiki/Licensing/PDDL

Parameters:

pretrained_file_name (str, default 'glove.840B.300d.txt') – The name of the pre-trained token embedding file.
embedding_root (str, default $MXNET_HOME/embeddings) – The root directory for storing embedding-related files.
init_unknown_vec (callback) – The callback used to initialize the embedding vector for the unknown token.
vocabulary (Vocabulary, default None) – It contains the tokens to index. Each indexed token will be associated with the loaded embedding vectors, such as loaded from a pre-trained token embedding file. If None, all the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding file, will be indexed.

mxnet.contrib.text.embedding.create(embedding_name, **kwargs)[source]¶

Creates an instance of token embedding.

Creates a token embedding instance by loading embedding vectors from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText. To get all the valid embedding_name and pretrained_file_name, use mxnet.contrib.text.embedding.get_pretrained_file_names().

Parameters:: embedding_name (str) – The token embedding name (case-insensitive).
Returns:: A token embedding instance that loads embedding vectors from an externally hosted pre-trained token embedding file.
Return type:: An instance of mxnet.contrib.text.glossary._TokenEmbedding

mxnet.contrib.text.embedding.get_pretrained_file_names(embedding_name=None)[source]¶

Get valid token embedding names and their pre-trained file names.

To load token embedding vectors from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText, one should use mxnet.contrib.text.embedding.create(embedding_name, pretrained_file_name). This method returns all the valid names of pretrained_file_name for the specified embedding_name. If embedding_name is set to None, this method returns all the valid names of embedding_name with their associated pretrained_file_name.

Parameters:: embedding_name (str or None, default None) – The pre-trained token embedding name.
Returns:: A list of all the valid pre-trained token embedding file names (pretrained_file_name) for the specified token embedding name (embedding_name). If the text embeding name is set to None, returns a dict mapping each valid token embedding name to a list of valid pre-trained files (pretrained_file_name). They can be plugged into mxnet.contrib.text.embedding.create(embedding_name, pretrained_file_name).
Return type:: dict or list

mxnet.contrib.text.embedding.register(embedding_cls)[source]¶

Registers a new token embedding.

Once an embedding is registered, we can create an instance of this embedding with create().

Examples

>>> @mxnet.contrib.text.embedding.register
... class MyTextEmbed(mxnet.contrib.text.embedding._TokenEmbedding):
...     def __init__(self, pretrained_file_name='my_pretrain_file'):
...         pass
>>> embed = mxnet.contrib.text.embedding.create('MyTokenEmbed')
>>> print(type(embed))
<class '__main__.MyTokenEmbed'>