Adaptive Skip-gram implementation in Julia
169 Stars
Updated Last
1 Year Ago
Started In
March 2015


Adaptive Skip-gram (AdaGram) model is a nonparametric extension of famous Skip-gram model implemented in word2vec software which is able to learn multiple representations per word capturing different word meanings. This projects implements AdaGram in Julia language.


AdaGram is not in the julia package repository yet, so it should be installed in the following way:

using Pkg

Training a model

The most straightforward way to train a model is to use script. If you run it with no parameters passed or with --help option, it will print usage information:

usage: train.jl [--window WINDOW] [--workers WORKERS]
                [--min-freq MIN-FREQ] [--remove-top-k REMOVE-TOP-K]
                [--dim DIM] [--prototypes PROTOTYPES] [--alpha ALPHA]
                [--d D] [--subsample SUBSAMPLE] [--context-cut]
                [--epochs EPOCHS] [--init-count INIT-COUNT]
                [--stopwords STOPWORDS]
                [--sense-treshold SENSE-TRESHOLD] [--regex REGEX] [-h]
                train dict output

Here is the description of all parameters:

  • WINDOW is a half-context size. Useful values are 3-10.
  • WORKERS is how much parallel processes will be used for training.
  • MIN-FREQ specifies the minimum word frequency below which a word will be ignored. Useful values are 5-50 depending on the corpora.
  • REMOVE-TOP-K allows to ignore K most frequent words as well.
  • DIM is the dimensionality of learned representations
  • PROTOTYPES sets the maximum number of learned prototypes. This is the truncating level used in truncated stick-breaking, so the actual amount of memory used depends on this number linearly.
  • ALPHA is the parameter of underlying Dirichlet process. Larger values of ALPHA lead to more meanings discovered. Useful values are 0.05-0.2.
  • D is used together with ALPHA in Pitman-Yor process and D=0 turns it into Dirichlet process. We couldn’t get reasonable results with PY, but left the option to change D.
  • SUBSAMPLE is a threshold for subsampling frequent words, similarly to how this is done in word2vec.
  • —context-cut option allows to randomly decrease WINDOW during the training, which increases training speed with almost no effects on model’s performance
  • EPOCHS specifies the number of passes over training text, usually one epoch is enough, larger number of epochs is usually required on small corpora.
  • INIT-COUNT is used for initialization of variational stick-breaking distribution. All prototypes are assigned with zero occurrences except first one which is assigned with INIT-COUNT. Zero value means that first prototype gets all occurrences.
  • STOPWORDS is a path to newline-separated file with list of words that must be ignored during the training
  • SENSE-THRESHOLD allows to sparse gradients and speed-up training. If the posterior probability of a prototype is blow that threshold then it won’t contribute to parameters’ gradients.
  • REGEX will be used to filter out words not matching with from the DICTIONARY provided
  • train — path to training text (see Format section below)
  • dict — path to dictionary file (see Format section below)
  • output — path for saving trained model.

Input format

Training text should be formatted as for word2vec. Words are case-sensitive and are assumed to be separated by space characters. All punctuation should be removed unless specially intented to be preserved. You may use utils/ INPUT_FILE OUTPUT_FILE for simple tokenization with UNIX utils.

In order to train a model you should also provide a dictionary file with word frequency statistics in the following format:

word1   34
word2   456
wordN   83

AdaGram will assume that provided word frequencies are actually obtained from training file. You may build a dictionary file using utils/ INPUT_FILE DICT_FILE.

Playing with a model

After model is trained, you may use learned word vectors in the same way as ones learned by word2vec. However, since AdaGram learns several vectors for each word, you may need to disambiguate a word using its context first, in order to determine which vector should be used.

First, load the model and the dictionary:

julia> using AdaGram
julia> vm, dict = load_model("PATH_TO_THE_MODEL");

To examine how many prototypes were learned for a word, use expected_pi function:

julia> expected_pi(vm, dict.word2id["apple"])
30-element Array{Float64,1}:

This function returns a --prototypes-sized array with prior probability of each prototype. As one may see, in this example only first two prototypes have probabilities significantly larger than zero, and thus we may conclude that only two meanings of word "apple" were discovered. We may examine each prototype by looking at its 10 nearest neighbours:

julia> nearest_neighbors(vm, dict, "apple", 1, 10)
10-element Array{(Any,Any,Any),1}:
julia> nearest_neighbors(vm, dict, "apple", 2, 10)
10-element Array{(Any,Any,Any),1}:

Now if we provide a context for word "apple" we may obtain posterior probability of each prototype:

julia> disambiguate(vm, dict, "apple", split("new iphone was announced today"))
30-element Array{Float64,1}:
julia> disambiguate(vm, dict, "apple", split("fresh tasty breakfast"))
30-element Array{Float64,1}:

As one may see, model correctly estimated probabilities of each sense with quite large confidence. Vector corresponding to second prototype of word "apple" can be obtained from vm.In[:, 2, dict.word2id["apple"]] and then used as context-aware features of word "apple".

Plase refer to API documentation for more detailed usage info.

Future work

  • Full API documentation
  • C and python bindings
  • Disambiguation into user-provided sense inventory


  1. Sergey Bartunov, Dmitry Kondrashkin, Anton Osokin, Dmitry Vetrov. Breaking Sticks and Ambiguities with Adaptive Skip-gram. ArXiv preprint, 2015
  2. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.