断文识字的“孔乙己” -- 一个简单的中文分词工具 Kong Yiji, a simple fine tuned Chinese tokenizer
- 
Trained on Chinese Treebank 8.0. Of version 1 now, using a extended word-level Hidden Markov Model(HMM) contrast by eariler char-level HMM. 
- 
Fine tuned to deal with Out-of-vocabulary (OOV) words(未登录词, 网络新词). If the algorithm cannot find them, just add them to user dict(see Constructor), and twist usr_dict_weight if necessary. 
- 
Fully exported debug info with functions below: - postable : table of part-of-speech(pos) tags used in CTB
- h2vtable : table of hidden (pos tag) to visual (words), i.e., emission matrix
- v2htable : reverse of above
- h2htable : table of hidden to hidden, i.e., transfer matrix
- hprtable : table of prior of hidden, i.e. prior probabilistic
 
- 
Masked digit chars to reduce parameters overfitting. 
- 
Removed lower discrimitive probs of word to postag(only keep top 2 highest). 
- POS tag nerual language model(RNN) to model infinitive history of pos tags.
kong(; user_dict_path="", user_dict_array=[], user_dict_weight=1)- 
user_dict_path : a file path of user dict, eachline of which begin a word, optionally ahead by a part-of-speech tag(postag); If the postag not supplied, NR (Proper noun, 专有名词) is automatically inserted. 
- 
user_dict_array : a Vector{Tuple{String, String}} repr. [(postag, word)] 
- 
user_dict_weight : if value is m, frequency of (postag, word) in user dictionary will be $ m * maximum(values(h2v[postag])) $ 
Note all user suppiled postags MUST conform to specifications of Chinese Treebank.
See test/runtests.jl
- Filter low frequency words from CTB
- Exploit summary of POS table, insert a example column, plus constract with other POS scheme(PKU etc.)
- Explore MaxEntropy & CRF related algorithms