langid-基于机器学习模型的语言检测

判断一段文本是什么语言

需求

最近从网上抓取大量文本,有中文,英文,日文等多语言,现在只想保留中文的文本。英文的好办啊,字母表过滤就好。一开始的想法是找出unicode字符集中中文对应的范围即可,发现只能找到CJK字符集,包含中日韩等字符。

langid.py - 基于机器学习模型的语言检测

Github 地址

1
https://github.com/saffsd/langid.py

使用

1
2
3
4
5
6
>>> import langid
>>> langid.classify("I do not speak english")
('en', 0.57133487679900674)
>>> langid.set_languages(['de','fr','it'])
>>> langid.classify("I do not speak english")
('it', 0.99999835791478453)

原理

1) 模型
符合多项分布的朴素贝叶斯模型
细节推导日后再补

2) 特征工程
这一块比较重要。貌似是n-gram加上infromation gain来做特征选择等一些trick,日后再补。

参考文献

原repo中的几篇论文, 语言检测这个task以后有空可以做个survey。

1) [1] Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification, In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, pp. 553—56

2) [2] Lui, Marco and Timothy Baldwin (2012) langid.py: An Off-the-shelf Language Identification Tool, In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Demo Session, Jeju, Republic of Korea.

3) [3] Kenneth Heafield and Rohan Kshirsagar and Santiago Barona (2015) Language Identification and Modeling in Specialized Hardware, In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)