标签归档:Keras

深度学习实践:从零开始做电影评论文本情感分析

Deep Learning Specialization on Coursera

最近读了《Python深度学习》, 是一本好书,很棒,隆重推荐。

本书由Keras之父、现任Google人工智能研究员的弗朗索瓦•肖莱(François Chollet)执笔,详尽介绍了用Python和Keras进行深度学习的探索实践,涉及计算机视觉、自然语言处理、生成式模型等应用。书中包含30多个代码示例,步骤讲解详细透彻。由于本书立足于人工智能的可达性和大众化,读者无须具备机器学习相关背景知识即可展开阅读。在学习完本书后,读者将具备搭建自己的深度学习环境、建立图像识别模型、生成图像和文字等能力。

各方面都很好,但是总感觉哪里有点欠缺,后来想想,可能是作者做得太好了,把数据预处理都做得好好的,所以你才能“20行搞定情感分析”,这可能也是学习其他深度学习工具过程中要面临的一个问题,很多工具都提供了预处理好的数据,导致学习过程中只需要调用相关接口即可。不过在实际工作中,数据的预处理是非常重要的,从数据获取,到数据清洗,再到基本的数据处理,例如中文需要分词,英文需要Tokenize, Truecase或者Lowercase等,还有去停用词等等,在将数据“喂”给工具之前,有很多事情要做。这个部分,貌似是当前一些教程有所欠缺的地方,所以才有了这个“从零开始做”的想法和系列,准备弥补一下这个缺失,第一个例子就拿《Python深度学习》这本书第一个文本挖掘例子练手:电影评论文本分类-二分类问题,这也可以归结为一个情感分析任务。

首先介绍一下这个原始的电影评论数据集aclIMDB: Large Movie Review Dataset, 这个数据集由斯坦福大学人工智能实验室于2011年推出,包含25000条训练数据和25000条测试数据,另外包含约50000条没有标签的辅助数据。训练集和测试集又分别包含12500条正例(正向评价pos)和12500负例(负向评价neg)。2018免费送彩金游戏数据,更详细的介绍可参考该数据集的官网:http://ai.stanford.edu/~amaas/data/sentiment/, paper: Learning Word Vectors for Sentiment Analysis, 和数据集里的readme。

然后下载和处理这份数据:Large Movie Review Dataset v1.0,下载链接;

http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

下载之后进行解压:tar -zxvf aclImdb.tar.gz,可以用tree命令看一下aclImdb的目录结构:

tree aclImdb -L 2

继续进入训练集正例的目录看一下: cd aclImdb/train/pos/:

这个里面包含了12500篇英文评论,我们随机打开一个看一下里面的文本内容:

vim 1234_10.txt

I grew up watching this movie ,and I still love it just as much today as when i was a kid. Don't listen to the critic reviews. They are not accurate on this film.Eddie Murphy really shines in his roll.You can sit down with your whole family and everybody will enjoy it.I recommend this movie to everybody to see. It is a comedy with a touch of fantasy.With demons ,dragons,and a little bald kid with God like powers.This movie takes you from L.A. to Tibet , of into the amazing view of the wondrous temples of the mountains in Tibet.Just a beautiful view! So go do your self a favor and snatch this one up! You wont regret it!

继续阅读

从零开始搭建深度学习服务器: 1080TI四卡并行(Ubuntu16.04+CUDA9.2+cuDNN7.1+TensorFlow+Keras)

Deep Learning Specialization on Coursera

这个系列写了好几篇文章,这是相关文章的索引,仅供参考:

最近公司又弄了一套4卡1080TI机器,配置基本上和之前是一致的,只是显卡换成了技嘉的伪公版1080TI:技嘉GIGABYTE GTX1080Ti 涡轮风扇108TTURBO-11GD

部件	型号	价格	链接	备注CPU	英特尔(Intel)酷睿六核i7-6850K 盒装CPU处理器 	4599	http://item.jd.com/11814000696.html	散热器	美商海盗船 H55 水冷	449	https://item.jd.com/10850633518.html	主板	华硕(ASUS)华硕 X99-E WS/USB 3.1工作站主板	4759	内存	美商海盗船(USCORSAIR) 复仇者LPX DDR4 3000 32GB(16Gx4条)  	2799 * 2	https://item.jd.com/1990572.html	SSD	三星(SAMSUNG) 960 EVO 250G M.2 NVMe 固态硬盘	599	https://item.jd.com/3739097.html		硬盘	希捷(SEAGATE)酷鱼系列 4TB 5900转 台式机机械硬盘 * 2 	629 * 2	https://item.jd.com/4220257.html	电源	美商海盗船 AX1500i 全模组电源 80Plus金牌	3699	https://item.jd.com/10783917878.html机箱	美商海盗船 AIR540 USB3.0 	949	http://item.jd.com/12173900062.html显卡	技嘉(GIGABYTE) GTX1080Ti 11GB 非公版高端游戏显卡深度学习涡轮 * 4 7400 * 4    https://item.jd.com/10583752777.html

这台深度学习主机大概是这样的:

深度学习主机

安装完Ubuntu16.04之后,我又开始了CUDA、cuDnn等深度学习环境和工具的安装之旅,时隔大半年,又有了很多变化,特别是CUDA9.x和cuDnn7.x已经成了标配,这里记录一下。

安装CUDA9.x

注:如果还需要安装Tensorflow1.8,建议这里安装CUDA9.0,我在另一台机器上遇到了一点问题,怀疑和我这台机器先安装CUDA9.0,再安装CUDA9.2有关。

依然从英伟达官方下载当前的CUDA版本,我选择了最新的CUDA9.2:

点选完对应Ubuntu16.04的CUDA9.2 deb版本之后,英伟达官方主页会给出安装提示:

Installation Instructions:
`sudo dpkg -i cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb`
`sudo apt-key add /var/cuda-repo-/7fa2af80.pub`
`sudo apt-get update`
`sudo apt-get install cuda`

在下载完大概1.2G的cuda deb版本之后,实际安装命令是这样的:

sudo dpkg -i cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.debsudo apt-key add /var/cuda-repo-9-2-local/7fa2af80.pubsudo apt-get updatesudo apt-get install cuda

官方CUDA下载下载页面还附带了一个cuBLAS 9.2 Patch更新,官方强烈建议安装:

This update includes fix to cublas GEMM APIs on V100 Tensor Core GPUs when used with default algorithm CUBLAS_GEMM_DEFAULT_TENSOR_OP. We strongly recommend installing this update as part of CUDA Toolkit 9.2 installation.

可以用如下方式安装这个Patch更新:

sudo dpkg -i cuda-repo-ubuntu1604-9-2-local-cublas-update-1_1.0-1_amd64.deb sudo apt-get update  sudo apt-get upgrade cuda

CUDA9.2安装完毕之后,1080TI的显卡驱动也附带安装了,可以重启机器,然后用 nvidia-smi 命令查看一下:

最后在在 ~/.bashrc 中设置环境变量:

export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}export CUDA_HOME=/usr/local/cuda

运行 source ~/.bashrc 使其生效。

安装cuDNN7.x

同样去英伟达官网的cuDNN下载页面:https://developer.nvidia.com/rdp/cudnn-download,最新版本是cuDNN7.1.4,有三个版本可以选择,分别面向CUDA8.0, CUDA9.0, CUDA9.2:

cudnn7.1.4 cuda9.2 ubuntu16.04

下载完cuDNN7.1的压缩包之后解压,然后将相关文件拷贝到cuda的系统路径下即可:

tar -zxvf cudnn-9.2-linux-x64-v7.1.tgzsudo cp cuda/include/cudnn.h /usr/local/cuda/include/sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/ -d sudo chmod a+r /usr/local/cuda/include/cudnn.hsudo chmod a+r /usr/local/cuda/lib64/libcudnn*

安装TensorFlow 1.8

TensorFlow的安装变得越来越简单,现在TensorFlow的官网也有中文安装文档了:https://www.tensorflow.org/install/install_linux?hl=zh-cn , 我们Follow这个文档,用Virtualenv的安装方式进行TensorFlow的安装,不过首先要配置一下基础环境。

首先在Ubuntu16.04里安装 libcupti-dev 库:

这是 NVIDIA CUDA 分析工具接口。此库提供高级分析支持。要安装此库,请针对 CUDA 工具包 8.0 或更高版本发出以下命令:

$ sudo apt-get install cuda-command-line-tools
并将其路径添加到您的 LD_LIBRARY_PATH 环境变量中:

$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64
对于 CUDA 工具包 7.5 或更低版本,请发出以下命令:

$ sudo apt-get install libcupti-dev

然而我运行“sudo apt-get install cuda-command-line-tools”命令后得到的却是:

E: 无法定位软件包 cuda-command-line-tools

Google后发现其实在安装CUDA9.2的时候,这个包已经安装了,在CUDA的路径下这个库已经有了:

/usr/local/cuda/extras/CUPTI/lib64$ lslibcupti.so  libcupti.so.9.2  libcupti.so.9.2.88

现在只需要将其加入到环境变量中,在~/.bashrc中添加如下声明并令source ~/.bashrc另其生效即可:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64

剩下的就更简单了:

sudo apt-get install python-pip python-dev python-virtualenv virtualenv --system-site-packages tensorflow1.8source tensorflow1.8/bin/activateeasy_install -U pippip install -i https://pypi.tuna.tsinghua.edu.cn/simple --upgrade tensorflow-gpu

强烈建议将清华的pip源写到配置文件里,这样就更方便快捷了。

最后测试一下TensorFlow1.8:

Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2018-06-17 12:15:34.158680: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-06-17 12:15:34.381812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:05:00.0
totalMemory: 10.91GiB freeMemory: 5.53GiB
2018-06-17 12:15:34.551451: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:06:00.0
totalMemory: 10.92GiB freeMemory: 5.80GiB
2018-06-17 12:15:34.780350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 2 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:09:00.0
totalMemory: 10.92GiB freeMemory: 5.80GiB
2018-06-17 12:15:34.959199: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 3 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:0a:00.0
totalMemory: 10.92GiB freeMemory: 5.80GiB
2018-06-17 12:15:34.966403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2, 3
2018-06-17 12:15:36.373745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-17 12:15:36.373785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 1 2 3 
2018-06-17 12:15:36.373798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N Y Y Y 
2018-06-17 12:15:36.373804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1:   Y N Y Y 
2018-06-17 12:15:36.373808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 2:   Y Y N Y 
2018-06-17 12:15:36.373814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 3:   Y Y Y N 
2018-06-17 12:15:36.374516: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5307 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1)
2018-06-17 12:15:36.444426: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 5582 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute capability: 6.1)
2018-06-17 12:15:36.506340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 5582 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1)
2018-06-17 12:15:36.614736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 5582 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute capability: 6.1)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute capability: 6.1
2018-06-17 12:15:36.689345: I tensorflow/core/common_runtime/direct_session.cc:284] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute capability: 6.1

安装Keras2.1.x

Keras的后端支持TensorFlow, Theano, CNTK,在安装完TensorFlow GPU版本之后,继续安装Keras非常简单,在TensorFlow的虚拟环境中,直接"pip install keras"即可,安装的版本是Keras2.1.6:

Installing collected packages: h5py, scipy, pyyaml, keras
Successfully installed h5py-2.7.1 keras-2.1.6 pyyaml-3.12 scipy-1.1.0

测试一下:

Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import keras
Using TensorFlow backend.

注:原创文章,转载请注明出处及保留链接“我爱自然语言处理”:

本文链接地址:从零开始搭建深度学习服务器: 1080TI四卡并行(Ubuntu16.04+CUDA9.2+cuDNN7.1+TensorFlow+Keras) /?p=10334

自然语言处理工具包spaCy介绍

Deep Learning Specialization on Coursera

spaCy 是一个Python自然语言处理工具包,诞生于2014年年中,号称“Industrial-Strength Natural Language Processing in Python”,是具有工业级强度的Python NLP工具包。spaCy里大量使用了 Cython 来提高相关模块的性能,这个区别于学术性质更浓的Python NLTK,因此具有了业界应用的实际价值。

安装和编译 spaCy 比较方便,在ubuntu环境下,直接用pip安装即可:

sudo apt-get install build-essential python-dev git
sudo pip install -U spacy

不过安装完毕之后,需要下载相关的模型数据,以英文模型数据为例,可以用"all"参数下载所有的数据:

sudo python -m spacy.en.download all

或者可以分别下载相关的模型和用glove训练好的词向量数据:


# 这个过程下载英文tokenizer,词性标注,句法分析,命名实体识别相关的模型
python -m spacy.en.download parser

# 这个过程下载glove训练好的词向量数据
python -m spacy.en.download glove

下载好的数据放在spacy安装目录下的data里,以我的ubuntu为例:

textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data$ du -sh *
776M en-1.1.0
774M en_glove_cc_300_1m_vectors-1.0.0

进入到英文数据模型下:

textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data/en-1.1.0$ du -sh *
424M deps
8.0K meta.json
35M ner
12M pos
84K tokenizer
300M vocab
6.3M wordnet

可以用如下命令检查模型数据是否安装成功:


textminer@textminer:~$ python -c "import spacy; spacy.load('en'); print('OK')"
OK

也可以用pytest进行测试:


# 首先找到spacy的安装路径:
python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"
/usr/local/lib/python2.7/dist-packages/spacy

# 再安装pytest:
sudo python -m pip install -U pytest

# 最后进行测试:
python -m pytest /usr/local/lib/python2.7/dist-packages/spacy --vectors --model --slow
============================= test session starts ==============================
platform linux2 -- Python 2.7.12, pytest-3.0.4, py-1.4.31, pluggy-0.4.0
rootdir: /usr/local/lib/python2.7/dist-packages/spacy, inifile:
collected 318 items

../../usr/local/lib/python2.7/dist-packages/spacy/tests/test_matcher.py ........
../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_entity_id.py ....
../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_matcher_bugfixes.py .....
......
../../usr/local/lib/python2.7/dist-packages/spacy/tests/vocab/test_vocab.py .......Xx
../../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_api.py x...............
../../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_home.py ............

============== 310 passed, 5 xfailed, 3 xpassed in 53.95 seconds ===============

现在可以快速测试一下spaCy的相关功能,我们以英文数据为例,spaCy目前主要支持英文和德文,对其他语言的支持正在陆续加入:


textminer@textminer:~$ ipython
Python 2.7.12 (default, Jul 1 2016, 15:12:24)
Type "copyright", "credits" or "license" for more information.

IPython 2.4.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.

In [1]: import spacy

# 加载英文模型数据,稍许等待
In [2]: nlp = spacy.load('en')

Word tokenize功能,spaCy 1.2版本加了中文tokenize接口,基于Jieba中文分词:

In [3]: test_doc = nlp(u"it's word tokenize test for spacy")

In [4]: print(test_doc)
it's word tokenize test for spacy

In [5]: for token in test_doc:
print(token)
...:
it
's
word
tokenize
test
for
spacy

英文断句:


In [6]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')

In [7]: for sent in test_doc.sents:
print(sent)
...:
Natural language processing (NLP) deals with the application of computational models to text or speech data.
Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways.
NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form.
From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.


词干化(Lemmatize):


In [8]: test_doc = nlp(u"you are best. it is lemmatize test for spacy. I love these books")

In [9]: for token in test_doc:
print(token, token.lemma_, token.lemma)
...:
(you, u'you', 472)
(are, u'be', 488)
(best, u'good', 556)
(., u'.', 419)
(it, u'it', 473)
(is, u'be', 488)
(lemmatize, u'lemmatize', 1510296)
(test, u'test', 1351)
(for, u'for', 480)
(spacy, u'spacy', 173783)
(., u'.', 419)
(I, u'i', 570)
(love, u'love', 644)
(these, u'these', 642)
(books, u'book', 1011)

词性标注(POS Tagging):


In [10]: for token in test_doc:
print(token, token.pos_, token.pos)
....:
(you, u'PRON', 92)
(are, u'VERB', 97)
(best, u'ADJ', 82)
(., u'PUNCT', 94)
(it, u'PRON', 92)
(is, u'VERB', 97)
(lemmatize, u'ADJ', 82)
(test, u'NOUN', 89)
(for, u'ADP', 83)
(spacy, u'NOUN', 89)
(., u'PUNCT', 94)
(I, u'PRON', 92)
(love, u'VERB', 97)
(these, u'DET', 87)
(books, u'NOUN', 89)

命名实体识别(NER):


In [11]: test_doc = nlp(u"Rami Eid is studying at Stony Brook University in New York")

In [12]: for ent in test_doc.ents:
print(ent, ent.label_, ent.label)
....:
(Rami Eid, u'PERSON', 346)
(Stony Brook University, u'ORG', 349)
(New York, u'GPE', 350)

名词短语提取:


In [13]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')

In [14]: for np in test_doc.noun_chunks:
print(np)
....:
Natural language processing
Natural language processing (NLP) deals
the application
computational models
text
speech
data
Application areas
NLP
automatic (machine) translation
languages
dialogue systems
a human
a machine
natural language
information extraction
the goal
unstructured text
structured (database) representations
flexible ways
NLP technologies
a dramatic impact
the way
people
computers
the way
people
the use
language
the way
people
the vast amount
linguistic data
electronic form
a scientific viewpoint
NLP
fundamental questions
formal models
example
natural language phenomena
algorithms
these models

基于词向量计算两个单词的相似度:


In [15]: test_doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")

In [16]: apples = test_doc[0]

In [17]: print(apples)
Apples

In [18]: oranges = test_doc[2]

In [19]: print(oranges)
oranges

In [20]: boots = test_doc[6]

In [21]: print(boots)
Boots

In [22]: hippos = test_doc[8]

In [23]: print(hippos)
hippos

In [24]: apples.similarity(oranges)
Out[24]: 0.77809414836023805

In [25]: boots.similarity(hippos)
Out[25]: 0.038474555379008429

当然,spaCy还包括句法分析的相关功能等。另外值得关注的是 spaCy 从1.0版本起,加入了对深度学习工具的支持,例如 Tensorflow 和 Keras 等,这方面具体可以参考官方文档给出的一个对情感分析(Sentiment Analysis)模型进行分析的例子:Hooking a deep learning model into spaCy.

参考:
spaCy官方文档
Getting Started with spaCy

注:原创文章,转载请注明出处及保留链接“我爱自然语言处理”:

本文链接地址:自然语言处理工具包spaCy介绍 /?p=9386