Named-Entity Recognition using Deep Learning


In Natural Language Processing, named-entity recognition is a task of information extraction that seeks to locate and classify elements in text into pre-defined categories. The following graph is stolen from Maluuba Website, it perfectly demonstrates what does NER do.



Given a sentence, we want to know which category each word belongs, by doing NER, we can get an annotated block of text that highlights the names of entities. The following example is from Wiki:

Given: Jim bought 300 shares of Acme Corp. in 2006.
Get: [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.

This is obviously a machine learning problem, and in this post I’ll talk about what I’ve tried and learned in a couple of days lately.


MITIE is an open sourced information extraction tools developed by MIT NLP lab, it comes with trained models for English and Spanish. The English named entity recognition model is trained based on data from the English Gigaword news corpus, the CoNLL 2003 named entity recognition task, and ACE data.

MITIE library provides APIs in C, C++, Java, R, and Python 2.7. It’s super easy to use and friendly in C++:

Add entities –> Set trainer –> Train –> Serialize and save model into file –> Read and deserialize model from file –> Do prediction

If we already have the trained model file, we can directly start from the “Read and deserialize” step.

More details will be mentioned in the EXPERIMENT AND RESULT section.


CRF++ is a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. CRF++ is developed by Taku Kudo, it supports multi-language models, and is very fast to train and test. After compiling and installing, it can be directly used by calling system commands in C++. I wrote some very silly interface functions for C++ calling CRF++ for both training and testing.


It is universally acknowledged that CNN works perfectly for images problems, and outperformed most of the other methods in image classification tasks, however, I never tried to use it on any NLP problems. So first, let’s find out what’s different between image problem and NLP problem (specifically, NER problem).

  1. Image datasets are uniformly sized, each image in it has same width and height; however, sentences in NLP datasets don’t have same length (both amount of character in word and amount of word in sentence are different).
  2. Image (for simplify the problem, let’s assume all images to be 1-channel images in this post) has 2 dimensions: vertical and horizontal; however, sentence just has 1 dimension.
  3. Most images are robust in rotation, scaling, maybe affine transformation, or noise, however, words and sentences are pretty sensitive in these transforms (one character change in a word may leads to very different meaning, say “dog” and “dig”).


For the 1st difference,  we can use some feature extraction methods to solve it. By encoding the data, we can not only make the data has uniform size, which makes it easy to feed into neural networks, but also it extracts out some features and helps the training process. Following are two of the many ways that encode words/sentences.

1. one-of-m coding

I saw this coding method in Zhang & LeCun’s paper, the encoding is done by prescribing an alpha-bet of size m for the input language, and then quantize each character using 1-of-m encoding. Then, the sequence of characters is transformed to a sequence of such m sized vectors with fixed length l. Any character exceeding length l is ignored, and any characters that are not in the alphabet including blank characters are quantized as all-zero vectors. Zhang & LeCun used 69 characters in the model, including 26 english letters, 10 digits and 33 other characters. After encoding the input language, the input data has uniform length of 69.

One of the advantage of this method is, after encoding, the vector is sparse, which means most of the elements are zero.

2. word2vec

Word2vec was originally conceived by Tomas Mikolov’ team at Google, it provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. Word2vec is a neural network which learns word’s meaning by running deep learning algorithms, by feeding huge training dataset, it can make highly accurate guesses about the words’ meaning, and clusters words by meaning. It outputs uniform length vector, as the representation of the input word, so it also perfectly solved the “different length” problem.

It was recently shown that the word vectors capture many linguistic regularities, for example vector operations vector(‘Paris’) – vector(‘France’) + vector(‘Italy’) results in a vector that is very close to vector(‘Rome’), and vector(‘king’) – vector(‘man’) + vector(‘woman’) is close to vector(‘queen’).


After encoding the data, each word is already represented by a vector, so each sentence is also changed to a 2-d matrix. N-Gram is a way goes even further to slice sentences into pieces that have same length. For example:

Jim bought 300 shares of Acme Corp in 2006

when n = 3, we can get the following sub-sentences:

Jim bought 300        bought 300 shares

300 shares of        shares of Acme

of Acme Corp        Acme Corp in

Corp in 2006

Say we’re using the 69 characters model one-of-m coding, then we can get 7 matrices, all have size 3*69.

Furthermore, we can use different value of n, say 3, 5, 7, 9, and train 4 different networks using each of the 4 n-values, and use something like mixture of experts when doing predictions.

Data Augmentation

Data augmentation is one of the several ways for reduce overfitting. For image data, it’s easy to enlarge the dataset by adding rotation, scaling, or noise, however for texts, we must find other ways to do it, for example, synonyms finding is a pretty good choice. In no matter which language, there are already synonyms dictionaries, we can simply replace words in sentence by their synonyms.

Also, I think since word2vec vectors capture many linguistic regularities, and distance between synonyms should be small, then we can add little noise in word vectors to enlarge the dataset (I’m not sure, if it’s wrong, correct me!).


In the example code I’m using a convnet which has the following structure:

Convolutional(1) –> Pooling(1) –>Convolutional(2) –> Pooling(2) –> Fully Connected –> Softmax

  • I’m only use 3-Gram word2vec-300 vector, so for each sub-sentence, the input is a 3*300 matrix.
  • In Convolutional(1) layer, there are 4 kernels, all of them have 2*41 size (with local response normalization).
  • In Pooling(1) layer, both the horizontal and vertical pooling dimension are 2.
  • In Convolutional(2) layer, there are 8 kernels, all of them have 1*21 size (with local response normalization).
  • In Pooling(2) layer, horizontal pooling dimension is 2, vertical pooling dimension is 1 (do nothing vertically).
  • There are 256 hidden units in Fully Connected layer (with dropout, dropout probability is 0.5).


MITIE & CNN code and dataset, can be found HERE.

CRF++ code and dataset can be found HERE.


1000 sentences of news, for example:

  • breaking B-NEWSTYPE
  • news I-NEWSTYPE
  • dealing O
  • with O
  • prime B-KEYWORDS
  • minister I-KEYWORDS
  • greece I-KEYWORDS
  • visit I-KEYWORDS
  • tunis I-KEYWORDS
  • in O
  • west B-PROVIDER
  • shore I-PROVIDER
  • news I-PROVIDER

There are totally 9 kind of categories. I use 800 of sentences as training set, and 200 as test set.


  • Sentence test accuracy : 163/200 = 0.815
  • Single word test accuracy :  2057/2123 = 0.969
  • Training time : 7570 seconds

In which, sentence test means, if the predict result of one word in a sentence is incorrect, then the whole sentence is incorrect; single word test calculates the word-level accuracy.


This is the template I use for the toy dataset:

# Unigram

# Bigram

for each token, I search the 3 positions forward and 3 positions backward, and follows by the performance:

  • Sentence test accuracy : 183/200 = 0.915
  • Single word test accuracy :  2171/2199 = 0.987267
  • Training time : 4.95 seconds

Where the definition of sentence test and single word test are the same as in MITIE experiment.


0-th layer kernel looks like:


1-st layer kernel looks like:


  • Training 3-Gram accuracy : 6975/7301 = 0.955349
  • Test 3-Gram accuracy : 1553/1799 = 0.863257
  • Test single word accuracy : 2109/2195 = 0.96082
  • Training time : 19096 seconds (40,000 iterations)


For the above toy dataset, convolutional neural networks performs as well as MITIE, although I used more time on trainingCNN, but actually here’s the accuracy at 10,000th iteration:

  • Training 3-Gram accuracy : 7023/7301 = 0.961923
  • Test 3-Gram accuracy : 1545/1799 = 0.85881

the accuracy actually doesn’t change much after 6,000th iteration.

Moreover, the CNN model file (12MB) is way less than MITIE model file (352MB), it only consists weight matrices and bias vectors of each layers in neural networks, the light size of model file also leads to fast loading time. For example, if just do one sentence NER by using both trained MITIE model and trainedCNN model, the running time will be approximately 3 seconds and 0.5 second.

CRF++ gives best result for the toy dataset although I still don’t clearly understand how CRF works, it is not only fast but also super accurate. The model file it trained has only 2.8 MB of size.

I think for larger dataset, CNN will give better performance, because first, well-trained CNNs automatically reveal structures inside data, we don’t need any prior knowledge about the data, and just wait for the network itself to get features; second, we are able to use different size of kernels in one network, and learn multiple ranges of features in a parallel fashion; third,  we can use larger and deeper CNN for training larger dataset, and we can also use GPU for speeding up the training process.

This entry was posted in Machine Learning, NLP and tagged , , , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.


  1. Ted
    Posted April 20, 2016 at 6:04 am | Permalink

    MITIE model file (352MB) contains the extractor which is more than 300 MB.

    So normally the size is only around 10 ~20 MB.

  2. Ali Mollahosseini
    Posted June 27, 2016 at 2:54 am | Permalink

    How did you measure the accuracy? And also, may I ask how did you implement your CNN?

2 Trackbacks

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>