An intro to processing Natural Language for Machine Learning.

The aim of this post is to cover how algorithms can understand text. One mechanism is described in this Post – One Hot Vectors with the other 2 more advanced ones being described in the Advanced Post.

Imagine you have three documents that contain the following text:

  • Document 1: Maria is going to buy apples.
  • Document 2: Maria is going to get grapes.
  • Document 3: Maria is going to work.

You are using two classes for classification:

  • Class 0: Fruit Buying Trip
  • Class 1: Work Trip

As you probably know by now, Machine Learning models rely on numbers and not on text, and you usually have data for your algorithms in the following format:

Data ID Feature 1 Feature 2 Class
1 0.92 0.2 0
2 0.57 0.33 1

Table 1

So how can Text be converted to the something like above?

Step 1: One Hot Vectors

Using One Hot Vectors means building a list of all the words in your language as possible features and then putting a ‘1’ against the words that occur in your text. Let’s take a look at an example using our 3 documents above:

Data ID a apples buy get going grapes is Maria to work Class
Document 1 0 1 1 0 1 0 1 1 1 0 0
Document 2 0 0 0 1 1 1 1 1 1 0 0
Document 3 0 0 0 0 1 0 1 1 1 1 1

Table 2

Above we can see that there was no mention of the word ‘a’ so it only had ‘0’s against it, and then every document had ‘1’s against the words that were mentioned.

This method works to define your documents as numbers, but if you look at the 3 documents, not all the words are important to define our classes:

  • Document 1: Maria is going to buy apples.
  • Document 2: Maria is going to get grapes.
  • Document 3: Maria is going to work.

If you’re interested in reading more about how to hint at word importance as well as embeddings, please go to the Advanced Post.