The aim of this post is to cover how algorithms can understand text. One mechanism is described in this Post – One Hot Vectors with the other 2 more advanced ones being described in the Advanced Post.
Imagine you have three documents that contain the following text:
You are using two classes for classification:
As you probably know by now, Machine Learning models rely on numbers and not on text, and you usually have data for your algorithms in the following format:
Data ID | Feature 1 | Feature 2 | Class |
---|---|---|---|
1 | 0.92 | 0.2 | 0 |
2 | 0.57 | 0.33 | 1 |
Table 1
So how can Text be converted to the something like above?
Step 1: One Hot Vectors
Using One Hot Vectors means building a list of all the words in your language as possible features and then putting a ‘1’ against the words that occur in your text. Let’s take a look at an example using our 3 documents above:
Data ID | a | … | apples | … | buy | … | get | … | going | … | grapes | … | is | … | Maria | … | to | … | work | … | Class |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Document 1 | 0 | … | 1 | … | 1 | … | 0 | … | 1 | … | 0 | … | 1 | … | 1 | … | 1 | … | 0 | … | 0 |
Document 2 | 0 | … | 0 | … | 0 | … | 1 | … | 1 | … | 1 | … | 1 | … | 1 | … | 1 | … | 0 | … | 0 |
Document 3 | 0 | … | 0 | … | 0 | … | 0 | … | 1 | … | 0 | … | 1 | … | 1 | … | 1 | … | 1 | … | 1 |
Table 2
Above we can see that there was no mention of the word ‘a’ so it only had ‘0’s against it, and then every document had ‘1’s against the words that were mentioned.
This method works to define your documents as numbers, but if you look at the 3 documents, not all the words are important to define our classes:
If you’re interested in reading more about how to hint at word importance as well as embeddings, please go to the Advanced Post.