Converting Text into Data for ML Algorithms – Advanced

An explanation of how Tf-Idf and Word2Vec work.

Remember the Documents we started with:

  • Document 1: Maria is going to buy apples.
  • Document 2: Maria is going to get grapes.
  • Document 3: Maria is going to work.

And their classes:

  • Class 0: Fruit Buying Trip
  • Class 1: Work Trip

It would be great if a method was available to mark words as ‘not important’ when converting them into numbers. Maybe their score could be lowered and instead of 1 be a smaller number as to reflect this importance.

Step 2: Term Frequency * Inverse Document Frequency (Tf-Idf)

Tf-Idf does that by looking at:

  • Term Frequency (Tf) in your document – how often does the word occur in your document – is it important to the document?
  • Inverse Document Frequency (Idf) – number of documents divided by the number of documents the word appears in – is it really common in your dataset? (p.s. the formula also has a log in there but don’t worry about it for now)
  • Multiplying Term Frequency (Tf) by Inverse Document Frequency (Idf)

Looking at the word ‘Maria’ it occurs in all 3 documents so in our case its Idf = log(3/3) = 0. Let’s now look at how we would record it per document:

  • Document 1: Tf = 1/6 (Maria occurs once in a sentence of 6 words) → Tf*Idf = 1/6 * 0 = 0
  • Document 2: Tf = 1/6 (Maria occurs once in a sentence of 6 words) → Tf*Idf = 1/6 * 0 = 0
  • Document 3: Tf = 1/5 (Maria occurs once in a sentence of 5 words) → Tf*Idf = 1/6 * 0 = 0

But what about the word ‘work’? It appears in one document so it’s Idf = log(3/1) which is approximately 1.09.

  • Document 1: TF = 0 (work does not occur) → Tf*Idf = 0 * 1 = 0
  • Document 2: TF = 0 (work does not occur) → Tf*Idf = 0 * 1 = 0
  • Document 3: TF = 1 (work occurs once) → Tf*Idf = 1 * 1.09 = 1.09

Doing this for every word yields:

Data ID a apples buy get going grapes is Maria to work Class
Document 1 0 1.09 1.09 0 0 0 0 0 0 0 0
Document 2 0 0 0 1.09 0 1.09 0 0 0 0 0
Document 3 0 0 0 0 0 0 0 0 0 1.09 1

Table 3

Words don’t get removed, but the ones that are important for the document and in general get a higher weighting.

Wouldn’t it be great if the algorithm actually understood that ‘buy’ and ‘get’ are similar. And that ‘apples’ and ‘grapes’ are also similar. And actually that these two documents very much mean Class 0 – Fruit Buying Trip?

Step 3: Word Embeddings (Word2Vec/GloVe)

Word Embeddings are just a fancy name for types of things. The way to understand word embeddings is to think about the famous game Articulate where your team guesses the word you are trying to describe without actually naming it.

So how would you describe ‘Apples’ and ‘Grapes’? And how about ‘Work’?

Apples and Grapes are:

  • like food, they are coming from plants, they are fruits, sweet-ish
  • they are not cars, buildings, planets, or countries, they are not an action and have small monetary value

Work is:

  • maybe you drive to it, you do it in a building, provides food, it’s an action and does have a higher monetary value
  • it’s not a planet, country, plant, a fruit, and not sweet

Based on the above, here’s an indication of how you might convert this information into a description of every word:

Word Car? Building? Food? Planet? Country? Plant? Fruit? Sweet? Action? Money related?
Apples 0 0 0.7 0 0 0.4 0.8 0.3 0 0.15
Grapes 0 0 0.7 0 0 0.4 0.8 0.8 0 0.15
Work 0.1 0.1 0.1 0 0 0 0 0 0.7 0.8
Buy 0 0 0 0 0 0 0 0 0.8 0.15
Get 0 0 0 0 0 0 0 0 0.8 0.12

Table 4

We can see from this table that ‘Apples’ and ‘Grapes’ are actually quite similar things, but very different to ‘Work’.

This is what Word2Vec/GloVe do for you. They provide you with these descriptions (embeddings) of the words in very large dimensions (imagine 300 different categories that the algorithm creates on its own).

Remember our original sentences?

  • Document 1: Maria is going to buy apples.
  • Document 2: Maria is going to get grapes.
  • Document 3: Maria is going to work.

How about we use the description information for every word together with Tf-Idf from earlier (simplified*):

  • Document 1: Maria is going to do some money related action with some plant, a fruit, somewhat sweet, not expensive.
  • Document 2: Maria is going to do some money related action with some plant, a fruit, sweet, not expensive.
  • Document 3: Maria is going to maybe use a car, in a building, food is mentioned, it’s an action and money related.

The 3 documents were very similar, with 4 words in common. We have helped the algorithm weigh down common words, whilst pushing up words that occur often in single documents or just rarely across the dataset (Tf-Idf). Moreover, we explained to it what apples, grapes and work mean (Word2Vec/GloVe).

Do you think the Algorithm might find it easier now to classify your 3 documents into the 2 Classes – Fruit Buying Trip and Work Trip?

What about an entirely new Document:

  • Document 4: Maria is going to purchase bananas?

Do you think that you could explain ‘purchase’ or maybe ‘bananas’ in such a way as to make it simpler for the algorithm to understand what sort of Trip it will be?

Here’s a hint:

Having both Tf-Idf and Word Embeddings is possible by multiplying the 2 matrices.

If you’ve been following up to here, this is how the final training dataset looks like if you applied Tf-Idf to Word2Vec.

The inspiration for this post comes from Andrew Ng’s deeplearning.ai course on Sequence Models.©