Converting Text into Data for ML Algorithms – Advanced

An explanation of how Tf-Idf and Word2Vec work.

Remember the Documents we started with:

  • Document 1: Maria is going to buy apples.
  • Document 2: Maria is going to get grapes.
  • Document 3: Maria is going to work.

And their classes:

  • Class 0: Fruit Buying Trip
  • Class 1: Work Trip

It would be great if a method was available to mark words as ‘not important’ when converting them into numbers. Maybe their score could be lowered and instead of 1 be a smaller number as to reflect this importance.

Converting Text into Data for ML – Intuition

An intro to processing Natural Language for Machine Learning.

The aim of this post is to cover how algorithms can understand text. One mechanism is described in this Post – One Hot Vectors with the other 2 more advanced ones being described in the Advanced Post.

Imagine you have three documents that contain the following text:

  • Document 1: Maria is going to buy apples.
  • Document 2: Maria is going to get grapes.
  • Document 3: Maria is going to work.

You are using two classes for classification:

  • Class 0: Fruit Buying Trip
  • Class 1: Work Trip

Fixing Common Issues with Cloudera HDP Docker

Fixing Common isssues with Cloudera HDP Docker

Hortonwords Data Platform (HDP) is a very useful out-of-the-box system with everything you need to get started doing Big Data Machine Learning.

It comes packed with awesome tools such as:

  • Hadoop – big data file store
  • Hive – SQL translation library to access the data in the big data file store
  • HBase – Big Data Database
  • Spark – Compute Cluster Manager
  • Ambari – to manage everything else

If you are having issues with your Docker installation after following: https://www.cloudera.com/tutorials/sandbox-deployment-and-install-guide/3.html here are some fixes:

Less confusing Confusion Matrices

An intro to Confusion Matrices

This post will look at explaining Confusion Matrices for Classification to non-tech people.

So what are Confusion Matrices?

Random Forests Explained

Random Forests Explained

Random Forests is an ensemble algorithm, built out of decision trees (or voters) which:

  1. Have been shown the characteristics of the data.
  2. Have been shown part of the data.

Or, if you want more technical terms:

  1. have been shown the features.
  2. have been shown subsets of the data and also have seen at most all the samples.