Author: Radu-Andrei Nedelcu

Converting Text into Data for ML Algorithms – Advanced

Post author By Radu-Andrei Nedelcu
Post date 12/02/2021

An explanation of how Tf-Idf and Word2Vec work.

Remember the Documents we started with:

Document 1: Maria is going to buy apples.
Document 2: Maria is going to get grapes.
Document 3: Maria is going to work.

And their classes:

Class 0: Fruit Buying Trip
Class 1: Work Trip

It would be great if a method was available to mark words as ‘not important’ when converting them into numbers. Maybe their score could be lowered and instead of 1 be a smaller number as to reflect this importance.

Converting Text into Data for ML – Intuition

Post author By Radu-Andrei Nedelcu
Post date 12/02/2021

An intro to processing Natural Language for Machine Learning.

The aim of this post is to cover how algorithms can understand text. One mechanism is described in this Post – One Hot Vectors with the other 2 more advanced ones being described in the Advanced Post.

Imagine you have three documents that contain the following text:

Document 1: Maria is going to buy apples.
Document 2: Maria is going to get grapes.
Document 3: Maria is going to work.

You are using two classes for classification:

Class 0: Fruit Buying Trip
Class 1: Work Trip

Fixing Common Issues with Cloudera HDP Docker

Post author By Radu-Andrei Nedelcu
Post date 12/02/2021

Fixing Common isssues with Cloudera HDP Docker

Hortonwords Data Platform (HDP) is a very useful out-of-the-box system with everything you need to get started doing Big Data Machine Learning.

It comes packed with awesome tools such as:

Hadoop – big data file store
Hive – SQL translation library to access the data in the big data file store
HBase – Big Data Database
Spark – Compute Cluster Manager
Ambari – to manage everything else

If you are having issues with your Docker installation after following: https://www.cloudera.com/tutorials/sandbox-deployment-and-install-guide/3.html here are some fixes:

Less confusing Confusion Matrices

Post author By Radu-Andrei Nedelcu
Post date 12/02/2021

An intro to Confusion Matrices

This post will look at explaining Confusion Matrices for Classification to non-tech people.

So what are Confusion Matrices?

Random Forests Explained

Post author By Radu-Andrei Nedelcu
Post date 12/02/2021

Random Forests Explained

by Radu-Andrei Nedelcu

Random Forests is an ensemble algorithm, built out of decision trees (or voters) which:

Have been shown the characteristics of the data.
Have been shown part of the data.

Or, if you want more technical terms:

have been shown the features.
have been shown subsets of the data and also have seen at most all the samples.