Text classification and segmentation are very common problems of all businesses. In this process of text classification (or segmentation), we need to convert text into numbers and use machine learning or deep learning algorithms. Let us try to analyze the process of converting word into numbers (better known as word embedding).
Word embedding is a term given to any method which converts word/ text to numbers. We cannot directly feed text data to our ML/ DL models. For this we first need to represent text in form of numbers. This numbers may simply tell if certain word appears in a sentence (bag of word approach). This numbers may take into consideration the frequency of word in the whole corpus and in the statement (as done in TF-IDF). Other techniques represent each word by a Vector rather than representing sentence as a embedded vector. These techniques include Word2Vec, GloVe, FastText. We will discuss two of the before mentioned techniques namely TF-IDF and Word2Vec
We will see how exactly does TF-IDF encodes a sentence into a vector.
TF-IDF contains 2 parts:
1. TF (Term-Frequency) : TF is the frequency of words appearing in a certain statement.
TF = (Number of times term t appears in a sentence) / (Total number of terms in the sentence)
2· IDF (Inverse Document Frequency): IDF gives the inverse of how many times a word is appearing in a sentence.
IDF = log_e(Total number of Sentences / Number of Sentences with word in it)
If only TF is used we get high value for words like ‘A’, ‘the’, ‘and’, etc which appear almost in all sentences, and most probably more than once. So, these words don’t carry much value and we must give them small number as its weight. For this IDF is used, which penalizes words that are appearing in lots of sentences. These words won’t help us to classify sentence into certain intent/ group.
Let us understand with an example, consider a document containing 100 words where the word ‘car’ appears 3 times. The term frequency for car is . Now if we have 10,000 such documents and the word ‘car’ appears in 100 of them then IDF will be. So, the TF-IDF weight for the word ‘car’ is given by the product of the product of the two ie. 0.03*2 = 0.06.
Consider the following sentences and their TF-IDF transformed feature set, an example from my book:
Word2Vec is a two-layer neural network that is used in mapping words to vector spaces for their representation. There are different methods of getting Word vectors for a sentence, but the main theory behind all the techniques is to give similar words a similar vector representation, i.e. represent similarity between words mathematically.
Given a corpus of text, Word2Vec trains words against other words that neighbour them in the input corpus and it does so by either predicting target using context (known as Continuous Bag of Words CBOW) or predicting context given target (known as skip-gram). On larger datasets we prefer the later as it gives a better accuracy.
Words clustered together after embedding using Word2Vec
Word2Vec can be used by installing and importing the gensim package in python.
For text classification and text segmentation, we need to convert text corpus ( sentence, paragraph etc) into a sequence of number and then we use deep learning (or machine learning) algorithms such as LSTM.
The above content is taken from this book.