PCA stands for Principal Component Analysis is a method used for dimension reduction of a dataset containing many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset, up to the maximum extent.
For a lot of machine learning applications it helps to visualize data. Visualizing 2 or 3 dimensional data is not that challenging but visualizing the higher dimensional data is impossible.
PCA works better if the data is mean centered and it is based on the eigen values and eigen vectors. We can use SVD or eigen decomposition. The Eigenvectors of such decomposition are used as a rotation matrix. The Eigenvectors are arranged in the rotation matrix in decreasing order according to its explained variance.
Mean centered data ensures that the first principal component is proportional to the maximum variance of the input data.
Singular Value Decomposition:
SVD of a matrix A with m rows and n columns with rank r is given below:
When Not to Use PCA:
If the relationship between the variables is weak, then PCA does not work well to reduce data.
Python Code:
To make it simple, I have used the Eigen Decomposition instead of SVD
Importing Libraries:
PCA Function:
First I have mean centered the data and the found the covariance matrix of the data not the mean centered data. Using numpy eigh, we will get eigen values and the eigen vectors.
Next I have sorted the eigen values’ index (argsort will do it) and made it in decreasing order such that higher values comes first. Then using this sorted eigen values’ indexes, I have sorted the eigen values and eigen vectors. Now I take only the n first vectors for reducing the data. Now on dot product between the taken eigen vectors and the mean centered data will give us the dimension reduced data.
In this blog let us see how to predict a country using analogies between the words using Vector Space Model (VSM). Given a pair of city and country with the another city using which we need to predict the country to which the city belongs to.
Word Embeddings:
It is the collective name for a set of language modeling and feature learning techniques in NLP where words or phrases from the vocabulary are mapped to vectors of real numbers.
Vector Space Model:
It is used to represent words and documents as vectors. To make a VSM by word by word design, we have to make a co-occurrence matrix and extract vector representations for the words in the corpus. Similar approach is used for word by document design. It also allows us to capture dependencies between words.
Example: I like simple data. I like simple raw data.
Here if we choose the window length to be two then co-occurrence of simple and data will be two.
For simple and data,
Simple data – distance is 1
Simple raw data – distance is 2
Applications of VSM:
Used to identify similarities
Information extraction
Machine Translation
Chatbots, etc..
Libraries Used:
numpy
pandas
matplotlib
nltk
genism – used to convert the word to vector.
pickle – used to read pickle files (Optional)
Importing Libraries:
Word Embeddings:
Since we have 30Lakhs words in our original dataset, we will be using only selected words (Capitals and Countries). If you need full dataset’s word embeddings, then set complete to be True and set_words as a empty array ([]).
Cosine Similarities:
It is used to check how similar the vectors are. As we know the wider the vector, lesser the value of cosine. Higher the cosine value, much similar the vectors. Two vectors are said to much similar when the distance between them is smaller when compared to other vectors. We can use either cosine similarity or Euclidian distance to find out the similarities between the vectors.
Predict Country:
Inputs are city1, country1, city2 and embeddings. city1 will be the capital of country1 and city2 will be the capital of country2 which we will predict. This can also be said as finding a word using the analogies between words.
Each word’s embedding has a shape of (1,300), so visualizing it will be very difficult. So we can use a method called PCA (Principal Component Analysis) which will be my next blog.
In the previous blog we have seen sentiment analysis of a tweet using logistic regression. In this blog we will be seeing Naïve Bayes approach for spam detection (whether the message is a spam or not). We can also use this for sentiment analysis, Author Identification, Word Disambiguation, etc.
Concept:
Before we move into the concept, let us see what are the assumptions made for this method. This method assumes that the features that we use for classification are all independent which is a rare case in reality. This method is similar to Logistic Regression.
For building this model, we will be using probability and Bayes Theorem.
With respect to above picture, A will be a word and B will be the label (1 or 0). But the problem with above formula is, it will return zero if the word hasn’t occurred in the corpus that we used. Since we consider that they are independent, we will be multiplying the probabilities. Thus the told probability becomes zero and while using log, the value reaches infinity. So to overcome this problem we will we using Laplacian smoothing.
Since we have the risk of underflow problem due to multiplication of probabilities, we will be using log(value) to overcome this problem.
Here the first part is known as the log prior and second part is known as loglikelihood. With the help of loglikelihood, we are able to retrieve a information.
Training Naïve Bayes in Summary:
Testing in Summary:
Applications:
Spam detection
Sentiment Analysis
Author Identification, etc.
Libraries Used:
csv – for reading csv files
nltk
numpy
re
math – for math operations
I have not described the libraries that have been used in previous blog. so check out the previous blog for a short description.
Importing Libraries:
Process Tweet:
I have explained data preprocessing in my previous blog. Check that out for clarification. Link is at the bottom of the blog.
Reading .csv File:
Convert Labels:
We will be converting the labels to Positive (For Spam) and Negative (For Not Spam) labels.
Frequency Dictionary:
Splitting the Dataset:
Since we will be using a small data set (size = ~5500), we split it into 80-20 for training and testing respectively.
NLP stands for Natural Language Processing is a unique subset of machine learning concerned with the interaction between human languages and the computers. We know that many people share their feeling in social media especially in twitter. So in this blog we are going to create a model that will predict whether the tweet is positive ( 🙂 ) or negative ( 😦 ) using python.
Libraries Used:
nltk – Natural Language ToolKit used especially for NLP tasks
numpy – For array operations
re – Regular Expression (or RE) specifies a set of strings that matches it and the functions in this module will help us to check if a particular string matches a given regular expression. This is mainly used for preprocessing the data.
Before going into the coding part, let us see some concept behind the sentiment analysis model.
Sparse Representation Vs Positive Negative Counts Representation
Before going into the representations, let us understand what is vocabulary and feature extraction. Vocabulary is a set containing all the words in the database and feature extraction is a step that extracts and produce feature representation that tells about the data given. Example: Consider a vocabulary as mentioned below.
V=[‘I’,’am’,……..,’good’,…,’sad’,…,’hated’].
feature = [1,1,……..,1,…,0,…,0] for ‘I am good’
This is known as the feature extraction. The feature representation of the above example is called as Sparse Representation. Since sparse representation contains n features where n is equal to the size of the vocabulary, the time taken for training as well as prediction will be larger. So we will be using another type of representation called Positive Negative Counts Representation.
Consider the corpus given below.
corpus = [‘I am happy’, ‘I am sad’ , ‘I am good’, ‘I am an arrogant’]
Here the 1st and 3rd sentences are positive whereas other two are negative.
We will be using a logistic regression model because it will return value between 0 and 1 using which we can get one out of two outputs (Positive or Negative) based on the condition we give. We get the count of words appeared in positive tweets and negative tweets (From the training set) as mentioned below and we take the sum of count of positive frequency and negative frequency for the given tweet.
NOTE: LOGISTIC REGRESSION CAN ALSO BE USED FOR GETTING MORE THAN TWO
This is known as frequency dictionary.
Example: ‘I am sad. I am arrogant’ == [‘I’, ‘am’, ‘sad’, ‘arrogant’]
In positive: sum_positive_frequency = 2+2+0+0=4
In negative: sum_negative_frequency = 2+2+1+1=6
feature of the above tweet = [1, 4, 6]
This representation is known as Positive Negative Counts Representation. As we can see here, the number of feature in this representation is only three. So the training and predicting time will be lesser. The ‘+1’ represents bias unit which is equivalent to constant that we had to linear equations.
Data Preprocessing
It includes two steps namely stemming and removing stopwords. Stemming is a process in which only the root word is taken. Example: Learning, learned (After stemming ‘learn’).
Stopwords are the words that do not had much meaning to the sentences like and, a, an, etc.. including punctuations. But sometimes some punctuations add meaning to the sentence. Consider the example ‘ I am : ( ‘ this tweet means sad but if we remove the punctuations then it will be a sentence without meaning. So this criteria should be taken care. This can be solved using libraries that contain these emojis.
Preprocessing also includes converting tweets to lowercase.
Logistic Regression:
Sigmoid function => h(x(i),ϴ) = 1 / (1+exp(-(ϴTx(i))))
Sigmoid function is function that maps any real value to a value between 0 and 1 which is shaped like the letter S. It is also known as logistic function.
Sigmoid Curve
Cost Function => J=-(sum(y * log(h) + (1-y) * log(1-h)))/m)
Cost Function is the sum of errors produced while comparing the true label and predicted label. When the label is 0 (y=0 then h=0) then J becomes zero because 0 multiplied with anything is zero and also log(1) is zero. Similarly if the label is 1 (y=1 then h=1), the J becomes zero. But what if label=1 but predicted label is 0. In that case, it reaches to negative infinity since log(0) is undefined and cost function as minus outside. Same for vice-versa also. Thus the right part ensures the label 1 (Positive) and left part ensures the label 0 (Negative). Since we take log of a value between 0 and 1, it will return a negative value. So the minus sign ensures that the cost value is positive all the time.
One of the objectives in creating a model is to minimize the cost function to a negligible value. We will be using gradient descent for this objective.