How to Run Text Classification Using Support Vector Machines, Naive Bayes, and Python
Share
Is your quest for text classification knowledge getting you down? What we saw was pretty depressing too. There is not much out there to help those who are new to natural language processing and text classification algorithms.
Learning Text Classification typically requires researching many articles, books, and videos. This is my take on explaining the Text classification technique with just the right content to get you working. You will have the working knowledge required to take on the interesting world of Natural Language Processing with Python.
What is Text Classification?
Since we’re all new to this, Text Classification is an automated process of classifying text into categories. We can classify Emails into spam or non-spam, foods into hot dog or not hot dog, etc.
Text Classification can be done with the help of Natural Language Processing and different algorithms such as:
What is Natural Language Processing?
Natural Language Processing(NLP) is a branch of AI which focuses on helping computers understand and interpret the human language. Since we’re using AI to interpret language, human language is not a set of rules that can be fed into a system. Furthermore, human language contains a lot of ambiguity and nuance that are often not based on any logical rules.
However, doors are opening thanks to machine learning, deep learning, and neural networks. These tools, patterns, and algorithms are built to analyze and reduce ambiguity that exists in systems such as human language.
Machine Learning Model to Classify Text
To this end, I will be using the Amazon Review Data set which contains 10,000 rows of Text data. The Data set has two columns “Text” and “Label”. For those following along, download the Amazon Review Data set here.
1. Add the Required Libraries
Before coding, we will import and use the following libraries throughout this post. These can be easily downloaded through their respective websites.
import pandas as pd import numpy as np from nltk.tokenize import word_tokenize from nltk import pos_tag from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer from sklearn.preprocessing import LabelEncoder from collections import defaultdict from nltk.corpus import wordnet as wn from sklearn.feature_extraction.text import TfidfVectorizer from sklearn import model_selection, naive_bayes, svm from sklearn.metrics import accuracy_score
2. Set Random Seed
Use the random seed to reproduce the same result every time if you keep the script consistent. Otherwise, each run will produce different results. However, the decision is yours – you can set the seed to any number.
np.random.seed(500)
3. Add the Corpus
Next, you can easily add the data set as a pandas data frame with the help of ‘read_csv’ function. I have set the encoding to ‘latin-1’ as the text had many special characters.
Corpus = pd.read_csv(r"...\NLP Project\corpus.csv",encoding='latin-1')
4. Data Preprocessing
Data preprocessing basically involves transforming raw data into an understandable format for NLP models. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. This will help in getting better results through the classification algorithms.
Furthermore, I explain the two of the more common techniques in data preprocessing:
Technique 1: Tokenization
Firstly, tokenization is a process of breaking text up into words, phrases, symbols, or other tokens. The list of tokens becomes input for further processing. The NLTK Library has word_tokenize and sent_tokenize to easily break a stream of text into a list of words or sentences, respectively.
Technique 2: Word Stemming/Lemmatization
Similarly, the aim of both stemming and lemmaization is the same: reduce the inflectional forms of each word into a common base or root. Lemmatization is closely related to stemming. However, stemmers operate on a single word without knowledge of the context. Without that knowledge they cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.
Form | Stem | Lemma |
---|---|---|
Studies | Studi | Study |
Studying | Study | Study |
beautiful | beauti | beautiful |
beautifully | beauti | beautiful |
Data Preprocessing Script
In conclusion of this section, here’s the complete script which performs the data preprocessing steps. You can always add or remove steps which best suits the data set you are dealing with:
- Remove Blank rows in the data, if any
- Change all the text to lower case
- Word tokenization
- Remove stop words
- Remove non-alpha text
- Word lemmatization
# Step - a : Remove blank rows if any. Corpus['text'].dropna(inplace=True) # Step - b : Change all the text to lower case. This is required as python interprets 'dog' and 'DOG' differently Corpus['text'] = [entry.lower() for entry in Corpus['text']] # Step - c : Tokenization : In this each entry in the corpus will be broken into set of words Corpus['text']= [word_tokenize(entry) for entry in Corpus['text']] # Step - d : Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting. # WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun tag_map = defaultdict(lambda : wn.NOUN) tag_map['J'] = wn.ADJ tag_map['V'] = wn.VERB tag_map['R'] = wn.ADV for index,entry in enumerate(Corpus['text']): # Declaring Empty List to store the words that follow the rules for this step Final_words = [] # Initializing WordNetLemmatizer() word_Lemmatized = WordNetLemmatizer() # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else. for word, tag in pos_tag(entry): # Below condition is to check for Stop words and consider only alphabets if word not in stopwords.words('english') and word.isalpha(): word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]]) Final_words.append(word_Final) # The final processed set of words for each iteration will be stored in 'text_final' Corpus.loc[index,'text_final'] = str(Final_words)
5. Prepare, Train, and Test Data Sets
The Corpus are split into two data sets: Training and Test. The training data set is used to fit the model and the predictions are performed on the test data set. This can be done through the train_test_split from the sklearn library. The Training Data will have 70% of the corpus and Test data will have the remaining 30% as we have set the parameter test_size=0.3 .
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Corpus['text_final'],Corpus['label'],test_size=0.3)
6. Encoding
Label encode the target variable to transform categorical data of string type in the data set into numerical values thus crating data the model can understand.
Encoder = LabelEncoder() Train_Y = Encoder.fit_transform(Train_Y) Test_Y = Encoder.fit_transform(Test_Y)
7. Word Vectorization
Word vectorization is a general process of turning a collection of text documents into numerical feature vectors. There are many methods to convert text data to vectors which the model can understand. But the most popular method is TF-IDF – an acronym than stands for “Term Frequency – Inverse Document Frequency”. These are the components of the resulting scores assigned to each word.
- Term Frequency: This summarizes how often a given word appears within a document.
- Inverse Document Frequency: This down scales words that appear a lot across documents.
Without going into the math, TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.
The below syntax can be used to first fit the TG-IDF model on the whole corpus. Moreover, this will help TF-IDF build a vocabulary of words it learned from the corpus data and will assign a unique integer number to each of these words. There will be a maximum of 5000 unique words/features as we have set parameter max_features=5000.
Finally, we will transform Train_X and Test_X to vectorized Train_X_Tfidf and Test_X_Tfidf. As a result, each row will now contain a list of unique integer numbers and its associated importance as calculated by TF-IDF.
Tfidf_vect = TfidfVectorizer(max_features=5000) Tfidf_vect.fit(Corpus['text_final']) Train_X_Tfidf = Tfidf_vect.transform(Train_X) Test_X_Tfidf = Tfidf_vect.transform(Test_X)
You can use the below syntax to see the vocabulary that it has learned from the corpus:
print(Tfidf_vect.vocabulary)
This will give a JSON output as:
{ ‘even’: 1459, ‘sound’: 4067, ‘track’: 4494, ‘beautiful’: 346, ‘paint’: 3045, ‘mind’: 2740, ‘well’: 4864, ‘would’: 4952, ‘recommend’: 3493, ‘people’: 3115, ‘hate’: 1961, ‘video’: 4761 …………}
Additionally, you can directly print the vectorized data to see how it looks:
print(Train_X_Tfidf)
Consequently, our data sets are ready to be fed into different classification Algorithms.
8. Use the ML Algorithms to Predict the outcome
Firstly, let’s try the Naive Bayes Classifier Algorithm. You can read more about Naive Bayes here.
# fit the training dataset on the NB classifier Naive = naive_bayes.MultinomialNB() Naive.fit(Train_X_Tfidf,Train_Y) # predict the labels on validation dataset predictions_NB = Naive.predict(Test_X_Tfidf) # Use accuracy_score function to get the accuracy print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)
Output:
Naive Bayes Accuracy Score -> 83.1%
Secondly, the SVM - Support Vector Machine. You can read more about SVM’s here
# Classifier - Algorithm - SVM # fit the training dataset on the classifier SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto') SVM.fit(Train_X_Tfidf,Train_Y) # predict the labels on validation dataset predictions_SVM = SVM.predict(Test_X_Tfidf) # Use accuracy_score function to get the accuracy print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)
Output:
SVM Accuracy Score -> 84.6%
Finishing Up
In conclusion, I hope this has explained what text classification is and how it can be easily implemented in Python.
As a next step you can try the following:
- Play around with the Data preprocessing steps and see how it effects the accuracy.
- Word Vectorization techniques such as Count Vectorizer and Word2Vec.
- Parameter tuning with the help of GridSearchCV on these Algorithms.
- Other classification Algorithms such as Linear Classifier, Boosting Models and even Neural Networks.