How to build your own Machine Learning Model with TensorFlow

Hello beautiful people!

Now that everyone is excited about AI and uses AI in their daily life, it would be awesome for us to dig deep into the basics of AI— in Machine Learning. And the best way to do that is to get hands-on experience by building an ML Model by yourself!

In this tutorial, you’ll learn about the basics of Machine Learning with Python, and you’ll be able to build your own Machine Learning Model with Tensorflow, a Python library. This model will be able to detect the tone/emotion of the input text, by studying and learning from the sample dataset provided.

Prerequisites

You should have a fundamental knowledge of data science concepts and libraries used in Python Machine Learning. This tutorial will use Pandas, Numpy, and Tensorflow Libraries.

Make sure you have Python installed in your system (Python 3.9+ is recommended).

What is Machine Learning?

In simple terms, Machine Learning (ML) is making the computer learn and make predictions, by studying data and statistics. By studying the provided data, the computer can identify and extract patterns and then make predictions based on them. Identification of Spam Emails, speech recognition, and prediction of traffic are some real-life use cases of Machine Learning.

(How ML Models detect spam and ham emails by studying the data in sample datasets.)

Even though Python Programming Language is not built specifically for ML or Data Science, it’s considered a great programming language for ML due to its versatility. With hundreds of libraries available for free downloading, anyone can easily build ML models by using a pre-built library without the need to program the complete procedure from scratch.

TensorFlow is a widely recognized, free, and open-source library built by Google for Machine Learning and Artificial Intelligence. TensorFlow is often used by data scientists, data engineers, and other developers to build Machine Learning models easily, as TensorFlow consists of a variety of machine learning and AI algorithms.

Visit the Official TensorFlow Website

Installation

To install Tensorflow, run the following command in your terminal.

pip install tensorflow

To install Pandas and Numpy:

pip install pandas numpy

Download the sample CSV file: Github Repository - TensorFlow ML Model

The sample dataset used in this tutorial has 2 columns: Text and sentiment. While the ”text” column has a variety of statements made on movies, books, etc, the “sentiment” column shows whether the text it’s positive, neutral, or negative, using numbers 1, 2, and 0 respectively.

Data Preparation

Note: Before analyzing any large dataset, it’s crucial to clean the duplicates and remove null values. However, since the given dataset is fairly small and doesn’t contain duplicates or null values, you can skip the data-cleaning process.

First, you should gather and prepare the dataset for training the sentiment analysis model. Pandas, a popular library for data analysis and manipulation, can be used for this task.

import pandas as pd
# Load data from CSV
data = pd.read_csv('sentiment.csv')

# Text data and labels
texts = data['text'].tolist()
labels = data['sentiment'].values

The above code converts the CSV file to a data frame, using pandas.read_csv() a function. Next, it assigns the values of the “sentiment” column to a Python list using tolist() and creates a Numpy array with the values.

Using a list for text data facilitates sequential access and manipulation, while a NumPy array efficiently handles numerical labels for machine learning tasks, offering simplicity and flexibility in data organization.

Processing Text

After preparing sample data, you need to reprocess the Text, which involves Tokenization. Tokenization is the process of splitting each text sample into individual words or tokens, so that, you can convert the raw text data into a format that can be processed by the model, allowing it to understand and learn from the individual words in the text samples. Refer to the below image to learn how tokenization works.

(How Tokenization works and possible errors.)

In this project, it’s best to use Manual Tokenization instead of using other pre-built tokenizers as it provides finer control over the tokenization process, ensuring compatibility with specific data formats and allowing for tailored preprocessing steps.

Note: In Manual Tokenization, you write code to split text into words, which is highly customizable according to the needs of the project. However, other methods, such as TensorFlow Keras Tokenizer, come with ready-made tools and functions for splitting text automatically, which is easier to implement but less customizable.

word_index = {}
sequences = []
for text in texts:
  words = text.lower().split()
  sequence = []
  for word in words:
      if word not in word_index:
          word_index[word] = len(word_index) + 1
      sequence.append(word_index[word])
  sequences.append(sequence)

In the above code,

word_index: An empty dictionary created to store each unique word in the dataset, along with its value.
sequences: An empty list that stores the sequences of numerical representation of words for each text sample.
for text in texts: loops through each text sample in the “texts” list (created earlier).
words = text.lower().split(): Converts each text sample to lowercase and splits it into individual words, based on whitespace.
for word in words: A nested loop that iterates over each word in the “words” list, which contains tokenized words from the current text samples.
if word not in word_index: If the word is not currently present in the word_index dictionary, it’s added to it along with a unique index, which is obtained by adding 1 to the current length of the dictionary.
sequence. append (word_index[word]): After determining the index of the current word, it’s appended to the “sequence” list. This converts each word in the text sample to its corresponding index based on the “word_index” dictionary.
sequence.append(sequence): After all the words in the text sample are converted to numerical indices and stored in the “sequence” list, this list is appended to the “sequences” list.

In summary, the above code tokenizes the text data by converting each word to its numerical representation based on the dictionary word_index, which maps words to unique indices. It creates sequences of numerical representations for each text sample, which can be used as input data for the model.

Model Architecture

The architecture of a certain model is the arrangement of layers, components, and connections that determine how data flows through it. The architecture of the model has a significant impact on the model’s training speed, performance, and generalization ability.

After processing the input data, you can define the architecture of the model as in the below example:

model = tf.keras.Sequential([
  tf.keras.layers.Embedding(len(word_index) + 1, 16, input_length=max_length),
  tf.keras.layers.LSTM(64),
  tf.keras.layers.Dense(3, activation='softmax')
])

In the above code, we use TensorFlow Keras which is a high-level neural networks API built for fast experimenting and prototyping of Deep Learning models, by simplifying the process of constructing and compiling machine learning models.

tf. keras.Sequential(): Defining a sequential model, which is a linear stack of layers.
tf.keras.layers.Embedding(len(word_index) + 1, 16, input_length=max_length): This layer is used for word embedding, which converts words into dense vectors of fixed size. The len(word_index) + 1 specifies the vocabulary size, 16 is the dimensionality of the embedding, and input_length=max_length sets the input length for each sequence.
tf.keras.layers.LSTM(64): This layer is a Long Short-Term Memory (LSTM) layer, which is a type of recurrent neural network (RNN) layer. It has 64 units, which determine the dimensionality of the output space.
tf.keras.layers.Dense(3, activation='softmax'): This is a densely connected layer with 3 units and a softmax activation function. It's the output layer of the model, producing a probability distribution over the three possible classes (assuming a multi-class classification problem).

Compilation

In Machine Learning with TensorFlow, compilation refers to the process of configuring the model for training by specifying three key components— Loss Function, Optimizer, and Metrics.

Loss Function: Calculates how well the model's predictions match the actual targets, guiding model improvement.
Optimizer: Adjusts the model's parameters to minimize the loss function, enabling efficient learning.
Metrics: Provides performance evaluation beyond loss, such as accuracy or precision, aiding in model assessment.

The below code can be used to compile the Sentiment Analysis Model:

model.compile(loss='sparse_categorical_crossentropy',
            optimizer='adam',
            metrics=['accuracy'])

Here,

loss='sparse_categorical_crossentropy' : A loss function is generally used for Classification tasks whether the target labels are integers and the output of the model is a probability distribution over multiple classes. It measures the difference between true labels and predictions, aiming to minimize it during training.
optimizer='adam' : Adam is an optimization algorithm that adapts the learning rate dynamically during training. It’s widely used in practice because of its efficiency, robustness, and effectiveness across a wide range of tasks when compared to other optimizers.
metrics = ['accuracy']: Accuracy is a common metric often used to evaluate classification models. It provides a straightforward measure of the model's overall performance on the task, as the percentage of samples for which the model’s predictions match the true labels.

Training the Model

Now that the input data is processed and ready and the model’s architecture is also defined, you can train the model using the model.fit() method.

model.fit(padded_sequences, labels, epochs=15, verbose=1)

padded_sequences: The input data for training the model, which consists of sequences of the same dimensions (padding will be discussed later in the tutorial).
labels: Target labels corresponding to the input data (i.e. sentiment categories assigned to each text sample)
epochs=15: An epoch is one complete pass through the complete training dataset during the training process. Accordingly, in this program, the model iterates over the complete dataset 15 times during the training.

When the number of epochs is increased, it will potentially improve the performance as it learns more complex patterns through the data samples. However, if too many epochs are used, the model may memorize training data leading (which is called “overfitting”) to poor generalization of new data. And also the time consumed for training will also increase with the increasing number of epochs and vice versa.

verbose=1: This is a parameter for controlling how much output the model’s fit method produces while training. A value of 1 means that progress bars will be displayed in your console as the model trains, 0 means no output, and 2 means one line per epoch. Since it would be good to see the accuracy and loss values and the time taken for each epoch, it’s set to 1.

Making Predictions

After compilation and training of the model, it can finally make predictions based on our sample data, simply by using the predict() function. However, we need to enter input data in order to test the model and receive output. To do so, we should input some text statements and then ask the model to predict the sentiment of the input data.

model.fit(padded_sequences, labels, epochs=15, verbose=1)

Here, test_texts stores some input data while the test_sequences list is used to store tokenized test data, which are words split by whitespaces after turning into lowercase. But still, test_sequences will not be able to act as the input data for the model.

The reason is that many deep learning frameworks, including Tensorflow, usually require input data to have a uniform dimension (which means the length of every sequence should be equal), to process batches of data efficiently. To achieve this, you can use techniques like padding, where sequences are extended to match the length of the longest sequences in the dataset, by using a special token like # or 0 (0, in this example).

import numpy as np

In the given code,

padded_test_sequences : An empty list to store the padded sequences that will be used to test the model.
for sequence in sequences : Loops through each sequence in the “sequences” list.
padded_sequence : Creates a new padded sequence for each sequence, truncating the original sequence to the first max_length elements to ensure consistency. Then, we're padding the sequence with zeros to match the max_length if it's shorter, effectively making all sequences the same length.
padded_test_sequences.append(): Add a padded sequence to the list that will be used for testing.
padded_sequences = np.array(): Converting the list of padded sequences into a Numpy array.

Note: Numpy Arrays are efficient data structures for numerical operations, which are commonly used in Machine Learning and Data Science.

Now, since the input data is ready to use, the model can finally predict the sentiment of input texts.

predictions = model.predict(padded_test_sequences)

# Print predicted sentiments
for i, text in enumerate(test_texts):
  print(f"Text: {text}, Predicted Sentiment: {np.argmax(predictions[i])}")

In the above code, the model.predict() method generates predictions for each test sequence, producing an array of predicted probabilities for each sentiment category. Then it iterates through each test_texts element and np.argmax(predictions[i]) returns the index of the highest probability in the predicted probabilities array for the i-th test sample. This index corresponds to the predicted sentiment category with the highest predicted probability for each test sample, which means the best prediction made is extracted and shown as the main output.

Special Notes: np.argmax() is a NumPy function that finds the index of the maximum value in an array. In this context, np.argmax(predictions[i]) helps determine the sentiment category with the highest predicted probability for each test sample.

The Program is now ready to run. After compiling and training the model, the Machine Learning Model will print out its predictions for the input data.

In the model’s output, you can see the values as “Accuracy” and “Loss” for each Epoch. In Machine Learning, Accuracy is the percentage of correct predictions out of total predictions. The higher accuracy is better. If the accuracy is 1.0, which means 100%, it means that the model made correct predictions in all the instances. Similarly, 0.5 means the model made correct predictions half the time, 0.25 means correct prediction quarter of the time, and so on.

Loss, on the other hand, shows how badly the model’s predictions match the true values. The lesser loss value means a better model with less number of errors, with the value 0 being the perfect loss value as that means no errors are made.

However, you can’t determine the overall accuracy and loss of the model with the above data shown for each Epoch. To do so, you may evaluate the model using the evaluate() method and print its Accuracy and Loss.

evaluation = model.evaluate(padded_sequences, labels, verbose=0)

# Extract loss and accuracy
loss = evaluation[0]
accuracy = evaluation[1]

# Print loss and accuracy
print("Loss:", loss)
print("Accuracy:", accuracy)

The above code may result in accuracy and loss values as below:

Loss: 0.6483516097068787
Accuracy: 0.7058823704719543

Accordingly, in this model, the Loss value is 0.6483 which means the Model may have made some errors. The accuracy of the model is about 70%, which means the predictions made by the model are correct more than half of the time. Overall, this model can be considered a “good” model; however, please note that the “good” loss and accuracy values highly depend on the type of model, the size of the dataset, and the purpose of a certain Machine Learning Model.

Summary

In this tutorial, we built a TensorFlow Machine Learning Model with the ability to predict the sentiment of a certain text, after analyzing the sample dataset.

The Full Code and Sample CSV File can be downloaded and seen in the GitHub Repository - GitHub - Buzzpy/Tensorflow-ML-Model