{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "25a0cd07", "metadata": {}, "source": [ "# Text Sentiment Analysis of IMDB Movie reviews using NLP (Word2Vec and RNN) for MYM Intern Assesment" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a6ca3e39", "metadata": {}, "source": [ "## Assesment Objectives: \n", "### - Preprocessing the data\n", "### - Converting Text(words) to Vectors using word2vec \n", "### - Using the word representations given by word2vec to feed a RNN and training the model\n", "### - Evaluating the model and plotting the performance graphs\n", "### - Improving the model by Transfer Learning\n", "### - Comparing Accuracy of Baseline model, The model and Improved model.\n", "### - Testing the model (predicting the model with new review)\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "84781652", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gensim==4.2.0\n" ] } ], "source": [ "!pip freeze | grep gensim ##Checking the version of Gensim - Word2Vec" ] }, { "attachments": {}, "cell_type": "markdown", "id": "12e2dbaa", "metadata": {}, "source": [ "## The Data" ] }, { "attachments": {}, "cell_type": "markdown", "id": "7b25eb77", "metadata": {}, "source": [ "### Starting with 20% of the sentences from TensorFlow Datasets of IMDB reviews to check the RAM compatibility of the PC to train the model faster by splitting the datasets as X_train, y_train, X_test and y_test.\n", "### Then preprocessing the textual data to create input features for a natural language processing (NLP) model.\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "2079b965", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-06-05 06:40:17.849061: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "import tensorflow_datasets as tfds\n", "from tensorflow.keras.preprocessing.text import text_to_word_sequence\n", "\n", "def load_data(percentage_of_sentences=None):\n", " train_data, test_data = tfds.load(name=\"imdb_reviews\", split=[\"train\", \"test\"], batch_size=-1, as_supervised=True)\n", "\n", " train_sentences, y_train = tfds.as_numpy(train_data)\n", " test_sentences, y_test = tfds.as_numpy(test_data)\n", " \n", " # Take only a given percentage of the entire data\n", " if percentage_of_sentences is not None:\n", " assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)\n", " \n", " len_train = int(percentage_of_sentences/100*len(train_sentences))\n", " train_sentences, y_train = train_sentences[:len_train], y_train[:len_train]\n", " \n", " len_test = int(percentage_of_sentences/100*len(test_sentences))\n", " test_sentences, y_test = test_sentences[:len_test], y_test[:len_test]\n", " \n", " X_train = [text_to_word_sequence(_.decode(\"utf-8\")) for _ in train_sentences]\n", " X_test = [text_to_word_sequence(_.decode(\"utf-8\")) for _ in test_sentences]\n", " \n", " return X_train, y_train, X_test, y_test\n", "\n", "X_train, y_train, X_test, y_test = load_data(percentage_of_sentences=20)" ] }, { "cell_type": "code", "execution_count": 4, "id": "4352850a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 1, 0, ..., 1, 0, 0])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_test" ] }, { "cell_type": "code", "execution_count": null, "id": "2d707253", "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "id": "fdffccea", "metadata": {}, "source": [ "## First, training a word2vec model (with the arguments that we want) on your training sentence. Store it into the `word2vec` variable. " ] }, { "cell_type": "code", "execution_count": 5, "id": "f5c2e1b0", "metadata": {}, "outputs": [], "source": [ "from gensim.models import Word2Vec\n", "\n", "word2vec = Word2Vec(sentences=X_train, vector_size=60, min_count=10, window=10)\n", "word2vec.save(\"word2vec.model\")\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "81a82d8e", "metadata": {}, "source": [ "## Embedding the training and test sentences." ] }, { "cell_type": "code", "execution_count": 6, "id": "62a835d9", "metadata": {}, "outputs": [], "source": [ "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", "import numpy as np\n", "\n", "# Function to convert a sentence (list of words) into a matrix representing the words in the embedding space\n", "def embed_sentence(word2vec, sentence):\n", " embedded_sentence = []\n", " for word in sentence:\n", " if word in word2vec.wv:\n", " embedded_sentence.append(word2vec.wv[word])\n", " \n", " return np.array(embedded_sentence)\n", "\n", "# Function that converts a list of sentences into a list of matrices\n", "def embedding(word2vec, sentences):\n", " embed = []\n", " \n", " for sentence in sentences:\n", " embedded_sentence = embed_sentence(word2vec, sentence)\n", " embed.append(embedded_sentence)\n", " \n", " return embed\n", "\n", "# Embed the training and test sentences\n", "X_train_embed = embedding(word2vec, X_train)\n", "X_test_embed = embedding(word2vec, X_test)\n", "\n", "\n", "# Pad the training and test embedded sentences\n", "X_train_pad = pad_sequences(X_train_embed, dtype='float32', padding='post', maxlen=200)\n", "X_test_pad = pad_sequences(X_test_embed, dtype='float32', padding='post', maxlen=200)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ea1e76af", "metadata": {}, "source": [ "### It's a good practice to check check the following for `X_train_pad` and `X_test_pad`:\n", "#### - they are numpy arrays\n", "#### - they are 3-dimensional\n", "#### - the last dimension is of the size of your word2vec embedding space (you can get it with `word2vec.wv.vector_size`\\\\\n", "#### - the first dimension is of the size of your `X_train` and `X_test`" ] }, { "cell_type": "code", "execution_count": 7, "id": "d4770855", "metadata": {}, "outputs": [], "source": [ "for X in [X_train_pad, X_test_pad]:\n", " assert type(X) == np.ndarray\n", " assert X.shape[-1] == word2vec.wv.vector_size\n", "\n", "\n", "assert X_train_pad.shape[0] == len(X_train)\n", "assert X_test_pad.shape[0] == len(X_test)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6418b6d1", "metadata": {}, "source": [ "## Baseline Model" ] }, { "cell_type": "code", "execution_count": 8, "id": "3b477d35", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of labels in train set {0: 2474, 1: 2526}\n", "Baseline accuracy: 0.499\n" ] } ], "source": [ "# It is always good to have a very simple model to test your own model against\n", "# Baseline accuracy can be to predict the label that is the most present in `y_train`.\n", "from sklearn.metrics import accuracy_score\n", "\n", "unique, counts = np.unique(y_train, return_counts=True)\n", "counts = dict(zip(unique, counts))\n", "print('Number of labels in train set', counts)\n", "\n", "y_pred = 0 if counts[0] > counts[1] else 1\n", "\n", "print('Baseline accuracy: ', accuracy_score(y_test, [y_pred]*len(y_test)))\n", "\n", "baseline_acc = accuracy_score(y_test, [y_pred]*len(y_test))\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b7a80da2", "metadata": {}, "source": [ "## The Model" ] }, { "cell_type": "code", "execution_count": 9, "id": "523cd8a1", "metadata": {}, "outputs": [], "source": [ "from tensorflow.keras import Sequential\n", "from tensorflow.keras import layers\n", "\n", "# writing a RNN model with Masking, LSTM and Dense layers.\n", "\n", "def init_model():\n", " model = Sequential()\n", " model.add(layers.Masking())\n", " model.add(layers.LSTM(20, activation='tanh'))\n", " model.add(layers.Dense(15, activation='relu'))\n", " model.add(layers.Dense(1, activation='sigmoid'))\n", "\n", " model.compile(loss='binary_crossentropy', #compiling the model with rmsprop optimizer\n", " optimizer='rmsprop',\n", " metrics=['accuracy'])\n", " \n", " return model\n", "\n", "model = init_model()" ] }, { "cell_type": "code", "execution_count": 10, "id": "5317ce64", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/100\n", "110/110 [==============================] - 9s 50ms/step - loss: 0.6751 - accuracy: 0.5749 - val_loss: 0.6679 - val_accuracy: 0.5980\n", "Epoch 2/100\n", "110/110 [==============================] - 4s 40ms/step - loss: 0.6271 - accuracy: 0.6486 - val_loss: 0.6284 - val_accuracy: 0.6433\n", "Epoch 3/100\n", "110/110 [==============================] - 4s 36ms/step - loss: 0.5775 - accuracy: 0.6980 - val_loss: 0.5818 - val_accuracy: 0.7040\n", "Epoch 4/100\n", "110/110 [==============================] - 4s 37ms/step - loss: 0.5377 - accuracy: 0.7329 - val_loss: 0.6517 - val_accuracy: 0.6653\n", "Epoch 5/100\n", "110/110 [==============================] - 4s 35ms/step - loss: 0.5152 - accuracy: 0.7520 - val_loss: 0.5543 - val_accuracy: 0.7300\n", "Epoch 6/100\n", "110/110 [==============================] - 4s 35ms/step - loss: 0.4946 - accuracy: 0.7657 - val_loss: 0.5921 - val_accuracy: 0.7267\n", "Epoch 7/100\n", "110/110 [==============================] - 4s 35ms/step - loss: 0.4811 - accuracy: 0.7769 - val_loss: 0.5399 - val_accuracy: 0.7460\n", "Epoch 8/100\n", "110/110 [==============================] - 4s 35ms/step - loss: 0.4625 - accuracy: 0.7874 - val_loss: 0.5263 - val_accuracy: 0.7587\n", "Epoch 9/100\n", "110/110 [==============================] - 4s 35ms/step - loss: 0.4564 - accuracy: 0.7880 - val_loss: 0.5222 - val_accuracy: 0.7433\n", "Epoch 10/100\n", "110/110 [==============================] - 4s 35ms/step - loss: 0.4375 - accuracy: 0.8023 - val_loss: 0.5429 - val_accuracy: 0.7353\n", "Epoch 11/100\n", "110/110 [==============================] - 4s 37ms/step - loss: 0.4271 - accuracy: 0.8109 - val_loss: 0.5867 - val_accuracy: 0.7433\n", "Epoch 12/100\n", "110/110 [==============================] - 4s 36ms/step - loss: 0.4118 - accuracy: 0.8151 - val_loss: 0.5874 - val_accuracy: 0.7247\n", "Epoch 13/100\n", "110/110 [==============================] - 4s 39ms/step - loss: 0.3998 - accuracy: 0.8229 - val_loss: 0.5303 - val_accuracy: 0.7547\n", "Epoch 14/100\n", "110/110 [==============================] - 4s 40ms/step - loss: 0.4012 - accuracy: 0.8214 - val_loss: 0.5168 - val_accuracy: 0.7653\n", "Epoch 15/100\n", "110/110 [==============================] - 4s 36ms/step - loss: 0.3857 - accuracy: 0.8297 - val_loss: 0.5377 - val_accuracy: 0.7500\n", "Epoch 16/100\n", "110/110 [==============================] - 4s 36ms/step - loss: 0.3727 - accuracy: 0.8403 - val_loss: 0.5561 - val_accuracy: 0.7640\n", "Epoch 17/100\n", "110/110 [==============================] - 4s 36ms/step - loss: 0.3616 - accuracy: 0.8414 - val_loss: 0.5625 - val_accuracy: 0.7460\n", "Epoch 18/100\n", "110/110 [==============================] - 4s 36ms/step - loss: 0.3502 - accuracy: 0.8483 - val_loss: 0.6253 - val_accuracy: 0.7247\n", "Epoch 19/100\n", "110/110 [==============================] - 4s 34ms/step - loss: 0.3420 - accuracy: 0.8517 - val_loss: 0.5606 - val_accuracy: 0.7540\n" ] } ], "source": [ "# Fiting the model on embedded and padded data with the early stopping criterion.\n", "\n", "from tensorflow.keras.callbacks import EarlyStopping\n", "\n", "es = EarlyStopping(patience=5, restore_best_weights=True)\n", "\n", "history = model.fit(X_train_pad, y_train, \n", " batch_size = 32,\n", " epochs=100,\n", " validation_split=0.3,\n", " callbacks=[es]\n", " )\n", "the_model_acc = history.history['accuracy'][-1]\n" ] }, { "cell_type": "code", "execution_count": 11, "id": "26e4350b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The accuracy evaluated on the test set is of 76.820%\n" ] } ], "source": [ "# Evaluating the model on the test set.\n", "\n", "result = model.evaluate(X_test_pad, y_test, verbose=0)\n", "\n", "print(f'The accuracy evaluated on the test set is of {result[1]*100:.3f}%')" ] }, { "cell_type": "code", "execution_count": 12, "id": "8d663411", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "# Plot the accuracy and loss curves\n", "plt.figure(figsize=(12, 6))\n", "plt.subplot(1, 2, 1)\n", "plt.plot(history.history['accuracy'], label='Training accuracy')\n", "plt.plot(history.history['val_accuracy'], label='Validation accuracy')\n", "plt.title('Accuracy')\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Accuracy')\n", "plt.legend()\n", "\n", "plt.subplot(1, 2, 2)\n", "plt.plot(history.history['loss'], label='Training loss')\n", "plt.plot(history.history['val_loss'], label='Validation loss')\n", "plt.title('Loss')\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Loss')\n", "plt.legend()\n", "\n", "plt.show()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "d9c43269", "metadata": {}, "source": [ "## Trained Word2Vec - Transfer Learning\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "92b6edab", "metadata": {}, "source": [ "### The accuracy of the above the baseline model, might be quite low. By improving the quality of the embedding we can Improve accuracy of the model." ] }, { "attachments": {}, "cell_type": "markdown", "id": "1e80388e", "metadata": {}, "source": [ "### Let's improve the quality of our embedding, instead of just loading a larger corpus, let's benefit from the embedding that others have learned. Because, the quality of an embedding, i.e. the proximity of the words, can be derived from different tasks. This is exactly what transfer learning is." ] }, { "attachments": {}, "cell_type": "markdown", "id": "97d4ebeb", "metadata": {}, "source": [ "### Listing all the different models available in the word2vec using gensim api." ] }, { "cell_type": "code", "execution_count": 13, "id": "b07b77ad", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']\n" ] } ], "source": [ "import gensim.downloader as api\n", "print(list(api.info()['models'].keys()))" ] }, { "cell_type": "code", "execution_count": 14, "id": "bd25af6a", "metadata": {}, "outputs": [], "source": [ "#Let's load one of the pre-trained word2vec embedding spaces. \n", "\n", "word2vec_transfer = api.load(\"glove-wiki-gigaword-100\")" ] }, { "cell_type": "code", "execution_count": 15, "id": "062fa47f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "400000\n", "100\n" ] } ], "source": [ "print(len(word2vec_transfer.key_to_index))\n", "print(len(word2vec_transfer['dog']))" ] }, { "cell_type": "code", "execution_count": 16, "id": "bc0a78fc", "metadata": {}, "outputs": [], "source": [ "# Function to convert a sentence (list of words) into a matrix representing the words in the embedding space\n", "def embed_sentence_with_TF(word2vec, sentence):\n", " embedded_sentence = []\n", " for word in sentence:\n", " if word in word2vec:\n", " embedded_sentence.append(word2vec[word])\n", " \n", " return np.array(embedded_sentence)\n", "\n", "# Function that converts a list of sentences into a list of matrices\n", "def embedding(word2vec, sentences):\n", " embed = []\n", " \n", " for sentence in sentences:\n", " embedded_sentence = embed_sentence_with_TF(word2vec, sentence)\n", " embed.append(embedded_sentence)\n", " \n", " return embed\n", "\n", "# Embed the training and test sentences\n", "X_train_embed_2 = embedding(word2vec_transfer, X_train)\n", "X_test_embed_2 = embedding(word2vec_transfer, X_test)" ] }, { "cell_type": "code", "execution_count": 17, "id": "9270ed0e", "metadata": {}, "outputs": [], "source": [ "# Pad the training and test embedded sentences\n", "X_train_pad_2 = pad_sequences(X_train_embed_2, dtype='float32', padding='post', maxlen=200)\n", "X_test_pad_2 = pad_sequences(X_test_embed_2, dtype='float32', padding='post', maxlen=200)" ] }, { "cell_type": "code", "execution_count": 18, "id": "cdee2b55", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/30\n", "110/110 [==============================] - 9s 53ms/step - loss: 0.6714 - accuracy: 0.5849 - val_loss: 0.6512 - val_accuracy: 0.6227\n", "Epoch 2/30\n", "110/110 [==============================] - 5s 41ms/step - loss: 0.6156 - accuracy: 0.6651 - val_loss: 0.6046 - val_accuracy: 0.6807\n", "Epoch 3/30\n", "110/110 [==============================] - 5s 41ms/step - loss: 0.5599 - accuracy: 0.7229 - val_loss: 0.5968 - val_accuracy: 0.6927\n", "Epoch 4/30\n", "110/110 [==============================] - 4s 39ms/step - loss: 0.5353 - accuracy: 0.7369 - val_loss: 0.6922 - val_accuracy: 0.6367\n", "Epoch 5/30\n", "110/110 [==============================] - 4s 40ms/step - loss: 0.5256 - accuracy: 0.7431 - val_loss: 0.5610 - val_accuracy: 0.7127\n", "Epoch 6/30\n", "110/110 [==============================] - 4s 38ms/step - loss: 0.4920 - accuracy: 0.7686 - val_loss: 0.6102 - val_accuracy: 0.6900\n", "Epoch 7/30\n", "110/110 [==============================] - 4s 38ms/step - loss: 0.4743 - accuracy: 0.7831 - val_loss: 0.5528 - val_accuracy: 0.7380\n", "Epoch 8/30\n", "110/110 [==============================] - 4s 38ms/step - loss: 0.4548 - accuracy: 0.7960 - val_loss: 0.4919 - val_accuracy: 0.7720\n", "Epoch 9/30\n", "110/110 [==============================] - 4s 38ms/step - loss: 0.4294 - accuracy: 0.8089 - val_loss: 0.5748 - val_accuracy: 0.7373\n", "Epoch 10/30\n", "110/110 [==============================] - 4s 38ms/step - loss: 0.4180 - accuracy: 0.8217 - val_loss: 0.5599 - val_accuracy: 0.7340\n", "Epoch 11/30\n", "110/110 [==============================] - 4s 39ms/step - loss: 0.3970 - accuracy: 0.8303 - val_loss: 0.4763 - val_accuracy: 0.7853\n", "Epoch 12/30\n", "110/110 [==============================] - 4s 37ms/step - loss: 0.3878 - accuracy: 0.8349 - val_loss: 0.7456 - val_accuracy: 0.6800\n", "Epoch 13/30\n", "110/110 [==============================] - 4s 38ms/step - loss: 0.3667 - accuracy: 0.8449 - val_loss: 0.4732 - val_accuracy: 0.7947\n", "Epoch 14/30\n", "110/110 [==============================] - 4s 38ms/step - loss: 0.3508 - accuracy: 0.8469 - val_loss: 0.5956 - val_accuracy: 0.7493\n", "Epoch 15/30\n", "110/110 [==============================] - 4s 40ms/step - loss: 0.3378 - accuracy: 0.8563 - val_loss: 0.5706 - val_accuracy: 0.7573\n", "Epoch 16/30\n", "110/110 [==============================] - 5s 42ms/step - loss: 0.3188 - accuracy: 0.8671 - val_loss: 0.4941 - val_accuracy: 0.7760\n", "Epoch 17/30\n", "110/110 [==============================] - 5s 43ms/step - loss: 0.3078 - accuracy: 0.8774 - val_loss: 0.5214 - val_accuracy: 0.7733\n", "Epoch 18/30\n", "110/110 [==============================] - 5s 42ms/step - loss: 0.2959 - accuracy: 0.8763 - val_loss: 0.6381 - val_accuracy: 0.7533\n" ] } ], "source": [ "from tensorflow.keras.callbacks import EarlyStopping\n", "import tensorflow as tf\n", "\n", "es = EarlyStopping(patience=5, restore_best_weights=True)\n", "\n", "model = init_model()\n", "\n", "history = model.fit(X_train_pad_2, y_train, \n", " batch_size = 32,\n", " epochs=30,\n", " validation_split=0.3,\n", " callbacks=[es]\n", " )\n", "model.save('my_model.h5')\n", "\n", "improved_model_acc = history.history['accuracy'][-1]\n", "\n" ] }, { "cell_type": "code", "execution_count": 19, "id": "d8297abb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The accuracy evaluated on the test set is of 80.120%\n" ] } ], "source": [ "result = model.evaluate(X_test_pad_2, y_test, verbose=0)\n", "\n", "print(f'The accuracy evaluated on the test set is of {result[1]*100:.3f}%')" ] }, { "cell_type": "code", "execution_count": 20, "id": "070d098d", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot the accuracy and loss curves\n", "plt.figure(figsize=(12, 6))\n", "plt.subplot(1, 2, 1)\n", "plt.plot(history.history['accuracy'], label='Training accuracy')\n", "plt.plot(history.history['val_accuracy'], label='Validation accuracy')\n", "plt.title('Accuracy')\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Accuracy')\n", "plt.legend()\n", "\n", "plt.subplot(1, 2, 2)\n", "plt.plot(history.history['loss'], label='Training loss')\n", "plt.plot(history.history['val_loss'], label='Validation loss')\n", "plt.title('Loss')\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Loss')\n", "plt.legend()\n", "\n", "plt.show()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "81d63cb7", "metadata": {}, "source": [ "### There is a significant improvement in the accuracy after Transfer learning." ] }, { "attachments": {}, "cell_type": "markdown", "id": "379655e1", "metadata": {}, "source": [ "## Comparing Accuracy of Baseline model, The model and Improved model." ] }, { "cell_type": "code", "execution_count": 21, "id": "44b32cc6", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plotting the Graph\n", "plt.plot([baseline_acc, the_model_acc, improved_model_acc], marker='o')\n", "plt.xticks([0, 1, 2], ['Baseline Model', 'The Model', 'The Improved Model'])\n", "plt.ylabel('Accuracy')\n", "plt.show()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "043e1753", "metadata": {}, "source": [ "## Predicting the model for new review." ] }, { "cell_type": "code", "execution_count": 22, "id": "645e44d4", "metadata": {}, "outputs": [], "source": [ "from tensorflow.keras import Sequential\n", "from tensorflow.keras.layers import Masking, LSTM, Dense\n", "from tensorflow.keras.models import load_model\n", "\n", "# Load the pre-trained Word2Vec model\n", "word2vec_transfer = api.load(\"glove-wiki-gigaword-100\")\n", "\n", "# Define the function to embed a sentence with the pre-trained Word2Vec model\n", "def embed_sentence_with_TF(word2vec, sentence):\n", " embedded_sentence = []\n", " for word in sentence:\n", " if word in word2vec:\n", " embedded_sentence.append(word2vec[word])\n", " return np.array(embedded_sentence)\n", "\n", "# Define the function to preprocess a new movie review\n", "def preprocess_review(review):\n", " # Tokenize the review\n", " review = text_to_word_sequence(review)\n", " # Embed the review with the pre-trained Word2Vec model\n", " review_embedded = embed_sentence_with_TF(word2vec_transfer, review)\n", " # Pad the embedded review\n", " review_padded = pad_sequences([review_embedded], dtype='float32', padding='post', maxlen=200)\n", " return review_padded\n", "\n", "# Load the trained model\n", "model = Sequential()\n", "model.add(Masking())\n", "model.add(LSTM(20, activation='tanh'))\n", "model.add(Dense(15, activation='relu'))\n", "model.add(Dense(1, activation='sigmoid'))\n", "model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])\n", "model = load_model('my_model.h5')\n", "def predict_sentiment(review):\n", " # Preprocess the review\n", " review_padded = preprocess_review(review)\n", " # Predict the sentiment\n", " sentiment = model.predict(review_padded)[0][0]\n", " return sentiment" ] }, { "cell_type": "code", "execution_count": 23, "id": "faf5685a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1/1 [==============================] - 1s 679ms/step\n", "Positive review\n" ] } ], "source": [ "review = input(\"Enter a review:\")\n", "sentiment = predict_sentiment(review)\n", "\n", "if sentiment > 0.5:\n", " print(\"Positive review\")\n", "elif sentiment == 0.5:\n", " print(\"Neutral review\")\n", "else:\n", " print(\"Negative review\")\n", " " ] }, { "cell_type": "code", "execution_count": 30, "id": "30d4a32e", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/inputs.py:27: UserWarning: Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components\n", " warnings.warn(\n", "/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/inputs.py:30: UserWarning: `optional` parameter is deprecated, and it has no effect\n", " super().__init__(\n", "/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/inputs.py:30: UserWarning: `numeric` parameter is deprecated, and it has no effect\n", " super().__init__(\n", "/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/outputs.py:22: UserWarning: Usage of gradio.outputs is deprecated, and will not be supported in the future, please import your components from gradio.components\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Running on local URL: http://127.0.0.1:7861\n", "Running on public URL: https://ef9fee4af1830c79fa.gradio.live\n", "\n", "This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)\n" ] }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "1/1 [==============================] - 1s 743ms/step\n", "1/1 [==============================] - 0s 20ms/step\n", "1/1 [==============================] - 0s 23ms/step\n", "1/1 [==============================] - 0s 21ms/step\n", "1/1 [==============================] - 0s 23ms/step\n", "1/1 [==============================] - 0s 83ms/step\n", "1/1 [==============================] - 0s 18ms/step\n", "1/1 [==============================] - 0s 18ms/step\n", "1/1 [==============================] - 0s 21ms/step\n", "1/1 [==============================] - 0s 85ms/step\n", "1/1 [==============================] - 0s 106ms/step\n", "1/1 [==============================] - 0s 79ms/step\n", "1/1 [==============================] - 0s 22ms/step\n", "1/1 [==============================] - 0s 84ms/step\n", "1/1 [==============================] - 0s 25ms/step\n", "1/1 [==============================] - 0s 19ms/step\n", "1/1 [==============================] - 0s 20ms/step\n", "1/1 [==============================] - 0s 18ms/step\n", "1/1 [==============================] - 0s 23ms/step\n", "1/1 [==============================] - 0s 28ms/step\n", "1/1 [==============================] - 0s 17ms/step\n", "1/1 [==============================] - 0s 20ms/step\n", "1/1 [==============================] - 0s 22ms/step\n", "1/1 [==============================] - 0s 17ms/step\n", "1/1 [==============================] - 0s 15ms/step\n", "1/1 [==============================] - 0s 20ms/step\n", "1/1 [==============================] - 0s 21ms/step\n", "1/1 [==============================] - 0s 26ms/step\n", "1/1 [==============================] - 0s 42ms/step\n", "1/1 [==============================] - 0s 14ms/step\n", "1/1 [==============================] - 0s 22ms/step\n", "1/1 [==============================] - 0s 19ms/step\n", "1/1 [==============================] - 0s 82ms/step\n", "1/1 [==============================] - 0s 32ms/step\n", "1/1 [==============================] - 0s 16ms/step\n", "1/1 [==============================] - 0s 19ms/step\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2023-06-05 17:41:09.620915: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at transpose_op.cc:142 : INVALID_ARGUMENT: transpose expects a vector of size 2. But input(1) is a vector of size 3\n", "2023-06-05 17:41:09.621087: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at transpose_op.cc:142 : INVALID_ARGUMENT: transpose expects a vector of size 2. But input(1) is a vector of size 3\n", "Traceback (most recent call last):\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/routes.py\", line 427, in run_predict\n", " output = await app.get_blocks().process_api(\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/blocks.py\", line 1323, in process_api\n", " result = await self.call_function(\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/blocks.py\", line 1051, in call_function\n", " prediction = await anyio.to_thread.run_sync(\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/anyio/to_thread.py\", line 31, in run_sync\n", " return await get_asynclib().run_sync_in_worker_thread(\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/anyio/_backends/_asyncio.py\", line 937, in run_sync_in_worker_thread\n", " return await future\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/anyio/_backends/_asyncio.py\", line 867, in run\n", " result = context.run(func, *args)\n", " File \"/var/folders/d2/90pdbpy11hx6fnbddlzkff340000gn/T/ipykernel_92516/1865062254.py\", line 35, in predict_sentiment\n", " sentiment = model.predict(review_padded)[0][0]\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/keras/utils/traceback_utils.py\", line 70, in error_handler\n", " raise e.with_traceback(filtered_tb) from None\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/tensorflow/python/eager/execute.py\", line 54, in quick_execute\n", " tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,\n", "tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:\n", "\n", "transpose expects a vector of size 2. But input(1) is a vector of size 3\n", "\t [[{{node transpose}}]]\n", "\t [[sequential_1/lstm_1/PartitionedCall]] [Op:__inference_predict_function_58960]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "1/1 [==============================] - 0s 29ms/step\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2023-06-05 17:41:36.584429: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at transpose_op.cc:142 : INVALID_ARGUMENT: transpose expects a vector of size 2. But input(1) is a vector of size 3\n", "2023-06-05 17:41:36.584461: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at transpose_op.cc:142 : INVALID_ARGUMENT: transpose expects a vector of size 2. But input(1) is a vector of size 3\n", "Traceback (most recent call last):\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/routes.py\", line 427, in run_predict\n", " output = await app.get_blocks().process_api(\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/blocks.py\", line 1323, in process_api\n", " result = await self.call_function(\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/blocks.py\", line 1051, in call_function\n", " prediction = await anyio.to_thread.run_sync(\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/anyio/to_thread.py\", line 31, in run_sync\n", " return await get_asynclib().run_sync_in_worker_thread(\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/anyio/_backends/_asyncio.py\", line 937, in run_sync_in_worker_thread\n", " return await future\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/anyio/_backends/_asyncio.py\", line 867, in run\n", " result = context.run(func, *args)\n", " File \"/var/folders/d2/90pdbpy11hx6fnbddlzkff340000gn/T/ipykernel_92516/1865062254.py\", line 35, in predict_sentiment\n", " sentiment = model.predict(review_padded)[0][0]\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/keras/utils/traceback_utils.py\", line 70, in error_handler\n", " raise e.with_traceback(filtered_tb) from None\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/tensorflow/python/eager/execute.py\", line 54, in quick_execute\n", " tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,\n", "tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:\n", "\n", "transpose expects a vector of size 2. But input(1) is a vector of size 3\n", "\t [[{{node transpose}}]]\n", "\t [[sequential_1/lstm_1/PartitionedCall]] [Op:__inference_predict_function_58960]\n", "2023-06-05 17:41:40.561062: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at transpose_op.cc:142 : INVALID_ARGUMENT: transpose expects a vector of size 2. But input(1) is a vector of size 3\n", "2023-06-05 17:41:40.561098: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at transpose_op.cc:142 : INVALID_ARGUMENT: transpose expects a vector of size 2. But input(1) is a vector of size 3\n", "Traceback (most recent call last):\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/routes.py\", line 427, in run_predict\n", " output = await app.get_blocks().process_api(\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/blocks.py\", line 1323, in process_api\n", " result = await self.call_function(\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/blocks.py\", line 1051, in call_function\n", " prediction = await anyio.to_thread.run_sync(\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/anyio/to_thread.py\", line 31, in run_sync\n", " return await get_asynclib().run_sync_in_worker_thread(\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/anyio/_backends/_asyncio.py\", line 937, in run_sync_in_worker_thread\n", " return await future\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/anyio/_backends/_asyncio.py\", line 867, in run\n", " result = context.run(func, *args)\n", " File \"/var/folders/d2/90pdbpy11hx6fnbddlzkff340000gn/T/ipykernel_92516/1865062254.py\", line 35, in predict_sentiment\n", " sentiment = model.predict(review_padded)[0][0]\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/keras/utils/traceback_utils.py\", line 70, in error_handler\n", " raise e.with_traceback(filtered_tb) from None\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/tensorflow/python/eager/execute.py\", line 54, in quick_execute\n", " tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,\n", "tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:\n", "\n", "transpose expects a vector of size 2. But input(1) is a vector of size 3\n", "\t [[{{node transpose}}]]\n", "\t [[sequential_1/lstm_1/PartitionedCall]] [Op:__inference_predict_function_58960]\n", "2023-06-05 17:41:45.918540: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at transpose_op.cc:142 : INVALID_ARGUMENT: transpose expects a vector of size 2. But input(1) is a vector of size 3\n", "2023-06-05 17:41:45.918572: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at transpose_op.cc:142 : INVALID_ARGUMENT: transpose expects a vector of size 2. But input(1) is a vector of size 3\n", "Traceback (most recent call last):\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/routes.py\", line 427, in run_predict\n", " output = await app.get_blocks().process_api(\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/blocks.py\", line 1323, in process_api\n", " result = await self.call_function(\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/blocks.py\", line 1051, in call_function\n", " prediction = await anyio.to_thread.run_sync(\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/anyio/to_thread.py\", line 31, in run_sync\n", " return await get_asynclib().run_sync_in_worker_thread(\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/anyio/_backends/_asyncio.py\", line 937, in run_sync_in_worker_thread\n", " return await future\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/anyio/_backends/_asyncio.py\", line 867, in run\n", " result = context.run(func, *args)\n", " File \"/var/folders/d2/90pdbpy11hx6fnbddlzkff340000gn/T/ipykernel_92516/1865062254.py\", line 35, in predict_sentiment\n", " sentiment = model.predict(review_padded)[0][0]\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/keras/utils/traceback_utils.py\", line 70, in error_handler\n", " raise e.with_traceback(filtered_tb) from None\n", " File \"/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/tensorflow/python/eager/execute.py\", line 54, in quick_execute\n", " tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,\n", "tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:\n", "\n", "transpose expects a vector of size 2. But input(1) is a vector of size 3\n", "\t [[{{node transpose}}]]\n", "\t [[sequential_1/lstm_1/PartitionedCall]] [Op:__inference_predict_function_58960]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "1/1 [==============================] - 0s 27ms/step\n", "1/1 [==============================] - 0s 21ms/step\n", "1/1 [==============================] - 0s 17ms/step\n", "1/1 [==============================] - 0s 17ms/step\n", "1/1 [==============================] - 0s 22ms/step\n", "1/1 [==============================] - 0s 23ms/step\n", "1/1 [==============================] - 0s 20ms/step\n" ] } ], "source": [ "import gradio as gr\n", "import numpy as np\n", "from tensorflow.keras.preprocessing.text import text_to_word_sequence\n", "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", "from gensim.models import KeyedVectors\n", "from tensorflow.keras.models import load_model\n", "# Load the pre-trained Word2Vec model\n", "word2vec_transfer = api.load(\"glove-wiki-gigaword-100\")\n", "\n", "# Define the function to embed a sentence with the pre-trained Word2Vec model\n", "def embed_sentence_with_TF(word2vec, sentence):\n", " embedded_sentence = []\n", " for word in sentence:\n", " if word in word2vec:\n", " embedded_sentence.append(word2vec[word])\n", " return np.array(embedded_sentence)\n", "\n", "# Define the function to preprocess a new movie review\n", "def preprocess_review(review):\n", " # Tokenize the review\n", " review = text_to_word_sequence(review)\n", " # Embed the review with the pre-trained Word2Vec model\n", " review_embedded = embed_sentence_with_TF(word2vec_transfer, review)\n", " # Pad the embedded review\n", " review_padded = pad_sequences([review_embedded], dtype='float32', padding='post', maxlen=200)\n", " return review_padded\n", "\n", "# Load the trained model\n", "model = load_model('my_model.h5')\n", "\n", "def predict_sentiment(review):\n", " # Preprocess the review\n", " review_padded = preprocess_review(review)\n", " # Predict the sentiment\n", " sentiment = model.predict(review_padded)[0][0]\n", " if sentiment > 0.5:\n", " return \"Positive\"\n", " elif sentiment == 0.5:\n", " return \"Neutral\"\n", " else:\n", " return \"Negative\"\n", "\n", "# Create a Gradio interface\n", "inputs = gr.inputs.Textbox(lines=5, label=\"Input Text\")\n", "outputs = gr.outputs.Textbox(label=\"Sentiment\")\n", "title = \"Sentiment Analysis\"\n", "description = \"Enter a text and get the sentiment prediction.\"\n", "gr.Interface(fn=predict_sentiment, inputs=inputs, outputs=outputs, title=title, description=description).launch(share=True)\n" ] }, { "cell_type": "code", "execution_count": 27, "id": "f45c4e95", "metadata": {}, "outputs": [], "source": [ "!git add Sentiment_Analysis_Gradio.ipynb" ] }, { "cell_type": "code", "execution_count": 28, "id": "bfaff667", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[main 8b4a38f] User Interface for custom trained model\n", " 1 file changed, 841 insertions(+)\n", " create mode 100644 Sentiment_Analysis_Gradio.ipynb\n" ] } ], "source": [ "!git commit -m 'User Interface for custom trained model'" ] }, { "cell_type": "code", "execution_count": 29, "id": "3e6d3355", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Enumerating objects: 4, done.\n", "Counting objects: 100% (4/4), done.\n", "Delta compression using up to 8 threads\n", "Compressing objects: 100% (3/3), done.\n", "Writing objects: 100% (3/3), 146.82 KiB | 1.44 MiB/s, done.\n", "Total 3 (delta 0), reused 0 (delta 0), pack-reused 0\n", "To github.com:MYM-Onboarding/Sentiment_Analyzer.git\n", " c6304a9..8b4a38f main -> main\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "1/1 [==============================] - 0s 87ms/step\n" ] } ], "source": [ "!git push origin main" ] }, { "cell_type": "code", "execution_count": null, "id": "55498423", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "toc": { "base_numbering": "1", "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }