{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "25a0cd07", "metadata": {}, "source": [ "# Text Sentiment Analysis of IMDB Movie reviews using NLP (Word2Vec and RNN) for MYM Intern Assesment" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a6ca3e39", "metadata": {}, "source": [ "## Assesment Objectives: \n", "### - Preprocessing the data\n", "### - Converting Text(words) to Vectors using word2vec \n", "### - Using the word representations given by word2vec to feed a RNN and training the model\n", "### - Evaluating the model and plotting the performance graphs\n", "### - Improving the model by Transfer Learning\n", "### - Comparing Accuracy of Baseline model, The model and Improved model.\n", "### - Testing the model (predicting the model with new review)\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "84781652", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gensim==4.2.0\n" ] } ], "source": [ "!pip freeze | grep gensim ##Checking the version of Gensim - Word2Vec" ] }, { "attachments": {}, "cell_type": "markdown", "id": "12e2dbaa", "metadata": {}, "source": [ "## The Data" ] }, { "attachments": {}, "cell_type": "markdown", "id": "7b25eb77", "metadata": {}, "source": [ "### Starting with 20% of the sentences from TensorFlow Datasets of IMDB reviews to check the RAM compatibility of the PC to train the model faster by splitting the datasets as X_train, y_train, X_test and y_test.\n", "### Then preprocessing the textual data to create input features for a natural language processing (NLP) model.\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "2079b965", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...\u001b[0m\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "03ff9b6f48284e698019be61052b6fca", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Dl Completed...: 0 url [00:00, ? url/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "702a447341f64056b5d9f1670a734acc", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Dl Size...: 0 MiB [00:00, ? MiB/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "e827a156b5f74ba585d28b3478514fa6", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating splits...: 0%| | 0/3 [00:00 0 and percentage_of_sentences<=100)\n", " \n", " len_train = int(percentage_of_sentences/100*len(train_sentences))\n", " train_sentences, y_train = train_sentences[:len_train], y_train[:len_train]\n", " \n", " len_test = int(percentage_of_sentences/100*len(test_sentences))\n", " test_sentences, y_test = test_sentences[:len_test], y_test[:len_test]\n", " \n", " X_train = [text_to_word_sequence(_.decode(\"utf-8\")) for _ in train_sentences]\n", " X_test = [text_to_word_sequence(_.decode(\"utf-8\")) for _ in test_sentences]\n", " \n", " return X_train, y_train, X_test, y_test\n", "\n", "X_train, y_train, X_test, y_test = load_data(percentage_of_sentences=20)" ] }, { "cell_type": "code", "execution_count": 3, "id": "4352850a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 1, 0, ..., 1, 0, 0])" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_test" ] }, { "cell_type": "code", "execution_count": null, "id": "2d707253", "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "id": "fdffccea", "metadata": {}, "source": [ "## First, training a word2vec model (with the arguments that we want) on your training sentence. Store it into the `word2vec` variable. " ] }, { "cell_type": "code", "execution_count": 4, "id": "f5c2e1b0", "metadata": {}, "outputs": [], "source": [ "from gensim.models import Word2Vec\n", "\n", "word2vec = Word2Vec(sentences=X_train, vector_size=60, min_count=10, window=10)\n", "word2vec.save(\"word2vec.model\")\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "81a82d8e", "metadata": {}, "source": [ "## Embedding the training and test sentences." ] }, { "cell_type": "code", "execution_count": 5, "id": "62a835d9", "metadata": {}, "outputs": [], "source": [ "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", "import numpy as np\n", "\n", "# Function to convert a sentence (list of words) into a matrix representing the words in the embedding space\n", "def embed_sentence(word2vec, sentence):\n", " embedded_sentence = []\n", " for word in sentence:\n", " if word in word2vec.wv:\n", " embedded_sentence.append(word2vec.wv[word])\n", " \n", " return np.array(embedded_sentence)\n", "\n", "# Function that converts a list of sentences into a list of matrices\n", "def embedding(word2vec, sentences):\n", " embed = []\n", " \n", " for sentence in sentences:\n", " embedded_sentence = embed_sentence(word2vec, sentence)\n", " embed.append(embedded_sentence)\n", " \n", " return embed\n", "\n", "# Embed the training and test sentences\n", "X_train_embed = embedding(word2vec, X_train)\n", "X_test_embed = embedding(word2vec, X_test)\n", "\n", "\n", "# Pad the training and test embedded sentences\n", "X_train_pad = pad_sequences(X_train_embed, dtype='float32', padding='post', maxlen=200)\n", "X_test_pad = pad_sequences(X_test_embed, dtype='float32', padding='post', maxlen=200)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ea1e76af", "metadata": {}, "source": [ "### It's a good practice to check check the following for `X_train_pad` and `X_test_pad`:\n", "#### - they are numpy arrays\n", "#### - they are 3-dimensional\n", "#### - the last dimension is of the size of your word2vec embedding space (you can get it with `word2vec.wv.vector_size`\\\\\n", "#### - the first dimension is of the size of your `X_train` and `X_test`" ] }, { "cell_type": "code", "execution_count": 6, "id": "d4770855", "metadata": {}, "outputs": [], "source": [ "for X in [X_train_pad, X_test_pad]:\n", " assert type(X) == np.ndarray\n", " assert X.shape[-1] == word2vec.wv.vector_size\n", "\n", "\n", "assert X_train_pad.shape[0] == len(X_train)\n", "assert X_test_pad.shape[0] == len(X_test)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6418b6d1", "metadata": {}, "source": [ "## Baseline Model" ] }, { "cell_type": "code", "execution_count": 7, "id": "3b477d35", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of labels in train set {0: 2474, 1: 2526}\n", "Baseline accuracy: 0.499\n" ] } ], "source": [ "# It is always good to have a very simple model to test your own model against\n", "# Baseline accuracy can be to predict the label that is the most present in `y_train`.\n", "from sklearn.metrics import accuracy_score\n", "\n", "unique, counts = np.unique(y_train, return_counts=True)\n", "counts = dict(zip(unique, counts))\n", "print('Number of labels in train set', counts)\n", "\n", "y_pred = 0 if counts[0] > counts[1] else 1\n", "\n", "print('Baseline accuracy: ', accuracy_score(y_test, [y_pred]*len(y_test)))\n", "\n", "baseline_acc = accuracy_score(y_test, [y_pred]*len(y_test))\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b7a80da2", "metadata": {}, "source": [ "## The Model" ] }, { "cell_type": "code", "execution_count": 8, "id": "523cd8a1", "metadata": {}, "outputs": [], "source": [ "from tensorflow.keras import Sequential\n", "from tensorflow.keras import layers\n", "\n", "# writing a RNN model with Masking, LSTM and Dense layers.\n", "\n", "def init_model():\n", " model = Sequential()\n", " model.add(layers.Masking())\n", " model.add(layers.LSTM(20, activation='tanh'))\n", " model.add(layers.Dense(15, activation='relu'))\n", " model.add(layers.Dense(1, activation='sigmoid'))\n", "\n", " model.compile(loss='binary_crossentropy', #compiling the model with rmsprop optimizer\n", " optimizer='rmsprop',\n", " metrics=['accuracy'])\n", " \n", " return model\n", "\n", "model = init_model()" ] }, { "cell_type": "code", "execution_count": 9, "id": "5317ce64", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/100\n", "110/110 [==============================] - 8s 47ms/step - loss: 0.6795 - accuracy: 0.5709 - val_loss: 0.6679 - val_accuracy: 0.5707\n", "Epoch 2/100\n", "110/110 [==============================] - 4s 38ms/step - loss: 0.6325 - accuracy: 0.6534 - val_loss: 0.6465 - val_accuracy: 0.6007\n", "Epoch 3/100\n", "110/110 [==============================] - 4s 36ms/step - loss: 0.5779 - accuracy: 0.7037 - val_loss: 0.5886 - val_accuracy: 0.6920\n", "Epoch 4/100\n", "110/110 [==============================] - 4s 36ms/step - loss: 0.5413 - accuracy: 0.7349 - val_loss: 0.5846 - val_accuracy: 0.7033\n", "Epoch 5/100\n", "110/110 [==============================] - 5s 42ms/step - loss: 0.5257 - accuracy: 0.7486 - val_loss: 0.5924 - val_accuracy: 0.7167\n", "Epoch 6/100\n", "110/110 [==============================] - 5s 42ms/step - loss: 0.5123 - accuracy: 0.7580 - val_loss: 0.5703 - val_accuracy: 0.7167\n", "Epoch 7/100\n", "110/110 [==============================] - 4s 39ms/step - loss: 0.4767 - accuracy: 0.7826 - val_loss: 0.5412 - val_accuracy: 0.7360\n", "Epoch 8/100\n", "110/110 [==============================] - 4s 40ms/step - loss: 0.4607 - accuracy: 0.7880 - val_loss: 0.5545 - val_accuracy: 0.7427\n", "Epoch 9/100\n", "110/110 [==============================] - 4s 38ms/step - loss: 0.4473 - accuracy: 0.7960 - val_loss: 0.5799 - val_accuracy: 0.7233\n", "Epoch 10/100\n", "110/110 [==============================] - 4s 36ms/step - loss: 0.4407 - accuracy: 0.8006 - val_loss: 0.5336 - val_accuracy: 0.7493\n", "Epoch 11/100\n", "110/110 [==============================] - 4s 36ms/step - loss: 0.4185 - accuracy: 0.8160 - val_loss: 0.5706 - val_accuracy: 0.7313\n", "Epoch 12/100\n", "110/110 [==============================] - 4s 36ms/step - loss: 0.4051 - accuracy: 0.8214 - val_loss: 0.5340 - val_accuracy: 0.7527\n", "Epoch 13/100\n", "110/110 [==============================] - 4s 36ms/step - loss: 0.3901 - accuracy: 0.8291 - val_loss: 0.6257 - val_accuracy: 0.7067\n", "Epoch 14/100\n", "110/110 [==============================] - 4s 36ms/step - loss: 0.3861 - accuracy: 0.8286 - val_loss: 0.6041 - val_accuracy: 0.7413\n", "Epoch 15/100\n", "110/110 [==============================] - 4s 36ms/step - loss: 0.3764 - accuracy: 0.8420 - val_loss: 0.5992 - val_accuracy: 0.7227\n" ] } ], "source": [ "# Fiting the model on embedded and padded data with the early stopping criterion.\n", "\n", "from tensorflow.keras.callbacks import EarlyStopping\n", "\n", "es = EarlyStopping(patience=5, restore_best_weights=True)\n", "\n", "history = model.fit(X_train_pad, y_train, \n", " batch_size = 32,\n", " epochs=100,\n", " validation_split=0.3,\n", " callbacks=[es]\n", " )\n", "the_model_acc = history.history['accuracy'][-1]\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "26e4350b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The accuracy evaluated on the test set is of 76.940%\n" ] } ], "source": [ "# Evaluating the model on the test set.\n", "\n", "result = model.evaluate(X_test_pad, y_test, verbose=0)\n", "\n", "print(f'The accuracy evaluated on the test set is of {result[1]*100:.3f}%')" ] }, { "cell_type": "code", "execution_count": 11, "id": "8d663411", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "# Plot the accuracy and loss curves\n", "plt.figure(figsize=(12, 6))\n", "plt.subplot(1, 2, 1)\n", "plt.plot(history.history['accuracy'], label='Training accuracy')\n", "plt.plot(history.history['val_accuracy'], label='Validation accuracy')\n", "plt.title('Accuracy')\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Accuracy')\n", "plt.legend()\n", "\n", "plt.subplot(1, 2, 2)\n", "plt.plot(history.history['loss'], label='Training loss')\n", "plt.plot(history.history['val_loss'], label='Validation loss')\n", "plt.title('Loss')\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Loss')\n", "plt.legend()\n", "\n", "plt.show()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "d9c43269", "metadata": {}, "source": [ "## Trained Word2Vec - Transfer Learning\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "92b6edab", "metadata": {}, "source": [ "### The accuracy of the above the baseline model, might be quite low. By improving the quality of the embedding we can Improve accuracy of the model." ] }, { "attachments": {}, "cell_type": "markdown", "id": "1e80388e", "metadata": {}, "source": [ "### Let's improve the quality of our embedding, instead of just loading a larger corpus, let's benefit from the embedding that others have learned. Because, the quality of an embedding, i.e. the proximity of the words, can be derived from different tasks. This is exactly what transfer learning is." ] }, { "attachments": {}, "cell_type": "markdown", "id": "97d4ebeb", "metadata": {}, "source": [ "### Listing all the different models available in the word2vec using gensim api." ] }, { "cell_type": "code", "execution_count": 12, "id": "b07b77ad", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']\n" ] } ], "source": [ "import gensim.downloader as api\n", "print(list(api.info()['models'].keys()))" ] }, { "cell_type": "code", "execution_count": 13, "id": "bd25af6a", "metadata": {}, "outputs": [], "source": [ "#Let's load one of the pre-trained word2vec embedding spaces. \n", "\n", "word2vec_transfer = api.load(\"glove-wiki-gigaword-100\")" ] }, { "cell_type": "code", "execution_count": 14, "id": "062fa47f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "400000\n", "100\n" ] } ], "source": [ "print(len(word2vec_transfer.key_to_index))\n", "print(len(word2vec_transfer['dog']))" ] }, { "cell_type": "code", "execution_count": 15, "id": "bc0a78fc", "metadata": {}, "outputs": [], "source": [ "# Function to convert a sentence (list of words) into a matrix representing the words in the embedding space\n", "def embed_sentence_with_TF(word2vec, sentence):\n", " embedded_sentence = []\n", " for word in sentence:\n", " if word in word2vec:\n", " embedded_sentence.append(word2vec[word])\n", " \n", " return np.array(embedded_sentence)\n", "\n", "# Function that converts a list of sentences into a list of matrices\n", "def embedding(word2vec, sentences):\n", " embed = []\n", " \n", " for sentence in sentences:\n", " embedded_sentence = embed_sentence_with_TF(word2vec, sentence)\n", " embed.append(embedded_sentence)\n", " \n", " return embed\n", "\n", "# Embed the training and test sentences\n", "X_train_embed_2 = embedding(word2vec_transfer, X_train)\n", "X_test_embed_2 = embedding(word2vec_transfer, X_test)" ] }, { "cell_type": "code", "execution_count": 16, "id": "9270ed0e", "metadata": {}, "outputs": [], "source": [ "# Pad the training and test embedded sentences\n", "X_train_pad_2 = pad_sequences(X_train_embed_2, dtype='float32', padding='post', maxlen=200)\n", "X_test_pad_2 = pad_sequences(X_test_embed_2, dtype='float32', padding='post', maxlen=200)" ] }, { "cell_type": "code", "execution_count": 17, "id": "cdee2b55", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/30\n", "110/110 [==============================] - 10s 51ms/step - loss: 0.6670 - accuracy: 0.5871 - val_loss: 0.6609 - val_accuracy: 0.5913\n", "Epoch 2/30\n", "110/110 [==============================] - 5s 43ms/step - loss: 0.6081 - accuracy: 0.6703 - val_loss: 0.5912 - val_accuracy: 0.6993\n", "Epoch 3/30\n", "110/110 [==============================] - 4s 41ms/step - loss: 0.5683 - accuracy: 0.7154 - val_loss: 0.6572 - val_accuracy: 0.6453\n", "Epoch 4/30\n", "110/110 [==============================] - 4s 39ms/step - loss: 0.5293 - accuracy: 0.7437 - val_loss: 0.6096 - val_accuracy: 0.6787\n", "Epoch 5/30\n", "110/110 [==============================] - 5s 43ms/step - loss: 0.5061 - accuracy: 0.7611 - val_loss: 0.5034 - val_accuracy: 0.7593\n", "Epoch 6/30\n", "110/110 [==============================] - 4s 39ms/step - loss: 0.4810 - accuracy: 0.7749 - val_loss: 0.5051 - val_accuracy: 0.7607\n", "Epoch 7/30\n", "110/110 [==============================] - 5s 43ms/step - loss: 0.4605 - accuracy: 0.7886 - val_loss: 0.5320 - val_accuracy: 0.7527\n", "Epoch 8/30\n", "110/110 [==============================] - 4s 40ms/step - loss: 0.4320 - accuracy: 0.8046 - val_loss: 0.4712 - val_accuracy: 0.7853\n", "Epoch 9/30\n", "110/110 [==============================] - 5s 42ms/step - loss: 0.4204 - accuracy: 0.8040 - val_loss: 0.5763 - val_accuracy: 0.7193\n", "Epoch 10/30\n", "110/110 [==============================] - 5s 42ms/step - loss: 0.4013 - accuracy: 0.8180 - val_loss: 0.5636 - val_accuracy: 0.7553\n", "Epoch 11/30\n", "110/110 [==============================] - 5s 49ms/step - loss: 0.3897 - accuracy: 0.8286 - val_loss: 0.5568 - val_accuracy: 0.7420\n", "Epoch 12/30\n", "110/110 [==============================] - 6s 55ms/step - loss: 0.3762 - accuracy: 0.8357 - val_loss: 0.5653 - val_accuracy: 0.7493\n", "Epoch 13/30\n", "110/110 [==============================] - 5s 44ms/step - loss: 0.3592 - accuracy: 0.8486 - val_loss: 0.4731 - val_accuracy: 0.7860\n" ] } ], "source": [ "from tensorflow.keras.callbacks import EarlyStopping\n", "import tensorflow as tf\n", "\n", "es = EarlyStopping(patience=5, restore_best_weights=True)\n", "\n", "model = init_model()\n", "\n", "history = model.fit(X_train_pad_2, y_train, \n", " batch_size = 32,\n", " epochs=30,\n", " validation_split=0.3,\n", " callbacks=[es]\n", " )\n", "model.save('my_model.h5')\n", "\n", "improved_model_acc = history.history['accuracy'][-1]\n", "\n" ] }, { "cell_type": "code", "execution_count": 18, "id": "d8297abb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The accuracy evaluated on the test set is of 78.920%\n" ] } ], "source": [ "result = model.evaluate(X_test_pad_2, y_test, verbose=0)\n", "\n", "print(f'The accuracy evaluated on the test set is of {result[1]*100:.3f}%')" ] }, { "cell_type": "code", "execution_count": 19, "id": "070d098d", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot the accuracy and loss curves\n", "plt.figure(figsize=(12, 6))\n", "plt.subplot(1, 2, 1)\n", "plt.plot(history.history['accuracy'], label='Training accuracy')\n", "plt.plot(history.history['val_accuracy'], label='Validation accuracy')\n", "plt.title('Accuracy')\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Accuracy')\n", "plt.legend()\n", "\n", "plt.subplot(1, 2, 2)\n", "plt.plot(history.history['loss'], label='Training loss')\n", "plt.plot(history.history['val_loss'], label='Validation loss')\n", "plt.title('Loss')\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Loss')\n", "plt.legend()\n", "\n", "plt.show()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "81d63cb7", "metadata": {}, "source": [ "### There is a significant improvement in the accuracy after Transfer learning." ] }, { "attachments": {}, "cell_type": "markdown", "id": "379655e1", "metadata": {}, "source": [ "## Comparing Accuracy of Baseline model, The model and Improved model." ] }, { "cell_type": "code", "execution_count": 20, "id": "44b32cc6", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plotting the Graph\n", "plt.plot([baseline_acc, the_model_acc, improved_model_acc], marker='o')\n", "plt.xticks([0, 1, 2], ['Baseline Model', 'The Model', 'The Improved Model'])\n", "plt.ylabel('Accuracy')\n", "plt.show()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "043e1753", "metadata": {}, "source": [ "## Predicting the model for new review." ] }, { "cell_type": "code", "execution_count": 21, "id": "645e44d4", "metadata": {}, "outputs": [], "source": [ "from tensorflow.keras import Sequential\n", "from tensorflow.keras.layers import Masking, LSTM, Dense\n", "from tensorflow.keras.models import load_model\n", "\n", "# Load the pre-trained Word2Vec model\n", "word2vec_transfer = api.load(\"glove-wiki-gigaword-100\")\n", "\n", "# Define the function to embed a sentence with the pre-trained Word2Vec model\n", "def embed_sentence_with_TF(word2vec, sentence):\n", " embedded_sentence = []\n", " for word in sentence:\n", " if word in word2vec:\n", " embedded_sentence.append(word2vec[word])\n", " return np.array(embedded_sentence)\n", "\n", "# Define the function to preprocess a new movie review\n", "def preprocess_review(review):\n", " # Tokenize the review\n", " review = text_to_word_sequence(review)\n", " # Embed the review with the pre-trained Word2Vec model\n", " review_embedded = embed_sentence_with_TF(word2vec_transfer, review)\n", " # Pad the embedded review\n", " review_padded = pad_sequences([review_embedded], dtype='float32', padding='post', maxlen=200)\n", " return review_padded\n", "\n", "# Load the trained model\n", "model = Sequential()\n", "model.add(Masking())\n", "model.add(LSTM(20, activation='tanh'))\n", "model.add(Dense(15, activation='relu'))\n", "model.add(Dense(1, activation='sigmoid'))\n", "model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])\n", "model = load_model('my_model.h5')\n", "def predict_sentiment(review):\n", " # Preprocess the review\n", " review_padded = preprocess_review(review)\n", " # Predict the sentiment\n", " sentiment = model.predict(review_padded)[0][0]\n", " return sentiment" ] }, { "cell_type": "code", "execution_count": 22, "id": "faf5685a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1/1 [==============================] - 1s 721ms/step\n", "Positive review\n" ] } ], "source": [ "review = input(\"Enter a review:\")\n", "sentiment = predict_sentiment(review)\n", "\n", "if sentiment > 0.5:\n", " print(\"Positive review\")\n", "elif sentiment == 0.5:\n", " print(\"Neutral review\")\n", "else:\n", " print(\"Negative review\")\n", " " ] }, { "cell_type": "code", "execution_count": 23, "id": "d5f2bc35", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: gradio in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (3.33.1)\n", "Requirement already satisfied: aiofiles in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (23.1.0)\n", "Requirement already satisfied: aiohttp in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (3.8.4)\n", "Requirement already satisfied: altair>=4.2.0 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (4.2.0)\n", "Requirement already satisfied: fastapi in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (0.96.0)\n", "Requirement already satisfied: ffmpy in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (0.3.0)\n", "Requirement already satisfied: gradio-client>=0.2.4 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (0.2.5)\n", "Requirement already satisfied: httpx in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (0.23.0)\n", "Requirement already satisfied: huggingface-hub>=0.14.0 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (0.15.1)\n", "Requirement already satisfied: jinja2 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (3.1.2)\n", "Requirement already satisfied: markdown-it-py[linkify]>=2.0.0 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (2.2.0)\n", "Requirement already satisfied: markupsafe in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (2.1.1)\n", "Requirement already satisfied: matplotlib in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (3.5.3)\n", "Requirement already satisfied: mdit-py-plugins<=0.3.3 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (0.3.3)\n", "Requirement already satisfied: numpy in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (1.23.4)\n", "Requirement already satisfied: orjson in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (3.9.0)\n", "Requirement already satisfied: pandas in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (1.4.4)\n", "Requirement already satisfied: pillow in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (9.1.1)\n", "Requirement already satisfied: pydantic in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (1.9.2)\n", "Requirement already satisfied: pydub in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (0.25.1)\n", "Requirement already satisfied: pygments>=2.12.0 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (2.13.0)\n", "Requirement already satisfied: python-multipart in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (0.0.6)\n", "Requirement already satisfied: pyyaml in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (5.4.1)\n", "Requirement already satisfied: requests in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (2.28.1)\n", "Requirement already satisfied: semantic-version in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (2.10.0)\n", "Requirement already satisfied: typing-extensions in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (4.4.0)\n", "Requirement already satisfied: uvicorn>=0.14.0 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (0.22.0)\n", "Requirement already satisfied: websockets>=10.0 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio) (11.0.3)\n", "Requirement already satisfied: entrypoints in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from altair>=4.2.0->gradio) (0.4)\n", "Requirement already satisfied: jsonschema>=3.0 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from altair>=4.2.0->gradio) (4.16.0)\n", "Requirement already satisfied: toolz in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from altair>=4.2.0->gradio) (0.12.0)\n", "Requirement already satisfied: fsspec in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio-client>=0.2.4->gradio) (2022.10.0)\n", "Requirement already satisfied: packaging in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from gradio-client>=0.2.4->gradio) (21.3)\n", "Requirement already satisfied: filelock in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from huggingface-hub>=0.14.0->gradio) (3.12.0)\n", "Requirement already satisfied: tqdm>=4.42.1 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from huggingface-hub>=0.14.0->gradio) (4.64.1)\n", "Requirement already satisfied: mdurl~=0.1 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from markdown-it-py[linkify]>=2.0.0->gradio) (0.1.2)\n", "Requirement already satisfied: linkify-it-py<3,>=1 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from markdown-it-py[linkify]>=2.0.0->gradio) (2.0.2)\n", "Requirement already satisfied: python-dateutil>=2.8.1 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from pandas->gradio) (2.8.2)\n", "Requirement already satisfied: pytz>=2020.1 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from pandas->gradio) (2022.1)\n", "Requirement already satisfied: click>=7.0 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from uvicorn>=0.14.0->gradio) (8.1.3)\n", "Requirement already satisfied: h11>=0.8 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from uvicorn>=0.14.0->gradio) (0.12.0)\n", "Requirement already satisfied: attrs>=17.3.0 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from aiohttp->gradio) (22.1.0)\n", "Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from aiohttp->gradio) (2.1.1)\n", "Requirement already satisfied: multidict<7.0,>=4.5 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from aiohttp->gradio) (6.0.4)\n", "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from aiohttp->gradio) (4.0.2)\n", "Requirement already satisfied: yarl<2.0,>=1.0 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from aiohttp->gradio) (1.9.2)\n", "Requirement already satisfied: frozenlist>=1.1.1 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from aiohttp->gradio) (1.3.3)\n", "Requirement already satisfied: aiosignal>=1.1.2 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from aiohttp->gradio) (1.3.1)\n", "Requirement already satisfied: starlette<0.28.0,>=0.27.0 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from fastapi->gradio) (0.27.0)\n", "Requirement already satisfied: certifi in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from httpx->gradio) (2022.12.7)\n", "Requirement already satisfied: sniffio in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from httpx->gradio) (1.3.0)\n", "Requirement already satisfied: rfc3986[idna2008]<2,>=1.3 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from httpx->gradio) (1.5.0)\n", "Requirement already satisfied: httpcore<0.16.0,>=0.15.0 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from httpx->gradio) (0.15.0)\n", "Requirement already satisfied: cycler>=0.10 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from matplotlib->gradio) (0.10.0)\n", "Requirement already satisfied: fonttools>=4.22.0 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from matplotlib->gradio) (4.38.0)\n", "Requirement already satisfied: kiwisolver>=1.0.1 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from matplotlib->gradio) (1.4.4)\n", "Requirement already satisfied: pyparsing>=2.2.1 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from matplotlib->gradio) (2.4.7)\n", "Requirement already satisfied: idna<4,>=2.5 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from requests->gradio) (2.10)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from requests->gradio) (1.26.12)\n", "Requirement already satisfied: six in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from cycler>=0.10->matplotlib->gradio) (1.16.0)\n", "Requirement already satisfied: anyio==3.* in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from httpcore<0.16.0,>=0.15.0->httpx->gradio) (3.6.2)\n", "Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from jsonschema>=3.0->altair>=4.2.0->gradio) (0.18.1)\n", "Requirement already satisfied: uc-micro-py in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from linkify-it-py<3,>=1->markdown-it-py[linkify]>=2.0.0->gradio) (1.0.2)\n" ] } ], "source": [ "!pip install gradio\n" ] }, { "cell_type": "code", "execution_count": 24, "id": "7daa2754", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: huggingface_hub in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (0.15.1)\n", "Requirement already satisfied: filelock in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from huggingface_hub) (3.12.0)\n", "Requirement already satisfied: fsspec in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from huggingface_hub) (2022.10.0)\n", "Requirement already satisfied: requests in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from huggingface_hub) (2.28.1)\n", "Requirement already satisfied: tqdm>=4.42.1 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from huggingface_hub) (4.64.1)\n", "Requirement already satisfied: pyyaml>=5.1 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from huggingface_hub) (5.4.1)\n", "Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from huggingface_hub) (4.4.0)\n", "Requirement already satisfied: packaging>=20.9 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from huggingface_hub) (21.3)\n", "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from packaging>=20.9->huggingface_hub) (2.4.7)\n", "Requirement already satisfied: charset-normalizer<3,>=2 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from requests->huggingface_hub) (2.1.1)\n", "Requirement already satisfied: idna<4,>=2.5 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from requests->huggingface_hub) (2.10)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from requests->huggingface_hub) (1.26.12)\n", "Requirement already satisfied: certifi>=2017.4.17 in /Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages (from requests->huggingface_hub) (2022.12.7)\n" ] } ], "source": [ "!pip install --upgrade huggingface_hub\n" ] }, { "cell_type": "code", "execution_count": 25, "id": "30d4a32e", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/inputs.py:27: UserWarning: Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components\n", " warnings.warn(\n", "/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/inputs.py:30: UserWarning: `optional` parameter is deprecated, and it has no effect\n", " super().__init__(\n", "/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/inputs.py:30: UserWarning: `numeric` parameter is deprecated, and it has no effect\n", " super().__init__(\n", "/Users/pavankumarhm/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/gradio/outputs.py:22: UserWarning: Usage of gradio.outputs is deprecated, and will not be supported in the future, please import your components from gradio.components\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Running on local URL: http://127.0.0.1:7862\n", "\n", "To create a public link, set `share=True` in `launch()`.\n" ] }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import gradio as gr\n", "import numpy as np\n", "from tensorflow.keras.preprocessing.text import text_to_word_sequence\n", "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", "from gensim.models import KeyedVectors\n", "from tensorflow.keras.models import load_model\n", "# Load the pre-trained Word2Vec model\n", "word2vec_transfer = api.load(\"glove-wiki-gigaword-100\")\n", "\n", "# Define the function to embed a sentence with the pre-trained Word2Vec model\n", "def embed_sentence_with_TF(word2vec, sentence):\n", " embedded_sentence = []\n", " for word in sentence:\n", " if word in word2vec:\n", " embedded_sentence.append(word2vec[word])\n", " return np.array(embedded_sentence)\n", "\n", "# Define the function to preprocess a new movie review\n", "def preprocess_review(review):\n", " # Tokenize the review\n", " review = text_to_word_sequence(review)\n", " # Embed the review with the pre-trained Word2Vec model\n", " review_embedded = embed_sentence_with_TF(word2vec_transfer, review)\n", " # Pad the embedded review\n", " review_padded = pad_sequences([review_embedded], dtype='float32', padding='post', maxlen=200)\n", " return review_padded\n", "\n", "# Load the trained model\n", "model = load_model('my_model.h5')\n", "\n", "def predict_sentiment(review):\n", " # Preprocess the review\n", " review_padded = preprocess_review(review)\n", " # Predict the sentiment\n", " sentiment = model.predict(review_padded)[0][0]\n", " if sentiment > 0.5:\n", " return \"Positive\"\n", " elif sentiment == 0.5:\n", " return \"Neutral\"\n", " else:\n", " return \"Negative\"\n", "\n", "# Create a Gradio interface\n", "inputs = gr.inputs.Textbox(lines=5, label=\"Input Text\")\n", "outputs = gr.outputs.Textbox(label=\"Sentiment\")\n", "title = \"Sentiment Analysis\"\n", "description = \"Enter a text and get the sentiment prediction.\"\n", "gr.Interface(fn=predict_sentiment, inputs=inputs, outputs=outputs, title=title, description=description).launch()\n" ] }, { "cell_type": "code", "execution_count": null, "id": "bfaff667", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "toc": { "base_numbering": "1", "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }