Simple Sentiment Analysis NN with GloVe Embeddings

This is a PyTorch-based neural network model for binary sentiment classification (positive/negative) on the IMDB dataset.

Model Description

The model was built as a lightweight Feed-Forward Neural Network that utilizes pre-trained GloVe embeddings for token representations. It performs average pooling over the embedded tokens of a sequence to create a sentence-level representation, which then passes through two linear layers to output a sentiment probability.

Architecture

Tokenization: Custom whitespace and punctuation tokenizer. Sequence length is padded/truncated to 150 tokens.
Embedding Layer: Pre-trained weights loaded from a custom tiny_glove.json dictionary. The embedding layer is frozen during training.
Pooling: Average pooling across the sequence dimension [batch_size, max_len, embed_dim] -> [batch_size, embed_dim].
Fully Connected Network:
- Linear(embed_dim, 64) + ReLU()
- Linear(64, 1) + Sigmoid()

Training Data

The model was trained on a balanced subset of the IMDB movie reviews dataset containing 10,000 samples (imdb_balanced_10k.csv).

Train Split: 8,000 samples (80%)
Test Split: 2,000 samples (20%)

Evaluation Results

Test Accuracy: 0.6870 (68.7%)

Training Parameters

Loss Function: Binary Cross Entropy Loss (BCELoss)
Optimizer: Adam
Learning Rate: 0.001
Batch Size: 64
Epochs: 10

Artifacts included

sentiment_nn.pth: The PyTorch state_dict of the trained model.
vocab.pkl: A serialized dictionary mapping string tokens to integer indices (includes <PAD> and <UNK> tokens).
label_encoder.pkl: Scikit-learn LabelEncoder used to encode string labels to binary classes.

How to use

import torch
import torch.nn as nn
import joblib
import re
import string
import numpy as np

# Load Vocab and Label Encoder
vocab = joblib.load("vocab.pkl")
label_encoder = joblib.load("label_encoder.pkl")
embed_dim = 300 # Depends on the Glove embeddings used

# Recreate the PyTorch model
class SentimentNN(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super(SentimentNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.fc1 = nn.Linear(embed_dim, 64)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(64, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        embeds = self.embedding(x)
        out = embeds.mean(dim=1)
        out = self.fc1(out)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out

model = SentimentNN(len(vocab), embed_dim)
model.load_state_dict(torch.load("sentiment_nn.pth"), strict=False)
model.eval()

# Inference
MAX_LEN = 150
text = "This movie is amazing!"
text = str(text).lower()
text = re.sub(r'[' + string.punctuation + ']', ' ', text)
tokens = text.split()

indices = [vocab.get(t, 1) for t in tokens[:MAX_LEN]]
if len(indices) < MAX_LEN:
    indices += [0] * (MAX_LEN - len(indices))

input_tensor = torch.tensor([indices], dtype=torch.long)

with torch.no_grad():
    prediction = model(input_tensor).item()
    
# Get String Label
predicted_class = 1 if prediction > 0.5 else 0
print("Sentiment:", label_encoder.inverse_transform([predicted_class])[0])
print(f"Confidence: {prediction:.4f}")

基于 GloVe 词向量的简单情感分析神经网络

这是一个基于 PyTorch 构建的二分类情感分析（正向/负向）神经网络模型，在 IMDB 数据集上进行训练。

模型描述

该模型被设计为一个轻量级的前馈神经网络（Feed-Forward Neural Network），利用预训练的 GloVe 词向量作为 token 的嵌入表示。它通过对句子中所有 token 的嵌入向量进行平均池化（Average Pooling）来生成句子的全局特征表示，随后通过两个紧接的全连接线性层输出该句的情感概率预测。

模型架构

分词（Tokenization）: 自定义基于空格和标点符号截断的分词器。输入序列长度固定统一为 150 个 tokens（不足在末尾补长，超出则截断）。
嵌入层（Embedding Layer）: 初始化权重加载自项目中提供的 tiny_glove.json 预训练词向量文件字典。该层参数在训练期间冻结。
池化层（Pooling）: 在序列维度即单个句子的长度方向上进行全局平均池化：[batch_size, max_len, embed_dim] -> [batch_size, embed_dim]。
全连接层（Fully Connected Network）:
- 第一层 Linear(embed_dim, 64) + ReLU()
- 第二层 Linear(64, 1) + Sigmoid() 激活函数输出最终概率。

训练数据

模型在一个内容分布平衡的 IMDB 电影评论高质量子集（包含 10,000 条样本数据，即项目中的 imdb_balanced_10k.csv）上完成训练。

训练集 (Train Split): 8,000 条样本 (占比 80%)
测试集 (Test Split): 2,000 条样本 (占比 20%)

评估结果 (Evaluation Results)

测试集准确率 (Test Accuracy): 0.6870 (68.7%)

训练参数

损失函数（Loss Function）: 二元交叉熵损失函数 (BCELoss)
优化器（Optimizer）: Adam
学习率（Learning Rate）: 0.001
批大小（Batch Size）: 64
总轮数（Epochs）: 10

包含的文件

包含以下工作流自动生成的参数模型和必要的前置推理组件：

sentiment_nn.pth: 训练完毕后的 PyTorch 网络模型 state_dict 权重字典。
vocab.pkl: 一个映射字典，将文本中的 tokens 映射转化为具体的整型 ID（保留了包括 <PAD> 和 <UNK> 字段）。
label_encoder.pkl: Scikit-learn 的 LabelEncoder 对象，用于预测通过后把二分类数值复原回原本的字符串文字标签。

如何使用

可以参考以下的 Python 推理模板逻辑将文本转为正确结果：

import torch
import torch.nn as nn
import joblib
import re
import string
import numpy as np

# 加载词表和标签编码器
vocab = joblib.load("vocab.pkl")
label_encoder = joblib.load("label_encoder.pkl")
embed_dim = 300 # 需要与预训练词向量维度一致

# 重新声明 PyTorch 模型结构
class SentimentNN(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super(SentimentNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.fc1 = nn.Linear(embed_dim, 64)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(64, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        embeds = self.embedding(x)
        out = embeds.mean(dim=1)
        out = self.fc1(out)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out

model = SentimentNN(len(vocab), embed_dim)
model.load_state_dict(torch.load("sentiment_nn.pth"), strict=False)
model.eval()

# 推理预测
MAX_LEN = 150
text = "This movie is amazing!"
text = str(text).lower()
text = re.sub(r'[' + string.punctuation + ']', ' ', text)
tokens = text.split()

indices = [vocab.get(t, 1) for t in tokens[:MAX_LEN]]
if len(indices) < MAX_LEN:
    indices += [0] * (MAX_LEN - len(indices))

input_tensor = torch.tensor([indices], dtype=torch.long)

with torch.no_grad():
    prediction = model(input_tensor).item()
    
# 获取结果
predicted_class = 1 if prediction > 0.5 else 0
print("情感判定 (Sentiment):", label_encoder.inverse_transform([predicted_class])[0])
print(f"置信度 (Confidence): {prediction:.4f}")

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Coco-Spot
/

my-simple-ml