Added :Added the support for the lastest model
Browse files- .gitignore +2 -1
- config.py +1 -1
- notebook/ai_vs_human/final_archi.md +0 -426
- notebook/ai_vs_human/main.ipynb +0 -1110
- notebook/ai_vs_human/mainv2.ipynb +0 -1170
- notebook/ai_vs_human/mainv3.ipynb +0 -0
- notebook/ai_vs_human_nepali/notebook/Nepali_Ai_vs_Human.ipynb +0 -1429
- notebook/ai_vs_human_nepali/notebook/documentation.md +0 -435
- notebook/ai_vs_human_nepali/notebook/final_main.ipynb +0 -0
- notebook/ai_vs_human_nepali/notebook/main.ipynb +0 -0
- notebook/ai_vs_human_nepali/notebook/working model.ipynb +0 -0
- notebook/ai_vs_human_nepali/topic_scrapper.ipynb +0 -455
.gitignore
CHANGED
|
@@ -17,7 +17,8 @@ __pycache__/
|
|
| 17 |
# ---- Jupyter / IPython ----
|
| 18 |
.ipynb_checkpoints/
|
| 19 |
*.ipynb
|
| 20 |
-
|
|
|
|
| 21 |
# ---- Model & Data Artifacts ----
|
| 22 |
*.pth
|
| 23 |
*.pt
|
|
|
|
| 17 |
# ---- Jupyter / IPython ----
|
| 18 |
.ipynb_checkpoints/
|
| 19 |
*.ipynb
|
| 20 |
+
notebook/
|
| 21 |
+
*.csv
|
| 22 |
# ---- Model & Data Artifacts ----
|
| 23 |
*.pth
|
| 24 |
*.pt
|
config.py
CHANGED
|
@@ -28,7 +28,7 @@ class Config:
|
|
| 28 |
REAL_FORGED_MODEL_LOCAL_PATH = os.getenv("REAL_FORGED_MODEL_LOCAL_PATH", "Model/real_forged/fft_cnn_model_78.pth")
|
| 29 |
DOCUMENT_FORGERY_MODEL_PATH = os.getenv(
|
| 30 |
"DOCUMENT_FORGERY_MODEL_PATH",
|
| 31 |
-
"features/Model/document_forgery/
|
| 32 |
)
|
| 33 |
# Decision thresholds for document forgery detector (probabilities in 0..1)
|
| 34 |
DOCUMENT_FORGERY_POSSIBLE_LOW = float(os.getenv("DOCUMENT_FORGERY_POSSIBLE_LOW", "0.40"))
|
|
|
|
| 28 |
REAL_FORGED_MODEL_LOCAL_PATH = os.getenv("REAL_FORGED_MODEL_LOCAL_PATH", "Model/real_forged/fft_cnn_model_78.pth")
|
| 29 |
DOCUMENT_FORGERY_MODEL_PATH = os.getenv(
|
| 30 |
"DOCUMENT_FORGERY_MODEL_PATH",
|
| 31 |
+
"features/Model/document_forgery/pixel_forgery_v3_best.pth",
|
| 32 |
)
|
| 33 |
# Decision thresholds for document forgery detector (probabilities in 0..1)
|
| 34 |
DOCUMENT_FORGERY_POSSIBLE_LOW = float(os.getenv("DOCUMENT_FORGERY_POSSIBLE_LOW", "0.40"))
|
notebook/ai_vs_human/final_archi.md
DELETED
|
@@ -1,426 +0,0 @@
|
|
| 1 |
-
# AI vs Human Text Detector V3 - Final Architecture Summary
|
| 2 |
-
dataset = "Pujan-Dev/english_aivshuman"
|
| 3 |
-
**Model Version**: V3
|
| 4 |
-
**Type**: Hybrid Feature Engineering + TF-IDF Classifier
|
| 5 |
-
**Output Directory**: `./v3_model/`
|
| 6 |
-
**Date**: March 2026
|
| 7 |
-
|
| 8 |
-
---
|
| 9 |
-
|
| 10 |
-
## 📊 Overview
|
| 11 |
-
|
| 12 |
-
The V3 model is a **non-transformer, feature-based ML classifier** that distinguishes between AI-generated and human-written text using a hybrid approach combining engineered linguistic features with TF-IDF text representations.
|
| 13 |
-
|
| 14 |
-
```
|
| 15 |
-
┌─────────────┐
|
| 16 |
-
│ Input Text │
|
| 17 |
-
└──────┬──────┘
|
| 18 |
-
│
|
| 19 |
-
├──────────────────────────────────┐
|
| 20 |
-
│ │
|
| 21 |
-
▼ ▼
|
| 22 |
-
┌──────────────────┐ ┌─────────────────┐
|
| 23 |
-
│ Text Features │ │ Engineered │
|
| 24 |
-
│ (TF-IDF) │ │ Features │
|
| 25 |
-
│ │ │ (16 features) │
|
| 26 |
-
│ • Word (1-2gram) │ │ │
|
| 27 |
-
│ • Char (3-5gram) │ │ • Perplexity │
|
| 28 |
-
│ │ │ • Burstiness │
|
| 29 |
-
│ Max 200k features│ │ • Stylometry │
|
| 30 |
-
└────────┬─────────┘ └─────────┬───────┘
|
| 31 |
-
│ │
|
| 32 |
-
│ ┌───────────────┐ │
|
| 33 |
-
└───────►│ StandardScaler│◄──────┘
|
| 34 |
-
└───────┬───────┘
|
| 35 |
-
│
|
| 36 |
-
┌───────▼──────────┐
|
| 37 |
-
│ Sparse Matrix │
|
| 38 |
-
│ Concat (hstack)│
|
| 39 |
-
└───────┬──────────┘
|
| 40 |
-
│
|
| 41 |
-
┌───────▼────────┐
|
| 42 |
-
│ Logistic │
|
| 43 |
-
│ Regression │
|
| 44 |
-
│ (GridSearchCV)│
|
| 45 |
-
└───────┬────────┘
|
| 46 |
-
│
|
| 47 |
-
┌───────▼────────┐
|
| 48 |
-
│ Prediction │
|
| 49 |
-
│ (Human vs AI) │
|
| 50 |
-
└────────────────┘
|
| 51 |
-
```
|
| 52 |
-
|
| 53 |
-
---
|
| 54 |
-
|
| 55 |
-
## 🏗️ Architecture Components
|
| 56 |
-
|
| 57 |
-
### 1. **Data Loading**
|
| 58 |
-
|
| 59 |
-
**Function**: `load_dataset_recursive(max_samples=20000)`
|
| 60 |
-
|
| 61 |
-
- **Source**: Recursively scans `./DATASET/` folder
|
| 62 |
-
- **Formats Supported**: `.jsonl`, `.json`, `.csv`
|
| 63 |
-
- **Schema Support**:
|
| 64 |
-
- Schema 1: `human_text` + `ai_text` columns
|
| 65 |
-
- Schema 2: `text` + `label`/`ai_gen` columns
|
| 66 |
-
- **Labels**:
|
| 67 |
-
- `0` = Human text
|
| 68 |
-
- `1` = AI-generated text
|
| 69 |
-
- **Preprocessing**: Text normalization (whitespace cleanup)
|
| 70 |
-
- **Max Samples**: 20,000 (configurable)
|
| 71 |
-
- **Random State**: 42
|
| 72 |
-
|
| 73 |
-
---
|
| 74 |
-
|
| 75 |
-
### 2. **Feature Extraction Pipeline**
|
| 76 |
-
|
| 77 |
-
The model extracts **3 types of features** in parallel:
|
| 78 |
-
|
| 79 |
-
#### 2.1 **Perplexity Features** (1 feature)
|
| 80 |
-
|
| 81 |
-
**Model**: `distilgpt2` (Hugging Face Transformers)
|
| 82 |
-
|
| 83 |
-
```python
|
| 84 |
-
class PerplexityCalculator:
|
| 85 |
-
- Model: distilgpt2
|
| 86 |
-
- Max Length: 512 tokens
|
| 87 |
-
- Metric: exp(cross_entropy_loss)
|
| 88 |
-
- Cap: 10,000 (outlier protection)
|
| 89 |
-
- Fallback: 100.0 on error
|
| 90 |
-
```
|
| 91 |
-
|
| 92 |
-
**What it measures**: Language model surprise/naturalness
|
| 93 |
-
- Lower perplexity → More predictable (often AI)
|
| 94 |
-
- Higher perplexity → Less predictable (often human)
|
| 95 |
-
|
| 96 |
-
---
|
| 97 |
-
|
| 98 |
-
#### 2.2 **Burstiness Features** (5 features)
|
| 99 |
-
|
| 100 |
-
Measures sentence length variation patterns.
|
| 101 |
-
|
| 102 |
-
**Features**:
|
| 103 |
-
1. `burst_mean` - Average sentence length (words)
|
| 104 |
-
2. `burst_std` - Standard deviation of sentence lengths
|
| 105 |
-
3. `burst_max` - Maximum sentence length
|
| 106 |
-
4. `burst_min` - Minimum sentence length
|
| 107 |
-
5. `burst_range` - Range (max - min)
|
| 108 |
-
|
| 109 |
-
**Theory**: Human writing has more variation in sentence length (high burstiness), while AI text tends to be more uniform.
|
| 110 |
-
|
| 111 |
-
---
|
| 112 |
-
|
| 113 |
-
#### 2.3 **Stylometry Features** (10 features)
|
| 114 |
-
|
| 115 |
-
Writing style and readability metrics.
|
| 116 |
-
|
| 117 |
-
**Features**:
|
| 118 |
-
1. `num_words` - Total word count
|
| 119 |
-
2. `num_chars` - Total character count
|
| 120 |
-
3. `num_sentences` - Total sentence count
|
| 121 |
-
4. `avg_word_len` - Average word length
|
| 122 |
-
5. `avg_sent_len` - Average sentence length
|
| 123 |
-
6. `lexical_diversity` - Unique words / total words
|
| 124 |
-
7. `punct_ratio` - Punctuation density
|
| 125 |
-
8. `caps_ratio` - Capitalization ratio
|
| 126 |
-
9. `flesch_reading` - Flesch Reading Ease score
|
| 127 |
-
10. `flesch_grade` - Flesch-Kincaid Grade Level
|
| 128 |
-
|
| 129 |
-
**Library**: `textstat` + `nltk`
|
| 130 |
-
|
| 131 |
-
---
|
| 132 |
-
|
| 133 |
-
### 3. **TF-IDF Vectorization**
|
| 134 |
-
|
| 135 |
-
#### 3.1 **Word-Level TF-IDF**
|
| 136 |
-
|
| 137 |
-
```python
|
| 138 |
-
TfidfVectorizer(
|
| 139 |
-
analyzer="word",
|
| 140 |
-
ngram_range=(1, 2), # Unigrams + bigrams
|
| 141 |
-
min_df=3, # Minimum document frequency
|
| 142 |
-
max_df=0.98, # Maximum document frequency
|
| 143 |
-
max_features=120000, # Cap at 120k features
|
| 144 |
-
sublinear_tf=True # log(tf) scaling
|
| 145 |
-
)
|
| 146 |
-
```
|
| 147 |
-
|
| 148 |
-
**Output**: Sparse matrix of word/phrase importance scores
|
| 149 |
-
|
| 150 |
-
---
|
| 151 |
-
|
| 152 |
-
#### 3.2 **Character-Level TF-IDF**
|
| 153 |
-
|
| 154 |
-
```python
|
| 155 |
-
TfidfVectorizer(
|
| 156 |
-
analyzer="char_wb", # Character n-grams (word boundaries)
|
| 157 |
-
ngram_range=(3, 5), # 3-char to 5-char sequences
|
| 158 |
-
min_df=3,
|
| 159 |
-
max_df=0.98,
|
| 160 |
-
max_features=80000, # Cap at 80k features
|
| 161 |
-
sublinear_tf=True
|
| 162 |
-
)
|
| 163 |
-
```
|
| 164 |
-
|
| 165 |
-
**Purpose**: Captures sub-word patterns and stylistic signatures
|
| 166 |
-
|
| 167 |
-
---
|
| 168 |
-
|
| 169 |
-
### 4. **Feature Preprocessing**
|
| 170 |
-
|
| 171 |
-
**Engineered Features**:
|
| 172 |
-
- Scaled using `StandardScaler` (z-score normalization)
|
| 173 |
-
- Converted to sparse CSR matrix for memory efficiency
|
| 174 |
-
|
| 175 |
-
**Hybrid Feature Vector**:
|
| 176 |
-
```python
|
| 177 |
-
hybrid_vec = hstack([word_tfidf, char_tfidf, engineered_features_scaled])
|
| 178 |
-
```
|
| 179 |
-
|
| 180 |
-
**Final Feature Dimensionality**:
|
| 181 |
-
- Word TF-IDF: Up to 120,000 features
|
| 182 |
-
- Char TF-IDF: Up to 80,000 features
|
| 183 |
-
- Engineered: 16 features
|
| 184 |
-
- **Total**: Up to ~200,016 features (sparse)
|
| 185 |
-
|
| 186 |
-
---
|
| 187 |
-
|
| 188 |
-
### 5. **Model Training**
|
| 189 |
-
|
| 190 |
-
#### 5.1 **Train-Test Split**
|
| 191 |
-
```python
|
| 192 |
-
train_size: 80% (16,000 samples)
|
| 193 |
-
test_size: 20% (4,000 samples)
|
| 194 |
-
stratified: Yes (balanced across classes)
|
| 195 |
-
random_state: 42
|
| 196 |
-
```
|
| 197 |
-
|
| 198 |
-
#### 5.2 **Classifier**
|
| 199 |
-
|
| 200 |
-
**Algorithm**: Logistic Regression
|
| 201 |
-
|
| 202 |
-
**Hyperparameter Tuning**: GridSearchCV with 3-fold stratified cross-validation
|
| 203 |
-
|
| 204 |
-
**Search Space**:
|
| 205 |
-
```python
|
| 206 |
-
{
|
| 207 |
-
"C": [0.5, 1.0, 2.0, 4.0], # Regularization strength
|
| 208 |
-
"class_weight": [None, "balanced"], # Class balancing
|
| 209 |
-
"solver": "saga", # Stochastic Average Gradient
|
| 210 |
-
"penalty": "l2", # L2 regularization
|
| 211 |
-
"max_iter": 2500,
|
| 212 |
-
"n_jobs": -1 # Parallel processing
|
| 213 |
-
}
|
| 214 |
-
```
|
| 215 |
-
|
| 216 |
-
**Scoring Metric**: F1 Score (balanced for precision/recall)
|
| 217 |
-
|
| 218 |
-
---
|
| 219 |
-
|
| 220 |
-
### 6. **Model Evaluation**
|
| 221 |
-
|
| 222 |
-
**Metrics Tracked**:
|
| 223 |
-
- **Accuracy**: Overall correct predictions
|
| 224 |
-
- **F1 Score**: Harmonic mean of precision/recall
|
| 225 |
-
- **ROC-AUC**: Area under ROC curve
|
| 226 |
-
- **Confusion Matrix**: True/false positives/negatives
|
| 227 |
-
- **Classification Report**: Per-class precision/recall/F1
|
| 228 |
-
|
| 229 |
-
**Visualizations**:
|
| 230 |
-
1. Confusion Matrix
|
| 231 |
-
2. ROC Curve
|
| 232 |
-
3. Feature Importance (top engineered features)
|
| 233 |
-
4. Perplexity Distribution (Human vs AI)
|
| 234 |
-
5. Lexical Diversity Distribution
|
| 235 |
-
6. Burstiness STD Distribution
|
| 236 |
-
|
| 237 |
-
---
|
| 238 |
-
|
| 239 |
-
### 7. **Model Persistence**
|
| 240 |
-
|
| 241 |
-
**Output Directory**: `./v3_model/`
|
| 242 |
-
|
| 243 |
-
**Saved Artifacts**:
|
| 244 |
-
|
| 245 |
-
| File | Description |
|
| 246 |
-
|------|-------------|
|
| 247 |
-
| `classifier.pkl` | Trained Logistic Regression model |
|
| 248 |
-
| `scaler.pkl` | StandardScaler for engineered features |
|
| 249 |
-
| `word_vectorizer.pkl` | Word-level TF-IDF vectorizer |
|
| 250 |
-
| `char_vectorizer.pkl` | Character-level TF-IDF vectorizer |
|
| 251 |
-
| `feature_names.json` | List of engineered feature names (16 features) |
|
| 252 |
-
| `metadata.json` | Model performance metrics & configuration |
|
| 253 |
-
|
| 254 |
-
**Metadata Contents**:
|
| 255 |
-
```json
|
| 256 |
-
{
|
| 257 |
-
"selected_model": "hybrid_tfidf_logistic",
|
| 258 |
-
"cv_best_f1": 0.xxxx,
|
| 259 |
-
"num_engineered_features": 16,
|
| 260 |
-
"num_word_tfidf_features": 120000,
|
| 261 |
-
"num_char_tfidf_features": 80000,
|
| 262 |
-
"train_samples": 16000,
|
| 263 |
-
"test_samples": 4000,
|
| 264 |
-
"train_accuracy": 0.xxxx,
|
| 265 |
-
"train_f1": 0.xxxx,
|
| 266 |
-
"test_accuracy": 0.xxxx,
|
| 267 |
-
"test_f1": 0.xxxx
|
| 268 |
-
}
|
| 269 |
-
```
|
| 270 |
-
|
| 271 |
-
---
|
| 272 |
-
|
| 273 |
-
### 8. **Inference Pipeline**
|
| 274 |
-
|
| 275 |
-
**Function**: `predict_v3(text: str) -> dict`
|
| 276 |
-
|
| 277 |
-
**Process**:
|
| 278 |
-
```python
|
| 279 |
-
1. Normalize text (whitespace cleanup)
|
| 280 |
-
2. Extract engineered features (16 features)
|
| 281 |
-
3. Scale engineered features (StandardScaler)
|
| 282 |
-
4. Generate word TF-IDF vector
|
| 283 |
-
5. Generate char TF-IDF vector
|
| 284 |
-
6. Concatenate all features (sparse matrix)
|
| 285 |
-
7. Predict with Logistic Regression
|
| 286 |
-
8. Return prediction + probabilities + features
|
| 287 |
-
```
|
| 288 |
-
|
| 289 |
-
**Output Schema**:
|
| 290 |
-
```python
|
| 291 |
-
{
|
| 292 |
-
"text": str, # Truncated input (100 chars)
|
| 293 |
-
"word_count": int, # Number of words
|
| 294 |
-
"predicted_label": int, # 0=Human, 1=AI
|
| 295 |
-
"predicted_name": str, # "human" or "ai"
|
| 296 |
-
"probability_human": float, # P(Human) [0-1]
|
| 297 |
-
"probability_ai": float, # P(AI) [0-1]
|
| 298 |
-
"features": dict # All 16 engineered features
|
| 299 |
-
}
|
| 300 |
-
```
|
| 301 |
-
|
| 302 |
-
**Batch Function**: `predict_v3_batch(texts: list[str]) -> list[dict]`
|
| 303 |
-
|
| 304 |
-
---
|
| 305 |
-
|
| 306 |
-
## 🔧 Configuration
|
| 307 |
-
|
| 308 |
-
```python
|
| 309 |
-
@dataclass
|
| 310 |
-
class V3Config:
|
| 311 |
-
max_samples: int = 20000 # Max training samples
|
| 312 |
-
test_size: float = 0.2 # Test split ratio
|
| 313 |
-
output_dir: str = "./v3_model" # Model save directory
|
| 314 |
-
random_state: int = 42 # Reproducibility seed
|
| 315 |
-
cv_folds: int = 3 # Cross-validation folds
|
| 316 |
-
```
|
| 317 |
-
|
| 318 |
-
---
|
| 319 |
-
|
| 320 |
-
## 📦 Dependencies
|
| 321 |
-
|
| 322 |
-
**Core Libraries**:
|
| 323 |
-
- `scikit-learn` - ML algorithms, TF-IDF, metrics
|
| 324 |
-
- `pandas` - Data manipulation
|
| 325 |
-
- `numpy` - Numerical operations
|
| 326 |
-
- `scipy` - Sparse matrix operations
|
| 327 |
-
|
| 328 |
-
**Feature Extraction**:
|
| 329 |
-
- `transformers` - DistilGPT2 for perplexity
|
| 330 |
-
- `torch` - PyTorch backend for transformers
|
| 331 |
-
- `nltk` - Sentence tokenization (`punkt_tab`)
|
| 332 |
-
- `textstat` - Readability metrics
|
| 333 |
-
|
| 334 |
-
**Visualization**:
|
| 335 |
-
- `matplotlib` - Plotting
|
| 336 |
-
- `seaborn` - Statistical visualizations
|
| 337 |
-
|
| 338 |
-
---
|
| 339 |
-
|
| 340 |
-
## 🎯 Key Design Decisions
|
| 341 |
-
|
| 342 |
-
### Why Not Transformers?
|
| 343 |
-
1. **Speed**: No GPU required, fast inference
|
| 344 |
-
2. **Interpretability**: Explainable features
|
| 345 |
-
3. **Efficiency**: Smaller model size (~500MB vs 5GB+)
|
| 346 |
-
4. **Robustness**: Works on any text length
|
| 347 |
-
|
| 348 |
-
### Why Hybrid Features?
|
| 349 |
-
1. **TF-IDF**: Captures content and vocabulary patterns
|
| 350 |
-
2. **Perplexity**: Measures language model naturalness
|
| 351 |
-
3. **Burstiness**: Detects sentence variation patterns
|
| 352 |
-
4. **Stylometry**: Analyzes writing style signatures
|
| 353 |
-
|
| 354 |
-
### Why Logistic Regression?
|
| 355 |
-
1. **Scalability**: Handles 200k+ sparse features efficiently
|
| 356 |
-
2. **Speed**: Fast training and inference
|
| 357 |
-
3. **Interpretability**: Clear feature importance via coefficients
|
| 358 |
-
4. **Robustness**: Well-suited for high-dimensional sparse data
|
| 359 |
-
|
| 360 |
-
---
|
| 361 |
-
|
| 362 |
-
## 📈 Expected Performance
|
| 363 |
-
|
| 364 |
-
**Typical Results** (20k samples):
|
| 365 |
-
- **Test Accuracy**: 85-95%
|
| 366 |
-
- **Test F1 Score**: 0.85-0.95
|
| 367 |
-
- **Inference Speed**: ~50-100 texts/second (CPU)
|
| 368 |
-
- **Model Size**: ~500 MB total
|
| 369 |
-
|
| 370 |
-
**Best For**:
|
| 371 |
-
- ✅ General English text classification
|
| 372 |
-
- ✅ Articles, essays, reviews
|
| 373 |
-
- ✅ Medium to long texts (50+ words)
|
| 374 |
-
|
| 375 |
-
**Limitations**:
|
| 376 |
-
- ⚠️ Very short texts (<10 words) may be unreliable
|
| 377 |
-
- ⚠️ Perplexity calculation is the bottleneck (uses GPU if available)
|
| 378 |
-
- ⚠️ Domain-specific jargon may affect performance
|
| 379 |
-
- ⚠️ Non-English text requires retraining
|
| 380 |
-
|
| 381 |
-
---
|
| 382 |
-
|
| 383 |
-
## 🔄 Model Loading Example
|
| 384 |
-
|
| 385 |
-
```python
|
| 386 |
-
from pathlib import Path
|
| 387 |
-
import pickle
|
| 388 |
-
import json
|
| 389 |
-
|
| 390 |
-
model_dir = Path("./v3_model")
|
| 391 |
-
|
| 392 |
-
# Load all artifacts
|
| 393 |
-
classifier = pickle.load(open(model_dir / "classifier.pkl", "rb"))
|
| 394 |
-
scaler = pickle.load(open(model_dir / "scaler.pkl", "rb"))
|
| 395 |
-
word_vectorizer = pickle.load(open(model_dir / "word_vectorizer.pkl", "rb"))
|
| 396 |
-
char_vectorizer = pickle.load(open(model_dir / "char_vectorizer.pkl", "rb"))
|
| 397 |
-
feature_names = json.load(open(model_dir / "feature_names.json", "r"))
|
| 398 |
-
metadata = json.load(open(model_dir / "metadata.json", "r"))
|
| 399 |
-
|
| 400 |
-
# Use predict_v3() function for inference
|
| 401 |
-
result = predict_v3("Your text here...")
|
| 402 |
-
```
|
| 403 |
-
|
| 404 |
-
---
|
| 405 |
-
|
| 406 |
-
## 💡 Future Improvements
|
| 407 |
-
|
| 408 |
-
1. **Model Versioning**: Add versioning system for model updates
|
| 409 |
-
2. **Confidence Thresholds**: Flag uncertain predictions
|
| 410 |
-
3. **Batch Optimization**: Vectorized batch inference
|
| 411 |
-
4. **Model Wrapper Class**: Encapsulate all logic in `AIPredictorV3` class
|
| 412 |
-
5. **Perplexity Caching**: Cache calculations for faster inference
|
| 413 |
-
6. **Ensemble Methods**: Combine multiple models for better accuracy
|
| 414 |
-
7. **Active Learning**: Iterative retraining with user feedback
|
| 415 |
-
8. **Multi-language Support**: Train separate models per language
|
| 416 |
-
|
| 417 |
-
---
|
| 418 |
-
|
| 419 |
-
## 📝 Citation & Credits
|
| 420 |
-
|
| 421 |
-
**Framework**: scikit-learn + HuggingFace Transformers
|
| 422 |
-
**Perplexity Model**: DistilGPT2 (OpenAI/Hugging Face)
|
| 423 |
-
**Readability Metrics**: textstat library
|
| 424 |
-
|
| 425 |
-
|
| 426 |
-
**Architecture Type**: Hybrid Feature Engineering + Logistic Regression
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
notebook/ai_vs_human/main.ipynb
DELETED
|
@@ -1,1110 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"cells": [
|
| 3 |
-
{
|
| 4 |
-
"cell_type": "markdown",
|
| 5 |
-
"id": "e522047b",
|
| 6 |
-
"metadata": {},
|
| 7 |
-
"source": [
|
| 8 |
-
"# AI vs Human Text Detector using BERT\n",
|
| 9 |
-
"Using google-bert/bert-base-cased with HC3 dataset or local data (~20k samples)"
|
| 10 |
-
]
|
| 11 |
-
},
|
| 12 |
-
{
|
| 13 |
-
"cell_type": "code",
|
| 14 |
-
"execution_count": 35,
|
| 15 |
-
"id": "16eddd36",
|
| 16 |
-
"metadata": {},
|
| 17 |
-
"outputs": [],
|
| 18 |
-
"source": [
|
| 19 |
-
"from functools import partial\n",
|
| 20 |
-
"\n",
|
| 21 |
-
"import datasets\n",
|
| 22 |
-
"from datasets import Dataset, DatasetDict, concatenate_datasets\n",
|
| 23 |
-
"import evaluate\n",
|
| 24 |
-
"import numpy as np\n",
|
| 25 |
-
"import torch\n",
|
| 26 |
-
"from transformers import (\n",
|
| 27 |
-
" AutoModelForSequenceClassification,\n",
|
| 28 |
-
" AutoTokenizer,\n",
|
| 29 |
-
" PreTrainedTokenizer,\n",
|
| 30 |
-
" BatchEncoding,\n",
|
| 31 |
-
" DataCollatorWithPadding,\n",
|
| 32 |
-
" Trainer,\n",
|
| 33 |
-
" TrainingArguments,\n",
|
| 34 |
-
")\n",
|
| 35 |
-
"from peft import LoraConfig, get_peft_model"
|
| 36 |
-
]
|
| 37 |
-
},
|
| 38 |
-
{
|
| 39 |
-
"cell_type": "markdown",
|
| 40 |
-
"id": "99bca750",
|
| 41 |
-
"metadata": {},
|
| 42 |
-
"source": [
|
| 43 |
-
"## Load AI Detection Dataset (~20k samples)"
|
| 44 |
-
]
|
| 45 |
-
},
|
| 46 |
-
{
|
| 47 |
-
"cell_type": "code",
|
| 48 |
-
"execution_count": 36,
|
| 49 |
-
"id": "2945f87a",
|
| 50 |
-
"metadata": {},
|
| 51 |
-
"outputs": [],
|
| 52 |
-
"source": [
|
| 53 |
-
"def get_raid_dataset(max_samples: int = 20000, use_local: bool = True) -> DatasetDict:\n",
|
| 54 |
-
" \"\"\"Load AI detection dataset and limit to ~20k samples\"\"\"\n",
|
| 55 |
-
" \n",
|
| 56 |
-
" print(\"Loading AI vs Human text dataset...\")\n",
|
| 57 |
-
" \n",
|
| 58 |
-
" all_texts = []\n",
|
| 59 |
-
" all_labels = []\n",
|
| 60 |
-
" \n",
|
| 61 |
-
" # Try loading HC3 dataset (Human ChatGPT Comparison Corpus)\n",
|
| 62 |
-
" try:\n",
|
| 63 |
-
" print(\"Attempting to load HC3 dataset...\")\n",
|
| 64 |
-
" dataset = datasets.load_dataset(\"Hello-SimpleAI/HC3\", \"all\", split=\"train\")\n",
|
| 65 |
-
" \n",
|
| 66 |
-
" # HC3 format: has 'question', 'human_answers', 'chatgpt_answers'\n",
|
| 67 |
-
" for item in dataset:\n",
|
| 68 |
-
" # Add human answers\n",
|
| 69 |
-
" if 'human_answers' in item and item['human_answers']:\n",
|
| 70 |
-
" for answer in item['human_answers'][:1]: # Take first answer\n",
|
| 71 |
-
" if answer and len(answer.strip()) > 0:\n",
|
| 72 |
-
" all_texts.append(answer)\n",
|
| 73 |
-
" all_labels.append(0) # 0 for human\n",
|
| 74 |
-
" \n",
|
| 75 |
-
" # Add AI answers\n",
|
| 76 |
-
" if 'chatgpt_answers' in item and item['chatgpt_answers']:\n",
|
| 77 |
-
" for answer in item['chatgpt_answers'][:1]: # Take first answer\n",
|
| 78 |
-
" if answer and len(answer.strip()) > 0:\n",
|
| 79 |
-
" all_texts.append(answer)\n",
|
| 80 |
-
" all_labels.append(1) # 1 for AI\n",
|
| 81 |
-
" \n",
|
| 82 |
-
" print(f\"✓ Loaded {len(all_texts)} samples from HC3 dataset\")\n",
|
| 83 |
-
" except Exception as e:\n",
|
| 84 |
-
" print(f\"⚠ Could not load HC3 dataset: {e}\")\n",
|
| 85 |
-
" \n",
|
| 86 |
-
" # Load local data and combine\n",
|
| 87 |
-
" if use_local:\n",
|
| 88 |
-
" try:\n",
|
| 89 |
-
" print(\"Loading local dataset...\")\n",
|
| 90 |
-
" import pandas as pd\n",
|
| 91 |
-
" df = pd.read_json(\"./DATASET/basic_Data.jsonl\", lines=True)\n",
|
| 92 |
-
" \n",
|
| 93 |
-
" # Build a proper binary classification dataset: human_text -> 0, ai_text -> 1\n",
|
| 94 |
-
" if {\"human_text\", \"ai_text\"}.issubset(df.columns):\n",
|
| 95 |
-
" local_texts = list(df[\"human_text\"].dropna()) + list(df[\"ai_text\"].dropna())\n",
|
| 96 |
-
" local_labels = [0] * len(df[\"human_text\"].dropna()) + [1] * len(df[\"ai_text\"].dropna())\n",
|
| 97 |
-
" \n",
|
| 98 |
-
" all_texts.extend(local_texts)\n",
|
| 99 |
-
" all_labels.extend(local_labels)\n",
|
| 100 |
-
" \n",
|
| 101 |
-
" print(f\"✓ Loaded {len(local_texts)} samples from local data\")\n",
|
| 102 |
-
" else:\n",
|
| 103 |
-
" print(\"⚠ Local dataset doesn't have required columns\")\n",
|
| 104 |
-
" except Exception as e:\n",
|
| 105 |
-
" print(f\"⚠ Could not load local dataset: {e}\")\n",
|
| 106 |
-
" \n",
|
| 107 |
-
" # Check if we have any data\n",
|
| 108 |
-
" if len(all_texts) == 0:\n",
|
| 109 |
-
" raise ValueError(\"No data loaded! Check HC3 dataset or local data availability\")\n",
|
| 110 |
-
" \n",
|
| 111 |
-
" # Create combined dataset\n",
|
| 112 |
-
" combined_dataset = Dataset.from_dict({\n",
|
| 113 |
-
" \"text\": all_texts,\n",
|
| 114 |
-
" \"label\": all_labels\n",
|
| 115 |
-
" })\n",
|
| 116 |
-
" \n",
|
| 117 |
-
" print(f\"Total combined samples: {len(combined_dataset)}\")\n",
|
| 118 |
-
" \n",
|
| 119 |
-
" # Shuffle and limit to max_samples\n",
|
| 120 |
-
" combined_dataset = combined_dataset.shuffle(seed=42)\n",
|
| 121 |
-
" if len(combined_dataset) > max_samples:\n",
|
| 122 |
-
" combined_dataset = combined_dataset.select(range(max_samples))\n",
|
| 123 |
-
" print(f\"Limited to {max_samples} samples\")\n",
|
| 124 |
-
" \n",
|
| 125 |
-
" # Filter out empty texts\n",
|
| 126 |
-
" combined_dataset = combined_dataset.filter(lambda x: x['text'] is not None and len(x['text'].strip()) > 0)\n",
|
| 127 |
-
" \n",
|
| 128 |
-
" # Split into train/test (95/5 split)\n",
|
| 129 |
-
" dataset_split = combined_dataset.train_test_split(test_size=0.05, seed=42)\n",
|
| 130 |
-
" \n",
|
| 131 |
-
" print(f\"\\n✓ Dataset ready!\")\n",
|
| 132 |
-
" print(f\" Train samples: {len(dataset_split['train'])}\")\n",
|
| 133 |
-
" print(f\" Test samples: {len(dataset_split['test'])}\")\n",
|
| 134 |
-
" \n",
|
| 135 |
-
" # Check label distribution\n",
|
| 136 |
-
" import numpy as np\n",
|
| 137 |
-
" train_labels = np.array(dataset_split['train']['label'])\n",
|
| 138 |
-
" print(f\" Label distribution (train):\")\n",
|
| 139 |
-
" print(f\" Human (0): {(train_labels == 0).sum()}\")\n",
|
| 140 |
-
" print(f\" AI (1): {(train_labels == 1).sum()}\")\n",
|
| 141 |
-
" \n",
|
| 142 |
-
" return dataset_split"
|
| 143 |
-
]
|
| 144 |
-
},
|
| 145 |
-
{
|
| 146 |
-
"cell_type": "code",
|
| 147 |
-
"execution_count": 37,
|
| 148 |
-
"id": "38d8478c",
|
| 149 |
-
"metadata": {},
|
| 150 |
-
"outputs": [
|
| 151 |
-
{
|
| 152 |
-
"name": "stdout",
|
| 153 |
-
"output_type": "stream",
|
| 154 |
-
"text": [
|
| 155 |
-
"Loading AI vs Human text dataset...\n",
|
| 156 |
-
"Attempting to load HC3 dataset...\n",
|
| 157 |
-
"⚠ Could not load HC3 dataset: Dataset scripts are no longer supported, but found HC3.py\n",
|
| 158 |
-
"Loading local dataset...\n",
|
| 159 |
-
"✓ Loaded 19940 samples from local data\n",
|
| 160 |
-
"Total combined samples: 19940\n"
|
| 161 |
-
]
|
| 162 |
-
},
|
| 163 |
-
{
|
| 164 |
-
"name": "stderr",
|
| 165 |
-
"output_type": "stream",
|
| 166 |
-
"text": [
|
| 167 |
-
"Filter: 100%|██████████| 19940/19940 [00:00<00:00, 95584.60 examples/s] \n"
|
| 168 |
-
]
|
| 169 |
-
},
|
| 170 |
-
{
|
| 171 |
-
"name": "stdout",
|
| 172 |
-
"output_type": "stream",
|
| 173 |
-
"text": [
|
| 174 |
-
"\n",
|
| 175 |
-
"✓ Dataset ready!\n",
|
| 176 |
-
" Train samples: 18943\n",
|
| 177 |
-
" Test samples: 997\n",
|
| 178 |
-
" Label distribution (train):\n",
|
| 179 |
-
" Human (0): 9477\n",
|
| 180 |
-
" AI (1): 9466\n"
|
| 181 |
-
]
|
| 182 |
-
}
|
| 183 |
-
],
|
| 184 |
-
"source": [
|
| 185 |
-
"# Load dataset\n",
|
| 186 |
-
"raw_datasets = get_raid_dataset(max_samples=20000)"
|
| 187 |
-
]
|
| 188 |
-
},
|
| 189 |
-
{
|
| 190 |
-
"cell_type": "markdown",
|
| 191 |
-
"id": "f60191f6",
|
| 192 |
-
"metadata": {},
|
| 193 |
-
"source": [
|
| 194 |
-
"## Initialize Model and Tokenizer"
|
| 195 |
-
]
|
| 196 |
-
},
|
| 197 |
-
{
|
| 198 |
-
"cell_type": "code",
|
| 199 |
-
"execution_count": 38,
|
| 200 |
-
"id": "315bb737",
|
| 201 |
-
"metadata": {},
|
| 202 |
-
"outputs": [
|
| 203 |
-
{
|
| 204 |
-
"name": "stderr",
|
| 205 |
-
"output_type": "stream",
|
| 206 |
-
"text": [
|
| 207 |
-
"Loading weights: 100%|██████████| 199/199 [00:00<00:00, 1208.24it/s, Materializing param=bert.pooler.dense.weight] \n",
|
| 208 |
-
"BertForSequenceClassification LOAD REPORT from: google-bert/bert-base-cased\n",
|
| 209 |
-
"Key | Status | \n",
|
| 210 |
-
"-------------------------------------------+------------+-\n",
|
| 211 |
-
"cls.predictions.transform.LayerNorm.bias | UNEXPECTED | \n",
|
| 212 |
-
"cls.seq_relationship.weight | UNEXPECTED | \n",
|
| 213 |
-
"cls.predictions.transform.dense.weight | UNEXPECTED | \n",
|
| 214 |
-
"cls.seq_relationship.bias | UNEXPECTED | \n",
|
| 215 |
-
"cls.predictions.bias | UNEXPECTED | \n",
|
| 216 |
-
"cls.predictions.transform.dense.bias | UNEXPECTED | \n",
|
| 217 |
-
"cls.predictions.transform.LayerNorm.weight | UNEXPECTED | \n",
|
| 218 |
-
"classifier.weight | MISSING | \n",
|
| 219 |
-
"classifier.bias | MISSING | \n",
|
| 220 |
-
"\n",
|
| 221 |
-
"Notes:\n",
|
| 222 |
-
"- UNEXPECTED\t:can be ignored when loading from different task/architecture; not ok if you expect identical arch.\n",
|
| 223 |
-
"- MISSING\t:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.\n"
|
| 224 |
-
]
|
| 225 |
-
},
|
| 226 |
-
{
|
| 227 |
-
"name": "stdout",
|
| 228 |
-
"output_type": "stream",
|
| 229 |
-
"text": [
|
| 230 |
-
"Model loaded: google-bert/bert-base-cased\n",
|
| 231 |
-
"Device: cuda\n"
|
| 232 |
-
]
|
| 233 |
-
}
|
| 234 |
-
],
|
| 235 |
-
"source": [
|
| 236 |
-
"# Use google-bert/bert-base-cased\n",
|
| 237 |
-
"base_model_name = \"google-bert/bert-base-cased\"\n",
|
| 238 |
-
"\n",
|
| 239 |
-
"tokenizer = AutoTokenizer.from_pretrained(base_model_name)\n",
|
| 240 |
-
"model = AutoModelForSequenceClassification.from_pretrained(\n",
|
| 241 |
-
" base_model_name,\n",
|
| 242 |
-
" num_labels=2,\n",
|
| 243 |
-
").to(device='cuda' if torch.cuda.is_available() else 'cpu')\n",
|
| 244 |
-
"\n",
|
| 245 |
-
"print(f\"Model loaded: {base_model_name}\")\n",
|
| 246 |
-
"print(f\"Device: {'cuda' if torch.cuda.is_available() else 'cpu'}\")"
|
| 247 |
-
]
|
| 248 |
-
},
|
| 249 |
-
{
|
| 250 |
-
"cell_type": "markdown",
|
| 251 |
-
"id": "0a772192",
|
| 252 |
-
"metadata": {},
|
| 253 |
-
"source": [
|
| 254 |
-
"## Apply LoRA for Parameter-Efficient Fine-tuning"
|
| 255 |
-
]
|
| 256 |
-
},
|
| 257 |
-
{
|
| 258 |
-
"cell_type": "code",
|
| 259 |
-
"execution_count": 39,
|
| 260 |
-
"id": "ba294e50",
|
| 261 |
-
"metadata": {},
|
| 262 |
-
"outputs": [
|
| 263 |
-
{
|
| 264 |
-
"name": "stdout",
|
| 265 |
-
"output_type": "stream",
|
| 266 |
-
"text": [
|
| 267 |
-
"trainable params: 2,680,322 || all params: 110,992,132 || trainable%: 2.4149\n"
|
| 268 |
-
]
|
| 269 |
-
}
|
| 270 |
-
],
|
| 271 |
-
"source": [
|
| 272 |
-
"peft_config = LoraConfig(\n",
|
| 273 |
-
" r=16,\n",
|
| 274 |
-
" target_modules=\"all-linear\",\n",
|
| 275 |
-
" lora_alpha=16,\n",
|
| 276 |
-
" bias=\"none\",\n",
|
| 277 |
-
" lora_dropout=0.05,\n",
|
| 278 |
-
" use_rslora=True,\n",
|
| 279 |
-
" modules_to_save=[\"classifier\"],\n",
|
| 280 |
-
")\n",
|
| 281 |
-
"\n",
|
| 282 |
-
"model = get_peft_model(model, peft_config)\n",
|
| 283 |
-
"model.print_trainable_parameters()"
|
| 284 |
-
]
|
| 285 |
-
},
|
| 286 |
-
{
|
| 287 |
-
"cell_type": "markdown",
|
| 288 |
-
"id": "3cf58dd8",
|
| 289 |
-
"metadata": {},
|
| 290 |
-
"source": [
|
| 291 |
-
"## Preprocessing and Tokenization"
|
| 292 |
-
]
|
| 293 |
-
},
|
| 294 |
-
{
|
| 295 |
-
"cell_type": "code",
|
| 296 |
-
"execution_count": 40,
|
| 297 |
-
"id": "c7992ba4",
|
| 298 |
-
"metadata": {},
|
| 299 |
-
"outputs": [
|
| 300 |
-
{
|
| 301 |
-
"name": "stderr",
|
| 302 |
-
"output_type": "stream",
|
| 303 |
-
"text": [
|
| 304 |
-
"Map: 100%|██████████| 18943/18943 [00:01<00:00, 10132.04 examples/s]\n",
|
| 305 |
-
"Map: 100%|██████████| 997/997 [00:00<00:00, 11498.07 examples/s]"
|
| 306 |
-
]
|
| 307 |
-
},
|
| 308 |
-
{
|
| 309 |
-
"name": "stdout",
|
| 310 |
-
"output_type": "stream",
|
| 311 |
-
"text": [
|
| 312 |
-
"Tokenization complete!\n",
|
| 313 |
-
"Tensor columns: ['input_ids', 'attention_mask', 'token_type_ids', 'labels']\n"
|
| 314 |
-
]
|
| 315 |
-
},
|
| 316 |
-
{
|
| 317 |
-
"name": "stderr",
|
| 318 |
-
"output_type": "stream",
|
| 319 |
-
"text": [
|
| 320 |
-
"\n"
|
| 321 |
-
]
|
| 322 |
-
}
|
| 323 |
-
],
|
| 324 |
-
"source": [
|
| 325 |
-
"def _preprocess_function(\n",
|
| 326 |
-
" batch: dict,\n",
|
| 327 |
-
" tokenizer: PreTrainedTokenizer,\n",
|
| 328 |
-
" max_length: int = 512,\n",
|
| 329 |
-
") -> BatchEncoding:\n",
|
| 330 |
-
" model_inputs = tokenizer(\n",
|
| 331 |
-
" batch[\"text\"],\n",
|
| 332 |
-
" max_length=max_length,\n",
|
| 333 |
-
" truncation=True,\n",
|
| 334 |
-
" )\n",
|
| 335 |
-
" model_inputs[\"labels\"] = batch[\"label\"]\n",
|
| 336 |
-
" return model_inputs\n",
|
| 337 |
-
"\n",
|
| 338 |
-
"\n",
|
| 339 |
-
"preprocess_function = partial(_preprocess_function, tokenizer=tokenizer)\n",
|
| 340 |
-
"tokenized_datasets = raw_datasets.map(\n",
|
| 341 |
-
" preprocess_function,\n",
|
| 342 |
-
" batched=True,\n",
|
| 343 |
-
" remove_columns=[\"text\", \"label\"],\n",
|
| 344 |
-
")\n",
|
| 345 |
-
"\n",
|
| 346 |
-
"# Ensure PyTorch tensors and expected columns\n",
|
| 347 |
-
"available_columns = tokenized_datasets[\"train\"].column_names\n",
|
| 348 |
-
"tensor_columns = [\n",
|
| 349 |
-
" column_name\n",
|
| 350 |
-
" for column_name in [\"input_ids\", \"attention_mask\", \"token_type_ids\", \"labels\"]\n",
|
| 351 |
-
" if column_name in available_columns\n",
|
| 352 |
-
"]\n",
|
| 353 |
-
"tokenized_datasets.set_format(type=\"torch\", columns=tensor_columns)\n",
|
| 354 |
-
"\n",
|
| 355 |
-
"print(\"Tokenization complete!\")\n",
|
| 356 |
-
"print(\"Tensor columns:\", tensor_columns)"
|
| 357 |
-
]
|
| 358 |
-
},
|
| 359 |
-
{
|
| 360 |
-
"cell_type": "markdown",
|
| 361 |
-
"id": "31db700b",
|
| 362 |
-
"metadata": {},
|
| 363 |
-
"source": [
|
| 364 |
-
"## Define Metrics"
|
| 365 |
-
]
|
| 366 |
-
},
|
| 367 |
-
{
|
| 368 |
-
"cell_type": "code",
|
| 369 |
-
"execution_count": 41,
|
| 370 |
-
"id": "899e4408",
|
| 371 |
-
"metadata": {},
|
| 372 |
-
"outputs": [],
|
| 373 |
-
"source": [
|
| 374 |
-
"metric_accuracy = evaluate.load(\"accuracy\")\n",
|
| 375 |
-
"metric_f1 = evaluate.load(\"f1\")\n",
|
| 376 |
-
"\n",
|
| 377 |
-
"\n",
|
| 378 |
-
"def _compute_metrics(\n",
|
| 379 |
-
" eval_pred: tuple[np.ndarray, np.ndarray],\n",
|
| 380 |
-
" metric_accuracy: evaluate.EvaluationModule,\n",
|
| 381 |
-
" metric_f1: evaluate.EvaluationModule,\n",
|
| 382 |
-
") -> dict[str, float]:\n",
|
| 383 |
-
" predictions, labels = eval_pred\n",
|
| 384 |
-
"\n",
|
| 385 |
-
" if isinstance(predictions, tuple):\n",
|
| 386 |
-
" predictions = predictions[0]\n",
|
| 387 |
-
"\n",
|
| 388 |
-
" predictions = np.argmax(predictions, axis=1)\n",
|
| 389 |
-
"\n",
|
| 390 |
-
" accuracy = metric_accuracy.compute(predictions=predictions, references=labels)\n",
|
| 391 |
-
" f1 = metric_f1.compute(predictions=predictions, references=labels)\n",
|
| 392 |
-
"\n",
|
| 393 |
-
" assert accuracy is not None and f1 is not None\n",
|
| 394 |
-
"\n",
|
| 395 |
-
" result = {\n",
|
| 396 |
-
" \"accuracy\": accuracy[\"accuracy\"],\n",
|
| 397 |
-
" \"f1\": f1[\"f1\"],\n",
|
| 398 |
-
" }\n",
|
| 399 |
-
"\n",
|
| 400 |
-
" return result\n",
|
| 401 |
-
"\n",
|
| 402 |
-
"\n",
|
| 403 |
-
"compute_metrics = partial(\n",
|
| 404 |
-
" _compute_metrics, metric_accuracy=metric_accuracy, metric_f1=metric_f1\n",
|
| 405 |
-
")"
|
| 406 |
-
]
|
| 407 |
-
},
|
| 408 |
-
{
|
| 409 |
-
"cell_type": "markdown",
|
| 410 |
-
"id": "34890c4d",
|
| 411 |
-
"metadata": {},
|
| 412 |
-
"source": [
|
| 413 |
-
"## Training Configuration"
|
| 414 |
-
]
|
| 415 |
-
},
|
| 416 |
-
{
|
| 417 |
-
"cell_type": "code",
|
| 418 |
-
"execution_count": 42,
|
| 419 |
-
"id": "9717d666",
|
| 420 |
-
"metadata": {},
|
| 421 |
-
"outputs": [],
|
| 422 |
-
"source": [
|
| 423 |
-
"train_batch_size = 4\n",
|
| 424 |
-
"gradient_accumulation_steps = 8\n",
|
| 425 |
-
"eval_batch_size = 4\n",
|
| 426 |
-
"\n",
|
| 427 |
-
"training_args = TrainingArguments(\n",
|
| 428 |
-
" \"./models/bert-base-raid-classifier\",\n",
|
| 429 |
-
" num_train_epochs=5,\n",
|
| 430 |
-
" learning_rate=5e-5,\n",
|
| 431 |
-
" weight_decay=0.1,\n",
|
| 432 |
-
" per_device_train_batch_size=train_batch_size,\n",
|
| 433 |
-
" per_device_eval_batch_size=eval_batch_size,\n",
|
| 434 |
-
" gradient_accumulation_steps=gradient_accumulation_steps,\n",
|
| 435 |
-
" fp16=torch.cuda.is_available(),\n",
|
| 436 |
-
" save_strategy=\"steps\",\n",
|
| 437 |
-
" save_total_limit=2,\n",
|
| 438 |
-
" save_steps=64,\n",
|
| 439 |
-
" metric_for_best_model=\"eval_accuracy\",\n",
|
| 440 |
-
" load_best_model_at_end=True,\n",
|
| 441 |
-
" eval_strategy=\"steps\",\n",
|
| 442 |
-
" eval_steps=64,\n",
|
| 443 |
-
" logging_strategy=\"steps\",\n",
|
| 444 |
-
" logging_steps=16,\n",
|
| 445 |
-
" remove_unused_columns=False,\n",
|
| 446 |
-
")\n",
|
| 447 |
-
"\n",
|
| 448 |
-
"data_collator = DataCollatorWithPadding(tokenizer=tokenizer)"
|
| 449 |
-
]
|
| 450 |
-
},
|
| 451 |
-
{
|
| 452 |
-
"cell_type": "markdown",
|
| 453 |
-
"id": "e840a954",
|
| 454 |
-
"metadata": {},
|
| 455 |
-
"source": [
|
| 456 |
-
"## Initialize Trainer and Train"
|
| 457 |
-
]
|
| 458 |
-
},
|
| 459 |
-
{
|
| 460 |
-
"cell_type": "code",
|
| 461 |
-
"execution_count": 43,
|
| 462 |
-
"id": "0fa3ed58",
|
| 463 |
-
"metadata": {},
|
| 464 |
-
"outputs": [
|
| 465 |
-
{
|
| 466 |
-
"name": "stdout",
|
| 467 |
-
"output_type": "stream",
|
| 468 |
-
"text": [
|
| 469 |
-
"Starting training...\n"
|
| 470 |
-
]
|
| 471 |
-
},
|
| 472 |
-
{
|
| 473 |
-
"data": {
|
| 474 |
-
"text/html": [
|
| 475 |
-
"\n",
|
| 476 |
-
" <div>\n",
|
| 477 |
-
" \n",
|
| 478 |
-
" <progress value='2960' max='2960' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
|
| 479 |
-
" [2960/2960 1:03:52, Epoch 5/5]\n",
|
| 480 |
-
" </div>\n",
|
| 481 |
-
" <table border=\"1\" class=\"dataframe\">\n",
|
| 482 |
-
" <thead>\n",
|
| 483 |
-
" <tr style=\"text-align: left;\">\n",
|
| 484 |
-
" <th>Step</th>\n",
|
| 485 |
-
" <th>Training Loss</th>\n",
|
| 486 |
-
" <th>Validation Loss</th>\n",
|
| 487 |
-
" <th>Accuracy</th>\n",
|
| 488 |
-
" <th>F1</th>\n",
|
| 489 |
-
" </tr>\n",
|
| 490 |
-
" </thead>\n",
|
| 491 |
-
" <tbody>\n",
|
| 492 |
-
" <tr>\n",
|
| 493 |
-
" <td>64</td>\n",
|
| 494 |
-
" <td>5.212345</td>\n",
|
| 495 |
-
" <td>0.625602</td>\n",
|
| 496 |
-
" <td>0.661986</td>\n",
|
| 497 |
-
" <td>0.634093</td>\n",
|
| 498 |
-
" </tr>\n",
|
| 499 |
-
" <tr>\n",
|
| 500 |
-
" <td>128</td>\n",
|
| 501 |
-
" <td>3.753965</td>\n",
|
| 502 |
-
" <td>0.458432</td>\n",
|
| 503 |
-
" <td>0.771314</td>\n",
|
| 504 |
-
" <td>0.809045</td>\n",
|
| 505 |
-
" </tr>\n",
|
| 506 |
-
" <tr>\n",
|
| 507 |
-
" <td>192</td>\n",
|
| 508 |
-
" <td>3.100017</td>\n",
|
| 509 |
-
" <td>0.287685</td>\n",
|
| 510 |
-
" <td>0.889669</td>\n",
|
| 511 |
-
" <td>0.891089</td>\n",
|
| 512 |
-
" </tr>\n",
|
| 513 |
-
" <tr>\n",
|
| 514 |
-
" <td>256</td>\n",
|
| 515 |
-
" <td>2.328572</td>\n",
|
| 516 |
-
" <td>0.390553</td>\n",
|
| 517 |
-
" <td>0.830491</td>\n",
|
| 518 |
-
" <td>0.855432</td>\n",
|
| 519 |
-
" </tr>\n",
|
| 520 |
-
" <tr>\n",
|
| 521 |
-
" <td>320</td>\n",
|
| 522 |
-
" <td>2.129814</td>\n",
|
| 523 |
-
" <td>0.238838</td>\n",
|
| 524 |
-
" <td>0.911735</td>\n",
|
| 525 |
-
" <td>0.917757</td>\n",
|
| 526 |
-
" </tr>\n",
|
| 527 |
-
" <tr>\n",
|
| 528 |
-
" <td>384</td>\n",
|
| 529 |
-
" <td>1.657923</td>\n",
|
| 530 |
-
" <td>0.388610</td>\n",
|
| 531 |
-
" <td>0.856570</td>\n",
|
| 532 |
-
" <td>0.874671</td>\n",
|
| 533 |
-
" </tr>\n",
|
| 534 |
-
" <tr>\n",
|
| 535 |
-
" <td>448</td>\n",
|
| 536 |
-
" <td>1.758504</td>\n",
|
| 537 |
-
" <td>0.179176</td>\n",
|
| 538 |
-
" <td>0.933801</td>\n",
|
| 539 |
-
" <td>0.937262</td>\n",
|
| 540 |
-
" </tr>\n",
|
| 541 |
-
" <tr>\n",
|
| 542 |
-
" <td>512</td>\n",
|
| 543 |
-
" <td>1.352967</td>\n",
|
| 544 |
-
" <td>0.344061</td>\n",
|
| 545 |
-
" <td>0.867603</td>\n",
|
| 546 |
-
" <td>0.882979</td>\n",
|
| 547 |
-
" </tr>\n",
|
| 548 |
-
" <tr>\n",
|
| 549 |
-
" <td>576</td>\n",
|
| 550 |
-
" <td>1.528169</td>\n",
|
| 551 |
-
" <td>0.143238</td>\n",
|
| 552 |
-
" <td>0.945838</td>\n",
|
| 553 |
-
" <td>0.947368</td>\n",
|
| 554 |
-
" </tr>\n",
|
| 555 |
-
" <tr>\n",
|
| 556 |
-
" <td>640</td>\n",
|
| 557 |
-
" <td>1.692302</td>\n",
|
| 558 |
-
" <td>0.185934</td>\n",
|
| 559 |
-
" <td>0.925777</td>\n",
|
| 560 |
-
" <td>0.930582</td>\n",
|
| 561 |
-
" </tr>\n",
|
| 562 |
-
" <tr>\n",
|
| 563 |
-
" <td>704</td>\n",
|
| 564 |
-
" <td>1.194244</td>\n",
|
| 565 |
-
" <td>0.189194</td>\n",
|
| 566 |
-
" <td>0.927783</td>\n",
|
| 567 |
-
" <td>0.932203</td>\n",
|
| 568 |
-
" </tr>\n",
|
| 569 |
-
" <tr>\n",
|
| 570 |
-
" <td>768</td>\n",
|
| 571 |
-
" <td>1.089103</td>\n",
|
| 572 |
-
" <td>0.191697</td>\n",
|
| 573 |
-
" <td>0.926780</td>\n",
|
| 574 |
-
" <td>0.931455</td>\n",
|
| 575 |
-
" </tr>\n",
|
| 576 |
-
" <tr>\n",
|
| 577 |
-
" <td>832</td>\n",
|
| 578 |
-
" <td>1.313780</td>\n",
|
| 579 |
-
" <td>0.133464</td>\n",
|
| 580 |
-
" <td>0.949850</td>\n",
|
| 581 |
-
" <td>0.950593</td>\n",
|
| 582 |
-
" </tr>\n",
|
| 583 |
-
" <tr>\n",
|
| 584 |
-
" <td>896</td>\n",
|
| 585 |
-
" <td>1.144064</td>\n",
|
| 586 |
-
" <td>0.161593</td>\n",
|
| 587 |
-
" <td>0.943831</td>\n",
|
| 588 |
-
" <td>0.946463</td>\n",
|
| 589 |
-
" </tr>\n",
|
| 590 |
-
" <tr>\n",
|
| 591 |
-
" <td>960</td>\n",
|
| 592 |
-
" <td>1.503407</td>\n",
|
| 593 |
-
" <td>0.211920</td>\n",
|
| 594 |
-
" <td>0.921765</td>\n",
|
| 595 |
-
" <td>0.927374</td>\n",
|
| 596 |
-
" </tr>\n",
|
| 597 |
-
" <tr>\n",
|
| 598 |
-
" <td>1024</td>\n",
|
| 599 |
-
" <td>1.106765</td>\n",
|
| 600 |
-
" <td>0.182482</td>\n",
|
| 601 |
-
" <td>0.931795</td>\n",
|
| 602 |
-
" <td>0.935606</td>\n",
|
| 603 |
-
" </tr>\n",
|
| 604 |
-
" <tr>\n",
|
| 605 |
-
" <td>1088</td>\n",
|
| 606 |
-
" <td>1.450451</td>\n",
|
| 607 |
-
" <td>0.127360</td>\n",
|
| 608 |
-
" <td>0.956871</td>\n",
|
| 609 |
-
" <td>0.958212</td>\n",
|
| 610 |
-
" </tr>\n",
|
| 611 |
-
" <tr>\n",
|
| 612 |
-
" <td>1152</td>\n",
|
| 613 |
-
" <td>1.380015</td>\n",
|
| 614 |
-
" <td>0.131538</td>\n",
|
| 615 |
-
" <td>0.957874</td>\n",
|
| 616 |
-
" <td>0.959064</td>\n",
|
| 617 |
-
" </tr>\n",
|
| 618 |
-
" <tr>\n",
|
| 619 |
-
" <td>1216</td>\n",
|
| 620 |
-
" <td>0.755666</td>\n",
|
| 621 |
-
" <td>0.158870</td>\n",
|
| 622 |
-
" <td>0.940822</td>\n",
|
| 623 |
-
" <td>0.943432</td>\n",
|
| 624 |
-
" </tr>\n",
|
| 625 |
-
" <tr>\n",
|
| 626 |
-
" <td>1280</td>\n",
|
| 627 |
-
" <td>0.863713</td>\n",
|
| 628 |
-
" <td>0.157785</td>\n",
|
| 629 |
-
" <td>0.943831</td>\n",
|
| 630 |
-
" <td>0.946565</td>\n",
|
| 631 |
-
" </tr>\n",
|
| 632 |
-
" <tr>\n",
|
| 633 |
-
" <td>1344</td>\n",
|
| 634 |
-
" <td>0.821364</td>\n",
|
| 635 |
-
" <td>0.172321</td>\n",
|
| 636 |
-
" <td>0.944835</td>\n",
|
| 637 |
-
" <td>0.947469</td>\n",
|
| 638 |
-
" </tr>\n",
|
| 639 |
-
" <tr>\n",
|
| 640 |
-
" <td>1408</td>\n",
|
| 641 |
-
" <td>0.957095</td>\n",
|
| 642 |
-
" <td>0.226298</td>\n",
|
| 643 |
-
" <td>0.922768</td>\n",
|
| 644 |
-
" <td>0.927835</td>\n",
|
| 645 |
-
" </tr>\n",
|
| 646 |
-
" <tr>\n",
|
| 647 |
-
" <td>1472</td>\n",
|
| 648 |
-
" <td>0.868089</td>\n",
|
| 649 |
-
" <td>0.197520</td>\n",
|
| 650 |
-
" <td>0.934804</td>\n",
|
| 651 |
-
" <td>0.938505</td>\n",
|
| 652 |
-
" </tr>\n",
|
| 653 |
-
" <tr>\n",
|
| 654 |
-
" <td>1536</td>\n",
|
| 655 |
-
" <td>1.310811</td>\n",
|
| 656 |
-
" <td>0.140865</td>\n",
|
| 657 |
-
" <td>0.953862</td>\n",
|
| 658 |
-
" <td>0.955426</td>\n",
|
| 659 |
-
" </tr>\n",
|
| 660 |
-
" <tr>\n",
|
| 661 |
-
" <td>1600</td>\n",
|
| 662 |
-
" <td>0.708888</td>\n",
|
| 663 |
-
" <td>0.152195</td>\n",
|
| 664 |
-
" <td>0.943831</td>\n",
|
| 665 |
-
" <td>0.946565</td>\n",
|
| 666 |
-
" </tr>\n",
|
| 667 |
-
" <tr>\n",
|
| 668 |
-
" <td>1664</td>\n",
|
| 669 |
-
" <td>0.717255</td>\n",
|
| 670 |
-
" <td>0.176768</td>\n",
|
| 671 |
-
" <td>0.942828</td>\n",
|
| 672 |
-
" <td>0.945663</td>\n",
|
| 673 |
-
" </tr>\n",
|
| 674 |
-
" <tr>\n",
|
| 675 |
-
" <td>1728</td>\n",
|
| 676 |
-
" <td>1.143681</td>\n",
|
| 677 |
-
" <td>0.156816</td>\n",
|
| 678 |
-
" <td>0.951856</td>\n",
|
| 679 |
-
" <td>0.953757</td>\n",
|
| 680 |
-
" </tr>\n",
|
| 681 |
-
" <tr>\n",
|
| 682 |
-
" <td>1792</td>\n",
|
| 683 |
-
" <td>0.638254</td>\n",
|
| 684 |
-
" <td>0.176596</td>\n",
|
| 685 |
-
" <td>0.944835</td>\n",
|
| 686 |
-
" <td>0.947469</td>\n",
|
| 687 |
-
" </tr>\n",
|
| 688 |
-
" <tr>\n",
|
| 689 |
-
" <td>1856</td>\n",
|
| 690 |
-
" <td>1.133300</td>\n",
|
| 691 |
-
" <td>0.119119</td>\n",
|
| 692 |
-
" <td>0.967904</td>\n",
|
| 693 |
-
" <td>0.968317</td>\n",
|
| 694 |
-
" </tr>\n",
|
| 695 |
-
" <tr>\n",
|
| 696 |
-
" <td>1920</td>\n",
|
| 697 |
-
" <td>1.061837</td>\n",
|
| 698 |
-
" <td>0.140624</td>\n",
|
| 699 |
-
" <td>0.957874</td>\n",
|
| 700 |
-
" <td>0.959381</td>\n",
|
| 701 |
-
" </tr>\n",
|
| 702 |
-
" <tr>\n",
|
| 703 |
-
" <td>1984</td>\n",
|
| 704 |
-
" <td>0.708067</td>\n",
|
| 705 |
-
" <td>0.189490</td>\n",
|
| 706 |
-
" <td>0.940822</td>\n",
|
| 707 |
-
" <td>0.943863</td>\n",
|
| 708 |
-
" </tr>\n",
|
| 709 |
-
" <tr>\n",
|
| 710 |
-
" <td>2048</td>\n",
|
| 711 |
-
" <td>0.761451</td>\n",
|
| 712 |
-
" <td>0.150488</td>\n",
|
| 713 |
-
" <td>0.951856</td>\n",
|
| 714 |
-
" <td>0.953846</td>\n",
|
| 715 |
-
" </tr>\n",
|
| 716 |
-
" <tr>\n",
|
| 717 |
-
" <td>2112</td>\n",
|
| 718 |
-
" <td>0.609547</td>\n",
|
| 719 |
-
" <td>0.189622</td>\n",
|
| 720 |
-
" <td>0.940822</td>\n",
|
| 721 |
-
" <td>0.943863</td>\n",
|
| 722 |
-
" </tr>\n",
|
| 723 |
-
" <tr>\n",
|
| 724 |
-
" <td>2176</td>\n",
|
| 725 |
-
" <td>0.803254</td>\n",
|
| 726 |
-
" <td>0.173354</td>\n",
|
| 727 |
-
" <td>0.946841</td>\n",
|
| 728 |
-
" <td>0.949282</td>\n",
|
| 729 |
-
" </tr>\n",
|
| 730 |
-
" <tr>\n",
|
| 731 |
-
" <td>2240</td>\n",
|
| 732 |
-
" <td>0.664540</td>\n",
|
| 733 |
-
" <td>0.154308</td>\n",
|
| 734 |
-
" <td>0.952859</td>\n",
|
| 735 |
-
" <td>0.954764</td>\n",
|
| 736 |
-
" </tr>\n",
|
| 737 |
-
" <tr>\n",
|
| 738 |
-
" <td>2304</td>\n",
|
| 739 |
-
" <td>0.691763</td>\n",
|
| 740 |
-
" <td>0.144127</td>\n",
|
| 741 |
-
" <td>0.963892</td>\n",
|
| 742 |
-
" <td>0.964706</td>\n",
|
| 743 |
-
" </tr>\n",
|
| 744 |
-
" <tr>\n",
|
| 745 |
-
" <td>2368</td>\n",
|
| 746 |
-
" <td>1.092195</td>\n",
|
| 747 |
-
" <td>0.157182</td>\n",
|
| 748 |
-
" <td>0.957874</td>\n",
|
| 749 |
-
" <td>0.959381</td>\n",
|
| 750 |
-
" </tr>\n",
|
| 751 |
-
" <tr>\n",
|
| 752 |
-
" <td>2432</td>\n",
|
| 753 |
-
" <td>0.752286</td>\n",
|
| 754 |
-
" <td>0.231035</td>\n",
|
| 755 |
-
" <td>0.933801</td>\n",
|
| 756 |
-
" <td>0.937736</td>\n",
|
| 757 |
-
" </tr>\n",
|
| 758 |
-
" <tr>\n",
|
| 759 |
-
" <td>2496</td>\n",
|
| 760 |
-
" <td>0.757014</td>\n",
|
| 761 |
-
" <td>0.185019</td>\n",
|
| 762 |
-
" <td>0.948847</td>\n",
|
| 763 |
-
" <td>0.951103</td>\n",
|
| 764 |
-
" </tr>\n",
|
| 765 |
-
" <tr>\n",
|
| 766 |
-
" <td>2560</td>\n",
|
| 767 |
-
" <td>0.766771</td>\n",
|
| 768 |
-
" <td>0.153019</td>\n",
|
| 769 |
-
" <td>0.958877</td>\n",
|
| 770 |
-
" <td>0.960078</td>\n",
|
| 771 |
-
" </tr>\n",
|
| 772 |
-
" <tr>\n",
|
| 773 |
-
" <td>2624</td>\n",
|
| 774 |
-
" <td>0.434590</td>\n",
|
| 775 |
-
" <td>0.201383</td>\n",
|
| 776 |
-
" <td>0.946841</td>\n",
|
| 777 |
-
" <td>0.949282</td>\n",
|
| 778 |
-
" </tr>\n",
|
| 779 |
-
" <tr>\n",
|
| 780 |
-
" <td>2688</td>\n",
|
| 781 |
-
" <td>0.565482</td>\n",
|
| 782 |
-
" <td>0.181478</td>\n",
|
| 783 |
-
" <td>0.952859</td>\n",
|
| 784 |
-
" <td>0.954764</td>\n",
|
| 785 |
-
" </tr>\n",
|
| 786 |
-
" <tr>\n",
|
| 787 |
-
" <td>2752</td>\n",
|
| 788 |
-
" <td>0.568177</td>\n",
|
| 789 |
-
" <td>0.201250</td>\n",
|
| 790 |
-
" <td>0.946841</td>\n",
|
| 791 |
-
" <td>0.949282</td>\n",
|
| 792 |
-
" </tr>\n",
|
| 793 |
-
" <tr>\n",
|
| 794 |
-
" <td>2816</td>\n",
|
| 795 |
-
" <td>0.611295</td>\n",
|
| 796 |
-
" <td>0.173839</td>\n",
|
| 797 |
-
" <td>0.954865</td>\n",
|
| 798 |
-
" <td>0.956606</td>\n",
|
| 799 |
-
" </tr>\n",
|
| 800 |
-
" <tr>\n",
|
| 801 |
-
" <td>2880</td>\n",
|
| 802 |
-
" <td>0.716351</td>\n",
|
| 803 |
-
" <td>0.187448</td>\n",
|
| 804 |
-
" <td>0.948847</td>\n",
|
| 805 |
-
" <td>0.951103</td>\n",
|
| 806 |
-
" </tr>\n",
|
| 807 |
-
" <tr>\n",
|
| 808 |
-
" <td>2944</td>\n",
|
| 809 |
-
" <td>0.603852</td>\n",
|
| 810 |
-
" <td>0.184578</td>\n",
|
| 811 |
-
" <td>0.948847</td>\n",
|
| 812 |
-
" <td>0.951103</td>\n",
|
| 813 |
-
" </tr>\n",
|
| 814 |
-
" </tbody>\n",
|
| 815 |
-
"</table><p>"
|
| 816 |
-
],
|
| 817 |
-
"text/plain": [
|
| 818 |
-
"<IPython.core.display.HTML object>"
|
| 819 |
-
]
|
| 820 |
-
},
|
| 821 |
-
"metadata": {},
|
| 822 |
-
"output_type": "display_data"
|
| 823 |
-
},
|
| 824 |
-
{
|
| 825 |
-
"data": {
|
| 826 |
-
"text/plain": [
|
| 827 |
-
"TrainOutput(global_step=2960, training_loss=1.3125710455146995, metrics={'train_runtime': 3832.8474, 'train_samples_per_second': 24.711, 'train_steps_per_second': 0.772, 'total_flos': 8360830141838376.0, 'train_loss': 1.3125710455146995, 'epoch': 5.0})"
|
| 828 |
-
]
|
| 829 |
-
},
|
| 830 |
-
"execution_count": 43,
|
| 831 |
-
"metadata": {},
|
| 832 |
-
"output_type": "execute_result"
|
| 833 |
-
}
|
| 834 |
-
],
|
| 835 |
-
"source": [
|
| 836 |
-
"trainer = Trainer(\n",
|
| 837 |
-
" model,\n",
|
| 838 |
-
" training_args,\n",
|
| 839 |
-
" train_dataset=tokenized_datasets[\"train\"],\n",
|
| 840 |
-
" eval_dataset=tokenized_datasets[\"test\"],\n",
|
| 841 |
-
" data_collator=data_collator,\n",
|
| 842 |
-
" compute_metrics=compute_metrics,\n",
|
| 843 |
-
")\n",
|
| 844 |
-
"\n",
|
| 845 |
-
"print(\"Starting training...\")\n",
|
| 846 |
-
"trainer.train()"
|
| 847 |
-
]
|
| 848 |
-
},
|
| 849 |
-
{
|
| 850 |
-
"cell_type": "markdown",
|
| 851 |
-
"id": "cde9bbb1",
|
| 852 |
-
"metadata": {},
|
| 853 |
-
"source": [
|
| 854 |
-
"## Final Evaluation"
|
| 855 |
-
]
|
| 856 |
-
},
|
| 857 |
-
{
|
| 858 |
-
"cell_type": "code",
|
| 859 |
-
"execution_count": 44,
|
| 860 |
-
"id": "bb81afb9",
|
| 861 |
-
"metadata": {},
|
| 862 |
-
"outputs": [
|
| 863 |
-
{
|
| 864 |
-
"name": "stdout",
|
| 865 |
-
"output_type": "stream",
|
| 866 |
-
"text": [
|
| 867 |
-
"Evaluating model...\n"
|
| 868 |
-
]
|
| 869 |
-
},
|
| 870 |
-
{
|
| 871 |
-
"data": {
|
| 872 |
-
"text/html": [
|
| 873 |
-
"\n",
|
| 874 |
-
" <div>\n",
|
| 875 |
-
" \n",
|
| 876 |
-
" <progress value='250' max='250' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
|
| 877 |
-
" [250/250 00:14]\n",
|
| 878 |
-
" </div>\n",
|
| 879 |
-
" "
|
| 880 |
-
],
|
| 881 |
-
"text/plain": [
|
| 882 |
-
"<IPython.core.display.HTML object>"
|
| 883 |
-
]
|
| 884 |
-
},
|
| 885 |
-
"metadata": {},
|
| 886 |
-
"output_type": "display_data"
|
| 887 |
-
},
|
| 888 |
-
{
|
| 889 |
-
"name": "stdout",
|
| 890 |
-
"output_type": "stream",
|
| 891 |
-
"text": [
|
| 892 |
-
"\n",
|
| 893 |
-
"Final Results:\n",
|
| 894 |
-
"Accuracy: 0.9679\n",
|
| 895 |
-
"F1 Score: 0.9683\n"
|
| 896 |
-
]
|
| 897 |
-
}
|
| 898 |
-
],
|
| 899 |
-
"source": [
|
| 900 |
-
"print(\"Evaluating model...\")\n",
|
| 901 |
-
"eval_results = trainer.evaluate()\n",
|
| 902 |
-
"print(\"\\nFinal Results:\")\n",
|
| 903 |
-
"print(f\"Accuracy: {eval_results['eval_accuracy']:.4f}\")\n",
|
| 904 |
-
"print(f\"F1 Score: {eval_results['eval_f1']:.4f}\")"
|
| 905 |
-
]
|
| 906 |
-
},
|
| 907 |
-
{
|
| 908 |
-
"cell_type": "markdown",
|
| 909 |
-
"id": "8bf17a40",
|
| 910 |
-
"metadata": {},
|
| 911 |
-
"source": [
|
| 912 |
-
"## Save Model"
|
| 913 |
-
]
|
| 914 |
-
},
|
| 915 |
-
{
|
| 916 |
-
"cell_type": "code",
|
| 917 |
-
"execution_count": 45,
|
| 918 |
-
"id": "e580bfd6",
|
| 919 |
-
"metadata": {},
|
| 920 |
-
"outputs": [
|
| 921 |
-
{
|
| 922 |
-
"name": "stdout",
|
| 923 |
-
"output_type": "stream",
|
| 924 |
-
"text": [
|
| 925 |
-
"Model saved successfully!\n"
|
| 926 |
-
]
|
| 927 |
-
}
|
| 928 |
-
],
|
| 929 |
-
"source": [
|
| 930 |
-
"# Save the final model\n",
|
| 931 |
-
"trainer.save_model(\"./trained_model/bert-base-raid-final\")\n",
|
| 932 |
-
"print(\"Model saved successfully!\")"
|
| 933 |
-
]
|
| 934 |
-
},
|
| 935 |
-
{
|
| 936 |
-
"cell_type": "markdown",
|
| 937 |
-
"id": "99c0a2f0",
|
| 938 |
-
"metadata": {},
|
| 939 |
-
"source": [
|
| 940 |
-
"## test the model\n"
|
| 941 |
-
]
|
| 942 |
-
},
|
| 943 |
-
{
|
| 944 |
-
"cell_type": "code",
|
| 945 |
-
"execution_count": 46,
|
| 946 |
-
"id": "016cc53e",
|
| 947 |
-
"metadata": {},
|
| 948 |
-
"outputs": [
|
| 949 |
-
{
|
| 950 |
-
"name": "stdout",
|
| 951 |
-
"output_type": "stream",
|
| 952 |
-
"text": [
|
| 953 |
-
"Prediction for human-written text:\n",
|
| 954 |
-
"{'predicted_label': 0, 'probability_human': 0.9988395571708679, 'probability_ai': 0.0011604195460677147}\n",
|
| 955 |
-
"\n",
|
| 956 |
-
"Prediction for AI-generated text:\n",
|
| 957 |
-
"{'predicted_label': 0, 'probability_human': 0.9988927245140076, 'probability_ai': 0.0011073390487581491}\n"
|
| 958 |
-
]
|
| 959 |
-
}
|
| 960 |
-
],
|
| 961 |
-
"source": [
|
| 962 |
-
"def predict(text: str) -> dict[str, float]:\n",
|
| 963 |
-
" inputs = tokenizer(\n",
|
| 964 |
-
" text,\n",
|
| 965 |
-
" max_length=512,\n",
|
| 966 |
-
" truncation=True,\n",
|
| 967 |
-
" return_tensors=\"pt\",\n",
|
| 968 |
-
" ).to(model.device)\n",
|
| 969 |
-
"\n",
|
| 970 |
-
" with torch.no_grad():\n",
|
| 971 |
-
" outputs = model(**inputs)\n",
|
| 972 |
-
" logits = outputs.logits\n",
|
| 973 |
-
" probabilities = torch.softmax(logits, dim=-1).cpu().numpy()[0]\n",
|
| 974 |
-
" predicted_label = np.argmax(probabilities)\n",
|
| 975 |
-
"\n",
|
| 976 |
-
" return {\n",
|
| 977 |
-
" \"predicted_label\": int(predicted_label),\n",
|
| 978 |
-
" \"probability_human\": float(probabilities[0]),\n",
|
| 979 |
-
" \"probability_ai\": float(probabilities[1]),\n",
|
| 980 |
-
" }\n",
|
| 981 |
-
" \n",
|
| 982 |
-
"text = \"Ai will replace this world. today in the nepal election someone might win by using ai.\"\n",
|
| 983 |
-
"text_by_ai = \"This is a sample text generated by AI.Also This is an long text by AI.\"\n",
|
| 984 |
-
"print(\"Prediction for human-written text:\")\n",
|
| 985 |
-
"print(predict(text))\n",
|
| 986 |
-
"print(\"\\nPrediction for AI-generated text:\")\n",
|
| 987 |
-
"print(predict(text_by_ai))\n"
|
| 988 |
-
]
|
| 989 |
-
},
|
| 990 |
-
{
|
| 991 |
-
"cell_type": "markdown",
|
| 992 |
-
"id": "7c6c2a5d",
|
| 993 |
-
"metadata": {},
|
| 994 |
-
"source": [
|
| 995 |
-
"def predict"
|
| 996 |
-
]
|
| 997 |
-
},
|
| 998 |
-
{
|
| 999 |
-
"cell_type": "code",
|
| 1000 |
-
"execution_count": 47,
|
| 1001 |
-
"id": "1b287605",
|
| 1002 |
-
"metadata": {},
|
| 1003 |
-
"outputs": [
|
| 1004 |
-
{
|
| 1005 |
-
"name": "stdout",
|
| 1006 |
-
"output_type": "stream",
|
| 1007 |
-
"text": [
|
| 1008 |
-
"Using 512 samples for RAID quick test\n"
|
| 1009 |
-
]
|
| 1010 |
-
},
|
| 1011 |
-
{
|
| 1012 |
-
"ename": "OutOfMemoryError",
|
| 1013 |
-
"evalue": "CUDA out of memory. Tried to allocate 768.00 MiB. GPU 0 has a total capacity of 3.68 GiB of which 719.12 MiB is free. Process 2034 has 46.03 MiB memory in use. Process 1961 has 6.78 MiB memory in use. Including non-PyTorch memory, this process has 2.90 GiB memory in use. Of the allocated memory 2.71 GiB is allocated by PyTorch, and 85.13 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)",
|
| 1014 |
-
"output_type": "error",
|
| 1015 |
-
"traceback": [
|
| 1016 |
-
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
|
| 1017 |
-
"\u001b[31mOutOfMemoryError\u001b[39m Traceback (most recent call last)",
|
| 1018 |
-
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[47]\u001b[39m\u001b[32m, line 32\u001b[39m\n\u001b[32m 28\u001b[39m \u001b[38;5;66;03m# Return AI-class probability for each input text\u001b[39;00m\n\u001b[32m 29\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m probabilities[:, \u001b[32m1\u001b[39m].astype(\u001b[38;5;28mfloat\u001b[39m).tolist()\n\u001b[32m---> \u001b[39m\u001b[32m32\u001b[39m predictions = \u001b[43mrun_detection\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmy_detector\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtest_df\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 33\u001b[39m evaluation_result = run_evaluation(predictions, test_df)\n\u001b[32m 35\u001b[39m evaluation_result\n",
|
| 1019 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/raid/detect.py:6\u001b[39m, in \u001b[36mrun_detection\u001b[39m\u001b[34m(f, df)\u001b[39m\n\u001b[32m 3\u001b[39m scores_df = df[[\u001b[33m\"\u001b[39m\u001b[33mid\u001b[39m\u001b[33m\"\u001b[39m]].copy()\n\u001b[32m 5\u001b[39m \u001b[38;5;66;03m# Run the detector function on the dataset and put output in score column\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m6\u001b[39m scores_df[\u001b[33m\"\u001b[39m\u001b[33mscore\u001b[39m\u001b[33m\"\u001b[39m] = \u001b[43mf\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdf\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mgeneration\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m.\u001b[49m\u001b[43mtolist\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 8\u001b[39m \u001b[38;5;66;03m# Convert scores and ids to dict in 'records' format for seralization\u001b[39;00m\n\u001b[32m 9\u001b[39m \u001b[38;5;66;03m# e.g. [{'id':'...', 'score':0}, {'id':'...', 'score':1}, ...]\u001b[39;00m\n\u001b[32m 10\u001b[39m results = scores_df[[\u001b[33m\"\u001b[39m\u001b[33mid\u001b[39m\u001b[33m\"\u001b[39m, \u001b[33m\"\u001b[39m\u001b[33mscore\u001b[39m\u001b[33m\"\u001b[39m]].to_dict(orient=\u001b[33m\"\u001b[39m\u001b[33mrecords\u001b[39m\u001b[33m\"\u001b[39m)\n",
|
| 1020 |
-
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[47]\u001b[39m\u001b[32m, line 24\u001b[39m, in \u001b[36mmy_detector\u001b[39m\u001b[34m(texts)\u001b[39m\n\u001b[32m 22\u001b[39m model.eval()\n\u001b[32m 23\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m torch.no_grad():\n\u001b[32m---> \u001b[39m\u001b[32m24\u001b[39m outputs = \u001b[43mmodel\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43minputs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 25\u001b[39m logits = outputs.logits\n\u001b[32m 26\u001b[39m probabilities = torch.softmax(logits, dim=-\u001b[32m1\u001b[39m).cpu().numpy()\n",
|
| 1021 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1734\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1735\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1736\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 1022 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1742\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1743\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1744\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1745\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1746\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1747\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1749\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1750\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
|
| 1023 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/accelerate/utils/operations.py:819\u001b[39m, in \u001b[36mconvert_outputs_to_fp32.<locals>.forward\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 818\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mforward\u001b[39m(*args, **kwargs):\n\u001b[32m--> \u001b[39m\u001b[32m819\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mmodel_forward\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 1024 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/accelerate/utils/operations.py:807\u001b[39m, in \u001b[36mConvertOutputsToFp32.__call__\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 806\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34m__call__\u001b[39m(\u001b[38;5;28mself\u001b[39m, *args, **kwargs):\n\u001b[32m--> \u001b[39m\u001b[32m807\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m convert_to_fp32(\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mmodel_forward\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m)\n",
|
| 1025 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/amp/autocast_mode.py:44\u001b[39m, in \u001b[36mautocast_decorator.<locals>.decorate_autocast\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 41\u001b[39m \u001b[38;5;129m@functools\u001b[39m.wraps(func)\n\u001b[32m 42\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mdecorate_autocast\u001b[39m(*args, **kwargs):\n\u001b[32m 43\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m autocast_instance:\n\u001b[32m---> \u001b[39m\u001b[32m44\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 1026 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/peft/peft_model.py:921\u001b[39m, in \u001b[36mPeftModel.forward\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 919\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28mself\u001b[39m._enable_peft_forward_hooks(*args, **kwargs):\n\u001b[32m 920\u001b[39m kwargs = {k: v \u001b[38;5;28;01mfor\u001b[39;00m k, v \u001b[38;5;129;01min\u001b[39;00m kwargs.items() \u001b[38;5;28;01mif\u001b[39;00m k \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m.special_peft_forward_args}\n\u001b[32m--> \u001b[39m\u001b[32m921\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mget_base_model\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 1027 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1734\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1735\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1736\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 1028 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1742\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1743\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1744\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1745\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1746\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1747\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1749\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1750\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
|
| 1029 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/utils/generic.py:835\u001b[39m, in \u001b[36mcan_return_tuple.<locals>.wrapper\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 833\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m return_dict_passed \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m 834\u001b[39m return_dict = return_dict_passed\n\u001b[32m--> \u001b[39m\u001b[32m835\u001b[39m output = \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 836\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m return_dict \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(output, \u001b[38;5;28mtuple\u001b[39m):\n\u001b[32m 837\u001b[39m output = output.to_tuple()\n",
|
| 1030 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py:1162\u001b[39m, in \u001b[36mBertForSequenceClassification.forward\u001b[39m\u001b[34m(self, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds, labels, **kwargs)\u001b[39m\n\u001b[32m 1144\u001b[39m \u001b[38;5;129m@can_return_tuple\u001b[39m\n\u001b[32m 1145\u001b[39m \u001b[38;5;129m@auto_docstring\u001b[39m\n\u001b[32m 1146\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mforward\u001b[39m(\n\u001b[32m (...)\u001b[39m\u001b[32m 1154\u001b[39m **kwargs: Unpack[TransformersKwargs],\n\u001b[32m 1155\u001b[39m ) -> \u001b[38;5;28mtuple\u001b[39m[torch.Tensor] | SequenceClassifierOutput:\n\u001b[32m 1156\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33mr\u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 1157\u001b[39m \u001b[33;03m labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):\u001b[39;00m\n\u001b[32m 1158\u001b[39m \u001b[33;03m Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,\u001b[39;00m\n\u001b[32m 1159\u001b[39m \u001b[33;03m config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If\u001b[39;00m\n\u001b[32m 1160\u001b[39m \u001b[33;03m `config.num_labels > 1` a classification loss is computed (Cross-Entropy).\u001b[39;00m\n\u001b[32m 1161\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1162\u001b[39m outputs = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mbert\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 1163\u001b[39m \u001b[43m \u001b[49m\u001b[43minput_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1164\u001b[39m \u001b[43m \u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[43m=\u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1165\u001b[39m \u001b[43m \u001b[49m\u001b[43mtoken_type_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtoken_type_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1166\u001b[39m \u001b[43m \u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1167\u001b[39m \u001b[43m \u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[43m=\u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1168\u001b[39m \u001b[43m \u001b[49m\u001b[43mreturn_dict\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m 1169\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1170\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1172\u001b[39m pooled_output = outputs[\u001b[32m1\u001b[39m]\n\u001b[32m 1174\u001b[39m pooled_output = \u001b[38;5;28mself\u001b[39m.dropout(pooled_output)\n",
|
| 1031 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1734\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1735\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1736\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 1032 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1742\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1743\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1744\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1745\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1746\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1747\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1749\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1750\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
|
| 1033 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/utils/generic.py:1002\u001b[39m, in \u001b[36mcheck_model_inputs.<locals>.wrapped_fn.<locals>.wrapper\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1000\u001b[39m outputs = func(\u001b[38;5;28mself\u001b[39m, *args, **kwargs)\n\u001b[32m 1001\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1002\u001b[39m outputs = \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1003\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m original_exception:\n\u001b[32m 1004\u001b[39m \u001b[38;5;66;03m# If we get a TypeError, it's possible that the model is not receiving the recordable kwargs correctly.\u001b[39;00m\n\u001b[32m 1005\u001b[39m \u001b[38;5;66;03m# Get a TypeError even after removing the recordable kwargs -> re-raise the original exception\u001b[39;00m\n\u001b[32m 1006\u001b[39m \u001b[38;5;66;03m# Otherwise -> we're probably missing `**kwargs` in the decorated function\u001b[39;00m\n\u001b[32m 1007\u001b[39m kwargs_without_recordable = {k: v \u001b[38;5;28;01mfor\u001b[39;00m k, v \u001b[38;5;129;01min\u001b[39;00m kwargs.items() \u001b[38;5;28;01mif\u001b[39;00m k \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m recordable_keys}\n",
|
| 1034 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py:679\u001b[39m, in \u001b[36mBertModel.forward\u001b[39m\u001b[34m(self, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, cache_position, **kwargs)\u001b[39m\n\u001b[32m 676\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m cache_position \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m 677\u001b[39m cache_position = torch.arange(past_key_values_length, past_key_values_length + seq_length, device=device)\n\u001b[32m--> \u001b[39m\u001b[32m679\u001b[39m embedding_output = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43membeddings\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 680\u001b[39m \u001b[43m \u001b[49m\u001b[43minput_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43minput_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 681\u001b[39m \u001b[43m \u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 682\u001b[39m \u001b[43m \u001b[49m\u001b[43mtoken_type_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtoken_type_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 683\u001b[39m \u001b[43m \u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[43m=\u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 684\u001b[39m \u001b[43m \u001b[49m\u001b[43mpast_key_values_length\u001b[49m\u001b[43m=\u001b[49m\u001b[43mpast_key_values_length\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 685\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 687\u001b[39m attention_mask, encoder_attention_mask = \u001b[38;5;28mself\u001b[39m._create_attention_masks(\n\u001b[32m 688\u001b[39m attention_mask=attention_mask,\n\u001b[32m 689\u001b[39m encoder_attention_mask=encoder_attention_mask,\n\u001b[32m (...)\u001b[39m\u001b[32m 693\u001b[39m past_key_values=past_key_values,\n\u001b[32m 694\u001b[39m )\n\u001b[32m 696\u001b[39m encoder_outputs = \u001b[38;5;28mself\u001b[39m.encoder(\n\u001b[32m 697\u001b[39m embedding_output,\n\u001b[32m 698\u001b[39m attention_mask=attention_mask,\n\u001b[32m (...)\u001b[39m\u001b[32m 705\u001b[39m **kwargs,\n\u001b[32m 706\u001b[39m )\n",
|
| 1035 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1734\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1735\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1736\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 1036 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1742\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1743\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1744\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1745\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1746\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1747\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1749\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1750\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
|
| 1037 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py:107\u001b[39m, in \u001b[36mBertEmbeddings.forward\u001b[39m\u001b[34m(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)\u001b[39m\n\u001b[32m 104\u001b[39m embeddings = inputs_embeds + token_type_embeddings\n\u001b[32m 106\u001b[39m position_embeddings = \u001b[38;5;28mself\u001b[39m.position_embeddings(position_ids)\n\u001b[32m--> \u001b[39m\u001b[32m107\u001b[39m embeddings = \u001b[43membeddings\u001b[49m\u001b[43m \u001b[49m\u001b[43m+\u001b[49m\u001b[43m \u001b[49m\u001b[43mposition_embeddings\u001b[49m\n\u001b[32m 109\u001b[39m embeddings = \u001b[38;5;28mself\u001b[39m.LayerNorm(embeddings)\n\u001b[32m 110\u001b[39m embeddings = \u001b[38;5;28mself\u001b[39m.dropout(embeddings)\n",
|
| 1038 |
-
"\u001b[31mOutOfMemoryError\u001b[39m: CUDA out of memory. Tried to allocate 768.00 MiB. GPU 0 has a total capacity of 3.68 GiB of which 719.12 MiB is free. Process 2034 has 46.03 MiB memory in use. Process 1961 has 6.78 MiB memory in use. Including non-PyTorch memory, this process has 2.90 GiB memory in use. Of the allocated memory 2.71 GiB is allocated by PyTorch, and 85.13 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)"
|
| 1039 |
-
]
|
| 1040 |
-
}
|
| 1041 |
-
],
|
| 1042 |
-
"source": [
|
| 1043 |
-
"from raid import run_detection, run_evaluation\n",
|
| 1044 |
-
"from raid.utils import load_data\n",
|
| 1045 |
-
"\n",
|
| 1046 |
-
"# Use test split and cap sample size for a quick RAID validation\n",
|
| 1047 |
-
"test_df = load_data(split=\"test\")\n",
|
| 1048 |
-
"sample_size = min(int(len(test_df) * 0.02), 512)\n",
|
| 1049 |
-
"test_df = test_df.sample(n=sample_size, random_state=42)\n",
|
| 1050 |
-
"\n",
|
| 1051 |
-
"print(f\"Using {len(test_df)} samples for RAID quick test\")\n",
|
| 1052 |
-
"\n",
|
| 1053 |
-
"\n",
|
| 1054 |
-
"def my_detector(texts: list[str]) -> list[float]:\n",
|
| 1055 |
-
" # RAID passes a batch/list of strings and expects a list of scores\n",
|
| 1056 |
-
" inputs = tokenizer(\n",
|
| 1057 |
-
" texts,\n",
|
| 1058 |
-
" max_length=512,\n",
|
| 1059 |
-
" truncation=True,\n",
|
| 1060 |
-
" padding=True,\n",
|
| 1061 |
-
" return_tensors=\"pt\",\n",
|
| 1062 |
-
" ).to(model.device)\n",
|
| 1063 |
-
"\n",
|
| 1064 |
-
" model.eval()\n",
|
| 1065 |
-
" with torch.no_grad():\n",
|
| 1066 |
-
" outputs = model(**inputs)\n",
|
| 1067 |
-
" logits = outputs.logits\n",
|
| 1068 |
-
" probabilities = torch.softmax(logits, dim=-1).cpu().numpy()\n",
|
| 1069 |
-
"\n",
|
| 1070 |
-
" # Return AI-class probability for each input text\n",
|
| 1071 |
-
" return probabilities[:, 1].astype(float).tolist()\n",
|
| 1072 |
-
"\n",
|
| 1073 |
-
"\n",
|
| 1074 |
-
"predictions = run_detection(my_detector, test_df)\n",
|
| 1075 |
-
"evaluation_result = run_evaluation(predictions, test_df)\n",
|
| 1076 |
-
"\n",
|
| 1077 |
-
"evaluation_result"
|
| 1078 |
-
]
|
| 1079 |
-
},
|
| 1080 |
-
{
|
| 1081 |
-
"cell_type": "code",
|
| 1082 |
-
"execution_count": null,
|
| 1083 |
-
"id": "6b6eb543",
|
| 1084 |
-
"metadata": {},
|
| 1085 |
-
"outputs": [],
|
| 1086 |
-
"source": []
|
| 1087 |
-
}
|
| 1088 |
-
],
|
| 1089 |
-
"metadata": {
|
| 1090 |
-
"kernelspec": {
|
| 1091 |
-
"display_name": "ml",
|
| 1092 |
-
"language": "python",
|
| 1093 |
-
"name": "python3"
|
| 1094 |
-
},
|
| 1095 |
-
"language_info": {
|
| 1096 |
-
"codemirror_mode": {
|
| 1097 |
-
"name": "ipython",
|
| 1098 |
-
"version": 3
|
| 1099 |
-
},
|
| 1100 |
-
"file_extension": ".py",
|
| 1101 |
-
"mimetype": "text/x-python",
|
| 1102 |
-
"name": "python",
|
| 1103 |
-
"nbconvert_exporter": "python",
|
| 1104 |
-
"pygments_lexer": "ipython3",
|
| 1105 |
-
"version": "3.11.6"
|
| 1106 |
-
}
|
| 1107 |
-
},
|
| 1108 |
-
"nbformat": 4,
|
| 1109 |
-
"nbformat_minor": 5
|
| 1110 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
notebook/ai_vs_human/mainv2.ipynb
DELETED
|
@@ -1,1170 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"cells": [
|
| 3 |
-
{
|
| 4 |
-
"cell_type": "markdown",
|
| 5 |
-
"id": "464eefd0",
|
| 6 |
-
"metadata": {},
|
| 7 |
-
"source": [
|
| 8 |
-
"# AI vs Human Detector V2\n",
|
| 9 |
-
"This notebook trains a V2 model that explicitly supports short inputs (including sentences under 50 words) and saves artifacts in `v2_model/`."
|
| 10 |
-
]
|
| 11 |
-
},
|
| 12 |
-
{
|
| 13 |
-
"cell_type": "markdown",
|
| 14 |
-
"id": "0be0e8d9",
|
| 15 |
-
"metadata": {},
|
| 16 |
-
"source": [
|
| 17 |
-
"## ✅ Bug Fixes & Capabilities\n",
|
| 18 |
-
"\n",
|
| 19 |
-
"**Fixed Issues:**\n",
|
| 20 |
-
"1. ✅ Runtime error when calling `trainer.evaluate()` after training (removed duplicate evaluation)\n",
|
| 21 |
-
"2. ✅ Missing `accelerate` dependency (auto-installs if needed)\n",
|
| 22 |
-
"3. ✅ Recursive dataset loading from `./DATASET/` folder (supports `.jsonl`, `.json`, `.csv`)\n",
|
| 23 |
-
"4. ✅ Short sentence support (<50 words) with data augmentation\n",
|
| 24 |
-
"\n",
|
| 25 |
-
"**Model Capabilities:**\n",
|
| 26 |
-
"- ✅ Works with **all sentence types**: very short (1-10 words), short (10-50), medium (50-150), long (150+)\n",
|
| 27 |
-
"- ✅ Handles edge cases: single words, special characters, numbers, mixed formats\n",
|
| 28 |
-
"- ✅ Batch prediction support\n",
|
| 29 |
-
"- ✅ Saves to `v2_model/` with tokenizer, config, and label map\n",
|
| 30 |
-
"- ✅ Can be loaded independently after saving\n",
|
| 31 |
-
"\n",
|
| 32 |
-
"**Architecture:** DistilRoBERTa-base (faster, lighter than BERT)\n",
|
| 33 |
-
"\n",
|
| 34 |
-
"**Quick Start:**\n",
|
| 35 |
-
"1. Run cells 1-7 to prepare data\n",
|
| 36 |
-
"2. Run cell 8 to train (takes ~15-30 min on GPU)\n",
|
| 37 |
-
"3. Run cell 9 to save to `v2_model/`\n",
|
| 38 |
-
"4. Run cells 10-12 to test all sentence types"
|
| 39 |
-
]
|
| 40 |
-
},
|
| 41 |
-
{
|
| 42 |
-
"cell_type": "markdown",
|
| 43 |
-
"id": "3a8134db",
|
| 44 |
-
"metadata": {},
|
| 45 |
-
"source": [
|
| 46 |
-
"## Additional Testing: Extreme Edge Cases & Batch Prediction"
|
| 47 |
-
]
|
| 48 |
-
},
|
| 49 |
-
{
|
| 50 |
-
"cell_type": "code",
|
| 51 |
-
"execution_count": 1,
|
| 52 |
-
"id": "f400f763",
|
| 53 |
-
"metadata": {},
|
| 54 |
-
"outputs": [
|
| 55 |
-
{
|
| 56 |
-
"name": "stdout",
|
| 57 |
-
"output_type": "stream",
|
| 58 |
-
"text": [
|
| 59 |
-
"Note: you may need to restart the kernel to use updated packages.\n"
|
| 60 |
-
]
|
| 61 |
-
}
|
| 62 |
-
],
|
| 63 |
-
"source": [
|
| 64 |
-
"%pip install -q -U datasets evaluate transformers torch pandas scikit-learn accelerate"
|
| 65 |
-
]
|
| 66 |
-
},
|
| 67 |
-
{
|
| 68 |
-
"cell_type": "code",
|
| 69 |
-
"execution_count": 2,
|
| 70 |
-
"id": "0c3d4d6d",
|
| 71 |
-
"metadata": {},
|
| 72 |
-
"outputs": [
|
| 73 |
-
{
|
| 74 |
-
"name": "stderr",
|
| 75 |
-
"output_type": "stream",
|
| 76 |
-
"text": [
|
| 77 |
-
"/home/pujan/miniconda3/envs/ml/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
|
| 78 |
-
" from .autonotebook import tqdm as notebook_tqdm\n",
|
| 79 |
-
"/home/pujan/miniconda3/envs/ml/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'Could not load this library: /home/pujan/miniconda3/envs/ml/lib/python3.11/site-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?\n",
|
| 80 |
-
" warn(\n",
|
| 81 |
-
"/home/pujan/miniconda3/envs/ml/lib/python3.11/site-packages/torchvision/datapoints/__init__.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().\n",
|
| 82 |
-
" warnings.warn(_BETA_TRANSFORMS_WARNING)\n",
|
| 83 |
-
"/home/pujan/miniconda3/envs/ml/lib/python3.11/site-packages/torchvision/transforms/v2/__init__.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().\n",
|
| 84 |
-
" warnings.warn(_BETA_TRANSFORMS_WARNING)\n"
|
| 85 |
-
]
|
| 86 |
-
}
|
| 87 |
-
],
|
| 88 |
-
"source": [
|
| 89 |
-
"from __future__ import annotations\n",
|
| 90 |
-
"\n",
|
| 91 |
-
"from dataclasses import dataclass\n",
|
| 92 |
-
"from functools import partial\n",
|
| 93 |
-
"from pathlib import Path\n",
|
| 94 |
-
"import json\n",
|
| 95 |
-
"import random\n",
|
| 96 |
-
"\n",
|
| 97 |
-
"import datasets\n",
|
| 98 |
-
"from datasets import Dataset, DatasetDict, concatenate_datasets\n",
|
| 99 |
-
"import evaluate\n",
|
| 100 |
-
"import numpy as np\n",
|
| 101 |
-
"import pandas as pd\n",
|
| 102 |
-
"import torch\n",
|
| 103 |
-
"from transformers import (\n",
|
| 104 |
-
" AutoModelForSequenceClassification,\n",
|
| 105 |
-
" AutoTokenizer,\n",
|
| 106 |
-
" BatchEncoding,\n",
|
| 107 |
-
" DataCollatorWithPadding,\n",
|
| 108 |
-
" PreTrainedTokenizer,\n",
|
| 109 |
-
" Trainer,\n",
|
| 110 |
-
" TrainingArguments,\n",
|
| 111 |
-
")\n",
|
| 112 |
-
"from packaging import version"
|
| 113 |
-
]
|
| 114 |
-
},
|
| 115 |
-
{
|
| 116 |
-
"cell_type": "code",
|
| 117 |
-
"execution_count": 3,
|
| 118 |
-
"id": "624d23ba",
|
| 119 |
-
"metadata": {},
|
| 120 |
-
"outputs": [
|
| 121 |
-
{
|
| 122 |
-
"name": "stdout",
|
| 123 |
-
"output_type": "stream",
|
| 124 |
-
"text": [
|
| 125 |
-
"Base model: distilroberta-base\n",
|
| 126 |
-
"Device: cuda\n",
|
| 127 |
-
"Output path: ./v2_model\n"
|
| 128 |
-
]
|
| 129 |
-
}
|
| 130 |
-
],
|
| 131 |
-
"source": [
|
| 132 |
-
"@dataclass\n",
|
| 133 |
-
"class V2Config:\n",
|
| 134 |
-
" base_model_name: str = \"distilroberta-base\"\n",
|
| 135 |
-
" max_samples: int = 20000\n",
|
| 136 |
-
" max_length: int = 256\n",
|
| 137 |
-
" short_word_limit: int = 50\n",
|
| 138 |
-
" short_aug_ratio: float = 0.35\n",
|
| 139 |
-
" output_dir: str = \"./v2_model\"\n",
|
| 140 |
-
" seed: int = 42\n",
|
| 141 |
-
"\n",
|
| 142 |
-
"\n",
|
| 143 |
-
"cfg = V2Config()\n",
|
| 144 |
-
"DEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
|
| 145 |
-
"random.seed(cfg.seed)\n",
|
| 146 |
-
"np.random.seed(cfg.seed)\n",
|
| 147 |
-
"torch.manual_seed(cfg.seed)\n",
|
| 148 |
-
"\n",
|
| 149 |
-
"print(f\"Base model: {cfg.base_model_name}\")\n",
|
| 150 |
-
"print(f\"Device: {DEVICE}\")\n",
|
| 151 |
-
"print(f\"Output path: {cfg.output_dir}\")"
|
| 152 |
-
]
|
| 153 |
-
},
|
| 154 |
-
{
|
| 155 |
-
"cell_type": "code",
|
| 156 |
-
"execution_count": 4,
|
| 157 |
-
"id": "0a1f860a",
|
| 158 |
-
"metadata": {},
|
| 159 |
-
"outputs": [],
|
| 160 |
-
"source": [
|
| 161 |
-
"def normalize_text(text: str) -> str:\n",
|
| 162 |
-
" return \" \".join(str(text).split()).strip()\n",
|
| 163 |
-
"\n",
|
| 164 |
-
"\n",
|
| 165 |
-
"def count_words(text: str) -> int:\n",
|
| 166 |
-
" return len(normalize_text(text).split())\n",
|
| 167 |
-
"\n",
|
| 168 |
-
"\n",
|
| 169 |
-
"def _load_local_file_to_text_labels(file_path: Path) -> tuple[list[str], list[int]]:\n",
|
| 170 |
-
" texts: list[str] = []\n",
|
| 171 |
-
" labels: list[int] = []\n",
|
| 172 |
-
"\n",
|
| 173 |
-
" try:\n",
|
| 174 |
-
" suffix = file_path.suffix.lower()\n",
|
| 175 |
-
" if suffix == \".jsonl\":\n",
|
| 176 |
-
" df = pd.read_json(file_path, lines=True)\n",
|
| 177 |
-
" elif suffix == \".json\":\n",
|
| 178 |
-
" df = pd.read_json(file_path)\n",
|
| 179 |
-
" elif suffix == \".csv\":\n",
|
| 180 |
-
" df = pd.read_csv(file_path)\n",
|
| 181 |
-
" else:\n",
|
| 182 |
-
" return texts, labels\n",
|
| 183 |
-
"\n",
|
| 184 |
-
" if {\"human_text\", \"ai_text\"}.issubset(df.columns):\n",
|
| 185 |
-
" human_texts = [normalize_text(x) for x in df[\"human_text\"].dropna().tolist()]\n",
|
| 186 |
-
" ai_texts = [normalize_text(x) for x in df[\"ai_text\"].dropna().tolist()]\n",
|
| 187 |
-
" human_texts = [x for x in human_texts if x]\n",
|
| 188 |
-
" ai_texts = [x for x in ai_texts if x]\n",
|
| 189 |
-
" texts.extend(human_texts)\n",
|
| 190 |
-
" labels.extend([0] * len(human_texts))\n",
|
| 191 |
-
" texts.extend(ai_texts)\n",
|
| 192 |
-
" labels.extend([1] * len(ai_texts))\n",
|
| 193 |
-
" return texts, labels\n",
|
| 194 |
-
"\n",
|
| 195 |
-
" # Alternative schema fallback: text + label/ai_gen columns.\n",
|
| 196 |
-
" if \"text\" in df.columns and (\"label\" in df.columns or \"ai_gen\" in df.columns):\n",
|
| 197 |
-
" label_col = \"label\" if \"label\" in df.columns else \"ai_gen\"\n",
|
| 198 |
-
" for _, row in df.iterrows():\n",
|
| 199 |
-
" text = normalize_text(row.get(\"text\", \"\"))\n",
|
| 200 |
-
" if not text:\n",
|
| 201 |
-
" continue\n",
|
| 202 |
-
" val = str(row.get(label_col, \"\")).strip().lower()\n",
|
| 203 |
-
" is_ai = val in {\"1\", \"true\", \"ai\", \"ai-generated\", \"ai_generated\"}\n",
|
| 204 |
-
" texts.append(text)\n",
|
| 205 |
-
" labels.append(1 if is_ai else 0)\n",
|
| 206 |
-
" return texts, labels\n",
|
| 207 |
-
"\n",
|
| 208 |
-
" except Exception as error:\n",
|
| 209 |
-
" print(f\"Skipped file due to parse error: {file_path} ({error})\")\n",
|
| 210 |
-
"\n",
|
| 211 |
-
" return texts, labels\n",
|
| 212 |
-
"\n",
|
| 213 |
-
"\n",
|
| 214 |
-
"def get_combined_dataset(max_samples: int = 20000, use_local: bool = True) -> DatasetDict:\n",
|
| 215 |
-
" all_texts: list[str] = []\n",
|
| 216 |
-
" all_labels: list[int] = []\n",
|
| 217 |
-
"\n",
|
| 218 |
-
" try:\n",
|
| 219 |
-
" hc3 = datasets.load_dataset(\"Hello-SimpleAI/HC3\", \"all\", split=\"train\")\n",
|
| 220 |
-
" for row in hc3:\n",
|
| 221 |
-
" for answer in row.get(\"human_answers\", [])[:1]:\n",
|
| 222 |
-
" text = normalize_text(answer)\n",
|
| 223 |
-
" if text:\n",
|
| 224 |
-
" all_texts.append(text)\n",
|
| 225 |
-
" all_labels.append(0)\n",
|
| 226 |
-
" for answer in row.get(\"chatgpt_answers\", [])[:1]:\n",
|
| 227 |
-
" text = normalize_text(answer)\n",
|
| 228 |
-
" if text:\n",
|
| 229 |
-
" all_texts.append(text)\n",
|
| 230 |
-
" all_labels.append(1)\n",
|
| 231 |
-
" print(f\"HC3 samples: {len(all_texts)}\")\n",
|
| 232 |
-
" except Exception as error:\n",
|
| 233 |
-
" print(f\"HC3 unavailable: {error}\")\n",
|
| 234 |
-
"\n",
|
| 235 |
-
" if use_local:\n",
|
| 236 |
-
" dataset_root = Path(\"./DATASET\")\n",
|
| 237 |
-
" candidates = list(dataset_root.rglob(\"*.jsonl\")) + list(dataset_root.rglob(\"*.json\")) + list(dataset_root.rglob(\"*.csv\"))\n",
|
| 238 |
-
"\n",
|
| 239 |
-
" local_before = len(all_texts)\n",
|
| 240 |
-
" for file_path in candidates:\n",
|
| 241 |
-
" texts, labels = _load_local_file_to_text_labels(file_path)\n",
|
| 242 |
-
" all_texts.extend(texts)\n",
|
| 243 |
-
" all_labels.extend(labels)\n",
|
| 244 |
-
"\n",
|
| 245 |
-
" print(f\"Local recursive files scanned: {len(candidates)}\")\n",
|
| 246 |
-
" print(f\"Local samples added: {len(all_texts) - local_before}\")\n",
|
| 247 |
-
"\n",
|
| 248 |
-
" if not all_texts:\n",
|
| 249 |
-
" raise ValueError(\"No training data loaded from HC3 or local dataset.\")\n",
|
| 250 |
-
"\n",
|
| 251 |
-
" ds = Dataset.from_dict({\"text\": all_texts, \"label\": all_labels})\n",
|
| 252 |
-
" ds = ds.filter(lambda x: x[\"text\"] is not None and len(normalize_text(x[\"text\"])) > 0)\n",
|
| 253 |
-
" ds = ds.shuffle(seed=cfg.seed)\n",
|
| 254 |
-
" if len(ds) > max_samples:\n",
|
| 255 |
-
" ds = ds.select(range(max_samples))\n",
|
| 256 |
-
"\n",
|
| 257 |
-
" split = ds.train_test_split(test_size=0.1, seed=cfg.seed)\n",
|
| 258 |
-
" return split\n",
|
| 259 |
-
"\n",
|
| 260 |
-
"\n",
|
| 261 |
-
"def add_short_text_variants(dataset: Dataset, short_word_limit: int = 50, ratio: float = 0.35) -> Dataset:\n",
|
| 262 |
-
" short_texts: list[str] = []\n",
|
| 263 |
-
" short_labels: list[int] = []\n",
|
| 264 |
-
"\n",
|
| 265 |
-
" for row in dataset:\n",
|
| 266 |
-
" text = normalize_text(row[\"text\"])\n",
|
| 267 |
-
" label = int(row[\"label\"])\n",
|
| 268 |
-
" words = text.split()\n",
|
| 269 |
-
"\n",
|
| 270 |
-
" if len(words) <= short_word_limit:\n",
|
| 271 |
-
" if random.random() < ratio:\n",
|
| 272 |
-
" short_texts.append(text)\n",
|
| 273 |
-
" short_labels.append(label)\n",
|
| 274 |
-
" continue\n",
|
| 275 |
-
"\n",
|
| 276 |
-
" # Keep first N words as a short variant to train behavior on short inputs.\n",
|
| 277 |
-
" if random.random() < ratio:\n",
|
| 278 |
-
" short_text = \" \".join(words[:short_word_limit])\n",
|
| 279 |
-
" short_texts.append(short_text)\n",
|
| 280 |
-
" short_labels.append(label)\n",
|
| 281 |
-
"\n",
|
| 282 |
-
" if not short_texts:\n",
|
| 283 |
-
" return dataset\n",
|
| 284 |
-
"\n",
|
| 285 |
-
" aug = Dataset.from_dict({\"text\": short_texts, \"label\": short_labels})\n",
|
| 286 |
-
" return concatenate_datasets([dataset, aug]).shuffle(seed=cfg.seed)"
|
| 287 |
-
]
|
| 288 |
-
},
|
| 289 |
-
{
|
| 290 |
-
"cell_type": "code",
|
| 291 |
-
"execution_count": 5,
|
| 292 |
-
"id": "889c5e58",
|
| 293 |
-
"metadata": {},
|
| 294 |
-
"outputs": [
|
| 295 |
-
{
|
| 296 |
-
"name": "stdout",
|
| 297 |
-
"output_type": "stream",
|
| 298 |
-
"text": [
|
| 299 |
-
"HC3 unavailable: Dataset scripts are no longer supported, but found HC3.py\n",
|
| 300 |
-
"Skipped file due to parse error: DATASET/test.csv (No columns to parse from file)\n",
|
| 301 |
-
"Local recursive files scanned: 2\n",
|
| 302 |
-
"Local samples added: 19940\n"
|
| 303 |
-
]
|
| 304 |
-
},
|
| 305 |
-
{
|
| 306 |
-
"name": "stderr",
|
| 307 |
-
"output_type": "stream",
|
| 308 |
-
"text": [
|
| 309 |
-
"Filter: 100%|██████████| 19940/19940 [00:00<00:00, 133317.22 examples/s]\n"
|
| 310 |
-
]
|
| 311 |
-
},
|
| 312 |
-
{
|
| 313 |
-
"name": "stdout",
|
| 314 |
-
"output_type": "stream",
|
| 315 |
-
"text": [
|
| 316 |
-
"Train samples: 24213\n",
|
| 317 |
-
"Eval samples: 1994\n",
|
| 318 |
-
"Train short (<50 words): 6839\n",
|
| 319 |
-
"Eval short (<50 words): 569\n"
|
| 320 |
-
]
|
| 321 |
-
}
|
| 322 |
-
],
|
| 323 |
-
"source": [
|
| 324 |
-
"raw_data = get_combined_dataset(max_samples=cfg.max_samples)\n",
|
| 325 |
-
"train_data = add_short_text_variants(\n",
|
| 326 |
-
" raw_data[\"train\"],\n",
|
| 327 |
-
" short_word_limit=cfg.short_word_limit,\n",
|
| 328 |
-
" ratio=cfg.short_aug_ratio,\n",
|
| 329 |
-
")\n",
|
| 330 |
-
"eval_data = raw_data[\"test\"]\n",
|
| 331 |
-
"\n",
|
| 332 |
-
"short_train = sum(count_words(t) < 50 for t in train_data[\"text\"])\n",
|
| 333 |
-
"short_eval = sum(count_words(t) < 50 for t in eval_data[\"text\"])\n",
|
| 334 |
-
"\n",
|
| 335 |
-
"print(f\"Train samples: {len(train_data)}\")\n",
|
| 336 |
-
"print(f\"Eval samples: {len(eval_data)}\")\n",
|
| 337 |
-
"print(f\"Train short (<50 words): {short_train}\")\n",
|
| 338 |
-
"print(f\"Eval short (<50 words): {short_eval}\")"
|
| 339 |
-
]
|
| 340 |
-
},
|
| 341 |
-
{
|
| 342 |
-
"cell_type": "code",
|
| 343 |
-
"execution_count": 7,
|
| 344 |
-
"id": "e8a2ff3e",
|
| 345 |
-
"metadata": {},
|
| 346 |
-
"outputs": [
|
| 347 |
-
{
|
| 348 |
-
"name": "stderr",
|
| 349 |
-
"output_type": "stream",
|
| 350 |
-
"text": [
|
| 351 |
-
"Loading weights: 100%|██████████| 101/101 [00:00<00:00, 8921.80it/s]\n",
|
| 352 |
-
"\u001b[1mRobertaForSequenceClassification LOAD REPORT\u001b[0m from: distilroberta-base\n",
|
| 353 |
-
"Key | Status | \n",
|
| 354 |
-
"----------------------------+------------+-\n",
|
| 355 |
-
"roberta.pooler.dense.weight | UNEXPECTED | \n",
|
| 356 |
-
"lm_head.dense.weight | UNEXPECTED | \n",
|
| 357 |
-
"roberta.pooler.dense.bias | UNEXPECTED | \n",
|
| 358 |
-
"lm_head.layer_norm.bias | UNEXPECTED | \n",
|
| 359 |
-
"lm_head.dense.bias | UNEXPECTED | \n",
|
| 360 |
-
"lm_head.layer_norm.weight | UNEXPECTED | \n",
|
| 361 |
-
"lm_head.bias | UNEXPECTED | \n",
|
| 362 |
-
"classifier.out_proj.bias | MISSING | \n",
|
| 363 |
-
"classifier.dense.weight | MISSING | \n",
|
| 364 |
-
"classifier.dense.bias | MISSING | \n",
|
| 365 |
-
"classifier.out_proj.weight | MISSING | \n",
|
| 366 |
-
"\n",
|
| 367 |
-
"\u001b[3mNotes:\n",
|
| 368 |
-
"- UNEXPECTED\u001b[3m\t:can be ignored when loading from different task/architecture; not ok if you expect identical arch.\n",
|
| 369 |
-
"- MISSING\u001b[3m\t:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.\u001b[0m\n",
|
| 370 |
-
"Map: 100%|██████████| 24213/24213 [00:01<00:00, 12285.23 examples/s]\n",
|
| 371 |
-
"Map: 100%|██████████| 1994/1994 [00:00<00:00, 11737.65 examples/s]\n"
|
| 372 |
-
]
|
| 373 |
-
}
|
| 374 |
-
],
|
| 375 |
-
"source": [
|
| 376 |
-
"tokenizer = AutoTokenizer.from_pretrained(cfg.base_model_name)\n",
|
| 377 |
-
"model = AutoModelForSequenceClassification.from_pretrained(cfg.base_model_name, num_labels=2).to(DEVICE)\n",
|
| 378 |
-
"\n",
|
| 379 |
-
"\n",
|
| 380 |
-
"def preprocess_batch(batch: dict, tokenizer: PreTrainedTokenizer, max_length: int = 256) -> BatchEncoding:\n",
|
| 381 |
-
" encoded = tokenizer(\n",
|
| 382 |
-
" batch[\"text\"],\n",
|
| 383 |
-
" truncation=True,\n",
|
| 384 |
-
" max_length=max_length,\n",
|
| 385 |
-
" )\n",
|
| 386 |
-
" encoded[\"labels\"] = batch[\"label\"]\n",
|
| 387 |
-
" return encoded\n",
|
| 388 |
-
"\n",
|
| 389 |
-
"\n",
|
| 390 |
-
"tokenize_fn = partial(preprocess_batch, tokenizer=tokenizer, max_length=cfg.max_length)\n",
|
| 391 |
-
"tokenized_train = train_data.map(tokenize_fn, batched=True, remove_columns=[\"text\", \"label\"])\n",
|
| 392 |
-
"tokenized_eval = eval_data.map(tokenize_fn, batched=True, remove_columns=[\"text\", \"label\"])\n",
|
| 393 |
-
"\n",
|
| 394 |
-
"columns = tokenized_train.column_names\n",
|
| 395 |
-
"tensor_columns = [name for name in [\"input_ids\", \"attention_mask\", \"token_type_ids\", \"labels\"] if name in columns]\n",
|
| 396 |
-
"tokenized_train.set_format(type=\"torch\", columns=tensor_columns)\n",
|
| 397 |
-
"tokenized_eval.set_format(type=\"torch\", columns=tensor_columns)\n",
|
| 398 |
-
"\n",
|
| 399 |
-
"metric_accuracy = evaluate.load(\"accuracy\")\n",
|
| 400 |
-
"metric_f1 = evaluate.load(\"f1\")\n",
|
| 401 |
-
"\n",
|
| 402 |
-
"\n",
|
| 403 |
-
"def compute_metrics(eval_pred: tuple[np.ndarray, np.ndarray]) -> dict[str, float]:\n",
|
| 404 |
-
" logits, labels = eval_pred\n",
|
| 405 |
-
" if isinstance(logits, tuple):\n",
|
| 406 |
-
" logits = logits[0]\n",
|
| 407 |
-
" preds = np.argmax(logits, axis=1)\n",
|
| 408 |
-
" acc = metric_accuracy.compute(predictions=preds, references=labels)\n",
|
| 409 |
-
" f1 = metric_f1.compute(predictions=preds, references=labels)\n",
|
| 410 |
-
" return {\"accuracy\": float(acc[\"accuracy\"]), \"f1\": float(f1[\"f1\"])}"
|
| 411 |
-
]
|
| 412 |
-
},
|
| 413 |
-
{
|
| 414 |
-
"cell_type": "code",
|
| 415 |
-
"execution_count": null,
|
| 416 |
-
"id": "00f52ac8",
|
| 417 |
-
"metadata": {},
|
| 418 |
-
"outputs": [
|
| 419 |
-
{
|
| 420 |
-
"name": "stdout",
|
| 421 |
-
"output_type": "stream",
|
| 422 |
-
"text": [
|
| 423 |
-
"Start training V2 model...\n"
|
| 424 |
-
]
|
| 425 |
-
},
|
| 426 |
-
{
|
| 427 |
-
"data": {
|
| 428 |
-
"text/html": [
|
| 429 |
-
"\n",
|
| 430 |
-
" <div>\n",
|
| 431 |
-
" \n",
|
| 432 |
-
" <progress value='4542' max='4542' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
|
| 433 |
-
" [4542/4542 20:20, Epoch 3/3]\n",
|
| 434 |
-
" </div>\n",
|
| 435 |
-
" <table border=\"1\" class=\"dataframe\">\n",
|
| 436 |
-
" <thead>\n",
|
| 437 |
-
" <tr style=\"text-align: left;\">\n",
|
| 438 |
-
" <th>Step</th>\n",
|
| 439 |
-
" <th>Training Loss</th>\n",
|
| 440 |
-
" <th>Validation Loss</th>\n",
|
| 441 |
-
" <th>Accuracy</th>\n",
|
| 442 |
-
" <th>F1</th>\n",
|
| 443 |
-
" </tr>\n",
|
| 444 |
-
" </thead>\n",
|
| 445 |
-
" <tbody>\n",
|
| 446 |
-
" <tr>\n",
|
| 447 |
-
" <td>200</td>\n",
|
| 448 |
-
" <td>0.666410</td>\n",
|
| 449 |
-
" <td>0.350684</td>\n",
|
| 450 |
-
" <td>0.834504</td>\n",
|
| 451 |
-
" <td>0.855390</td>\n",
|
| 452 |
-
" </tr>\n",
|
| 453 |
-
" <tr>\n",
|
| 454 |
-
" <td>400</td>\n",
|
| 455 |
-
" <td>0.598755</td>\n",
|
| 456 |
-
" <td>0.256876</td>\n",
|
| 457 |
-
" <td>0.897192</td>\n",
|
| 458 |
-
" <td>0.904518</td>\n",
|
| 459 |
-
" </tr>\n",
|
| 460 |
-
" <tr>\n",
|
| 461 |
-
" <td>600</td>\n",
|
| 462 |
-
" <td>0.574993</td>\n",
|
| 463 |
-
" <td>0.198666</td>\n",
|
| 464 |
-
" <td>0.919258</td>\n",
|
| 465 |
-
" <td>0.917138</td>\n",
|
| 466 |
-
" </tr>\n",
|
| 467 |
-
" <tr>\n",
|
| 468 |
-
" <td>800</td>\n",
|
| 469 |
-
" <td>0.560090</td>\n",
|
| 470 |
-
" <td>0.555182</td>\n",
|
| 471 |
-
" <td>0.849047</td>\n",
|
| 472 |
-
" <td>0.868040</td>\n",
|
| 473 |
-
" </tr>\n",
|
| 474 |
-
" <tr>\n",
|
| 475 |
-
" <td>1000</td>\n",
|
| 476 |
-
" <td>0.387553</td>\n",
|
| 477 |
-
" <td>0.203730</td>\n",
|
| 478 |
-
" <td>0.929288</td>\n",
|
| 479 |
-
" <td>0.930848</td>\n",
|
| 480 |
-
" </tr>\n",
|
| 481 |
-
" <tr>\n",
|
| 482 |
-
" <td>1200</td>\n",
|
| 483 |
-
" <td>0.411762</td>\n",
|
| 484 |
-
" <td>0.521041</td>\n",
|
| 485 |
-
" <td>0.849047</td>\n",
|
| 486 |
-
" <td>0.868387</td>\n",
|
| 487 |
-
" </tr>\n",
|
| 488 |
-
" <tr>\n",
|
| 489 |
-
" <td>1400</td>\n",
|
| 490 |
-
" <td>0.386610</td>\n",
|
| 491 |
-
" <td>0.348940</td>\n",
|
| 492 |
-
" <td>0.902708</td>\n",
|
| 493 |
-
" <td>0.910434</td>\n",
|
| 494 |
-
" </tr>\n",
|
| 495 |
-
" <tr>\n",
|
| 496 |
-
" <td>1600</td>\n",
|
| 497 |
-
" <td>0.244696</td>\n",
|
| 498 |
-
" <td>0.346382</td>\n",
|
| 499 |
-
" <td>0.916249</td>\n",
|
| 500 |
-
" <td>0.921633</td>\n",
|
| 501 |
-
" </tr>\n",
|
| 502 |
-
" <tr>\n",
|
| 503 |
-
" <td>1800</td>\n",
|
| 504 |
-
" <td>0.223823</td>\n",
|
| 505 |
-
" <td>0.308763</td>\n",
|
| 506 |
-
" <td>0.924774</td>\n",
|
| 507 |
-
" <td>0.928977</td>\n",
|
| 508 |
-
" </tr>\n",
|
| 509 |
-
" <tr>\n",
|
| 510 |
-
" <td>2000</td>\n",
|
| 511 |
-
" <td>0.249242</td>\n",
|
| 512 |
-
" <td>0.358467</td>\n",
|
| 513 |
-
" <td>0.919258</td>\n",
|
| 514 |
-
" <td>0.924307</td>\n",
|
| 515 |
-
" </tr>\n",
|
| 516 |
-
" <tr>\n",
|
| 517 |
-
" <td>2200</td>\n",
|
| 518 |
-
" <td>0.221226</td>\n",
|
| 519 |
-
" <td>0.335397</td>\n",
|
| 520 |
-
" <td>0.919759</td>\n",
|
| 521 |
-
" <td>0.924599</td>\n",
|
| 522 |
-
" </tr>\n",
|
| 523 |
-
" <tr>\n",
|
| 524 |
-
" <td>2400</td>\n",
|
| 525 |
-
" <td>0.221417</td>\n",
|
| 526 |
-
" <td>0.587722</td>\n",
|
| 527 |
-
" <td>0.882648</td>\n",
|
| 528 |
-
" <td>0.894973</td>\n",
|
| 529 |
-
" </tr>\n",
|
| 530 |
-
" <tr>\n",
|
| 531 |
-
" <td>2600</td>\n",
|
| 532 |
-
" <td>0.191291</td>\n",
|
| 533 |
-
" <td>0.329566</td>\n",
|
| 534 |
-
" <td>0.928285</td>\n",
|
| 535 |
-
" <td>0.931677</td>\n",
|
| 536 |
-
" </tr>\n",
|
| 537 |
-
" <tr>\n",
|
| 538 |
-
" <td>2800</td>\n",
|
| 539 |
-
" <td>0.219115</td>\n",
|
| 540 |
-
" <td>0.368331</td>\n",
|
| 541 |
-
" <td>0.919759</td>\n",
|
| 542 |
-
" <td>0.925164</td>\n",
|
| 543 |
-
" </tr>\n",
|
| 544 |
-
" <tr>\n",
|
| 545 |
-
" <td>3000</td>\n",
|
| 546 |
-
" <td>0.308968</td>\n",
|
| 547 |
-
" <td>0.277328</td>\n",
|
| 548 |
-
" <td>0.931795</td>\n",
|
| 549 |
-
" <td>0.934928</td>\n",
|
| 550 |
-
" </tr>\n",
|
| 551 |
-
" <tr>\n",
|
| 552 |
-
" <td>3200</td>\n",
|
| 553 |
-
" <td>0.131352</td>\n",
|
| 554 |
-
" <td>0.585112</td>\n",
|
| 555 |
-
" <td>0.891174</td>\n",
|
| 556 |
-
" <td>0.901854</td>\n",
|
| 557 |
-
" </tr>\n",
|
| 558 |
-
" <tr>\n",
|
| 559 |
-
" <td>3400</td>\n",
|
| 560 |
-
" <td>0.152614</td>\n",
|
| 561 |
-
" <td>0.388915</td>\n",
|
| 562 |
-
" <td>0.924273</td>\n",
|
| 563 |
-
" <td>0.929208</td>\n",
|
| 564 |
-
" </tr>\n",
|
| 565 |
-
" <tr>\n",
|
| 566 |
-
" <td>3600</td>\n",
|
| 567 |
-
" <td>0.145248</td>\n",
|
| 568 |
-
" <td>0.439313</td>\n",
|
| 569 |
-
" <td>0.921765</td>\n",
|
| 570 |
-
" <td>0.926898</td>\n",
|
| 571 |
-
" </tr>\n",
|
| 572 |
-
" <tr>\n",
|
| 573 |
-
" <td>3800</td>\n",
|
| 574 |
-
" <td>0.086042</td>\n",
|
| 575 |
-
" <td>0.467167</td>\n",
|
| 576 |
-
" <td>0.920762</td>\n",
|
| 577 |
-
" <td>0.926099</td>\n",
|
| 578 |
-
" </tr>\n",
|
| 579 |
-
" <tr>\n",
|
| 580 |
-
" <td>4000</td>\n",
|
| 581 |
-
" <td>0.051121</td>\n",
|
| 582 |
-
" <td>0.561893</td>\n",
|
| 583 |
-
" <td>0.909729</td>\n",
|
| 584 |
-
" <td>0.916898</td>\n",
|
| 585 |
-
" </tr>\n",
|
| 586 |
-
" <tr>\n",
|
| 587 |
-
" <td>4200</td>\n",
|
| 588 |
-
" <td>0.141769</td>\n",
|
| 589 |
-
" <td>0.477382</td>\n",
|
| 590 |
-
" <td>0.920762</td>\n",
|
| 591 |
-
" <td>0.926168</td>\n",
|
| 592 |
-
" </tr>\n",
|
| 593 |
-
" <tr>\n",
|
| 594 |
-
" <td>4400</td>\n",
|
| 595 |
-
" <td>0.016825</td>\n",
|
| 596 |
-
" <td>0.506922</td>\n",
|
| 597 |
-
" <td>0.918255</td>\n",
|
| 598 |
-
" <td>0.924151</td>\n",
|
| 599 |
-
" </tr>\n",
|
| 600 |
-
" </tbody>\n",
|
| 601 |
-
"</table><p>"
|
| 602 |
-
],
|
| 603 |
-
"text/plain": [
|
| 604 |
-
"<IPython.core.display.HTML object>"
|
| 605 |
-
]
|
| 606 |
-
},
|
| 607 |
-
"metadata": {},
|
| 608 |
-
"output_type": "display_data"
|
| 609 |
-
},
|
| 610 |
-
{
|
| 611 |
-
"name": "stderr",
|
| 612 |
-
"output_type": "stream",
|
| 613 |
-
"text": [
|
| 614 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.48it/s]\n",
|
| 615 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.84it/s]\n",
|
| 616 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.64it/s]\n",
|
| 617 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.02it/s]\n",
|
| 618 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.96it/s]\n",
|
| 619 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.07it/s]\n",
|
| 620 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.79it/s]\n",
|
| 621 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.02it/s]\n",
|
| 622 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.03it/s]\n",
|
| 623 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.03it/s]\n",
|
| 624 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.00it/s]\n",
|
| 625 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.10it/s]\n",
|
| 626 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.59it/s]\n",
|
| 627 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.23it/s]\n",
|
| 628 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.16it/s]\n",
|
| 629 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.19it/s]\n",
|
| 630 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.14it/s]\n",
|
| 631 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.14it/s]\n",
|
| 632 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.00it/s]\n",
|
| 633 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.21it/s]\n",
|
| 634 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.17it/s]\n",
|
| 635 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.99it/s]\n",
|
| 636 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.22it/s]\n",
|
| 637 |
-
"There were missing keys in the checkpoint model loaded: ['roberta.embeddings.LayerNorm.weight', 'roberta.embeddings.LayerNorm.bias', 'roberta.encoder.layer.0.attention.output.LayerNorm.weight', 'roberta.encoder.layer.0.attention.output.LayerNorm.bias', 'roberta.encoder.layer.0.output.LayerNorm.weight', 'roberta.encoder.layer.0.output.LayerNorm.bias', 'roberta.encoder.layer.1.attention.output.LayerNorm.weight', 'roberta.encoder.layer.1.attention.output.LayerNorm.bias', 'roberta.encoder.layer.1.output.LayerNorm.weight', 'roberta.encoder.layer.1.output.LayerNorm.bias', 'roberta.encoder.layer.2.attention.output.LayerNorm.weight', 'roberta.encoder.layer.2.attention.output.LayerNorm.bias', 'roberta.encoder.layer.2.output.LayerNorm.weight', 'roberta.encoder.layer.2.output.LayerNorm.bias', 'roberta.encoder.layer.3.attention.output.LayerNorm.weight', 'roberta.encoder.layer.3.attention.output.LayerNorm.bias', 'roberta.encoder.layer.3.output.LayerNorm.weight', 'roberta.encoder.layer.3.output.LayerNorm.bias', 'roberta.encoder.layer.4.attention.output.LayerNorm.weight', 'roberta.encoder.layer.4.attention.output.LayerNorm.bias', 'roberta.encoder.layer.4.output.LayerNorm.weight', 'roberta.encoder.layer.4.output.LayerNorm.bias', 'roberta.encoder.layer.5.attention.output.LayerNorm.weight', 'roberta.encoder.layer.5.attention.output.LayerNorm.bias', 'roberta.encoder.layer.5.output.LayerNorm.weight', 'roberta.encoder.layer.5.output.LayerNorm.bias'].\n",
|
| 638 |
-
"There were unexpected keys in the checkpoint model loaded: ['roberta.embeddings.LayerNorm.beta', 'roberta.embeddings.LayerNorm.gamma', 'roberta.encoder.layer.0.attention.output.LayerNorm.beta', 'roberta.encoder.layer.0.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.0.output.LayerNorm.beta', 'roberta.encoder.layer.0.output.LayerNorm.gamma', 'roberta.encoder.layer.1.attention.output.LayerNorm.beta', 'roberta.encoder.layer.1.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.1.output.LayerNorm.beta', 'roberta.encoder.layer.1.output.LayerNorm.gamma', 'roberta.encoder.layer.2.attention.output.LayerNorm.beta', 'roberta.encoder.layer.2.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.2.output.LayerNorm.beta', 'roberta.encoder.layer.2.output.LayerNorm.gamma', 'roberta.encoder.layer.3.attention.output.LayerNorm.beta', 'roberta.encoder.layer.3.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.3.output.LayerNorm.beta', 'roberta.encoder.layer.3.output.LayerNorm.gamma', 'roberta.encoder.layer.4.attention.output.LayerNorm.beta', 'roberta.encoder.layer.4.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.4.output.LayerNorm.beta', 'roberta.encoder.layer.4.output.LayerNorm.gamma', 'roberta.encoder.layer.5.attention.output.LayerNorm.beta', 'roberta.encoder.layer.5.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.5.output.LayerNorm.beta', 'roberta.encoder.layer.5.output.LayerNorm.gamma'].\n"
|
| 639 |
-
]
|
| 640 |
-
},
|
| 641 |
-
{
|
| 642 |
-
"name": "stdout",
|
| 643 |
-
"output_type": "stream",
|
| 644 |
-
"text": [
|
| 645 |
-
"Final evaluation...\n"
|
| 646 |
-
]
|
| 647 |
-
},
|
| 648 |
-
{
|
| 649 |
-
"data": {
|
| 650 |
-
"text/html": [
|
| 651 |
-
"\n",
|
| 652 |
-
" <div>\n",
|
| 653 |
-
" \n",
|
| 654 |
-
" <progress value='250' max='250' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
|
| 655 |
-
" [250/250 00:07]\n",
|
| 656 |
-
" </div>\n",
|
| 657 |
-
" "
|
| 658 |
-
],
|
| 659 |
-
"text/plain": [
|
| 660 |
-
"<IPython.core.display.HTML object>"
|
| 661 |
-
]
|
| 662 |
-
},
|
| 663 |
-
"metadata": {},
|
| 664 |
-
"output_type": "display_data"
|
| 665 |
-
},
|
| 666 |
-
{
|
| 667 |
-
"ename": "RuntimeError",
|
| 668 |
-
"evalue": "on_train_begin must be called before on_evaluate",
|
| 669 |
-
"output_type": "error",
|
| 670 |
-
"traceback": [
|
| 671 |
-
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
|
| 672 |
-
"\u001b[31mRuntimeError\u001b[39m Traceback (most recent call last)",
|
| 673 |
-
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[8]\u001b[39m\u001b[32m, line 55\u001b[39m\n\u001b[32m 52\u001b[39m trainer.train()\n\u001b[32m 54\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33m\"\u001b[39m\u001b[33mFinal evaluation...\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m---> \u001b[39m\u001b[32m55\u001b[39m eval_result = \u001b[43mtrainer\u001b[49m\u001b[43m.\u001b[49m\u001b[43mevaluate\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 56\u001b[39m \u001b[38;5;28mprint\u001b[39m(json.dumps(eval_result, indent=\u001b[32m2\u001b[39m, default=\u001b[38;5;28mstr\u001b[39m))\n",
|
| 674 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/trainer.py:2602\u001b[39m, in \u001b[36mTrainer.evaluate\u001b[39m\u001b[34m(self, eval_dataset, ignore_keys, metric_key_prefix)\u001b[39m\n\u001b[32m 2599\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m DebugOption.TPU_METRICS_DEBUG \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m.args.debug:\n\u001b[32m 2600\u001b[39m xm.master_print(met.metrics_report())\n\u001b[32m-> \u001b[39m\u001b[32m2602\u001b[39m \u001b[38;5;28mself\u001b[39m.control = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mcallback_handler\u001b[49m\u001b[43m.\u001b[49m\u001b[43mon_evaluate\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mstate\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mcontrol\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43moutput\u001b[49m\u001b[43m.\u001b[49m\u001b[43mmetrics\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 2604\u001b[39m \u001b[38;5;28mself\u001b[39m._memory_tracker.stop_and_update_metrics(output.metrics)\n\u001b[32m 2606\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m output.metrics\n",
|
| 675 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/trainer_callback.py:524\u001b[39m, in \u001b[36mCallbackHandler.on_evaluate\u001b[39m\u001b[34m(self, args, state, control, metrics)\u001b[39m\n\u001b[32m 522\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mon_evaluate\u001b[39m(\u001b[38;5;28mself\u001b[39m, args: TrainingArguments, state: TrainerState, control: TrainerControl, metrics):\n\u001b[32m 523\u001b[39m control.should_evaluate = \u001b[38;5;28;01mFalse\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m524\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mcall_event\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mon_evaluate\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mstate\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcontrol\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmetrics\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmetrics\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 676 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/trainer_callback.py:545\u001b[39m, in \u001b[36mCallbackHandler.call_event\u001b[39m\u001b[34m(self, event, args, state, control, **kwargs)\u001b[39m\n\u001b[32m 543\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mcall_event\u001b[39m(\u001b[38;5;28mself\u001b[39m, event, args, state, control, **kwargs):\n\u001b[32m 544\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m callback \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m.callbacks:\n\u001b[32m--> \u001b[39m\u001b[32m545\u001b[39m result = \u001b[38;5;28;43mgetattr\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mcallback\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mevent\u001b[49m\u001b[43m)\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 546\u001b[39m \u001b[43m \u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 547\u001b[39m \u001b[43m \u001b[49m\u001b[43mstate\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 548\u001b[39m \u001b[43m \u001b[49m\u001b[43mcontrol\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 549\u001b[39m \u001b[43m \u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 550\u001b[39m \u001b[43m \u001b[49m\u001b[43mprocessing_class\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mprocessing_class\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 551\u001b[39m \u001b[43m \u001b[49m\u001b[43moptimizer\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43moptimizer\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 552\u001b[39m \u001b[43m \u001b[49m\u001b[43mlr_scheduler\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mlr_scheduler\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 553\u001b[39m \u001b[43m \u001b[49m\u001b[43mtrain_dataloader\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mtrain_dataloader\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 554\u001b[39m \u001b[43m \u001b[49m\u001b[43meval_dataloader\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43meval_dataloader\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 555\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 556\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 557\u001b[39m \u001b[38;5;66;03m# A Callback can skip the return of `control` if it doesn't change it.\u001b[39;00m\n\u001b[32m 558\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m result \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n",
|
| 677 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/utils/notebook.py:354\u001b[39m, in \u001b[36mNotebookProgressCallback.on_evaluate\u001b[39m\u001b[34m(self, args, state, control, metrics, **kwargs)\u001b[39m\n\u001b[32m 353\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mon_evaluate\u001b[39m(\u001b[38;5;28mself\u001b[39m, args, state, control, metrics=\u001b[38;5;28;01mNone\u001b[39;00m, **kwargs):\n\u001b[32m--> \u001b[39m\u001b[32m354\u001b[39m tt = \u001b[43m_require\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mtraining_tracker\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mon_train_begin must be called before on_evaluate\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[32m 356\u001b[39m values = {\u001b[33m\"\u001b[39m\u001b[33mTraining Loss\u001b[39m\u001b[33m\"\u001b[39m: \u001b[33m\"\u001b[39m\u001b[33mNo log\u001b[39m\u001b[33m\"\u001b[39m, \u001b[33m\"\u001b[39m\u001b[33mValidation Loss\u001b[39m\u001b[33m\"\u001b[39m: \u001b[33m\"\u001b[39m\u001b[33mNo log\u001b[39m\u001b[33m\"\u001b[39m}\n\u001b[32m 357\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m log \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mreversed\u001b[39m(state.log_history):\n",
|
| 678 |
-
"\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/utils/notebook.py:31\u001b[39m, in \u001b[36m_require\u001b[39m\u001b[34m(x, msg)\u001b[39m\n\u001b[32m 29\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34m_require\u001b[39m(x: _T | \u001b[38;5;28;01mNone\u001b[39;00m, msg: \u001b[38;5;28mstr\u001b[39m) -> _T:\n\u001b[32m 30\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m x \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m---> \u001b[39m\u001b[32m31\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(msg)\n\u001b[32m 32\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m x\n",
|
| 679 |
-
"\u001b[31mRuntimeError\u001b[39m: on_train_begin must be called before on_evaluate"
|
| 680 |
-
]
|
| 681 |
-
}
|
| 682 |
-
],
|
| 683 |
-
"source": [
|
| 684 |
-
"import sys\n",
|
| 685 |
-
"import subprocess\n",
|
| 686 |
-
"\n",
|
| 687 |
-
"\n",
|
| 688 |
-
"def _ensure_accelerate(min_version: str = \"1.1.0\") -> None:\n",
|
| 689 |
-
" try:\n",
|
| 690 |
-
" import accelerate # noqa: F401\n",
|
| 691 |
-
" from packaging import version\n",
|
| 692 |
-
"\n",
|
| 693 |
-
" if version.parse(accelerate.__version__) < version.parse(min_version):\n",
|
| 694 |
-
" raise ImportError(f\"accelerate version too old: {accelerate.__version__}\")\n",
|
| 695 |
-
" except Exception:\n",
|
| 696 |
-
" print(\"Installing/upgrading accelerate in current kernel environment...\")\n",
|
| 697 |
-
" subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", f\"accelerate>={min_version}\"])\n",
|
| 698 |
-
"\n",
|
| 699 |
-
"\n",
|
| 700 |
-
"_ensure_accelerate()\n",
|
| 701 |
-
"\n",
|
| 702 |
-
"train_args = TrainingArguments(\n",
|
| 703 |
-
" output_dir=\"./results/v2-distilroberta\",\n",
|
| 704 |
-
" num_train_epochs=3,\n",
|
| 705 |
-
" learning_rate=2e-5,\n",
|
| 706 |
-
" weight_decay=0.01,\n",
|
| 707 |
-
" per_device_train_batch_size=8,\n",
|
| 708 |
-
" per_device_eval_batch_size=8,\n",
|
| 709 |
-
" gradient_accumulation_steps=2,\n",
|
| 710 |
-
" fp16=torch.cuda.is_available(),\n",
|
| 711 |
-
" eval_strategy=\"steps\",\n",
|
| 712 |
-
" eval_steps=200,\n",
|
| 713 |
-
" save_strategy=\"steps\",\n",
|
| 714 |
-
" save_steps=200,\n",
|
| 715 |
-
" save_total_limit=2,\n",
|
| 716 |
-
" logging_steps=50,\n",
|
| 717 |
-
" metric_for_best_model=\"eval_f1\",\n",
|
| 718 |
-
" load_best_model_at_end=True,\n",
|
| 719 |
-
" remove_unused_columns=False,\n",
|
| 720 |
-
" report_to=\"none\",\n",
|
| 721 |
-
")\n",
|
| 722 |
-
"\n",
|
| 723 |
-
"data_collator = DataCollatorWithPadding(tokenizer=tokenizer)\n",
|
| 724 |
-
"\n",
|
| 725 |
-
"trainer = Trainer(\n",
|
| 726 |
-
" model=model,\n",
|
| 727 |
-
" args=train_args,\n",
|
| 728 |
-
" train_dataset=tokenized_train,\n",
|
| 729 |
-
" eval_dataset=tokenized_eval,\n",
|
| 730 |
-
" data_collator=data_collator,\n",
|
| 731 |
-
" compute_metrics=compute_metrics,\n",
|
| 732 |
-
")\n",
|
| 733 |
-
"\n",
|
| 734 |
-
"print(\"Start training V2 model...\")\n",
|
| 735 |
-
"train_result = trainer.train()\n",
|
| 736 |
-
"\n",
|
| 737 |
-
"print(\"\\n✓ Training complete!\")\n",
|
| 738 |
-
"print(f\"Final training metrics:\")\n",
|
| 739 |
-
"if hasattr(trainer.state, 'log_history') and trainer.state.log_history:\n",
|
| 740 |
-
" # Get the last evaluation metrics from log history\n",
|
| 741 |
-
" for log_entry in reversed(trainer.state.log_history):\n",
|
| 742 |
-
" if 'eval_loss' in log_entry:\n",
|
| 743 |
-
" print(f\" Eval Loss: {log_entry.get('eval_loss', 'N/A'):.4f}\")\n",
|
| 744 |
-
" print(f\" Eval Accuracy: {log_entry.get('eval_accuracy', 'N/A'):.4f}\")\n",
|
| 745 |
-
" print(f\" Eval F1: {log_entry.get('eval_f1', 'N/A'):.4f}\")\n",
|
| 746 |
-
" break"
|
| 747 |
-
]
|
| 748 |
-
},
|
| 749 |
-
{
|
| 750 |
-
"cell_type": "code",
|
| 751 |
-
"execution_count": 9,
|
| 752 |
-
"id": "1b601515",
|
| 753 |
-
"metadata": {},
|
| 754 |
-
"outputs": [
|
| 755 |
-
{
|
| 756 |
-
"name": "stderr",
|
| 757 |
-
"output_type": "stream",
|
| 758 |
-
"text": [
|
| 759 |
-
"Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.29it/s]"
|
| 760 |
-
]
|
| 761 |
-
},
|
| 762 |
-
{
|
| 763 |
-
"name": "stdout",
|
| 764 |
-
"output_type": "stream",
|
| 765 |
-
"text": [
|
| 766 |
-
"Saved V2 model to: /mnt/linux-data/Work/aiapi/notebook/ai_vs_human/v2_model\n"
|
| 767 |
-
]
|
| 768 |
-
},
|
| 769 |
-
{
|
| 770 |
-
"name": "stderr",
|
| 771 |
-
"output_type": "stream",
|
| 772 |
-
"text": [
|
| 773 |
-
"\n"
|
| 774 |
-
]
|
| 775 |
-
}
|
| 776 |
-
],
|
| 777 |
-
"source": [
|
| 778 |
-
"save_dir = Path(cfg.output_dir)\n",
|
| 779 |
-
"save_dir.mkdir(parents=True, exist_ok=True)\n",
|
| 780 |
-
"trainer.save_model(str(save_dir))\n",
|
| 781 |
-
"tokenizer.save_pretrained(str(save_dir))\n",
|
| 782 |
-
"\n",
|
| 783 |
-
"label_map = {\"0\": \"human\", \"1\": \"ai\"}\n",
|
| 784 |
-
"(save_dir / \"label_map.json\").write_text(json.dumps(label_map, indent=2), encoding=\"utf-8\")\n",
|
| 785 |
-
"\n",
|
| 786 |
-
"print(f\"Saved V2 model to: {save_dir.resolve()}\")"
|
| 787 |
-
]
|
| 788 |
-
},
|
| 789 |
-
{
|
| 790 |
-
"cell_type": "code",
|
| 791 |
-
"execution_count": 11,
|
| 792 |
-
"id": "93f0e5a0",
|
| 793 |
-
"metadata": {},
|
| 794 |
-
"outputs": [
|
| 795 |
-
{
|
| 796 |
-
"name": "stdout",
|
| 797 |
-
"output_type": "stream",
|
| 798 |
-
"text": [
|
| 799 |
-
"================================================================================\n",
|
| 800 |
-
"COMPREHENSIVE TEST: All Sentence Types\n",
|
| 801 |
-
"================================================================================\n",
|
| 802 |
-
"\n",
|
| 803 |
-
"1. VERY SHORT SENTENCES (< 10 words):\n",
|
| 804 |
-
" [2 words] human: Hello world.\n",
|
| 805 |
-
" [3 words] human: AI is powerful.\n",
|
| 806 |
-
" [3 words] human: I like coding.\n",
|
| 807 |
-
" [4 words] human: Machine learning works well.\n",
|
| 808 |
-
"\n",
|
| 809 |
-
"2. SHORT SENTENCES (10-50 words):\n",
|
| 810 |
-
" [10 words] human: AI writes fast, but humans add personal experience and emoti...\n",
|
| 811 |
-
" [14 words] human: I woke up late, missed the bus, and ran all the way to class...\n",
|
| 812 |
-
" [11 words] human: This response was generated by a language model in one pass....\n",
|
| 813 |
-
" [17 words] human: The field of data science combines statistics, programming, ...\n",
|
| 814 |
-
"\n",
|
| 815 |
-
"3. MEDIUM SENTENCES (50-150 words):\n",
|
| 816 |
-
" [74 words] human: Artificial intelligence systems can process massive amounts ...\n",
|
| 817 |
-
" [87 words] human: I once tried to learn guitar in a single weekend because I t...\n",
|
| 818 |
-
"\n",
|
| 819 |
-
"4. LONG SENTENCES (150+ words):\n",
|
| 820 |
-
" [153 words] ai: Machine learning represents a subset of artificial intellige...\n",
|
| 821 |
-
"\n",
|
| 822 |
-
"5. EDGE CASES:\n",
|
| 823 |
-
" [1 words] human: 'A'\n",
|
| 824 |
-
" [4 words] human: 'This is a test.'\n",
|
| 825 |
-
" [4 words] human: 'Multiple spaces between words'\n",
|
| 826 |
-
"\n",
|
| 827 |
-
"================================================================================\n",
|
| 828 |
-
"✓ All sentence types tested successfully!\n",
|
| 829 |
-
"================================================================================\n"
|
| 830 |
-
]
|
| 831 |
-
}
|
| 832 |
-
],
|
| 833 |
-
"source": [
|
| 834 |
-
"def predict_v2(text: str) -> dict[str, float | int | str]:\n",
|
| 835 |
-
" \"\"\"Predict whether text is AI or human-written. Works for all sentence lengths.\"\"\"\n",
|
| 836 |
-
" cleaned = normalize_text(text)\n",
|
| 837 |
-
" if not cleaned:\n",
|
| 838 |
-
" raise ValueError(\"Input text is empty.\")\n",
|
| 839 |
-
"\n",
|
| 840 |
-
" inputs = tokenizer(\n",
|
| 841 |
-
" cleaned,\n",
|
| 842 |
-
" truncation=True,\n",
|
| 843 |
-
" max_length=cfg.max_length,\n",
|
| 844 |
-
" return_tensors=\"pt\",\n",
|
| 845 |
-
" ).to(model.device)\n",
|
| 846 |
-
"\n",
|
| 847 |
-
" model.eval()\n",
|
| 848 |
-
" with torch.no_grad():\n",
|
| 849 |
-
" logits = model(**inputs).logits\n",
|
| 850 |
-
" probs = torch.softmax(logits, dim=-1).cpu().numpy()[0]\n",
|
| 851 |
-
"\n",
|
| 852 |
-
" pred = int(np.argmax(probs))\n",
|
| 853 |
-
" wc = count_words(cleaned)\n",
|
| 854 |
-
"\n",
|
| 855 |
-
" return {\n",
|
| 856 |
-
" \"text\": cleaned,\n",
|
| 857 |
-
" \"word_count\": wc,\n",
|
| 858 |
-
" \"predicted_label\": pred,\n",
|
| 859 |
-
" \"predicted_name\": \"ai\" if pred == 1 else \"human\",\n",
|
| 860 |
-
" \"probability_human\": float(probs[0]),\n",
|
| 861 |
-
" \"probability_ai\": float(probs[1]),\n",
|
| 862 |
-
" \"short_text\": wc < 50,\n",
|
| 863 |
-
" }\n",
|
| 864 |
-
"\n",
|
| 865 |
-
"\n",
|
| 866 |
-
"print(\"=\" * 80)\n",
|
| 867 |
-
"print(\"COMPREHENSIVE TEST: All Sentence Types\")\n",
|
| 868 |
-
"print(\"=\" * 80)\n",
|
| 869 |
-
"\n",
|
| 870 |
-
"# Test 1: Very short sentences (under 10 words)\n",
|
| 871 |
-
"print(\"\\n1. VERY SHORT SENTENCES (< 10 words):\")\n",
|
| 872 |
-
"very_short = [\n",
|
| 873 |
-
" \"Hello world.\",\n",
|
| 874 |
-
" \"AI is powerful.\",\n",
|
| 875 |
-
" \"I like coding.\",\n",
|
| 876 |
-
" \"Machine learning works well.\",\n",
|
| 877 |
-
"]\n",
|
| 878 |
-
"for text in very_short:\n",
|
| 879 |
-
" result = predict_v2(text)\n",
|
| 880 |
-
" print(f\" [{result['word_count']} words] {result['predicted_name']}: {text[:60]}\")\n",
|
| 881 |
-
"\n",
|
| 882 |
-
"# Test 2: Short sentences (10-50 words)\n",
|
| 883 |
-
"print(\"\\n2. SHORT SENTENCES (10-50 words):\")\n",
|
| 884 |
-
"short_examples = [\n",
|
| 885 |
-
" \"AI writes fast, but humans add personal experience and emotion.\",\n",
|
| 886 |
-
" \"I woke up late, missed the bus, and ran all the way to class.\",\n",
|
| 887 |
-
" \"This response was generated by a language model in one pass.\",\n",
|
| 888 |
-
" \"The field of data science combines statistics, programming, and domain knowledge to extract meaningful insights from data.\",\n",
|
| 889 |
-
"]\n",
|
| 890 |
-
"for text in short_examples:\n",
|
| 891 |
-
" result = predict_v2(text)\n",
|
| 892 |
-
" print(f\" [{result['word_count']} words] {result['predicted_name']}: {text[:60]}...\")\n",
|
| 893 |
-
"\n",
|
| 894 |
-
"# Test 3: Medium sentences (50-150 words)\n",
|
| 895 |
-
"print(\"\\n3. MEDIUM SENTENCES (50-150 words):\")\n",
|
| 896 |
-
"medium_examples = [\n",
|
| 897 |
-
" \"Artificial intelligence systems can process massive amounts of data extremely quickly compared to humans. They are designed to analyze large datasets, identify patterns, and extract useful insights within seconds or minutes. Using advanced algorithms and machine learning models, AI systems can examine structured and unstructured data such as text, images, audio, and numerical information. By learning from historical data, these systems can recognize complex relationships between variables and make accurate predictions about future outcomes.\",\n",
|
| 898 |
-
" \"I once tried to learn guitar in a single weekend because I thought it would be easy. Turns out my fingers had other plans. After two hours of awkward chords and random noises, I realized that music requires patience, practice, and a lot more discipline than I originally expected. My friends laughed when they heard me trying to play, but I kept practicing anyway because I genuinely wanted to improve. Eventually, after weeks of consistent effort, I could finally play a simple song from start to finish.\",\n",
|
| 899 |
-
"]\n",
|
| 900 |
-
"for text in medium_examples:\n",
|
| 901 |
-
" result = predict_v2(text)\n",
|
| 902 |
-
" print(f\" [{result['word_count']} words] {result['predicted_name']}: {text[:60]}...\")\n",
|
| 903 |
-
"\n",
|
| 904 |
-
"# Test 4: Long sentences (150+ words)\n",
|
| 905 |
-
"print(\"\\n4. LONG SENTENCES (150+ words):\")\n",
|
| 906 |
-
"long_examples = [\n",
|
| 907 |
-
" \"Machine learning represents a subset of artificial intelligence that enables computer systems to automatically learn and improve from experience without being explicitly programmed for every single task. The fundamental idea behind machine learning is to develop algorithms that can receive input data and use statistical analysis to predict an output while updating outputs as new data becomes available. This field has grown exponentially over the past few decades, driven by increases in computational power, the availability of large datasets, and breakthroughs in algorithmic approaches. Modern machine learning systems power everything from recommendation engines on streaming platforms to autonomous vehicles, medical diagnosis tools, and natural language processing applications. The three main categories of machine learning include supervised learning, where models are trained on labeled data; unsupervised learning, where patterns are discovered in unlabeled data; and reinforcement learning, where agents learn to make decisions by receiving rewards or penalties for their actions in an environment.\",\n",
|
| 908 |
-
"]\n",
|
| 909 |
-
"for text in long_examples:\n",
|
| 910 |
-
" result = predict_v2(text)\n",
|
| 911 |
-
" print(f\" [{result['word_count']} words] {result['predicted_name']}: {text[:60]}...\")\n",
|
| 912 |
-
"\n",
|
| 913 |
-
"# Test 5: Edge cases\n",
|
| 914 |
-
"print(\"\\n5. EDGE CASES:\")\n",
|
| 915 |
-
"edge_cases = [\n",
|
| 916 |
-
" \"A\", # Single word\n",
|
| 917 |
-
" \"This is a test.\", # Very basic\n",
|
| 918 |
-
" \" Multiple spaces between words \", # Extra whitespace\n",
|
| 919 |
-
"]\n",
|
| 920 |
-
"for text in edge_cases:\n",
|
| 921 |
-
" try:\n",
|
| 922 |
-
" result = predict_v2(text)\n",
|
| 923 |
-
" print(f\" [{result['word_count']} words] {result['predicted_name']}: '{text.strip()}'\")\n",
|
| 924 |
-
" except Exception as e:\n",
|
| 925 |
-
" print(f\" ERROR: {text.strip()[:30]} - {str(e)}\")\n",
|
| 926 |
-
"\n",
|
| 927 |
-
"print(\"\\n\" + \"=\" * 80)\n",
|
| 928 |
-
"print(\"✓ All sentence types tested successfully!\")\n",
|
| 929 |
-
"print(\"=\" * 80)"
|
| 930 |
-
]
|
| 931 |
-
},
|
| 932 |
-
{
|
| 933 |
-
"cell_type": "code",
|
| 934 |
-
"execution_count": 12,
|
| 935 |
-
"id": "98ef7c7d",
|
| 936 |
-
"metadata": {},
|
| 937 |
-
"outputs": [
|
| 938 |
-
{
|
| 939 |
-
"name": "stdout",
|
| 940 |
-
"output_type": "stream",
|
| 941 |
-
"text": [
|
| 942 |
-
"================================================================================\n",
|
| 943 |
-
"TESTING SAVED V2 MODEL FROM DISK\n",
|
| 944 |
-
"================================================================================\n"
|
| 945 |
-
]
|
| 946 |
-
},
|
| 947 |
-
{
|
| 948 |
-
"name": "stderr",
|
| 949 |
-
"output_type": "stream",
|
| 950 |
-
"text": [
|
| 951 |
-
"Loading weights: 100%|██████████| 105/105 [00:00<00:00, 8556.64it/s]"
|
| 952 |
-
]
|
| 953 |
-
},
|
| 954 |
-
{
|
| 955 |
-
"name": "stdout",
|
| 956 |
-
"output_type": "stream",
|
| 957 |
-
"text": [
|
| 958 |
-
"\n",
|
| 959 |
-
"✓ Loaded model from: v2_model\n",
|
| 960 |
-
"\n",
|
| 961 |
-
"Running inference tests:\n",
|
| 962 |
-
" [very short ] human (AI: 0.50%): Hi there!\n",
|
| 963 |
-
" [short ] human (AI: 0.09%): I love programming and building cool projects.\n",
|
| 964 |
-
" [medium ] human (AI: 3.09%): Artificial intelligence has revolutionized many in\n",
|
| 965 |
-
"\n",
|
| 966 |
-
"✓ Saved model works correctly for all sentence types!\n"
|
| 967 |
-
]
|
| 968 |
-
},
|
| 969 |
-
{
|
| 970 |
-
"name": "stderr",
|
| 971 |
-
"output_type": "stream",
|
| 972 |
-
"text": [
|
| 973 |
-
"\n"
|
| 974 |
-
]
|
| 975 |
-
}
|
| 976 |
-
],
|
| 977 |
-
"source": [
|
| 978 |
-
"# Load and test the saved v2_model independently\n",
|
| 979 |
-
"print(\"=\" * 80)\n",
|
| 980 |
-
"print(\"TESTING SAVED V2 MODEL FROM DISK\")\n",
|
| 981 |
-
"print(\"=\" * 80)\n",
|
| 982 |
-
"\n",
|
| 983 |
-
"saved_model_path = Path(cfg.output_dir)\n",
|
| 984 |
-
"if saved_model_path.exists():\n",
|
| 985 |
-
" # Load fresh model and tokenizer from saved checkpoint\n",
|
| 986 |
-
" saved_tokenizer = AutoTokenizer.from_pretrained(str(saved_model_path))\n",
|
| 987 |
-
" saved_model = AutoModelForSequenceClassification.from_pretrained(str(saved_model_path)).to(DEVICE)\n",
|
| 988 |
-
" \n",
|
| 989 |
-
" print(f\"\\n✓ Loaded model from: {saved_model_path}\")\n",
|
| 990 |
-
" \n",
|
| 991 |
-
" # Test with diverse examples\n",
|
| 992 |
-
" test_cases = [\n",
|
| 993 |
-
" (\"Hi there!\", \"very short\"),\n",
|
| 994 |
-
" (\"I love programming and building cool projects.\", \"short\"),\n",
|
| 995 |
-
" (\"Artificial intelligence has revolutionized many industries by enabling automation, improving decision-making, and creating new opportunities for innovation.\", \"medium\"),\n",
|
| 996 |
-
" ]\n",
|
| 997 |
-
" \n",
|
| 998 |
-
" print(\"\\nRunning inference tests:\")\n",
|
| 999 |
-
" for text, category in test_cases:\n",
|
| 1000 |
-
" inputs = saved_tokenizer(text, truncation=True, max_length=256, return_tensors=\"pt\").to(DEVICE)\n",
|
| 1001 |
-
" saved_model.eval()\n",
|
| 1002 |
-
" with torch.no_grad():\n",
|
| 1003 |
-
" logits = saved_model(**inputs).logits\n",
|
| 1004 |
-
" probs = torch.softmax(logits, dim=-1).cpu().numpy()[0]\n",
|
| 1005 |
-
" pred_label = int(np.argmax(probs))\n",
|
| 1006 |
-
" pred_name = \"ai\" if pred_label == 1 else \"human\"\n",
|
| 1007 |
-
" \n",
|
| 1008 |
-
" print(f\" [{category:12}] {pred_name:6} (AI: {probs[1]:.2%}): {text[:50]}\")\n",
|
| 1009 |
-
" \n",
|
| 1010 |
-
" print(\"\\n✓ Saved model works correctly for all sentence types!\")\n",
|
| 1011 |
-
"else:\n",
|
| 1012 |
-
" print(f\"⚠ Model not found at: {saved_model_path}\")\n",
|
| 1013 |
-
" print(\" Run the save cell first to create v2_model/\")"
|
| 1014 |
-
]
|
| 1015 |
-
},
|
| 1016 |
-
{
|
| 1017 |
-
"cell_type": "code",
|
| 1018 |
-
"execution_count": 13,
|
| 1019 |
-
"id": "2f63e591",
|
| 1020 |
-
"metadata": {},
|
| 1021 |
-
"outputs": [
|
| 1022 |
-
{
|
| 1023 |
-
"name": "stdout",
|
| 1024 |
-
"output_type": "stream",
|
| 1025 |
-
"text": [
|
| 1026 |
-
"================================================================================\n",
|
| 1027 |
-
"EXTREME EDGE CASE TESTING\n",
|
| 1028 |
-
"================================================================================\n",
|
| 1029 |
-
"\n",
|
| 1030 |
-
"Testing extreme edge cases:\n",
|
| 1031 |
-
" ✓ Single character [ 1w] human (99.3%): 'A'\n",
|
| 1032 |
-
" ✓ Single word [ 1w] human (99.4%): 'Hello'\n",
|
| 1033 |
-
" ✓ Two words [ 2w] human (99.6%): 'Hello world'\n",
|
| 1034 |
-
" ✓ Numbers only [ 3w] human (98.7%): '123 456 789'\n",
|
| 1035 |
-
" ✓ Special chars [ 4w] human (99.8%): '!!! ### $$$ ???'\n",
|
| 1036 |
-
" ✓ Mixed alphanumeric [ 3w] human (99.3%): 'Test123 ABC456 xyz789'\n",
|
| 1037 |
-
" ✓ Very long word [ 1w] human (99.1%): 'supercalifragilisticexpialidocious'\n",
|
| 1038 |
-
" ✓ Repeated words [ 5w] human (99.6%): 'test test test test test'\n",
|
| 1039 |
-
" ✓ Newlines [ 6w] human (99.4%): 'Line one\\nLine two\\nLine three'\n",
|
| 1040 |
-
" ✓ Tabs [ 3w] human (99.5%): 'Col1\\tCol2\\tCol3'\n",
|
| 1041 |
-
" ✓ Multiple spaces [ 3w] human (99.7%): 'Too many spaces'\n",
|
| 1042 |
-
" ✓ Punctuation heavy [ 5w] human (99.8%): 'Wow! Really? Yes! No... Maybe?'\n",
|
| 1043 |
-
" ✗ Empty-like ERROR: Input text is empty.\n",
|
| 1044 |
-
" ✓ Mixed case [ 5w] human (99.3%): 'ThIs Is MiXeD cAsE tExT'\n",
|
| 1045 |
-
" ✓ All caps [ 4w] human (99.3%): 'THIS IS ALL CAPITALS'\n",
|
| 1046 |
-
" ✓ All lower [ 4w] human (99.9%): 'this is all lowercase'\n",
|
| 1047 |
-
"\n",
|
| 1048 |
-
"Result: 15 passed, 1 failed\n",
|
| 1049 |
-
"\n",
|
| 1050 |
-
"================================================================================\n",
|
| 1051 |
-
"BATCH PREDICTION TEST\n",
|
| 1052 |
-
"================================================================================\n",
|
| 1053 |
-
"\n",
|
| 1054 |
-
"Predicting batch of mixed-length sentences:\n",
|
| 1055 |
-
"\n",
|
| 1056 |
-
" Sentence 1 (1 words):\n",
|
| 1057 |
-
" Text: Short....\n",
|
| 1058 |
-
" Prediction: human\n",
|
| 1059 |
-
" Confidence: AI=0.1%, Human=99.9%\n",
|
| 1060 |
-
"\n",
|
| 1061 |
-
" Sentence 2 (9 words):\n",
|
| 1062 |
-
" Text: This is a medium length sentence with some content....\n",
|
| 1063 |
-
" Prediction: human\n",
|
| 1064 |
-
" Confidence: AI=0.1%, Human=99.9%\n",
|
| 1065 |
-
"\n",
|
| 1066 |
-
" Sentence 3 (29 words):\n",
|
| 1067 |
-
" Text: This is a longer sentence that contains more words and provi...\n",
|
| 1068 |
-
" Prediction: human\n",
|
| 1069 |
-
" Confidence: AI=0.1%, Human=99.9%\n",
|
| 1070 |
-
"\n",
|
| 1071 |
-
"================================================================================\n",
|
| 1072 |
-
"✓ ALL EDGE CASES AND BATCH TESTS COMPLETE!\n",
|
| 1073 |
-
"================================================================================\n"
|
| 1074 |
-
]
|
| 1075 |
-
}
|
| 1076 |
-
],
|
| 1077 |
-
"source": [
|
| 1078 |
-
"print(\"=\" * 80)\n",
|
| 1079 |
-
"print(\"EXTREME EDGE CASE TESTING\")\n",
|
| 1080 |
-
"print(\"=\" * 80)\n",
|
| 1081 |
-
"\n",
|
| 1082 |
-
"# Test various edge cases that might break the model\n",
|
| 1083 |
-
"edge_test_cases = {\n",
|
| 1084 |
-
" \"Single character\": \"A\",\n",
|
| 1085 |
-
" \"Single word\": \"Hello\",\n",
|
| 1086 |
-
" \"Two words\": \"Hello world\",\n",
|
| 1087 |
-
" \"Numbers only\": \"123 456 789\",\n",
|
| 1088 |
-
" \"Special chars\": \"!!! ### $$$ ???\",\n",
|
| 1089 |
-
" \"Mixed alphanumeric\": \"Test123 ABC456 xyz789\",\n",
|
| 1090 |
-
" \"Very long word\": \"supercalifragilisticexpialidocious\",\n",
|
| 1091 |
-
" \"Repeated words\": \"test test test test test\",\n",
|
| 1092 |
-
" \"Newlines\": \"Line one\\nLine two\\nLine three\",\n",
|
| 1093 |
-
" \"Tabs\": \"Col1\\tCol2\\tCol3\",\n",
|
| 1094 |
-
" \"Multiple spaces\": \"Too many spaces\",\n",
|
| 1095 |
-
" \"Punctuation heavy\": \"Wow! Really? Yes! No... Maybe?\",\n",
|
| 1096 |
-
" \"Empty-like\": \" \",\n",
|
| 1097 |
-
" \"Mixed case\": \"ThIs Is MiXeD cAsE tExT\",\n",
|
| 1098 |
-
" \"All caps\": \"THIS IS ALL CAPITALS\",\n",
|
| 1099 |
-
" \"All lower\": \"this is all lowercase\",\n",
|
| 1100 |
-
"}\n",
|
| 1101 |
-
"\n",
|
| 1102 |
-
"print(\"\\nTesting extreme edge cases:\")\n",
|
| 1103 |
-
"passed = 0\n",
|
| 1104 |
-
"failed = 0\n",
|
| 1105 |
-
"\n",
|
| 1106 |
-
"for case_name, text in edge_test_cases.items():\n",
|
| 1107 |
-
" try:\n",
|
| 1108 |
-
" result = predict_v2(text)\n",
|
| 1109 |
-
" wc = result['word_count']\n",
|
| 1110 |
-
" pred = result['predicted_name']\n",
|
| 1111 |
-
" conf = result['probability_ai'] if pred == 'ai' else result['probability_human']\n",
|
| 1112 |
-
" \n",
|
| 1113 |
-
" # Handle display of text with special characters\n",
|
| 1114 |
-
" display_text = text.replace('\\n', '\\\\n').replace('\\t', '\\\\t')[:40]\n",
|
| 1115 |
-
" print(f\" ✓ {case_name:20} [{wc:2}w] {pred:6} ({conf:.1%}): '{display_text}'\")\n",
|
| 1116 |
-
" passed += 1\n",
|
| 1117 |
-
" except Exception as e:\n",
|
| 1118 |
-
" print(f\" ✗ {case_name:20} ERROR: {str(e)[:50]}\")\n",
|
| 1119 |
-
" failed += 1\n",
|
| 1120 |
-
"\n",
|
| 1121 |
-
"print(f\"\\nResult: {passed} passed, {failed} failed\")\n",
|
| 1122 |
-
"\n",
|
| 1123 |
-
"# Batch prediction test\n",
|
| 1124 |
-
"print(\"\\n\" + \"=\" * 80)\n",
|
| 1125 |
-
"print(\"BATCH PREDICTION TEST\")\n",
|
| 1126 |
-
"print(\"=\" * 80)\n",
|
| 1127 |
-
"\n",
|
| 1128 |
-
"batch_texts = [\n",
|
| 1129 |
-
" \"Short.\",\n",
|
| 1130 |
-
" \"This is a medium length sentence with some content.\",\n",
|
| 1131 |
-
" \"This is a longer sentence that contains more words and provides more context for the model to analyze and make predictions based on the patterns it learned during training.\",\n",
|
| 1132 |
-
"]\n",
|
| 1133 |
-
"\n",
|
| 1134 |
-
"print(\"\\nPredicting batch of mixed-length sentences:\")\n",
|
| 1135 |
-
"batch_results = [predict_v2(text) for text in batch_texts]\n",
|
| 1136 |
-
"\n",
|
| 1137 |
-
"for i, (text, result) in enumerate(zip(batch_texts, batch_results), 1):\n",
|
| 1138 |
-
" print(f\"\\n Sentence {i} ({result['word_count']} words):\")\n",
|
| 1139 |
-
" print(f\" Text: {text[:60]}...\")\n",
|
| 1140 |
-
" print(f\" Prediction: {result['predicted_name']}\")\n",
|
| 1141 |
-
" print(f\" Confidence: AI={result['probability_ai']:.1%}, Human={result['probability_human']:.1%}\")\n",
|
| 1142 |
-
"\n",
|
| 1143 |
-
"print(\"\\n\" + \"=\" * 80)\n",
|
| 1144 |
-
"print(\"✓ ALL EDGE CASES AND BATCH TESTS COMPLETE!\")\n",
|
| 1145 |
-
"print(\"=\" * 80)"
|
| 1146 |
-
]
|
| 1147 |
-
}
|
| 1148 |
-
],
|
| 1149 |
-
"metadata": {
|
| 1150 |
-
"kernelspec": {
|
| 1151 |
-
"display_name": "ml",
|
| 1152 |
-
"language": "python",
|
| 1153 |
-
"name": "python3"
|
| 1154 |
-
},
|
| 1155 |
-
"language_info": {
|
| 1156 |
-
"codemirror_mode": {
|
| 1157 |
-
"name": "ipython",
|
| 1158 |
-
"version": 3
|
| 1159 |
-
},
|
| 1160 |
-
"file_extension": ".py",
|
| 1161 |
-
"mimetype": "text/x-python",
|
| 1162 |
-
"name": "python",
|
| 1163 |
-
"nbconvert_exporter": "python",
|
| 1164 |
-
"pygments_lexer": "ipython3",
|
| 1165 |
-
"version": "3.11.14"
|
| 1166 |
-
}
|
| 1167 |
-
},
|
| 1168 |
-
"nbformat": 4,
|
| 1169 |
-
"nbformat_minor": 5
|
| 1170 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
notebook/ai_vs_human/mainv3.ipynb
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
notebook/ai_vs_human_nepali/notebook/Nepali_Ai_vs_Human.ipynb
DELETED
|
@@ -1,1429 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"cells": [
|
| 3 |
-
{
|
| 4 |
-
"cell_type": "code",
|
| 5 |
-
"execution_count": 1,
|
| 6 |
-
"id": "901fc22d",
|
| 7 |
-
"metadata": {
|
| 8 |
-
"id": "901fc22d"
|
| 9 |
-
},
|
| 10 |
-
"outputs": [
|
| 11 |
-
{
|
| 12 |
-
"name": "stderr",
|
| 13 |
-
"output_type": "stream",
|
| 14 |
-
"text": [
|
| 15 |
-
"/home/pujan/miniconda3/envs/ml/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
|
| 16 |
-
" from .autonotebook import tqdm as notebook_tqdm\n"
|
| 17 |
-
]
|
| 18 |
-
}
|
| 19 |
-
],
|
| 20 |
-
"source": [
|
| 21 |
-
"import os\n",
|
| 22 |
-
"os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'\n",
|
| 23 |
-
"\n",
|
| 24 |
-
"import math\n",
|
| 25 |
-
"import pandas as pd\n",
|
| 26 |
-
"import torch\n",
|
| 27 |
-
"from torch.utils.data import Dataset, DataLoader\n",
|
| 28 |
-
"from transformers import AutoTokenizer, AutoModel, get_linear_schedule_with_warmup\n",
|
| 29 |
-
"from sklearn.model_selection import train_test_split\n",
|
| 30 |
-
"from sklearn.metrics import classification_report, f1_score, accuracy_score\n",
|
| 31 |
-
"import torch.nn as nn\n",
|
| 32 |
-
"from torch.optim import AdamW"
|
| 33 |
-
]
|
| 34 |
-
},
|
| 35 |
-
{
|
| 36 |
-
"cell_type": "code",
|
| 37 |
-
"execution_count": 2,
|
| 38 |
-
"id": "70d3c048",
|
| 39 |
-
"metadata": {},
|
| 40 |
-
"outputs": [
|
| 41 |
-
{
|
| 42 |
-
"name": "stdout",
|
| 43 |
-
"output_type": "stream",
|
| 44 |
-
"text": [
|
| 45 |
-
"Columns: ['human_text', 'ai_generated_text']\n",
|
| 46 |
-
"Prepared dataset shape: (1986, 2)\n",
|
| 47 |
-
"label\n",
|
| 48 |
-
"1 996\n",
|
| 49 |
-
"0 990\n",
|
| 50 |
-
"Name: count, dtype: int64\n"
|
| 51 |
-
]
|
| 52 |
-
},
|
| 53 |
-
{
|
| 54 |
-
"data": {
|
| 55 |
-
"text/html": [
|
| 56 |
-
"<div>\n",
|
| 57 |
-
"<style scoped>\n",
|
| 58 |
-
" .dataframe tbody tr th:only-of-type {\n",
|
| 59 |
-
" vertical-align: middle;\n",
|
| 60 |
-
" }\n",
|
| 61 |
-
"\n",
|
| 62 |
-
" .dataframe tbody tr th {\n",
|
| 63 |
-
" vertical-align: top;\n",
|
| 64 |
-
" }\n",
|
| 65 |
-
"\n",
|
| 66 |
-
" .dataframe thead th {\n",
|
| 67 |
-
" text-align: right;\n",
|
| 68 |
-
" }\n",
|
| 69 |
-
"</style>\n",
|
| 70 |
-
"<table border=\"1\" class=\"dataframe\">\n",
|
| 71 |
-
" <thead>\n",
|
| 72 |
-
" <tr style=\"text-align: right;\">\n",
|
| 73 |
-
" <th></th>\n",
|
| 74 |
-
" <th>text</th>\n",
|
| 75 |
-
" <th>label</th>\n",
|
| 76 |
-
" </tr>\n",
|
| 77 |
-
" </thead>\n",
|
| 78 |
-
" <tbody>\n",
|
| 79 |
-
" <tr>\n",
|
| 80 |
-
" <th>0</th>\n",
|
| 81 |
-
" <td>हामीले पार्टी एकतापछि कि दुबै पार्टीको सिद्धान...</td>\n",
|
| 82 |
-
" <td>0</td>\n",
|
| 83 |
-
" </tr>\n",
|
| 84 |
-
" <tr>\n",
|
| 85 |
-
" <th>1</th>\n",
|
| 86 |
-
" <td>एमाले प्रतिनिधिसभाको प्रत्यक्षतर्फ ८० समानुपात...</td>\n",
|
| 87 |
-
" <td>0</td>\n",
|
| 88 |
-
" </tr>\n",
|
| 89 |
-
" <tr>\n",
|
| 90 |
-
" <th>2</th>\n",
|
| 91 |
-
" <td>नेकपा माओवादी केन्द्रका नेता रामनारायण विडारील...</td>\n",
|
| 92 |
-
" <td>1</td>\n",
|
| 93 |
-
" </tr>\n",
|
| 94 |
-
" <tr>\n",
|
| 95 |
-
" <th>3</th>\n",
|
| 96 |
-
" <td>प्रदेश नं २ का मुख्यमन्त्रीको रूपमा संघीय समाज...</td>\n",
|
| 97 |
-
" <td>1</td>\n",
|
| 98 |
-
" </tr>\n",
|
| 99 |
-
" <tr>\n",
|
| 100 |
-
" <th>4</th>\n",
|
| 101 |
-
" <td>बिहीबार एमालेका अध्यक्ष केपी शर्मा ओली र माओवा...</td>\n",
|
| 102 |
-
" <td>0</td>\n",
|
| 103 |
-
" </tr>\n",
|
| 104 |
-
" </tbody>\n",
|
| 105 |
-
"</table>\n",
|
| 106 |
-
"</div>"
|
| 107 |
-
],
|
| 108 |
-
"text/plain": [
|
| 109 |
-
" text label\n",
|
| 110 |
-
"0 हामीले पार्टी एकतापछि कि दुबै पार्टीको सिद्धान... 0\n",
|
| 111 |
-
"1 एमाले प्रतिनिधिसभाको प्रत्यक्षतर्फ ८० समानुपात... 0\n",
|
| 112 |
-
"2 नेकपा माओवादी केन्द्रका नेता रामनारायण विडारील... 1\n",
|
| 113 |
-
"3 प्रदेश नं २ का मुख्यमन्त्रीको रूपमा संघीय समाज... 1\n",
|
| 114 |
-
"4 बिहीबार एमालेका अध्यक्�� केपी शर्मा ओली र माओवा... 0"
|
| 115 |
-
]
|
| 116 |
-
},
|
| 117 |
-
"execution_count": 2,
|
| 118 |
-
"metadata": {},
|
| 119 |
-
"output_type": "execute_result"
|
| 120 |
-
}
|
| 121 |
-
],
|
| 122 |
-
"source": [
|
| 123 |
-
"# Load Dataset and convert to binary classification format\n",
|
| 124 |
-
"DATA_PATH = '../DATASET/new_data.csv'\n",
|
| 125 |
-
"raw_df = pd.read_csv(DATA_PATH)\n",
|
| 126 |
-
"print('Columns:', raw_df.columns.tolist())\n",
|
| 127 |
-
"\n",
|
| 128 |
-
"required_cols = ['human_text', 'ai_generated_text']\n",
|
| 129 |
-
"missing = [c for c in required_cols if c not in raw_df.columns]\n",
|
| 130 |
-
"if missing:\n",
|
| 131 |
-
" raise ValueError(f'Missing required columns: {missing}')\n",
|
| 132 |
-
"\n",
|
| 133 |
-
"# Build unified training dataframe: text + label (0=Human, 1=AI)\n",
|
| 134 |
-
"df_human = raw_df[['human_text']].dropna().rename(columns={'human_text': 'text'})\n",
|
| 135 |
-
"df_human['label'] = 0\n",
|
| 136 |
-
"\n",
|
| 137 |
-
"df_ai = raw_df[['ai_generated_text']].dropna().rename(columns={'ai_generated_text': 'text'})\n",
|
| 138 |
-
"df_ai['label'] = 1\n",
|
| 139 |
-
"\n",
|
| 140 |
-
"df = pd.concat([df_human, df_ai], ignore_index=True)\n",
|
| 141 |
-
"df['text'] = df['text'].astype(str).str.strip()\n",
|
| 142 |
-
"df = df[df['text'].str.len() > 10].drop_duplicates(subset=['text']).sample(frac=1, random_state=42).reset_index(drop=True)\n",
|
| 143 |
-
"\n",
|
| 144 |
-
"print('Prepared dataset shape:', df.shape)\n",
|
| 145 |
-
"print(df['label'].value_counts())\n",
|
| 146 |
-
"df.head()"
|
| 147 |
-
]
|
| 148 |
-
},
|
| 149 |
-
{
|
| 150 |
-
"cell_type": "code",
|
| 151 |
-
"execution_count": 3,
|
| 152 |
-
"id": "f93d4c7a",
|
| 153 |
-
"metadata": {
|
| 154 |
-
"id": "f93d4c7a"
|
| 155 |
-
},
|
| 156 |
-
"outputs": [
|
| 157 |
-
{
|
| 158 |
-
"name": "stdout",
|
| 159 |
-
"output_type": "stream",
|
| 160 |
-
"text": [
|
| 161 |
-
"Nulls in text: 0\n",
|
| 162 |
-
"Nulls in label: 0\n",
|
| 163 |
-
"Example text sample:\n",
|
| 164 |
-
"हामीले पार्टी एकतापछि कि दुबै पार्टीको सिद्धान्त राख्ने कि राख्ने माओवाद र जबज दुबै नराख्ने भन्दा उहाँहरु मान्नु भएन । एमालेका साथीहरुले जवजको विषय उठाउन चाहनुभएन । सिद्धान्तको विषय नै नमिलेपछि पार्टी एकता संयोजन समितिको बैठक रोकियो कार्यदलका एक सदस्\n"
|
| 165 |
-
]
|
| 166 |
-
}
|
| 167 |
-
],
|
| 168 |
-
"source": [
|
| 169 |
-
"# Quick sanity checks\n",
|
| 170 |
-
"print('Nulls in text:', int(df['text'].isnull().sum()))\n",
|
| 171 |
-
"print('Nulls in label:', int(df['label'].isnull().sum()))\n",
|
| 172 |
-
"print('Example text sample:')\n",
|
| 173 |
-
"print(df.loc[0, 'text'][:250])"
|
| 174 |
-
]
|
| 175 |
-
},
|
| 176 |
-
{
|
| 177 |
-
"cell_type": "code",
|
| 178 |
-
"execution_count": 4,
|
| 179 |
-
"id": "ba4a933f",
|
| 180 |
-
"metadata": {
|
| 181 |
-
"colab": {
|
| 182 |
-
"base_uri": "https://localhost:8080/",
|
| 183 |
-
"height": 206
|
| 184 |
-
},
|
| 185 |
-
"id": "ba4a933f",
|
| 186 |
-
"outputId": "9bf5f0a5-c547-43f1-b8f2-a580024d74a9"
|
| 187 |
-
},
|
| 188 |
-
"outputs": [
|
| 189 |
-
{
|
| 190 |
-
"name": "stdout",
|
| 191 |
-
"output_type": "stream",
|
| 192 |
-
"text": [
|
| 193 |
-
"label\n",
|
| 194 |
-
"AI 0.501511\n",
|
| 195 |
-
"Human 0.498489\n",
|
| 196 |
-
"Name: proportion, dtype: float64\n"
|
| 197 |
-
]
|
| 198 |
-
},
|
| 199 |
-
{
|
| 200 |
-
"data": {
|
| 201 |
-
"text/plain": [
|
| 202 |
-
"label \n",
|
| 203 |
-
"0 count 990.000000\n",
|
| 204 |
-
" mean 455.551515\n",
|
| 205 |
-
" std 56.825837\n",
|
| 206 |
-
" min 299.000000\n",
|
| 207 |
-
" 25% 418.000000\n",
|
| 208 |
-
" 50% 458.000000\n",
|
| 209 |
-
" 75% 494.000000\n",
|
| 210 |
-
" max 629.000000\n",
|
| 211 |
-
"1 count 996.000000\n",
|
| 212 |
-
" mean 284.231928\n",
|
| 213 |
-
" std 67.165254\n",
|
| 214 |
-
" min 103.000000\n",
|
| 215 |
-
" 25% 238.000000\n",
|
| 216 |
-
" 50% 282.000000\n",
|
| 217 |
-
" 75% 331.000000\n",
|
| 218 |
-
" max 433.000000\n",
|
| 219 |
-
"Name: text, dtype: float64"
|
| 220 |
-
]
|
| 221 |
-
},
|
| 222 |
-
"execution_count": 4,
|
| 223 |
-
"metadata": {},
|
| 224 |
-
"output_type": "execute_result"
|
| 225 |
-
}
|
| 226 |
-
],
|
| 227 |
-
"source": [
|
| 228 |
-
"# Class balance\n",
|
| 229 |
-
"print(df['label'].value_counts(normalize=True).rename({0: 'Human', 1: 'AI'}))\n",
|
| 230 |
-
"df.groupby('label')['text'].apply(lambda s: s.str.len().describe())"
|
| 231 |
-
]
|
| 232 |
-
},
|
| 233 |
-
{
|
| 234 |
-
"cell_type": "code",
|
| 235 |
-
"execution_count": 5,
|
| 236 |
-
"id": "d7b48175",
|
| 237 |
-
"metadata": {
|
| 238 |
-
"colab": {
|
| 239 |
-
"base_uri": "https://localhost:8080/",
|
| 240 |
-
"height": 206
|
| 241 |
-
},
|
| 242 |
-
"id": "d7b48175",
|
| 243 |
-
"outputId": "08bc4562-874c-40c1-d554-1d809a6d0e31"
|
| 244 |
-
},
|
| 245 |
-
"outputs": [
|
| 246 |
-
{
|
| 247 |
-
"data": {
|
| 248 |
-
"text/plain": [
|
| 249 |
-
"<matplotlib.legend.Legend at 0x7fef748b5290>"
|
| 250 |
-
]
|
| 251 |
-
},
|
| 252 |
-
"execution_count": 5,
|
| 253 |
-
"metadata": {},
|
| 254 |
-
"output_type": "execute_result"
|
| 255 |
-
},
|
| 256 |
-
{
|
| 257 |
-
"data": {
|
| 258 |
-
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAvwAAAGHCAYAAADMVYYQAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjgsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvwVt1zgAAAAlwSFlzAAAPYQAAD2EBqD+naQAARoNJREFUeJzt3X1cVGX+//H3gMMIAt7HTaJiIeVduVomVmgFu5aurtXWWq1Wa5bdaO5+7WtWDrsFZuVSa9nPttS2yG03c7Wvd6SJlbmp5epqmbspmkpkoaAoDMz1+6OYHAEdYGCY4+v5eMxDz3Wuuc7nzGdGPlxec47NGGMEAAAAwJJCAh0AAAAAgMZDwQ8AAABYGAU/AAAAYGEU/AAAAICFUfADAAAAFkbBDwAAAFgYBT8AAABgYRT8AAAAgIVR8AMAAAAWRsEPIOjZbDafHmvXrvXL8Q4cOCCn06ktW7b41H/t2rWy2Wz6+9//7pfj+1tpaamcTmeNr4/T6ZTNZtOhQ4fqNfbYsWO9ctCqVSt17dpVP//5zzVv3jyVlZVVe87gwYM1ePDgOh1nx44dcjqd2rNnT52ed+qx9uzZI5vNpqeffrpO45xJZmamFi9eXK296r3hr/cmANSkRaADAICG+uijj7y2//CHP+i9997TmjVrvNp79Ojhl+MdOHBAGRkZ6tq1qy6++GK/jBlIpaWlysjIkKQ6F9q+CA8P9+Ti+PHj2rdvn5YvX65x48bpmWee0YoVK9SpUydP/xdeeKHOx9ixY4cyMjI0ePBgde3a1efn1edY9ZGZmakbbrhBI0eO9Gr/yU9+oo8++shv700AqAkFP4Cgd9lll3ltd+zYUSEhIdXaERg15eLXv/61br/9dg0bNkw33HCDNmzY4NnXFMVvaWmpIiIiAl5oR0dH8z4F0OhY0gPgrFBeXq7HH39cF1xwgRwOhzp27Kjbb79d33zzjafPjBkzFBISoqVLl3o9d+zYsYqIiNC2bdu0du1aXXLJJZKk22+/3bNUxel0NjjGgoICjR8/Xp06dVJYWJgSExOVkZGhiooKT5+Tl5zMmjVLiYmJioyM1MCBA72K5iovvfSSunfvLofDoR49eignJ0djx471zILv2bNHHTt2lCRlZGR4zmfs2LFe43z99df61a9+pdatWysmJkZ33HGHjhw50qDzTU9P17hx4/TPf/5T69at87TXtKRnzpw5uuiiixQZGamoqChdcMEFevjhhyVJ8+fP14033ihJGjJkiOcc5s+f7xmvV69eWrdunVJSUhQREaE77rij1mNJktvt1hNPPKHOnTurZcuW6t+/v1avXu3V5+TX8WRVy6Cq2Gw2HTt2TAsWLPDEVnXM2pb0LFmyRAMHDlRERISioqKUlpZW7X+yqo6zfft2v+cGgLVQ8AOwPLfbrREjRmjGjBkaPXq0/u///k8zZsxQbm6uBg8erOPHj0uSHnroIQ0dOlRjxoxRfn6+JGnevHlasGCB/vSnP6l37976yU9+onnz5kmSHnnkEX300Uf66KOP9Jvf/KZBMRYUFOjSSy/VypUr9dhjj2n58uW68847lZWVpXHjxlXr//zzzys3N1fZ2dl6/fXXdezYMV177bVehd7cuXN11113qU+fPlq0aJEeeeQRZWRkeBWXcXFxWrFihSTpzjvv9JzPo48+6nW866+/Xt27d9dbb72l//3f/1VOTo4efPDBBp2zJP385z+XJK+C/1QLFy7UhAkTlJqaqrfffluLFy/Wgw8+qGPHjkmSrrvuOmVmZnpel6pzuO666zxjHDx4ULfeeqtGjx6tZcuWacKECaeNa/bs2VqxYoWys7P12muvKSQkREOHDq1WdPvio48+Unh4uK699lpPbKdbSpSTk6MRI0YoOjpab7zxhl5++WUVFRVp8ODB+uCDD6r1b6zcALAQAwAWM2bMGNOqVSvP9htvvGEkmbfeesur38aNG40k88ILL3jaDh06ZDp16mQuvfRS88knn5iIiAhz66231vi8efPm+RTPe++9ZySZv/3tb7X2GT9+vImMjDT5+fle7U8//bSRZLZv326MMWb37t1Gkundu7epqKjw9Pv444+NJPPGG28YY4yprKw0sbGxZsCAAV7j5efnG7vdbrp06eJp++abb4wkM3369GpxTZ8+3UgyM2fO9GqfMGGCadmypXG73ac991NzcarPPvvMSDL33HOPpy01NdWkpqZ6tu+77z7Tpk2b0x7nb3/7m5Fk3nvvvWr7UlNTjSSzevXqGvedfKyq1zc+Pt4cP37c015cXGzatWtnrrnmGq9zO/l1rFL1mp2sVatWZsyYMdX6Vr03quKurKw08fHxpnfv3qaystLTr6SkxJxzzjkmJSWl2nHqmxsAZw9m+AFY3jvvvKM2bdpo+PDhqqio8DwuvvhixcbGes14t2/fXn/961/1ySefKCUlRZ07d9aLL77YJDEOGTJE8fHxXjEOHTpUkpSXl+fV/7rrrlNoaKhnu0+fPpLk+Z+JnTt3qqCgQL/85S+9nte5c2cNGjSozvFVzcSffLwTJ06osLCwzmOdzBhzxj6XXnqpDh8+rF/96lf6xz/+Ua8rBrVt21ZXXXWVz/1HjRqlli1berajoqI0fPhwrVu3TpWVlXU+vq927typAwcO6LbbblNIyI8/oiMjI3X99ddrw4YNKi0t9XpOY+UGgHVQ8AOwvK+//lqHDx9WWFiY7Ha716OgoKBaATlgwAD17NlTJ06c0D333KNWrVo1SYxLly6tFl/Pnj0lqVqM7du399p2OByS5Fme9O2330qSYmJiqh2rprYzOdPx6qvqF5T4+Pha+9x222165ZVXlJ+fr+uvv17nnHOOBgwYoNzcXJ+PExcXV6e4YmNja2wrLy/X0aNH6zRWXVTlraZ44+Pj5Xa7VVRU5NXeWLkBYB1cpQeA5XXo0EHt27f3rFU/VVRUlNf29OnTtW3bNvXr10+PPfaYhg0bpm7dujV6jH369NETTzxR4/7TFcQ1qSoCv/7662r7CgoK6h5gI1myZImkM18O9Pbbb9ftt9+uY8eOad26dZo+fbqGDRumL774Ql26dDnjcU7+Eq0vanqNCgoKFBYWpsjISElSy5Yta7yPQH3vWSD9mLeDBw9W23fgwAGFhISobdu29R4fwNmJGX4Aljds2DB9++23qqysVP/+/as9kpOTPX1zc3OVlZWlRx55RLm5uWrdurVuuukmlZeXe/o0xgzqsGHD9O9//1vnnXdejTHWteBPTk5WbGys3nzzTa/2vXv3av369V5tgZoRzs3N1Z///GelpKTo8ssv9+k5rVq10tChQzVt2jSVl5dr+/btkvx/DosWLdKJEyc82yUlJVq6dKmuuOIKz1Kqrl27qrCw0OuXqvLycq1cubLaeA6Hw6fYkpOTde655yonJ8drudOxY8f01ltvea7cAwB1wQw/AMu7+eab9frrr+vaa6/VxIkTdemll8put+urr77Se++9pxEjRugXv/iF50ouqampmj59ukJCQvTXv/5VV155paZMmaLs7GxJ0nnnnafw8HC9/vrruvDCCxUZGan4+PgzFuU1XTZTklJTU/X73/9eubm5SklJ0QMPPKDk5GSdOHFCe/bs0bJly/Tiiy963ZzqTEJCQpSRkaHx48frhhtu0B133KHDhw8rIyNDcXFxXuvDo6Ki1KVLF/3jH//Q1VdfrXbt2qlDhw51uoHV6bjdbs+5l5WVae/evVq+fLnefPNNXXjhhdV+KTnVuHHjFB4erkGDBikuLk4FBQXKyspS69atPZdI7dWrl6Tvr0wUFRWlli1bKjExsdpyF1+FhoYqLS1NkydPltvt1pNPPqni4mLPDcok6aabbtJjjz2mm2++Wf/zP/+jEydO6LnnnqtxjX/v3r21du1aLV26VHFxcYqKivL6RbNKSEiIZs6cqVtuuUXDhg3T+PHjVVZWpqeeekqHDx/WjBkz6nU+AM5ygf7WMAD4W01XhnG5XObpp582F110kWnZsqWJjIw0F1xwgRk/frzZtWuXqaioMKmpqSYmJsYcPHjQ67lPPfWUkWTefvttT9sbb7xhLrjgAmO322u9wk2Vqiux1PaoukLLN998Yx544AGTmJho7Ha7adeunenXr5+ZNm2aOXr0qDHmx6vIPPXUU9WOU1Mcc+fONeeff74JCwsz3bt3N6+88ooZMWKE6du3r1e/d9991/Tt29c4HA4jyXNFmaorwXzzzTde/efNm2ckmd27d9d63sZ8n4uTzzU8PNx07tzZDB8+3LzyyiumrKys2nNOvXLOggULzJAhQ0xMTIwJCwsz8fHx5pe//KXZunWr1/Oys7NNYmKiCQ0N9bqKUmpqqunZs2eN8dV2lZ4nn3zSZGRkmE6dOpmwsDDTt29fs3LlymrPX7Zsmbn44otNeHi46datm5k9e3aNV+nZsmWLGTRokImIiDCSPMc89So9VRYvXmwGDBhgWrZsaVq1amWuvvpq8+GHH3r1aWhuAJw9bMb4cIkEAIAlHD58WN27d9fIkSM1d+7cQIcDAGgCLOkBAIsqKCjQE088oSFDhqh9+/bKz8/XH//4R5WUlGjixImBDg8A0EQo+AHAohwOh/bs2aMJEybou+++U0REhC677DK9+OKLnst9AgCsjyU9AAAAgIVxWU4AAADAwij4AQAAAAuj4AcAAAAszPJf2nW73Tpw4ICioqLqfGt1AAAAoLkyxqikpETx8fFeN1Q8leUL/gMHDighISHQYQAAAACNYt++fae9G7vlC/6oqChJ378Q0dHRjX48l8ulVatWKT09XXa7vdGPh6ZHjq2N/FofObY28mt95PhHxcXFSkhI8NS7tbF8wV+1jCc6OrrJCv6IiAhFR0ef9W9CqyLH1kZ+rY8cWxv5tT5yXN2Zlq3zpV0AAADAwij4AQAAAAuj4AcAAAAszPJr+AEAAFB3lZWVcrlcgQ6jGpfLpRYtWujEiROqrKwMdDiNKjQ0VC1atGjwpeUp+AEAAODl6NGj+uqrr2SMCXQo1RhjFBsbq3379p0V91iKiIhQXFycwsLC6j0GBT8AAAA8Kisr9dVXXykiIkIdO3ZsdkW12+3W0aNHFRkZedqbTQU7Y4zKy8v1zTffaPfu3UpKSqr3+VLwAwAAwMPlcskYo44dOyo8PDzQ4VTjdrtVXl6uli1bWrrgl6Tw8HDZ7Xbl5+d7zrk+rP0qAQAAoF6a28z+2cofv9RQ8AMAAAAWRsEPAAAAWBgFPwAAAGBhfGkXAACc0dRF23zqlzWqdyNHgkDx9T3gL3V9L40dO1aHDx/W4sWLvdrXrl2rIUOGqKioSG3atPFfgEGEGX4AAADAwij4AQAAcFZwOp26+OKLvdqys7PVtWtXz/bYsWM1cuRIZWZmKiYmRm3atFFGRoYqKir0P//zP2rXrp06deqkV155xWuchx56SN27d1dERIS6deumRx991OtOxVXH/stf/qKuXbuqdevWuvnmm1VSUtKYpyyJgh8AAADwsmbNGh04cEDr1q3TrFmz5HQ6NWzYMLVt21b//Oc/dffdd+vuu+/Wvn37PM+JiorS/PnztWPHDj377LN66aWX9Mc//tFr3P/+979avHix3nnnHb3zzjvKy8vTjBkzGv18KPgBAABgCe+8844iIyO9HkOHDq3zOO3atdNzzz2n5ORk3XHHHUpOTlZpaakefvhhJSUlaerUqQoLC9OHH37oec4jjzyilJQUde3aVcOHD9dvf/tbvfnmm17jut1uzZ8/X7169dIVV1yh2267TatXr27weZ9JQAv+rl27ymazVXvce++9kr6/pbDT6VR8fLzCw8M1ePBgbd++PZAhAwAAoJkaMmSItmzZ4vX485//XOdxevbs6XXDq5iYGPXu/eOXiENDQ9W+fXsVFhZ62v7+97/r8ssvV2xsrCIjI/Xoo49q7969XuN27dpVUVFRnu24uDivMRpLQAv+jRs36uDBg55Hbm6uJOnGG2+UJM2cOVOzZs3S7NmztXHjRsXGxiotLa1J1joBAAAguLRq1Urnn3++1+Pcc8/17A8JCZExxus5J6+zr2K32722bTZbjW1ut1uStGHDBt18880aOnSo3nnnHX366aeaNm2aysvLzzhu1RiNKaCX5ezYsaPX9owZM3TeeecpNTVVxhhlZ2dr2rRpGjVqlCRpwYIFiomJUU5OjsaPHx+IkAEAABCkOnbsqIKCAhljZLPZJElbtmxp8LgffvihunTpomnTpnna8vPzGzyuvzSb6/CXl5frtdde0+TJk2Wz2fTll1+qoKBA6enpnj4Oh0Opqalav359rQV/WVmZysrKPNvFxcWSvv/trabf4Pyt6hhNcSwEBjm2NvJrfeS4fkLl2yxkoF9X8ttwLpdLxhi53W7v2edTZsYbW20z31Uz9FUxntx+atvJ47jdbl155ZX65ptv9OSTT+r666/XypUrtXz5ckVHR3v61TbO6dq6deumvXv3KicnR5dccomWLVumt99+2+v4VXGfGvPpzrVqnzFGLpdLoaGhXvt8fZ83m4J/8eLFOnz4sMaOHStJKigokPT9mqmTxcTEnPY3pqysLGVkZFRrX7VqlSIiIvwX8BlULU+CdZFjayO/1keO6+aS0DP3kaRly/Y0ahy+Ir/116JFC8XGxuro0aNeS1Ieurpzk8ZRNWlbm1OXeLtcLlVUVFR7Xmlpqaf/ueeeq6efflqzZs3S448/ruHDh+vee+/VggULvCaJTx2noqJC5eXlXm1ut1snTpxQcXGxhgwZonvuuUf333+/ysvLlZaWpt/97neaMWOG5zllZWWqrKz0GuPEiRNyu92nPdfy8nIdP35c69atU0VFRY3ndiY2c+pCpgD56U9/qrCwMC1dulSStH79eg0aNEgHDhxQXFycp9+4ceO0b98+rVixosZxaprhT0hI0KFDhxQdHd24J6Hv3yS5ublKS0urtk4L1kCOrY38Wh85rp+MpTt86jd9eI9GjuT0yG/DnThxQvv27VPXrl3VsmXLQIdTjTFGJSUlioqK8izLsbITJ05oz549SkhIqJaP4uJidejQQUeOHDltndssZvjz8/P17rvvatGiRZ622NhYSd/P9J9c8BcWFlab9T+Zw+GQw+Go1m6325v0g9/Ux0PTI8fWRn6tjxzXTaWP1/loLq8p+a2/yspK2Ww2hYSEeF2pprmoWv5SFaPVhYSEeL40fOp72tf3eLN4lebNm6dzzjlH1113nactMTFRsbGxXv8lV15erry8PKWkpAQiTAAAACDoBHyG3+12a968eRozZoxatPgxHJvNpkmTJikzM1NJSUlKSkpSZmamIiIiNHr06ABGDAAAAASPgBf87777rvbu3as77rij2r4pU6bo+PHjmjBhgoqKijRgwACtWrXK64YFAAAAAGoX8II/PT292g0QqthsNjmdTjmdzqYNCgAAALCIZrGGHwAAAEDjoOAHAAAALIyCHwAAALAwCn4AAADAwgL+pV0AAAAEgaUTm/Z4w59t2uNZGDP8AAAAsIz169crNDRUP/vZz7za9+zZI5vNpi1btgQmsACi4AcAAIBlvPLKK7r//vv1wQcfaO/evYEOp1mg4AcAAIAlHDt2TG+++abuueceDRs2TPPnzw90SM0CBT8AAAAs4a9//auSk5OVnJysW2+9VfPmzav1Bq9nEwp+AAAAWMLLL7+sW2+9VZL0s5/9TEePHtXq1asDHFXgUfADAAAg6O3cuVMff/yxbr75ZklSixYtdNNNN+mVV14JcGSBx2U5AQAAEPRefvllVVRU6Nxzz/W0GWNkt9tVVFQUwMgCjxl+AAAABLWKigq9+uqreuaZZ7RlyxbP41//+pe6dOmi119/PdAhBhQz/AAAAAhq77zzjoqKinTnnXeqdevWXvtuuOEGvfzyyxo2bFiAogs8Cn4AAACcWTO+8+3LL7+sa665plqxL0nXX3+9MjMz9d133wUgsuaBgh8AAABBbenSpbXu+8lPfuK5NOfZeolO1vADAAAAFkbBDwAAAFgYBT8AAABgYRT8AAAAgIVR8AMAAKCas/ULrs2NP/JAwQ8AAACP0NBQSVJ5eXmAI4EklZaWSpLsdnu9x+CynAAAAPBo0aKFIiIi9M0338hutyskpHnND7vdbpWXl+vEiRPNLjZ/MsaotLRUhYWFatOmjecXsfqg4AcAAICHzWZTXFycdu/erfz8/ECHU40xRsePH1d4eLhsNlugw2l0bdq0UWxsbIPGoOAHAACAl7CwMCUlJTXLZT0ul0vr1q3TlVde2aBlLsHAbrc3aGa/CgU/AABnsamLtgU6BDRTISEhatmyZaDDqCY0NFQVFRVq2bKl5Qt+f7HuwicAAAAAFPwAAACAlVHwAwAAABYW8IJ///79uvXWW9W+fXtFRETo4osv1ubNmz37jTFyOp2Kj49XeHi4Bg8erO3btwcwYgAAACB4BLTgLyoq0qBBg2S327V8+XLt2LFDzzzzjNq0aePpM3PmTM2aNUuzZ8/Wxo0bFRsbq7S0NJWUlAQucAAAACBIBPQqPU8++aQSEhI0b948T1vXrl09fzfGKDs7W9OmTdOoUaMkSQsWLFBMTIxycnI0fvz4pg4ZAAAACCoBLfiXLFmin/70p7rxxhuVl5enc889VxMmTNC4ceMkSbt371ZBQYHS09M9z3E4HEpNTdX69etrLPjLyspUVlbm2S4uLpb0/TVbXS5XI5+RPMdoimMhMMixtZFf6yPH3kLl9ut4gX5dya/1keMf+foa2IwxppFjqVXVtV0nT56sG2+8UR9//LEmTZqk//f//p9+/etfa/369Ro0aJD279+v+Ph4z/Puuusu5efna+XKldXGdDqdysjIqNaek5OjiIiIxjsZAAAAoAmVlpZq9OjROnLkiKKjo2vtF9AZfrfbrf79+yszM1OS1LdvX23fvl1z5szRr3/9a0+/U2+bbIyp9VbKU6dO1eTJkz3bxcXFSkhIUHp6+mlfCH9xuVzKzc1VWloaN4OwqKDK8fKHfOs39MnGjSOIBFV+US/k2FvG0h1+HW/68B5+Ha+uyK/1keMfVa1kOZOAFvxxcXHq0cP7H4YLL7xQb731liQpNjZWklRQUKC4uDhPn8LCQsXExNQ4psPhkMPhqNZut9ub9E3R1MdD0wuKHNsqfevX3M8jAIIiv2gQcvy9Sj9fv6O5vKbk1/rIse+ft4BepWfQoEHauXOnV9sXX3yhLl26SJISExMVGxur3Nxcz/7y8nLl5eUpJSWlSWMFAAAAglFAZ/gffPBBpaSkKDMzU7/85S/18ccfa+7cuZo7d66k75fyTJo0SZmZmUpKSlJSUpIyMzMVERGh0aNHBzJ0AAAAICgEtOC/5JJL9Pbbb2vq1Kn6/e9/r8TERGVnZ+uWW27x9JkyZYqOHz+uCRMmqKioSAMGDNCqVasUFRUVwMgBAACA4BDQgl+Shg0bpmHDhtW632azyel0yul0Nl1QAAAAgEUEdA0/AAAAgMZFwQ8AAABYGAU/AAAAYGEU/AAAAICFUfADAAAAFkbBDwAAAFhYwC/LCeAstnSib/2GP9u4cQBBZOqibT71yxrVu5EjARAsmOEHAAAALIyCHwAAALAwCn4AAADAwij4AQAAAAuj4AcAAAAsjIIfAAAAsDAKfgAAAMDCuA4/AABoctxPAGg6zPADAAAAFkbBDwAAAFgYBT8AAABgYazhB4CGWDrRt37Dn23cOAAAqAUz/AAAAICFUfADAAAAFkbBDwAAAFgYBT8AAABgYRT8AAAAgIVR8AMAAAAWRsEPAAAAWBjX4QcAoBmYumhboEMAYFHM8AMAAAAWRsEPAAAAWBgFPwAAAGBhAS34nU6nbDab1yM2Ntaz3xgjp9Op+Ph4hYeHa/Dgwdq+fXsAIwYAAACCS8Bn+Hv27KmDBw96Htu2/filpZkzZ2rWrFmaPXu2Nm7cqNjYWKWlpamkpCSAEQMAAADBI+AFf4sWLRQbG+t5dOzYUdL3s/vZ2dmaNm2aRo0apV69emnBggUqLS1VTk5OgKMGAAAAgkPAL8u5a9cuxcfHy+FwaMCAAcrMzFS3bt20e/duFRQUKD093dPX4XAoNTVV69ev1/jx42scr6ysTGVlZZ7t4uJiSZLL5ZLL5Wrck/nhOCf/CesJqhybUN/6BepcmmF8dc5vMzwHnF5z/QyHyu3X8Xw9v+Z+3LrmqbnmF/5Djn/k62tgM8aYRo6lVsuXL1dpaam6d++ur7/+Wo8//rg+//xzbd++XTt37tSgQYO0f/9+xcfHe55z1113KT8/XytXrqxxTKfTqYyMjGrtOTk5ioiIaLRzAQAAAJpSaWmpRo8erSNHjig6OrrWfgEt+E917NgxnXfeeZoyZYouu+wyDRo0SAcOHFBcXJynz7hx47Rv3z6tWLGixjFqmuFPSEjQoUOHTvtC+IvL5VJubq7S0tJkt9sb/XhoekGV4+UP+dZv6JONG0dtmmF8dc5vMzwHnF5z/QxnLN3h1/GmD+9hieP6Ol6V5ppf+A85/lFxcbE6dOhwxoI/4Et6TtaqVSv17t1bu3bt0siRIyVJBQUFXgV/YWGhYmJiah3D4XDI4XBUa7fb7U36pmjq46HpBUWObZW+9QvUeTTj+HzObzM+B5xec/sMV/r5a3W+nltzP259c9Tc8gv/I8e+fz4C/qXdk5WVlemzzz5TXFycEhMTFRsbq9zcXM/+8vJy5eXlKSUlJYBRAgAAAMEjoDP8v/vd7zR8+HB17txZhYWFevzxx1VcXKwxY8bIZrNp0qRJyszMVFJSkpKSkpSZmamIiAiNHj06kGEDOBssf8j32XvgNKYu2nbmTgDQiAJa8H/11Vf61a9+pUOHDqljx4667LLLtGHDBnXp0kWSNGXKFB0/flwTJkxQUVGRBgwYoFWrVikqKiqQYQMAAABBI6AF/8KFC0+732azyel0yul0Nk1AAAAAgMU0qzX8AAAAAPyrWV2lB0Azt3Sib/2GP9u4cQAAAJ8xww8AAABYGAU/AAAAYGEU/AAAAICFsYYfAAA0W3W9j0Go3LokVMpYuqPGu/lmjertr9CAoMEMPwAAAGBhFPwAAACAhVHwAwAAABbGGn4AACyormvfAVgXM/wAAACAhVHwAwAAABZGwQ8AAABYGGv4ATR/Syf61m/4s40bBwAAQYgZfgAAAMDCKPgBAAAAC6PgBwAAACyMNfwAzi5n+j6ACZWU2iShAFbE9f+B5ocZfgAAAMDCKPgBAAAAC6PgBwAAACyMgh8AAACwML60C8D/fL1RllWOCwBAM8YMPwAAAGBhFPwAAACAhdWr4O/WrZu+/fbbau2HDx9Wt27dGhwUAAAAAP+o1xr+PXv2qLKyslp7WVmZ9u/f3+CgAABo7rjBFIBgUaeCf8mSJZ6/r1y5Uq1bt/ZsV1ZWavXq1eratavfggMAAADQMHUq+EeOHClJstlsGjNmjNc+u92url276plnnvFbcAAAAAAapk4Fv9vtliQlJiZq48aN6tChQ6MEBQAAAMA/6rWGf/fu3f6OQ1lZWXr44Yc1ceJEZWdnS5KMMcrIyNDcuXNVVFSkAQMG6Pnnn1fPnj39fnwAAGB9vn73ImtU70aOBGg69b7x1urVq7V69WoVFhZ6Zv6rvPLKK3Uaa+PGjZo7d6769Onj1T5z5kzNmjVL8+fPV/fu3fX4448rLS1NO3fuVFRUVH1DBwAAAM4a9bosZ0ZGhtLT07V69WodOnRIRUVFXo+6OHr0qG655Ra99NJLatu2rafdGKPs7GxNmzZNo0aNUq9evbRgwQKVlpYqJyenPmEDAAAAZ516zfC/+OKLmj9/vm677bYGB3Dvvffquuuu0zXXXKPHH3/c0757924VFBQoPT3d0+ZwOJSamqr169dr/PjxNY5XVlamsrIyz3ZxcbEkyeVyyeVyNTjeM6k6RlMcC4ERVDk2ob718/VcfB0viLl+OEeXv881GN4vZwl/fYZD5T5zJzS5kB/yEtLA/ATFv/FnqaD6OdzIfH0N6lXwl5eXKyUlpT5P9bJw4UJ98skn2rhxY7V9BQUFkqSYmBiv9piYGOXn59c6ZlZWljIyMqq1r1q1ShEREQ2M2He5ublNdiwERnDkONW3bsuW+Xc8C8jV5ZLx44A+v8ZoKg39DF9i/d9/g1q/0L0Nev6yZXv8EwgaTXD8HG5cpaWlPvWrV8H/m9/8Rjk5OXr00Ufr83RJ0r59+zRx4kStWrVKLVu2rLWfzWbz2jbGVGs72dSpUzV58mTPdnFxsRISEpSenq7o6Oh6x+srl8ul3NxcpaWlyW63N/rx0PSCKsfLH/Kt39An/TteEHOZUOXqcqXpA9lt1W8wWG++vsZodP76DGcs3eHHqOAvIXKrX+heba7sLHf9Vi5LkqYP7+HHqOBPQfVzuJFVrWQ5k3oV/CdOnNDcuXP17rvvqk+fPtVe7FmzZp1xjM2bN6uwsFD9+vXztFVWVmrdunWaPXu2du7cKen7mf64uDhPn8LCwmqz/idzOBxyOBzV2u12e5O+KZr6eGh6QZFjXwtWX8/DnwVwc2Yku63SvwV/c3+vnIUa+hmubEAxicbnVkiDctTs/31HcPwcbmS+nn+9Cv6tW7fq4osvliT9+9//9tp3utn3k1199dXats370li33367LrjgAj300EPq1q2bYmNjlZubq759+0r6filRXl6ennySmTIAAADAF/Uq+N97770GHzgqKkq9evXyamvVqpXat2/vaZ80aZIyMzOVlJSkpKQkZWZmKiIiQqNHj27w8QEAAICzQb2vw98UpkyZouPHj2vChAmeG2+tWrWKa/ADAAAAPqpXwT9kyJDTLt1Zs2ZNvYJZu3at17bNZpPT6ZTT6azXeAAAAMDZrl4Ff9X6/Soul0tbtmzRv//9b40ZM8YfcQEAAADwg3oV/H/84x9rbHc6nTp69GiDAgIAAADgP369ptitt96qV155xZ9DAgAAAGgAvxb8H3300WlvogUAAACgadVrSc+oUaO8to0xOnjwoDZt2tSgu+8CAAAA8K96FfytW7f22g4JCVFycrJ+//vfKz093S+BAQAQCFMXbTtzJwAIIvUq+OfNm+fvOAAAAAA0ggbdeGvz5s367LPPZLPZ1KNHD/Xt29dfcQEAAADwg3oV/IWFhbr55pu1du1atWnTRsYYHTlyREOGDNHChQvVsWNHf8cJAAAAoB7qVfDff//9Ki4u1vbt23XhhRdKknbs2KExY8bogQce0BtvvOHXIAGcYunEQEcAAACCRL0K/hUrVujdd9/1FPuS1KNHDz3//PN8aRcAAABoRup1HX632y273V6t3W63y+12NzgoAAAAAP5Rr4L/qquu0sSJE3XgwAFP2/79+/Xggw/q6quv9ltwAAAAABqmXkt6Zs+erREjRqhr165KSEiQzWbT3r171bt3b7322mv+jhFAY+M7AQAAWFa9Cv6EhAR98sknys3N1eeffy5jjHr06KFrrrnG3/EBAAAAaIA6LelZs2aNevTooeLiYklSWlqa7r//fj3wwAO65JJL1LNnT73//vuNEigAAACAuqtTwZ+dna1x48YpOjq62r7WrVtr/PjxmjVrlt+CAwAAANAwdVrS869//UtPPvlkrfvT09P19NNPNzgoAMAZ+Pq9i+HPNm4cAIBmr04z/F9//XWNl+Os0qJFC33zzTcNDgoAAACAf9Sp4D/33HO1bdu2Wvdv3bpVcXFxDQ4KAAAAgH/UqeC/9tpr9dhjj+nEiRPV9h0/flzTp0/XsGHD/BYcAAAAgIap0xr+Rx55RIsWLVL37t113333KTk5WTabTZ999pmef/55VVZWatq0aY0VKxC8WG8N3gMBN3XR9/9DHSq3LgmVMpbuUGX97j8JAEGlTgV/TEyM1q9fr3vuuUdTp06VMUaSZLPZ9NOf/lQvvPCCYmJiGiVQAAAAAHVX5xtvdenSRcuWLVNRUZH+85//yBijpKQktW3btjHiAwAAANAA9brTriS1bdtWl1xyiT9jAQAAAOBnLF4EAAAALIyCHwAAALAwCn4AAADAwij4AQAAAAuj4AcAAAAsLKAF/5w5c9SnTx9FR0crOjpaAwcO1PLlyz37jTFyOp2Kj49XeHi4Bg8erO3btwcwYgAAACC4BLTg79Spk2bMmKFNmzZp06ZNuuqqqzRixAhPUT9z5kzNmjVLs2fP1saNGxUbG6u0tDSVlJQEMmwAAAAgaAS04B8+fLiuvfZade/eXd27d9cTTzyhyMhIbdiwQcYYZWdna9q0aRo1apR69eqlBQsWqLS0VDk5OYEMGwAAAAga9b7xlr9VVlbqb3/7m44dO6aBAwdq9+7dKigoUHp6uqePw+FQamqq1q9fr/Hjx9c4TllZmcrKyjzbxcXFkiSXyyWXy9W4J/HDcU7+E9ZTrxybUF8H9+94qDPXD6+tK1Cvsb/fA/xb5BEqtyQp5JQ/YS3+yi8/x5svaq0f+foa2IwxppFjOa1t27Zp4MCBOnHihCIjI5WTk6Nrr71W69ev16BBg7R//37Fx8d7+t91113Kz8/XypUraxzP6XQqIyOjWntOTo4iIiIa7TwAAACAplRaWqrRo0fryJEjio6OrrVfwGf4k5OTtWXLFh0+fFhvvfWWxowZo7y8PM9+m83m1d8YU63tZFOnTtXkyZM928XFxUpISFB6evppXwh/cblcys3NVVpamux2e6MfD02vXjle/pBv/YY+6d/xUGcuE6pcXa40fSC7rbLpA/D3e8DX8c4CGUt3SPp+5rdf6F5truwsNxersxx/5Xf68B5+jAr+RK31o6qVLGcS8II/LCxM559/viSpf//+2rhxo5599lk99ND3P8wKCgoUFxfn6V9YWKiYmJhax3M4HHI4HNXa7XZ7k74pmvp4aHp1yrGvhaO/x0P9GMluqwxMwR+o99RZoPKU4s+tkGptsI6G5pef4c0ftZbv79Nm9y+dMUZlZWVKTExUbGyscnNzPfvKy8uVl5enlJSUAEYIAAAABI+AzvA//PDDGjp0qBISElRSUqKFCxdq7dq1WrFihWw2myZNmqTMzEwlJSUpKSlJmZmZioiI0OjRowMZNgCguVo6sdZdI7/6TpLktrVQYeeRTRQQrG7qom1n7JM1qncTRALULqAF/9dff63bbrtNBw8eVOvWrdWnTx+tWLFCaWlpkqQpU6bo+PHjmjBhgoqKijRgwACtWrVKUVFRgQwbAAAACBoBLfhffvnl0+632WxyOp1yOp1NExAAAABgMc1uDT8AAAAA/wn4VXoAAI3oNGvavQx/tnHjqE1zjw8ALIAZfgAAAMDCKPgBAAAAC6PgBwAAACyMNfwAgKB28nXQq661DwD4ETP8AAAAgIVR8AMAAAAWRsEPAAAAWBhr+AEAAE5x8ndDgGDHDD8AAABgYRT8AAAAgIVR8AMAAAAWxhp+AIDf+LruOWtU70aOBABQhRl+AAAAwMIo+AEAAAALo+AHAAAALIw1/EBzsnRioCMAmkTVWv+RX31Xa58Bie1+3DjNZ+N0YwAAmOEHAAAALI2CHwAAALAwCn4AAADAwij4AQAAAAuj4AcAAAAsjIIfAAAAsDAKfgAAAMDCuA4/UJPTXQ/fhEpKlZY/JP18VpOFBASTkV/NDHQIAIAfMMMPAAAAWBgFPwAAAGBhFPwAAACAhQV0DX9WVpYWLVqkzz//XOHh4UpJSdGTTz6p5ORkTx9jjDIyMjR37lwVFRVpwIABev7559WzZ88ARg4AjeR03x85y/xz93eBDgFoUlMXbfOpX9ao3o0cCawmoDP8eXl5uvfee7Vhwwbl5uaqoqJC6enpOnbsmKfPzJkzNWvWLM2ePVsbN25UbGys0tLSVFJSEsDIAQAAgOAQ0Bn+FStWeG3PmzdP55xzjjZv3qwrr7xSxhhlZ2dr2rRpGjVqlCRpwYIFiomJUU5OjsaPHx+IsAEAAICg0awuy3nkyBFJUrt27SRJu3fvVkFBgdLT0z19HA6HUlNTtX79+hoL/rKyMpWVlXm2i4uLJUkul0sul6sxw/cc5+Q/EaRMaK27XD/sc5lQydc8n2Y8NC9e+T2b+OnfrFC5JUluW7P68eKlKraQH2KFtVTltTnl19eaINTHmM/2GoNa60e+vgY2Y4xp5Fh8YozRiBEjVFRUpPfff1+StH79eg0aNEj79+9XfHy8p+9dd92l/Px8rVy5sto4TqdTGRkZ1dpzcnIUERHReCcAAAAANKHS0lKNHj1aR44cUXR0dK39ms0UzH333aetW7fqgw8+qLbPZrN5bRtjqrVVmTp1qiZPnuzZLi4uVkJCgtLT00/7QviLy+VSbm6u0tLSZLfbG/14aCTLH6p1l8uEKleXK00fyH5tZoPHQ/PilV9bZaDDaTpDn/TLMBlLd0iSrtuf7ZfxGoPb1kKHEoapw753FGIqTtv3/86d1DRBwW9C5Fa/0L3aXNlZbotejHD68B6BDiGgqLV+VLWS5UyaRcF///33a8mSJVq3bp06derkaY+NjZUkFRQUKC4uztNeWFiomJiYGsdyOBxyOBzV2u12e5O+KZr6ePCzMxV6RrLbKn3P8dlUOFpBVX7Pprz56d+ryh8KrDMV0s1BiKk4Y5yVFi0YzwZuhVg2f9QX36PW8v29ENBPgjFG9913nxYtWqQ1a9YoMTHRa39iYqJiY2OVm5vraSsvL1deXp5SUlKaOlwAAAAg6AR0hv/ee+9VTk6O/vGPfygqKkoFBQWSpNatWys8PFw2m02TJk1SZmamkpKSlJSUpMzMTEVERGj06NGBDB1Nxddrkg9/tnHjAKyOzxoAWFZAC/45c+ZIkgYPHuzVPm/ePI0dO1aSNGXKFB0/flwTJkzw3Hhr1apVioqKauJoAQAAgOAT0ILflwsE2Ww2OZ1OOZ3Oxg8IAAAAsBhrfpsFAAAAgKRmcpUeoMn4uk4ZAADAIpjhBwAAACyMgh8AAACwMAp+AAAAwMJYww8A8N0Zvgcz8qvvmigQAICvmOEHAAAALIyCHwAAALAwCn4AAADAwij4AQAAAAvjS7uwBm6oBdTLP3fzJVsAsDpm+AEAAAALo+AHAAAALIyCHwAAALAw1vADDcF3BwAATWzqom0+9csa1buRI0GwYIYfAAAAsDAKfgAAAMDCKPgBAAAAC6PgBwAAACyMgh8AAACwMAp+AAAAwMIo+AEAAAAL4zr8AADUYuRXM33qt7jTlEaOBKg7rtePKszwAwAAABZGwQ8AAABYGAU/AAAAYGEU/AAAAICFUfADAAAAFkbBDwAAAFgYBT8AAABgYQEt+NetW6fhw4crPj5eNptNixcv9tpvjJHT6VR8fLzCw8M1ePBgbd++PTDBAgAAAEEooAX/sWPHdNFFF2n27Nk17p85c6ZmzZql2bNna+PGjYqNjVVaWppKSkqaOFIAAAAgOAX0TrtDhw7V0KFDa9xnjFF2dramTZumUaNGSZIWLFigmJgY5eTkaPz48U0ZKgAAABCUAlrwn87u3btVUFCg9PR0T5vD4VBqaqrWr19fa8FfVlamsrIyz3ZxcbEkyeVyyeVyNW7QPxzn5D/RQCY00BFU4/ohJlczjA0Nd7bl121rtj8GGk3VOfvz3EPl9ttYaJiQH3IRQk58Fmw1C7XWj3x9DZrtv/QFBQWSpJiYGK/2mJgY5efn1/q8rKwsZWRkVGtftWqVIiIi/BvkaeTm5jbZsawtNdAB1CpXl0sm0FGgsZw1+e0c6AAC51DCML+NdYn2+G0s+Ee/0L2BDiFoLFu2J9Ah1Au1llRaWupTv2Zb8Fex2Wxe28aYam0nmzp1qiZPnuzZLi4uVkJCgtLT0xUdHd1ocVZxuVzKzc1VWlqa7HZ7ox/P8pY/FOgIqnGZUOXqcqXpA9ltlYEOB352tuV3U35RoENocm5bCx1KGKYO+95RiKnwy5j/d+4kv4yDhguRW/1C92pzZWe5uRihT6YP7xHoEOqEWutHVStZzqTZFvyxsbGSvp/pj4uL87QXFhZWm/U/mcPhkMPhqNZut9ub9E3R1MezrOZacBnJbqs8KwrCs9JZlF9/FbzBKMRU+O38Kyksmx23QsiLj4K1XqHW8j13zfaTkJiYqNjYWK//rikvL1deXp5SUlICGBkAAAAQPAI6w3/06FH95z//8Wzv3r1bW7ZsUbt27dS5c2dNmjRJmZmZSkpKUlJSkjIzMxUREaHRo0cHMGr4xdKJgY4AAADgrBDQgn/Tpk0aMmSIZ7tq7f2YMWM0f/58TZkyRcePH9eECRNUVFSkAQMGaNWqVYqKigpUyAAAAEBQCWjBP3jwYBlT+2UwbDabnE6nnE5n0wUFAAAAWEizXcMPAAAAoOGa7VV6AAAA0PimLtrmU7+sUb0bORI0Fmb4AQAAAAuj4AcAAAAsjIIfAAAAsDDW8MO/uL4+UC//3P2dT/0GJLZr5EgAAFbDDD8AAABgYRT8AAAAgIVR8AMAAAAWxhp++LbufvizjR8HgDNirT+AQOF6/cGLGX4AAADAwij4AQAAAAuj4AcAAAAsjDX8TcHXa9OzTh6An/i61h/+MfKrmT71W9xpSiNHAgQPvhPQdJjhBwAAACyMgh8AAACwMAp+AAAAwMJYw9+c+Hutv6/jNfVYgAWwRh4AECyY4QcAAAAsjIIfAAAAsDAKfgAAAMDCKPgBAAAAC+NLuwBwkk35RVLn7/8MMRWBDgcAgo6vN9RC02GGHwAAALAwCn4AAADAwij4AQAAAAtjDT8AAE1k5Fczfeq3uNOURo4EwNmEGX4AAADAwij4AQAAAAuj4AcAAAAsLCjW8L/wwgt66qmndPDgQfXs2VPZ2dm64oorAh1W4CydGOgIAAAAmhVfr/+fNap3QMYLpGY/w//Xv/5VkyZN0rRp0/Tpp5/qiiuu0NChQ7V3795AhwYAAAA0e82+4J81a5buvPNO/eY3v9GFF16o7OxsJSQkaM6cOYEODQAAAGj2mvWSnvLycm3evFn/+7//69Wenp6u9evX1/icsrIylZWVebaPHDkiSfruu+/kcrkaL9gfuFwulZaW6ttvv5Xdbv++8VhFox8XTcdljEpVqm9VIbutMtDhwM9KTrhVWlqqkhNuhRh3oMNBI3Dbmn+OK0qLAx1C0HLLrdLQUrkqi+Vu/vOa8MG3337rtV1jrSXfPzenjlcbf4/XGEpKSiRJxpjT9mvWBf+hQ4dUWVmpmJgYr/aYmBgVFBTU+JysrCxlZGRUa09MTGyUGHG2+lOgA0CjejXQAaDRNfcc5wQ6AKDZePosG68+SkpK1Lp161r3N+uCv4rNZvPaNsZUa6sydepUTZ482bPtdrv13XffqX379rU+x5+Ki4uVkJCgffv2KTo6utGPh6ZHjq2N/FofObY28mt95PhHxhiVlJQoPj7+tP2adcHfoUMHhYaGVpvNLywsrDbrX8XhcMjhcHi1tWnTprFCrFV0dPRZ/ya0OnJsbeTX+sixtZFf6yPH3zvdzH6VZr24LSwsTP369VNubq5Xe25urlJSUgIUFQAAABA8mvUMvyRNnjxZt912m/r376+BAwdq7ty52rt3r+6+++5AhwYAAAA0e82+4L/pppv07bff6ve//70OHjyoXr16admyZerSpUugQ6uRw+HQ9OnTqy0rgnWQY2sjv9ZHjq2N/FofOa47mznTdXwAAAAABK1mvYYfAAAAQMNQ8AMAAAAWRsEPAAAAWBgFPwAAAGBhFPw+WLdunYYPH674+HjZbDYtXrzYa78xRk6nU/Hx8QoPD9fgwYO1fft2rz5lZWW6//771aFDB7Vq1Uo///nP9dVXXzXhWaA2WVlZuuSSSxQVFaVzzjlHI0eO1M6dO736kOPgNmfOHPXp08dzk5aBAwdq+fLlnv3k11qysrJks9k0adIkTxs5Dm5Op1M2m83rERsb69lPfq1h//79uvXWW9W+fXtFRETo4osv1ubNmz37yXP9UfD74NixY7rooos0e/bsGvfPnDlTs2bN0uzZs7Vx40bFxsYqLS1NJSUlnj6TJk3S22+/rYULF+qDDz7Q0aNHNWzYMFVWVjbVaaAWeXl5uvfee7Vhwwbl5uaqoqJC6enpOnbsmKcPOQ5unTp10owZM7Rp0yZt2rRJV111lUaMGOH5QUF+rWPjxo2aO3eu+vTp49VOjoNfz549dfDgQc9j27Ztnn3kN/gVFRVp0KBBstvtWr58uXbs2KFnnnlGbdq08fQhzw1gUCeSzNtvv+3ZdrvdJjY21syYMcPTduLECdO6dWvz4osvGmOMOXz4sLHb7WbhwoWePvv37zchISFmxYoVTRY7fFNYWGgkmby8PGMMObaqtm3bmj//+c/k10JKSkpMUlKSyc3NNampqWbixInGGD7DVjB9+nRz0UUX1biP/FrDQw89ZC6//PJa95PnhmGGv4F2796tgoICpaene9ocDodSU1O1fv16SdLmzZvlcrm8+sTHx6tXr16ePmg+jhw5Iklq166dJHJsNZWVlVq4cKGOHTumgQMHkl8Luffee3Xdddfpmmuu8Wonx9awa9cuxcfHKzExUTfffLO+/PJLSeTXKpYsWaL+/fvrxhtv1DnnnKO+ffvqpZde8uwnzw1Dwd9ABQUFkqSYmBiv9piYGM++goIChYWFqW3btrX2QfNgjNHkyZN1+eWXq1evXpLIsVVs27ZNkZGRcjgcuvvuu/X222+rR48e5NciFi5cqE8++URZWVnV9pHj4DdgwAC9+uqrWrlypV566SUVFBQoJSVF3377Lfm1iC+//FJz5sxRUlKSVq5cqbvvvlsPPPCAXn31VUl8jhuqRaADsAqbzea1bYyp1nYqX/qgad13333aunWrPvjgg2r7yHFwS05O1pYtW3T48GG99dZbGjNmjPLy8jz7yW/w2rdvnyZOnKhVq1apZcuWtfYjx8Fr6NChnr/37t1bAwcO1HnnnacFCxbosssuk0R+g53b7Vb//v2VmZkpSerbt6+2b9+uOXPm6Ne//rWnH3muH2b4G6jqKgGn/uZYWFjo+S00NjZW5eXlKioqqrUPAu/+++/XkiVL9N5776lTp06ednJsDWFhYTr//PPVv39/ZWVl6aKLLtKzzz5Lfi1g8+bNKiwsVL9+/dSiRQu1aNFCeXl5eu6559SiRQtPjsixdbRq1Uq9e/fWrl27+AxbRFxcnHr06OHVduGFF2rv3r2S+FncUBT8DZSYmKjY2Fjl5uZ62srLy5WXl6eUlBRJUr9+/WS32736HDx4UP/+9789fRA4xhjdd999WrRokdasWaPExESv/eTYmowxKisrI78WcPXVV2vbtm3asmWL59G/f3/dcsst2rJli7p160aOLaasrEyfffaZ4uLi+AxbxKBBg6pdEvuLL75Qly5dJPGzuMEC8EXhoFNSUmI+/fRT8+mnnxpJZtasWebTTz81+fn5xhhjZsyYYVq3bm0WLVpktm3bZn71q1+ZuLg4U1xc7Bnj7rvvNp06dTLvvvuu+eSTT8xVV11lLrroIlNRURGo08IP7rnnHtO6dWuzdu1ac/DgQc+jtLTU04ccB7epU6eadevWmd27d5utW7eahx9+2ISEhJhVq1YZY8ivFZ18lR5jyHGw++1vf2vWrl1rvvzyS7NhwwYzbNgwExUVZfbs2WOMIb9W8PHHH5sWLVqYJ554wuzatcu8/vrrJiIiwrz22muePuS5/ij4ffDee+8ZSdUeY8aMMcZ8f6mo6dOnm9jYWONwOMyVV15ptm3b5jXG8ePHzX333WfatWtnwsPDzbBhw8zevXsDcDY4VU25lWTmzZvn6UOOg9sdd9xhunTpYsLCwkzHjh3N1Vdf7Sn2jSG/VnRqwU+Og9tNN91k4uLijN1uN/Hx8WbUqFFm+/btnv3k1xqWLl1qevXqZRwOh7ngggvM3LlzvfaT5/qzGWNMYP5vAQAAAEBjYw0/AAAAYGEU/AAAAICFUfADAAAAFkbBDwAAAFgYBT8AAABgYRT8AAAAgIVR8AMAAAAWRsEPAAAAWBgFPwCcBWw2mxYvXhzoMAAAAUDBDwAWUFBQoPvvv1/dunWTw+FQQkKChg8frtWrVwc6tDMaO3asRo4cGegwAMCyWgQ6AABAw+zZs0eDBg1SmzZtNHPmTPXp00cul0srV67Uvffeq88//7xRjlteXq6wsLBGGbs+mls8ANBcMMMPAEFuwoQJstls+vjjj3XDDTeoe/fu6tmzpyZPnqwNGzZ4+h06dEi/+MUvFBERoaSkJC1ZssSzr7KyUnfeeacSExMVHh6u5ORkPfvss17HqZqJz8rKUnx8vLp37y5Jeu2119S/f39FRUUpNjZWo0ePVmFhoddzt2/fruuuu07R0dGKiorSFVdcof/+979yOp1asGCB/vGPf8hms8lms2nt2rWSpP379+umm25S27Zt1b59e40YMUJ79uw5YzwvvPCCkpKS1LJlS8XExOiGG27w58sNAEGHGX4ACGLfffedVqxYoSeeeEKtWrWqtr9Nmzaev2dkZGjmzJl66qmn9Kc//Um33HKL8vPz1a5dO7ndbnXq1ElvvvmmOnTooPXr1+uuu+5SXFycfvnLX3rGWL16taKjo5WbmytjjKTvZ9b/8Ic/KDk5WYWFhXrwwQc1duxYLVu2TNL3hfuVV16pwYMHa82aNYqOjtaHH36oiooK/e53v9Nnn32m4uJizZs3T5LUrl07lZaWasiQIbriiiu0bt06tWjRQo8//rh+9rOfaevWrZ6Z/FPj2bRpkx544AH95S9/UUpKir777ju9//77jfXyA0BwMACAoPXPf/7TSDKLFi06bT9J5pFHHvFsHz161NhsNrN8+fJanzNhwgRz/fXXe7bHjBljYmJiTFlZ2WmP9fHHHxtJpqSkxBhjzNSpU01iYqIpLy+vsf+YMWPMiBEjvNpefvllk5ycbNxut6etrKzMhIeHm5UrV9Yaz1tvvWWio6NNcXHxaWMEgLMJS3oAIIiZH2bZbTbbGfv26dPH8/dWrVopKirKa+nNiy++qP79+6tjx46KjIzUSy+9pL1793qN0bt372rr5D/99FONGDFCXbp0UVRUlAYPHixJnudu2bJFV1xxhex2u8/ntXnzZv3nP/9RVFSUIiMjFRkZqXbt2unEiRP673//W2s8aWlp6tKli7p166bbbrtNr7/+ukpLS30+LgBYEQU/AASxpKQk2Ww2ffbZZ2fse2rBbbPZ5Ha7JUlvvvmmHnzwQd1xxx1atWqVtmzZottvv13l5eVezzl12dCxY8eUnp6uyMhIvfbaa9q4caPefvttSfI8Nzw8vM7n5Xa71a9fP23ZssXr8cUXX2j06NG1xhMVFaVPPvlEb7zxhuLi4vTYY4/poosu0uHDh+scAwBYBQU/AASxdu3a6ac//amef/55HTt2rNp+Xwvd999/XykpKZowYYL69u2r888/32smvTaff/65Dh06pBkzZuiKK67QBRdcUO0Lu3369NH7778vl8tV4xhhYWGqrKz0avvJT36iXbt26ZxzztH555/v9WjduvVpY2rRooWuueYazZw5U1u3btWePXu0Zs2aM54LAFgVBT8ABLkXXnhBlZWVuvTSS/XWW29p165d+uyzz/Tcc89p4MCBPo1x/vnna9OmTVq5cqW++OILPfroo9q4ceMZn9e5c2eFhYXpT3/6k7788kstWbJEf/jDH7z63HfffSouLtbNN9+sTZs2adeuXfrLX/6inTt3SpK6du2qrVu3aufOnTp06JBcLpduueUWdejQQSNGjND777+v3bt3Ky8vTxMnTtRXX31VazzvvPOOnnvuOW3ZskX5+fl69dVX5Xa7lZyc7NPrAABWRMEPAEEuMTFRn3zyiYYMGaLf/va36tWrl9LS0rR69WrNmTPHpzHuvvtujRo1SjfddJMGDBigb7/9VhMmTDjj8zp27Kj58+frb3/7m3r06KEZM2bo6aef9urTvn17rVmzRkePHlVqaqr69eunl156ybPEaNy4cUpOTvZ8f+DDDz9URESE1q1bp86dO2vUqFG68MILdccdd+j48eOKjo6uNZ42bdpo0aJFuuqqq3ThhRfqxRdf1BtvvKGePXv69DoAgBXZTNU3vgAAAABYDjP8AAAAgIVR8AMAAAAWRsEPAAAAWBgFPwAAAGBhFPwAAACAhVHwAwAAABZGwQ8AAABYGAU/AAAAYGEU/AAAAICFUfADAAAAFkbBDwAAAFjY/wfznNHMialmyAAAAABJRU5ErkJggg==",
|
| 259 |
-
"text/plain": [
|
| 260 |
-
"<Figure size 900x400 with 1 Axes>"
|
| 261 |
-
]
|
| 262 |
-
},
|
| 263 |
-
"metadata": {},
|
| 264 |
-
"output_type": "display_data"
|
| 265 |
-
}
|
| 266 |
-
],
|
| 267 |
-
"source": [
|
| 268 |
-
"# Visualize text-length distributions by class\n",
|
| 269 |
-
"df['text_len'] = df['text'].str.len()\n",
|
| 270 |
-
"ax = df[df['label'] == 0]['text_len'].hist(bins=40, alpha=0.6, label='Human', figsize=(9, 4))\n",
|
| 271 |
-
"df[df['label'] == 1]['text_len'].hist(bins=40, alpha=0.6, label='AI', ax=ax)\n",
|
| 272 |
-
"ax.set_title('Text Length Distribution')\n",
|
| 273 |
-
"ax.set_xlabel('Characters')\n",
|
| 274 |
-
"ax.set_ylabel('Count')\n",
|
| 275 |
-
"ax.legend()"
|
| 276 |
-
]
|
| 277 |
-
},
|
| 278 |
-
{
|
| 279 |
-
"cell_type": "code",
|
| 280 |
-
"execution_count": 6,
|
| 281 |
-
"id": "59fe88ce",
|
| 282 |
-
"metadata": {
|
| 283 |
-
"id": "59fe88ce"
|
| 284 |
-
},
|
| 285 |
-
"outputs": [
|
| 286 |
-
{
|
| 287 |
-
"data": {
|
| 288 |
-
"text/html": [
|
| 289 |
-
"<div>\n",
|
| 290 |
-
"<style scoped>\n",
|
| 291 |
-
" .dataframe tbody tr th:only-of-type {\n",
|
| 292 |
-
" vertical-align: middle;\n",
|
| 293 |
-
" }\n",
|
| 294 |
-
"\n",
|
| 295 |
-
" .dataframe tbody tr th {\n",
|
| 296 |
-
" vertical-align: top;\n",
|
| 297 |
-
" }\n",
|
| 298 |
-
"\n",
|
| 299 |
-
" .dataframe thead th {\n",
|
| 300 |
-
" text-align: right;\n",
|
| 301 |
-
" }\n",
|
| 302 |
-
"</style>\n",
|
| 303 |
-
"<table border=\"1\" class=\"dataframe\">\n",
|
| 304 |
-
" <thead>\n",
|
| 305 |
-
" <tr style=\"text-align: right;\">\n",
|
| 306 |
-
" <th></th>\n",
|
| 307 |
-
" <th>text</th>\n",
|
| 308 |
-
" <th>label</th>\n",
|
| 309 |
-
" </tr>\n",
|
| 310 |
-
" </thead>\n",
|
| 311 |
-
" <tbody>\n",
|
| 312 |
-
" <tr>\n",
|
| 313 |
-
" <th>0</th>\n",
|
| 314 |
-
" <td>हामीले पार्टी एकतापछि कि दुबै पार्टीको सिद्धान...</td>\n",
|
| 315 |
-
" <td>0</td>\n",
|
| 316 |
-
" </tr>\n",
|
| 317 |
-
" <tr>\n",
|
| 318 |
-
" <th>1</th>\n",
|
| 319 |
-
" <td>एमाले प्रतिनिधिसभाको प्रत्यक्षतर्फ ८० समानुपात...</td>\n",
|
| 320 |
-
" <td>0</td>\n",
|
| 321 |
-
" </tr>\n",
|
| 322 |
-
" <tr>\n",
|
| 323 |
-
" <th>2</th>\n",
|
| 324 |
-
" <td>नेकपा माओवादी केन्द्रका नेता रामनारायण विडारील...</td>\n",
|
| 325 |
-
" <td>1</td>\n",
|
| 326 |
-
" </tr>\n",
|
| 327 |
-
" <tr>\n",
|
| 328 |
-
" <th>3</th>\n",
|
| 329 |
-
" <td>प्रदेश नं २ का मुख्यमन्त्रीको रूपमा संघीय समाज...</td>\n",
|
| 330 |
-
" <td>1</td>\n",
|
| 331 |
-
" </tr>\n",
|
| 332 |
-
" <tr>\n",
|
| 333 |
-
" <th>4</th>\n",
|
| 334 |
-
" <td>बिहीबार एमालेका अध्यक्ष केपी शर्मा ओली र माओवा...</td>\n",
|
| 335 |
-
" <td>0</td>\n",
|
| 336 |
-
" </tr>\n",
|
| 337 |
-
" </tbody>\n",
|
| 338 |
-
"</table>\n",
|
| 339 |
-
"</div>"
|
| 340 |
-
],
|
| 341 |
-
"text/plain": [
|
| 342 |
-
" text label\n",
|
| 343 |
-
"0 हामीले पार्टी एकतापछि कि दुबै पार्टीको सिद्धान... 0\n",
|
| 344 |
-
"1 एमाले प्रतिनिधिसभाको प्रत्यक्षतर्फ ८० समानुपात... 0\n",
|
| 345 |
-
"2 नेकपा माओवादी केन्द्रका नेता रामनारायण विडारील... 1\n",
|
| 346 |
-
"3 प्रदेश नं २ का मुख्यमन्त्रीको रूपमा संघीय समाज... 1\n",
|
| 347 |
-
"4 बिहीबार एमालेका अध्यक्ष केपी शर्मा ओली र माओवा... 0"
|
| 348 |
-
]
|
| 349 |
-
},
|
| 350 |
-
"execution_count": 6,
|
| 351 |
-
"metadata": {},
|
| 352 |
-
"output_type": "execute_result"
|
| 353 |
-
}
|
| 354 |
-
],
|
| 355 |
-
"source": [
|
| 356 |
-
"# Keep only columns needed for training\n",
|
| 357 |
-
"df = df[['text', 'label']].copy()\n",
|
| 358 |
-
"df.head()"
|
| 359 |
-
]
|
| 360 |
-
},
|
| 361 |
-
{
|
| 362 |
-
"cell_type": "code",
|
| 363 |
-
"execution_count": 7,
|
| 364 |
-
"id": "434df9a2",
|
| 365 |
-
"metadata": {
|
| 366 |
-
"id": "434df9a2"
|
| 367 |
-
},
|
| 368 |
-
"outputs": [
|
| 369 |
-
{
|
| 370 |
-
"name": "stdout",
|
| 371 |
-
"output_type": "stream",
|
| 372 |
-
"text": [
|
| 373 |
-
"Using model: distilbert-base-multilingual-cased\n"
|
| 374 |
-
]
|
| 375 |
-
}
|
| 376 |
-
],
|
| 377 |
-
"source": [
|
| 378 |
-
"# Model/tokenizer config (smaller multilingual model for low-VRAM GPU)\n",
|
| 379 |
-
"MODEL_NAME = 'distilbert-base-multilingual-cased'\n",
|
| 380 |
-
"MAX_LEN = 96\n",
|
| 381 |
-
"\n",
|
| 382 |
-
"tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)\n",
|
| 383 |
-
"print('Using model:', MODEL_NAME)"
|
| 384 |
-
]
|
| 385 |
-
},
|
| 386 |
-
{
|
| 387 |
-
"cell_type": "code",
|
| 388 |
-
"execution_count": 8,
|
| 389 |
-
"id": "ef7d53f9",
|
| 390 |
-
"metadata": {
|
| 391 |
-
"id": "ef7d53f9"
|
| 392 |
-
},
|
| 393 |
-
"outputs": [],
|
| 394 |
-
"source": [
|
| 395 |
-
"class NepaliDataset(Dataset):\n",
|
| 396 |
-
" def __init__(self, texts, labels):\n",
|
| 397 |
-
" self.texts = texts\n",
|
| 398 |
-
" self.labels = labels\n",
|
| 399 |
-
"\n",
|
| 400 |
-
" def __len__(self):\n",
|
| 401 |
-
" return len(self.texts)\n",
|
| 402 |
-
"\n",
|
| 403 |
-
" def __getitem__(self, idx):\n",
|
| 404 |
-
" return {\n",
|
| 405 |
-
" 'text': self.texts[idx],\n",
|
| 406 |
-
" 'label': int(self.labels[idx]),\n",
|
| 407 |
-
" }"
|
| 408 |
-
]
|
| 409 |
-
},
|
| 410 |
-
{
|
| 411 |
-
"cell_type": "code",
|
| 412 |
-
"execution_count": 9,
|
| 413 |
-
"id": "134a3fc1",
|
| 414 |
-
"metadata": {
|
| 415 |
-
"id": "134a3fc1"
|
| 416 |
-
},
|
| 417 |
-
"outputs": [
|
| 418 |
-
{
|
| 419 |
-
"name": "stdout",
|
| 420 |
-
"output_type": "stream",
|
| 421 |
-
"text": [
|
| 422 |
-
"Train: 1588 | Val: 398\n"
|
| 423 |
-
]
|
| 424 |
-
}
|
| 425 |
-
],
|
| 426 |
-
"source": [
|
| 427 |
-
"# Train/Validation Split\n",
|
| 428 |
-
"train_texts, val_texts, train_labels, val_labels = train_test_split(\n",
|
| 429 |
-
" df['text'].tolist(),\n",
|
| 430 |
-
" df['label'].tolist(),\n",
|
| 431 |
-
" test_size=0.2,\n",
|
| 432 |
-
" random_state=42,\n",
|
| 433 |
-
" stratify=df['label'].tolist(),\n",
|
| 434 |
-
")\n",
|
| 435 |
-
"print(f'Train: {len(train_texts)} | Val: {len(val_texts)}')"
|
| 436 |
-
]
|
| 437 |
-
},
|
| 438 |
-
{
|
| 439 |
-
"cell_type": "code",
|
| 440 |
-
"execution_count": 10,
|
| 441 |
-
"id": "dd226ed1",
|
| 442 |
-
"metadata": {
|
| 443 |
-
"id": "dd226ed1"
|
| 444 |
-
},
|
| 445 |
-
"outputs": [
|
| 446 |
-
{
|
| 447 |
-
"name": "stdout",
|
| 448 |
-
"output_type": "stream",
|
| 449 |
-
"text": [
|
| 450 |
-
"Batch size: 2 | Max length: 96\n"
|
| 451 |
-
]
|
| 452 |
-
}
|
| 453 |
-
],
|
| 454 |
-
"source": [
|
| 455 |
-
"train_dataset = NepaliDataset(train_texts, train_labels)\n",
|
| 456 |
-
"val_dataset = NepaliDataset(val_texts, val_labels)\n",
|
| 457 |
-
"\n",
|
| 458 |
-
"def collate_batch(batch):\n",
|
| 459 |
-
" texts = [item['text'] for item in batch]\n",
|
| 460 |
-
" labels = torch.tensor([item['label'] for item in batch], dtype=torch.long)\n",
|
| 461 |
-
" enc = tokenizer(\n",
|
| 462 |
-
" texts,\n",
|
| 463 |
-
" padding=True,\n",
|
| 464 |
-
" truncation=True,\n",
|
| 465 |
-
" max_length=MAX_LEN,\n",
|
| 466 |
-
" return_tensors='pt',\n",
|
| 467 |
-
" )\n",
|
| 468 |
-
" return {\n",
|
| 469 |
-
" 'input_ids': enc['input_ids'],\n",
|
| 470 |
-
" 'attention_mask': enc['attention_mask'],\n",
|
| 471 |
-
" 'labels': labels,\n",
|
| 472 |
-
" }\n",
|
| 473 |
-
"\n",
|
| 474 |
-
"BATCH_SIZE = 2\n",
|
| 475 |
-
"train_loader = DataLoader(\n",
|
| 476 |
-
" train_dataset,\n",
|
| 477 |
-
" batch_size=BATCH_SIZE,\n",
|
| 478 |
-
" shuffle=True,\n",
|
| 479 |
-
" collate_fn=collate_batch,\n",
|
| 480 |
-
" pin_memory=(torch.cuda.is_available()),\n",
|
| 481 |
-
")\n",
|
| 482 |
-
"val_loader = DataLoader(\n",
|
| 483 |
-
" val_dataset,\n",
|
| 484 |
-
" batch_size=BATCH_SIZE,\n",
|
| 485 |
-
" shuffle=False,\n",
|
| 486 |
-
" collate_fn=collate_batch,\n",
|
| 487 |
-
" pin_memory=(torch.cuda.is_available()),\n",
|
| 488 |
-
")\n",
|
| 489 |
-
"print('Batch size:', BATCH_SIZE, '| Max length:', MAX_LEN)"
|
| 490 |
-
]
|
| 491 |
-
},
|
| 492 |
-
{
|
| 493 |
-
"cell_type": "code",
|
| 494 |
-
"execution_count": 11,
|
| 495 |
-
"id": "51320951",
|
| 496 |
-
"metadata": {
|
| 497 |
-
"id": "51320951"
|
| 498 |
-
},
|
| 499 |
-
"outputs": [],
|
| 500 |
-
"source": [
|
| 501 |
-
"# === Model ===\n",
|
| 502 |
-
"class IndicBERTClassifier(nn.Module):\n",
|
| 503 |
-
" def __init__(self, dropout=0.2):\n",
|
| 504 |
-
" super(IndicBERTClassifier, self).__init__()\n",
|
| 505 |
-
" self.bert = AutoModel.from_pretrained(MODEL_NAME)\n",
|
| 506 |
-
" if hasattr(self.bert, 'gradient_checkpointing_enable'):\n",
|
| 507 |
-
" self.bert.gradient_checkpointing_enable()\n",
|
| 508 |
-
" self.dropout = nn.Dropout(dropout)\n",
|
| 509 |
-
" self.classifier = nn.Linear(self.bert.config.hidden_size, 2)\n",
|
| 510 |
-
"\n",
|
| 511 |
-
" def forward(self, input_ids, attention_mask):\n",
|
| 512 |
-
" outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)\n",
|
| 513 |
-
" cls_output = outputs.last_hidden_state[:, 0, :]\n",
|
| 514 |
-
" cls_output = self.dropout(cls_output)\n",
|
| 515 |
-
" return self.classifier(cls_output)"
|
| 516 |
-
]
|
| 517 |
-
},
|
| 518 |
-
{
|
| 519 |
-
"cell_type": "code",
|
| 520 |
-
"execution_count": 12,
|
| 521 |
-
"id": "944f918e",
|
| 522 |
-
"metadata": {
|
| 523 |
-
"id": "944f918e"
|
| 524 |
-
},
|
| 525 |
-
"outputs": [],
|
| 526 |
-
"source": [
|
| 527 |
-
"# Step 8: Create a custom Dataset class\n",
|
| 528 |
-
"class NepaliTextDataset(Dataset):\n",
|
| 529 |
-
" def __init__(self, input_ids, attention_mask, labels):\n",
|
| 530 |
-
" self.input_ids = input_ids\n",
|
| 531 |
-
" self.attention_mask = attention_mask\n",
|
| 532 |
-
" self.labels = labels\n",
|
| 533 |
-
"\n",
|
| 534 |
-
" def __len__(self):\n",
|
| 535 |
-
" return len(self.labels)\n",
|
| 536 |
-
"\n",
|
| 537 |
-
" def __getitem__(self, idx):\n",
|
| 538 |
-
" return {\n",
|
| 539 |
-
" 'input_ids': torch.tensor(self.input_ids[idx]),\n",
|
| 540 |
-
" 'attention_mask': torch.tensor(self.attention_mask[idx]),\n",
|
| 541 |
-
" 'labels': torch.tensor(self.labels[idx])\n",
|
| 542 |
-
" }"
|
| 543 |
-
]
|
| 544 |
-
},
|
| 545 |
-
{
|
| 546 |
-
"cell_type": "code",
|
| 547 |
-
"execution_count": 13,
|
| 548 |
-
"id": "a9d426e1",
|
| 549 |
-
"metadata": {
|
| 550 |
-
"id": "a9d426e1"
|
| 551 |
-
},
|
| 552 |
-
"outputs": [
|
| 553 |
-
{
|
| 554 |
-
"name": "stderr",
|
| 555 |
-
"output_type": "stream",
|
| 556 |
-
"text": [
|
| 557 |
-
"Loading weights: 100%|██████████| 100/100 [00:00<00:00, 11666.08it/s]\n",
|
| 558 |
-
"\u001b[1mDistilBertModel LOAD REPORT\u001b[0m from: distilbert-base-multilingual-cased\n",
|
| 559 |
-
"Key | Status | | \n",
|
| 560 |
-
"------------------------+------------+--+-\n",
|
| 561 |
-
"vocab_layer_norm.bias | UNEXPECTED | | \n",
|
| 562 |
-
"vocab_transform.weight | UNEXPECTED | | \n",
|
| 563 |
-
"vocab_layer_norm.weight | UNEXPECTED | | \n",
|
| 564 |
-
"vocab_transform.bias | UNEXPECTED | | \n",
|
| 565 |
-
"vocab_projector.bias | UNEXPECTED | | \n",
|
| 566 |
-
"\n",
|
| 567 |
-
"\u001b[3mNotes:\n",
|
| 568 |
-
"- UNEXPECTED\u001b[3m\t:can be ignored when loading from different task/architecture; not ok if you expect identical arch.\u001b[0m\n"
|
| 569 |
-
]
|
| 570 |
-
}
|
| 571 |
-
],
|
| 572 |
-
"source": [
|
| 573 |
-
"\n",
|
| 574 |
-
"device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
|
| 575 |
-
"model = IndicBERTClassifier().to(device)"
|
| 576 |
-
]
|
| 577 |
-
},
|
| 578 |
-
{
|
| 579 |
-
"cell_type": "code",
|
| 580 |
-
"execution_count": 14,
|
| 581 |
-
"id": "2740c14a",
|
| 582 |
-
"metadata": {
|
| 583 |
-
"id": "2740c14a"
|
| 584 |
-
},
|
| 585 |
-
"outputs": [
|
| 586 |
-
{
|
| 587 |
-
"name": "stdout",
|
| 588 |
-
"output_type": "stream",
|
| 589 |
-
"text": [
|
| 590 |
-
"Grad accumulation steps: 4\n"
|
| 591 |
-
]
|
| 592 |
-
}
|
| 593 |
-
],
|
| 594 |
-
"source": [
|
| 595 |
-
"# === Optimizer, Scheduler & Loss ===\n",
|
| 596 |
-
"optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)\n",
|
| 597 |
-
"loss_fn = nn.CrossEntropyLoss()\n",
|
| 598 |
-
"\n",
|
| 599 |
-
"max_epochs = 6\n",
|
| 600 |
-
"grad_accum_steps = 4 # effective batch = BATCH_SIZE * grad_accum_steps\n",
|
| 601 |
-
"steps_per_epoch = math.ceil(len(train_loader) / grad_accum_steps)\n",
|
| 602 |
-
"total_steps = steps_per_epoch * max_epochs\n",
|
| 603 |
-
"warmup_steps = int(0.1 * total_steps)\n",
|
| 604 |
-
"scheduler = get_linear_schedule_with_warmup(\n",
|
| 605 |
-
" optimizer,\n",
|
| 606 |
-
" num_warmup_steps=warmup_steps,\n",
|
| 607 |
-
" num_training_steps=total_steps,\n",
|
| 608 |
-
")\n",
|
| 609 |
-
"print('Grad accumulation steps:', grad_accum_steps)"
|
| 610 |
-
]
|
| 611 |
-
},
|
| 612 |
-
{
|
| 613 |
-
"cell_type": "code",
|
| 614 |
-
"execution_count": 15,
|
| 615 |
-
"id": "14ce04bd",
|
| 616 |
-
"metadata": {
|
| 617 |
-
"id": "14ce04bd"
|
| 618 |
-
},
|
| 619 |
-
"outputs": [],
|
| 620 |
-
"source": [
|
| 621 |
-
"# === Training Loop ===\n",
|
| 622 |
-
"def train(model, loader):\n",
|
| 623 |
-
" model.train()\n",
|
| 624 |
-
" total_loss = 0\n",
|
| 625 |
-
" for batch in loader:\n",
|
| 626 |
-
" input_ids = batch['input_ids'].to(device)\n",
|
| 627 |
-
" attention_mask = batch['attention_mask'].to(device)\n",
|
| 628 |
-
" labels = batch['labels'].to(device)\n",
|
| 629 |
-
"\n",
|
| 630 |
-
" optimizer.zero_grad()\n",
|
| 631 |
-
" outputs = model(input_ids, attention_mask)\n",
|
| 632 |
-
" loss = loss_fn(outputs, labels)\n",
|
| 633 |
-
" loss.backward()\n",
|
| 634 |
-
" optimizer.step()\n",
|
| 635 |
-
" total_loss += loss.item()\n",
|
| 636 |
-
" return total_loss / len(loader)\n",
|
| 637 |
-
"\n",
|
| 638 |
-
"# === Evaluation ===\n",
|
| 639 |
-
"def evaluate(model, loader):\n",
|
| 640 |
-
" model.eval()\n",
|
| 641 |
-
" preds, true = [], []\n",
|
| 642 |
-
" with torch.no_grad():\n",
|
| 643 |
-
" for batch in loader:\n",
|
| 644 |
-
" input_ids = batch['input_ids'].to(device)\n",
|
| 645 |
-
" attention_mask = batch['attention_mask'].to(device)\n",
|
| 646 |
-
" labels = batch['labels'].to(device)\n",
|
| 647 |
-
"\n",
|
| 648 |
-
" outputs = model(input_ids, attention_mask)\n",
|
| 649 |
-
" pred_labels = torch.argmax(outputs, dim=1)\n",
|
| 650 |
-
" preds.extend(pred_labels.cpu().numpy())\n",
|
| 651 |
-
" true.extend(labels.cpu().numpy())\n",
|
| 652 |
-
"\n",
|
| 653 |
-
" print(classification_report(true, preds, target_names=[\"Human\", \"AI\"]))\n"
|
| 654 |
-
]
|
| 655 |
-
},
|
| 656 |
-
{
|
| 657 |
-
"cell_type": "code",
|
| 658 |
-
"execution_count": null,
|
| 659 |
-
"id": "d24e91b7",
|
| 660 |
-
"metadata": {
|
| 661 |
-
"colab": {
|
| 662 |
-
"base_uri": "https://localhost:8080/"
|
| 663 |
-
},
|
| 664 |
-
"id": "d24e91b7",
|
| 665 |
-
"outputId": "33ef8227-5c71-4c0d-88e7-b1a9e30b45f4"
|
| 666 |
-
},
|
| 667 |
-
"outputs": [
|
| 668 |
-
{
|
| 669 |
-
"name": "stdout",
|
| 670 |
-
"output_type": "stream",
|
| 671 |
-
"text": [
|
| 672 |
-
"\n",
|
| 673 |
-
"Epoch 1/6\n"
|
| 674 |
-
]
|
| 675 |
-
},
|
| 676 |
-
{
|
| 677 |
-
"name": "stderr",
|
| 678 |
-
"output_type": "stream",
|
| 679 |
-
"text": [
|
| 680 |
-
"/tmp/ipykernel_155548/4183901742.py:4: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.\n",
|
| 681 |
-
" scaler = GradScaler(enabled=use_amp)\n",
|
| 682 |
-
"/tmp/ipykernel_155548/4183901742.py:17: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
|
| 683 |
-
" with autocast(enabled=use_amp):\n"
|
| 684 |
-
]
|
| 685 |
-
},
|
| 686 |
-
{
|
| 687 |
-
"name": "stdout",
|
| 688 |
-
"output_type": "stream",
|
| 689 |
-
"text": [
|
| 690 |
-
"Batch 0 | Loss: 0.8206\n",
|
| 691 |
-
"Batch 50 | Loss: 0.8677\n",
|
| 692 |
-
"Batch 100 | Loss: 0.8435\n",
|
| 693 |
-
"Batch 150 | Loss: 0.6523\n",
|
| 694 |
-
"Batch 200 | Loss: 0.7219\n",
|
| 695 |
-
"Batch 250 | Loss: 0.5793\n",
|
| 696 |
-
"Batch 300 | Loss: 0.6833\n",
|
| 697 |
-
"Batch 350 | Loss: 0.5742\n",
|
| 698 |
-
"Batch 400 | Loss: 0.4844\n",
|
| 699 |
-
"Batch 450 | Loss: 0.5671\n",
|
| 700 |
-
"Batch 500 | Loss: 0.5363\n",
|
| 701 |
-
"Batch 550 | Loss: 0.5386\n",
|
| 702 |
-
"Batch 600 | Loss: 0.5520\n",
|
| 703 |
-
"Batch 650 | Loss: 0.7692\n",
|
| 704 |
-
"Batch 700 | Loss: 0.4680\n",
|
| 705 |
-
"Batch 750 | Loss: 0.6353\n",
|
| 706 |
-
"Train | Loss: 0.6600 | Acc: 0.5913 | F1: 0.5895\n"
|
| 707 |
-
]
|
| 708 |
-
},
|
| 709 |
-
{
|
| 710 |
-
"name": "stderr",
|
| 711 |
-
"output_type": "stream",
|
| 712 |
-
"text": [
|
| 713 |
-
"/tmp/ipykernel_155548/4183901742.py:55: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
|
| 714 |
-
" with autocast(enabled=use_amp):\n"
|
| 715 |
-
]
|
| 716 |
-
},
|
| 717 |
-
{
|
| 718 |
-
"name": "stdout",
|
| 719 |
-
"output_type": "stream",
|
| 720 |
-
"text": [
|
| 721 |
-
"Validation | Loss: 0.5192 | Acc: 0.8015 | F1: 0.7812\n",
|
| 722 |
-
" precision recall f1-score support\n",
|
| 723 |
-
"\n",
|
| 724 |
-
" Human 0.75 0.90 0.82 198\n",
|
| 725 |
-
" AI 0.88 0.70 0.78 200\n",
|
| 726 |
-
"\n",
|
| 727 |
-
" accuracy 0.80 398\n",
|
| 728 |
-
" macro avg 0.81 0.80 0.80 398\n",
|
| 729 |
-
"weighted avg 0.81 0.80 0.80 398\n",
|
| 730 |
-
"\n",
|
| 731 |
-
"Saved improved checkpoint: model_best.pth\n",
|
| 732 |
-
"\n",
|
| 733 |
-
"Epoch 2/6\n"
|
| 734 |
-
]
|
| 735 |
-
},
|
| 736 |
-
{
|
| 737 |
-
"name": "stderr",
|
| 738 |
-
"output_type": "stream",
|
| 739 |
-
"text": [
|
| 740 |
-
"/tmp/ipykernel_155548/4183901742.py:17: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
|
| 741 |
-
" with autocast(enabled=use_amp):\n"
|
| 742 |
-
]
|
| 743 |
-
},
|
| 744 |
-
{
|
| 745 |
-
"name": "stdout",
|
| 746 |
-
"output_type": "stream",
|
| 747 |
-
"text": [
|
| 748 |
-
"Batch 0 | Loss: 0.6078\n",
|
| 749 |
-
"Batch 50 | Loss: 1.1135\n",
|
| 750 |
-
"Batch 100 | Loss: 0.3297\n",
|
| 751 |
-
"Batch 150 | Loss: 0.8473\n",
|
| 752 |
-
"Batch 200 | Loss: 0.9326\n",
|
| 753 |
-
"Batch 250 | Loss: 0.5112\n",
|
| 754 |
-
"Batch 300 | Loss: 0.1645\n",
|
| 755 |
-
"Batch 350 | Loss: 0.2250\n",
|
| 756 |
-
"Batch 400 | Loss: 0.7142\n",
|
| 757 |
-
"Batch 450 | Loss: 0.3741\n",
|
| 758 |
-
"Batch 500 | Loss: 0.3084\n",
|
| 759 |
-
"Batch 550 | Loss: 0.1472\n",
|
| 760 |
-
"Batch 600 | Loss: 0.0679\n",
|
| 761 |
-
"Batch 650 | Loss: 0.1234\n",
|
| 762 |
-
"Batch 700 | Loss: 1.1370\n",
|
| 763 |
-
"Batch 750 | Loss: 0.8843\n",
|
| 764 |
-
"Train | Loss: 0.4817 | Acc: 0.7720 | F1: 0.7665\n"
|
| 765 |
-
]
|
| 766 |
-
},
|
| 767 |
-
{
|
| 768 |
-
"name": "stderr",
|
| 769 |
-
"output_type": "stream",
|
| 770 |
-
"text": [
|
| 771 |
-
"/tmp/ipykernel_155548/4183901742.py:55: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
|
| 772 |
-
" with autocast(enabled=use_amp):\n"
|
| 773 |
-
]
|
| 774 |
-
},
|
| 775 |
-
{
|
| 776 |
-
"name": "stdout",
|
| 777 |
-
"output_type": "stream",
|
| 778 |
-
"text": [
|
| 779 |
-
"Validation | Loss: 0.3708 | Acc: 0.8417 | F1: 0.8225\n",
|
| 780 |
-
" precision recall f1-score support\n",
|
| 781 |
-
"\n",
|
| 782 |
-
" Human 0.78 0.95 0.86 198\n",
|
| 783 |
-
" AI 0.94 0.73 0.82 200\n",
|
| 784 |
-
"\n",
|
| 785 |
-
" accuracy 0.84 398\n",
|
| 786 |
-
" macro avg 0.86 0.84 0.84 398\n",
|
| 787 |
-
"weighted avg 0.86 0.84 0.84 398\n",
|
| 788 |
-
"\n",
|
| 789 |
-
"Saved improved checkpoint: model_best.pth\n",
|
| 790 |
-
"\n",
|
| 791 |
-
"Epoch 3/6\n"
|
| 792 |
-
]
|
| 793 |
-
},
|
| 794 |
-
{
|
| 795 |
-
"name": "stderr",
|
| 796 |
-
"output_type": "stream",
|
| 797 |
-
"text": [
|
| 798 |
-
"/tmp/ipykernel_155548/4183901742.py:17: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
|
| 799 |
-
" with autocast(enabled=use_amp):\n"
|
| 800 |
-
]
|
| 801 |
-
},
|
| 802 |
-
{
|
| 803 |
-
"name": "stdout",
|
| 804 |
-
"output_type": "stream",
|
| 805 |
-
"text": [
|
| 806 |
-
"Batch 0 | Loss: 0.0415\n",
|
| 807 |
-
"Batch 50 | Loss: 0.0845\n",
|
| 808 |
-
"Batch 100 | Loss: 0.0336\n",
|
| 809 |
-
"Batch 150 | Loss: 0.6389\n",
|
| 810 |
-
"Batch 200 | Loss: 1.6021\n",
|
| 811 |
-
"Batch 250 | Loss: 0.0696\n",
|
| 812 |
-
"Batch 300 | Loss: 0.5184\n",
|
| 813 |
-
"Batch 350 | Loss: 0.0569\n",
|
| 814 |
-
"Batch 400 | Loss: 0.8119\n",
|
| 815 |
-
"Batch 450 | Loss: 1.5121\n",
|
| 816 |
-
"Batch 500 | Loss: 0.0330\n",
|
| 817 |
-
"Batch 550 | Loss: 0.0208\n",
|
| 818 |
-
"Batch 600 | Loss: 1.1329\n",
|
| 819 |
-
"Batch 650 | Loss: 0.7745\n",
|
| 820 |
-
"Batch 700 | Loss: 0.0740\n",
|
| 821 |
-
"Batch 750 | Loss: 1.4907\n",
|
| 822 |
-
"Train | Loss: 0.3830 | Acc: 0.8495 | F1: 0.8488\n"
|
| 823 |
-
]
|
| 824 |
-
},
|
| 825 |
-
{
|
| 826 |
-
"name": "stderr",
|
| 827 |
-
"output_type": "stream",
|
| 828 |
-
"text": [
|
| 829 |
-
"/tmp/ipykernel_155548/4183901742.py:55: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
|
| 830 |
-
" with autocast(enabled=use_amp):\n"
|
| 831 |
-
]
|
| 832 |
-
},
|
| 833 |
-
{
|
| 834 |
-
"name": "stdout",
|
| 835 |
-
"output_type": "stream",
|
| 836 |
-
"text": [
|
| 837 |
-
"Validation | Loss: 0.3527 | Acc: 0.8668 | F1: 0.8515\n",
|
| 838 |
-
" precision recall f1-score support\n",
|
| 839 |
-
"\n",
|
| 840 |
-
" Human 0.80 0.97 0.88 198\n",
|
| 841 |
-
" AI 0.97 0.76 0.85 200\n",
|
| 842 |
-
"\n",
|
| 843 |
-
" accuracy 0.87 398\n",
|
| 844 |
-
" macro avg 0.88 0.87 0.87 398\n",
|
| 845 |
-
"weighted avg 0.88 0.87 0.87 398\n",
|
| 846 |
-
"\n",
|
| 847 |
-
"Saved improved checkpoint: model_best.pth\n",
|
| 848 |
-
"\n",
|
| 849 |
-
"Epoch 4/6\n"
|
| 850 |
-
]
|
| 851 |
-
},
|
| 852 |
-
{
|
| 853 |
-
"name": "stderr",
|
| 854 |
-
"output_type": "stream",
|
| 855 |
-
"text": [
|
| 856 |
-
"/tmp/ipykernel_155548/4183901742.py:17: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
|
| 857 |
-
" with autocast(enabled=use_amp):\n"
|
| 858 |
-
]
|
| 859 |
-
},
|
| 860 |
-
{
|
| 861 |
-
"name": "stdout",
|
| 862 |
-
"output_type": "stream",
|
| 863 |
-
"text": [
|
| 864 |
-
"Batch 0 | Loss: 1.2321\n",
|
| 865 |
-
"Batch 50 | Loss: 0.0369\n",
|
| 866 |
-
"Batch 100 | Loss: 0.0161\n",
|
| 867 |
-
"Batch 150 | Loss: 0.2000\n",
|
| 868 |
-
"Batch 200 | Loss: 0.0035\n",
|
| 869 |
-
"Batch 250 | Loss: 2.3207\n",
|
| 870 |
-
"Batch 300 | Loss: 0.0022\n",
|
| 871 |
-
"Batch 350 | Loss: 2.2738\n",
|
| 872 |
-
"Batch 400 | Loss: 0.0011\n",
|
| 873 |
-
"Batch 450 | Loss: 0.0075\n",
|
| 874 |
-
"Batch 500 | Loss: 2.4454\n",
|
| 875 |
-
"Batch 550 | Loss: 0.3863\n",
|
| 876 |
-
"Batch 600 | Loss: 0.0038\n",
|
| 877 |
-
"Batch 650 | Loss: 0.0061\n",
|
| 878 |
-
"Batch 700 | Loss: 0.0005\n",
|
| 879 |
-
"Batch 750 | Loss: 0.0182\n",
|
| 880 |
-
"Train | Loss: 0.4209 | Acc: 0.8923 | F1: 0.8903\n"
|
| 881 |
-
]
|
| 882 |
-
},
|
| 883 |
-
{
|
| 884 |
-
"name": "stderr",
|
| 885 |
-
"output_type": "stream",
|
| 886 |
-
"text": [
|
| 887 |
-
"/tmp/ipykernel_155548/4183901742.py:55: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
|
| 888 |
-
" with autocast(enabled=use_amp):\n"
|
| 889 |
-
]
|
| 890 |
-
},
|
| 891 |
-
{
|
| 892 |
-
"name": "stdout",
|
| 893 |
-
"output_type": "stream",
|
| 894 |
-
"text": [
|
| 895 |
-
"Validation | Loss: 0.4601 | Acc: 0.8769 | F1: 0.8831\n",
|
| 896 |
-
" precision recall f1-score support\n",
|
| 897 |
-
"\n",
|
| 898 |
-
" Human 0.92 0.83 0.87 198\n",
|
| 899 |
-
" AI 0.84 0.93 0.88 200\n",
|
| 900 |
-
"\n",
|
| 901 |
-
" accuracy 0.88 398\n",
|
| 902 |
-
" macro avg 0.88 0.88 0.88 398\n",
|
| 903 |
-
"weighted avg 0.88 0.88 0.88 398\n",
|
| 904 |
-
"\n",
|
| 905 |
-
"Saved improved checkpoint: model_best.pth\n",
|
| 906 |
-
"\n",
|
| 907 |
-
"Epoch 5/6\n"
|
| 908 |
-
]
|
| 909 |
-
},
|
| 910 |
-
{
|
| 911 |
-
"name": "stderr",
|
| 912 |
-
"output_type": "stream",
|
| 913 |
-
"text": [
|
| 914 |
-
"/tmp/ipykernel_155548/4183901742.py:17: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
|
| 915 |
-
" with autocast(enabled=use_amp):\n"
|
| 916 |
-
]
|
| 917 |
-
},
|
| 918 |
-
{
|
| 919 |
-
"name": "stdout",
|
| 920 |
-
"output_type": "stream",
|
| 921 |
-
"text": [
|
| 922 |
-
"Batch 0 | Loss: 0.0010\n",
|
| 923 |
-
"Batch 50 | Loss: 0.0061\n",
|
| 924 |
-
"Batch 100 | Loss: 0.0047\n",
|
| 925 |
-
"Batch 150 | Loss: 0.0201\n",
|
| 926 |
-
"Batch 200 | Loss: 0.0023\n",
|
| 927 |
-
"Batch 250 | Loss: 0.0395\n",
|
| 928 |
-
"Batch 300 | Loss: 0.0011\n",
|
| 929 |
-
"Batch 350 | Loss: 0.0002\n",
|
| 930 |
-
"Batch 400 | Loss: 3.2169\n",
|
| 931 |
-
"Batch 450 | Loss: 4.4883\n",
|
| 932 |
-
"Batch 500 | Loss: 0.0002\n",
|
| 933 |
-
"Batch 550 | Loss: 0.0003\n",
|
| 934 |
-
"Batch 600 | Loss: 0.0000\n",
|
| 935 |
-
"Batch 650 | Loss: 0.0002\n",
|
| 936 |
-
"Batch 700 | Loss: 0.0000\n",
|
| 937 |
-
"Batch 750 | Loss: 4.6367\n",
|
| 938 |
-
"Train | Loss: 0.5447 | Acc: 0.9011 | F1: 0.8990\n"
|
| 939 |
-
]
|
| 940 |
-
},
|
| 941 |
-
{
|
| 942 |
-
"name": "stderr",
|
| 943 |
-
"output_type": "stream",
|
| 944 |
-
"text": [
|
| 945 |
-
"/tmp/ipykernel_155548/4183901742.py:55: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
|
| 946 |
-
" with autocast(enabled=use_amp):\n"
|
| 947 |
-
]
|
| 948 |
-
},
|
| 949 |
-
{
|
| 950 |
-
"name": "stdout",
|
| 951 |
-
"output_type": "stream",
|
| 952 |
-
"text": [
|
| 953 |
-
"Validation | Loss: 0.5331 | Acc: 0.9271 | F1: 0.9266\n",
|
| 954 |
-
" precision recall f1-score support\n",
|
| 955 |
-
"\n",
|
| 956 |
-
" Human 0.92 0.94 0.93 198\n",
|
| 957 |
-
" AI 0.94 0.92 0.93 200\n",
|
| 958 |
-
"\n",
|
| 959 |
-
" accuracy 0.93 398\n",
|
| 960 |
-
" macro avg 0.93 0.93 0.93 398\n",
|
| 961 |
-
"weighted avg 0.93 0.93 0.93 398\n",
|
| 962 |
-
"\n",
|
| 963 |
-
"Saved improved checkpoint: model_best.pth\n",
|
| 964 |
-
"\n",
|
| 965 |
-
"Epoch 6/6\n"
|
| 966 |
-
]
|
| 967 |
-
},
|
| 968 |
-
{
|
| 969 |
-
"name": "stderr",
|
| 970 |
-
"output_type": "stream",
|
| 971 |
-
"text": [
|
| 972 |
-
"/tmp/ipykernel_155548/4183901742.py:17: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
|
| 973 |
-
" with autocast(enabled=use_amp):\n"
|
| 974 |
-
]
|
| 975 |
-
},
|
| 976 |
-
{
|
| 977 |
-
"name": "stdout",
|
| 978 |
-
"output_type": "stream",
|
| 979 |
-
"text": [
|
| 980 |
-
"Batch 0 | Loss: 0.0000\n"
|
| 981 |
-
]
|
| 982 |
-
}
|
| 983 |
-
],
|
| 984 |
-
"source": [
|
| 985 |
-
"from torch.cuda.amp import autocast, GradScaler\n",
|
| 986 |
-
"\n",
|
| 987 |
-
"use_amp = device.type == 'cuda'\n",
|
| 988 |
-
"scaler = GradScaler(enabled=use_amp)\n",
|
| 989 |
-
"\n",
|
| 990 |
-
"def train_one_epoch(model, loader):\n",
|
| 991 |
-
" model.train()\n",
|
| 992 |
-
" total_loss = 0.0\n",
|
| 993 |
-
" all_preds, all_true = [], []\n",
|
| 994 |
-
"\n",
|
| 995 |
-
" optimizer.zero_grad(set_to_none=True)\n",
|
| 996 |
-
" for batch_idx, batch in enumerate(loader):\n",
|
| 997 |
-
" input_ids = batch['input_ids'].to(device, non_blocking=True)\n",
|
| 998 |
-
" attention_mask = batch['attention_mask'].to(device, non_blocking=True)\n",
|
| 999 |
-
" labels = batch['labels'].to(device, non_blocking=True)\n",
|
| 1000 |
-
"\n",
|
| 1001 |
-
" with autocast(enabled=use_amp):\n",
|
| 1002 |
-
" logits = model(input_ids, attention_mask=attention_mask)\n",
|
| 1003 |
-
" loss = loss_fn(logits, labels) / grad_accum_steps\n",
|
| 1004 |
-
"\n",
|
| 1005 |
-
" scaler.scale(loss).backward()\n",
|
| 1006 |
-
"\n",
|
| 1007 |
-
" if (batch_idx + 1) % grad_accum_steps == 0 or (batch_idx + 1) == len(loader):\n",
|
| 1008 |
-
" torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)\n",
|
| 1009 |
-
" scaler.step(optimizer)\n",
|
| 1010 |
-
" scaler.update()\n",
|
| 1011 |
-
" scheduler.step()\n",
|
| 1012 |
-
" optimizer.zero_grad(set_to_none=True)\n",
|
| 1013 |
-
"\n",
|
| 1014 |
-
" total_loss += loss.item() * grad_accum_steps\n",
|
| 1015 |
-
" preds = torch.argmax(logits, dim=1)\n",
|
| 1016 |
-
" all_preds.extend(preds.detach().cpu().numpy())\n",
|
| 1017 |
-
" all_true.extend(labels.detach().cpu().numpy())\n",
|
| 1018 |
-
"\n",
|
| 1019 |
-
" if batch_idx % 50 == 0:\n",
|
| 1020 |
-
" print(f'Batch {batch_idx} | Loss: {(loss.item() * grad_accum_steps):.4f}')\n",
|
| 1021 |
-
"\n",
|
| 1022 |
-
" avg_loss = total_loss / max(len(loader), 1)\n",
|
| 1023 |
-
" train_acc = accuracy_score(all_true, all_preds)\n",
|
| 1024 |
-
" train_f1 = f1_score(all_true, all_preds)\n",
|
| 1025 |
-
" return avg_loss, train_acc, train_f1\n",
|
| 1026 |
-
"\n",
|
| 1027 |
-
"\n",
|
| 1028 |
-
"def evaluate(model, loader):\n",
|
| 1029 |
-
" model.eval()\n",
|
| 1030 |
-
" all_preds, all_true = [], []\n",
|
| 1031 |
-
" total_loss = 0.0\n",
|
| 1032 |
-
"\n",
|
| 1033 |
-
" with torch.no_grad():\n",
|
| 1034 |
-
" for batch in loader:\n",
|
| 1035 |
-
" input_ids = batch['input_ids'].to(device, non_blocking=True)\n",
|
| 1036 |
-
" attention_mask = batch['attention_mask'].to(device, non_blocking=True)\n",
|
| 1037 |
-
" labels = batch['labels'].to(device, non_blocking=True)\n",
|
| 1038 |
-
"\n",
|
| 1039 |
-
" with autocast(enabled=use_amp):\n",
|
| 1040 |
-
" logits = model(input_ids, attention_mask=attention_mask)\n",
|
| 1041 |
-
" loss = loss_fn(logits, labels)\n",
|
| 1042 |
-
"\n",
|
| 1043 |
-
" total_loss += loss.item()\n",
|
| 1044 |
-
" preds = torch.argmax(logits, dim=1)\n",
|
| 1045 |
-
" all_preds.extend(preds.cpu().numpy())\n",
|
| 1046 |
-
" all_true.extend(labels.cpu().numpy())\n",
|
| 1047 |
-
"\n",
|
| 1048 |
-
" val_loss = total_loss / max(len(loader), 1)\n",
|
| 1049 |
-
" val_acc = accuracy_score(all_true, all_preds)\n",
|
| 1050 |
-
" val_f1 = f1_score(all_true, all_preds)\n",
|
| 1051 |
-
"\n",
|
| 1052 |
-
" print(f'Validation | Loss: {val_loss:.4f} | Acc: {val_acc:.4f} | F1: {val_f1:.4f}')\n",
|
| 1053 |
-
" print(classification_report(all_true, all_preds, target_names=['Human', 'AI']))\n",
|
| 1054 |
-
" return val_loss, val_acc, val_f1\n",
|
| 1055 |
-
"\n",
|
| 1056 |
-
"\n",
|
| 1057 |
-
"# Training with early stopping on validation F1\n",
|
| 1058 |
-
"patience = 2\n",
|
| 1059 |
-
"best_val_f1 = 0.0\n",
|
| 1060 |
-
"epochs_without_improve = 0\n",
|
| 1061 |
-
"\n",
|
| 1062 |
-
"for epoch in range(1, max_epochs + 1):\n",
|
| 1063 |
-
" print(f'\\nEpoch {epoch}/{max_epochs}')\n",
|
| 1064 |
-
" if device.type == 'cuda':\n",
|
| 1065 |
-
" torch.cuda.empty_cache()\n",
|
| 1066 |
-
"\n",
|
| 1067 |
-
" train_loss, train_acc, train_f1 = train_one_epoch(model, train_loader)\n",
|
| 1068 |
-
" print(f'Train | Loss: {train_loss:.4f} | Acc: {train_acc:.4f} | F1: {train_f1:.4f}')\n",
|
| 1069 |
-
"\n",
|
| 1070 |
-
" val_loss, val_acc, val_f1 = evaluate(model, val_loader)\n",
|
| 1071 |
-
"\n",
|
| 1072 |
-
" if val_f1 > best_val_f1:\n",
|
| 1073 |
-
" best_val_f1 = val_f1\n",
|
| 1074 |
-
" epochs_without_improve = 0\n",
|
| 1075 |
-
" torch.save(model.state_dict(), 'model_best.pth')\n",
|
| 1076 |
-
" print('Saved improved checkpoint: model_best.pth')\n",
|
| 1077 |
-
" else:\n",
|
| 1078 |
-
" epochs_without_improve += 1\n",
|
| 1079 |
-
" if epochs_without_improve >= patience:\n",
|
| 1080 |
-
" print('Early stopping triggered.')\n",
|
| 1081 |
-
" break\n",
|
| 1082 |
-
"\n",
|
| 1083 |
-
"print(f'Best validation F1: {best_val_f1:.4f}')"
|
| 1084 |
-
]
|
| 1085 |
-
},
|
| 1086 |
-
{
|
| 1087 |
-
"cell_type": "code",
|
| 1088 |
-
"execution_count": null,
|
| 1089 |
-
"id": "wBIT-kPaswqy",
|
| 1090 |
-
"metadata": {
|
| 1091 |
-
"id": "wBIT-kPaswqy"
|
| 1092 |
-
},
|
| 1093 |
-
"outputs": [],
|
| 1094 |
-
"source": [
|
| 1095 |
-
"# Optional: save current in-memory weights as latest checkpoint\n",
|
| 1096 |
-
"torch.save(model.state_dict(), 'model_latest.pth')\n",
|
| 1097 |
-
"print('Saved: model_latest.pth')"
|
| 1098 |
-
]
|
| 1099 |
-
},
|
| 1100 |
-
{
|
| 1101 |
-
"cell_type": "code",
|
| 1102 |
-
"execution_count": null,
|
| 1103 |
-
"id": "19b9652c",
|
| 1104 |
-
"metadata": {
|
| 1105 |
-
"colab": {
|
| 1106 |
-
"base_uri": "https://localhost:8080/"
|
| 1107 |
-
},
|
| 1108 |
-
"id": "19b9652c",
|
| 1109 |
-
"outputId": "e1b12835-b081-4d46-a909-c92cb3b6d230"
|
| 1110 |
-
},
|
| 1111 |
-
"outputs": [
|
| 1112 |
-
{
|
| 1113 |
-
"data": {
|
| 1114 |
-
"text/plain": [
|
| 1115 |
-
"('./nepali_xlmr_classifier/tokenizer_config.json',\n",
|
| 1116 |
-
" './nepali_xlmr_classifier/special_tokens_map.json',\n",
|
| 1117 |
-
" './nepali_xlmr_classifier/sentencepiece.bpe.model',\n",
|
| 1118 |
-
" './nepali_xlmr_classifier/added_tokens.json',\n",
|
| 1119 |
-
" './nepali_xlmr_classifier/tokenizer.json')"
|
| 1120 |
-
]
|
| 1121 |
-
},
|
| 1122 |
-
"execution_count": 41,
|
| 1123 |
-
"metadata": {},
|
| 1124 |
-
"output_type": "execute_result"
|
| 1125 |
-
}
|
| 1126 |
-
],
|
| 1127 |
-
"source": [
|
| 1128 |
-
"tokenizer.save_pretrained(\"./nepali_xlmr_classifier\")"
|
| 1129 |
-
]
|
| 1130 |
-
},
|
| 1131 |
-
{
|
| 1132 |
-
"cell_type": "code",
|
| 1133 |
-
"execution_count": null,
|
| 1134 |
-
"id": "eAnrw316iRw8",
|
| 1135 |
-
"metadata": {
|
| 1136 |
-
"colab": {
|
| 1137 |
-
"base_uri": "https://localhost:8080/"
|
| 1138 |
-
},
|
| 1139 |
-
"id": "eAnrw316iRw8",
|
| 1140 |
-
"outputId": "04885bb5-4f06-459b-a83c-40f5e00703fe"
|
| 1141 |
-
},
|
| 1142 |
-
"outputs": [
|
| 1143 |
-
{
|
| 1144 |
-
"name": "stdout",
|
| 1145 |
-
"output_type": "stream",
|
| 1146 |
-
"text": [
|
| 1147 |
-
"0\n"
|
| 1148 |
-
]
|
| 1149 |
-
}
|
| 1150 |
-
],
|
| 1151 |
-
"source": [
|
| 1152 |
-
"def predict(text):\n",
|
| 1153 |
-
" model.eval()\n",
|
| 1154 |
-
" inputs = tokenizer(\n",
|
| 1155 |
-
" text,\n",
|
| 1156 |
-
" return_tensors='pt',\n",
|
| 1157 |
-
" truncation=True,\n",
|
| 1158 |
-
" padding=True,\n",
|
| 1159 |
-
" max_length=MAX_LEN,\n",
|
| 1160 |
-
" )\n",
|
| 1161 |
-
" inputs = {k: v.to(device) for k, v in inputs.items()}\n",
|
| 1162 |
-
"\n",
|
| 1163 |
-
" with torch.no_grad():\n",
|
| 1164 |
-
" logits = model(inputs['input_ids'], inputs['attention_mask'])\n",
|
| 1165 |
-
" probs = torch.softmax(logits, dim=1)\n",
|
| 1166 |
-
" pred = torch.argmax(probs, dim=1).item()\n",
|
| 1167 |
-
" confidence = probs[0, pred].item()\n",
|
| 1168 |
-
"\n",
|
| 1169 |
-
" label = 'AI' if pred == 1 else 'Human'\n",
|
| 1170 |
-
" return label, confidence\n",
|
| 1171 |
-
"\n",
|
| 1172 |
-
"sample = 'अख्तियार दुरुपयोग अनुसन्धान आयोगले सिन्धुपाल्चोक–२ बाट प्रतिनिधिसभा सदस्य निर्वाचित सांसद तथा पूर्वमन्त्री बस्नेतसहित १६ जना र २ कम्पनी विरुद्ध ३ अर्ब २१ करोडभन्दा बढी बिगो कायम गरी बिहीबार विशेष अदालतमा भ्रष्टाचार मुद्दा दायर गरेको छ ।'\n",
|
| 1173 |
-
"label, conf = predict(sample)\n",
|
| 1174 |
-
"print(f'Prediction: {label} | Confidence: {conf:.4f}')"
|
| 1175 |
-
]
|
| 1176 |
-
},
|
| 1177 |
-
{
|
| 1178 |
-
"cell_type": "code",
|
| 1179 |
-
"execution_count": null,
|
| 1180 |
-
"id": "lqGrqG51NiQV",
|
| 1181 |
-
"metadata": {
|
| 1182 |
-
"colab": {
|
| 1183 |
-
"base_uri": "https://localhost:8080/"
|
| 1184 |
-
},
|
| 1185 |
-
"id": "lqGrqG51NiQV",
|
| 1186 |
-
"outputId": "6bdae59b-2684-4bd0-f804-d16ebd8272db"
|
| 1187 |
-
},
|
| 1188 |
-
"outputs": [
|
| 1189 |
-
{
|
| 1190 |
-
"name": "stdout",
|
| 1191 |
-
"output_type": "stream",
|
| 1192 |
-
"text": [
|
| 1193 |
-
"1\n",
|
| 1194 |
-
"1\n",
|
| 1195 |
-
"1\n",
|
| 1196 |
-
"1\n",
|
| 1197 |
-
"1\n",
|
| 1198 |
-
"1\n",
|
| 1199 |
-
"1\n",
|
| 1200 |
-
"1\n",
|
| 1201 |
-
"1\n",
|
| 1202 |
-
"0\n"
|
| 1203 |
-
]
|
| 1204 |
-
}
|
| 1205 |
-
],
|
| 1206 |
-
"source": [
|
| 1207 |
-
"print(predict(\"इन्टरनेटको सुरुवात सन् १९६९ मा अमेरिकी रक्षा मन्त्रालयले निर्माण गरेको ARPANET नामक प्रोजेक्टबाट भएको हो, जसको उद्देश्य आपसी संचारलाई सहज बनाउने थियो र जसले भविष्यमा इन्टरनेटको रूप लियो\"))\n",
|
| 1208 |
-
"\n",
|
| 1209 |
-
"print(predict(\"सुरुमा इन्टरनेट केही वैज्ञानिक तथा सरकारी संस्थाहरूमा सीमित रहेको भए पनि, समयक्रममा यसको पहुँच आम नागरिक, विद्यालय, र व्यवसायिक क्षेत्रमा विस्तार हुँदै गयो\"))\n",
|
| 1210 |
-
"\n",
|
| 1211 |
-
"print(predict(\"ARPANETले कम्प्युटरहरूलाई आपसमा जोड्ने सफल प्रयोग गरेपछि इन्टरनेटको सम्भावना प्रमाणित भयो, जसले गर्दा विश्वभरका अनुसन्धानकर्ताहरू यसप्रति आकर्षित हुन थाले\"))\n",
|
| 1212 |
-
"\n",
|
| 1213 |
-
"print(predict(\"सन् १९९० को दशकमा विश्वव्यापी रूपमा इन्टरनेट विस्तार हुन थालेपछि मानिसहरू सूचनाको आदान–प्रदान, इमेल, र वेबसाइटहरूको प्रयोगमार्फत डिजिटल संसारमा प्रवेश गर्न थाले।\"))\n",
|
| 1214 |
-
"\n",
|
| 1215 |
-
"print(predict(\"इन्टरनेटले शिक्षा, स्वास्थ्य, सञ्चार, मनोरञ्जन, तथा व्यापारजस्ता धेरै क्षेत्रहरूमा अभूतपूर्व परिवर्तन ल्याएको छ, जसले गर्दा मानव जीवन सरल, छरितो र प्रभावकारी बनेको छ।\"))\n",
|
| 1216 |
-
"\n",
|
| 1217 |
-
"print(predict(\"समयसँगै इन्टरनेट एक अत्यावश्यक सेवाको रूपमा विकास भएको छ, जसबिनाको आधुनिक जीवन लगभग असम्भवजस्तै लाग्ने अवस्था सिर्जना भएको छ।\"))\n",
|
| 1218 |
-
"\n",
|
| 1219 |
-
"print(predict(\"आजको युगमा इन्टरनेट केवल सूचना प्राप्तिको माध्यम मात्र नभई ज्ञानको भण्डार, रचनात्मकता प्रदर्शन गर्ने मंच, तथा रोजगार सृजनाको स्रोत पनि बनिसकेको छ।\"))\n",
|
| 1220 |
-
"\n",
|
| 1221 |
-
"print(predict(\"इन्टरनेटको प्रभाव त्यति गहिरो भएको छ कि विद्यालयका बालबालिकादेखि वृद्धसम्म यसको प्रयोगमा संलग्न छन्, जसले डिजिटल विभाजनको अवधारणा जन्माएको छ।\"))\n",
|
| 1222 |
-
"\n",
|
| 1223 |
-
"print(predict(\"इन्टरनेटले विश्वलाई एउटा सानो गाउँमा रूपान्तरण गरेको छ, जहाँ मानिसहरू हजारौं किलोमिटर टाढा भएर पनि एकअर्कासँग प्रत्यक्ष संवाद गर्न सक्छन्।\"))\n",
|
| 1224 |
-
"\n",
|
| 1225 |
-
"print(predict(\"संसदीय समितिले समन्वयकारी भूमिका निर्वाह गर्दै मनसुनजन्य विपद् जोखिम न्यूनीकरण, विपद् प्रतिकार्यका लागि तयारी गर्न तीन तहकै सरकारलाई निर्देशन दिएको छ।\"))\n"
|
| 1226 |
-
]
|
| 1227 |
-
},
|
| 1228 |
-
{
|
| 1229 |
-
"cell_type": "code",
|
| 1230 |
-
"execution_count": null,
|
| 1231 |
-
"id": "X2ePCc5Disrt",
|
| 1232 |
-
"metadata": {
|
| 1233 |
-
"colab": {
|
| 1234 |
-
"base_uri": "https://localhost:8080/",
|
| 1235 |
-
"height": 35
|
| 1236 |
-
},
|
| 1237 |
-
"id": "X2ePCc5Disrt",
|
| 1238 |
-
"outputId": "a4d27689-28cb-43c0-8333-67f2d3a6e097"
|
| 1239 |
-
},
|
| 1240 |
-
"outputs": [
|
| 1241 |
-
{
|
| 1242 |
-
"data": {
|
| 1243 |
-
"application/vnd.google.colaboratory.intrinsic+json": {
|
| 1244 |
-
"type": "string"
|
| 1245 |
-
},
|
| 1246 |
-
"text/plain": [
|
| 1247 |
-
"'/content/classifier.zip'"
|
| 1248 |
-
]
|
| 1249 |
-
},
|
| 1250 |
-
"execution_count": 42,
|
| 1251 |
-
"metadata": {},
|
| 1252 |
-
"output_type": "execute_result"
|
| 1253 |
-
}
|
| 1254 |
-
],
|
| 1255 |
-
"source": [
|
| 1256 |
-
"import shutil\n",
|
| 1257 |
-
"\n",
|
| 1258 |
-
"# Replace 'my_folder' with your folder name or path\n",
|
| 1259 |
-
"folder_path = '/content/nepali_xlmr_classifier'\n",
|
| 1260 |
-
"zip_path = '/content/classifier.zip'\n",
|
| 1261 |
-
"\n",
|
| 1262 |
-
"shutil.make_archive(zip_path.replace('.zip', ''), 'zip', folder_path)\n"
|
| 1263 |
-
]
|
| 1264 |
-
},
|
| 1265 |
-
{
|
| 1266 |
-
"cell_type": "code",
|
| 1267 |
-
"execution_count": null,
|
| 1268 |
-
"id": "4BDzVg2gN7xi",
|
| 1269 |
-
"metadata": {
|
| 1270 |
-
"colab": {
|
| 1271 |
-
"base_uri": "https://localhost:8080/",
|
| 1272 |
-
"height": 17
|
| 1273 |
-
},
|
| 1274 |
-
"id": "4BDzVg2gN7xi",
|
| 1275 |
-
"outputId": "ef31798e-24f5-45ad-900f-7528b32ae39f"
|
| 1276 |
-
},
|
| 1277 |
-
"outputs": [
|
| 1278 |
-
{
|
| 1279 |
-
"data": {
|
| 1280 |
-
"application/javascript": "\n async function download(id, filename, size) {\n if (!google.colab.kernel.accessAllowed) {\n return;\n }\n const div = document.createElement('div');\n const label = document.createElement('label');\n label.textContent = `Downloading \"${filename}\": `;\n div.appendChild(label);\n const progress = document.createElement('progress');\n progress.max = size;\n div.appendChild(progress);\n document.body.appendChild(div);\n\n const buffers = [];\n let downloaded = 0;\n\n const channel = await google.colab.kernel.comms.open(id);\n // Send a message to notify the kernel that we're ready.\n channel.send({})\n\n for await (const message of channel.messages) {\n // Send a message to notify the kernel that we're ready.\n channel.send({})\n if (message.buffers) {\n for (const buffer of message.buffers) {\n buffers.push(buffer);\n downloaded += buffer.byteLength;\n progress.value = downloaded;\n }\n }\n }\n const blob = new Blob(buffers, {type: 'application/binary'});\n const a = document.createElement('a');\n a.href = window.URL.createObjectURL(blob);\n a.download = filename;\n div.appendChild(a);\n a.click();\n div.remove();\n }\n ",
|
| 1281 |
-
"text/plain": [
|
| 1282 |
-
"<IPython.core.display.Javascript object>"
|
| 1283 |
-
]
|
| 1284 |
-
},
|
| 1285 |
-
"metadata": {},
|
| 1286 |
-
"output_type": "display_data"
|
| 1287 |
-
},
|
| 1288 |
-
{
|
| 1289 |
-
"data": {
|
| 1290 |
-
"application/javascript": "download(\"download_33034c8f-76d5-48d0-b7cd-3d066ac8e32f\", \"classifier.zip\", 6596694)",
|
| 1291 |
-
"text/plain": [
|
| 1292 |
-
"<IPython.core.display.Javascript object>"
|
| 1293 |
-
]
|
| 1294 |
-
},
|
| 1295 |
-
"metadata": {},
|
| 1296 |
-
"output_type": "display_data"
|
| 1297 |
-
}
|
| 1298 |
-
],
|
| 1299 |
-
"source": [
|
| 1300 |
-
"from google.colab import files\n",
|
| 1301 |
-
"\n",
|
| 1302 |
-
"files.download(zip_path)\n"
|
| 1303 |
-
]
|
| 1304 |
-
},
|
| 1305 |
-
{
|
| 1306 |
-
"cell_type": "code",
|
| 1307 |
-
"execution_count": null,
|
| 1308 |
-
"id": "2jJkcOlw_R1k",
|
| 1309 |
-
"metadata": {
|
| 1310 |
-
"id": "2jJkcOlw_R1k"
|
| 1311 |
-
},
|
| 1312 |
-
"outputs": [],
|
| 1313 |
-
"source": [
|
| 1314 |
-
"torch.save(model.state_dict(), \"final_model.pth\") # AFTER training with classification head\n"
|
| 1315 |
-
]
|
| 1316 |
-
},
|
| 1317 |
-
{
|
| 1318 |
-
"cell_type": "code",
|
| 1319 |
-
"execution_count": null,
|
| 1320 |
-
"id": "xnHr1IDABebZ",
|
| 1321 |
-
"metadata": {
|
| 1322 |
-
"colab": {
|
| 1323 |
-
"base_uri": "https://localhost:8080/"
|
| 1324 |
-
},
|
| 1325 |
-
"id": "xnHr1IDABebZ",
|
| 1326 |
-
"outputId": "95761a2d-56fa-418c-de03-d66d1ae662ee"
|
| 1327 |
-
},
|
| 1328 |
-
"outputs": [
|
| 1329 |
-
{
|
| 1330 |
-
"name": "stdout",
|
| 1331 |
-
"output_type": "stream",
|
| 1332 |
-
"text": [
|
| 1333 |
-
"The text is predicted to be: Human\n",
|
| 1334 |
-
"1\n",
|
| 1335 |
-
"0\n",
|
| 1336 |
-
"1\n"
|
| 1337 |
-
]
|
| 1338 |
-
}
|
| 1339 |
-
],
|
| 1340 |
-
"source": [
|
| 1341 |
-
"# prompt: How to load the model and classifier and use it ? if no other code is in top of this\n",
|
| 1342 |
-
"\n",
|
| 1343 |
-
"# Define the device\n",
|
| 1344 |
-
"device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
|
| 1345 |
-
"\n",
|
| 1346 |
-
"# Instantiate the model\n",
|
| 1347 |
-
"model = IndicBERTClassifier().to(device)\n",
|
| 1348 |
-
"\n",
|
| 1349 |
-
"# Load the saved state dictionary\n",
|
| 1350 |
-
"# Make sure the path to your saved model file is correct\n",
|
| 1351 |
-
"model_path = \"final_model.pth\" # Or \"model_95_acc.pth\" if you saved that one last\n",
|
| 1352 |
-
"model.load_state_dict(torch.load(model_path, map_location=device))\n",
|
| 1353 |
-
"\n",
|
| 1354 |
-
"# Set the model to evaluation mode\n",
|
| 1355 |
-
"model.eval()\n",
|
| 1356 |
-
"\n",
|
| 1357 |
-
"# Load the tokenizer\n",
|
| 1358 |
-
"tokenizer_path = \"./nepali_xlmr_classifier\" # Make sure this path is correct\n",
|
| 1359 |
-
"tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)\n",
|
| 1360 |
-
"\n",
|
| 1361 |
-
"# Now the model and tokenizer are loaded and ready to be used for predictions.\n",
|
| 1362 |
-
"# You can use the existing `predict` function or write a new one.\n",
|
| 1363 |
-
"\n",
|
| 1364 |
-
"# Example of using the predict function with the loaded model and tokenizer\n",
|
| 1365 |
-
"def predict(text):\n",
|
| 1366 |
-
" model.eval() # Ensure model is in evaluation mode\n",
|
| 1367 |
-
" inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)\n",
|
| 1368 |
-
" inputs = {k: v.to(device) for k, v in inputs.items()}\n",
|
| 1369 |
-
" with torch.no_grad():\n",
|
| 1370 |
-
" outputs = model(**inputs)\n",
|
| 1371 |
-
"\n",
|
| 1372 |
-
" # Handle if output is tensor (some versions/models return logits directly)\n",
|
| 1373 |
-
" logits = outputs if isinstance(outputs, torch.Tensor) else outputs.logits\n",
|
| 1374 |
-
"\n",
|
| 1375 |
-
" pred = torch.argmax(logits, dim=1).item()\n",
|
| 1376 |
-
" return pred\n",
|
| 1377 |
-
"\n",
|
| 1378 |
-
"# Example usage with some text\n",
|
| 1379 |
-
"text_to_predict = \"This is a test sentence.\" # Replace with your Nepali text\n",
|
| 1380 |
-
"predicted_class = predict(text_to_predict)\n",
|
| 1381 |
-
"\n",
|
| 1382 |
-
"# Interpret the prediction (assuming 0 for Human, 1 for AI based on your previous code)\n",
|
| 1383 |
-
"class_label = \"Human\" if predicted_class == 0 else \"AI\"\n",
|
| 1384 |
-
"print(f\"The text is predicted to be: {class_label}\")\n",
|
| 1385 |
-
"\n",
|
| 1386 |
-
"# You can test with more examples as you did before\n",
|
| 1387 |
-
"print(predict(\"यी सबै वाक्यहरू इन्टरनेटको विकास, प्रभाव, र चुनौतीहरूको गहिरो सन्दर्भ समेटेर तयार पारिएका छन्। यदि तिमीलाई चाहिएको खण्डमा विशेष विषय (जस्तै शिक्षा, साइबर सुरक्षा, ग्रामीण प्रभाव आदि) चाहिएको हो भने, म त्यही विषयमा केन्द्रित लामो वाक्यहरू पनि दिन सक्छु।\"))\n",
|
| 1388 |
-
"print(predict(\"अख्तियार दुरुपयोग अनुसन्धान आयोगले सिन्धुपाल्चोक–२ बाट प्रतिनिधिसभा सदस्य निर्वाचित सांसद तथा पूर्वमन्त्री बस्नेतसहित १६ जना र २ कम्पनी विरुद्ध ३ अर्ब २१ करोडभन्दा बढी बिगो कायम गरी बिहीबार विशेष अदालतमा भ्रष्टाचार मुद्दा दायर गरेको छ । योसँगै बस्नेत सांसद पदबाट स्वतः निलम्बनमा परेका छन् ।\"))\n",
|
| 1389 |
-
"print(predict(\"इन्टरनेटको सुरुवात सन् १९६९ मा अमेरिकी रक्षा मन्त्रालयले निर्माण गरेको ARPANET नामक प्रोजेक्टबाट भएको हो, जसको उद्देश्य आपसी संचारलाई सहज बनाउने थियो र जसले भविष्यमा इन्टरनेटको रूप लियो\"))\n"
|
| 1390 |
-
]
|
| 1391 |
-
},
|
| 1392 |
-
{
|
| 1393 |
-
"cell_type": "code",
|
| 1394 |
-
"execution_count": null,
|
| 1395 |
-
"id": "gG8fnbqyDUpm",
|
| 1396 |
-
"metadata": {
|
| 1397 |
-
"id": "gG8fnbqyDUpm"
|
| 1398 |
-
},
|
| 1399 |
-
"outputs": [],
|
| 1400 |
-
"source": []
|
| 1401 |
-
}
|
| 1402 |
-
],
|
| 1403 |
-
"metadata": {
|
| 1404 |
-
"accelerator": "TPU",
|
| 1405 |
-
"colab": {
|
| 1406 |
-
"gpuType": "V28",
|
| 1407 |
-
"provenance": []
|
| 1408 |
-
},
|
| 1409 |
-
"kernelspec": {
|
| 1410 |
-
"display_name": "ml",
|
| 1411 |
-
"language": "python",
|
| 1412 |
-
"name": "python3"
|
| 1413 |
-
},
|
| 1414 |
-
"language_info": {
|
| 1415 |
-
"codemirror_mode": {
|
| 1416 |
-
"name": "ipython",
|
| 1417 |
-
"version": 3
|
| 1418 |
-
},
|
| 1419 |
-
"file_extension": ".py",
|
| 1420 |
-
"mimetype": "text/x-python",
|
| 1421 |
-
"name": "python",
|
| 1422 |
-
"nbconvert_exporter": "python",
|
| 1423 |
-
"pygments_lexer": "ipython3",
|
| 1424 |
-
"version": "3.11.14"
|
| 1425 |
-
}
|
| 1426 |
-
},
|
| 1427 |
-
"nbformat": 4,
|
| 1428 |
-
"nbformat_minor": 5
|
| 1429 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
notebook/ai_vs_human_nepali/notebook/documentation.md
DELETED
|
@@ -1,435 +0,0 @@
|
|
| 1 |
-
# Nepali AI vs Human Notebook Documentation
|
| 2 |
-
|
| 3 |
-
This folder contains a small notebook series for building an AI-vs-human text detector for Nepali text. The notebooks are not identical copies; they represent the evolution of the project from a lightweight scikit-learn baseline to a stronger hybrid model and a transformer-based experiment.
|
| 4 |
-
|
| 5 |
-
## Notebook Inventory
|
| 6 |
-
|
| 7 |
-
The notebooks in this directory are:
|
| 8 |
-
|
| 9 |
-
- [main.ipynb](main.ipynb)
|
| 10 |
-
- [working model.ipynb](working%20model.ipynb)
|
| 11 |
-
- [Nepali_Ai_vs_Human.ipynb](Nepali_Ai_vs_Human.ipynb)
|
| 12 |
-
- [final_main.ipynb](final_main.ipynb)
|
| 13 |
-
|
| 14 |
-
## Shared Goal
|
| 15 |
-
|
| 16 |
-
All notebooks solve the same binary classification task:
|
| 17 |
-
|
| 18 |
-
- Class 0 = Human-written Nepali text
|
| 19 |
-
- Class 1 = AI-generated Nepali text
|
| 20 |
-
|
| 21 |
-
The notebooks differ in how they prepare the data, which features they extract, and which model family they train.
|
| 22 |
-
|
| 23 |
-
## Shared Data Sources
|
| 24 |
-
|
| 25 |
-
Across the notebooks, the dataset is built from one or more CSV files under the notebook dataset folders. The common column pattern is:
|
| 26 |
-
|
| 27 |
-
- human_text
|
| 28 |
-
- ai_generated_text
|
| 29 |
-
|
| 30 |
-
Some notebooks also use:
|
| 31 |
-
|
| 32 |
-
- title
|
| 33 |
-
- label
|
| 34 |
-
- paragraph
|
| 35 |
-
|
| 36 |
-
The data preparation usually performs some combination of:
|
| 37 |
-
|
| 38 |
-
- dropping null rows
|
| 39 |
-
- stripping whitespace
|
| 40 |
-
- removing duplicates
|
| 41 |
-
- converting two source columns into one text column plus one label column
|
| 42 |
-
- balancing classes by sampling
|
| 43 |
-
- splitting long texts into smaller chunks
|
| 44 |
-
|
| 45 |
-
## Notebook Relationship
|
| 46 |
-
|
| 47 |
-
The notebooks form a progression:
|
| 48 |
-
|
| 49 |
-
1. main.ipynb is the first lightweight sklearn baseline.
|
| 50 |
-
2. working model.ipynb refines the baseline with better text chunking.
|
| 51 |
-
3. Nepali_Ai_vs_Human.ipynb switches to a transformer-style neural model.
|
| 52 |
-
4. final_main.ipynb is the most complete hybrid notebook and is the closest thing to a production workflow.
|
| 53 |
-
|
| 54 |
-
## main.ipynb
|
| 55 |
-
|
| 56 |
-
### Purpose
|
| 57 |
-
|
| 58 |
-
This is the earliest baseline notebook. It focuses on a CPU-friendly approach using TF-IDF plus hand-crafted text features, then compares several classic machine learning models.
|
| 59 |
-
|
| 60 |
-
### Data Preparation
|
| 61 |
-
|
| 62 |
-
The notebook loads several CSV files and concatenates them into one dataframe. The data is drawn from:
|
| 63 |
-
|
| 64 |
-
- ../DATASET/data.csv
|
| 65 |
-
- ../DATASET/new_data.csv
|
| 66 |
-
- /mnt/linux-data/Work/aiapi/notebook/ai_vs_human_nepali/news_scrap_new2.fixed.csv
|
| 67 |
-
|
| 68 |
-
The notebook creates separate cleaned columns for human text and AI text, then stacks them into a single training dataframe with labels.
|
| 69 |
-
|
| 70 |
-
Important preprocessing steps:
|
| 71 |
-
|
| 72 |
-
- remove URLs
|
| 73 |
-
- keep only Nepali Unicode characters and whitespace
|
| 74 |
-
- lowercase the text
|
| 75 |
-
- remove consecutive repeated words
|
| 76 |
-
|
| 77 |
-
### Feature Engineering
|
| 78 |
-
|
| 79 |
-
The notebook combines two feature families:
|
| 80 |
-
|
| 81 |
-
- Word-level TF-IDF with 1-2 gram features
|
| 82 |
-
- Dense, hand-crafted features based on text structure
|
| 83 |
-
|
| 84 |
-
The hand-crafted features include:
|
| 85 |
-
|
| 86 |
-
- burstiness statistics from sentence lengths
|
| 87 |
-
- average word length
|
| 88 |
-
- average sentence length
|
| 89 |
-
- lexical diversity
|
| 90 |
-
- punctuation ratio
|
| 91 |
-
- repeated bigram ratio
|
| 92 |
-
- Devanagari diacritic density
|
| 93 |
-
|
| 94 |
-
The sparse TF-IDF matrix is concatenated with the dense feature matrix using horizontal stacking.
|
| 95 |
-
|
| 96 |
-
### Models Trained
|
| 97 |
-
|
| 98 |
-
The notebook compares several standard classifiers:
|
| 99 |
-
|
| 100 |
-
- LogisticRegressionCV
|
| 101 |
-
- RidgeClassifierCV
|
| 102 |
-
- MultinomialNB
|
| 103 |
-
- BernoulliNB
|
| 104 |
-
- RandomForestClassifier
|
| 105 |
-
- GradientBoostingClassifier
|
| 106 |
-
- LinearSVC
|
| 107 |
-
- KNeighborsClassifier
|
| 108 |
-
|
| 109 |
-
Dense conversion is applied only where needed, such as for LinearSVC and KNeighbors.
|
| 110 |
-
|
| 111 |
-
### Evaluation
|
| 112 |
-
|
| 113 |
-
The notebook evaluates the models with:
|
| 114 |
-
|
| 115 |
-
- validation accuracy
|
| 116 |
-
- weighted F1 score
|
| 117 |
-
- classification reports
|
| 118 |
-
- confusion matrices
|
| 119 |
-
- ROC curves
|
| 120 |
-
|
| 121 |
-
The top models are selected by validation accuracy and re-used in later prediction cells.
|
| 122 |
-
|
| 123 |
-
### Prediction Demo
|
| 124 |
-
|
| 125 |
-
The notebook includes several sample Nepali texts for inference. It prints per-model predictions and, where possible, confidence values.
|
| 126 |
-
|
| 127 |
-
### Saved Artifacts
|
| 128 |
-
|
| 129 |
-
Each model is saved as a pickle file in a local saved_models directory.
|
| 130 |
-
|
| 131 |
-
### Known Issues
|
| 132 |
-
|
| 133 |
-
- Several cells are duplicated, especially the dataset loading cells.
|
| 134 |
-
- The vectorizer and the feature builder are not saved with the models, so full reloading is incomplete.
|
| 135 |
-
- There are repeated prediction sections, which makes the notebook harder to maintain.
|
| 136 |
-
- Some cells appear to be placeholders or empty.
|
| 137 |
-
|
| 138 |
-
## working model.ipynb
|
| 139 |
-
|
| 140 |
-
### Purpose
|
| 141 |
-
|
| 142 |
-
This notebook is a refinement of main.ipynb. It keeps the same overall classifier strategy but improves how long Nepali articles are handled.
|
| 143 |
-
|
| 144 |
-
### Main Difference From main.ipynb
|
| 145 |
-
|
| 146 |
-
The key improvement is sentence chunking:
|
| 147 |
-
|
| 148 |
-
- long texts are split into smaller chunks
|
| 149 |
-
- chunk boundaries prefer Nepali danda punctuation
|
| 150 |
-
- each chunk is limited to a small number of sentences or words
|
| 151 |
-
|
| 152 |
-
This makes the dataset more granular and helps the classifier train on smaller, more uniform samples.
|
| 153 |
-
|
| 154 |
-
### Preprocessing
|
| 155 |
-
|
| 156 |
-
The notebook defines:
|
| 157 |
-
|
| 158 |
-
- clean_text
|
| 159 |
-
- remove_auto_repeating
|
| 160 |
-
- split_into_sentence_chunks
|
| 161 |
-
- expand_texts_to_chunks
|
| 162 |
-
|
| 163 |
-
These functions preserve sentence punctuation for chunking, then normalize the cleaned chunks for downstream training.
|
| 164 |
-
|
| 165 |
-
### Feature Engineering and Models
|
| 166 |
-
|
| 167 |
-
The rest of the pipeline is essentially the same as main.ipynb:
|
| 168 |
-
|
| 169 |
-
- TF-IDF word n-grams
|
| 170 |
-
- burstiness and stylometric features
|
| 171 |
-
- concatenated sparse + dense representation
|
| 172 |
-
- the same family of sklearn classifiers
|
| 173 |
-
|
| 174 |
-
### Evaluation and Inference
|
| 175 |
-
|
| 176 |
-
The notebook follows the same model comparison, ROC plotting, confusion matrix plotting, and sample prediction pattern as the baseline notebook.
|
| 177 |
-
|
| 178 |
-
### Saved Artifacts
|
| 179 |
-
|
| 180 |
-
Like main.ipynb, the fitted sklearn models are stored under saved_models as individual pickle files.
|
| 181 |
-
|
| 182 |
-
### Known Issues
|
| 183 |
-
|
| 184 |
-
- The notebook has redundant cells and repeated code blocks.
|
| 185 |
-
- It still does not serialize the vectorizer and feature transformer together with the model artifacts.
|
| 186 |
-
- Some prediction logic is repeated more than once.
|
| 187 |
-
|
| 188 |
-
## Nepali_Ai_vs_Human.ipynb
|
| 189 |
-
|
| 190 |
-
### Purpose
|
| 191 |
-
|
| 192 |
-
This notebook is the deep learning branch of the project. Instead of hand-crafted features plus classical classifiers, it uses a transformer encoder with a classification head.
|
| 193 |
-
|
| 194 |
-
### Data Preparation
|
| 195 |
-
|
| 196 |
-
The notebook reads one CSV file and converts the two-column source format into a single text-label dataframe.
|
| 197 |
-
|
| 198 |
-
Important preparation steps:
|
| 199 |
-
|
| 200 |
-
- validate required columns
|
| 201 |
-
- drop nulls
|
| 202 |
-
- build a unified dataframe with text and label
|
| 203 |
-
- filter short texts
|
| 204 |
-
- drop duplicate text rows
|
| 205 |
-
- shuffle the dataset
|
| 206 |
-
|
| 207 |
-
The notebook keeps the raw text mostly intact rather than applying aggressive regex cleaning.
|
| 208 |
-
|
| 209 |
-
### Model Architecture
|
| 210 |
-
|
| 211 |
-
The model pipeline is built around Hugging Face transformers and PyTorch:
|
| 212 |
-
|
| 213 |
-
- tokenizer from a multilingual BERT-style model
|
| 214 |
-
- AutoModel backbone
|
| 215 |
-
- classification head with dropout
|
| 216 |
-
- binary output layer
|
| 217 |
-
|
| 218 |
-
The notebook defines a custom PyTorch module named IndicBERTClassifier.
|
| 219 |
-
|
| 220 |
-
### Training Setup
|
| 221 |
-
|
| 222 |
-
The notebook uses:
|
| 223 |
-
|
| 224 |
-
- train/validation split with stratification
|
| 225 |
-
- DataLoader-based batching
|
| 226 |
-
- AdamW optimizer
|
| 227 |
-
- cross-entropy loss
|
| 228 |
-
- linear warmup scheduler
|
| 229 |
-
- gradient accumulation
|
| 230 |
-
- mixed precision when CUDA is available
|
| 231 |
-
- early stopping on validation F1
|
| 232 |
-
|
| 233 |
-
This makes it more GPU-oriented than the sklearn notebooks.
|
| 234 |
-
|
| 235 |
-
### Evaluation
|
| 236 |
-
|
| 237 |
-
Per-epoch evaluation includes:
|
| 238 |
-
|
| 239 |
-
- accuracy
|
| 240 |
-
- F1 score
|
| 241 |
-
- classification report
|
| 242 |
-
|
| 243 |
-
The notebook also saves improved checkpoints when validation F1 improves.
|
| 244 |
-
|
| 245 |
-
### Prediction Demo
|
| 246 |
-
|
| 247 |
-
The notebook defines a predict function that:
|
| 248 |
-
|
| 249 |
-
- tokenizes the input text
|
| 250 |
-
- runs the transformer model
|
| 251 |
-
- applies softmax
|
| 252 |
-
- returns the predicted class and confidence
|
| 253 |
-
|
| 254 |
-
Several sample Nepali sentences are passed through the predictor at the end of the notebook.
|
| 255 |
-
|
| 256 |
-
### Saved Artifacts
|
| 257 |
-
|
| 258 |
-
The notebook saves:
|
| 259 |
-
|
| 260 |
-
- model_best.pth
|
| 261 |
-
- model_latest.pth
|
| 262 |
-
- tokenizer files in ./nepali_xlmr_classifier
|
| 263 |
-
|
| 264 |
-
There is also a Colab-oriented zip export section.
|
| 265 |
-
|
| 266 |
-
### Known Issues
|
| 267 |
-
|
| 268 |
-
- The notebook mixes local notebook execution with Colab-specific code.
|
| 269 |
-
- Some cells show CUDA or environment-related warnings.
|
| 270 |
-
- The training flow is more complex and less polished than the final hybrid notebook.
|
| 271 |
-
- Paths are hard-coded in a few places.
|
| 272 |
-
|
| 273 |
-
## final_main.ipynb
|
| 274 |
-
|
| 275 |
-
### Purpose
|
| 276 |
-
|
| 277 |
-
This is the most complete notebook in the folder. It combines semantic embeddings from Sentence Transformers with stylometric features, then trains a linear model and an XGBoost model on the fused feature vector.
|
| 278 |
-
|
| 279 |
-
### Data Preparation
|
| 280 |
-
|
| 281 |
-
The notebook reads the dataset from:
|
| 282 |
-
|
| 283 |
-
- ../DATASET/Final_data/final_news345.csv
|
| 284 |
-
- /mnt/linux-data/Work/aiapi/notebook/ai_vs_human_nepali/Final_data/final_news345.csv
|
| 285 |
-
|
| 286 |
-
The notebook expects a label column with string values and maps them to binary classes.
|
| 287 |
-
|
| 288 |
-
It also includes a preprocessing utility that can:
|
| 289 |
-
|
| 290 |
-
- split very long Nepali texts into chunks
|
| 291 |
-
- preserve danda-based sentence boundaries
|
| 292 |
-
- filter out extremely short chunks
|
| 293 |
-
- balance the dataset by sampling each class to the same count
|
| 294 |
-
|
| 295 |
-
### Visualization
|
| 296 |
-
|
| 297 |
-
The notebook includes exploratory plots for:
|
| 298 |
-
|
| 299 |
-
- class distribution
|
| 300 |
-
- character count distribution
|
| 301 |
-
- word count distribution
|
| 302 |
-
- sentence count distribution
|
| 303 |
-
- cleaned text length distribution
|
| 304 |
-
- stylometric feature comparison plots
|
| 305 |
-
|
| 306 |
-
This makes it the most documented and inspection-friendly notebook in the folder.
|
| 307 |
-
|
| 308 |
-
### Text Cleaning
|
| 309 |
-
|
| 310 |
-
The notebook defines clean_nepali_text, which:
|
| 311 |
-
|
| 312 |
-
- lowercases the text
|
| 313 |
-
- normalizes Nepali and common Unicode punctuation
|
| 314 |
-
- removes unwanted characters
|
| 315 |
-
- collapses repeated whitespace
|
| 316 |
-
- trims the result
|
| 317 |
-
|
| 318 |
-
This cleaned text is used for both embeddings and stylometric extraction.
|
| 319 |
-
|
| 320 |
-
### Stylometric Features
|
| 321 |
-
|
| 322 |
-
The notebook uses six hand-crafted features:
|
| 323 |
-
|
| 324 |
-
- word_count
|
| 325 |
-
- sentence_count
|
| 326 |
-
- avg_word_length
|
| 327 |
-
- avg_sentence_length
|
| 328 |
-
- type_token_ratio
|
| 329 |
-
- punctuation_ratio
|
| 330 |
-
|
| 331 |
-
These features are extracted from the cleaned text and then standardized with StandardScaler.
|
| 332 |
-
|
| 333 |
-
### Semantic Embeddings
|
| 334 |
-
|
| 335 |
-
The notebook uses the Sentence Transformers model:
|
| 336 |
-
|
| 337 |
-
- sentence-transformers/paraphrase-multilingual-mpnet-base-v2
|
| 338 |
-
|
| 339 |
-
This produces 768-dimensional multilingual sentence embeddings. The notebook loads the embedder on CPU to reduce CUDA memory pressure.
|
| 340 |
-
|
| 341 |
-
### Feature Fusion
|
| 342 |
-
|
| 343 |
-
The final feature matrix is built by concatenating:
|
| 344 |
-
|
| 345 |
-
- 768 embedding dimensions
|
| 346 |
-
- 6 scaled stylometric dimensions
|
| 347 |
-
|
| 348 |
-
So each sample becomes a 774-dimensional vector.
|
| 349 |
-
|
| 350 |
-
### Models Trained
|
| 351 |
-
|
| 352 |
-
Two models are trained on the fused features:
|
| 353 |
-
|
| 354 |
-
- Logistic Regression
|
| 355 |
-
- XGBoost
|
| 356 |
-
|
| 357 |
-
XGBoost is configured with class imbalance handling through scale_pos_weight.
|
| 358 |
-
|
| 359 |
-
### Evaluation
|
| 360 |
-
|
| 361 |
-
The notebook evaluates both models using:
|
| 362 |
-
|
| 363 |
-
- accuracy
|
| 364 |
-
- precision
|
| 365 |
-
- recall
|
| 366 |
-
- F1 score
|
| 367 |
-
- confusion matrices
|
| 368 |
-
- ROC curves and AUC
|
| 369 |
-
|
| 370 |
-
It also computes and visualizes XGBoost feature importance.
|
| 371 |
-
|
| 372 |
-
### Prediction Flow
|
| 373 |
-
|
| 374 |
-
The prediction function follows this exact sequence:
|
| 375 |
-
|
| 376 |
-
1. clean the input
|
| 377 |
-
2. extract stylometric features
|
| 378 |
-
3. build the sentence embedding
|
| 379 |
-
4. scale the stylometric vector
|
| 380 |
-
5. concatenate the two feature blocks
|
| 381 |
-
6. predict with XGBoost
|
| 382 |
-
|
| 383 |
-
The function returns a dictionary containing the label, numeric class id, and probability.
|
| 384 |
-
|
| 385 |
-
### Saved Artifacts
|
| 386 |
-
|
| 387 |
-
The notebook saves a joblib bundle at:
|
| 388 |
-
|
| 389 |
-
- ../models/ai_text_detector_model.pkl
|
| 390 |
-
|
| 391 |
-
The saved artifact includes:
|
| 392 |
-
|
| 393 |
-
- xgb_model
|
| 394 |
-
- lr_model
|
| 395 |
-
- scaler
|
| 396 |
-
- embed_model name string
|
| 397 |
-
- stylo_cols
|
| 398 |
-
- label_map
|
| 399 |
-
|
| 400 |
-
### Known Issues
|
| 401 |
-
|
| 402 |
-
- The XGBoost fit call uses the test set as an eval_set, which is acceptable for monitoring but not ideal if you want strict validation separation.
|
| 403 |
-
- The embedding model name is saved, but the embedder itself is not serialized.
|
| 404 |
-
- The notebook is the strongest production candidate, but it still lacks a separate load-and-predict helper for end users.
|
| 405 |
-
|
| 406 |
-
## Comparison Summary
|
| 407 |
-
|
| 408 |
-
| Notebook | Main Approach | Strength | Weakness |
|
| 409 |
-
|---|---|---|---|
|
| 410 |
-
| main.ipynb | TF-IDF + stylometry + classic ML | Simple baseline, easy to inspect | Repetitive and not fully serializable |
|
| 411 |
-
| working model.ipynb | TF-IDF + stylometry + chunking | Better handling of long text | Still mostly a baseline notebook |
|
| 412 |
-
| Nepali_Ai_vs_Human.ipynb | Transformer classifier | Strong semantic modeling | Heavier, more environment-sensitive |
|
| 413 |
-
| final_main.ipynb | Sentence embeddings + stylometry + XGBoost | Best balance of performance, clarity, and deployability | Uses a saved model name string instead of serializing the embedder |
|
| 414 |
-
|
| 415 |
-
## Recommended Reading Order
|
| 416 |
-
|
| 417 |
-
If you want to understand the project evolution, read the notebooks in this order:
|
| 418 |
-
|
| 419 |
-
1. main.ipynb
|
| 420 |
-
2. working model.ipynb
|
| 421 |
-
3. Nepali_Ai_vs_Human.ipynb
|
| 422 |
-
4. final_main.ipynb
|
| 423 |
-
|
| 424 |
-
If you only want the most useful notebook for reuse or deployment, start with final_main.ipynb.
|
| 425 |
-
|
| 426 |
-
## Practical Notes
|
| 427 |
-
|
| 428 |
-
- Several notebooks contain duplicated or stale cells from experimentation.
|
| 429 |
-
- Not every cell has been executed successfully.
|
| 430 |
-
- Paths are sometimes hard-coded for the local workspace, so moving the folder may require path cleanup.
|
| 431 |
-
- The project alternates between three styles of modeling: classical sklearn, transformer fine-tuning, and hybrid embedding-based classification.
|
| 432 |
-
|
| 433 |
-
## Suggested Next Step
|
| 434 |
-
|
| 435 |
-
If you want, the next useful document to add is an inference guide that explains how to load the saved model bundle from final_main.ipynb and run predictions on new Nepali text.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
notebook/ai_vs_human_nepali/notebook/final_main.ipynb
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
notebook/ai_vs_human_nepali/notebook/main.ipynb
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
notebook/ai_vs_human_nepali/notebook/working model.ipynb
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
notebook/ai_vs_human_nepali/topic_scrapper.ipynb
DELETED
|
@@ -1,455 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"cells": [
|
| 3 |
-
{
|
| 4 |
-
"cell_type": "code",
|
| 5 |
-
"execution_count": 15,
|
| 6 |
-
"id": "4b53d4bc",
|
| 7 |
-
"metadata": {},
|
| 8 |
-
"outputs": [],
|
| 9 |
-
"source": [
|
| 10 |
-
"# # Groq Nepali Rewriter\n",
|
| 11 |
-
"\n",
|
| 12 |
-
"# This notebook loads the dataset, builds a Nepali rewrite prompt, tests one sample, and then saves a batch output CSV using the Groq API.\n",
|
| 13 |
-
"\n",
|
| 14 |
-
"# Requirements:\n",
|
| 15 |
-
"# - `GROQ_API_KEY` must be available in `.env`\n",
|
| 16 |
-
"# - the input file must contain a `paragraph` column"
|
| 17 |
-
]
|
| 18 |
-
},
|
| 19 |
-
{
|
| 20 |
-
"cell_type": "code",
|
| 21 |
-
"execution_count": 16,
|
| 22 |
-
"id": "6c8dc1cb",
|
| 23 |
-
"metadata": {},
|
| 24 |
-
"outputs": [
|
| 25 |
-
{
|
| 26 |
-
"data": {
|
| 27 |
-
"text/plain": [
|
| 28 |
-
"True"
|
| 29 |
-
]
|
| 30 |
-
},
|
| 31 |
-
"execution_count": 16,
|
| 32 |
-
"metadata": {},
|
| 33 |
-
"output_type": "execute_result"
|
| 34 |
-
}
|
| 35 |
-
],
|
| 36 |
-
"source": [
|
| 37 |
-
"import os\n",
|
| 38 |
-
"import re\n",
|
| 39 |
-
"import time\n",
|
| 40 |
-
"from concurrent.futures import ThreadPoolExecutor, as_completed\n",
|
| 41 |
-
"\n",
|
| 42 |
-
"import pandas as pd\n",
|
| 43 |
-
"from dotenv import load_dotenv\n",
|
| 44 |
-
"from groq import Groq\n",
|
| 45 |
-
"\n",
|
| 46 |
-
"load_dotenv()"
|
| 47 |
-
]
|
| 48 |
-
},
|
| 49 |
-
{
|
| 50 |
-
"cell_type": "code",
|
| 51 |
-
"execution_count": 17,
|
| 52 |
-
"id": "019adfa8",
|
| 53 |
-
"metadata": {},
|
| 54 |
-
"outputs": [],
|
| 55 |
-
"source": [
|
| 56 |
-
"api_key = os.getenv(\"GROQ_API_KEY2\")\n",
|
| 57 |
-
"if not api_key:\n",
|
| 58 |
-
" raise ValueError(\"GROQ_API_KEY not found in .env or environment.\")\n",
|
| 59 |
-
"\n",
|
| 60 |
-
"client = Groq(api_key=api_key)\n",
|
| 61 |
-
"MODEL_NAME = \"llama-3.3-70b-versatile\""
|
| 62 |
-
]
|
| 63 |
-
},
|
| 64 |
-
{
|
| 65 |
-
"cell_type": "code",
|
| 66 |
-
"execution_count": 18,
|
| 67 |
-
"id": "4b4d2bbe",
|
| 68 |
-
"metadata": {},
|
| 69 |
-
"outputs": [],
|
| 70 |
-
"source": [
|
| 71 |
-
"data =pd.read_csv(\"DATASET/topics_1000.csv\")"
|
| 72 |
-
]
|
| 73 |
-
},
|
| 74 |
-
{
|
| 75 |
-
"cell_type": "code",
|
| 76 |
-
"execution_count": 19,
|
| 77 |
-
"id": "c36cfbbf",
|
| 78 |
-
"metadata": {},
|
| 79 |
-
"outputs": [
|
| 80 |
-
{
|
| 81 |
-
"data": {
|
| 82 |
-
"text/html": [
|
| 83 |
-
"<div>\n",
|
| 84 |
-
"<style scoped>\n",
|
| 85 |
-
" .dataframe tbody tr th:only-of-type {\n",
|
| 86 |
-
" vertical-align: middle;\n",
|
| 87 |
-
" }\n",
|
| 88 |
-
"\n",
|
| 89 |
-
" .dataframe tbody tr th {\n",
|
| 90 |
-
" vertical-align: top;\n",
|
| 91 |
-
" }\n",
|
| 92 |
-
"\n",
|
| 93 |
-
" .dataframe thead th {\n",
|
| 94 |
-
" text-align: right;\n",
|
| 95 |
-
" }\n",
|
| 96 |
-
"</style>\n",
|
| 97 |
-
"<table border=\"1\" class=\"dataframe\">\n",
|
| 98 |
-
" <thead>\n",
|
| 99 |
-
" <tr style=\"text-align: right;\">\n",
|
| 100 |
-
" <th></th>\n",
|
| 101 |
-
" <th>id</th>\n",
|
| 102 |
-
" <th>topic</th>\n",
|
| 103 |
-
" </tr>\n",
|
| 104 |
-
" </thead>\n",
|
| 105 |
-
" <tbody>\n",
|
| 106 |
-
" <tr>\n",
|
| 107 |
-
" <th>0</th>\n",
|
| 108 |
-
" <td>1</td>\n",
|
| 109 |
-
" <td>नेपालमा कृत्रिम बुद्धिमत्ता विकासको वर्तमान अव...</td>\n",
|
| 110 |
-
" </tr>\n",
|
| 111 |
-
" <tr>\n",
|
| 112 |
-
" <th>1</th>\n",
|
| 113 |
-
" <td>2</td>\n",
|
| 114 |
-
" <td>नेपालको शिक्षा प्रणालीमा डिजिटल प्रविधिको प्रभाव</td>\n",
|
| 115 |
-
" </tr>\n",
|
| 116 |
-
" <tr>\n",
|
| 117 |
-
" <th>2</th>\n",
|
| 118 |
-
" <td>3</td>\n",
|
| 119 |
-
" <td>काठमाडौँ उपत्यकाको वायु प्रदूषण समस्या</td>\n",
|
| 120 |
-
" </tr>\n",
|
| 121 |
-
" <tr>\n",
|
| 122 |
-
" <th>3</th>\n",
|
| 123 |
-
" <td>4</td>\n",
|
| 124 |
-
" <td>नेपालमा जलवायु परिवर्तनका असरहरू</td>\n",
|
| 125 |
-
" </tr>\n",
|
| 126 |
-
" <tr>\n",
|
| 127 |
-
" <th>4</th>\n",
|
| 128 |
-
" <td>5</td>\n",
|
| 129 |
-
" <td>ग्रामीण क्षेत्रमा इन्टरनेट पहुँचको विस्तार</td>\n",
|
| 130 |
-
" </tr>\n",
|
| 131 |
-
" </tbody>\n",
|
| 132 |
-
"</table>\n",
|
| 133 |
-
"</div>"
|
| 134 |
-
],
|
| 135 |
-
"text/plain": [
|
| 136 |
-
" id topic\n",
|
| 137 |
-
"0 1 नेपालमा कृत्रिम बुद्धिमत्ता विकासको वर्तमान अव...\n",
|
| 138 |
-
"1 2 नेपालको शिक्षा प्रणालीमा डिजिटल प्रविधिको प्रभाव\n",
|
| 139 |
-
"2 3 काठमाडौँ उपत्यकाको वायु प्रदूषण समस्या\n",
|
| 140 |
-
"3 4 नेपालमा जलवायु परिवर्तनका असरहरू\n",
|
| 141 |
-
"4 5 ग्रामीण क्षेत्रमा इन्टरनेट पहुँचको विस्तार"
|
| 142 |
-
]
|
| 143 |
-
},
|
| 144 |
-
"execution_count": 19,
|
| 145 |
-
"metadata": {},
|
| 146 |
-
"output_type": "execute_result"
|
| 147 |
-
}
|
| 148 |
-
],
|
| 149 |
-
"source": [
|
| 150 |
-
"data.head()"
|
| 151 |
-
]
|
| 152 |
-
},
|
| 153 |
-
{
|
| 154 |
-
"cell_type": "code",
|
| 155 |
-
"execution_count": 20,
|
| 156 |
-
"id": "b6e226b8",
|
| 157 |
-
"metadata": {},
|
| 158 |
-
"outputs": [],
|
| 159 |
-
"source": [
|
| 160 |
-
"import numpy as np\n",
|
| 161 |
-
"def build_prompt(paragraph):\n",
|
| 162 |
-
" style = [\n",
|
| 163 |
-
" \"Use simple and clear language.\",\n",
|
| 164 |
-
" \"Make it engaging and interesting to read.\",\n",
|
| 165 |
-
" \"Use a conversational tone.\",\n",
|
| 166 |
-
" \"Keep the original meaning intact.\",\n",
|
| 167 |
-
" \"Avoid complex jargon and technical terms.\",\n",
|
| 168 |
-
" \"Use short sentences and paragraphs.\",\n",
|
| 169 |
-
" \"Add examples or anecdotes to illustrate points.\",\n",
|
| 170 |
-
" \"Use active voice instead of passive voice.\",\n",
|
| 171 |
-
" \"Include a call to action or a thought-provoking question at the end.\",\n",
|
| 172 |
-
" ]\n",
|
| 173 |
-
" selected_style_random_single = np.random.choice(style, size=len(style), replace=False) # Select the first 5 style guidelines\n",
|
| 174 |
-
" prompt = f\"\"\"\n",
|
| 175 |
-
" give me an essay for the following topics puree nepali ok no enlgish language:\n",
|
| 176 |
-
" {paragraph}\n",
|
| 177 |
-
" Rewrite the above paragraph in Nepali, following these style guidelines:\n",
|
| 178 |
-
" {', '.join(selected_style_random_single)}\n",
|
| 179 |
-
" \"\"\"\n",
|
| 180 |
-
" return prompt.strip()"
|
| 181 |
-
]
|
| 182 |
-
},
|
| 183 |
-
{
|
| 184 |
-
"cell_type": "code",
|
| 185 |
-
"execution_count": 21,
|
| 186 |
-
"id": "cf16922b",
|
| 187 |
-
"metadata": {},
|
| 188 |
-
"outputs": [
|
| 189 |
-
{
|
| 190 |
-
"name": "stdout",
|
| 191 |
-
"output_type": "stream",
|
| 192 |
-
"text": [
|
| 193 |
-
"नेपालमा कृत्रिम बुद्धिमत्ता विकासको वर्तमान अवस्था\n",
|
| 194 |
-
"\n",
|
| 195 |
-
"कृत्रिम बुद्धिमत्ता विकास नेपालको लागि एक नयाँ युग हो । यो प्राविधिक क्षेत्र दिन-प्रतिदिन विकसित हुने क्रममा छ । नेपालमा कृत्रिम बुद्धिमत्ताले विभिन्न क्षेत्रमा परिवर्तन ल्याउने क्षमता राख्दछ । जस्तै: स्वास्थ्य सेवामा, शिक्षामा, वित्तीय सेवामा, तथा उत्पादन क्षेत्रमा ।\n",
|
| 196 |
-
"\n",
|
| 197 |
-
"नेपालमा कृत्रिम बुद्धिमत्ताको विकासले नयाँ अवस्था प्राप्त गरिरहेको छ । यो देशमा विभिन्न प्राविधिक कम्पनीहरुले कृत्रिम बुद्धिमत्ताको विकासमा लगनशील छन् । तसर्थ, यसले नेपालमा रोजगारीको अवसर पनि बढाउने छ । उदाहरणको लागि, कृत्रिम बुद्धिमत्ताले स्वास्थ्य सेवामा रोग निदान गर्ने, रोगको उपचार सुझाउने, तथा व्यक्तिको स्वास्थ्य जाँच गर्ने काम गर्नसक्ने छ ।\n",
|
| 198 |
-
"\n",
|
| 199 |
-
"कृत्रिम बुद्धिमत्ताको विकासले नेपालको अर्थतन्त्रमा पनि परिवर्तन ल्याउने छ । यसले व्यवसायिक क्षेत्रमा उत्पादनशीलता बढाउने, उत्पादन मुल्य कम गर्ने, तथा गुणस्तर मापन गर्ने काम गर्नसक्ने छ । उदाहरणको लागि, कृत्रिम बुद्धिमत्ताले वित्तीय सेवामा लेनदेनको निरीक्षण गर्ने, धोकाधोकाको मुल्यांकन गर्ने, तथा वित्तीय संस्थाहरुलाई सुझाव दिने काम गर्नसक्ने छ ।\n",
|
| 200 |
-
"\n",
|
| 201 |
-
"नेपालमा कृत्रिम बुद्धिमत्ता विकासको वर्तमान अवस्थाले देशलाई एक नयाँ दिशामा लम्बने क्षमता राख्दछ । तर, यसको विकासमा चुनौतिहरु पनि छन् । जस्तै: डाटा सुरक्षा, निजताको हनन, तथा श्रमिकहरुको प्रतिस्पर्धी क्षमता । तसर्थ, नेपालमा कृत्रिम बुद्धिमत्ताको विकासलाई प्रोत्साहित गर्नको लागि, हामीले यसको विकासमा लगनशील कम्पनीहरुलाई साथ दिनु पर्छ । हामीले पनि कृत्रिम बुद्धिमत्ता���ो विकासमा योगदान पुर्याउनुपर्छ ।\n",
|
| 202 |
-
"\n",
|
| 203 |
-
"आह, नेपालमा कृत्रिम बुद्धिमत्ता विकासको वर्तमान अवस्थाले देशलाई एक नयाँ दिशामा लम्बने क्षमता राख्दछ । तर, यसको विकासमा हामी के गरिरहेका छौ? हामीले कृत्रिम बुद्धिमत्ताको विकासमा योगदान पुर्याउने छौ कि? हामीले यसको विकासमा चुनौतिहरुलाई मात गर्ने छौ कि? यस प्रश्नको उत्तर हामीसँग छ । आउनうभ, हामी नेपालमा कृत्रिम बुद्धिमत्ताको विकासलाई प्रोत्साहित गरौं । आउनूभ, हामी देशलाई एक नयाँ दिशामा लम्बौं ।\n"
|
| 204 |
-
]
|
| 205 |
-
}
|
| 206 |
-
],
|
| 207 |
-
"source": [
|
| 208 |
-
"build_prompt = build_prompt\n",
|
| 209 |
-
"\n",
|
| 210 |
-
"sample_title = str(data.iloc[0][\"topic\"])\n",
|
| 211 |
-
"\n",
|
| 212 |
-
"sample_response = client.chat.completions.create(\n",
|
| 213 |
-
" model=MODEL_NAME,\n",
|
| 214 |
-
" messages=[{\"role\": \"user\", \"content\": build_prompt(sample_title)}],\n",
|
| 215 |
-
")\n",
|
| 216 |
-
"\n",
|
| 217 |
-
"generated_text = sample_response.choices[0].message.content.strip()\n",
|
| 218 |
-
"print(generated_text)"
|
| 219 |
-
]
|
| 220 |
-
},
|
| 221 |
-
{
|
| 222 |
-
"cell_type": "code",
|
| 223 |
-
"execution_count": null,
|
| 224 |
-
"id": "c709f126",
|
| 225 |
-
"metadata": {},
|
| 226 |
-
"outputs": [],
|
| 227 |
-
"source": [
|
| 228 |
-
"def grok_step3_5_scraper(\n",
|
| 229 |
-
" input_file,\n",
|
| 230 |
-
" output_file=\"step3_5_grok_nepali.csv\",\n",
|
| 231 |
-
" limit=100,\n",
|
| 232 |
-
" model=MODEL_NAME,\n",
|
| 233 |
-
" requests_per_second=2,\n",
|
| 234 |
-
" max_workers=2,\n",
|
| 235 |
-
" max_retries=3,\n",
|
| 236 |
-
"):\n",
|
| 237 |
-
" working_df = pd.read_csv(input_file)\n",
|
| 238 |
-
" if limit is not None:\n",
|
| 239 |
-
" working_df = working_df.head(limit)\n",
|
| 240 |
-
"\n",
|
| 241 |
-
" cols = set(working_df.columns)\n",
|
| 242 |
-
" if \"Title\" in cols or \"शीर्षक\" in cols:\n",
|
| 243 |
-
" title_col = \"Title\" if \"Title\" in cols else \"शीर्षक\"\n",
|
| 244 |
-
" prompt_col = title_col\n",
|
| 245 |
-
" if \"Paragraph\" in cols:\n",
|
| 246 |
-
" human_col = \"Paragraph\"\n",
|
| 247 |
-
" elif \"विवरण\" in cols:\n",
|
| 248 |
-
" human_col = \"विवरण\"\n",
|
| 249 |
-
" elif \"paragraph\" in cols:\n",
|
| 250 |
-
" human_col = \"paragraph\"\n",
|
| 251 |
-
" else:\n",
|
| 252 |
-
" human_col = prompt_col\n",
|
| 253 |
-
" elif \"paragraph\" in cols or \"Paragraph\" in cols or \"विवरण\" in cols:\n",
|
| 254 |
-
" prompt_col = (\n",
|
| 255 |
-
" \"paragraph\" if \"paragraph\" in cols\n",
|
| 256 |
-
" else (\"Paragraph\" if \"Paragraph\" in cols else \"विवरण\")\n",
|
| 257 |
-
" )\n",
|
| 258 |
-
" human_col = prompt_col\n",
|
| 259 |
-
" title_col = prompt_col\n",
|
| 260 |
-
" else:\n",
|
| 261 |
-
" raise ValueError(\n",
|
| 262 |
-
" \"No supported text columns found. Expected one of: Title/शीर्षक with Paragraph/विवरण, or paragraph.\"\n",
|
| 263 |
-
" )\n",
|
| 264 |
-
"\n",
|
| 265 |
-
" working_df = working_df.dropna(subset=[human_col]).copy()\n",
|
| 266 |
-
"\n",
|
| 267 |
-
" total_input_rows = len(working_df)\n",
|
| 268 |
-
" already_done = 0\n",
|
| 269 |
-
"\n",
|
| 270 |
-
" if os.path.exists(output_file):\n",
|
| 271 |
-
" try:\n",
|
| 272 |
-
" existing_df = pd.read_csv(output_file)\n",
|
| 273 |
-
" already_done = len(existing_df)\n",
|
| 274 |
-
" except pd.errors.EmptyDataError:\n",
|
| 275 |
-
" already_done = 0\n",
|
| 276 |
-
"\n",
|
| 277 |
-
" if already_done >= total_input_rows:\n",
|
| 278 |
-
" print(\n",
|
| 279 |
-
" f\"Nothing to do. {already_done} rows already exist in {output_file} (input rows: {total_input_rows}).\"\n",
|
| 280 |
-
" )\n",
|
| 281 |
-
" return\n",
|
| 282 |
-
"\n",
|
| 283 |
-
" if already_done > 0:\n",
|
| 284 |
-
" working_df = working_df.iloc[already_done:].copy()\n",
|
| 285 |
-
" print(\n",
|
| 286 |
-
" f\"Resuming from row {already_done}. Processing remaining {len(working_df)} rows out of {total_input_rows}.\"\n",
|
| 287 |
-
" )\n",
|
| 288 |
-
" else:\n",
|
| 289 |
-
" print(f\"Loaded {total_input_rows} rows from {input_file}\")\n",
|
| 290 |
-
" print(\n",
|
| 291 |
-
" f\"Using title column: {title_col} | prompt column: {prompt_col} | human column: {human_col}\"\n",
|
| 292 |
-
" )\n",
|
| 293 |
-
"\n",
|
| 294 |
-
" results = []\n",
|
| 295 |
-
"\n",
|
| 296 |
-
" bad_markers = [\n",
|
| 297 |
-
" \"error\",\n",
|
| 298 |
-
" \"invalid\",\n",
|
| 299 |
-
" \"not found\",\n",
|
| 300 |
-
" \"decommissioned\",\n",
|
| 301 |
-
" \"rate limit\",\n",
|
| 302 |
-
" \"api key\",\n",
|
| 303 |
-
" ]\n",
|
| 304 |
-
"\n",
|
| 305 |
-
" def is_valid_ai_text(text: str) -> bool:\n",
|
| 306 |
-
" if not text:\n",
|
| 307 |
-
" return False\n",
|
| 308 |
-
" clean_text = text.strip()\n",
|
| 309 |
-
" if len(clean_text) < 20:\n",
|
| 310 |
-
" return False\n",
|
| 311 |
-
" lower_text = clean_text.lower()\n",
|
| 312 |
-
" return not any(marker in lower_text for marker in bad_markers)\n",
|
| 313 |
-
"\n",
|
| 314 |
-
" def extract_retry_wait_seconds(error_text: str) -> float:\n",
|
| 315 |
-
" match = re.search(r\"try again in\\s*(\\d+)ms\", error_text, re.IGNORECASE)\n",
|
| 316 |
-
" if match:\n",
|
| 317 |
-
" return int(match.group(1)) / 1000.0 + 0.2\n",
|
| 318 |
-
" return 1.5\n",
|
| 319 |
-
"\n",
|
| 320 |
-
" def process_one(idx, title_text, prompt_text, human_text):\n",
|
| 321 |
-
" local_client = Groq(api_key=api_key)\n",
|
| 322 |
-
"\n",
|
| 323 |
-
" for attempt in range(max_retries + 1):\n",
|
| 324 |
-
" try:\n",
|
| 325 |
-
" completion = local_client.chat.completions.create(\n",
|
| 326 |
-
" model=model,\n",
|
| 327 |
-
" messages=[{\"role\": \"user\", \"content\": build_prompt(str(prompt_text))}],\n",
|
| 328 |
-
" temperature=0.2,\n",
|
| 329 |
-
" max_tokens=500,\n",
|
| 330 |
-
" )\n",
|
| 331 |
-
" ai_text = completion.choices[0].message.content.strip()\n",
|
| 332 |
-
"\n",
|
| 333 |
-
" if not is_valid_ai_text(ai_text):\n",
|
| 334 |
-
" if attempt < max_retries:\n",
|
| 335 |
-
" continue\n",
|
| 336 |
-
" return {\n",
|
| 337 |
-
" \"idx\": idx,\n",
|
| 338 |
-
" \"ok\": False,\n",
|
| 339 |
-
" \"reason\": \"invalid_or_error_text\",\n",
|
| 340 |
-
" \"ai_text\": ai_text,\n",
|
| 341 |
-
" }\n",
|
| 342 |
-
"\n",
|
| 343 |
-
" return {\n",
|
| 344 |
-
" \"idx\": idx,\n",
|
| 345 |
-
" \"ok\": True,\n",
|
| 346 |
-
" \"title\": str(title_text),\n",
|
| 347 |
-
" \"human_text\": str(human_text),\n",
|
| 348 |
-
" \"ai_generated_text\": ai_text,\n",
|
| 349 |
-
" }\n",
|
| 350 |
-
" except Exception as error:\n",
|
| 351 |
-
" error_text = str(error)\n",
|
| 352 |
-
" is_rate_limited = (\n",
|
| 353 |
-
" \"rate_limit_exceeded\" in error_text.lower()\n",
|
| 354 |
-
" or \"rate limit reached\" in error_text.lower()\n",
|
| 355 |
-
" or \"429\" in error_text\n",
|
| 356 |
-
" )\n",
|
| 357 |
-
"\n",
|
| 358 |
-
" if is_rate_limited and attempt < max_retries:\n",
|
| 359 |
-
" wait_seconds = extract_retry_wait_seconds(error_text)\n",
|
| 360 |
-
" print(\n",
|
| 361 |
-
" f\"Row {idx} rate-limited, retry {attempt + 1}/{max_retries} after {wait_seconds:.2f}s\"\n",
|
| 362 |
-
" )\n",
|
| 363 |
-
" time.sleep(wait_seconds)\n",
|
| 364 |
-
" continue\n",
|
| 365 |
-
"\n",
|
| 366 |
-
" return {\n",
|
| 367 |
-
" \"idx\": idx,\n",
|
| 368 |
-
" \"ok\": False,\n",
|
| 369 |
-
" \"reason\": error_text,\n",
|
| 370 |
-
" \"ai_text\": \"\",\n",
|
| 371 |
-
" }\n",
|
| 372 |
-
"\n",
|
| 373 |
-
" rows = list(working_df[[title_col, prompt_col, human_col]].itertuples(index=True, name=None))\n",
|
| 374 |
-
" total = len(rows)\n",
|
| 375 |
-
"\n",
|
| 376 |
-
" for start in range(0, total, requests_per_second):\n",
|
| 377 |
-
" window = rows[start : start + requests_per_second]\n",
|
| 378 |
-
" tick_start = time.time()\n",
|
| 379 |
-
"\n",
|
| 380 |
-
" with ThreadPoolExecutor(max_workers=max_workers) as executor:\n",
|
| 381 |
-
" futures = {\n",
|
| 382 |
-
" executor.submit(process_one, idx, title_text, prompt_text, human_text): idx\n",
|
| 383 |
-
" for idx, title_text, prompt_text, human_text in window\n",
|
| 384 |
-
" }\n",
|
| 385 |
-
"\n",
|
| 386 |
-
" for future in as_completed(futures):\n",
|
| 387 |
-
" out = future.result()\n",
|
| 388 |
-
" if out[\"ok\"]:\n",
|
| 389 |
-
" # Save as id + ai_gen only\n",
|
| 390 |
-
" results.append({\n",
|
| 391 |
-
" \"id\": out[\"idx\"],\n",
|
| 392 |
-
" \"ai_gen\": out[\"ai_generated_text\"]\n",
|
| 393 |
-
" })\n",
|
| 394 |
-
" print(\n",
|
| 395 |
-
" f\"Row {out['idx']}: generated {len(out['ai_generated_text'].split())} words\"\n",
|
| 396 |
-
" )\n",
|
| 397 |
-
" else:\n",
|
| 398 |
-
" print(f\"Row {out['idx']} skipped: {out['reason']}\")\n",
|
| 399 |
-
"\n",
|
| 400 |
-
" if len(results) >= 10:\n",
|
| 401 |
-
" pd.DataFrame(results)[[\"id\", \"ai_gen\"]].to_csv(\n",
|
| 402 |
-
" output_file,\n",
|
| 403 |
-
" index=False,\n",
|
| 404 |
-
" mode=\"a\",\n",
|
| 405 |
-
" header=not os.path.exists(output_file),\n",
|
| 406 |
-
" )\n",
|
| 407 |
-
" print(f\"Saved {len(results)} valid rows to {output_file}\")\n",
|
| 408 |
-
" results = []\n",
|
| 409 |
-
"\n",
|
| 410 |
-
" elapsed = time.time() - tick_start\n",
|
| 411 |
-
" if elapsed < 1:\n",
|
| 412 |
-
" time.sleep(1 - elapsed)\n",
|
| 413 |
-
"\n",
|
| 414 |
-
" if results:\n",
|
| 415 |
-
" pd.DataFrame(results)[[\"id\", \"ai_gen\"]].to_csv(\n",
|
| 416 |
-
" output_file,\n",
|
| 417 |
-
" index=False,\n",
|
| 418 |
-
" mode=\"a\",\n",
|
| 419 |
-
" header=not os.path.exists(output_file),\n",
|
| 420 |
-
" )\n",
|
| 421 |
-
"\n",
|
| 422 |
-
" print(f\"Finished. Output saved to {output_file}\")"
|
| 423 |
-
]
|
| 424 |
-
},
|
| 425 |
-
{
|
| 426 |
-
"cell_type": "code",
|
| 427 |
-
"execution_count": null,
|
| 428 |
-
"id": "357ccb81",
|
| 429 |
-
"metadata": {},
|
| 430 |
-
"outputs": [],
|
| 431 |
-
"source": []
|
| 432 |
-
}
|
| 433 |
-
],
|
| 434 |
-
"metadata": {
|
| 435 |
-
"kernelspec": {
|
| 436 |
-
"display_name": "ml",
|
| 437 |
-
"language": "python",
|
| 438 |
-
"name": "python3"
|
| 439 |
-
},
|
| 440 |
-
"language_info": {
|
| 441 |
-
"codemirror_mode": {
|
| 442 |
-
"name": "ipython",
|
| 443 |
-
"version": 3
|
| 444 |
-
},
|
| 445 |
-
"file_extension": ".py",
|
| 446 |
-
"mimetype": "text/x-python",
|
| 447 |
-
"name": "python",
|
| 448 |
-
"nbconvert_exporter": "python",
|
| 449 |
-
"pygments_lexer": "ipython3",
|
| 450 |
-
"version": "3.11.14"
|
| 451 |
-
}
|
| 452 |
-
},
|
| 453 |
-
"nbformat": 4,
|
| 454 |
-
"nbformat_minor": 5
|
| 455 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|