Pujan-Dev commited on
Commit
690f2c5
·
1 Parent(s): 9b781c6

Added :Added the support for the lastest model

Browse files
.gitignore CHANGED
@@ -17,7 +17,8 @@ __pycache__/
17
  # ---- Jupyter / IPython ----
18
  .ipynb_checkpoints/
19
  *.ipynb
20
-
 
21
  # ---- Model & Data Artifacts ----
22
  *.pth
23
  *.pt
 
17
  # ---- Jupyter / IPython ----
18
  .ipynb_checkpoints/
19
  *.ipynb
20
+ notebook/
21
+ *.csv
22
  # ---- Model & Data Artifacts ----
23
  *.pth
24
  *.pt
config.py CHANGED
@@ -28,7 +28,7 @@ class Config:
28
  REAL_FORGED_MODEL_LOCAL_PATH = os.getenv("REAL_FORGED_MODEL_LOCAL_PATH", "Model/real_forged/fft_cnn_model_78.pth")
29
  DOCUMENT_FORGERY_MODEL_PATH = os.getenv(
30
  "DOCUMENT_FORGERY_MODEL_PATH",
31
- "features/Model/document_forgery/ela_cnn_model.pth",
32
  )
33
  # Decision thresholds for document forgery detector (probabilities in 0..1)
34
  DOCUMENT_FORGERY_POSSIBLE_LOW = float(os.getenv("DOCUMENT_FORGERY_POSSIBLE_LOW", "0.40"))
 
28
  REAL_FORGED_MODEL_LOCAL_PATH = os.getenv("REAL_FORGED_MODEL_LOCAL_PATH", "Model/real_forged/fft_cnn_model_78.pth")
29
  DOCUMENT_FORGERY_MODEL_PATH = os.getenv(
30
  "DOCUMENT_FORGERY_MODEL_PATH",
31
+ "features/Model/document_forgery/pixel_forgery_v3_best.pth",
32
  )
33
  # Decision thresholds for document forgery detector (probabilities in 0..1)
34
  DOCUMENT_FORGERY_POSSIBLE_LOW = float(os.getenv("DOCUMENT_FORGERY_POSSIBLE_LOW", "0.40"))
notebook/ai_vs_human/final_archi.md DELETED
@@ -1,426 +0,0 @@
1
- # AI vs Human Text Detector V3 - Final Architecture Summary
2
- dataset = "Pujan-Dev/english_aivshuman"
3
- **Model Version**: V3
4
- **Type**: Hybrid Feature Engineering + TF-IDF Classifier
5
- **Output Directory**: `./v3_model/`
6
- **Date**: March 2026
7
-
8
- ---
9
-
10
- ## 📊 Overview
11
-
12
- The V3 model is a **non-transformer, feature-based ML classifier** that distinguishes between AI-generated and human-written text using a hybrid approach combining engineered linguistic features with TF-IDF text representations.
13
-
14
- ```
15
- ┌─────────────┐
16
- │ Input Text │
17
- └──────┬──────┘
18
-
19
- ├──────────────────────────────────┐
20
- │ │
21
- ▼ ▼
22
- ┌──────────────────┐ ┌─────────────────┐
23
- │ Text Features │ │ Engineered │
24
- │ (TF-IDF) │ │ Features │
25
- │ │ │ (16 features) │
26
- │ • Word (1-2gram) │ │ │
27
- │ • Char (3-5gram) │ │ • Perplexity │
28
- │ │ │ • Burstiness │
29
- │ Max 200k features│ │ • Stylometry │
30
- └────────┬─────────┘ └─────────┬───────┘
31
- │ │
32
- │ ┌───────────────┐ │
33
- └───────►│ StandardScaler│◄──────┘
34
- └───────┬───────┘
35
-
36
- ┌───────▼──────────┐
37
- │ Sparse Matrix │
38
- │ Concat (hstack)│
39
- └───────┬──────────┘
40
-
41
- ┌───────▼────────┐
42
- │ Logistic │
43
- │ Regression │
44
- │ (GridSearchCV)│
45
- └───────┬────────┘
46
-
47
- ┌───────▼────────┐
48
- │ Prediction │
49
- │ (Human vs AI) │
50
- └────────────────┘
51
- ```
52
-
53
- ---
54
-
55
- ## 🏗️ Architecture Components
56
-
57
- ### 1. **Data Loading**
58
-
59
- **Function**: `load_dataset_recursive(max_samples=20000)`
60
-
61
- - **Source**: Recursively scans `./DATASET/` folder
62
- - **Formats Supported**: `.jsonl`, `.json`, `.csv`
63
- - **Schema Support**:
64
- - Schema 1: `human_text` + `ai_text` columns
65
- - Schema 2: `text` + `label`/`ai_gen` columns
66
- - **Labels**:
67
- - `0` = Human text
68
- - `1` = AI-generated text
69
- - **Preprocessing**: Text normalization (whitespace cleanup)
70
- - **Max Samples**: 20,000 (configurable)
71
- - **Random State**: 42
72
-
73
- ---
74
-
75
- ### 2. **Feature Extraction Pipeline**
76
-
77
- The model extracts **3 types of features** in parallel:
78
-
79
- #### 2.1 **Perplexity Features** (1 feature)
80
-
81
- **Model**: `distilgpt2` (Hugging Face Transformers)
82
-
83
- ```python
84
- class PerplexityCalculator:
85
- - Model: distilgpt2
86
- - Max Length: 512 tokens
87
- - Metric: exp(cross_entropy_loss)
88
- - Cap: 10,000 (outlier protection)
89
- - Fallback: 100.0 on error
90
- ```
91
-
92
- **What it measures**: Language model surprise/naturalness
93
- - Lower perplexity → More predictable (often AI)
94
- - Higher perplexity → Less predictable (often human)
95
-
96
- ---
97
-
98
- #### 2.2 **Burstiness Features** (5 features)
99
-
100
- Measures sentence length variation patterns.
101
-
102
- **Features**:
103
- 1. `burst_mean` - Average sentence length (words)
104
- 2. `burst_std` - Standard deviation of sentence lengths
105
- 3. `burst_max` - Maximum sentence length
106
- 4. `burst_min` - Minimum sentence length
107
- 5. `burst_range` - Range (max - min)
108
-
109
- **Theory**: Human writing has more variation in sentence length (high burstiness), while AI text tends to be more uniform.
110
-
111
- ---
112
-
113
- #### 2.3 **Stylometry Features** (10 features)
114
-
115
- Writing style and readability metrics.
116
-
117
- **Features**:
118
- 1. `num_words` - Total word count
119
- 2. `num_chars` - Total character count
120
- 3. `num_sentences` - Total sentence count
121
- 4. `avg_word_len` - Average word length
122
- 5. `avg_sent_len` - Average sentence length
123
- 6. `lexical_diversity` - Unique words / total words
124
- 7. `punct_ratio` - Punctuation density
125
- 8. `caps_ratio` - Capitalization ratio
126
- 9. `flesch_reading` - Flesch Reading Ease score
127
- 10. `flesch_grade` - Flesch-Kincaid Grade Level
128
-
129
- **Library**: `textstat` + `nltk`
130
-
131
- ---
132
-
133
- ### 3. **TF-IDF Vectorization**
134
-
135
- #### 3.1 **Word-Level TF-IDF**
136
-
137
- ```python
138
- TfidfVectorizer(
139
- analyzer="word",
140
- ngram_range=(1, 2), # Unigrams + bigrams
141
- min_df=3, # Minimum document frequency
142
- max_df=0.98, # Maximum document frequency
143
- max_features=120000, # Cap at 120k features
144
- sublinear_tf=True # log(tf) scaling
145
- )
146
- ```
147
-
148
- **Output**: Sparse matrix of word/phrase importance scores
149
-
150
- ---
151
-
152
- #### 3.2 **Character-Level TF-IDF**
153
-
154
- ```python
155
- TfidfVectorizer(
156
- analyzer="char_wb", # Character n-grams (word boundaries)
157
- ngram_range=(3, 5), # 3-char to 5-char sequences
158
- min_df=3,
159
- max_df=0.98,
160
- max_features=80000, # Cap at 80k features
161
- sublinear_tf=True
162
- )
163
- ```
164
-
165
- **Purpose**: Captures sub-word patterns and stylistic signatures
166
-
167
- ---
168
-
169
- ### 4. **Feature Preprocessing**
170
-
171
- **Engineered Features**:
172
- - Scaled using `StandardScaler` (z-score normalization)
173
- - Converted to sparse CSR matrix for memory efficiency
174
-
175
- **Hybrid Feature Vector**:
176
- ```python
177
- hybrid_vec = hstack([word_tfidf, char_tfidf, engineered_features_scaled])
178
- ```
179
-
180
- **Final Feature Dimensionality**:
181
- - Word TF-IDF: Up to 120,000 features
182
- - Char TF-IDF: Up to 80,000 features
183
- - Engineered: 16 features
184
- - **Total**: Up to ~200,016 features (sparse)
185
-
186
- ---
187
-
188
- ### 5. **Model Training**
189
-
190
- #### 5.1 **Train-Test Split**
191
- ```python
192
- train_size: 80% (16,000 samples)
193
- test_size: 20% (4,000 samples)
194
- stratified: Yes (balanced across classes)
195
- random_state: 42
196
- ```
197
-
198
- #### 5.2 **Classifier**
199
-
200
- **Algorithm**: Logistic Regression
201
-
202
- **Hyperparameter Tuning**: GridSearchCV with 3-fold stratified cross-validation
203
-
204
- **Search Space**:
205
- ```python
206
- {
207
- "C": [0.5, 1.0, 2.0, 4.0], # Regularization strength
208
- "class_weight": [None, "balanced"], # Class balancing
209
- "solver": "saga", # Stochastic Average Gradient
210
- "penalty": "l2", # L2 regularization
211
- "max_iter": 2500,
212
- "n_jobs": -1 # Parallel processing
213
- }
214
- ```
215
-
216
- **Scoring Metric**: F1 Score (balanced for precision/recall)
217
-
218
- ---
219
-
220
- ### 6. **Model Evaluation**
221
-
222
- **Metrics Tracked**:
223
- - **Accuracy**: Overall correct predictions
224
- - **F1 Score**: Harmonic mean of precision/recall
225
- - **ROC-AUC**: Area under ROC curve
226
- - **Confusion Matrix**: True/false positives/negatives
227
- - **Classification Report**: Per-class precision/recall/F1
228
-
229
- **Visualizations**:
230
- 1. Confusion Matrix
231
- 2. ROC Curve
232
- 3. Feature Importance (top engineered features)
233
- 4. Perplexity Distribution (Human vs AI)
234
- 5. Lexical Diversity Distribution
235
- 6. Burstiness STD Distribution
236
-
237
- ---
238
-
239
- ### 7. **Model Persistence**
240
-
241
- **Output Directory**: `./v3_model/`
242
-
243
- **Saved Artifacts**:
244
-
245
- | File | Description |
246
- |------|-------------|
247
- | `classifier.pkl` | Trained Logistic Regression model |
248
- | `scaler.pkl` | StandardScaler for engineered features |
249
- | `word_vectorizer.pkl` | Word-level TF-IDF vectorizer |
250
- | `char_vectorizer.pkl` | Character-level TF-IDF vectorizer |
251
- | `feature_names.json` | List of engineered feature names (16 features) |
252
- | `metadata.json` | Model performance metrics & configuration |
253
-
254
- **Metadata Contents**:
255
- ```json
256
- {
257
- "selected_model": "hybrid_tfidf_logistic",
258
- "cv_best_f1": 0.xxxx,
259
- "num_engineered_features": 16,
260
- "num_word_tfidf_features": 120000,
261
- "num_char_tfidf_features": 80000,
262
- "train_samples": 16000,
263
- "test_samples": 4000,
264
- "train_accuracy": 0.xxxx,
265
- "train_f1": 0.xxxx,
266
- "test_accuracy": 0.xxxx,
267
- "test_f1": 0.xxxx
268
- }
269
- ```
270
-
271
- ---
272
-
273
- ### 8. **Inference Pipeline**
274
-
275
- **Function**: `predict_v3(text: str) -> dict`
276
-
277
- **Process**:
278
- ```python
279
- 1. Normalize text (whitespace cleanup)
280
- 2. Extract engineered features (16 features)
281
- 3. Scale engineered features (StandardScaler)
282
- 4. Generate word TF-IDF vector
283
- 5. Generate char TF-IDF vector
284
- 6. Concatenate all features (sparse matrix)
285
- 7. Predict with Logistic Regression
286
- 8. Return prediction + probabilities + features
287
- ```
288
-
289
- **Output Schema**:
290
- ```python
291
- {
292
- "text": str, # Truncated input (100 chars)
293
- "word_count": int, # Number of words
294
- "predicted_label": int, # 0=Human, 1=AI
295
- "predicted_name": str, # "human" or "ai"
296
- "probability_human": float, # P(Human) [0-1]
297
- "probability_ai": float, # P(AI) [0-1]
298
- "features": dict # All 16 engineered features
299
- }
300
- ```
301
-
302
- **Batch Function**: `predict_v3_batch(texts: list[str]) -> list[dict]`
303
-
304
- ---
305
-
306
- ## 🔧 Configuration
307
-
308
- ```python
309
- @dataclass
310
- class V3Config:
311
- max_samples: int = 20000 # Max training samples
312
- test_size: float = 0.2 # Test split ratio
313
- output_dir: str = "./v3_model" # Model save directory
314
- random_state: int = 42 # Reproducibility seed
315
- cv_folds: int = 3 # Cross-validation folds
316
- ```
317
-
318
- ---
319
-
320
- ## 📦 Dependencies
321
-
322
- **Core Libraries**:
323
- - `scikit-learn` - ML algorithms, TF-IDF, metrics
324
- - `pandas` - Data manipulation
325
- - `numpy` - Numerical operations
326
- - `scipy` - Sparse matrix operations
327
-
328
- **Feature Extraction**:
329
- - `transformers` - DistilGPT2 for perplexity
330
- - `torch` - PyTorch backend for transformers
331
- - `nltk` - Sentence tokenization (`punkt_tab`)
332
- - `textstat` - Readability metrics
333
-
334
- **Visualization**:
335
- - `matplotlib` - Plotting
336
- - `seaborn` - Statistical visualizations
337
-
338
- ---
339
-
340
- ## 🎯 Key Design Decisions
341
-
342
- ### Why Not Transformers?
343
- 1. **Speed**: No GPU required, fast inference
344
- 2. **Interpretability**: Explainable features
345
- 3. **Efficiency**: Smaller model size (~500MB vs 5GB+)
346
- 4. **Robustness**: Works on any text length
347
-
348
- ### Why Hybrid Features?
349
- 1. **TF-IDF**: Captures content and vocabulary patterns
350
- 2. **Perplexity**: Measures language model naturalness
351
- 3. **Burstiness**: Detects sentence variation patterns
352
- 4. **Stylometry**: Analyzes writing style signatures
353
-
354
- ### Why Logistic Regression?
355
- 1. **Scalability**: Handles 200k+ sparse features efficiently
356
- 2. **Speed**: Fast training and inference
357
- 3. **Interpretability**: Clear feature importance via coefficients
358
- 4. **Robustness**: Well-suited for high-dimensional sparse data
359
-
360
- ---
361
-
362
- ## 📈 Expected Performance
363
-
364
- **Typical Results** (20k samples):
365
- - **Test Accuracy**: 85-95%
366
- - **Test F1 Score**: 0.85-0.95
367
- - **Inference Speed**: ~50-100 texts/second (CPU)
368
- - **Model Size**: ~500 MB total
369
-
370
- **Best For**:
371
- - ✅ General English text classification
372
- - ✅ Articles, essays, reviews
373
- - ✅ Medium to long texts (50+ words)
374
-
375
- **Limitations**:
376
- - ⚠️ Very short texts (<10 words) may be unreliable
377
- - ⚠️ Perplexity calculation is the bottleneck (uses GPU if available)
378
- - ⚠️ Domain-specific jargon may affect performance
379
- - ⚠️ Non-English text requires retraining
380
-
381
- ---
382
-
383
- ## 🔄 Model Loading Example
384
-
385
- ```python
386
- from pathlib import Path
387
- import pickle
388
- import json
389
-
390
- model_dir = Path("./v3_model")
391
-
392
- # Load all artifacts
393
- classifier = pickle.load(open(model_dir / "classifier.pkl", "rb"))
394
- scaler = pickle.load(open(model_dir / "scaler.pkl", "rb"))
395
- word_vectorizer = pickle.load(open(model_dir / "word_vectorizer.pkl", "rb"))
396
- char_vectorizer = pickle.load(open(model_dir / "char_vectorizer.pkl", "rb"))
397
- feature_names = json.load(open(model_dir / "feature_names.json", "r"))
398
- metadata = json.load(open(model_dir / "metadata.json", "r"))
399
-
400
- # Use predict_v3() function for inference
401
- result = predict_v3("Your text here...")
402
- ```
403
-
404
- ---
405
-
406
- ## 💡 Future Improvements
407
-
408
- 1. **Model Versioning**: Add versioning system for model updates
409
- 2. **Confidence Thresholds**: Flag uncertain predictions
410
- 3. **Batch Optimization**: Vectorized batch inference
411
- 4. **Model Wrapper Class**: Encapsulate all logic in `AIPredictorV3` class
412
- 5. **Perplexity Caching**: Cache calculations for faster inference
413
- 6. **Ensemble Methods**: Combine multiple models for better accuracy
414
- 7. **Active Learning**: Iterative retraining with user feedback
415
- 8. **Multi-language Support**: Train separate models per language
416
-
417
- ---
418
-
419
- ## 📝 Citation & Credits
420
-
421
- **Framework**: scikit-learn + HuggingFace Transformers
422
- **Perplexity Model**: DistilGPT2 (OpenAI/Hugging Face)
423
- **Readability Metrics**: textstat library
424
-
425
-
426
- **Architecture Type**: Hybrid Feature Engineering + Logistic Regression
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
notebook/ai_vs_human/main.ipynb DELETED
@@ -1,1110 +0,0 @@
1
- {
2
- "cells": [
3
- {
4
- "cell_type": "markdown",
5
- "id": "e522047b",
6
- "metadata": {},
7
- "source": [
8
- "# AI vs Human Text Detector using BERT\n",
9
- "Using google-bert/bert-base-cased with HC3 dataset or local data (~20k samples)"
10
- ]
11
- },
12
- {
13
- "cell_type": "code",
14
- "execution_count": 35,
15
- "id": "16eddd36",
16
- "metadata": {},
17
- "outputs": [],
18
- "source": [
19
- "from functools import partial\n",
20
- "\n",
21
- "import datasets\n",
22
- "from datasets import Dataset, DatasetDict, concatenate_datasets\n",
23
- "import evaluate\n",
24
- "import numpy as np\n",
25
- "import torch\n",
26
- "from transformers import (\n",
27
- " AutoModelForSequenceClassification,\n",
28
- " AutoTokenizer,\n",
29
- " PreTrainedTokenizer,\n",
30
- " BatchEncoding,\n",
31
- " DataCollatorWithPadding,\n",
32
- " Trainer,\n",
33
- " TrainingArguments,\n",
34
- ")\n",
35
- "from peft import LoraConfig, get_peft_model"
36
- ]
37
- },
38
- {
39
- "cell_type": "markdown",
40
- "id": "99bca750",
41
- "metadata": {},
42
- "source": [
43
- "## Load AI Detection Dataset (~20k samples)"
44
- ]
45
- },
46
- {
47
- "cell_type": "code",
48
- "execution_count": 36,
49
- "id": "2945f87a",
50
- "metadata": {},
51
- "outputs": [],
52
- "source": [
53
- "def get_raid_dataset(max_samples: int = 20000, use_local: bool = True) -> DatasetDict:\n",
54
- " \"\"\"Load AI detection dataset and limit to ~20k samples\"\"\"\n",
55
- " \n",
56
- " print(\"Loading AI vs Human text dataset...\")\n",
57
- " \n",
58
- " all_texts = []\n",
59
- " all_labels = []\n",
60
- " \n",
61
- " # Try loading HC3 dataset (Human ChatGPT Comparison Corpus)\n",
62
- " try:\n",
63
- " print(\"Attempting to load HC3 dataset...\")\n",
64
- " dataset = datasets.load_dataset(\"Hello-SimpleAI/HC3\", \"all\", split=\"train\")\n",
65
- " \n",
66
- " # HC3 format: has 'question', 'human_answers', 'chatgpt_answers'\n",
67
- " for item in dataset:\n",
68
- " # Add human answers\n",
69
- " if 'human_answers' in item and item['human_answers']:\n",
70
- " for answer in item['human_answers'][:1]: # Take first answer\n",
71
- " if answer and len(answer.strip()) > 0:\n",
72
- " all_texts.append(answer)\n",
73
- " all_labels.append(0) # 0 for human\n",
74
- " \n",
75
- " # Add AI answers\n",
76
- " if 'chatgpt_answers' in item and item['chatgpt_answers']:\n",
77
- " for answer in item['chatgpt_answers'][:1]: # Take first answer\n",
78
- " if answer and len(answer.strip()) > 0:\n",
79
- " all_texts.append(answer)\n",
80
- " all_labels.append(1) # 1 for AI\n",
81
- " \n",
82
- " print(f\"✓ Loaded {len(all_texts)} samples from HC3 dataset\")\n",
83
- " except Exception as e:\n",
84
- " print(f\"⚠ Could not load HC3 dataset: {e}\")\n",
85
- " \n",
86
- " # Load local data and combine\n",
87
- " if use_local:\n",
88
- " try:\n",
89
- " print(\"Loading local dataset...\")\n",
90
- " import pandas as pd\n",
91
- " df = pd.read_json(\"./DATASET/basic_Data.jsonl\", lines=True)\n",
92
- " \n",
93
- " # Build a proper binary classification dataset: human_text -> 0, ai_text -> 1\n",
94
- " if {\"human_text\", \"ai_text\"}.issubset(df.columns):\n",
95
- " local_texts = list(df[\"human_text\"].dropna()) + list(df[\"ai_text\"].dropna())\n",
96
- " local_labels = [0] * len(df[\"human_text\"].dropna()) + [1] * len(df[\"ai_text\"].dropna())\n",
97
- " \n",
98
- " all_texts.extend(local_texts)\n",
99
- " all_labels.extend(local_labels)\n",
100
- " \n",
101
- " print(f\"✓ Loaded {len(local_texts)} samples from local data\")\n",
102
- " else:\n",
103
- " print(\"⚠ Local dataset doesn't have required columns\")\n",
104
- " except Exception as e:\n",
105
- " print(f\"⚠ Could not load local dataset: {e}\")\n",
106
- " \n",
107
- " # Check if we have any data\n",
108
- " if len(all_texts) == 0:\n",
109
- " raise ValueError(\"No data loaded! Check HC3 dataset or local data availability\")\n",
110
- " \n",
111
- " # Create combined dataset\n",
112
- " combined_dataset = Dataset.from_dict({\n",
113
- " \"text\": all_texts,\n",
114
- " \"label\": all_labels\n",
115
- " })\n",
116
- " \n",
117
- " print(f\"Total combined samples: {len(combined_dataset)}\")\n",
118
- " \n",
119
- " # Shuffle and limit to max_samples\n",
120
- " combined_dataset = combined_dataset.shuffle(seed=42)\n",
121
- " if len(combined_dataset) > max_samples:\n",
122
- " combined_dataset = combined_dataset.select(range(max_samples))\n",
123
- " print(f\"Limited to {max_samples} samples\")\n",
124
- " \n",
125
- " # Filter out empty texts\n",
126
- " combined_dataset = combined_dataset.filter(lambda x: x['text'] is not None and len(x['text'].strip()) > 0)\n",
127
- " \n",
128
- " # Split into train/test (95/5 split)\n",
129
- " dataset_split = combined_dataset.train_test_split(test_size=0.05, seed=42)\n",
130
- " \n",
131
- " print(f\"\\n✓ Dataset ready!\")\n",
132
- " print(f\" Train samples: {len(dataset_split['train'])}\")\n",
133
- " print(f\" Test samples: {len(dataset_split['test'])}\")\n",
134
- " \n",
135
- " # Check label distribution\n",
136
- " import numpy as np\n",
137
- " train_labels = np.array(dataset_split['train']['label'])\n",
138
- " print(f\" Label distribution (train):\")\n",
139
- " print(f\" Human (0): {(train_labels == 0).sum()}\")\n",
140
- " print(f\" AI (1): {(train_labels == 1).sum()}\")\n",
141
- " \n",
142
- " return dataset_split"
143
- ]
144
- },
145
- {
146
- "cell_type": "code",
147
- "execution_count": 37,
148
- "id": "38d8478c",
149
- "metadata": {},
150
- "outputs": [
151
- {
152
- "name": "stdout",
153
- "output_type": "stream",
154
- "text": [
155
- "Loading AI vs Human text dataset...\n",
156
- "Attempting to load HC3 dataset...\n",
157
- "⚠ Could not load HC3 dataset: Dataset scripts are no longer supported, but found HC3.py\n",
158
- "Loading local dataset...\n",
159
- "✓ Loaded 19940 samples from local data\n",
160
- "Total combined samples: 19940\n"
161
- ]
162
- },
163
- {
164
- "name": "stderr",
165
- "output_type": "stream",
166
- "text": [
167
- "Filter: 100%|██████████| 19940/19940 [00:00<00:00, 95584.60 examples/s] \n"
168
- ]
169
- },
170
- {
171
- "name": "stdout",
172
- "output_type": "stream",
173
- "text": [
174
- "\n",
175
- "✓ Dataset ready!\n",
176
- " Train samples: 18943\n",
177
- " Test samples: 997\n",
178
- " Label distribution (train):\n",
179
- " Human (0): 9477\n",
180
- " AI (1): 9466\n"
181
- ]
182
- }
183
- ],
184
- "source": [
185
- "# Load dataset\n",
186
- "raw_datasets = get_raid_dataset(max_samples=20000)"
187
- ]
188
- },
189
- {
190
- "cell_type": "markdown",
191
- "id": "f60191f6",
192
- "metadata": {},
193
- "source": [
194
- "## Initialize Model and Tokenizer"
195
- ]
196
- },
197
- {
198
- "cell_type": "code",
199
- "execution_count": 38,
200
- "id": "315bb737",
201
- "metadata": {},
202
- "outputs": [
203
- {
204
- "name": "stderr",
205
- "output_type": "stream",
206
- "text": [
207
- "Loading weights: 100%|██████████| 199/199 [00:00<00:00, 1208.24it/s, Materializing param=bert.pooler.dense.weight] \n",
208
- "BertForSequenceClassification LOAD REPORT from: google-bert/bert-base-cased\n",
209
- "Key | Status | \n",
210
- "-------------------------------------------+------------+-\n",
211
- "cls.predictions.transform.LayerNorm.bias | UNEXPECTED | \n",
212
- "cls.seq_relationship.weight | UNEXPECTED | \n",
213
- "cls.predictions.transform.dense.weight | UNEXPECTED | \n",
214
- "cls.seq_relationship.bias | UNEXPECTED | \n",
215
- "cls.predictions.bias | UNEXPECTED | \n",
216
- "cls.predictions.transform.dense.bias | UNEXPECTED | \n",
217
- "cls.predictions.transform.LayerNorm.weight | UNEXPECTED | \n",
218
- "classifier.weight | MISSING | \n",
219
- "classifier.bias | MISSING | \n",
220
- "\n",
221
- "Notes:\n",
222
- "- UNEXPECTED\t:can be ignored when loading from different task/architecture; not ok if you expect identical arch.\n",
223
- "- MISSING\t:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.\n"
224
- ]
225
- },
226
- {
227
- "name": "stdout",
228
- "output_type": "stream",
229
- "text": [
230
- "Model loaded: google-bert/bert-base-cased\n",
231
- "Device: cuda\n"
232
- ]
233
- }
234
- ],
235
- "source": [
236
- "# Use google-bert/bert-base-cased\n",
237
- "base_model_name = \"google-bert/bert-base-cased\"\n",
238
- "\n",
239
- "tokenizer = AutoTokenizer.from_pretrained(base_model_name)\n",
240
- "model = AutoModelForSequenceClassification.from_pretrained(\n",
241
- " base_model_name,\n",
242
- " num_labels=2,\n",
243
- ").to(device='cuda' if torch.cuda.is_available() else 'cpu')\n",
244
- "\n",
245
- "print(f\"Model loaded: {base_model_name}\")\n",
246
- "print(f\"Device: {'cuda' if torch.cuda.is_available() else 'cpu'}\")"
247
- ]
248
- },
249
- {
250
- "cell_type": "markdown",
251
- "id": "0a772192",
252
- "metadata": {},
253
- "source": [
254
- "## Apply LoRA for Parameter-Efficient Fine-tuning"
255
- ]
256
- },
257
- {
258
- "cell_type": "code",
259
- "execution_count": 39,
260
- "id": "ba294e50",
261
- "metadata": {},
262
- "outputs": [
263
- {
264
- "name": "stdout",
265
- "output_type": "stream",
266
- "text": [
267
- "trainable params: 2,680,322 || all params: 110,992,132 || trainable%: 2.4149\n"
268
- ]
269
- }
270
- ],
271
- "source": [
272
- "peft_config = LoraConfig(\n",
273
- " r=16,\n",
274
- " target_modules=\"all-linear\",\n",
275
- " lora_alpha=16,\n",
276
- " bias=\"none\",\n",
277
- " lora_dropout=0.05,\n",
278
- " use_rslora=True,\n",
279
- " modules_to_save=[\"classifier\"],\n",
280
- ")\n",
281
- "\n",
282
- "model = get_peft_model(model, peft_config)\n",
283
- "model.print_trainable_parameters()"
284
- ]
285
- },
286
- {
287
- "cell_type": "markdown",
288
- "id": "3cf58dd8",
289
- "metadata": {},
290
- "source": [
291
- "## Preprocessing and Tokenization"
292
- ]
293
- },
294
- {
295
- "cell_type": "code",
296
- "execution_count": 40,
297
- "id": "c7992ba4",
298
- "metadata": {},
299
- "outputs": [
300
- {
301
- "name": "stderr",
302
- "output_type": "stream",
303
- "text": [
304
- "Map: 100%|██████████| 18943/18943 [00:01<00:00, 10132.04 examples/s]\n",
305
- "Map: 100%|██████████| 997/997 [00:00<00:00, 11498.07 examples/s]"
306
- ]
307
- },
308
- {
309
- "name": "stdout",
310
- "output_type": "stream",
311
- "text": [
312
- "Tokenization complete!\n",
313
- "Tensor columns: ['input_ids', 'attention_mask', 'token_type_ids', 'labels']\n"
314
- ]
315
- },
316
- {
317
- "name": "stderr",
318
- "output_type": "stream",
319
- "text": [
320
- "\n"
321
- ]
322
- }
323
- ],
324
- "source": [
325
- "def _preprocess_function(\n",
326
- " batch: dict,\n",
327
- " tokenizer: PreTrainedTokenizer,\n",
328
- " max_length: int = 512,\n",
329
- ") -> BatchEncoding:\n",
330
- " model_inputs = tokenizer(\n",
331
- " batch[\"text\"],\n",
332
- " max_length=max_length,\n",
333
- " truncation=True,\n",
334
- " )\n",
335
- " model_inputs[\"labels\"] = batch[\"label\"]\n",
336
- " return model_inputs\n",
337
- "\n",
338
- "\n",
339
- "preprocess_function = partial(_preprocess_function, tokenizer=tokenizer)\n",
340
- "tokenized_datasets = raw_datasets.map(\n",
341
- " preprocess_function,\n",
342
- " batched=True,\n",
343
- " remove_columns=[\"text\", \"label\"],\n",
344
- ")\n",
345
- "\n",
346
- "# Ensure PyTorch tensors and expected columns\n",
347
- "available_columns = tokenized_datasets[\"train\"].column_names\n",
348
- "tensor_columns = [\n",
349
- " column_name\n",
350
- " for column_name in [\"input_ids\", \"attention_mask\", \"token_type_ids\", \"labels\"]\n",
351
- " if column_name in available_columns\n",
352
- "]\n",
353
- "tokenized_datasets.set_format(type=\"torch\", columns=tensor_columns)\n",
354
- "\n",
355
- "print(\"Tokenization complete!\")\n",
356
- "print(\"Tensor columns:\", tensor_columns)"
357
- ]
358
- },
359
- {
360
- "cell_type": "markdown",
361
- "id": "31db700b",
362
- "metadata": {},
363
- "source": [
364
- "## Define Metrics"
365
- ]
366
- },
367
- {
368
- "cell_type": "code",
369
- "execution_count": 41,
370
- "id": "899e4408",
371
- "metadata": {},
372
- "outputs": [],
373
- "source": [
374
- "metric_accuracy = evaluate.load(\"accuracy\")\n",
375
- "metric_f1 = evaluate.load(\"f1\")\n",
376
- "\n",
377
- "\n",
378
- "def _compute_metrics(\n",
379
- " eval_pred: tuple[np.ndarray, np.ndarray],\n",
380
- " metric_accuracy: evaluate.EvaluationModule,\n",
381
- " metric_f1: evaluate.EvaluationModule,\n",
382
- ") -> dict[str, float]:\n",
383
- " predictions, labels = eval_pred\n",
384
- "\n",
385
- " if isinstance(predictions, tuple):\n",
386
- " predictions = predictions[0]\n",
387
- "\n",
388
- " predictions = np.argmax(predictions, axis=1)\n",
389
- "\n",
390
- " accuracy = metric_accuracy.compute(predictions=predictions, references=labels)\n",
391
- " f1 = metric_f1.compute(predictions=predictions, references=labels)\n",
392
- "\n",
393
- " assert accuracy is not None and f1 is not None\n",
394
- "\n",
395
- " result = {\n",
396
- " \"accuracy\": accuracy[\"accuracy\"],\n",
397
- " \"f1\": f1[\"f1\"],\n",
398
- " }\n",
399
- "\n",
400
- " return result\n",
401
- "\n",
402
- "\n",
403
- "compute_metrics = partial(\n",
404
- " _compute_metrics, metric_accuracy=metric_accuracy, metric_f1=metric_f1\n",
405
- ")"
406
- ]
407
- },
408
- {
409
- "cell_type": "markdown",
410
- "id": "34890c4d",
411
- "metadata": {},
412
- "source": [
413
- "## Training Configuration"
414
- ]
415
- },
416
- {
417
- "cell_type": "code",
418
- "execution_count": 42,
419
- "id": "9717d666",
420
- "metadata": {},
421
- "outputs": [],
422
- "source": [
423
- "train_batch_size = 4\n",
424
- "gradient_accumulation_steps = 8\n",
425
- "eval_batch_size = 4\n",
426
- "\n",
427
- "training_args = TrainingArguments(\n",
428
- " \"./models/bert-base-raid-classifier\",\n",
429
- " num_train_epochs=5,\n",
430
- " learning_rate=5e-5,\n",
431
- " weight_decay=0.1,\n",
432
- " per_device_train_batch_size=train_batch_size,\n",
433
- " per_device_eval_batch_size=eval_batch_size,\n",
434
- " gradient_accumulation_steps=gradient_accumulation_steps,\n",
435
- " fp16=torch.cuda.is_available(),\n",
436
- " save_strategy=\"steps\",\n",
437
- " save_total_limit=2,\n",
438
- " save_steps=64,\n",
439
- " metric_for_best_model=\"eval_accuracy\",\n",
440
- " load_best_model_at_end=True,\n",
441
- " eval_strategy=\"steps\",\n",
442
- " eval_steps=64,\n",
443
- " logging_strategy=\"steps\",\n",
444
- " logging_steps=16,\n",
445
- " remove_unused_columns=False,\n",
446
- ")\n",
447
- "\n",
448
- "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)"
449
- ]
450
- },
451
- {
452
- "cell_type": "markdown",
453
- "id": "e840a954",
454
- "metadata": {},
455
- "source": [
456
- "## Initialize Trainer and Train"
457
- ]
458
- },
459
- {
460
- "cell_type": "code",
461
- "execution_count": 43,
462
- "id": "0fa3ed58",
463
- "metadata": {},
464
- "outputs": [
465
- {
466
- "name": "stdout",
467
- "output_type": "stream",
468
- "text": [
469
- "Starting training...\n"
470
- ]
471
- },
472
- {
473
- "data": {
474
- "text/html": [
475
- "\n",
476
- " <div>\n",
477
- " \n",
478
- " <progress value='2960' max='2960' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
479
- " [2960/2960 1:03:52, Epoch 5/5]\n",
480
- " </div>\n",
481
- " <table border=\"1\" class=\"dataframe\">\n",
482
- " <thead>\n",
483
- " <tr style=\"text-align: left;\">\n",
484
- " <th>Step</th>\n",
485
- " <th>Training Loss</th>\n",
486
- " <th>Validation Loss</th>\n",
487
- " <th>Accuracy</th>\n",
488
- " <th>F1</th>\n",
489
- " </tr>\n",
490
- " </thead>\n",
491
- " <tbody>\n",
492
- " <tr>\n",
493
- " <td>64</td>\n",
494
- " <td>5.212345</td>\n",
495
- " <td>0.625602</td>\n",
496
- " <td>0.661986</td>\n",
497
- " <td>0.634093</td>\n",
498
- " </tr>\n",
499
- " <tr>\n",
500
- " <td>128</td>\n",
501
- " <td>3.753965</td>\n",
502
- " <td>0.458432</td>\n",
503
- " <td>0.771314</td>\n",
504
- " <td>0.809045</td>\n",
505
- " </tr>\n",
506
- " <tr>\n",
507
- " <td>192</td>\n",
508
- " <td>3.100017</td>\n",
509
- " <td>0.287685</td>\n",
510
- " <td>0.889669</td>\n",
511
- " <td>0.891089</td>\n",
512
- " </tr>\n",
513
- " <tr>\n",
514
- " <td>256</td>\n",
515
- " <td>2.328572</td>\n",
516
- " <td>0.390553</td>\n",
517
- " <td>0.830491</td>\n",
518
- " <td>0.855432</td>\n",
519
- " </tr>\n",
520
- " <tr>\n",
521
- " <td>320</td>\n",
522
- " <td>2.129814</td>\n",
523
- " <td>0.238838</td>\n",
524
- " <td>0.911735</td>\n",
525
- " <td>0.917757</td>\n",
526
- " </tr>\n",
527
- " <tr>\n",
528
- " <td>384</td>\n",
529
- " <td>1.657923</td>\n",
530
- " <td>0.388610</td>\n",
531
- " <td>0.856570</td>\n",
532
- " <td>0.874671</td>\n",
533
- " </tr>\n",
534
- " <tr>\n",
535
- " <td>448</td>\n",
536
- " <td>1.758504</td>\n",
537
- " <td>0.179176</td>\n",
538
- " <td>0.933801</td>\n",
539
- " <td>0.937262</td>\n",
540
- " </tr>\n",
541
- " <tr>\n",
542
- " <td>512</td>\n",
543
- " <td>1.352967</td>\n",
544
- " <td>0.344061</td>\n",
545
- " <td>0.867603</td>\n",
546
- " <td>0.882979</td>\n",
547
- " </tr>\n",
548
- " <tr>\n",
549
- " <td>576</td>\n",
550
- " <td>1.528169</td>\n",
551
- " <td>0.143238</td>\n",
552
- " <td>0.945838</td>\n",
553
- " <td>0.947368</td>\n",
554
- " </tr>\n",
555
- " <tr>\n",
556
- " <td>640</td>\n",
557
- " <td>1.692302</td>\n",
558
- " <td>0.185934</td>\n",
559
- " <td>0.925777</td>\n",
560
- " <td>0.930582</td>\n",
561
- " </tr>\n",
562
- " <tr>\n",
563
- " <td>704</td>\n",
564
- " <td>1.194244</td>\n",
565
- " <td>0.189194</td>\n",
566
- " <td>0.927783</td>\n",
567
- " <td>0.932203</td>\n",
568
- " </tr>\n",
569
- " <tr>\n",
570
- " <td>768</td>\n",
571
- " <td>1.089103</td>\n",
572
- " <td>0.191697</td>\n",
573
- " <td>0.926780</td>\n",
574
- " <td>0.931455</td>\n",
575
- " </tr>\n",
576
- " <tr>\n",
577
- " <td>832</td>\n",
578
- " <td>1.313780</td>\n",
579
- " <td>0.133464</td>\n",
580
- " <td>0.949850</td>\n",
581
- " <td>0.950593</td>\n",
582
- " </tr>\n",
583
- " <tr>\n",
584
- " <td>896</td>\n",
585
- " <td>1.144064</td>\n",
586
- " <td>0.161593</td>\n",
587
- " <td>0.943831</td>\n",
588
- " <td>0.946463</td>\n",
589
- " </tr>\n",
590
- " <tr>\n",
591
- " <td>960</td>\n",
592
- " <td>1.503407</td>\n",
593
- " <td>0.211920</td>\n",
594
- " <td>0.921765</td>\n",
595
- " <td>0.927374</td>\n",
596
- " </tr>\n",
597
- " <tr>\n",
598
- " <td>1024</td>\n",
599
- " <td>1.106765</td>\n",
600
- " <td>0.182482</td>\n",
601
- " <td>0.931795</td>\n",
602
- " <td>0.935606</td>\n",
603
- " </tr>\n",
604
- " <tr>\n",
605
- " <td>1088</td>\n",
606
- " <td>1.450451</td>\n",
607
- " <td>0.127360</td>\n",
608
- " <td>0.956871</td>\n",
609
- " <td>0.958212</td>\n",
610
- " </tr>\n",
611
- " <tr>\n",
612
- " <td>1152</td>\n",
613
- " <td>1.380015</td>\n",
614
- " <td>0.131538</td>\n",
615
- " <td>0.957874</td>\n",
616
- " <td>0.959064</td>\n",
617
- " </tr>\n",
618
- " <tr>\n",
619
- " <td>1216</td>\n",
620
- " <td>0.755666</td>\n",
621
- " <td>0.158870</td>\n",
622
- " <td>0.940822</td>\n",
623
- " <td>0.943432</td>\n",
624
- " </tr>\n",
625
- " <tr>\n",
626
- " <td>1280</td>\n",
627
- " <td>0.863713</td>\n",
628
- " <td>0.157785</td>\n",
629
- " <td>0.943831</td>\n",
630
- " <td>0.946565</td>\n",
631
- " </tr>\n",
632
- " <tr>\n",
633
- " <td>1344</td>\n",
634
- " <td>0.821364</td>\n",
635
- " <td>0.172321</td>\n",
636
- " <td>0.944835</td>\n",
637
- " <td>0.947469</td>\n",
638
- " </tr>\n",
639
- " <tr>\n",
640
- " <td>1408</td>\n",
641
- " <td>0.957095</td>\n",
642
- " <td>0.226298</td>\n",
643
- " <td>0.922768</td>\n",
644
- " <td>0.927835</td>\n",
645
- " </tr>\n",
646
- " <tr>\n",
647
- " <td>1472</td>\n",
648
- " <td>0.868089</td>\n",
649
- " <td>0.197520</td>\n",
650
- " <td>0.934804</td>\n",
651
- " <td>0.938505</td>\n",
652
- " </tr>\n",
653
- " <tr>\n",
654
- " <td>1536</td>\n",
655
- " <td>1.310811</td>\n",
656
- " <td>0.140865</td>\n",
657
- " <td>0.953862</td>\n",
658
- " <td>0.955426</td>\n",
659
- " </tr>\n",
660
- " <tr>\n",
661
- " <td>1600</td>\n",
662
- " <td>0.708888</td>\n",
663
- " <td>0.152195</td>\n",
664
- " <td>0.943831</td>\n",
665
- " <td>0.946565</td>\n",
666
- " </tr>\n",
667
- " <tr>\n",
668
- " <td>1664</td>\n",
669
- " <td>0.717255</td>\n",
670
- " <td>0.176768</td>\n",
671
- " <td>0.942828</td>\n",
672
- " <td>0.945663</td>\n",
673
- " </tr>\n",
674
- " <tr>\n",
675
- " <td>1728</td>\n",
676
- " <td>1.143681</td>\n",
677
- " <td>0.156816</td>\n",
678
- " <td>0.951856</td>\n",
679
- " <td>0.953757</td>\n",
680
- " </tr>\n",
681
- " <tr>\n",
682
- " <td>1792</td>\n",
683
- " <td>0.638254</td>\n",
684
- " <td>0.176596</td>\n",
685
- " <td>0.944835</td>\n",
686
- " <td>0.947469</td>\n",
687
- " </tr>\n",
688
- " <tr>\n",
689
- " <td>1856</td>\n",
690
- " <td>1.133300</td>\n",
691
- " <td>0.119119</td>\n",
692
- " <td>0.967904</td>\n",
693
- " <td>0.968317</td>\n",
694
- " </tr>\n",
695
- " <tr>\n",
696
- " <td>1920</td>\n",
697
- " <td>1.061837</td>\n",
698
- " <td>0.140624</td>\n",
699
- " <td>0.957874</td>\n",
700
- " <td>0.959381</td>\n",
701
- " </tr>\n",
702
- " <tr>\n",
703
- " <td>1984</td>\n",
704
- " <td>0.708067</td>\n",
705
- " <td>0.189490</td>\n",
706
- " <td>0.940822</td>\n",
707
- " <td>0.943863</td>\n",
708
- " </tr>\n",
709
- " <tr>\n",
710
- " <td>2048</td>\n",
711
- " <td>0.761451</td>\n",
712
- " <td>0.150488</td>\n",
713
- " <td>0.951856</td>\n",
714
- " <td>0.953846</td>\n",
715
- " </tr>\n",
716
- " <tr>\n",
717
- " <td>2112</td>\n",
718
- " <td>0.609547</td>\n",
719
- " <td>0.189622</td>\n",
720
- " <td>0.940822</td>\n",
721
- " <td>0.943863</td>\n",
722
- " </tr>\n",
723
- " <tr>\n",
724
- " <td>2176</td>\n",
725
- " <td>0.803254</td>\n",
726
- " <td>0.173354</td>\n",
727
- " <td>0.946841</td>\n",
728
- " <td>0.949282</td>\n",
729
- " </tr>\n",
730
- " <tr>\n",
731
- " <td>2240</td>\n",
732
- " <td>0.664540</td>\n",
733
- " <td>0.154308</td>\n",
734
- " <td>0.952859</td>\n",
735
- " <td>0.954764</td>\n",
736
- " </tr>\n",
737
- " <tr>\n",
738
- " <td>2304</td>\n",
739
- " <td>0.691763</td>\n",
740
- " <td>0.144127</td>\n",
741
- " <td>0.963892</td>\n",
742
- " <td>0.964706</td>\n",
743
- " </tr>\n",
744
- " <tr>\n",
745
- " <td>2368</td>\n",
746
- " <td>1.092195</td>\n",
747
- " <td>0.157182</td>\n",
748
- " <td>0.957874</td>\n",
749
- " <td>0.959381</td>\n",
750
- " </tr>\n",
751
- " <tr>\n",
752
- " <td>2432</td>\n",
753
- " <td>0.752286</td>\n",
754
- " <td>0.231035</td>\n",
755
- " <td>0.933801</td>\n",
756
- " <td>0.937736</td>\n",
757
- " </tr>\n",
758
- " <tr>\n",
759
- " <td>2496</td>\n",
760
- " <td>0.757014</td>\n",
761
- " <td>0.185019</td>\n",
762
- " <td>0.948847</td>\n",
763
- " <td>0.951103</td>\n",
764
- " </tr>\n",
765
- " <tr>\n",
766
- " <td>2560</td>\n",
767
- " <td>0.766771</td>\n",
768
- " <td>0.153019</td>\n",
769
- " <td>0.958877</td>\n",
770
- " <td>0.960078</td>\n",
771
- " </tr>\n",
772
- " <tr>\n",
773
- " <td>2624</td>\n",
774
- " <td>0.434590</td>\n",
775
- " <td>0.201383</td>\n",
776
- " <td>0.946841</td>\n",
777
- " <td>0.949282</td>\n",
778
- " </tr>\n",
779
- " <tr>\n",
780
- " <td>2688</td>\n",
781
- " <td>0.565482</td>\n",
782
- " <td>0.181478</td>\n",
783
- " <td>0.952859</td>\n",
784
- " <td>0.954764</td>\n",
785
- " </tr>\n",
786
- " <tr>\n",
787
- " <td>2752</td>\n",
788
- " <td>0.568177</td>\n",
789
- " <td>0.201250</td>\n",
790
- " <td>0.946841</td>\n",
791
- " <td>0.949282</td>\n",
792
- " </tr>\n",
793
- " <tr>\n",
794
- " <td>2816</td>\n",
795
- " <td>0.611295</td>\n",
796
- " <td>0.173839</td>\n",
797
- " <td>0.954865</td>\n",
798
- " <td>0.956606</td>\n",
799
- " </tr>\n",
800
- " <tr>\n",
801
- " <td>2880</td>\n",
802
- " <td>0.716351</td>\n",
803
- " <td>0.187448</td>\n",
804
- " <td>0.948847</td>\n",
805
- " <td>0.951103</td>\n",
806
- " </tr>\n",
807
- " <tr>\n",
808
- " <td>2944</td>\n",
809
- " <td>0.603852</td>\n",
810
- " <td>0.184578</td>\n",
811
- " <td>0.948847</td>\n",
812
- " <td>0.951103</td>\n",
813
- " </tr>\n",
814
- " </tbody>\n",
815
- "</table><p>"
816
- ],
817
- "text/plain": [
818
- "<IPython.core.display.HTML object>"
819
- ]
820
- },
821
- "metadata": {},
822
- "output_type": "display_data"
823
- },
824
- {
825
- "data": {
826
- "text/plain": [
827
- "TrainOutput(global_step=2960, training_loss=1.3125710455146995, metrics={'train_runtime': 3832.8474, 'train_samples_per_second': 24.711, 'train_steps_per_second': 0.772, 'total_flos': 8360830141838376.0, 'train_loss': 1.3125710455146995, 'epoch': 5.0})"
828
- ]
829
- },
830
- "execution_count": 43,
831
- "metadata": {},
832
- "output_type": "execute_result"
833
- }
834
- ],
835
- "source": [
836
- "trainer = Trainer(\n",
837
- " model,\n",
838
- " training_args,\n",
839
- " train_dataset=tokenized_datasets[\"train\"],\n",
840
- " eval_dataset=tokenized_datasets[\"test\"],\n",
841
- " data_collator=data_collator,\n",
842
- " compute_metrics=compute_metrics,\n",
843
- ")\n",
844
- "\n",
845
- "print(\"Starting training...\")\n",
846
- "trainer.train()"
847
- ]
848
- },
849
- {
850
- "cell_type": "markdown",
851
- "id": "cde9bbb1",
852
- "metadata": {},
853
- "source": [
854
- "## Final Evaluation"
855
- ]
856
- },
857
- {
858
- "cell_type": "code",
859
- "execution_count": 44,
860
- "id": "bb81afb9",
861
- "metadata": {},
862
- "outputs": [
863
- {
864
- "name": "stdout",
865
- "output_type": "stream",
866
- "text": [
867
- "Evaluating model...\n"
868
- ]
869
- },
870
- {
871
- "data": {
872
- "text/html": [
873
- "\n",
874
- " <div>\n",
875
- " \n",
876
- " <progress value='250' max='250' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
877
- " [250/250 00:14]\n",
878
- " </div>\n",
879
- " "
880
- ],
881
- "text/plain": [
882
- "<IPython.core.display.HTML object>"
883
- ]
884
- },
885
- "metadata": {},
886
- "output_type": "display_data"
887
- },
888
- {
889
- "name": "stdout",
890
- "output_type": "stream",
891
- "text": [
892
- "\n",
893
- "Final Results:\n",
894
- "Accuracy: 0.9679\n",
895
- "F1 Score: 0.9683\n"
896
- ]
897
- }
898
- ],
899
- "source": [
900
- "print(\"Evaluating model...\")\n",
901
- "eval_results = trainer.evaluate()\n",
902
- "print(\"\\nFinal Results:\")\n",
903
- "print(f\"Accuracy: {eval_results['eval_accuracy']:.4f}\")\n",
904
- "print(f\"F1 Score: {eval_results['eval_f1']:.4f}\")"
905
- ]
906
- },
907
- {
908
- "cell_type": "markdown",
909
- "id": "8bf17a40",
910
- "metadata": {},
911
- "source": [
912
- "## Save Model"
913
- ]
914
- },
915
- {
916
- "cell_type": "code",
917
- "execution_count": 45,
918
- "id": "e580bfd6",
919
- "metadata": {},
920
- "outputs": [
921
- {
922
- "name": "stdout",
923
- "output_type": "stream",
924
- "text": [
925
- "Model saved successfully!\n"
926
- ]
927
- }
928
- ],
929
- "source": [
930
- "# Save the final model\n",
931
- "trainer.save_model(\"./trained_model/bert-base-raid-final\")\n",
932
- "print(\"Model saved successfully!\")"
933
- ]
934
- },
935
- {
936
- "cell_type": "markdown",
937
- "id": "99c0a2f0",
938
- "metadata": {},
939
- "source": [
940
- "## test the model\n"
941
- ]
942
- },
943
- {
944
- "cell_type": "code",
945
- "execution_count": 46,
946
- "id": "016cc53e",
947
- "metadata": {},
948
- "outputs": [
949
- {
950
- "name": "stdout",
951
- "output_type": "stream",
952
- "text": [
953
- "Prediction for human-written text:\n",
954
- "{'predicted_label': 0, 'probability_human': 0.9988395571708679, 'probability_ai': 0.0011604195460677147}\n",
955
- "\n",
956
- "Prediction for AI-generated text:\n",
957
- "{'predicted_label': 0, 'probability_human': 0.9988927245140076, 'probability_ai': 0.0011073390487581491}\n"
958
- ]
959
- }
960
- ],
961
- "source": [
962
- "def predict(text: str) -> dict[str, float]:\n",
963
- " inputs = tokenizer(\n",
964
- " text,\n",
965
- " max_length=512,\n",
966
- " truncation=True,\n",
967
- " return_tensors=\"pt\",\n",
968
- " ).to(model.device)\n",
969
- "\n",
970
- " with torch.no_grad():\n",
971
- " outputs = model(**inputs)\n",
972
- " logits = outputs.logits\n",
973
- " probabilities = torch.softmax(logits, dim=-1).cpu().numpy()[0]\n",
974
- " predicted_label = np.argmax(probabilities)\n",
975
- "\n",
976
- " return {\n",
977
- " \"predicted_label\": int(predicted_label),\n",
978
- " \"probability_human\": float(probabilities[0]),\n",
979
- " \"probability_ai\": float(probabilities[1]),\n",
980
- " }\n",
981
- " \n",
982
- "text = \"Ai will replace this world. today in the nepal election someone might win by using ai.\"\n",
983
- "text_by_ai = \"This is a sample text generated by AI.Also This is an long text by AI.\"\n",
984
- "print(\"Prediction for human-written text:\")\n",
985
- "print(predict(text))\n",
986
- "print(\"\\nPrediction for AI-generated text:\")\n",
987
- "print(predict(text_by_ai))\n"
988
- ]
989
- },
990
- {
991
- "cell_type": "markdown",
992
- "id": "7c6c2a5d",
993
- "metadata": {},
994
- "source": [
995
- "def predict"
996
- ]
997
- },
998
- {
999
- "cell_type": "code",
1000
- "execution_count": 47,
1001
- "id": "1b287605",
1002
- "metadata": {},
1003
- "outputs": [
1004
- {
1005
- "name": "stdout",
1006
- "output_type": "stream",
1007
- "text": [
1008
- "Using 512 samples for RAID quick test\n"
1009
- ]
1010
- },
1011
- {
1012
- "ename": "OutOfMemoryError",
1013
- "evalue": "CUDA out of memory. Tried to allocate 768.00 MiB. GPU 0 has a total capacity of 3.68 GiB of which 719.12 MiB is free. Process 2034 has 46.03 MiB memory in use. Process 1961 has 6.78 MiB memory in use. Including non-PyTorch memory, this process has 2.90 GiB memory in use. Of the allocated memory 2.71 GiB is allocated by PyTorch, and 85.13 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)",
1014
- "output_type": "error",
1015
- "traceback": [
1016
- "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
1017
- "\u001b[31mOutOfMemoryError\u001b[39m Traceback (most recent call last)",
1018
- "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[47]\u001b[39m\u001b[32m, line 32\u001b[39m\n\u001b[32m 28\u001b[39m \u001b[38;5;66;03m# Return AI-class probability for each input text\u001b[39;00m\n\u001b[32m 29\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m probabilities[:, \u001b[32m1\u001b[39m].astype(\u001b[38;5;28mfloat\u001b[39m).tolist()\n\u001b[32m---> \u001b[39m\u001b[32m32\u001b[39m predictions = \u001b[43mrun_detection\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmy_detector\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtest_df\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 33\u001b[39m evaluation_result = run_evaluation(predictions, test_df)\n\u001b[32m 35\u001b[39m evaluation_result\n",
1019
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/raid/detect.py:6\u001b[39m, in \u001b[36mrun_detection\u001b[39m\u001b[34m(f, df)\u001b[39m\n\u001b[32m 3\u001b[39m scores_df = df[[\u001b[33m\"\u001b[39m\u001b[33mid\u001b[39m\u001b[33m\"\u001b[39m]].copy()\n\u001b[32m 5\u001b[39m \u001b[38;5;66;03m# Run the detector function on the dataset and put output in score column\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m6\u001b[39m scores_df[\u001b[33m\"\u001b[39m\u001b[33mscore\u001b[39m\u001b[33m\"\u001b[39m] = \u001b[43mf\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdf\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mgeneration\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m.\u001b[49m\u001b[43mtolist\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 8\u001b[39m \u001b[38;5;66;03m# Convert scores and ids to dict in 'records' format for seralization\u001b[39;00m\n\u001b[32m 9\u001b[39m \u001b[38;5;66;03m# e.g. [{'id':'...', 'score':0}, {'id':'...', 'score':1}, ...]\u001b[39;00m\n\u001b[32m 10\u001b[39m results = scores_df[[\u001b[33m\"\u001b[39m\u001b[33mid\u001b[39m\u001b[33m\"\u001b[39m, \u001b[33m\"\u001b[39m\u001b[33mscore\u001b[39m\u001b[33m\"\u001b[39m]].to_dict(orient=\u001b[33m\"\u001b[39m\u001b[33mrecords\u001b[39m\u001b[33m\"\u001b[39m)\n",
1020
- "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[47]\u001b[39m\u001b[32m, line 24\u001b[39m, in \u001b[36mmy_detector\u001b[39m\u001b[34m(texts)\u001b[39m\n\u001b[32m 22\u001b[39m model.eval()\n\u001b[32m 23\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m torch.no_grad():\n\u001b[32m---> \u001b[39m\u001b[32m24\u001b[39m outputs = \u001b[43mmodel\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43minputs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 25\u001b[39m logits = outputs.logits\n\u001b[32m 26\u001b[39m probabilities = torch.softmax(logits, dim=-\u001b[32m1\u001b[39m).cpu().numpy()\n",
1021
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1734\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1735\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1736\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
1022
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1742\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1743\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1744\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1745\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1746\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1747\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1749\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1750\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
1023
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/accelerate/utils/operations.py:819\u001b[39m, in \u001b[36mconvert_outputs_to_fp32.<locals>.forward\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 818\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mforward\u001b[39m(*args, **kwargs):\n\u001b[32m--> \u001b[39m\u001b[32m819\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mmodel_forward\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
1024
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/accelerate/utils/operations.py:807\u001b[39m, in \u001b[36mConvertOutputsToFp32.__call__\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 806\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34m__call__\u001b[39m(\u001b[38;5;28mself\u001b[39m, *args, **kwargs):\n\u001b[32m--> \u001b[39m\u001b[32m807\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m convert_to_fp32(\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mmodel_forward\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m)\n",
1025
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/amp/autocast_mode.py:44\u001b[39m, in \u001b[36mautocast_decorator.<locals>.decorate_autocast\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 41\u001b[39m \u001b[38;5;129m@functools\u001b[39m.wraps(func)\n\u001b[32m 42\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mdecorate_autocast\u001b[39m(*args, **kwargs):\n\u001b[32m 43\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m autocast_instance:\n\u001b[32m---> \u001b[39m\u001b[32m44\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
1026
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/peft/peft_model.py:921\u001b[39m, in \u001b[36mPeftModel.forward\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 919\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28mself\u001b[39m._enable_peft_forward_hooks(*args, **kwargs):\n\u001b[32m 920\u001b[39m kwargs = {k: v \u001b[38;5;28;01mfor\u001b[39;00m k, v \u001b[38;5;129;01min\u001b[39;00m kwargs.items() \u001b[38;5;28;01mif\u001b[39;00m k \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m.special_peft_forward_args}\n\u001b[32m--> \u001b[39m\u001b[32m921\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mget_base_model\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
1027
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1734\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1735\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1736\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
1028
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1742\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1743\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1744\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1745\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1746\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1747\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1749\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1750\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
1029
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/utils/generic.py:835\u001b[39m, in \u001b[36mcan_return_tuple.<locals>.wrapper\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 833\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m return_dict_passed \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m 834\u001b[39m return_dict = return_dict_passed\n\u001b[32m--> \u001b[39m\u001b[32m835\u001b[39m output = \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 836\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m return_dict \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(output, \u001b[38;5;28mtuple\u001b[39m):\n\u001b[32m 837\u001b[39m output = output.to_tuple()\n",
1030
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py:1162\u001b[39m, in \u001b[36mBertForSequenceClassification.forward\u001b[39m\u001b[34m(self, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds, labels, **kwargs)\u001b[39m\n\u001b[32m 1144\u001b[39m \u001b[38;5;129m@can_return_tuple\u001b[39m\n\u001b[32m 1145\u001b[39m \u001b[38;5;129m@auto_docstring\u001b[39m\n\u001b[32m 1146\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mforward\u001b[39m(\n\u001b[32m (...)\u001b[39m\u001b[32m 1154\u001b[39m **kwargs: Unpack[TransformersKwargs],\n\u001b[32m 1155\u001b[39m ) -> \u001b[38;5;28mtuple\u001b[39m[torch.Tensor] | SequenceClassifierOutput:\n\u001b[32m 1156\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33mr\u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 1157\u001b[39m \u001b[33;03m labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):\u001b[39;00m\n\u001b[32m 1158\u001b[39m \u001b[33;03m Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,\u001b[39;00m\n\u001b[32m 1159\u001b[39m \u001b[33;03m config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If\u001b[39;00m\n\u001b[32m 1160\u001b[39m \u001b[33;03m `config.num_labels > 1` a classification loss is computed (Cross-Entropy).\u001b[39;00m\n\u001b[32m 1161\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1162\u001b[39m outputs = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mbert\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 1163\u001b[39m \u001b[43m \u001b[49m\u001b[43minput_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1164\u001b[39m \u001b[43m \u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[43m=\u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1165\u001b[39m \u001b[43m \u001b[49m\u001b[43mtoken_type_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtoken_type_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1166\u001b[39m \u001b[43m \u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1167\u001b[39m \u001b[43m \u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[43m=\u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1168\u001b[39m \u001b[43m \u001b[49m\u001b[43mreturn_dict\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m 1169\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1170\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1172\u001b[39m pooled_output = outputs[\u001b[32m1\u001b[39m]\n\u001b[32m 1174\u001b[39m pooled_output = \u001b[38;5;28mself\u001b[39m.dropout(pooled_output)\n",
1031
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1734\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1735\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1736\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
1032
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1742\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1743\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1744\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1745\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1746\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1747\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1749\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1750\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
1033
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/utils/generic.py:1002\u001b[39m, in \u001b[36mcheck_model_inputs.<locals>.wrapped_fn.<locals>.wrapper\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1000\u001b[39m outputs = func(\u001b[38;5;28mself\u001b[39m, *args, **kwargs)\n\u001b[32m 1001\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1002\u001b[39m outputs = \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1003\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m original_exception:\n\u001b[32m 1004\u001b[39m \u001b[38;5;66;03m# If we get a TypeError, it's possible that the model is not receiving the recordable kwargs correctly.\u001b[39;00m\n\u001b[32m 1005\u001b[39m \u001b[38;5;66;03m# Get a TypeError even after removing the recordable kwargs -> re-raise the original exception\u001b[39;00m\n\u001b[32m 1006\u001b[39m \u001b[38;5;66;03m# Otherwise -> we're probably missing `**kwargs` in the decorated function\u001b[39;00m\n\u001b[32m 1007\u001b[39m kwargs_without_recordable = {k: v \u001b[38;5;28;01mfor\u001b[39;00m k, v \u001b[38;5;129;01min\u001b[39;00m kwargs.items() \u001b[38;5;28;01mif\u001b[39;00m k \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m recordable_keys}\n",
1034
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py:679\u001b[39m, in \u001b[36mBertModel.forward\u001b[39m\u001b[34m(self, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, cache_position, **kwargs)\u001b[39m\n\u001b[32m 676\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m cache_position \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m 677\u001b[39m cache_position = torch.arange(past_key_values_length, past_key_values_length + seq_length, device=device)\n\u001b[32m--> \u001b[39m\u001b[32m679\u001b[39m embedding_output = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43membeddings\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 680\u001b[39m \u001b[43m \u001b[49m\u001b[43minput_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43minput_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 681\u001b[39m \u001b[43m \u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 682\u001b[39m \u001b[43m \u001b[49m\u001b[43mtoken_type_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtoken_type_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 683\u001b[39m \u001b[43m \u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[43m=\u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 684\u001b[39m \u001b[43m \u001b[49m\u001b[43mpast_key_values_length\u001b[49m\u001b[43m=\u001b[49m\u001b[43mpast_key_values_length\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 685\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 687\u001b[39m attention_mask, encoder_attention_mask = \u001b[38;5;28mself\u001b[39m._create_attention_masks(\n\u001b[32m 688\u001b[39m attention_mask=attention_mask,\n\u001b[32m 689\u001b[39m encoder_attention_mask=encoder_attention_mask,\n\u001b[32m (...)\u001b[39m\u001b[32m 693\u001b[39m past_key_values=past_key_values,\n\u001b[32m 694\u001b[39m )\n\u001b[32m 696\u001b[39m encoder_outputs = \u001b[38;5;28mself\u001b[39m.encoder(\n\u001b[32m 697\u001b[39m embedding_output,\n\u001b[32m 698\u001b[39m attention_mask=attention_mask,\n\u001b[32m (...)\u001b[39m\u001b[32m 705\u001b[39m **kwargs,\n\u001b[32m 706\u001b[39m )\n",
1035
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1734\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1735\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1736\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
1036
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1742\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1743\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1744\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1745\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1746\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1747\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1749\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1750\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
1037
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py:107\u001b[39m, in \u001b[36mBertEmbeddings.forward\u001b[39m\u001b[34m(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)\u001b[39m\n\u001b[32m 104\u001b[39m embeddings = inputs_embeds + token_type_embeddings\n\u001b[32m 106\u001b[39m position_embeddings = \u001b[38;5;28mself\u001b[39m.position_embeddings(position_ids)\n\u001b[32m--> \u001b[39m\u001b[32m107\u001b[39m embeddings = \u001b[43membeddings\u001b[49m\u001b[43m \u001b[49m\u001b[43m+\u001b[49m\u001b[43m \u001b[49m\u001b[43mposition_embeddings\u001b[49m\n\u001b[32m 109\u001b[39m embeddings = \u001b[38;5;28mself\u001b[39m.LayerNorm(embeddings)\n\u001b[32m 110\u001b[39m embeddings = \u001b[38;5;28mself\u001b[39m.dropout(embeddings)\n",
1038
- "\u001b[31mOutOfMemoryError\u001b[39m: CUDA out of memory. Tried to allocate 768.00 MiB. GPU 0 has a total capacity of 3.68 GiB of which 719.12 MiB is free. Process 2034 has 46.03 MiB memory in use. Process 1961 has 6.78 MiB memory in use. Including non-PyTorch memory, this process has 2.90 GiB memory in use. Of the allocated memory 2.71 GiB is allocated by PyTorch, and 85.13 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)"
1039
- ]
1040
- }
1041
- ],
1042
- "source": [
1043
- "from raid import run_detection, run_evaluation\n",
1044
- "from raid.utils import load_data\n",
1045
- "\n",
1046
- "# Use test split and cap sample size for a quick RAID validation\n",
1047
- "test_df = load_data(split=\"test\")\n",
1048
- "sample_size = min(int(len(test_df) * 0.02), 512)\n",
1049
- "test_df = test_df.sample(n=sample_size, random_state=42)\n",
1050
- "\n",
1051
- "print(f\"Using {len(test_df)} samples for RAID quick test\")\n",
1052
- "\n",
1053
- "\n",
1054
- "def my_detector(texts: list[str]) -> list[float]:\n",
1055
- " # RAID passes a batch/list of strings and expects a list of scores\n",
1056
- " inputs = tokenizer(\n",
1057
- " texts,\n",
1058
- " max_length=512,\n",
1059
- " truncation=True,\n",
1060
- " padding=True,\n",
1061
- " return_tensors=\"pt\",\n",
1062
- " ).to(model.device)\n",
1063
- "\n",
1064
- " model.eval()\n",
1065
- " with torch.no_grad():\n",
1066
- " outputs = model(**inputs)\n",
1067
- " logits = outputs.logits\n",
1068
- " probabilities = torch.softmax(logits, dim=-1).cpu().numpy()\n",
1069
- "\n",
1070
- " # Return AI-class probability for each input text\n",
1071
- " return probabilities[:, 1].astype(float).tolist()\n",
1072
- "\n",
1073
- "\n",
1074
- "predictions = run_detection(my_detector, test_df)\n",
1075
- "evaluation_result = run_evaluation(predictions, test_df)\n",
1076
- "\n",
1077
- "evaluation_result"
1078
- ]
1079
- },
1080
- {
1081
- "cell_type": "code",
1082
- "execution_count": null,
1083
- "id": "6b6eb543",
1084
- "metadata": {},
1085
- "outputs": [],
1086
- "source": []
1087
- }
1088
- ],
1089
- "metadata": {
1090
- "kernelspec": {
1091
- "display_name": "ml",
1092
- "language": "python",
1093
- "name": "python3"
1094
- },
1095
- "language_info": {
1096
- "codemirror_mode": {
1097
- "name": "ipython",
1098
- "version": 3
1099
- },
1100
- "file_extension": ".py",
1101
- "mimetype": "text/x-python",
1102
- "name": "python",
1103
- "nbconvert_exporter": "python",
1104
- "pygments_lexer": "ipython3",
1105
- "version": "3.11.6"
1106
- }
1107
- },
1108
- "nbformat": 4,
1109
- "nbformat_minor": 5
1110
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
notebook/ai_vs_human/mainv2.ipynb DELETED
@@ -1,1170 +0,0 @@
1
- {
2
- "cells": [
3
- {
4
- "cell_type": "markdown",
5
- "id": "464eefd0",
6
- "metadata": {},
7
- "source": [
8
- "# AI vs Human Detector V2\n",
9
- "This notebook trains a V2 model that explicitly supports short inputs (including sentences under 50 words) and saves artifacts in `v2_model/`."
10
- ]
11
- },
12
- {
13
- "cell_type": "markdown",
14
- "id": "0be0e8d9",
15
- "metadata": {},
16
- "source": [
17
- "## ✅ Bug Fixes & Capabilities\n",
18
- "\n",
19
- "**Fixed Issues:**\n",
20
- "1. ✅ Runtime error when calling `trainer.evaluate()` after training (removed duplicate evaluation)\n",
21
- "2. ✅ Missing `accelerate` dependency (auto-installs if needed)\n",
22
- "3. ✅ Recursive dataset loading from `./DATASET/` folder (supports `.jsonl`, `.json`, `.csv`)\n",
23
- "4. ✅ Short sentence support (<50 words) with data augmentation\n",
24
- "\n",
25
- "**Model Capabilities:**\n",
26
- "- ✅ Works with **all sentence types**: very short (1-10 words), short (10-50), medium (50-150), long (150+)\n",
27
- "- ✅ Handles edge cases: single words, special characters, numbers, mixed formats\n",
28
- "- ✅ Batch prediction support\n",
29
- "- ✅ Saves to `v2_model/` with tokenizer, config, and label map\n",
30
- "- ✅ Can be loaded independently after saving\n",
31
- "\n",
32
- "**Architecture:** DistilRoBERTa-base (faster, lighter than BERT)\n",
33
- "\n",
34
- "**Quick Start:**\n",
35
- "1. Run cells 1-7 to prepare data\n",
36
- "2. Run cell 8 to train (takes ~15-30 min on GPU)\n",
37
- "3. Run cell 9 to save to `v2_model/`\n",
38
- "4. Run cells 10-12 to test all sentence types"
39
- ]
40
- },
41
- {
42
- "cell_type": "markdown",
43
- "id": "3a8134db",
44
- "metadata": {},
45
- "source": [
46
- "## Additional Testing: Extreme Edge Cases & Batch Prediction"
47
- ]
48
- },
49
- {
50
- "cell_type": "code",
51
- "execution_count": 1,
52
- "id": "f400f763",
53
- "metadata": {},
54
- "outputs": [
55
- {
56
- "name": "stdout",
57
- "output_type": "stream",
58
- "text": [
59
- "Note: you may need to restart the kernel to use updated packages.\n"
60
- ]
61
- }
62
- ],
63
- "source": [
64
- "%pip install -q -U datasets evaluate transformers torch pandas scikit-learn accelerate"
65
- ]
66
- },
67
- {
68
- "cell_type": "code",
69
- "execution_count": 2,
70
- "id": "0c3d4d6d",
71
- "metadata": {},
72
- "outputs": [
73
- {
74
- "name": "stderr",
75
- "output_type": "stream",
76
- "text": [
77
- "/home/pujan/miniconda3/envs/ml/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
78
- " from .autonotebook import tqdm as notebook_tqdm\n",
79
- "/home/pujan/miniconda3/envs/ml/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'Could not load this library: /home/pujan/miniconda3/envs/ml/lib/python3.11/site-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?\n",
80
- " warn(\n",
81
- "/home/pujan/miniconda3/envs/ml/lib/python3.11/site-packages/torchvision/datapoints/__init__.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().\n",
82
- " warnings.warn(_BETA_TRANSFORMS_WARNING)\n",
83
- "/home/pujan/miniconda3/envs/ml/lib/python3.11/site-packages/torchvision/transforms/v2/__init__.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().\n",
84
- " warnings.warn(_BETA_TRANSFORMS_WARNING)\n"
85
- ]
86
- }
87
- ],
88
- "source": [
89
- "from __future__ import annotations\n",
90
- "\n",
91
- "from dataclasses import dataclass\n",
92
- "from functools import partial\n",
93
- "from pathlib import Path\n",
94
- "import json\n",
95
- "import random\n",
96
- "\n",
97
- "import datasets\n",
98
- "from datasets import Dataset, DatasetDict, concatenate_datasets\n",
99
- "import evaluate\n",
100
- "import numpy as np\n",
101
- "import pandas as pd\n",
102
- "import torch\n",
103
- "from transformers import (\n",
104
- " AutoModelForSequenceClassification,\n",
105
- " AutoTokenizer,\n",
106
- " BatchEncoding,\n",
107
- " DataCollatorWithPadding,\n",
108
- " PreTrainedTokenizer,\n",
109
- " Trainer,\n",
110
- " TrainingArguments,\n",
111
- ")\n",
112
- "from packaging import version"
113
- ]
114
- },
115
- {
116
- "cell_type": "code",
117
- "execution_count": 3,
118
- "id": "624d23ba",
119
- "metadata": {},
120
- "outputs": [
121
- {
122
- "name": "stdout",
123
- "output_type": "stream",
124
- "text": [
125
- "Base model: distilroberta-base\n",
126
- "Device: cuda\n",
127
- "Output path: ./v2_model\n"
128
- ]
129
- }
130
- ],
131
- "source": [
132
- "@dataclass\n",
133
- "class V2Config:\n",
134
- " base_model_name: str = \"distilroberta-base\"\n",
135
- " max_samples: int = 20000\n",
136
- " max_length: int = 256\n",
137
- " short_word_limit: int = 50\n",
138
- " short_aug_ratio: float = 0.35\n",
139
- " output_dir: str = \"./v2_model\"\n",
140
- " seed: int = 42\n",
141
- "\n",
142
- "\n",
143
- "cfg = V2Config()\n",
144
- "DEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
145
- "random.seed(cfg.seed)\n",
146
- "np.random.seed(cfg.seed)\n",
147
- "torch.manual_seed(cfg.seed)\n",
148
- "\n",
149
- "print(f\"Base model: {cfg.base_model_name}\")\n",
150
- "print(f\"Device: {DEVICE}\")\n",
151
- "print(f\"Output path: {cfg.output_dir}\")"
152
- ]
153
- },
154
- {
155
- "cell_type": "code",
156
- "execution_count": 4,
157
- "id": "0a1f860a",
158
- "metadata": {},
159
- "outputs": [],
160
- "source": [
161
- "def normalize_text(text: str) -> str:\n",
162
- " return \" \".join(str(text).split()).strip()\n",
163
- "\n",
164
- "\n",
165
- "def count_words(text: str) -> int:\n",
166
- " return len(normalize_text(text).split())\n",
167
- "\n",
168
- "\n",
169
- "def _load_local_file_to_text_labels(file_path: Path) -> tuple[list[str], list[int]]:\n",
170
- " texts: list[str] = []\n",
171
- " labels: list[int] = []\n",
172
- "\n",
173
- " try:\n",
174
- " suffix = file_path.suffix.lower()\n",
175
- " if suffix == \".jsonl\":\n",
176
- " df = pd.read_json(file_path, lines=True)\n",
177
- " elif suffix == \".json\":\n",
178
- " df = pd.read_json(file_path)\n",
179
- " elif suffix == \".csv\":\n",
180
- " df = pd.read_csv(file_path)\n",
181
- " else:\n",
182
- " return texts, labels\n",
183
- "\n",
184
- " if {\"human_text\", \"ai_text\"}.issubset(df.columns):\n",
185
- " human_texts = [normalize_text(x) for x in df[\"human_text\"].dropna().tolist()]\n",
186
- " ai_texts = [normalize_text(x) for x in df[\"ai_text\"].dropna().tolist()]\n",
187
- " human_texts = [x for x in human_texts if x]\n",
188
- " ai_texts = [x for x in ai_texts if x]\n",
189
- " texts.extend(human_texts)\n",
190
- " labels.extend([0] * len(human_texts))\n",
191
- " texts.extend(ai_texts)\n",
192
- " labels.extend([1] * len(ai_texts))\n",
193
- " return texts, labels\n",
194
- "\n",
195
- " # Alternative schema fallback: text + label/ai_gen columns.\n",
196
- " if \"text\" in df.columns and (\"label\" in df.columns or \"ai_gen\" in df.columns):\n",
197
- " label_col = \"label\" if \"label\" in df.columns else \"ai_gen\"\n",
198
- " for _, row in df.iterrows():\n",
199
- " text = normalize_text(row.get(\"text\", \"\"))\n",
200
- " if not text:\n",
201
- " continue\n",
202
- " val = str(row.get(label_col, \"\")).strip().lower()\n",
203
- " is_ai = val in {\"1\", \"true\", \"ai\", \"ai-generated\", \"ai_generated\"}\n",
204
- " texts.append(text)\n",
205
- " labels.append(1 if is_ai else 0)\n",
206
- " return texts, labels\n",
207
- "\n",
208
- " except Exception as error:\n",
209
- " print(f\"Skipped file due to parse error: {file_path} ({error})\")\n",
210
- "\n",
211
- " return texts, labels\n",
212
- "\n",
213
- "\n",
214
- "def get_combined_dataset(max_samples: int = 20000, use_local: bool = True) -> DatasetDict:\n",
215
- " all_texts: list[str] = []\n",
216
- " all_labels: list[int] = []\n",
217
- "\n",
218
- " try:\n",
219
- " hc3 = datasets.load_dataset(\"Hello-SimpleAI/HC3\", \"all\", split=\"train\")\n",
220
- " for row in hc3:\n",
221
- " for answer in row.get(\"human_answers\", [])[:1]:\n",
222
- " text = normalize_text(answer)\n",
223
- " if text:\n",
224
- " all_texts.append(text)\n",
225
- " all_labels.append(0)\n",
226
- " for answer in row.get(\"chatgpt_answers\", [])[:1]:\n",
227
- " text = normalize_text(answer)\n",
228
- " if text:\n",
229
- " all_texts.append(text)\n",
230
- " all_labels.append(1)\n",
231
- " print(f\"HC3 samples: {len(all_texts)}\")\n",
232
- " except Exception as error:\n",
233
- " print(f\"HC3 unavailable: {error}\")\n",
234
- "\n",
235
- " if use_local:\n",
236
- " dataset_root = Path(\"./DATASET\")\n",
237
- " candidates = list(dataset_root.rglob(\"*.jsonl\")) + list(dataset_root.rglob(\"*.json\")) + list(dataset_root.rglob(\"*.csv\"))\n",
238
- "\n",
239
- " local_before = len(all_texts)\n",
240
- " for file_path in candidates:\n",
241
- " texts, labels = _load_local_file_to_text_labels(file_path)\n",
242
- " all_texts.extend(texts)\n",
243
- " all_labels.extend(labels)\n",
244
- "\n",
245
- " print(f\"Local recursive files scanned: {len(candidates)}\")\n",
246
- " print(f\"Local samples added: {len(all_texts) - local_before}\")\n",
247
- "\n",
248
- " if not all_texts:\n",
249
- " raise ValueError(\"No training data loaded from HC3 or local dataset.\")\n",
250
- "\n",
251
- " ds = Dataset.from_dict({\"text\": all_texts, \"label\": all_labels})\n",
252
- " ds = ds.filter(lambda x: x[\"text\"] is not None and len(normalize_text(x[\"text\"])) > 0)\n",
253
- " ds = ds.shuffle(seed=cfg.seed)\n",
254
- " if len(ds) > max_samples:\n",
255
- " ds = ds.select(range(max_samples))\n",
256
- "\n",
257
- " split = ds.train_test_split(test_size=0.1, seed=cfg.seed)\n",
258
- " return split\n",
259
- "\n",
260
- "\n",
261
- "def add_short_text_variants(dataset: Dataset, short_word_limit: int = 50, ratio: float = 0.35) -> Dataset:\n",
262
- " short_texts: list[str] = []\n",
263
- " short_labels: list[int] = []\n",
264
- "\n",
265
- " for row in dataset:\n",
266
- " text = normalize_text(row[\"text\"])\n",
267
- " label = int(row[\"label\"])\n",
268
- " words = text.split()\n",
269
- "\n",
270
- " if len(words) <= short_word_limit:\n",
271
- " if random.random() < ratio:\n",
272
- " short_texts.append(text)\n",
273
- " short_labels.append(label)\n",
274
- " continue\n",
275
- "\n",
276
- " # Keep first N words as a short variant to train behavior on short inputs.\n",
277
- " if random.random() < ratio:\n",
278
- " short_text = \" \".join(words[:short_word_limit])\n",
279
- " short_texts.append(short_text)\n",
280
- " short_labels.append(label)\n",
281
- "\n",
282
- " if not short_texts:\n",
283
- " return dataset\n",
284
- "\n",
285
- " aug = Dataset.from_dict({\"text\": short_texts, \"label\": short_labels})\n",
286
- " return concatenate_datasets([dataset, aug]).shuffle(seed=cfg.seed)"
287
- ]
288
- },
289
- {
290
- "cell_type": "code",
291
- "execution_count": 5,
292
- "id": "889c5e58",
293
- "metadata": {},
294
- "outputs": [
295
- {
296
- "name": "stdout",
297
- "output_type": "stream",
298
- "text": [
299
- "HC3 unavailable: Dataset scripts are no longer supported, but found HC3.py\n",
300
- "Skipped file due to parse error: DATASET/test.csv (No columns to parse from file)\n",
301
- "Local recursive files scanned: 2\n",
302
- "Local samples added: 19940\n"
303
- ]
304
- },
305
- {
306
- "name": "stderr",
307
- "output_type": "stream",
308
- "text": [
309
- "Filter: 100%|██████████| 19940/19940 [00:00<00:00, 133317.22 examples/s]\n"
310
- ]
311
- },
312
- {
313
- "name": "stdout",
314
- "output_type": "stream",
315
- "text": [
316
- "Train samples: 24213\n",
317
- "Eval samples: 1994\n",
318
- "Train short (<50 words): 6839\n",
319
- "Eval short (<50 words): 569\n"
320
- ]
321
- }
322
- ],
323
- "source": [
324
- "raw_data = get_combined_dataset(max_samples=cfg.max_samples)\n",
325
- "train_data = add_short_text_variants(\n",
326
- " raw_data[\"train\"],\n",
327
- " short_word_limit=cfg.short_word_limit,\n",
328
- " ratio=cfg.short_aug_ratio,\n",
329
- ")\n",
330
- "eval_data = raw_data[\"test\"]\n",
331
- "\n",
332
- "short_train = sum(count_words(t) < 50 for t in train_data[\"text\"])\n",
333
- "short_eval = sum(count_words(t) < 50 for t in eval_data[\"text\"])\n",
334
- "\n",
335
- "print(f\"Train samples: {len(train_data)}\")\n",
336
- "print(f\"Eval samples: {len(eval_data)}\")\n",
337
- "print(f\"Train short (<50 words): {short_train}\")\n",
338
- "print(f\"Eval short (<50 words): {short_eval}\")"
339
- ]
340
- },
341
- {
342
- "cell_type": "code",
343
- "execution_count": 7,
344
- "id": "e8a2ff3e",
345
- "metadata": {},
346
- "outputs": [
347
- {
348
- "name": "stderr",
349
- "output_type": "stream",
350
- "text": [
351
- "Loading weights: 100%|██████████| 101/101 [00:00<00:00, 8921.80it/s]\n",
352
- "\u001b[1mRobertaForSequenceClassification LOAD REPORT\u001b[0m from: distilroberta-base\n",
353
- "Key | Status | \n",
354
- "----------------------------+------------+-\n",
355
- "roberta.pooler.dense.weight | UNEXPECTED | \n",
356
- "lm_head.dense.weight | UNEXPECTED | \n",
357
- "roberta.pooler.dense.bias | UNEXPECTED | \n",
358
- "lm_head.layer_norm.bias | UNEXPECTED | \n",
359
- "lm_head.dense.bias | UNEXPECTED | \n",
360
- "lm_head.layer_norm.weight | UNEXPECTED | \n",
361
- "lm_head.bias | UNEXPECTED | \n",
362
- "classifier.out_proj.bias | MISSING | \n",
363
- "classifier.dense.weight | MISSING | \n",
364
- "classifier.dense.bias | MISSING | \n",
365
- "classifier.out_proj.weight | MISSING | \n",
366
- "\n",
367
- "\u001b[3mNotes:\n",
368
- "- UNEXPECTED\u001b[3m\t:can be ignored when loading from different task/architecture; not ok if you expect identical arch.\n",
369
- "- MISSING\u001b[3m\t:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.\u001b[0m\n",
370
- "Map: 100%|██████████| 24213/24213 [00:01<00:00, 12285.23 examples/s]\n",
371
- "Map: 100%|██████████| 1994/1994 [00:00<00:00, 11737.65 examples/s]\n"
372
- ]
373
- }
374
- ],
375
- "source": [
376
- "tokenizer = AutoTokenizer.from_pretrained(cfg.base_model_name)\n",
377
- "model = AutoModelForSequenceClassification.from_pretrained(cfg.base_model_name, num_labels=2).to(DEVICE)\n",
378
- "\n",
379
- "\n",
380
- "def preprocess_batch(batch: dict, tokenizer: PreTrainedTokenizer, max_length: int = 256) -> BatchEncoding:\n",
381
- " encoded = tokenizer(\n",
382
- " batch[\"text\"],\n",
383
- " truncation=True,\n",
384
- " max_length=max_length,\n",
385
- " )\n",
386
- " encoded[\"labels\"] = batch[\"label\"]\n",
387
- " return encoded\n",
388
- "\n",
389
- "\n",
390
- "tokenize_fn = partial(preprocess_batch, tokenizer=tokenizer, max_length=cfg.max_length)\n",
391
- "tokenized_train = train_data.map(tokenize_fn, batched=True, remove_columns=[\"text\", \"label\"])\n",
392
- "tokenized_eval = eval_data.map(tokenize_fn, batched=True, remove_columns=[\"text\", \"label\"])\n",
393
- "\n",
394
- "columns = tokenized_train.column_names\n",
395
- "tensor_columns = [name for name in [\"input_ids\", \"attention_mask\", \"token_type_ids\", \"labels\"] if name in columns]\n",
396
- "tokenized_train.set_format(type=\"torch\", columns=tensor_columns)\n",
397
- "tokenized_eval.set_format(type=\"torch\", columns=tensor_columns)\n",
398
- "\n",
399
- "metric_accuracy = evaluate.load(\"accuracy\")\n",
400
- "metric_f1 = evaluate.load(\"f1\")\n",
401
- "\n",
402
- "\n",
403
- "def compute_metrics(eval_pred: tuple[np.ndarray, np.ndarray]) -> dict[str, float]:\n",
404
- " logits, labels = eval_pred\n",
405
- " if isinstance(logits, tuple):\n",
406
- " logits = logits[0]\n",
407
- " preds = np.argmax(logits, axis=1)\n",
408
- " acc = metric_accuracy.compute(predictions=preds, references=labels)\n",
409
- " f1 = metric_f1.compute(predictions=preds, references=labels)\n",
410
- " return {\"accuracy\": float(acc[\"accuracy\"]), \"f1\": float(f1[\"f1\"])}"
411
- ]
412
- },
413
- {
414
- "cell_type": "code",
415
- "execution_count": null,
416
- "id": "00f52ac8",
417
- "metadata": {},
418
- "outputs": [
419
- {
420
- "name": "stdout",
421
- "output_type": "stream",
422
- "text": [
423
- "Start training V2 model...\n"
424
- ]
425
- },
426
- {
427
- "data": {
428
- "text/html": [
429
- "\n",
430
- " <div>\n",
431
- " \n",
432
- " <progress value='4542' max='4542' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
433
- " [4542/4542 20:20, Epoch 3/3]\n",
434
- " </div>\n",
435
- " <table border=\"1\" class=\"dataframe\">\n",
436
- " <thead>\n",
437
- " <tr style=\"text-align: left;\">\n",
438
- " <th>Step</th>\n",
439
- " <th>Training Loss</th>\n",
440
- " <th>Validation Loss</th>\n",
441
- " <th>Accuracy</th>\n",
442
- " <th>F1</th>\n",
443
- " </tr>\n",
444
- " </thead>\n",
445
- " <tbody>\n",
446
- " <tr>\n",
447
- " <td>200</td>\n",
448
- " <td>0.666410</td>\n",
449
- " <td>0.350684</td>\n",
450
- " <td>0.834504</td>\n",
451
- " <td>0.855390</td>\n",
452
- " </tr>\n",
453
- " <tr>\n",
454
- " <td>400</td>\n",
455
- " <td>0.598755</td>\n",
456
- " <td>0.256876</td>\n",
457
- " <td>0.897192</td>\n",
458
- " <td>0.904518</td>\n",
459
- " </tr>\n",
460
- " <tr>\n",
461
- " <td>600</td>\n",
462
- " <td>0.574993</td>\n",
463
- " <td>0.198666</td>\n",
464
- " <td>0.919258</td>\n",
465
- " <td>0.917138</td>\n",
466
- " </tr>\n",
467
- " <tr>\n",
468
- " <td>800</td>\n",
469
- " <td>0.560090</td>\n",
470
- " <td>0.555182</td>\n",
471
- " <td>0.849047</td>\n",
472
- " <td>0.868040</td>\n",
473
- " </tr>\n",
474
- " <tr>\n",
475
- " <td>1000</td>\n",
476
- " <td>0.387553</td>\n",
477
- " <td>0.203730</td>\n",
478
- " <td>0.929288</td>\n",
479
- " <td>0.930848</td>\n",
480
- " </tr>\n",
481
- " <tr>\n",
482
- " <td>1200</td>\n",
483
- " <td>0.411762</td>\n",
484
- " <td>0.521041</td>\n",
485
- " <td>0.849047</td>\n",
486
- " <td>0.868387</td>\n",
487
- " </tr>\n",
488
- " <tr>\n",
489
- " <td>1400</td>\n",
490
- " <td>0.386610</td>\n",
491
- " <td>0.348940</td>\n",
492
- " <td>0.902708</td>\n",
493
- " <td>0.910434</td>\n",
494
- " </tr>\n",
495
- " <tr>\n",
496
- " <td>1600</td>\n",
497
- " <td>0.244696</td>\n",
498
- " <td>0.346382</td>\n",
499
- " <td>0.916249</td>\n",
500
- " <td>0.921633</td>\n",
501
- " </tr>\n",
502
- " <tr>\n",
503
- " <td>1800</td>\n",
504
- " <td>0.223823</td>\n",
505
- " <td>0.308763</td>\n",
506
- " <td>0.924774</td>\n",
507
- " <td>0.928977</td>\n",
508
- " </tr>\n",
509
- " <tr>\n",
510
- " <td>2000</td>\n",
511
- " <td>0.249242</td>\n",
512
- " <td>0.358467</td>\n",
513
- " <td>0.919258</td>\n",
514
- " <td>0.924307</td>\n",
515
- " </tr>\n",
516
- " <tr>\n",
517
- " <td>2200</td>\n",
518
- " <td>0.221226</td>\n",
519
- " <td>0.335397</td>\n",
520
- " <td>0.919759</td>\n",
521
- " <td>0.924599</td>\n",
522
- " </tr>\n",
523
- " <tr>\n",
524
- " <td>2400</td>\n",
525
- " <td>0.221417</td>\n",
526
- " <td>0.587722</td>\n",
527
- " <td>0.882648</td>\n",
528
- " <td>0.894973</td>\n",
529
- " </tr>\n",
530
- " <tr>\n",
531
- " <td>2600</td>\n",
532
- " <td>0.191291</td>\n",
533
- " <td>0.329566</td>\n",
534
- " <td>0.928285</td>\n",
535
- " <td>0.931677</td>\n",
536
- " </tr>\n",
537
- " <tr>\n",
538
- " <td>2800</td>\n",
539
- " <td>0.219115</td>\n",
540
- " <td>0.368331</td>\n",
541
- " <td>0.919759</td>\n",
542
- " <td>0.925164</td>\n",
543
- " </tr>\n",
544
- " <tr>\n",
545
- " <td>3000</td>\n",
546
- " <td>0.308968</td>\n",
547
- " <td>0.277328</td>\n",
548
- " <td>0.931795</td>\n",
549
- " <td>0.934928</td>\n",
550
- " </tr>\n",
551
- " <tr>\n",
552
- " <td>3200</td>\n",
553
- " <td>0.131352</td>\n",
554
- " <td>0.585112</td>\n",
555
- " <td>0.891174</td>\n",
556
- " <td>0.901854</td>\n",
557
- " </tr>\n",
558
- " <tr>\n",
559
- " <td>3400</td>\n",
560
- " <td>0.152614</td>\n",
561
- " <td>0.388915</td>\n",
562
- " <td>0.924273</td>\n",
563
- " <td>0.929208</td>\n",
564
- " </tr>\n",
565
- " <tr>\n",
566
- " <td>3600</td>\n",
567
- " <td>0.145248</td>\n",
568
- " <td>0.439313</td>\n",
569
- " <td>0.921765</td>\n",
570
- " <td>0.926898</td>\n",
571
- " </tr>\n",
572
- " <tr>\n",
573
- " <td>3800</td>\n",
574
- " <td>0.086042</td>\n",
575
- " <td>0.467167</td>\n",
576
- " <td>0.920762</td>\n",
577
- " <td>0.926099</td>\n",
578
- " </tr>\n",
579
- " <tr>\n",
580
- " <td>4000</td>\n",
581
- " <td>0.051121</td>\n",
582
- " <td>0.561893</td>\n",
583
- " <td>0.909729</td>\n",
584
- " <td>0.916898</td>\n",
585
- " </tr>\n",
586
- " <tr>\n",
587
- " <td>4200</td>\n",
588
- " <td>0.141769</td>\n",
589
- " <td>0.477382</td>\n",
590
- " <td>0.920762</td>\n",
591
- " <td>0.926168</td>\n",
592
- " </tr>\n",
593
- " <tr>\n",
594
- " <td>4400</td>\n",
595
- " <td>0.016825</td>\n",
596
- " <td>0.506922</td>\n",
597
- " <td>0.918255</td>\n",
598
- " <td>0.924151</td>\n",
599
- " </tr>\n",
600
- " </tbody>\n",
601
- "</table><p>"
602
- ],
603
- "text/plain": [
604
- "<IPython.core.display.HTML object>"
605
- ]
606
- },
607
- "metadata": {},
608
- "output_type": "display_data"
609
- },
610
- {
611
- "name": "stderr",
612
- "output_type": "stream",
613
- "text": [
614
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.48it/s]\n",
615
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.84it/s]\n",
616
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.64it/s]\n",
617
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.02it/s]\n",
618
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.96it/s]\n",
619
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.07it/s]\n",
620
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.79it/s]\n",
621
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.02it/s]\n",
622
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.03it/s]\n",
623
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.03it/s]\n",
624
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.00it/s]\n",
625
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.10it/s]\n",
626
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.59it/s]\n",
627
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.23it/s]\n",
628
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.16it/s]\n",
629
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.19it/s]\n",
630
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.14it/s]\n",
631
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.14it/s]\n",
632
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.00it/s]\n",
633
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.21it/s]\n",
634
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.17it/s]\n",
635
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.99it/s]\n",
636
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.22it/s]\n",
637
- "There were missing keys in the checkpoint model loaded: ['roberta.embeddings.LayerNorm.weight', 'roberta.embeddings.LayerNorm.bias', 'roberta.encoder.layer.0.attention.output.LayerNorm.weight', 'roberta.encoder.layer.0.attention.output.LayerNorm.bias', 'roberta.encoder.layer.0.output.LayerNorm.weight', 'roberta.encoder.layer.0.output.LayerNorm.bias', 'roberta.encoder.layer.1.attention.output.LayerNorm.weight', 'roberta.encoder.layer.1.attention.output.LayerNorm.bias', 'roberta.encoder.layer.1.output.LayerNorm.weight', 'roberta.encoder.layer.1.output.LayerNorm.bias', 'roberta.encoder.layer.2.attention.output.LayerNorm.weight', 'roberta.encoder.layer.2.attention.output.LayerNorm.bias', 'roberta.encoder.layer.2.output.LayerNorm.weight', 'roberta.encoder.layer.2.output.LayerNorm.bias', 'roberta.encoder.layer.3.attention.output.LayerNorm.weight', 'roberta.encoder.layer.3.attention.output.LayerNorm.bias', 'roberta.encoder.layer.3.output.LayerNorm.weight', 'roberta.encoder.layer.3.output.LayerNorm.bias', 'roberta.encoder.layer.4.attention.output.LayerNorm.weight', 'roberta.encoder.layer.4.attention.output.LayerNorm.bias', 'roberta.encoder.layer.4.output.LayerNorm.weight', 'roberta.encoder.layer.4.output.LayerNorm.bias', 'roberta.encoder.layer.5.attention.output.LayerNorm.weight', 'roberta.encoder.layer.5.attention.output.LayerNorm.bias', 'roberta.encoder.layer.5.output.LayerNorm.weight', 'roberta.encoder.layer.5.output.LayerNorm.bias'].\n",
638
- "There were unexpected keys in the checkpoint model loaded: ['roberta.embeddings.LayerNorm.beta', 'roberta.embeddings.LayerNorm.gamma', 'roberta.encoder.layer.0.attention.output.LayerNorm.beta', 'roberta.encoder.layer.0.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.0.output.LayerNorm.beta', 'roberta.encoder.layer.0.output.LayerNorm.gamma', 'roberta.encoder.layer.1.attention.output.LayerNorm.beta', 'roberta.encoder.layer.1.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.1.output.LayerNorm.beta', 'roberta.encoder.layer.1.output.LayerNorm.gamma', 'roberta.encoder.layer.2.attention.output.LayerNorm.beta', 'roberta.encoder.layer.2.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.2.output.LayerNorm.beta', 'roberta.encoder.layer.2.output.LayerNorm.gamma', 'roberta.encoder.layer.3.attention.output.LayerNorm.beta', 'roberta.encoder.layer.3.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.3.output.LayerNorm.beta', 'roberta.encoder.layer.3.output.LayerNorm.gamma', 'roberta.encoder.layer.4.attention.output.LayerNorm.beta', 'roberta.encoder.layer.4.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.4.output.LayerNorm.beta', 'roberta.encoder.layer.4.output.LayerNorm.gamma', 'roberta.encoder.layer.5.attention.output.LayerNorm.beta', 'roberta.encoder.layer.5.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.5.output.LayerNorm.beta', 'roberta.encoder.layer.5.output.LayerNorm.gamma'].\n"
639
- ]
640
- },
641
- {
642
- "name": "stdout",
643
- "output_type": "stream",
644
- "text": [
645
- "Final evaluation...\n"
646
- ]
647
- },
648
- {
649
- "data": {
650
- "text/html": [
651
- "\n",
652
- " <div>\n",
653
- " \n",
654
- " <progress value='250' max='250' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
655
- " [250/250 00:07]\n",
656
- " </div>\n",
657
- " "
658
- ],
659
- "text/plain": [
660
- "<IPython.core.display.HTML object>"
661
- ]
662
- },
663
- "metadata": {},
664
- "output_type": "display_data"
665
- },
666
- {
667
- "ename": "RuntimeError",
668
- "evalue": "on_train_begin must be called before on_evaluate",
669
- "output_type": "error",
670
- "traceback": [
671
- "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
672
- "\u001b[31mRuntimeError\u001b[39m Traceback (most recent call last)",
673
- "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[8]\u001b[39m\u001b[32m, line 55\u001b[39m\n\u001b[32m 52\u001b[39m trainer.train()\n\u001b[32m 54\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33m\"\u001b[39m\u001b[33mFinal evaluation...\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m---> \u001b[39m\u001b[32m55\u001b[39m eval_result = \u001b[43mtrainer\u001b[49m\u001b[43m.\u001b[49m\u001b[43mevaluate\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 56\u001b[39m \u001b[38;5;28mprint\u001b[39m(json.dumps(eval_result, indent=\u001b[32m2\u001b[39m, default=\u001b[38;5;28mstr\u001b[39m))\n",
674
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/trainer.py:2602\u001b[39m, in \u001b[36mTrainer.evaluate\u001b[39m\u001b[34m(self, eval_dataset, ignore_keys, metric_key_prefix)\u001b[39m\n\u001b[32m 2599\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m DebugOption.TPU_METRICS_DEBUG \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m.args.debug:\n\u001b[32m 2600\u001b[39m xm.master_print(met.metrics_report())\n\u001b[32m-> \u001b[39m\u001b[32m2602\u001b[39m \u001b[38;5;28mself\u001b[39m.control = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mcallback_handler\u001b[49m\u001b[43m.\u001b[49m\u001b[43mon_evaluate\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mstate\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mcontrol\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43moutput\u001b[49m\u001b[43m.\u001b[49m\u001b[43mmetrics\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 2604\u001b[39m \u001b[38;5;28mself\u001b[39m._memory_tracker.stop_and_update_metrics(output.metrics)\n\u001b[32m 2606\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m output.metrics\n",
675
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/trainer_callback.py:524\u001b[39m, in \u001b[36mCallbackHandler.on_evaluate\u001b[39m\u001b[34m(self, args, state, control, metrics)\u001b[39m\n\u001b[32m 522\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mon_evaluate\u001b[39m(\u001b[38;5;28mself\u001b[39m, args: TrainingArguments, state: TrainerState, control: TrainerControl, metrics):\n\u001b[32m 523\u001b[39m control.should_evaluate = \u001b[38;5;28;01mFalse\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m524\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mcall_event\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mon_evaluate\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mstate\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcontrol\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmetrics\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmetrics\u001b[49m\u001b[43m)\u001b[49m\n",
676
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/trainer_callback.py:545\u001b[39m, in \u001b[36mCallbackHandler.call_event\u001b[39m\u001b[34m(self, event, args, state, control, **kwargs)\u001b[39m\n\u001b[32m 543\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mcall_event\u001b[39m(\u001b[38;5;28mself\u001b[39m, event, args, state, control, **kwargs):\n\u001b[32m 544\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m callback \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m.callbacks:\n\u001b[32m--> \u001b[39m\u001b[32m545\u001b[39m result = \u001b[38;5;28;43mgetattr\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mcallback\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mevent\u001b[49m\u001b[43m)\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 546\u001b[39m \u001b[43m \u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 547\u001b[39m \u001b[43m \u001b[49m\u001b[43mstate\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 548\u001b[39m \u001b[43m \u001b[49m\u001b[43mcontrol\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 549\u001b[39m \u001b[43m \u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 550\u001b[39m \u001b[43m \u001b[49m\u001b[43mprocessing_class\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mprocessing_class\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 551\u001b[39m \u001b[43m \u001b[49m\u001b[43moptimizer\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43moptimizer\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 552\u001b[39m \u001b[43m \u001b[49m\u001b[43mlr_scheduler\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mlr_scheduler\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 553\u001b[39m \u001b[43m \u001b[49m\u001b[43mtrain_dataloader\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mtrain_dataloader\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 554\u001b[39m \u001b[43m \u001b[49m\u001b[43meval_dataloader\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43meval_dataloader\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 555\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 556\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 557\u001b[39m \u001b[38;5;66;03m# A Callback can skip the return of `control` if it doesn't change it.\u001b[39;00m\n\u001b[32m 558\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m result \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n",
677
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/utils/notebook.py:354\u001b[39m, in \u001b[36mNotebookProgressCallback.on_evaluate\u001b[39m\u001b[34m(self, args, state, control, metrics, **kwargs)\u001b[39m\n\u001b[32m 353\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mon_evaluate\u001b[39m(\u001b[38;5;28mself\u001b[39m, args, state, control, metrics=\u001b[38;5;28;01mNone\u001b[39;00m, **kwargs):\n\u001b[32m--> \u001b[39m\u001b[32m354\u001b[39m tt = \u001b[43m_require\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mtraining_tracker\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mon_train_begin must be called before on_evaluate\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[32m 356\u001b[39m values = {\u001b[33m\"\u001b[39m\u001b[33mTraining Loss\u001b[39m\u001b[33m\"\u001b[39m: \u001b[33m\"\u001b[39m\u001b[33mNo log\u001b[39m\u001b[33m\"\u001b[39m, \u001b[33m\"\u001b[39m\u001b[33mValidation Loss\u001b[39m\u001b[33m\"\u001b[39m: \u001b[33m\"\u001b[39m\u001b[33mNo log\u001b[39m\u001b[33m\"\u001b[39m}\n\u001b[32m 357\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m log \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mreversed\u001b[39m(state.log_history):\n",
678
- "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/utils/notebook.py:31\u001b[39m, in \u001b[36m_require\u001b[39m\u001b[34m(x, msg)\u001b[39m\n\u001b[32m 29\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34m_require\u001b[39m(x: _T | \u001b[38;5;28;01mNone\u001b[39;00m, msg: \u001b[38;5;28mstr\u001b[39m) -> _T:\n\u001b[32m 30\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m x \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m---> \u001b[39m\u001b[32m31\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(msg)\n\u001b[32m 32\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m x\n",
679
- "\u001b[31mRuntimeError\u001b[39m: on_train_begin must be called before on_evaluate"
680
- ]
681
- }
682
- ],
683
- "source": [
684
- "import sys\n",
685
- "import subprocess\n",
686
- "\n",
687
- "\n",
688
- "def _ensure_accelerate(min_version: str = \"1.1.0\") -> None:\n",
689
- " try:\n",
690
- " import accelerate # noqa: F401\n",
691
- " from packaging import version\n",
692
- "\n",
693
- " if version.parse(accelerate.__version__) < version.parse(min_version):\n",
694
- " raise ImportError(f\"accelerate version too old: {accelerate.__version__}\")\n",
695
- " except Exception:\n",
696
- " print(\"Installing/upgrading accelerate in current kernel environment...\")\n",
697
- " subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", f\"accelerate>={min_version}\"])\n",
698
- "\n",
699
- "\n",
700
- "_ensure_accelerate()\n",
701
- "\n",
702
- "train_args = TrainingArguments(\n",
703
- " output_dir=\"./results/v2-distilroberta\",\n",
704
- " num_train_epochs=3,\n",
705
- " learning_rate=2e-5,\n",
706
- " weight_decay=0.01,\n",
707
- " per_device_train_batch_size=8,\n",
708
- " per_device_eval_batch_size=8,\n",
709
- " gradient_accumulation_steps=2,\n",
710
- " fp16=torch.cuda.is_available(),\n",
711
- " eval_strategy=\"steps\",\n",
712
- " eval_steps=200,\n",
713
- " save_strategy=\"steps\",\n",
714
- " save_steps=200,\n",
715
- " save_total_limit=2,\n",
716
- " logging_steps=50,\n",
717
- " metric_for_best_model=\"eval_f1\",\n",
718
- " load_best_model_at_end=True,\n",
719
- " remove_unused_columns=False,\n",
720
- " report_to=\"none\",\n",
721
- ")\n",
722
- "\n",
723
- "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)\n",
724
- "\n",
725
- "trainer = Trainer(\n",
726
- " model=model,\n",
727
- " args=train_args,\n",
728
- " train_dataset=tokenized_train,\n",
729
- " eval_dataset=tokenized_eval,\n",
730
- " data_collator=data_collator,\n",
731
- " compute_metrics=compute_metrics,\n",
732
- ")\n",
733
- "\n",
734
- "print(\"Start training V2 model...\")\n",
735
- "train_result = trainer.train()\n",
736
- "\n",
737
- "print(\"\\n✓ Training complete!\")\n",
738
- "print(f\"Final training metrics:\")\n",
739
- "if hasattr(trainer.state, 'log_history') and trainer.state.log_history:\n",
740
- " # Get the last evaluation metrics from log history\n",
741
- " for log_entry in reversed(trainer.state.log_history):\n",
742
- " if 'eval_loss' in log_entry:\n",
743
- " print(f\" Eval Loss: {log_entry.get('eval_loss', 'N/A'):.4f}\")\n",
744
- " print(f\" Eval Accuracy: {log_entry.get('eval_accuracy', 'N/A'):.4f}\")\n",
745
- " print(f\" Eval F1: {log_entry.get('eval_f1', 'N/A'):.4f}\")\n",
746
- " break"
747
- ]
748
- },
749
- {
750
- "cell_type": "code",
751
- "execution_count": 9,
752
- "id": "1b601515",
753
- "metadata": {},
754
- "outputs": [
755
- {
756
- "name": "stderr",
757
- "output_type": "stream",
758
- "text": [
759
- "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.29it/s]"
760
- ]
761
- },
762
- {
763
- "name": "stdout",
764
- "output_type": "stream",
765
- "text": [
766
- "Saved V2 model to: /mnt/linux-data/Work/aiapi/notebook/ai_vs_human/v2_model\n"
767
- ]
768
- },
769
- {
770
- "name": "stderr",
771
- "output_type": "stream",
772
- "text": [
773
- "\n"
774
- ]
775
- }
776
- ],
777
- "source": [
778
- "save_dir = Path(cfg.output_dir)\n",
779
- "save_dir.mkdir(parents=True, exist_ok=True)\n",
780
- "trainer.save_model(str(save_dir))\n",
781
- "tokenizer.save_pretrained(str(save_dir))\n",
782
- "\n",
783
- "label_map = {\"0\": \"human\", \"1\": \"ai\"}\n",
784
- "(save_dir / \"label_map.json\").write_text(json.dumps(label_map, indent=2), encoding=\"utf-8\")\n",
785
- "\n",
786
- "print(f\"Saved V2 model to: {save_dir.resolve()}\")"
787
- ]
788
- },
789
- {
790
- "cell_type": "code",
791
- "execution_count": 11,
792
- "id": "93f0e5a0",
793
- "metadata": {},
794
- "outputs": [
795
- {
796
- "name": "stdout",
797
- "output_type": "stream",
798
- "text": [
799
- "================================================================================\n",
800
- "COMPREHENSIVE TEST: All Sentence Types\n",
801
- "================================================================================\n",
802
- "\n",
803
- "1. VERY SHORT SENTENCES (< 10 words):\n",
804
- " [2 words] human: Hello world.\n",
805
- " [3 words] human: AI is powerful.\n",
806
- " [3 words] human: I like coding.\n",
807
- " [4 words] human: Machine learning works well.\n",
808
- "\n",
809
- "2. SHORT SENTENCES (10-50 words):\n",
810
- " [10 words] human: AI writes fast, but humans add personal experience and emoti...\n",
811
- " [14 words] human: I woke up late, missed the bus, and ran all the way to class...\n",
812
- " [11 words] human: This response was generated by a language model in one pass....\n",
813
- " [17 words] human: The field of data science combines statistics, programming, ...\n",
814
- "\n",
815
- "3. MEDIUM SENTENCES (50-150 words):\n",
816
- " [74 words] human: Artificial intelligence systems can process massive amounts ...\n",
817
- " [87 words] human: I once tried to learn guitar in a single weekend because I t...\n",
818
- "\n",
819
- "4. LONG SENTENCES (150+ words):\n",
820
- " [153 words] ai: Machine learning represents a subset of artificial intellige...\n",
821
- "\n",
822
- "5. EDGE CASES:\n",
823
- " [1 words] human: 'A'\n",
824
- " [4 words] human: 'This is a test.'\n",
825
- " [4 words] human: 'Multiple spaces between words'\n",
826
- "\n",
827
- "================================================================================\n",
828
- "✓ All sentence types tested successfully!\n",
829
- "================================================================================\n"
830
- ]
831
- }
832
- ],
833
- "source": [
834
- "def predict_v2(text: str) -> dict[str, float | int | str]:\n",
835
- " \"\"\"Predict whether text is AI or human-written. Works for all sentence lengths.\"\"\"\n",
836
- " cleaned = normalize_text(text)\n",
837
- " if not cleaned:\n",
838
- " raise ValueError(\"Input text is empty.\")\n",
839
- "\n",
840
- " inputs = tokenizer(\n",
841
- " cleaned,\n",
842
- " truncation=True,\n",
843
- " max_length=cfg.max_length,\n",
844
- " return_tensors=\"pt\",\n",
845
- " ).to(model.device)\n",
846
- "\n",
847
- " model.eval()\n",
848
- " with torch.no_grad():\n",
849
- " logits = model(**inputs).logits\n",
850
- " probs = torch.softmax(logits, dim=-1).cpu().numpy()[0]\n",
851
- "\n",
852
- " pred = int(np.argmax(probs))\n",
853
- " wc = count_words(cleaned)\n",
854
- "\n",
855
- " return {\n",
856
- " \"text\": cleaned,\n",
857
- " \"word_count\": wc,\n",
858
- " \"predicted_label\": pred,\n",
859
- " \"predicted_name\": \"ai\" if pred == 1 else \"human\",\n",
860
- " \"probability_human\": float(probs[0]),\n",
861
- " \"probability_ai\": float(probs[1]),\n",
862
- " \"short_text\": wc < 50,\n",
863
- " }\n",
864
- "\n",
865
- "\n",
866
- "print(\"=\" * 80)\n",
867
- "print(\"COMPREHENSIVE TEST: All Sentence Types\")\n",
868
- "print(\"=\" * 80)\n",
869
- "\n",
870
- "# Test 1: Very short sentences (under 10 words)\n",
871
- "print(\"\\n1. VERY SHORT SENTENCES (< 10 words):\")\n",
872
- "very_short = [\n",
873
- " \"Hello world.\",\n",
874
- " \"AI is powerful.\",\n",
875
- " \"I like coding.\",\n",
876
- " \"Machine learning works well.\",\n",
877
- "]\n",
878
- "for text in very_short:\n",
879
- " result = predict_v2(text)\n",
880
- " print(f\" [{result['word_count']} words] {result['predicted_name']}: {text[:60]}\")\n",
881
- "\n",
882
- "# Test 2: Short sentences (10-50 words)\n",
883
- "print(\"\\n2. SHORT SENTENCES (10-50 words):\")\n",
884
- "short_examples = [\n",
885
- " \"AI writes fast, but humans add personal experience and emotion.\",\n",
886
- " \"I woke up late, missed the bus, and ran all the way to class.\",\n",
887
- " \"This response was generated by a language model in one pass.\",\n",
888
- " \"The field of data science combines statistics, programming, and domain knowledge to extract meaningful insights from data.\",\n",
889
- "]\n",
890
- "for text in short_examples:\n",
891
- " result = predict_v2(text)\n",
892
- " print(f\" [{result['word_count']} words] {result['predicted_name']}: {text[:60]}...\")\n",
893
- "\n",
894
- "# Test 3: Medium sentences (50-150 words)\n",
895
- "print(\"\\n3. MEDIUM SENTENCES (50-150 words):\")\n",
896
- "medium_examples = [\n",
897
- " \"Artificial intelligence systems can process massive amounts of data extremely quickly compared to humans. They are designed to analyze large datasets, identify patterns, and extract useful insights within seconds or minutes. Using advanced algorithms and machine learning models, AI systems can examine structured and unstructured data such as text, images, audio, and numerical information. By learning from historical data, these systems can recognize complex relationships between variables and make accurate predictions about future outcomes.\",\n",
898
- " \"I once tried to learn guitar in a single weekend because I thought it would be easy. Turns out my fingers had other plans. After two hours of awkward chords and random noises, I realized that music requires patience, practice, and a lot more discipline than I originally expected. My friends laughed when they heard me trying to play, but I kept practicing anyway because I genuinely wanted to improve. Eventually, after weeks of consistent effort, I could finally play a simple song from start to finish.\",\n",
899
- "]\n",
900
- "for text in medium_examples:\n",
901
- " result = predict_v2(text)\n",
902
- " print(f\" [{result['word_count']} words] {result['predicted_name']}: {text[:60]}...\")\n",
903
- "\n",
904
- "# Test 4: Long sentences (150+ words)\n",
905
- "print(\"\\n4. LONG SENTENCES (150+ words):\")\n",
906
- "long_examples = [\n",
907
- " \"Machine learning represents a subset of artificial intelligence that enables computer systems to automatically learn and improve from experience without being explicitly programmed for every single task. The fundamental idea behind machine learning is to develop algorithms that can receive input data and use statistical analysis to predict an output while updating outputs as new data becomes available. This field has grown exponentially over the past few decades, driven by increases in computational power, the availability of large datasets, and breakthroughs in algorithmic approaches. Modern machine learning systems power everything from recommendation engines on streaming platforms to autonomous vehicles, medical diagnosis tools, and natural language processing applications. The three main categories of machine learning include supervised learning, where models are trained on labeled data; unsupervised learning, where patterns are discovered in unlabeled data; and reinforcement learning, where agents learn to make decisions by receiving rewards or penalties for their actions in an environment.\",\n",
908
- "]\n",
909
- "for text in long_examples:\n",
910
- " result = predict_v2(text)\n",
911
- " print(f\" [{result['word_count']} words] {result['predicted_name']}: {text[:60]}...\")\n",
912
- "\n",
913
- "# Test 5: Edge cases\n",
914
- "print(\"\\n5. EDGE CASES:\")\n",
915
- "edge_cases = [\n",
916
- " \"A\", # Single word\n",
917
- " \"This is a test.\", # Very basic\n",
918
- " \" Multiple spaces between words \", # Extra whitespace\n",
919
- "]\n",
920
- "for text in edge_cases:\n",
921
- " try:\n",
922
- " result = predict_v2(text)\n",
923
- " print(f\" [{result['word_count']} words] {result['predicted_name']}: '{text.strip()}'\")\n",
924
- " except Exception as e:\n",
925
- " print(f\" ERROR: {text.strip()[:30]} - {str(e)}\")\n",
926
- "\n",
927
- "print(\"\\n\" + \"=\" * 80)\n",
928
- "print(\"✓ All sentence types tested successfully!\")\n",
929
- "print(\"=\" * 80)"
930
- ]
931
- },
932
- {
933
- "cell_type": "code",
934
- "execution_count": 12,
935
- "id": "98ef7c7d",
936
- "metadata": {},
937
- "outputs": [
938
- {
939
- "name": "stdout",
940
- "output_type": "stream",
941
- "text": [
942
- "================================================================================\n",
943
- "TESTING SAVED V2 MODEL FROM DISK\n",
944
- "================================================================================\n"
945
- ]
946
- },
947
- {
948
- "name": "stderr",
949
- "output_type": "stream",
950
- "text": [
951
- "Loading weights: 100%|██████████| 105/105 [00:00<00:00, 8556.64it/s]"
952
- ]
953
- },
954
- {
955
- "name": "stdout",
956
- "output_type": "stream",
957
- "text": [
958
- "\n",
959
- "✓ Loaded model from: v2_model\n",
960
- "\n",
961
- "Running inference tests:\n",
962
- " [very short ] human (AI: 0.50%): Hi there!\n",
963
- " [short ] human (AI: 0.09%): I love programming and building cool projects.\n",
964
- " [medium ] human (AI: 3.09%): Artificial intelligence has revolutionized many in\n",
965
- "\n",
966
- "✓ Saved model works correctly for all sentence types!\n"
967
- ]
968
- },
969
- {
970
- "name": "stderr",
971
- "output_type": "stream",
972
- "text": [
973
- "\n"
974
- ]
975
- }
976
- ],
977
- "source": [
978
- "# Load and test the saved v2_model independently\n",
979
- "print(\"=\" * 80)\n",
980
- "print(\"TESTING SAVED V2 MODEL FROM DISK\")\n",
981
- "print(\"=\" * 80)\n",
982
- "\n",
983
- "saved_model_path = Path(cfg.output_dir)\n",
984
- "if saved_model_path.exists():\n",
985
- " # Load fresh model and tokenizer from saved checkpoint\n",
986
- " saved_tokenizer = AutoTokenizer.from_pretrained(str(saved_model_path))\n",
987
- " saved_model = AutoModelForSequenceClassification.from_pretrained(str(saved_model_path)).to(DEVICE)\n",
988
- " \n",
989
- " print(f\"\\n✓ Loaded model from: {saved_model_path}\")\n",
990
- " \n",
991
- " # Test with diverse examples\n",
992
- " test_cases = [\n",
993
- " (\"Hi there!\", \"very short\"),\n",
994
- " (\"I love programming and building cool projects.\", \"short\"),\n",
995
- " (\"Artificial intelligence has revolutionized many industries by enabling automation, improving decision-making, and creating new opportunities for innovation.\", \"medium\"),\n",
996
- " ]\n",
997
- " \n",
998
- " print(\"\\nRunning inference tests:\")\n",
999
- " for text, category in test_cases:\n",
1000
- " inputs = saved_tokenizer(text, truncation=True, max_length=256, return_tensors=\"pt\").to(DEVICE)\n",
1001
- " saved_model.eval()\n",
1002
- " with torch.no_grad():\n",
1003
- " logits = saved_model(**inputs).logits\n",
1004
- " probs = torch.softmax(logits, dim=-1).cpu().numpy()[0]\n",
1005
- " pred_label = int(np.argmax(probs))\n",
1006
- " pred_name = \"ai\" if pred_label == 1 else \"human\"\n",
1007
- " \n",
1008
- " print(f\" [{category:12}] {pred_name:6} (AI: {probs[1]:.2%}): {text[:50]}\")\n",
1009
- " \n",
1010
- " print(\"\\n✓ Saved model works correctly for all sentence types!\")\n",
1011
- "else:\n",
1012
- " print(f\"⚠ Model not found at: {saved_model_path}\")\n",
1013
- " print(\" Run the save cell first to create v2_model/\")"
1014
- ]
1015
- },
1016
- {
1017
- "cell_type": "code",
1018
- "execution_count": 13,
1019
- "id": "2f63e591",
1020
- "metadata": {},
1021
- "outputs": [
1022
- {
1023
- "name": "stdout",
1024
- "output_type": "stream",
1025
- "text": [
1026
- "================================================================================\n",
1027
- "EXTREME EDGE CASE TESTING\n",
1028
- "================================================================================\n",
1029
- "\n",
1030
- "Testing extreme edge cases:\n",
1031
- " ✓ Single character [ 1w] human (99.3%): 'A'\n",
1032
- " ✓ Single word [ 1w] human (99.4%): 'Hello'\n",
1033
- " ✓ Two words [ 2w] human (99.6%): 'Hello world'\n",
1034
- " ✓ Numbers only [ 3w] human (98.7%): '123 456 789'\n",
1035
- " ✓ Special chars [ 4w] human (99.8%): '!!! ### $$$ ???'\n",
1036
- " ✓ Mixed alphanumeric [ 3w] human (99.3%): 'Test123 ABC456 xyz789'\n",
1037
- " ✓ Very long word [ 1w] human (99.1%): 'supercalifragilisticexpialidocious'\n",
1038
- " ✓ Repeated words [ 5w] human (99.6%): 'test test test test test'\n",
1039
- " ✓ Newlines [ 6w] human (99.4%): 'Line one\\nLine two\\nLine three'\n",
1040
- " ✓ Tabs [ 3w] human (99.5%): 'Col1\\tCol2\\tCol3'\n",
1041
- " ✓ Multiple spaces [ 3w] human (99.7%): 'Too many spaces'\n",
1042
- " ✓ Punctuation heavy [ 5w] human (99.8%): 'Wow! Really? Yes! No... Maybe?'\n",
1043
- " ✗ Empty-like ERROR: Input text is empty.\n",
1044
- " ✓ Mixed case [ 5w] human (99.3%): 'ThIs Is MiXeD cAsE tExT'\n",
1045
- " ✓ All caps [ 4w] human (99.3%): 'THIS IS ALL CAPITALS'\n",
1046
- " ✓ All lower [ 4w] human (99.9%): 'this is all lowercase'\n",
1047
- "\n",
1048
- "Result: 15 passed, 1 failed\n",
1049
- "\n",
1050
- "================================================================================\n",
1051
- "BATCH PREDICTION TEST\n",
1052
- "================================================================================\n",
1053
- "\n",
1054
- "Predicting batch of mixed-length sentences:\n",
1055
- "\n",
1056
- " Sentence 1 (1 words):\n",
1057
- " Text: Short....\n",
1058
- " Prediction: human\n",
1059
- " Confidence: AI=0.1%, Human=99.9%\n",
1060
- "\n",
1061
- " Sentence 2 (9 words):\n",
1062
- " Text: This is a medium length sentence with some content....\n",
1063
- " Prediction: human\n",
1064
- " Confidence: AI=0.1%, Human=99.9%\n",
1065
- "\n",
1066
- " Sentence 3 (29 words):\n",
1067
- " Text: This is a longer sentence that contains more words and provi...\n",
1068
- " Prediction: human\n",
1069
- " Confidence: AI=0.1%, Human=99.9%\n",
1070
- "\n",
1071
- "================================================================================\n",
1072
- "✓ ALL EDGE CASES AND BATCH TESTS COMPLETE!\n",
1073
- "================================================================================\n"
1074
- ]
1075
- }
1076
- ],
1077
- "source": [
1078
- "print(\"=\" * 80)\n",
1079
- "print(\"EXTREME EDGE CASE TESTING\")\n",
1080
- "print(\"=\" * 80)\n",
1081
- "\n",
1082
- "# Test various edge cases that might break the model\n",
1083
- "edge_test_cases = {\n",
1084
- " \"Single character\": \"A\",\n",
1085
- " \"Single word\": \"Hello\",\n",
1086
- " \"Two words\": \"Hello world\",\n",
1087
- " \"Numbers only\": \"123 456 789\",\n",
1088
- " \"Special chars\": \"!!! ### $$$ ???\",\n",
1089
- " \"Mixed alphanumeric\": \"Test123 ABC456 xyz789\",\n",
1090
- " \"Very long word\": \"supercalifragilisticexpialidocious\",\n",
1091
- " \"Repeated words\": \"test test test test test\",\n",
1092
- " \"Newlines\": \"Line one\\nLine two\\nLine three\",\n",
1093
- " \"Tabs\": \"Col1\\tCol2\\tCol3\",\n",
1094
- " \"Multiple spaces\": \"Too many spaces\",\n",
1095
- " \"Punctuation heavy\": \"Wow! Really? Yes! No... Maybe?\",\n",
1096
- " \"Empty-like\": \" \",\n",
1097
- " \"Mixed case\": \"ThIs Is MiXeD cAsE tExT\",\n",
1098
- " \"All caps\": \"THIS IS ALL CAPITALS\",\n",
1099
- " \"All lower\": \"this is all lowercase\",\n",
1100
- "}\n",
1101
- "\n",
1102
- "print(\"\\nTesting extreme edge cases:\")\n",
1103
- "passed = 0\n",
1104
- "failed = 0\n",
1105
- "\n",
1106
- "for case_name, text in edge_test_cases.items():\n",
1107
- " try:\n",
1108
- " result = predict_v2(text)\n",
1109
- " wc = result['word_count']\n",
1110
- " pred = result['predicted_name']\n",
1111
- " conf = result['probability_ai'] if pred == 'ai' else result['probability_human']\n",
1112
- " \n",
1113
- " # Handle display of text with special characters\n",
1114
- " display_text = text.replace('\\n', '\\\\n').replace('\\t', '\\\\t')[:40]\n",
1115
- " print(f\" ✓ {case_name:20} [{wc:2}w] {pred:6} ({conf:.1%}): '{display_text}'\")\n",
1116
- " passed += 1\n",
1117
- " except Exception as e:\n",
1118
- " print(f\" ✗ {case_name:20} ERROR: {str(e)[:50]}\")\n",
1119
- " failed += 1\n",
1120
- "\n",
1121
- "print(f\"\\nResult: {passed} passed, {failed} failed\")\n",
1122
- "\n",
1123
- "# Batch prediction test\n",
1124
- "print(\"\\n\" + \"=\" * 80)\n",
1125
- "print(\"BATCH PREDICTION TEST\")\n",
1126
- "print(\"=\" * 80)\n",
1127
- "\n",
1128
- "batch_texts = [\n",
1129
- " \"Short.\",\n",
1130
- " \"This is a medium length sentence with some content.\",\n",
1131
- " \"This is a longer sentence that contains more words and provides more context for the model to analyze and make predictions based on the patterns it learned during training.\",\n",
1132
- "]\n",
1133
- "\n",
1134
- "print(\"\\nPredicting batch of mixed-length sentences:\")\n",
1135
- "batch_results = [predict_v2(text) for text in batch_texts]\n",
1136
- "\n",
1137
- "for i, (text, result) in enumerate(zip(batch_texts, batch_results), 1):\n",
1138
- " print(f\"\\n Sentence {i} ({result['word_count']} words):\")\n",
1139
- " print(f\" Text: {text[:60]}...\")\n",
1140
- " print(f\" Prediction: {result['predicted_name']}\")\n",
1141
- " print(f\" Confidence: AI={result['probability_ai']:.1%}, Human={result['probability_human']:.1%}\")\n",
1142
- "\n",
1143
- "print(\"\\n\" + \"=\" * 80)\n",
1144
- "print(\"✓ ALL EDGE CASES AND BATCH TESTS COMPLETE!\")\n",
1145
- "print(\"=\" * 80)"
1146
- ]
1147
- }
1148
- ],
1149
- "metadata": {
1150
- "kernelspec": {
1151
- "display_name": "ml",
1152
- "language": "python",
1153
- "name": "python3"
1154
- },
1155
- "language_info": {
1156
- "codemirror_mode": {
1157
- "name": "ipython",
1158
- "version": 3
1159
- },
1160
- "file_extension": ".py",
1161
- "mimetype": "text/x-python",
1162
- "name": "python",
1163
- "nbconvert_exporter": "python",
1164
- "pygments_lexer": "ipython3",
1165
- "version": "3.11.14"
1166
- }
1167
- },
1168
- "nbformat": 4,
1169
- "nbformat_minor": 5
1170
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
notebook/ai_vs_human/mainv3.ipynb DELETED
The diff for this file is too large to render. See raw diff
 
notebook/ai_vs_human_nepali/notebook/Nepali_Ai_vs_Human.ipynb DELETED
@@ -1,1429 +0,0 @@
1
- {
2
- "cells": [
3
- {
4
- "cell_type": "code",
5
- "execution_count": 1,
6
- "id": "901fc22d",
7
- "metadata": {
8
- "id": "901fc22d"
9
- },
10
- "outputs": [
11
- {
12
- "name": "stderr",
13
- "output_type": "stream",
14
- "text": [
15
- "/home/pujan/miniconda3/envs/ml/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
16
- " from .autonotebook import tqdm as notebook_tqdm\n"
17
- ]
18
- }
19
- ],
20
- "source": [
21
- "import os\n",
22
- "os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'\n",
23
- "\n",
24
- "import math\n",
25
- "import pandas as pd\n",
26
- "import torch\n",
27
- "from torch.utils.data import Dataset, DataLoader\n",
28
- "from transformers import AutoTokenizer, AutoModel, get_linear_schedule_with_warmup\n",
29
- "from sklearn.model_selection import train_test_split\n",
30
- "from sklearn.metrics import classification_report, f1_score, accuracy_score\n",
31
- "import torch.nn as nn\n",
32
- "from torch.optim import AdamW"
33
- ]
34
- },
35
- {
36
- "cell_type": "code",
37
- "execution_count": 2,
38
- "id": "70d3c048",
39
- "metadata": {},
40
- "outputs": [
41
- {
42
- "name": "stdout",
43
- "output_type": "stream",
44
- "text": [
45
- "Columns: ['human_text', 'ai_generated_text']\n",
46
- "Prepared dataset shape: (1986, 2)\n",
47
- "label\n",
48
- "1 996\n",
49
- "0 990\n",
50
- "Name: count, dtype: int64\n"
51
- ]
52
- },
53
- {
54
- "data": {
55
- "text/html": [
56
- "<div>\n",
57
- "<style scoped>\n",
58
- " .dataframe tbody tr th:only-of-type {\n",
59
- " vertical-align: middle;\n",
60
- " }\n",
61
- "\n",
62
- " .dataframe tbody tr th {\n",
63
- " vertical-align: top;\n",
64
- " }\n",
65
- "\n",
66
- " .dataframe thead th {\n",
67
- " text-align: right;\n",
68
- " }\n",
69
- "</style>\n",
70
- "<table border=\"1\" class=\"dataframe\">\n",
71
- " <thead>\n",
72
- " <tr style=\"text-align: right;\">\n",
73
- " <th></th>\n",
74
- " <th>text</th>\n",
75
- " <th>label</th>\n",
76
- " </tr>\n",
77
- " </thead>\n",
78
- " <tbody>\n",
79
- " <tr>\n",
80
- " <th>0</th>\n",
81
- " <td>हामीले पार्टी एकतापछि कि दुबै पार्टीको सिद्धान...</td>\n",
82
- " <td>0</td>\n",
83
- " </tr>\n",
84
- " <tr>\n",
85
- " <th>1</th>\n",
86
- " <td>एमाले प्रतिनिधिसभाको प्रत्यक्षतर्फ ८० समानुपात...</td>\n",
87
- " <td>0</td>\n",
88
- " </tr>\n",
89
- " <tr>\n",
90
- " <th>2</th>\n",
91
- " <td>नेकपा माओवादी केन्द्रका नेता रामनारायण विडारील...</td>\n",
92
- " <td>1</td>\n",
93
- " </tr>\n",
94
- " <tr>\n",
95
- " <th>3</th>\n",
96
- " <td>प्रदेश नं २ का मुख्यमन्त्रीको रूपमा संघीय समाज...</td>\n",
97
- " <td>1</td>\n",
98
- " </tr>\n",
99
- " <tr>\n",
100
- " <th>4</th>\n",
101
- " <td>बिहीबार एमालेका अध्यक्ष केपी शर्मा ओली र माओवा...</td>\n",
102
- " <td>0</td>\n",
103
- " </tr>\n",
104
- " </tbody>\n",
105
- "</table>\n",
106
- "</div>"
107
- ],
108
- "text/plain": [
109
- " text label\n",
110
- "0 हामीले पार्टी एकतापछि कि दुबै पार्टीको सिद्धान... 0\n",
111
- "1 एमाले प्रतिनिधिसभाको प्रत्यक्षतर्फ ८० समानुपात... 0\n",
112
- "2 नेकपा माओवादी केन्द्रका नेता रामनारायण विडारील... 1\n",
113
- "3 प्रदेश नं २ का मुख्यमन्त्रीको रूपमा संघीय समाज... 1\n",
114
- "4 बिहीबार एमालेका अध्यक्�� केपी शर्मा ओली र माओवा... 0"
115
- ]
116
- },
117
- "execution_count": 2,
118
- "metadata": {},
119
- "output_type": "execute_result"
120
- }
121
- ],
122
- "source": [
123
- "# Load Dataset and convert to binary classification format\n",
124
- "DATA_PATH = '../DATASET/new_data.csv'\n",
125
- "raw_df = pd.read_csv(DATA_PATH)\n",
126
- "print('Columns:', raw_df.columns.tolist())\n",
127
- "\n",
128
- "required_cols = ['human_text', 'ai_generated_text']\n",
129
- "missing = [c for c in required_cols if c not in raw_df.columns]\n",
130
- "if missing:\n",
131
- " raise ValueError(f'Missing required columns: {missing}')\n",
132
- "\n",
133
- "# Build unified training dataframe: text + label (0=Human, 1=AI)\n",
134
- "df_human = raw_df[['human_text']].dropna().rename(columns={'human_text': 'text'})\n",
135
- "df_human['label'] = 0\n",
136
- "\n",
137
- "df_ai = raw_df[['ai_generated_text']].dropna().rename(columns={'ai_generated_text': 'text'})\n",
138
- "df_ai['label'] = 1\n",
139
- "\n",
140
- "df = pd.concat([df_human, df_ai], ignore_index=True)\n",
141
- "df['text'] = df['text'].astype(str).str.strip()\n",
142
- "df = df[df['text'].str.len() > 10].drop_duplicates(subset=['text']).sample(frac=1, random_state=42).reset_index(drop=True)\n",
143
- "\n",
144
- "print('Prepared dataset shape:', df.shape)\n",
145
- "print(df['label'].value_counts())\n",
146
- "df.head()"
147
- ]
148
- },
149
- {
150
- "cell_type": "code",
151
- "execution_count": 3,
152
- "id": "f93d4c7a",
153
- "metadata": {
154
- "id": "f93d4c7a"
155
- },
156
- "outputs": [
157
- {
158
- "name": "stdout",
159
- "output_type": "stream",
160
- "text": [
161
- "Nulls in text: 0\n",
162
- "Nulls in label: 0\n",
163
- "Example text sample:\n",
164
- "हामीले पार्टी एकतापछि कि दुबै पार्टीको सिद्धान्त राख्ने कि राख्ने माओवाद र जबज दुबै नराख्ने भन्दा उहाँहरु मान्नु भएन । एमालेका साथीहरुले जवजको विषय उठाउन चाहनुभएन । सिद्धान्तको विषय नै नमिलेपछि पार्टी एकता संयोजन समितिको बैठक रोकियो कार्यदलका एक सदस्\n"
165
- ]
166
- }
167
- ],
168
- "source": [
169
- "# Quick sanity checks\n",
170
- "print('Nulls in text:', int(df['text'].isnull().sum()))\n",
171
- "print('Nulls in label:', int(df['label'].isnull().sum()))\n",
172
- "print('Example text sample:')\n",
173
- "print(df.loc[0, 'text'][:250])"
174
- ]
175
- },
176
- {
177
- "cell_type": "code",
178
- "execution_count": 4,
179
- "id": "ba4a933f",
180
- "metadata": {
181
- "colab": {
182
- "base_uri": "https://localhost:8080/",
183
- "height": 206
184
- },
185
- "id": "ba4a933f",
186
- "outputId": "9bf5f0a5-c547-43f1-b8f2-a580024d74a9"
187
- },
188
- "outputs": [
189
- {
190
- "name": "stdout",
191
- "output_type": "stream",
192
- "text": [
193
- "label\n",
194
- "AI 0.501511\n",
195
- "Human 0.498489\n",
196
- "Name: proportion, dtype: float64\n"
197
- ]
198
- },
199
- {
200
- "data": {
201
- "text/plain": [
202
- "label \n",
203
- "0 count 990.000000\n",
204
- " mean 455.551515\n",
205
- " std 56.825837\n",
206
- " min 299.000000\n",
207
- " 25% 418.000000\n",
208
- " 50% 458.000000\n",
209
- " 75% 494.000000\n",
210
- " max 629.000000\n",
211
- "1 count 996.000000\n",
212
- " mean 284.231928\n",
213
- " std 67.165254\n",
214
- " min 103.000000\n",
215
- " 25% 238.000000\n",
216
- " 50% 282.000000\n",
217
- " 75% 331.000000\n",
218
- " max 433.000000\n",
219
- "Name: text, dtype: float64"
220
- ]
221
- },
222
- "execution_count": 4,
223
- "metadata": {},
224
- "output_type": "execute_result"
225
- }
226
- ],
227
- "source": [
228
- "# Class balance\n",
229
- "print(df['label'].value_counts(normalize=True).rename({0: 'Human', 1: 'AI'}))\n",
230
- "df.groupby('label')['text'].apply(lambda s: s.str.len().describe())"
231
- ]
232
- },
233
- {
234
- "cell_type": "code",
235
- "execution_count": 5,
236
- "id": "d7b48175",
237
- "metadata": {
238
- "colab": {
239
- "base_uri": "https://localhost:8080/",
240
- "height": 206
241
- },
242
- "id": "d7b48175",
243
- "outputId": "08bc4562-874c-40c1-d554-1d809a6d0e31"
244
- },
245
- "outputs": [
246
- {
247
- "data": {
248
- "text/plain": [
249
- "<matplotlib.legend.Legend at 0x7fef748b5290>"
250
- ]
251
- },
252
- "execution_count": 5,
253
- "metadata": {},
254
- "output_type": "execute_result"
255
- },
256
- {
257
- "data": {
258
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAvwAAAGHCAYAAADMVYYQAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjgsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvwVt1zgAAAAlwSFlzAAAPYQAAD2EBqD+naQAARoNJREFUeJzt3X1cVGX+//H3gMMIAt7HTaJiIeVduVomVmgFu5aurtXWWq1Wa5bdaO5+7WtWDrsFZuVSa9nPttS2yG03c7Wvd6SJlbmp5epqmbspmkpkoaAoDMz1+6OYHAEdYGCY4+v5eMxDz3Wuuc7nzGdGPlxec47NGGMEAAAAwJJCAh0AAAAAgMZDwQ8AAABYGAU/AAAAYGEU/AAAAICFUfADAAAAFkbBDwAAAFgYBT8AAABgYRT8AAAAgIVR8AMAAAAWRsEPIOjZbDafHmvXrvXL8Q4cOCCn06ktW7b41H/t2rWy2Wz6+9//7pfj+1tpaamcTmeNr4/T6ZTNZtOhQ4fqNfbYsWO9ctCqVSt17dpVP//5zzVv3jyVlZVVe87gwYM1ePDgOh1nx44dcjqd2rNnT52ed+qx9uzZI5vNpqeffrpO45xJZmamFi9eXK296r3hr/cmANSkRaADAICG+uijj7y2//CHP+i9997TmjVrvNp79Ojhl+MdOHBAGRkZ6tq1qy6++GK/jBlIpaWlysjIkKQ6F9q+CA8P9+Ti+PHj2rdvn5YvX65x48bpmWee0YoVK9SpUydP/xdeeKHOx9ixY4cyMjI0ePBgde3a1efn1edY9ZGZmakbbrhBI0eO9Gr/yU9+oo8++shv700AqAkFP4Cgd9lll3ltd+zYUSEhIdXaERg15eLXv/61br/9dg0bNkw33HCDNmzY4NnXFMVvaWmpIiIiAl5oR0dH8z4F0OhY0gPgrFBeXq7HH39cF1xwgRwOhzp27Kjbb79d33zzjafPjBkzFBISoqVLl3o9d+zYsYqIiNC2bdu0du1aXXLJJZKk22+/3bNUxel0NjjGgoICjR8/Xp06dVJYWJgSExOVkZGhiooKT5+Tl5zMmjVLiYmJioyM1MCBA72K5iovvfSSunfvLofDoR49eignJ0djx471zILv2bNHHTt2lCRlZGR4zmfs2LFe43z99df61a9+pdatWysmJkZ33HGHjhw50qDzTU9P17hx4/TPf/5T69at87TXtKRnzpw5uuiiixQZGamoqChdcMEFevjhhyVJ8+fP14033ihJGjJkiOcc5s+f7xmvV69eWrdunVJSUhQREaE77rij1mNJktvt1hNPPKHOnTurZcuW6t+/v1avXu3V5+TX8WRVy6Cq2Gw2HTt2TAsWLPDEVnXM2pb0LFmyRAMHDlRERISioqKUlpZW7X+yqo6zfft2v+cGgLVQ8AOwPLfbrREjRmjGjBkaPXq0/u///k8zZsxQbm6uBg8erOPHj0uSHnroIQ0dOlRjxoxRfn6+JGnevHlasGCB/vSnP6l37976yU9+onnz5kmSHnnkEX300Uf66KOP9Jvf/KZBMRYUFOjSSy/VypUr9dhjj2n58uW68847lZWVpXHjxlXr//zzzys3N1fZ2dl6/fXXdezYMV177bVehd7cuXN11113qU+fPlq0aJEeeeQRZWRkeBWXcXFxWrFihSTpzjvv9JzPo48+6nW866+/Xt27d9dbb72l//3f/1VOTo4efPDBBp2zJP385z+XJK+C/1QLFy7UhAkTlJqaqrfffluLFy/Wgw8+qGPHjkmSrrvuOmVmZnpel6pzuO666zxjHDx4ULfeeqtGjx6tZcuWacKECaeNa/bs2VqxYoWys7P12muvKSQkREOHDq1WdPvio48+Unh4uK699lpPbKdbSpSTk6MRI0YoOjpab7zxhl5++WUVFRVp8ODB+uCDD6r1b6zcALAQAwAWM2bMGNOqVSvP9htvvGEkmbfeesur38aNG40k88ILL3jaDh06ZDp16mQuvfRS88knn5iIiAhz66231vi8efPm+RTPe++9ZySZv/3tb7X2GT9+vImMjDT5+fle7U8//bSRZLZv326MMWb37t1Gkundu7epqKjw9Pv444+NJPPGG28YY4yprKw0sbGxZsCAAV7j5efnG7vdbrp06eJp++abb4wkM3369GpxTZ8+3UgyM2fO9GqfMGGCadmypXG73ac991NzcarPPvvMSDL33HOPpy01NdWkpqZ6tu+77z7Tpk2b0x7nb3/7m5Fk3nvvvWr7UlNTjSSzevXqGvedfKyq1zc+Pt4cP37c015cXGzatWtnrrnmGq9zO/l1rFL1mp2sVatWZsyYMdX6Vr03quKurKw08fHxpnfv3qaystLTr6SkxJxzzjkmJSWl2nHqmxsAZw9m+AFY3jvvvKM2bdpo+PDhqqio8DwuvvhixcbGes14t2/fXn/961/1ySefKCUlRZ07d9aLL77YJDEOGTJE8fHxXjEOHTpUkpSXl+fV/7rrrlNoaKhnu0+fPpLk+Z+JnTt3qqCgQL/85S+9nte5c2cNGjSozvFVzcSffLwTJ06osLCwzmOdzBhzxj6XXnqpDh8+rF/96lf6xz/+Ua8rBrVt21ZXXXWVz/1HjRqlli1berajoqI0fPhwrVu3TpWVlXU+vq927typAwcO6LbbblNIyI8/oiMjI3X99ddrw4YNKi0t9XpOY+UGgHVQ8AOwvK+//lqHDx9WWFiY7Ha716OgoKBaATlgwAD17NlTJ06c0D333KNWrVo1SYxLly6tFl/Pnj0lqVqM7du399p2OByS5Fme9O2330qSYmJiqh2rprYzOdPx6qvqF5T4+Pha+9x222165ZVXlJ+fr+uvv17nnHOOBgwYoNzcXJ+PExcXV6e4YmNja2wrLy/X0aNH6zRWXVTlraZ44+Pj5Xa7VVRU5NXeWLkBYB1cpQeA5XXo0EHt27f3rFU/VVRUlNf29OnTtW3bNvXr10+PPfaYhg0bpm7dujV6jH369NETTzxR4/7TFcQ1qSoCv/7662r7CgoK6h5gI1myZImkM18O9Pbbb9ftt9+uY8eOad26dZo+fbqGDRumL774Ql26dDnjcU7+Eq0vanqNCgoKFBYWpsjISElSy5Yta7yPQH3vWSD9mLeDBw9W23fgwAGFhISobdu29R4fwNmJGX4Aljds2DB9++23qqysVP/+/as9kpOTPX1zc3OVlZWlRx55RLm5uWrdurVuuukmlZeXe/o0xgzqsGHD9O9//1vnnXdejTHWteBPTk5WbGys3nzzTa/2vXv3av369V5tgZoRzs3N1Z///GelpKTo8ssv9+k5rVq10tChQzVt2jSVl5dr+/btkvx/DosWLdKJEyc82yUlJVq6dKmuuOIKz1Kqrl27qrCw0OuXqvLycq1cubLaeA6Hw6fYkpOTde655yonJ8drudOxY8f01ltvea7cAwB1wQw/AMu7+eab9frrr+vaa6/VxIkTdemll8put+urr77Se++9pxEjRugXv/iF50ouqampmj59ukJCQvTXv/5VV155paZMmaLs7GxJ0nnnnafw8HC9/vrruvDCCxUZGan4+PgzFuU1XTZTklJTU/X73/9eubm5SklJ0QMPPKDk5GSdOHFCe/bs0bJly/Tiiy963ZzqTEJCQpSRkaHx48frhhtu0B133KHDhw8rIyNDcXFxXuvDo6Ki1KVLF/3jH//Q1VdfrXbt2qlDhw51uoHV6bjdbs+5l5WVae/evVq+fLnefPNNXXjhhdV+KTnVuHHjFB4erkGDBikuLk4FBQXKyspS69atPZdI7dWrl6Tvr0wUFRWlli1bKjExsdpyF1+FhoYqLS1NkydPltvt1pNPPqni4mLPDcok6aabbtJjjz2mm2++Wf/zP/+jEydO6LnnnqtxjX/v3r21du1aLV26VHFxcYqKivL6RbNKSEiIZs6cqVtuuUXDhg3T+PHjVVZWpqeeekqHDx/WjBkz6nU+AM5ygf7WMAD4W01XhnG5XObpp582F110kWnZsqWJjIw0F1xwgRk/frzZtWuXqaioMKmpqSYmJsYcPHjQ67lPPfWUkWTefvttT9sbb7xhLrjgAmO322u9wk2Vqiux1PaoukLLN998Yx544AGTmJho7Ha7adeunenXr5+ZNm2aOXr0qDHmx6vIPPXUU9WOU1Mcc+fONeeff74JCwsz3bt3N6+88ooZMWKE6du3r1e/d9991/Tt29c4HA4jyXNFmaorwXzzzTde/efNm2ckmd27d9d63sZ8n4uTzzU8PNx07tzZDB8+3LzyyiumrKys2nNOvXLOggULzJAhQ0xMTIwJCwsz8fHx5pe//KXZunWr1/Oys7NNYmKiCQ0N9bqKUmpqqunZs2eN8dV2lZ4nn3zSZGRkmE6dOpmwsDDTt29fs3LlymrPX7Zsmbn44otNeHi46datm5k9e3aNV+nZsmWLGTRokImIiDCSPMc89So9VRYvXmwGDBhgWrZsaVq1amWuvvpq8+GHH3r1aWhuAJw9bMb4cIkEAIAlHD58WN27d9fIkSM1d+7cQIcDAGgCLOkBAIsqKCjQE088oSFDhqh9+/bKz8/XH//4R5WUlGjixImBDg8A0EQo+AHAohwOh/bs2aMJEybou+++U0REhC677DK9+OKLnst9AgCsjyU9AAAAgIVxWU4AAADAwij4AQAAAAuj4AcAAAAszPJf2nW73Tpw4ICioqLqfGt1AAAAoLkyxqikpETx8fFeN1Q8leUL/gMHDighISHQYQAAAACNYt++fae9G7vlC/6oqChJ378Q0dHRjX48l8ulVatWKT09XXa7vdGPh6ZHjq2N/FofObY28mt95PhHxcXFSkhI8NS7tbF8wV+1jCc6OrrJCv6IiAhFR0ef9W9CqyLH1kZ+rY8cWxv5tT5yXN2Zlq3zpV0AAADAwij4AQAAAAuj4AcAAAAszPJr+AEAAFB3lZWVcrlcgQ6jGpfLpRYtWujEiROqrKwMdDiNKjQ0VC1atGjwpeUp+AEAAODl6NGj+uqrr2SMCXQo1RhjFBsbq3379p0V91iKiIhQXFycwsLC6j0GBT8AAAA8Kisr9dVXXykiIkIdO3ZsdkW12+3W0aNHFRkZedqbTQU7Y4zKy8v1zTffaPfu3UpKSqr3+VLwAwAAwMPlcskYo44dOyo8PDzQ4VTjdrtVXl6uli1bWrrgl6Tw8HDZ7Xbl5+d7zrk+rP0qAQAAoF6a28z+2cofv9RQ8AMAAAAWRsEPAAAAWBgFPwAAAGBhfGkXAACc0dRF23zqlzWqdyNHgkDx9T3gL3V9L40dO1aHDx/W4sWLvdrXrl2rIUOGqKioSG3atPFfgEGEGX4AAADAwij4AQAAcFZwOp26+OKLvdqys7PVtWtXz/bYsWM1cuRIZWZmKiYmRm3atFFGRoYqKir0P//zP2rXrp06deqkV155xWuchx56SN27d1dERIS6deumRx991OtOxVXH/stf/qKuXbuqdevWuvnmm1VSUtKYpyyJgh8AAADwsmbNGh04cEDr1q3TrFmz5HQ6NWzYMLVt21b//Oc/dffdd+vuu+/Wvn37PM+JiorS/PnztWPHDj377LN66aWX9Mc//tFr3P/+979avHix3nnnHb3zzjvKy8vTjBkzGv18KPgBAABgCe+8844iIyO9HkOHDq3zOO3atdNzzz2n5ORk3XHHHUpOTlZpaakefvhhJSUlaerUqQoLC9OHH37oec4jjzyilJQUde3aVcOHD9dvf/tbvfnmm17jut1uzZ8/X7169dIVV1yh2267TatXr27weZ9JQAv+rl27ymazVXvce++9kr6/pbDT6VR8fLzCw8M1ePBgbd++PZAhAwAAoJkaMmSItmzZ4vX485//XOdxevbs6XXDq5iYGPXu/eOXiENDQ9W+fXsVFhZ62v7+97/r8ssvV2xsrCIjI/Xoo49q7969XuN27dpVUVFRnu24uDivMRpLQAv+jRs36uDBg55Hbm6uJOnGG2+UJM2cOVOzZs3S7NmztXHjRsXGxiotLa1J1joBAAAguLRq1Urnn3++1+Pcc8/17A8JCZExxus5J6+zr2K32722bTZbjW1ut1uStGHDBt18880aOnSo3nnnHX366aeaNm2aysvLzzhu1RiNKaCX5ezYsaPX9owZM3TeeecpNTVVxhhlZ2dr2rRpGjVqlCRpwYIFiomJUU5OjsaPHx+IkAEAABCkOnbsqIKCAhljZLPZJElbtmxp8LgffvihunTpomnTpnna8vPzGzyuvzSb6/CXl5frtdde0+TJk2Wz2fTll1+qoKBA6enpnj4Oh0Opqalav359rQV/WVmZysrKPNvFxcWSvv/trabf4Pyt6hhNcSwEBjm2NvJrfeS4fkLl2yxkoF9X8ttwLpdLxhi53W7v2edTZsYbW20z31Uz9FUxntx+atvJ47jdbl155ZX65ptv9OSTT+r666/XypUrtXz5ckVHR3v61TbO6dq6deumvXv3KicnR5dccomWLVumt99+2+v4VXGfGvPpzrVqnzFGLpdLoaGhXvt8fZ83m4J/8eLFOnz4sMaOHStJKigokPT9mqmTxcTEnPY3pqysLGVkZFRrX7VqlSIiIvwX8BlULU+CdZFjayO/1keO6+aS0DP3kaRly/Y0ahy+Ir/116JFC8XGxuro0aNeS1Ieurpzk8ZRNWlbm1OXeLtcLlVUVFR7Xmlpqaf/ueeeq6efflqzZs3S448/ruHDh+vee+/VggULvCaJTx2noqJC5eXlXm1ut1snTpxQcXGxhgwZonvuuUf333+/ysvLlZaWpt/97neaMWOG5zllZWWqrKz0GuPEiRNyu92nPdfy8nIdP35c69atU0VFRY3ndiY2c+pCpgD56U9/qrCwMC1dulSStH79eg0aNEgHDhxQXFycp9+4ceO0b98+rVixosZxaprhT0hI0KFDhxQdHd24J6Hv3yS5ublKS0urtk4L1kCOrY38Wh85rp+MpTt86jd9eI9GjuT0yG/DnThxQvv27VPXrl3VsmXLQIdTjTFGJSUlioqK8izLsbITJ05oz549SkhIqJaP4uJidejQQUeOHDltndssZvjz8/P17rvvatGiRZ622NhYSd/P9J9c8BcWFlab9T+Zw+GQw+Go1m6325v0g9/Ux0PTI8fWRn6tjxzXTaWP1/loLq8p+a2/yspK2Ww2hYSEeF2pprmoWv5SFaPVhYSEeL40fOp72tf3eLN4lebNm6dzzjlH1113nactMTFRsbGxXv8lV15erry8PKWkpAQiTAAAACDoBHyG3+12a968eRozZoxatPgxHJvNpkmTJikzM1NJSUlKSkpSZmamIiIiNHr06ABGDAAAAASPgBf87777rvbu3as77rij2r4pU6bo+PHjmjBhgoqKijRgwACtWrXK64YFAAAAAGoX8II/PT292g0QqthsNjmdTjmdzqYNCgAAALCIZrGGHwAAAEDjoOAHAAAALIyCHwAAALAwCn4AAADAwgL+pV0AAAAEgaUTm/Z4w59t2uNZGDP8AAAAsIz169crNDRUP/vZz7za9+zZI5vNpi1btgQmsACi4AcAAIBlvPLKK7r//vv1wQcfaO/evYEOp1mg4AcAAIAlHDt2TG+++abuueceDRs2TPPnzw90SM0CBT8AAAAs4a9//auSk5OVnJysW2+9VfPmzav1Bq9nEwp+AAAAWMLLL7+sW2+9VZL0s5/9TEePHtXq1asDHFXgUfADAAAg6O3cuVMff/yxbr75ZklSixYtdNNNN+mVV14JcGSBx2U5AQAAEPRefvllVVRU6Nxzz/W0GWNkt9tVVFQUwMgCjxl+AAAABLWKigq9+uqreuaZZ7RlyxbP41//+pe6dOmi119/PdAhBhQz/AAAAAhq77zzjoqKinTnnXeqdevWXvtuuOEGvfzyyxo2bFiAogs8Cn4AAACcWTO+8+3LL7+sa665plqxL0nXX3+9MjMz9d133wUgsuaBgh8AAABBbenSpbXu+8lPfuK5NOfZeolO1vADAAAAFkbBDwAAAFgYBT8AAABgYRT8AAAAgIVR8AMAAKCas/ULrs2NP/JAwQ8AAACP0NBQSVJ5eXmAI4EklZaWSpLsdnu9x+CynAAAAPBo0aKFIiIi9M0338hutyskpHnND7vdbpWXl+vEiRPNLjZ/MsaotLRUhYWFatOmjecXsfqg4AcAAICHzWZTXFycdu/erfz8/ECHU40xRsePH1d4eLhsNlugw2l0bdq0UWxsbIPGoOAHAACAl7CwMCUlJTXLZT0ul0vr1q3TlVde2aBlLsHAbrc3aGa/CgU/AABnsamLtgU6BDRTISEhatmyZaDDqCY0NFQVFRVq2bKl5Qt+f7HuwicAAAAAFPwAAACAlVHwAwAAABYW8IJ///79uvXWW9W+fXtFRETo4osv1ubNmz37jTFyOp2Kj49XeHi4Bg8erO3btwcwYgAAACB4BLTgLyoq0qBBg2S327V8+XLt2LFDzzzzjNq0aePpM3PmTM2aNUuzZ8/Wxo0bFRsbq7S0NJWUlAQucAAAACBIBPQqPU8++aQSEhI0b948T1vXrl09fzfGKDs7W9OmTdOoUaMkSQsWLFBMTIxycnI0fvz4pg4ZAAAACCoBLfiXLFmin/70p7rxxhuVl5enc889VxMmTNC4ceMkSbt371ZBQYHS09M9z3E4HEpNTdX69etrLPjLyspUVlbm2S4uLpb0/TVbXS5XI5+RPMdoimMhMMixtZFf6yPH3kLl9ut4gX5dya/1keMf+foa2IwxppFjqVXVtV0nT56sG2+8UR9//LEmTZqk//f//p9+/etfa/369Ro0aJD279+v+Ph4z/Puuusu5efna+XKldXGdDqdysjIqNaek5OjiIiIxjsZAAAAoAmVlpZq9OjROnLkiKKjo2vtF9AZfrfbrf79+yszM1OS1LdvX23fvl1z5szRr3/9a0+/U2+bbIyp9VbKU6dO1eTJkz3bxcXFSkhIUHp6+mlfCH9xuVzKzc1VWloaN4OwqKDK8fKHfOs39MnGjSOIBFV+US/k2FvG0h1+HW/68B5+Ha+uyK/1keMfVa1kOZOAFvxxcXHq0cP7H4YLL7xQb731liQpNjZWklRQUKC4uDhPn8LCQsXExNQ4psPhkMPhqNZut9ub9E3R1MdD0wuKHNsqfevX3M8jAIIiv2gQcvy9Sj9fv6O5vKbk1/rIse+ft4BepWfQoEHauXOnV9sXX3yhLl26SJISExMVGxur3Nxcz/7y8nLl5eUpJSWlSWMFAAAAglFAZ/gffPBBpaSkKDMzU7/85S/18ccfa+7cuZo7d66k75fyTJo0SZmZmUpKSlJSUpIyMzMVERGh0aNHBzJ0AAAAICgEtOC/5JJL9Pbbb2vq1Kn6/e9/r8TERGVnZ+uWW27x9JkyZYqOHz+uCRMmqKioSAMGDNCqVasUFRUVwMgBAACA4BDQgl+Shg0bpmHDhtW632azyel0yul0Nl1QAAAAgEUEdA0/AAAAgMZFwQ8AAABYGAU/AAAAYGEU/AAAAICFUfADAAAAFkbBDwAAAFhYwC/LCeAstnSib/2GP9u4cQBBZOqibT71yxrVu5EjARAsmOEHAAAALIyCHwAAALAwCn4AAADAwij4AQAAAAuj4AcAAAAsjIIfAAAAsDAKfgAAAMDCuA4/AABoctxPAGg6zPADAAAAFkbBDwAAAFgYBT8AAABgYazhB4CGWDrRt37Dn23cOAAAqAUz/AAAAICFUfADAAAAFkbBDwAAAFgYBT8AAABgYRT8AAAAgIVR8AMAAAAWRsEPAAAAWBjX4QcAoBmYumhboEMAYFHM8AMAAAAWRsEPAAAAWBgFPwAAAGBhAS34nU6nbDab1yM2Ntaz3xgjp9Op+Ph4hYeHa/Dgwdq+fXsAIwYAAACCS8Bn+Hv27KmDBw96Htu2/filpZkzZ2rWrFmaPXu2Nm7cqNjYWKWlpamkpCSAEQMAAADBI+AFf4sWLRQbG+t5dOzYUdL3s/vZ2dmaNm2aRo0apV69emnBggUqLS1VTk5OgKMGAAAAgkPAL8u5a9cuxcfHy+FwaMCAAcrMzFS3bt20e/duFRQUKD093dPX4XAoNTVV69ev1/jx42scr6ysTGVlZZ7t4uJiSZLL5ZLL5Wrck/nhOCf/CesJqhybUN/6BepcmmF8dc5vMzwHnF5z/QyHyu3X8Xw9v+Z+3LrmqbnmF/5Djn/k62tgM8aYRo6lVsuXL1dpaam6d++ur7/+Wo8//rg+//xzbd++XTt37tSgQYO0f/9+xcfHe55z1113KT8/XytXrqxxTKfTqYyMjGrtOTk5ioiIaLRzAQAAAJpSaWmpRo8erSNHjig6OrrWfgEt+E917NgxnXfeeZoyZYouu+wyDRo0SAcOHFBcXJynz7hx47Rv3z6tWLGixjFqmuFPSEjQoUOHTvtC+IvL5VJubq7S0tJkt9sb/XhoekGV4+UP+dZv6JONG0dtmmF8dc5vMzwHnF5z/QxnLN3h1/GmD+9hieP6Ol6V5ppf+A85/lFxcbE6dOhwxoI/4Et6TtaqVSv17t1bu3bt0siRIyVJBQUFXgV/YWGhYmJiah3D4XDI4XBUa7fb7U36pmjq46HpBUWObZW+9QvUeTTj+HzObzM+B5xec/sMV/r5a3W+nltzP259c9Tc8gv/I8e+fz4C/qXdk5WVlemzzz5TXFycEhMTFRsbq9zcXM/+8vJy5eXlKSUlJYBRAgAAAMEjoDP8v/vd7zR8+HB17txZhYWFevzxx1VcXKwxY8bIZrNp0qRJyszMVFJSkpKSkpSZmamIiAiNHj06kGEDOBssf8j32XvgNKYu2nbmTgDQiAJa8H/11Vf61a9+pUOHDqljx4667LLLtGHDBnXp0kWSNGXKFB0/flwTJkxQUVGRBgwYoFWrVikqKiqQYQMAAABBI6AF/8KFC0+732azyel0yul0Nk1AAAAAgMU0qzX8AAAAAPyrWV2lB0Azt3Sib/2GP9u4cQAAAJ8xww8AAABYGAU/AAAAYGEU/AAAAICFsYYfAAA0W3W9j0Go3LokVMpYuqPGu/lmjertr9CAoMEMPwAAAGBhFPwAAACAhVHwAwAAABbGGn4AACyormvfAVgXM/wAAACAhVHwAwAAABZGwQ8AAABYGGv4ATR/Syf61m/4s40bBwAAQYgZfgAAAMDCKPgBAAAAC6PgBwAAACyMNfwAzi5n+j6ACZWU2iShAFbE9f+B5ocZfgAAAMDCKPgBAAAAC6PgBwAAACyMgh8AAACwML60C8D/fL1RllWOCwBAM8YMPwAAAGBhFPwAAACAhdWr4O/WrZu+/fbbau2HDx9Wt27dGhwUAAAAAP+o1xr+PXv2qLKyslp7WVmZ9u/f3+CgAABo7rjBFIBgUaeCf8mSJZ6/r1y5Uq1bt/ZsV1ZWavXq1eratavfggMAAADQMHUq+EeOHClJstlsGjNmjNc+u92url276plnnvFbcAAAAAAapk4Fv9vtliQlJiZq48aN6tChQ6MEBQAAAMA/6rWGf/fu3f6OQ1lZWXr44Yc1ceJEZWdnS5KMMcrIyNDcuXNVVFSkAQMG6Pnnn1fPnj39fnwAAGB9vn73ImtU70aOBGg69b7x1urVq7V69WoVFhZ6Zv6rvPLKK3Uaa+PGjZo7d6769Onj1T5z5kzNmjVL8+fPV/fu3fX4448rLS1NO3fuVFRUVH1DBwAAAM4a9bosZ0ZGhtLT07V69WodOnRIRUVFXo+6OHr0qG655Ra99NJLatu2rafdGKPs7GxNmzZNo0aNUq9evbRgwQKVlpYqJyenPmEDAAAAZ516zfC/+OKLmj9/vm677bYGB3Dvvffquuuu0zXXXKPHH3/c0757924VFBQoPT3d0+ZwOJSamqr169dr/PjxNY5XVlamsrIyz3ZxcbEkyeVyyeVyNTjeM6k6RlMcC4ERVDk2ob718/VcfB0viLl+OEeXv881GN4vZwl/fYZD5T5zJzS5kB/yEtLA/ATFv/FnqaD6OdzIfH0N6lXwl5eXKyUlpT5P9bJw4UJ98skn2rhxY7V9BQUFkqSYmBiv9piYGOXn59c6ZlZWljIyMqq1r1q1ShEREQ2M2He5ublNdiwERnDkONW3bsuW+Xc8C8jV5ZLx44A+v8ZoKg39DF9i/d9/g1q/0L0Nev6yZXv8EwgaTXD8HG5cpaWlPvWrV8H/m9/8Rjk5OXr00Ufr83RJ0r59+zRx4kStWrVKLVu2rLWfzWbz2jbGVGs72dSpUzV58mTPdnFxsRISEpSenq7o6Oh6x+srl8ul3NxcpaWlyW63N/rx0PSCKsfLH/Kt39An/TteEHOZUOXqcqXpA9lt1W8wWG++vsZodP76DGcs3eHHqOAvIXKrX+heba7sLHf9Vi5LkqYP7+HHqOBPQfVzuJFVrWQ5k3oV/CdOnNDcuXP17rvvqk+fPtVe7FmzZp1xjM2bN6uwsFD9+vXztFVWVmrdunWaPXu2du7cKen7mf64uDhPn8LCwmqz/idzOBxyOBzV2u12e5O+KZr6eGh6QZFjXwtWX8/DnwVwc2Yku63SvwV/c3+vnIUa+hmubEAxicbnVkiDctTs/31HcPwcbmS+nn+9Cv6tW7fq4osvliT9+9//9tp3utn3k1199dXats370li33367LrjgAj300EPq1q2bYmNjlZubq759+0r6filRXl6ennySmTIAAADAF/Uq+N97770GHzgqKkq9evXyamvVqpXat2/vaZ80aZIyMzOVlJSkpKQkZWZmKiIiQqNHj27w8QEAAICzQb2vw98UpkyZouPHj2vChAmeG2+tWrWKa/ADAAAAPqpXwT9kyJDTLt1Zs2ZNvYJZu3at17bNZpPT6ZTT6azXeAAAAMDZrl4Ff9X6/Soul0tbtmzRv//9b40ZM8YfcQEAAADwg3oV/H/84x9rbHc6nTp69GiDAgIAAADgP369ptitt96qV155xZ9DAgAAAGgAvxb8H3300WlvogUAAACgadVrSc+oUaO8to0xOnjwoDZt2tSgu+8CAAAA8K96FfytW7f22g4JCVFycrJ+//vfKz093S+BAQAQCFMXbTtzJwAIIvUq+OfNm+fvOAAAAAA0ggbdeGvz5s367LPPZLPZ1KNHD/Xt29dfcQEAAADwg3oV/IWFhbr55pu1du1atWnTRsYYHTlyREOGDNHChQvVsWNHf8cJAAAAoB7qVfDff//9Ki4u1vbt23XhhRdKknbs2KExY8bogQce0BtvvOHXIAGcYunEQEcAAACCRL0K/hUrVujdd9/1FPuS1KNHDz3//PN8aRcAAABoRup1HX632y273V6t3W63y+12NzgoAAAAAP5Rr4L/qquu0sSJE3XgwAFP2/79+/Xggw/q6quv9ltwAAAAABqmXkt6Zs+erREjRqhr165KSEiQzWbT3r171bt3b7322mv+jhFAY+M7AQAAWFa9Cv6EhAR98sknys3N1eeffy5jjHr06KFrrrnG3/EBAAAAaIA6LelZs2aNevTooeLiYklSWlqa7r//fj3wwAO65JJL1LNnT73//vuNEigAAACAuqtTwZ+dna1x48YpOjq62r7WrVtr/PjxmjVrlt+CAwAAANAwdVrS869//UtPPvlkrfvT09P19NNPNzgoAMAZ+Pq9i+HPNm4cAIBmr04z/F9//XWNl+Os0qJFC33zzTcNDgoAAACAf9Sp4D/33HO1bdu2Wvdv3bpVcXFxDQ4KAAAAgH/UqeC/9tpr9dhjj+nEiRPV9h0/flzTp0/XsGHD/BYcAAAAgIap0xr+Rx55RIsWLVL37t113333KTk5WTabTZ999pmef/55VVZWatq0aY0VKxC8WG8N3gMBN3XR9/9DHSq3LgmVMpbuUGX97j8JAEGlTgV/TEyM1q9fr3vuuUdTp06VMUaSZLPZ9NOf/lQvvPCCYmJiGiVQAAAAAHVX5xtvdenSRcuWLVNRUZH+85//yBijpKQktW3btjHiAwAAANAA9brTriS1bdtWl1xyiT9jAQAAAOBnLF4EAAAALIyCHwAAALAwCn4AAADAwij4AQAAAAuj4AcAAAAsLKAF/5w5c9SnTx9FR0crOjpaAwcO1PLlyz37jTFyOp2Kj49XeHi4Bg8erO3btwcwYgAAACC4BLTg79Spk2bMmKFNmzZp06ZNuuqqqzRixAhPUT9z5kzNmjVLs2fP1saNGxUbG6u0tDSVlJQEMmwAAAAgaAS04B8+fLiuvfZade/eXd27d9cTTzyhyMhIbdiwQcYYZWdna9q0aRo1apR69eqlBQsWqLS0VDk5OYEMGwAAAAga9b7xlr9VVlbqb3/7m44dO6aBAwdq9+7dKigoUHp6uqePw+FQamqq1q9fr/Hjx9c4TllZmcrKyjzbxcXFkiSXyyWXy9W4J/HDcU7+E9ZTrxybUF8H9+94qDPXD6+tK1Cvsb/fA/xb5BEqtyQp5JQ/YS3+yi8/x5svaq0f+foa2IwxppFjOa1t27Zp4MCBOnHihCIjI5WTk6Nrr71W69ev16BBg7R//37Fx8d7+t91113Kz8/XypUraxzP6XQqIyOjWntOTo4iIiIa7TwAAACAplRaWqrRo0fryJEjio6OrrVfwGf4k5OTtWXLFh0+fFhvvfWWxowZo7y8PM9+m83m1d8YU63tZFOnTtXkyZM928XFxUpISFB6evppXwh/cblcys3NVVpamux2e6MfD02vXjle/pBv/YY+6d/xUGcuE6pcXa40fSC7rbLpA/D3e8DX8c4CGUt3SPp+5rdf6F5truwsNxersxx/5Xf68B5+jAr+RK31o6qVLGcS8II/LCxM559/viSpf//+2rhxo5599lk99ND3P8wKCgoUFxfn6V9YWKiYmJhax3M4HHI4HNXa7XZ7k74pmvp4aHp1yrGvhaO/x0P9GMluqwxMwR+o99RZoPKU4s+tkGptsI6G5pef4c0ftZbv79Nm9y+dMUZlZWVKTExUbGyscnNzPfvKy8uVl5enlJSUAEYIAAAABI+AzvA//PDDGjp0qBISElRSUqKFCxdq7dq1WrFihWw2myZNmqTMzEwlJSUpKSlJmZmZioiI0OjRowMZNgCguVo6sdZdI7/6TpLktrVQYeeRTRQQrG7qom1n7JM1qncTRALULqAF/9dff63bbrtNBw8eVOvWrdWnTx+tWLFCaWlpkqQpU6bo+PHjmjBhgoqKijRgwACtWrVKUVFRgQwbAAAACBoBLfhffvnl0+632WxyOp1yOp1NExAAAABgMc1uDT8AAAAA/wn4VXoAAI3oNGvavQx/tnHjqE1zjw8ALIAZfgAAAMDCKPgBAAAAC6PgBwAAACyMNfwAgKB28nXQq661DwD4ETP8AAAAgIVR8AMAAAAWRsEPAAAAWBhr+AEAAE5x8ndDgGDHDD8AAABgYRT8AAAAgIVR8AMAAAAWxhp+AIDf+LruOWtU70aOBABQhRl+AAAAwMIo+AEAAAALo+AHAAAALIw1/EBzsnRioCMAmkTVWv+RX31Xa58Bie1+3DjNZ+N0YwAAmOEHAAAALI2CHwAAALAwCn4AAADAwij4AQAAAAuj4AcAAAAsjIIfAAAAsDAKfgAAAMDCuA4/UJPTXQ/fhEpKlZY/JP18VpOFBASTkV/NDHQIAIAfMMMPAAAAWBgFPwAAAGBhFPwAAACAhQV0DX9WVpYWLVqkzz//XOHh4UpJSdGTTz6p5ORkTx9jjDIyMjR37lwVFRVpwIABev7559WzZ88ARg4AjeR03x85y/xz93eBDgFoUlMXbfOpX9ao3o0cCawmoDP8eXl5uvfee7Vhwwbl5uaqoqJC6enpOnbsmKfPzJkzNWvWLM2ePVsbN25UbGys0tLSVFJSEsDIAQAAgOAQ0Bn+FStWeG3PmzdP55xzjjZv3qwrr7xSxhhlZ2dr2rRpGjVqlCRpwYIFiomJUU5OjsaPHx+IsAEAAICg0awuy3nkyBFJUrt27SRJu3fvVkFBgdLT0z19HA6HUlNTtX79+hoL/rKyMpWVlXm2i4uLJUkul0sul6sxw/cc5+Q/EaRMaK27XD/sc5lQydc8n2Y8NC9e+T2b+OnfrFC5JUluW7P68eKlKraQH2KFtVTltTnl19eaINTHmM/2GoNa60e+vgY2Y4xp5Fh8YozRiBEjVFRUpPfff1+StH79eg0aNEj79+9XfHy8p+9dd92l/Px8rVy5sto4TqdTGRkZ1dpzcnIUERHReCcAAAAANKHS0lKNHj1aR44cUXR0dK39ms0UzH333aetW7fqgw8+qLbPZrN5bRtjqrVVmTp1qiZPnuzZLi4uVkJCgtLT00/7QviLy+VSbm6u0tLSZLfbG/14aCTLH6p1l8uEKleXK00fyH5tZoPHQ/PilV9bZaDDaTpDn/TLMBlLd0iSrtuf7ZfxGoPb1kKHEoapw753FGIqTtv3/86d1DRBwW9C5Fa/0L3aXNlZbotejHD68B6BDiGgqLV+VLWS5UyaRcF///33a8mSJVq3bp06derkaY+NjZUkFRQUKC4uztNeWFiomJiYGsdyOBxyOBzV2u12e5O+KZr6ePCzMxV6RrLbKn3P8dlUOFpBVX7Pprz56d+ryh8KrDMV0s1BiKk4Y5yVFi0YzwZuhVg2f9QX36PW8v29ENBPgjFG9913nxYtWqQ1a9YoMTHRa39iYqJiY2OVm5vraSsvL1deXp5SUlKaOlwAAAAg6AR0hv/ee+9VTk6O/vGPfygqKkoFBQWSpNatWys8PFw2m02TJk1SZmamkpKSlJSUpMzMTEVERGj06NGBDB1Nxddrkg9/tnHjAKyOzxoAWFZAC/45c+ZIkgYPHuzVPm/ePI0dO1aSNGXKFB0/flwTJkzw3Hhr1apVioqKauJoAQAAgOAT0ILflwsE2Ww2OZ1OOZ3Oxg8IAAAAsBhrfpsFAAAAgKRmcpUeoMn4uk4ZAADAIpjhBwAAACyMgh8AAACwMAp+AAAAwMJYww8A8N0Zvgcz8qvvmigQAICvmOEHAAAALIyCHwAAALAwCn4AAADAwij4AQAAAAvjS7uwBm6oBdTLP3fzJVsAsDpm+AEAAAALo+AHAAAALIyCHwAAALAw1vADDcF3BwAATWzqom0+9csa1buRI0GwYIYfAAAAsDAKfgAAAMDCKPgBAAAAC6PgBwAAACyMgh8AAACwMAp+AAAAwMIo+AEAAAAL4zr8AADUYuRXM33qt7jTlEaOBKg7rtePKszwAwAAABZGwQ8AAABYGAU/AAAAYGEU/AAAAICFUfADAAAAFkbBDwAAAFgYBT8AAABgYQEt+NetW6fhw4crPj5eNptNixcv9tpvjJHT6VR8fLzCw8M1ePBgbd++PTDBAgAAAEEooAX/sWPHdNFFF2n27Nk17p85c6ZmzZql2bNna+PGjYqNjVVaWppKSkqaOFIAAAAgOAX0TrtDhw7V0KFDa9xnjFF2dramTZumUaNGSZIWLFigmJgY5eTkaPz48U0ZKgAAABCUAlrwn87u3btVUFCg9PR0T5vD4VBqaqrWr19fa8FfVlamsrIyz3ZxcbEkyeVyyeVyNW7QPxzn5D/RQCY00BFU4/ohJlczjA0Nd7bl121rtj8GGk3VOfvz3EPl9ttYaJiQH3IRQk58Fmw1C7XWj3x9DZrtv/QFBQWSpJiYGK/2mJgY5efn1/q8rKwsZWRkVGtftWqVIiIi/BvkaeTm5jbZsawtNdAB1CpXl0sm0FGgsZw1+e0c6AAC51DCML+NdYn2+G0s+Ee/0L2BDiFoLFu2J9Ah1Au1llRaWupTv2Zb8Fex2Wxe28aYam0nmzp1qiZPnuzZLi4uVkJCgtLT0xUdHd1ocVZxuVzKzc1VWlqa7HZ7ox/P8pY/FOgIqnGZUOXqcqXpA9ltlYEOB352tuV3U35RoENocm5bCx1KGKYO+95RiKnwy5j/d+4kv4yDhguRW/1C92pzZWe5uRihT6YP7xHoEOqEWutHVStZzqTZFvyxsbGSvp/pj4uL87QXFhZWm/U/mcPhkMPhqNZut9ub9E3R1MezrOZacBnJbqs8KwrCs9JZlF9/FbzBKMRU+O38Kyksmx23QsiLj4K1XqHW8j13zfaTkJiYqNjYWK//rikvL1deXp5SUlICGBkAAAAQPAI6w3/06FH95z//8Wzv3r1bW7ZsUbt27dS5c2dNmjRJmZmZSkpKUlJSkjIzMxUREaHRo0cHMGr4xdKJgY4AAADgrBDQgn/Tpk0aMmSIZ7tq7f2YMWM0f/58TZkyRcePH9eECRNUVFSkAQMGaNWqVYqKigpUyAAAAEBQCWjBP3jwYBlT+2UwbDabnE6nnE5n0wUFAAAAWEizXcMPAAAAoOGa7VV6AAAA0PimLtrmU7+sUb0bORI0Fmb4AQAAAAuj4AcAAAAsjIIfAAAAsDDW8MO/uL4+UC//3P2dT/0GJLZr5EgAAFbDDD8AAABgYRT8AAAAgIVR8AMAAAAWxhp++LbufvizjR8HgDNirT+AQOF6/cGLGX4AAADAwij4AQAAAAuj4AcAAAAsjDX8TcHXa9OzTh6An/i61h/+MfKrmT71W9xpSiNHAgQPvhPQdJjhBwAAACyMgh8AAACwMAp+AAAAwMJYw9+c+Hutv6/jNfVYgAWwRh4AECyY4QcAAAAsjIIfAAAAsDAKfgAAAMDCKPgBAAAAC+NLuwBwkk35RVLn7/8MMRWBDgcAgo6vN9RC02GGHwAAALAwCn4AAADAwij4AQAAAAtjDT8AAE1k5Fczfeq3uNOURo4EwNmEGX4AAADAwij4AQAAAAuj4AcAAAAsLCjW8L/wwgt66qmndPDgQfXs2VPZ2dm64oorAh1W4CydGOgIAAAAmhVfr/+fNap3QMYLpGY/w//Xv/5VkyZN0rRp0/Tpp5/qiiuu0NChQ7V3795AhwYAAAA0e82+4J81a5buvPNO/eY3v9GFF16o7OxsJSQkaM6cOYEODQAAAGj2mvWSnvLycm3evFn/+7//69Wenp6u9evX1/icsrIylZWVebaPHDkiSfruu+/kcrkaL9gfuFwulZaW6ttvv5Xdbv++8VhFox8XTcdljEpVqm9VIbutMtDhwM9KTrhVWlqqkhNuhRh3oMNBI3Dbmn+OK0qLAx1C0HLLrdLQUrkqi+Vu/vOa8MG3337rtV1jrSXfPzenjlcbf4/XGEpKSiRJxpjT9mvWBf+hQ4dUWVmpmJgYr/aYmBgVFBTU+JysrCxlZGRUa09MTGyUGHG2+lOgA0CjejXQAaDRNfcc5wQ6AKDZePosG68+SkpK1Lp161r3N+uCv4rNZvPaNsZUa6sydepUTZ482bPtdrv13XffqX379rU+x5+Ki4uVkJCgffv2KTo6utGPh6ZHjq2N/FofObY28mt95PhHxhiVlJQoPj7+tP2adcHfoUMHhYaGVpvNLywsrDbrX8XhcMjhcHi1tWnTprFCrFV0dPRZ/ya0OnJsbeTX+sixtZFf6yPH3zvdzH6VZr24LSwsTP369VNubq5Xe25urlJSUgIUFQAAABA8mvUMvyRNnjxZt912m/r376+BAwdq7ty52rt3r+6+++5AhwYAAAA0e82+4L/pppv07bff6ve//70OHjyoXr16admyZerSpUugQ6uRw+HQ9OnTqy0rgnWQY2sjv9ZHjq2N/FofOa47mznTdXwAAAAABK1mvYYfAAAAQMNQ8AMAAAAWRsEPAAAAWBgFPwAAAGBhFPw+WLdunYYPH674+HjZbDYtXrzYa78xRk6nU/Hx8QoPD9fgwYO1fft2rz5lZWW6//771aFDB7Vq1Uo///nP9dVXXzXhWaA2WVlZuuSSSxQVFaVzzjlHI0eO1M6dO736kOPgNmfOHPXp08dzk5aBAwdq+fLlnv3k11qysrJks9k0adIkTxs5Dm5Op1M2m83rERsb69lPfq1h//79uvXWW9W+fXtFRETo4osv1ubNmz37yXP9UfD74NixY7rooos0e/bsGvfPnDlTs2bN0uzZs7Vx40bFxsYqLS1NJSUlnj6TJk3S22+/rYULF+qDDz7Q0aNHNWzYMFVWVjbVaaAWeXl5uvfee7Vhwwbl5uaqoqJC6enpOnbsmKcPOQ5unTp10owZM7Rp0yZt2rRJV111lUaMGOH5QUF+rWPjxo2aO3eu+vTp49VOjoNfz549dfDgQc9j27Ztnn3kN/gVFRVp0KBBstvtWr58uXbs2KFnnnlGbdq08fQhzw1gUCeSzNtvv+3ZdrvdJjY21syYMcPTduLECdO6dWvz4osvGmOMOXz4sLHb7WbhwoWePvv37zchISFmxYoVTRY7fFNYWGgkmby8PGMMObaqtm3bmj//+c/k10JKSkpMUlKSyc3NNampqWbixInGGD7DVjB9+nRz0UUX1biP/FrDQw89ZC6//PJa95PnhmGGv4F2796tgoICpaene9ocDodSU1O1fv16SdLmzZvlcrm8+sTHx6tXr16ePmg+jhw5Iklq166dJHJsNZWVlVq4cKGOHTumgQMHkl8Luffee3Xdddfpmmuu8Wonx9awa9cuxcfHKzExUTfffLO+/PJLSeTXKpYsWaL+/fvrxhtv1DnnnKO+ffvqpZde8uwnzw1Dwd9ABQUFkqSYmBiv9piYGM++goIChYWFqW3btrX2QfNgjNHkyZN1+eWXq1evXpLIsVVs27ZNkZGRcjgcuvvuu/X222+rR48e5NciFi5cqE8++URZWVnV9pHj4DdgwAC9+uqrWrlypV566SUVFBQoJSVF3377Lfm1iC+//FJz5sxRUlKSVq5cqbvvvlsPPPCAXn31VUl8jhuqRaADsAqbzea1bYyp1nYqX/qgad13333aunWrPvjgg2r7yHFwS05O1pYtW3T48GG99dZbGjNmjPLy8jz7yW/w2rdvnyZOnKhVq1apZcuWtfYjx8Fr6NChnr/37t1bAwcO1HnnnacFCxbosssuk0R+g53b7Vb//v2VmZkpSerbt6+2b9+uOXPm6Ne//rWnH3muH2b4G6jqKgGn/uZYWFjo+S00NjZW5eXlKioqqrUPAu/+++/XkiVL9N5776lTp06ednJsDWFhYTr//PPVv39/ZWVl6aKLLtKzzz5Lfi1g8+bNKiwsVL9+/dSiRQu1aNFCeXl5eu6559SiRQtPjsixdbRq1Uq9e/fWrl27+AxbRFxcnHr06OHVduGFF2rv3r2S+FncUBT8DZSYmKjY2Fjl5uZ62srLy5WXl6eUlBRJUr9+/WS32736HDx4UP/+9789fRA4xhjdd999WrRokdasWaPExESv/eTYmowxKisrI78WcPXVV2vbtm3asmWL59G/f3/dcsst2rJli7p160aOLaasrEyfffaZ4uLi+AxbxKBBg6pdEvuLL75Qly5dJPGzuMEC8EXhoFNSUmI+/fRT8+mnnxpJZtasWebTTz81+fn5xhhjZsyYYVq3bm0WLVpktm3bZn71q1+ZuLg4U1xc7Bnj7rvvNp06dTLvvvuu+eSTT8xVV11lLrroIlNRURGo08IP7rnnHtO6dWuzdu1ac/DgQc+jtLTU04ccB7epU6eadevWmd27d5utW7eahx9+2ISEhJhVq1YZY8ivFZ18lR5jyHGw++1vf2vWrl1rvvzyS7NhwwYzbNgwExUVZfbs2WOMIb9W8PHHH5sWLVqYJ554wuzatcu8/vrrJiIiwrz22muePuS5/ij4ffDee+8ZSdUeY8aMMcZ8f6mo6dOnm9jYWONwOMyVV15ptm3b5jXG8ePHzX333WfatWtnwsPDzbBhw8zevXsDcDY4VU25lWTmzZvn6UOOg9sdd9xhunTpYsLCwkzHjh3N1Vdf7Sn2jSG/VnRqwU+Og9tNN91k4uLijN1uN/Hx8WbUqFFm+/btnv3k1xqWLl1qevXqZRwOh7ngggvM3LlzvfaT5/qzGWNMYP5vAQAAAEBjYw0/AAAAYGEU/AAAAICFUfADAAAAFkbBDwAAAFgYBT8AAABgYRT8AAAAgIVR8AMAAAAWRsEPAAAAWBgFPwCcBWw2mxYvXhzoMAAAAUDBDwAWUFBQoPvvv1/dunWTw+FQQkKChg8frtWrVwc6tDMaO3asRo4cGegwAMCyWgQ6AABAw+zZs0eDBg1SmzZtNHPmTPXp00cul0srV67Uvffeq88//7xRjlteXq6wsLBGGbs+mls8ANBcMMMPAEFuwoQJstls+vjjj3XDDTeoe/fu6tmzpyZPnqwNGzZ4+h06dEi/+MUvFBERoaSkJC1ZssSzr7KyUnfeeacSExMVHh6u5ORkPfvss17HqZqJz8rKUnx8vLp37y5Jeu2119S/f39FRUUpNjZWo0ePVmFhoddzt2/fruuuu07R0dGKiorSFVdcof/+979yOp1asGCB/vGPf8hms8lms2nt2rWSpP379+umm25S27Zt1b59e40YMUJ79uw5YzwvvPCCkpKS1LJlS8XExOiGG27w58sNAEGHGX4ACGLfffedVqxYoSeeeEKtWrWqtr9Nmzaev2dkZGjmzJl66qmn9Kc//Um33HKL8vPz1a5dO7ndbnXq1ElvvvmmOnTooPXr1+uuu+5SXFycfvnLX3rGWL16taKjo5WbmytjjKTvZ9b/8Ic/KDk5WYWFhXrwwQc1duxYLVu2TNL3hfuVV16pwYMHa82aNYqOjtaHH36oiooK/e53v9Nnn32m4uJizZs3T5LUrl07lZaWasiQIbriiiu0bt06tWjRQo8//rh+9rOfaevWrZ6Z/FPj2bRpkx544AH95S9/UUpKir777ju9//77jfXyA0BwMACAoPXPf/7TSDKLFi06bT9J5pFHHvFsHz161NhsNrN8+fJanzNhwgRz/fXXe7bHjBljYmJiTFlZ2WmP9fHHHxtJpqSkxBhjzNSpU01iYqIpLy+vsf+YMWPMiBEjvNpefvllk5ycbNxut6etrKzMhIeHm5UrV9Yaz1tvvWWio6NNcXHxaWMEgLMJS3oAIIiZH2bZbTbbGfv26dPH8/dWrVopKirKa+nNiy++qP79+6tjx46KjIzUSy+9pL1793qN0bt372rr5D/99FONGDFCXbp0UVRUlAYPHixJnudu2bJFV1xxhex2u8/ntXnzZv3nP/9RVFSUIiMjFRkZqXbt2unEiRP673//W2s8aWlp6tKli7p166bbbrtNr7/+ukpLS30+LgBYEQU/AASxpKQk2Ww2ffbZZ2fse2rBbbPZ5Ha7JUlvvvmmHnzwQd1xxx1atWqVtmzZottvv13l5eVezzl12dCxY8eUnp6uyMhIvfbaa9q4caPefvttSfI8Nzw8vM7n5Xa71a9fP23ZssXr8cUXX2j06NG1xhMVFaVPPvlEb7zxhuLi4vTYY4/poosu0uHDh+scAwBYBQU/AASxdu3a6ac//amef/55HTt2rNp+Xwvd999/XykpKZowYYL69u2r888/32smvTaff/65Dh06pBkzZuiKK67QBRdcUO0Lu3369NH7778vl8tV4xhhYWGqrKz0avvJT36iXbt26ZxzztH555/v9WjduvVpY2rRooWuueYazZw5U1u3btWePXu0Zs2aM54LAFgVBT8ABLkXXnhBlZWVuvTSS/XWW29p165d+uyzz/Tcc89p4MCBPo1x/vnna9OmTVq5cqW++OILPfroo9q4ceMZn9e5c2eFhYXpT3/6k7788kstWbJEf/jDH7z63HfffSouLtbNN9+sTZs2adeuXfrLX/6inTt3SpK6du2qrVu3aufOnTp06JBcLpduueUWdejQQSNGjND777+v3bt3Ky8vTxMnTtRXX31VazzvvPOOnnvuOW3ZskX5+fl69dVX5Xa7lZyc7NPrAABWRMEPAEEuMTFRn3zyiYYMGaLf/va36tWrl9LS0rR69WrNmTPHpzHuvvtujRo1SjfddJMGDBigb7/9VhMmTDjj8zp27Kj58+frb3/7m3r06KEZM2bo6aef9urTvn17rVmzRkePHlVqaqr69eunl156ybPEaNy4cUpOTvZ8f+DDDz9URESE1q1bp86dO2vUqFG68MILdccdd+j48eOKjo6uNZ42bdpo0aJFuuqqq3ThhRfqxRdf1BtvvKGePXv69DoAgBXZTNU3vgAAAABYDjP8AAAAgIVR8AMAAAAWRsEPAAAAWBgFPwAAAGBhFPwAAACAhVHwAwAAABZGwQ8AAABYGAU/AAAAYGEU/AAAAICFUfADAAAAFkbBDwAAAFjY/wfznNHMialmyAAAAABJRU5ErkJggg==",
259
- "text/plain": [
260
- "<Figure size 900x400 with 1 Axes>"
261
- ]
262
- },
263
- "metadata": {},
264
- "output_type": "display_data"
265
- }
266
- ],
267
- "source": [
268
- "# Visualize text-length distributions by class\n",
269
- "df['text_len'] = df['text'].str.len()\n",
270
- "ax = df[df['label'] == 0]['text_len'].hist(bins=40, alpha=0.6, label='Human', figsize=(9, 4))\n",
271
- "df[df['label'] == 1]['text_len'].hist(bins=40, alpha=0.6, label='AI', ax=ax)\n",
272
- "ax.set_title('Text Length Distribution')\n",
273
- "ax.set_xlabel('Characters')\n",
274
- "ax.set_ylabel('Count')\n",
275
- "ax.legend()"
276
- ]
277
- },
278
- {
279
- "cell_type": "code",
280
- "execution_count": 6,
281
- "id": "59fe88ce",
282
- "metadata": {
283
- "id": "59fe88ce"
284
- },
285
- "outputs": [
286
- {
287
- "data": {
288
- "text/html": [
289
- "<div>\n",
290
- "<style scoped>\n",
291
- " .dataframe tbody tr th:only-of-type {\n",
292
- " vertical-align: middle;\n",
293
- " }\n",
294
- "\n",
295
- " .dataframe tbody tr th {\n",
296
- " vertical-align: top;\n",
297
- " }\n",
298
- "\n",
299
- " .dataframe thead th {\n",
300
- " text-align: right;\n",
301
- " }\n",
302
- "</style>\n",
303
- "<table border=\"1\" class=\"dataframe\">\n",
304
- " <thead>\n",
305
- " <tr style=\"text-align: right;\">\n",
306
- " <th></th>\n",
307
- " <th>text</th>\n",
308
- " <th>label</th>\n",
309
- " </tr>\n",
310
- " </thead>\n",
311
- " <tbody>\n",
312
- " <tr>\n",
313
- " <th>0</th>\n",
314
- " <td>हामीले पार्टी एकतापछि कि दुबै पार्टीको सिद्धान...</td>\n",
315
- " <td>0</td>\n",
316
- " </tr>\n",
317
- " <tr>\n",
318
- " <th>1</th>\n",
319
- " <td>एमाले प्रतिनिधिसभाको प्रत्यक्षतर्फ ८० समानुपात...</td>\n",
320
- " <td>0</td>\n",
321
- " </tr>\n",
322
- " <tr>\n",
323
- " <th>2</th>\n",
324
- " <td>नेकपा माओवादी केन्द्रका नेता रामनारायण विडारील...</td>\n",
325
- " <td>1</td>\n",
326
- " </tr>\n",
327
- " <tr>\n",
328
- " <th>3</th>\n",
329
- " <td>प्रदेश नं २ का मुख्यमन्त्रीको रूपमा संघीय समाज...</td>\n",
330
- " <td>1</td>\n",
331
- " </tr>\n",
332
- " <tr>\n",
333
- " <th>4</th>\n",
334
- " <td>बिहीबार एमालेका अध्यक्ष केपी शर्मा ओली र माओवा...</td>\n",
335
- " <td>0</td>\n",
336
- " </tr>\n",
337
- " </tbody>\n",
338
- "</table>\n",
339
- "</div>"
340
- ],
341
- "text/plain": [
342
- " text label\n",
343
- "0 हामीले पार्टी एकतापछि कि दुबै पार्टीको सिद्धान... 0\n",
344
- "1 एमाले प्रतिनिधिसभाको प्रत्यक्षतर्फ ८० समानुपात... 0\n",
345
- "2 नेकपा माओवादी केन्द्रका नेता रामनारायण विडारील... 1\n",
346
- "3 प्रदेश नं २ का मुख्यमन्त्रीको रूपमा संघीय समाज... 1\n",
347
- "4 बिहीबार एमालेका अध्यक्ष केपी शर्मा ओली र माओवा... 0"
348
- ]
349
- },
350
- "execution_count": 6,
351
- "metadata": {},
352
- "output_type": "execute_result"
353
- }
354
- ],
355
- "source": [
356
- "# Keep only columns needed for training\n",
357
- "df = df[['text', 'label']].copy()\n",
358
- "df.head()"
359
- ]
360
- },
361
- {
362
- "cell_type": "code",
363
- "execution_count": 7,
364
- "id": "434df9a2",
365
- "metadata": {
366
- "id": "434df9a2"
367
- },
368
- "outputs": [
369
- {
370
- "name": "stdout",
371
- "output_type": "stream",
372
- "text": [
373
- "Using model: distilbert-base-multilingual-cased\n"
374
- ]
375
- }
376
- ],
377
- "source": [
378
- "# Model/tokenizer config (smaller multilingual model for low-VRAM GPU)\n",
379
- "MODEL_NAME = 'distilbert-base-multilingual-cased'\n",
380
- "MAX_LEN = 96\n",
381
- "\n",
382
- "tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)\n",
383
- "print('Using model:', MODEL_NAME)"
384
- ]
385
- },
386
- {
387
- "cell_type": "code",
388
- "execution_count": 8,
389
- "id": "ef7d53f9",
390
- "metadata": {
391
- "id": "ef7d53f9"
392
- },
393
- "outputs": [],
394
- "source": [
395
- "class NepaliDataset(Dataset):\n",
396
- " def __init__(self, texts, labels):\n",
397
- " self.texts = texts\n",
398
- " self.labels = labels\n",
399
- "\n",
400
- " def __len__(self):\n",
401
- " return len(self.texts)\n",
402
- "\n",
403
- " def __getitem__(self, idx):\n",
404
- " return {\n",
405
- " 'text': self.texts[idx],\n",
406
- " 'label': int(self.labels[idx]),\n",
407
- " }"
408
- ]
409
- },
410
- {
411
- "cell_type": "code",
412
- "execution_count": 9,
413
- "id": "134a3fc1",
414
- "metadata": {
415
- "id": "134a3fc1"
416
- },
417
- "outputs": [
418
- {
419
- "name": "stdout",
420
- "output_type": "stream",
421
- "text": [
422
- "Train: 1588 | Val: 398\n"
423
- ]
424
- }
425
- ],
426
- "source": [
427
- "# Train/Validation Split\n",
428
- "train_texts, val_texts, train_labels, val_labels = train_test_split(\n",
429
- " df['text'].tolist(),\n",
430
- " df['label'].tolist(),\n",
431
- " test_size=0.2,\n",
432
- " random_state=42,\n",
433
- " stratify=df['label'].tolist(),\n",
434
- ")\n",
435
- "print(f'Train: {len(train_texts)} | Val: {len(val_texts)}')"
436
- ]
437
- },
438
- {
439
- "cell_type": "code",
440
- "execution_count": 10,
441
- "id": "dd226ed1",
442
- "metadata": {
443
- "id": "dd226ed1"
444
- },
445
- "outputs": [
446
- {
447
- "name": "stdout",
448
- "output_type": "stream",
449
- "text": [
450
- "Batch size: 2 | Max length: 96\n"
451
- ]
452
- }
453
- ],
454
- "source": [
455
- "train_dataset = NepaliDataset(train_texts, train_labels)\n",
456
- "val_dataset = NepaliDataset(val_texts, val_labels)\n",
457
- "\n",
458
- "def collate_batch(batch):\n",
459
- " texts = [item['text'] for item in batch]\n",
460
- " labels = torch.tensor([item['label'] for item in batch], dtype=torch.long)\n",
461
- " enc = tokenizer(\n",
462
- " texts,\n",
463
- " padding=True,\n",
464
- " truncation=True,\n",
465
- " max_length=MAX_LEN,\n",
466
- " return_tensors='pt',\n",
467
- " )\n",
468
- " return {\n",
469
- " 'input_ids': enc['input_ids'],\n",
470
- " 'attention_mask': enc['attention_mask'],\n",
471
- " 'labels': labels,\n",
472
- " }\n",
473
- "\n",
474
- "BATCH_SIZE = 2\n",
475
- "train_loader = DataLoader(\n",
476
- " train_dataset,\n",
477
- " batch_size=BATCH_SIZE,\n",
478
- " shuffle=True,\n",
479
- " collate_fn=collate_batch,\n",
480
- " pin_memory=(torch.cuda.is_available()),\n",
481
- ")\n",
482
- "val_loader = DataLoader(\n",
483
- " val_dataset,\n",
484
- " batch_size=BATCH_SIZE,\n",
485
- " shuffle=False,\n",
486
- " collate_fn=collate_batch,\n",
487
- " pin_memory=(torch.cuda.is_available()),\n",
488
- ")\n",
489
- "print('Batch size:', BATCH_SIZE, '| Max length:', MAX_LEN)"
490
- ]
491
- },
492
- {
493
- "cell_type": "code",
494
- "execution_count": 11,
495
- "id": "51320951",
496
- "metadata": {
497
- "id": "51320951"
498
- },
499
- "outputs": [],
500
- "source": [
501
- "# === Model ===\n",
502
- "class IndicBERTClassifier(nn.Module):\n",
503
- " def __init__(self, dropout=0.2):\n",
504
- " super(IndicBERTClassifier, self).__init__()\n",
505
- " self.bert = AutoModel.from_pretrained(MODEL_NAME)\n",
506
- " if hasattr(self.bert, 'gradient_checkpointing_enable'):\n",
507
- " self.bert.gradient_checkpointing_enable()\n",
508
- " self.dropout = nn.Dropout(dropout)\n",
509
- " self.classifier = nn.Linear(self.bert.config.hidden_size, 2)\n",
510
- "\n",
511
- " def forward(self, input_ids, attention_mask):\n",
512
- " outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)\n",
513
- " cls_output = outputs.last_hidden_state[:, 0, :]\n",
514
- " cls_output = self.dropout(cls_output)\n",
515
- " return self.classifier(cls_output)"
516
- ]
517
- },
518
- {
519
- "cell_type": "code",
520
- "execution_count": 12,
521
- "id": "944f918e",
522
- "metadata": {
523
- "id": "944f918e"
524
- },
525
- "outputs": [],
526
- "source": [
527
- "# Step 8: Create a custom Dataset class\n",
528
- "class NepaliTextDataset(Dataset):\n",
529
- " def __init__(self, input_ids, attention_mask, labels):\n",
530
- " self.input_ids = input_ids\n",
531
- " self.attention_mask = attention_mask\n",
532
- " self.labels = labels\n",
533
- "\n",
534
- " def __len__(self):\n",
535
- " return len(self.labels)\n",
536
- "\n",
537
- " def __getitem__(self, idx):\n",
538
- " return {\n",
539
- " 'input_ids': torch.tensor(self.input_ids[idx]),\n",
540
- " 'attention_mask': torch.tensor(self.attention_mask[idx]),\n",
541
- " 'labels': torch.tensor(self.labels[idx])\n",
542
- " }"
543
- ]
544
- },
545
- {
546
- "cell_type": "code",
547
- "execution_count": 13,
548
- "id": "a9d426e1",
549
- "metadata": {
550
- "id": "a9d426e1"
551
- },
552
- "outputs": [
553
- {
554
- "name": "stderr",
555
- "output_type": "stream",
556
- "text": [
557
- "Loading weights: 100%|██████████| 100/100 [00:00<00:00, 11666.08it/s]\n",
558
- "\u001b[1mDistilBertModel LOAD REPORT\u001b[0m from: distilbert-base-multilingual-cased\n",
559
- "Key | Status | | \n",
560
- "------------------------+------------+--+-\n",
561
- "vocab_layer_norm.bias | UNEXPECTED | | \n",
562
- "vocab_transform.weight | UNEXPECTED | | \n",
563
- "vocab_layer_norm.weight | UNEXPECTED | | \n",
564
- "vocab_transform.bias | UNEXPECTED | | \n",
565
- "vocab_projector.bias | UNEXPECTED | | \n",
566
- "\n",
567
- "\u001b[3mNotes:\n",
568
- "- UNEXPECTED\u001b[3m\t:can be ignored when loading from different task/architecture; not ok if you expect identical arch.\u001b[0m\n"
569
- ]
570
- }
571
- ],
572
- "source": [
573
- "\n",
574
- "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
575
- "model = IndicBERTClassifier().to(device)"
576
- ]
577
- },
578
- {
579
- "cell_type": "code",
580
- "execution_count": 14,
581
- "id": "2740c14a",
582
- "metadata": {
583
- "id": "2740c14a"
584
- },
585
- "outputs": [
586
- {
587
- "name": "stdout",
588
- "output_type": "stream",
589
- "text": [
590
- "Grad accumulation steps: 4\n"
591
- ]
592
- }
593
- ],
594
- "source": [
595
- "# === Optimizer, Scheduler & Loss ===\n",
596
- "optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)\n",
597
- "loss_fn = nn.CrossEntropyLoss()\n",
598
- "\n",
599
- "max_epochs = 6\n",
600
- "grad_accum_steps = 4 # effective batch = BATCH_SIZE * grad_accum_steps\n",
601
- "steps_per_epoch = math.ceil(len(train_loader) / grad_accum_steps)\n",
602
- "total_steps = steps_per_epoch * max_epochs\n",
603
- "warmup_steps = int(0.1 * total_steps)\n",
604
- "scheduler = get_linear_schedule_with_warmup(\n",
605
- " optimizer,\n",
606
- " num_warmup_steps=warmup_steps,\n",
607
- " num_training_steps=total_steps,\n",
608
- ")\n",
609
- "print('Grad accumulation steps:', grad_accum_steps)"
610
- ]
611
- },
612
- {
613
- "cell_type": "code",
614
- "execution_count": 15,
615
- "id": "14ce04bd",
616
- "metadata": {
617
- "id": "14ce04bd"
618
- },
619
- "outputs": [],
620
- "source": [
621
- "# === Training Loop ===\n",
622
- "def train(model, loader):\n",
623
- " model.train()\n",
624
- " total_loss = 0\n",
625
- " for batch in loader:\n",
626
- " input_ids = batch['input_ids'].to(device)\n",
627
- " attention_mask = batch['attention_mask'].to(device)\n",
628
- " labels = batch['labels'].to(device)\n",
629
- "\n",
630
- " optimizer.zero_grad()\n",
631
- " outputs = model(input_ids, attention_mask)\n",
632
- " loss = loss_fn(outputs, labels)\n",
633
- " loss.backward()\n",
634
- " optimizer.step()\n",
635
- " total_loss += loss.item()\n",
636
- " return total_loss / len(loader)\n",
637
- "\n",
638
- "# === Evaluation ===\n",
639
- "def evaluate(model, loader):\n",
640
- " model.eval()\n",
641
- " preds, true = [], []\n",
642
- " with torch.no_grad():\n",
643
- " for batch in loader:\n",
644
- " input_ids = batch['input_ids'].to(device)\n",
645
- " attention_mask = batch['attention_mask'].to(device)\n",
646
- " labels = batch['labels'].to(device)\n",
647
- "\n",
648
- " outputs = model(input_ids, attention_mask)\n",
649
- " pred_labels = torch.argmax(outputs, dim=1)\n",
650
- " preds.extend(pred_labels.cpu().numpy())\n",
651
- " true.extend(labels.cpu().numpy())\n",
652
- "\n",
653
- " print(classification_report(true, preds, target_names=[\"Human\", \"AI\"]))\n"
654
- ]
655
- },
656
- {
657
- "cell_type": "code",
658
- "execution_count": null,
659
- "id": "d24e91b7",
660
- "metadata": {
661
- "colab": {
662
- "base_uri": "https://localhost:8080/"
663
- },
664
- "id": "d24e91b7",
665
- "outputId": "33ef8227-5c71-4c0d-88e7-b1a9e30b45f4"
666
- },
667
- "outputs": [
668
- {
669
- "name": "stdout",
670
- "output_type": "stream",
671
- "text": [
672
- "\n",
673
- "Epoch 1/6\n"
674
- ]
675
- },
676
- {
677
- "name": "stderr",
678
- "output_type": "stream",
679
- "text": [
680
- "/tmp/ipykernel_155548/4183901742.py:4: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.\n",
681
- " scaler = GradScaler(enabled=use_amp)\n",
682
- "/tmp/ipykernel_155548/4183901742.py:17: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
683
- " with autocast(enabled=use_amp):\n"
684
- ]
685
- },
686
- {
687
- "name": "stdout",
688
- "output_type": "stream",
689
- "text": [
690
- "Batch 0 | Loss: 0.8206\n",
691
- "Batch 50 | Loss: 0.8677\n",
692
- "Batch 100 | Loss: 0.8435\n",
693
- "Batch 150 | Loss: 0.6523\n",
694
- "Batch 200 | Loss: 0.7219\n",
695
- "Batch 250 | Loss: 0.5793\n",
696
- "Batch 300 | Loss: 0.6833\n",
697
- "Batch 350 | Loss: 0.5742\n",
698
- "Batch 400 | Loss: 0.4844\n",
699
- "Batch 450 | Loss: 0.5671\n",
700
- "Batch 500 | Loss: 0.5363\n",
701
- "Batch 550 | Loss: 0.5386\n",
702
- "Batch 600 | Loss: 0.5520\n",
703
- "Batch 650 | Loss: 0.7692\n",
704
- "Batch 700 | Loss: 0.4680\n",
705
- "Batch 750 | Loss: 0.6353\n",
706
- "Train | Loss: 0.6600 | Acc: 0.5913 | F1: 0.5895\n"
707
- ]
708
- },
709
- {
710
- "name": "stderr",
711
- "output_type": "stream",
712
- "text": [
713
- "/tmp/ipykernel_155548/4183901742.py:55: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
714
- " with autocast(enabled=use_amp):\n"
715
- ]
716
- },
717
- {
718
- "name": "stdout",
719
- "output_type": "stream",
720
- "text": [
721
- "Validation | Loss: 0.5192 | Acc: 0.8015 | F1: 0.7812\n",
722
- " precision recall f1-score support\n",
723
- "\n",
724
- " Human 0.75 0.90 0.82 198\n",
725
- " AI 0.88 0.70 0.78 200\n",
726
- "\n",
727
- " accuracy 0.80 398\n",
728
- " macro avg 0.81 0.80 0.80 398\n",
729
- "weighted avg 0.81 0.80 0.80 398\n",
730
- "\n",
731
- "Saved improved checkpoint: model_best.pth\n",
732
- "\n",
733
- "Epoch 2/6\n"
734
- ]
735
- },
736
- {
737
- "name": "stderr",
738
- "output_type": "stream",
739
- "text": [
740
- "/tmp/ipykernel_155548/4183901742.py:17: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
741
- " with autocast(enabled=use_amp):\n"
742
- ]
743
- },
744
- {
745
- "name": "stdout",
746
- "output_type": "stream",
747
- "text": [
748
- "Batch 0 | Loss: 0.6078\n",
749
- "Batch 50 | Loss: 1.1135\n",
750
- "Batch 100 | Loss: 0.3297\n",
751
- "Batch 150 | Loss: 0.8473\n",
752
- "Batch 200 | Loss: 0.9326\n",
753
- "Batch 250 | Loss: 0.5112\n",
754
- "Batch 300 | Loss: 0.1645\n",
755
- "Batch 350 | Loss: 0.2250\n",
756
- "Batch 400 | Loss: 0.7142\n",
757
- "Batch 450 | Loss: 0.3741\n",
758
- "Batch 500 | Loss: 0.3084\n",
759
- "Batch 550 | Loss: 0.1472\n",
760
- "Batch 600 | Loss: 0.0679\n",
761
- "Batch 650 | Loss: 0.1234\n",
762
- "Batch 700 | Loss: 1.1370\n",
763
- "Batch 750 | Loss: 0.8843\n",
764
- "Train | Loss: 0.4817 | Acc: 0.7720 | F1: 0.7665\n"
765
- ]
766
- },
767
- {
768
- "name": "stderr",
769
- "output_type": "stream",
770
- "text": [
771
- "/tmp/ipykernel_155548/4183901742.py:55: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
772
- " with autocast(enabled=use_amp):\n"
773
- ]
774
- },
775
- {
776
- "name": "stdout",
777
- "output_type": "stream",
778
- "text": [
779
- "Validation | Loss: 0.3708 | Acc: 0.8417 | F1: 0.8225\n",
780
- " precision recall f1-score support\n",
781
- "\n",
782
- " Human 0.78 0.95 0.86 198\n",
783
- " AI 0.94 0.73 0.82 200\n",
784
- "\n",
785
- " accuracy 0.84 398\n",
786
- " macro avg 0.86 0.84 0.84 398\n",
787
- "weighted avg 0.86 0.84 0.84 398\n",
788
- "\n",
789
- "Saved improved checkpoint: model_best.pth\n",
790
- "\n",
791
- "Epoch 3/6\n"
792
- ]
793
- },
794
- {
795
- "name": "stderr",
796
- "output_type": "stream",
797
- "text": [
798
- "/tmp/ipykernel_155548/4183901742.py:17: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
799
- " with autocast(enabled=use_amp):\n"
800
- ]
801
- },
802
- {
803
- "name": "stdout",
804
- "output_type": "stream",
805
- "text": [
806
- "Batch 0 | Loss: 0.0415\n",
807
- "Batch 50 | Loss: 0.0845\n",
808
- "Batch 100 | Loss: 0.0336\n",
809
- "Batch 150 | Loss: 0.6389\n",
810
- "Batch 200 | Loss: 1.6021\n",
811
- "Batch 250 | Loss: 0.0696\n",
812
- "Batch 300 | Loss: 0.5184\n",
813
- "Batch 350 | Loss: 0.0569\n",
814
- "Batch 400 | Loss: 0.8119\n",
815
- "Batch 450 | Loss: 1.5121\n",
816
- "Batch 500 | Loss: 0.0330\n",
817
- "Batch 550 | Loss: 0.0208\n",
818
- "Batch 600 | Loss: 1.1329\n",
819
- "Batch 650 | Loss: 0.7745\n",
820
- "Batch 700 | Loss: 0.0740\n",
821
- "Batch 750 | Loss: 1.4907\n",
822
- "Train | Loss: 0.3830 | Acc: 0.8495 | F1: 0.8488\n"
823
- ]
824
- },
825
- {
826
- "name": "stderr",
827
- "output_type": "stream",
828
- "text": [
829
- "/tmp/ipykernel_155548/4183901742.py:55: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
830
- " with autocast(enabled=use_amp):\n"
831
- ]
832
- },
833
- {
834
- "name": "stdout",
835
- "output_type": "stream",
836
- "text": [
837
- "Validation | Loss: 0.3527 | Acc: 0.8668 | F1: 0.8515\n",
838
- " precision recall f1-score support\n",
839
- "\n",
840
- " Human 0.80 0.97 0.88 198\n",
841
- " AI 0.97 0.76 0.85 200\n",
842
- "\n",
843
- " accuracy 0.87 398\n",
844
- " macro avg 0.88 0.87 0.87 398\n",
845
- "weighted avg 0.88 0.87 0.87 398\n",
846
- "\n",
847
- "Saved improved checkpoint: model_best.pth\n",
848
- "\n",
849
- "Epoch 4/6\n"
850
- ]
851
- },
852
- {
853
- "name": "stderr",
854
- "output_type": "stream",
855
- "text": [
856
- "/tmp/ipykernel_155548/4183901742.py:17: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
857
- " with autocast(enabled=use_amp):\n"
858
- ]
859
- },
860
- {
861
- "name": "stdout",
862
- "output_type": "stream",
863
- "text": [
864
- "Batch 0 | Loss: 1.2321\n",
865
- "Batch 50 | Loss: 0.0369\n",
866
- "Batch 100 | Loss: 0.0161\n",
867
- "Batch 150 | Loss: 0.2000\n",
868
- "Batch 200 | Loss: 0.0035\n",
869
- "Batch 250 | Loss: 2.3207\n",
870
- "Batch 300 | Loss: 0.0022\n",
871
- "Batch 350 | Loss: 2.2738\n",
872
- "Batch 400 | Loss: 0.0011\n",
873
- "Batch 450 | Loss: 0.0075\n",
874
- "Batch 500 | Loss: 2.4454\n",
875
- "Batch 550 | Loss: 0.3863\n",
876
- "Batch 600 | Loss: 0.0038\n",
877
- "Batch 650 | Loss: 0.0061\n",
878
- "Batch 700 | Loss: 0.0005\n",
879
- "Batch 750 | Loss: 0.0182\n",
880
- "Train | Loss: 0.4209 | Acc: 0.8923 | F1: 0.8903\n"
881
- ]
882
- },
883
- {
884
- "name": "stderr",
885
- "output_type": "stream",
886
- "text": [
887
- "/tmp/ipykernel_155548/4183901742.py:55: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
888
- " with autocast(enabled=use_amp):\n"
889
- ]
890
- },
891
- {
892
- "name": "stdout",
893
- "output_type": "stream",
894
- "text": [
895
- "Validation | Loss: 0.4601 | Acc: 0.8769 | F1: 0.8831\n",
896
- " precision recall f1-score support\n",
897
- "\n",
898
- " Human 0.92 0.83 0.87 198\n",
899
- " AI 0.84 0.93 0.88 200\n",
900
- "\n",
901
- " accuracy 0.88 398\n",
902
- " macro avg 0.88 0.88 0.88 398\n",
903
- "weighted avg 0.88 0.88 0.88 398\n",
904
- "\n",
905
- "Saved improved checkpoint: model_best.pth\n",
906
- "\n",
907
- "Epoch 5/6\n"
908
- ]
909
- },
910
- {
911
- "name": "stderr",
912
- "output_type": "stream",
913
- "text": [
914
- "/tmp/ipykernel_155548/4183901742.py:17: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
915
- " with autocast(enabled=use_amp):\n"
916
- ]
917
- },
918
- {
919
- "name": "stdout",
920
- "output_type": "stream",
921
- "text": [
922
- "Batch 0 | Loss: 0.0010\n",
923
- "Batch 50 | Loss: 0.0061\n",
924
- "Batch 100 | Loss: 0.0047\n",
925
- "Batch 150 | Loss: 0.0201\n",
926
- "Batch 200 | Loss: 0.0023\n",
927
- "Batch 250 | Loss: 0.0395\n",
928
- "Batch 300 | Loss: 0.0011\n",
929
- "Batch 350 | Loss: 0.0002\n",
930
- "Batch 400 | Loss: 3.2169\n",
931
- "Batch 450 | Loss: 4.4883\n",
932
- "Batch 500 | Loss: 0.0002\n",
933
- "Batch 550 | Loss: 0.0003\n",
934
- "Batch 600 | Loss: 0.0000\n",
935
- "Batch 650 | Loss: 0.0002\n",
936
- "Batch 700 | Loss: 0.0000\n",
937
- "Batch 750 | Loss: 4.6367\n",
938
- "Train | Loss: 0.5447 | Acc: 0.9011 | F1: 0.8990\n"
939
- ]
940
- },
941
- {
942
- "name": "stderr",
943
- "output_type": "stream",
944
- "text": [
945
- "/tmp/ipykernel_155548/4183901742.py:55: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
946
- " with autocast(enabled=use_amp):\n"
947
- ]
948
- },
949
- {
950
- "name": "stdout",
951
- "output_type": "stream",
952
- "text": [
953
- "Validation | Loss: 0.5331 | Acc: 0.9271 | F1: 0.9266\n",
954
- " precision recall f1-score support\n",
955
- "\n",
956
- " Human 0.92 0.94 0.93 198\n",
957
- " AI 0.94 0.92 0.93 200\n",
958
- "\n",
959
- " accuracy 0.93 398\n",
960
- " macro avg 0.93 0.93 0.93 398\n",
961
- "weighted avg 0.93 0.93 0.93 398\n",
962
- "\n",
963
- "Saved improved checkpoint: model_best.pth\n",
964
- "\n",
965
- "Epoch 6/6\n"
966
- ]
967
- },
968
- {
969
- "name": "stderr",
970
- "output_type": "stream",
971
- "text": [
972
- "/tmp/ipykernel_155548/4183901742.py:17: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
973
- " with autocast(enabled=use_amp):\n"
974
- ]
975
- },
976
- {
977
- "name": "stdout",
978
- "output_type": "stream",
979
- "text": [
980
- "Batch 0 | Loss: 0.0000\n"
981
- ]
982
- }
983
- ],
984
- "source": [
985
- "from torch.cuda.amp import autocast, GradScaler\n",
986
- "\n",
987
- "use_amp = device.type == 'cuda'\n",
988
- "scaler = GradScaler(enabled=use_amp)\n",
989
- "\n",
990
- "def train_one_epoch(model, loader):\n",
991
- " model.train()\n",
992
- " total_loss = 0.0\n",
993
- " all_preds, all_true = [], []\n",
994
- "\n",
995
- " optimizer.zero_grad(set_to_none=True)\n",
996
- " for batch_idx, batch in enumerate(loader):\n",
997
- " input_ids = batch['input_ids'].to(device, non_blocking=True)\n",
998
- " attention_mask = batch['attention_mask'].to(device, non_blocking=True)\n",
999
- " labels = batch['labels'].to(device, non_blocking=True)\n",
1000
- "\n",
1001
- " with autocast(enabled=use_amp):\n",
1002
- " logits = model(input_ids, attention_mask=attention_mask)\n",
1003
- " loss = loss_fn(logits, labels) / grad_accum_steps\n",
1004
- "\n",
1005
- " scaler.scale(loss).backward()\n",
1006
- "\n",
1007
- " if (batch_idx + 1) % grad_accum_steps == 0 or (batch_idx + 1) == len(loader):\n",
1008
- " torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)\n",
1009
- " scaler.step(optimizer)\n",
1010
- " scaler.update()\n",
1011
- " scheduler.step()\n",
1012
- " optimizer.zero_grad(set_to_none=True)\n",
1013
- "\n",
1014
- " total_loss += loss.item() * grad_accum_steps\n",
1015
- " preds = torch.argmax(logits, dim=1)\n",
1016
- " all_preds.extend(preds.detach().cpu().numpy())\n",
1017
- " all_true.extend(labels.detach().cpu().numpy())\n",
1018
- "\n",
1019
- " if batch_idx % 50 == 0:\n",
1020
- " print(f'Batch {batch_idx} | Loss: {(loss.item() * grad_accum_steps):.4f}')\n",
1021
- "\n",
1022
- " avg_loss = total_loss / max(len(loader), 1)\n",
1023
- " train_acc = accuracy_score(all_true, all_preds)\n",
1024
- " train_f1 = f1_score(all_true, all_preds)\n",
1025
- " return avg_loss, train_acc, train_f1\n",
1026
- "\n",
1027
- "\n",
1028
- "def evaluate(model, loader):\n",
1029
- " model.eval()\n",
1030
- " all_preds, all_true = [], []\n",
1031
- " total_loss = 0.0\n",
1032
- "\n",
1033
- " with torch.no_grad():\n",
1034
- " for batch in loader:\n",
1035
- " input_ids = batch['input_ids'].to(device, non_blocking=True)\n",
1036
- " attention_mask = batch['attention_mask'].to(device, non_blocking=True)\n",
1037
- " labels = batch['labels'].to(device, non_blocking=True)\n",
1038
- "\n",
1039
- " with autocast(enabled=use_amp):\n",
1040
- " logits = model(input_ids, attention_mask=attention_mask)\n",
1041
- " loss = loss_fn(logits, labels)\n",
1042
- "\n",
1043
- " total_loss += loss.item()\n",
1044
- " preds = torch.argmax(logits, dim=1)\n",
1045
- " all_preds.extend(preds.cpu().numpy())\n",
1046
- " all_true.extend(labels.cpu().numpy())\n",
1047
- "\n",
1048
- " val_loss = total_loss / max(len(loader), 1)\n",
1049
- " val_acc = accuracy_score(all_true, all_preds)\n",
1050
- " val_f1 = f1_score(all_true, all_preds)\n",
1051
- "\n",
1052
- " print(f'Validation | Loss: {val_loss:.4f} | Acc: {val_acc:.4f} | F1: {val_f1:.4f}')\n",
1053
- " print(classification_report(all_true, all_preds, target_names=['Human', 'AI']))\n",
1054
- " return val_loss, val_acc, val_f1\n",
1055
- "\n",
1056
- "\n",
1057
- "# Training with early stopping on validation F1\n",
1058
- "patience = 2\n",
1059
- "best_val_f1 = 0.0\n",
1060
- "epochs_without_improve = 0\n",
1061
- "\n",
1062
- "for epoch in range(1, max_epochs + 1):\n",
1063
- " print(f'\\nEpoch {epoch}/{max_epochs}')\n",
1064
- " if device.type == 'cuda':\n",
1065
- " torch.cuda.empty_cache()\n",
1066
- "\n",
1067
- " train_loss, train_acc, train_f1 = train_one_epoch(model, train_loader)\n",
1068
- " print(f'Train | Loss: {train_loss:.4f} | Acc: {train_acc:.4f} | F1: {train_f1:.4f}')\n",
1069
- "\n",
1070
- " val_loss, val_acc, val_f1 = evaluate(model, val_loader)\n",
1071
- "\n",
1072
- " if val_f1 > best_val_f1:\n",
1073
- " best_val_f1 = val_f1\n",
1074
- " epochs_without_improve = 0\n",
1075
- " torch.save(model.state_dict(), 'model_best.pth')\n",
1076
- " print('Saved improved checkpoint: model_best.pth')\n",
1077
- " else:\n",
1078
- " epochs_without_improve += 1\n",
1079
- " if epochs_without_improve >= patience:\n",
1080
- " print('Early stopping triggered.')\n",
1081
- " break\n",
1082
- "\n",
1083
- "print(f'Best validation F1: {best_val_f1:.4f}')"
1084
- ]
1085
- },
1086
- {
1087
- "cell_type": "code",
1088
- "execution_count": null,
1089
- "id": "wBIT-kPaswqy",
1090
- "metadata": {
1091
- "id": "wBIT-kPaswqy"
1092
- },
1093
- "outputs": [],
1094
- "source": [
1095
- "# Optional: save current in-memory weights as latest checkpoint\n",
1096
- "torch.save(model.state_dict(), 'model_latest.pth')\n",
1097
- "print('Saved: model_latest.pth')"
1098
- ]
1099
- },
1100
- {
1101
- "cell_type": "code",
1102
- "execution_count": null,
1103
- "id": "19b9652c",
1104
- "metadata": {
1105
- "colab": {
1106
- "base_uri": "https://localhost:8080/"
1107
- },
1108
- "id": "19b9652c",
1109
- "outputId": "e1b12835-b081-4d46-a909-c92cb3b6d230"
1110
- },
1111
- "outputs": [
1112
- {
1113
- "data": {
1114
- "text/plain": [
1115
- "('./nepali_xlmr_classifier/tokenizer_config.json',\n",
1116
- " './nepali_xlmr_classifier/special_tokens_map.json',\n",
1117
- " './nepali_xlmr_classifier/sentencepiece.bpe.model',\n",
1118
- " './nepali_xlmr_classifier/added_tokens.json',\n",
1119
- " './nepali_xlmr_classifier/tokenizer.json')"
1120
- ]
1121
- },
1122
- "execution_count": 41,
1123
- "metadata": {},
1124
- "output_type": "execute_result"
1125
- }
1126
- ],
1127
- "source": [
1128
- "tokenizer.save_pretrained(\"./nepali_xlmr_classifier\")"
1129
- ]
1130
- },
1131
- {
1132
- "cell_type": "code",
1133
- "execution_count": null,
1134
- "id": "eAnrw316iRw8",
1135
- "metadata": {
1136
- "colab": {
1137
- "base_uri": "https://localhost:8080/"
1138
- },
1139
- "id": "eAnrw316iRw8",
1140
- "outputId": "04885bb5-4f06-459b-a83c-40f5e00703fe"
1141
- },
1142
- "outputs": [
1143
- {
1144
- "name": "stdout",
1145
- "output_type": "stream",
1146
- "text": [
1147
- "0\n"
1148
- ]
1149
- }
1150
- ],
1151
- "source": [
1152
- "def predict(text):\n",
1153
- " model.eval()\n",
1154
- " inputs = tokenizer(\n",
1155
- " text,\n",
1156
- " return_tensors='pt',\n",
1157
- " truncation=True,\n",
1158
- " padding=True,\n",
1159
- " max_length=MAX_LEN,\n",
1160
- " )\n",
1161
- " inputs = {k: v.to(device) for k, v in inputs.items()}\n",
1162
- "\n",
1163
- " with torch.no_grad():\n",
1164
- " logits = model(inputs['input_ids'], inputs['attention_mask'])\n",
1165
- " probs = torch.softmax(logits, dim=1)\n",
1166
- " pred = torch.argmax(probs, dim=1).item()\n",
1167
- " confidence = probs[0, pred].item()\n",
1168
- "\n",
1169
- " label = 'AI' if pred == 1 else 'Human'\n",
1170
- " return label, confidence\n",
1171
- "\n",
1172
- "sample = 'अख्तियार दुरुपयोग अनुसन्धान आयोगले सिन्धुपाल्चोक–२ बाट प्रतिनिधिसभा सदस्य निर्वाचित सांसद तथा पूर्वमन्त्री बस्नेतसहित १६ जना र २ कम्पनी विरुद्ध ३ अर्ब २१ करोडभन्दा बढी बिगो कायम गरी बिहीबार विशेष अदालतमा भ्रष्टाचार मुद्दा दायर गरेको छ ।'\n",
1173
- "label, conf = predict(sample)\n",
1174
- "print(f'Prediction: {label} | Confidence: {conf:.4f}')"
1175
- ]
1176
- },
1177
- {
1178
- "cell_type": "code",
1179
- "execution_count": null,
1180
- "id": "lqGrqG51NiQV",
1181
- "metadata": {
1182
- "colab": {
1183
- "base_uri": "https://localhost:8080/"
1184
- },
1185
- "id": "lqGrqG51NiQV",
1186
- "outputId": "6bdae59b-2684-4bd0-f804-d16ebd8272db"
1187
- },
1188
- "outputs": [
1189
- {
1190
- "name": "stdout",
1191
- "output_type": "stream",
1192
- "text": [
1193
- "1\n",
1194
- "1\n",
1195
- "1\n",
1196
- "1\n",
1197
- "1\n",
1198
- "1\n",
1199
- "1\n",
1200
- "1\n",
1201
- "1\n",
1202
- "0\n"
1203
- ]
1204
- }
1205
- ],
1206
- "source": [
1207
- "print(predict(\"इन्टरनेटको सुरुवात सन् १९६९ मा अमेरिकी रक्षा मन्त्रालयले निर्माण गरेको ARPANET नामक प्रोजेक्टबाट भएको हो, जसको उद्देश्य आपसी संचारलाई सहज बनाउने थियो र जसले भविष्यमा इन्टरनेटको रूप लियो\"))\n",
1208
- "\n",
1209
- "print(predict(\"सुरुमा इन्टरनेट केही वैज्ञानिक तथा सरकारी संस्थाहरूमा सीमित रहेको भए पनि, समयक्रममा यसको पहुँच आम नागरिक, विद्यालय, र व्यवसायिक क्षेत्रमा विस्तार हुँदै गयो\"))\n",
1210
- "\n",
1211
- "print(predict(\"ARPANETले कम्प्युटरहरूलाई आपसमा जोड्ने सफल प्रयोग गरेपछि इन्टरनेटको सम्भावना प्रमाणित भयो, जसले गर्दा विश्वभरका अनुसन्धानकर्ताहरू यसप्रति आकर्षित हुन थाले\"))\n",
1212
- "\n",
1213
- "print(predict(\"सन् १९९० को दशकमा विश्वव्यापी रूपमा इन्टरनेट विस्तार हुन थालेपछि मानिसहरू सूचनाको आदान–प्रदान, इमेल, र वेबसाइटहरूको प्रयोगमार्फत डिजिटल संसारमा प्रवेश गर्न थाले।\"))\n",
1214
- "\n",
1215
- "print(predict(\"इन्टरनेटले शिक्षा, स्वास्थ्य, सञ्चार, मनोरञ्जन, तथा व्यापारजस्ता धेरै क्षेत्रहरूमा अभूतपूर्व परिवर्तन ल्याएको छ, जसले गर्दा मानव जीवन सरल, छरितो र प्रभावकारी बनेको छ।\"))\n",
1216
- "\n",
1217
- "print(predict(\"समयसँगै इन्टरनेट एक अत्यावश्यक सेवाको रूपमा विकास भएको छ, जसबिनाको आधुनिक जीवन लगभग असम्भवजस्तै लाग्ने अवस्था सिर्जना भएको छ।\"))\n",
1218
- "\n",
1219
- "print(predict(\"आजको युगमा इन्टरनेट केवल सूचना प्राप्तिको माध्यम मात्र नभई ज्ञानको भण्डार, रचनात्मकता प्रदर्शन गर्ने मंच, तथा रोजगार सृजनाको स्रोत पनि बनिसकेको छ।\"))\n",
1220
- "\n",
1221
- "print(predict(\"इन्टरनेटको प्रभाव त्यति गहिरो भएको छ कि विद्यालयका बालबालिकादेखि वृद्धसम्म यसको प्रयोगमा संलग्न छन्, जसले डिजिटल विभाजनको अवधारणा जन्माएको छ।\"))\n",
1222
- "\n",
1223
- "print(predict(\"इन्टरनेटले विश्वलाई एउटा सानो गाउँमा रूपान्तरण गरेको छ, जहाँ मानिसहरू हजारौं किलोमिटर टाढा भएर पनि एकअर्कासँग प्रत्यक्ष संवाद गर्न सक्छन्।\"))\n",
1224
- "\n",
1225
- "print(predict(\"संसदीय समितिले समन्वयकारी भूमिका निर्वाह गर्दै मनसुनजन्य विपद् जोखिम न्यूनीकरण, विपद् प्रतिकार्यका लागि तयारी गर्न तीन तहकै सरकारलाई निर्देशन दिएको छ।\"))\n"
1226
- ]
1227
- },
1228
- {
1229
- "cell_type": "code",
1230
- "execution_count": null,
1231
- "id": "X2ePCc5Disrt",
1232
- "metadata": {
1233
- "colab": {
1234
- "base_uri": "https://localhost:8080/",
1235
- "height": 35
1236
- },
1237
- "id": "X2ePCc5Disrt",
1238
- "outputId": "a4d27689-28cb-43c0-8333-67f2d3a6e097"
1239
- },
1240
- "outputs": [
1241
- {
1242
- "data": {
1243
- "application/vnd.google.colaboratory.intrinsic+json": {
1244
- "type": "string"
1245
- },
1246
- "text/plain": [
1247
- "'/content/classifier.zip'"
1248
- ]
1249
- },
1250
- "execution_count": 42,
1251
- "metadata": {},
1252
- "output_type": "execute_result"
1253
- }
1254
- ],
1255
- "source": [
1256
- "import shutil\n",
1257
- "\n",
1258
- "# Replace 'my_folder' with your folder name or path\n",
1259
- "folder_path = '/content/nepali_xlmr_classifier'\n",
1260
- "zip_path = '/content/classifier.zip'\n",
1261
- "\n",
1262
- "shutil.make_archive(zip_path.replace('.zip', ''), 'zip', folder_path)\n"
1263
- ]
1264
- },
1265
- {
1266
- "cell_type": "code",
1267
- "execution_count": null,
1268
- "id": "4BDzVg2gN7xi",
1269
- "metadata": {
1270
- "colab": {
1271
- "base_uri": "https://localhost:8080/",
1272
- "height": 17
1273
- },
1274
- "id": "4BDzVg2gN7xi",
1275
- "outputId": "ef31798e-24f5-45ad-900f-7528b32ae39f"
1276
- },
1277
- "outputs": [
1278
- {
1279
- "data": {
1280
- "application/javascript": "\n async function download(id, filename, size) {\n if (!google.colab.kernel.accessAllowed) {\n return;\n }\n const div = document.createElement('div');\n const label = document.createElement('label');\n label.textContent = `Downloading \"${filename}\": `;\n div.appendChild(label);\n const progress = document.createElement('progress');\n progress.max = size;\n div.appendChild(progress);\n document.body.appendChild(div);\n\n const buffers = [];\n let downloaded = 0;\n\n const channel = await google.colab.kernel.comms.open(id);\n // Send a message to notify the kernel that we're ready.\n channel.send({})\n\n for await (const message of channel.messages) {\n // Send a message to notify the kernel that we're ready.\n channel.send({})\n if (message.buffers) {\n for (const buffer of message.buffers) {\n buffers.push(buffer);\n downloaded += buffer.byteLength;\n progress.value = downloaded;\n }\n }\n }\n const blob = new Blob(buffers, {type: 'application/binary'});\n const a = document.createElement('a');\n a.href = window.URL.createObjectURL(blob);\n a.download = filename;\n div.appendChild(a);\n a.click();\n div.remove();\n }\n ",
1281
- "text/plain": [
1282
- "<IPython.core.display.Javascript object>"
1283
- ]
1284
- },
1285
- "metadata": {},
1286
- "output_type": "display_data"
1287
- },
1288
- {
1289
- "data": {
1290
- "application/javascript": "download(\"download_33034c8f-76d5-48d0-b7cd-3d066ac8e32f\", \"classifier.zip\", 6596694)",
1291
- "text/plain": [
1292
- "<IPython.core.display.Javascript object>"
1293
- ]
1294
- },
1295
- "metadata": {},
1296
- "output_type": "display_data"
1297
- }
1298
- ],
1299
- "source": [
1300
- "from google.colab import files\n",
1301
- "\n",
1302
- "files.download(zip_path)\n"
1303
- ]
1304
- },
1305
- {
1306
- "cell_type": "code",
1307
- "execution_count": null,
1308
- "id": "2jJkcOlw_R1k",
1309
- "metadata": {
1310
- "id": "2jJkcOlw_R1k"
1311
- },
1312
- "outputs": [],
1313
- "source": [
1314
- "torch.save(model.state_dict(), \"final_model.pth\") # AFTER training with classification head\n"
1315
- ]
1316
- },
1317
- {
1318
- "cell_type": "code",
1319
- "execution_count": null,
1320
- "id": "xnHr1IDABebZ",
1321
- "metadata": {
1322
- "colab": {
1323
- "base_uri": "https://localhost:8080/"
1324
- },
1325
- "id": "xnHr1IDABebZ",
1326
- "outputId": "95761a2d-56fa-418c-de03-d66d1ae662ee"
1327
- },
1328
- "outputs": [
1329
- {
1330
- "name": "stdout",
1331
- "output_type": "stream",
1332
- "text": [
1333
- "The text is predicted to be: Human\n",
1334
- "1\n",
1335
- "0\n",
1336
- "1\n"
1337
- ]
1338
- }
1339
- ],
1340
- "source": [
1341
- "# prompt: How to load the model and classifier and use it ? if no other code is in top of this\n",
1342
- "\n",
1343
- "# Define the device\n",
1344
- "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
1345
- "\n",
1346
- "# Instantiate the model\n",
1347
- "model = IndicBERTClassifier().to(device)\n",
1348
- "\n",
1349
- "# Load the saved state dictionary\n",
1350
- "# Make sure the path to your saved model file is correct\n",
1351
- "model_path = \"final_model.pth\" # Or \"model_95_acc.pth\" if you saved that one last\n",
1352
- "model.load_state_dict(torch.load(model_path, map_location=device))\n",
1353
- "\n",
1354
- "# Set the model to evaluation mode\n",
1355
- "model.eval()\n",
1356
- "\n",
1357
- "# Load the tokenizer\n",
1358
- "tokenizer_path = \"./nepali_xlmr_classifier\" # Make sure this path is correct\n",
1359
- "tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)\n",
1360
- "\n",
1361
- "# Now the model and tokenizer are loaded and ready to be used for predictions.\n",
1362
- "# You can use the existing `predict` function or write a new one.\n",
1363
- "\n",
1364
- "# Example of using the predict function with the loaded model and tokenizer\n",
1365
- "def predict(text):\n",
1366
- " model.eval() # Ensure model is in evaluation mode\n",
1367
- " inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)\n",
1368
- " inputs = {k: v.to(device) for k, v in inputs.items()}\n",
1369
- " with torch.no_grad():\n",
1370
- " outputs = model(**inputs)\n",
1371
- "\n",
1372
- " # Handle if output is tensor (some versions/models return logits directly)\n",
1373
- " logits = outputs if isinstance(outputs, torch.Tensor) else outputs.logits\n",
1374
- "\n",
1375
- " pred = torch.argmax(logits, dim=1).item()\n",
1376
- " return pred\n",
1377
- "\n",
1378
- "# Example usage with some text\n",
1379
- "text_to_predict = \"This is a test sentence.\" # Replace with your Nepali text\n",
1380
- "predicted_class = predict(text_to_predict)\n",
1381
- "\n",
1382
- "# Interpret the prediction (assuming 0 for Human, 1 for AI based on your previous code)\n",
1383
- "class_label = \"Human\" if predicted_class == 0 else \"AI\"\n",
1384
- "print(f\"The text is predicted to be: {class_label}\")\n",
1385
- "\n",
1386
- "# You can test with more examples as you did before\n",
1387
- "print(predict(\"यी सबै वाक्यहरू इन्टरनेटको विकास, प्रभाव, र चुनौतीहरूको गहिरो सन्दर्भ समेटेर तयार पारिएका छन्। यदि तिमीलाई चाहिएको खण्डमा विशेष विषय (जस्तै शिक्षा, साइबर सुरक्षा, ग्रामीण प्रभाव आदि) चाहिएको हो भने, म त्यही विषयमा केन्द्रित लामो वाक्यहरू पनि दिन सक्छु।\"))\n",
1388
- "print(predict(\"अख्तियार दुरुपयोग अनुसन्धान आयोगले सिन्धुपाल्चोक–२ बाट प्रतिनिधिसभा सदस्य निर्वाचित सांसद तथा पूर्वमन्त्री बस्नेतसहित १६ जना र २ कम्पनी विरुद्ध ३ अर्ब २१ करोडभन्दा बढी बिगो कायम गरी बिहीबार विशेष अदालतमा भ्रष्टाचार मुद्दा दायर गरेको छ । योसँगै बस्नेत सांसद पदबाट स्वतः निलम्बनमा परेका छन् ।\"))\n",
1389
- "print(predict(\"इन्टरनेटको सुरुवात सन् १९६९ मा अमेरिकी रक्षा मन्त्रालयले निर्माण गरेको ARPANET नामक प्रोजेक्टबाट भएको हो, जसको उद्देश्य आपसी संचारलाई सहज बनाउने थियो र जसले भविष्यमा इन्टरनेटको रूप लियो\"))\n"
1390
- ]
1391
- },
1392
- {
1393
- "cell_type": "code",
1394
- "execution_count": null,
1395
- "id": "gG8fnbqyDUpm",
1396
- "metadata": {
1397
- "id": "gG8fnbqyDUpm"
1398
- },
1399
- "outputs": [],
1400
- "source": []
1401
- }
1402
- ],
1403
- "metadata": {
1404
- "accelerator": "TPU",
1405
- "colab": {
1406
- "gpuType": "V28",
1407
- "provenance": []
1408
- },
1409
- "kernelspec": {
1410
- "display_name": "ml",
1411
- "language": "python",
1412
- "name": "python3"
1413
- },
1414
- "language_info": {
1415
- "codemirror_mode": {
1416
- "name": "ipython",
1417
- "version": 3
1418
- },
1419
- "file_extension": ".py",
1420
- "mimetype": "text/x-python",
1421
- "name": "python",
1422
- "nbconvert_exporter": "python",
1423
- "pygments_lexer": "ipython3",
1424
- "version": "3.11.14"
1425
- }
1426
- },
1427
- "nbformat": 4,
1428
- "nbformat_minor": 5
1429
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
notebook/ai_vs_human_nepali/notebook/documentation.md DELETED
@@ -1,435 +0,0 @@
1
- # Nepali AI vs Human Notebook Documentation
2
-
3
- This folder contains a small notebook series for building an AI-vs-human text detector for Nepali text. The notebooks are not identical copies; they represent the evolution of the project from a lightweight scikit-learn baseline to a stronger hybrid model and a transformer-based experiment.
4
-
5
- ## Notebook Inventory
6
-
7
- The notebooks in this directory are:
8
-
9
- - [main.ipynb](main.ipynb)
10
- - [working model.ipynb](working%20model.ipynb)
11
- - [Nepali_Ai_vs_Human.ipynb](Nepali_Ai_vs_Human.ipynb)
12
- - [final_main.ipynb](final_main.ipynb)
13
-
14
- ## Shared Goal
15
-
16
- All notebooks solve the same binary classification task:
17
-
18
- - Class 0 = Human-written Nepali text
19
- - Class 1 = AI-generated Nepali text
20
-
21
- The notebooks differ in how they prepare the data, which features they extract, and which model family they train.
22
-
23
- ## Shared Data Sources
24
-
25
- Across the notebooks, the dataset is built from one or more CSV files under the notebook dataset folders. The common column pattern is:
26
-
27
- - human_text
28
- - ai_generated_text
29
-
30
- Some notebooks also use:
31
-
32
- - title
33
- - label
34
- - paragraph
35
-
36
- The data preparation usually performs some combination of:
37
-
38
- - dropping null rows
39
- - stripping whitespace
40
- - removing duplicates
41
- - converting two source columns into one text column plus one label column
42
- - balancing classes by sampling
43
- - splitting long texts into smaller chunks
44
-
45
- ## Notebook Relationship
46
-
47
- The notebooks form a progression:
48
-
49
- 1. main.ipynb is the first lightweight sklearn baseline.
50
- 2. working model.ipynb refines the baseline with better text chunking.
51
- 3. Nepali_Ai_vs_Human.ipynb switches to a transformer-style neural model.
52
- 4. final_main.ipynb is the most complete hybrid notebook and is the closest thing to a production workflow.
53
-
54
- ## main.ipynb
55
-
56
- ### Purpose
57
-
58
- This is the earliest baseline notebook. It focuses on a CPU-friendly approach using TF-IDF plus hand-crafted text features, then compares several classic machine learning models.
59
-
60
- ### Data Preparation
61
-
62
- The notebook loads several CSV files and concatenates them into one dataframe. The data is drawn from:
63
-
64
- - ../DATASET/data.csv
65
- - ../DATASET/new_data.csv
66
- - /mnt/linux-data/Work/aiapi/notebook/ai_vs_human_nepali/news_scrap_new2.fixed.csv
67
-
68
- The notebook creates separate cleaned columns for human text and AI text, then stacks them into a single training dataframe with labels.
69
-
70
- Important preprocessing steps:
71
-
72
- - remove URLs
73
- - keep only Nepali Unicode characters and whitespace
74
- - lowercase the text
75
- - remove consecutive repeated words
76
-
77
- ### Feature Engineering
78
-
79
- The notebook combines two feature families:
80
-
81
- - Word-level TF-IDF with 1-2 gram features
82
- - Dense, hand-crafted features based on text structure
83
-
84
- The hand-crafted features include:
85
-
86
- - burstiness statistics from sentence lengths
87
- - average word length
88
- - average sentence length
89
- - lexical diversity
90
- - punctuation ratio
91
- - repeated bigram ratio
92
- - Devanagari diacritic density
93
-
94
- The sparse TF-IDF matrix is concatenated with the dense feature matrix using horizontal stacking.
95
-
96
- ### Models Trained
97
-
98
- The notebook compares several standard classifiers:
99
-
100
- - LogisticRegressionCV
101
- - RidgeClassifierCV
102
- - MultinomialNB
103
- - BernoulliNB
104
- - RandomForestClassifier
105
- - GradientBoostingClassifier
106
- - LinearSVC
107
- - KNeighborsClassifier
108
-
109
- Dense conversion is applied only where needed, such as for LinearSVC and KNeighbors.
110
-
111
- ### Evaluation
112
-
113
- The notebook evaluates the models with:
114
-
115
- - validation accuracy
116
- - weighted F1 score
117
- - classification reports
118
- - confusion matrices
119
- - ROC curves
120
-
121
- The top models are selected by validation accuracy and re-used in later prediction cells.
122
-
123
- ### Prediction Demo
124
-
125
- The notebook includes several sample Nepali texts for inference. It prints per-model predictions and, where possible, confidence values.
126
-
127
- ### Saved Artifacts
128
-
129
- Each model is saved as a pickle file in a local saved_models directory.
130
-
131
- ### Known Issues
132
-
133
- - Several cells are duplicated, especially the dataset loading cells.
134
- - The vectorizer and the feature builder are not saved with the models, so full reloading is incomplete.
135
- - There are repeated prediction sections, which makes the notebook harder to maintain.
136
- - Some cells appear to be placeholders or empty.
137
-
138
- ## working model.ipynb
139
-
140
- ### Purpose
141
-
142
- This notebook is a refinement of main.ipynb. It keeps the same overall classifier strategy but improves how long Nepali articles are handled.
143
-
144
- ### Main Difference From main.ipynb
145
-
146
- The key improvement is sentence chunking:
147
-
148
- - long texts are split into smaller chunks
149
- - chunk boundaries prefer Nepali danda punctuation
150
- - each chunk is limited to a small number of sentences or words
151
-
152
- This makes the dataset more granular and helps the classifier train on smaller, more uniform samples.
153
-
154
- ### Preprocessing
155
-
156
- The notebook defines:
157
-
158
- - clean_text
159
- - remove_auto_repeating
160
- - split_into_sentence_chunks
161
- - expand_texts_to_chunks
162
-
163
- These functions preserve sentence punctuation for chunking, then normalize the cleaned chunks for downstream training.
164
-
165
- ### Feature Engineering and Models
166
-
167
- The rest of the pipeline is essentially the same as main.ipynb:
168
-
169
- - TF-IDF word n-grams
170
- - burstiness and stylometric features
171
- - concatenated sparse + dense representation
172
- - the same family of sklearn classifiers
173
-
174
- ### Evaluation and Inference
175
-
176
- The notebook follows the same model comparison, ROC plotting, confusion matrix plotting, and sample prediction pattern as the baseline notebook.
177
-
178
- ### Saved Artifacts
179
-
180
- Like main.ipynb, the fitted sklearn models are stored under saved_models as individual pickle files.
181
-
182
- ### Known Issues
183
-
184
- - The notebook has redundant cells and repeated code blocks.
185
- - It still does not serialize the vectorizer and feature transformer together with the model artifacts.
186
- - Some prediction logic is repeated more than once.
187
-
188
- ## Nepali_Ai_vs_Human.ipynb
189
-
190
- ### Purpose
191
-
192
- This notebook is the deep learning branch of the project. Instead of hand-crafted features plus classical classifiers, it uses a transformer encoder with a classification head.
193
-
194
- ### Data Preparation
195
-
196
- The notebook reads one CSV file and converts the two-column source format into a single text-label dataframe.
197
-
198
- Important preparation steps:
199
-
200
- - validate required columns
201
- - drop nulls
202
- - build a unified dataframe with text and label
203
- - filter short texts
204
- - drop duplicate text rows
205
- - shuffle the dataset
206
-
207
- The notebook keeps the raw text mostly intact rather than applying aggressive regex cleaning.
208
-
209
- ### Model Architecture
210
-
211
- The model pipeline is built around Hugging Face transformers and PyTorch:
212
-
213
- - tokenizer from a multilingual BERT-style model
214
- - AutoModel backbone
215
- - classification head with dropout
216
- - binary output layer
217
-
218
- The notebook defines a custom PyTorch module named IndicBERTClassifier.
219
-
220
- ### Training Setup
221
-
222
- The notebook uses:
223
-
224
- - train/validation split with stratification
225
- - DataLoader-based batching
226
- - AdamW optimizer
227
- - cross-entropy loss
228
- - linear warmup scheduler
229
- - gradient accumulation
230
- - mixed precision when CUDA is available
231
- - early stopping on validation F1
232
-
233
- This makes it more GPU-oriented than the sklearn notebooks.
234
-
235
- ### Evaluation
236
-
237
- Per-epoch evaluation includes:
238
-
239
- - accuracy
240
- - F1 score
241
- - classification report
242
-
243
- The notebook also saves improved checkpoints when validation F1 improves.
244
-
245
- ### Prediction Demo
246
-
247
- The notebook defines a predict function that:
248
-
249
- - tokenizes the input text
250
- - runs the transformer model
251
- - applies softmax
252
- - returns the predicted class and confidence
253
-
254
- Several sample Nepali sentences are passed through the predictor at the end of the notebook.
255
-
256
- ### Saved Artifacts
257
-
258
- The notebook saves:
259
-
260
- - model_best.pth
261
- - model_latest.pth
262
- - tokenizer files in ./nepali_xlmr_classifier
263
-
264
- There is also a Colab-oriented zip export section.
265
-
266
- ### Known Issues
267
-
268
- - The notebook mixes local notebook execution with Colab-specific code.
269
- - Some cells show CUDA or environment-related warnings.
270
- - The training flow is more complex and less polished than the final hybrid notebook.
271
- - Paths are hard-coded in a few places.
272
-
273
- ## final_main.ipynb
274
-
275
- ### Purpose
276
-
277
- This is the most complete notebook in the folder. It combines semantic embeddings from Sentence Transformers with stylometric features, then trains a linear model and an XGBoost model on the fused feature vector.
278
-
279
- ### Data Preparation
280
-
281
- The notebook reads the dataset from:
282
-
283
- - ../DATASET/Final_data/final_news345.csv
284
- - /mnt/linux-data/Work/aiapi/notebook/ai_vs_human_nepali/Final_data/final_news345.csv
285
-
286
- The notebook expects a label column with string values and maps them to binary classes.
287
-
288
- It also includes a preprocessing utility that can:
289
-
290
- - split very long Nepali texts into chunks
291
- - preserve danda-based sentence boundaries
292
- - filter out extremely short chunks
293
- - balance the dataset by sampling each class to the same count
294
-
295
- ### Visualization
296
-
297
- The notebook includes exploratory plots for:
298
-
299
- - class distribution
300
- - character count distribution
301
- - word count distribution
302
- - sentence count distribution
303
- - cleaned text length distribution
304
- - stylometric feature comparison plots
305
-
306
- This makes it the most documented and inspection-friendly notebook in the folder.
307
-
308
- ### Text Cleaning
309
-
310
- The notebook defines clean_nepali_text, which:
311
-
312
- - lowercases the text
313
- - normalizes Nepali and common Unicode punctuation
314
- - removes unwanted characters
315
- - collapses repeated whitespace
316
- - trims the result
317
-
318
- This cleaned text is used for both embeddings and stylometric extraction.
319
-
320
- ### Stylometric Features
321
-
322
- The notebook uses six hand-crafted features:
323
-
324
- - word_count
325
- - sentence_count
326
- - avg_word_length
327
- - avg_sentence_length
328
- - type_token_ratio
329
- - punctuation_ratio
330
-
331
- These features are extracted from the cleaned text and then standardized with StandardScaler.
332
-
333
- ### Semantic Embeddings
334
-
335
- The notebook uses the Sentence Transformers model:
336
-
337
- - sentence-transformers/paraphrase-multilingual-mpnet-base-v2
338
-
339
- This produces 768-dimensional multilingual sentence embeddings. The notebook loads the embedder on CPU to reduce CUDA memory pressure.
340
-
341
- ### Feature Fusion
342
-
343
- The final feature matrix is built by concatenating:
344
-
345
- - 768 embedding dimensions
346
- - 6 scaled stylometric dimensions
347
-
348
- So each sample becomes a 774-dimensional vector.
349
-
350
- ### Models Trained
351
-
352
- Two models are trained on the fused features:
353
-
354
- - Logistic Regression
355
- - XGBoost
356
-
357
- XGBoost is configured with class imbalance handling through scale_pos_weight.
358
-
359
- ### Evaluation
360
-
361
- The notebook evaluates both models using:
362
-
363
- - accuracy
364
- - precision
365
- - recall
366
- - F1 score
367
- - confusion matrices
368
- - ROC curves and AUC
369
-
370
- It also computes and visualizes XGBoost feature importance.
371
-
372
- ### Prediction Flow
373
-
374
- The prediction function follows this exact sequence:
375
-
376
- 1. clean the input
377
- 2. extract stylometric features
378
- 3. build the sentence embedding
379
- 4. scale the stylometric vector
380
- 5. concatenate the two feature blocks
381
- 6. predict with XGBoost
382
-
383
- The function returns a dictionary containing the label, numeric class id, and probability.
384
-
385
- ### Saved Artifacts
386
-
387
- The notebook saves a joblib bundle at:
388
-
389
- - ../models/ai_text_detector_model.pkl
390
-
391
- The saved artifact includes:
392
-
393
- - xgb_model
394
- - lr_model
395
- - scaler
396
- - embed_model name string
397
- - stylo_cols
398
- - label_map
399
-
400
- ### Known Issues
401
-
402
- - The XGBoost fit call uses the test set as an eval_set, which is acceptable for monitoring but not ideal if you want strict validation separation.
403
- - The embedding model name is saved, but the embedder itself is not serialized.
404
- - The notebook is the strongest production candidate, but it still lacks a separate load-and-predict helper for end users.
405
-
406
- ## Comparison Summary
407
-
408
- | Notebook | Main Approach | Strength | Weakness |
409
- |---|---|---|---|
410
- | main.ipynb | TF-IDF + stylometry + classic ML | Simple baseline, easy to inspect | Repetitive and not fully serializable |
411
- | working model.ipynb | TF-IDF + stylometry + chunking | Better handling of long text | Still mostly a baseline notebook |
412
- | Nepali_Ai_vs_Human.ipynb | Transformer classifier | Strong semantic modeling | Heavier, more environment-sensitive |
413
- | final_main.ipynb | Sentence embeddings + stylometry + XGBoost | Best balance of performance, clarity, and deployability | Uses a saved model name string instead of serializing the embedder |
414
-
415
- ## Recommended Reading Order
416
-
417
- If you want to understand the project evolution, read the notebooks in this order:
418
-
419
- 1. main.ipynb
420
- 2. working model.ipynb
421
- 3. Nepali_Ai_vs_Human.ipynb
422
- 4. final_main.ipynb
423
-
424
- If you only want the most useful notebook for reuse or deployment, start with final_main.ipynb.
425
-
426
- ## Practical Notes
427
-
428
- - Several notebooks contain duplicated or stale cells from experimentation.
429
- - Not every cell has been executed successfully.
430
- - Paths are sometimes hard-coded for the local workspace, so moving the folder may require path cleanup.
431
- - The project alternates between three styles of modeling: classical sklearn, transformer fine-tuning, and hybrid embedding-based classification.
432
-
433
- ## Suggested Next Step
434
-
435
- If you want, the next useful document to add is an inference guide that explains how to load the saved model bundle from final_main.ipynb and run predictions on new Nepali text.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
notebook/ai_vs_human_nepali/notebook/final_main.ipynb DELETED
The diff for this file is too large to render. See raw diff
 
notebook/ai_vs_human_nepali/notebook/main.ipynb DELETED
The diff for this file is too large to render. See raw diff
 
notebook/ai_vs_human_nepali/notebook/working model.ipynb DELETED
The diff for this file is too large to render. See raw diff
 
notebook/ai_vs_human_nepali/topic_scrapper.ipynb DELETED
@@ -1,455 +0,0 @@
1
- {
2
- "cells": [
3
- {
4
- "cell_type": "code",
5
- "execution_count": 15,
6
- "id": "4b53d4bc",
7
- "metadata": {},
8
- "outputs": [],
9
- "source": [
10
- "# # Groq Nepali Rewriter\n",
11
- "\n",
12
- "# This notebook loads the dataset, builds a Nepali rewrite prompt, tests one sample, and then saves a batch output CSV using the Groq API.\n",
13
- "\n",
14
- "# Requirements:\n",
15
- "# - `GROQ_API_KEY` must be available in `.env`\n",
16
- "# - the input file must contain a `paragraph` column"
17
- ]
18
- },
19
- {
20
- "cell_type": "code",
21
- "execution_count": 16,
22
- "id": "6c8dc1cb",
23
- "metadata": {},
24
- "outputs": [
25
- {
26
- "data": {
27
- "text/plain": [
28
- "True"
29
- ]
30
- },
31
- "execution_count": 16,
32
- "metadata": {},
33
- "output_type": "execute_result"
34
- }
35
- ],
36
- "source": [
37
- "import os\n",
38
- "import re\n",
39
- "import time\n",
40
- "from concurrent.futures import ThreadPoolExecutor, as_completed\n",
41
- "\n",
42
- "import pandas as pd\n",
43
- "from dotenv import load_dotenv\n",
44
- "from groq import Groq\n",
45
- "\n",
46
- "load_dotenv()"
47
- ]
48
- },
49
- {
50
- "cell_type": "code",
51
- "execution_count": 17,
52
- "id": "019adfa8",
53
- "metadata": {},
54
- "outputs": [],
55
- "source": [
56
- "api_key = os.getenv(\"GROQ_API_KEY2\")\n",
57
- "if not api_key:\n",
58
- " raise ValueError(\"GROQ_API_KEY not found in .env or environment.\")\n",
59
- "\n",
60
- "client = Groq(api_key=api_key)\n",
61
- "MODEL_NAME = \"llama-3.3-70b-versatile\""
62
- ]
63
- },
64
- {
65
- "cell_type": "code",
66
- "execution_count": 18,
67
- "id": "4b4d2bbe",
68
- "metadata": {},
69
- "outputs": [],
70
- "source": [
71
- "data =pd.read_csv(\"DATASET/topics_1000.csv\")"
72
- ]
73
- },
74
- {
75
- "cell_type": "code",
76
- "execution_count": 19,
77
- "id": "c36cfbbf",
78
- "metadata": {},
79
- "outputs": [
80
- {
81
- "data": {
82
- "text/html": [
83
- "<div>\n",
84
- "<style scoped>\n",
85
- " .dataframe tbody tr th:only-of-type {\n",
86
- " vertical-align: middle;\n",
87
- " }\n",
88
- "\n",
89
- " .dataframe tbody tr th {\n",
90
- " vertical-align: top;\n",
91
- " }\n",
92
- "\n",
93
- " .dataframe thead th {\n",
94
- " text-align: right;\n",
95
- " }\n",
96
- "</style>\n",
97
- "<table border=\"1\" class=\"dataframe\">\n",
98
- " <thead>\n",
99
- " <tr style=\"text-align: right;\">\n",
100
- " <th></th>\n",
101
- " <th>id</th>\n",
102
- " <th>topic</th>\n",
103
- " </tr>\n",
104
- " </thead>\n",
105
- " <tbody>\n",
106
- " <tr>\n",
107
- " <th>0</th>\n",
108
- " <td>1</td>\n",
109
- " <td>नेपालमा कृत्रिम बुद्धिमत्ता विकासको वर्तमान अव...</td>\n",
110
- " </tr>\n",
111
- " <tr>\n",
112
- " <th>1</th>\n",
113
- " <td>2</td>\n",
114
- " <td>नेपालको शिक्षा प्रणालीमा डिजिटल प्रविधिको प्रभाव</td>\n",
115
- " </tr>\n",
116
- " <tr>\n",
117
- " <th>2</th>\n",
118
- " <td>3</td>\n",
119
- " <td>काठमाडौँ उपत्यकाको वायु प्रदूषण समस्या</td>\n",
120
- " </tr>\n",
121
- " <tr>\n",
122
- " <th>3</th>\n",
123
- " <td>4</td>\n",
124
- " <td>नेपालमा जलवायु परिवर्तनका असरहरू</td>\n",
125
- " </tr>\n",
126
- " <tr>\n",
127
- " <th>4</th>\n",
128
- " <td>5</td>\n",
129
- " <td>ग्रामीण क्षेत्रमा इन्टरनेट पहुँचको विस्तार</td>\n",
130
- " </tr>\n",
131
- " </tbody>\n",
132
- "</table>\n",
133
- "</div>"
134
- ],
135
- "text/plain": [
136
- " id topic\n",
137
- "0 1 नेपालमा कृत्रिम बुद्धिमत्ता विकासको वर्तमान अव...\n",
138
- "1 2 नेपालको शिक्षा प्रणालीमा डिजिटल प्रविधिको प्रभाव\n",
139
- "2 3 काठमाडौँ उपत्यकाको वायु प्रदूषण समस्या\n",
140
- "3 4 नेपालमा जलवायु परिवर्तनका असरहरू\n",
141
- "4 5 ग्रामीण क्षेत्रमा इन्टरनेट पहुँचको विस्तार"
142
- ]
143
- },
144
- "execution_count": 19,
145
- "metadata": {},
146
- "output_type": "execute_result"
147
- }
148
- ],
149
- "source": [
150
- "data.head()"
151
- ]
152
- },
153
- {
154
- "cell_type": "code",
155
- "execution_count": 20,
156
- "id": "b6e226b8",
157
- "metadata": {},
158
- "outputs": [],
159
- "source": [
160
- "import numpy as np\n",
161
- "def build_prompt(paragraph):\n",
162
- " style = [\n",
163
- " \"Use simple and clear language.\",\n",
164
- " \"Make it engaging and interesting to read.\",\n",
165
- " \"Use a conversational tone.\",\n",
166
- " \"Keep the original meaning intact.\",\n",
167
- " \"Avoid complex jargon and technical terms.\",\n",
168
- " \"Use short sentences and paragraphs.\",\n",
169
- " \"Add examples or anecdotes to illustrate points.\",\n",
170
- " \"Use active voice instead of passive voice.\",\n",
171
- " \"Include a call to action or a thought-provoking question at the end.\",\n",
172
- " ]\n",
173
- " selected_style_random_single = np.random.choice(style, size=len(style), replace=False) # Select the first 5 style guidelines\n",
174
- " prompt = f\"\"\"\n",
175
- " give me an essay for the following topics puree nepali ok no enlgish language:\n",
176
- " {paragraph}\n",
177
- " Rewrite the above paragraph in Nepali, following these style guidelines:\n",
178
- " {', '.join(selected_style_random_single)}\n",
179
- " \"\"\"\n",
180
- " return prompt.strip()"
181
- ]
182
- },
183
- {
184
- "cell_type": "code",
185
- "execution_count": 21,
186
- "id": "cf16922b",
187
- "metadata": {},
188
- "outputs": [
189
- {
190
- "name": "stdout",
191
- "output_type": "stream",
192
- "text": [
193
- "नेपालमा कृत्रिम बुद्धिमत्ता विकासको वर्तमान अवस्था\n",
194
- "\n",
195
- "कृत्रिम बुद्धिमत्ता विकास नेपालको लागि एक नयाँ युग हो । यो प्राविधिक क्षेत्र दिन-प्रतिदिन विकसित हुने क्रममा छ । नेपालमा कृत्रिम बुद्धिमत्ताले विभिन्न क्षेत्रमा परिवर्तन ल्याउने क्षमता राख्दछ । जस्तै: स्वास्थ्य सेवामा, शिक्षामा, वित्तीय सेवामा, तथा उत्पादन क्षेत्रमा ।\n",
196
- "\n",
197
- "नेपालमा कृत्रिम बुद्धिमत्ताको विकासले नयाँ अवस्था प्राप्त गरिरहेको छ । यो देशमा विभिन्न प्राविधिक कम्पनीहरुले कृत्रिम बुद्धिमत्ताको विकासमा लगनशील छन् । तसर्थ, यसले नेपालमा रोजगारीको अवसर पनि बढाउने छ । उदाहरणको लागि, कृत्रिम बुद्धिमत्ताले स्वास्थ्य सेवामा रोग निदान गर्ने, रोगको उपचार सुझाउने, तथा व्यक्तिको स्वास्थ्य जाँच गर्ने काम गर्नसक्ने छ ।\n",
198
- "\n",
199
- "कृत्रिम बुद्धिमत्ताको विकासले नेपालको अर्थतन्त्रमा पनि परिवर्तन ल्याउने छ । यसले व्यवसायिक क्षेत्रमा उत्पादनशीलता बढाउने, उत्पादन मुल्य कम गर्ने, तथा गुणस्तर मापन गर्ने काम गर्नसक्ने छ । उदाहरणको लागि, कृत्रिम बुद्धिमत्ताले वित्तीय सेवामा लेनदेनको निरीक्षण गर्ने, धोकाधोकाको मुल्यांकन गर्ने, तथा वित्तीय संस्थाहरुलाई सुझाव दिने काम गर्नसक्ने छ ।\n",
200
- "\n",
201
- "नेपालमा कृत्रिम बुद्धिमत्ता विकासको वर्तमान अवस्थाले देशलाई एक नयाँ दिशामा लम्बने क्षमता राख्दछ । तर, यसको विकासमा चुनौतिहरु पनि छन् । जस्तै: डाटा सुरक्षा, निजताको हनन, तथा श्रमिकहरुको प्रतिस्पर्धी क्षमता । तसर्थ, नेपालमा कृत्रिम बुद्धिमत्ताको विकासलाई प्रोत्साहित गर्नको लागि, हामीले यसको विकासमा लगनशील कम्पनीहरुलाई साथ दिनु पर्छ । हामीले पनि कृत्रिम बुद्धिमत्ता���ो विकासमा योगदान पुर्याउनुपर्छ ।\n",
202
- "\n",
203
- "आह, नेपालमा कृत्रिम बुद्धिमत्ता विकासको वर्तमान अवस्थाले देशलाई एक नयाँ दिशामा लम्बने क्षमता राख्दछ । तर, यसको विकासमा हामी के गरिरहेका छौ? हामीले कृत्रिम बुद्धिमत्ताको विकासमा योगदान पुर्याउने छौ कि? हामीले यसको विकासमा चुनौतिहरुलाई मात गर्ने छौ कि? यस प्रश्नको उत्तर हामीसँग छ । आउनうभ, हामी नेपालमा कृत्रिम बुद्धिमत्ताको विकासलाई प्रोत्साहित गरौं । आउनूभ, हामी देशलाई एक नयाँ दिशामा लम्बौं ।\n"
204
- ]
205
- }
206
- ],
207
- "source": [
208
- "build_prompt = build_prompt\n",
209
- "\n",
210
- "sample_title = str(data.iloc[0][\"topic\"])\n",
211
- "\n",
212
- "sample_response = client.chat.completions.create(\n",
213
- " model=MODEL_NAME,\n",
214
- " messages=[{\"role\": \"user\", \"content\": build_prompt(sample_title)}],\n",
215
- ")\n",
216
- "\n",
217
- "generated_text = sample_response.choices[0].message.content.strip()\n",
218
- "print(generated_text)"
219
- ]
220
- },
221
- {
222
- "cell_type": "code",
223
- "execution_count": null,
224
- "id": "c709f126",
225
- "metadata": {},
226
- "outputs": [],
227
- "source": [
228
- "def grok_step3_5_scraper(\n",
229
- " input_file,\n",
230
- " output_file=\"step3_5_grok_nepali.csv\",\n",
231
- " limit=100,\n",
232
- " model=MODEL_NAME,\n",
233
- " requests_per_second=2,\n",
234
- " max_workers=2,\n",
235
- " max_retries=3,\n",
236
- "):\n",
237
- " working_df = pd.read_csv(input_file)\n",
238
- " if limit is not None:\n",
239
- " working_df = working_df.head(limit)\n",
240
- "\n",
241
- " cols = set(working_df.columns)\n",
242
- " if \"Title\" in cols or \"शीर्षक\" in cols:\n",
243
- " title_col = \"Title\" if \"Title\" in cols else \"शीर्षक\"\n",
244
- " prompt_col = title_col\n",
245
- " if \"Paragraph\" in cols:\n",
246
- " human_col = \"Paragraph\"\n",
247
- " elif \"विवरण\" in cols:\n",
248
- " human_col = \"विवरण\"\n",
249
- " elif \"paragraph\" in cols:\n",
250
- " human_col = \"paragraph\"\n",
251
- " else:\n",
252
- " human_col = prompt_col\n",
253
- " elif \"paragraph\" in cols or \"Paragraph\" in cols or \"विवरण\" in cols:\n",
254
- " prompt_col = (\n",
255
- " \"paragraph\" if \"paragraph\" in cols\n",
256
- " else (\"Paragraph\" if \"Paragraph\" in cols else \"विवरण\")\n",
257
- " )\n",
258
- " human_col = prompt_col\n",
259
- " title_col = prompt_col\n",
260
- " else:\n",
261
- " raise ValueError(\n",
262
- " \"No supported text columns found. Expected one of: Title/शीर्षक with Paragraph/विवरण, or paragraph.\"\n",
263
- " )\n",
264
- "\n",
265
- " working_df = working_df.dropna(subset=[human_col]).copy()\n",
266
- "\n",
267
- " total_input_rows = len(working_df)\n",
268
- " already_done = 0\n",
269
- "\n",
270
- " if os.path.exists(output_file):\n",
271
- " try:\n",
272
- " existing_df = pd.read_csv(output_file)\n",
273
- " already_done = len(existing_df)\n",
274
- " except pd.errors.EmptyDataError:\n",
275
- " already_done = 0\n",
276
- "\n",
277
- " if already_done >= total_input_rows:\n",
278
- " print(\n",
279
- " f\"Nothing to do. {already_done} rows already exist in {output_file} (input rows: {total_input_rows}).\"\n",
280
- " )\n",
281
- " return\n",
282
- "\n",
283
- " if already_done > 0:\n",
284
- " working_df = working_df.iloc[already_done:].copy()\n",
285
- " print(\n",
286
- " f\"Resuming from row {already_done}. Processing remaining {len(working_df)} rows out of {total_input_rows}.\"\n",
287
- " )\n",
288
- " else:\n",
289
- " print(f\"Loaded {total_input_rows} rows from {input_file}\")\n",
290
- " print(\n",
291
- " f\"Using title column: {title_col} | prompt column: {prompt_col} | human column: {human_col}\"\n",
292
- " )\n",
293
- "\n",
294
- " results = []\n",
295
- "\n",
296
- " bad_markers = [\n",
297
- " \"error\",\n",
298
- " \"invalid\",\n",
299
- " \"not found\",\n",
300
- " \"decommissioned\",\n",
301
- " \"rate limit\",\n",
302
- " \"api key\",\n",
303
- " ]\n",
304
- "\n",
305
- " def is_valid_ai_text(text: str) -> bool:\n",
306
- " if not text:\n",
307
- " return False\n",
308
- " clean_text = text.strip()\n",
309
- " if len(clean_text) < 20:\n",
310
- " return False\n",
311
- " lower_text = clean_text.lower()\n",
312
- " return not any(marker in lower_text for marker in bad_markers)\n",
313
- "\n",
314
- " def extract_retry_wait_seconds(error_text: str) -> float:\n",
315
- " match = re.search(r\"try again in\\s*(\\d+)ms\", error_text, re.IGNORECASE)\n",
316
- " if match:\n",
317
- " return int(match.group(1)) / 1000.0 + 0.2\n",
318
- " return 1.5\n",
319
- "\n",
320
- " def process_one(idx, title_text, prompt_text, human_text):\n",
321
- " local_client = Groq(api_key=api_key)\n",
322
- "\n",
323
- " for attempt in range(max_retries + 1):\n",
324
- " try:\n",
325
- " completion = local_client.chat.completions.create(\n",
326
- " model=model,\n",
327
- " messages=[{\"role\": \"user\", \"content\": build_prompt(str(prompt_text))}],\n",
328
- " temperature=0.2,\n",
329
- " max_tokens=500,\n",
330
- " )\n",
331
- " ai_text = completion.choices[0].message.content.strip()\n",
332
- "\n",
333
- " if not is_valid_ai_text(ai_text):\n",
334
- " if attempt < max_retries:\n",
335
- " continue\n",
336
- " return {\n",
337
- " \"idx\": idx,\n",
338
- " \"ok\": False,\n",
339
- " \"reason\": \"invalid_or_error_text\",\n",
340
- " \"ai_text\": ai_text,\n",
341
- " }\n",
342
- "\n",
343
- " return {\n",
344
- " \"idx\": idx,\n",
345
- " \"ok\": True,\n",
346
- " \"title\": str(title_text),\n",
347
- " \"human_text\": str(human_text),\n",
348
- " \"ai_generated_text\": ai_text,\n",
349
- " }\n",
350
- " except Exception as error:\n",
351
- " error_text = str(error)\n",
352
- " is_rate_limited = (\n",
353
- " \"rate_limit_exceeded\" in error_text.lower()\n",
354
- " or \"rate limit reached\" in error_text.lower()\n",
355
- " or \"429\" in error_text\n",
356
- " )\n",
357
- "\n",
358
- " if is_rate_limited and attempt < max_retries:\n",
359
- " wait_seconds = extract_retry_wait_seconds(error_text)\n",
360
- " print(\n",
361
- " f\"Row {idx} rate-limited, retry {attempt + 1}/{max_retries} after {wait_seconds:.2f}s\"\n",
362
- " )\n",
363
- " time.sleep(wait_seconds)\n",
364
- " continue\n",
365
- "\n",
366
- " return {\n",
367
- " \"idx\": idx,\n",
368
- " \"ok\": False,\n",
369
- " \"reason\": error_text,\n",
370
- " \"ai_text\": \"\",\n",
371
- " }\n",
372
- "\n",
373
- " rows = list(working_df[[title_col, prompt_col, human_col]].itertuples(index=True, name=None))\n",
374
- " total = len(rows)\n",
375
- "\n",
376
- " for start in range(0, total, requests_per_second):\n",
377
- " window = rows[start : start + requests_per_second]\n",
378
- " tick_start = time.time()\n",
379
- "\n",
380
- " with ThreadPoolExecutor(max_workers=max_workers) as executor:\n",
381
- " futures = {\n",
382
- " executor.submit(process_one, idx, title_text, prompt_text, human_text): idx\n",
383
- " for idx, title_text, prompt_text, human_text in window\n",
384
- " }\n",
385
- "\n",
386
- " for future in as_completed(futures):\n",
387
- " out = future.result()\n",
388
- " if out[\"ok\"]:\n",
389
- " # Save as id + ai_gen only\n",
390
- " results.append({\n",
391
- " \"id\": out[\"idx\"],\n",
392
- " \"ai_gen\": out[\"ai_generated_text\"]\n",
393
- " })\n",
394
- " print(\n",
395
- " f\"Row {out['idx']}: generated {len(out['ai_generated_text'].split())} words\"\n",
396
- " )\n",
397
- " else:\n",
398
- " print(f\"Row {out['idx']} skipped: {out['reason']}\")\n",
399
- "\n",
400
- " if len(results) >= 10:\n",
401
- " pd.DataFrame(results)[[\"id\", \"ai_gen\"]].to_csv(\n",
402
- " output_file,\n",
403
- " index=False,\n",
404
- " mode=\"a\",\n",
405
- " header=not os.path.exists(output_file),\n",
406
- " )\n",
407
- " print(f\"Saved {len(results)} valid rows to {output_file}\")\n",
408
- " results = []\n",
409
- "\n",
410
- " elapsed = time.time() - tick_start\n",
411
- " if elapsed < 1:\n",
412
- " time.sleep(1 - elapsed)\n",
413
- "\n",
414
- " if results:\n",
415
- " pd.DataFrame(results)[[\"id\", \"ai_gen\"]].to_csv(\n",
416
- " output_file,\n",
417
- " index=False,\n",
418
- " mode=\"a\",\n",
419
- " header=not os.path.exists(output_file),\n",
420
- " )\n",
421
- "\n",
422
- " print(f\"Finished. Output saved to {output_file}\")"
423
- ]
424
- },
425
- {
426
- "cell_type": "code",
427
- "execution_count": null,
428
- "id": "357ccb81",
429
- "metadata": {},
430
- "outputs": [],
431
- "source": []
432
- }
433
- ],
434
- "metadata": {
435
- "kernelspec": {
436
- "display_name": "ml",
437
- "language": "python",
438
- "name": "python3"
439
- },
440
- "language_info": {
441
- "codemirror_mode": {
442
- "name": "ipython",
443
- "version": 3
444
- },
445
- "file_extension": ".py",
446
- "mimetype": "text/x-python",
447
- "name": "python",
448
- "nbconvert_exporter": "python",
449
- "pygments_lexer": "ipython3",
450
- "version": "3.11.14"
451
- }
452
- },
453
- "nbformat": 4,
454
- "nbformat_minor": 5
455
- }