cycloevan's picture
Upload 12 files
8f467eb verified
|
Raw
History Blame Contribute Delete
7.19 kB
---
language:
- en
tags:
- text-classification
- cybersecurity
- http-attack-detection
- intrusion-detection
- web-security
- tfidf
- xgboost
- lightgbm
- sklearn
- keras
license: mit
metrics:
- accuracy
- f1
---
# HTTP Attack Classification Models
A collection of machine learning models for detecting and classifying HTTP-based cyber attacks from raw request logs.
Each model takes a raw HTTP request string as input and classifies it into one of 9 attack categories.
---
## Task
- **Task**: Multi-class Text Classification
- **Domain**: Network Security / Intrusion Detection
- **Input**: Raw HTTP request string (method, path, headers, body)
- **Output**: One of 9 attack type labels
---
## Attack Types
| Class | Description | Common Indicators |
|-------|-------------|-------------------|
| `Vulnerability_Scan` | Automated scanning for known vulnerabilities | sqlmap, nikto, nmap user-agents; repeated probing patterns |
| `System_Cmd_Execution` | OS command injection attempts | `\|`, `;`, `&&`, `wget`, `curl`, `/bin/sh`, `boot.ini` |
| `HOST_Scan` | Network host discovery and port scanning | Minimal headers, bare `GET /`, nmap scripting engine |
| `Path_Disclosure` | Directory traversal and file path exposure | `../`, `..%2F`, `/etc/passwd`, `/etc/shadow`, `/proc/` |
| `SQL_Injection` | SQL injection in query parameters | `UNION SELECT`, `OR 1=1`, `--`, `'`, boolean-based blind patterns |
| `Cross_Site_Scripting` | XSS payload injection | `<script>`, `onerror=`, `javascript:`, `alert()`, `prompt()` |
| `Automatically_Searching_Infor` | Automated information gathering | Crawlers, `/_vti_pvt/`, `robots.txt`, `sitemap.xml` probing |
| `Leakage_Through_NW` | Sensitive file access via network | Access to config files, logs, backups (`.ico`, `.conf`, `.bak`) |
| `Directory_Indexing` | Browsing exposed directory listings | Trailing `/` on directory paths, source/workspace/src paths |
---
## Models
| File | Model | Feature Extraction | Test Accuracy | Notes |
|------|-------|--------------------|:-------------:|-------|
| `tdidf-svc.joblib` | TF-IDF + LinearSVC | word, default | 87.4% | Best generalization |
| `xgb_char.joblib` | TF-IDF + XGBoost | char, ngram(1,2), max_features=1024 | **88.5%** | Best local accuracy |
| `xgb_word.joblib` | TF-IDF + XGBoost | word, NLTK tokenizer | 86.7% | |
| `lgb_model.joblib` | TF-IDF + LightGBM | word, NLTK tokenizer | 86.5% | |
| `rf_nltk.joblib` | TF-IDF + RandomForest | word, NLTK, n_estimators=1000 | 84.8% | |
| `rf_gridsearch.joblib` | TF-IDF + RandomForest | word, GridSearchCV best | 83.6% | best: max_depth=None, n_estimators=150 |
| `rf_basic.joblib` | TF-IDF + RandomForest | word, default | 83.3% | |
| `catboost.joblib` | TF-IDF + CatBoost | word, NLTK tokenizer | 83.0% | |
| `multinomial_nb.joblib` | CountVectorizer + MultinomialNB | word, default | 67.5% | Baseline |
| `lstm_bidirectional.h5` | BiLSTM | Keras Tokenizer, maxlen=216 | 85.2% | Requires Keras/TF |
| `textcnn_model.h5` | TextCNN | Keras Tokenizer, maxlen=256 | 86.1% | Requires Keras/TF |
---
## Usage
### Preprocessing
```python
import urllib.parse
def preprocess(payload: str) -> str:
return urllib.parse.unquote_plus(payload)
```
### sklearn-based models (joblib)
Applies to: `tdidf-svc.joblib`, `xgb_char.joblib`, `xgb_word.joblib`, `lgb_model.joblib`, `rf_*.joblib`, `catboost.joblib`, `multinomial_nb.joblib`
Each file is a scikit-learn Pipeline with the vectorizer and classifier bundled together β€” raw text can be passed directly.
```python
import joblib
model = joblib.load("xgb_char.joblib")
payloads = [
"GET /../../../../etc/passwd HTTP/1.1\r\nHost: 10.0.0.1\r\n",
"GET /search?q=' OR 1=1-- HTTP/1.1\r\nHost: example.com\r\n",
]
predictions = model.predict(payloads)
print(predictions)
# ['Path_Disclosure', 'SQL_Injection']
```
### Keras-based models (.h5)
```python
import numpy as np
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences
import joblib
model = load_model("lstm_bidirectional.h5") # or textcnn_model.h5
tokenizer = joblib.load("tokenizer.joblib") # must be saved separately during training
payloads = ["GET /../../../../etc/passwd HTTP/1.1\r\nHost: 10.0.0.1"]
sequences = tokenizer.texts_to_sequences(payloads)
padded = pad_sequences(sequences, maxlen=216) # maxlen=256 for TextCNN
pred = model.predict(padded)
label_idx = np.argmax(pred, axis=1)
print(label_idx)
```
---
## Evaluation
### Per-model summary
| Model | Accuracy | Macro F1 | Weakest Class (F1) |
|-------|:--------:|:--------:|--------------------|
| TF-IDF + XGBoost (char) | 88.5% | 0.92 | System_Cmd_Execution (0.84) |
| TF-IDF + LinearSVC | 87.4% | β€” | System_Cmd_Execution |
| TF-IDF + XGBoost (word) | 86.7% | 0.90 | System_Cmd_Execution (0.83) |
| TF-IDF + LightGBM | 86.5% | 0.91 | System_Cmd_Execution (0.83) |
| TextCNN | 86.1% | 0.89 | System_Cmd_Execution (0.79) |
| BiLSTM | 85.2% | 0.89 | System_Cmd_Execution (0.78) |
### Per-class observations
- **Easiest classes**: `Automatically_Searching_Infor` and `Leakage_Through_NW` achieve F1 β‰₯ 0.99 across all models β€” highly distinctive tool signatures (nmap, crawlers) and file access patterns make them trivial to separate.
- **Hardest class**: `System_Cmd_Execution` consistently scores the lowest F1 (0.75–0.84) due to pattern overlap with `Vulnerability_Scan`. Both classes involve probing behavior with similar HTTP structure.
- **char-level XGBoost advantage**: Sub-word character n-grams capture attack-specific tokens like `../`, `<script>`, `UNION` more robustly than word tokenization, especially for obfuscated payloads.
---
## Architecture Details
### BiLSTM
```
Embedding(22,883 vocab, dim=100, maxlen=216)
β†’ Bidirectional(LSTM(64)) β†’ LSTM(32) β†’ Dense(512) β†’ Dense(9, softmax)
```
- EarlyStopping(monitor=val_accuracy, patience=3) β€” triggered at epoch 20
- Saved: `lstm_bidirectional.h5` (28 MB)
### TextCNN
```
Embedding(20,000 vocab, dim=128, maxlen=256)
β†’ Conv1D(128, kernel=3) ─┐
β†’ Conv1D(128, kernel=4) ──→ GlobalMaxPool β†’ Concat(384) β†’ Dense(256) β†’ Dropout(0.3) β†’ Dense(9, softmax)
β†’ Conv1D(128, kernel=5) β”€β”˜
Total params: 2.86M
```
- EarlyStopping(monitor=val_loss, patience=3) β€” triggered at epoch 5
- Saved: `textcnn_model.h5` (33 MB)
---
## Key Findings
- **TF-IDF outperforms deep learning** on HTTP attack data: attack patterns rely on decisive keywords (`UNION SELECT`, `../`, `<script>`, `wget`). Bag-of-words representations capture these directly, while sequential models can be distracted by irrelevant header noise.
- **char-level features beat word-level**: Character n-grams handle URL encoding variations and partial token matches more effectively (e.g., `%3Cscript%3E` vs `<script>`).
- **Class imbalance effect**: `Vulnerability_Scan` dominates at 37.5% β€” models tend to over-predict this class for ambiguous samples.
---
## Environment
| Item | Value |
|------|-------|
| Python | 3.12 |
| scikit-learn | 1.x |
| XGBoost | 3.2.0 |
| LightGBM | 4.6.0 |
| CatBoost | 1.2.10 |
| TensorFlow / Keras | 2.x |
---
## License
MIT License