File size: 7,192 Bytes

8f467eb

---
language:
  - en
tags:
  - text-classification
  - cybersecurity
  - http-attack-detection
  - intrusion-detection
  - web-security
  - tfidf
  - xgboost
  - lightgbm
  - sklearn
  - keras
license: mit
metrics:
  - accuracy
  - f1
---

# HTTP Attack Classification Models

A collection of machine learning models for detecting and classifying HTTP-based cyber attacks from raw request logs.  
Each model takes a raw HTTP request string as input and classifies it into one of 9 attack categories.

---

## Task

- **Task**: Multi-class Text Classification
- **Domain**: Network Security / Intrusion Detection
- **Input**: Raw HTTP request string (method, path, headers, body)
- **Output**: One of 9 attack type labels

---

## Attack Types

| Class | Description | Common Indicators |
|-------|-------------|-------------------|
| `Vulnerability_Scan` | Automated scanning for known vulnerabilities | sqlmap, nikto, nmap user-agents; repeated probing patterns |
| `System_Cmd_Execution` | OS command injection attempts | `\|`, `;`, `&&`, `wget`, `curl`, `/bin/sh`, `boot.ini` |
| `HOST_Scan` | Network host discovery and port scanning | Minimal headers, bare `GET /`, nmap scripting engine |
| `Path_Disclosure` | Directory traversal and file path exposure | `../`, `..%2F`, `/etc/passwd`, `/etc/shadow`, `/proc/` |
| `SQL_Injection` | SQL injection in query parameters | `UNION SELECT`, `OR 1=1`, `--`, `'`, boolean-based blind patterns |
| `Cross_Site_Scripting` | XSS payload injection | `<script>`, `onerror=`, `javascript:`, `alert()`, `prompt()` |
| `Automatically_Searching_Infor` | Automated information gathering | Crawlers, `/_vti_pvt/`, `robots.txt`, `sitemap.xml` probing |
| `Leakage_Through_NW` | Sensitive file access via network | Access to config files, logs, backups (`.ico`, `.conf`, `.bak`) |
| `Directory_Indexing` | Browsing exposed directory listings | Trailing `/` on directory paths, source/workspace/src paths |

---

## Models

| File | Model | Feature Extraction | Test Accuracy | Notes |
|------|-------|--------------------|:-------------:|-------|
| `tdidf-svc.joblib` | TF-IDF + LinearSVC | word, default | 87.4% | Best generalization |
| `xgb_char.joblib` | TF-IDF + XGBoost | char, ngram(1,2), max_features=1024 | **88.5%** | Best local accuracy |
| `xgb_word.joblib` | TF-IDF + XGBoost | word, NLTK tokenizer | 86.7% | |
| `lgb_model.joblib` | TF-IDF + LightGBM | word, NLTK tokenizer | 86.5% | |
| `rf_nltk.joblib` | TF-IDF + RandomForest | word, NLTK, n_estimators=1000 | 84.8% | |
| `rf_gridsearch.joblib` | TF-IDF + RandomForest | word, GridSearchCV best | 83.6% | best: max_depth=None, n_estimators=150 |
| `rf_basic.joblib` | TF-IDF + RandomForest | word, default | 83.3% | |
| `catboost.joblib` | TF-IDF + CatBoost | word, NLTK tokenizer | 83.0% | |
| `multinomial_nb.joblib` | CountVectorizer + MultinomialNB | word, default | 67.5% | Baseline |
| `lstm_bidirectional.h5` | BiLSTM | Keras Tokenizer, maxlen=216 | 85.2% | Requires Keras/TF |
| `textcnn_model.h5` | TextCNN | Keras Tokenizer, maxlen=256 | 86.1% | Requires Keras/TF |

---

## Usage

### Preprocessing

```python
import urllib.parse

def preprocess(payload: str) -> str:
    return urllib.parse.unquote_plus(payload)
```

### sklearn-based models (joblib)

Applies to: `tdidf-svc.joblib`, `xgb_char.joblib`, `xgb_word.joblib`, `lgb_model.joblib`, `rf_*.joblib`, `catboost.joblib`, `multinomial_nb.joblib`

Each file is a scikit-learn Pipeline with the vectorizer and classifier bundled together — raw text can be passed directly.

```python
import joblib

model = joblib.load("xgb_char.joblib")

payloads = [
    "GET /../../../../etc/passwd HTTP/1.1\r\nHost: 10.0.0.1\r\n",
    "GET /search?q=' OR 1=1-- HTTP/1.1\r\nHost: example.com\r\n",
]
predictions = model.predict(payloads)
print(predictions)
# ['Path_Disclosure', 'SQL_Injection']
```

### Keras-based models (.h5)

```python
import numpy as np
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences
import joblib

model = load_model("lstm_bidirectional.h5")       # or textcnn_model.h5
tokenizer = joblib.load("tokenizer.joblib")        # must be saved separately during training

payloads = ["GET /../../../../etc/passwd HTTP/1.1\r\nHost: 10.0.0.1"]
sequences = tokenizer.texts_to_sequences(payloads)
padded = pad_sequences(sequences, maxlen=216)      # maxlen=256 for TextCNN

pred = model.predict(padded)
label_idx = np.argmax(pred, axis=1)
print(label_idx)
```

---

## Evaluation

### Per-model summary

| Model | Accuracy | Macro F1 | Weakest Class (F1) |
|-------|:--------:|:--------:|--------------------|
| TF-IDF + XGBoost (char) | 88.5% | 0.92 | System_Cmd_Execution (0.84) |
| TF-IDF + LinearSVC | 87.4% | — | System_Cmd_Execution |
| TF-IDF + XGBoost (word) | 86.7% | 0.90 | System_Cmd_Execution (0.83) |
| TF-IDF + LightGBM | 86.5% | 0.91 | System_Cmd_Execution (0.83) |
| TextCNN | 86.1% | 0.89 | System_Cmd_Execution (0.79) |
| BiLSTM | 85.2% | 0.89 | System_Cmd_Execution (0.78) |

### Per-class observations

- **Easiest classes**: `Automatically_Searching_Infor` and `Leakage_Through_NW` achieve F1 ≥ 0.99 across all models — highly distinctive tool signatures (nmap, crawlers) and file access patterns make them trivial to separate.
- **Hardest class**: `System_Cmd_Execution` consistently scores the lowest F1 (0.75–0.84) due to pattern overlap with `Vulnerability_Scan`. Both classes involve probing behavior with similar HTTP structure.
- **char-level XGBoost advantage**: Sub-word character n-grams capture attack-specific tokens like `../`, `<script>`, `UNION` more robustly than word tokenization, especially for obfuscated payloads.

---

## Architecture Details

### BiLSTM

```
Embedding(22,883 vocab, dim=100, maxlen=216)
→ Bidirectional(LSTM(64)) → LSTM(32) → Dense(512) → Dense(9, softmax)
```

- EarlyStopping(monitor=val_accuracy, patience=3) — triggered at epoch 20
- Saved: `lstm_bidirectional.h5` (28 MB)

### TextCNN

```
Embedding(20,000 vocab, dim=128, maxlen=256)
→ Conv1D(128, kernel=3) ─┐
→ Conv1D(128, kernel=4) ──→ GlobalMaxPool → Concat(384) → Dense(256) → Dropout(0.3) → Dense(9, softmax)
→ Conv1D(128, kernel=5) ─┘
Total params: 2.86M
```

- EarlyStopping(monitor=val_loss, patience=3) — triggered at epoch 5
- Saved: `textcnn_model.h5` (33 MB)

---

## Key Findings

- **TF-IDF outperforms deep learning** on HTTP attack data: attack patterns rely on decisive keywords (`UNION SELECT`, `../`, `<script>`, `wget`). Bag-of-words representations capture these directly, while sequential models can be distracted by irrelevant header noise.
- **char-level features beat word-level**: Character n-grams handle URL encoding variations and partial token matches more effectively (e.g., `%3Cscript%3E` vs `<script>`).
- **Class imbalance effect**: `Vulnerability_Scan` dominates at 37.5% — models tend to over-predict this class for ambiguous samples.

---

## Environment

| Item | Value |
|------|-------|
| Python | 3.12 |
| scikit-learn | 1.x |
| XGBoost | 3.2.0 |
| LightGBM | 4.6.0 |
| CatBoost | 1.2.10 |
| TensorFlow / Keras | 2.x |

---

## License

MIT License