Upload 12 files

8f467eb verified 16 days ago

7.19 kB

	---
	language:
	- en
	tags:
	- text-classification
	- cybersecurity
	- http-attack-detection
	- intrusion-detection
	- web-security
	- tfidf
	- xgboost
	- lightgbm
	- sklearn
	- keras
	license: mit
	metrics:
	- accuracy
	- f1
	---

	# HTTP Attack Classification Models

	A collection of machine learning models for detecting and classifying HTTP-based cyber attacks from raw request logs.
	Each model takes a raw HTTP request string as input and classifies it into one of 9 attack categories.

	---

	## Task

	- Task: Multi-class Text Classification
	- Domain: Network Security / Intrusion Detection
	- Input: Raw HTTP request string (method, path, headers, body)
	- Output: One of 9 attack type labels

	---

	## Attack Types

	\| Class \| Description \| Common Indicators \|
	\|-------\|-------------\|-------------------\|
	\| `Vulnerability_Scan` \| Automated scanning for known vulnerabilities \| sqlmap, nikto, nmap user-agents; repeated probing patterns \|
	\| `System_Cmd_Execution` \| OS command injection attempts \| `\\|`, `;`, `&&`, `wget`, `curl`, `/bin/sh`, `boot.ini` \|
	\| `HOST_Scan` \| Network host discovery and port scanning \| Minimal headers, bare `GET /`, nmap scripting engine \|
	\| `Path_Disclosure` \| Directory traversal and file path exposure \| `../`, `..%2F`, `/etc/passwd`, `/etc/shadow`, `/proc/` \|
	\| `SQL_Injection` \| SQL injection in query parameters \| `UNION SELECT`, `OR 1=1`, `--`, `'`, boolean-based blind patterns \|
	\| `Cross_Site_Scripting` \| XSS payload injection \| `<script>`, `onerror=`, `javascript:`, `alert()`, `prompt()` \|
	\| `Automatically_Searching_Infor` \| Automated information gathering \| Crawlers, `/_vti_pvt/`, `robots.txt`, `sitemap.xml` probing \|
	\| `Leakage_Through_NW` \| Sensitive file access via network \| Access to config files, logs, backups (`.ico`, `.conf`, `.bak`) \|
	\| `Directory_Indexing` \| Browsing exposed directory listings \| Trailing `/` on directory paths, source/workspace/src paths \|

	---

	## Models

	\| File \| Model \| Feature Extraction \| Test Accuracy \| Notes \|
	\|------\|-------\|--------------------\|:-------------:\|-------\|
	\| `tdidf-svc.joblib` \| TF-IDF + LinearSVC \| word, default \| 87.4% \| Best generalization \|
	\| `xgb_char.joblib` \| TF-IDF + XGBoost \| char, ngram(1,2), max_features=1024 \| 88.5% \| Best local accuracy \|
	\| `xgb_word.joblib` \| TF-IDF + XGBoost \| word, NLTK tokenizer \| 86.7% \| \|
	\| `lgb_model.joblib` \| TF-IDF + LightGBM \| word, NLTK tokenizer \| 86.5% \| \|
	\| `rf_nltk.joblib` \| TF-IDF + RandomForest \| word, NLTK, n_estimators=1000 \| 84.8% \| \|
	\| `rf_gridsearch.joblib` \| TF-IDF + RandomForest \| word, GridSearchCV best \| 83.6% \| best: max_depth=None, n_estimators=150 \|
	\| `rf_basic.joblib` \| TF-IDF + RandomForest \| word, default \| 83.3% \| \|
	\| `catboost.joblib` \| TF-IDF + CatBoost \| word, NLTK tokenizer \| 83.0% \| \|
	\| `multinomial_nb.joblib` \| CountVectorizer + MultinomialNB \| word, default \| 67.5% \| Baseline \|
	\| `lstm_bidirectional.h5` \| BiLSTM \| Keras Tokenizer, maxlen=216 \| 85.2% \| Requires Keras/TF \|
	\| `textcnn_model.h5` \| TextCNN \| Keras Tokenizer, maxlen=256 \| 86.1% \| Requires Keras/TF \|

	---

	## Usage

	### Preprocessing

	```python
	import urllib.parse

	def preprocess(payload: str) -> str:
	return urllib.parse.unquote_plus(payload)
	```

	### sklearn-based models (joblib)

	Applies to: `tdidf-svc.joblib`, `xgb_char.joblib`, `xgb_word.joblib`, `lgb_model.joblib`, `rf_*.joblib`, `catboost.joblib`, `multinomial_nb.joblib`

	Each file is a scikit-learn Pipeline with the vectorizer and classifier bundled together — raw text can be passed directly.

	```python
	import joblib

	model = joblib.load("xgb_char.joblib")

	payloads = [
	"GET /../../../../etc/passwd HTTP/1.1\r\nHost: 10.0.0.1\r\n",
	"GET /search?q=' OR 1=1-- HTTP/1.1\r\nHost: example.com\r\n",
	]
	predictions = model.predict(payloads)
	print(predictions)
	# ['Path_Disclosure', 'SQL_Injection']
	```

	### Keras-based models (.h5)

	```python
	import numpy as np
	from tensorflow.keras.models import load_model
	from tensorflow.keras.preprocessing.sequence import pad_sequences
	import joblib

	model = load_model("lstm_bidirectional.h5") # or textcnn_model.h5
	tokenizer = joblib.load("tokenizer.joblib") # must be saved separately during training

	payloads = ["GET /../../../../etc/passwd HTTP/1.1\r\nHost: 10.0.0.1"]
	sequences = tokenizer.texts_to_sequences(payloads)
	padded = pad_sequences(sequences, maxlen=216) # maxlen=256 for TextCNN

	pred = model.predict(padded)
	label_idx = np.argmax(pred, axis=1)
	print(label_idx)
	```

	---

	## Evaluation

	### Per-model summary

	\| Model \| Accuracy \| Macro F1 \| Weakest Class (F1) \|
	\|-------\|:--------:\|:--------:\|--------------------\|
	\| TF-IDF + XGBoost (char) \| 88.5% \| 0.92 \| System_Cmd_Execution (0.84) \|
	\| TF-IDF + LinearSVC \| 87.4% \| — \| System_Cmd_Execution \|
	\| TF-IDF + XGBoost (word) \| 86.7% \| 0.90 \| System_Cmd_Execution (0.83) \|
	\| TF-IDF + LightGBM \| 86.5% \| 0.91 \| System_Cmd_Execution (0.83) \|
	\| TextCNN \| 86.1% \| 0.89 \| System_Cmd_Execution (0.79) \|
	\| BiLSTM \| 85.2% \| 0.89 \| System_Cmd_Execution (0.78) \|

	### Per-class observations

	- Easiest classes: `Automatically_Searching_Infor` and `Leakage_Through_NW` achieve F1 ≥ 0.99 across all models — highly distinctive tool signatures (nmap, crawlers) and file access patterns make them trivial to separate.
	- Hardest class: `System_Cmd_Execution` consistently scores the lowest F1 (0.75–0.84) due to pattern overlap with `Vulnerability_Scan`. Both classes involve probing behavior with similar HTTP structure.
	- char-level XGBoost advantage: Sub-word character n-grams capture attack-specific tokens like `../`, `<script>`, `UNION` more robustly than word tokenization, especially for obfuscated payloads.

	---

	## Architecture Details

	### BiLSTM

	```
	Embedding(22,883 vocab, dim=100, maxlen=216)
	→ Bidirectional(LSTM(64)) → LSTM(32) → Dense(512) → Dense(9, softmax)
	```

	- EarlyStopping(monitor=val_accuracy, patience=3) — triggered at epoch 20
	- Saved: `lstm_bidirectional.h5` (28 MB)

	### TextCNN

	```
	Embedding(20,000 vocab, dim=128, maxlen=256)
	→ Conv1D(128, kernel=3) ─┐
	→ Conv1D(128, kernel=4) ──→ GlobalMaxPool → Concat(384) → Dense(256) → Dropout(0.3) → Dense(9, softmax)
	→ Conv1D(128, kernel=5) ─┘
	Total params: 2.86M
	```

	- EarlyStopping(monitor=val_loss, patience=3) — triggered at epoch 5
	- Saved: `textcnn_model.h5` (33 MB)

	---

	## Key Findings

	- TF-IDF outperforms deep learning on HTTP attack data: attack patterns rely on decisive keywords (`UNION SELECT`, `../`, `<script>`, `wget`). Bag-of-words representations capture these directly, while sequential models can be distracted by irrelevant header noise.
	- char-level features beat word-level: Character n-grams handle URL encoding variations and partial token matches more effectively (e.g., `%3Cscript%3E` vs `<script>`).
	- Class imbalance effect: `Vulnerability_Scan` dominates at 37.5% — models tend to over-predict this class for ambiguous samples.

	---

	## Environment

	\| Item \| Value \|
	\|------\|-------\|
	\| Python \| 3.12 \|
	\| scikit-learn \| 1.x \|
	\| XGBoost \| 3.2.0 \|
	\| LightGBM \| 4.6.0 \|
	\| CatBoost \| 1.2.10 \|
	\| TensorFlow / Keras \| 2.x \|

	---

	## License

	MIT License