Text Classification
Scikit-learn
Joblib
Keras
English
cybersecurity
http-attack-detection
intrusion-detection
web-security
tfidf
xgboost
lightgbm
Instructions to use cycloevan/http-attack-classification with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use cycloevan/http-attack-classification with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("cycloevan/http-attack-classification", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Keras
How to use cycloevan/http-attack-classification with Keras:
# Available backend options are: "jax", "torch", "tensorflow". import os os.environ["KERAS_BACKEND"] = "jax" import keras model = keras.saving.load_model("hf://cycloevan/http-attack-classification") - Notebooks
- Google Colab
- Kaggle
File size: 7,192 Bytes
8f467eb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | ---
language:
- en
tags:
- text-classification
- cybersecurity
- http-attack-detection
- intrusion-detection
- web-security
- tfidf
- xgboost
- lightgbm
- sklearn
- keras
license: mit
metrics:
- accuracy
- f1
---
# HTTP Attack Classification Models
A collection of machine learning models for detecting and classifying HTTP-based cyber attacks from raw request logs.
Each model takes a raw HTTP request string as input and classifies it into one of 9 attack categories.
---
## Task
- **Task**: Multi-class Text Classification
- **Domain**: Network Security / Intrusion Detection
- **Input**: Raw HTTP request string (method, path, headers, body)
- **Output**: One of 9 attack type labels
---
## Attack Types
| Class | Description | Common Indicators |
|-------|-------------|-------------------|
| `Vulnerability_Scan` | Automated scanning for known vulnerabilities | sqlmap, nikto, nmap user-agents; repeated probing patterns |
| `System_Cmd_Execution` | OS command injection attempts | `\|`, `;`, `&&`, `wget`, `curl`, `/bin/sh`, `boot.ini` |
| `HOST_Scan` | Network host discovery and port scanning | Minimal headers, bare `GET /`, nmap scripting engine |
| `Path_Disclosure` | Directory traversal and file path exposure | `../`, `..%2F`, `/etc/passwd`, `/etc/shadow`, `/proc/` |
| `SQL_Injection` | SQL injection in query parameters | `UNION SELECT`, `OR 1=1`, `--`, `'`, boolean-based blind patterns |
| `Cross_Site_Scripting` | XSS payload injection | `<script>`, `onerror=`, `javascript:`, `alert()`, `prompt()` |
| `Automatically_Searching_Infor` | Automated information gathering | Crawlers, `/_vti_pvt/`, `robots.txt`, `sitemap.xml` probing |
| `Leakage_Through_NW` | Sensitive file access via network | Access to config files, logs, backups (`.ico`, `.conf`, `.bak`) |
| `Directory_Indexing` | Browsing exposed directory listings | Trailing `/` on directory paths, source/workspace/src paths |
---
## Models
| File | Model | Feature Extraction | Test Accuracy | Notes |
|------|-------|--------------------|:-------------:|-------|
| `tdidf-svc.joblib` | TF-IDF + LinearSVC | word, default | 87.4% | Best generalization |
| `xgb_char.joblib` | TF-IDF + XGBoost | char, ngram(1,2), max_features=1024 | **88.5%** | Best local accuracy |
| `xgb_word.joblib` | TF-IDF + XGBoost | word, NLTK tokenizer | 86.7% | |
| `lgb_model.joblib` | TF-IDF + LightGBM | word, NLTK tokenizer | 86.5% | |
| `rf_nltk.joblib` | TF-IDF + RandomForest | word, NLTK, n_estimators=1000 | 84.8% | |
| `rf_gridsearch.joblib` | TF-IDF + RandomForest | word, GridSearchCV best | 83.6% | best: max_depth=None, n_estimators=150 |
| `rf_basic.joblib` | TF-IDF + RandomForest | word, default | 83.3% | |
| `catboost.joblib` | TF-IDF + CatBoost | word, NLTK tokenizer | 83.0% | |
| `multinomial_nb.joblib` | CountVectorizer + MultinomialNB | word, default | 67.5% | Baseline |
| `lstm_bidirectional.h5` | BiLSTM | Keras Tokenizer, maxlen=216 | 85.2% | Requires Keras/TF |
| `textcnn_model.h5` | TextCNN | Keras Tokenizer, maxlen=256 | 86.1% | Requires Keras/TF |
---
## Usage
### Preprocessing
```python
import urllib.parse
def preprocess(payload: str) -> str:
return urllib.parse.unquote_plus(payload)
```
### sklearn-based models (joblib)
Applies to: `tdidf-svc.joblib`, `xgb_char.joblib`, `xgb_word.joblib`, `lgb_model.joblib`, `rf_*.joblib`, `catboost.joblib`, `multinomial_nb.joblib`
Each file is a scikit-learn Pipeline with the vectorizer and classifier bundled together β raw text can be passed directly.
```python
import joblib
model = joblib.load("xgb_char.joblib")
payloads = [
"GET /../../../../etc/passwd HTTP/1.1\r\nHost: 10.0.0.1\r\n",
"GET /search?q=' OR 1=1-- HTTP/1.1\r\nHost: example.com\r\n",
]
predictions = model.predict(payloads)
print(predictions)
# ['Path_Disclosure', 'SQL_Injection']
```
### Keras-based models (.h5)
```python
import numpy as np
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences
import joblib
model = load_model("lstm_bidirectional.h5") # or textcnn_model.h5
tokenizer = joblib.load("tokenizer.joblib") # must be saved separately during training
payloads = ["GET /../../../../etc/passwd HTTP/1.1\r\nHost: 10.0.0.1"]
sequences = tokenizer.texts_to_sequences(payloads)
padded = pad_sequences(sequences, maxlen=216) # maxlen=256 for TextCNN
pred = model.predict(padded)
label_idx = np.argmax(pred, axis=1)
print(label_idx)
```
---
## Evaluation
### Per-model summary
| Model | Accuracy | Macro F1 | Weakest Class (F1) |
|-------|:--------:|:--------:|--------------------|
| TF-IDF + XGBoost (char) | 88.5% | 0.92 | System_Cmd_Execution (0.84) |
| TF-IDF + LinearSVC | 87.4% | β | System_Cmd_Execution |
| TF-IDF + XGBoost (word) | 86.7% | 0.90 | System_Cmd_Execution (0.83) |
| TF-IDF + LightGBM | 86.5% | 0.91 | System_Cmd_Execution (0.83) |
| TextCNN | 86.1% | 0.89 | System_Cmd_Execution (0.79) |
| BiLSTM | 85.2% | 0.89 | System_Cmd_Execution (0.78) |
### Per-class observations
- **Easiest classes**: `Automatically_Searching_Infor` and `Leakage_Through_NW` achieve F1 β₯ 0.99 across all models β highly distinctive tool signatures (nmap, crawlers) and file access patterns make them trivial to separate.
- **Hardest class**: `System_Cmd_Execution` consistently scores the lowest F1 (0.75β0.84) due to pattern overlap with `Vulnerability_Scan`. Both classes involve probing behavior with similar HTTP structure.
- **char-level XGBoost advantage**: Sub-word character n-grams capture attack-specific tokens like `../`, `<script>`, `UNION` more robustly than word tokenization, especially for obfuscated payloads.
---
## Architecture Details
### BiLSTM
```
Embedding(22,883 vocab, dim=100, maxlen=216)
β Bidirectional(LSTM(64)) β LSTM(32) β Dense(512) β Dense(9, softmax)
```
- EarlyStopping(monitor=val_accuracy, patience=3) β triggered at epoch 20
- Saved: `lstm_bidirectional.h5` (28 MB)
### TextCNN
```
Embedding(20,000 vocab, dim=128, maxlen=256)
β Conv1D(128, kernel=3) ββ
β Conv1D(128, kernel=4) βββ GlobalMaxPool β Concat(384) β Dense(256) β Dropout(0.3) β Dense(9, softmax)
β Conv1D(128, kernel=5) ββ
Total params: 2.86M
```
- EarlyStopping(monitor=val_loss, patience=3) β triggered at epoch 5
- Saved: `textcnn_model.h5` (33 MB)
---
## Key Findings
- **TF-IDF outperforms deep learning** on HTTP attack data: attack patterns rely on decisive keywords (`UNION SELECT`, `../`, `<script>`, `wget`). Bag-of-words representations capture these directly, while sequential models can be distracted by irrelevant header noise.
- **char-level features beat word-level**: Character n-grams handle URL encoding variations and partial token matches more effectively (e.g., `%3Cscript%3E` vs `<script>`).
- **Class imbalance effect**: `Vulnerability_Scan` dominates at 37.5% β models tend to over-predict this class for ambiguous samples.
---
## Environment
| Item | Value |
|------|-------|
| Python | 3.12 |
| scikit-learn | 1.x |
| XGBoost | 3.2.0 |
| LightGBM | 4.6.0 |
| CatBoost | 1.2.10 |
| TensorFlow / Keras | 2.x |
---
## License
MIT License
|