Text Classification
Scikit-learn
Joblib
Keras
English
cybersecurity
http-attack-detection
intrusion-detection
web-security
tfidf
xgboost
lightgbm
Instructions to use cycloevan/http-attack-classification with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use cycloevan/http-attack-classification with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("cycloevan/http-attack-classification", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Keras
How to use cycloevan/http-attack-classification with Keras:
# Available backend options are: "jax", "torch", "tensorflow". import os os.environ["KERAS_BACKEND"] = "jax" import keras model = keras.saving.load_model("hf://cycloevan/http-attack-classification") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| tags: | |
| - text-classification | |
| - cybersecurity | |
| - http-attack-detection | |
| - intrusion-detection | |
| - web-security | |
| - tfidf | |
| - xgboost | |
| - lightgbm | |
| - sklearn | |
| - keras | |
| license: mit | |
| metrics: | |
| - accuracy | |
| - f1 | |
| # HTTP Attack Classification Models | |
| A collection of machine learning models for detecting and classifying HTTP-based cyber attacks from raw request logs. | |
| Each model takes a raw HTTP request string as input and classifies it into one of 9 attack categories. | |
| --- | |
| ## Task | |
| - **Task**: Multi-class Text Classification | |
| - **Domain**: Network Security / Intrusion Detection | |
| - **Input**: Raw HTTP request string (method, path, headers, body) | |
| - **Output**: One of 9 attack type labels | |
| --- | |
| ## Attack Types | |
| | Class | Description | Common Indicators | | |
| |-------|-------------|-------------------| | |
| | `Vulnerability_Scan` | Automated scanning for known vulnerabilities | sqlmap, nikto, nmap user-agents; repeated probing patterns | | |
| | `System_Cmd_Execution` | OS command injection attempts | `\|`, `;`, `&&`, `wget`, `curl`, `/bin/sh`, `boot.ini` | | |
| | `HOST_Scan` | Network host discovery and port scanning | Minimal headers, bare `GET /`, nmap scripting engine | | |
| | `Path_Disclosure` | Directory traversal and file path exposure | `../`, `..%2F`, `/etc/passwd`, `/etc/shadow`, `/proc/` | | |
| | `SQL_Injection` | SQL injection in query parameters | `UNION SELECT`, `OR 1=1`, `--`, `'`, boolean-based blind patterns | | |
| | `Cross_Site_Scripting` | XSS payload injection | `<script>`, `onerror=`, `javascript:`, `alert()`, `prompt()` | | |
| | `Automatically_Searching_Infor` | Automated information gathering | Crawlers, `/_vti_pvt/`, `robots.txt`, `sitemap.xml` probing | | |
| | `Leakage_Through_NW` | Sensitive file access via network | Access to config files, logs, backups (`.ico`, `.conf`, `.bak`) | | |
| | `Directory_Indexing` | Browsing exposed directory listings | Trailing `/` on directory paths, source/workspace/src paths | | |
| --- | |
| ## Models | |
| | File | Model | Feature Extraction | Test Accuracy | Notes | | |
| |------|-------|--------------------|:-------------:|-------| | |
| | `tdidf-svc.joblib` | TF-IDF + LinearSVC | word, default | 87.4% | Best generalization | | |
| | `xgb_char.joblib` | TF-IDF + XGBoost | char, ngram(1,2), max_features=1024 | **88.5%** | Best local accuracy | | |
| | `xgb_word.joblib` | TF-IDF + XGBoost | word, NLTK tokenizer | 86.7% | | | |
| | `lgb_model.joblib` | TF-IDF + LightGBM | word, NLTK tokenizer | 86.5% | | | |
| | `rf_nltk.joblib` | TF-IDF + RandomForest | word, NLTK, n_estimators=1000 | 84.8% | | | |
| | `rf_gridsearch.joblib` | TF-IDF + RandomForest | word, GridSearchCV best | 83.6% | best: max_depth=None, n_estimators=150 | | |
| | `rf_basic.joblib` | TF-IDF + RandomForest | word, default | 83.3% | | | |
| | `catboost.joblib` | TF-IDF + CatBoost | word, NLTK tokenizer | 83.0% | | | |
| | `multinomial_nb.joblib` | CountVectorizer + MultinomialNB | word, default | 67.5% | Baseline | | |
| | `lstm_bidirectional.h5` | BiLSTM | Keras Tokenizer, maxlen=216 | 85.2% | Requires Keras/TF | | |
| | `textcnn_model.h5` | TextCNN | Keras Tokenizer, maxlen=256 | 86.1% | Requires Keras/TF | | |
| --- | |
| ## Usage | |
| ### Preprocessing | |
| ```python | |
| import urllib.parse | |
| def preprocess(payload: str) -> str: | |
| return urllib.parse.unquote_plus(payload) | |
| ``` | |
| ### sklearn-based models (joblib) | |
| Applies to: `tdidf-svc.joblib`, `xgb_char.joblib`, `xgb_word.joblib`, `lgb_model.joblib`, `rf_*.joblib`, `catboost.joblib`, `multinomial_nb.joblib` | |
| Each file is a scikit-learn Pipeline with the vectorizer and classifier bundled together β raw text can be passed directly. | |
| ```python | |
| import joblib | |
| model = joblib.load("xgb_char.joblib") | |
| payloads = [ | |
| "GET /../../../../etc/passwd HTTP/1.1\r\nHost: 10.0.0.1\r\n", | |
| "GET /search?q=' OR 1=1-- HTTP/1.1\r\nHost: example.com\r\n", | |
| ] | |
| predictions = model.predict(payloads) | |
| print(predictions) | |
| # ['Path_Disclosure', 'SQL_Injection'] | |
| ``` | |
| ### Keras-based models (.h5) | |
| ```python | |
| import numpy as np | |
| from tensorflow.keras.models import load_model | |
| from tensorflow.keras.preprocessing.sequence import pad_sequences | |
| import joblib | |
| model = load_model("lstm_bidirectional.h5") # or textcnn_model.h5 | |
| tokenizer = joblib.load("tokenizer.joblib") # must be saved separately during training | |
| payloads = ["GET /../../../../etc/passwd HTTP/1.1\r\nHost: 10.0.0.1"] | |
| sequences = tokenizer.texts_to_sequences(payloads) | |
| padded = pad_sequences(sequences, maxlen=216) # maxlen=256 for TextCNN | |
| pred = model.predict(padded) | |
| label_idx = np.argmax(pred, axis=1) | |
| print(label_idx) | |
| ``` | |
| --- | |
| ## Evaluation | |
| ### Per-model summary | |
| | Model | Accuracy | Macro F1 | Weakest Class (F1) | | |
| |-------|:--------:|:--------:|--------------------| | |
| | TF-IDF + XGBoost (char) | 88.5% | 0.92 | System_Cmd_Execution (0.84) | | |
| | TF-IDF + LinearSVC | 87.4% | β | System_Cmd_Execution | | |
| | TF-IDF + XGBoost (word) | 86.7% | 0.90 | System_Cmd_Execution (0.83) | | |
| | TF-IDF + LightGBM | 86.5% | 0.91 | System_Cmd_Execution (0.83) | | |
| | TextCNN | 86.1% | 0.89 | System_Cmd_Execution (0.79) | | |
| | BiLSTM | 85.2% | 0.89 | System_Cmd_Execution (0.78) | | |
| ### Per-class observations | |
| - **Easiest classes**: `Automatically_Searching_Infor` and `Leakage_Through_NW` achieve F1 β₯ 0.99 across all models β highly distinctive tool signatures (nmap, crawlers) and file access patterns make them trivial to separate. | |
| - **Hardest class**: `System_Cmd_Execution` consistently scores the lowest F1 (0.75β0.84) due to pattern overlap with `Vulnerability_Scan`. Both classes involve probing behavior with similar HTTP structure. | |
| - **char-level XGBoost advantage**: Sub-word character n-grams capture attack-specific tokens like `../`, `<script>`, `UNION` more robustly than word tokenization, especially for obfuscated payloads. | |
| --- | |
| ## Architecture Details | |
| ### BiLSTM | |
| ``` | |
| Embedding(22,883 vocab, dim=100, maxlen=216) | |
| β Bidirectional(LSTM(64)) β LSTM(32) β Dense(512) β Dense(9, softmax) | |
| ``` | |
| - EarlyStopping(monitor=val_accuracy, patience=3) β triggered at epoch 20 | |
| - Saved: `lstm_bidirectional.h5` (28 MB) | |
| ### TextCNN | |
| ``` | |
| Embedding(20,000 vocab, dim=128, maxlen=256) | |
| β Conv1D(128, kernel=3) ββ | |
| β Conv1D(128, kernel=4) βββ GlobalMaxPool β Concat(384) β Dense(256) β Dropout(0.3) β Dense(9, softmax) | |
| β Conv1D(128, kernel=5) ββ | |
| Total params: 2.86M | |
| ``` | |
| - EarlyStopping(monitor=val_loss, patience=3) β triggered at epoch 5 | |
| - Saved: `textcnn_model.h5` (33 MB) | |
| --- | |
| ## Key Findings | |
| - **TF-IDF outperforms deep learning** on HTTP attack data: attack patterns rely on decisive keywords (`UNION SELECT`, `../`, `<script>`, `wget`). Bag-of-words representations capture these directly, while sequential models can be distracted by irrelevant header noise. | |
| - **char-level features beat word-level**: Character n-grams handle URL encoding variations and partial token matches more effectively (e.g., `%3Cscript%3E` vs `<script>`). | |
| - **Class imbalance effect**: `Vulnerability_Scan` dominates at 37.5% β models tend to over-predict this class for ambiguous samples. | |
| --- | |
| ## Environment | |
| | Item | Value | | |
| |------|-------| | |
| | Python | 3.12 | | |
| | scikit-learn | 1.x | | |
| | XGBoost | 3.2.0 | | |
| | LightGBM | 4.6.0 | | |
| | CatBoost | 1.2.10 | | |
| | TensorFlow / Keras | 2.x | | |
| --- | |
| ## License | |
| MIT License | |