📊 LinkedIn Job Posting Engagement Analysis

Which LinkedIn job posting characteristics predict candidate engagement (views) — and how well can engagement be predicted or classified using only posting-level features?

Personal motivation: As someone in entrepreneurship, understanding which job posting features attract candidates is directly relevant to future hiring decisions.

📹 Presentation Video

📋 Dataset at a Glance

Property	Value
Source	LinkedIn Job Postings — arshkon/linkedin-job-postings (Kaggle)
Original size	123,850 rows × 49 columns
Working sample	30,000 rows · `random_state=42`
After join with companies	30,000 rows × 40 columns
After cleaning	29,572 rows × 51 columns (in df_model)
Train / Test split	23,657 / 5,915 (80/20, `random_state=42`)
Regression target	`log_views = log1p(views)` — log-transformed to handle right skew
Classification target	`high_engagement` — top 25% of training views (threshold from training only)

⚠️ Scope & Limitations

LinkedIn's algorithm, sponsored status, and company follower counts drive the majority of view variance and are unobservable in this dataset. Models use posting-level features only. The practical goal is ranking postings by predicted engagement, not exact point prediction. Results show associations, not causal relationships.

🗂️ Repository Files

File	Description
`notebook.ipynb`	Full pipeline: Cleaning → EDA → Features → Clustering → Regression → Classification → Bonus
`linkedin_regression_model.pkl`	Winning model: Random Forest (Tuned)
`linkedin_classification_model.pkl`	Winning model: Decision Tree
`regression_model_results.csv`	Full regression model comparison
`classification_model_results.csv`	Full classification model comparison

🧹 Data Cleaning Pipeline

Step 1 — Reproducible sampling
        123,850 rows → sample(n=30,000, random_state=42)
        Joined with companies.csv on company_id (left join, rows preserved)
        Result: 30,000 rows × 40 columns

Step 2 — Duplicate & missing target removal
        Removed duplicate rows
        Dropped rows where views is NaN or negative
        Result: 29,572 usable rows

Step 3 — Date parsing
        listed_time, original_listed_time, expiry, closed_time → parsed to datetime
        Extracted: posting_year, posting_month, posting_dayofweek, posting_weekend

Step 4 — Missing value analysis & column dropping
        Threshold: >70% missing → drop
        Dropped: closed_time (99.2%), skills_desc (98.1%), med_salary (95.1%),
                 remote_allowed (87.9%), applies (81.1%), max_salary/min_salary (76%)
        Protected columns: salary fields kept for feature engineering

Step 5 — Leakage columns excluded
        expiry, applies → removed (post-publication outcomes)
        views → kept as target only, not as feature

Step 6 — Salary imputation strategy
        has_salary_info = 1 if salary present, else 0
        salary_midpoint computed from min/max salary where available
        Missing salary → imputed inside sklearn Pipeline on training data only

Step 7 — Log transformation of target
        Raw views: mean=14.9, std=98.8, max=9,949 — heavily right-skewed
        log_views = log1p(views) — compresses scale, improves regression fit
        Predictions converted back via expm1() for interpretation
        Outliers (IQR method): 4,074 outliers (13.8%) — kept, not removed

🔍 EDA — 5 Questions + Correlation Heatmap

Note: EDA question numbers in the notebook differ from intuitive order. Q1=Work type, Q2=Salary, Q3=Description, Q4=Day of week, Q5=Seniority. Presented here in order of impact.

Salary Transparency vs Views (Notebook Q2)

No salary info   ████████████░░░░░░░░░░░░░  ~12 avg views   (70.1% of postings)
Has salary info  ████████████████████████░  ~21 avg views   (29.9% of postings)
                                             +74.3% lift ✓

Only 8,562 of 29,572 postings (29.9%) disclose salary. 74.3% more views for transparent postings. Highest-leverage, lowest-cost recruiter action.

Description Length vs Views (Notebook Q3)

< 100 words    ██████░░░░░░░░░░░░░░  low    — signals incomplete posting
100–250 words  █████████░░░░░░░░░░░  medium
250–500 words  ████████████████████  PEAK ★ — sweet spot
500–750 words  ████████████████░░░░  high
> 1000 words   ███████░░░░░░░░░░░░░  drop-off — overwhelms candidates

Non-linear relationship confirmed. Sweet spot: 250–500 words. Motivated description_density — the #1 feature in the winning regression model.

Day of Week vs Views (Notebook Q4)

Monday    ████████████████████  39 avg views  ★ best day (n=1,837)
Tuesday   █████████████████░░░  (weekday)
Wednesday ████████████████░░░░  (weekday)
Thursday  ███████████████░░░░░  (weekday)
Friday    ███░░░░░░░░░░░░░░░░░   7 avg views  ✗ worst day (n=10,076)
Saturday  ████████████░░░░░░░░  (weekend — noisier, n=2,116 total)
Sunday    ████████████░░░░░░░░  (weekend — noisier)

Weekend average: 28 views vs Weekday average: 22 views
Note: Weekend sample is much smaller (2,116 total) — estimates are noisier.
Weekday postings averaged 21.8% LOWER views than weekend in this dataset.

Counterintuitive finding: Weekend postings showed higher average views than weekdays in this sample, BUT weekend volume is very small (2,116 obs) making these estimates unreliable. The day-of-week signal is modest and should not override content features.

Work Type vs Views (Notebook Q1)

Contract    ████████████████████  29.97 avg views  7.0 median
Internship  █████████████████░░░  25.71 avg views  5.0 median
Full-time   ████████░░░░░░░░░░░░  13.70 avg views  4.0 median
Other       ███████░░░░░░░░░░░░░  11.27 avg views  4.0 median
Part-time   ██████░░░░░░░░░░░░░░   9.59 avg views  4.0 median

Contract and Internship roles show the highest engagement. However, Full-time dominates volume (23,674 of 29,572 postings). Work type is a useful feature but should not be interpreted as causal.

Seniority Level vs Views (Notebook Q5)

Entry-level  ████████████████████  18 avg views  n=792
Senior-level ████████████░░░░░░░░  16 avg views  n=3,577
Other/Mid    ██████████░░░░░░░░░░  15 avg views  n=25,203

Entry vs Senior: +12.4% more views
Entry vs Other:  +18.9% more views

Supply-side effect — more candidates qualify for junior roles so the pool is larger. Entry-level advantage is modest (+12.4% vs senior). is_entry_role carries predictive signal because it proxies for candidate pool size.

🔥 Feature Correlation with log(views+1)

Feature                      Corr    Direction   Note
─────────────────────────────────────────────────────────────────────
desc_salary_interaction      +0.18   ↑ views     strongest predictor
has_salary_info              +0.14   ↑ views     salary transparency
salary_log                   +0.12   ↑ views     salary level
description_density          +0.10   ↑ views     content quality
description_word_count       +0.08   ↑ views     description length
is_software_role             +0.08   ↑ views     tech role demand
is_data_role                 +0.07   ↑ views     data role demand
is_entry_role                +0.06   ↑ views     larger candidate pool
posting_weekend              -0.04   ↓ views     (small negative)
is_senior_role               -0.03   ↓ views     smaller candidate pool
─────────────────────────────────────────────────────────────────────
Internal correlations (structural):
salary_log ↔ salary_midpoint  +0.96  log transform of same variable
desc_wc ↔ desc_density        +0.55  density uses length in formula
is_software ↔ is_data         +0.35  often co-occur in job titles
is_senior ↔ is_entry          -0.28  mutually exclusive by construction
─────────────────────────────────────────────────────────────────────

Most features show weak linear correlation — no single feature dominates. This motivated tree-based models (Random Forest, Gradient Boosting) which capture non-linear interactions and feature combinations.

⚙️ Feature Engineering — 20 base + 10 cluster = 30 Total Features

Note: The notebook creates 20 engineered features before clustering, then adds 6 cluster dummy columns for a total of 30 in the final feature matrix (X_train_fe shape: 23,657 × 30).

Group	Features
Text length	`title_length`, `title_word_count`, `description_length`, `description_word_count`
Text structure	`description_density`, `title_desc_ratio`
Salary	`salary_midpoint`, `salary_range`, `has_salary_info`, `salary_log`
Role keywords	`is_senior_role`, `is_entry_role`, `is_software_role`, `is_data_role`, `is_manager_role`, `is_sales_role`, `is_marketing_role`, `is_remote_text`
Interactions	`desc_salary_interaction`, `senior_salary`, `weekend_remote`, `title_desc_word_interaction`, `salary_density_interaction`, `salary_description_interaction`, `title_density_interaction`
Clustering	`cluster_0`, `cluster_1`, `cluster_2`, `cluster_3`, `cluster_4`, `cluster_5`

Missing value strategy:

Columns with >70% missing → dropped (closed_time, skills_desc, med_salary, remote_allowed, applies, salary min/max, compensation fields)
Salary → has_salary_info flag + salary_midpoint computed where possible; remaining salary NaN imputed inside sklearn Pipeline on training data only
Remaining numeric → SimpleImputer(strategy="median") inside Pipeline

🔵 Clustering — KMeans k=6

Clustering features used (12 total, leakage-checked): title_word_count, description_word_count, salary_log, description_density, has_salary_info, is_senior_role, is_entry_role, is_software_role, is_data_role, is_manager_role, is_sales_role, is_marketing_role

Methods used to select k:

Elbow method (inertia k=2–10) — inconclusive, no sharp elbow
K-Means silhouette scores on full training matrix
Cluster-size stability table (smallest/largest cluster per k)
Interactive K-Means widget (visualization aid only — uses sample)
Hierarchical clustering dendrogram (Ward linkage, 300 obs sample)
Agglomerative Clustering diagnostic comparison (k=2–10 on sample)

Chart 1 — Actual silhouette scores by k (full training matrix)

  k=2   ████████░░░░░░░░░░░░  0.198  smallest cluster: 6,830 (28.9%)
  k=3   █████████░░░░░░░░░░░  0.221  smallest cluster: 2,100 (8.9%)
  k=4   ████████████████░░░░  0.312  ← strong score BUT largest=72%
  k=5   ██████████░░░░░░░░░░  0.250  smallest: 526 (unstable)
  k=6   ████████████░░░░░░░░  0.290  ← SELECTED ★ smallest: 583 (2.5%)
  k=7   ████████████░░░░░░░░  0.286  singleton cluster appeared
  k=8   █████████████░░░░░░░  0.315  singleton cluster appeared
  k=9   █████████████░░░░░░░  0.314  singleton cluster appeared
  k=10  ██████████████░░░░░░  0.350  singleton cluster appeared

Why NOT k=10 (highest score): singleton cluster (1 observation)
Why NOT k=4 (strong score):   largest cluster = 72% of observations
Why k=6: no singletons, stable sizes, silhouette 0.290, interpretable profiles

Note: Elbow method was inconclusive (inertia 255,430 at k=2 → 98,508 at k=10,
no sharp elbow). Agglomerative diagnostic best at k=2 (score 0.467 on sample)
— too coarse. k=6 selected as practical compromise across all methods.

Chart 2 — Actual cluster sizes at k=6 (training set n=23,657)

  Cluster 0 — Manager-focused       ████████████  4,571  (19%)  is_manager_role=1.00
  Cluster 1 — General / Mixed       ████████████████████ 13,055 (55%)  no dominant role signal
  Cluster 2 — Salary-transparent    ████          1,940   (8%)  has_salary_info=1.00
  Cluster 3 — Data roles            ███           1,451   (6%)  is_data_role=1.00
  Cluster 4 — Software roles        █████         2,057   (9%)  is_software_role=1.00
  Cluster 5 — Entry / low salary    ██              583   (2%)  smallest cluster

Official final silhouette score: 0.290 (full training matrix)

Cluster labels one-hot encoded as 6 dummy features. Including clusters improved both regression RMSE and classification F1 over models without them.

📈 Regression — Predicting `log1p(views)`

Baseline

Mean Baseline (predict training mean for all observations):
  RMSE_log = 0.8708   R² = -0.0002   ← floor every model must beat
  MAE_views ≈ 10.64

Baseline Linear Regression (20 features, no clustering):
  RMSE_log = 0.8425   R² = 0.0639
  MAE_views ≈ 10.54

Full model comparison (after feature engineering + clustering)

Model                        RMSE_log ↓    R² ↑
─────────────────────────────────────────────────────
Random Forest (Tuned)  ★     0.8347        0.0811
Random Forest (Ctrl)         0.8349        0.0807
Gradient Boosting            0.8370        0.0770
Linear Regression + Feat     0.8420        0.0640
RidgeCV                      0.8420        0.0640
Lasso Regression             0.8430        0.0640
PCA + Linear Regression      0.8440        0.0600
Mean Baseline                0.8708       -0.0002
─────────────────────────────────────────────────────
Winner: RandomizedSearchCV tuned RF
Improvement over manually controlled RF: 0.0002 RMSE_log (practically negligible)
3-fold CV mean RMSE_log: 0.8747 (±0.0125) — stable across folds
Overfitting lesson: unrestricted RF → train R²=0.854, test R²=0.003
Fixed by: max_depth, min_samples_split, min_samples_leaf, max_features constraints
Outlier robustness test: capping views at 99th pct → RMSE_log 0.8147, R²=0.0812

Top feature importances (RF Tuned)

description_density          ████████████  #1 — content quality
description_length           ██████████░░  #2 — raw description size
description_word_count       ████████░░░░  #3 — word count
title-description interaction████████░░░░  #4 — combined signal
is_software_role             ██████░░░░░░  #5 — tech role demand
is_data_role                 █████░░░░░░░  #6 — data role demand
salary_log / has_salary_info ████░░░░░░░░  #7+ — salary signals

Note: desc_salary_interaction ranked #2 in SHAP analysis but further down in Gini importance. Both agree on description quality and salary as top drivers.

Regression interpretation

R² = 0.081 → model explains ~8% of variance in log(views+1)

Why acceptable:
  ✓ Beats mean baseline (R²≈0) — real posting-level signal captured
  ✓ Social engagement inherently noisy — platform factors dominate
  ✓ 92% of variance from unobservable sources (algorithm, followers, ads)
  ✓ Practical use = ranking postings, not forecasting exact counts

PCA + Linear: reduced to 15 components (96.3% variance preserved) — no improvement
Gradient Boosting marginally worse than RF — non-linear models help but modestly

🟠 Classification — High Engagement vs. Normal

Target: high_engagement = 1 if views ≥ 75th percentile of TRAINING views
Class balance: ~75% Normal (Class 0) / ~25% High Engagement (Class 1)
Feature matrix: X_clf uses 24 features (not the full 30 — see notebook cell 207)
Training: ~24,000 obs | Test: ~6,000 obs
Metric: F1-score for Class 1 (accuracy misleading with 75/25 imbalance)

Model comparison

Model                  F1 (C1)    Recall (C1)   Notes
──────────────────────────────────────────────────────────────
Decision Tree     ★    HIGHEST    HIGHEST       lowest FN count
Logistic Regr.         near-best  high          close to DT
Random Forest          moderate   lower         lowest FP count
Dummy Baseline         0.00       0.00          always predicts Class 0
──────────────────────────────────────────────────────────────
Winner: max_depth=8, class_weight="balanced"
5-fold CV F1: 0.4424 ± 0.0152 — stable, no lucky split

Confusion matrix (all models — from notebook)

Decision Tree:     lowest FN (catches most high-engagement) — most false positives
Random Forest:     lowest FP (fewest false alarms) — misses most high-engagement
Logistic Regr.:    between the two — close to DT in F1

FN (missed high-engagement) = most costly error:
  Company fails to prioritize, promote, or learn from a valuable listing.
FP (false alarm) = also costly:
  Recruiters waste attention on postings that are not actually strong.

💡 Business Insights (from notebook cell 242)

Salary transparency is associated with higher engagement — 74.3% more views. Fewer than 30% of postings disclose salary today.
Description structure matters — density was the #1 feature in both models. Sweet spot: 250–500 words.
Tech roles attract more engagement — software and data role flags carry signal beyond salary.
Work type is associated with engagement — contract roles lead, but full-time dominates volume.
Platform factors dominate — R²≈0.08 is expected. Model value is in ranking, not exact prediction.

🎁 Bonus Work

🚀 Interactive Dashboard

👉 Open the LinkedIn Job Engagement Dashboard

Tab	Description
🎯 Engagement Predictor	Real-time predicted views + High/Normal classification
📊 EDA Dashboard	All 5 EDA findings as interactive charts
ℹ️ About	Feature groups, model details, limitations

🧠 SHAP Explainability

SHAP mean |value| — RF Tuned regression (test observations)

description_density      ████████████  strongest ↑
desc_salary_interaction  ██████████░░  salary × description synergy ↑
salary_log               ████████░░░░  salary level ↑
has_salary_info          ██████░░░░░░  disclosed → more views ↑
posting_weekend          ██░░░░░░░░░░  weekend → fewer views ↓

Key finding: desc_salary_interaction ranks #2 in SHAP but lower in Gini —
confirms it captures genuine non-linear interaction beyond individual features.

📊 Feature Importance: Regression vs Classification

                        Regression RF    Classification DT
description_density     #1               #2
desc_salary_interaction varies           varies
salary_log              #7+              varies
is_entry_role           lower            rises in classification
is_data_role            #6               varies
─────────────────────────────────────────────────────────
Agreement: description quality + salary dominate both models
Divergence: seniority/role flags matter more for threshold-crossing
            (classification) than for predicting exact counts (regression)

🔬 Additional Bonus Items

Interactive K-Means Widget — explore different k values visually in notebook (cell 4.11)
Hierarchical Clustering Dendrogram — Ward linkage, 300 obs sample (cell 4.12)
Agglomerative Clustering Diagnostic — k=2–10 comparison (cell 4.13)
Outlier Robustness Test — views capped at 99th percentile: RMSE_log 0.8147 vs 0.8347 uncapped
3-fold CV for regression — mean RMSE_log 0.8747 ± 0.0125

🛠️ How to Use the Models

import pickle, numpy as np

with open("linkedin_regression_model.pkl", "rb") as f:
    reg_model = pickle.load(f)
with open("linkedin_classification_model.pkl", "rb") as f:
    clf_model = pickle.load(f)

# Regression — predict log(views+1), convert back
log_views_pred = reg_model.predict(X_test_fe)
views_pred = np.expm1(log_views_pred)

# Classification — predict high-engagement label (0 or 1)
label = clf_model.predict(X_clf)

Regression model expects 30-column X_test_fe (with cluster dummies). Classification model expects 24-column X_clf. Run the full pipeline in the notebook to produce compatible inputs.

Assignment 2 — Classification, Regression, Clustering, Evaluation | LinkedIn Job Postings · arshkon/linkedin-job-postings (Kaggle)

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support