Spaces:

MusoraProductDepartment
/

Sentiment_analysis

Sleeping

App Files Files Community

Sentiment_analysis / process_helpscout /README.md

Danialebrat

Adding HelpScout to UI

58db664 about 1 month ago

preview code

raw

history blame contribute delete

12.6 kB

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade

HelpScout Processing Pipeline

Extracts, cleans, and enriches customer support conversations from HelpScout. The module has two distinct responsibilities:

Data export (fetch_and_export.py) — fetches raw threads, cleans HTML, and exports CSVs for the Streamlit dashboard.
AI processing pipeline (main.py) — fetches the same conversations, runs them through a two-step agentic workflow (sentiment + topic extraction), and writes enriched records to Snowflake.

Folder Structure

process_helpscout/
│
├── main.py                          # Pipeline entry point (parallel processing)
├── data_fetcher.py                  # Fetches & aggregates conversations; deduplication check
├── fetch_and_export.py              # CSV export script (separate from the pipeline)
├── html_cleaner.py                  # HTML → clean plain text (shared by both workflows)
├── snowflake_conn.py                # Snowflake connection wrapper
│
├── agents/                          # LLM-based extraction agents
│   ├── README.md                    # Agent architecture docs (read this to extend)
│   ├── base_agent.py                # Abstract base class for all agents
│   ├── sentiment_analysis_agent.py  # Classifies sentiment polarity + emotions
│   └── topic_extraction_agent.py    # Assigns topic tags + billing flags
│
├── workflow/
│   └── conversation_processor.py   # LangGraph workflow: sentiment → topics → END
│
├── config_files/
│   ├── processing_config.json       # Agent models, batch settings, output table, sentiment categories
│   └── topics.json                  # HelpScout topic taxonomy (source of truth for topic extraction)
│
├── queries/
│   └── helpscout_conversations.sql  # SQL that fetches customer threads from Snowflake
│
├── sql/
│   └── create_features_table.sql   # DDL — run once before first pipeline execution
│
├── output/                          # Auto-created; holds CSV exports
│   ├── helpscout_threads.csv
│   └── helpscout_conversations.csv
│
└── visualization/                   # Streamlit dashboard (reads from CSV exports)
    ├── app.py
    ├── components/dashboard.py
    └── utils/data_processor.py

Data Flow

CSV Export (Dashboard)

Snowflake (STITCH.HELPSCOUT.CONVERSATION_THREADS)
        │  queries/helpscout_conversations.sql
        ▼
fetch_and_export.py
        │  process_threads()       — clean HTML, add word_count, date columns
        │  aggregate_conversations() — one row per conversation_id
        ▼
output/helpscout_threads.csv        (one row per message thread)
output/helpscout_conversations.csv  (one row per conversation)
        │
        ▼
visualization/app.py  →  Streamlit dashboard

AI Processing Pipeline

Snowflake (STITCH.HELPSCOUT.CONVERSATION_THREADS)
        │  Same SQL — customer threads only, Feb 17 2026+
        ▼
data_fetcher.fetch_conversations()
        │  Cleans HTML (html_cleaner.py)
        │  Aggregates to one row per conversation
        │  Checks HELPSCOUT_CONVERSATION_FEATURES for already-processed IDs
        ▼
main.py  —  splits into parallel batches
        │
        ├── Worker 1: ConversationProcessingWorkflow
        │       ├── Node 1: SentimentAnalysisAgent  →  polarity + emotions
        │       └── Node 2: TopicExtractionAgent    →  topics + billing flags
        │
        ├── Worker 2: ... (same)
        └── Worker N: ... (same)
        │
        ▼
SOCIAL_MEDIA_DB.ML_FEATURES.HELPSCOUT_CONVERSATION_FEATURES

Setup

1. Environment variables

All credentials are read from the project root .env file.

Key	Description
`SNOWFLAKE_USER`	Snowflake username
`SNOWFLAKE_PASSWORD`	Snowflake password
`SNOWFLAKE_ACCOUNT`	Snowflake account identifier
`SNOWFLAKE_ROLE`	Role with access to `STITCH`, `ESTUARY`, and `SOCIAL_MEDIA_DB`
`SNOWFLAKE_WAREHOUSE`	Compute warehouse
`OPENAI_API_KEY`	Required for the AI pipeline only

2. Dependencies

All dependencies are in the project root requirements.txt:

snowflake-snowpark-python
beautifulsoup4
pandas, numpy
langchain-openai, langgraph
python-dotenv
streamlit, plotly (dashboard only)

3. Create the output table (once)

Before running the pipeline for the first time, execute the DDL in Snowflake:

-- Run this in your Snowflake worksheet or via the Snowflake CLI
-- File: sql/create_features_table.sql

This creates SOCIAL_MEDIA_DB.ML_FEATURES.HELPSCOUT_CONVERSATION_FEATURES with a primary key on CONVERSATION_ID. The pipeline always appends — it never truncates the table.

Usage

Run the AI processing pipeline

cd process_helpscout

# Process all new conversations (parallel, recommended)
python main.py

# Limit to 100 conversations — useful for a first test run
python main.py --limit 100

# Sequential mode — single process, easier to read logs when debugging
python main.py --sequential

# Use a custom config file
python main.py --config /path/to/my_config.json

On every run the pipeline:

Fetches all conversations (from Feb 17 2026 to today)
Queries the output table for already-processed CONVERSATION_IDs
Skips those — only new conversations are sent to the LLM
Appends results to the Snowflake output table

Run the CSV export (dashboard data)

cd process_helpscout
python fetch_and_export.py

Launch the Streamlit dashboard

cd process_helpscout
streamlit run visualization/app.py

Output Table

Table: SOCIAL_MEDIA_DB.ML_FEATURES.HELPSCOUT_CONVERSATION_FEATURES

Column	Type	Description
`CONVERSATION_ID`	VARCHAR	HelpScout conversation ID (primary key)
`CUSTOMER_EMAIL`	VARCHAR	Customer email address
`CUSTOMER_FIRST`	VARCHAR	Customer first name
`CUSTOMER_LAST`	VARCHAR	Customer last name
`CUSTOMER_HS_ID`	NUMBER	HelpScout internal customer ID
`THREAD_COUNT`	NUMBER	Number of customer message threads
`FIRST_MESSAGE_AT`	TIMESTAMP_TZ	When the first customer message was sent
`LAST_MESSAGE_AT`	TIMESTAMP_TZ	When the last customer message was sent
`DURATION_HOURS`	FLOAT	Hours between first and last message
`STATUS`	VARCHAR	Last known HelpScout status
`STATE`	VARCHAR	Last known HelpScout state
`SOURCE_TYPE`	VARCHAR	e.g. `email`, `chat`
`SOURCE_VIA`	VARCHAR	e.g. `api`, `mailbox`
`COMBINED_TEXT`	TEXT	Raw aggregated customer messages
`CONVERSATION_TEXT_USED`	TEXT	Formatted + truncated text sent to the LLM
`SENTIMENT_POLARITY`	VARCHAR	`very_positive` / `positive` / `neutral` / `negative` / `very_negative`
`EMOTIONS`	VARCHAR	Comma-separated emotion values (NULL if none valid)
`SENTIMENT_CONFIDENCE`	VARCHAR	`high` / `medium` / `low`
`SENTIMENT_NOTES`	TEXT	1-2 sentence LLM explanation of the sentiment
`TOPICS`	VARCHAR	Comma-separated topic IDs (multi-label)
`IS_REFUND_REQUEST`	BOOLEAN	Customer explicitly asked for a refund
`IS_CANCELLATION`	BOOLEAN	Customer explicitly wants to cancel
`IS_MEMBERSHIP`	BOOLEAN	Customer wants to join/rejoin and purchase membership
`TOPIC_CONFIDENCE`	VARCHAR	`high` / `medium` / `low`
`TOPIC_NOTES`	TEXT	1-2 sentence LLM explanation of topics
`SUMMARY`	TEXT	2-3 sentence neutral summary of the conversation
`PROCESSING_ERRORS`	TEXT	Semicolon-separated errors (NULL on full success)
`PROCESSED_AT`	TIMESTAMP_NTZ	When this record was written by the pipeline
`WORKFLOW_VERSION`	VARCHAR	Pipeline version for auditability

Configuration

All pipeline settings live in config_files/processing_config.json.

Agent models

"agents": {
  "sentiment_analysis": {
    "model": "gpt-4o-mini",
    "temperature": 0.2,
    "max_retries": 3
  },
  "topic_extraction": {
    "model": "gpt-4o-mini",
    "temperature": 0.2,
    "max_retries": 3
  }
}

Switch any agent to gpt-4o for higher accuracy (at higher cost) by changing the "model" value.

Conversation length

"processing": {
  "max_conversation_chars": 3000,
  "min_batch_size": 10,
  "max_batch_size": 50
}

max_conversation_chars controls how many characters of conversation text are sent to the LLM. Increasing this improves context for long conversations but raises token costs. The workflow formats messages as [1] msg\n[2] msg… and truncates at this limit.

Output destination

"output": {
  "database": "SOCIAL_MEDIA_DB",
  "schema": "ML_FEATURES",
  "table": "HELPSCOUT_CONVERSATION_FEATURES"
}

To write to a different table (e.g. a staging or test table), change these values and re-run the DDL in sql/create_features_table.sql for the new table name.

Sentiment categories

The sentiment_polarity and emotions blocks in processing_config.json define the valid values for classification. Adding, removing, or renaming a category here is automatically reflected in both the LLM prompt and the output validation — no code changes required.

Topic taxonomy

Topic definitions live in config_files/topics.json. This file is the single source of truth: the TopicExtractionAgent builds its system prompt directly from it. To add a new topic:

Add an entry to the "topics" array with a unique id, label, and description.
If the topic has boolean sub-flags (like billing), add a "flags" key — then update topic_extraction_agent.py to extract those flags.
Re-run the pipeline — the new topic will be available immediately.

SQL Query

File: queries/helpscout_conversations.sql

Design decision	Detail
Date filter	`CREATED_AT >= '2026-02-17'` to current date
Team exclusion	Anti-join with `USORA_USERS WHERE access_level = 'team'` — only customer messages reach the pipeline
Thread types	`TYPE IN ('customer', 'message')` — excludes notes, forwarded threads, system messages
JSON extraction	Snowflake semi-structured syntax: `COLUMN:field::VARCHAR`

To change the date range, edit the WHERE ct.CREATED_AT >= '...' line in the SQL file.

HTML Cleaner

html_cleaner.py runs a four-stage pipeline on every message body:

Stage	What it removes
`_remove_quoted_sections()`	`<blockquote>` tags and Gmail/Outlook/Yahoo quoted-reply CSS wrappers
`_remove_boilerplate()`	`<table>`, `<img>`, `<script>`, `<style>` tags and footer/unsubscribe blocks
`_extract_text()`	Extracts plain text while preserving line breaks
`_clean_text()`	Strips invisible Unicode, collapses whitespace, removes `>` quote lines, cuts off at "On … wrote:" markers

To add a new boilerplate pattern, append a string to footer_keywords inside _remove_boilerplate(), or add a CSS class fragment to _QUOTED_CLASS_PATTERNS at the top of the file.

Extending the Pipeline

Add a third agentic step

Create agents/your_new_agent.py inheriting from BaseAgent (see agents/README.md).
Add a new node method _your_node() in workflow/conversation_processor.py.

Add the node and a new edge in _build_workflow():

graph.add_node("your_step", self._your_node)
graph.add_edge("topic_extraction", "your_step")
graph.add_edge("your_step", END)

Add the corresponding output fields to ConversationState.
Map new columns in main.py's column_map dict and add them to the DDL.

Change the date range

Edit queries/helpscout_conversations.sql:

ct.CREATED_AT >= '2026-02-17 00:00:00'   -- ← change start date

Include team replies

Remove the anti-join in helpscout_conversations.sql and broaden TYPE to include 'note' and 'message'. Be sure to update the HTML cleaning and aggregation if team messages need different handling.

Process a different HelpScout mailbox

Add a WHERE clause on a mailbox ID column if available, or filter by source_via / status.

Automate daily runs

Schedule main.py with a cron job, Airflow DAG, or any task scheduler. Because the pipeline skips already-processed conversations, re-running it daily processes only new conversations — no manual bookkeeping needed.