ento-model-parse / INSTRUCTIONS.md
dmozzherin's picture
add Go web tool
dbd6f57
# Insect Label Parser β€” Setup Instructions
This tool reads raw entomology collection label text and extracts structured
data (country, state, locality, date, collector, elevation, etc.) as JSON.
It runs entirely on your computer β€” no internet connection required after
the one-time setup.
---
## Step 1 β€” Which file do I need?
Copy one of these files from `output/gguf/` to your computer:
| File | Size | Use when |
|------|------|----------|
| `ento-label-parser-q4_k_m.gguf` | 3.2 GB | Your computer has **8 GB RAM** (most laptops) |
| `ento-label-parser-q5_k_m.gguf` | 3.4 GB | Your computer has **16 GB RAM or more** (slightly better quality) |
Not sure how much RAM you have?
- **Mac:** Apple menu β†’ About This Mac β†’ look for "Memory"
- **Windows:** Settings β†’ System β†’ About β†’ look for "Installed RAM"
> **The Q4 file works well for this task.** Label parsing is a simple
> extraction job β€” the quality difference between Q4 and Q5 is very small.
---
## Option A: LM Studio (recommended for most users β€” no terminal needed)
LM Studio is a free desktop app with a chat interface, similar to ChatGPT
but running fully on your own machine.
### Install
1. Go to **lmstudio.ai** and download the version for your operating system
(Mac, Windows, or Linux)
2. Install and open it
### Load the model
1. In LM Studio, click **My Models** in the left sidebar
2. Click **"Load model from file"** (or drag the `.gguf` file into the window)
3. Navigate to the `ento-label-parser-q4_k_m.gguf` file you copied in Step 1
4. Wait for the model to load (progress bar at the bottom)
### Configure the system prompt
This step tells the model what it is supposed to do.
1. Click the **Chat** icon in the left sidebar
2. Find the **System Prompt** box (usually at the top of the right panel)
3. Paste this text exactly:
```
Parse this insect collection label and return a JSON object with the extracted fields. Only include fields that are present in the label.
```
4. Set **Temperature** to `0` in the model settings panel (this makes
output deterministic β€” the same label always gives the same result)
### Parse a label
Paste the raw label text into the chat box and press Enter. The model will
return a JSON object. Example:
**Input:**
```
U.S.A., Texas: Austin, Travis Co., 15.iv.2021, J. Doe, sweeping
```
**Output:**
```json
{
"country": "USA",
"state": "Texas",
"county": "Travis",
"verbatim_locality": "Austin",
"verbatim_date": "15.iv.2021",
"start_date_year": "2021",
"start_date_month": "4",
"start_date_day": "15",
"verbatim_collectors": "J. Doe",
"verbatim_method": "sweeping"
}
```
---
## Option B: Ollama (for users comfortable with a terminal)
Ollama is a lightweight tool that runs models from the command line and also
exposes a local API for scripting.
### Requirement: Ollama version 0.20.7 or newer
Older versions do not support this model's architecture. Check your version:
```
ollama --version
```
If it shows a version older than 0.20.7, update from **ollama.com**.
### Install
Go to **ollama.com**, download, and install for your operating system.
### Register the model
Open a terminal, navigate to the project folder, and run:
```bash
ollama create ento-label-parser -f Modelfile
```
You only need to do this once.
### Parse a label
```bash
ollama run ento-label-parser "U.S.A., Texas: Austin, 15.iv.2021, J. Doe"
```
Or pipe a text file:
```bash
cat my_label.txt | ollama run ento-label-parser
```
---
## Troubleshooting
**The model is very slow.**
This is normal on a laptop without a dedicated GPU. The Q4 file typically
takes 5–30 seconds per label on a CPU. If you have an NVIDIA or AMD GPU
with 4+ GB of video memory, Ollama and LM Studio will use it automatically
and be much faster.
**LM Studio says "not enough memory."**
Try the Q4 file if you were using Q5. If Q4 also fails, your computer may
have less than 8 GB of RAM available β€” try closing other applications first.
**Ollama says "unknown model architecture: gemma4".**
Your Ollama version is too old. Update it from **ollama.com**.
**The output is not valid JSON.**
Occasionally the model will include a short thinking passage before the
JSON. Copy just the `{ ... }` portion of the output. If this happens
frequently, make sure Temperature is set to `0`.