# Insect Label Parser — Setup Instructions This tool reads raw entomology collection label text and extracts structured data (country, state, locality, date, collector, elevation, etc.) as JSON. It runs entirely on your computer — no internet connection required after the one-time setup. --- ## Step 1 — Which file do I need? Copy one of these files from `output/gguf/` to your computer: | File | Size | Use when | |------|------|----------| | `ento-label-parser-q4_k_m.gguf` | 3.2 GB | Your computer has **8 GB RAM** (most laptops) | | `ento-label-parser-q5_k_m.gguf` | 3.4 GB | Your computer has **16 GB RAM or more** (slightly better quality) | Not sure how much RAM you have? - **Mac:** Apple menu → About This Mac → look for "Memory" - **Windows:** Settings → System → About → look for "Installed RAM" > **The Q4 file works well for this task.** Label parsing is a simple > extraction job — the quality difference between Q4 and Q5 is very small. --- ## Option A: LM Studio (recommended for most users — no terminal needed) LM Studio is a free desktop app with a chat interface, similar to ChatGPT but running fully on your own machine. ### Install 1. Go to **lmstudio.ai** and download the version for your operating system (Mac, Windows, or Linux) 2. Install and open it ### Load the model 1. In LM Studio, click **My Models** in the left sidebar 2. Click **"Load model from file"** (or drag the `.gguf` file into the window) 3. Navigate to the `ento-label-parser-q4_k_m.gguf` file you copied in Step 1 4. Wait for the model to load (progress bar at the bottom) ### Configure the system prompt This step tells the model what it is supposed to do. 1. Click the **Chat** icon in the left sidebar 2. Find the **System Prompt** box (usually at the top of the right panel) 3. Paste this text exactly: ``` Parse this insect collection label and return a JSON object with the extracted fields. Only include fields that are present in the label. ``` 4. Set **Temperature** to `0` in the model settings panel (this makes output deterministic — the same label always gives the same result) ### Parse a label Paste the raw label text into the chat box and press Enter. The model will return a JSON object. Example: **Input:** ``` U.S.A., Texas: Austin, Travis Co., 15.iv.2021, J. Doe, sweeping ``` **Output:** ```json { "country": "USA", "state": "Texas", "county": "Travis", "verbatim_locality": "Austin", "verbatim_date": "15.iv.2021", "start_date_year": "2021", "start_date_month": "4", "start_date_day": "15", "verbatim_collectors": "J. Doe", "verbatim_method": "sweeping" } ``` --- ## Option B: Ollama (for users comfortable with a terminal) Ollama is a lightweight tool that runs models from the command line and also exposes a local API for scripting. ### Requirement: Ollama version 0.20.7 or newer Older versions do not support this model's architecture. Check your version: ``` ollama --version ``` If it shows a version older than 0.20.7, update from **ollama.com**. ### Install Go to **ollama.com**, download, and install for your operating system. ### Register the model Open a terminal, navigate to the project folder, and run: ```bash ollama create ento-label-parser -f Modelfile ``` You only need to do this once. ### Parse a label ```bash ollama run ento-label-parser "U.S.A., Texas: Austin, 15.iv.2021, J. Doe" ``` Or pipe a text file: ```bash cat my_label.txt | ollama run ento-label-parser ``` --- ## Troubleshooting **The model is very slow.** This is normal on a laptop without a dedicated GPU. The Q4 file typically takes 5–30 seconds per label on a CPU. If you have an NVIDIA or AMD GPU with 4+ GB of video memory, Ollama and LM Studio will use it automatically and be much faster. **LM Studio says "not enough memory."** Try the Q4 file if you were using Q5. If Q4 also fails, your computer may have less than 8 GB of RAM available — try closing other applications first. **Ollama says "unknown model architecture: gemma4".** Your Ollama version is too old. Update it from **ollama.com**. **The output is not valid JSON.** Occasionally the model will include a short thinking passage before the JSON. Copy just the `{ ... }` portion of the output. If this happens frequently, make sure Temperature is set to `0`.