| --- |
| license: mit |
| tags: |
| - endpoints-template |
| - optimum |
| library_name: generic |
| --- |
| |
| # Optimized and Quantized [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2) with a custom handler.py |
|
|
|
|
| This repository implements a `custom` handler for `question-answering` for 🤗 Inference Endpoints for accelerated inference using [🤗 Optiumum](https://huggingface.co/docs/optimum/index). The code for the customized handler is in the [handler.py](https://huggingface.co/philschmid/roberta-base-squad2-optimized/blob/main/handler.py). |
|
|
| Below is also describe how we converted & optimized the model, based on the [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co/blog/optimum-inference) blog post. You can also check out the [notebook](https://huggingface.co/philschmid/roberta-base-squad2-optimized/blob/main/optimize_model.ipynb). |
|
|
| ### expected Request payload |
|
|
| ```json |
| { |
| "inputs": { |
| "question": "As what is Philipp working?", |
| "context": "Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." |
| } |
| } |
| ``` |
|
|
| below is an example on how to run a request using Python and `requests`. |
|
|
| ## Run Request |
|
|
| ```python |
| import json |
| from typing import List |
| import requests as r |
| import base64 |
| |
| ENDPOINT_URL = "" |
| HF_TOKEN = "" |
| |
| |
| def predict(question:str=None,context:str=None): |
| payload = {"inputs": {"question": question, "context": context}} |
| response = r.post( |
| ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload |
| ) |
| return response.json() |
| |
| |
| prediction = predict( |
| question="As what is Philipp working?", |
| context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science." |
| ) |
| ``` |
|
|
| expected output |
|
|
| ```python |
| { |
| 'score': 0.4749588668346405, |
| 'start': 88, |
| 'end': 102, |
| 'answer': 'Technical Lead' |
| } |
| ``` |
|
|
|
|
|
|
| # Convert & Optimize model with Optimum |
|
|
| Steps: |
| 1. [Convert model to ONNX](#1-convert-model-to-onnx) |
| 2. [Optimize & quantize model with Optimum](#2-optimize--quantize-model-with-optimum) |
| 3. [Create Custom Handler for Inference Endpoints](#3-create-custom-handler-for-inference-endpoints) |
| 4. [Test Custom Handler Locally](#4-test-custom-handler-locally) |
| 5. [Push to repository and create Inference Endpoint](#5-push-to-repository-and-create-inference-endpoint) |
|
|
| Helpful links: |
| * [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co/blog/optimum-inference) |
| * [Optimizing Transformers for GPUs with Optimum](https://www.philschmid.de/optimizing-transformers-with-optimum-gpu) |
| * [Optimum Documentation](https://huggingface.co/docs/optimum/onnxruntime/modeling_ort) |
| * [Create Custom Handler Endpoints](https://link-to-docs) |
|
|
| ## Setup & Installation |
|
|
|
|
| ```python |
| %%writefile requirements.txt |
| optimum[onnxruntime]==1.4.0 |
| mkl-include |
| mkl |
| ``` |
|
|
|
|
| ```python |
| !pip install -r requirements.txt |
| ``` |
|
|
| ## 0. Base line Performance |
|
|
|
|
| ```python |
| from transformers import pipeline |
| |
| qa = pipeline("question-answering",model="deepset/roberta-base-squad2") |
| ``` |
|
|
| Okay, let's test the performance (latency) with sequence length of 128. |
|
|
|
|
| ```python |
| context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." |
| question="As what is Philipp working?" |
| |
| payload = {"inputs": {"question": question, "context": context}} |
| ``` |
|
|
|
|
| ```python |
| from time import perf_counter |
| import numpy as np |
| |
| def measure_latency(pipe,payload): |
| latencies = [] |
| # warm up |
| for _ in range(10): |
| _ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"]) |
| # Timed run |
| for _ in range(50): |
| start_time = perf_counter() |
| _ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"]) |
| latency = perf_counter() - start_time |
| latencies.append(latency) |
| # Compute run statistics |
| time_avg_ms = 1000 * np.mean(latencies) |
| time_std_ms = 1000 * np.std(latencies) |
| return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}" |
| |
| print(f"Vanilla model {measure_latency(qa,payload)}") |
| # Vanilla model Average latency (ms) - 64.15 +\- 2.44 |
| ``` |
|
|
|
|
|
|
| ## 1. Convert model to ONNX |
|
|
|
|
| ```python |
| from optimum.onnxruntime import ORTModelForQuestionAnswering |
| from transformers import AutoTokenizer |
| from pathlib import Path |
| |
| |
| model_id="deepset/roberta-base-squad2" |
| onnx_path = Path(".") |
| |
| # load vanilla transformers and convert to onnx |
| model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True) |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| |
| # save onnx checkpoint and tokenizer |
| model.save_pretrained(onnx_path) |
| tokenizer.save_pretrained(onnx_path) |
| ``` |
|
|
|
|
| ## 2. Optimize & quantize model with Optimum |
|
|
|
|
| ```python |
| from optimum.onnxruntime import ORTOptimizer, ORTQuantizer |
| from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig |
| |
| # Create the optimizer |
| optimizer = ORTOptimizer.from_pretrained(model) |
| |
| # Define the optimization strategy by creating the appropriate configuration |
| optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations |
| |
| # Optimize the model |
| optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config) |
| ``` |
|
|
|
|
| ```python |
| # create ORTQuantizer and define quantization configuration |
| dynamic_quantizer = ORTQuantizer.from_pretrained(onnx_path, file_name="model_optimized.onnx") |
| dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False) |
| |
| # apply the quantization configuration to the model |
| model_quantized_path = dynamic_quantizer.quantize( |
| save_dir=onnx_path, |
| quantization_config=dqconfig, |
| ) |
| |
| ``` |
|
|
| ## 3. Create Custom Handler for Inference Endpoints |
|
|
|
|
|
|
| ```python |
| %%writefile handler.py |
| from typing import Dict, List, Any |
| from optimum.onnxruntime import ORTModelForQuestionAnswering |
| from transformers import AutoTokenizer, pipeline |
| |
| |
| class EndpointHandler(): |
| def __init__(self, path=""): |
| # load the optimized model |
| self.model = ORTModelForQuestionAnswering.from_pretrained(path, file_name="model_optimized_quantized.onnx") |
| self.tokenizer = AutoTokenizer.from_pretrained(path) |
| # create pipeline |
| self.pipeline = pipeline("question-answering", model=self.model, tokenizer=self.tokenizer) |
| |
| def __call__(self, data: Any) -> List[List[Dict[str, float]]]: |
| """ |
| Args: |
| data (:obj:): |
| includes the input data and the parameters for the inference. |
| Return: |
| A :obj:`list`:. The list contains the answer and scores of the inference inputs |
| """ |
| inputs = data.get("inputs", data) |
| # run the model |
| prediction = self.pipeline(**inputs) |
| # return prediction |
| return prediction |
| ``` |
|
|
| ## 4. Test Custom Handler Locally |
|
|
|
|
|
|
| ```python |
| from handler import EndpointHandler |
| |
| # init handler |
| my_handler = EndpointHandler(path=".") |
| |
| # prepare sample payload |
| context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." |
| question="As what is Philipp working?" |
| |
| payload = {"inputs": {"question": question, "context": context}} |
| |
| # test the handler |
| my_handler(payload) |
| ``` |
|
|
|
|
| ```python |
| from time import perf_counter |
| import numpy as np |
| |
| def measure_latency(handler,payload): |
| latencies = [] |
| # warm up |
| for _ in range(10): |
| _ = handler(payload) |
| # Timed run |
| for _ in range(50): |
| start_time = perf_counter() |
| _ = handler(payload) |
| latency = perf_counter() - start_time |
| latencies.append(latency) |
| # Compute run statistics |
| time_avg_ms = 1000 * np.mean(latencies) |
| time_std_ms = 1000 * np.std(latencies) |
| return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}" |
| |
| print(f"Optimized & Quantized model {measure_latency(my_handler,payload)}") |
| # |
| |
| ``` |
|
|
| `Optimized & Quantized model Average latency (ms) - 29.90 +\- 0.53` |
| `Vanilla model Average latency (ms) - 64.15 +\- 2.44` |
|
|
| ## 5. Push to repository and create Inference Endpoint |
|
|
|
|
|
|
| ```python |
| # add all our new files |
| !git add * |
| # commit our files |
| !git commit -m "add custom handler" |
| # push the files to the hub |
| !git push |
| ``` |
|
|
|
|