Update README.md

4da517f over 3 years ago

9.71 kB

	---
	license: mit
	tags:
	- endpoints-template
	- optimum
	library_name: generic
	---

	# Optimized and Quantized [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2) with a custom handler.py


	This repository implements a `custom` handler for `question-answering` for 🤗 Inference Endpoints for accelerated inference using [🤗 Optiumum](https://huggingface.co/docs/optimum/index). The code for the customized handler is in the [handler.py](https://huggingface.co/philschmid/roberta-base-squad2-optimized/blob/main/handler.py).

	Below is also describe how we converted & optimized the model, based on the [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co/blog/optimum-inference) blog post. You can also check out the [notebook](https://huggingface.co/philschmid/roberta-base-squad2-optimized/blob/main/optimize_model.ipynb).

	### expected Request payload

	```json
	{
	"inputs": {
	"question": "As what is Philipp working?",
	"context": "Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
	}
	}
	```

	below is an example on how to run a request using Python and `requests`.

	## Run Request

	```python
	import json
	from typing import List
	import requests as r
	import base64

	ENDPOINT_URL = ""
	HF_TOKEN = ""


	def predict(question:str=None,context:str=None):
	payload = {"inputs": {"question": question, "context": context}}
	response = r.post(
	ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload
	)
	return response.json()


	prediction = predict(
	question="As what is Philipp working?",
	context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science."
	)
	```

	expected output

	```python
	{
	'score': 0.4749588668346405,
	'start': 88,
	'end': 102,
	'answer': 'Technical Lead'
	}
	```



	# Convert & Optimize model with Optimum

	Steps:
	1. [Convert model to ONNX](#1-convert-model-to-onnx)
	2. [Optimize & quantize model with Optimum](#2-optimize--quantize-model-with-optimum)
	3. [Create Custom Handler for Inference Endpoints](#3-create-custom-handler-for-inference-endpoints)
	4. [Test Custom Handler Locally](#4-test-custom-handler-locally)
	5. [Push to repository and create Inference Endpoint](#5-push-to-repository-and-create-inference-endpoint)

	Helpful links:
	* [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co/blog/optimum-inference)
	* [Optimizing Transformers for GPUs with Optimum](https://www.philschmid.de/optimizing-transformers-with-optimum-gpu)
	* [Optimum Documentation](https://huggingface.co/docs/optimum/onnxruntime/modeling_ort)
	* [Create Custom Handler Endpoints](https://link-to-docs)

	## Setup & Installation


	```python
	%%writefile requirements.txt
	optimum[onnxruntime]==1.4.0
	mkl-include
	mkl
	```


	```python
	!pip install -r requirements.txt
	```

	## 0. Base line Performance


	```python
	from transformers import pipeline

	qa = pipeline("question-answering",model="deepset/roberta-base-squad2")
	```

	Okay, let's test the performance (latency) with sequence length of 128.


	```python
	context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
	question="As what is Philipp working?"

	payload = {"inputs": {"question": question, "context": context}}
	```


	```python
	from time import perf_counter
	import numpy as np

	def measure_latency(pipe,payload):
	latencies = []
	# warm up
	for _ in range(10):
	_ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
	# Timed run
	for _ in range(50):
	start_time = perf_counter()
	_ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
	latency = perf_counter() - start_time
	latencies.append(latency)
	# Compute run statistics
	time_avg_ms = 1000 * np.mean(latencies)
	time_std_ms = 1000 * np.std(latencies)
	return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"

	print(f"Vanilla model {measure_latency(qa,payload)}")
	# Vanilla model Average latency (ms) - 64.15 +\- 2.44
	```



	## 1. Convert model to ONNX


	```python
	from optimum.onnxruntime import ORTModelForQuestionAnswering
	from transformers import AutoTokenizer
	from pathlib import Path


	model_id="deepset/roberta-base-squad2"
	onnx_path = Path(".")

	# load vanilla transformers and convert to onnx
	model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True)
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	# save onnx checkpoint and tokenizer
	model.save_pretrained(onnx_path)
	tokenizer.save_pretrained(onnx_path)
	```


	## 2. Optimize & quantize model with Optimum


	```python
	from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
	from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig

	# Create the optimizer
	optimizer = ORTOptimizer.from_pretrained(model)

	# Define the optimization strategy by creating the appropriate configuration
	optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations

	# Optimize the model
	optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)
	```


	```python
	# create ORTQuantizer and define quantization configuration
	dynamic_quantizer = ORTQuantizer.from_pretrained(onnx_path, file_name="model_optimized.onnx")
	dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

	# apply the quantization configuration to the model
	model_quantized_path = dynamic_quantizer.quantize(
	save_dir=onnx_path,
	quantization_config=dqconfig,
	)

	```

	## 3. Create Custom Handler for Inference Endpoints



	```python
	%%writefile handler.py
	from typing import Dict, List, Any
	from optimum.onnxruntime import ORTModelForQuestionAnswering
	from transformers import AutoTokenizer, pipeline


	class EndpointHandler():
	def __init__(self, path=""):
	# load the optimized model
	self.model = ORTModelForQuestionAnswering.from_pretrained(path, file_name="model_optimized_quantized.onnx")
	self.tokenizer = AutoTokenizer.from_pretrained(path)
	# create pipeline
	self.pipeline = pipeline("question-answering", model=self.model, tokenizer=self.tokenizer)

	def __call__(self, data: Any) -> List[List[Dict[str, float]]]:
	"""
	Args:
	data (:obj:):
	includes the input data and the parameters for the inference.
	Return:
	A :obj:`list`:. The list contains the answer and scores of the inference inputs
	"""
	inputs = data.get("inputs", data)
	# run the model
	prediction = self.pipeline(**inputs)
	# return prediction
	return prediction
	```

	## 4. Test Custom Handler Locally



	```python
	from handler import EndpointHandler

	# init handler
	my_handler = EndpointHandler(path=".")

	# prepare sample payload
	context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
	question="As what is Philipp working?"

	payload = {"inputs": {"question": question, "context": context}}

	# test the handler
	my_handler(payload)
	```


	```python
	from time import perf_counter
	import numpy as np

	def measure_latency(handler,payload):
	latencies = []
	# warm up
	for _ in range(10):
	_ = handler(payload)
	# Timed run
	for _ in range(50):
	start_time = perf_counter()
	_ = handler(payload)
	latency = perf_counter() - start_time
	latencies.append(latency)
	# Compute run statistics
	time_avg_ms = 1000 * np.mean(latencies)
	time_std_ms = 1000 * np.std(latencies)
	return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"

	print(f"Optimized & Quantized model {measure_latency(my_handler,payload)}")
	#

	```

	`Optimized & Quantized model Average latency (ms) - 29.90 +\- 0.53`
	`Vanilla model Average latency (ms) - 64.15 +\- 2.44`

	## 5. Push to repository and create Inference Endpoint



	```python
	# add all our new files
	!git add *
	# commit our files
	!git commit -m "add custom handler"
	# push the files to the hub
	!git push
	```