# PP-OCRv5_server_rec

## Overview

**PP-OCRv5_server_rec** is a dedicated lightweight model for text recognition, focusing specifically on efficient recognition and understanding of text elements in multi-language documents and natural scenes.

## Model Architecture

PP-OCRv5_server_rec is one of the PP-OCRv5_rec series, the latest generation of text recognition models developed by the PaddleOCR team. It is designed to efficiently and accurately support the recognition of Simplified Chinese, Traditional Chinese, English, Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters with a single model. While maintaining recognition performance, it also balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.

## Usage

### Single input inference

The example below demonstrates how to detect text with PP-OCRv5_server_rec using the [AutoModel](/docs/transformers/main/en/model_doc/auto#transformers.AutoModel).

```python
import requests
from PIL import Image

from transformers import AutoImageProcessor, AutoModelForTextRecognition

model_path="PaddlePaddle/PP-OCRv5_server_rec_safetensors"
model = AutoModelForTextRecognition.from_pretrained(model_path, device_map="auto")
image_processor = AutoImageProcessor.from_pretrained(model_path)

image = Image.open(requests.get("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_ocr_rec_001.png", stream=True).raw).convert("RGB")
inputs = image_processor(images=image, return_tensors="pt").to(model.device)
outputs = model(**inputs)

results = image_processor.post_process_text_recognition(outputs)

for result in results:
    print(result)
```

### Batched inference

Here is how you can do it with PP-OCRv5_server_rec using the [AutoModel](/docs/transformers/main/en/model_doc/auto#transformers.AutoModel):

```python
import requests
from PIL import Image

from transformers import AutoImageProcessor, AutoModelForTextRecognition

model_path = "PaddlePaddle/PP-OCRv5_server_rec_safetensors"
model = AutoModelForTextRecognition.from_pretrained(model_path, device_map="auto")
image_processor = AutoImageProcessor.from_pretrained(model_path)

image = Image.open(requests.get("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_ocr_rec_001.png", stream=True).raw).convert("RGB")
inputs = image_processor(images=[image, image], return_tensors="pt").to(model.device)
outputs = model(**inputs)

results = image_processor.post_process_text_recognition(outputs)
for result in results:
    print(result)
```

## PPOCRV5ServerRecForTextRecognition[[transformers.PPOCRV5ServerRecForTextRecognition]]

#### transformers.PPOCRV5ServerRecForTextRecognition[[transformers.PPOCRV5ServerRecForTextRecognition]]

[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/pp_ocrv5_server_rec/modeling_pp_ocrv5_server_rec.py#L358)

PPOCRV5ServerRec model for text recognition tasks.

This model inherits from [PreTrainedModel](/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.

forwardtransformers.PPOCRV5ServerRecForTextRecognition.forwardhttps://github.com/huggingface/transformers/blob/main/src/transformers/models/pp_ocrv5_server_rec/modeling_pp_ocrv5_server_rec.py#L368[{"name": "pixel_values", "val": ": FloatTensor"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]- **pixel_values** (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`) --
  The tensors corresponding to the input images. Pixel values can be obtained using
  [PPOCRV5ServerRecImageProcessor](/docs/transformers/main/en/model_doc/pp_ocrv5_server_rec#transformers.PPOCRV5ServerRecImageProcessor). See `PPOCRV5ServerRecImageProcessor.__call__()` for details (`processor_class` uses
  [PPOCRV5ServerRecImageProcessor](/docs/transformers/main/en/model_doc/pp_ocrv5_server_rec#transformers.PPOCRV5ServerRecImageProcessor) for processing images).0`BaseModelOutputWithNoAttention` or `tuple(torch.FloatTensor)`A `BaseModelOutputWithNoAttention` or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([PPOCRV5ServerRecConfig](/docs/transformers/main/en/model_doc/pp_ocrv5_server_rec#transformers.PPOCRV5ServerRecConfig)) and inputs.
The [PPOCRV5ServerRecForTextRecognition](/docs/transformers/main/en/model_doc/pp_ocrv5_server_rec#transformers.PPOCRV5ServerRecForTextRecognition) forward method, overrides the `__call__` special method.

Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.

- **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`) -- Sequence of hidden-states at the output of the last layer of the model.
- **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
  one for the output of each layer) of shape `(batch_size, num_channels, height, width)`.

  Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

**Parameters:**

config ([PPOCRV5ServerRecConfig](/docs/transformers/main/en/model_doc/pp_ocrv5_server_rec#transformers.PPOCRV5ServerRecConfig)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.

**Returns:**

``BaseModelOutputWithNoAttention` or `tuple(torch.FloatTensor)``

A `BaseModelOutputWithNoAttention` or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([PPOCRV5ServerRecConfig](/docs/transformers/main/en/model_doc/pp_ocrv5_server_rec#transformers.PPOCRV5ServerRecConfig)) and inputs.

## PPOCRV5ServerRecConfig[[transformers.PPOCRV5ServerRecConfig]]

#### transformers.PPOCRV5ServerRecConfig[[transformers.PPOCRV5ServerRecConfig]]

[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/pp_ocrv5_server_rec/configuration_pp_ocrv5_server_rec.py#L31)

This is the configuration class to store the configuration of a PPOCRV5ServerRecModel. It is used to instantiate a Pp Ocrv5 Server Rec
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the [PaddlePaddle/PP-OCRv5_server_rec_safetensors](https://huggingface.co/PaddlePaddle/PP-OCRv5_server_rec_safetensors)

Configuration objects inherit from [PreTrainedConfig](/docs/transformers/main/en/main_classes/configuration#transformers.PreTrainedConfig) and can be used to control the model outputs. Read the
documentation from [PreTrainedConfig](/docs/transformers/main/en/main_classes/configuration#transformers.PreTrainedConfig) for more information.

**Parameters:**

hidden_act (`str`, *optional*, defaults to `silu`) : The non-linear activation function (function or string) in the decoder. For example, `"gelu"`, `"relu"`, `"silu"`, etc.

backbone_config (`Union[dict, ~configuration_utils.PreTrainedConfig]`, *optional*) : The configuration of the backbone model.

hidden_size (`int`, *optional*, defaults to `120`) : Dimension of the hidden representations.

mlp_ratio (`float`, *optional*, defaults to `2.0`) : Ratio of the MLP hidden dim to the embedding dim.

depth (`int`, *optional*, defaults to `2`) : Number of Transformer layers in the vision encoder.

head_out_channels (`int`, *optional*, defaults to 18385) : The number of output channels from the PPOCRV5ServerRecHead, responsible for final classification.

conv_kernel_size (`list`, *optional*) : The size of the convolutional kernel.

qkv_bias (`bool`, *optional*, defaults to `True`) : Whether to add a bias to the queries, keys and values.

num_attention_heads (`int`, *optional*, defaults to `8`) : Number of attention heads for each attention layer in the Transformer decoder.

attention_dropout (`Union[float, int]`, *optional*, defaults to `0.0`) : The dropout ratio for the attention probabilities.

layer_norm_eps (`float`, *optional*, defaults to `1e-06`) : The epsilon used by the layer normalization layers.

## PPOCRV5ServerRecModel[[transformers.PPOCRV5ServerRecModel]]

#### transformers.PPOCRV5ServerRecModel[[transformers.PPOCRV5ServerRecModel]]

[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/pp_ocrv5_server_rec/modeling_pp_ocrv5_server_rec.py#L321)

PPOCRV5ServerRec model, consisting of Backbone and Head networks.

This model inherits from [PreTrainedModel](/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.

forwardtransformers.PPOCRV5ServerRecModel.forwardhttps://github.com/huggingface/transformers/blob/main/src/transformers/models/pp_ocrv5_server_rec/modeling_pp_ocrv5_server_rec.py#L329[{"name": "pixel_values", "val": ": FloatTensor"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]- **pixel_values** (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`) --
  The tensors corresponding to the input images. Pixel values can be obtained using
  [PPOCRV5ServerRecImageProcessor](/docs/transformers/main/en/model_doc/pp_ocrv5_server_rec#transformers.PPOCRV5ServerRecImageProcessor). See `PPOCRV5ServerRecImageProcessor.__call__()` for details (`processor_class` uses
  [PPOCRV5ServerRecImageProcessor](/docs/transformers/main/en/model_doc/pp_ocrv5_server_rec#transformers.PPOCRV5ServerRecImageProcessor) for processing images).0`BaseModelOutputWithNoAttention` or `tuple(torch.FloatTensor)`A `BaseModelOutputWithNoAttention` or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([PPOCRV5ServerRecConfig](/docs/transformers/main/en/model_doc/pp_ocrv5_server_rec#transformers.PPOCRV5ServerRecConfig)) and inputs.
The [PPOCRV5ServerRecModel](/docs/transformers/main/en/model_doc/pp_ocrv5_server_rec#transformers.PPOCRV5ServerRecModel) forward method, overrides the `__call__` special method.

Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.

- **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`) -- Sequence of hidden-states at the output of the last layer of the model.
- **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
  one for the output of each layer) of shape `(batch_size, num_channels, height, width)`.

  Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

**Parameters:**

config ([PPOCRV5ServerRecConfig](/docs/transformers/main/en/model_doc/pp_ocrv5_server_rec#transformers.PPOCRV5ServerRecConfig)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.

**Returns:**

``BaseModelOutputWithNoAttention` or `tuple(torch.FloatTensor)``

A `BaseModelOutputWithNoAttention` or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([PPOCRV5ServerRecConfig](/docs/transformers/main/en/model_doc/pp_ocrv5_server_rec#transformers.PPOCRV5ServerRecConfig)) and inputs.

## PPOCRV5ServerRecEncoderWithSVTR[[transformers.PPOCRV5ServerRecEncoderWithSVTR]]

#### transformers.PPOCRV5ServerRecEncoderWithSVTR[[transformers.PPOCRV5ServerRecEncoderWithSVTR]]

[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/pp_ocrv5_server_rec/modeling_pp_ocrv5_server_rec.py#L259)

SVTR: Scene Text Recognition with a Single Visual Model
https://www.paddleocr.ai/v2.10.0/en/algorithm/text_recognition/algorithm_rec_svtr.html

## PPOCRV5MobileRecEncoderWithSVTR[[transformers.PPOCRV5MobileRecEncoderWithSVTR]]

#### transformers.PPOCRV5MobileRecEncoderWithSVTR[[transformers.PPOCRV5MobileRecEncoderWithSVTR]]

[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/pp_ocrv5_mobile_rec/modeling_pp_ocrv5_mobile_rec.py#L258)

SVTR: Scene Text Recognition with a Single Visual Model
https://www.paddleocr.ai/v2.10.0/en/algorithm/text_recognition/algorithm_rec_svtr.html

## PPOCRV5ServerRecImageProcessor[[transformers.PPOCRV5ServerRecImageProcessor]]

#### transformers.PPOCRV5ServerRecImageProcessor[[transformers.PPOCRV5ServerRecImageProcessor]]

[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/pp_ocrv5_server_rec/image_processing_pp_ocrv5_server_rec.py#L49)

Constructs a PPOCRV5ServerRecImageProcessor image processor.

get_target_sizetransformers.PPOCRV5ServerRecImageProcessor.get_target_sizehttps://github.com/huggingface/transformers/blob/main/src/transformers/models/pp_ocrv5_server_rec/image_processing_pp_ocrv5_server_rec.py#L116[{"name": "shape_list", "val": ": list"}]

Calculate the width and height from the widest image in the batch.

**Parameters:**

- ****kwargs** ([ImagesKwargs](/docs/transformers/main/en/main_classes/processors#transformers.ImagesKwargs), *optional*) : Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.
#### post_process_text_recognition[[transformers.PPOCRV5ServerRecImageProcessor.post_process_text_recognition]]

[Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/pp_ocrv5_server_rec/image_processing_pp_ocrv5_server_rec.py#L143)

Post-processes raw model logits to decode the recognized text and its confidence score.

**Parameters:**

predictions : Model outputs with `logits` attribute (probability maps of shape `(batch_size, height, vocab_size)`).

**Returns:**

A list of dictionaries, where each dictionary corresponds to an image in the batch.
Each dictionary contains:
- "text" (str): The decoded text string.
- "score" (float): The average confidence score of the characters in the decoded text.

