# PE Video (Perception Encoder Video)

## Overview

TODO

## Usage

### Basic usage

```py
TODO
```

## PeVideoVideoProcessor[[transformers.PeVideoVideoProcessor]]

#### transformers.PeVideoVideoProcessor[[transformers.PeVideoVideoProcessor]]

[Source](https://github.com/huggingface/transformers/blob/v5.1.0/src/transformers/models/pe_video/video_processing_pe_video.py#L24)

__call__transformers.PeVideoVideoProcessor.__call__https://github.com/huggingface/transformers/blob/v5.1.0/src/transformers/video_processing_utils.py#L205[{"name": "videos", "val": ""}, {"name": "**kwargs", "val": ""}]

## PeVideoProcessor[[transformers.PeVideoProcessor]]

#### transformers.PeVideoProcessor[[transformers.PeVideoProcessor]]

[Source](https://github.com/huggingface/transformers/blob/v5.1.0/src/transformers/models/pe_video/processing_pe_video.py#L4)

__call__transformers.PeVideoProcessor.__call__https://github.com/huggingface/transformers/blob/v5.1.0/src/transformers/processing_utils.py#L605[{"name": "images", "val": ": typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None"}, {"name": "text", "val": ": str | list[str] | list[list[str]] | None = None"}, {"name": "videos", "val": ": typing.Union[list['PIL.Image.Image'], numpy.ndarray, ForwardRef('torch.Tensor'), list[numpy.ndarray], list['torch.Tensor'], list[list['PIL.Image.Image']], list[list[numpy.ndarray]], list[list['torch.Tensor']], transformers.video_utils.URL, list[transformers.video_utils.URL], list[list[transformers.video_utils.URL]], transformers.video_utils.Path, list[transformers.video_utils.Path], list[list[transformers.video_utils.Path]], NoneType] = None"}, {"name": "audio", "val": ": typing.Union[numpy.ndarray, ForwardRef('torch.Tensor'), collections.abc.Sequence[numpy.ndarray], collections.abc.Sequence['torch.Tensor'], NoneType] = None"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.processing_utils.ProcessingKwargs]"}]- **images** (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `list[PIL.Image.Image]`, `list[np.ndarray]`, `list[torch.Tensor]`) --
  The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
  tensor. Both channels-first and channels-last formats are supported.
- **text** (`TextInput`, `PreTokenizedInput`, `list[TextInput]`, `list[PreTokenizedInput]`, *optional*) --
  The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
  (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
  `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
- **videos** (`np.ndarray`, `torch.Tensor`, `List[np.ndarray]`, `List[torch.Tensor]`) --
  The video or batch of videos to be prepared. Each video can be a 4D NumPy array or PyTorch
  tensor, or a nested list of 3D frames. Both channels-first and channels-last formats are supported.
- **audio** (`np.ndarray`, `torch.Tensor`, `list[np.ndarray]`, `list[torch.Tensor]`) --
  The audio or batch of audio to be prepared. Each audio can be a NumPy array or PyTorch
  tensor.
- **return_tensors** (`str` or [TensorType](/docs/transformers/v5.1.0/en/internal/file_utils#transformers.TensorType), *optional*) --
  If set, will return tensors of a particular framework. Acceptable values are:

  - `'pt'`: Return PyTorch `torch.Tensor` objects.
  - `'np'`: Return NumPy `np.ndarray` objects.0[BatchFeature](/docs/transformers/v5.1.0/en/main_classes/feature_extractor#transformers.BatchFeature)A [BatchFeature](/docs/transformers/v5.1.0/en/main_classes/feature_extractor#transformers.BatchFeature) object with processed inputs in a dict format.

Main method to prepare for model inputs. This method forwards the each modality argument to its own processor
along with `kwargs`. Please refer to the docstring of the each processor attributes for more information.

**Parameters:**

images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `list[PIL.Image.Image]`, `list[np.ndarray]`, `list[torch.Tensor]`) : The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. Both channels-first and channels-last formats are supported.

text (`TextInput`, `PreTokenizedInput`, `list[TextInput]`, `list[PreTokenizedInput]`, *optional*) : The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).

videos (`np.ndarray`, `torch.Tensor`, `List[np.ndarray]`, `List[torch.Tensor]`) : The video or batch of videos to be prepared. Each video can be a 4D NumPy array or PyTorch tensor, or a nested list of 3D frames. Both channels-first and channels-last formats are supported.

audio (`np.ndarray`, `torch.Tensor`, `list[np.ndarray]`, `list[torch.Tensor]`) : The audio or batch of audio to be prepared. Each audio can be a NumPy array or PyTorch tensor.

return_tensors (`str` or [TensorType](/docs/transformers/v5.1.0/en/internal/file_utils#transformers.TensorType), *optional*) : If set, will return tensors of a particular framework. Acceptable values are:  - `'pt'`: Return PyTorch `torch.Tensor` objects. - `'np'`: Return NumPy `np.ndarray` objects.

**Returns:**

`[BatchFeature](/docs/transformers/v5.1.0/en/main_classes/feature_extractor#transformers.BatchFeature)`

A [BatchFeature](/docs/transformers/v5.1.0/en/main_classes/feature_extractor#transformers.BatchFeature) object with processed inputs in a dict format.

## PeVideoEncoderConfig[[transformers.PeVideoEncoderConfig]]

#### transformers.PeVideoEncoderConfig[[transformers.PeVideoEncoderConfig]]

[Source](https://github.com/huggingface/transformers/blob/v5.1.0/src/transformers/models/pe_video/configuration_pe_video.py#L25)

This is the configuration class to store the configuration of a [PeVideoEncoder](/docs/transformers/v5.1.0/en/model_doc/pe_video#transformers.PeVideoEncoder). It is used to instantiate a
PeVideoEncoder model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of pe-av-large.
e.g. [facebook/pe-av-large](https://huggingface.co/facebook/pe-av-large)

Configuration objects inherit from [PreTrainedConfig](/docs/transformers/v5.1.0/en/main_classes/configuration#transformers.PreTrainedConfig) and can be used to control the model outputs. Read the
documentation from [PreTrainedConfig](/docs/transformers/v5.1.0/en/main_classes/configuration#transformers.PreTrainedConfig) for more information.

```python
>>> from transformers import PeAudioEncoder, PeAudioEncoderConfig

>>> # Initializing a PeAudioEncoder style configuration
>>> configuration = PeAudioEncoderConfig()

>>> # Initializing a model from the pe-av-large style configuration
>>> model = PeAudioEncoder(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```

**Parameters:**

vision_config (`Union[PreTrainedConfig, dict]`, *optional*) : Configuration for the vision backbone used to extract frame embeddings. If a dictionary is provided, it is used to instantiate a [TimmWrapperConfig](/docs/transformers/v5.1.0/en/model_doc/timm_wrapper#transformers.TimmWrapperConfig) with the PE default arguments.

hidden_size (`int`, *optional*, defaults to 1792) : Dimension of the hidden representations.

intermediate_size (`int`, *optional*, defaults to 4800) : Dimension of the feedforward layers in the Transformer blocks.

num_hidden_layers (`int`, *optional*, defaults to 6) : Number of Transformer encoder blocks.

num_attention_heads (`int`, *optional*, defaults to 14) : Number of attention heads used in each attention layer.

num_key_value_heads (`int`, *optional*) : Number of key and value heads for grouped-query attention. If unset, this defaults to `num_attention_heads`.

head_dim (`int`, *optional*, defaults to 128) : Dimension of each attention head for query, key, and value projections.

hidden_act (`str`, *optional*, defaults to `"silu"`) : The non-linear activation function (function or string) in the Transformer blocks.

max_position_embeddings (`int`, *optional*, defaults to 10000) : Maximum sequence length supported by the rotary position embeddings.

initializer_range (`float`, *optional*, defaults to 0.02) : Standard deviation of the truncated normal initializer for weight matrices.

rms_norm_eps (`float`, *optional*, defaults to 1e-05) : Epsilon used by the RMS normalization layers.

rope_parameters (`Union[RopeParameters, dict]`, *optional*, defaults to `{'rope_theta' : 20000}`): Parameters for the rotary position embeddings, such as the base `rope_theta`.

attention_bias (`bool`, *optional*, defaults to `False`) : Whether to use bias terms in the query, key, value, and output projections.

attention_dropout (`float`, *optional*, defaults to 0.0) : Dropout ratio applied to attention probabilities.

## PeVideoConfig[[transformers.PeVideoConfig]]

#### transformers.PeVideoConfig[[transformers.PeVideoConfig]]

[Source](https://github.com/huggingface/transformers/blob/v5.1.0/src/transformers/models/pe_video/configuration_pe_video.py#L142)

This is the configuration class to store the configuration of a [PeVideoModel](/docs/transformers/v5.1.0/en/model_doc/pe_video#transformers.PeVideoModel). It is used to instantiate a
PeVideoModel model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of pe-av-large.
e.g. [facebook/pe-av-large](https://huggingface.co/facebook/pe-av-large)

Configuration objects inherit from [PreTrainedConfig](/docs/transformers/v5.1.0/en/main_classes/configuration#transformers.PreTrainedConfig) and can be used to control the model outputs. Read the
documentation from [PreTrainedConfig](/docs/transformers/v5.1.0/en/main_classes/configuration#transformers.PreTrainedConfig) for more information.

```python
>>> from transformers import PeVideoModel, PeVideoConfig

>>> # Initializing a PeVideoModel style configuration
>>> configuration = PeVideoConfig()

>>> # Initializing a model from the pe-av-large style configuration
>>> model = PeVideoModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```

**Parameters:**

text_config (`dict` or `PreTrainedConfig`, *optional*) : Configuration for the text model component.

video_config (`dict` or `PreTrainedConfig`, *optional*) : Configuration for the video encoder component.

## PeVideoModel[[transformers.PeVideoModel]]

#### transformers.PeVideoModel[[transformers.PeVideoModel]]

[Source](https://github.com/huggingface/transformers/blob/v5.1.0/src/transformers/models/pe_video/modeling_pe_video.py#L557)

forwardtransformers.PeVideoModel.forwardhttps://github.com/huggingface/transformers/blob/v5.1.0/src/transformers/models/pe_video/modeling_pe_video.py#L607[{"name": "input_ids", "val": ": Tensor"}, {"name": "pixel_values_videos", "val": ": Tensor"}, {"name": "attention_mask", "val": ": torch.Tensor | None = None"}, {"name": "padding_mask_videos", "val": ": torch.Tensor | None = None"}, {"name": "return_loss", "val": ": bool | None = None"}, {"name": "**kwargs", "val": ""}]

## PeVideoEncoder[[transformers.PeVideoEncoder]]

#### transformers.PeVideoEncoder[[transformers.PeVideoEncoder]]

[Source](https://github.com/huggingface/transformers/blob/v5.1.0/src/transformers/models/pe_video/modeling_pe_video.py#L503)

The PeVideo Encoder model.

This model inherits from [PreTrainedModel](/docs/transformers/v5.1.0/en/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.

forwardtransformers.PeVideoEncoder.forwardhttps://github.com/huggingface/transformers/blob/v5.1.0/src/transformers/models/pe_video/modeling_pe_video.py#L522[{"name": "pixel_values_videos", "val": ": Tensor"}, {"name": "padding_mask_videos", "val": ": torch.Tensor | None = None"}, {"name": "**kwargs", "val": ""}]

**Parameters:**

config ([PeVideoEncoderConfig](/docs/transformers/v5.1.0/en/model_doc/pe_video#transformers.PeVideoEncoderConfig)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/v5.1.0/en/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.

