| | --- |
| | license: cc-by-nc-4.0 |
| | datasets: |
| | - oeg/CelebA_Sent2Vect_Sp |
| | language: |
| | - es |
| | tags: |
| | - CelebA |
| | - Spanish |
| | - celebFaces Attributes |
| | --- |
| | # Sent2vec trained with data from the descriptive text corpus of the CelebA dataset |
| |
|
| | ## Overview |
| |
|
| | - **Language**: Spanish |
| | - **Data**: [CelebA_Sent2vec_Sp](https://huggingface.co/datasets/oeg/CelebA_Sent2Vect_Sp). |
| | - **Architecture**: Sent2vec |
| | - **Paper**: [Information Processing and Management](https://doi.org/10.1016/j.ipm.2024.103667) |
| | |
| | ## Description |
| | |
| | Sent2vec can be used directly for English texts. For this purpose, all you have to do is download the library and enter the text to be coded, since most |
| | of these algorithms were trained using English as the original language. However, since this work is used with text in Spanish, it has been necessary |
| | to train it from zero in this new language. This training was carried out using the generated corpus ([in this respository](https://huggingface.co/datasets/oeg/CelebA_Sent2Vect_Sp)) |
| | with the following process: |
| | - A corpus composed of a set of descriptive sentences of characteristics of each of the faces of the CelebA dataset in Spanish has been generated. |
| | A total of 192,209 sentences are available for training. |
| | - Apply a pre-processing consisting of removing accents. _stopwords_ and connectors were retained as part of the sentence structure during training. |
| | - Install the libraries _Sent2vec_ and _FastText_, and configure the parameters. The parameters have been fixed empirically after several |
| | - tests, being: 4,800 dimensions of feature vectors, 5,000 epochs, 200 threads, 2 n-grams and a learning rate of 0.05. |
| |
|
| | In this context, the total training time lasted 7 hours working with all CPUs at maximum performance. |
| | As a result, it generates a _bin_ extension file which can be downloaded from this repository. |
| |
|
| | ## How to use |
| | Download the model, as a result there is a **sent2vec_celebAEs-UNI.bin** file which will be loaded using the _sent2vec_ library in Python as follows: |
| | |
| | ```python |
| | import sent2vec |
| | Model_path="sent2vec_celebAEs-UNI.bin" |
| | s2vmodel = sent2vec.Sent2vecModel() |
| | s2vmodel.load_model(Model_path) |
| | caption = """El hombre luce una sombra a las 5 en punto. Su cabello es de color negro. Tiene una nariz grande con cejas tupidas. El hombre se ve atractivo""" |
| | vector = s2vmodel.embed_sentence(caption) |
| | print(vector) |
| | ``` |
| | ## Results |
| | As a result, the encoder will generate a numeric vector whose dimension is 4800. |
| | |
| | ```python |
| | >>$ print(vector) |
| | >>$ [[0.1,0.87,0.51,........0.7]] |
| | >>$ len(vector[0]) |
| | >>$ 4800 |
| | ``` |
| | |
| | To see detailed information on the use of the trained model, enter the [following link](https://github.com/eduar03yauri/DCGAN-text2face-forSpanish/blob/main/Data/encoder-models/Sent2vec_model_trained.md) |
| | |
| | ## Licensing information |
| | This model is available under the [CC BY-NC 4.0.](https://creativecommons.org/licenses/by-nc/4.0/deed.es) |
| | |
| | ## Citation information |
| | |
| | **Citing**: If you used Sent2vec+CelebA model in your work, please cite the paper publish in **[Information Processing and Management](https://doi.org/10.1016/j.ipm.2024.103667)**: |
| | |
| | ```bib |
| | @article{YAURILOZANO2024103667, |
| | title = {Generative Adversarial Networks for text-to-face synthesis & generation: A quantitative–qualitative analysis of Natural Language Processing encoders for Spanish}, |
| | journal = {Information Processing & Management}, |
| | volume = {61}, |
| | number = {3}, |
| | pages = {103667}, |
| | year = {2024}, |
| | issn = {0306-4573}, |
| | doi = {https://doi.org/10.1016/j.ipm.2024.103667}, |
| | url = {https://www.sciencedirect.com/science/article/pii/S030645732400027X}, |
| | author = {Eduardo Yauri-Lozano and Manuel Castillo-Cara and Luis Orozco-Barbosa and Raúl García-Castro} |
| | } |
| | ``` |
| | |
| | ## Autors |
| | - [Eduardo Yauri Lozano](https://github.com/eduar03yauri) |
| | - [Manuel Castillo-Cara](https://github.com/manwestc) |
| | - [Raúl García-Castro](https://github.com/rgcmme) |
| | |
| | [*Universidad Nacional de Ingeniería*](https://www.uni.edu.pe/), [*Ontology Engineering Group*](https://oeg.fi.upm.es/), [*Universidad Politécnica de Madrid.*](https://www.upm.es/internacional) |
| | |
| | ## Contributors |
| | See the full list of contributors [here](https://github.com/eduar03yauri/DCGAN-text2face-forSpanish). |
| | |
| | <kbd><img src="https://www.uni.edu.pe/images/logos/logo_uni_2016.png" alt="Universidad Politécnica de Madrid" width="100"></kbd> |
| | <kbd><img src="https://raw.githubusercontent.com/oeg-upm/TINTO/main/assets/logo-oeg.png" alt="Ontology Engineering Group" width="100"></kbd> |
| | <kbd><img src="https://raw.githubusercontent.com/oeg-upm/TINTO/main/assets/logo-upm.png" alt="Universidad Politécnica de Madrid" width="100"></kbd> |