close
The Wayback Machine - https://web.archive.org/web/20220524180407/https://github.com/PlanTL-GOB-ES/lm-spanish
Skip to content
main
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 

Spanish Language Models 💃🏻

A repository part of the MarIA project.

Corpora 📃

Corpora Number of documents Number of tokens Size (GB)
BNE 201,080,084 135,733,450,668 570GB

Models 🤖

Fine-tunned models 🧗🏼‍♀️🏇🏼🤽🏼‍♀️🏌🏼‍♂️🏄🏼‍♀️

Word embeddings 🔤

Word embeddings trained with FastText for 300d:

Datasets 🗂️

Evaluation

Dataset Metric RoBERTa-b RoBERTa-l BETO* mBERT BERTIN** Electricidad***
MLDoc F1 0.9664 0.9702 0.9714🔥 0.9617 0.9668 0.9565
CoNLL-NERC F1 0.8851🔥 0.8823 0.8759 0.8691 0.8835 0.7954
CAPITEL-NERC F1 0.8960 0.9051🔥 0.8772 0.8810 0.8856 0.8035
PAWS-X F1 0.9020 0.9150🔥 0.8930 0.9000 0.8965 0.9045
UD-POS F1 0.9907🔥 0.9904 0.9900 0.9886 0.9898 0.9818
CAPITEL-POS F1 0.9846 0.9856🔥 0.9836 0.9839 0.9847 0.9816
SQAC F1 0.7923 0.8202🔥 0.7923 0.7562 0.7678 0.7383
STS Combined 0.8533🔥 0.8411 0.8159 0.8164 0.7945 0.8063
XNLI Accuracy 0.8016 0.8263🔥 0.8130 0.7876 0.7890 0.7878

* A model based on BERT architecture.

** A model based on RoBERTa architecture.

*** A model based on Electra architecture.

Usage example ⚗️

For the RoBERTa-base

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

For the RoBERTa-large

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-large-bne')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/roberta-large-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

Other Spanish Language Models 👩‍👧‍👦

We are developing domain-specific language models:

Cite 📣

@article{gutierrezfandino2022,
	author = {Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Marc Pàmies and Joan Llop-Palao and Joaquin Silveira-Ocampo and Casimiro Pio Carrino and Carme Armentano-Oller and Carlos Rodriguez-Penagos and Aitor Gonzalez-Agirre and Marta Villegas},
	title = {MarIA: Spanish Language Models},
	journal = {Procesamiento del Lenguaje Natural},
	volume = {68},
	number = {0},
	year = {2022},
	issn = {1989-7553},
	url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405},
	pages = {39--60}
}

Contact 📧

📋 We are interested in (1) extending our corpora to make larger models (2) train/evaluate the model in other tasks.

For questions regarding this work, contact Asier Gutiérrez-Fandiño (plantl-gob-es@bsc.es)

About

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages