| | --- |
| | license: mit |
| | datasets: |
| | - lc_quad |
| | --- |
| | |
| | This repo contains a custom tokenizer for SPARQL. It is a SentencePieceBPE tokenizer trained on lc_quad. Here is an example. |
| | |
| | Original query: |
| | ``` |
| | SELECT ?answer WHERE { wd:Q825946 wdt:P371 ?X . ?X wdt:P2048 ?answer} |
| | ``` |
| | |
| | Result from default T5 tokenizer (just as an example): |
| | ``` |
| | ['▁', 'SEL', 'ECT', '▁', '?', 'ans', 'wer', '▁W', 'HER', 'E', '▁', '{', '▁', 'w', 'd', ':', 'Q', '82', '59', '46', '▁', |
| | 'w', 'd', 't', ':', 'P', '37', '1', '▁', '?', 'X', '▁', '.', '▁', '?', 'X', '▁', 'w', 'd', 't', ':', 'P', '20', '48', |
| | '▁', '?', 'ans', 'wer', '}'] |
| | ``` |
| | |
| | Result from this tokenizer: |
| | ``` |
| | ['▁SELECT', '▁?answer', '▁WHERE', '▁{', '▁wd:Q8', '259', '46', '▁wdt:P371', '▁?X', '▁.', '▁?X', '▁wdt:P2048', '▁?answer', '}'] |
| | ``` |
| | |
| | # How to use |
| | |
| | ```python |
| | from transformers import AutoTokenizer |
| | tokenizer = AutoTokenizer.from_pretrained("InfAI/sparql-tokenizer") |
| | tokenizer.tokenize("SELECT ?answer WHERE { wd:Q825946 wdt:P371 ?X . ?X wdt:P2048 ?answer}") |
| | ``` |
| | |
| | ``` |
| | ['▁SELECT', '▁?answer', '▁WHERE', '▁{', '▁wd:Q8', '259', '46', '▁wdt:P371', '▁?X', '▁.', '▁?X', '▁wdt:P2048', '▁?answer', '}'] |
| | ``` |
| | |
| | ```python |
| | tokenizer("SELECT ?answer WHERE { wd:Q825946 wdt:P371 ?X . ?X wdt:P2048 ?answer}") |
| | ``` |
| | |
| | ``` |
| | {'input_ids': [441, 444, 431, 422, 606, 1388, 720, 1791, 456, 418, 456, 3657, 444, 185], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} |
| | ``` |