Boris Orekhov

nevmenandr

4 3

https://nevmenandr.github.io/portfolio/

AI & ML interests

Natural Language Processing, Poetry Generation, Linguistics, Low-resource languages

Recent Activity

posted an update about 1 month ago

https://huggingface.co/nevmenandr/char-based-lstm-russian-poetry-pasternak 🧠 LSTM Language Model Visualization: A Deep Dive into Char-RNN 📊 Model Architecture at a Glance - Model Type: 5-layer LSTM - Hidden Size: 512 - Vocabulary: 137 characters - Sequence Length: 50 - Total Parameters: ~9.8 million - Training: 50 epochs, 10,750 iterations - Final Validation Loss: 1.1266 - The model learned to generate Pasternak-style poetry - pretty impressive for a char-rnn! 🎨 The Beautiful Mess Check out this heatmap visualization - it's like a Persian carpet! 🏠✨ - Each gate has its own patterns: - Input Gate: Controls what new info enters the cell - Forget Gate: Decides what to discard - Cell Gate: Creates new candidate values - Output Gate: Determines what to output - The weights show beautiful structured patterns - different gates learned distinct strategies for processing text.https://huggingface.co/papers/2306.02771

posted an update about 2 months ago

🔥 New Russian Stylometry Dataset! Russian Stylometric Dataset (RSD) — 322 texts from the 19th – early 20th centuries (16 million words), prepared for analysis in stylo (R) and machine learning (Python). 📚 What's inside? Fiction, journalism, scientific texts, drama, poetry Grouped by author, gender, age, genre, literary movements (Romanticism/Realism) Character speech (Tolstoy, Gogol, Ostrovsky) Generated texts (LSTM, GPT) 📊 Use cases: authorship attribution, clustering, classification, benchmarking methods. 🔓 Public domain + GPL-3.0 license. 👉 Learn more: https://github.com/nevmenandr/RSD DOI: 10.5281/zenodo.20701309

posted an update about 2 months ago

https://huggingface.co/nevmenandr/char-based-lstm-russian-poetry-https://huggingface.co/nevmenandr/char-based-lstm-russian-poetry-mandelshtam https://huggingface.co/nevmenandr/char-based-lstm-russian-poetry-hexameter https://huggingface.co/papers/2306.02771 📜 RNN vs. Transformers: How an Old Architecture Better Perceives Poetic Style In the era of Transformer dominance, we often forget that old RNNs (especially character-level LSTMs) remain irreplaceable for tasks where *individual style*, rhythm, and micro-patterns matter. These three models are clear proof of that. 🎯 Why does this matter today? - **Stylistic analysis**: RNNs better capture meter, repetitions, and unexpected tonal shifts. - **Teaching poetics**: generating "almost correct" but hallucinating lines helps explore the boundaries of style. - **Nostalgia and replication**: a reminder that not everything is measured by BLEU and perplexity. 🖼️ Visualization Attached is an infographic comparing the three models (architecture, style, generation sample). > RNNs aren't dead. They're just writing poetry in silence.

View all activity

Organizations

posted an update about 1 month ago

Post

110

nevmenandr/char-based-lstm-russian-poetry-pasternak

🧠 LSTM Language Model Visualization: A Deep Dive into Char-RNN

📊 Model Architecture at a Glance

- Model Type: 5-layer LSTM
- Hidden Size: 512
- Vocabulary: 137 characters
- Sequence Length: 50
- Total Parameters: ~9.8 million
- Training: 50 epochs, 10,750 iterations
- Final Validation Loss: 1.1266
- The model learned to generate Pasternak-style poetry - pretty impressive for a char-rnn!

🎨 The Beautiful Mess

Check out this heatmap visualization - it's like a Persian carpet! 🏠✨

- Each gate has its own patterns:
- Input Gate: Controls what new info enters the cell
- Forget Gate: Decides what to discard
- Cell Gate: Creates new candidate values
- Output Gate: Determines what to output
- The weights show beautiful structured patterns - different gates learned distinct strategies for processing

text.https://huggingface.co/papers/2306.02771

posted an update about 2 months ago

Post

967

🔥 New Russian Stylometry Dataset!

Russian Stylometric Dataset (RSD) — 322 texts from the 19th – early 20th centuries (16 million words), prepared for analysis in stylo (R) and machine learning (Python).

📚 What's inside?

Fiction, journalism, scientific texts, drama, poetry

Grouped by author, gender, age, genre, literary movements (Romanticism/Realism)

Character speech (Tolstoy, Gogol, Ostrovsky)

Generated texts (LSTM, GPT)

📊 Use cases: authorship attribution, clustering, classification, benchmarking methods.

🔓 Public domain + GPL-3.0 license.

👉 Learn more: https://github.com/nevmenandr/RSD

DOI: 10.5281/zenodo.20701309

posted an update about 2 months ago

Post

https://huggingface.co/nevmenandr/char-based-lstm-russian-poetry-https://huggingface.co/nevmenandr/char-based-lstm-russian-poetry-mandelshtam
nevmenandr/char-based-lstm-russian-poetry-hexameter

Identifying the style by a qualified reader on a short fragment of generated poetry (2306.02771)

📜 RNN vs. Transformers: How an Old Architecture Better Perceives Poetic Style

In the era of Transformer dominance, we often forget that old RNNs (especially character-level LSTMs) remain irreplaceable for tasks where *individual style*, rhythm, and micro-patterns matter. These three models are clear proof of that.

🎯 Why does this matter today?

- **Stylistic analysis**: RNNs better capture meter, repetitions, and unexpected tonal shifts.
- **Teaching poetics**: generating "almost correct" but hallucinating lines helps explore the boundaries of style.
- **Nostalgia and replication**: a reminder that not everything is measured by BLEU and perplexity.

🖼️ Visualization

Attached is an infographic comparing the three models (architecture, style, generation sample).

> RNNs aren't dead. They're just writing poetry in silence.

posted an update about 2 years ago

Post

2876

nevmenandr/w2v-chess

import gensim
from sklearn.decomposition import PCA
import matplotlib
import matplotlib.pyplot as plt

model = gensim.models.Word2Vec.load('white_moves.model')
dict_moves = model.wv.vocab
dict_moves_appr = {}
for k in dict_moves:
    if not k.startswith('->'):
        continue
    dict_moves_appr[k] = dict_moves[k]
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
fig, ax = plt.subplots()
ax.plot(Y[:, 0], Y[:, 1], 'o')
ax.set_title('White moves')
lab = list(dict_moves_appr)
for i, lb in enumerate(lab):
    plt.annotate(lb, xy=(Y[i, 0], Y[i, 1]))
plt.show()

biblically accurate angel

posted an update about 2 years ago

Post

725

nevmenandr/incoming-students-ma-dh-hse-university
Dataset visualized by datawrapper

<div style="min-height:494px"><script type="text/javascript" defer src="https://datawrapper.dwcdn.net/q3waH/embed.js?v=2" charset="utf-8"></script><noscript><img src="https://datawrapper.dwcdn.net/q3waH/full.png" alt="" /></noscript></div>

Data from my talk in February: https://www.youtube.com/watch?v=ZfXqvIzl5fo . Slides: https://nevmenandr.github.io/slides/2024-02-02/slides.pdf.

posted an update about 2 years ago

Post

1343

nevmenandr/w2v-russian-tolstoy

import gensim
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set_style("darkgrid")

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

modelLNT2 = Word2Vec.load("cbow_300_10.model")

# skip some code... for full version see model's card

tsnescatterplot(modelLNT2, 'жизнь_S', [i[0] for i in modelLNT2.wv.most_similar(negative=["жизнь_S"])])

life by Tolstoy (w2v):

posted an update about 2 years ago

Post

1238

Playing with dhcloud/w2v-russian-19c-fiction-lemmas

import numpy as np
from gensim.models import Word2Vec
from sklearn.manifold import TSNE

modell = Word2Vec.load("w2vlemmas.model")
keys = ['Шекспир', 'Пушкин', 'Гоголь', 'матрос', 'кот', 'роман']
embedding_clusters = []
word_clusters = []
for word in keys:
    embeddings = []
    words = []
    for similar_word, _ in modell.wv.most_similar(word, topn=30):
        words.append(similar_word)
        embeddings.append(modell.wv[similar_word])
    embedding_clusters.append(embeddings)
    word_clusters.append(words)
tsne_model_en_2d = TSNE(perplexity=15, n_components=2, init='pca', n_iter=3500, random_state=32)
embedding_clusters = np.array(embedding_clusters)
n, m, k = embedding_clusters.shape
embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)

Novel is a different type of literature than Shakespeare and Pushkin

Boris Orekhov

AI & ML interests

Recent Activity

Organizations

nevmenandr's activity