NLP – Simon's blog

Resources to learn how to use LLMs (Large Language Models)

August 19, 2023September 15, 2023 Simon Leave a comment

Weights & Biases: Building LLM powered apps

This is a course by Weights & Biases that explains the basics of tokenization and how to build apps using LLM APIs: https://www.wandb.courses/courses/building-llm-powered-apps

Useful notebooks: https://github.com/wandb/edu/tree/971ef92ee35ecaaf6b5a4902d11804540af60879/llm-apps-course/notebooks

Example chatbot implementation that can answer questions about documentation in form of Markdown files: https://github.com/wandb/edu/tree/971ef92ee35ecaaf6b5a4902d11804540af60879/llm-apps-course/src

Coursera / deeplearning.ai – Generative AI with Large Language Models

This course costs 39 pounds but is worth the price. It covers many topics of the LLM lifecycle in three weeks of videos, labs and quizzes. Lots of papers and articles for deeper knowledge are also provided!

Topics that are covered are:

Transformer architectures (encoder-decoder, encoder only, decoder only) and their use cases
Pre-training for the different transformer variants
Fine-tuning (Parameter-efficient fine-tuning (PEFT), Low rank adaptation (LoRA), prompt tuning)
Distributed training
Quantization
Model evaluation
Reinforcement learning from human feedback (RLHF)
Program-aided language models (PAL)
and others

You can enroll here: https://www.coursera.org/learn/generative-ai-with-llms/

Lemmatize whole sentences with Python and nltk’s WordNetLemmatizer

June 29, 2018July 2, 2018 Simon Leave a comment

Lemmatization is the process of converting words (e.g. in a sentence) to their stemming while respecting their context. For example, the sentence “You are not better than me” would become “You be not good than me”. This is useful when dealing with NLP preprocessing, for example to train doc2vec models. The python module nltk.stem contains a class called WordNetLemmatizer. In order to use it, one must provide both the word and its part-of-speech tag (adjective, noun, verb, …) because lemmatization is highly dependent on context. Read More

NLP: Approaches for Sentence Embeddings (Overview)

May 23, 2018May 23, 2018 Simon Leave a comment

In 2013, Mikolov et. al published ‘Distributed Representations of Words and Phrases and their Compositionality‘, a paper about a new approach to represent words by dense vectors. This was an improvement over the alternative, representing words as one-hot vectors, as these dense vector embeddings encode some meaning of the words they represent. In other terms, words with similar meaning are be close to each other in the vector space of the embedding. For example, “blue” would be close to “red” but far from “cat”. A commonly used name for their approach is word2vec.