Resources to learn how to use LLMs (Large Language Models)

Weights & Biases: Building LLM powered apps

This is a course by Weights & Biases that explains the basics of tokenization and how to build apps using LLM APIs:

Useful notebooks:

Example chatbot implementation that can answer questions about documentation in form of Markdown files:

Coursera / – Generative AI with Large Language Models

This course costs 39 pounds but is worth the price. It covers many topics of the LLM lifecycle in three weeks of videos, labs and quizzes. Lots of papers and articles for deeper knowledge are also provided!

Topics that are covered are:

  • Transformer architectures (encoder-decoder, encoder only, decoder only) and their use cases
  • Pre-training for the different transformer variants
  • Fine-tuning (Parameter-efficient fine-tuning (PEFT), Low rank adaptation (LoRA), prompt tuning)
  • Distributed training
  • Quantization
  • Model evaluation
  • Reinforcement learning from human feedback (RLHF)
  • Program-aided language models (PAL)
  • and others

You can enroll here:

Avoid INTERNAL_SERVER_ERROR in MLFlow UI caused by timeouts

MLFlow can be very slow sometimes, especially if you are using the default storage method (plain folders and files in the file system) rather than a database backend. If you have more than just a few runs in an experiment, the web interface gets really slow. Load times of a few minutes can easily happen if you have 100 or more runs in an experiment.

MLFlow UI internally uses gunicorn as a webserver. Setting the timeout of gunicorn to a higher number can resolve the problem of seeing INTERNAL_SERVER_ERROR after the page loaded a minute or two. You can set a new timeout like this:

GUNICORN_CMD_ARGS="--timeout 600" mlflow ui -h -p 1234

This sets the timeout to 10 minutes (600 seconds) which should be enough time for most cases. However, depending on the number of runs you have, you might have to set it even higher. Of course this is very annoying and if you access the UI often, it really can block your work.

A better solution is probably to use a database as the storage backend (e.g. SQLite). The root problem that makes the UI so slow is that MLFlow needs to iterate through the experiment folder, go into each run folder, then go into each metrics, params, artifacts, etc. folders and then open text files for each item you have in them. I’ll publish a comparison between the two methods in the next days.

MLFlow + Optuna: Parallel hyper-parameter optimization and logging

Optuna is a Python library that allows to easily optimize hyper-parameters of machine learning models. MLFlow is a tool which can be used to keep track of experiments. In this post I want to show how to use them together: Use Optuna to find optimal hyper-parameters and MLFlow to keep track of each hyper-parameter candidate (Optuna trial).

I will create one MLFlow run for the overall Optuna study and one nested run for each trial. Trials will run in parallel. Using the default MLFlow fluent interface does not work properly when using multiple threads in parallel because you will see errors like this:

mlflow.exceptions.MlflowException: Changing param values is not allowed. Param with key=’x’ was already logged with value=’4.826018001260979′ for run ID=’664a3b7001b04fcdb132c351238a8cf4′. Attempted logging new value ‘4.799057323848487’.

This error is shown if you use the “standard mlflow approach”:

Read More

Python 3: Recursively print structured tree including hierarchy markers using depth-first search

Printing a tree in Python is easy if the parent-child relationship should not be visualized as well, i.e. just printing all nodes with an indentation that depends on the level within the tree.

To keep the code easy, let’s first define a simple tree structure by creating a Node class that holds a value x and can have an arbitrary number of child nodes:

class Node(object):
    def __init__(self, x, children=[]):
        self.x = x
        self.children = children

To print all nodes of a tree using depth-first search, only few lines are required:

def printTree(root, level=0):
    print("  " * level, root.x)
    for child in root.children:
        printTree(child, level + 1)

#tree = Node(..., children=[Node(...., ...), Node(...,....)] # See end of the article for a bigger structure that is used for the examples in this article.

However, the output can be hard to read. When the tree has more than a few levels, it is challenging to see the relationship between parent and child nodes. A definition of the following tree is given at the end of this article if you want to try it yourself. For now, just focus on the output:

Read More

Master’s thesis: Facial Landmark Detection and Shape Modeling using Neural Networks

I have written my master’s thesis as an exchange student at Carnegie Mellon University (CMU) in Pittsburgh, PA, USA. This was possible thanks to the CLICS exchange program that allows KIT students to visit partner universities. At CMU I was working with the MultiComp group which belongs to the Language Technologies Institute (LTI), part of the School of Computer Science.

My thesis aimed at improving the precition accuracy for the task of facial landmark detection. The English abstract:

Facial landmarks are distinctive points in human faces that are used for a variety of tasks such as facial expression analysis, lip reading or face recognition. The performance on these tasks depends heavily on the accuracy of the detected facial landmarks. It is challenging to accurately locate facial landmarks even on faces that are partially occluded by glasses, facial hair or other objects. In this work we introduce a new approach to tackle these challenges on unconstrained frontal and semi-frontal face images. The proposed solution is a new deep learning based algorithm that is built on the Stacked Hourglass Network which has proven to be effective for human pose estimation, a task similar to facial landmark detection. The algorithm processes face images by repeatedly down- and upsampling the image and thus analyzes it on multiple scales. The Stacked Hourglass Network is trained using Wing loss and regresses coordinates using a Differentiable Spatial To Numerical Transform. Our algorithm is able to outperform current state-of-the-art solutions on the
300-W and Menpo datasets in terms of the point-to-point normalized error. Additionally, a neural Point Distribution Model is employed as a shape model that refines the predictions made by the Stacked Hourglass Network. By adding the Point Distribution Model, the prediction error on the inner facial landmarks of the challenging test set of 300-W reduces
even more. The Point Distribution Model achieves the biggest improvements on the inner landmarks of faces with strong head poses while improving the predictions of landmarks on the outline is more challenging.

The whole thesis can be found as a PDF here. The code is available here. As this is research code, don’t expect proper documentation and very clean code. 😉

Advent of Code 2018 – 25 days of coding

On December 1st the 2018 edition of Advent of Code will start. For those who don’t know what Advent of Code is: It is a programming competition where the authors release one programming problem every day at midnight EST/UTC-5 (6.00 in Germany).

The difficulty of the problems varies every day and it’s mostly about developing algorithms based on detailed descriptions. If you’re interested in how problems look like, check AOC 2017. You can implement your solution in any language you prefer. You don’t submit your code, but only the response of your algorithm to an input that is given to you on the problem description (this input is different for every user, so you cannot just steal it from others).

Read More

Use inotifywait and rsync to automatically push code to a remote server without git (Tips for usage with PyCharm included)

I have written a little helper script that I use whenever I want to write code locally but run it remotely. This is for example useful when I cannot run the code locally because it needs one or more GPUs or is very computationally intensive.

One possibility would be to use git and push/pull each change manually. But this would obviously be too much effort for little changes (like typo fixes). Another alternative is to manually run rsync after each change. But as I am lazy, I want to run rsync automatically whenever any file in my project changes.

Read More

Why are precision, recall and F1 score equal when using micro averaging in a multi-class problem?

In a recent project I was wondering why I get the exact same value for precision, recall and the F1 score when using scikit-learn’s metrics. The project is about a simple classification problem where the input is mapped to exactly \(1\) of \(n\) classes. I was using micro averaging for the metric functions, which means the following according to sklearn’s documentation:

Calculate metrics globally by counting the total true positives, false negatives and false positives.

According to the documentation this behaviour is correct:

Note that for “micro”-averaging in a multiclass setting with all labels included will produce equal precision, recall and F, while “weighted” averaging may produce an F-score that is not between precision and recall.

After thinking about it a bit I figured out why this is the case. In this article, I will explain the reasons.

Read More

Lemmatize whole sentences with Python and nltk’s WordNetLemmatizer

Lemmatization is the process of converting words (e.g. in a sentence) to their stemming while respecting their context. For example, the sentence “You are not better than me” would become “You be not good than me”. This is useful when dealing with NLP preprocessing, for example to train doc2vec models. The python module nltk.stem contains a class called WordNetLemmatizer. In order to use it, one must provide both the word and its part-of-speech tag (adjective, noun, verb, …) because lemmatization is highly dependent on context. Read More

NLP: Approaches for Sentence Embeddings (Overview)

In 2013, Mikolov et. al published ‘Distributed Representations of Words and Phrases and their Compositionality‘, a paper about a new approach to represent words by dense vectors. This was an improvement over the alternative, representing words as one-hot vectors, as these dense vector embeddings encode some meaning of the words they represent. In other terms, words with similar meaning are be close to each other in the vector space of the embedding. For example, “blue” would be close to “red” but far from “cat”. A commonly used name for their approach is word2vec.

Read More