Avoid INTERNAL_SERVER_ERROR in MLFlow UI caused by timeouts

MLFlow can be very slow sometimes, especially if you are using the default storage method (plain folders and files in the file system) rather than a database backend. If you have more than just a few runs in an experiment, the web interface gets really slow. Load times of a few minutes can easily happen if you have 100 or more runs in an experiment.

MLFlow UI internally uses gunicorn as a webserver. Setting the timeout of gunicorn to a higher number can resolve the problem of seeing INTERNAL_SERVER_ERROR after the page loaded a minute or two. You can set a new timeout like this:

GUNICORN_CMD_ARGS="--timeout 600" mlflow ui -h 127.0.0.1 -p 1234

This sets the timeout to 10 minutes (600 seconds) which should be enough time for most cases. However, depending on the number of runs you have, you might have to set it even higher. Of course this is very annoying and if you access the UI often, it really can block your work.

A better solution is probably to use a database as the storage backend (e.g. SQLite). The root problem that makes the UI so slow is that MLFlow needs to iterate through the experiment folder, go into each run folder, then go into each metrics, params, artifacts, etc. folders and then open text files for each item you have in them. I’ll publish a comparison between the two methods in the next days.

Leave a Reply Cancel reply