Data

Calculating drunkedness level of an IRC user based on mistakes in their typing
July 2, 2024

Here is a recent procrastination project of mine. One of the symptoms of alcohol intoxication is poor coordination. This can be seen with people touch typing on keyboards. Usually those people are used to specific typing speed. In normal conditions it’s selected as the fastest speed that doesn’t result in mistakes. When alcohol creeps into the circulatory system two things can happen. Person can either keep typing at their original speed and make progressively more mistakes or slow down. While the latter can be hard to analyse on IRC channels due to static nature of the chat (the idea would be to measure time between messages but we can’t tell whether a person is typing or thinking), I have recently had a pleasure of seeing the former case in action. In this article I will analyse the mistakes in the typing of a drunk user and try to come to some conclusions.

Data

First I copy logs from irssi to a working directory and cut out most of them leaving interesting part. Messages in the log have following format:

10:10 < user> message

Then I cut out the lines corresponing to our subject and time they were posted:

grep ’< subject>’ ’#channel.log’ | cut -d ’ ’ -f 3- > raw.dat
grep ’< subject>’ ’#channel.log’ | cut -d ’ ’ -f 1 > times.dat

Next I copy raw message data to fixed.dat and manually fix all mistakes. I tried using python autocorrect package but results were unsatisfactory. I tried to avoid fixing mistakes that would be intentional, like ”kewl” wouldn’t be fixed to ”cool”. Anonymized data used in the analysis can be found here.

Basic analysis

Script used for analysis can be found here. First we load time data from times.dat and convert it to minutes. For each line in raw.dat and fixed.dat we find nubmer of single character differences using difflib and divide that by number of characters in line. We then calculate a rolling average on that data bacause number of mistakes per line is highly variable. Averaging radius of 20 was found to work best. This data is then plotted and resulting figure is shown below.

Refer to caption — Figure 1: Plot of average number of mistakes per number of characters with respect to time.

This data can be thought of as a ”drunkedness level” which is dependent on blood alcohol level. Next we will assume that dependency to be proportional.

(Im)proper model

Say we want to extract the amount of consumed alcohol with respect to time. Ignoring the influence of ethanol metabolism products and assuming contant rate of elimination (which might not be correct [Lewis_1986]) we will describe the relationship between drunkedness $D$ and amount af alcohol consumed $g$ as a following convolution:

\displaystyle D(t)=\int\limits_{-\infty}^{\infty}f(\tau)g(t-\tau)d\tau,

where $f(t)$ is response function to some delta–like alcohol consumption profile. Thus actual comsumption profile can be found by deconvolution:

\displaystyle g(t)=\int\limits_{-\infty}^{\infty}\frac{\tilde{D}(\xi)}{\tilde{% f}(\xi)}e^{2\pi it\xi}d\xi,

where tilde signifies a Fourier transform.
Now we need a proper distribution $f(t)$ modelling the drunkedness in time in response to single drinking event. Best way to get that would be to get our subject to drink considerable amount of alcohol in one go and then keep typing stuff for a few hours – not realistically doable. Other way is to assume the drunkedness to be proportional to blood alcohol level. It’s still quite hard to find raw data on that. There is some data in plots here [Jones_1984] and here [Mitchell_2014]. In the end I used this writing down values from the plot. I also found an interesting blog but I didn’t use content from there. We will not implement deconvolution ourselves but use the implementation from scipy. For that we need to add a tail to our distribution. At some point our subject has gone to sleep and stopped sending gibberish messages. We will assume that their blood alcohol level decreased as in the model by adding the normalized tail starting from the maximum. Also, to reduce upcoming problems with deconvolution I tried fitting a spline over the data, again using scipy. Result is shown below.

As you can see that didn’t work. Deconvolution seems to be sensitive to noise. Turns out I’m not the only one that has this problem.
This project was a partial failure. I’m pretty sure it’s possible to extract amount of consumed alcohol in time from that data. If you know how to solve that problem please contact me and I’ll edit this article :). Possible solutions:

•

Implement our own deconvolution, maybe filter high frequencies in the Fourier transform to avoid those spikes.
•

Use this. I’m not sure what it does but apparently it’s epic.
•

Use discreet decomposition into finitely many response functions using simple fitting tools.