# Monthly Archives: July 2012

## Keystroke Biometrics using Mathematica

A few weeks ago Paul-Jean Letourneau posted an article on Wolfram’s Blog about using Mathematica to collect and analyze keystroke metrics as a way to identify individuals. The article analyzes how you type, measuring the time intervals between your typing the individual characters using a little interactive widget, collecting and visualizing the data while you repeatedly type in the word “wolfram”.

Keystroke metrics of 50 trials typing the word “wolfram”

It is somewhat interesting at this point to analyze one’s one typing style. For example there appears to be a bi-modal distribution of the time intervals between keystrokes, with the sequence “r-a” taking me almost twice as long (~130ms) as most other sequences (~60-70ms). There is also a ‘learning’ effect visible in my 50 trials, where the speed improves noticeably after about 20 repetitions or so. However, there are occasional relapses into a much slower typing pattern throughout the rest of the trials.

However, what I thought was more interesting is the subsequent analysis the author did across a set of 42 such series he obtained from his colleagues (noting humorously that “it just so happens that Wolfram is a company full of data nerds”). He then proceeds to analyze and visualize that data in various ways.

Distribution Histogram of keystroke intervals

He observes the bimodal nature of the distribution with peaks around 75ms and 150ms for different pairs of characters. In fact, averaging over all those pair typing times, a correlation is found indicating that when people type slower they are more consistent.

(Negative) Correlation of pairwise typing speed and consistency

The analysis continues with the observation that each measurement can be seen as a point in a six-dimensional space (six pair-transitions in a word with seven characters). When a person types this same word 50 times you get a cluster of 50 points in six-dimensional space. Different individuals will produce different clusters. So one can use the (built-in) function FindClusters to determine such clusters. However, since people have a certain amount of inconsistency in their typing, it is possible that sometimes one person’s typing will show up in another person’s cluster and vice versa. To measure the quality of the clusters to distinguish individuals, one can implement various measures. The author implements the Rand-index, a measure of the similarity between two data-clusterings. This gives a numeric accuracy on a scale from 0 to 1 for the ability to distinguish between a pair of two people. When looking across all pairs of 42 people – there are 21*41=861 different pairs, but the author chose to look at all 42*42=1764 pairs, as the FindCluster results depend on the sequence input data, so Rand[i,j] may be different from Rand[j,i] – you get the following histogram of Rand quality scores:

Histogram of Rand quality score for all pairs

This clearly shows that keystroke metrics for one word are not sufficient to reliably distinguish between arbitrary pairs of people. The average quality score is only 0.67. On the other hand, about 400 (~23%) of those quality scores are a perfect 1.0, so for about a quarter of the pairs it alone would suffice to reliably distinguish the two people typing. About half as many scores are 0.0, meaning that the clusters overlap so much that no distinction is possible. The remaining scores are distributed mostly between 0.5 and 1.0, meaning you would just guess right more often than wrong.

The author wraps up the post with this paragraph:

Using this fun little typing interface, I feel like I actually learned something about the way my colleagues and I type. The time to type two letters with the same finger on the same hand takes twice as long as with different fingers. The faster you type, the more your typing speed will fluctuate. The more your typing speed fluctuates, the harder it will be to distinguish you from another person based on your typing style. Of course we’ve really just scratched the surface of what’s possible and what would actually be necessary in order to build a keystroke-based authentication system. But we’ve uncovered some trends in typing behavior that would help in building such a system.

An interactive CDF widget embedded in the article allows you to collect and visualize the timing of your own typing. Source code as well as the test data is also shared if you want to further explore the details of this interesting analysis.

1 Comment

Posted by on July 20, 2012 in Linguistic, Scientific

## London Tube Map and Graph Visualizations

The previous post on Tube Maps has quickly risen in the view stats into the Top 3 posts. Perhaps it’s due to many people searching Google for images of the original London tube map in the context of the upcoming Olympic Games.

I recently reviewed some of the classes in the free Wolfram’s Data Science course. If you are interested in Data Science, this is excellent material. And if you are using Mathematica, you can download the underlying code and play with the materials.

It just so happens that in the notebook for the Graphs and Networks: Concepts and Applications class there is a graph object for the London subway.

Mathematica Graph object for the London subway

As previously demonstrated in our post on world country neighborhood relationships, Mathematica’s graph objects are fully integrated into the language and there are powerful visualization and analysis functions.

For example, this graph has 353 vertices (stations) and 409 edges (subway connections). This one line of code  highlights all stations no more than 5 stations away from the Waterloo station:

```HighlightGraph[london,
NeighborhoodGraph[london, "Waterloo", 5]]```

Neighborhood Graph 5 around Waterloo

Since HighlightGraph and NeighborhoodGraph are built-in functions, this can be done in one line of code.

```Export["london.gif",
Table[HighlightGraph[london,
NeighborhoodGraph[london, "King's Cross St. Pancras", k]],
{k, 0, 20, 1}]]```

creates this animated GIF file:

Paths spreading out from the center

Shortest paths can easily be determined and visualized:

```HighlightGraph[london,
FindShortestPath[london, "Amersham", "Woolwich Arsenal"]]```

A shortest path example

There are many other graph functions such as:

```GraphDiameter[london]   39
GraphCenter[london]     "King's Cross St. Pancras"
GraphPeriphery[london]  {"Watford Junction", "Woodford"}```

In other words, the King’s Cross St. Pancras station is at the center, with radius up to 20 out into the periphery, and 39 the shortest path between Watford Junction and Woodford, the longest shortest path in the network.

Let’s look at distances within the graph. The built-in function GraphDistanceMatrix calculates all pairwise distances between any two stations:

`mat = GraphDistanceMatrix[london]; MatrixPlot[mat]`

Graph Distance Matrix Plot

For the 353*353 = 124,609 pairs of stations, let’s plot a histogram of the pairwise distances:

`Histogram[Flatten[mat]]`

Graph Distance Histogram

The average distance between two stations in the London subway system is about 14.

So far, very little coding has been required as we have used built-in functions. Of course, the set of functions can be easily extended. One interesting aspect is the notion of centrality or distance of a node from the center of the graph. This is expressed in the built-in function ClosenessCentrality

```cc = ClosenessCentrality[london];
HighlightCentrality[g_, cc_] :=
HighlightGraph[g,
Table[Style[VertexList[g][[i]],
ColorData["TemperatureMap"][cc[[i]]/Max[cc]]],
{i, VertexCount[g]}]];
HighlightCentrality[london, cc]```

Color coded Centrality Map

Another interesting notion is that of BetweennessCentrality, which is a measure indicating how often a particular node lies on the shortest paths between all node-pairs. The following nifty little snippet of code identifies the 10 most traversed stations – along the shortest paths – of the London underground:

```HighlightGraph[london,
First /@ SortBy[
Last][[-10 ;;]]]```

10 most traversed stations

I have often felt that progress in computer science and in languages comes from raising the level of abstraction. It’s amazing how much analysis and visualization one can do in Mathematica with very little coding due to the large number of powerful, built-in functions. The reference documentation of these functions often has many useful examples (and is also available for free on the web).
When I graduated from college 20 years ago we didn’t have such powerful language platforms. Implementing a good algorithm for finding shortest paths is a good exercise for a college-level computer science course. And even when such pre-built functions exist, it may still be instructive to figure out how to implement such algorithms.
As manager I have always encouraged my software engineers to spend a certain fraction of their time searching for built-in functions or otherwise pre-existing code to speed up project implementation. Bill Gates has been quoted to have said:

“There is only one trick in software: Use a piece of code that has already been written.”

With software engineers, it is well known that productivity often varies not just by small factors, but by orders of magnitude. A handful of talented and motivated engineers with the right tools can outperform staffs of hundreds at large companies. I believe the increasing levels of abstraction and computational power of platforms such as Mathematica further exacerbates this trend and the resulting inequality in productivity.

1 Comment

Posted by on July 11, 2012 in Education, Recreational