visualign

RSS

Circos Data Visualization How-to Book

Earlier this year we have looked at a powerful data visualization tool called Circos developed by Martin Krzywinski from the British Columbia Genome Science Center. The previous post looked at an example of how this tool can be used to show complex connectivity pathways in the human neocortex, so-called Connectograms.

Circos Book Cover

The Circos tool can be used interactively on the above website. In that mode you upload jobs via tabular data- and configuration-files and have some limited control over the rendering of the resulting charts. For full expressive power and flexibility, Circos can also be downloaded freely and used on your computer for rendering with extensive customization control over the resulting charts.

I have been asked to review a new book titled “Circos Data Visualization How-to“, published by Packt Publishing here. It’s main goal is to guide through the above download + installation process and get you started with Circos charts and their modification. Here is a brief review of this book.

Although originally developed for visualizing genomic data, Circos has been applied to many other complex data visualization projects, incl. social sciences. One such study was done by Tom Schenk, who analyzed the relationships between college majors and the professions those graduates ended up in. It appears as if this work inspired the author to write this book to help others with using Circos.

I downloaded the book in Kindle format and read it on the Mac due to the color graphics and the much larger screen size. It’s well structured and around 70 pages in printed form. The book focuses first on the download and install part, then has a series of examples from first chart to more complex ones using customization such as colors, ribbons, heat maps or dynamic binding.

Flow Chart for creation of Circos charts

Circos is essentially a set of Perl modules combined with the GD graphics library.

The first part is on Installing Circos, with a chapter each on Windows 7 and on Linux or Mac OS. Working on MAC I went the latter route. I ended up right in the weeds and it took me about 4 hours to get everything installed and working. The description is derived from a Linux install and is generally somewhat terse. It assumes you have all prerequisite tools installed on your Mac or at least that you are savvy enough to figure out what’s missing and where to get it. I had to dust off some of my Unix skills and go hunting for solutions via Google to a list of install problems:

directory permissions (I needed to warp the exact instructions with sudo)
installing Xcode tools from Apple for my platform (make was not preinstalled)
understanding cause of error messages (Google searches, Google group on Circos)
locating and installing the GD graphics library (helpful installing-circos-on-os-x tips by Paulo Nuin)
version and location issues (many libraries are in ongoing development; some sources have moved)

Others may find this part a lot easier, but I would say there should be an extra chapter for the Mac with tips and explanations to some of these speed bumps. On the plus side, the Google group seems to be very active and I found frequent and recent answers by Circos author Martin Krzywinski.

The next part of the book is easy to understand. One creates a simple hair-to-eye color relationship diagram. Then configuration files are introduced to customize colors and chart appearance. All required data and configuration files are also contained in the companion download from the Packt Publishing book page.

Chart of relationship between hair and eye colors

The last part of the book goes into more advanced topics such as customizing labels, links and ribbons, formatting links with rules, reducing links through bundling, and adding data tracks as heat maps or histograms. This is the meat for those who intend to use Circos in more advanced ways. I did not spend a lot of time here, but found the examples to be useful.

Contributions by State and Political party during 2012 U.S. Presidential Elections

This section ends abruptly. One gets the feel that there are other subtleties that could be explored and explained. A summary or outlook chapter would have been nice to wrap up the book and give perspective. For example, I would have liked to hear from the author how much time he spent with various features during the college major to professions project.

In summary: This book will get you going with Circos on your own machine. Installing can be a challenge on Mac, depending on how familiar you are with Unix and the open source tool stack. The examples for your first Circos charts are easy to follow and explain data and configuration files. The more advanced features are briefly touched upon, but require more experimentation and time to understand and appreciate.
Circos author Martin Krzywinski writes on his website: “To get your feet wet and hands dirty, download Circos and a read the tutorials, or dive into a full course on Circos.” The How-to book by Tom Schenk helps with this process, but you still need to come prepared. If you are a Unix power user this should feel familiar. If you are a Mac user who rarely ever opens a Terminal then you might be better off just using Circos via the tableviewer web interface.
Lastly, I would recommend buying the electronic version of this book, as you can cut & paste the code, leverage the companion code and documents. A printed version of this book would be of very limited use.

1 Comment

Posted by visualign on December 6, 2012 in Education, Scientific

Tags: Book Review, circos, data visualization, data visualization tool

2012 Election Result Maps

15 Nov

The New York Times has covered the 2012 U.S. presidential election in great detail, including the much heralded fivethirtyeight Blog (after the 538 electoral votes) by forecaster Nate Silver. His poll-aggregation model has consistently produced the most accurate forecasts, and called 99 of 100 states correctly in both the 2008 and the 2012 elections.

A popular visualization is the map of the 50 states in colors red (Republican) and blue (Democrat) plus green (Independent). Since most states allocate all their electoral votes to the candidate with the most votes in that state, this state map seems the most important.

2012 Election Result By State (Source: NYTimes.com)

This map hardly changed from 2008, only Indiana and North Carolina changed color. Hence the electoral vote result in 2012 (332 Dem – 206 Rep) is similar to that of 2008 (365 Dem – 173 Rep). The visual perception of this map, however, is that there is roughly the same amount of red and blue, with slightly more red than blue. This perception becomes even stronger when looking at the results by county.

2012 Election Results By County (Source: NYTimes.com)

Why is the outcome so strongly in favor of the blue (Democrat) when it looks like the majority of the area is red? The answer is found in very uneven population density of the 50 states. Although roughly the same size, California’s (slightly more blue) population density is about 40x higher than Montana’s (mostly red). On the extreme end of this scale, the most densely populated state New Jersey has about 1000x as many people living per square mile as the least densely populated state Alaska. Urban areas have a much higher density of voters than rural areas. The different demographics are such that urban areas tend to vote more blue (Democrat), rural areas tend to vote more red (Republican). The size of the colored area in the above chart would only be a good indicator if the population density was uniform. A great way to compensate visually for this difference can be seen in the third chart published by the NYTimes.

2012 Election Delta By County (Source: NYTimes.com)

Now the size of the colored circles is proportional to the number of surplus votes for that color in that county. The few blue circles around most major cities are larger and outweigh the many small red circles in rural areas – both optically intuitive and numerically in total. The original map is interactive, giving tooltips when you hover over the circles. For example, in just Los Angeles county there were about 1 million more blue (Democrat) votes than red (Republican).

2012 Election in Los Angeles County

This optical summation leads to intuitively correct results for the popular votes. The difference in popular vote was about 3.5 million more blue (Democrat) votes or roughly 3%. We see more blue in this delta circle diagram.

Of course, the president is not elected by the popular, but by the electoral votes per state. So no matter how big the Democrat advantage in California may be, there won’t be more than the 55 electoral votes for California. This winner-take-all dynamic of electoral votes by state leads to the outsized influence of swing states which are near the 50%-50% mark on the popular votes. A small lead in the popular vote can lead to a large gain in electoral votes. In extreme cases, a candidate can win the electoral vote and become president despite losing in the popular vote (as happened in 2000 and the very narrow win of Florida by George W. Bush).

Another variation on this theme of visually combining votes and population density information comes from Chris Howard. (This was referenced in an article on theatlanticcities.com by Emily Badger on the spatial divide of urban vs. rural voting preferences which has other election maps as well). The idea is to use shades of blue and red with population density increasing in darker shades of the color, used on a by county map.

2012 Election by county with shading by population density (Source: Chris Howard)

A final visualization comes from Nate Silver’s Blog post on November 8. While the % details of this at the time preliminary result may be slightly off (not all votes had been counted yet), the electoral vote counts remain valid.

2012 Election By State Cumulative (Source: Fivethirtyeight Blog)

It shows which swing state [electoral votes] put the blue ticket over the winning line (Colorado [9]) and which other swing states could have been lost without losing the presidency (Florida [29], Ohio [18], Virginia [13]). It also gives a crude, but somewhat telling indication of where you might want to live if you want to surround yourself by people with blue or red preferences.

Leave a comment

Posted by visualign on November 15, 2012 in Socioeconomic

Superstorm Sandy – Visualizing Hurricanes

28 Oct

Time-lapse animation of Sandy Oct-28 from geostationary orbit, 1 frame per minute, 11 hours of daylight. Although “only” a category 1 hurricane, this superstorm has enormous size. Tropical storm force winds extend out over an area 900 miles in diameter.

Living in South Florida makes you alert to tropical storms during hurricane season from May to November. Exactly 7 years ago, at the end of October 2005, the eye of category 3 hurricane Wilma swept over our home in West Palm Beach in South Florida – the most powerful natural weather event I have ever witnessed. After avoiding a direct hit since then, we got a massive rain event from Isaac earlier this year, but again avoided a direct hit. To be sure, often the flooding associated with hurricanes is worse than the wind damage. For example, when hurricane Katrina hit New Orleans in August 2005, most of the devastation came from flooding after the levees were breached. But the first question is always where the storms will make landfall and how strong they are when they hit your area.

Tropical storms are being tracked and forecast in great detail, in particular by the National Hurricane Center of the National Weather Service. There are many great visualizations illustrating the path, windspeed, rainfall, extent of tropical storm force winds, etc. Due to the convenience for browsing, I have almost completely switched to following hurricane or weather updates from the iPad. (In this case I’m using the Hurr Tracker app from EZ Apps.)

Last week a new tropical storm emerged in the Carribean and was named ‘Sandy’. A few days ago with Sandy’s center over the Bahamas, the path looked like this:

Path of hurricane Sandy as of Oct-25 (Hurr Tracker iPad app)

Note the use of color for wind speed and the cone of uncertainty in the lower segment, as well as the rings around the center indicating the size of the area with storm-force winds.

Naturally curious whether South Florida was likely to get hit, another image gave us some relief:

5 Day tracking map for hurricane Sandy

Now a few days later, while we did get some strong northerly winds and pounding surf leading to beach erosion, Sandy was not a particularly disturbing event for South Florida. At the same time, however, Sandy is forecast to make landfall on the Jersey shore within about 24 hours during the night from Monday to Tuesday.

One interesting set of maps with a color code displaying the probability of an area experiencing winds of a certain speed, say at least tropical storm force winds (>= 39 mph). The following map was issued this afternoon and indicates the very large area (mostly offshore) with near 100% probability of exceeding tropical storm force winds in purple.

Tropical storm force wind speed probabilities for hurricane Sandy as of Oct-28

This indicates how large Sandy is – an area the size of Texas with tropical storm force winds! Meteorologists are concerned for the Northeast due to Sandy converging with two other weather events, a storm from the West and cold air coming down from the North. This is expected to intensify the weather system, similar to the Perfect Storm of 1991. Due to the timing around Halloween this is why Sandy was also called a ‘Frankenstorm’.

One of the most chilling pictures is this animated GIF from WeatherBELL. A story in the Atlantic earlier today writes this:

Dr. Ryan Maue, a meteorologist at WeatherBELL, put out this animated GIF of the storm’s approach yesterday. “This is unprecedented –absolutely stunning upper-level configuration pinwheeling #Sandy on-shore like ping-pong ball,” he tweeted. It shows how cold air to the north and west of the storm spin Sandy into the mid-atlantic coastline.

(Click the image if the animation doesn’t play in your browser.)

Animation of hurricane Sandy moving into the NorthEast (Source: WeatherBELL)

Understandably this forecast of superstorm Sandy has the authorities worried. The full moon tomorrow exacerbates the tides and New York City is expecting up to 11 ft storm surge. Cities across the Northeast are taking precautions as of this writing. For example, the New York City subway metro transit system is shutting down tonight and several hundred thousand people in low-lying coastal areas are under mandatory evacuation order. More than 5000 flights to the area on Monday have been cancelled. Take a look at the expected 5 day precipitation forecast in the Northeast. Some areas may get up to 10 inches of rain and/or snow!

5 day precipitation forecast with Sandy’s impact for the Northeast

The first priority is to use such visualizations to communicate the weather impact and allow people to take necessary precautions. One can use similar hurricane charts to visualize other uncertain events, such as the future outcomes of development projects. We will look at this in an upcoming post on this Blog.

Addendum 11/4/12: The NYTimes has provided some interactive graphics detailing the location and size of power outages caused by superstorm Sandy in the New York and New Jersey area. The New York City outages have been summarized in this chart, normalized to the percentage of all customers. As can be seen, the efforts to restore power over the first 6 days have been fairly successful, especially in Manhattan and Staten Island, less so in Westchester.

6 day tracking map of power outages caused by Sandy in New York City

Leave a comment

Posted by visualign on October 28, 2012 in Recreational, Scientific

Tags: national hurricane center, nature, storm force winds

Trends in Health Habits across the United States

19 Oct

This week Scientific American published an interesting article about trends in health habits across the United States. The article includes both a large composite chart as well as a page with an interactive chart. Both are well done and a great example of using a visualization to help telling a story. I personally find the most useful part of the graphic to be the comparison column on the right with shades of color indicating degree of improvement (blue) or deterioration (red).

US health habits 1995 vs. 2010 (Source: Scientific American)

From the article:

Americans are imbibing alcohol and overeating more yet are smoking less (black lines in center graphs).

Some of the behaviors have patterns; others do not. Obesity is heaviest in the Southeast (2010 maps). Smoking is concentrated there as well. Excess drinking is high in the Northeast.

Comparing 2010 and 1995 figures provides the greatest insight into trends (maps, far right). Heavy drinking has worsened in 47 states, and obesity has expanded in every state. Tobacco use has declined in all states except Oklahoma and West Virginia. The “good” habit, exercise, is up in many places—even in the Southeast, where it has lagged.

A more detailed visual analysis is possible using the interactive version of these graphs on the related subpage Bad Health Habits are on the rise. Here one can compare up to three arbitrary states against top, median, and bottom performing states by health habit.

The following examples show tobacco use, exercise and obesity by state with line charts for the three arbitrarily selected states of Florida, California and Hawaii.

Tobacco Trend By State

Exercise Trend By State

Obesity Trend By State

Leading the exercise statistics are citizens in states offering attractive outdoor sports opportunities, like Oregon or Hawaii. Such correlation seems intuitive in both causal directions: People interested in exercise tend to move to those states with the most attractive outdoor sports. And people living in those states may end up exercising more due to the opportunity.

When looking at the average trend line, exercise seems to have leveled off after a bump in the early 2000’s, whereas the decline in smoking over the last decade continues unabated.

15 years is half a generation. During that time, Americans have in almost every state smoked less, exercised more in many states, but obesity is sharply on the rise in every state! From a health and policy debate the latter seems to be the most alarming trend. Most people want the next generation to be better off than the previous one. This has to some extent been true with wealth, at least until the great recession of 2008. But these data show that at population levels, more wealth is not necessarily more health.

Leave a comment

Posted by visualign on October 19, 2012 in Medical

Inequality and the World Economy

15 Oct

The last edition of The Economist featured a 25-page special report on “The new politics of capitalism and inequality” headlined “True Progressivism“. It is the most recommended and commented story on The Economist this week.

We have looked at various forms of economic inequality on this Blog before, as well as other manifestations (market share, capitalization, online attention) and various ways to measure and visualize inequality (Gini-index). Hence I was curious about any new trends and perhaps ways to visualize global economic inequality. That said, I don’t intend to enter the socio-political debate about the virtues of inequality and (re-)distribution policies.

In the segment titled “For richer, for poorer” The Economist explains.

The level of inequality differs widely around the world. Emerging economies are more unequal than rich ones. Scandinavian countries have the smallest income disparities, with a Gini coefficient for disposable income of around 0.25. At the other end of the spectrum the world’s most unequal, such as South Africa, register Ginis of around 0.6.

Many studies have found that economic inequality has been rising over the last 30 years in many industrial and developing nations around the world. One interesting phenomenon is that while the Gini index of many countries has increased, the Gini index of world inequality has fallen. This is shown in the following image from The Economist.

Global and national inequality levels (Source: The Economist)

This is somewhat non-intuitive. Of course the countries differ widely in terms of population size and level of economic development. At a minimum it means that a measure like the Gini index is not simply additive when aggregated over a collection of countries.

Another interesting chart displays a world map with color coding the changes in inequality of the respective country.

Changes in economic inequality over the last 30 years (Source: The Economist)

It’s a bit difficult to read this map without proper knowledge of the absolute levels of inequality, such as we displayed in the post on Inequality, Lorenz-Curves and Gini-Index. For example, a look at a country like Namibia in South Africa indicates a trend (light-blue) towards less inequality. However, Namibia used to be for many years the country with the world’s largest Gini (1994: 0.7; 2004: 0.63; 2010: 0.58 according to iNamibia) and hence still has much larger inequality than most developed countries.

World Map of national Gini values (Source: Wikipedia)

So global Gini is declining, while in many large industrial countries Gini is rising. One region where regional Gini is declining as well is Latin-America. Between 1980-2000 Latin America’s Gini has grown, but in the last decade Gini has declined back to 1980 levels (~0.5), despite the strong economic growth throughout the region (Mexico, Brazil).

Gini of Latin America over the last 30 years (Source: The Economist)

Much of the coverage in The Economist tackles the policy debate and the questions of distribution vs. dynamism. On the one hand reducing Gini from very large inequality contributes to social stability and welfare. On the other hand, further reducing already low Gini diminishes incentives and thus potentially slows down economic growth.

In theory, inequality has an ambiguous relationship with prosperity. It can boost growth, because richer folk save and invest more and because people work harder in response to incentives. But big income gaps can also be inefficient, because they can bar talented poor people from access to education or feed resentment that results in growth-destroying populist policies.

In other words: Some inequality is desirable, too much of it is problematic. After growing over the last 30 years, economic inequality in the United States has perhaps reached a worrisome level as the pendulum has swung too far. How to find the optimal amount of inequality and how to get there seem like fascinating policy debates to have. Certainly an example where data visualization can help an otherwise dry subject.

2 Comments

Posted by visualign on October 15, 2012 in Socioeconomic

Tags: Gini, global economic inequality, inequality, world economy

Software continues to eat the world

20 Aug

One year ago Marc Andreessen, co-founder of Netscape and venture capital firm Andreessen-Horowitz, wrote an essay for the Wall Street Journal titled “Why Software Is Eating The World“. It is interesting to reflect back to this piece and some of the predictions made back at a time when Internet company LinkedIn had just gone public and Groupon was just filing for an IPO.

Andreessen’s observation was simply this: Software has become so powerful and computer infrastructure so cheap and ubiquitous that many industries are being disrupted by new business models enabled by that software. Examples listed were books (Amazon disrupting Borders), movie rental (NetFlix disrupting Blockbuster), music industry (Pandora, iTunes), animation movies (Pixar), photo-sharing services (disrupting Kodak), job recruiting (LinkedIn), telecommunication (Skype), video-gaming (Zynga) and others.

On the infrastructure side one can bolster this argument by pointing at the rapid development of new technologies such as cloud computing or big data analytics. Andreessen gave one example of the cost of running an Internet application in the cloud dropping by a factor of 100x in the last decade (from $150,000 / month in 2000 using LoudCloud to about $1500 / month in 2011 using Amazon Web Services). Microsoft now has infrastructure with Windows Azure where procuring an instance of a modern server at one (or even multiple) data center(s) takes only minutes and costs you less than $1 per CPU hour.

Likewise, the number of Internet users has grown from some 50 million around 2000 to more than 2 billion with broadband access in 2011. This is certainly one aspect fueling the enormous growth of social media companies like Facebook and Twitter. To be sure, not every high-flying startup goes on to be as successful after its IPO. Facebook trades at half the value of opening day after three months. Groupon trades at less than 20% of its IPO value some 9 months ago. But LinkedIn has sustained and even modestly grown its market capitalization. And Google and Apple both trade near or at their all-time high, with Apple today at $621b becoming the most valuable company of all time (non inflation-adjusted).

The growing dominance and ubiquitous reach of software shows in other areas as well. Take automobiles. Software is increasingly been used for comfort and safety in modern cars. In fact, self-driving cars – once the realm of science fiction such as flying hover cars – are now technically feasible and knocking on the door of broad industrial adoption. After driving 300.000 miles in test Google is now deploying its fleet of self-driving cars for the benefit of its employees. Engineers even take self-driving cars to the racetracks, such as up on Pikes Peak or the Thunderhill raceway. Performance is now at the level of very good drivers, with the benefit of not having the human flaws (drinking, falling asleep, texting, showing off, etc.) which cause so many accidents. Expert drivers still outperform the computer-driven cars. (That said, even human experts sometimes make mistakes with terrible consequences, such as this crash on Pikes Peak this year.) The situation is similar to how computers got so proficient at chess in the mid-nineties that finally even the world champion was defeated.

In this post I want to look at some other areas specifically impacting my own life, such as digital photography. I am not a professional photographer, but over the years my wife and I have owned dozens of cameras and have followed the evolution of digital photography and its software for many years. Of course, there is an ongoing development towards chips with higher resolution and lenses with better optic and faster controls. But the major innovation comes from better software. Things like High Dynamic Range (HDR) to compensate for stark contrast in lighting such as a portrait photo against a bright background. Or stitching multiple photos together to a panorama, with Microsoft’s PhotoSynth taking this to a new level by building 3D models from multiple shots of a scene.

One recent innovation comes in the form of the new Sony RX100 camera, which science writer David Pogue raved about in the New York Times as “the best pocket camera ever made”. My wife bought one a few weeks ago and we both have been learning all it can do ever since. Despite the many impressive features and specifications about lens, optics, chip, controls, etc. what I find most interesting is the software running on such a small device. The intelligent Automatic setting will decide most settings for your everyday use, while one can always direct priorities (aperture, shutter, program) or manually override most aspects. There are a great many menus and it is not trivial to get to use all capabilities of this camera, as it’s extremely feature-rich. Some examples of the more creative software come in modes such as ‘water color’ or ‘illustration’. The original image is processed right then and there to generate effects as if it was a painting or a drawing. Both original and processed photo are stored on the mini-SD card.

Flower close-up in ‘illustration’ mode

One interesting effect is to filter to just the main colors (Yellow, Red, Green, Blue). Many of these effects are shown on the display, with the aperture ring serving as a flexible multi-functional dial for more convenient handling with two hands. (Actually, the camera body is so small that it is a challenge to use all dials while holding the device; just like the BlackBerry keyboard made us write with two thumbs instead of ten fingers.) The point of such software features is not so much that they are radically new; you could do so with a good photo editing software for many years. The point is that with the ease and integration of having them at your fingertips you are much more likely to use them.

Example of suppressing all colors except yellow

The camera will allow registering of faces and detect those in images. You can set it up such that it will take a picture only when it detects a small/medium/large smile on the subject being photographed. One setting allows you to take self-portrait, with the timer starting to count down as soon as the camera detects one (or two) faces in the picture! It is an eerie experience when the camera starts to “understand” what is happening in the image!

There is an automatic panorama stitching mode where you just hold the button and swipe the camera left-right or up-down while the camera takes multiple shots. It automatically stitches them into one composite, so no more uploading of the individual photos and stitching on the computer required.

Beach panorama stitched on the camera using swipe-&-shoot

I have been experimenting with panorama photos since 2005 (see my collection or my Panoramas from the Panamerican Peaks adventure). It’s always been somewhat tedious and results were often mixed (lens distortions, lighting changes sun vs. cloud or objects moving during the individual frames, not holding the camera level, skipping a part of the horizon, etc.) despite crafty post-processing on the computer with image software. I have read about special 360 degree lenses to take high-end panoramas, but who wants to go to those lengths just for the occasional panorama photo? From my experience, nothing moves the needle as much as the ease and integration of taking panoramas right in the camera as the RX100 does.

Or take the field of healthcare. Big Data, Mobility and Cloud Computing make possible entirely new business models. Let’s just look at mobility. The smartphone is evolving into a universal healthcare device for measuring, tracking and visualizing medical information. Since many people have their smartphone with them at almost all times, one can start tracking and analyzing personal medical data over time. And for almost any medical measurement, “there is an app for that”. One interesting example is this optical heart-rate monitor app Cardiio for the iPhone. (Cardio + IO ?)

Screenshots of Cardiio iPhone app to optically track heart rate

It is amazing that this app can track your heart rate just by analyzing the changes of light reflected from your face with its built-in camera. Not even a plug-in required!

Another system comes from Withings, this one designed to turn the iPhone into a blood pressure monitor. A velcro sleeve with battery mount and cable plugs into the iPhone and an app controls the inflation of the sleeve, the measurement and some simple statistics.

Blood pressure monitor system from Withings for iPhone

Again, it’s fairly simple to just put the sleeve around one upper arm and push the button on the iPhone app. The results are systolic and diastolic blood pressure readings and heart rate.

Sample blood pressure and pulse measurement

Like many other monitoring apps this one also keeps track of the readings and does some simple form of visual plotting and averaging.

Plot of several blood pressure readings

There is also a separate app which will allow you to upload your data and create a more comprehensive record of your own health over time. Withings provides a few other medical devices such as scales to add body weight and body fat readings. The company tagline is “smart and connected things”.

One final example is an award-winning contribution from a student team from Australia called Stethocloud. This system is aimed at diagnosing pneumonia. It is comprised of an app for the iPhone, a simple stethoscope plug-in for the iPhone and on the back-end some server-based software analyzing the measurements in the Windows Azure cloud according to standards defined by the World Health Organization. The winning team (in Microsoft’s 2012 Imagine Cup) built a prototype in only 2 weeks and had only minimal upfront investments.

StethoCloud system for iPhone to diagnose pneumonia

This last example perhaps illustrates best the opportunities of new software technologies to bring unprecedented advances to healthcare – and to many other fields and industries. I think Marc Andreessen was spot on with his observation that software is eating the world. It certainly does in my world.

Leave a comment

Posted by visualign on August 20, 2012 in Industrial, Medical, Socioeconomic

Tags: digital photography, gadgets, healthcare, software, technology, trends

Olympic Medal Charts

15 Aug

The 2012 London Olympic Games ended this weekend with a colorful closing ceremony. Media coverage was unprecedented, with other forms of competition around who had the most social media presence or which website had the best online coverage of the games.

In this post I’m looking at the medal counts over the history of the Olympic Games (summer games only, 27 events over the last 116 years, no games in 1916, 1940, and 1944). Nearly 11.000 athletes from 205 countries competed for more than 900 medals in 302 events. The New York Times has an interactive chart of the medal counts on their London 2012 Results page:

Bubble size represents the number of medals won by the country, bubble position is roughly based on a world map and bubble color indicates the continent. Moving the slider to a different year changes the bubbles, which gives a dynamic grow or shrink effect.

Below this chart is a table listing all gold, silver, bronze winners for each sport in that year, grouped by type of sport such as Gymnastics, Rowing or Swimming. Selecting a bubble will filter this to entries where the respective country won a medal. This shows the domination of some sports by certain countries, such as Diving (8 events, China won 6 gold and 10 total medals) or Cycling – Track (10 events, Great Britain won 7 gold and 9 total medals). In two sports, domination by one country was 100%: Badminton (5 events, China won 5 gold and 8 total medals), Table Tennis (4 events, China won 4 gold and 6 total medals).

There is also a summary table ranking the countries by total medals. For 2012, the United States clearly won that competition, winning more gold medals (46) than all but 3 other countries (China, Russia, Britain) won total medals.

Top 10 countries for medal count in 2012

Of course countries vary greatly by population size. It is remarkable that a relatively small nations such as Jamaica (~2.7 million) won 12 medals (4, 4, 4), while India (~1.25 billion) won only 6 medals (0, 2, 4). In that sense, Jamaica is about 1000x more medal-decorated per population size than India! In another New York Times graphic there is an option to compare medal count adjusted for population size, i.e. with the medal count normalized to a standard population size of say 100 million.

Directed graph comparing medal performance adjusted for country size

Selecting any node in this graph will highlight countries with better, worse or comparable relative medal performance. (There are different ways to rank based on how different medals are weighted.)

The Guardian Data Blog has taken this a step further and written a piece called “alternative medals table“. This post not only discusses multiple factors like population, GDP, or number of athletes and how to deal with them statistically; it also provides all the data and many charts in a Google Docs spreadsheet. One article combines GDP adjustment with cartographical mapping across Europe:

Medals GDP Adjusted and mapped for Europe

If you want to do your own analysis, you can get the data in shared spreadsheets. To do a somewhat more historic analysis, I used a different source, namely Wolfram’s curated data source accessible from within Mathematica. Of course, once you have all that data, you can examine it in many different directions. Did you know that 14853 Olympic medals were awarded so far in 27 summer Olympiads? The average was 550 medals, growing about 29 medals per event with nearly 1000 awarded in 2008 and 2012.

A lot of attention was paid to who would win the most medals in London. China seemed in contention for the top spot, but in the end the United States won the most medals, as it did in the last 5 Olympiads. Only 7 countries won the most medals at any Olympiad. Greece (1896), France (1900), the United Kingdom (1908), Sweden (1912), and Germany (1936) did so just once. The Soviet Union (which no longer exists) did it 8 times. And the United States did it 14 times. China, which is only participating since 1984, has yet to win the most medals of any Olympiad.

Aside from the top rank, I was curious about the distribution of medals over all countries. Both nations and events have increased, as is shown in the following paired bar chart:

Number of participating nations and total medals per Summer Games

The number of nations grew steadily with only two exceptions during the thirties and the seventies; presumably due to economic hardship many nations didn’t want to afford participation. 1980 also saw the Boycott of the Moscow Games by the United States and several other delegations over geopolitical disagreements. At just over 200 the number of nations seems to have stabilized.

The number of medals depends primarily on the number of events at each Olympiad. This year there were 302 events in 26 types of Sports. Total medal count isn’t necessarily exactly triple that since in some events there could be more than 1 Bronze (such as in Judo, Taekwondo, and Wrestling). Case in point, in 2012 there were 968 medals awarded, 62 more than 3 * 302 events.

What is the distribution of those medals over the participating nations? One measure would be the percentage of nations winning at least some medals. Another measure showing the degree of inequality in a distribution is the Gini index. Here I plotted the percentage of nations medaling and the Gini index of the medal distribution over all participating nations for every Olympiad:

Percentage and Gini-Index of medal distribution by nations

Up until 1932 3 out of 4 nations won at least some medals. Then the percentage dropped down to levels around 40% and lower since the sixties. That means 6 of 10 nations go home without any medals. During the same time period the inequality grew from Gini of about .65 to near .90 One exception were the Third Games in 1904 in St. Louis. With only 13 nations competing the United States dominated so many sports to yield an extreme Gini of .92 All of the last five Games resulted in a Gini of about .86, so this still very large amount of medal winning inequality seems to have stabilized.

It would be interesting to extend this to the level of participating athletes. Of course we know which athlete ranks at the top as the most decorated Olympic athlete of all time: Michael Phelps with 22 medals.

2 Comments

Posted by visualign on August 15, 2012 in Recreational

Tags: DataBlog, inequality, interactive diagram, Mathematica, NYTimes, Olympics, sports

Keystroke Biometrics using Mathematica

20 Jul

A few weeks ago Paul-Jean Letourneau posted an article on Wolfram’s Blog about using Mathematica to collect and analyze keystroke metrics as a way to identify individuals. The article analyzes how you type, measuring the time intervals between your typing the individual characters using a little interactive widget, collecting and visualizing the data while you repeatedly type in the word “wolfram”.

Keystroke metrics of 50 trials typing the word “wolfram”

It is somewhat interesting at this point to analyze one’s one typing style. For example there appears to be a bi-modal distribution of the time intervals between keystrokes, with the sequence “r-a” taking me almost twice as long (~130ms) as most other sequences (~60-70ms). There is also a ‘learning’ effect visible in my 50 trials, where the speed improves noticeably after about 20 repetitions or so. However, there are occasional relapses into a much slower typing pattern throughout the rest of the trials.

However, what I thought was more interesting is the subsequent analysis the author did across a set of 42 such series he obtained from his colleagues (noting humorously that “it just so happens that Wolfram is a company full of data nerds”). He then proceeds to analyze and visualize that data in various ways.

Distribution Histogram of keystroke intervals

He observes the bimodal nature of the distribution with peaks around 75ms and 150ms for different pairs of characters. In fact, averaging over all those pair typing times, a correlation is found indicating that when people type slower they are more consistent.

(Negative) Correlation of pairwise typing speed and consistency

The analysis continues with the observation that each measurement can be seen as a point in a six-dimensional space (six pair-transitions in a word with seven characters). When a person types this same word 50 times you get a cluster of 50 points in six-dimensional space. Different individuals will produce different clusters. So one can use the (built-in) function FindClusters to determine such clusters. However, since people have a certain amount of inconsistency in their typing, it is possible that sometimes one person’s typing will show up in another person’s cluster and vice versa. To measure the quality of the clusters to distinguish individuals, one can implement various measures. The author implements the Rand-index, a measure of the similarity between two data-clusterings. This gives a numeric accuracy on a scale from 0 to 1 for the ability to distinguish between a pair of two people. When looking across all pairs of 42 people – there are 21*41=861 different pairs, but the author chose to look at all 42*42=1764 pairs, as the FindCluster results depend on the sequence input data, so Rand[i,j] may be different from Rand[j,i] – you get the following histogram of Rand quality scores:

Histogram of Rand quality score for all pairs

This clearly shows that keystroke metrics for one word are not sufficient to reliably distinguish between arbitrary pairs of people. The average quality score is only 0.67. On the other hand, about 400 (~23%) of those quality scores are a perfect 1.0, so for about a quarter of the pairs it alone would suffice to reliably distinguish the two people typing. About half as many scores are 0.0, meaning that the clusters overlap so much that no distinction is possible. The remaining scores are distributed mostly between 0.5 and 1.0, meaning you would just guess right more often than wrong.

The author wraps up the post with this paragraph:

Using this fun little typing interface, I feel like I actually learned something about the way my colleagues and I type. The time to type two letters with the same finger on the same hand takes twice as long as with different fingers. The faster you type, the more your typing speed will fluctuate. The more your typing speed fluctuates, the harder it will be to distinguish you from another person based on your typing style. Of course we’ve really just scratched the surface of what’s possible and what would actually be necessary in order to build a keystroke-based authentication system. But we’ve uncovered some trends in typing behavior that would help in building such a system.

An interactive CDF widget embedded in the article allows you to collect and visualize the timing of your own typing. Source code as well as the test data is also shared if you want to further explore the details of this interesting analysis.

1 Comment

Posted by visualign on July 20, 2012 in Linguistic, Scientific

Tags: biometrics, data clustering, histogram, keystroke metrics, Mathematica, Rand measure, timing analysis

London Tube Map and Graph Visualizations

11 Jul

The previous post on Tube Maps has quickly risen in the view stats into the Top 3 posts. Perhaps it’s due to many people searching Google for images of the original London tube map in the context of the upcoming Olympic Games.

I recently reviewed some of the classes in the free Wolfram’s Data Science course. If you are interested in Data Science, this is excellent material. And if you are using Mathematica, you can download the underlying code and play with the materials.

It just so happens that in the notebook for the Graphs and Networks: Concepts and Applications class there is a graph object for the London subway.

Mathematica Graph object for the London subway

As previously demonstrated in our post on world country neighborhood relationships, Mathematica’s graph objects are fully integrated into the language and there are powerful visualization and analysis functions.

For example, this graph has 353 vertices (stations) and 409 edges (subway connections). This one line of code highlights all stations no more than 5 stations away from the Waterloo station:

HighlightGraph[london, 
  NeighborhoodGraph[london, "Waterloo", 5]]

Neighborhood Graph 5 around Waterloo

Since HighlightGraph and NeighborhoodGraph are built-in functions, this can be done in one line of code.

Export["london.gif",
  Table[HighlightGraph[london, 
    NeighborhoodGraph[london, "King's Cross St. Pancras", k]],
   {k, 0, 20, 1}]]

creates this animated GIF file:

Paths spreading out from the center

Shortest paths can easily be determined and visualized:

HighlightGraph[london, 
  FindShortestPath[london, "Amersham", "Woolwich Arsenal"]]

A shortest path example

There are many other graph functions such as:

GraphDiameter[london]   39
GraphRadius[london]     20
GraphCenter[london]     "King's Cross St. Pancras"
GraphPeriphery[london]  {"Watford Junction", "Woodford"}

In other words, the King’s Cross St. Pancras station is at the center, with radius up to 20 out into the periphery, and 39 the shortest path between Watford Junction and Woodford, the longest shortest path in the network.

Let’s look at distances within the graph. The built-in function GraphDistanceMatrix calculates all pairwise distances between any two stations:

mat = GraphDistanceMatrix[london]; MatrixPlot[mat]

Graph Distance Matrix Plot

For the 353*353 = 124,609 pairs of stations, let’s plot a histogram of the pairwise distances:

Histogram[Flatten[mat]]

Graph Distance Histogram

The average distance between two stations in the London subway system is about 14.

So far, very little coding has been required as we have used built-in functions. Of course, the set of functions can be easily extended. One interesting aspect is the notion of centrality or distance of a node from the center of the graph. This is expressed in the built-in function ClosenessCentrality

cc = ClosenessCentrality[london];
HighlightCentrality[g_, cc_] := 
   HighlightGraph[g, 
    Table[Style[VertexList[g][[i]], 
      ColorData["TemperatureMap"][cc[[i]]/Max[cc]]], 
        {i, VertexCount[g]}]];
HighlightCentrality[london, cc]

Color coded Centrality Map

Another interesting notion is that of BetweennessCentrality, which is a measure indicating how often a particular node lies on the shortest paths between all node-pairs. The following nifty little snippet of code identifies the 10 most traversed stations – along the shortest paths – of the London underground:

HighlightGraph[london,
 First /@ SortBy[
 Thread[VertexList[london] -> BetweennessCentrality[london]],
 Last][[-10 ;;]]]

10 most traversed stations

I have often felt that progress in computer science and in languages comes from raising the level of abstraction. It’s amazing how much analysis and visualization one can do in Mathematica with very little coding due to the large number of powerful, built-in functions. The reference documentation of these functions often has many useful examples (and is also available for free on the web).
When I graduated from college 20 years ago we didn’t have such powerful language platforms. Implementing a good algorithm for finding shortest paths is a good exercise for a college-level computer science course. And even when such pre-built functions exist, it may still be instructive to figure out how to implement such algorithms.
As manager I have always encouraged my software engineers to spend a certain fraction of their time searching for built-in functions or otherwise pre-existing code to speed up project implementation. Bill Gates has been quoted to have said:

“There is only one trick in software: Use a piece of code that has already been written.”

With software engineers, it is well known that productivity often varies not just by small factors, but by orders of magnitude. A handful of talented and motivated engineers with the right tools can outperform staffs of hundreds at large companies. I believe the increasing levels of abstraction and computational power of platforms such as Mathematica further exacerbates this trend and the resulting inequality in productivity.

1 Comment

Posted by visualign on July 11, 2012 in Education, Recreational

Tags: graph objects, graph visualization, interactive diagram, London, london tube map, Mathematica, productivity inequality, tube map

Interactive Tournament Map

27 Jun

I hadn’t followed the UEFA 2012 European football championship (called soccer in the US) and wanted to catch up on where things stand. Enter the interactive tournament map on the official UEFA website:

Row selection highlights games at that stadium

When you first enter the map it animates the timeline from left to right by drawing the colored lines for each team. The tabular layout shows time in daily columns from left to right and teams in rows by 4 tournament groups. Today’s day column is always highlighted. Here are some of the interactive elements:

Mouse over any of the colored lines highlights the corresponding team’s games along it’s timeline.
Clicking on a particular day column header highlights the games played on that date.
Clicking on the stadium symbol at the right end highlights the games played at that stadium.
Clicking on any circle brings up a dialog with details for that game.
Clicking on a row header on the left brings up a dialog with details for that team.
Selecting the tournament stage at the bottom (quarter-, semi-, final) moves to the date interval.

Detail for team Spain

Spain is the reigning football world champion, so they are clearly one of the favorites of this tournament and will actually play their semi-final against Portugal later this evening.

The final will be played in the Olympic Stadium in Kyiv, capital of participating host country Ukraine.

Detail with game schedule for stadium

From these details you can click on the games and get to yet more detail (videos, comments, etc.) for that particular game.

When I first looked at the map, the amount of information displayed had me a bit confused. The color scheme is often difficult to separate, for example the three orange-red tones in Group B. The black background feels attractive, although I could do without the pattern overlay, which doesn’t add information and only distracts. Lastly, I could do without the colorful advertisements around the map. On first glance I thought the stadium symbols on the right were also just colored ads.

The interactive nature made the map grow on me. It’s intuitive and the tabular layout easy to navigate. You may not have a screen wide enough to see the map in its entirety, but I suppose you wouldn’t want to see time down the vertical axis, would you?

Postscript 7/1/12: Sure enough, Spain beat Italy 4:0 in today’s final and went on to become the European football champion 2012.

Leave a comment

Posted by visualign on June 27, 2012 in Recreational

Tags: football, interactive diagram, interactive map, tabular layout, timeline chart, UEFA 2012

Search
Top Posts & Pages
Visualign Twitter
Tweets by visualigncorp
Categories
- Art (3)
- Education (4)
- Financial (18)
- Industrial (44)
- Linguistic (3)
- Medical (9)
- Recreational (12)
- Scientific (27)
- Socioeconomic (30)
- Uncategorized (2)
Archives
- May 2024 (1)
- April 2024 (1)
- November 2023 (1)
- October 2023 (1)
- April 2020 (1)
- March 2020 (1)
- June 2018 (1)
- February 2018 (1)
- May 2017 (1)
- March 2017 (2)
- September 2014 (1)
- June 2014 (1)
- September 2013 (1)
- June 2013 (1)
- April 2013 (1)
- March 2013 (1)
- February 2013 (1)
- January 2013 (1)
- December 2012 (2)
- November 2012 (1)
- October 2012 (3)
- August 2012 (2)
- July 2012 (2)
- June 2012 (5)
- May 2012 (4)
- April 2012 (1)
- March 2012 (2)
- February 2012 (7)
- January 2012 (3)
- December 2011 (2)
- November 2011 (5)
- October 2011 (4)
- September 2011 (5)
- August 2011 (4)
- July 2011 (7)
- June 2011 (7)
- May 2011 (2)
Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Email Address:

Join 713 other subscribers
Blog Stats
- 291,213 hits

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Top Posts & Pages

Categories

Archives

Subscribe to Blog via Email

Blog Stats