RSS

Tag Archives: implementation

Implementation of TreeMap

Implementation of TreeMap

After posting on TreeMaps twice before (TreeMap of the Market and original post here) I wanted to better understand how they can be implemented.

In his book “Visualize This” – which we reviewed here – author Nathan Yau has a short chapter on TreeMaps, which he also published on his FlowingData Blog here. He is working with the statistical programming language R and uses a library which implements TreeMaps. While this allows for very easy creation of a TreeMap with just a few lines of code, from the perspective of how the TreeMap is constructed this is still a black box.

I searched for existing implementations of TreeMaps in Mathematica (which I am using for many visualization projects). Surprisingly I didn’t find any implementations, despite the 20 year history of both the Mathematica platform and the TreeMap concept. So I decided to learn by implementing a TreeMap algorithm myself.

Let’s recap: A TreeMap turns a tree of numeric values into a planar, space-filling map. A rectangular area is subdivided into smaller rectangles with sizes in relation to the values of the tree nodes. The color can be mapped based on either that same value or some other corresponding value.

One algorithm for TreeMaps is called slice-and-dice. It starts at the top-level and works recursively down to the leaf level of the tree. Suppose you have N values at any given level of the tree and a corresponding rectangle.
a) Sort the values in descending order.
b) Select the first k values (0<k<N) which sum to at least the split-ratio of the values total.
c) Split the rectangle into two parts according to split-ratio along its longer side (to avoid very narrow shapes).
d) Allocate the first k values to the split-off part, the remaining N-k values to the rest of the rectangle.
e) Repeat as long as you have sublists with more than one value (N>1) at current level.
f) For each node at current level, map its sub-tree onto the corresponding rectangle (until you reach leaf level).

As an example, consider the list of values {6,5,4,3,2,1}. Their sum is 21. If we have a split-ratio parameter of say 0.4, then we split the values into {6,5} and {4,3,2,1} since the ratio (6+5)/21 = 0.53 > 0.4, then continue with {6,5} in the first portion of the rectangle and with {4,3,2,1} in the other portion.

Let's look at the results of such an algorithm. Here I'm using a two-level tree with a branching factor of 6 and random values between 0 (dark) and 100 (bright). The animation is iterating through various split-ratios from 0.1 to 0.9:

Notice how the layout changes as a result of the split-ratio parameter. If it’s near 0 or 1, then we tend to get thinner stripes; when it’s closer to 0.5 we get more square shaped containers (i.e. lower aspect ratios).

The recursive algorithm becomes apparent when we use a tree with two levels. You can still recognize the containers from level 1 which are then sub-divided at level 2:

One of the fundamental tenets of this Blog is that interactive visualizations lead to better understanding of structure in the data or of the dynamic properties of a model. You can interact with this algorithm in the TreeMap model in Computable Document Format (CDF). Simply click on the graphic above and you get redirected to a site where you can interact with the model (requires one-time loading of the free CDF Browser Plug-In). You can change the shape of the outer rectangle, adjust the tree level and split-ratio and pick different color-schemes. The values are shown as Tooltips when you hover over the corresponding rectangle. You also have access to the Mathematica source code if you want to modify it further. Here is a TreeMap with three levels:

Of course a more complete implementation would allow to vary the color-controlling parameter, to filter the values and to re-arrange the dimensions as different levels of the tree. Perhaps someone can start with this Mathematica code and take it to the next level. The previous TreeMap post points to several tools and galleries with interactive applications so you can experiment with that.

Lastly, I wanted to point out a good article by the creator of TreeMaps, Ben Shneiderman. In this 2006 paper called “Discovering Business intelligence Using Treemap Visualizations” he cites various BI applications of TreeMaps. Several studies have shown that TreeMaps allow users to recognize certain patterns in the data (like best and worst performing sales reps or regions) faster than with other more traditional chart techniques. No wonder that TreeMaps are finding their way into more and more tools and Dashboard applications.

 
4 Comments

Posted by on November 9, 2011 in Industrial, Scientific

 

Tags: , , ,

Number of Neighbors for World Countries

Number of Neighbors for World Countries

One important geographical aspect in economy is whether a country is land-locked. Another aspect is the number of neighbors a given country shares a border with. If we sort all 239 world countries, 75 (31%, almost one third) of them are island countries such as Madagascar or Australia where this number is zero. On the opposite end are countries with the most border connections. Here are the top 6 countries in descending order: China (16), Russia (14), Brazil (10), Sudan, Germany, and Democratic Republic of Congo (9 each). All other countries have 8 or less neighbors. Here is a visual breakdown:

The histogram shows the high frequency of island states; the range from 1 to 5 neighbors is fairly common, with a steep drop off in the frequency of 6 or more neighbors. Here is a world map with the same color-code:

WorldMap color-coded by number of neighboring countries

Large countries tend to have more neighbors (Russia (14), China (16), Brazil (10)), but there are obvious exceptions to this tendency (Canada (1), United States (2)). The number of neighbors depends not just on the size of the country itself, but on it’s neighbors’ sizes as well; for example, a small country such as Austria (land area size world rank: 116th) has a rather high number of 8 neighbors because many of them in turn are relatively small (Switzerland, Liechtenstein, Slovenia, etc.).

The average number of neighbors is about 2.7 and there are 323 such border relationships. These can be visualized as graphs with countries as vertices and borders as edges. (Note that to simplify the graphs I excluded all 75 islands = disconnected vertices except Australia.) There are two main partitions of this graph following the land-border geography: One with Europe, Asia and Africa and one with the Americas.

Border-Connected Countries in Europe, Asia, Africa

With the graph layout changed from “Spring Embedding” to “Spring Electrical Embedding” one obtains this interesting variation of the same graph which looks like a sword fish:

The "EurAsiAfrica Sword-fish"

The other partition of the Americas can be visualized in a circular embedding layout:

Europe, Asia, Africa (left) and Americas (right)

It is also interesting to look at the numbers for lengths of pairwise borders between two countries:

  • Number: 323 border-pairs
  • Minimum: 0.34 [km]
  • Maximum: 8893 [km]
  • Mean: 789.6 [km]
  • Total: 255048 [km]
  • Most pairwise borders are between 100 – 1000 km long, but they can as short as 1/3 km (China – Macau) or almost 9000 km (Canada – United States).

    When we look at the entire border length for each country, we see familiar names on top of the ranking:
    China: 22147 [km], Russia: 20293 [km], Brazil: 16857 [km], India: 14103 [km], Kazakhstan: 12185 [km], United States: 12034 [km]. It seems likely that the first four, the so called “BRIC” countries, owe part of their economic strength to their geography: Size, length of borders and number of neighbors influence the number of local trading partners and routes to them. There are many more correlations one can analyze such as between border length / number of neighbors and GDP / length of road network etc. One thing seems likely when it comes to the economy of world countries: Size matters, and so does Geography!

    Epilog: This analysis was all performed using Wolfram’s Mathematica 8. The built-in curated CountryData provides access to more than 200 properties of the world countries, including things like Population, Area, GDP, etc. Some cleaning of the borders lengths data was required to deal with different spellings of the same country. (If you’re interested in the data or source-code, please contact me via email.) List manipulation and mathematical operations such as summation are very easy to do in the functional programming paradigm of Mathematica. Graphs are first-order data structures with numerous vertex and edge operators. Charting is also fairly powerful with BarCharts, ListPlots and more advanced graph charting options. Which other software provides all this flexibility in one integrated package?

     
    6 Comments

    Posted by on October 6, 2011 in Recreational, Socioeconomic

     

    Tags: , , , ,

    Visualizing Inequality

    Visualizing Inequality

    Measuring and visualizing inequality is often the starting point for further analysis of underlying causes. Only with such understanding can one systematically influence the degree of inequality or take advantage of it. In previous posts on this Blog we have already looked at some approaches, such as the Lorenz-Curve and Gini-Index or the Whale-Curve for Customer Profitability Analysis. Here I want to provide another visual method and look at various examples.

    Inequality is very common in economics. Competitors have different share of and capitalization in a market. Customers have different profitability for a company. Employees have different incomes across the industry. Countries have different GDP in the world economy. Households have different income and wealth in a population.

    The Gini Index is an aggregate measure for the degree of inequality of any given distribution. It ranges from 0.0 or perfect equality, i.e. every element contributes the same amount to 1.0 or the most extreme inequality, i.e. one element contributes everything and all other elements contribute nothing. (The previous post referenced above contains links to articles for the definition and calculation of the Gini index.)

    There are several ways to visualize inequality, including the Lorenz-Curve. Here we look at one form of pie-charts for some discrete distributions. As a first example, consider the distribution of market capitalization among the Top-20 technology companies (Source: Nasdaq, Date: 9/17/11):

    Market Cap of Top 20 Technology Companies on the Nasdaq

    Apple, the largest company by far, is bigger than the bottom 10 combined. The first four (20%) companies – Apple, Microsoft, IBM, Google – are almost half of the entire size and thus almost the size of the other 16 (80%) combined. The pie-chart gives an intuitive sense of the inequality. The Gini Index gives a precise mathematical measure; for this discrete distribution it is 0.47

    Another example is a look at the top PC shipments in the U.S. (Source: IDC, Date: Q2’11)

    U.S. PC Shipments in Q2'11

    There is a similar degree of inequality (Gini = 0.46). In fact, this degree of inequality (Gini index ~ 0.5) is not unusual for such distributions in mature industries with many established players. However, consider the tablet market, which is dominated by Apple’s iOS (Source: Strategy Analytics, Date: Q2’11)

    Worldwide Tablet OS shipments in Q2'11

    Apple’s iOS captures 61%, Android 30%, and the other 3 categories combined are under 10%. This is a much stronger degree of inequality with Gini = 0.74

    To pick an example from a different industry, here are the top 18 car brands sold in the U.S. (Source: Market Data Center at WSJ.COM; Date: Aug-2011):

    U.S. Total Car Sales in Aug-11

    When comparing different the Gini index values for these kinds of distributions it is important to realize the impact of the number of elements. More elements in the distribution (say Top-50 instead of Top-20) usually increases the Gini index. This is due to the impact of additional very small players. Suppose for example, instead of the Top-18 you left out the two companies with the smallest sales, namely Saab and Subaru, and plotted only the Top-16. Their combined sales are less than 0.4% of the total, so one wouldn’t expect to miss much. Yet you get a Gini index of 0.49 instead of 0.54. So with discrete distributions and a relatively small number elements one risks comparing apples to oranges when there are different number of elements.

    Consider as a last example a comparison of the above with two other distributions from my own personal experience – the list of base salaries of 30 employees reporting to me at one of my previous companies as well as the list of contributions to a recent personal charity fundraising campaign.

    Gini Index Comparison

    What’s interesting is that the salary distribution has by far the lowest amount of inequality. You wouldn’t believe that from the feelings of employees where many believe they are not getting their fair share and others are getting so much more… In fact, the skills and value contributions to the employer are probably far more unequal than the salaries! (Check out Paul Graham’s essays on “Great Hackers” for more on this topic!)
    And when it comes to donations, the amount people are willing to give to charitable causes differs immensely. We have seen this already in a previous post on Gini-Index with recent U.S. political donations showing an astounding inequality of Gini index = 0.89. I challenge you to find a distribution across so many elements (thousands) which has greater inequality. If you find one, please comment on this Blog or email me as I’d like to know about it.

     
    8 Comments

    Posted by on September 22, 2011 in Industrial, Scientific, Socioeconomic

     

    Tags: , ,

    Business Benefits of Software Release in Multiple Increments

    One of the main principles of Lean-Agile Software Development is to deliver fast and in small increments. Breaking a large system into multiple increments and delivering some of them early has many benefits to both the business and the customer such as: Earn returns for delivered value sooner. Obtain customer feedback sooner to clarify future features. Capture more market share due to early mover advantage. Reduce risk of obsolescence due to late delivery.

    While these benefits are somewhat intuitive, how can we better illustrate and quantify such benefits? From a financial, cash-flow perspective, there are three main business benefits of switching from one single release to multiple release increments:

    • Sooner Break-Even
    • Smaller Investment
    • Higher Net Return

    A visual model helps to reinforce and quantify them. In the following 4min demonstration I am using a simplified model to illustrate the above benefits:

    The above demonstration is based on the book “Lean-Agile Software Development – Achieving Enterprise Agility” by Alan Shalloway, Guy Beaver and James Trott. (See chapter 2: The Business Case for Agility) I believe that having such dynamic visualizations can help explain these benefits and thus make a more compelling business case for using Lean-Agile Software Development.

    Business Benefits of Two-Increment Release

    Click on the above graphic to interact with the dynamic model using the Wolfram CDF Player.

     
    1 Comment

    Posted by on August 20, 2011 in Industrial

     

    Tags: , , ,

    Customer Profitability

    Inequality is often at the root of structure and contrasts. Exposing inequality can often lead to insight. For example, take the well-known Pareto principle, which states that roughly 80% of the effects come from 20% of the causes (hence also referred to as the 80-20 rule).

    From the above Wikipedia page on the Pareto principle, chapter on business:

    The distribution shows up in several different aspects relevant to entrepreneurs and business managers. For example:
    80% of your profits come from 20% of your customers
    80% of your complaints come from 20% of your customers
    80% of your profits come from 20% of the time you spend
    80% of your sales come from 20% of your products
    80% of your sales are made by 20% of your sales staff
    Therefore, many businesses have an easy access to dramatic improvements in profitability by focusing on the most effective areas and eliminating, ignoring, automating, delegating or re-training the rest, as appropriate.

    Visualization can be a powerful instrument for such analysis. For customer profitability, a graphical representation of this inequality is often used as a starting point for analysis. A commonly used visualization is the so-called Whale-Curve. I created a short, 4 min video recording of a dynamic Whale-Curve Demonstration:

    In case you’re curious, the above demonstration uses an underlying model I created in Mathematica. You can dynamically interact with it yourself using the free CDF (Computable Document Format) Player:

    I have provided it as a contribution to the Wolfram Demonstration project, so you can download it, and even look at the source code if you are a Mathematica user.

    If you are interested in applying customer profitability analysis to your business, you may want to consider the company RapidBusinessModeling, which has an elaborate analysis approach starting with such Whale-Curves.

    The underlying notion of Inequality is a fundamental concept. We will look at it in other contexts in a later post.

     
    2 Comments

    Posted by on August 18, 2011 in Financial, Industrial, Scientific

     

    Tags: , , , ,

     
    %d bloggers like this: