Exploring GE14 with data science (Part 1): Unveiling the most influential ethnic & age groups

Exploring GE14 with data science (Part 1): Unveiling the most influential ethnic & age groups

Can we predict voting behavior based on the attributes of voters? Propensity modeling is a statistical method used in the field of advertising & marketing to predict consumer purchase behavior.

For example, if I wanted to predict consumer A’s purchase behavior (whether they buy or do not buy a product), I can find another existing consumer (consumer B) who 1) is a lot like consumer A (same ethnicity, same age, lives in the same area, etc), and 2) whose purchase behavior is known to me. I can then predict consumer A’s behavior based on consumer B’s behavior.

This methodology increases in accuracy if I am able to find more consumers like consumer A and who share similar purchase behaviors with consumer B (it also helps to have data on customers who are the opposite of customer A). At the end of the modeling phase, a marketer is able to determine which characteristic (i.e. feature) is associated with the highest likelihood to purchase a product. By ranking all the features by importance (e.g. age followed by sex etc.), a marketer is able to focus their efforts on areas that will produce desirable results.

In this post, I will apply this concept to the voting constituencies of Malaysia. Specifically, I will look into how similar and dissimilar each voting constituency is according to registered voters’ ethnicities & age. With this model, I hope attain answers to the following questions:

  1. Does ethnicity matter more than age? Or does age matter more than ethnicity?

  2. Which age group matters the most in deciding a constituency's election outcome?

  3. Which ethnic group matters the most in deciding a constituency's election outcome?   

Modeling approach

The diagram below summarizes how the model will answer these questions. A lot of the heavyweight computation is done via machine learning techniques in Python. I used a dataset from GE14 containing the age and ethnicity breakdown of each parliamentary seat. The dataset can be downloaded here.


I split the dataset into two subsets: 90% of the data was used to train the model to identify the features of voters that were useful for predicting the voting in a particular constituency. To test the model’s accuracy, I applied it to the remaining 10% of data: I fed the model the features of voters in those constituencies but hid the voting results, which the model had to predict. The model correctly predicted the voting behavior 91% of the time. If you are interested to know more, view the code here.

Question 1: Does ethnicity or age matter more?

The following figure shows the relative importance for all the features fed into the model. In particular, we fed the composition of each age and ethnicity group as a feature for every parliamentary seat. The figure shows how much ‘weight’ each feature has in the model (the total ‘weight’ for all features is 100).

Age and Ethnicity Ranked_together.png

Based on the figure, the composition of Chinese voters in a constituency has the greatest impact (a 17% weight) on the election outcome for a parliamentary seat. The model clearly shows ethnic composition (the proportion of Chinese, Indians and Malays) matters more than age when it comes to determining the constituents election outcome.

Question 2: Which age group matters the most?


Next, let’s focus on the different age groups alone. The following figure shows us the relative weights of the composition each age group. The composition of voters within the 51-60 age group, followed by the composition of voters within the 21-30 age group matters the most in determining the outcome for a constituency.

This is an interesting finding, because the two most important age groups are highly dissimilar (the youngest age group vs. one of the older age groups).

Keep in mind: age group composition is a zero-sum game, i.e. the increase in the composition for a single age group will mean the decrease in the composition for another age group. Thus, the next logical question to ask is: does an increase in the composition for voters aged 51-60 correlate with a decrease in the composition for another age group? We will investigate this question in the next post of this series.

Question 3: Which ethnic group matters the most?

The figure above showing the weights of all features suggests that the composition of Chinese voters has the largest relationship with the election outcome for a given constituency, followed by the proportions of voters who are Indian, Malay, and then of various Bumiputera groups.

Caution should be taken when interpreting this result. All constituencies have highly similar proportions of Malay (or Bumiputera Sabah/Sarawak for East Malaysia) voters, and the only distinguishing difference between each constituency tends to be the ethnic composition of the smaller ethnic groups.

Thus, the model will tend to rank the smaller ethnic groups higher than the larger ethnic groups, because the varying compositions of these smaller groups often drives any the difference in outcomes. In statistical language, the explained variance seems to come from the minority ethnic groups.

The de-facto ‘most important’ ethnic composition appears to be Chinese simply because it is the largest minority ethnic group in Malaysia.

While the ranking of ethnic groups by rank does not paint a complete picture, ethnicity did play a sizable role in determining the election outcome in each constituency. When I ran the model using age composition alone, the model’s accuracy rate was 82% but with ethnicity composition included, the model’s accuracy jumped to 91%.

We can’t fully tease apart the relative importance of the different ethnic groups, we can conclude that ethnicity as a whole is related to election outcomes.


This analysis is based on registered voters rather than who actually showed up on election day. While I have no reason to doubt the data is reliable, we are unable to validate the age and ethnicities of voters who actually showed up and voted. Voting in an election is anonymous; there are no demographic markers attached to each vote.

Data source: Bersih 2.0, DAP Malaysia, Tindak Malaysia, and Sinar Project

Cameron Highlands: Under the microscope

Cameron Highlands: Under the microscope