[ad_1]

Publicly accessible data exist that describe the socio-economic characteristics of geographic locations. In Australia, where I live, the government, through the Australian Bureau of Statistics (ABS), regularly collects and publishes individual and household data on income, occupation, education, employment and housing at a local level. Examples of published data points include:

- Ratio of relatively high-income/low-income people
- Percentage of people classified as managers in each occupation
- Percentage of people with no formal education
- percentage of unemployed people
- Percentage of properties with four or more bedrooms

Although these data points may seem focused on individual people, they reflect people’s access to material and social resources and ability to participate in society in a particular geographic area. Finally, it shows the socio-economic advantages and disadvantages of this area.

Is there a way to take these data points into account and derive a score that ranks geographic areas from most advantaged to least advantaged?

The goal of deriving a score can be formulated as a regression problem, where each data point or feature is used to predict a target variable (in this scenario, a numeric score). This requires the target variable to be available in some instances to train the predictive model.

However, since you don’t have a target variable to begin with, you may need to approach this problem differently. For example, under the assumption that each geographic region differs from a socio-economic perspective, understand which data points are most helpful in explaining variation, and then create a score based on the numerical combination of these data points. Can we aim to derive this?

A technique called Principal Component Analysis (PCA) allows you to do just that. This article will show you how.

The ABS publishes data points showing the socio-economic characteristics of geographic regions under the “Standardized Variable Ratio Data Cube” in the “Data Downloads” section of this web page.[1]. These data points are published at the Statistical Area 1 (SA1) level, a digital boundary that divides Australia into areas with populations of approximately 200 to 800 people. This is a more detailed digital boundary compared to postal codes (postal codes) or state digital boundaries.

For the purposes of this article’s demonstration, we derive a socio-economic score based on 14 of the 44 public data points listed in Table 1 of the data sources above (more on why we chose this subset later). I will explain) ). these are :

- INC_LOW: Percentage of people living in households with a stated household equivalent annual income of AU$1 to AU$25,999.
- INC_HIGH: Percentage of people with stated annual household income above AU$91,000
- UNEMPLOYED_IER: Percentage of unemployed people aged 15 and over
- HIGHBED: Percentage of occupied properties with four or more bedrooms.
- High-cost mortgages: Percentage of occupied private properties with mortgage payments of more than A$2,800 per month.
- Low rent: Percentage of occupied private properties paying rent of less than A$250 per week.
- Ownership: Percentage of private real estate occupied without a mortgage.
- Mortgage: Percentage of occupied private real estate that has a mortgage.
- Group: Percentage of occupied private property that is private property (such as an apartment or unit) occupied by a group.
- LONE: Percentage of occupied real estate that is privately occupied by one person.
- Overcrowding: Percentage of occupied property that requires one or more additional bedrooms (based on Canada’s National Occupancy Standards)
- NOCAR: Percentage of occupied private land without cars.
- ONEPARENT: Percentage of single-parent households
- UNINCORP: Percentage of properties with at least one business owner.

This section provides step-by-step Python code that uses PCA to derive socio-economic scores for Australia’s SA1 region.

First, load the required Python packages and data.

`## Load the required Python packages`### For dataframe operations

import numpy as np

import pandas as pd

### For PCA

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

### For Visualization

import matplotlib.pyplot as plt

import seaborn as sns

### For Validation

from scipy.stats import pearsonr

`## Load data`file1 = 'data/standardised_variables_seifa_2021.xlsx'

### Reading from Table 1, from row 5 onwards, for column A to AT

data1 = pd.read_excel(file1, sheet_name = 'Table 1', header = 5,

usecols = 'A:AT')

`## Remove rows with missing value (113 out of 60k rows)`data1_dropna = data1.dropna()

An important cleaning step before running PCA is to standardize each of the 14 data points (features) to a mean of 0 and a standard deviation of 1. This is mainly to ensure the loadings assigned to each feature by PCA (think of these as metrics). (e.g. feature importance) can be compared between features. Otherwise, features that are not really important may be emphasized more or given a higher load, and vice versa.

Note that the ABS data sources cited above already contain standardized features. That is, for non-standardized data sources:

`## Standardise data for PCA`### Take all but the first column which is merely a location indicator

data_final = data1_dropna.iloc[:,1:]

### Perform standardisation of data

sc = StandardScaler()

sc.fit(data_final)

### Standardised data

data_final = sc.transform(data_final)

Using standardized data, you can perform PCA with just a few lines of code.

`## Perform PCA`pca = PCA()

pca.fit_transform(data_final)

PCA aims to represent the underlying data in terms of principal components (PCs). The number of PCs provided in PCA is equal to the number of standardized features in the data. This example returns 14 PCs.

Each PC is a linear combination of all standardized features and is differentiated only by its respective loading of standardized features. For example, the figure below shows the load assigned by function to the first and his second PC (PC1 and PC2).

Using 14 PCs, the code below visualizes how much variation each PC explains.

## Create visualization for variations explained by each PCexp_var_pca = pca.explained_variance_ratio_

plt.bar(range(1, len(exp_var_pca) + 1), exp_var_pca, alpha = 0.7,

label = '% of Variation Explained',color = 'darkseagreen')

plt.ylabel('Explained Variation')

plt.xlabel('Principal Component')

plt.legend(loc = 'best')

plt.show()

As shown in the output visualization below, Principal Component 1 (PC1) accounts for the largest proportion of the variance in the original dataset, and each subsequent PC explains less variance. Specifically, PC1 describes his circa 2015. 35% of the variation in the data.

In this article’s demo, PC1 is selected as the only PC to derive the socio-economic score for the following reasons:

- PC1 explains sufficiently large variation in the data on a relative basis.
- Choosing more PCs may explain (slightly) more variation, but makes it difficult to interpret scores that take into account the socio-economic advantages and disadvantages of specific geographic areas. For example, as shown in the image below, PC1 and PC2 provide conflicting explanations of how a particular feature (e.g. “INC_LOW”) influences socio-economic variation in a geographic area. There may be cases.

`## Show and compare loadings for PC1 and PC2`### Using df_plot dataframe per Image 1

sns.heatmap(df_plot, annot = False, fmt = ".1f", cmap = 'summer')

plt.show()

To obtain a score for each SA1, simply multiply the standardized portion of each feature by the PC1 loading. This can be achieved by:

## Obtain raw score based on PC1### Perform sum product of standardised feature and PC1 loading

pca.fit_transform(data_final)

### Reverse the sign of the sum product above to make output more interpretable

pca_data_transformed = -1.0*pca.fit_transform(data_final)

### Convert to Pandas dataframe, and join raw score with SA1 column

pca1 = pd.DataFrame(pca_data_transformed[:,0], columns = ['Score_Raw'])

score_SA1 = pd.concat([data1_dropna['SA1_2021'].reset_index(drop = True), pca1]

, axis = 1)

### Inspect the raw score

score_SA1.head()

The higher the score, the more advantageous the SA1 is in terms of access to socio-economic resources.

How do we know that the score we derived above was even remotely correct?

For context, the ABS actually publishes a socio-economic score called the Index of Economic Resources (IER), which is defined on the ABS website as:

*“The Index of Economic Resources (IER) focuses on the financial aspects of relative socio-economic advantage and disadvantage by summarizing variables related to income and housing. variables are excluded because they are not direct measures of economic resources. We also exclude assets such as savings and stocks, which are relevant but cannot be included because they are not collected in the census.”*

ABS did not disclose detailed procedures and stated in a technical paper that IER was derived using the same functionality (14) and methodology (PCA, PC1 only) as performed above. This means that if you derive the correct scores, they should be comparable to the IER scores published here (“Statistics Area Level 1, Index, SEIFA 2021.xlsx”, Table 4).

Since the published scores are standardized to a mean of 1,000 and a standard deviation of 100, we begin our validation by standardizing the raw scores to the same.

`## Standardise raw scores`score_SA1['IER_recreated'] =

(score_SA1['Score_Raw']/score_SA1['Score_Raw'].std())*100 + 1000

For comparison, we load the IER scores published by SA1.

`## Read in ABS published IER scores`

## similarly to how we read in the standardised portion of the featuresfile2 = 'data/Statistical Area Level 1, Indexes, SEIFA 2021.xlsx'

data2 = pd.read_excel(file2, sheet_name = 'Table 4', header = 5,

usecols = 'A:C')

data2.rename(columns = {'2021 Statistical Area Level 1 (SA1)': 'SA1_2021', 'Score': 'IER_2021'}, inplace = True)

col_select = ['SA1_2021', 'IER_2021']

data2 = data2[col_select]

ABS_IER_dropna = data2.dropna().reset_index(drop = True)

**Verification 1 — Load on PC1**

As shown in the image below, if we compare the PC1 loads derived above with the PC1 loads published by ABS, we find that they differ by a constant -45%. This is just a scaling difference, so it does not affect the standardized (mean 1,000, standard deviation 100) derived scores.

(You should be able to see the “Derived (A)” column in the PC1 load in image 1).

**Test 2 — Distribution of scores**

The code below creates a histogram of both scores. Their shapes look almost identical.

`## Check distribution of scores`score_SA1.hist(column = 'IER_recreated', bins = 100, color = 'darkseagreen')

plt.title('Distribution of recreated IER scores')

ABS_IER_dropna.hist(column = 'IER_2021', bins = 100, color = 'lightskyblue')

plt.title('Distribution of ABS IER scores')

plt.show()

**Validation 3 – IER score with SA1**

As the ultimate test, let’s compare the IER scores from SA1.

## Join the two scores by SA1 for comparison

IER_join = pd.merge(ABS_IER_dropna, score_SA1, how = 'left', on = 'SA1_2021')## Plot scores on x-y axis.

## If scores are identical, it should show a straight line.

plt.scatter('IER_recreated', 'IER_2021', data = IER_join, color = 'darkseagreen')

plt.title('Comparison of recreated and ABS IER scores')

plt.xlabel('Recreated IER score')

plt.ylabel('ABS IER score')

plt.show()

A diagonal straight line, as shown in the output image below, indicates that the two scores are nearly identical.

In addition to this, the code below shows that the two scores have a correlation close to 1.

The demonstration in this article effectively reproduces how to adjust the IER, one of the four socio-economic indicators published by the ABS. The IER can be used to rank the socio-economic status of a geographic area.

If you step back and think about it, what we’ve essentially accomplished is to reduce the dimensionality of the data from 14 to 1, and lose some of the information that the data conveys.

Dimensionality reduction techniques such as PCA also commonly help reduce high-dimensional spaces, such as text embeddings, to two or three (visualizable) principal components.

[ad_2]

Source link