A county like mine

Factors that underly the similarity of counties in the United States

Plotly
Python
Visualization
Politics
Statistics

Principal Component Analysis (PCA) is a statistical technique used to simplify complex data by reducing the number of variables while retaining most of the original information. In other words, it helps to identify the most important patterns or relationships in large datasets by transforming them into a set of linearly uncorrelated variables, known as principal components.

PCA works by identifying the underlying structure of the data and extracting the directions of maximum variance, which are the principal components. Each principal component is a linear combination of the original variables, and they are orthogonal to each other, meaning that they are uncorrelated. The first principal component captures the most significant variation in the data, and subsequent components capture decreasing amounts of variation. By selecting the appropriate number of principal components, we can reduce the dimensionality of the dataset while retaining the essential information.

In this data science project, I used Principal Component Analysis (PCA) to visualize and explore the similarity of various counties in the United States based on a set of demographic metrics including th following:

The primary utility of using PCA in this context is to reduce the dimensionality of the dataset while retaining as much information as possible. By doing so, we can efficiently examine patterns and relationships among counties in a lower-dimensional space.

For each of these data, I applied PCA to transform the original high-dimensional data into a lower-dimensional space that still captures most of the variation present in the original dataset.

By plotting the first two or three principal components on a 2D scatter plot, we can visualize the relationships among counties based on their transformed coordinates. This allows us to identify clusters of similar counties, outliers, or any other interesting patterns. In the code, I also highlighted a specific county (represented by highlighted_area), making it easier to identify its position relative to other counties in the reduced-dimensional space.

Additionally, I created a bar chart displaying the proportion of variance explained by each principal component, which helps to determine the optimal number of components to retain for further analysis or modeling.

The use of PCA in this project allows us to effectively summarize the complex, high-dimensional relationships between counties, making it easier to identify patterns and interpret the data.

Code
# Import
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import plotly.express as px
import plotly.graph_objects as go

Population

Code
# Population Estimates
pop = pd.read_excel("PopulationEstimates.xlsx", skiprows=4)

pop.columns = ['fips_code', 'state', 'area_name', 'rucc_2013', 'pop_1990', 'pop_2000', 'pop_2010', 'pop_2020', 'pop_2021']

pop = (pop
    .filter(['fips_code', 'state', 'area_name', 'rucc_2013','pop_2021'], axis=1)
    .query("state != 'Puerto Rico'"))

pol = pd.read_csv('politics.csv').rename({'fips':'fips_code'}, axis=1)

pop.head()
fips_code state area_name rucc_2013 pop_2021
0 0 US United States NaN 331893745.0
1 1000 AL Alabama NaN 5039877.0
2 1001 AL Autauga County 2.0 59095.0
3 1003 AL Baldwin County 3.0 239294.0
4 1005 AL Barbour County 6.0 24964.0
Code
def make_pca(new_df, highlighted_area, dim=2):
    # Merge
    df_raw = (pd
        .merge(pop, new_df, on='fips_code', how='left')
        .dropna()
        .query("state != 'Puerto Rico'"))

    # Normalize data
    scaler = StandardScaler()
    df = scaler.fit_transform(df_raw.drop(['fips_code', 'state', 'area_name'], axis=1))
    df_name = df_raw[['fips_code', 'state', 'area_name']].reset_index(drop=True)

    # PCA analysis
    pca = PCA()
    pca.fit(df)
    pcs = pca.transform(df)
    pcs_df = pd.DataFrame(pcs)
    df_final = pd.concat([df_name, pcs_df], axis=1)

    df_return = df_final.drop(['state', 'area_name'], axis=1)

    fig2 = pca_variance(pca)

    # Marks
    marker_colors = df_final['fips_code'].apply(lambda x: 'red' if x == highlighted_area else 'blue')
    marker_sizes = df_final['fips_code'].apply(lambda x: 15 if x == highlighted_area else 5)

    if dim == 2:
        # Chart PC distance 2D
        fig = px.scatter(df_final,
            x=df_final.columns[len(df_name.columns)],
            y=df_final.columns[len(df_name.columns) + 1],
            hover_data=['fips_code','state','area_name'],
            custom_data=['fips_code'],
            color=marker_colors,
            size=marker_sizes,
            color_discrete_map={'red': 'red', 'blue': 'blue'})

        fig.update_traces(marker=dict(symbol='circle', line=dict(width=1))
        )

        return fig, fig2, df_return

    elif dim == 3:
        fig = px.scatter_3d(df_final,
            x=df_final.columns[len(df_name.columns)],
            y=df_final.columns[len(df_name.columns) + 1],
            z=df_final.columns[len(df_name.columns) + 2],
            hover_data=['fips_code','state','area_name'],
            custom_data=['fips_code'],
            color=marker_colors,
            size=marker_sizes,
            color_discrete_map={'red': 'red', 'blue': 'blue'})

        fig.update_traces(marker=dict(symbol='circle', line=dict(width=1)))

        return fig, fig2, df_return

def pca_variance(pca):
    # Explained variance
    explained_variance_ratio = pca.explained_variance_ratio_
    num_components = len(explained_variance_ratio)
    components = list(range(1, num_components + 1))

    fig = go.Figure(data=[go.Bar(x=components, y=explained_variance_ratio)])

    fig.update_layout(
        title="Proportion of Variance Explained by Principal Components",
        xaxis_title="Principal Components",
        yaxis_title="Explained Variance Ratio",
        yaxis=dict(tickformat=".2%"),
    )

    return fig

Poverty

Poverty variables:

  • percent of population below poverty level overall in county 2020 (pct_povall_2020)
  • percent of population below poverty level for minors in county 2020 (pct_pov017_2020)
  • median household income in county 2020 (medhhinc_2020)
Code
# Poverty Estimates
pov = pd.read_excel("PovertyEstimates.xlsx", skiprows=4)

pov.columns = ['fips_code', 'stabr', 'area_name', 'rucc_2003', 'uic_2003', 'rucc_2013', 'uic_2013', 'povall_2020', 'ci90lb_all_2020', 'ci90ub_all_2020', 'pct_povall_2020', 'ci90lb_allp_2020', 'ci90ub_allp_2020', 'pov017_2020', 'ci90lb_017_2020', 'ci90ub_017_2020', 'pct_pov017_2020', 'ci90lb_017p_2020', 'ci90ub_017p_2020', 'pov517_2020', 'ci90lb_517_2020', 'ci90ub_517_2020', 'pct_pov517_2020', 'ci90lb_517p_2020', 'ci90ub_517p_2020', 'medhhinc_2020', 'ci90lb_inc_2020', 'ci90ub_inc_2020', 'pov04_2020', 'ci90lb_04_2020', 'ci90ub_04_2020', 'pct_pov04_2020', 'ci90lb_04p_2020', 'ci90ub_04p_2020']

pov_filtered_columns = ['fips_code','pct_povall_2020','pct_pov017_2020','medhhinc_2020']

pov = pov[pov_filtered_columns]

Education

Education variables:

  • percent of population with less than high school diploma in county 2017-2021 (pct_less_than_hs_2017_21)
  • percent of population with high school diploma in county 2017-2021 (pct_hs_diploma_2017_21)
  • percent of population with some college or associate degree in county 2017-2021 (pct_some_college_assoc_degree_2017_21)
  • percent of population with bachelor’s degree or higher in county 2017-2021 (pct_bachelors_plus_2017_21)
Code
# Education
edu = pd.read_excel("Education.xlsx", skiprows=3)

edu.columns = ['fips_code', 'state', 'area_name', 'rucc_2003', 'uic_2003', 'rucc_2013', 'uic_2013', 'less_than_hs_1970', 'hs_diploma_1970', 'some_college_1_3_yrs_1970', 'college_4_yrs_plus_1970', 'pct_less_than_hs_1970', 'pct_hs_diploma_1970', 'pct_some_college_1_3_yrs_1970', 'pct_college_4_yrs_plus_1970', 'less_than_hs_1980', 'hs_diploma_1980', 'some_college_1_3_yrs_1980', 'college_4_yrs_plus_1980', 'pct_less_than_hs_1980', 'pct_hs_diploma_1980', 'pct_some_college_1_3_yrs_1980', 'pct_college_4_yrs_plus_1980', 'less_than_hs_1990', 'hs_diploma_1990', 'some_college_assoc_degree_1990', 'bachelors_plus_1990', 'pct_less_than_hs_1990', 'pct_hs_diploma_1990', 'pct_some_college_assoc_degree_1990', 'pct_bachelors_plus_1990', 'less_than_hs_2000', 'hs_diploma_2000', 'some_college_assoc_degree_2000', 'bachelors_plus_2000', 'pct_less_than_hs_2000', 'pct_hs_diploma_2000', 'pct_some_college_assoc_degree_2000', 'pct_bachelors_plus_2000', 'less_than_hs_2008_12', 'hs_diploma_2008_12', 'some_college_assoc_degree_2008_12', 'bachelors_plus_2008_12', 'pct_less_than_hs_2008_12', 'pct_hs_diploma_2008_12', 'pct_some_college_assoc_degree_2008_12', 'pct_bachelors_plus_2008_12', 'less_than_hs_2017_21', 'hs_diploma_2017_21', 'some_college_assoc_degree_2017_21', 'bachelors_plus_2017_21', 'pct_less_than_hs_2017_21', 'pct_hs_diploma_2017_21', 'pct_some_college_assoc_degree_2017_21', 'pct_bachelors_plus_2017_21']

edu_filtered_columns = [col for col in edu.columns if ('2017' in col and 'pct' in col) or 'fips' in col.lower()]

edu = edu[edu_filtered_columns]

Unemployment

Unemployment variables:

  • Civilian labor force in county 2021 (civ_labor_force_2021)
  • Employed in county 2021 (employed_2021)
  • Unemployed in county 2021 (unemployed_2021)
  • Unemployment rate in county 2021 (unemp_rate_2021)
  • Gini coefficient in county 2019? (Gini.Coefficient)
Code
# Unemployment
unemp = pd.read_excel("Unemployment.xlsx", skiprows=4)

unemp.columns = ['fips_code', 'state', 'area_name', 'rucc_2013', 'uic_2013', 'metro_2013', 'civ_labor_force_2000', 'employed_2000', 'unemployed_2000', 'unemp_rate_2000', 'civ_labor_force_2001', 'employed_2001', 'unemployed_2001', 'unemp_rate_2001', 'civ_labor_force_2002', 'employed_2002', 'unemployed_2002', 'unemp_rate_2002', 'civ_labor_force_2003', 'employed_2003', 'unemployed_2003', 'unemp_rate_2003', 'civ_labor_force_2004', 'employed_2004', 'unemployed_2004', 'unemp_rate_2004', 'civ_labor_force_2005', 'employed_2005', 'unemployed_2005', 'unemp_rate_2005', 'civ_labor_force_2006', 'employed_2006', 'unemployed_2006', 'unemp_rate_2006', 'civ_labor_force_2007', 'employed_2007', 'unemployed_2007', 'unemp_rate_2007', 'civ_labor_force_2008', 'employed_2008', 'unemployed_2008', 'unemp_rate_2008', 'civ_labor_force_2009', 'employed_2009', 'unemployed_2009', 'unemp_rate_2009', 'civ_labor_force_2010', 'employed_2010', 'unemployed_2010', 'unemp_rate_2010', 'civ_labor_force_2011', 'employed_2011', 'unemployed_2011', 'unemp_rate_2011', 'civ_labor_force_2012', 'employed_2012', 'unemployed_2012', 'unemp_rate_2012', 'civ_labor_force_2013', 'employed_2013', 'unemployed_2013', 'unemp_rate_2013', 'civ_labor_force_2014', 'employed_2014', 'unemployed_2014', 'unemp_rate_2014', 'civ_labor_force_2015', 'employed_2015', 'unemployed_2015', 'unemp_rate_2015', 'civ_labor_force_2016', 'employed_2016', 'unemployed_2016', 'unemp_rate_2016', 'civ_labor_force_2017', 'employed_2017', 'unemployed_2017', 'unemp_rate_2017', 'civ_labor_force_2018', 'employed_2018', 'unemployed_2018', 'unemp_rate_2018', 'civ_labor_force_2019', 'employed_2019', 'unemployed_2019', 'unemp_rate_2019', 'civ_labor_force_2020', 'employed_2020', 'unemployed_2020', 'unemp_rate_2020', 'civ_labor_force_2021', 'employed_2021', 'unemployed_2021', 'unemp_rate_2021', 'med_hh_income_2020', 'med_hh_income_pct_state_total_2020']

unemp_filtered_columns = [col for col in unemp.columns if '2021' in col or 'fips' in col.lower()]

unemp = unemp[unemp_filtered_columns]

unemp_cols = ['civ_labor_force_2021','employed_2021','unemployed_2021']

unemp2 = pd.merge(unemp, pop.filter(['fips_code','pop_2021']), how = 'left',
     left_on='fips_code', right_on='fips_code'
)

unemp[unemp_cols] = (unemp2[unemp_cols].div(unemp2['pop_2021'], axis=0))*100

unemp = pd.merge(unemp, pol[['fips_code','Gini.Coefficient']], how = 'left',
     left_on='fips_code', right_on='fips_code'
)

Crime

Crime Variables

  • Crime rate/100k people
  • Murders/capita
  • Rape/capita
  • Robbery/capita
  • Aggravated assault/capita
  • Burglaries/capita
  • Larcenies/capita
  • Motor vehicle thefts/capita
  • Arsons/capita
Code
# Crime
crime = pd.read_csv("crime_clean.csv").drop('Unnamed: 0', axis=1)

crime.columns = ['fips_code','crime_rate_per_100k','murder','rape','robbery','ag_assault','burglry','larceny','mv_theft','arson','population']

crime_cols = ['murder','rape','robbery','ag_assault','burglry','larceny','mv_theft','arson']

crime[crime_cols] = crime[crime_cols].div(crime['population'], axis=0)*100000

crime = crime.drop('population', axis=1)
a, b, crime_df = make_pca(crime, 48209, dim=2)
a
C:\Users\ajave\AppData\Local\Programs\Python\Python312\Lib\site-packages\plotly\express\_core.py:2065: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Code
b

Race

Race Variables

  • White (%)
  • Black (%)
  • American Indian/Alaska Native (%)
  • Asian (%)
  • Native Hawaiian/Pacific Islander (%)
  • Other race (%)
  • Two or more races (%)
  • Hispanic/Latino (%)
Code
# Race
race = pd.read_csv('race.csv')

race_cols = ['white', 'black','amerindian', 'asian', 'hawaiian_or_PI', 'other_race', 'two_or_more','hispanic']

race[race_cols] = race.filter(race_cols, axis=1).div(race['total_population'],axis=0)*100

race = race.drop(['county_name','total_population'], axis=1)
a, b, race_df = make_pca(race, 48209, dim=2)
a
C:\Users\ajave\AppData\Local\Programs\Python\Python312\Lib\site-packages\plotly\express\_core.py:2065: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Code
b

Politics

Political Variables

  • Republican Vote Share in 2016 Presidential Election (%)
  • Democratic Vote Share in 2016 Presidential Election (%)
  • Libertarian Vote Share in 2016 Presidential Election (%)
Code
# Politics
politics = ['fips_code','rep16_frac',
'dem16_frac',
'libert16_frac']
politics = pol[politics].fillna(0)

Job Industry

Job Industry Variables

  • Management, professional, and related occupations (%)
  • Service occupations (%)
  • Sales and office occupations (%)
  • Farming, fishing, and forestry occupations (%)
  • Construction, extraction, maintenance, and repair occupations (%)
  • Production, transportation, and material moving occupations (%)
Code
# Job Industry
jobs = ['fips_code','Management.professional.and.related.occupations',
'Service.occupations',
'Sales.and.office.occupations',
'Farming.fishing.and.forestry.occupations',
'Construction.extraction.maintenance.and.repair.occupations',
'Production.transportation.and.material.moving.occupations']

jobs = pol[jobs].fillna(0)

jobs.columns = ['fips_code', 'mgmt', 'service', 'sales', 'farming', 'construction', 'production']

jobs2 = pd.merge(jobs, pop.filter(['fips_code','pop_2021'], axis=1), how = 'left', left_on='fips_code', right_on='fips_code')

jobs_cols = ['mgmt', 'service', 'sales', 'farming', 'construction', 'production']

jobs[jobs_cols] = ((jobs2[jobs_cols].div(jobs2['pop_2021'], axis=0))*100000).fillna(0)
a, b, jobs_df= make_pca(jobs, 48209, dim=2)
a
C:\Users\ajave\AppData\Local\Programs\Python\Python312\Lib\site-packages\plotly\express\_core.py:2065: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Code
b

Health

Health Variables

  • Poor physical health days (%)
  • Poor mental health days (%)
  • Low birthweight (%)
  • Teen births (%)
  • Adult smoking (%)
  • Adult obesity (%)
  • Diabetes (%)
  • Sexually transmitted infections (%)
Code
# Health
health = ['fips_code','Poor.physical.health.days', 'Poor.mental.health.days', 'Low.birthweight', 'Teen.births', 'Adult.smoking', 'Adult.obesity', 'Diabetes','Sexually.transmitted.infections']

health = pol[health].fillna(0)

health.columns = ['fips_code', 'poor_phys', 'poor_mental', 'low_birth_lbs', 'teen_births', 'smokers','obesity','diabetes', 'stds']

health2 = pd.merge(health, pop.filter(['fips_code','pop_2021'], axis=1), how = 'left', left_on='fips_code', right_on='fips_code')

health_cols = ['poor_phys', 'poor_mental', 'low_birth_lbs', 'teen_births', 'smokers','obesity','diabetes', 'stds']

health[health_cols] = ((health[health_cols].div(jobs2['pop_2021'], axis=0))*100000).fillna(0)
a, b, health_df = make_pca(health, 48209, dim=2)
a
C:\Users\ajave\AppData\Local\Programs\Python\Python312\Lib\site-packages\plotly\express\_core.py:2065: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Code
b

Social Capital

  • Percentage of births to unmarried mothers (%)
  • Percentage of women who are married (%)
  • Percentage of children in single-parent households (%)
  • Non-religious non-profit organizations per 1,000 people
  • Religious congregations per 1,000 people
  • Informal civic engagement sub-index
  • Presidential election voting rate 2012-2016
  • Mail-back census response rate
  • Confidence in institutions sub-index
  • Membership organizations per 1,000 people
  • Recreation and leisure establishments per 1,000 people
  • Associations per 1,000 people (Penn State method)
  • Non-religious and religious organizations per 1,000 people
  • Religious adherents per 1,000 people
  • Percentage of people who receive emotional support sometimes, rarely, or never
  • Share of middle-class itemizers deducting charitable contributions (%)
  • Ratio of 80th to 20th percentile household income*
Code
# Social Capital
soccap = pd.read_csv('soccap.csv')

soccap.columns=['fips_code', 'pct_births_unmarried', 'pct_women_married', 'pct_children_single_parent', 'non_religious_non_profit_orgs_per_1k',
'religious_congregations_per_1k', 'informal_civic_engagement_subidx', 'presidential_election_voting_rate_12_16',
'mail_back_census_response_rate', 'confidence_in_institutions_subidx', 'membership_orgs_per_1k', 'recreation_leisure_est_per_1k',
'associations_per_1k_penn_state_method', 'non_religious_and_religious_orgs_per_1k', 'religious_adherents_per_1k',
'pct_emotional_support_sometimes_rarely_or_never', 'charitable_contributions_share_of_agi_middle_class_itemizers',
'share_of_middle_class_itemizers_deducting_charitable_contributions', 'ratio_80th_to_20th_pct_hh_income']
a, b, soccap_df = make_pca(soccap, 48209, dim=2)
a
C:\Users\ajave\AppData\Local\Programs\Python\Python312\Lib\site-packages\plotly\express\_core.py:2065: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Code
b

PCA with all the data

Code
df_raw = pop.drop(['rucc_2013'], axis = 1).pipe(
    pd.merge, pov, on='fips_code').pipe(
    pd.merge, edu, on='fips_code').pipe(
    pd.merge, unemp, on='fips_code').pipe(
    pd.merge, crime, on='fips_code').pipe(
    pd.merge, race, on='fips_code').pipe(
    pd.merge, politics, on='fips_code').pipe(
    pd.merge, jobs, on='fips_code').pipe(
    pd.merge, health, on='fips_code').pipe(
    pd.merge, soccap, on='fips_code')

for col in df_raw.drop(['fips_code','state','area_name'],axis=1).columns:
    df_raw[col] = df_raw[col].fillna(df_raw[col].mean())

highlighted_area = 48209

# Normalize data
scaler = StandardScaler()
df = scaler.fit_transform(df_raw.drop(['fips_code', 'state', 'area_name'], axis=1))
df_name = df_raw[['fips_code', 'state', 'area_name']].reset_index(drop=True)

# PCA analysis
pca = PCA()
pca.fit(df)
pcs = pca.transform(df)
pcs_df = pd.DataFrame(pcs)
df_final = pd.concat([df_name, pcs_df], axis=1)

fig2 = pca_variance(pca)

# Marks
marker_colors = df_final['fips_code'].apply(lambda x: 'red' if x == highlighted_area else 'blue')
marker_sizes = df_final['fips_code'].apply(lambda x: 15 if x == highlighted_area else 5)


# Chart PC distance 2D
fig = px.scatter(df_final,
    x=df_final.columns[len(df_name.columns)],
    y=df_final.columns[len(df_name.columns) + 1],
    hover_data=['fips_code','state','area_name'],
    custom_data=['fips_code'],
    color=marker_colors,
    size=marker_sizes,
    color_discrete_map={'red': 'red', 'blue': 'blue'})

fig.update_traces(marker=dict(symbol='circle', line=dict(width=1))
)

fig.show()
C:\Users\ajave\AppData\Local\Programs\Python\Python312\Lib\site-packages\plotly\express\_core.py:2065: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Code
fig = px.scatter_3d(df_final,
    x=df_final.columns[len(df_name.columns)],
    y=df_final.columns[len(df_name.columns) + 1],
    z=df_final.columns[len(df_name.columns) + 2],
    hover_data=['fips_code','state','area_name'],
    custom_data=['fips_code'],
    color=marker_colors,
    size=marker_sizes,
    color_discrete_map={'red': 'red', 'blue': 'blue'})

fig.update_traces(marker=dict(symbol='circle', line=dict(width=1)))

fig.show()
C:\Users\ajave\AppData\Local\Programs\Python\Python312\Lib\site-packages\plotly\express\_core.py:2065: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Code
# Explained variance
explained_variance_ratio = pca.explained_variance_ratio_
num_components = len(explained_variance_ratio)
components = list(range(1, num_components + 1))

fig = go.Figure(data=[go.Bar(x=components, y=explained_variance_ratio)])

fig.update_layout(
    title="Proportion of Variance Explained by Principal Components",
    xaxis_title="Principal Components",
    yaxis_title="Explained Variance Ratio",
    yaxis=dict(tickformat=".2%"),
)

fig.show()
Code
loadings = pd.DataFrame(pca.components_.T, columns=[f'PC{i}' for i in range(1, num_components+1)], index=df_raw.columns[3:]) 

PC1 = loadings.reindex(loadings['PC1'].abs().sort_values(ascending=False).index)['PC1'].head()

PC2 = loadings.reindex(loadings['PC2'].abs().sort_values(ascending=False).index)['PC2'].head()

PC3 = loadings.reindex(loadings['PC3'].abs().sort_values(ascending=False).index)['PC3'].head()

PC4 = loadings.reindex(loadings['PC4'].abs().sort_values(ascending=False).index)['PC4'].head()

PC5 = loadings.reindex(loadings['PC5'].abs().sort_values(ascending=False).index)['PC5'].head()

PC6 = loadings.reindex(loadings['PC6'].abs().sort_values(ascending=False).index)['PC6'].head()

PC1

The first principal component seems to be some measure of traditional metrics of the society including how White the population is, both religious and nonreligious organizations, low criminality, and percent married.

Code
PC1
white                                      0.184024
non_religious_and_religious_orgs_per_1k    0.183871
crime_rate_per_100k                       -0.180371
pct_women_married                          0.179891
non_religious_non_profit_orgs_per_1k       0.176724
Name: PC1, dtype: float64

PC2

The second principal component seems to be some measure of how economically disadvantaged the population is, with measures like poverty rate highly correlated and median household income negatively correlated. Interestingly, religious congregations and teen births per 1k is also highly correlated with this principal component. It seems like this principal component has some aspects of traditionality of the society, but the negative aspects of it.

Code
PC2
pct_pov017_2020                   0.263468
pct_povall_2020                   0.248170
medhhinc_2020                    -0.244717
religious_congregations_per_1k    0.225318
pct_bachelors_plus_2017_21       -0.219958
Name: PC2, dtype: float64

PC3

The third principal component seems to be some measure of how busy the area is. It is highly correlated with the percent of the population that is employed in the management, production, and construction industry.

Code
PC3
civ_labor_force_2021    0.292500
employed_2021           0.286014
mgmt                    0.273708
production              0.263062
construction            0.256393
Name: PC3, dtype: float64

PC4

The fourth principal component seems to be some negative measure of civic engagement and social capital.

Code
PC4
non_religious_non_profit_orgs_per_1k       0.296876
non_religious_and_religious_orgs_per_1k    0.272233
rep16_frac                                -0.237853
recreation_leisure_est_per_1k              0.234750
membership_orgs_per_1k                     0.227421
Name: PC4, dtype: float64

PC5

Not exactly sure what this principal component is measuring.

Code
PC5
hispanic                                                        0.302891
two_or_more                                                     0.263537
other_race                                                      0.250530
charitable_contributions_share_of_agi_middle_class_itemizers   -0.214137
associations_per_1k_penn_state_method                          -0.207575
Name: PC5, dtype: float64

PC6

Probably measures a variable related to something with race and crime

Code
PC6
crime_rate_per_100k    0.253964
hispanic               0.253551
ag_assault             0.252535
rape                   0.248669
black                 -0.237223
Name: PC6, dtype: float64
Back to top