[시각화 정리] 07: Counts plot | 완숙의 에그머니🍳

이 포스팅은 시각화 정리 시리즈 16 편 중 7 번째 글 입니다.

Part 1 - 01: 라이브러리
Part 2 - 02: matplotlib 알아보기
Part 3 - 03: 산점도
Part 4 - 04: Circling을 통한 버블 플롯
Part 5 - 05: 선형 회귀 선을 포함한 산점도
Part 6 - 06: Strip plot
Part 7 - This Post
Part 8 - 08: Marginal Histogram
Part 9 - 09: Correlation plot
Part 10 - 09: Marginal Boxplot
Part 11 - 10: Pair plot
Part 12 - 11: Diverging Bars
Part 13 - 12: Diverging lines with text
Part 14 - 13: Diverging Lollipop Chart with Markers
Part 15 - 14: Area chart
Part 16 - 15: Ordered Bar Chart

▼ 목록 보기

카테고리 변수에 대한 y 갯수를 그려주는 Counts plot에 대해 알아보자.
연습 kaggle notebook

카테고리 변수의 갯수 분포를 한눈에!

이전의 stripplot이 category 변수를 x, 조사하고 싶은 연속 변수를 y라 했을 때 분포를 본다면, 이번에는 x 변수에 대한 y의 개수를 파악한다. 이 문장에 내포되어 있는 뜻은, x, y가 모두 카테고리 변수라는 얘기로 볼 수 있다. 단, 이산적인 숫자이다. 이것을 코딩하기 위해서 필요한 것은, group과 그 그룹에 속하는 개수를 알아야 한다.

# Useful for:
# Draw a scatterplot where one variable is categorical.
# In this plot we calculate the size of overlapping points in each category and for each y.
# This way, the bigger the bubble the more concentration we have in that region.

# More info:
# https://seaborn.pydata.org/generated/seaborn.stripplot.html

# ----------------------------------------------------------------------------------------------------
# get the data
PATH = '/kaggle/input/the-50-plot-challenge/mpg_ggplot2.csv'
df = pd.read_csv(PATH)

# ----------------------------------------------------------------------------------------------------
# prepare the data for plotting

# we need to make a groupby by variables of interest
# cty, hwy에 따른 group을 묶고, 그 카테고리에 해당하는 개수를 갖고 있는다.
# 그리고 그 개수의 column 이름을 counts라 한다.
gb_df = df.groupby(["cty", "hwy"]).size().reset_index(name = "counts")

# sort the values
gb_df.sort_values(["cty", "hwy", "counts"], ascending = True, inplace = True)

# create a color for each group.
# there are several way os doing, you can also use this line:
# colors = [plt.cm.gist_earth(i/float(len(gb_df["cty"].unique()))) for i in range(len(gb_df["cty"].unique()))]
# colors 딕셔너리를 만든다. 각 cty 숫자에 대응하는 3개의 값(list)을 가지고 있는다.
# 이 값은 RGB 값에 대응된다.
colors = {i:np.random.random(3,) for i in sorted(list(gb_df["cty"].unique()))}

# ----------------------------------------------------------------------------------------------------
# instanciate the figure
fig = plt.figure(figsize = (10, 5))
ax = fig.add_subplot()

# ----------------------------------------------------------------------------------------------------
# iterate over each category and plot the data. This way, every group has it's own color and size.
for x in sorted(list(gb_df["cty"].unique())):

    # get x and y values for each group
    x_values = gb_df[gb_df["cty"] == x]["cty"]
    y_values = gb_df[gb_df["cty"] == x]["hwy"]

    # extract the size of each group to plot
    size = gb_df[gb_df["cty"] == x]["counts"]

    # extract the color for each group and covert it from rgb to hex
    # 0~1의 실수 범위 색상을 16진수 색상으로 바꿔준다.
    color = matplotlib.colors.rgb2hex(colors[x])

    # plot the data
    ax.scatter(x_values, y_values, s = size*10, c = color)

# ----------------------------------------------------------------------------------------------------
# prettify the plot

# set title
ax.set_title("Count plot");

다운로드 (4)

Reference

Plotting with Python: learn 80 plots STEP by STEP