이 포스팅은 시각화 정리 시리즈 16 편 중 4 번째 글 입니다.
목차
산점도 그래프의 응용인 버블 플롯을 그려보자.
연습 kaggle notebook
산점도에서 추가적으로 내가 원하는 그룹에 대해 원으로 그룹을 지어, 크기를 알 수 있다.
원래 bubble plot은 특정 집단의 boundary를 측정하는데 용이하다. 여기서는 산점도에서 추가적으로 circling을 통하여 그 boundary를 측정하는 방법을 사용해본다. 알고리즘으로는 컨벡스 헐을 사용한다.
# Useful for:
# Visualize the relationship between data.
# More info:
# https://en.wikipedia.org/wiki/Scatter_plot
# ----------------------------------------------------------------------------------------------------
# get the data
PATH = '/kaggle/input/the-50-plot-challenge/midwest_filter.csv'
df = pd.read_csv(PATH)
# ----------------------------------------------------------------------------------------------------
# instanciate the figure
fig = plt.figure(figsize = (12, 6))
ax = fig.add_subplot(1,1,1,)
# ----------------------------------------------------------------------------------------------------
# 각 그룹마다 다르게 묶은 뒤, 연속해서 plot한다. 이럴 경우 각 포인트가 그룹마다 다른 색으로 칠해진다.
for cat in sorted(list(df["category"].unique())):
# filter x and the y for each category
ar = df[df["category"] == cat]["area"]
pop = df[df["category"] == cat]["poptotal"]
# plot the data
ax.scatter(ar, pop, label = cat, s = 10)
# ----------------------------------------------------------------------------------------------------
# prettify the plot
# 맨 위 줄과 오른쪽 줄을 없애서 보기 편하게
ax.spines["top"].set_color("None")
ax.spines["right"].set_color("None")
# set a specific label for each axis
ax.set_xlabel("Area")
ax.set_ylabel("Population")
# change the lower limit of the plot, this will allow us to see the legend on the left
ax.set_xlim(-0.01)
ax.set_title("Scatter plot of population vs area.")
ax.legend(loc = "upper left", fontsize = 10)
추가적인 bubble plot
# Useful for:
# Visualize the relationship between data but also helps us encircle a specific group we might want to draw the attention to.
# More info:
# https://en.wikipedia.org/wiki/Scatter_plot
# ----------------------------------------------------------------------------------------------------
# get the data
PATH = '/kaggle/input/the-50-plot-challenge/midwest_filter.csv'
df = pd.read_csv(PATH)
# ----------------------------------------------------------------------------------------------------
# instanciate the figure
fig = plt.figure(figsize = (12, 6))
ax = fig.add_subplot(1,1,1,)
# ----------------------------------------------------------------------------------------------------
# prepare the data for plotting
size_total = df["poptotal"].sum()
# we want every group to have a different marker
markers = [".", ",", "o", "v", "^", "<", ">", "1", "2", "3", "4", "8", "s", "p", "P", "*", "h", "H", "+", "x", "X", "D", "d"]
# ----------------------------------------------------------------------------------------------------
# create an encircle
# based on this solution
# https://stackoverflow.com/questions/44575681/how-do-i-encircle-different-data-sets-in-scatter-plot
def encircle(x,y, ax = None, **kw):
'''
Takes an axes and the x and y and draws a polygon on the axes.
This code separates the differents clusters
'''
# get the axis if not passed
if not ax: ax=plt.gca()
# concatenate the x and y arrays
p = np.c_[x,y]
# to calculate the limits of the polygon
hull = ConvexHull(p)
# create a polygon from the hull vertices
poly = plt.Polygon(p[hull.vertices,:], **kw)
# add the patch to the axes
ax.add_patch(poly) # 내가 그린 그림을 위에 덧칠하게 해줌
# ----------------------------------------------------------------------------------------------------
# iterate over each category and plot the data. This way, every group has it's own color and marker.
# on the iteration we will calculate our hull/polygon for each group and connect specific groups
for cat, marker in zip(sorted(list(df["category"].unique())), markers): # 이런 스킬이 굉장히 중요해 보임
# filter x and the y for each category
ar = df[df["category"] == cat]["area"]
pop = df[df["category"] == cat]["poptotal"]
# this will allow us to set a specific size for each group.
# 해당 population 값에 따라 marker의 크기를 다르게 해주기 위함
size = pop/size_total
# plot the data
ax.scatter(ar, pop, label = cat, s = size*10000, marker = marker)
try:
# try to add a patch
encircle(ar, pop, ec = "k", alpha=0.1)
except:
# if we don't have enough poins to encircle just pass
pass
# ----------------------------------------------------------------------------------------------------
# prettify the plot
# eliminate 2/4 spines (lines that make the box/axes) to make it more pleasant
ax.spines["top"].set_color("None")
ax.spines["right"].set_color("None")
# set a specific label for each axis
ax.set_xlabel("Area")
ax.set_ylabel("Population")
# change the lower limit of the plot, this will allow us to see the legend on the left
ax.set_xlim(-0.01)
ax.set_title("Bubble plot with encircling")
ax.legend(loc = "upper left", fontsize = 10);