[시각화 정리] 04: Circling을 통한 버블 플롯

이 포스팅은 시각화 정리 시리즈 16 편 중 4 번째 글 입니다.

Part 1 - 01: 라이브러리
Part 2 - 02: matplotlib 알아보기
Part 3 - 03: 산점도
Part 4 - This Post
Part 5 - 05: 선형 회귀 선을 포함한 산점도
Part 6 - 06: Strip plot
Part 7 - 07: Counts plot
Part 8 - 08: Marginal Histogram
Part 9 - 09: Correlation plot
Part 10 - 09: Marginal Boxplot
Part 11 - 10: Pair plot
Part 12 - 11: Diverging Bars
Part 13 - 12: Diverging lines with text
Part 14 - 13: Diverging Lollipop Chart with Markers
Part 15 - 14: Area chart
Part 16 - 15: Ordered Bar Chart

▼ 목록 보기

추가적인 bubble plot
- Reference

▼ 내리기

산점도 그래프의 응용인 버블 플롯을 그려보자.
연습 kaggle notebook

산점도에서 추가적으로 내가 원하는 그룹에 대해 원으로 그룹을 지어, 크기를 알 수 있다.

원래 bubble plot은 특정 집단의 boundary를 측정하는데 용이하다. 여기서는 산점도에서 추가적으로 circling을 통하여 그 boundary를 측정하는 방법을 사용해본다. 알고리즘으로는 컨벡스 헐을 사용한다.

기본적인 bubble plot

# Useful for:
# Visualize the relationship between data.

# More info:
# https://en.wikipedia.org/wiki/Scatter_plot

# ----------------------------------------------------------------------------------------------------
# get the data
PATH = '/kaggle/input/the-50-plot-challenge/midwest_filter.csv'
df = pd.read_csv(PATH)

# ----------------------------------------------------------------------------------------------------
# instanciate the figure
fig = plt.figure(figsize = (12, 6))
ax = fig.add_subplot(1,1,1,)

# ----------------------------------------------------------------------------------------------------
# 각 그룹마다 다르게 묶은 뒤, 연속해서 plot한다. 이럴 경우 각 포인트가 그룹마다 다른 색으로 칠해진다.
for cat in sorted(list(df["category"].unique())):
    # filter x and the y for each category
    ar = df[df["category"] == cat]["area"]
    pop = df[df["category"] == cat]["poptotal"]

    # plot the data
    ax.scatter(ar, pop, label = cat, s = 10)

# ----------------------------------------------------------------------------------------------------
# prettify the plot

# 맨 위 줄과 오른쪽 줄을 없애서 보기 편하게
ax.spines["top"].set_color("None")
ax.spines["right"].set_color("None")

# set a specific label for each axis
ax.set_xlabel("Area")
ax.set_ylabel("Population")

# change the lower limit of the plot, this will allow us to see the legend on the left
ax.set_xlim(-0.01)
ax.set_title("Scatter plot of population vs area.")
ax.legend(loc = "upper left", fontsize = 10)

__results___42_0

추가적인 bubble plot

# Useful for:
# Visualize the relationship between data but also helps us encircle a specific group we might want to draw the attention to.

# More info:
# https://en.wikipedia.org/wiki/Scatter_plot

# ----------------------------------------------------------------------------------------------------
# get the data
PATH = '/kaggle/input/the-50-plot-challenge/midwest_filter.csv'
df = pd.read_csv(PATH)

# ----------------------------------------------------------------------------------------------------
# instanciate the figure
fig = plt.figure(figsize = (12, 6))
ax = fig.add_subplot(1,1,1,)

# ----------------------------------------------------------------------------------------------------
# prepare the data for plotting
size_total = df["poptotal"].sum()
# we want every group to have a different marker
markers = [".", ",", "o", "v", "^", "<", ">", "1", "2", "3", "4", "8", "s", "p", "P", "*", "h", "H", "+", "x", "X", "D", "d"]

# ----------------------------------------------------------------------------------------------------
# create an encircle
# based on this solution
# https://stackoverflow.com/questions/44575681/how-do-i-encircle-different-data-sets-in-scatter-plot
def encircle(x,y, ax = None, **kw):
    '''
    Takes an axes and the x and y and draws a polygon on the axes.
    This code separates the differents clusters
    '''
    # get the axis if not passed
    if not ax: ax=plt.gca()

    # concatenate the x and y arrays
    p = np.c_[x,y]

    # to calculate the limits of the polygon
    hull = ConvexHull(p)

    # create a polygon from the hull vertices
    poly = plt.Polygon(p[hull.vertices,:], **kw)

    # add the patch to the axes
    ax.add_patch(poly) # 내가 그린 그림을 위에 덧칠하게 해줌

# ----------------------------------------------------------------------------------------------------
# iterate over each category and plot the data. This way, every group has it's own color and marker.
# on the iteration we will calculate our hull/polygon for each group and connect specific groups
for cat, marker in zip(sorted(list(df["category"].unique())), markers):  # 이런 스킬이 굉장히 중요해 보임
    # filter x and the y for each category
    ar = df[df["category"] == cat]["area"]
    pop = df[df["category"] == cat]["poptotal"]

    # this will allow us to set a specific size for each group.
    # 해당 population 값에 따라 marker의 크기를 다르게 해주기 위함
    size = pop/size_total

    # plot the data
    ax.scatter(ar, pop, label = cat, s = size*10000, marker = marker)

    try:
        # try to add a patch
        encircle(ar, pop, ec = "k", alpha=0.1)
    except:
        # if we don't have enough poins to encircle just pass
        pass

# ----------------------------------------------------------------------------------------------------
# prettify the plot

# eliminate 2/4 spines (lines that make the box/axes) to make it more pleasant
ax.spines["top"].set_color("None")
ax.spines["right"].set_color("None")

# set a specific label for each axis
ax.set_xlabel("Area")
ax.set_ylabel("Population")

# change the lower limit of the plot, this will allow us to see the legend on the left
ax.set_xlim(-0.01)
ax.set_title("Bubble plot with encircling")
ax.legend(loc = "upper left", fontsize = 10);

__results___43_0

Reference

Plotting with Python: learn 80 plots STEP by STEP