EDA

탐험적 데이터 탐색을 통해 재료를 알아보자.

Check the data

# 모양
train_df.shape, test_df.shape

((200000, 202), (200000, 201))

train_df.head()

test_df.head()

Train contains:

ID_code (string);
target;
200 numerical variables, named from var_0 to var_199;
Test contains:

ID_code (string);
200 numerical variables, named from var_0 to var_199;

Missing Data

def missing_data(data):
    total = data.isnull().sum()
    percent = (data.isnull().sum()/data.isnull().count()*100)
    tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    types = []
    for col in data.columns:
        dtype = str(data[col].dtype)
        types.append(dtype)
    tt['Types'] = types
    return(np.transpose(tt))

%%time
missing_data(train_df)

CPU times: user 2.07 s, sys: 134 ms, total: 2.2 s
Wall time: 2.2 s

%%time
missing_data(test_df)

CPU times: user 2.2 s, sys: 132 ms, total: 2.33 s
Wall time: 2.33 s

결측치가 없는 것을 확인했다.

Describe

%%time
train_df.describe()

%time
test_df.describe()

관찰의 결과

train, test 모두 표준 편차가 크다.
train, test 데이터의 mean, std, min 등의 특성치가 매우 근접하다. 즉, 같은 집합을 대변하는 것처럼 보인다.
각 feature의 평균값은 각기 다르다. 범위가 넓다.
train, test의 크기는 같다.

변수 상관도

Reference

kaggle Notebook

완숙의 에그머니🍳

02: EDA

EDA

Check the data

Missing Data

Describe

관찰의 결과

변수 상관도

Reference