데이터 전처리

SKN/05. Machine Learning

데이터 전처리

claovy☘️ 2025. 3. 11. 23:47

1. 인코딩

Label encorder : 범주형 데이터를 숫자로 변환

from sklearn.preprocessing import LabelEncoder

items = ['TV', '냉장고', '세탁기', '컴퓨터', '전기난로', '컴퓨터', 'TV', '믹서기', '컴퓨터']

encoder = LabelEncoder()

encoder.fit(items) # 중복값을 제거, 오름차순 정렬

encoded_items = encoder.transform(items)

encoded_items

[출력]
array([0, 1, 3, 5, 4, 5, 0, 2, 5])

One-hot encorder : 데이터를 희소배열(특정 인덱스만 값을 가지는 배열)로 변환

from sklearn.preprocessing import OneHotEncoder

# 가구 리스트를 2차원 형태로 변환

items = np.array(items).reshape(-1, 1)

# One-Hot encoding

encoder = OneHotEncoder()

encoder.fit(items) # 중복값을 제거, 오름차순 정렬 -> 그 인덱스에만 1을 준 희소행렬

oh_items = encoder.transform(items)

print(oh_items)

[출력]
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 9 stored elements and shape (9, 6)>
  Coords	Values
  (0, 0)	1.0
  (1, 1)	1.0
  (2, 3)	1.0
  (3, 5)	1.0
  (4, 4)	1.0
  (5, 5)	1.0
  (6, 0)	1.0
  (7, 2)	1.0
  (8, 5)	1.0

출력값은 각각 2차원 행렬에서 해당하는 좌표와 값을 뜻한다.

Coords는 좌표 ([행][열]) / Values는 해당하는 실제 값

ex ) (0, 0) 1.0는 0,0 인덱스의 값 1

print(oh_items.toarray())

[출력]
[[1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1.]]

연산 및 시각화의 편의성을 위해 toarray를 사용하여 밀집행렬을 만들어준다.

* 또는 np.toarray 말고 pd.get_dummies를 사용해 DataFrame을 ndarray로 변환해줘도된다.

2. 피쳐 스케일링(정규화)

표준정규화

평균이 0, 표준편차가 1인 값으로 변환 => 데이터가 정규분포인 경우 적합하다
이상치에 덜 민감
선형회귀 및 로지스틱 회귀 등의 알고리즘에 적합

최소최대정규화

0~1 사이의 값으로 반환
이상치에 민감 (이상치의 경우 데이터 왜곡 가능성 o)
SVM 및 KNN 같은 거리 기반 모델에 적합

--- 🍀 StandardScaler, MinMaxScaler 학습을 위해 iris 데이터셋을 사용했다. ---

StandardScaler

from sklearn.preprocessing import StandardScaler

standard_sc = StandardScaler()

standard_sc.fit(iris_ds.data)

standard_sc.transform(iris_ds.data)

MinMaxScaler

from sklearn.preprocessing import MinMaxScaler

minmax_sc = MinMaxScaler()

# minmax_sc.fit_transform([[20], [30], [40]]) # (값 - 최소값) / (최대값 - 최소값)

minmax_sc.fit(iris_ds.data)

minmax_sc.transform(iris_ds.data)

'SKN > 05. Machine Learning' 카테고리의 다른 글

규제선형모델 (0)	2025.03.11
EDA (0)	2025.03.11

현재글데이터 전처리

claovy☘️

우당탕탕 기술블로그

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

claovy☘️