Session 9 - Preprocessing Data Scaling
📅 May 09, 2026
Sklearn Preprocessing
७ वटा tools — के गर्छ, किन गर्छ, कहिले गर्ने
⚖ Scalers — Numeric Data को Scale मिलाउन
के गर्छ: Data लाई mean = 0 र standard deviation = 1 बनाउँछ। Z-score normalization पनि भनिन्छ।
x_new = (x − mean) / std
किन use गर्ने: धेरै ML algorithms लाई features एउटै scale मा चाहिन्छ, नत्र ठूलो value भएको feature ले dominate गर्छ।
कहिले use गर्ने:
⚠ Outlier धेरै भए — RobustScaler बेस्ट हुन्छ
के गर्छ: सबै values लाई fixed range — default [0, 1] — मा ल्याउँछ। Range आफैँ set गर्न पनि मिल्छ।
x_new = (x − min) / (max − min)
किन use गर्ने: जब data bounded range मा राख्नु पर्छ, जस्तै image pixels (0–255 → 0–1)।
कहिले use गर्ने:
⚠ एउटा outlier ले पूरै range बिगार्छ — Outlier भएमा नगर्नु
के गर्छ: Mean को सट्टा median र std को सट्टा IQR (Q1–Q3) use गर्छ। Outlier लाई ignore गर्छ।
x_new = (x − median) / IQR
किन use गर्ने: StandardScaler मा outlier ले mean/std biased बनाउँछ। यसले त्यो समस्या हटाउँछ।
कहिले use गर्ने:
✓ Outlier भएको data मा सबैभन्दा राम्रो choice
🔧 Transformers — Data को रूप बदल्न
के गर्छ: एउटा threshold दिने — त्यो भन्दा माथि भए 1, तल वा बराबर भए 0। Continuous → Binary।
x_new = 1 यदि x > threshold, नत्र 0
किन use गर्ने: Exact value होइन, उपस्थिति/अनुपस्थिति मात्र जानकारी चाहिएमा। जस्तै: पानी पर्यो कि परेन।
कहिले use गर्ने:
के गर्छ: प्रत्येक row (sample) को length = 1 बनाउँछ। Column होइन, row normalize गर्छ — यो अरूभन्दा फरक छ।
x_new = x / ||x|| (L2 norm by default)
किन use गर्ने: Magnitude होइन, direction मात्र महत्त्वपूर्ण भएमा। Text similarity, document comparison मा काम लाग्छ।
कहिले use गर्ने:
⚠ यसले column होइन, row normalize गर्छ — बिर्सनु हुँदैन
🏷 Encoders — Category लाई Number मा बदल्न
के गर्छ: Categorical column लाई binary columns मा फर्काउँछ। प्रत्येक category को लागि नयाँ column — एउटामा 1, बाँकीमा 0।
रातो → [1,0,0] | हरियो → [0,1,0] | नीलो → [0,0,1]
किन use गर्ने: Categories बिच कुनै order हुँदैन (रातो > हरियो भन्न मिल्दैन)। LabelEncoder use गर्दा model ले गलत order सिक्छ।
कहिले use गर्ने:
⚠ Categories धेरै भए columns विस्फोट हुन्छ (high cardinality problem)
के गर्छ: Categorical labels लाई 0, 1, 2, 3... जस्ता integers मा बदल्छ।
बिरालो → 0 | कुकुर → 1 | माछा → 2
किन use गर्ने: Target/output column (y) मा number चाहिन्छ। Feature column मा ordinal data (order भएको) मा मात्र use गर्नु।
कहिले use गर्ने:
⚠ Feature column मा nominal data मा नगर् — model ले गलत order सिक्छ
⚡ छिटो Guide — कहिले कुन?
| Tool | कहिले? | Use Case |
|---|---|---|
| StandardScaler | Outlier कम छ | Linear/Logistic Regression, SVM, Neural Net, PCA |
| MinMaxScaler | Range [0,1] चाहिन्छ | Image data, Neural Network, KNN |
| RobustScaler | Outlier धेरै छ | Medical / Financial / Real-world data |
| Binarizer | हो/होइन मात्र | Word presence, image thresholding |
| Normalizer | Row direction चाहिन्छ | NLP, TF-IDF, Cosine similarity |
| OneHotEncoder | Category, order छैन | रङ, देश, शहर — feature columns (X) |
| LabelEncoder | Target column वा ordinal | y column, सानो/मध्यम/ठूलो, Random Forest |
💡 Tree-based models लाई scaling चाहिँदैन
Random Forest, XGBoost, Decision Tree — यिनीहरूलाई StandardScaler / MinMaxScaler को आवश्यकता छैन। Encoder मात्र चाहिन्छ।
⚠ सबैभन्दा महत्त्वपूर्ण नियम — Data Leakage
fit() सधैँ training data मात्र मा गर्ने। Test data मा transform() मात्र। नत्र data leakage हुन्छ र model ले cheat गर्छ।
Jupyter Notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler,MinMaxScaler,RobustScaler,Binarizer,Normalizer,OneHotEncoder,LabelEncoder
np.random.seed(5)
a = np.random.normal(0, 2, 10000) # normal data genration ko lagi (0 -> mean center point , 2, std
b = np.random.normal(5, 3, 10000)
c = np.random.normal(-5, 5, 10000)
df = pd.DataFrame({
'A' : a,
'B' : b,
'C' : c
})
df
| A | B | C | |
|---|---|---|---|
| 0 | 0.882455 | 1.466990 | -7.089363 |
| 1 | -0.661740 | 0.571914 | -3.105893 |
| 2 | 4.861542 | 7.828852 | -1.039068 |
| 3 | -0.504184 | 5.001904 | -9.755722 |
| 4 | 0.219220 | 1.731128 | -7.679528 |
| ... | ... | ... | ... |
| 9995 | -0.068972 | 3.739968 | -4.303864 |
| 9996 | -2.167260 | 0.681434 | -9.584910 |
| 9997 | 2.188712 | 9.860986 | -9.986860 |
| 9998 | 0.595445 | 5.982577 | -11.128039 |
| 9999 | 0.122246 | 9.105143 | -6.612080 |
10000 rows × 3 columns
df.plot.kde()
<AxesSubplot:ylabel='Density'>
scaler = StandardScaler()
scaled = scaler.fit_transform(df)
df = pd.DataFrame(scaled, columns=['A','B','C'])
df.plot.kde()
<AxesSubplot:ylabel='Density'>
minmax = MinMaxScaler()
mm_scaled = minmax.fit_transform(df)
df = pd.DataFrame(mm_scaled,columns=['A','B','C'])
df.plot.kde()
<AxesSubplot:ylabel='Density'>
rscaler = RobustScaler()
robust = rscaler.fit_transform(df)
df = pd.DataFrame(robust, columns=['A','B','C'])
df.plot.kde()
<AxesSubplot:ylabel='Density'>
binary = Binarizer(threshold=5)
a = np.array([[ 30, 10, 22],
[ 20, 2, 10],
[ 33, 5, 5]])
binary1 = binary.transform(a)
binary1
array([[1, 1, 1],
[1, 0, 1],
[1, 0, 0]])
X = [[5, 2, 3],
[2, 4, 10],
[6, 8, 6]]
sss = [np.sqrt(np.sum(np.power(X[i], 2))) for i in range(len(X))]
len(X)
3
np.array([X[k] / sss[k] for k in range(len(X))])
array([[0.81110711, 0.32444284, 0.48666426],
[0.18257419, 0.36514837, 0.91287093],
[0.51449576, 0.68599434, 0.51449576]])
normalizer = Normalizer()
data_tf = normalizer.fit_transform(X)
data_tf
array([[0.81110711, 0.32444284, 0.48666426],
[0.18257419, 0.36514837, 0.91287093],
[0.51449576, 0.68599434, 0.51449576]])
dff = pd.read_csv("/Users/shridharmankar/encoding.csv")
dff
| TEAM | YEAR | |
|---|---|---|
| 0 | A | 2000 |
| 1 | B | 2002 |
| 2 | C | 2003 |
| 3 | D | 2004 |
| 4 | A | 2005 |
| 5 | C | 2006 |
| 6 | B | 2007 |
| 7 | A | 2008 |
| 8 | D | 2009 |
le = LabelEncoder()
df1 = dff
df1.TEAM = le.fit_transform(df1.TEAM)
df1
| TEAM | YEAR | |
|---|---|---|
| 0 | 0 | 2000 |
| 1 | 1 | 2002 |
| 2 | 2 | 2003 |
| 3 | 3 | 2004 |
| 4 | 0 | 2005 |
| 5 | 2 | 2006 |
| 6 | 1 | 2007 |
| 7 | 0 | 2008 |
| 8 | 3 | 2009 |
enc = OneHotEncoder()
enc_df1 = pd.DataFrame(enc.fit_transform(df1[['TEAM']]).toarray())
enc_df1
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 1 | 0.0 | 1.0 | 0.0 | 0.0 |
| 2 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | 0.0 | 0.0 | 0.0 | 1.0 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 |
| 5 | 0.0 | 0.0 | 1.0 | 0.0 |
| 6 | 0.0 | 1.0 | 0.0 | 0.0 |
| 7 | 1.0 | 0.0 | 0.0 | 0.0 |
| 8 | 0.0 | 0.0 | 0.0 | 1.0 |
abc = df1.join(enc_df1)
abc
| TEAM | YEAR | 0 | 1 | 2 | 3 | |
|---|---|---|---|---|---|---|
| 0 | 0 | 2000 | 1.0 | 0.0 | 0.0 | 0.0 |
| 1 | 1 | 2002 | 0.0 | 1.0 | 0.0 | 0.0 |
| 2 | 2 | 2003 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | 3 | 2004 | 0.0 | 0.0 | 0.0 | 1.0 |
| 4 | 0 | 2005 | 1.0 | 0.0 | 0.0 | 0.0 |
| 5 | 2 | 2006 | 0.0 | 0.0 | 1.0 | 0.0 |
| 6 | 1 | 2007 | 0.0 | 1.0 | 0.0 | 0.0 |
| 7 | 0 | 2008 | 1.0 | 0.0 | 0.0 | 0.0 |
| 8 | 3 | 2009 | 0.0 | 0.0 | 0.0 | 1.0 |
final = abc.drop(['TEAM'], axis='columns')
final
| YEAR | 0 | 1 | 2 | 3 | |
|---|---|---|---|---|---|
| 0 | 2000 | 1.0 | 0.0 | 0.0 | 0.0 |
| 1 | 2002 | 0.0 | 1.0 | 0.0 | 0.0 |
| 2 | 2003 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | 2004 | 0.0 | 0.0 | 0.0 | 1.0 |
| 4 | 2005 | 1.0 | 0.0 | 0.0 | 0.0 |
| 5 | 2006 | 0.0 | 0.0 | 1.0 | 0.0 |
| 6 | 2007 | 0.0 | 1.0 | 0.0 | 0.0 |
| 7 | 2008 | 1.0 | 0.0 | 0.0 | 0.0 |
| 8 | 2009 | 0.0 | 0.0 | 0.0 | 1.0 |