Machine Learning Basics
Introduction
Machine Learning enables computers to learn from data. This guide covers ML fundamentals, supervised/unsupervised learning, scikit-learn workflows, neural networks with TensorFlow, model evaluation, and practical ML deployment.
1. ML Environment Setup
# Create virtual environment
python -m venv ml-env
source ml-env/bin/activate # Windows: ml-env\Scripts\activate
# Install ML libraries
pip install numpy pandas matplotlib seaborn
pip install scikit-learn tensorflow keras
pip install jupyter notebook
# Verify installation
python -c "import sklearn; print(sklearn.__version__)"
python -c "import tensorflow as tf; print(tf.__version__)"
# GPU support (optional)
pip install tensorflow[and-cuda]
# Data science stack
pip install scipy statsmodels
pip install plotly
# Start Jupyter
jupyter notebook
2. Data Preprocessing
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Load data
df = pd.read_csv('data.csv')
# Explore data
print(df.head())
print(df.info())
print(df.describe())
print(df.isnull().sum())
# Handle missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['category'].fillna(df['category'].mode()[0], inplace=True)
df.dropna(subset=['critical_column'], inplace=True)
# Encode categorical variables
label_encoder = LabelEncoder()
df['category_encoded'] = label_encoder.fit_transform(df['category'])
# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['category'], prefix='cat')
# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split data
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
# Feature engineering
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 100],
labels=['child', 'young', 'middle', 'senior'])
df['interaction'] = df['feature1'] * df['feature2']
df['log_transform'] = np.log1p(df['skewed_feature'])
3. Supervised Learning - Classification
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Logistic Regression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# Classification Report
print(classification_report(y_test, y_pred))
# Decision Tree
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
# Feature importance
importances = rf.feature_importances_
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': importances
}).sort_values('importance', ascending=False)
print(feature_importance)
# Support Vector Machine
svm = SVC(kernel='rbf', random_state=42)
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
# Predict probabilities
y_proba = log_reg.predict_proba(X_test)
print(f"Probability predictions: {y_proba[:5]}")
4. Supervised Learning - Regression
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
# Model evaluation
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R² Score: {r2:.4f}")
# Ridge Regression (L2 regularization)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)
# Lasso Regression (L1 regularization)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)
# Gradient Boosting
gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,
max_depth=3, random_state=42)
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)
# Compare models
models = {
'Linear': lr,
'Ridge': ridge,
'Lasso': lasso,
'GradientBoosting': gb
}
for name, model in models.items():
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"{name}: RMSE={rmse:.4f}, R²={r2:.4f}")
5. Unsupervised Learning
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X)
# Add cluster labels to dataframe
df['cluster'] = clusters
# Cluster centers
print("Cluster Centers:")
print(kmeans.cluster_centers_)
# Elbow method to find optimal k
inertias = []
K_range = range(1, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
import matplotlib.pyplot as plt
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
# DBSCAN (Density-based clustering)
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters_dbscan = dbscan.fit_predict(X)
# Principal Component Analysis (PCA)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {sum(pca.explained_variance_ratio_):.4f}")
# Visualize PCA
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.colorbar()
plt.show()
6. Model Evaluation & Tuning
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve
# Cross-validation
scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
print(f"Cross-validation scores: {scores}")
print(f"Mean accuracy: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
# Hyperparameter tuning with GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")
best_model = grid_search.best_estimator_
# ROC Curve
y_proba = best_model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {auc_score:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
# Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_proba)
plt.plot(recall, precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()
# Learning curves
from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(
best_model, X, y, cv=5, n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10)
)
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation score')
plt.xlabel('Training set size')
plt.ylabel('Score')
plt.legend()
plt.show()
7. Neural Networks with TensorFlow
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Simple Neural Network
model = keras.Sequential([
layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
layers.Dropout(0.3),
layers.Dense(32, activation='relu'),
layers.Dropout(0.3),
layers.Dense(1, activation='sigmoid') # Binary classification
])
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
# Train model
history = model.fit(
X_train, y_train,
epochs=50,
batch_size=32,
validation_split=0.2,
verbose=1
)
# Evaluate
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"Test accuracy: {test_acc:.4f}")
# Plot training history
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training')
plt.plot(history.history['val_accuracy'], label='Validation')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training')
plt.plot(history.history['val_loss'], label='Validation')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()
# Make predictions
predictions = model.predict(X_test)
y_pred = (predictions > 0.5).astype(int)
# Multi-class classification
model_multiclass = keras.Sequential([
layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
layers.BatchNormalization(),
layers.Dropout(0.4),
layers.Dense(64, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.4),
layers.Dense(10, activation='softmax') # 10 classes
])
model_multiclass.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# Early stopping
early_stopping = keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=5,
restore_best_weights=True
)
# Model checkpoint
checkpoint = keras.callbacks.ModelCheckpoint(
'best_model.h5',
monitor='val_accuracy',
save_best_only=True
)
# Train with callbacks
history = model_multiclass.fit(
X_train, y_train,
epochs=100,
batch_size=32,
validation_split=0.2,
callbacks=[early_stopping, checkpoint],
verbose=1
)
8. Model Deployment
# Save scikit-learn model
import joblib
joblib.dump(best_model, 'model.pkl')
joblib.dump(scaler, 'scaler.pkl')
# Load model
loaded_model = joblib.load('model.pkl')
loaded_scaler = joblib.load('scaler.pkl')
# Make prediction
new_data = [[25, 50000, 3]] # age, income, years_experience
new_data_scaled = loaded_scaler.transform(new_data)
prediction = loaded_model.predict(new_data_scaled)
# Flask API for model serving
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
model = joblib.load('model.pkl')
scaler = joblib.load('scaler.pkl')
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
features = [[data['age'], data['income'], data['experience']]]
features_scaled = scaler.transform(features)
prediction = model.predict(features_scaled)
probability = model.predict_proba(features_scaled)
return jsonify({
'prediction': int(prediction[0]),
'probability': float(probability[0][1])
})
if __name__ == '__main__':
app.run(debug=True)
# Save TensorFlow model
model.save('tf_model.h5')
model.save('saved_model') # SavedModel format
# Load TensorFlow model
loaded_tf_model = keras.models.load_model('tf_model.h5')
# Convert to TensorFlow Lite (mobile)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
with open('model.tflite', 'wb') as f:
f.write(tflite_model)
# ONNX export for cross-platform
import tf2onnx
spec = (tf.TensorSpec((None, X_train.shape[1]), tf.float32, name="input"),)
output_path = model.save("tf_model")
model_proto, _ = tf2onnx.convert.from_keras(model, input_signature=spec, opset=13)
with open("model.onnx", "wb") as f:
f.write(model_proto.SerializeToString())
9. Best Practices
✓ Machine Learning Best Practices:
- ✓ Always split data into train/validation/test sets
- ✓ Scale/normalize features before training
- ✓ Handle missing values appropriately
- ✓ Use cross-validation for model evaluation
- ✓ Monitor for overfitting (gap between train/val performance)
- ✓ Start with simple models before complex ones
- ✓ Use regularization to prevent overfitting
- ✓ Feature engineering is often more important than algorithms
- ✓ Document data preprocessing steps
- ✓ Version control datasets and models
- ✓ Monitor model performance in production
- ✓ Retrain models periodically with new data
- ✓ Understand your problem before choosing algorithms
- ✓ Always have a baseline to compare against
- ✓ Visualize data and results extensively
Conclusion
Machine Learning empowers data-driven decision making. Master data preprocessing, understand supervised and unsupervised learning algorithms, evaluate models properly, and deploy responsibly. Start with scikit-learn for traditional ML, then explore deep learning with TensorFlow for complex problems.
💡 Pro Tip: Before diving into complex deep learning models, always establish a strong baseline with simple models like logistic regression or random forests. Often, well-engineered features with simpler models outperform complex neural networks, especially with limited data. Focus on understanding your data first, algorithms second.