Sagemaker WorkFlow Process / A Machine Learning Pipeline by AWS

Sagemaker is a Machine Learning Pipeline by AWS.

Data Collection & Integration

  • Predition: Lable / Target.
  • Good Data: Good Data will contain a signal about the phenomenon you’re trying to model.
  • Observation: A single data point, made up of the label and the features.
  • Dataset: Stacked up with a bunch of observations.
  • Data points : Features – Ratio: You need at least 10 times the number of data points as features. So if you’ve got five features, you should have 50 data points minimum in your training data.

Data Preparation

Data Visualization & Analysis

  • Histograms: Histograms are effective visualizations for spotting outliers in data.
  • Imputation: Imputation is going to make a best guess as to what the value actually should be. In a regression problem, you can deal with outliers or even missing data by just assigning a new value using imputation.
  • Scatter Plots: Visualize the relationship between the features and the labels. It’s important to understand if there’s a strong correlation between features and labels.

Feature Selection & Engineering

Model Training

  • Randomize Data: Randomize it during your split to help your model avoid bias. This is especially true with structured data, if your data coming in a specific order.
  • Underfitting: Low variance and high bias. These models are overly simple and they can’t really see the underlying patterns in the data.
  • Overfitting: High-variance and low bias. These models are overly complex, and while they can detect patterns in the training data, they’re not accurate outside of the training data.
  • Parameter:
    • Internal of the model and it’s something the model can learn or estimate purely off of the data.
    • An example of a parameter could be the weight of an ANN or the coefficients in linear regression.
    • The model has to have parameters to make predictions, and most often, these aren’t set by humans.
  • Hyperparameters: Set by humans, and typically, you can’t really know the best value of the hyperparameter, but you can trial and error and use that to get there.
    • It could be the learning rate for training a neural network.
  • Hyperparameter Tuning: One technique that can be used to combat underfitting and overfitting.
  • Types of Hyperparameter Tuning:
    • Loss function
    • Regularization
    • Learning Parameters

Model Evaluation



Author: Yuzu
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.