Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization complete course is currently being offered by DeepLearning.AI through Coursera platform.
About this Course
In the second course of the Deep Learning Specialization, you will open the deep learning black box to understand the processes that drive performance and generate good results systematically.
By the end, you will learn the best practices to train and develop test sets and analyze bias/variance for building deep learning applications; be able to use standard neural network techniques such as initialization, L2 and dropout regularization, hyperparameter tuning, batch normalization, and gradient checking; implement and apply a variety of optimization algorithms, such as mini-batch gradient descent, Momentum, RMSprop and Adam, and check for their convergence; and implement a neural network in TensorFlow.
Instructors:
- Andrew Ng
- Kian Katanforoosh
- Younes Bensouda Mourri
Skills You Will Gain
- Tensorflow
- Deep Learning
- Mathematical Optimization
- hyperparameter tuning
Also Check: How to Apply for Coursera Financial Aid
- 98% train. 1% dev. 1% test
- 60% train . 20% dev . 20% test
- 33% train . 33% dev . 33% test
- Come from the same distribution
- Have the same number of examples
- Come from different distributions
- Be identical to each other (same (x,y) pairs)
- Get more test data
- add regularization
- Get more training data
- Make the Neural Network deeper
- Increase the number of units in each hidden layer
- Get more training data
- Use a bigger neural network
- Increase the regularization parameter lambda
- Decrease the regularization parameter lambda
- The process of gradually decreasing the learning rate during training.
- A technique to avoid vanishing gradient by imposing a ceiling on the values of the weights.
- Gradual corruption of the dee in the neural network if it is trained on noisy data.
- A regularization technique (such as L2 regularization) that results in gradient descent shrinking the weights on every iteration.
- Weights are pushed toward becoming smaller (closer to 0)
- Doubling lambda should roughly result in doubling the weights
- Weights are pushed toward becoming bigger (further from 0)
- Gradient descent taking bigger steps with each iteration (proportional to lambda)
- You apply dropout (randomly eliminating units) and do not keep the 1/keep_prob factor in the calculations used in training
- You do not apply dropout (do not randomly eliminate units), but keep the 1/keep_prob factor in the calculations used in training.
- You do not apply dropout (do not randomly eliminate units) and do not keep the 1/keep_prob factor in the calculations used in training
- You apply dropout (randomly eliminating units) but keep the 1/keep_prob factor in the calculations used in training
- Increasing the regularization effect
- Reducing the regularization effect
- Causing the neural network to end up with a higher training set error
- Causing the neural network to end up with a lower training set error
- Vanishing gradient
- Gradient Checking
- Xavier initialization
- Dropout
- Data augmentation
- L2 regularization
- Exploding gradient
- It makes the parameter initialization faster
- It makes it easier to visualize the data
- It makes the cost function faster to optimize
- Normalization is another word for regularization--It helps to reduce variance
- a[3]{7}(8)
- a[8]{7}(3)
- a[3]{8}(7)
- a[8]{3}(7)
- Training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.
- You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches, so that the algorithm processes all mini-batches at the same time (vectorization).
- One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent.
- If the mini-batch size is m, you end up with stochastic gradient descent, which is usually slower than mini-batch gradient descent.
- If the mini-batch size is 1, you end up having to process the entire training set before making any progress.
- If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.
- If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress. ---> Correct
- Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like this:
- If you’re using mini-batch gradient descent, something is wrong. But if you’re using batch gradient descent, this looks acceptable.
- If you’re using mini-batch gradient descent, this looks acceptable. But if you’re using batch gradient descent, something is wrong.
- Whether you’re using batch gradient descent or mini-batch gradient descent, something is wrong.
- Whether you’re using batch gradient descent or mini-batch gradient descent, this looks acceptable.
- v2=7.5, v2corrected=7.5
- v2=10, v2corrected=7.5
- v2=7.5, v2corrected=10
- v2=10, v2corrected=10
- Î±=(t^-.5)Î±0
- Î±=(1/1+2*t).Î±0
- Î±=(0.95^t).Î±0
- Î±=(e^t).Î±0
- Increasing Î² will create more oscillations within the red line.
- Increasing Î² will shift the red line slightly to the right.
- Decreasing Î² will create more oscillation within the red line.
- (1) is gradient descent with momentum (small Î²), (2) is gradient descent with momentum (small Î²), (3) is gradient descent
- (1) is gradient descent. (2) is gradient descent with momentum (large Î²) . (3) is gradient descent with momentum (small Î²)
- (1) is gradient descent with momentum (small Î²). (2) is gradient descent. (3) is gradient descent with momentum (large Î²)
- (1) is gradient descent. (2) is gradient descent with momentum (small Î²). (3) is gradient descent with momentum (large Î²)
- Try mini-batch gradient descent
- Try better random initialization for the weights
- Try tuning the learning rate Î±
- Try using Adam
- Try initializing all the weights to zero
- Adam should be used with batch gradient computations, not with mini-batches.
- The learning rate hyperparameter Î± in Adam usually needs to be tuned.
- We usually use “default” values for the hyperparameters Î²1,Î²2 and Îµ in Adam (Î²1=0.9, Î²2=0.999, Îµ=10−8)
- Adam combines the advantages of RMSProp and momentum
- True
- False
- True
- False
- The amount of computational power you can access
- The number of hyperparameters you have to tune
- Whether you use batch or mini-batch optimization
- The presence of local minima (and saddle points) in your neural network
- True
- False
- z^[l]
- To speed up convergence
- To have a more accurate normalization
- In case Î¼ is too small
- To avoid division by zero
- They can be learned using Adam, Gradient descent with momentum, or RMSprop, not just with gradient descent.
- They set the mean and variance of the linear variable z l] of a given layer.
- [The optimal values are Î³ = Ïƒ + Îµ, and . 2 Î² = Î¼
- Î² and Î³ are hyperparameters of the algorithm, which we tune via random sampling
- There is one global value of and one global value of for each layer, and applies to all the hidden units in that layer.
- Use the most recent mini-batch’s value of Î¼ and Ïƒ to perform the needed normalizations.
- Skip the step where you normalize using Î¼ and Ïƒ since a single test example cannot be normalized
- Perform the needed normalizations, use Î¼ and Ïƒ^2 estimated using an exponentially weighted average across mini-batches seen during training.
- If you implemented Batch Norm on mini-batches of (say) 256 examples, then to evaluate on one test example, duplicate that example 256 times so that you’re working with a mini-batch the same size as during training.
- Deep learning programming frameworks require cloud-based machines to run.
- A programming framework allows you to code up deep learning algorithms with typically fewer lines of code than a lower-level language such as Python.
- Even if a project is currently open source, good governance of the project helps ensure that the it remains open even in the long term, rather than become closed or modified to benefit only one company.
Post a Comment