Implementation of Fundemental ML Concepts

Pytorch, Matplotlib, CUDA

Cover Image for Implementation of Fundemental ML Concepts

Pytorch, Matplotlib, CUDA

December 1, 2024

From Zero to Hero: Building Machine Learning Models with Karpathy's NN Zero-to-Hero Course

This project showcases my journey through Andrej Karpathy’s renowned NN Zero-to-Hero course, where I progressively built machine learning models of increasing complexity. Each step of the course deepened my understanding of neural networks, guiding me from simple statistical models to implementing concepts from fundemental understandings such as gradient decent to the more recent Transformer architecture.

Progressive Model Development

The course began with fundamental machine learning principles and quickly advanced to constructing increasingly sophisticated models:

Bigram Models: These simple statistical models were the starting point, capturing probabilities of sequences of two elements. Bigram models laid the foundation for a statistical understanding predictive modeling.
WaveNet-inspired Generative Models: Inspired by WaveNet: A Generative Model for Raw Audio, this step introduced autoregressive models that demonstrated how to generate high-fidelity sequences from raw data.
Residual Networks: Leveraging the principles of Deep Residual Learning for Image Recognition, I implemented networks that mitigated vanishing gradients and allowed deeper architectures.

The Culmination: Transformer Models

The pinnacle of the course was the implementation of a basic transformer model, inspired by the revolutionary Attention Is All You Need paper. This section involved:

Basic Transformer module

Transformer Visualization

Self-Attention Mechanisms: Understanding how transformers focus on different parts of an input sequence to encode contextual relationships efficiently.
Positional Encodings: Implementing positional embeddings to preserve the sequential nature of the data.
Scalability: Appreciating how transformers serve as a scalable backbone for diverse applications, from natural language processing to computer vision.

How the implementation differs from that in the paper

Tokenisation: unlike the paper, individual characters were used as tokens instead of subwords or word chunks
Dropout Layer in the FeedForward layer Introduced in the paper titled; Improving neural networks by preventing co-adaptation of feature detectors ML Paper (Dropout Technique); a dropout layer was implemented to reduce overfitting
Due to the difference in nature of the task, (Attention is all you need was tasked with translation), there was the additional implementation of Cross Attention, which is not possible in my model

Final Deliverable

A Transformer based model trained on the Tiny Shakesphere dataset to produce Shakesphere like language.

Here is a sample of the text produced:

MARIANA:
Basta, desperate of my sword; can I do be voth
To dine in the voice? Or else it hath a drunken two
And dancient now in France was Delphon, Master
A lady's father, and when men must sell him.

DUKE VINCENTIO:
Now miserable complainings
But by whose conjuration can, that show
What fair with cunning which dreams them wear
These one anon.

Model Parameters:

Component	Details
Tokenization	Type: Character-level
	Vocabulary Size: 65 characters
Embeddings	Token Embedding Dimension: 384
	Positional Embedding Dimension: 384
Model Architecture	Type: Transformer-based Causal Language Model (GPT-like)
	Number of Transformer Blocks: 6
	Multi-Head Attention:
	• Number of Heads: 6
	• Head Size: 64
	Feedforward Network:
	• Hidden Dimension: 1,536 (4 × 384)
	Dropout Rate: 0.2
	Layer Normalization: Pre-Norm applied before attention and feedforward layers
Parameters	Total Parameters: ~10.789 Million
	Breakdown:
	• Token Embeddings: 24,960
	• Positional Embeddings: 98,304
	• Transformer Blocks: 10,639,872
	• Final LayerNorm: 768
	• Language Modeling Head: 25,025
Training Setup	Batch Size: 64
	Context Length (Block Size): 256 tokens
	Maximum Iterations: 5,000
	Evaluation Interval: Every 500 iterations
	Learning Rate: 3e-4
	Optimizer: AdamW
	Loss Function: Cross-Entropy
	Training/Validation Split: 90% training, 10% validation
	Evaluation Batches: 200 per split
Device Configuration	Primary Device: CUDA (GPU) if available, otherwise CPU
	CUDA Details: Automatically detects and utilizes the first CUDA device available
Data Handling	Dataset: `input.txt` (e.g., "tinyshakespeare")
	Encoding: String to Integer indices (`stoi` mapping)
	Decoding: Integer indices to String (`itos` mapping)
	Batch Generation: Random sampling of sequences with a maximum length of 256 tokens
Generation Features	Method: Sequential token generation using softmax probabilities and multinomial sampling
	Maximum New Tokens: Configurable (e.g., 500 or 10,000 tokens)
	Context Handling: Maintains a rolling window of the last 256 tokens to respect context limit
Weight Initialization	Linear Layers: Initialized with Normal Distribution (mean=0.0, std=0.02), Bias=0
	Embedding Layers: Initialized with Normal Distribution (mean=0.0, std=0.02)
Regularization Techniques	Dropout: Applied to attention weights and feedforward networks (rate=0.2)
Progress Tracking	Tool: `tqdm` for real-time monitoring of training progress and loss metrics
Post-Training Outputs	Loss Plots: Saved as `training_plot.png`
	Model Saving: State dictionary saved as `microGPT_model.pth`

Training Results with CUDA on a RTX2070

Loss

Explore the Code

Dive into the technical implementation and explore the project repository: GitHub Repository: Machine Learning Learning

Everything else.