edit1: I'll merge without quantization later. Just testing stuff and memory caps.
edit2: merged adapter both with and without quantization (careful as I'm not reseting gpu memory throughout the model declarations, so some of it might go to RAM and slow things down)
Rank (R): Measure of "dimensionality" of the space a transformation - weight matrix - can span.
The max rank is always the min(#rows, #columns)... why? because that's the max rank possible when assuming all rows and columns are linearly independent.
Higher rank means more parameters to learn, more memory, more energy spent on concept/semantic space.
Low (Lo): One wants to approximate a subset of these weight matrices (x layers) with lower rank matrices (check SVD or other factorization methods). [A]
After that, typically a simple linear projection gets us back to the correct dimension that must match the next input dimension. [B]
The rest of the network has its weights frozen during the forward and backward pass.
Adaptation (A): One adapts the pre-trained model to a more specific task/objective, capturing important relationships from this new environment, comparatively to the innitial general model training.
LoRA: One fine-tunes a large network not simply by freezing some weights and re-training on specific data, but actually reducing the computational cost by performing and working with low rank approximation matrices.
For an intuitive overview of rank and alpha parameters, check this.