Large Language Models. How much do you know about them? Take the LLM Knowledge Test to find out.
Question 1
Do you need to have a vector store for all your text-based LLM use cases?
A. Yes
B. No
Correct Answer: B
Explanation
A vector store is used to store the vector representation of a word or sentence. These vector representations capture the semantic meaning of the words or sentences and are used in various NLP tasks. However, not all text-based LLM use cases require a vector store. Some tasks, such as summarization, sentiment analysis, and translation, do not need context augmentation.
Here is why:
- Summarization: This task involves condensing a larger body of text into a short summary. It does not require the context of other documents or sentences beyond the text being summarized.
- Sentiment Analysis: This task involves determining the sentiment (positive, negative, neutral) expressed in a piece of text. It is typically done based on the text itself without needing additional context.
- Translation: This task involves translating text from one language to another. The context is usually provided by the sentence itself and the broader document it is part of, rather than a separate vector store.
Question 2
Which technique helps mitigate bias in prompt-based learning?
A. Fine-tuning
B. Data augmentation
C. Prompt calibration
D. Gradient clipping
Correct Answer: C
Explanation
Prompt calibration involves adjusting prompts to minimize bias in the generated outputs. Fine-tuning modifies the model itself, while data augmentation expands the training data. Gradient clipping prevents exploding gradients during training.
Question 3
Which of the following is NOT a technique specifically used for aligning Large Language Models (LLMs) with human values and preferences?
A. RLHF
B. Direct Preference Optimization
C. Data Augmentation
Correct Answer: C
Explanation
Data Augmentation is a general machine learning technique that involves expanding the training data with variations or modifications of existing data. While it can indirectly impact LLM alignment by influencing the model’s learning patterns, it’s not specifically designed for human value alignment.
Incorrect Options:
A) Reinforcement Learning from Human Feedback (RLHF) is a technique where human feedback is used to refine the LLM’s reward function, guiding it towards generating outputs that align with human preferences.
B) Direct Preference Optimization (DPO) is another technique that directly compares different LLM outputs based on human preferences to guide the learning process.
Question 4
In Reinforcement Learning from Human Feedback (RLHF), what describes “reward hacking”?
A. Optimizes for desired behavior
B. Exploits reward function
Correct Answer: B
Explanation
Reward hacking refers to a situation in RLHF where the agent discovers unintended loopholes or biases in the reward function to achieve high rewards without actually following the desired behavior. The agent essentially “games the system” to maximize its reward metric.
Why Option A is Incorrect:
While optimizing for the desired behavior is the intended outcome of RLHF, it doesn’t represent reward hacking. Option A describes a successful training process. In reward hacking, the agent deviates from the desired behavior and finds an unintended way to maximize the reward.
Question 5
Fine-tuning GenAI model for a task (e.g., Creative writing), which factor significantly impacts the model’s ability to adapt to the target task?
A. Size of fine-tuning dataset
B. Pre-trained model architecture
Correct Answer: B
Explanation
The architecture of the pre-trained model acts as the foundation for fine-tuning. A complex and versatile architecture like those used in large models (e.g., GPT-3) allows for greater adaptation to diverse tasks. The size of the fine-tuning dataset plays a role, but it’s secondary. A well-architected pre-trained model can learn from a relatively small dataset and generalize effectively to the target task.
Why A is Incorrect:
While the size of the fine-tuning dataset can enhance performance, it’s not the most crucial factor. Even a massive dataset cannot compensate for limitations in the pre-trained model’s architecture. A well-designed pre-trained model can extract relevant patterns from a smaller dataset and outperform a less sophisticated model with a larger dataset.
Question 6
What does the self-attention mechanism in transformer architecture allow the model to do?
A. Weigh word importance
B. Predict next word
C. Automatic summarization
Correct Answer: A
Explanation
The self-attention mechanism in transformers acts as a spotlight, illuminating the relative importance of words within a sentence.
In essence, self-attention allows transformers to dynamically adjust the focus based on the current word being processed. Words with higher similarity scores contribute more significantly, leading to a richer understanding of word importance and sentence structure. This empowers transformers for various NLP tasks that heavily rely on context-aware analysis.
Incorrect Options:
- Predict next word: While transformers can be used for language modeling (including next-word prediction), this isn’t the primary function of self-attention.
- Automatic summarization: While self-attention is a core component of summarization models, it’s not solely responsible for generating summaries.
Question 7
What is one advantage of using subword algorithms like BPE or WordPiece in Large Language Models (LLMs)?
A. Limit vocabulary size
B. Reduce amount of training data
C. Make computationally efficient
Correct Answer: A
Explanation
LLMs deal with massive amounts of text, leading to a very large vocabulary if you consider every single word. Subword algorithms like Byte Pair Encoding (BPE) and WordPiece break down words into smaller meaningful units (subwords) which are then used as the vocabulary. This significantly reduces the vocabulary size while still capturing the meaning of most words, making the model more efficient to train and use.
Incorrect Answer Explanations:
- Reduce amount of training data: Subword algorithms don’t directly reduce the amount of training data. The data size remains the same.
- Make computationally efficient: While limiting vocabulary size can improve computational efficiency, it’s not the primary purpose of subword algorithms. Their main advantage lies in effectively representing a large vocabulary with a smaller set of units.
Question 8
Compared to Softmax, how does Adaptive Softmax speed up large language models?
A. Sparse word reps
B. Zipf’s law exploit
C. Pre-trained embedding
Correct Answer: B
Explanation
Standard Softmax struggles with vast vocabularies, requiring expensive calculations for every word. Imagine a large language model predicting the next word in a sentence. Softmax multiplies massive matrices for each word in the vocabulary, leading to billions of operations! Adaptive Softmax leverages Zipf’s law (common words are frequent, rare words are infrequent) to group words by frequency. Frequent words get precise calculations in smaller groups, while rare words are grouped together for more efficient computations. This significantly reduces the cost of training large language models.
Incorrect Answer Explanations:
- A. Sparse word reps: While sparse representations can improve memory usage, they don’t directly address the computational bottleneck of Softmax in large vocabularies.
- C. Pre-trained embedding: Pre-trained embeddings enhance model performance but don’t address the core issue of Softmax’s computational complexity.
Question 9
Which configuration parameter for inference can be adjusted to either increase or decrease randomness within the model output layer?
A. Max new tokens
B. Top-k sampling
C. Temperature
Correct Answer: C
Explanation
During text generation, large language models (LLMs) rely on a softmax layer to assign probabilities to potential next words. Temperature acts as a key parameter influencing the randomness of these probability distributions.
- Lower Temperature: When set low, the softmax layer assigns significantly higher probabilities to the single word with the highest likelihood based on the current context.
- Higher Temperature: A higher temperature “softens” the probability distribution, making other, less likely words more competitive.
Why other options are incorrect:
- (A) Max new tokens: This parameter simply defines the maximum number of words the LLM can generate in a single sequence.
- (B) Top-k sampling: This technique restricts the softmax layer to consider only the top k most probable words for the next prediction.
Question 10
What transformer model uses masking & bi-directional context for masked token prediction?
A. Autoencoder
B. Autoregressive
C. Sequence-to-sequence
Correct Answer: A
Explanation
Autoencoder models are pre-trained using masked language modeling. They use randomly masked tokens in the input sequence, and the pretraining objective is to predict the masked tokens to reconstruct the original sentence.
Question 11
What technique allows you to scale model training across GPUs when the model doesn’t fit in the memory of a single chip?
A. DDP
B. FSDP
Correct Answer: B
Explanation
FSDP (Fully Sharded Data Parallel) is the technique that allows scaling model training across GPUs when the model is too big to fit in the memory of a single chip. FSDP distributes or shards the model parameters, gradients, and optimizer states across GPUs, enabling efficient training.
Incorrect Answers:
- A) DDP (Distributed Data-Parallel) is a technique that distributes data and processes batches in parallel across multiple GPUs, but it requires the model to fit onto a single GPU.
Question 12
What is the purpose of quantization in training large language models?
A. Reduce memory usage
B. Improve model accuracy
C. Enhance model interpretability
Correct Answer: A
Explanation
Quantization helps reduce the memory required to store model weights by reducing their precision.
Incorrect Answers:
- B) Improve model accuracy: While quantization can have some impact on model accuracy, it’s primarily used to optimize memory usage and computational efficiency.
- C) Enhance model interpretability: Quantization focuses on memory and computational efficiency, not directly on model interpretability.
Question 13
How can scaling laws be used to design compute-optimal models?
A. Optimizing model & data size
B. Improve model interpretability
C. Reduce training time
D. Enhance model scalability
Correct Answer: A
Explanation
Scaling laws provide valuable insights into the relationship between model size (number of parameters), dataset size, and the model’s performance (often measured as loss). This relationship can be mathematically expressed through power laws.
Here’s how scaling laws help design compute-optimal models:
- Understanding cost trade-offs: By analyzing scaling laws, you can estimate the impact of increasing model size or dataset size on performance and computational resources (training time, memory usage). This allows you to find a balance between model complexity and training cost.
- Targeted optimization: Scaling laws help predict the performance gain from increasing model size or data size. This helps you focus optimization efforts on the factors that will have the most significant impact on performance within your computational budget.
Question 14
What is catastrophic forgetting in fine-tuning?
A. Other tasks perform worse
B. All tasks perform better
C. Pre-trained weights enhance
Correct Answer: A
Explanation
Catastrophic forgetting refers to the degradation of performance on tasks other than the one being fine-tuned, as the weights of the original model are modified.
Incorrect options:
- B) All tasks perform better: This is incorrect as catastrophic forgetting leads to a loss of performance on other tasks.
- C) Pre-trained weights enhance: This is incorrect as catastrophic forgetting occurs due to the modification of weights during fine-tuning.
Question 15
Parameter Efficient Fine-Tuning (PEFT) updates only a small subset of parameters and this helps prevent catastrophic forgetting.
A. True
B. False
Correct Answer: A
Explanation
Parameter Efficient Fine-Tuning (PEFT) is a method that updates only a small subset of parameters during the fine-tuning process. This approach is designed to be more memory efficient and to prevent catastrophic forgetting. Catastrophic forgetting is a phenomenon where a neural network forgets its previously learned information upon learning new information. By updating only a small subset of parameters, PEFT mitigates this issue, allowing the model to retain its previously learned knowledge while adapting to new tasks.
Explanation for the incorrect answer (False): If you chose False, the misunderstanding might be due to the assumption that all parameters need to be updated during fine-tuning. However, in PEFT, only a small subset of parameters is updated. This is indeed an effective strategy to prevent catastrophic forgetting and is not less efficient. It allows the model to maintain its general knowledge while adapting to specific tasks, thereby enhancing its performance on those tasks without a significant increase in computational cost.
Question 16
In a Transformer model with group attention, how does the mechanism differ from standard self-attention when processing a sentence?
A. Replaces self-attention
B. Pre-defined word groups
C. Attention on specific word
Correct Answer: B
Explanation
Standard self-attention in a Transformer considers the relationships between individual words within a sentence. Group attention, on the other hand, introduces a new layer of attention that focuses on groups of words pre-defined based on specific criteria, such as syntactic or semantic groupings (e.g., noun phrases, verb phrases).
Question 17
During LLM training, which step is NOT directly involved in the process?
A. Feature engineering
B. Pre-training
C. Fine-tuning
D. RLHF
Correct Answer: A
Explanation
LLMs primarily rely on raw text data for training. Feature engineering, which involves manually extracting specific features from the data, is not a typical step in LLM training. Options (B), (C), and (D) are all common stages in the LLM training pipeline.
Question 18
Pre-training is a crucial step in LLM training. What is the main objective of pre-training?
A. To perform a specific task
B. General language understanding
Correct Answer: B
Explanation
Pre-training aims to equip the LLM with a foundational understanding of language by exposing it to a vast amount of text data. This allows the model to learn general representations of words, their relationships, and overall language structure.
Question 19
Which of the following sequences represents the most likely order of LLM training stages?
A. Pre-training
B. RLHF
C. Instruction Fine-tuning
- A -> C -> B
- B -> A -> C
- C -> A -> B
Correct Answer: A -> C -> B
Explanation:
LLM training follows a specific order:
- Pre-training (A): The LLM is exposed to a massive dataset to learn general language patterns and relationships between words.
- Instruction Fine-tuning (C): The pre-trained model is adapted to a specific task using labeled data and instructions. This tailors the model’s knowledge to the desired task.
- RLHF (Reinforcement Learning from Human Feedback) (B): This optional stage further refines the model’s behavior by incorporating human feedback through a reward system. The LLM receives rewards for desirable outputs.
Question 20
Which technique uses gating functions to decide which model to use based on the input?
A. Ensemble Techniques
B. Mixture of Experts (MoE)
Correct Answer: B
Explanation:
Mixture of Experts (MoE) is a machine learning technique that uses multiple models, called “experts,” and a gating function that decides which expert to use based on the input. This allows MoE to model more complex patterns and adapt to different regions of the input space, making it more flexible than traditional ensemble techniques, which typically combine predictions from multiple models without such a gating function.
Question 21
Which database would you use if you want to store multi-dimensional vectors and perform ANN search?
A. Vector Database
B. Traditional Database
Correct Answer: A
Explanation:
Traditional databases, like relational databases, are designed to store data in tables with rows and columns, which is not efficient for storing and searching multi-dimensional vectors. Vector databases specialize in handling high-dimensional vectors and are optimized for:
- Storage: Efficiently storing and compressing vectors.
- ANN Search (Approximate Nearest Neighbor Search): Finding data points similar to a query vector using specialized algorithms, even in high-dimensional spaces.
Thus, a vector database is the most suitable choice for storing multi-dimensional vectors and performing ANN searches.
Question 22
A technique that utilizes a smaller model to learn from a larger pre-trained model, improving efficiency, is called:
A. Gradient Clipping
B. Backpropagation
C. Knowledge Distillation
D. Batch Normalization
Correct Answer: C
Explanation:
Knowledge distillation is a technique that allows a smaller model (student) to learn from a larger, pre-trained model (teacher). It improves training efficiency by leveraging the knowledge already encoded in the teacher model. The process involves training a large, powerful model on a massive dataset, and then the smaller model learns from the teacher’s outputs or internal representations, guiding it to learn similar patterns and achieve good performance with less data and computational resources.
Question 23
Which method places ## at the start of tokens?
A. BPE
B. WordPiece
Correct Answer: B
Explanation:
WordPiece tokenization method places ## at the beginning of tokens. This is a characteristic feature of WordPiece.
Question 24
What does ‘Prompt leaking’ signify in the context of Language Learning Models (LLMs)?
A. Extracting sensitive info
B. Hijacking model’s output
Correct Answer: A
Explanation:
‘Prompt leaking’ in the context of Language Learning Models (LLMs) refers to the act of extracting sensitive or confidential information from the model’s response. This could potentially be exploited by adversaries to gain unauthorized insights into the LLM’s behavior or compromise its security. For example, if an LLM trained on confidential emails inadvertently includes sensitive information in its responses, this constitutes prompt leaking.
Question 25
Which of the following vector indexing techniques relies on grouping similar vectors in a cluster for efficient retrieval?
A. Flat Indexing
B. Inverted File Index
C. Principal Component Analysis
Correct Answer: B
Explanation:
- Flat Indexing: Stores vectors without any specific organization. It can be used for similarity searches but is not efficient for large datasets.
- Inverted File Index: Commonly used for text retrieval in document databases, this technique indexes words and tracks which documents contain those words. For vector similarity search, it can be adapted to group similar vectors in clusters, making retrieval efficient.
- Principal Component Analysis (PCA): Reduces the dimensionality of vectors while preserving important information. It can be used for dimensionality reduction before indexing but does not involve clustering similar vectors.
Question 26
For a small review dataset, if you want a 100% recall rate which vector index would you use? Speed is not a consideration here.
A. Flat Index
B. HNSW
C. Random Projection
Correct Answer: A
Explanation:
In a small dataset, a flat index allows for an exhaustive search, comparing each review vector to every other vector. This maximizes the chance of finding the most similar reviews with perfect accuracy.
- HNSW: While accurate, it might not guarantee finding the absolute closest neighbors in every case due to its focus on efficient search within clusters.
- Random Projection: The potential loss of information due to dimensionality reduction might compromise the goal of perfect accuracy.
Question 27
In the Inverted File Index (IVF) index, which parameter would you tune to expand the number of clusters?
A. nprob
B. nlist
Correct Answer: B
Explanation:
- nlist: Controls the number of vectors assigned to each inverted list during the initial clustering stage. Increasing nlist leads to the creation of more inverted lists, essentially representing more clusters.
- nprob: Determines the number of probes (comparisons) performed within each inverted list during retrieval. It influences how many elements within each cluster are explored during the search but does not directly affect the number of clusters.
Question 28
Which metric is NOT typically used for evaluating the quality of factual language summaries generated by an LLM?
A. ROUGE Score
B. BLEU Score
C. Perplexity
Correct Answer: C
Explanation:
Perplexity measures how well the model predicts the next word in a sequence. While relevant for some LLM tasks, it is not commonly used for evaluating factual language summaries.
- ROUGE Score and BLEU Score are standard metrics for assessing the quality and factual accuracy of summaries generated by LLMs.
Question 29
Which of the following indices represents a method that involves multiplying with another metric to reduce the size of the original vector?
A. Random Projection Index
B. Flat Index
Correct Answer: A
Explanation:
The Random Projection Index is a dimensionality reduction technique that works by projecting the original high-dimensional data into a lower-dimensional space using a random matrix. This process involves multiplication with another metric (the random matrix), effectively reducing the size of the original vector.
A Flat Index does not involve such a process.
Question 30
What’s the correct post-filtering order in a vector database?
A. Meta-data filtering → Top-K
B. Top-K → Meta-data filtering
Correct Answer: B
Explanation:
In a vector database, post-filtering typically involves two steps: retrieving the top K results from the vector index and then performing meta-data filtering. The correct sequence is to first retrieve the top-K results to narrow down the search space and then apply meta-data filtering to generate the final top-K results.
Question 31
What’s the correct pre-filtering order in a vector database?
A. Meta-data filtering → Top-K
B. Top-K → Meta-data filtering
Correct Answer: A
Explanation:
In the context of a vector database, pre-filtering typically involves two steps: performing meta-data filtering and then retrieving the top K results from the vector index. The correct sequence is to first perform meta-data filtering, which reduces the overall search space, and then execute the vector query on these filtered vectors to generate the top-K results.
Question 32
You can use an algorithm other than Proximal Policy Optimization to update the model weights during RLHF.
A. True
B. False
Correct Answer: A
Explanation:
For instance, you can use an algorithm like Q-Learning. While Proximal Policy Optimization (PPO) is the most popular for Reinforcement Learning from Human Feedback (RLHF) due to its balance of complexity and performance, RLHF is an ongoing field of research. New techniques and algorithms continue to be developed, and this preference may change in the future.