TaoTian AI Agent Interview Experience

1. Introduction to the RAG Process; Understanding Encoding Models, Principles, Advantages, and Disadvantages; How to Evaluate Encoding Model Capabilities

The core process of RAG (Retrieval-Augmented Generation) consists of: Data Preparation (cleaning, chunking, vectorization, storage), Retrieval (similarity search), Augmentation (injecting context into the prompt), and Generation (LLM output).

Principles and Advantages/Disadvantages of Encoding Models

Encoding models are typically based on BERT/RoBERTa and other encoder-only architectures. They map text into high-dimensional vector space using a bidirectional attention mechanism.

Principle: Utilizing contrastive learning to ensure semantically similar texts are closer together in vector space (Cosine Similarity).
Advantages: High computational efficiency, suitable for large-scale semantic searches.
Disadvantages: Limited by context window length (usually 512-812 characters); unable to understand extremely complex long-range logic.

Evaluation Metrics

MTEB (Massive Text Embedding Benchmark): Currently recognized as an authoritative leaderboard.
Retrieval Accuracy: \(Hit Rate@K\) (whether the top K results contain the correct answer) and MRR (Mean Reciprocal Rank).

2. What are the Classifications of RAG; What Implementation Frameworks Exist for Multimodal RAG; How are Pseudo-Multimodal RAG and Multimodal RAG Implemented, and What are the Differences; What Type of Multimodal RAG Can CLIP Be Used For, and Why

Naive RAG (Simple Retrieval)

This is the most basic Index -> Retrieve -> Augment -> Generation linear process.

Method: Chunk documents and store them in a vector database. When a user asks a question, convert the question into a vector to find the most similar Top-K segments, which are then fed to the LLM.
Drawbacks:
- Slow/Low Accuracy Retrieval: Vector matching may only find segments that are keyword-similar but semantically unrelated.
- Context Break: Chunking may cut off an important sentence, leading to incomplete information.

Advanced RAG (Preprocessing + Postprocessing)

To address the pain points of simple retrieval, a large number of optimization strategies are added at both ends of the retrieval process:

Preprocessing (Pre-Retrieval):
- Query Rewriting: If the user’s question is vague, first let the LLM rewrite the question to be clearer or generate multiple variants (Multi-Query).
- Hypothetical Answer (HyDE): First, let the LLM guess an answer, using this “hypothetical answer” to search, which is often more accurate than searching directly with the “question.”
Postprocessing (Post-Retrieval):
- Re-ranking: After vector search retrieves 100 results, use a more precise model (Cross-Encoder) to select the 5 most relevant ones.
- Context Compression: Simplify lengthy retrieved texts to retain only key information, preventing the LLM from “getting lost in the middle.”

Modular RAG (Modular Composition)

This is the current cutting-edge form, which is no longer linear but atomic and plugin-based.

Core Logic: Decomposing RAG into different functional modules that can be flexibly combined based on task requirements.
Examples of New Modules:
- Search Module: Not only searches the vector database but can also search Google or enterprise knowledge graphs.
- Memory Module: Remembers the user’s previous conversational habits.
- Rewrite/Route Module: Automatically decides whether to query the database or generate directly based on the question type.
Features: Supports iteration and loops. For example, if a search reveals insufficient information, it will automatically trigger the “re-search” module until enough information is gathered.

Dimension	Naive RAG	Advanced RAG	Modular RAG
Process	Linear (Straight)	Linear + Pre/Post Enhancements	Modular, Non-linear (e.g., loops, branches)
Core Technology	Vector Retrieval	Re-ranking, Query Transformation	Intelligent Routing, Multi-source Fusion
Pain Points Addressed	Achieved from scratch	Addressed retrieval noise and semantic alignment	Solved complex tasks and long process issues

Multimodal RAG Implementation

Implementation Frameworks: LlamaIndex, LangChain (Multi-modal), Unstructured.io.
Pseudo-Multimodal RAG: Achieved through Captioning. Converts images/videos into textual descriptions and stores them in traditional text vector databases.
- Difference: Pseudo-multimodal loses visual details; true multimodal matches directly in a unified vector space (Multimodal Embedding).
Role of CLIP: Serves as the foundation of true multimodal RAG. It achieves representation of images and text in the same feature space through image-text pairing training, suitable for “searching images with text” or “searching images with images.”

3. How to Evaluate RAG; What is the Most Important Aspect of the RAG Evaluation System

The core of the evaluation system is the RAG Triad:

Faithfulness: Does the answer come from the retrieved context?
Answer Relevance: Does the answer address the query?
Context Precision: Does the retrieved information contain the correct answer?

4. What are the Pain Points of Traditional RAG; Introduce GraphRAG, What are the Challenges of GraphRAG; How Does GraphRAG Handle Incremental Scenarios

Pain Points of Traditional RAG

Poor Global Understanding: Difficult to answer macro questions like “What is the main idea of this document?”
Weak Long-range Associations: Unable to connect implicit entity relationships across documents and paragraphs.

GraphRAG

Challenges: The cost of building the graph is extremely high (LLM extraction of entity relationships is time-consuming); schema design is complex.
Incremental Scenarios: Achieved through Graph Consolidation. When new entities enter, use LLM for entity disambiguation, merging duplicate nodes and updating existing cluster summaries.

5. Introduce the Responsibilities of Fine-tuning; What is the Most Important Aspect of Fine-tuning Large Models

Responsibilities: Domain knowledge injection, format alignment (Instruction Following), style transfer.

Most Important Aspect: Data Quality. As stated in the LIMA paper: Less Is More for Alignment, 1000 high-quality data points far outweigh 100,000 dirty data points.

6. What are the Methods of Post-training; What are the Methods of Fine-tuning, How are They Done; LoRA Principles and Parameter Count

Post-training Methods

Stage / Method	Core Objective	Core Approach	Pros and Cons Analysis	Applicable Scenarios
1. Supervised Fine-tuning (SFT)	Teach the model to converse and follow instructions	Supervised training using \((Prompt, Response)\) data pairs.	Pros: Quick results, foundational for conversational ability. Cons: Difficult to address value preference issues.	Basic conversational ability building, specific format alignment (e.g., JSON).
2. Continuous Pre-training (CPT)	Inject domain knowledge	Train on pure text in specific domains using Next Token Prediction.	Pros: Significantly enhances industry knowledge. Cons: Does not change interaction patterns, high computational cost.	Building expert models in vertical domains like healthcare, law, finance.
3. Reinforcement Learning (RLHF - PPO)	Extreme human preference alignment	Train a reward model (RM) and update the policy using PPO algorithm.	Pros: High ceiling, makes the model “smarter” and safer. Cons: Extremely complex process, requires three models to run simultaneously, training is very unstable.	Final refinement of top closed-source/open-source models (e.g., GPT-4, Llama 3).
4. Direct Preference Optimization (DPO)	Efficient preference alignment	Abandon the reward model, directly calculate loss on (best answer, worst answer) pairs.	Pros: Simple, stable, low memory usage, currently the most mainstream. Cons: Sensitive to data distribution.	Most enterprise-level conversational models and fine-tuning projects.
5. Modular Fine-tuning (LoRA / PEFT)	Low-cost task adaptation	Freeze original parameters and only train low-rank matrices (Rank Decomposition) in the bypass.	Pros: Extremely small parameter count (<1%), very low memory requirements. Cons: Performance ceiling slightly lower than full fine-tuning.	Adaptation for downstream tasks under computational constraints, multi-task switching.
6. Hybrid Optimization (ORPO)	One-step alignment	Combine SFT and alignment without needing a reference model.	Pros: Extremely simple process, high computational efficiency. Cons: Relatively new technology, less community accumulation.	Lightweight projects pursuing extreme training efficiency.

Fine-tuning Methods

Full Fine-tuning: Full parameter fine-tuning, good results but extremely high computational cost.
PEFT (Parameter-Efficient Fine-tuning): Such as LoRA, P-Tuning, Adapter.

LoRA Principles

Parallel two low-rank matrices \(A\) and \(B\) next to the pre-trained weights. \[ W_{new} = W_{base} + \Delta W = W_{base} + A \times B \] Parameter Count: Extremely small, usually less than 1% of the original model. The calculation formula is \(2 \times r \times d_{model}\) (where \(r\) is the rank).

7. Introduce DPO; What is the Difference Between DPO and PPO

PPO Principles

PPO (Reinforcement Learning from Human Feedback): Requires training a Reward Model, updating the policy through policy gradient. The process is complex and training is very unstable.

Why is PPO Training Unstable?

Credit Assignment Problem: An LLM generates a sentence containing dozens of tokens, but the reward is usually given as a total score only at the end. PPO finds it difficult to accurately determine which word led to a high or low score, and this “sparse feedback” causes gradient updates to oscillate.
Dual Distribution Drift: PPO involves two dynamic systems: one is the policy model changing, and the other is the value model chasing. If the value model estimates inaccurately, the advantage function provided will mislead the policy model, causing training to collapse instantly.
Non-stationarity and KL Penalty: To prevent the model from deviating, PPO must include a KL divergence constraint. However, this coefficient is very difficult to tune—if set too high, the model does not learn (does not update), and if set too low, the model outputs gibberish (Reward Hacking).

Core Essence of DPO and ORPO

DPO (Direct Preference Optimization)

Core Principle: Utilizes mathematical techniques to transform the alignment problem originally requiring “reinforcement learning” into a “lookup comparison” problem. No reward model needed. It directly performs maximum likelihood estimation on preference data, converting the reinforcement learning objective into simple binary cross-entropy loss. Simpler, more stable, and efficient.
Why Stable: It establishes an analytical solution, proving that the optimal policy model is proportional to the logit probabilities it generates. During training, as long as the probability increase for “good answers” is greater than that for “bad answers,” the model evolves. It does not involve complex sampling and scoring processes; it is essentially contrastive learning.

ORPO (Odds Ratio Preference Optimization)

Core Principle: Directly modifies the loss function of SFT. It believes that the model should not only learn “what to say” but also “what not to say.”
Core Logic:
1. Weaken Negative Samples: It directly penalizes the probability of generating negative samples (Rejected) using the Odds Ratio statistical measure.
2. Single-stage Alignment: It does not require the four models of PPO, nor does it need the reference model of DPO, relying solely on one model to simultaneously complete “knowledge learning” and “preference selection” during the SFT process.

8. Introduce Some Agent Implementation Frameworks; What are the Differences Between These Frameworks; What Scenarios is LangGraph Suitable For; What are the Ways to Build Agents with LangGraph

Frameworks: AutoGPT (autonomous navigation), CrewAI (multi-role collaboration), LangGraph (cyclic flow control).
Advantages of LangGraph: Supports Stateful and Cyclic. Traditional LangChain is a DAG (Directed Acyclic Graph), which struggles to handle complex logic that requires repeated iterations and corrections.
Building Methods: 1. StateGraph (explicit state definition); 2. Nodes & Edges (logical nodes and edges).

9. Scenario Question: A Customer Inputs a Screenshot of a Software or Web Interface, How to Help Users Understand the Function of Each Component of the Interface Through RAG (??? Honestly, I didn’t quite understand), Define Input and Output Yourself; How to Distinguish Similar Components Like Image Boxes and Video Boxes

Solution Definition

Input: Interface screenshot + target component location (Bounding Box) or coordinates.
Output: Function description and interaction logic of the component.
Implementation Process:
1. Preprocessing: Use a multimodal model (e.g., GPT-4o or dedicated OCR + object detection model) to extract elements from the screenshot.
2. Indexing: Store the features of the components (visual features + location information + associated text) in a vector database.
3. Retrieval: When the user clicks or inputs a location, calculate the overlap of coordinates or feature similarity.
4. Generation: Generate descriptions by combining the component’s documentation (Context).

Distinguishing Similar Components (Image Box vs Video Box)

Multimodal Features: Video boxes typically come with play button features (triangle icon) and progress bar features.
Code/Metadata: If extracted from the code layer, check the differences between <img> tags and <video> or <iframe> tags.
Temporal Analysis: If it is dynamic input, the video box will have frame rate changes, while the image box remains static.

Original Link