这里沉淀入行365可复用的知识资产。登录并完成评测后,系统会按你的短板和目标重排。
An isolated environment where an AI agent runs code or executes actions with restricted access to the host system. Prevents a compromised or misbehaving agent from touching the filesystem, network, or processes outside its allowed scope. Cr
Validation layers that sit around an LLM to detect and block unsafe inputs or outputs: harmful content, PII leakage, off-topic requests, prompt injection attempts. Can be implemented as input filters, output classifiers, or both. The primar
An attack where malicious instructions are hidden in content the agent reads (a webpage, email, document) and hijack its behavior. For example, a webpage telling the agent to ignore its instructions and exfiltrate data. The SQL injection of
Connecting model outputs to verifiable external sources: search results, databases, real-time APIs: to reduce hallucination and keep answers accurate. RAG is one form of grounding; web search is another.
The mechanism by which LLMs can invoke external functions or APIs: like running code, searching the web, or querying a database: by outputting structured JSON that the host application executes and feeds back as a result.
The core execution cycle of an AI agent: observe the environment or task, reason about what to do next, call a tool or produce output, then observe the result and repeat. Agents run this loop until a stopping condition is met.
An LLM given access to tools (web search, code execution, APIs) and the ability to reason over multi-step tasks autonomously: perceiving state, planning actions, executing them, and iterating until a goal is achieved.
Forcing the model to emit output in a specific format: JSON, XML, a fixed schema: rather than freeform text. Critical for building reliable pipelines where downstream code needs to parse the model's response. Often paired with tool use.
The maximum number of tokens a model can process in a single call: both input and output combined. Claude 3.5 Sonnet has a 200K token context. Longer contexts enable more complex tasks but increase memory and compute costs.
The model's ability to learn new tasks purely from examples in its context window: no weight updates required. You show it examples, and it adapts. The core mechanism behind few-shot prompting and one of the most surprising emergent abiliti
Zero-shot: asking the model to do a task with no examples. Few-shot: providing 2 to 5 input/output examples in the prompt so the model pattern-matches the format. Few-shot is one of the most reliable and underused techniques in prompt engin
A prompting pattern that interleaves reasoning traces with tool actions: the model thinks out loud ("Thought: I need to search for X"), calls a tool ("Action: search(X)"), observes the result, and repeats. The blueprint behind most modern A
A prompting technique that instructs the model to reason step-by-step before giving a final answer. Dramatically improves performance on complex reasoning tasks. The basis of "thinking" models like Claude's extended thinking mode.
A hidden instruction block sent at the start of a conversation that sets the model's persona, rules, and behavior before any user message arrives. The primary mechanism operators use to customize LLM behavior for their product.
The broader discipline of deciding what information goes into the model's context window: not just the prompt, but retrieved documents, tool results, memory, conversation history, and how it is all structured and prioritized.
The practice of carefully crafting the text inputs to a model to elicit better, more reliable outputs: using techniques like few-shot examples, chain-of-thought, role instructions, and output formatting constraints.
The practice of using your own LLM or AI-powered tools internally as part of your own development workflow before shipping them to customers. AI labs use their frontier models to write code, generate evals, draft research, and run internal
The craft of building the infrastructure to run evaluations at scale: test runners, dataset pipelines, scoring logic, and result tracking. A well-built eval harness is what makes it possible to iterate on a model safely and quickly.
Systematic tests used to measure a model's capabilities, accuracy, or safety across specific tasks. Good evals are what separate rigorous AI development from vibes-based iteration. Everything from math benchmarks to red-teaming.
Training a smaller "student" model to mimic the outputs of a larger "teacher" model. The student learns not just the correct answers but the teacher's probability distributions, capturing nuanced knowledge. How many efficient small models a
An efficient fine-tuning technique that freezes the base model weights and adds small trainable "adapter" matrices. Trains in a fraction of the time and memory of full fine-tuning, while achieving comparable results. The dominant fine-tunin
When a model confidently generates factually incorrect or fabricated information. Happens because LLMs are trained to produce plausible-sounding text, not verified facts. RAG and grounding techniques help mitigate this.
A training technique where human raters rank model outputs, and those preferences train a reward model, which then guides the LLM via reinforcement learning to produce more helpful, harmless, and honest responses. Used by Claude, GPT-4, and
Continuing the training of a base model on a curated, domain-specific dataset to specialize its behavior. More efficient than training from scratch, but requires quality data and can cause "catastrophic forgetting" of prior knowledge.
The raw, unnormalized scores the model outputs for every possible next token before sampling. Logits are converted to probabilities via softmax, then sampled based on temperature and top-p settings.
A sampling strategy that restricts the model to only consider the smallest set of tokens whose cumulative probability exceeds p. For example, top-p=0.9 ignores the bottom 10% of unlikely tokens. Often used alongside temperature.
A number (usually 0 to 2) that controls how random the model's outputs are. Low temperature (toward 0) means deterministic and repetitive. High temperature (toward 2) means creative and unpredictable. Most production use cases sit between 0
The abstract mathematical space where embeddings live. Concepts that are semantically similar cluster together; analogies appear as vector arithmetic (king minus man plus woman equals queen). The geometry of this space is what makes semanti
A database optimized for storing and querying embedding vectors using approximate nearest-neighbor search. Examples: Pinecone, Weaviate, pgvector. The storage layer that makes RAG possible at scale.
A pattern where relevant documents are fetched from a database at query time and injected into the model's context, giving it up-to-date or domain-specific knowledge without retraining. Your own private search engine combined with an LLM.
The process of converting text, images, or other data into a dense vector representation using an encoder model. "Embedding a document" means turning it into a fixed-size vector that captures its meaning.
A list of numbers (e.g. 1536 floats) that encodes the semantic meaning of text. Texts with similar meaning have vectors that are numerically close to each other. The backbone of semantic search and RAG.
The act of running a trained model to generate output, as opposed to training. When you send a message to Claude, that is inference. Optimizing inference (speed, cost, memory) is a major engineering challenge at scale.
Compressing a model by reducing the numerical precision of its weights: e.g. from 32-bit floats to 4-bit integers. Trades a small accuracy drop for dramatically lower memory and faster inference. Used to run large models on consumer hardwar
In AI, RAM (and GPU VRAM) is the primary bottleneck for running models. A 70B model in 16-bit precision needs roughly 140GB of VRAM. Quantization is the main technique to reduce this footprint.
A performance optimization that saves the intermediate attention computations (keys and values) for tokens already processed, so the model does not recompute them on each new token. Critical for fast inference in long contexts.
A language model small enough to run on consumer hardware or at the edge: typically under 7B parameters. Models like Phi-3, Gemma 2B, and Mistral 7B trade raw capability for dramatically lower inference cost and the ability to run locally w
Models that can process and reason across multiple types of data: text, images, audio, video, code. Claude 3, GPT-4o, and Gemini are all multimodal. Requires encoding different modalities into a shared latent space the LLM can attend over.
An architecture where the model contains many specialized sub-networks ("experts") and a router that activates only a few of them per token. Used in GPT-4, Mixtral, and others: enables massive parameter counts at a fraction of the compute c
A model where every parameter is activated for every token: the standard architecture, as opposed to Mixture of Experts. GPT-2, LLaMA, and Claude are dense models. Simpler to train and reason about, but compute cost scales directly with par
The system that converts raw text into tokens: small chunks (words, subwords, or characters) the model actually processes. "Hello world" might be 2 tokens; a rare word might be split into 4. Token count directly impacts cost and context lim
The numerical values inside a model: billions of floating-point numbers that encode everything the model learned during training. A "70B model" has 70 billion parameters. More parameters does not always mean better performance.
The second major component inside every Transformer block, after the attention layer. While attention lets tokens communicate with each other, the FFN processes each token independently through two linear transformations with a non-linearit
The core operation in a Transformer that lets every token "look at" every other token and weigh how relevant each is. This is what gives LLMs their ability to understand long-range dependencies in text.
The neural network architecture introduced in 2017 ("Attention is All You Need") that powers virtually all modern LLMs. Uses self-attention to process tokens in parallel, understanding relationships across the entire context.
以上为全网通用热门。想要获取针对你当前岗位与深度的专属推荐?