Learn to build cost-effective apps using Large Language Models
In Large Language Model-Based Solutions: How to Deliver Value with Cost-Effective Generative AI Applications, Principal Data Scientist at Amazon Web Services, Shreyas Subramanian, delivers a practical guide for developers and data scientists who wish to build and deploy cost-effective large language model (LLM)-based solutions. In the book, you'll find coverage of a wide range of key topics, including how to select a model, pre- and post-processing of data, prompt engineering, and instruction fine tuning.
The author sheds light on techniques for optimizing inference, like model quantization and pruning, as well as different and affordable architectures for typical generative AI (GenAI) applications, including search systems, agent assists, and autonomous agents. You'll also find:
Effective strategies to address the challenge of the high computational cost associated with LLMs Assistance with the complexities of building and deploying affordable generative AI
apps, including tuning and inference techniques Selection criteria for choosing a model, with particular consideration given to compact, nimble, and domain-specific models
Perfect for developers and data scientists interested in deploying foundational models, or business leaders planning to scale out their use of GenAI, Large Language Model-Based Solutions will also benefit project leaders and managers, technical support staff, and administrators with an interest or stake in the subject.
By:
Shreyas Subramanian (AWS (Amazon Web Services Inc))
Imprint: John Wiley & Sons Inc
Country of Publication: United States
Dimensions:
Height: 234mm,
Width: 185mm,
Spine: 15mm
Weight: 318g
ISBN: 9781394240722
ISBN 10: 1394240724
Series: Tech Today
Pages: 224
Publication Date: 07 May 2024
Audience:
Professional and scholarly
,
Undergraduate
Format: Paperback
Publisher's Status: Active
Introduction xix Chapter 1: Introduction 1 Overview of GenAI Applications and Large Language Models 1 The Rise of Large Language Models 1 Neural Networks, Transformers, and Beyond 2 GenAI vs. LLMs: What’s the Difference? 5 The Three-Layer GenAI Application Stack 6 The Infrastructure Layer 6 The Model Layer 7 The Application Layer 8 Paths to Productionizing GenAI Applications 9 Sample LLM-Powered Chat Application 11 The Importance of Cost Optimization 12 Cost Assessment of the Model Inference Component 12 Cost Assessment of the Vector Database Component 19 Benchmarking Setup and Results 20 Other Factors to Consider 23 Cost Assessment of the Large Language Model Component 24 Summary 27 Chapter 2: Tuning Techniques for Cost Optimization 29 Fine-Tuning and Customizability 29 Basic Scaling Laws You Should Know 30 Parameter-Efficient Fine-Tuning Methods 32 Adapters Under the Hood 33 Prompt Tuning 34 Prefix Tuning 36 P-tuning 39 IA3 40 Low-Rank Adaptation 44 Cost and Performance Implications of PEFT Methods 46 Summary 48 Chapter 3: Inference Techniques for Cost Optimization 49 Introduction to Inference Techniques 49 Prompt Engineering 50 Impact of Prompt Engineering on Cost 50 Estimating Costs for Other Models 52 Clear and Direct Prompts 53 Adding Qualifying Words for Brief Responses 53 Breaking Down the Request 54 Example of Using Claude for PII Removal 55 Conclusion 59 Providing Context 59 Examples of Providing Context 60 RAG and Long Context Models 60 Recent Work Comparing RAG with Long Content Models 61 Conclusion 62 Context and Model Limitations 62 Indicating a Desired Format 63 Example of Formatted Extraction with Claude 63 Trade-Off Between Verbosity and Clarity 66 Caching with Vector Stores 66 What Is a Vector Store? 66 How to Implement Caching Using Vector Stores 66 Conclusion 69 Chains for Long Documents 69 What Is Chaining? 69 Implementing Chains 69 Example Use Case 70 Common Components 70 Tools That Implement Chains 72 Comparing Results 76 Conclusion 76 Summarization 77 Summarization in the Context of Cost and Performance 77 Efficiency in Data Processing 77 Cost-Effective Storage 77 Enhanced Downstream Applications 77 Improved Cache Utilization 77 Summarization as a Preprocessing Step 77 Enhanced User Experience 77 Conclusion 77 Batch Prompting for Efficient Inference 78 Batch Inference 78 Experimental Results 80 Using the accelerate Library 81 Using the DeepSpeed Library 81 Batch Prompting 82 Example of Using Batch Prompting 83 Model Optimization Methods 83 Quantization 83 Code Example 84 Recent Advancements: GPTQ 85 Parameter-Efficient Fine-Tuning Methods 85 Recap of PEFT Methods 85 Code Example 86 Cost and Performance Implications 87 Summary 88 References 88 Chapter 4: Model Selection and Alternatives 89 Introduction to Model Selection 89 Motivating Example: The Tale of Two Models 89 The Role of Compact and Nimble Models 90 Examples of Successful Smaller Models 91 Quantization for Powerful but Smaller Models 91 Text Generation with Mistral 7B 93 Zephyr 7B and Aligned Smaller Models 94 CogVLM for Language-Vision Multimodality 95 Prometheus for Fine-Grained Text Evaluation 96 Orca 2 and Teaching Smaller Models to Reason 98 Breaking Traditional Scaling Laws with Gemini and Phi 99 Phi 1, 1.5, and 2 B Models 100 Gemini Models 102 Domain-Specific Models 104 Step 1 - Training Your Own Tokenizer 105 Step 2 - Training Your Own Domain-Specific Model 107 More References for Fine-Tuning 114 Evaluating Domain-Specific Models vs. Generic Models 115 The Power of Prompting with General-Purpose Models 120 Summary 122 Chapter 5: Infrastructure and Deployment Tuning Strategies 123 Introduction to Tuning Strategies 123 Hardware Utilization and Batch Tuning 124 Memory Occupancy 126 Strategies to Fit Larger Models in Memory 128 KV Caching 130 PagedAttention 131 How Does PagedAttention Work? 131 Comparisons, Limitations, and Cost Considerations 131 AlphaServe 133 How Does AlphaServe Work? 133 Impact of Batching 134 Cost and Performance Considerations 134 S3: Scheduling Sequences with Speculation 134 How Does S3 Work? 135 Performance and Cost 135 Streaming LLMs with Attention Sinks 136 Fixed to Sliding Window Attention 137 Extending the Context Length 137 Working with Infinite Length Context 137 How Does StreamingLLM Work? 138 Performance and Results 139 Cost Considerations 139 Batch Size Tuning 140 Frameworks for Deployment Configuration Testing 141 Cloud-Native Inference Frameworks 142 Deep Dive into Serving Stack Choices 142 Batching Options 143 Options in DJL Serving 144 High-Level Guidance for Selecting Serving Parameters 146 Automatically Finding Good Inference Configurations 146 Creating a Generic Template 148 Defining a HPO Space 149 Searching the Space for Optimal Configurations 151 Results of Inference HPO 153 Inference Acceleration Tools 155 TensorRT and GPU Acceleration Tools 156 CPU Acceleration Tools 156 Monitoring and Observability 157 LLMOps and Monitoring 157 Why Is Monitoring Important for LLMs? 159 Monitoring and Updating Guardrails 160 Summary 161 Conclusion 163 Index 181
SHREYAS SUBRAMANIAN, PhD, is a principal data scientist at AWS, one of the largest organizations building and providing large language models for enterprise use. He is currently advising both internal Amazon teams and large enterprise customers on building, tuning, and deploying Generative AI applications at scale. Shreyas runs machine learning-focused cost optimization workshops, helping them reduce the costs of machine learning applications on the cloud. Shreyas also actively participates in cutting-edge research and development of advanced training, tuning and deployment techniques for foundation models.