Op werkdagen voor 23:00 besteld, morgen in huis Gratis verzending vanaf €20

Training Data for Machine Learning

Human Supervision from Annotation to Data Science

Paperback Engels 2023 9781492094524
Verkooppositie 4683Hoogste positie: 4683
Verwachte levertijd ongeveer 8 werkdagen


Your training data has as much to do with the success of your data project as the algorithms themselves because most failures in AI systems relate to training data. But while training data is the foundation for successful AI and machine learning, there are few comprehensive resources to help you ace the process.

In this hands-on guide, author Anthony Sarkis--lead engineer for the Diffgram AI training data software-shows technical professionals, managers, and subject matter experts how to work with and scale training data, while illuminating the human side of supervising machines. Engineering leaders, data engineers, and data science professionals alike will gain a solid understanding of the concepts, tools, and processes they need to succeed with training data.

With this book, you'll learn how to:
- Work effectively with training data including schemas, raw data, and annotations
- Transform your work, team, or organization to be more AI/ML data-centric
- Clearly explain training data concepts to other staff, team members, and stakeholders
- Design, deploy, and ship training data for production-grade AI applications
- Recognize and correct new training-data-based failure modes such as data bias
- Confidently use automation to more effectively create training data
- Successfully maintain, operate, and improve training data systems of record


Aantal pagina's:250
Hoofdrubriek:IT-management / ICT


Wees de eerste die een lezersrecensie schrijft!


Who Should Read This Book?
For the Technical Professional and Engineer
For the Manager and Director
For the Subject Matter Expert and Data Annotation Specialist
For the Data Scientist
Why I Wrote This Book
How This Book Is Organized
The Basics and Getting Started
Concepts and Theories
Putting It All Together
Conventions Used in This Book
O’Reilly Online Learning
How to Contact Us

1. Training Data Introduction
Training Data Intents
What Can You Do With Training Data?
What Is Training Data Most Concerned With?
Training Data Opportunities
Business Transformation
Training Data Efficiency
Tooling Proficiency
Process Improvement Opportunities
Why Training Data Matters
ML Applications Are Becoming Mainstream
The Foundation of Successful AI
Training Data Is Here to Stay
Training Data Controls the ML Program
New Types of Users
Training Data in the Wild
What Makes Training Data Difficult?
The Art of Supervising Machines
A New Thing for Data Science
ML Program Ecosystem
Data-Centric Machine Learning
History of Development Affects Training Data Too
What Training Data Is Not
Generative AI
Human Alignment Is Human Supervision

2. Getting Up and Running
Getting Up and Running
Tasks Setup
Annotator Setup
Data Setup
Workflow Setup
Data Catalog Setup
Initial Usage
Tools Overview
Training Data for Machine Learning
Growing Selection of Tools
People, Process, and Data
Embedded Supervision
Human Computer Supervision
Separation of End Concerns
Many Personas
A Paradigm to Deliver Machine Learning Software
Installed Versus Software as a Service
Development System
Installation Options
Annotation Interfaces
Modeling Integration
Multi-User versus Single-User Systems
Hidden Assumptions
Open Source and Closed Source
Open Source Standards
Realizing the Need for Dedicated Tooling

3. Schema
Schema Deep Dive Introduction
Labels and Attributes—What Is It?
What Do We Care About?
Introduction to Labels
Attributes Introduction
Attribute Complexity Exceeds Spatial Complexity
Technical Overview
Spatial Representation—Where Is It?
Using Spatial Types to Prevent Social Bias
Trade-Offs with Types
Computer Vision Spatial Type Examples
Relationships, Sequences, Time Series: When Is It?
Sequences and Relationships
Guides and Instructions
Judgment Calls
Relation of Machine Learning Tasks to Training Data
Semantic Segmentation
Image Classification (Tags)
Object Detection
Pose Estimation
Relationship of Tasks to Training Data Types
General Concepts
Instance Concept Refresher
Upgrading Data Over Time
The Boundary Between Modeling and Training Data
Raw Data Concepts

4. Data Engineering
Who Wants the Data?
A Game of Telephone
Planning a Great System
Naive and Training Data–Centric Approaches
Raw Data Storage
By Reference or by Value
Off-the-Shelf Dedicated Training Data Tooling on Your Own Hardware
Data Storage: Where Does the Data Rest?
External Reference Connection
Raw Media (BLOB)–Type Specific
Formatting and Mapping
User-Defined Types (Compound Files)
Defining DataMaps
Ingest Wizards
Organizing Data and Useful Storage
Remote Storage
Data Access
Disambiguating Storage, Ingestion, Export, and Access
File-Based Exports
Streaming Data
Queries Introduction
Integrations with the Ecosystem
Access Control
Identity and Authorization
Example of Setting Permissions
Signed URLs
Personally Identifiable Information
Updating Data

5. Workflow
Glue Between Tech and People
Why Are Human Tasks Needed?
Partnering with Non-Software Users in New Ways
Getting Started with Human Tasks
Schemas’ Staying Power
User Roles
Gold Standard Training
Task Assignment Concepts
Do You Need to Customize the Interface?
How Long Will the Average Annotator Be Using It?
Tasks and Project Structure
Quality Assurance
Annotator Trust
Annotators Are Partners
Common Causes of Training Data Errors
Task Review Loops
Annotation Metrics Examples
Data Exploration
Using the Model to Debug the Humans
Distinctions Between a Dataset, Model, and Model Run
Getting Data to Models
Overview of Streaming
Data Organization
Pipelines and Processes
Direct Annotation
Business Process Integration
Depth of Labeling
Supervising Existing Data
Interactive Automations
Example: Semantic Segmentation Auto Bordering

6. Theories, Concepts, and Maintenance
A System Is Only as Useful as Its Schema
Who Supervises the Data Matters
Intentionally Chosen Data Is Best
Working with Historical Data
Training Data Is Like Code
Surface Assumptions Around Usage of Your Training Data
Human Supervision Is Different from Classic Datasets
General Concepts
Data Relevancy
Need for Both Qualitative and Quantitative Evaluations
Prioritization: What to Label
Transfer Learning’s Relation to Datasets (Fine-Tuning)
Per-Sample Judgment Calls
Ethical and Privacy Considerations
Bias Is Hard to Escape
Preventing Lost Metadata
Train/Val/Test Is the Cherry on Top
Sample Creation
Simple Schema for a Strawberry Picking System
Geometric Representations
Binary Classification
Let’s Manually Create Our First Set
Upgraded Classification
Where Is the Traffic Light?
Net Lift
Levels of System Maturity of Training Data Operations
Applied Versus Research Sets
Training Data Management
Completed Tasks
Maintaining Set Metadata
Task Management

7. AI Transformation and Use Cases
AI Transformation
Seeing Your Day-to-Day Work as Annotation
The Creative Revolution of Data-centric AI
You Can Create New Data
You Can Change What Data You Collect
You Can Change the Meaning of the Data
You Can Create!
Think Step Function Improvement for Major Projects
Build Your AI Data to Secure Your AI Present and Future
Appoint a Leader: The Director of AI Data
New Expectations People Have for the Future of AI
Sometimes Proposals and Corrections, Sometimes Replacement
Upstream Producers and Downstream Consumers
Spectrum of Training Data Team Engagement
Dedicated Producers and Other Teams
Organizing Producers from Other Teams
Use Case Discovery
Rubric for Good Use Cases
Evaluating a Use Case Against the Rubric
Conceptual Effects of Use Cases
The New “Crowd Sourcing”: Your Own Experts
Key Levers on Training Data ROI
What the Annotated Data Represents
Trade-Offs of Controlling Your Own Training Data
The Need for Hardware
Common Project Mistakes
Modern Training Data Tools
Think Learning Curve, Not Perfection
New Training and Knowledge Are Required
How Companies Produce and Consume Data
Trap to Avoid: Premature Optimization in Training Data
No Silver Bullets
Culture of Training Data
New Engineering Principles

8. Automation
Getting Started
Motivation: When to Use These Methods?
Check What Part of the Schema a Method Is Designed to Work On
What Do People Actually Use?
What Kind of Results Can I Expect?
Common Confusions
User Interface Optimizations
Nature of Automations
Setup Costs
How to Benchmark Well
How to Scope the Automation Relative to the Problem
Correction Time
Subject Matter Experts
Consider How the Automations Stack
Standard Pre-Labeling
Pre-Labeling a Portion of the Data Only
Interactive Annotation Automation
Creating Your Own
Technical Setup Notes
What Is a Watcher? (Observer Pattern)
How to Use a Watcher
Interactive Capturing of a Region of Interest
Interactive Drawing Box to Polygon Using GrabCut
Full Image Model Prediction Example
Example: Person Detection for Different Attribute
Quality Assurance Automation
Using the Model to Debug the Humans
Automated Checklist Example
Domain-Specific Reasonableness Checks
Data Discovery: What to Label
Human Exploration
Raw Data Exploration
Metadata Exploration
Adding Pre-Labeling-Based Metadata
Better Models Are Better than Better Augmentation
To Augment or Not to Augment
Simulation and Synthetic Data
Simulations Still Need Human Review
Media Specific
What Methods Work with Which Media?
Media-Specific Research
Domain Specific
Geometry-Based Labeling
Heuristics-Based Labeling

9. Case Studies and Stories
A Security Startup Adopts Training Data Tools
Quality Assurance at a Large-Scale Self-Driving Project
Big-Tech Challenges
Insurance Tech Startup Lessons
An Academic Approach to Training Data
Kaggle TSA Competition

About the Author

Managementboek Top 100


Populaire producten



        Training Data for Machine Learning