Meer over Anthony Sarkis

Anthony Sarkis

Training Data for Machine Learning

Name: Training Data for Machine Learning
Author: Anthony Sarkis

Human Supervision from Annotation to Data Science

Paperback Engels 2023 1e druk 9781492094524

Verwachte levertijd ongeveer 16 werkdagen

75,56

In winkelwagen

Samenvatting

Your training data has as much to do with the success of your data project as the algorithms themselves because most failures in AI systems relate to training data. But while training data is the foundation for successful AI and machine learning, there are few comprehensive resources to help you ace the process.

In this hands-on guide, author Anthony Sarkis--lead engineer for the Diffgram AI training data software-shows technical professionals, managers, and subject matter experts how to work with and scale training data, while illuminating the human side of supervising machines. Engineering leaders, data engineers, and data science professionals alike will gain a solid understanding of the concepts, tools, and processes they need to succeed with training data.

With this book, you'll learn how to:
- Work effectively with training data including schemas, raw data, and annotations
- Transform your work, team, or organization to be more AI/ML data-centric
- Clearly explain training data concepts to other staff, team members, and stakeholders
- Design, deploy, and ship training data for production-grade AI applications
- Recognize and correct new training-data-based failure modes such as data bias
- Confidently use automation to more effectively create training data
- Successfully maintain, operate, and improve training data systems of record

Specificaties

ISBN13:9781492094524

Trefwoorden:machine learning, data science

Taal:Engels

Bindwijze:paperback

Aantal pagina's:250

Uitgever:O'Reilly

Druk:1

Verschijningsdatum:30-11-2023

Hoofdrubriek:IT-management / ICT

Lezersrecensies

Wees de eerste die een lezersrecensie schrijft!

Schrijf een recensie

Uw cijfer

?

Log in om te stemmen

Inhoudsopgave

Preface
Who Should Read This Book?
For the Technical Professional and Engineer
For the Manager and Director
For the Subject Matter Expert and Data Annotation Specialist
For the Data Scientist
Why I Wrote This Book
How This Book Is Organized
Themes
The Basics and Getting Started
Concepts and Theories
Putting It All Together
Conventions Used in This Book
O’Reilly Online Learning
How to Contact Us
Acknowledgments

1. Training Data Introduction
Training Data Intents
What Can You Do With Training Data?
What Is Training Data Most Concerned With?
Training Data Opportunities
Business Transformation
Training Data Efficiency
Tooling Proficiency
Process Improvement Opportunities
Why Training Data Matters
ML Applications Are Becoming Mainstream
The Foundation of Successful AI
Training Data Is Here to Stay
Training Data Controls the ML Program
New Types of Users
Training Data in the Wild
What Makes Training Data Difficult?
The Art of Supervising Machines
A New Thing for Data Science
ML Program Ecosystem
Data-Centric Machine Learning
Failures
History of Development Affects Training Data Too
What Training Data Is Not
Generative AI
Human Alignment Is Human Supervision
Summary

2. Getting Up and Running
Introduction
Getting Up and Running
Installation
Tasks Setup
Annotator Setup
Data Setup
Workflow Setup
Data Catalog Setup
Initial Usage
Optimization
Tools Overview
Training Data for Machine Learning
Growing Selection of Tools
People, Process, and Data
Embedded Supervision
Human Computer Supervision
Separation of End Concerns
Standards
Many Personas
A Paradigm to Deliver Machine Learning Software
Trade-Offs
Costs
Installed Versus Software as a Service
Development System
Scale
Installation Options
Annotation Interfaces
Modeling Integration
Multi-User versus Single-User Systems
Integrations
Scope
Hidden Assumptions
Security
Open Source and Closed Source
History
Open Source Standards
Realizing the Need for Dedicated Tooling
Summary

3. Schema
Schema Deep Dive Introduction
Labels and Attributes—What Is It?
What Do We Care About?
Introduction to Labels
Attributes Introduction
Attribute Complexity Exceeds Spatial Complexity
Technical Overview
Spatial Representation—Where Is It?
Using Spatial Types to Prevent Social Bias
Trade-Offs with Types
Computer Vision Spatial Type Examples
Relationships, Sequences, Time Series: When Is It?
Sequences and Relationships
When
Guides and Instructions
Judgment Calls
Relation of Machine Learning Tasks to Training Data
Semantic Segmentation
Image Classification (Tags)
Object Detection
Pose Estimation
Relationship of Tasks to Training Data Types
General Concepts
Instance Concept Refresher
Upgrading Data Over Time
The Boundary Between Modeling and Training Data
Raw Data Concepts
Summary

4. Data Engineering
Introduction
Who Wants the Data?
A Game of Telephone
Planning a Great System
Naive and Training Data–Centric Approaches
Raw Data Storage
By Reference or by Value
Off-the-Shelf Dedicated Training Data Tooling on Your Own Hardware
Data Storage: Where Does the Data Rest?
External Reference Connection
Raw Media (BLOB)–Type Specific
Formatting and Mapping
User-Defined Types (Compound Files)
Defining DataMaps
Ingest Wizards
Organizing Data and Useful Storage
Remote Storage
Versioning
Data Access
Disambiguating Storage, Ingestion, Export, and Access
File-Based Exports
Streaming Data
Queries Introduction
Integrations with the Ecosystem
Security
Access Control
Identity and Authorization
Example of Setting Permissions
Signed URLs
Personally Identifiable Information
Pre-Labeling
Updating Data
Summary

5. Workflow
Introduction
Glue Between Tech and People
Why Are Human Tasks Needed?
Partnering with Non-Software Users in New Ways
Getting Started with Human Tasks
Basics
Schemas’ Staying Power
User Roles
Training
Gold Standard Training
Task Assignment Concepts
Do You Need to Customize the Interface?
How Long Will the Average Annotator Be Using It?
Tasks and Project Structure
Quality Assurance
Annotator Trust
Annotators Are Partners
Common Causes of Training Data Errors
Task Review Loops
Analytics
Annotation Metrics Examples
Data Exploration
Models
Using the Model to Debug the Humans
Distinctions Between a Dataset, Model, and Model Run
Getting Data to Models
Dataflow
Overview of Streaming
Data Organization
Pipelines and Processes
Direct Annotation
Business Process Integration
Attributes
Depth of Labeling
Supervising Existing Data
Interactive Automations
Example: Semantic Segmentation Auto Bordering
Video
Summary

6. Theories, Concepts, and Maintenance
Introduction
Theories
A System Is Only as Useful as Its Schema
Who Supervises the Data Matters
Intentionally Chosen Data Is Best
Working with Historical Data
Training Data Is Like Code
Surface Assumptions Around Usage of Your Training Data
Human Supervision Is Different from Classic Datasets
General Concepts
Data Relevancy
Need for Both Qualitative and Quantitative Evaluations
Iterations
Prioritization: What to Label
Transfer Learning’s Relation to Datasets (Fine-Tuning)
Per-Sample Judgment Calls
Ethical and Privacy Considerations
Bias
Bias Is Hard to Escape
Metadata
Preventing Lost Metadata
Train/Val/Test Is the Cherry on Top
Sample Creation
Simple Schema for a Strawberry Picking System
Geometric Representations
Binary Classification
Let’s Manually Create Our First Set
Upgraded Classification
Where Is the Traffic Light?
Maintenance
Actions
Net Lift
Levels of System Maturity of Training Data Operations
Applied Versus Research Sets
Training Data Management
Quality
Completed Tasks
Freshness
Maintaining Set Metadata
Task Management
Summary

7. AI Transformation and Use Cases
Introduction
AI Transformation
Seeing Your Day-to-Day Work as Annotation
The Creative Revolution of Data-centric AI
You Can Create New Data
You Can Change What Data You Collect
You Can Change the Meaning of the Data
You Can Create!
Think Step Function Improvement for Major Projects
Build Your AI Data to Secure Your AI Present and Future
Appoint a Leader: The Director of AI Data
New Expectations People Have for the Future of AI
Sometimes Proposals and Corrections, Sometimes Replacement
Upstream Producers and Downstream Consumers
Spectrum of Training Data Team Engagement
Dedicated Producers and Other Teams
Organizing Producers from Other Teams
Use Case Discovery
Rubric for Good Use Cases
Evaluating a Use Case Against the Rubric
Conceptual Effects of Use Cases
The New “Crowd Sourcing”: Your Own Experts
Key Levers on Training Data ROI
What the Annotated Data Represents
Trade-Offs of Controlling Your Own Training Data
The Need for Hardware
Common Project Mistakes
Modern Training Data Tools
Think Learning Curve, Not Perfection
New Training and Knowledge Are Required
How Companies Produce and Consume Data
Trap to Avoid: Premature Optimization in Training Data
No Silver Bullets
Culture of Training Data
New Engineering Principles
Summary

8. Automation
Introduction
Getting Started
Motivation: When to Use These Methods?
Check What Part of the Schema a Method Is Designed to Work On
What Do People Actually Use?
What Kind of Results Can I Expect?
Common Confusions
User Interface Optimizations
Risks
Trade-Offs
Nature of Automations
Setup Costs
How to Benchmark Well
How to Scope the Automation Relative to the Problem
Correction Time
Subject Matter Experts
Consider How the Automations Stack
Pre-Labeling
Standard Pre-Labeling
Pre-Labeling a Portion of the Data Only
Interactive Annotation Automation
Creating Your Own
Technical Setup Notes
What Is a Watcher? (Observer Pattern)
How to Use a Watcher
Interactive Capturing of a Region of Interest
Interactive Drawing Box to Polygon Using GrabCut
Full Image Model Prediction Example
Example: Person Detection for Different Attribute
Quality Assurance Automation
Using the Model to Debug the Humans
Automated Checklist Example
Domain-Specific Reasonableness Checks
Data Discovery: What to Label
Human Exploration
Raw Data Exploration
Metadata Exploration
Adding Pre-Labeling-Based Metadata
Augmentation
Better Models Are Better than Better Augmentation
To Augment or Not to Augment
Simulation and Synthetic Data
Simulations Still Need Human Review
Media Specific
What Methods Work with Which Media?
Considerations
Media-Specific Research
Domain Specific
Geometry-Based Labeling
Heuristics-Based Labeling
Summary

9. Case Studies and Stories
Introduction
Industry
A Security Startup Adopts Training Data Tools
Quality Assurance at a Large-Scale Self-Driving Project
Big-Tech Challenges
Insurance Tech Startup Lessons
Stories
An Academic Approach to Training Data
Kaggle TSA Competition
Summary

Index
About the Author

Anderen die dit boek kochten, kochten ook

Jan van Bon

ITIL 4 – Pocketguide 2e druk

€ 21,26
Job van den Berg

AI Agents

€ 34,95
Alan Beaulieu

Learning SQL

€ 29,95
Barend Last

Beter, leuker, sneller - Optimaal ontwikkelen met AI

€ 31,25
Bob van Duuren

Laat Copilot voor je werken

€ 34,99
Joris Hutter

Grip op de AI Act

€ 49,50

Managementboek Top 100

Bekijk de volledige Managementboek Top 100

Uw winkelwagen

Training Data for Machine Learning

Human Supervision from Annotation to Data Science

Samenvatting

Specificaties

Lezersrecensies

Inhoudsopgave

Anderen die dit boek kochten, kochten ook

Managementboek Top 100

Rubrieken

Populaire producten

Personen

Trefwoorden