DP-100 Study Guide
Table of Contents
- DP-100 Designing and Implementing a Data Science Solution on Azure
- Design and prepare a machine learning solution (20–25%)
- Explore data, and
run experiments (20–25%)
- Use
automated machine learning to explore optimal models
- Use automated machine learning for tabular data
- Use automated machine learning for computer vision
- Use automated machine learning for natural language processing
- Select and understand training options, including preprocessing and algorithms
- Evaluate an automated machine learning run, including responsible AI guidelines
- Use notebooks for
custom model training
- Use the terminal to configure a compute instance
- Access and wrangle data in notebooks
- Wrangle data interactively with attached Synapse Spark pools and serverless Spark compute
- Retrieve features from a feature store to train a model
- Track model training by using MLflow
- Evaluate a model, including responsible AI guidelines
- Automate hyperparameter tuning
- Use
automated machine learning to explore optimal models
- Train and deploy models (25–30%)
- Optimize
language models for AI applications (25–30%)
- Prepare for model optimization
- Select and deploy a language model from the model catalog
- Compare language models using benchmarks
- Test a deployed language model in the playground
- Select an optimization approach
- Optimize through prompt engineering and prompt flow
- Optimize through Retrieval Augmented Generation (RAG)
- Optimize through fine-tuning
Design and prepare a machine learning solution (20–25%)
🎓 Don't Forget Your Learning Badge!
Congratulations on completing your study! You can redeem your learning badge here to showcase your achievement.
Design a machine learning solution
Identify the structure and format for datasets
- Understand Dataset Structure: A dataset consists of organized data, typically in tables, files, or documents. In Azure, datasets define the shape and type of data you’re working with—such as columns, data types, and relationships. Before building ML solutions, it’s essential to clarify what data elements are needed and how they relate.
- Common Formats: Datasets may be stored or transmitted in different formats, including CSV, JSON, Parquet, or database tables (e.g., Azure SQL Table). The format affects how you read, process, and transform the data in your Azure workflows.
- Specify Schema and Metadata: In Azure Data Factory and Synapse Analytics, you can import the schema (the blueprint of what data looks like) directly from the source or a file. Schema includes field names, types (string, integer, date, etc.), and sometimes relationships (1:N, N:N). Defining schema clearly helps with mapping and processing steps in data pipelines.
- Dataset Configuration in Azure: When creating a dataset, properties such as ‘name’, ‘type’, ‘schema’, and ‘typeProperties’ must be set. These configurations determine how Azure services interact with the data and ensure compatibility with downstream machine learning models.
- Staying Actionable: Always align dataset structure and format to the needs of the business goal or use case. Document your data model and formats so users and other systems understand and can use the data correctly, minimizing errors and confusion.
Example: Suppose you’re building an expense report solution. Your dataset includes a main table (ExpenseReports) with fields like EmployeeName, Department, ReportID, and a related table (LineItems) containing ItemDescription, Amount, Date, and ReportID. The data is stored in CSV files, which you import and link in Azure Data Factory. This clear tabular structure, with matching ReportIDs, allows easy data analysis.
Use Case: A beginner Azure Data engineer needs to create a pipeline that loads user activity logs (in JSON format) from Azure Blob Storage into Azure SQL. They define a dataset in Azure Data Factory specifying the file format (JSON), schema (fields like userId, action, timestamp), and mapping rules. This enables automated, structured data loading feeding into downstream analytics and ML models.
For more information see these links:
- Data modeling: Designing your data structure
- Datasets in Azure Data Factory and Azure Synapse Analytics
- DataSet and XmlDataDocument Synchronization
- Writing DataSet Schema Information as XSD
- Datasets
Determine the compute specifications for machine learning workload
- Understand the workload requirements: Assess the size and type of your machine learning tasks. Training deep learning models, for example, typically requires more computational power (GPUs, more memory) compared to smaller models or batch inferencing. Start by identifying if your workload is CPU-bound or can benefit from GPU acceleration.
- Select appropriate Azure VM series: Azure offers different VM series for machine learning, such as the NC (NVIDIA GPU) and ND (high-end GPU, designed for deep learning) series. Choose a VM with enough GPU memory, vCPUs, and system RAM to fit your dataset and model size. For example, NC96ads_A100_v4 offers 4 powerful NVIDIA A100 GPUs, 96 vCPUs, and 880GB RAM for large-scale deep learning tasks.
- Ensure software and hardware compatibility: Match your selected VM’s GPU architecture to the compatible CUDA version needed by your ML framework (e.g., PyTorch, TensorFlow). For example, NVIDIA A100 GPUs in Azure require CUDA 11.0 or later. Always confirm that your framework and drivers are compatible to avoid setup issues.
Example: A new data analyst at an IT company wants to train an image classification model using Azure Machine Learning. The dataset is large, so training on a laptop would take days. By selecting an NC A100 v4 VM in Azure, which includes NVIDIA A100 GPUs and plenty of memory, the analyst speeds up the process significantly, completing training in hours rather than days.
Use Case: An IT team new to Azure wants to deploy a predictive maintenance model for server hardware. They use Azure Machine Learning to select an NCads_A100_v4 VM, ensuring enough GPU power and memory to efficiently train their deep learning model. They check the CUDA version compatibility before training, and use Azure CLI’s ‘az ml compute list-sizes’ command to compare options and select the right compute target.
For more information see these links:
- Migration Guide for GPU Compute Workloads in Azure
- What are compute targets in Azure Machine Learning?
- ND-MI300X-v5 sizes series
- NC_A100_v4 sizes series
- Migration Guide for GPU Compute Workloads in Azure
Select the development approach to train a model
- Understand Different Training Approaches: When training a machine learning model in Azure, you can choose from several approaches like code-first (using Python SDK), low-code/no-code tools (such as Azure Machine Learning Designer), or automated machine learning (AutoML) where the service chooses the best model and parameters for you.
- Assess Skill Level and Project Needs: For beginners or those new to coding, AutoML and Designer are recommended as they simplify the process. If you have some programming experience or want more control, using the Python SDK and custom scripts provides flexibility.
- Consider Model Complexity and Data Size: Simple models with smaller data can be trained using Designer or AutoML. For large datasets, complex models, or requiring distributed training (like deep learning), use code-based approaches or Azure’s scalable infrastructure.
- Integration with Azure Ecosystem: Choose tools and approaches that integrate easily with existing Azure data sources and deployment targets. For example, Designer easily connects to Azure datasets, while code-first solutions support advanced integration and automation in pipelines.
- Iterate and Evaluate: No matter the approach, it’s important to test the model with new or real-world data, adjust the algorithm if needed, and evaluate performance before deploying to production.
Example: Suppose an IT team needs to classify incoming customer support emails into categories (like Billing, Technical Issue, or Account Management). A beginner-friendly approach would be to use Azure Machine Learning’s AutoML to upload the labeled email data, select ‘Classification’ as the problem type, and let the service automatically test several models and pick the best one.
Use Case: A new IT data analyst uses Azure Machine Learning Designer to build a churn prediction model for a software-as-a-service (SaaS) product. Without writing code, they use drag-and-drop modules to import customer usage data, select features, train a model, and evaluate its performance. This helps the analyst quickly deliver a working solution and share insights with the team.
For more information see these links:
- Train a model with CNTK
- AI architecture design
- Train AI and ML models
- Train models with Azure Machine Learning
- Train models with Azure Machine Learning
Create and manage resources in an Azure Machine Learning workspace
Create and manage a workspace
- A workspace in Azure Machine Learning is a centralized environment where you can manage datasets, experiments, models, and compute resources. Workspaces act as containers that help organize and secure your machine learning assets.
- To create a workspace, you can use the Azure portal, Azure Machine Learning Studio, or the Python SDK. You’ll need to choose a unique name, resource group, region, and configure network and security settings suitable for your organization’s needs.
- Managing a workspace involves updating settings, adding or removing users, configuring access permissions, and monitoring resources. You can also manage compute targets, data storage, and collaborate with your team within the workspace.
- You can delete a workspace if it is no longer needed; however, this will remove all resources associated with it. Always back up important experiments, models, and data before deleting a workspace.
- Workspaces can be shared among teams, allowing multiple users to work on related projects efficiently and securely, supporting collaboration and reproducibility in machine learning workflows.
Example: Imagine you are a data analyst at a retail company new to Azure. You want to analyze customer purchasing trends using machine learning. First, you create an Azure Machine Learning workspace named ‘RetailTrendsML’ using the Azure portal, select your company’s Azure subscription, and pick your closest region for data storage. Afterwards, you upload historical sales data and invite your coworker to collaborate on building and testing predictive models, all within this workspace.
Use Case: A small IT team at a healthcare provider wants to develop predictive models for patient appointment no-shows. By creating an Azure Machine Learning workspace, they centralize all data, code, experiments, and models, making it easier to share resources securely with team members and ensure compliance with data privacy standards.
For more information see these links:
- Manage workspaces in Microsoft Playwright Testing Preview
- Create and work with workspaces
- Manage workspaces
- Manage Azure Machine Learning workspaces in the portal or with the Python SDK (v2)
- Create and manage a workspace in Azure API Management
Create and manage datastores
- Datastores in Azure Machine Learning are secure connections to your cloud-based storage accounts, such as Azure Blob, Azure Data Lake, and Azure File. Using datastores allows you to organize, manage, and access your data from machine learning experiments without writing extra code for authentication or file access.
- Creating a datastore can be done directly in Azure Machine Learning studio, via the VS Code extension, CLI, or Python SDK. You typically select your storage type, provide necessary authentication details, and register the datastore to your workspace so it’s available for experiments and pipelines.
- Managing datastores includes viewing their configuration, updating connection information, unregistering datastores you no longer need, and configuring access roles for data security. In the VS Code extension, you can perform these management tasks with simple right-click operations or command palette commands.
- Datastores support both credential-based (using storage keys or SAS tokens) and identity-based (using Azure managed identities) access. Choosing the best method for your scenario enhances both security and convenience.
- Datastores are essential for handling large datasets efficiently—by connecting your ML workspace to data that lives in Azure storage, you avoid duplicate uploads, keep costs low, and ensure your data is always up-to-date for your models.
Example: Imagine a new Azure Data professional working on a predictive analytics project for an online retail company. They upload customer purchase data into an Azure Blob Storage account. Instead of manually coding access for each experiment, they create a datastore in Azure Machine Learning studio that points to their Blob Storage, securely handling credentials. Now, every team member can use this data in their ML experiments by simply referencing the datastore, saving time and reducing errors.
Use Case: A beginner IT analyst wants to build a machine learning model to forecast demand for products. They store historical sales records in Azure Data Lake. By creating a datastore linked to their Data Lake in Azure Machine Learning, they easily use these records in their models and automate the data flow when retraining, without worrying about maintaining connection details or merging files manually.
For more information see these links:
- Manage Azure Machine Learning resources with the VS Code extension (preview)
- How Azure Machine Learning works: resources and assets
- Datastores interface-Method Details
- Connect to data with the Azure Machine Learning studio
- How to create data registry
Create and manage compute targets
- A compute target in Azure Machine Learning is any local or cloud-based resource where you can train, test, or deploy machine learning models. Common types include local machines, compute instances, compute clusters, and serverless compute.
- To create a compute target, you can use the Azure Machine Learning studio interface, Python SDK, Azure CLI, or VS Code extension. For example, you can set up a managed cluster that automatically scales up or down as jobs are submitted, helping optimize both performance and cost.
- Managing compute targets involves monitoring resource usage, scaling resources as needed, and choosing the right VM size for your workload. For clusters, it’s best practice to set the minimum nodes to zero to avoid unnecessary charges when idle.
- Compute targets can be customized with parameters such as VM size, maximum nodes, and access configuration, ensuring the environment fits your specific data science needs. Cloud resources like AmlCompute are ideal for large datasets and intensive compute jobs.
- You can view, update, or delete compute targets through the Azure Machine Learning studio under the “Manage > Compute” section, making it easy to keep track of all your available compute options.
Example: Imagine you’re building a predictive model to identify customer churn for your company’s web service. You start by developing your code on a local machine, but when you need more processing power to train a larger model, you create an Azure Compute Cluster using the studio with just a few clicks. As you submit the job, Azure automatically provisions the necessary resources, trains your model, and releases them when done.
Use Case: A new data analyst at a tech company wants to scale machine learning experiments from a laptop to a cloud environment. They set up a managed compute cluster in Azure, specify the VM size, and configure auto-scaling. This enables them to run complex training jobs efficiently and only pay for compute resources when active, allowing for both cost control and faster results on large datasets.
For more information see these links:
- What is the Azure Machine Learning SDK v1 for Python?
- Manage compute resources for model training and deployment in studio
- az ml computetarget create computeinstance
- What are compute targets in Azure Machine Learning?
- az ml computetarget create amlcompute
Set up Git integration for source control
- Git integration allows you to track, manage, and collaborate on changes to code and resources in your Azure Machine Learning workspace. By connecting your workspace to a Git repository—such as Azure DevOps or GitHub—you can version your work and roll back to previous states if needed.
- Setting up Git integration involves associating your Azure ML workspace with a Git repository. This is typically done using the ‘Set up code repository’ option found in the workspace’s settings or management hub. You need sufficient permissions, like Azure Contributor or higher, to complete the configuration.
- Once Git integration is enabled, every change you make to code, pipeline definitions, or datasets can be committed, pushed, and synchronized with your repository. This makes it easier for teams to collaborate, avoid conflicts, and maintain a history of all modifications.
- In addition to backup and recovery benefits, Git source control supports branching and pull requests—core workflows in collaborative development. You can create branches for experimenting or new features and merge finalized changes back into the shared code base after review.
- For organizations with compliance or auditing needs, Git provides a transparent and secure way to track who made changes and why, making governance of machine learning projects easier.
Example: Imagine a team of data scientists working on an Azure Machine Learning project to predict customer churn. By setting up Git integration in their workspace, each team member can make improvements to the training pipeline, commit their changes to the shared Git repository, and easily merge contributions while maintaining a history of all updates. If a new feature causes an unexpected error, the team can quickly revert to a previously working version from the repository.
Use Case: A new Azure Data engineer is tasked with developing a customer segmentation model in Azure Machine Learning and must collaborate with another teammate. By configuring Git integration with Azure DevOps in their workspace, both engineers can work on separate branches, share code reviews through pull requests, and ensure all changes are safely versioned. This streamlines teamwork and prevents accidental overwrites or loss of work.
For more information see these links:
- Source control in Synapse Studio
- Source control with Warehouse (preview)
- Source control in Azure Data Factory
- Dataverse Git integration setup
- Tutorial: Work with Git in Visual Studio
Create and manage assets in an Azure Machine Learning workspace
Create and manage data assets
- Data assets in Azure Machine Learning are reusable references to data stored in locations like local files, Azure Storage, or public URLs, making it easier to organize and access data for machine learning projects.
- Azure Machine Learning supports three types of data assets: File (uri_file) for single files, Folder (uri_folder) for collections of files, and Table (mltable) for tabular data with complex schemas and versioning capabilities.
- When you create a data asset, you provide a name, source path, and asset type, and you can add tags (metadata) to help classify, search, or secure your data assets. Once created, you can manage assets by updating tags, tracking lineage, and auditing usage.
- Data assets support versioning and immutability, enabling reproducibility and tracking changes across different machine learning experiments or jobs. This makes it easier to debug, revert to previous data versions, and ensure why and how data was used.
- You can consume data assets in jobs or interactive sessions either by mounting (live connection to the data) or downloading (creating a local copy), helping ensure the most efficient workflow for your current scenario.
Example: Suppose you have a CSV file containing sales data stored on Azure Blob Storage. You can create a File type data asset in Azure Machine Learning, giving it a friendly name like ‘monthly_sales_data’, specify its location in Blob Storage, and tag it as ‘medallion:silver’ to mark it as validated data. This makes it reusable for various machine learning experiments or reports.
Use Case: A new data analyst joins an IT company and uses Azure Machine Learning Studio to register a folder of image data as a Folder type data asset. By tagging it with ‘project:customer_churn’, they can quickly access and update image assets for their customer churn prediction model, share the consistent data source with teammates, and maintain version control without needing to remember long storage paths.
For more information see these links:
- Create and manage data assets
- Create and manage data assets
- Create and manage data assets
- Create and manage data assets
- Create and manage data assets
Create and manage environments
- Environments in Azure Machine Learning are isolated spaces where you can manage resources, settings, and assets for different stages of your project, such as development, testing, or production. Creating separate environments helps prevent accidental changes to important data and ensures each stage uses the right resources.
- You can create and manage environments using the Azure Developer Portal or command-line tools like Azure Developer CLI. These tools allow you to define key information, such as the environment’s name, associated project, type (e.g., dev, test, prod), and what resources should be included.
- Managing environments includes editing (e.g., renaming or updating tags), monitoring (checking status and resources), switching between environments (with commands or portal dropdowns), and deleting environments when they are no longer needed. Organizing environments effectively makes it easier to collaborate with others and keep track of project assets.
Example: Imagine you’re building a data analysis project for a retail company. You create a ‘dev’ environment where you safely experiment with code and try new ideas. When you’re ready, you promote your work to a ‘test’ environment for validation, and finally to a ‘prod’ environment where the results are used in real business decisions.
Use Case: A new IT analyst at an online retailer uses Azure Machine Learning environments to separate their experiments from production workloads. They build and test a customer segmentation model in a development environment, ensuring mistakes won’t affect real customer data. Once the model is ready, they deploy it to the production environment for use in personalized marketing.
For more information see these links:
- Manage environments in Azure Deployment Environments
- Manage environments
- Manage environments
- Work with Azure Developer CLI environments
- Quickstart: Create and access an environment in Azure Deployment Environments
Share assets across workspaces by using registries
- Azure Machine Learning registries enable you to store and share assets—such as datasets, models, components, and environments—across multiple workspaces within your organization, improving collaboration and reusability.
- By publishing assets to a central registry, teams working in different workspaces (even in different regions or subscriptions) can access and reuse shared resources without needing to recreate or manually transfer them.
- You can share assets like preprocessed datasets, trained models, or pipeline components by registering them in a workspace and then publishing or promoting them to a registry using Azure CLI or Python SDK commands.
- Registries are especially useful for sharing common, non-sensitive data or reusable components needed across multiple data science or development projects, but are not recommended for sensitive data requiring fine-grained access controls.
- When assets are shared via registries, you retain version control and can easily manage updates or track the lineage of assets as they move between development, test, and production environments.
Example: A data science team creates a cleaned and preprocessed public dataset for customer sales analysis in their development workspace. They register this dataset in an Azure Machine Learning registry, so other teams—such as the marketing or finance teams working in separate workspaces—can access and use the same up-to-date dataset for their machine learning experiments without duplicating effort.
Use Case: A company building machine learning models to predict equipment failures creates a set of reusable components—including data processing scripts, environment specifications, and trained models—in a central registry. Teams responsible for different factories, each with their own workspace, can then use these shared assets from the registry to quickly develop and deploy solutions tailored to their local needs, streamlining their workflow and ensuring consistency across teams.
For more information see these links:
- Share data across workspaces with registries
- Share data across workspaces with registries
- Machine Learning registries for MLOps
- Share models, components, and environments across workspaces with registries
- Share data across workspaces with registries
Explore data, and run experiments (20–25%)
Use automated machine learning to explore optimal models
Use automated machine learning for tabular data
- Automated machine learning (AutoML) simplifies the process of building machine learning models for tabular data, such as spreadsheets or database tables, by automating tasks like selecting algorithms, tuning parameters, and validating models.
- With Azure Machine Learning, you can easily set up AutoML training for tabular data using either the Python SDK or the CLI. This reduces the need for extensive coding knowledge and allows beginners to quickly start experimenting with their data.
- AutoML helps users focus on preparing quality data and defining the goal (such as predicting sales or classifying customer responses) while Azure automates the model selection, training, and evaluation to find optimal solutions.
- You only need to provide your data in a tabular format and specify the target column (the value you want to predict). Azure Machine Learning handles the rest—including trying multiple algorithms and measuring performance automatically.
- AutoML can be integrated into machine learning pipelines, so you can automate workflows from data preparation to model deployment, improving productivity and ensuring repeatable, scalable results.
Example: Suppose you work for an IT company and have a CSV file containing customer support tickets with columns like ‘issue_type’, ‘response_time’, and ‘resolved’. Using Azure AutoML, you can predict whether a new ticket will be resolved quickly by uploading your data, setting ‘resolved’ as the target column, and letting AutoML try different models to find the best way to make accurate predictions.
Use Case: An IT team new to Azure Data wants to predict which incoming support requests are likely to be urgent based on historical ticket data. Using Azure AutoML, they upload the data, configure a classification task to predict ticket urgency, and let AutoML find the optimal model. The team gains valuable insights with minimal machine learning expertise, enabling faster and smarter ticket prioritization.
For more information see these links:
- Set up AutoML training for tabular data with the Azure Machine Learning CLI and Python SDK
- What is automated machine learning (AutoML)?
- Set up AutoML training for tabular data with the Azure Machine Learning CLI and Python SDK
- Set up AutoML training for tabular data with the Azure Machine Learning CLI and Python SDK
- Set up AutoML training for tabular data with the Azure Machine Learning CLI and Python SDK
Use automated machine learning for computer vision
- Automated Machine Learning (AutoML) in Azure enables you to train computer vision models—such as image classification and object detection—without needing to write complex code or manually tune every parameter. AutoML takes care of the heavy lifting by exploring different algorithms, feature transformations, and hyperparameters to find the most effective model for your image data.
- You can use AutoML for several computer vision tasks, including single- or multi-label image classification (identifying what is in an image), object detection (locating objects within images), and instance segmentation (mapping specific pixels to objects). These tasks are accessible via the Azure Machine Learning Studio UI, Python SDK, or CLI, making them approachable for newcomers.
- To get started, you simply provide labeled images and choose the right task type. AutoML will handle data preprocessing, training, evaluation, and even model explainability via the Responsible AI (RAI) dashboard. When your best model is ready, you can deploy it for real-world use or export it in formats like ONNX for predictions in various environments.
- The workflow is: upload your image data to Azure, configure an AutoML run (selecting the vision task), monitor the experiment through Azure ML Studio, and deploy or export the best performing model. This saves valuable time and eliminates the need for deep data science or machine learning expertise.
- AutoML for computer vision supports industry best practices, including Responsible AI integrations, to help you debug, explain, and trust your model’s predictions. Beginners can review vision insights and understand how their models make predictions, enhancing model transparency and reliability.
Example: Suppose you work for an IT services company supporting a retailer, and you want to automatically detect when certain products are missing from store shelves. You collect photos of shelves and label which products are present or missing. Using Azure AutoML’s object detection task, you upload your images, configure the experiment, and let AutoML train and select the most accurate detection model—without needing advanced machine learning knowledge.
Use Case: A new IT analyst at a company uses Azure AutoML to quickly build a model that scans incoming security camera images and flags unauthorized personnel in restricted areas. They simply collect and label example images, configure the AutoML object detection job in the Azure ML studio, and deploy the resulting model as an automated alert system, all without deep expertise in computer vision or coding.
For more information see these links:
- Set up AutoML to train computer vision models
- Set up AutoML to train computer vision models
- Generate Responsible AI vision insights with YAML and Python (preview)
- Tutorial: Train an object detection model with AutoML and Python
- Make predictions with ONNX on computer vision models from AutoML
Use automated machine learning for natural language processing
- Automated machine learning (AutoML) in Azure helps simplify and accelerate building models for natural language processing (NLP) tasks by automatically selecting, training, and tuning algorithms.
- AutoML supports popular NLP tasks such as multi-class and multi-label text classification, and named entity recognition (NER)—making it easier to process and analyze unstructured text data (like emails, support tickets, or documents).
- Azure Machine Learning AutoML integrates the latest deep neural networks, such as BERT, and provides options for distributed training using multi-GPU compute clusters for faster and scalable model development.
- Users can easily connect labeled data for training, use data labeling tools, and operationalize their models at scale with built-in MLOps and ML Pipelines, all from the Azure Machine Learning Studio UI.
- Beginners can start using AutoML for NLP tasks with zero to minimal coding using the Azure Machine Learning Studio UI and built-in Python SDK, making powerful text models accessible to anyone.
Example: An IT support team receives hundreds of support tickets daily. Using Azure AutoML for NLP, they can automatically categorize incoming tickets into types like ‘Password Reset’, ‘Software Installation’, or ‘Network Issues’, saving time and streamlining ticket triage.
Use Case: A new Azure Data analyst in IT uses AutoML to build a text classification model that identifies and tags the urgency level in support emails (like ‘Immediate’, ‘High’, ‘Normal’). By uploading labeled historical tickets to Azure Machine Learning Studio and leveraging AutoML, the analyst quickly generates a production-ready model to improve prioritization—no advanced coding required.
For more information see these links:
- What is automated machine learning (AutoML)?
- Set up AutoML to train a natural language processing model
- Set up AutoML to train a natural language processing model
- What is automated machine learning (AutoML)?
- Data featurization in automated machine learning (AutoML)
Select and understand training options, including preprocessing and algorithms
- Data preprocessing is a crucial first step when training machine learning models. This process involves cleaning the data, handling missing values, converting data types, normalizing numeric values, and extracting features. Proper preprocessing ensures the algorithms can interpret the data effectively and produce accurate models.
- Algorithm selection determines how the model learns from your data. In automated machine learning (AutoML), Azure automatically tries multiple algorithms (such as decision trees, logistic regression, or neural networks) and selects the best one based on your data and the problem you want to solve, whether that’s classification, regression, or another task.
- With Azure Automated ML, you can review and customize the training process. Azure provides generated training code, which details all preprocessing steps, the selected algorithm, and the hyperparameters. Beginners can analyze this code to understand what happened during training, track model development over time, or even adjust parameters for further experiments.
- It’s important to choose the right training method for your skill level and needs. Azure offers low-code solutions like the Designer for users without extensive coding experience and code-first solutions like the Python SDK for those comfortable with programming. Both methods support model training, but offer different levels of control and flexibility.
- Automated machine learning not only saves time by automating data processing and algorithm selection but also enables users with little or no data science background to build, evaluate, and deploy models efficiently.
Example: Suppose you have a dataset containing customer information for an IT helpdesk, including ticket details, customer descriptions, and resolution times. Using Azure AutoML, you upload the dataset and let the system automatically handle text preprocessing (like converting descriptions into numerical features), select and train several algorithms (such as logistic regression or support vector machines), and evaluate which model predicts issue resolution time most accurately.
Use Case: An IT administrator, new to Azure Data, wants to predict which helpdesk tickets are likely to become high priority. By using Azure Automated ML, they can simply upload their existing ticket data, let the platform preprocess the data and try multiple algorithms, and then use the recommended trained model to automatically flag potentially urgent tickets, improving response times without needing to write code.
For more information see these links:
- View training code for an Automated ML model
- Train models with Azure Machine Learning
- Train Keras models at scale with Azure Machine Learning
- How to choose an ML.NET algorithm
- Train and evaluate a model
Evaluate an automated machine learning run, including responsible AI guidelines
- Review performance metrics from automated machine learning runs: After Azure Automated ML creates several models, you can use evaluation charts such as confusion matrices for classification or residuals histograms for regression to see which model performs best on your data. Metrics like accuracy or mean squared error help you compare and select the optimal model.
- Assess your model using Responsible AI guidelines: Azure Machine Learning provides a Responsible AI dashboard that helps you understand model fairness, interpretability, data distributions, and error patterns. This dashboard shows whether your model treats different groups fairly, explains which features influence predictions, and highlights common errors or blind spots.
- Share and act on Responsible AI insights: You can generate a PDF scorecard from the Responsible AI dashboard to share results with technical and nontechnical stakeholders (like team members or regulators). This summary helps build trust, guides deployment decisions, and documents your model’s performance and fairness for future review.
Example: Imagine you built an automated ML model in Azure to predict employee attrition from IT department data. After the run, you use evaluation metrics to select the best model. Then, you open the Responsible AI dashboard, noticing that the model uses ‘salary’, ‘years at company’, and ‘department’ as key features. You see some fairness concerns, as the model is slightly less accurate for junior employees. You share these findings using the scorecard and decide to review the model before deploying it.
Use Case: A new Azure Data analyst in an IT company builds a model to automate software license renewal predictions. After the automated ML run, the analyst uses the Responsible AI dashboard to check model accuracy, review which factors most influence predictions (like device type and renewal history), and ensure the model does not unfairly disadvantage newer devices. They then download and present the scorecard to their manager, deciding together if the model can be safely used.
For more information see these links:
- Evaluate automated machine learning experiment results
- Evaluate automated machine learning experiment results
- Assess AI systems by using the Responsible AI dashboard
- Assess AI systems by using the Responsible AI dashboard
- Share Responsible AI insights using the Responsible AI scorecard (preview)
Use notebooks for custom model training
Use the terminal to configure a compute instance
- Accessing the Terminal: To configure a compute instance in Azure Machine Learning, first access the built-in terminal. You can open the terminal directly from the Azure ML studio (Notebooks > Open terminal), Jupyter, VS Code, or other development environments connected to your workspace.
- Configuring Compute Resources: In the terminal, you can manage files, install required libraries using pip or conda, adjust environment settings, and monitor resources. For example, you might use pip install to add a needed Python package for your model training.
- Managing Storage and System Resources: With terminal access, you can keep track of disk usage, clear unnecessary files to free up space (important if you run out of disk space, as the OS disk is limited), or check running processes using basic Linux commands like df, rm, or top.
- Networking and Security Considerations: If your compute instance does not have a public IP (recommended for security), you can still access it securely via terminal through the Azure ML workspace or using az ml compute connect-ssh if needed.
- Actionable Configuration: Use the terminal to set up environment variables, configure scripts, or automate jobs to prepare your compute instance for custom model training. This direct control helps ensure that dependencies and settings match your project needs.
Example: Imagine you need the scikit-learn library for a custom training script, but it’s not pre-installed. By opening the terminal in Azure ML, you run ‘pip install scikit-learn’ to instantly make it available on your compute instance—ready for use in your notebook.
Use Case: A data analyst new to Azure is tasked with preparing a compute instance for a machine learning project. She opens the terminal in Azure ML studio, checks the disk space with ‘df -h’, removes unused files to free up storage, and installs required Python packages using pip—all directly from the terminal.
For more information see these links:
- Use managed compute in a managed virtual network
- Create an Azure Machine Learning compute instance
- az ml computetarget create computeinstance
- Manage an Azure Machine Learning compute instance
- Access a compute instance terminal in your workspace
Access and wrangle data in notebooks
- Accessing Data: In Azure Data notebooks, you can easily load data from sources like OneLake, Lakehouse, or external files (such as CSV, Parquet) using Python libraries like pandas or PySpark. This allows you to directly read, inspect, and manipulate data for your project.
- Wrangling Data: Data wrangling involves cleaning, transforming, and organizing raw data into a usable format. With tools like Data Wrangler in Microsoft Fabric, you can visually explore, clean, and transform your dataset using a user-friendly interface or Python code, helping prepare your data for analysis and model training.
- Automation and Reusability: As you perform data wrangling steps in the notebook, Data Wrangler can automatically generate Python code for each operation. You can save and reuse these code snippets to automate future data preparation tasks, making your workflow more efficient and repeatable.
- Interactive Exploration: Notebooks let you visualize the data, generate summary statistics, and interactively make decisions about how to handle missing values, duplicates, or outliers using built-in visualization libraries and the Copilot AI assistant.
- Real-Time Data Editing: You can launch Data Wrangler from within a notebook to edit pandas or Spark DataFrames in real time. Changes are reflected instantly, and you can sample your dataset to work with subsets for faster processing and testing.
Example: A beginner using Azure Data in IT wants to analyze a customer feedback CSV file stored in OneLake. They open a notebook, use pandas to read the file, and then launch Data Wrangler to remove duplicate responses, fix missing ratings, and filter for recent comments. The cleaned data is now ready for further analysis or model training.
Use Case: An IT analyst new to Azure Data needs to prepare user log data for anomaly detection. They access the raw logs in the Lakehouse, use Data Wrangler in a notebook to remove irrelevant columns, handle missing login times, and create new columns (like ‘session duration’). The notebook generates reusable code so future log datasets can be wrangled automatically, speeding up the data prep process for ongoing model training.
For more information see these links:
- Use Python experience on Notebook
- Tutorial Part 2: Explore and visualize data using Microsoft Fabric notebooks
- What is Data Science in Microsoft Fabric?
- Interactive Data Wrangling with Apache Spark in Azure Machine Learning
- How to accelerate data prep with Data Wrangler in Microsoft Fabric
Wrangle data interactively with attached Synapse Spark pools and serverless Spark compute
- Interactive data wrangling using Spark pools: In Azure Machine Learning notebooks, you can clean, reshape, and explore data interactively using Apache Spark. Both attached Synapse Spark pools and serverless Spark compute let you process large datasets simultaneously across many machines, making your data preparation faster and more scalable.
- Choosing compute options: Use a serverless Spark compute for quick, managed, on-demand access to distributed data processing—no need to set up or manage infrastructure. Alternatively, select an attached Synapse Spark pool if your organization already uses Azure Synapse Analytics and you need advanced configuration or integration with other Synapse resources.
- Secure data access with Azure storage: Before wrangling data, ensure proper permissions (like Contributor and Storage Blob Data Contributor roles) and credential management (using Azure Key Vault secrets for service principal authentication). You can read data from Azure Data Lake, Blob Storage, or Azure Machine Learning Datastores using secure URIs and Spark code within your notebook.
- Live transformations and previews: By running Spark code in your notebook, you can interactively preview, clean, and transform your data—for example, handling missing values or filtering rows. Tools like pyspark.pandas allow for familiar, pandas-like syntax while leveraging Spark’s distributed compute power.
- Seamless integration with ML training: Once your data is wrangled and ready, you can use it directly in the same notebook for custom model training, ensuring a smooth workflow from raw data to machine learning model creation.
Example: Suppose you have a large CSV of IT system logs stored in Azure Data Lake. You connect to serverless Spark compute in your Azure Machine Learning notebook, load the logs into a Spark DataFrame using pyspark.pandas, filter for errors, fill in missing values, and summarize the number of errors by day—all by running code interactively in the notebook.
Use Case: An IT team new to Azure wants to build a machine learning model to predict system failures. They use an Azure Machine Learning notebook with serverless Spark compute to clean and join several large logs and monitoring files, transforming and aggregating the data interactively until it’s ready for model training—without worrying about managing Spark infrastructure.
For more information see these links:
- Interactive Data Wrangling with Apache Spark in Azure Machine Learning
- Interactive Data Wrangling with Apache Spark in Azure Machine Learning
- Interactive Data Wrangling with Apache Spark in Azure Machine Learning
- Interactive Data Wrangling with Apache Spark in Azure Machine Learning
- Apache Spark in Azure Machine Learning
Retrieve features from a feature store to train a model
- Feature stores help organize and reuse important data points (features) used to train machine learning models, making workflows more efficient and reliable.
- To train a model, you first select features stored in the feature store, either through the Azure ML UI or programmatically using the SDK. Each feature is typically linked using a primary key, like a customer ID.
- When you create your training dataset, you use the feature retrieval specification to define which features to load. This ensures that both training and future inference consistently use the same data structure.
- After retrieving features and training the model, the model retains references to the features. During inference (making predictions), it automatically fetches updated or matching feature values from the feature store using primary keys.
- Using a feature store reduces duplication and inconsistency in feature engineering and allows teams to collaborate more easily by sharing standardized features across projects.
Example: Imagine you’re working for an online retailer and want to predict which customers are likely to make a purchase next month. Instead of manually collecting diverse data points like account age, purchase count, and customer segment from different places, you retrieve these features directly from the feature store using their customer IDs. This ensures your model is trained on reliable, up-to-date data and you can reuse the same set of features whenever you retrain or update your prediction model.
Use Case: A new data scientist on an Azure-based team wants to build a fraud detection model for online transactions. They use a notebook to discover existing feature sets (like transaction history and account behavior) in the organization’s feature store, retrieve the relevant features using the SDK, and create a training dataset. Their model is trained on these features and, when deployed, automatically looks up the same features when scoring new transactions, ensuring consistency between training and inference.
For more information see these links:
- Use features to train models
- Tutorial 2: Experiment and train models by using features
- Use features to train models
- Concepts
- Use features to train models
Track model training by using MLflow
- MLflow allows you to automatically track all aspects of your model training process in notebooks, including which datasets, parameters, and code versions you used, as well as the results (metrics and models) you achieved. This makes it easy to review and reproduce your work later.
- With MLflow tracking, every time you train a model, you create a ‘run’ that captures all details about that specific experiment. You can organize multiple runs under ‘experiments’, allowing you to easily compare different attempts and quickly identify the best performing models.
- MLflow lets you log and visualize evaluation metrics (like accuracy or loss), parameters (like learning rate), and artifacts (such as trained models or plots) in one central location. This provides clear insights into what works best and supports collaboration with your team.
- In Azure environments like Databricks or Microsoft Fabric, MLflow integrates seamlessly, so you can track, manage, and even deploy models directly from your notebook workflow, streamlining the end-to-end machine learning lifecycle.
- MLflow applies quota limits on parameters, tags, and runs, so as you log experiments, be mindful and periodically clean up unused runs or adjust your logging strategy to stay within these limits.
Example: Suppose you are experimenting with different machine learning algorithms to predict customer churn. Using MLflow in your Azure notebook, you can track which model type (like logistic regression or decision tree), parameter values, and data version you used for each run. In the MLflow UI, you can easily compare their accuracy scores to decide which approach is most effective.
Use Case: An IT analyst working for an Azure-based company uses MLflow in Databricks notebooks to track and document their model training experiments for a fraud detection system. They run several experiments varying data preprocessing steps and algorithms, logging each run with MLflow. Later, when management requests a final report, the analyst quickly generates a summary from MLflow’s experiment tracking, showing the parameter choices, model performance, and artifacts for each tested approach.
For more information see these links:
- Track model development using MLflow
- Track model training with MLflow in jobs - Training
- Track model training in Jupyter notebooks with MLflow - Training
- Train and track machine learning models with MLflow in Microsoft Fabric - Training
- Track Azure Databricks machine learning experiments with MLflow and Azure Machine Learning
Evaluate a model, including responsible AI guidelines
- Model evaluation involves checking how well your AI model performs, including accuracy, how errors are distributed, and how fair its predictions are across different groups. Azure Machine Learning notebooks can use built-in tools to help with this process.
- Responsible AI guidelines mean making sure your model is ethical, fair, transparent, and meets legal and organizational requirements. Azure Machine Learning provides dashboards and scorecards to help review and document these aspects.
- The Responsible AI dashboard in Azure offers key tools: data analysis, fairness checks, error analysis, and model interpretability. These let you inspect whether your model treats data groups fairly, why it makes certain predictions, and where it might make mistakes.
- Using the Responsible AI scorecard (preview), you can easily generate a sharable PDF report from dashboard insights. This helps communicate model health, fairness, and compliance to non-technical stakeholders, building trust and supporting audit processes.
- As a beginner, focus on exploring your trained model with the Responsible AI dashboard and documenting your findings with the scorecard. Share these insights with product managers and risk officers before deployment to ensure multi-stakeholder alignment.
Example: Suppose you train a machine learning model to predict which customers are likely to buy IT support packages. By running the Responsible AI dashboard in Azure, you discover the model works well overall, but tends to perform less accurately for customers from small businesses compared to enterprise clients. Using fairness analysis and error breakdown in the dashboard, you identify this gap and plan improvements. You create a Responsible AI scorecard to share what you learned with your manager and compliance team.
Use Case: A new Azure Data specialist in an IT company trains a custom machine learning model for predicting server failure. Before deployment, they use the Responsible AI dashboard to assess fairness (ensuring the model doesn’t disadvantage certain server types or departments), interpret key features used in prediction, and generate a Responsible AI scorecard to share with IT managers and auditors. This helps the team make informed decisions and comply with company policies.
For more information see these links:
- Share Responsible AI insights using the Responsible AI scorecard (preview)
- Assess AI systems by using the Responsible AI dashboard
- Plan for AI adoption
- Use Responsible AI scorecard (preview) in Azure Machine Learning
- Assess AI systems by using the Responsible AI dashboard
Automate hyperparameter tuning
Select a sampling method
- Understand the role of sampling in hyperparameter tuning: Sampling methods determine how values for each hyperparameter are chosen during automated tuning. The right sampling method can help you efficiently explore the space and improve your model’s performance.
- Explore available sampling methods in Azure Machine Learning: The main methods are random sampling, grid sampling, and Bayesian sampling. Random sampling picks values at random, grid sampling tests every possible combination, and Bayesian sampling learns from previous experiments to select the next values.
- Choose a method based on your resources and objectives: Grid sampling is best when you can afford to exhaustively test every combination, random sampling is simple and works well for large search spaces, and Bayesian sampling is ideal when you want to maximize performance with fewer experiments.
- Consider integration with telemetry for resource management: In Application Insights, you can apply adaptive, fixed-rate, or ingestion sampling to control the volume of logs and telemetry data during experiments, keeping costs manageable as your tuning process generates more data.
- Actionable tip: Start with random or grid sampling for simple experiments. Move to Bayesian sampling as you gain experience and need better performance without increasing resource use.
Example: Suppose you want to fine-tune the learning rate and number of layers of a neural network using Azure Machine Learning. You define a range of possible values for each parameter. With grid sampling, Azure will run experiments for every combination (e.g., all learning rates with all layer counts). With random sampling, it will pick random combinations within these ranges. Bayesian sampling will learn from earlier results to choose better combinations for the next rounds.
Use Case: As a new Azure Data user automating hyperparameter tuning for a machine learning model in Azure Machine Learning, you can select grid sampling to exhaustively test setups when you have a small number of hyperparameters, or use Bayesian sampling to efficiently find the best configuration when you want to reduce time and resource usage.
For more information see these links:
- Sampling in Application Insights
- Hyperparameter tuning a model (v2)
- Sampling in Application Insights
- Log sampling in .NET
- Sampling in Application Insights
Define the search space
- The search space defines all the possible values that hyperparameters can take during automated hyperparameter tuning. It acts like a map of all the options Azure Machine Learning (ML) can try as it tests different model settings.
- In Azure ML, you specify a search space using objects such as SearchSpace, listing out the hyperparameters (like learning rate, batch size, etc.) and the range or set of values each can take. This can include numeric ranges, categories, or even nested options for more complex configurations.
- A well-defined search space balances thoroughness (exploring enough options) and efficiency (not making the process too slow). A search space that’s too narrow may miss the best model settings, while one that’s too wide may take too long to explore.
Example: Suppose you are training an image classification model in Azure. You want to tune the batch size and learning rate. You define a search space where batch size can be 16, 32, or 64, and learning rate can be any value between 0.001 and 0.1. The AutoML service will then try different combinations within this space to find the best settings.
Use Case: An IT professional new to Azure Data is configuring hyperparameter tuning for a machine learning model that detects network intrusions. By defining a search space for parameters like decision tree depth (from 3 to 10) and the number of estimators (choices: 50, 100, 200), the user lets Azure ML automatically experiment and identify the most effective model setup for accurate threat detection.
For more information see these links:
- ImageClassification.SearchSpace Property-Definition
- SearchSpace Class-Definition
- TextNer.SearchSpace Property-Definition
- ImageObjectDetection.SearchSpace Property-Definition
- SearchSpace<T> Class-Definition
Define the primary metric
- The primary metric is the key measurement that determines how well a machine learning model is performing during automated hyperparameter tuning.
- When setting up automated tuning in Azure Machine Learning, you must choose a primary metric that matches your model’s objective, such as accuracy for classification or root mean squared error for regression.
- The primary metric guides the optimization process by telling Azure which value to maximize or minimize as it tests different hyperparameter combinations.
- Using the correct primary metric ensures your model training aligns with your business goals, such as predicting outcomes more accurately or minimizing errors.
Example: Suppose you’re creating a model to predict whether IT support tickets will be resolved within 24 hours. You choose ‘accuracy’ as your primary metric, so Azure will try different settings to maximize the number of correctly predicted cases.
Use Case: A new Azure Data user in IT wants to optimize a demand forecasting model for internal hardware requests. By setting ‘normalized root mean squared error’ (nRMSE) as the primary metric, Azure automatically tunes hyperparameters to minimize prediction errors and improve resource planning.
For more information see these links:
- MachineLearningObjective.PrimaryMetric Property-Definition
- MachineLearningForecasting.PrimaryMetric Property-Definition
- Objective.PrimaryMetric Property-Definition
- IObjective.PrimaryMetric Property-Definition
- Metric Class-Attributes
Define early termination options
- Early termination options in automated hyperparameter tuning allow you to stop training unpromising models before they finish. This helps save time and computing resources.
- These options work by monitoring performance metrics (like accuracy or loss) as your model trains. If a model is not improving or is performing worse than others, the system can automatically stop it.
- There are different strategies for early termination, such as Bandit Policy, Median Stopping, and Truncation Selection. Each uses different rules to decide when to stop an experiment early.
- Early termination helps you focus resources on promising model configurations, speeding up the overall tuning process and reducing costs, which is especially beneficial when working within limited budgets on cloud platforms like Azure.
- In Azure Machine Learning, you can enable early termination policies directly in your automated machine learning or hyperparameter tuning runs, making it easy even for beginners to use.
Example: Imagine you are testing 100 different configurations of a machine learning model on Azure. With early termination options turned on, the system checks how each model is performing after a few training rounds. If it sees that some models are far behind others—say, their accuracy is much lower—it will automatically stop those and focus computing resources on the ones showing better potential. This way, you don’t spend money or time on models that won’t likely succeed.
Use Case: A new Azure Data engineer is trying to find the best parameters for a customer churn prediction model. By enabling early termination in Azure ML automated hyperparameter tuning, they ensure that poor-performing models are stopped early. This speeds up the process and controls Azure compute costs, helping them deliver results faster without going over budget.
For more information see these links:
- PREVIEW TERMS
- Microsoft Publisher Agreement version: 2.0 May 2020
- Microsoft Publisher Agreement 8.0 July 2024 update
- Windows Analytics Agreement
- Microsoft Publisher Agreement 8.0 October 2021 update
Train and deploy models (25–30%)
Run model training scripts
Consume data in a job
- Data ingestion is the process of bringing data from various sources into your job’s environment so it can be processed or analyzed. In Azure, you can ingest data using batch (large sets of data processed at intervals) or stream (real-time events) methods.
- Jobs in Azure (such as Azure Databricks Jobs or Azure Spring Apps Jobs) consume ingested data to run model training scripts or data processing tasks. The data must be in a supported format (like CSV, JSON, or Parquet) and mapped correctly to be usable.
- The method of data consumption depends on how the data was ingested—batch jobs may read from files stored in Azure Data Lake or Blob Storage, while streaming jobs may connect to real-time event sources such as Event Hubs or IoT streams.
- Once data is available, your job (such as a machine learning training script) can access and load the data using built-in libraries or APIs (for example, pandas in Python for CSV, or Spark for Parquet). This step is crucial before transformation, analysis, or modeling.
- Monitoring and validating data consumption ensures your job gets the correct data in the right format, reducing the risk of errors in downstream processes like training or analytics.
Example: Imagine you are training a machine learning model to predict equipment failures in a manufacturing plant. First, you use an Azure Databricks Job to ingest sensor data stored as Parquet files in Azure Data Lake. When the job runs, it reads the Parquet files, loads the data into a Spark dataframe, and then uses that data for model training.
Use Case: A data engineer new to Azure needs to train a sales forecasting model. They set up a job in Azure Databricks that automatically ingests weekly sales data in CSV format from Azure Blob Storage. The job script loads the CSV into a dataframe, checks for data consistency, and uses it to train a regression model—fully automating the repeatable process.
For more information see these links:
- Introduction to ingestion
- Job in Azure Spring Apps (Preview)
- Azure Synapse Data Explorer data ingestion overview (Preview)
- Implement data processing and analysis workflows with Jobs
- Azure Data Explorer data ingestion overview
Configure compute for a job run
- Choose the Right Compute Target: In Azure Machine Learning, you can run your training job either on your local machine or in the cloud using compute resources such as CPU or GPU clusters. Selecting the right compute target depends on your dataset size, the complexity of your model, and the available resources. For large-scale jobs, it’s often best to use a cloud-based compute cluster.
- Configure Compute Settings: When setting up a job, you need to specify the compute target along with other settings like virtual machine size, node count, and idle time before scaling down. This ensures your job runs efficiently with the right balance between speed and cost.
- Use ScriptRunConfig for Training Jobs: To submit a model training script, use the ScriptRunConfig class, where you define the source code location, the script to run, the compute target, and the environment (such as Python libraries and dependencies). This keeps your job configuration organized and reusable.
- Manage Dependencies and Environment: The environment setting allows you to specify the exact software environment (for example, required Python packages) your script needs to run, so jobs are reproducible regardless of compute target.
- Switch Compute with Minimal Code Changes: Azure ML lets you change the compute target (from local to remote cluster, for example) by simply updating your ScriptRunConfig settings. This flexibility makes it easy to scale experiments without altering your core training scripts.
Example: Suppose you are developing a machine learning model on your laptop. As your dataset grows, the training becomes slow. You configure a new Azure ML compute cluster (for example, a six-node CPU cluster) and, using ScriptRunConfig, set the compute_target to this cluster. Now, the same script runs much faster in the cloud, freeing up your local machine for other work.
Use Case: An IT analyst at a company is tasked with building a demand forecasting model using Azure ML. Initially, they test their script locally for quick iterations. When satisfied, they configure the job to use a cloud-based compute cluster with multiple CPUs to train the final model on the full dataset, ensuring efficient and timely completion.
For more information see these links:
- RunConfiguration Class-Remarks
- RunConfiguration Class-Remarks
- Tutorial: Forecast demand with no-code automated machine learning in the Azure Machine Learning studio
- Configure and submit training jobs
- Configure and submit training jobs
Configure an environment for a job run
- Define the runtime language and version: Before running a model training job, you need to select the appropriate scripting language (such as Python or PowerShell) and its version to match your code requirements. This ensures that your scripts execute correctly and use the necessary language features.
- Manage dependencies and packages: Specify all required packages and libraries your script depends on (for example, TensorFlow for Python or Az PowerShell for PowerShell). In Azure, you can add these either from approved repositories (like PyPI or PSGallery), upload your own files, or reference workspace or catalog volumes.
- Configure environment variables and resources: Set environment variables (such as API keys or configuration flags) and adjust resources like CPU and memory. These settings control how the job runs and can be customized at both the job and execution level for flexibility and security.
- Save and publish environment changes: When modifying the environment—such as updating libraries or runtime settings—it’s important to save your progress and then publish updates to apply them. This ensures consistency and stability in the job run environment across executions.
- Handle retries, timeouts, and execution parameters: Set parameters like maximum retries, timeouts, and arguments for job runs to manage error handling and performance. These configurations help ensure that failed jobs can retry automatically and that long-running jobs have appropriate limits.
Example: A data analyst wants to train a machine learning model using a Python script in Azure Databricks. They select Python 3.10 as their runtime environment, add the scikit-learn and pandas packages from PyPI, set an environment variable for their Azure Storage access key, allocate 2 CPUs and 8GB of memory for the job, and specify a 60-minute timeout to prevent endless runs.
Use Case: A beginner working with Azure Data needs to automate nightly training of a fraud detection model. By configuring the job run environment to include the latest Python version, essential packages like PySpark for distributed data processing, setting secure environment variables for database credentials, and enabling job retries, the team ensures reliable, secure, and repeatable model training without manual intervention.
For more information see these links:
- Runtime environment in Azure Automation
- Job in Azure Spring Apps (Preview)
- Create, configure, and use an environment in Fabric
- Configure the serverless environment
- Configure the serverless environment
Track model training with MLflow in a job run
- MLflow allows you to automatically capture key details of your model training process, such as the parameters, metrics, code version, and artifacts. This tracking ensures that every training job in Azure is logged as a distinct MLflow Run.
- You can initiate tracking in your training script by starting an MLflow run (using mlflow.start_run()), logging metrics and parameters, and ending the run. This can be done either interactively in notebooks or as part of a scheduled job run in Azure.
- By tracking training runs in MLflow, you can easily compare different experiments, reproduce results, and share insights with teammates. The MLflow UI lets you visualize and filter runs, inspect models, and understand how changes to your scripts affect performance.
Example: Imagine you are training a machine learning model to predict customer churn in Azure Databricks. By adding MLflow tracking to your training script, each time the script runs as a job, MLflow logs details like the algorithm used, training data, chosen hyperparameters, and metrics such as accuracy. Later, you can open the MLflow UI to compare runs and see which training configuration worked best.
Use Case: A new data professional at a company uses Azure Machine Learning to train several models overnight to predict IT ticket resolution times. By tracking these job-based training runs with MLflow, they can review which models performed best, share results with their manager, and have a clear record of past experiments, making optimization and reporting easier.
For more information see these links:
- MLflow 3 traditional ML workflow
- Track model development using MLflow
- Track model training with MLflow in jobs - Training
- Track experiments and models with MLflow
- MLflow and Azure Machine Learning
Define parameters for a job
- Job parameters are key-value pairs that define configurable settings for a job. These allow you to make your job flexible by specifying details such as input data paths, model types, or resource allocation. By parameterizing jobs, you can reuse the same job definition with different configurations without changing the code.
- Parameters can be set to static default values or dynamic values, which can reference other job attributes or runtime information. After setting default parameters, you can choose to override them when launching a job manually or through an API. This flexibility supports scenarios where you need to run the same training script on different datasets or with different settings.
- Job parameters are passed down to job tasks such as notebooks, Python scripts, or SQL queries. This automatic ‘pushdown’ ensures that all components of your workflow can access the necessary values. If there is a conflict between a job parameter and a task parameter with the same key, the job parameter takes precedence.
- You can configure job parameters via the Azure Databricks workspace UI, REST API, CLI, or with Databricks Asset Bundles (YAML or JSON). This allows for both interactive and automated management of job configurations, suitable for ad-hoc experiments or production pipelines.
- Dynamic value references (e.g. {{job.parameters.dataset_path}}) make it easy to reference parameter values throughout tasks, enabling advanced patterns such as conditional logic, for-each loops over datasets, or chaining outputs between tasks.
Example: Suppose you’re running a machine learning training job in Azure Databricks. Instead of hard-coding the input data path, you define a job parameter called ‘dataset_path’ with a default value of ‘/mnt/data/train.csv’. When you want to train your model with a new dataset, you simply input a different value for ‘dataset_path’ when launching the job, without needing to update your code or notebook.
Use Case: A data analyst new to Azure wants to use Databricks jobs to automate training of a customer churn prediction model. By defining job parameters for ‘dataset_path’, ‘model_type’, and ‘output_path’, the analyst can quickly retrain the model on new data or experiment with different algorithms by editing the parameters through the workspace UI, making training runs repeatable and scalable.
For more information see these links:
Run a script as a job
- A job in Azure is a way to automate the execution of scripts, such as Python files, notebooks, or other tasks, according to a schedule or trigger. Instead of running scripts manually every time, jobs help you run them automatically at specific times or in response to certain events.
- You can set up jobs using several methods: the Azure Databricks user interface (UI), the REST API, command-line tools like Azure CLI, or programmatically with SDKs. This flexibility allows you to choose the tool that works best for your workflow and experience level.
- When you run a script as a job, you can monitor its status, view its output, and manage its lifecycle. Jobs can also be configured to run multiple tasks, like data cleaning followed by model training, and can automatically end when all tasks are completed.
- Scheduling jobs is helpful for routine tasks, such as nightly data processing or weekly reporting. Azure lets you specify the frequency and timing, making sure your operations run reliably without manual intervention.
- Jobs also support parallel execution and scaling, meaning you can run many scripts or tasks at once across different compute nodes using services like Azure Batch or Databricks, speeding up processing and efficiency.
Example: Suppose you want to retrain your machine learning model every night using the latest sales data stored in Azure. Instead of manually running your training script at midnight, you create a job in Azure Databricks that automatically runs your Python script at 12:00 AM every day. You can monitor job completion and see the model’s updated results the next morning.
Use Case: An IT data analyst new to Azure Data needs to generate a weekly report summarizing log file statistics. By configuring a job that runs a Python script every Sunday evening, they ensure the report is ready for Monday morning review without needing to intervene manually each week.
For more information see these links:
- Azure Databricks for Python developers
- CLI example: Run a job and tasks with Azure Batch
- CLI example: Run a job and tasks with Azure Batch
- Jobs and tasks in Azure Batch
- Create and run PowerShell scripts from the Configuration Manager console
Use logs to troubleshoot job run errors
- Logs are detailed records of what happens during a job run, capturing errors, warnings, and execution steps. Reviewing logs is the first step to understanding why a job failed or did not behave as expected.
- In Azure, logs can be accessed directly through portals or consoles, such as the Azure DevOps pipeline run summary or the Kudu console for WebJobs. You can view log files for specific tasks, steps, or jobs, or download all logs for deeper analysis.
- Different types of logs provide different insights. Task logs focus on individual script steps, agent/workers logs show how the environment was configured and executed, and error logs summarize what failed. Knowing where to look helps you quickly diagnose and fix issues.
- Common troubleshooting techniques include searching logs for explicit error messages, checking for missing files, permission errors, incorrect CRON expressions, and investigating configuration details. Often, logs will point to the line, command, or component that caused the error.
- For actionable troubleshooting, start by locating the relevant logs for your failed job, search for error keywords, and review details around the time of failure. Use log clues to correct script errors, misconfigurations, or permission settings, then rerun your job to test the solution.
Example: Suppose an Azure Data engineer runs a machine learning model training script using Azure Machine Learning and the job fails with no output. By opening the job’s logs through the Azure portal, they find a ‘Permission denied’ error in the stderr log. The log indicates the script lacked permission to read the training data. Adjusting the data access settings and verifying permissions in the workspace solves the problem on the next run.
Use Case: An entry-level data analyst uses Azure DevOps to automate data pipeline runs. When a scheduled training job fails overnight, they check the pipeline run summary, view the failed step’s logs, and discover that a missing ‘run.sh’ file caused the error. The analyst consults the logs, restores the file, and reruns the pipeline successfully—learning how log inspection quickly highlights root causes and guides fixes.
For more information see these links:
- Review logs to diagnose pipeline issues
- How WebJobs run in Azure App Service
- Troubleshoot a slow or failing job on a HDInsight cluster
- Troubleshoot pipeline runs
- Troubleshooting the ParallelRunStep
Implement training pipelines
Create custom components
- Custom components are reusable pieces of code or functionality that you create to extend the capabilities of your Azure Data solutions—these might automate a specific process, add a visual element, or connect to external systems.
- In Azure training pipelines, custom components allow you to tailor your workflow to your organization’s specific data needs, enhancing flexibility and reducing repetitive manual tasks.
- You can build custom components in tools like Visual Studio or Power Apps. Once built and added to your project, they often automatically appear in the application’s toolbox for easy reuse and drag-and-drop functionality.
- Custom components must implement certain interfaces and follow compatibility guidelines (for example, IComponent in .NET or proper packaging in Power Apps) so they work correctly within the Azure ecosystem.
- Testing and debugging your custom component before deploying it is essential to ensure it integrates smoothly into the broader data pipeline and performs as expected.
Example: A data engineer creates a custom Power Apps component that visualizes real-time sales data as an interactive dashboard. Once added to the organization’s Dataverse environment, this dashboard can be easily reused and customized in multiple apps—letting teams analyze data without rebuilding the visualization each time.
Use Case: An Azure Data team needs to preprocess data coming from various CSV files before feeding it into a machine learning pipeline. The team develops a custom pipeline component to automate tasks like cleaning and formatting the data. With this reusable component, the team increases productivity and consistency across multiple training projects.
For more information see these links:
- Walkthrough: Automatically Populating the Toolbox with Custom Components
- Walkthrough: Automatically Populating the Toolbox with Custom Components
- Developing Custom Pipeline Components
- Build a Power Apps component - Training
- Add code components to a custom page for your model-driven app
Create a pipeline
- A pipeline in Azure Data (such as Azure DevOps or Azure Data Factory) is a set of automated steps that move and process data or code from one stage to another—for example, from development, to testing, to production.
- Creating a pipeline typically starts by selecting a source for your data or code (like a GitHub repository), and then defining the tasks or operations that should happen automatically (such as building, testing, or training a model).
- You can use predefined templates (like for Python, .NET, or Java projects) to help quickly set up your pipeline, or customize it using a YAML file—making it flexible for different tasks or data workflows.
- Once created, pipelines can be monitored and edited in Azure to ensure they are running correctly and updated as your needs change.
- Pipelines enable automation, reduce manual errors, and help ensure your data or code is processed the same way every time, supporting reliable delivery and operations.
Example: Suppose you have a Python machine learning project stored in GitHub. In Azure DevOps, you set up a new pipeline by pointing it to your repository, selecting the recommended Python package template, and clicking ‘Save and run’. Azure Pipelines creates a YAML file that defines steps like installing dependencies and running tests, and these steps will run automatically every time you update your code.
Use Case: A new Azure Data engineer sets up a pipeline to automate the process of training and deploying a machine learning model. Whenever they push new code to their GitHub repository, the Azure pipeline automatically installs dependencies, runs training scripts, tests the results, and, if successful, deploys the trained model to production—ensuring a repeatable and error-free workflow.
For more information see these links:
- Create your first pipeline
- Create your first pipeline
- Create your first pipeline
- Create your first pipeline
- Create your first pipeline
Pass data between steps in a pipeline
- Pipelines in Azure Data services are made up of multiple steps (called activities), each performing a specific task such as ingestion, transformation, or loading data. Passing data between these steps ensures that outputs from one activity become the inputs for the next, creating a smooth flow of information.
- Data can be passed between steps in several ways, including using shared datasets (like tables or files), pipeline variables, and output settings (e.g., using OutputFileDatasetConfig in Azure Machine Learning or staging areas in dataflows), allowing for flexible and reusable pipeline structures.
- Using dataflows inside a pipeline lets you transform data and store the results in destinations such as a lakehouse or Blob Storage. Subsequent pipeline steps can then pick up these results as source data for further activities, ensuring modular and manageable workflows.
- Orchestrating step dependencies is important: you can control when one step runs based on the success or failure of a previous step, ensuring data is always in the right state before it is used. In graphical pipeline designers, this is often shown by connecting outputs (like ‘Succeeded’) to the next activity.
Example: Suppose you build a pipeline that first ingests customer order data from an OData source using a dataflow, writes the cleaned data to a Lakehouse, and then in the next pipeline step, copies that data from the Lakehouse to Azure Blob Storage as a CSV file. Here, the output of the dataflow (the transformed order table) is passed as the input for the copy activity.
Use Case: A new Azure Data engineer wants to automate the daily process of collecting, transforming, and archiving sales data. They use a data pipeline where the output from a transformation step (removing duplicates and standardizing fields) in Dataflow Gen2 becomes the input for an export step that saves standardized files for business reporting.
For more information see these links:
- Use a dataflow in a pipeline
- Quickstart: Move and transform data with dataflows and data pipelines
- Quickstart: Move and transform data with dataflows and data pipelines
- Moving data into and between machine learning pipeline steps (Python)
- Quickstart: Transform data using mapping data flows
Run and schedule a pipeline
- Pipelines can be run manually or triggered automatically using schedules and events. In Azure Data Factory and Azure DevOps, scheduled triggers let you run pipelines at set times, such as hourly, daily, weekly, or monthly.
- To schedule a pipeline, you specify the recurrence and the start and end dates (if needed). For instance, you can configure a pipeline to run every night at midnight to process daily data, or every 15 minutes for more frequent tasks.
- Event-based triggers allow pipelines to run when something specific happens, such as code being pushed to a branch or a pull request being created. This is useful for automating testing and validation, as well as for continuous integration.
- You can combine scheduled triggers and event-based triggers. For example, use event-based triggers for immediate validation when changes are made, and scheduled triggers for regular batch processing or maintenance.
- Once the trigger is configured, you need to publish your pipeline and trigger settings to make them active. Remember that each pipeline run may have associated costs, so plan your schedules to avoid unnecessary runs, especially during testing.
Example: An Azure Data team sets up a pipeline in Azure Data Factory to import daily sales data from a database. They schedule the pipeline to run every day at 2:00 AM, using the ‘Daily’ recurrence option. The pipeline fetches the latest records, processes them, and stores the results in a data lake for reporting.
Use Case: A beginner-level IT professional new to Azure Data configures a pipeline in Azure DevOps to automate the nightly build and test of a business intelligence solution. They use a scheduled trigger to run the pipeline every night, ensuring that any updates made during the workday are validated and ready for use the next morning.
For more information see these links:
- Configure schedules for pipelines
- Create a trigger that runs a pipeline on a schedule
- Specify events that trigger pipelines
- Configure schedules for pipelines
- Configure schedules for pipelines
Monitor and troubleshoot pipeline runs
- Monitor pipeline runs regularly using tools like Azure Synapse Studio or Azure DevOps. This allows you to check the status of each run, see which activities succeeded or failed, and quickly identify issues.
- Use diagnostic logs and error analysis to troubleshoot failed pipeline runs. View detailed logs for each pipeline task, use the ‘Find’ feature to search for error messages, and enable verbose logging to gather more technical details if the cause isn’t obvious.
- Set up notifications and task insights to stay informed about failures. Azure DevOps can send automatic alerts when a pipeline run fails and provide pop-up insights on common causes, helping you respond quickly and resolve issues efficiently.
- Learn common causes of pipeline failures such as resource permissions, time-outs, or code download errors. Most issues can be solved by reviewing logs, checking resource access, and following step-by-step troubleshooting guides provided within the platform.
- Apply filters and review historical runs to spot patterns in failures or delays. Use built-in filters in Synapse Studio to focus on pipelines that need attention and leverage Azure Monitor for long-term diagnostics.
Example: Imagine that an Azure Data Engineer schedules a nightly pipeline to move sales data from a cloud database to a reporting system. One morning, the report is missing new data. The engineer opens Synapse Studio, goes to the Monitor hub, and sees that last night’s pipeline run failed. Clicking on the run, they view the error log and find a permission issue with a linked resource. They update the resource permissions, rerun the pipeline, and the data is successfully delivered.
Use Case: A beginner data professional in IT uses Azure Synapse Studio to monitor daily ETL pipelines. When a pipeline transferring log data fails, they follow a simple process: check pipeline run status, review the error logs, learn that the issue is a missing file, and fix it by correcting the input data path. Their report is now complete and accurate, showing quick troubleshooting in action.
For more information see these links:
- Troubleshoot pipeline runs
- Troubleshoot pipeline runs
- Troubleshoot pipeline runs
- Use Synapse Studio to monitor your workspace pipeline runs
- Review logs to diagnose pipeline issues
Manage models
Define the signature in the MLmodel file
- The signature in the MLmodel file defines the expected input and output data types for an MLflow model, acting as a data contract between the model and any system that uses it. This ensures consistency and helps prevent data compatibility errors.
- Signatures can be column-based (for tabular data, e.g., pandas DataFrames) or tensor-based (for image or array data, e.g., numpy ndarrays). The signature specifies details like data shapes and types, which are crucial for proper model deployment.
- Including a signature in the MLmodel file enables Azure Machine Learning to automatically enforce input data type and shape checks at deployment. This improves reliability by catching mistakes early if your input data doesn’t match what the model expects.
- You can inspect or manually define the signature in the MLmodel file to control what your model will accept. When you use MLflow autologging, it attempts to infer the signature automatically, but manual specification is available for custom cases.
- In practice, having a well-defined signature simplifies no-code deployments in Azure, as the platform can automatically create a scoring script and environment, provided the inputs and outputs are clearly described in the MLmodel.
Example: Imagine you have a machine learning model that classifies whether an image contains a cat or not. In the MLmodel file, the signature might specify that the input should be a tensor with shape [-1, 224, 224, 3] (meaning batches of RGB images) and the output should be a tensor with shape [-1, 2] (representing probabilities for ‘cat’ and ‘not cat’). This setup ensures that batches of correctly sized images are sent to the model and helps avoid runtime errors due to unexpected data formats.
Use Case: In Azure Data projects, a data engineer deploys a batch image classification model using MLflow. By specifying a signature in the MLmodel file, the engineer ensures only properly shaped image batches are accepted by the model endpoint. This prevents errors during data processing and makes deployment via Azure Machine Learning Studio smoother, saving time and reducing troubleshooting for new users.
For more information see these links:
- Artifacts and models in MLflow
- Guidelines for deploying MLflow models
- Guidelines for deploying MLflow models
- Image processing with batch model deployments
- Deploy MLflow models in batch deployments in Azure Machine Learning
Package a feature retrieval specification with the model artifact
- A feature retrieval specification is a YAML file that lists all the features (data columns) your model needs, and must be included with your model artifact when using Azure Machine Learning feature stores. It tells the system exactly which features to retrieve for training and prediction.
- Packaging the feature retrieval specification with your model artifact ensures that feature lineage is tracked in Azure ML. This allows you to see which features were used to train a model, helping with auditing, troubleshooting, and reproducibility.
- At inference (prediction) time, the scoring code uses the feature retrieval specification to fetch the correct feature values from the online feature store. The scoring script looks for ‘feature_retrieval_spec.yaml’ in the model artifact root folder—using the wrong file name or path can cause errors.
- When training a model, you should either let the training pipeline handle this packaging step automatically, or copy the ‘feature_retrieval_spec.yaml’ yourself into the model output folder during the training job. This process is essential for the model to work correctly in production.
- If you use custom training logic, ensure the feature retrieval specification is passed and copied into the model artifact folder before registering the model in Azure ML. Otherwise, deployment and inference can fail due to missing or misaligned feature data.
Example: Suppose you are building a fraud detection model using Azure Machine Learning. You select features like ‘transaction amount’ and ‘account age’ from different feature sets in the feature store. After training the model, you package the ‘feature_retrieval_spec.yaml’ file (created during data preparation) together with your model (like ‘model.pkl’) in the model artifact’s root folder. This way, when you deploy the model, Azure ML knows what features to fetch for predictions.
Use Case: A new Azure Data engineer is tasked with developing and deploying a customer churn prediction model for a telecommunications company. By packaging the feature retrieval specification with the model artifact, the engineer ensures that all necessary features are correctly retrieved during both training and real-time inference, maintaining consistency between training data and live predictions and helping with auditing feature usage.
For more information see these links:
- Feature retrieval specification and usage in training and inference
- Feature retrieval specification and usage in training and inference
- Tutorial 2: Experiment and train models by using features
- Troubleshooting managed feature store
- Tutorial 2: Experiment and train models by using features
Register an MLflow model
- MLflow model registration is the process of saving a trained machine learning model into a centralized registry within Azure Machine Learning or Azure Databricks. This helps you manage and track different models and their versions easily.
- You can register models either directly from an experiment run or from model artifacts saved on your local filesystem. Registration preserves important details such as experiment lineage and metadata, which helps in auditing and troubleshooting.
- The MLflow Model Registry provides a user-friendly UI and APIs for tracking models, adding descriptions, managing model versions, and moving models through deployment stages such as Staging and Production.
- Once registered, models can be referenced by their name and version, loaded for predictions, or promoted to different stages, enabling streamlined collaboration and automation in IT projects.
- Registering a model in MLflow makes it discoverable and reusable across your workspace, ensuring standardization and reducing duplication.
Example: Suppose you train a machine learning model to predict server CPU usage in Azure Databricks. After training, you can register this model using MLflow by clicking ‘Register Model’ in the UI or by running a command such as mlflow.register_model(‘runs:/run_id/model’, ‘cpu-usage-predictor’). The model is now stored in the registry, along with its version history, making it easy for your team to track, update, and use.
Use Case: An IT company wants to automate resource allocation in Azure. A data analyst trains a model to forecast demand spikes for virtual machines. By registering the model in MLflow’s registry, both the analyst and operation teams can seamlessly collaborate: the model’s versions are tracked, descriptions document changes, and the latest approved version can be loaded automatically for integration into resource management scripts.
For more information see these links:
- Manage models registry in Azure Machine Learning with MLflow
- Log, load, and register MLflow models
- Workspace Model Registry example
- Workspace Model Registry example
- Workspace Model Registry example
Assess a model by using responsible AI principles
- Understand and apply Responsible AI principles: When assessing a model, check that it follows Microsoft’s six Responsible AI principles—fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability. These principles help ensure that AI systems act ethically and are trustworthy.
- Evaluate fairness using tools: Use Azure Machine Learning’s Responsible AI dashboard to assess whether your model treats all user groups fairly and avoids unintended bias. Analyze model predictions for different groups and validate that no group is disadvantaged.
- Check transparency and interpretability: Use interpretability tools in Azure to explain how your model makes decisions. This allows you, your team, and stakeholders to understand what features the model uses and why certain predictions are made.
- Monitor reliability, privacy, and ongoing compliance: Set up continuous monitoring to catch model errors, performance drops, or data drift over time. Make sure your model protects user data and follows all privacy and security requirements.
- Establish clear accountability and governance: Assign responsibilities for monitoring, updating, and auditing the model. Documentation and regular audits help you maintain control and act quickly if issues are identified.
Example: Suppose you build an AI model in Azure that predicts which IT helpdesk tickets need urgent attention. By using the Responsible AI dashboard, you discover that the model is prioritizing tickets mostly from one department, so you analyze and update the training data and model to ensure fair treatment of all departments.
Use Case: An IT team new to Azure Data uses the Responsible AI dashboard to audit a model that assigns severity levels to incoming system alerts. They check for unfair bias (such as alert types or user groups being under-prioritized), use interpretability features to explain decisions to managers, and implement regular reviews and documentation to stay compliant with privacy and ethical standards.
For more information see these links:
- What is Responsible AI?
- Responsible AI considerations for intelligent application workloads
- Examine how Microsoft is committed to Responsible AI - Training
- Manage AI
- Assess AI systems by using the Responsible AI dashboard
Deploy a model
Configure settings for online deployment
- Choose the right deployment configuration: When deploying a model online, it’s essential to select settings that match your target environment. This includes defining which build configuration (like Release or Debug), .NET framework version, and deployment mode (framework-dependent or self-contained) your application will use. These settings ensure that your app runs smoothly on the intended platform.
- Set up environment-specific variables and connections: Use configuration files—often in JSON format—to automate environment-specific settings such as database connections, environment variables, and permissions. These files allow you to tailor your deployment for different stages (development, testing, production) without manual intervention.
- Customize user experiences and permissions: In tools like the ALM Accelerator for Power Platform, deployment user settings let you personalize UI features based on user expertise. For example, you can enable or disable advanced functionality for users who are new to application lifecycle management or grant access to only required features.
- Manage deployment slot settings: If you are deploying to Azure (such as Azure Functions), you can use deployment slots to separate environments (like staging and production). Some settings—like publishing endpoints, custom domain names, and diagnostics—can be made ‘sticky’ so they remain with a specific slot, ensuring consistent behavior when swapping slots.
- Validate and update settings throughout the deployment pipeline: Before deploying, make sure all required variables, credentials, and configurations are properly set in your deployment files and pipeline variables. Regularly reviewing these helps prevent deployment errors and ensures your app operates as expected in its target environment.
Example: Suppose you’re deploying a web application using Azure Web Apps. You create a publish profile in Visual Studio, specify the Release build configuration, set the target .NET framework version to .NET 6, and use a self-contained deployment to package all necessary runtime files. You then use a JSON configuration file to set environment-specific connection strings for your production database and configure user access based on group membership.
Use Case: A beginner on Azure Data wants to deploy a custom Power Platform solution for their IT department. They use the ALM Accelerator to automatically create deployment user profiles—giving advanced features only to experienced users. They configure JSON files to set secure environment variables and share the app with designated Microsoft Entra groups. This allows the solution to be securely deployed to production while providing a streamlined experience for novice team members.
For more information see these links:
- Configure deployment user settings
- Manage web deployment settings
- Deployment configuration guide
- Deployment configuration guide
- Azure Functions deployment slots
Deploy a model to an online endpoint
- An online endpoint in Azure Machine Learning lets you deploy your trained model as a web service, so others can send data and receive predictions in real time using HTTP requests.
- Deploying to an online endpoint involves selecting your model, preparing a scoring script (which tells Azure how to use your model to make predictions), picking compute resources (like CPU or GPU), and setting up authentication to control access.
- Online endpoints support quick, low-latency predictions, making them ideal for scenarios where users or systems need instant responses, such as detecting spam in emails or recommending products on a website.
- You can manage and monitor your online endpoint in Azure, including scaling it up or down based on demand, updating the model, or routing requests between different model versions.
- Deployment can be done easily using Azure Machine Learning Studio (a web interface), the Azure CLI, or the Python SDK, making it accessible for users with varying levels of technical expertise.
Example: Imagine you’ve trained a machine learning model that can classify whether support tickets are ‘urgent’ or ‘normal’. By deploying this model to an online endpoint, your company’s ticketing system can send each incoming ticket to the endpoint and instantly receive back its urgency classification, allowing customer service teams to prioritize responses automatically.
Use Case: A small IT consultancy uses Azure Machine Learning online endpoints to deploy a real-time predictive maintenance model. The model analyzes live data from customer equipment and triggers alerts if potential failures are detected. The online endpoint ensures that field engineers receive immediate notifications, helping reduce downtime and improve customer service.
For more information see these links:
- Online endpoint deployment for real-time inferencing
- Deploy models from HuggingFace hub to Azure Machine Learning online endpoints for real-time inference
- Deploy models from HuggingFace hub to Azure Machine Learning online endpoints for real-time inference
- Deploy and score a machine learning model by using an online endpoint
- Online endpoint deployment for real-time inferencing
Test an online deployed service
- Confirm Service Availability: After deploying your model as an online service (such as an Azure web service), it’s important to ensure that the service is running and accessible. This can be done by sending a simple test request, such as an HTTP GET or POST, and verifying that you receive a valid response.
- Test with Real or Sample Data: To ensure the model works as expected, send a request with actual or sample input data and check the returned output. This step verifies both connectivity and that the model inference operates as intended.
- Use Automated Tools: Tools such as Postman, soapUI, or even scripts using Python’s requests package can automate and repeat the testing process, making it easier to catch errors quickly and consistently. This is especially useful after each deployment or update.
- Monitor Logs and Metrics: Check audit logs and monitoring dashboards (like Azure Monitor or Service Health Dashboard) during and after testing. These provide details about errors, usage, and any abnormal activity, helping you troubleshoot or further improve reliability.
- Integration and End-to-End Testing: In complex scenarios, especially when your model is part of a larger application or microservices architecture, run integration or end-to-end tests to confirm that all components interact correctly with the deployed model service.
Example: After deploying a trained image classification model to Azure as a web service, you use Postman to send a POST request with a sample image. The service responds with a prediction label (e.g., ‘cat’ or ‘dog’). You check the response to make sure it matches your expectations, confirming that both the deployment and the model work as intended.
Use Case: An IT analyst new to Azure Data platforms wants to validate that their deployed customer sentiment analysis model is available 24/7 and responds correctly to incoming REST API calls. They use automated scripts to send sample text data throughout the day, relying on Azure Monitor to alert them if any requests fail or if there’s unexpected downtime.
For more information see these links:
- Audit logging and monitoring overview
- Testing ASP.NET Core services and web apps
- Create a web service test
- Microsoft Dynamics 365 Monitoring Service
- Tools for Testing
Configure compute for a batch deployment
- Compute clusters are essential for batch deployments in Azure Machine Learning, providing the hardware resources that run your model or pipeline steps. Before deploying, you must ensure a suitable compute cluster exists in your workspace.
- You can create a new compute cluster (like ‘batch-cluster’) or use an existing one. The cluster’s configuration, such as size, type (CPU or GPU), and scaling limits, determines how many jobs can run in parallel and how quickly they’re processed.
- When configuring your batch deployment, specify the compute cluster in the deployment settings (such as in the deployment YAML or Python SDK). This makes sure your deployment knows where to run.
- Multiple batch deployments can share the same cluster, allowing you to efficiently use resources. You should choose cluster settings that balance cost and required performance for your workloads.
- After deployment, the compute’s behavior (like scaling up or down) will affect processing speed and cost; monitor and adjust these settings as needed via the Azure portal or CLI.
Example: Suppose you want to analyze thousands of customer images for a marketing campaign using a pre-trained machine learning model. You create an Azure Machine Learning compute cluster called ‘batch-cluster’ with 0 minimum and 5 maximum instances. Then, you define your batch deployment to use this cluster, letting Azure automatically scale resources depending on how many images need processing.
Use Case: An IT team at a company new to Azure Data wants to batch process large CSV files with customer transaction data to predict purchasing trends each month. They configure a compute cluster suitable for handling the files, deploy their ML pipeline for inference, and set it to scale automatically so jobs run efficiently and cost-effectively.
For more information see these links:
- How to deploy pipelines with batch endpoints
- How to deploy a pipeline to perform batch scoring with preprocessing
- Deploy MLflow models in batch deployments in Azure Machine Learning
- Deploy models for scoring in batch endpoints
- Image processing with batch model deployments
Deploy a model to a batch endpoint
- Batch endpoints in Azure Machine Learning are designed for running machine learning models on large datasets in a non-real-time manner, meaning you can process thousands or millions of records at once instead of responding to individual requests.
- To deploy a model to a batch endpoint, you need a registered model in your Azure workspace, available compute resources (like a cluster), and optionally a scoring script and environment file—MLflow models simplify this by auto-generating the scoring script and environment.
- The deployment process involves creating an endpoint (with a unique name), then defining and deploying your model to that endpoint. Once deployed, you can submit batch inference jobs by providing input data, and Azure will handle scaling and orchestration across the compute cluster.
- Batch deployments are ideal for scenarios where you have large volumes of data, do not require immediate results, and can take advantage of running tasks in parallel to save time and resources.
- You can easily customize batch deployments by providing your own scoring script or environment definition, especially if your data requires special pre- or post-processing, or if your model has specific runtime needs not covered by default MLflow support.
Example: Imagine you have trained a machine learning model that predicts the risk of heart disease based on patient medical records. With thousands of new patient records generated each day, you want to predict risk scores for all of them overnight. You deploy your model to a batch endpoint in Azure Machine Learning and run batch inference on all new data files each night, saving results to a database for further review.
Use Case: A healthcare provider stores large volumes of patient test data in Azure. Each week, the IT team uses a batch endpoint to analyze all new patient records with a registered MLflow model, generating automated risk scores that assist doctors in flagging potential heart conditions without manual review of every individual record.
For more information see these links:
- Deploy MLflow models in batch deployments in Azure Machine Learning
- Deploy models for scoring in batch endpoints
- Deploy models for scoring in batch endpoints
- Deploy MLflow models in batch deployments in Azure Machine Learning
- Deploy MLflow models in batch deployments in Azure Machine Learning
Invoke the batch endpoint to start a batch scoring job
- What is batch scoring in Azure ML: Batch scoring is a way to process large amounts of data with a deployed machine learning model, producing predictions in parallel instead of evaluating data one row at a time. This is useful for tasks like running predictions on thousands of images or files at once.
- Invoking a batch endpoint: You use a command (or SDK/API) to start a batch scoring job. The job takes input data (such as files or folders in Azure Storage), processes them through your model, and creates output (predictions) in a specified storage location. The job runs asynchronously, meaning you can check on its progress while it completes in the background.
- Key parameters required: You must provide the batch endpoint name, resource group, workspace name, and specify your input data location (folder or file, stored in Azure or as a public URL). You can also set options like deployment name, output path, and mini-batch size to control how and where results are stored and how data is processed.
- How outputs are handled: The results from batch scoring are saved in cloud storage (like your workspace’s blob store) as files. You can customize where these files are stored using parameters, and preview or analyze them later using tools like Azure Storage Explorer.
- Parallel processing for efficiency: Batch endpoints split input data into smaller mini-batches and process multiple files in parallel, which saves time compared to handling each file one by one. Settings like mini-batch size help tune this parallelism according to your data distribution.
Example: Suppose you trained a machine learning
model to recognize handwritten digits. You have thousands of scanned
digit images stored in an Azure Storage folder. Using the Azure CLI, you
run az ml batch-endpoint invoke
with the batch endpoint
name and the path to your image folder. Azure ML then automatically
creates a job that processes all images in batches and saves the
prediction results (which digits were detected) in the specified output
folder.
Use Case: In IT, a data analyst new to Azure wants to quickly score a large batch of log files for anomaly detection. By uploading the logs to Azure Storage and invoking a batch endpoint, the analyst can use a deployed model to flag unusual patterns across all files efficiently, with results stored for further investigation.
For more information see these links:
- az ml batch-endpoint invoke
- Deploy models for scoring in batch endpoints
- Deploy models for scoring in batch endpoints
- Deploy models for scoring in batch endpoints
- Deploy models for scoring in batch endpoints
Optimize language models for AI applications (25–30%)
Prepare for model optimization
- Understand the model requirements: Before optimizing, identify the goal of your language model (e.g., improving accuracy, reducing latency). This will help you select the right model from the catalog and determine the best optimization strategy.
- Evaluate different language models: Use built-in benchmarks on Azure to compare models for your task. Look at metrics like accuracy, speed, and resource usage to choose the model that best fits your application’s needs.
- Test deployment in a controlled environment: Use the Azure AI Playground to test your selected model with real-world data and typical prompts. This helps identify issues and areas for improvement before scaling up.
- Select an optimization approach: Based on your findings, choose how to optimize—such as prompt engineering, retrieval augmented generation (RAG), fine-tuning with your own data, or automated hyperparameter tuning using built-in tools like FLAML or AutoML.
- Prepare data and infrastructure: Organize your data for the chosen optimization (e.g., cleaning, chunking for RAG, creating labeled training sets for fine-tuning) and ensure the necessary Azure compute resources and environments are configured for training and testing.
Example: Imagine you’re building a chatbot for IT support. First, you identify that high accuracy in interpreting support requests is important. You compare GPT-based and Llama-based models using Azure benchmarks, then test top candidates in the Azure AI Playground with sample queries. After seeing which model responds best, you decide to fine-tune it with historical IT support tickets for optimal performance.
Use Case: An IT company new to Azure Data wants to automate customer support using AI. They evaluate available language models on Azure, deploy their chosen model to a test environment, and use prompt engineering to improve response relevance—eventually fine-tuning the model with their own support data for high-quality, customized interactions.
For more information see these links:
- Study guide for Exam DP-100: Designing and Implementing a Data Science Solution on Azure
- Fine-tune models with Azure AI Foundry
- Hyperparameter tuning (preview)
- Train AI and ML models
- Optimize model training with Azure Machine Learning - Training
Select and deploy a language model from the model catalog
- Browse and select a language model from the Azure AI Foundry model catalog. The catalog contains models created by Azure, partners, and the community for tasks like text generation, summarization, and translation.
- Check model details, such as performance, supported languages, and required permissions, before selecting. Some models require specific roles or subscriptions to use, especially those from partners.
- Deploy the selected model to an endpoint using the Azure AI Foundry portal. You’ll specify deployment settings such as region and resource group, and the portal guides you step-by-step.
- Once deployed, your model is accessible via an API endpoint, making it easy to integrate with your applications or test in the Language playground within Azure AI Foundry.
- Regularly monitor and update your deployment as new versions become available or business requirements change. You can swap models or update deployment regions as needed.
Example: A new Azure Data professional logs into Azure AI Foundry, navigates to the model catalog, and selects a pre-built text summarization model from Azure. After reviewing its details, they deploy the model to an endpoint. They then use the API to summarize customer feedback collected from online forms, helping their IT team quickly understand common user issues.
Use Case: An IT specialist at a company needs to automate extraction of key information from support tickets. By selecting and deploying a conversational language understanding model from Azure AI Foundry, the specialist sets up an API endpoint. Their team integrates this endpoint into their helpdesk system, allowing automatic ticket categorization and faster response times for common issues.
For more information see these links:
- Choose and deploy models from the model catalog in Azure AI Foundry portal - Training
- Deploy models as serverless API deployments (programming-language-cli)
- Deploy models as serverless API deployments (programming-language-bicep)
- Deploy an Azure AI Foundry custom translation model
- Quickstart: Conversational language understanding (azure-ai-foundry)
Compare language models using benchmarks
- Benchmarks provide standardized ways to compare language models based on key metrics like quality, performance, safety, and cost. Azure AI Foundry portal offers leaderboards that display these metrics for various models, helping users quickly identify models that best fit their needs.
- Quality benchmarks measure how accurately a model performs common tasks such as question answering, reasoning, coding, and math. Scores are calculated using widely recognized datasets, making it easier to evaluate model suitability for specific applications.
- Performance and cost benchmarks help users balance speed and expenses. Metrics like latency (how fast a model responds), throughput (how much data can be processed), and estimated costs let you weigh different models to make informed, actionable choices. Trade-off charts in the portal help visualize these comparisons and prioritize what’s important for your solution.
- Model leaderboards are updated regularly, ensuring users always have access to the latest models and their benchmark results. This ongoing refresh keeps your decision-making process current with industry advancements.
- Beginner-friendly model selection is built into the Azure AI Foundry portal, where you can filter and compare models by scenario (for example, document processing or conversational AI), then review detailed benchmark results before deploying or testing a model.
Example: A data analyst using Azure Data services wants to automate extracting information from support tickets. By browsing the model leaderboards in the Azure AI Foundry portal, they compare several language models. One performs best for accuracy but is slower and more costly, while another offers faster processing with slightly less accuracy. The analyst uses trade-off charts to choose a model that balances speed and quality for large-scale ticket analysis.
Use Case: An IT team in a company new to Azure Data needs a reliable language model for classifying customer feedback at scale. They use the Azure AI Foundry portal leaderboards to compare models across performance, accuracy, and cost benchmarks. By reviewing scenario-based leaderboards (such as for document classification) and detailed benchmarking results, they quickly identify and deploy a model that fits their workflow and budget.
For more information see these links:
- Model leaderboards in Azure AI Foundry portal (preview)
- Model leaderboards in Azure AI Foundry portal (preview)
- Model leaderboards in Azure AI Foundry portal (preview)
- Model leaderboards in Azure AI Foundry portal (preview)
- Natural language processing technology
Test a deployed language model in the playground
- The playground in Azure AI Foundry or Databricks allows you to interactively test your deployed language model without needing to write any code. You can enter sample text (called a prompt or utterance) and see how the model responds, making it easy to evaluate accuracy and behavior.
- The playground provides options to view results in plain text or JSON, which helps you understand how your model is interpreting input and producing output. This is useful for quickly spotting errors or areas for improvement before integrating the model into your application.
- You can use the playground to compare responses from different model deployments or versions side by side. This makes it easier to choose the best-performing model for your needs or to measure the impact of new training data or fine-tuning.
- Testing commands like ‘Run’ in the playground simulate real-user interactions, so you can try typical queries your end-users might submit and immediately see how the model processes them. This helps ensure the model meets your business requirements before going live.
- The playground environment also exposes configuration options and lets you experiment with advanced settings (like temperature or response length) to find the best model settings for your application scenario.
Example: Imagine you work for a company that receives many customer emails about product support. After training a conversational AI model to understand support requests, you deploy it and use the Azure AI Foundry Language Playground. You paste in a sample email like ‘I can’t log in to my account’ and click ‘Run’ to test if the model correctly identifies the intent (log-in help) and extracts key information.
Use Case: As a new Azure Data professional in IT, you can use the playground to validate a custom language model trained to classify support ticket categories. Before integrating the model into your helpdesk system, you test various real-world ticket descriptions in the playground to verify that the model assigns them to correct categories (e.g., ‘password reset’, ‘billing issue’). This helps prevent misclassifications that could delay ticket resolution.
For more information see these links:
- Quickstart: Conversational language understanding (azure-ai-foundry)
- Chat with LLMs and prototype generative AI apps using AI Playground
- Quickstart: Conversational language understanding (azure-ai-foundry)
- Query your custom model
- Add and configure models to Azure AI Foundry Models (ai-foundry-portal)
Select an optimization approach
- Identify your optimization goal: Start by determining what you want to achieve, such as reducing costs, improving model accuracy, speeding up response times, or maximizing resource use. Common goals in IT and AI applications include optimizing for cost per click (CPC), cost per acquisition (CPA), or maintaining a set profit margin.
- Choose the right optimization strategy: Select from proven strategies such as optimizing to a predicted CPC/CPA goal or optimizing to a percentage margin of booked revenue. For instance, if you’re aiming for a specific CPA or CPC, set this as your performance goal so that the optimization algorithm can make informed decisions on how to bid or allocate resources.
- Utilize available tools and settings in Azure: Use built-in features like Performance Goals, intelligent fulfillment strategies, and cost monitoring tools. For example, setting up a fulfillment strategy in Azure Intelligent Order Management allows you to specify sources, inventory type, and simulations so that your optimization approach matches your business needs.
Example: A beginner working in Azure Data wants to ensure their advertising campaign gets the lowest cost per conversion. They set a $20 CPA goal using the Performance Goals section in their campaign line item. The Azure optimization algorithm will then automatically adjust bids to try to achieve conversions at or below $20, even if the revenue is booked on a CPM basis.
Use Case: An IT team new to Azure Data is launching an AI-powered chatbot for customer support. They want to optimize cloud resource usage so the chatbot performs well but stays within budget. They select a fulfillment strategy in Azure Intelligent Order Management that prioritizes the closest data center, uses real inventory tracking, and sets simulation runs to predict cost savings before going live. This approach helps them meet business goals and control expenses effectively.
For more information see these links:
- Optimization buying strategies
- Intelligent Fulfillment Optimization
- Optimization buying strategies
- Prerequisite knowledge
- Cost Optimization design principles
Optimize through prompt engineering and prompt flow
Test prompts with manual evaluation
- Manual evaluation of test prompts involves reviewing AI-generated outputs by a human to assess the accuracy, relevance, and quality of responses against defined criteria.
- By creating and testing multiple versions of prompts, you can compare which version produces the most useful and reliable results for your business scenario.
- Evaluation datasets containing expected answers or facts are used to judge whether the AI’s responses meet the required standards, making manual review more structured.
- Tools like MLflow can help automate parts of the evaluation, but manual checking ensures outputs meet business expectations and compliance (especially for sensitive or unique scenarios).
- Manual evaluation allows you to identify edge cases or errors that automated scoring might miss, guiding the optimization of prompt design for Azure Data applications.
Example: A data analyst trying to summarize monthly sales figures asks a GenAI model to generate a summary using different prompt wordings. They manually check if the generated summaries mention all key numbers, trends, and business insights as expected.
Use Case: In Azure Data projects, new users manually evaluate prompt outputs from a chatbot that answers data governance questions. They verify if the answers match company policy and contain correct guidance before deploying the bot to internal teams.
For more information see these links:
- Evaluate and compare prompt versions
- Batch testing for prompts (preview)
- Evaluate and compare prompt versions
- Transparency Note for auto-generate prompt variants in prompt flow
- Batch testing for prompts (preview)
Define and track prompt variants
- Prompt variants are different versions of prompts or tool configurations that you create to explore how changes affect the AI model’s output. Using variants in Azure Machine Learning prompt flow allows you to fine-tune prompts for optimal results.
- Tracking prompt variants helps you compare performance, quality, and relevance of the responses from each version. By keeping a record of each variant and its settings, you avoid confusion and ensure systematic improvements.
- Variants in prompt flow also include tracking connection settings (e.g., temperature, model type) alongside prompt content, making it possible to experiment both with the wording and technical parameters for better results.
- You can use built-in features in Azure AI Foundry or MLflow to automatically log and manage the history of prompt and application versions, ensuring reproducibility and transparency in your workflow.
- Comparing results from multiple variants side-by-side—in the prompt flow UI or through experiment tracking—empowers data-driven decisions, helping you choose which variant generates the most relevant, high-quality output for your use case.
Example: Suppose you are building an AI-powered news summarization tool on Azure. You create four prompt variants: two ask for a summary and two ask for the main point of an article, each tested at two temperature settings (e.g., 1.0 and 0.7). By tracking the responses and performance of each, you identify which prompt and settings produce the clearest and most informative summaries for your users.
Use Case: A new Azure Data engineer wants to deploy a customer support chatbot. They use prompt variants to test and compare different versions of their chatbot’s greeting and answer-generating prompts—such as ‘Hello, how can I assist you?’ versus ‘Welcome! What Azure Data question do you have today?’—as well as adjusting technical parameters. By tracking and comparing these variants using Azure Machine Learning prompt flow and MLflow versioning, the engineer chooses the most effective configuration, improving chat quality for users.
For more information see these links:
- Variants in prompt flow
- Transparency Note for auto-generate prompt variants in prompt flow
- Transparency Note for auto-generate prompt variants in prompt flow
- Track prompt versions alongside application versions
- Tune prompts using variants in Azure AI Foundry portal
Create prompt templates
- Prompt templates are pre-defined structures that combine instructions for the AI model with placeholders for user input, enabling consistent and repeatable AI interactions.
- Using templates helps ensure best practices in prompt design, such as clear wording and proper formatting, ensuring reliable output even for beginners.
- Prompt templates can include dynamic variables, allowing them to adapt in real time to incoming data or user input, and can be reused across different scenarios, reducing development time.
- You can create your own custom prompt templates in Azure Data tools (like Power Apps’ AI hub) or choose from a library of popular templates to kickstart your project.
- Templates can be integrated as ‘plugin functions’, allowing them to be called automatically by other AI flows or combined with other templates for more complex automations.
Example: Suppose you want to automatically summarize IT support tickets using Azure AI. You can use a prompt template that says: ‘Summarize the following support ticket: {ticket_text}’. The ‘{ticket_text}’ is a placeholder that will be filled with each incoming ticket during processing.
Use Case: An IT team in an organization uses a prompt template in Power Automate to extract key issues and recommended actions from daily support request emails, standardizing responses and aiding quick triage.
For more information see these links:
- Semantic Kernel Components
- Get started with prompt library
- Create a prompt
- Build with Teams AI library
- Use the Prompt Coach template to build an agent
Define chaining logic with the prompt flow SDK
- Chaining logic in the prompt flow SDK allows you to link multiple prompts together in a sequence, where the output from one prompt serves as the input for the next. This makes it easier to manage complex AI workflows by breaking them down into smaller, manageable steps.
- By dividing your workflow into separate prompts or nodes, you can isolate specific tasks, making your logic cleaner and your application easier to develop, maintain, and test. Each node in the chain can represent a distinct function, such as data extraction, transformation, or decision-making.
- The orchestrator prompt (typically a ChatPrompt) oversees the flow by determining which child prompts to trigger based on the current context. This organization helps improve accuracy because each prompt handles a focused, well-defined task, reducing complexity and potential errors.
- Chaining enables the creation of multi-modal workflows. While the orchestrator must be a ChatPrompt, child prompts can process different types of inputs (text, images, structured data), supporting applications that require handling various media formats even if the underlying model doesn’t natively support them.
Example: Suppose you’re building a customer support chatbot for an IT service company using Azure AI Foundry. The chatbot first uses a prompt to understand the user’s problem. Next, it chains to another prompt that suggests troubleshooting steps based on the detected issue. Finally, a third prompt provides a summary response or escalates the issue if needed. Chaining these prompts together creates an efficient and organized support workflow.
Use Case: A beginner working in Azure Data can use chaining logic with the prompt flow SDK to automate ticket classification and response in an internal helpdesk system. Each prompt in the chain handles one subtask, such as extracting the issue type, picking appropriate responses, and logging the ticket into the company database, which provides clear modularity and easy updates.
For more information see these links:
- Prompt flow in Azure AI Foundry portal
- Chaining (preview)
- Prompt flow ecosystem
- Prompt flow in Azure AI Foundry portal
- Develop prompt flow
Use tracing to evaluate your flow
- Tracing allows you to record detailed information about how your flow runs, including which steps were executed, how long they took, and any errors that occurred. This makes it much easier to understand the flow’s behavior and pinpoint issues.
- By enabling tracing and integrating with monitoring tools like Application Insights, you can visualize data from flow executions. This includes telemetry such as run times, trigger events, and action details, which help you analyze performance and usage patterns.
- Tracing helps with troubleshooting and optimizing flows. For example, you can inspect the trace to see where bottlenecks or failures occur, use logs to diagnose root causes, and set up alerts for specific errors or performance thresholds.
Example: A data engineer creates an automated cloud flow in Power Automate to copy new Azure Data Explorer entries to a SharePoint list. By enabling tracing, the engineer reviews the flow run history and Application Insights telemetry to see when flows succeed or fail, which steps take the most time, and details of any errors. This actionable tracing data makes it easy to troubleshoot and enhance the automation.
Use Case: An Azure Data newbie sets up a scheduled flow to extract daily usage metrics from Azure Data Explorer and email a report to IT stakeholders. By using tracing, the engineer monitors each run in Application Insights and Dataverse, quickly spotting recurring errors in the report generation step, tracking overall flow duration, and improving scheduling and logic based on actual run data.
For more information see these links:
- Monitor your flows
- Azure Data Explorer connector for Microsoft Power Automate
- Monitor and troubleshoot automation processes
- Enable tracing and collect feedback for a flow deployment
- Tracing and logging
Optimize through Retrieval Augmented Generation (RAG)
Prepare data for RAG, including cleaning, chunking, and embedding
- Cleaning Data: The first step is transforming raw, unstructured data (like PDFs, Word Docs, or HTML pages) into clean, readable text. This includes removing headers, footers, special characters, duplicate content, and irrelevant sections using tools like Python libraries (e.g., PyPDF2 or BeautifulSoup). Clean data ensures better search results and reduces noise for the AI model.
- Chunking the Data: After cleaning, the text is broken into manageable pieces called ‘chunks.’ Chunking strategies can be based on sentences, paragraphs, or character count. Thoughtful chunking (with possible overlap) helps maintain context and allows the AI to retrieve relevant information that fits within its prompt size limit.
- Generating Embeddings: Each chunk is converted into a numerical vector (embedding) using AI models. These embeddings capture the semantic meaning of the text and allow for similarity comparisons. Embeddings are indexed and stored in a vector database, making it easy for the RAG system to quickly retrieve contextually relevant chunks when a user asks a question.
Example: Suppose you have a collection of IT support knowledge base articles in PDF format stored in Azure Blob Storage. First, you use a parsing library to convert each PDF into plain text, removing boilerplate headers and footers. You then split each article into paragraphs or sections, ensuring each chunk is not too large and overlaps with adjacent chunks for context. Lastly, you use an embedding model from Azure AI to generate vectors for each chunk and store them in Azure’s vector database, linking each chunk to its source file and section for easy reference during retrieval.
Use Case: An Azure Data Engineer building a RAG-powered chatbot for IT support on the company intranet first prepares all support documentation. They clean and chunk the text, create embeddings, and store them in Azure Databricks. When users ask questions, the chatbot quickly retrieves the most relevant answers based on semantic similarity, providing accurate solutions and referencing the original documents.
For more information see these links:
- Retrieval-augmented generation (RAG) provides LLM knowledge
- Build an unstructured data pipeline for RAG
- Build an unstructured data pipeline for RAG
- Build an unstructured data pipeline for RAG
- RAG generate embeddings phase
Configure a vector store
- Understand what a vector store is: A vector store, or vector database, is a specialized database designed to store and manage vector embeddings. These are numerical representations of data—such as text, images, or documents—in a high-dimensional space, making it possible to search for and compare similar items efficiently.
- Choose and configure the vector store: In Azure, you can use services like Azure Cosmos DB (with vector indexing enabled) or Azure Database for PostgreSQL (with the pgvector extension). During configuration, define which fields will store vector data and set vector index options for fast similarity search. For example, in Cosmos DB, you can choose from flat, quantized flat, or DiskANN indexing methods based on your scale and accuracy needs.
- Enable vector indexing and filtering: When storing data, ensure the relevant field is indexed as a vector property and specify the number of dimensions (e.g., 1536 for common embeddings). Set fields as filterable if you want to use them for queries. This allows you to efficiently run similarity searches combined with traditional queries, such as finding all documents similar to an input while filtering by tags or dates.
- Load and retrieve data: Once your vector store is configured, insert vectorized data (embeddings) alongside any traditional data you want to store. Use the provided APIs or SDKs to perform similarity searches—these queries find items in the database whose vectors are most similar to a given input, supporting intelligent retrieval for AI models.
- Integrate with Retrieval Augmented Generation (RAG): With the vector store enabled and searchable, you can now use it in RAG pipelines to augment AI-generated answers with highly relevant, context-specific information pulled from your enterprise data.
Example: Suppose you want to build a semantic search feature for your IT helpdesk documents on Azure. You start by creating a collection in Azure Cosmos DB, configure the ‘contentVector’ field to store 1536-dimensional embeddings, and set it as a vector index. You upload each helpdesk article as a document, including its vector. Now, when a user submits a support question, you generate its vector embedding and use a similarity search API to quickly find relevant knowledge base articles.
Use Case: An IT analyst new to Azure Data wants to improve their company’s internal document search. By configuring a vector store in Azure Cosmos DB and integrating it with their existing search portal, users can search using natural language questions and receive the most semantically relevant results—even when their keywords don’t exactly match the documents’ wording.
For more information see these links:
- Vector search using Semantic Kernel Vector Store connectors (Preview) (programming-language-java)
- Vector search
- Vector stores in Azure Database for PostgreSQL
- Vector Search in Azure Cosmos DB for NoSQL
- Generative AI with Azure Database for PostgreSQL
Configure an Azure AI Search-based index store
- Understand the basics of Azure AI Search index store: An index in Azure AI Search organizes and stores searchable content, allowing for fast and efficient retrieval of information. Each index is like a database where you define what data can be searched, filtered, or retrieved.
- Define field attributes and types: When configuring an index, you decide the data types for each field (like Edm.String for text or Edm.Int32 for numbers) and set attributes such as searchable, retrievable, filterable, facetable, sortable, or stored. These attributes control how users can interact with your data during searches.
- Connect data sources and set up indexers: Azure AI Search can index data from sources like Azure Blob Storage or Azure SQL Database. By configuring an indexer, you automate the extraction of data, mapping source fields to your search index and keeping the index updated as underlying data changes.
- Use the Azure Portal or REST API for configuration: Beginners can use the Azure Portal’s wizards for point-and-click index creation and management, while more advanced users may prefer automating index creation and updates via REST APIs or SDKs.
- Optimize for Retrieval Augmented Generation (RAG): A well-configured index enables powerful search and filtering, helping systems retrieve relevant information efficiently for tasks like chatbots, document QA, or knowledge discovery in IT scenarios.
Example: Imagine an IT department stores technical documentation PDFs in Azure Blob Storage. By using Azure AI Search, they create an index to make all document titles and content searchable. Fields like ‘Title’ (Edm.String, searchable) and ‘UploadDate’ (Edm.DateTimeOffset, filterable, sortable) allow staff to quickly find documents by keywords or filter by upload date.
Use Case: A helpdesk chatbot uses Retrieval Augmented Generation to answer technical questions from employees. It relies on the Azure AI Search index to retrieve the most relevant knowledge base articles and documentation, based on user queries. When an employee asks, ‘How do I reset my company laptop?’, the system searches the indexed content and returns direct answers gleaned from IT documentation.
For more information see these links:
- Create an index in Azure AI Search
- Index data from Azure Blob Storage
- Create an index in Azure AI Search
- Manage an index in Azure AI Search (azure-portal)
- Index data from Azure SQL Database
Evaluate your RAG solution
- Assess the retrieval quality of your RAG solution by measuring how accurately it finds relevant data for each user query. Use metrics like precision (how much of the retrieved data is relevant) and recall (how much of the total relevant data is retrieved).
- Evaluate the response quality of the system, focusing on metrics such as groundedness (how well the LLM answer aligns with source data), relevancy (how pertinent the response is to the question), and safety (avoiding unintended harmful or biased outputs).
- Monitor system performance, including cost and latency. Track overall processing time from query to response and the computational resources used (like token consumption or index search speed), to ensure the application meets user experience and budget needs.
- Use evaluation tools and methods—such as Azure Machine Learning’s built-in metrics visualizations or the RAG Experiment Accelerator—to systematically test different configurations and strategies, comparing results to find the most effective solution.
Example: An IT department new to Azure Data builds a support chatbot using RAG. They test their chatbot by asking common troubleshooting questions and check if it retrieves the correct help articles (retrieval quality), provides accurate and easy-to-understand answers (response quality), and replies quickly without using excessive cloud resources (system performance). They regularly review these metrics in Azure Machine Learning to make improvements.
Use Case: A company sets up a RAG-powered internal knowledge base using Azure Machine Learning. By measuring retrieval precision and groundedness, they ensure employees get reliable responses to queries on cloud setup and security procedures. They use automated reports to iteratively improve their retrieval and LLM settings—helping staff solve technical issues faster and with more confidence, even with limited Azure experience.
For more information see these links:
- Assess performance: Metrics that matter
- Test and evaluate AI workloads on Azure
- Observability in generative AI
- Using experimentation to accelerate RAG development
- Retrieval Augmented Generation using Azure Machine Learning prompt flow (preview)
Optimize through fine-tuning
Prepare data for fine-tuning
- Gather relevant data: Collect task-specific input and output pairs that demonstrate the desired behavior you want from the fine-tuned model. Data should be accurate, clear, and representative of real-world scenarios.
- Format your data for Azure: Training and validation datasets must be structured in JSON Lines (JSONL) format, following Azure OpenAI’s conversational schema. Each line should represent one example in the required format, making it easy for the service to process.
- Ensure quality and quantity: Use hundreds or thousands of high-quality examples for best results. The minimum is 10 examples, but larger datasets generally produce better models. Make sure all examples are error-free and relevant to your task.
- Prepare for uploading: Encode data files in UTF-8 (with BOM) and ensure file size does not exceed 512 MB. For large files, consider using Azure Blob Storage for secure and reliable data upload.
Example: Suppose a team wants to fine-tune an Azure OpenAI model to help IT helpdesk staff answer frequently asked technical questions. They collect actual chat logs featuring user queries and helpdesk responses, then format these in JSONL as input-output pairs, ensuring each example follows the Azure OpenAI chat schema.
Use Case: An IT professional new to Azure Data can prepare customer service chat transcripts in JSONL format and upload them to Azure Blob Storage. This fine-tunes a language model to deliver tailored, accurate answers to common IT support requests, improving helpdesk efficiency and customer satisfaction.
For more information see these links:
- Azure OpenAI in Azure AI Foundry Models fine-tuning considerations
- Azure OpenAI in Azure AI Foundry Models fine-tuning considerations
- Customize a model with fine-tuning (rest-api)
- Customize a model with fine-tuning (rest-api)
- Customize a model with fine-tuning (programming-language-studio)
Select an appropriate base model
- Understand your project requirements: Begin by identifying what you need your model to do (for example, text classification, summarization, or image recognition). A clear understanding of your goals will help you choose a base model that matches your needs.
- Use Azure AI Foundry model leaderboards: The Azure AI Foundry portal provides model leaderboards that rank models by quality, cost, and performance. These leaderboards make it easy to compare different base models based on the metrics that matter most to your project.
- Compare models using trade-off charts: Trade-off charts in Azure AI Foundry allow you to weigh the pros and cons between quality, cost, and performance. For example, you may prioritize a model with higher quality even if it costs more, or you may need the most cost-effective model for large-scale deployment.
- Analyze benchmark results: For each model, you can review detailed benchmark results, including aggregate scores and comparative charts. This helps you see how each model performs across different scenarios, so you can select the most suitable base model.
- Verify compatibility and versioning: Make sure the model you choose is compatible with your data and tools. In Azure, you can compare different model versions using version control features to ensure you select the best fit for your scenario.
Example: Suppose you want to create a chatbot for customer support using Azure Data. You access the Azure AI Foundry’s model leaderboards and see three top models ranked by quality, cost, and performance. By reviewing their leaderboard scores and using trade-off charts, you notice that the highest-quality model is slightly more expensive, but offers much lower latency and better results for your use case. This helps you make an informed choice for your chatbot project.
Use Case: An IT professional new to Azure Data is tasked with developing a document summarization tool for their company’s internal reports. They use the Azure AI Foundry portal to browse the model catalog, compare base models for text summarization using the leaderboards and benchmark data, and choose a model that balances high-quality output with reasonable cost. This forms the starting point for further fine-tuning with company-specific data.
For more information see these links:
- Compare and select models using the model leaderboard in Azure AI Foundry portal (preview)
- Compare and select models using the model leaderboard in Azure AI Foundry portal (preview)
- Compare and select models using the model leaderboard in Azure AI Foundry portal (preview)
- Machine learning model in Microsoft Fabric
- Windows Machine Learning
Run a fine-tuning job
- A fine-tuning job customizes a pre-built AI model (like GPT) with your own data to improve its accuracy for specific tasks in your organization. In Azure AI Foundry, this is started using the ‘Create a fine-tuned model’ tool.
- The workflow involves uploading your training and (optionally) validation data, selecting the base model and training method (such as supervised fine-tuning or preference optimization), and configuring any job parameters before starting the job.
- Once submitted, Azure queues and runs your job. You can monitor the status, review logs, and even pause the job to check intermediate results. Checkpoints are created at each training step, letting you roll back or deploy earlier versions if needed.
Example: Imagine you work for a company that processes thousands of IT support tickets every month. By running a fine-tuning job on Azure using real past support conversations as training data, you customize a language model to better understand and automatically categorize new incoming support requests. This reduces manual sorting and speeds up response times.
Use Case: A new Azure Data engineer in an organization needs an automated solution to route internal support queries accurately. By running a fine-tuning job on Azure AI Foundry with sample support tickets and their correct categories, the engineer creates a model that automatically classifies future tickets into the right department (e.g., Networking, Access, Database), improving workflow efficiency.
For more information see these links:
- Customize a model with fine-tuning (programming-language-studio)
- Customize a model with fine-tuning (programming-language-studio)
- Customize a model with fine-tuning (programming-language-studio)
- Customize a model with fine-tuning (programming-language-studio)
- Customize a model with fine-tuning (programming-language-studio)
Evaluate your fine-tuned model
- Use Evaluation Metrics: After fine-tuning your model, measure its performance using key metrics like accuracy, precision, recall, and F1 score. These indicate how well your model predicts outcomes and help identify areas for improvement.
- Compare Predictions to Actual Results: Evaluate your model by testing it on new, unseen data. Compare the predicted results to actual outcomes to check if your model is generalizing well and not just memorizing training data.
- Analyze Training and Validation Statistics: Review loss and accuracy values from your training and validation datasets. Loss should decrease and accuracy should increase as training progresses. Use monitoring tools in Azure to visualize these trends and spot any issues.
- Iterate and Improve: If your model does not perform well on validation data, adjust your training data, fine-tuning settings, or try different model architectures. High-quality data and correct parameters are crucial for a reliable solution.
Example: Suppose you have fine-tuned a text classification model to categorize IT support tickets. After training, you test the model on a separate set of tickets. If your model correctly predicts the ticket categories for most examples and has high precision and recall scores, it means your fine-tuning process was effective.
Use Case: An Azure Data engineer new to model fine-tuning trains a custom model to automatically detect and label critical data incidents in Azure logs. By evaluating the fine-tuned model’s performance with validation data and metrics, they ensure the system reliably identifies urgent issues for timely response.
For more information see these links:
- Model fine-tuning concepts
- Train and evaluate a model
- Evaluation metrics
- Customize a model with fine-tuning (programming-language-studio)
- Customize a model with fine-tuning (programming-language-studio)