Take the LEAD with Datalab.¶

The Datalab LEAD Program: Learn, Experience, Advance, Develop.¶

Machine Learning/Artificial Intelligence Learning Paths¶

We present 20 topics in the data science learning path, providing learning objectives, related skills, subtopics, and references/resources for each. The goal is to give graduate students a structured and comprehensive program to acquire data science expertise, including hands-on experience with real-world open-source tools and libraries.

timeline
     A. General Data Science  : A1. Introduction to Data Science and Machine Learning
                         : A2. Data Analyis with Pandas
                 : A3. Data Visualization
                 : A3. Ethical Considerations in Data Science
     B. Statistics : B1. Descriptive Statistics
                  : B2. Probability Distributions
          : B3. Inferential Statistics
          : B4. Bayesian Statistics
     C. Machine Learning  : C1. Machine Learning with Scikit-Learn
                    : C2. Supervised Learning
                 : C3. Unsupervised Learning
             : C4. Ensemble Methods
     D. Deep Learning  : D1. Neural Networks with PyTorch
            : D2. Transformers with HuggingFace
            : D3. GenAI 1 <br> (LLM, RAG) 
            : D4. GenAI 2 <br> (Multimodal LLMs)
            : D5. Agents and MCP
     E. Continuous Integration / Continuous Deployment : E1. MLOps
                                       : E2. LLMOps
                               : E3. AgentsOps

A. General Data Science¶

A1. Introduction to Data Science and Machine Learning

Data Science is an interdisciplinary field focused on extracting knowledge and insights from data. Machine Learning (ML), a key component of Artificial Intelligence (AI), enables systems to learn from data to make decisions or predictions.
A2. Data Analysis with Pandas

Pandas is an open-source Python library used for data manipulation and analysis. It provides data structures, such as Series (1D) and DataFrames (2D), designed to handle tabular datasets efficiently.
A3. Data Visualization with Matplotlib and Seaborn

Matplotlib is a library in Python that enables users to generate visualizations like histograms, scatter plots, bar charts, pie charts and much more. Seaborn is a visualization library that is built on top of Matplotlib. It provides data visualizations that are typically more aesthetic and statistically sophisticated.
A4. Ethical Considerations of Data Science

Ethics in data science encompasses the moral principles and guidelines that govern the collection, analysis, and use of data to ensure responsible and beneficial outcomes.

B. Statistics¶

B1. Descriptive Statistics

Descriptive Statistics is a set of brief descriptive coefficients that summarize a given data set representative of an entire or sample population.
B2. Probability Distributions

In probability theory and statistics, a probability distribution is a function that gives the probabilities of occurrence of possible events for an experiment.
B3. Inferential Statistics

Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates.
B4. Bayesian Statistics

Bayesian statistics is a method of statistical inference that uses Bayes' Theorem to update the probability of a hypothesis as new evidence becomes available.

C. Machine Learning¶

C1. Machine Learning with Scikit-Learn

Scikit-learn is a powerful and widely used Python library for machine learning.
C2. Unsupervised Learning

Unsupervised learning is a type of machine learning where algorithms learn from unlabeled data, identifying patterns and structures without specific guidance or desired outputs.
C3. Supervised Learning

Supervised learning is a type of machine learning where an algorithm learns to predict an output variable by being trained on a labeled dataset.
C4. Ensemble Learning

Ensemble learning in machine learning combines multiple individual models (base learners) to create a more accurate and robust predictive model than any single model alone.

D. Deep Learning¶

D1. Deep Learning in PyTorch

PyTorch is an open-source ML framework offering flexible deep learning development with Python integration. It features dynamic computation graphs and GPU acceleration for neural networks, computer vision, and NLP tasks.
D2. Transformers with HuggingFace

Hugging Face Transformers is a Python library and open-source framework used to access and utilize pre-trained machine learning models for tasks like natural language processing (NLP), computer vision, audio processing, and multi-modal applications.
D3. Generative AI 1 - LLM, RAG

Retrieval-augmented generation (RAG) is a technique that enables large language models (LLMs) to retrieve and incorporate new information.
D4. Generative AI - Multimodal LLMs

Multimodal LLMs, are advanced AI systems that can process and generate content across multiple types of data, or modalities, such as text, images, audio, and video.

E. Continuous Integration / Continuous Development¶

D1. MLOps

MLOps (Machine Learning Operations), is a way to manage machine learning models, making it easier to develop, deploy, and update them as business needs change.
D2. LLMOps

LLMOps (Large Language Model Operations), extends MLOps practices to handle large language model deployment challenges. It focuses on managing computational resources, prompt engineering, and monitoring model performance and ethics.
D3. AgentOps

AgentOps deploys autonomous agents that perform complex tasks independently. These agents work with APIs, use real-time data for decisions, and adapt dynamically - making them suitable for autonomous high-stakes applications.

Prompt Engineering

Created: 05/25/2025 (C. Lizárraga); Updated: 05/29/2025 (C. Lizárraga)

2025. University of Arizona DataLab, Data Science Institute