A1 Intro to DataScience and ML

Introduction to Data Science and Machine Learning 🧭🤖¶

(Image credit: Jason Leung. Unsplash.com)

This module serves as your foundational entry point into the dynamic worlds of Data Science and Machine Learning. It's designed for self-paced, experiential learning, encouraging active engagement with Python and AI tools to progressively build your skills from foundational knowledge to analytical application.

1. General Learning Objectives¶

Upon successful completion of this module, you will be able to:

Remember key terminology and foundational concepts of Data Science and Machine Learning. (Bloom's: Remember)
Understand the typical workflow of a data science project and the roles of various components. (Bloom's: Understand)
Understand the distinctions and relationships between Artificial Intelligence, Machine Learning, and Data Science. (Bloom's: Understand)
Apply basic Python programming skills using essential libraries (Pandas, NumPy) for fundamental data loading, inspection, and manipulation tasks. (Bloom's: Apply)
Apply core data visualization techniques using Matplotlib and Seaborn to explore datasets and communicate initial findings. (Bloom's: Apply)
Apply the principles of a basic machine learning algorithm (e.g., k-Nearest Neighbors) to a simple classification problem using Scikit-learn. (Bloom's: Apply)
Analyze simple datasets to identify appropriate questions, necessary preprocessing steps, and suitable introductory modeling approaches. (Bloom's: Analyze)
Analyze the output and performance of a basic machine learning model to draw initial conclusions. (Bloom's: Analyze)

Throughout this module, you are encouraged to use open-source Large Language Models (LLMs) as a learning aid to clarify concepts, debug code, and explore topics more deeply, thereby enhancing your independent learning journey.

2. Topic Overview¶

Data Science is an interdisciplinary field that extracts knowledge and insights from structured and unstructured data using scientific methods, processes, algorithms, and systems. Machine Learning (ML), a core component of Artificial Intelligence (AI), provides systems with the ability to automatically learn and improve from experience without being explicitly programmed. This module introduces you to the foundational principles that underpin these fields, the typical lifecycle of a data science project (from data acquisition to model deployment and interpretation), and the ethical considerations involved. You'll discover how Python, with its rich ecosystem of libraries, has become the de facto language for data science, enabling practitioners to tackle complex problems across various domains like healthcare, finance, and technology. Understanding these fundamentals is the first crucial step toward becoming a proficient data scientist.

3. Open-Source Python Libraries¶

These libraries are workhorses in the field of Data Science:

NumPy (Numerical Python):
- Description: The fundamental package for numerical computation in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently.
- Practical Applications: Performing mathematical and logical operations on arrays, Fourier transforms, routines for shape manipulation, linear algebra, and random number generation. It forms the basis for most other scientific computing and data analysis libraries in Python.
Pandas:
- Description: A powerful and flexible open-source data analysis and manipulation tool, built on top of NumPy. It provides expressive data structures like DataFrame and Series, designed to make working with "relational" or "labeled" data both easy and intuitive.
- Practical Applications: Data cleaning, data transformation (reshaping, merging, joining), data loading from various file formats (CSV, Excel, SQL databases), time-series analysis, and exploratory data analysis.
Matplotlib:
- Description: A comprehensive library for creating static, animated, and interactive visualizations in Python. It provides a MATLAB-like interface and an object-oriented API for embedding plots into applications.
- Practical Applications: Generating a wide variety of plots like line plots, scatter plots, histograms, bar charts, and more, for data exploration, presentation, and publication.
Seaborn:
- Description: A Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
- Practical Applications: Creating more sophisticated statistical plots with less code, such as heatmaps, violin plots, pair plots, and complex categorical plots. Excellent for highlighting relationships and distributions.
Scikit-learn (sklearn):
- Description: A simple and efficient tool for data mining and data analysis. It features various classification, regression, clustering, dimensionality reduction, model selection, and preprocessing algorithms.
- Practical Applications: Implementing machine learning models, evaluating model performance, preparing data for modeling (e.g., feature scaling, encoding categorical variables).
Jupyter Notebook / JupyterLab:
- Description: Web-based interactive computational environments that allow you to create and share documents containing live code, equations, visualizations, and narrative text.
- Practical Applications: Ideal for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. They support an iterative, exploratory approach to data science.

4. Core Skills to Develop¶

Engaging with this module will help you develop these specific skills:

Conceptual Fluency: Articulating the definitions and relationships between key Data Science and Machine Learning terms.
Data Acclimation: Loading, inspecting, and performing initial descriptive analysis on diverse datasets using Pandas.
Visual Exploration: Generating and interpreting basic statistical plots to identify patterns, distributions, and relationships in data.
Introductory Modeling: Implementing and evaluating a simple classification model using Scikit-learn.
AI-Augmented Learning: Effectively using LLMs to clarify concepts, troubleshoot Python code, and explore problem-solving alternatives.

5. Subtopics¶

This module is structured around five key subtopics:

The Data Science Ecosystem: Defining Data Science, AI, ML, and the typical project lifecycle.
Python Environment & Jupyter Mastery: Setting up your tools for effective data science work.
Data Foundations with NumPy & Pandas: Core techniques for data manipulation and preparation.
Visual Insights with Matplotlib & Seaborn: Fundamentals of exploratory data visualization.
First Steps in Machine Learning with Scikit-learn: Introduction to predictive modeling.

6. Experiential Use Cases¶

For each use case, remember to leverage AI tools if you get stuck or want to dive deeper. For instance, you can ask an LLM: "Explain this Python error message: [paste error]", "Generate a simple Python script to load a CSV using Pandas," or "What are some common ways to handle missing data in a dataset?"

Subtopic 1: The Data Science Ecosystem¶

Use Case 1.1: Defining the Domain
- Problem Definition: You are tasked with explaining Data Science, Machine Learning (ML), and Artificial Intelligence (AI) to a non-technical colleague and illustrating how they relate.
- Learning Goal: Create a concise written explanation or a simple diagram that defines AI, ML, and Data Science, highlighting their overlaps and distinctions. (Bloom's: Understand)
- Python Tools: N/A for direct coding. Use a text editor or diagramming tool.
Use Case 1.2: Real-World ML Applications
- Problem Definition: Identify how machine learning impacts everyday life or specific industries.
- Learning Goal: Research and list three distinct real-world applications of machine learning. For each, briefly describe the problem it solves and the likely type of data used. (Bloom's: Remember, Understand)
- Python Tools: N/A for direct coding. Use web search and a text editor.
Use Case 1.3: Mapping the Data Science Workflow
- Problem Definition: You need to understand the standard process flow of a data science project.
- Learning Goal: Outline the key stages of a common data science workflow (e.g., CRISP-DM). For each stage, describe its main objective and list one example activity. (Bloom's: Remember, Understand)
- Python Tools: N/A for direct coding. Use a text editor or presentation software.

Subtopic 2: Python Environment & Jupyter Mastery¶

Use Case 2.1: Environment Setup
- Problem Definition: Ensure you have a functional Python environment with all necessary data science libraries.
- Learning Goal: Install Anaconda (or Miniconda), create a virtual environment, and install NumPy, Pandas, Matplotlib, Seaborn, and Scikit-learn. Verify installations within a Jupyter Notebook by importing each library. (Bloom's: Apply)
- Python Tools: Anaconda/Miniconda, command line/terminal, Jupyter Notebook.
Use Case 2.2: Jupyter Notebook Navigation
- Problem Definition: Become proficient with the Jupyter Notebook interface for interactive coding and documentation.
- Learning Goal: Create a new Jupyter Notebook. Practice creating and running code cells (e.g., variable assignments, simple calculations), using Markdown cells for titles and formatted text (bold, italics, lists), and saving your notebook. (Bloom's: Apply)
- Python Tools: Jupyter Notebook.
Use Case 2.3: AI for Code Comprehension
- Problem Definition: You encounter a Python code snippet online for data loading but don't fully understand its syntax or logic.
- Learning Goal: Use an LLM to explain a provided short Python script (e.g., a script that loads a CSV with Pandas and prints basic info). Document the explanation and re-run the script with understanding. (Bloom's: Understand, Apply)
- Python Tools: Jupyter Notebook, access to an LLM.

Subtopic 3: Data Foundations with NumPy & Pandas¶

For these use cases, you can find simple CSV datasets online (e.g., student performance, basic retail sales) or ask an LLM to generate sample CSV data for you to save and use.
Use Case 3.1: Data Loading & Initial Inspection
- Problem Definition: You are given a dataset in a CSV file and need to understand its basic structure and content.
- Learning Goal: Load the CSV file into a Pandas DataFrame. Use functions like .head(), .tail(), .info(), .shape, and .describe() to gather initial insights about the data (number of rows/columns, data types, missing values, summary statistics). (Bloom's: Apply)
- Python Tools: Pandas, Jupyter Notebook.
Use Case 3.2: Basic Data Cleaning
- Problem Definition: The loaded dataset contains some missing values that could affect future analysis.
- Learning Goal: Identify columns with missing values. Apply a simple strategy to handle them (e.g., filling numerical NaNs with the column mean or median, or dropping rows/columns if appropriate, justifying your choice). (Bloom's: Apply, Analyze)
- Python Tools: Pandas, NumPy, Jupyter Notebook.
Use Case 3.3: Data Selection & Filtering
- Problem Definition: You need to extract specific subsets of your data for a targeted analysis (e.g., analyze data for a particular category or time period).
- Learning Goal: Select specific columns from the DataFrame. Filter rows based on one or more conditions (e.g., select all records where 'sales' > 1000 and 'region' == 'North'). (Bloom's: Apply)
- Python Tools: Pandas, Jupyter Notebook.

Subtopic 4: Visual Insights with Matplotlib & Seaborn¶

Use Case 4.1: Univariate Visualization
- Problem Definition: Understand the distribution of individual variables in your dataset.
- Learning Goal: Create a histogram for a numerical column (e.g., 'age' or 'price') and a bar chart for a categorical column (e.g., 'product_category') to visualize their distributions. Interpret what these plots tell you. (Bloom's: Apply, Analyze)
- Python Tools: Matplotlib, Seaborn, Pandas, Jupyter Notebook.
Use Case 4.2: Bivariate Visualization
- Problem Definition: Explore potential relationships between pairs of variables.
- Learning Goal: Generate a scatter plot to visualize the relationship between two numerical variables (e.g., 'study_hours' vs. 'exam_score'). Create a box plot to compare the distribution of a numerical variable across different categories of a categorical variable (e.g., 'income' by 'education_level'). Discuss any observed patterns. (Bloom's: Apply, Analyze)
- Python Tools: Matplotlib, Seaborn, Pandas, Jupyter Notebook.
Use Case 4.3: Customizing Visualizations for Clarity
- Problem Definition: Your basic plots are created, but they lack titles, proper labels, and could be more visually appealing for presentation.
- Learning Goal: Take one of the plots created earlier and add a descriptive title, clear x-axis and y-axis labels, and change the color or style to improve its readability and aesthetic quality. You can ask an LLM for suggestions on "how to make a Matplotlib plot more informative." (Bloom's: Apply)
- Python Tools: Matplotlib, Seaborn, Pandas, Jupyter Notebook.

Subtopic 5: First Steps in Machine Learning with Scikit-learn¶

Use a simple, well-known dataset for this, like the Iris dataset, which can be loaded directly from Scikit-learn: from sklearn.datasets import load_iris.
Use Case 5.1: Understanding Features and Target
- Problem Definition: Before building a model, you must clearly define what you are trying to predict and what information will be used for prediction.
- Learning Goal: Load the Iris dataset. Identify and separate the features (X – e.g., sepal length, petal width) and the target variable (y – e.g., species of iris). Explain why this is a classification problem. (Bloom's: Understand, Apply)
- Python Tools: Scikit-learn, Pandas (optional, for DataFrame conversion), Jupyter Notebook.
Use Case 5.2: Training a Simple Classifier
- Problem Definition: You need to train a machine learning model to learn patterns from the data.
- Learning Goal: Split the data into training and testing sets. Train a k-Nearest Neighbors (k-NN) classifier on the training portion of the Iris dataset. (Bloom's: Apply)
- Python Tools: Scikit-learn (train_test_split, KNeighborsClassifier), Jupyter Notebook.
Use Case 5.3: Making Predictions and Basic Evaluation
- Problem Definition: Once trained, you need to assess how well your model performs on unseen data.
- Learning Goal: Use the trained k-NN model to make predictions on the test set. Calculate the accuracy of the model and interpret what this score means in the context of the problem. (Bloom's: Apply, Analyze)
- Python Tools: Scikit-learn, Jupyter Notebook.

7. Assessment Quiz¶

This quiz helps you self-assess your understanding. Answers can be verified by reviewing module content or quick experimentation.

Which of the following best describes Machine Learning?
a) The science of making computers perform tasks that require human intelligence.
b) A field of study that gives computers the ability to learn without being explicitly programmed.
c) The process of using computers to analyze large datasets only.
d) The development of software that can reason and solve complex problems like humans.
(Answer: (1) )

Consider the following Python code using Pandas:

import pandas as pd
data = {'colA': [1, None, 3, 4], 'colB': [5, 6, 7, 8]}
df = pd.DataFrame(data)
df['colA'].fillna(df['colA'].mean(), inplace=True)
# Assuming the mean of the non-missing values in colA is (1+3+4)/3 = 2.66...

What will be the value in df['colA'][1] after this code runs?
a) None
b) 0
c) Approximately 2.67
d) 6
(Answer: (2) )

Which Python library is primarily used for creating statistical visualizations like heatmaps and pair plots with concise syntax?
a) NumPy
b) Seaborn
c) Pandas
d) Scikit-learn
(Answer: (1) )
In a typical classification problem, what is the role of the 'target variable'?
a) It's an input feature used by the model to learn.
b) It's the categorical label or class that the model aims to predict.
c) It's a numerical value the model tries to estimate.
d) It's a technique for reducing the number of features.
(Answer: (1) )
What is the primary purpose of train_test_split in Scikit-learn?
a) To combine two different datasets into one.
b) To separate features from the target variable within a single dataset.
c) To divide a dataset into one part for training the model and another, unseen part for evaluating its performance.
d) To visualize the distribution of data.
(Answer: (2) )
If you want to create a scatter plot in Python to visualize the relationship between 'Height' and 'Weight' columns in a Pandas DataFrame df, which line of code is most appropriate using Seaborn?
a) sns.histplot(data=df, x='Height', y='Weight')
b) sns.boxplot(data=df, x='Height', y='Weight')
c) sns.scatterplot(data=df, x='Height', y='Weight')
d) df.plot(kind='scatter', x='Height', y='Weight') (This is Pandas plotting, not Seaborn directly)
(Answer: (2) )
You have loaded a dataset into a Pandas DataFrame called sales_df. How would you display the first 10 rows of this DataFrame?
a) sales_df.show(10)
b) sales_df.display_head(10)
c) sales_df.head(10)
d) sales_df.first(10)
(Answer: (2) )
When you encounter a Python error message that you don't understand while working in a Jupyter Notebook, how can an LLM assist you most effectively?
a) By automatically fixing the code in your notebook.
b) By explaining what the error message typically means, suggesting possible causes, and providing examples of how to fix similar errors.
c) By providing a link to the full Python documentation without context.
d) By advising you to restart your computer.
(Answer: (1) )
What does the .info() method in Pandas primarily provide for a DataFrame?
a) A statistical summary of numerical columns (mean, std, min, max).
b) The first five rows of the DataFrame.
c) A concise summary of the DataFrame, including data types of columns and non-null counts.
d) The correlation matrix of numerical columns.
(Answer: (2) )
Which of these tasks falls under the 'Data Cleaning/Preparation' stage of the data science workflow?
a) Defining business objectives.
b) Training a machine learning model.
c) Handling missing values and transforming variables.
d) Presenting results to stakeholders.
(Answer: (2) )

Answer is: b
Answer is: c

8. Bonus Challenge Problems¶

These challenges encourage you to synthesize your learning and explore concepts at a deeper analytical level.

Mini-Project: Exploratory Data Analysis (EDA) on a Novel Dataset (Bloom's: Analyze)
- Problem: Find a small, interesting, and publicly available dataset (e.g., from Kaggle Datasets, UCI Machine Learning Repository - look for simpler ones). Perform a basic EDA. This should include:
  - Loading the dataset.
  - Inspecting its structure, data types, and identifying missing values.
  - Formulating at least three initial questions about the data that you find interesting.
  - Creating at least three distinct types of visualizations to help answer your questions or explore patterns.
  - Writing a brief summary of your findings and any challenges encountered.
- Guidance: Document your steps and reasoning in a Jupyter Notebook. Use LLMs to help you find datasets or to get ideas for relevant questions and visualizations for the dataset you choose. For example, "Suggest interesting questions I can explore in a dataset about [dataset topic]."
Comparing Classifiers: k-NN vs. Another (Bloom's: Analyze, Evaluate - introductory level)
- Problem: Using the Iris dataset (or another simple classification dataset you find), train and evaluate the k-Nearest Neighbors (k-NN) classifier as done in the module. Then, research and implement one other simple classification algorithm available in Scikit-learn (e.g., Logistic Regression or a Decision Tree).
  - Train this new model on the same training data.
  - Evaluate it on the same test data using accuracy.
  - Briefly compare the results. Which performed better on this specific task?
  - Use an LLM to help you understand the basic principles of the new classifier you chose ("Explain Logistic Regression in simple terms for a beginner").
- Guidance: Document your code and a short comparison of the models' performance and any observations in a Jupyter Notebook. This is not about finding the "best" model in an absolute sense, but about the process of applying and comparing.

9. References & Further Reading¶

(Focus on open-access or widely available resources)

VanderPlas, J. (2016). Python Data Science Handbook. O'Reilly Media.
- Open-access online version: https://jakevdp.github.io/PythonDataScienceHandbook/
- A comprehensive guide covering IPython, NumPy, Pandas, Matplotlib, and Scikit-learn. Excellent for practical learning.
Grus, J. (2019). Data Science from Scratch: First Principles with Python (2^nd ed.). O'Reilly Media.
- While not fully open-access, many concepts are foundational and widely discussed. Focuses on understanding by building from scratch.
Scikit-learn User Guide.
- Open-access: https://scikit-learn.org/stable/user_guide.html
- The official documentation is extensive, with tutorials and examples for all its modules.
Pandas Documentation: Getting Started & User Guide.
- Open-access: https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html
- Authoritative source for learning Pandas, from basic to advanced features.
StatQuest with Josh Starmer.
- Open-access (YouTube): https://www.youtube.com/user/joshstarmer
- Provides clear and intuitive explanations of key statistics, machine learning, and data science concepts. Highly recommended for understanding the "why" behind the techniques.

Created: 05/25/2025 (C. Lizárraga); Updated: 05/25/2025 (C. Lizárraga)

2025. University of Arizona DataLab, Data Science Institute