{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(sec-data-science-lifecycle)=\n", "# Data Science Lifecycle\n", "\n", "Data science involves exploring and experimenting with data. A data science project consists of multiple steps, such as data preparation, modeling, and deploying to production environments. As shown in {numref}`fig-data-science-lifecycle`, CRISP-DM (Cross-industry standard process for data mining) was initially proposed as an industry-standard by Shearer {cite}`shearer2000CRISPDM` to describe the data science lifecycle.\n", "\n", "```{figure} ../img/ch-data-science/data-science-lifecycle.svg\n", "---\n", "width: 400px\n", "name: fig-data-science-lifecycle\n", "---\n", "Data Science Lifecycle\n", "```\n", "\n", "CRISP-DM consists of six parts:\n", "\n", "* Business Understanding\n", "* Data Understanding\n", "* Data Preparation\n", "* Modeling\n", "* Evaluation\n", "* Deployment\n", "\n", "## Business Understanding\n", "\n", "Understanding the business logic plays a vital role in the success of any project because the ultimate goal of data science is to serve the business. Before diving into specific data science modeling, data analysts should deeply understand the business. For example, in e-commerce data analysis, one must first comprehend the intrinsic logic of e-commerce, user demands, and company business goals to better serve users and the company in subsequent modeling and analysis.\n", "\n", "House price prediction is a typical data science scenario. We have chosen the California housing dataset for demonstration purposes. This dataset provides features such as house prices, neighborhood income, house age, number of rooms, number of bedrooms, and neighborhood population. In this business scenario, we need to clarify the objective: as house prices are influenced by multiple factors, real estate agents should negotiate between buyers and sellers to reach a reasonable price for the house.\n", "\n", "## Data Understanding\n", "\n", "After understanding the business logic, the next step is to understand the data. Data analysts need to collaborate closely with the business team to understand the available data and any third-party data relevant to the business. Specifically, data analysts need to delve into how the data is generated, its description, data types, and many other details. At this stage, exploratory data analysis (EDA) may also be conducted by examining the data distribution or visualizing the data to gain insights into its basic characteristics.\n", "\n", "Here, we explore the housing price dataset. It is worth noting that enterprise data is often more complex than open-source data and tightly integrated with the business." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | MedInc | \n", "HouseAge | \n", "AveRooms | \n", "AveBedrms | \n", "Population | \n", "AveOccup | \n", "Latitude | \n", "Longitude | \n", "MedHouseValue | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "8.3252 | \n", "41.0 | \n", "6.984127 | \n", "1.023810 | \n", "322.0 | \n", "2.555556 | \n", "37.88 | \n", "-122.23 | \n", "4.526 | \n", "
1 | \n", "8.3014 | \n", "21.0 | \n", "6.238137 | \n", "0.971880 | \n", "2401.0 | \n", "2.109842 | \n", "37.86 | \n", "-122.22 | \n", "3.585 | \n", "
2 | \n", "7.2574 | \n", "52.0 | \n", "8.288136 | \n", "1.073446 | \n", "496.0 | \n", "2.802260 | \n", "37.85 | \n", "-122.24 | \n", "3.521 | \n", "
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()