-
Notifications
You must be signed in to change notification settings - Fork 0
/
Retail Data Analysis.Rmd
165 lines (114 loc) · 7.79 KB
/
Retail Data Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
title: 'Minería de datos: PEC1'
author: "Autor: Nombre estudiante"
date: "marzo 2023"
output:
html_document:
highlight: default
number_sections: yes
theme: cosmo
toc: yes
toc_depth: 2
includes:
in_header: 75.584-PEC-header.html
word_document: default
pdf_document:
highlight: zenburn
toc: yes
---
**
# Exercises
**
## Exercise 1:
Propose a complete data mining project. The organization of the response has to coincide in the typical phases of the life cycle of a data mining project. *You do not have to perform the tasks of the phase*. For each phase, indicate the objective of the phase and the product to be obtained. Use examples of what and how the tasks could be. If there is any characteristic that makes the life cycle of a mining project different from other projects, indicate it.
> The aim is to construct a prediction model to determine the likelihood of a fatal accident based on the available data and to study the elements that lead to fatal motor vehicle accidents.
1.Business Understanding: This phase's goals are to specify the issue statement, pinpoint the company's objectives, and ascertain how data mining might assist in achieving those objectives. This step should result in a clear knowledge of the issue and how the information can be used to address it. In order to assist in the development of policies to lower the number of fatalities, one possible objective of this phase may be to identify the most frequent causes of fatal motor vehicle accidents.
2. Data Understanding: Gathering and comprehending the data that will be utilized for analysis are the goals of this phase. A well-defined dataset with all pertinent features, each fully described and recorded, should be the end result of this phase. In this stage, for instance, one may investigate the distribution of the "FATALS" characteristic, which denotes the quantity of fatalities in a specific accident.
3.Data Preparation: This stage's goal is to clean up and preprocess the data in order to get it ready for analysis. An organized dataset suitable for modeling should be the end result of this phase. For instance, one may perform feature selection, normalize the data, and eliminate any missing values during this stage.
4. Building a prediction model that can precisely anticipate the possibility of a catastrophic accident is the goal of this phase. This phase should result in a model that can be applied to forecast the outcome of fresh data. For instance, depending on the provided attributes, a logistic regression model may be used to estimate the likelihood of a catastrophic accident.
5.Evaluation: The goal of this stage is to assess the model's effectiveness and assess whether it achieves the desired business outcomes. This phase should result in a report that evaluates the model's performance, points out any flaws or restrictions, and makes suggestions for how to make the model better. One may, for instance, assess the model's precision, recall, and F1-score and contrast them to industry standards.
6. Deployment: The goal of this stage is to put the model into use so that it can forecast the outcome of new data. A well-documented, user-friendly model that can be incorporated into the organization's current systems should be the end result of this phase. For instance, the model may be made available as a RESTful API that other programs could use.
## Exercise 2:
Using the data set used in the previous example, perform the tasks prior to generating a data mining model explained in the modules "The data mining process" and "Data preprocessing and feature management". You can use the previous example as a reference, but try to change the approach and analyze the data based on the different dimensions that the data presents. Optionally and valuable, data from other years can be added to the study to make temporary comparisons (https://www.nhtsa.gov/file-downloads?p=nhtsa/downloads/FARS/) or add other related facts to study, for example , drug use in accidents (https://static.nhtsa.gov/nhtsa/downloads/FARS/2020/National/FARS2020NationalCSV.zip)
```{r}
library(readr)
path = "C:/Users/Admins/Desktop/accident.CSV"
accidentData <- read.csv(path, row.names=NULL)
names(accidentData)
```
>Exploratory Data Analysis
```{r}
# Explore the data
library(psych)
describe(accidentData)
```
# Data cleaning
1. Checking for missing values and duplicated values
```{r}
# Check for missing or null values
sum(is.na(accidentData))
# Check for duplicates
sum(duplicated(accidentData))
```
From the results obtained above, the data has no missing or duplicated values
> Univariate Analysis
```{r, message=FALSE, warning=FALSE}
library(ggplot2)
library(gridExtra)
# Create histogram for VE_TOTAL
p1 <- ggplot(accidentData, aes(x = VE_TOTAL)) +
geom_histogram(fill = "cornflowerblue", color = "black", bins = 20) +
ggtitle("Distribution of VE_TOTAL") +theme_bw()
# Create histogram for PERSONS
p2 <- ggplot(accidentData, aes(x = PERSONS)) +
geom_histogram(fill = "darkorange", color = "black", bins = 20) +
ggtitle("Distribution of PERSONS")+theme_bw()
# Arrange the plots side by side
grid.arrange(p1, p2, ncol = 2)
```
```{r}
library(ggplot2)
# Create density plots for VE_TOTAL and PERSONS
p1 <- ggplot(accidentData, aes(x = VE_TOTAL)) +
geom_density(fill = "blue", color = "black") +
ggtitle("Density Plot of VE_TOTAL") +
theme(plot.title = element_text(size = 14)) +
theme(plot.background = element_rect(fill = "white")) +
theme(panel.background = element_rect(fill = "white"))
p2 <- ggplot(accidentData, aes(x = PERSONS)) +
geom_density(fill = "orange", color = "black") +
ggtitle("Density Plot of PERSONS") +
theme(plot.title = element_text(size = 14)) +
theme(plot.background = element_rect(fill = "white")) +
theme(panel.background = element_rect(fill = "white"))
# Arrange plots side by side
library(gridExtra)
grid.arrange(p1, p2, nrow=1)
```
```{r}
library(ggplot2)
ggplot(accidentData, aes(x = STATE)) +
geom_bar(fill="cornflowerblue") +
facet_wrap(~YEAR, nrow = 3)+
labs(title = "Barplot of Accidents count")+theme_bw()
```
```{r}
# assuming your data frame is called "accidents"
fatalities_by_state_weather <- aggregate(accidentData$FATALS, by = list(State = accidentData$STATE, Weather = accidentData$WEATHERNAME), FUN = sum)
```
```{r}
library(ggplot2)
ggplot(fatalities_by_state_weather, aes(x = State, y =fatalities_by_state_weather, fill = Weather)) +
geom_bar(stat = "identity") +
labs(x = "State", y = "Number of Fatalities", fill = "Weather") +
ggtitle("Number of Accident Fatalities by State and Weather")
```
Exercise 1:
Concept and weight in the final grade.
The objective of the project is correctly defined with sufficient precision and can be solved with data mining techniques. 10%
The phases of the life cycle are highly expressed. The examples are clarifying. It justifies and argues for the decisions that have been made. twenty%
Exercise 2:
The database is loaded, its structure is visualized, and the basic facts of the data are explained. 5%
It is studied if there are empty attributes or in different scales that have to be normalized. If this is the case, measures are adopted to address these attributes. A new useful variable is constructed from the existing ones. Some attribute is discretized. 10%
Data is analyzed visually and tangible conclusions are drawn. It is necessary to elaborate a coherent speech and with clear conclusions. It is necessary to contextualize all the results well, the conclusions do not have to be technical but "translated" to the context that is being worked on. 30%
Some other aspects of the data presented in the "Data Preprocessing and Feature Management" modules are covered in depth. 25%