Data Science Challenge

Post Image
7 Jan

The Biggest Job Data Scientists Do Is Invisible


I have recently looked into publicly available SNCF customer satisfaction survey. My goal was to illustrate how Data Science can be important for such simple managerial decisions as investment in simple amenities. The data was coming form several sources and before I knew it, the initial data clean-up was taking me longer than fitting the models and checking assumptions. I seized this opportunity to show to the managers the fact they rarely suspect: how lengthy the task of data manuipulation is.

I also thank SNCF for their Open Data intiative that allowed me to illustrate this case.

Importing and cleaning the data

The main challenge of this seemingly simple project was a vast amount of different files that had to be cleaned and combined together before the data analysis could even start. Data preparation is one of the unseen parts of Data Scientist’s job, this hidden part of the iceberg.

So, first we import the main file that summarizes the train stations and customer satisfaction in each, both for arriving and departing passengers.

SNCF customer satisfaction survey

SNCF customer satisfaction survey

SNCF customer satisfaction survey (cont)

SNCF customer satisfaction survey (cont)

SNCF customer satisfaction survey (cont)

SNCF customer satisfaction survey (cont)

SNCF customer satisfaction survey (cont)

SNCF customer satisfaction survey (cont)

SNCF customer satisfaction survey (cont)

SNCF customer satisfaction survey (cont)

To save you from the boredom, I leave the conversion of this (heavily graphical and full of special characters) Excel file behind the scope of this article and we’ll skip to importing the cleaned-up csv version of the file instead. I also shorten up the names of the columns to save myself some typing. And I convert the numeric columns to be treated as such (yes, with for-cycle, because I’m old-school).

df1 <- read.csv("SNCF_Synthesis.csv", sep = ";", dec = ".", header = TRUE, stringsAsFactors = FALSE)

df1[1,2] <- "Type" #rename some of the header
df1[1,3] <- "Service"
df1[1,4] <- "UIC"
df1[1,6] <- "Works"

colnames(df1) <- df1[1,]
df1 <- df1[-1,] #remove old column names in a row

#convert cols 7 to 24 to numeric
for (i.seq in seq(from = 7, to = (ncol(df1)-1))){
  df1[,i.seq] <- as.numeric(df1[,i.seq])
}

y <- c(1, 2, 3, 4, 6, 25)
for (i.seq in y){
  df1[,i.seq] <- factor(df1[,i.seq])
}

Finally, I convert the rest to factor variables, and the first part of input data is ready.

Next, I import the file containing cleanliness rating of the stations. I only keep the important columns, rename them and convert them to factors or numbers as needed. Next, I drop the old observations, only keeping the ones that correspond to the time period of the other satisfaction survey, i.e. March.

df_prop <- read.csv("proprete-en-gare.csv", sep = ";",
                  dec = ".", header = TRUE, stringsAsFactors = FALSE)
df_prop <- df_prop[,c(1,2,3,6)]
colnames(df_prop) <- c("Month", "UIC", "Gare", "Conformity")

df_prop[,1] <- factor(df_prop[,1])
df_prop[,2] <- factor(df_prop[,2])

y <- df_prop %>%
  group_by(Month) %>%
  summarise(meanconf = mean(Conformity))
tail(y)
## # A tibble: 6 x 2
##   Month   meanconf
##   <fct>      <dbl>
## 1 2019-03     94.5
## 2 2019-04     94.2
## 3 2019-05     94.2
## 4 2019-06     94.0
## 5 2019-07     93.7
## 6 2019-08     93.0
df_prop_mars <- df_prop %>%
  filter(Month == "2019-03")

some(df_prop_mars)
##       Month      UIC                Gare Conformity
## 47  2019-03 87581009 Bordeaux Saint-Jean      93.99
## 59  2019-03 87394007            Chartres      95.05
## 67  2019-03 87342014               Arras      92.68
## 73  2019-03 87611004   Toulouse Matabiau      97.06
## 78  2019-03 87113001           Paris Est      95.90
## 83  2019-03 87313874              Amiens      93.85
## 85  2019-03 87345009               Douai      95.11
## 94  2019-03 87673004             Bayonne      98.23
## 108 2019-03 87474338             Morlaix      92.24
## 116 2019-03 87673400            Biarritz      94.58

This could be pretty straightforward, but the Unique ID of the train station is not in the same format as in the other file, so we drop the first two digits. Also, there’s apparently a doublon that needs to be found and eliminated.

df_prop_mars <- df_prop_mars[,c(2,4)]
df_prop_mars$UIC <- factor((as.numeric(as.character(df_prop_mars$UIC))-87000000))

# it contains a doublon, we're looking for it
z <- df_prop_mars %>%
  group_by(UIC) %>%
  summarize(count = n())

y <- z[which(z$count > 1),1]
y
## # A tibble: 1 x 1
##   UIC
##   <fct>
## 1 271007

Next, we figure out which are the lines showing information for the station above twice, and delete one of them.

which(df_prop_mars$UIC == "271007")
## [1]  6 81
df_prop_mars <- df_prop_mars[-6,]

As a next step, I import the data on various facilities installed in the train stations that SNCF proudly shares: pianos, telephone charging stations, magazine vending machines and so forth.

df_fun <- read.csv("gares-pianos.csv", sep = ";",
                    dec = ".", header = TRUE, stringsAsFactors = FALSE)

I follow the same clean-up process as before and check that it doesn’t have doublons. That’s what we’ve got at the end:

str(df_fun)
## 'data.frame':    109 obs. of  6 variables:
##  $ UIC                    : Factor w/ 109 levels "111849","113001",..: 94 20 7 31 29 8 68 79 37 41 ...
##  $ Gare                   : chr  "Aix-les-Bains le Revard" "Belfort - Montbeliard TGV" "Champagne-Ardenne TGV" "Paris Saint-Lazare" ...
##  $ Piano                  : int  1 1 1 1 0 0 0 0 1 0 ...
##  $ Power.Station          : int  0 3 0 0 0 2 1 1 2 2 ...
##  $ Baby.Foot              : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ Distr.Histoires.Courtes: int  0 0 1 1 1 0 0 0 1 0 ...

Finally, and most importantly, we import the number of people present at each train station:

## 'data.frame':    3015 obs. of  3 variables:
##  $ UIC       : Factor w/ 3015 levels "28","9696","109306",..: 523 2623 758 752 2637 2074 1895 1663 362 1462 ...
##  $ NumTrav   : int  7686902 76441 99154 126145 104267 96 18504 556255 50010 16070 ...
##  $ NumNonTrav: int  7686902 76441 123942 157681 130333 96 18504 695319 50010 16070 ...

Pivoting the data frame

Next, we shorten up the column names and pivot the data frame:

df2 <- df1
y <- colnames(df2)[7:24]
df2[,ncol(df2)] <- factor(df2[,ncol(df2)], levels = c("Entrants", "Sortants"), labels = c("Ent", "Sort"))
#make col names shorter for pivot column

df3 <- pivot_wider(df2, names_from = Ent_Sort,
            names_prefix = "", names_sep = "_", names_repair = "check_unique",
            values_from = y, values_fill = NULL, values_fn = NULL)

colnames(df2)
##  [1] "Agence"    "Type"      "Service"   "UIC"       "Gare"      "Works"
##  [7] "Glob_Mars" "Glob_Sept" "P1_Mars"   "P1_Sept"   "P0_Mars"   "P0_Sept"
## [13] "P2_Mars"   "P2_Sept"   "P3_Mars"   "P3_Sept"   "P4_Mars"   "P4_Sept"
## [19] "P6_Mars"   "P6_Sept"   "P7_Mars"   "P7_Sept"   "P5_Mars"   "P5_Sept"
## [25] "Ent_Sort"
colnames(df3)
##  [1] "Agence"         "Type"           "Service"        "UIC"
##  [5] "Gare"           "Works"          "Glob_Mars_Ent"  "Glob_Mars_Sort"
##  [9] "Glob_Sept_Ent"  "Glob_Sept_Sort" "P1_Mars_Ent"    "P1_Mars_Sort"
## [13] "P1_Sept_Ent"    "P1_Sept_Sort"   "P0_Mars_Ent"    "P0_Mars_Sort"
## [17] "P0_Sept_Ent"    "P0_Sept_Sort"   "P2_Mars_Ent"    "P2_Mars_Sort"
## [21] "P2_Sept_Ent"    "P2_Sept_Sort"   "P3_Mars_Ent"    "P3_Mars_Sort"
## [25] "P3_Sept_Ent"    "P3_Sept_Sort"   "P4_Mars_Ent"    "P4_Mars_Sort"
## [29] "P4_Sept_Ent"    "P4_Sept_Sort"   "P6_Mars_Ent"    "P6_Mars_Sort"
## [33] "P6_Sept_Ent"    "P6_Sept_Sort"   "P7_Mars_Ent"    "P7_Mars_Sort"
## [37] "P7_Sept_Ent"    "P7_Sept_Sort"   "P5_Mars_Ent"    "P5_Mars_Sort"
## [41] "P5_Sept_Ent"    "P5_Sept_Sort"

We now add to the data add the number of people present in the station (counting both travelers and non-travelers):

df4 <- merge(df3, df_fun, by = "UIC", all.x = TRUE)
df4 <- merge(df4, df_freq, by = "UIC", all.x = TRUE)
df4$NumTotal <- df4$NumTrav+df4$NumNonTrav

Next, I add cleanliness data, using only entries for March. I do some final clean-up and save the data for the use next time:

y <- colnames(df4)
z <- !(str_detect(y, "Sept")) #detect which columns have Sept in them and NOT this vector
df5 <- df4[,z] #clean

#remove ugly character in one of the factors
levels(df5[,2]) <- c("CERA", "Centre Ouest", "DGGP", "Est", "Grand Sud","Manche Nord","Nelle Aquitaine")

df6 <- merge(df5, df_prop_mars, by = "UIC", all.x=TRUE, all.y = FALSE)

colnames(df6)[33] <- "Clean"
colnames(df6)[5] <- "Station"
df6 <- df6[,!colnames(df6) %in% c("Gare.y")] #remove this column by addressing it by name

df_final <- df6
save.image(file = "SNCF.RData")
write.csv(df_final, "SNCF_all_data_mars.csv")

If you’d think that would be all, the cleaning job is only beginning. The database is a bit overcomplicated to run meaningful analysis on it. Also, the whole bunch of data for arriving passengers is missing. To deal with it, I write a custom function that averages the satisfaction ratings of arriving and departing passengers or takes one of these ratings if another one isn’t available. Incidentally, I’m pretty sure there must be a readily available function to do it, but sometimes quickly drafting a fix is faster than looking for existing solution.

average_two_columns <- function(i, j) {
for (i.seq in seq_along(1:nrow(df_final))){
 x <- df_final[i.seq,i]
 y <- df_final[i.seq,j]
   vec1[i.seq] <- ifelse(is.na(y), ifelse(is.na(x), NA, x), (x+y)/2)
}
  return(unlist(vec1))
}

I run it on the rating columns and include the other parameters into this new database:

vec1 <- c(1,1,1,1)
df <- df_final[,c(1,6)]
df$Works <- as.numeric(as.character((df$Works)))
df$P0 <- average_two_columns(11,12)
df$P1 <- average_two_columns(9,10)
df$P2 <- average_two_columns(13,14)
df$P3 <- average_two_columns(15,16)
df$P4 <- average_two_columns(17,18)
df$P5 <- average_two_columns(23,24)
df$P6 <- average_two_columns(19,20)
df$P7 <- average_two_columns(21,22)
df$Piano <- df_final[,25]
df$PowStat <- df_final[,26]
df$BabyFoot <- df_final[,27]
df$Distr <- df_final[,28]
df$People <- df_final[,31]
df$Clean <- df_final$Clean
df$Works <- df$Works
str(df)
## 'data.frame':    153 obs. of  16 variables:
##  $ UIC     : Factor w/ 152 levels "111849","113001",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Works   : num  2 2 4 4 0 0 0 0 0 0 ...
##  $ P0      : num  7.64 8.16 7.8 8.63 9.03 8.84 8.62 8.66 8.17 7.92 ...
##  $ P1      : num  7.67 8.33 8.11 8.77 9.18 8.89 8.92 8.74 8.61 8.06 ...
##  $ P2      : num  7.98 8.39 8.02 8.59 8.92 ...
##  $ P3      : num  8.2 8.17 7.63 8.23 8.94 8.81 8.48 8.48 7.67 7.54 ...
##  $ P4      : num  7.08 7.32 6.98 7.95 8.28 ...
##  $ P5      : num  6.91 7.57 6.73 8.16 8.14 6.81 7.67 7.38 8.36 7.29 ...
##  $ P6      : num  7.15 7.35 7.64 8 8.57 ...
##  $ P7      : num  7 7.3 6.32 7.89 7.98 ...
##  $ Piano   : int  1 1 0 1 NA NA 1 1 0 NA ...
##  $ PowStat : int  2 3 2 3 NA NA 1 0 2 NA ...
##  $ BabyFoot: int  0 0 0 0 NA NA 0 1 0 NA ...
##  $ Distr   : int  1 2 0 0 NA NA 0 1 0 NA ...
##  $ People  : int  11767096 85588270 2292404 19070756 1556062 538454 8436729 2181454 2274728 10323435 ...
##  $ Clean   : num  91.9 95.9 90.3 92.4 94.2 ...

Finally, I scale the continuous variables and convert the factor levels into just three levels: 0’s, 1’s and NA’s. This way, the presence of charging stations in the station is only represented as ‘Yes’, ‘No’ or ‘We don’t know’, while the actual number of the charging stations (if there’s some installed) is not important.

dat <- df %>% mutate_each_(funs(scale(.) %>% as.vector),
                           vars=c("People","Clean"))

dat$Piano[dat$Piano > 0] <- 1
dat$PowStat[dat$PowStat > 0] <- 1
dat$BabyFoot[dat$BabyFoot > 0] <- 1
dat$Distr[dat$Distr > 0] <- 1
dat$Works <- as.integer(as.character(dat$Works))
dat$Works[dat$Works > 0] <- 1
dat$Works <- factor(ifelse(is.na(dat$Works), "NA", dat$Works), levels = c("0", "1", "NA"))
dat$Piano <- factor(ifelse(is.na(dat$Piano), "NA", dat$Piano), levels = c("0", "1", "NA"))
dat$PowStat <- factor(ifelse(is.na(dat$PowStat), "NA", dat$PowStat), levels = c("0", "1", "NA"))
dat$BabyFoot <- factor(ifelse(is.na(dat$BabyFoot), "NA", dat$BabyFoot), levels = c("0", "1", "NA"))
dat$Distr <- factor(ifelse(is.na(dat$Distr), "NA", dat$Distr), levels = c("0", "1", "NA"))

str(dat)

And the last but not least, I add the global satisfaction ratings to the dataframe and save both dataframes.

df$Global <- rowMeans(cbind(df$P1, df$P2, df$P3, df$P4, df$P5))
#Global satisfaction

dat$Global <- df$Global

str(dat)
## 'data.frame':    153 obs. of  17 variables:
##  $ UIC     : Factor w/ 152 levels "111849","113001",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Works   : Factor w/ 3 levels "0","1","NA": 2 2 2 2 1 1 1 1 1 1 ...
##  $ P0      : num  7.64 8.16 7.8 8.63 9.03 8.84 8.62 8.66 8.17 7.92 ...
##  $ P1      : num  7.67 8.33 8.11 8.77 9.18 8.89 8.92 8.74 8.61 8.06 ...
##  $ P2      : num  7.98 8.39 8.02 8.59 8.92 ...
##  $ P3      : num  8.2 8.17 7.63 8.23 8.94 8.81 8.48 8.48 7.67 7.54 ...
##  $ P4      : num  7.08 7.32 6.98 7.95 8.28 ...
##  $ P5      : num  6.91 7.57 6.73 8.16 8.14 6.81 7.67 7.38 8.36 7.29 ...
##  $ P6      : num  7.15 7.35 7.64 8 8.57 ...
##  $ P7      : num  7 7.3 6.32 7.89 7.98 ...
##  $ Piano   : Factor w/ 3 levels "0","1","NA": 2 2 1 2 3 3 2 2 1 3 ...
##  $ PowStat : Factor w/ 3 levels "0","1","NA": 2 2 2 2 3 3 2 1 2 3 ...
##  $ BabyFoot: Factor w/ 3 levels "0","1","NA": 1 1 1 1 3 3 1 2 1 3 ...
##  $ Distr   : Factor w/ 3 levels "0","1","NA": 2 2 1 1 3 3 1 2 1 3 ...
##  $ People  : num  -0.0571 1.3647 -0.2396 0.0836 -0.2537 ...
##  $ Clean   : num  -0.916 0.4811 -1.4416 -0.7396 -0.0929 ...
##  $ Global  : num  7.57 7.96 7.49 8.34 8.69 ...
write.csv(df, "SNCF_data_mars_analysis_1.csv")
write.csv(dat, "SNCF_data_mars_analysis_2.csv") #standardized and factorized

This post that serves as a prequel to SNCF customer satisfaction post is not something dedicated purely to the managers due to its technical nature. Nevertheless, I think it is important to illustrate the ideas behind data manipulation and to highlight the importance of the data traceability. It also helps explain the uneven progress in any Data Science project.

Managers make a common mistake to expect the findings early on. However, it is often towards the end of the project that the patterns start to emerge.

Fitting the model and choosing the best one based on its performance is relatively easy for a Data Scientist. At the very least, it is a common task. On the contrary, every dataset is unique. Treating it is often time-consuming. Worse even, we often discover that more data needs to be aquired.

With the energence of tools like Tableau the data manipulation task can sometimes be accelerated. However, Tableau doesn't replace conventional Data Science tools like R and Python. If no patterns emerge in Tableau, the job has to be started all over again - at the cost of losing the intial time spent.

The recommendations derived from data using modern predictive analytics methods are only as good as the quality of the data and the skill of the expert manipulating it. The traceability is of utmost importance and the time spent on initial data clean-up should not be neglected.

1 Comment
  • BalusC, January 4

    I doubt that outsiders have any idea. All they hear about is buzz words like Big Data and Tensorflow. You don't even imagine how much time simple data manipulation takes until you dig your head into it.