SNCF Satisfaction Survey

Post Image
7 Jan

Pianos in the train stations don't improve satisfaction


Today I wanted to take advantage of publicly available SNCF customer satisfaction survey to illustrate how Data Science can be important for such simple managerial decisions as investment in simple amenities. The data is the courtesy of SNCF.

Exploratory Data Analysis

As the data cleaning and preparation is a whole separate topic, I leave it out of the scope of this article. For the curious minds, there's an article dedicated to the hidden part of an iceberg that Data Scientist's job represents.

Here the objective is to look at the data and draw meaningful conclusions. To begin with, let’s have a glance at the crosspots generated using ggpairs function of GGally library:

SNCF trains station satisfaction scatterplot.

SNCF train station satisfaction scatterplot.

The curious thing here is that all satisfaction ratings P0 - P7 are highly correlated. This is a well-known satisfaction survey phenomenon where the client is only capable of talking about his general satisfaction. This general satisfaction ‘halo’ leaves an impact on response to all the survey questions. Notice their beautiful normal distribution around general satisfaction.

It’s not straightforward to establish the relationship between survey answers and factor variables (like the presence of amenities) based on the plot alone, so we’ll get to this later.

Next thing to notice is that amenities like piano, power stations, magazine vending machines and foosball tables are highly correlated with amount of people in the station. It obviously reflects the managerial choice to install them in big city stations, and not in small town stations.

Finally, notice that the distribution of the number of people looks very particular, all crammed to the left and with a long tail. The amount of people and the customer satisfaction is negatively correlated.

ggplot(df, aes(x=People/1000000)) +
  geom_histogram(aes(y =..density..), fill="lightskyblue2", color="skyblue4", bins = 8)+
  geom_vline(aes(xintercept=mean(People/1000000)), color="grey39",
             linetype="dashed")+
  labs(title="Number of people in the station",x="People (million)", y = "Frequency")+
  geom_density()+
  theme_classic()

Let us now have a look at density function for amount of people at train stations and its probability distribution.

Density distribution for amount of people at the station.

Cumulative for amount of people at the station.

What these plots mean in simple words is that there’s a very high amount of train stations with a small number of visitors, and a very few huge train stations, like Gare de Lyon in Paris. Even though it makes perfect sense, it is not a great distribution to work with, as ideally we would like a normal distribution for most common statistical methods to work.

with(df, plot(sort(People/1000000),
                    (1:length(People))/length(People),
                    type="s", ylim=c(0,1), las = 1,
                    xlab="Stations Ordered by People Presence (million)",
                    ylab="Proportion of Stations with Lower Frequentation"))

For instance, the Normal Q-Q plot doesn’t look like it should, and none of the common marketing transformations (logarithmic, square, square root, etc.) were able to do the trick.

Normal Q-Q plot for Amount of People at the station.

Normal Q-Q plot for Amount of People at the station.

By applying Box-Cox transformation with best λ determined by using car package, we arrive at something nicer to look at, but still very far from normal distribution.

# ---------------------------------- Box-Cox Transform -------------------------
powerTransform(df$People)
lambda <- coef(powerTransform(1/df$People))
df$People_BC <- bcPower(df$People, lambda)

par(mfrow=c(1,2))
hist(df$People,
     xlab="Visitors", ylab="Count",
     main="Original Distribution")
hist(df$People_BC,
     xlab="Box-Cox Transform of Visitors", ylab="Count",
     main="Transformed Distribution")

To examine the correlations in a much more comprehensible way, let’s have a look at the correlation heat map:

sel_vec <- c(2, 17, 11:16)
sat_vec <- c(4:8,17, 11:16)

library(polycor)
library(corrplot) # for correlation plot, install if needed
Cor <- hetcor(df[,sel_vec], ML = FALSE, std.err = TRUE,use="complete.obs", bins=4, pd=TRUE)

corrplot.mixed(corr=Cor$correlations, upper = "ellipse", tl.pos ="lt")

As a rule of thumb, in social sciences a correlation is considered significant if its absolute value is above 0.3 and rather strong when it’s over 0.5. From this plot it is excruciatingly clear that global satisfaction (Global) increases slightly with cleanliness, but decreases when there are power stations (PowStat) or magazine vending machines (Distr). That would be an unexpected finding. It is explained by the fact that the presence of these amenities is strongly correlated with large train stations with huge amounts of people. Apparently these stations are unpleasant, as we see a strong negative correlation of global satisfaction and a number of people present.

Let’s have a look at variables explanations and see if they make sense:

P1 = Information Satisfaction

(I dropped the P0 as it was collinear to P1

P0 = I can easily ask for information about my train )

P2 = Movements inside the train station

P3 = Cleanliness and safety

P4 = Time spent in the station

(I also dropped P6 and P7 for the same reason, as they make part of P4:

P6 = I feel good here

P7 = I like the ambiance)

P5 = Services and shops

The curious observation is that cleanliness rating as accessed by SNCF hardly depends on the “cleanliness and safety” perception of the visitors. It’s hardly surprising that the Global rating is correlated to the individual satisfaction ratings, as it was built as their average, but it’s nevertheless amusing to see how individual ratings all depend one on each other, because of the overall satisfaction halo that we discussed earlier. The number of train station visitors has a strong negative impact on each of the ratings, except on the satisfaction with services and shops.

Preparing the model

Let us now prepare the model.

Firstly, we need to deal with the missing data. It is completely pointless to take into account the stations where the satisfaction survey responses aren’t present. So we’ll simply remove these entries. Not that I continue working with both dataframes, one with factors and another one with actual values, in order to see on which one the model performs better.

y <- is.na(df$Global)
y <- !y
df <- df[y,]
dat <- dat[y,]

Next, I examine the rest of the missing values using mice and VIM libraries and complete the entries by generating imputed data using cartesian method for df and pmm for dat (the code for only one of these manipulations is shown):

library(mice)
md.pattern(df)
md.pattern(dat)

library(VIM)
mice_plot <- aggr(df, col=c('navyblue','yellow'),
                  numbers=TRUE, sortVars=TRUE,
                  labels=names(df), cex.axis=.7,
                  gap=3, ylab=c("Missing data","Pattern"))

imputed_Data1 <- mice(df, m=5, maxit = 50, method = 'cart', seed = 500)

df <- complete(imputed_Data1,1)

Let’s check what we’ve got:

summary(df)
##       UIC          Works             P0              P1              P2
##  672006 :  2   Min.   :0.000   Min.   :5.700   Min.   :6.640   Min.   :6.930
##  111849 :  1   1st Qu.:0.000   1st Qu.:8.140   1st Qu.:8.290   1st Qu.:8.134
##  113001 :  1   Median :0.000   Median :8.330   Median :8.545   Median :8.420
##  118000 :  1   Mean   :1.016   Mean   :8.313   Mean   :8.479   Mean   :8.375
##  141002 :  1   3rd Qu.:2.000   3rd Qu.:8.570   3rd Qu.:8.740   3rd Qu.:8.690
##  142109 :  1   Max.   :4.000   Max.   :9.680   Max.   :9.680   Max.   :9.330
##  (Other):119   NA's   :2
##        P3              P4              P5              P6
##  Min.   :6.450   Min.   :5.660   Min.   :5.910   Min.   :6.060
##  1st Qu.:7.830   1st Qu.:7.305   1st Qu.:7.305   1st Qu.:7.475
##  Median :8.168   Median :7.630   Median :7.650   Median :7.812
##  Mean   :8.146   Mean   :7.588   Mean   :7.597   Mean   :7.782
##  3rd Qu.:8.475   3rd Qu.:7.945   3rd Qu.:7.947   3rd Qu.:8.158
##  Max.   :9.600   Max.   :8.680   Max.   :8.600   Max.   :8.840
##
##        P7            Piano           PowStat          BabyFoot
##  Min.   :5.270   Min.   :0.0000   Min.   :0.0000   Min.   :0.000
##  1st Qu.:7.058   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.000
##  Median :7.450   Median :1.0000   Median :0.0000   Median :0.000
##  Mean   :7.396   Mean   :0.7977   Mean   :0.8876   Mean   :0.191
##  3rd Qu.:7.765   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:0.000
##  Max.   :8.700   Max.   :3.0000   Max.   :7.0000   Max.   :1.000
##                  NA's   :37       NA's   :37       NA's   :37
##      Distr            People              Clean            Global
##  Min.   :0.0000   Min.   :   538454   Min.   : 82.37   Min.   :6.402
##  1st Qu.:0.0000   1st Qu.:  1899648   1st Qu.: 93.01   1st Qu.:7.834
##  Median :1.0000   Median :  3721942   Median : 95.05   Median :8.086
##  Mean   :0.6517   Mean   : 17384725   Mean   : 94.51   Mean   :8.037
##  3rd Qu.:1.0000   3rd Qu.: 10189076   3rd Qu.: 96.78   3rd Qu.:8.264
##  Max.   :4.0000   Max.   :527278558   Max.   :100.00   Max.   :8.898
##  NA's   :37                           NA's   :4

Note that there are no more ‘NA’ values in df dataframe, but there are still in dat dataframe, as we encoded one of the factor levels to ‘NA’.

Note also that the data in df is not scaled, because the advantage of the linear regression is that it works with unscaled variables, unlike many other methods.

Before proceding to splitting the data into test and training subsets, let’s make sure that it doesn’t contain outliers:

We can’t rely on the software highlighting the data as outliers, but have to use common sense. In our case, for instance, exceptionally high or low satisfaction ratings are most likely real, and the high number of visitors in the train station is, as well. I personally don’t see outliers in this dataset.

I then scale the continuous variables inf df and split the data using the train-and-test regimen of caret library, with 80% of the training data. I do this for both df and dat and check if there’re any problems with partitions. As this is a straghtforward commonly repeated manipulation, I leave the code out.

Model selection and training

I explicitly specify all variables in regression model:

model1 <- {Global ~ Works + Piano + PowStat + BabyFoot + Distr + People + Clean}
# fit linear regression model with scaled and factorized variables
dat_fit <- lm(Global ~ Works + Piano + PowStat + BabyFoot + Distr + People + Clean, data = train_dat)
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03607 0.3066 0.1176 0.9066
Works1 0.1489 0.2227 0.6685 0.5055
Piano1 0.127 0.2701 0.4703 0.6393
PianoNA -0.2117 0.3554 -0.5956 0.5529
PowStat1 -0.05189 0.2648 -0.196 0.8451
BabyFoot1 0.012 0.2827 0.04244 0.9662
Distr1 -0.24 0.2514 -0.9548 0.3422
People -0.3225 0.09604 -3.358 0.001149
Clean 0.009469 0.1081 0.08758 0.9304
Fitting linear model: Global ~ Works + Piano + PowStat + BabyFoot + Distr + People + Clean
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
100 0.9811 0.1444 0.06923
# fit linear regression model with scaled and non-factor variables
df_fit <- lm(Global ~ Works + Piano + PowStat + BabyFoot + Distr + People + Clean, data = train)
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1654 0.2126 0.778 0.4385
Works 0.03032 0.06244 0.4855 0.6284
Piano 0.1008 0.1669 0.6038 0.5474
PowStat -0.08237 0.07653 -1.076 0.2846
BabyFoot -0.03865 0.2143 -0.1804 0.8573
Distr -0.2186 0.1228 -1.779 0.07843
People -0.3256 0.159 -2.048 0.04337
Clean -0.02493 0.09346 -0.2667 0.7903
Fitting linear model: Global ~ Works + Piano + PowStat + BabyFoot + Distr + People + Clean
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
102 0.8458 0.1271 0.06208

It is fairly obvious that the model with continuous variables performs better in terms of certainty. The residuals are lower and the p-value is smaller. We therefore can choose to continue working with df, leaving dat aside.

The significant variables, at the first glance, are People and Distr. We’ve already seen that all these are correlated, so a number of visitors alone might explain the satisfaction just as well.

Let’s check.

df_simple <- lm(Global ~ People, data = train)
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.05026 0.08349 0.602 0.5486
People -0.3906 0.1299 -3.006 0.003344
Fitting linear model: Global ~ People
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
102 0.8405 0.08289 0.07372

This simple model is performing great. Residual standard error is barely higher than that of the more complex model. Isn’t it actually surprising?

Let’s have a look side-by-side:
  Global Global
Predictors Estimates CI p Estimates CI p
(Intercept) 0.17 -0.26 – 0.59 0.439 0.05 -0.12 – 0.22 0.549
Works 0.03 -0.09 – 0.15 0.628
Piano 0.10 -0.23 – 0.43 0.547
PowStat -0.08 -0.23 – 0.07 0.285
BabyFoot -0.04 -0.46 – 0.39 0.857
Distr -0.22 -0.46 – 0.03 0.078
People -0.33 -0.64 – -0.01 0.043 -0.39 -0.65 – -0.13 0.003
Clean -0.02 -0.21 – 0.16 0.790
Observations 102 102
R2 / R2 adjusted 0.127 / 0.062 0.083 / 0.074

Let’s compare Akaike’s Information Criterion and Bayesian information criterion for both models:

## For df_fit:    AIC =  264.9658 , BIC =  288.5905
## For df_simple: AIC =  258.0035 , BIC =  265.8784

The smaller the values, the better the fit is. However, BIC is punishing the complexity of the model more severely.

Based on Mean Squared Error I would still lean towards the slightly more complex model, even though the difference is not huge:

## For df_fit:    RMSE =  0.8119456
## For df_simple: RMSE =  0.8322459
#let's see if we can build a better model with interactions

full_model <- {Global ~ (Works + Piano + PowStat + BabyFoot + Distr + People + Clean)^2}
chosen_model_fit <- step(lm(full_model, data = train), direction ="backward")
chosen_model_formula <- formula(chosen_model_fit)
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.63 0.3195 -1.972 0.0519
Works 0.3478 0.1253 2.776 0.006769
Piano 0.4558 0.2083 2.188 0.03143
PowStat 0.04832 0.1319 0.3663 0.7151
BabyFoot -0.5436 0.3 -1.812 0.07356
Distr -0.1302 0.1433 -0.9081 0.3664
People -3.467 0.981 -3.534 0.0006652
Clean -0.1231 0.1106 -1.112 0.2691
Works:PowStat -0.1373 0.05398 -2.544 0.01275
Works:Distr -0.2525 0.1159 -2.179 0.03207
Works:People 0.4061 0.1846 2.201 0.03048
Works:Clean 0.1419 0.06603 2.15 0.03443
Piano:People 1.63 0.7627 2.137 0.03543
PowStat:BabyFoot 0.393 0.2183 1.8 0.07537
PowStat:Distr 0.3186 0.1484 2.146 0.0347
BabyFoot:People -1.082 0.7588 -1.426 0.1575
BabyFoot:Clean -0.3835 0.2434 -1.576 0.1188
Fitting linear model: Global ~ Works + Piano + PowStat + BabyFoot + Distr + People + Clean + Works:PowStat + Works:Distr + Works:People + Works:Clean + Piano:People + PowStat:BabyFoot + PowStat:Distr + BabyFoot:People + BabyFoot:Clean
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
102 0.7925 0.3071 0.1766

Based on the anova comparison of two models, there is significant difference between two.

Analysis of Variance Table
Res.Df RSS Df Sum of Sq Pr(>Chi)
94 67.24 NA NA NA
85 53.38 9 13.86 0.00864

Checking assumptions of linear regression

Let us first have a look at residuals for three models:

par(mfrow=c(1,3),mar=c(4, 4, 2, 2))
hist(df_fit$residuals,
     main = "df_fit model", xlab = "Residuals")
hist(df_simple$residuals,
     main = "df_simple model", xlab = "Residuals")
hist(chosen_model_fit$residuals,
     main = "chosen_model_fit model", xlab = "Residuals")

All three are normally distributed, as they should be if the fit is good. Let’s have a look at Q-Q plots for all three:

Model quality plots for df_simple model Model quality plots for df_fit model Model quality plots for df_chosen_fit model

We should also check the model for multicollinearity across explanatory variables and ensure that all values are low:

print(vif(df_fit))
##    Works    Piano  PowStat BabyFoot    Distr   People    Clean
## 1.293768 1.143916 1.230986 1.031874 1.188111 1.478836 1.080747

This obviously won’t make sense for chosen_model_fit, as interactions between variables will have much higher value.

Residuals plots for df_simple model Residuals plots for df_fit model Residuals plots for df_chosen_fit model

  df_simple df_fit chosen_model

Train RSQ

0.074

0.062

0.177

Test RSQ

0.182

0.179

0.111

Pred Error

0.684

0.699

0.761

To summarize our findings, customer satisfaction in SNCF train stations can be easily and quite accurately predicted based on the number of the train station visitors (or simply put, train station size). The three models don’t exhibit an extraordinary difference in prediction quality, and the variables such as the presence or absence of the piano or vending machine don’t play an important role.

Consequently, if the objective behind installing these facilities was improving customer satisfaction (and not, say, improving brand image or attracting more travelers), then the conclusion is that this is not serving its right purpose. Therefore, this is an investment that could be spared.

Incidentally, the cleanliness rating of the train station doesn’t have a significant impact on customer satisfaction either. Investing in additional personnel dedicated to this purpose is therefore not a recommendation I would suggest.

This study of SNCF customer satisfaction shows explicitly that the efforts to improve station amenities do not lead to improved satisfaction of travelers and other people visiting the train stations.

3 Comments
  • Jane Lovell, January 8

    Well explained! Do you reckon it was the purpose of this investment though?

    • Ekaterina Kinsht, January 8

      Thanks, Jane! This is a good question. Since I only looked through publicly available data and didn't perform this study as their consultant, we can only guess. However, I don't think it was a primary reason. My guess is that they rather do it for improving the brand image. Still, it would be good for them to know that there's no impact on satisfaction.

  • Christian Wichne, January 9

    Thanks for sharing this. Do you personally rely on AIC or BIC when selecting which model to use?