SNCF Satisfaction Survey

Pianos in the train stations don't improve satisfaction
Today I wanted to take advantage of publicly available SNCF customer satisfaction survey to illustrate how Data Science can be important for such simple managerial decisions as investment in simple amenities. The data is the courtesy of SNCF.
Exploratory Data Analysis
As the data cleaning and preparation is a whole separate topic, I leave it out of the scope of this article. For the curious minds, there's an article dedicated to the hidden part of an iceberg that Data Scientist's job represents.
Here the objective is to look at the data and draw meaningful conclusions. To begin with, let’s have a glance at the crosspots generated using ggpairs function of GGally library:
SNCF train station satisfaction scatterplot.
The curious thing here is that all satisfaction ratings P0 - P7 are highly correlated. This is a well-known satisfaction survey phenomenon where the client is only capable of talking about his general satisfaction. This general satisfaction ‘halo’ leaves an impact on response to all the survey questions. Notice their beautiful normal distribution around general satisfaction.
It’s not straightforward to establish the relationship between survey answers and factor variables (like the presence of amenities) based on the plot alone, so we’ll get to this later.
Next thing to notice is that amenities like piano, power stations, magazine vending machines and foosball tables are highly correlated with amount of people in the station. It obviously reflects the managerial choice to install them in big city stations, and not in small town stations.
Finally, notice that the distribution of the number of people looks very particular, all crammed to the left and with a long tail. The amount of people and the customer satisfaction is negatively correlated.
ggplot(df, aes(x=People/1000000)) +
geom_histogram(aes(y =..density..), fill="lightskyblue2", color="skyblue4", bins = 8)+
geom_vline(aes(xintercept=mean(People/1000000)), color="grey39",
linetype="dashed")+
labs(title="Number of people in the station",x="People (million)", y = "Frequency")+
geom_density()+
theme_classic()
Let us now have a look at density function for amount of people at train stations and its probability distribution.
What these plots mean in simple words is that there’s a very high amount of train stations with a small number of visitors, and a very few huge train stations, like Gare de Lyon in Paris. Even though it makes perfect sense, it is not a great distribution to work with, as ideally we would like a normal distribution for most common statistical methods to work.
with(df, plot(sort(People/1000000),
(1:length(People))/length(People),
type="s", ylim=c(0,1), las = 1,
xlab="Stations Ordered by People Presence (million)",
ylab="Proportion of Stations with Lower Frequentation"))
For instance, the Normal Q-Q plot doesn’t look like it should, and none of the common marketing transformations (logarithmic, square, square root, etc.) were able to do the trick.
Normal Q-Q plot for Amount of People at the station.
By applying Box-Cox transformation with best λ determined by using car package, we arrive at something nicer to look at, but still very far from normal distribution.
# ---------------------------------- Box-Cox Transform -------------------------
powerTransform(df$People)
lambda <- coef(powerTransform(1/df$People))
df$People_BC <- bcPower(df$People, lambda)
par(mfrow=c(1,2))
hist(df$People,
xlab="Visitors", ylab="Count",
main="Original Distribution")
hist(df$People_BC,
xlab="Box-Cox Transform of Visitors", ylab="Count",
main="Transformed Distribution")
To examine the correlations in a much more comprehensible way, let’s have a look at the correlation heat map:
sel_vec <- c(2, 17, 11:16)
sat_vec <- c(4:8,17, 11:16)
library(polycor)
library(corrplot) # for correlation plot, install if needed
Cor <- hetcor(df[,sel_vec], ML = FALSE, std.err = TRUE,use="complete.obs", bins=4, pd=TRUE)
corrplot.mixed(corr=Cor$correlations, upper = "ellipse", tl.pos ="lt")
As a rule of thumb, in social sciences a correlation is considered significant if its absolute value is above 0.3 and rather strong when it’s over 0.5. From this plot it is excruciatingly clear that global satisfaction (Global) increases slightly with cleanliness, but decreases when there are power stations (PowStat) or magazine vending machines (Distr). That would be an unexpected finding. It is explained by the fact that the presence of these amenities is strongly correlated with large train stations with huge amounts of people. Apparently these stations are unpleasant, as we see a strong negative correlation of global satisfaction and a number of people present.
Let’s have a look at variables explanations and see if they make sense:
P1 = Information Satisfaction
(I dropped the P0 as it was collinear to P1
P0 = I can easily ask for information about my train )
P2 = Movements inside the train station
P3 = Cleanliness and safety
P4 = Time spent in the station
(I also dropped P6 and P7 for the same reason, as they make part of P4:
P6 = I feel good here
P7 = I like the ambiance)
P5 = Services and shops
The curious observation is that cleanliness rating as accessed by SNCF hardly depends on the “cleanliness and safety” perception of the visitors. It’s hardly surprising that the Global rating is correlated to the individual satisfaction ratings, as it was built as their average, but it’s nevertheless amusing to see how individual ratings all depend one on each other, because of the overall satisfaction halo that we discussed earlier. The number of train station visitors has a strong negative impact on each of the ratings, except on the satisfaction with services and shops.
Preparing the model
Let us now prepare the model.
Firstly, we need to deal with the missing data. It is completely pointless to take into account the stations where the satisfaction survey responses aren’t present. So we’ll simply remove these entries. Not that I continue working with both dataframes, one with factors and another one with actual values, in order to see on which one the model performs better.
y <- is.na(df$Global)
y <- !y
df <- df[y,]
dat <- dat[y,]
Next, I examine the rest of the missing values using mice and VIM libraries and complete the entries by generating imputed data using cartesian method for df and pmm for dat (the code for only one of these manipulations is shown):
library(mice)
md.pattern(df)
md.pattern(dat)
library(VIM)
mice_plot <- aggr(df, col=c('navyblue','yellow'),
numbers=TRUE, sortVars=TRUE,
labels=names(df), cex.axis=.7,
gap=3, ylab=c("Missing data","Pattern"))
imputed_Data1 <- mice(df, m=5, maxit = 50, method = 'cart', seed = 500)
df <- complete(imputed_Data1,1)
Let’s check what we’ve got:
summary(df)
## UIC Works P0 P1 P2
## 672006 : 2 Min. :0.000 Min. :5.700 Min. :6.640 Min. :6.930
## 111849 : 1 1st Qu.:0.000 1st Qu.:8.140 1st Qu.:8.290 1st Qu.:8.134
## 113001 : 1 Median :0.000 Median :8.330 Median :8.545 Median :8.420
## 118000 : 1 Mean :1.016 Mean :8.313 Mean :8.479 Mean :8.375
## 141002 : 1 3rd Qu.:2.000 3rd Qu.:8.570 3rd Qu.:8.740 3rd Qu.:8.690
## 142109 : 1 Max. :4.000 Max. :9.680 Max. :9.680 Max. :9.330
## (Other):119 NA's :2
## P3 P4 P5 P6
## Min. :6.450 Min. :5.660 Min. :5.910 Min. :6.060
## 1st Qu.:7.830 1st Qu.:7.305 1st Qu.:7.305 1st Qu.:7.475
## Median :8.168 Median :7.630 Median :7.650 Median :7.812
## Mean :8.146 Mean :7.588 Mean :7.597 Mean :7.782
## 3rd Qu.:8.475 3rd Qu.:7.945 3rd Qu.:7.947 3rd Qu.:8.158
## Max. :9.600 Max. :8.680 Max. :8.600 Max. :8.840
##
## P7 Piano PowStat BabyFoot
## Min. :5.270 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:7.058 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :7.450 Median :1.0000 Median :0.0000 Median :0.000
## Mean :7.396 Mean :0.7977 Mean :0.8876 Mean :0.191
## 3rd Qu.:7.765 3rd Qu.:1.0000 3rd Qu.:2.0000 3rd Qu.:0.000
## Max. :8.700 Max. :3.0000 Max. :7.0000 Max. :1.000
## NA's :37 NA's :37 NA's :37
## Distr People Clean Global
## Min. :0.0000 Min. : 538454 Min. : 82.37 Min. :6.402
## 1st Qu.:0.0000 1st Qu.: 1899648 1st Qu.: 93.01 1st Qu.:7.834
## Median :1.0000 Median : 3721942 Median : 95.05 Median :8.086
## Mean :0.6517 Mean : 17384725 Mean : 94.51 Mean :8.037
## 3rd Qu.:1.0000 3rd Qu.: 10189076 3rd Qu.: 96.78 3rd Qu.:8.264
## Max. :4.0000 Max. :527278558 Max. :100.00 Max. :8.898
## NA's :37 NA's :4
Note that there are no more ‘NA’ values in df dataframe, but there are still in dat dataframe, as we encoded one of the factor levels to ‘NA’.
Note also that the data in df is not scaled, because the advantage of the linear regression is that it works with unscaled variables, unlike many other methods.
Before proceding to splitting the data into test and training subsets, let’s make sure that it doesn’t contain outliers:
We can’t rely on the software highlighting the data as outliers, but have to use common sense. In our case, for instance, exceptionally high or low satisfaction ratings are most likely real, and the high number of visitors in the train station is, as well. I personally don’t see outliers in this dataset.
I then scale the continuous variables inf df and split the data using the train-and-test regimen of caret library, with 80% of the training data. I do this for both df and dat and check if there’re any problems with partitions. As this is a straghtforward commonly repeated manipulation, I leave the code out.
Model selection and training
I explicitly specify all variables in regression model:
model1 <- {Global ~ Works + Piano + PowStat + BabyFoot + Distr + People + Clean}
# fit linear regression model with scaled and factorized variables
dat_fit <- lm(Global ~ Works + Piano + PowStat + BabyFoot + Distr + People + Clean, data = train_dat)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 0.03607 | 0.3066 | 0.1176 | 0.9066 |
Works1 | 0.1489 | 0.2227 | 0.6685 | 0.5055 |
Piano1 | 0.127 | 0.2701 | 0.4703 | 0.6393 |
PianoNA | -0.2117 | 0.3554 | -0.5956 | 0.5529 |
PowStat1 | -0.05189 | 0.2648 | -0.196 | 0.8451 |
BabyFoot1 | 0.012 | 0.2827 | 0.04244 | 0.9662 |
Distr1 | -0.24 | 0.2514 | -0.9548 | 0.3422 |
People | -0.3225 | 0.09604 | -3.358 | 0.001149 |
Clean | 0.009469 | 0.1081 | 0.08758 | 0.9304 |
Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
100 | 0.9811 | 0.1444 | 0.06923 |
# fit linear regression model with scaled and non-factor variables
df_fit <- lm(Global ~ Works + Piano + PowStat + BabyFoot + Distr + People + Clean, data = train)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 0.1654 | 0.2126 | 0.778 | 0.4385 |
Works | 0.03032 | 0.06244 | 0.4855 | 0.6284 |
Piano | 0.1008 | 0.1669 | 0.6038 | 0.5474 |
PowStat | -0.08237 | 0.07653 | -1.076 | 0.2846 |
BabyFoot | -0.03865 | 0.2143 | -0.1804 | 0.8573 |
Distr | -0.2186 | 0.1228 | -1.779 | 0.07843 |
People | -0.3256 | 0.159 | -2.048 | 0.04337 |
Clean | -0.02493 | 0.09346 | -0.2667 | 0.7903 |
Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
102 | 0.8458 | 0.1271 | 0.06208 |
It is fairly obvious that the model with continuous variables performs better in terms of certainty. The residuals are lower and the p-value is smaller. We therefore can choose to continue working with df, leaving dat aside.
The significant variables, at the first glance, are People and Distr. We’ve already seen that all these are correlated, so a number of visitors alone might explain the satisfaction just as well.
Let’s check.
df_simple <- lm(Global ~ People, data = train)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 0.05026 | 0.08349 | 0.602 | 0.5486 |
People | -0.3906 | 0.1299 | -3.006 | 0.003344 |
Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
102 | 0.8405 | 0.08289 | 0.07372 |
This simple model is performing great. Residual standard error is barely higher than that of the more complex model. Isn’t it actually surprising?
Let’s have a look side-by-side:Global | Global | |||||
---|---|---|---|---|---|---|
Predictors | Estimates | CI | p | Estimates | CI | p |
(Intercept) | 0.17 | -0.26 – 0.59 | 0.439 | 0.05 | -0.12 – 0.22 | 0.549 |
Works | 0.03 | -0.09 – 0.15 | 0.628 | |||
Piano | 0.10 | -0.23 – 0.43 | 0.547 | |||
PowStat | -0.08 | -0.23 – 0.07 | 0.285 | |||
BabyFoot | -0.04 | -0.46 – 0.39 | 0.857 | |||
Distr | -0.22 | -0.46 – 0.03 | 0.078 | |||
People | -0.33 | -0.64 – -0.01 | 0.043 | -0.39 | -0.65 – -0.13 | 0.003 |
Clean | -0.02 | -0.21 – 0.16 | 0.790 | |||
Observations | 102 | 102 | ||||
R2 / R2 adjusted | 0.127 / 0.062 | 0.083 / 0.074 |
Let’s compare Akaike’s Information Criterion and Bayesian information criterion for both models:
## For df_fit: AIC = 264.9658 , BIC = 288.5905
## For df_simple: AIC = 258.0035 , BIC = 265.8784
The smaller the values, the better the fit is. However, BIC is punishing the complexity of the model more severely.
Based on Mean Squared Error I would still lean towards the slightly more complex model, even though the difference is not huge:
## For df_fit: RMSE = 0.8119456
## For df_simple: RMSE = 0.8322459
#let's see if we can build a better model with interactions
full_model <- {Global ~ (Works + Piano + PowStat + BabyFoot + Distr + People + Clean)^2}
chosen_model_fit <- step(lm(full_model, data = train), direction ="backward")
chosen_model_formula <- formula(chosen_model_fit)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -0.63 | 0.3195 | -1.972 | 0.0519 |
Works | 0.3478 | 0.1253 | 2.776 | 0.006769 |
Piano | 0.4558 | 0.2083 | 2.188 | 0.03143 |
PowStat | 0.04832 | 0.1319 | 0.3663 | 0.7151 |
BabyFoot | -0.5436 | 0.3 | -1.812 | 0.07356 |
Distr | -0.1302 | 0.1433 | -0.9081 | 0.3664 |
People | -3.467 | 0.981 | -3.534 | 0.0006652 |
Clean | -0.1231 | 0.1106 | -1.112 | 0.2691 |
Works:PowStat | -0.1373 | 0.05398 | -2.544 | 0.01275 |
Works:Distr | -0.2525 | 0.1159 | -2.179 | 0.03207 |
Works:People | 0.4061 | 0.1846 | 2.201 | 0.03048 |
Works:Clean | 0.1419 | 0.06603 | 2.15 | 0.03443 |
Piano:People | 1.63 | 0.7627 | 2.137 | 0.03543 |
PowStat:BabyFoot | 0.393 | 0.2183 | 1.8 | 0.07537 |
PowStat:Distr | 0.3186 | 0.1484 | 2.146 | 0.0347 |
BabyFoot:People | -1.082 | 0.7588 | -1.426 | 0.1575 |
BabyFoot:Clean | -0.3835 | 0.2434 | -1.576 | 0.1188 |
Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
102 | 0.7925 | 0.3071 | 0.1766 |
Based on the anova comparison of two models, there is significant difference between two.
Res.Df | RSS | Df | Sum of Sq | Pr(>Chi) |
---|---|---|---|---|
94 | 67.24 | NA | NA | NA |
85 | 53.38 | 9 | 13.86 | 0.00864 |
Checking assumptions of linear regression
Let us first have a look at residuals for three models:
par(mfrow=c(1,3),mar=c(4, 4, 2, 2))
hist(df_fit$residuals,
main = "df_fit model", xlab = "Residuals")
hist(df_simple$residuals,
main = "df_simple model", xlab = "Residuals")
hist(chosen_model_fit$residuals,
main = "chosen_model_fit model", xlab = "Residuals")
All three are normally distributed, as they should be if the fit is good. Let’s have a look at Q-Q plots for all three:
We should also check the model for multicollinearity across explanatory variables and ensure that all values are low:
print(vif(df_fit))
## Works Piano PowStat BabyFoot Distr People Clean
## 1.293768 1.143916 1.230986 1.031874 1.188111 1.478836 1.080747
This obviously won’t make sense for chosen_model_fit, as interactions between variables will have much higher value.
df_simple | df_fit | chosen_model | |
---|---|---|---|
Train RSQ |
0.074 |
0.062 |
0.177 |
Test RSQ |
0.182 |
0.179 |
0.111 |
Pred Error |
0.684 |
0.699 |
0.761 |
To summarize our findings, customer satisfaction in SNCF train stations can be easily and quite accurately predicted based on the number of the train station visitors (or simply put, train station size). The three models don’t exhibit an extraordinary difference in prediction quality, and the variables such as the presence or absence of the piano or vending machine don’t play an important role.
Consequently, if the objective behind installing these facilities was improving customer satisfaction (and not, say, improving brand image or attracting more travelers), then the conclusion is that this is not serving its right purpose. Therefore, this is an investment that could be spared.
Incidentally, the cleanliness rating of the train station doesn’t have a significant impact on customer satisfaction either. Investing in additional personnel dedicated to this purpose is therefore not a recommendation I would suggest.
This study of SNCF customer satisfaction shows explicitly that the efforts to improve station amenities do not lead to improved satisfaction of travelers and other people visiting the train stations.
3 Comments
Well explained! Do you reckon it was the purpose of this investment though?
Thanks, Jane! This is a good question. Since I only looked through publicly available data and didn't perform this study as their consultant, we can only guess. However, I don't think it was a primary reason. My guess is that they rather do it for improving the brand image. Still, it would be good for them to know that there's no impact on satisfaction.
Thanks for sharing this. Do you personally rely on AIC or BIC when selecting which model to use?