
An outlier is a data point, that is extreme and wide apart or very different from the rest of the data points. Basically, outliers appear due to erroneous entry due to human or system error, erratic production process, malfunctioned machineries etc. Presence of outliers is the indicator of possible irregularities in the system or process. In huge datasets, outliers are predicted and are highly common. A box plot is a popular tool for finding outliers in a dataset.
Box plot is nothing but a five-points plot indicating first quartile (Q1), second quartile / median (Q2), third quartile (Q3), upper whisker limit (Q3 + 1.5 IQR) and lower whisker limit (Q1 – 1.5 IQR) where IQR (Inter Quartile Range) is Q3 – Q1. Any datapoint outside the whiskers limits is treated as OUTLIER.
Boxplot without outlier
Boxplot with outlier
If outliers are present significantly in any dataset in total or in a particular feature, we should treat outliers before processing of model building as most of the statistical measure are sensitive to these extreme values. For building regression models and forecast, outlier treatment is a must. For classification purpose, outliers do not have much implication as those models are build on the principle of similarity or distance.
Outlier treatment means cap the whole dataset within the whisker limits (the capping limits may be differently fixed according to the business needs). It is done by replacing the higher and lower extreme values with upper whisker limit and lower whisker limit respectively.

Image courtesy: https://help.ezbiocloud.net, https://justinsighting.com
