Testing a proposal or the workability of a design using hypothesis testing is standard practice in the corporate world. Be it changing the user interface on a mobile application or checking a model which is used to diagnose a patient for psychotherapy, the most inexpensive accessible non-human decision maker is the flipping of a coin. With all the confounding variables associated with real-world problems, flipping a coin can make things go for a toss — literally.

Clients and stakeholders may or may not understand the intricacies involved in the model. They don’t care about the type of activation function used or the optimisation technique followed. It always comes down to one question: How does the model work in the worst case scenario? This is where the Confidence Interval (CI) estimate comes into the picture.

CI is generated on range and probability. Range, which is the lower and upper limit on the skill that can be expected on the model. Probability talks about whether the model belongs to the range or not.

The CI is often referred to as the margin of error and may be used to graphically depict the uncertainty of an estimate on graphs through the use of error bars.

### For Classification Accuracy In Machine Learning

A machine learning algorithm is well understood by the data scientists and the engineers who develop them but when the product needs to be pitched, the only parameter that counts is its performance. So, a metric to gauge the performance of a model is necessary.

Classification accuracy is used to assess the efficacy of a classification algorithm. To report the classification accuracy of the model alone is not best of practices.

`Classification Accuracy = correct predictions/ total predictions`

It is common to use classification accuracy or classification error (the inverse of accuracy) to describe the skill of a classification predictive model. For example, a model that makes correct predictions of the class outcome variable 75% of the time has a classification accuracy of 75%, calculated as:

`accuracy = total correct predictions / total predictions made * 100`

Classification accuracy or classification error is a proportion or a ratio. It describes the proportion of correct or incorrect predictions made by the model. Each prediction is a binary decision that could be correct or incorrect. Technically, this is called a Bernoulli trial, named for Jacob Bernoulli. The proportions in a Bernoulli trial have a specific distribution called a binomial distribution.

We can use the assumption of a Gaussian distribution of the proportion (i.e. the classification accuracy or error) to easily calculate the confidence interval.

In the case of classification error, the radius of the interval can be calculated as:

`interval = z * sqrt( (error * (1 - error)) / n)`

In the case of classification accuracy, the radius of the interval can be calculated as:

`interval = z * sqrt( (accuracy * (1 - accuracy)) / n)`

Where interval is the radius of the confidence interval, error and accuracy are classification error and classification accuracy respectively, n is the size of the sample, sqrt is the square root function, and z is a critical value from the Gaussian distribution. Technically, this is called the Binomial proportion confidence interval.

A code snippet to calculate the accuracy scores:

`# split the data into a train and validation sets`

X1, X2, y1, y2 = train_test_split(X_train, y_train, test_size=0.5)

base_prediction = base_model.predict(X2)

error = mean_squared_error(base_prediction, y2) ** 0.5

mean = base_model.predict(X_test)

st_dev = error

X1, X2, y1, y2 = train_test_split(X, y, test_size=0.5)

base_model.fit(X1, y1)

base_prediction = base_model.predict(X2)

validation_error = (base_prediction - y2) ** 2

error_model.fit(X2, validation_error)

mean = base_model.predict(X_test)

st_dev = error_model.predict(X_test)

`Check the idea behind this method here`

### Common Misconceptions About Confidence Intervals

A 95% confidence interval does not mean that for a given realised interval there is a 95% probability that the population parameter lies within the interval. The 95% probability relates to the reliability of the estimation procedure, not to a specific calculated interval.

A confidence interval is not a definitive range of plausible values for the sample parameter, though it may be understood as an estimate of plausible values for the population parameter.

A particular confidence interval of 95% calculated from an experiment does not mean that there is a 95% probability of a sample parameter from a repeat of the experiment falling within this interval. So, it is essential to remember that:

- 95% confidence is confidence that in the long-run 95% of the CIs will include the population mean. It is a confidence in the algorithm and not a statement about a single CI.
- In frequentist terms, the CI either contains the population mean or it does not.
- There is no relationship between a sample’s variance and it’s mean. Therefore we cannot infer that a single narrow CI is more accurate. In this context “accuracy” refers to the long run coverage of the population mean. Look at the visualisation above and note how much the widths of the CIs vary. They can still be narrow but far away from the true mean.

### Conclusion

A confidence interval is different from a tolerance interval that describes the bounds of data sampled from the distribution. CI provides bounds on a population parameter, such as a mean, standard deviation, or similar and, to deal with the uncertainty inherent in results derived from data that are themselves only a randomly selected subset of a population.

It is said that preferring hypothesis testing to confidence intervals and estimation will lead to fewer statistical misinterpretations. Confidence intervals can be unintuitive and sometimes are as misunderstood as p-values and null hypothesis significance testing. Moreover, CIs are often used to perform hypothesis tests and are therefore prone to the same misuses as p-values.

Real world data is filled with noise, is inconsistent, non-linear. So, a single “significant” CI can be mighty useful to draw conclusions which otherwise would be cumbersome.