What Is Training In Machine Learning

How Much Grooming Data is Required for Machine Learning?

Last Updated on May 23, 2019

The amount of data you demand depends both on the complexity of your problem and on the complexity of your called algorithm.

This is a fact, but does not assist you if you are at the pointy finish of a motorcar learning projection.

A common question I become asked is:

How much data do I need?

I cannot respond this question directly for you, or for anyone. Simply I can give you a handful of ways of thinking nearly this question.

In this post, I lay out a suite of methods that you can apply to recollect nigh how much training data you need to apply car learning to your problem.

My hope that one or more than of these methods may help you lot understand the difficulty of the question and how it is tightly coupled with the center of the induction trouble that you are trying to solve.

Let's swoop into information technology.

Note: Do yous have your own heuristic methods for deciding how much data is required for automobile learning? Delight share them in the comments.

How Much Training Data is Required for Auto Learning?
Photo by Seabamirum, some rights reserved.

Why Are You lot Asking This Question?

It is important to know why you are asking about the required size of the grooming dataset.

The reply may influence your next footstep.

For case:

Do yous have too much information? Consider developing some learning curves to find out just how big a representative sample is (below). Or, consider using a big data framework in lodge to utilize all available data.
Practise you take also little information? Consider confirming that you indeed take too little data. Consider collecting more data, or using data augmentation methods to artificially increase your sample size.
Have you lot not collected data nevertheless? Consider collecting some information and evaluating whether it is enough. Or, if it is for a study or data collection is expensive, consider talking to a domain expert and a statistician.

More generally, you may accept more pedestrian questions such equally:

How many records should I export from the database?
How many samples are required to achieve a desired level of performance?
How large must the training set be to attain a sufficient guess of model performance?
How much data is required to demonstrate that one model is better than another?
Should I utilise a train/examination split or yard-fold cantankerous validation?

It may be these latter questions that the suggestions in this post seek to address.

In practice, I answer this question myself using learning curves (run across beneath), using resampling methods on small datasets (e.thousand. k-fold cross validation and the bootstrap), and by calculation confidence intervals to final results.

What is your reason for asking well-nigh the number of samples required for motorcar learning?
Delight allow me know in the comments.

Then, how much data do y'all demand?

1. It Depends; No Ane Tin Tell Yous

No ane can tell yous how much information you need for your predictive modeling trouble.

It is unknowable: an intractable problem that you lot must find answers to through empirical investigation.

The amount of data required for auto learning depends on many factors, such as:

The complexity of the problem, nominally the unknown underlying part that best relates your input variables to the output variable.
The complexity of the learning algorithm, nominally the algorithm used to inductively acquire the unknown underlying mapping function from specific examples.

This is our starting indicate.

And "it depends" is the answer that most practitioners volition give yous the first time you lot ask.

two. Reason past Analogy

A lot of people have worked on a lot of practical machine learning problems before you.

Some of them accept published their results.

Perhaps y'all tin can look at studies on problems similar to yours as an gauge for the amount of information that may be required.

Similarly, it is common to perform studies on how algorithm performance scales with dataset size. Mayhap such studies can inform you how much data you require to use a specific algorithm.

Perhaps you can average over multiple studies.

Search for papers on Google, Google Scholar, and Arxiv.

3. Use Domain Expertise

You need a sample of information from your problem that is representative of the problem y'all are trying to solve.

In general, the examples must be contained and identically distributed.

Remember, in motorcar learning we are learning a function to map input data to output data. The mapping role learned volition simply be as good equally the data you provide it from which to learn.

This means that there needs to be plenty information to reasonably capture the relationships that may exist both between input features and between input features and output features.

Use your domain knowledge, or find a domain expert and reason about the domain and the scale of data that may be required to reasonably capture the useful complexity in the trouble.

4. Use a Statistical Heuristic

In that location are statistical heuristic methods available that allow you lot to calculate a suitable sample size.

Most of the heuristics I have seen have been for classification problems every bit a function of the number of classes, input features or model parameters. Some heuristics seem rigorous, others seem completely ad hoc.

Here are some examples you may consider:

Factor of the number of classes: There must be ten independent examples for each form, where x could be tens, hundreds, or thousands (e.1000. five, 50, 500, 5000).
Cistron of the number of input features: There must be x% more examples than there are input features, where x could be tens (e.g. x).
Factor of the number of model parameters: There must be x independent examples for each parameter in the model, where 10 could be tens (e.m. ten).

They all look like ad hoc scaling factors to me.

Accept you lot used any of these heuristics?
How did it go? Let me know in the comments.

In theoretical piece of work on this topic (not my surface area of expertise!), a classifier (east.g. k-nearest neighbors) is ofttimes assorted against the optimal Bayesian conclusion rule and the difficulty is characterized in the context of the expletive of dimensionality; that is in that location is an exponential increment in difficulty of the trouble as the number of input features is increased.

For example:

Small Sample Size Effects in Statistical Design Recognition: Recommendations for Practitioners, 1991
Dimensionality and sample size considerations in pattern recognition practice, 1982

Findings suggest fugitive local methods (similar g-nearest neighbors) for sparse samples from loftier dimensional problems (e.g. few samples and many input features).

For a kinder discussion of this topic, see:

Department two.five Local Methods in Loftier Dimensions, The Elements of Statistical Learning: Information Mining, Inference, and Prediction, 2008.

5. Nonlinear Algorithms Need More Data

The more powerful automobile learning algorithms are often referred to as nonlinear algorithms.

By definition, they are able to acquire complex nonlinear relationships between input and output features. You may very well be using these types of algorithms or intend to utilize them.

These algorithms are oft more flexible and even nonparametric (they can figure out how many parameters are required to model your trouble in addition to the values of those parameters). They are also loftier-variance, meaning predictions vary based on the specific data used to train them. This added flexibility and power comes at the cost of requiring more training information, often a lot more data.

In fact, some nonlinear algorithms like deep learning methods can continue to improve in skill equally you requite them more than data.

If a linear algorithm achieves adept performance with hundreds of examples per class, you may need thousands of examples per class for a nonlinear algorithm, like random forest, or an artificial neural network.

6. Evaluate Dataset Size vs Model Skill

Information technology is common when developing a new motorcar learning algorithm to demonstrate and even explain the operation of the algorithm in response to the amount of data or problem complexity.

These studies may or may not exist performed and published by the writer of the algorithm, and may or may non exist for the algorithms or problem types that yous are working with.

I would suggest performing your own written report with your available data and a unmarried well-performing algorithm, such as random forest.

Design a written report that evaluates model skill versus the size of the training dataset.

Plotting the result as a line plot with preparation dataset size on the 10-centrality and model skill on the y-axis will give yous an idea of how the size of the data affects the skill of the model on your specific problem.

This graph is called a learning curve.

From this graph, you may be able to project the corporeality of data that is required to develop a adept model, or perchance how petty information y'all actually need earlier hitting an inflection point of diminishing returns.

I highly recommend this arroyo in general in guild to develop robust models in the context of a well-rounded understanding of the problem.

vii. Naive Guesstimate

You need lots of information when applying auto learning algorithms.

Ofttimes, you need more than information than you may reasonably crave in classical statistics.

I often respond the question of how much data is required with the flippant response:

Get and utilise equally much information equally yous can.

If pressed with the question, and with nix cognition of the specifics of your trouble, I would say something naive like:

You need thousands of examples.
No fewer than hundreds.
Ideally, tens or hundreds of thousands for "boilerplate" modeling problems.
Millions or tens-of-millions for "difficult" problems similar those tackled past deep learning.

Again, this is just more ad hoc guesstimating, but it'southward a starting point if you need it. So become started!

8. Go More Information (No Matter What!?)

Large data is often discussed along with car learning, simply y'all may non require big information to fit your predictive model.

Some issues require big data, all the data you have. For example, simple statistical auto translation:

The Unreasonable Effectiveness of Data (and Peter Norvig's talk)

If y'all are performing traditional predictive modeling, and so there will likely be a point of diminishing returns in the grooming gear up size, and you lot should study your problems and your called model/south to see where that indicate is.

Keep in mind that machine learning is a process of consecration. The model can just capture what it has seen. If your training data does not include edge cases, they will very likely not be supported by the model.

Don't Procrastinate; Get Started

Now, stop getting gear up to model your problem, and model it.

Do not let the problem of the preparation ready size stop yous from getting started on your predictive modeling problem.

In many cases, I run across this question as a reason to procrastinate.

Get all the information yous tin can, utilise what you have, and come across how effective models are on your trouble.

Larn something, and then take action to better sympathize what yous have with further analysis, extend the data you accept with augmentation, or get together more data from your domain.

Summary

In this mail service, you discovered a suite of ways to call back and reason nigh the problem of answering the mutual question:

How much preparation data do I need for machine learning?

Did any of these methods assist?
Permit me know in the comments below.

Do y'all have any questions?
Enquire your questions in the comments below and I will do my best to respond.
Except, of grade, the question of how much data that you specifically need.