A typical and useful assumption for statistical inference is that the data is independently and identically distributed (IID). We can take a random subset of patients and predict their likelihood for diabetes with a normal train-test split, no problem. In practice, however, there are some types of datasets where this assumption doesn’t hold, and a typical train-test split can introduce data leakage. When the distribution of the variable of interest is not random, the data is said to be autocorrelated — and this has implications on machine learning models.
We can find spatial autocorrelation on many datasets with a geospatial component. Consider the maps below:
If data were IID, it would look like the map on the right. But in real life, we have maps like on the left where we can easily observe patterns. The first law of geography states that nearer things are more related to each other than distant things. Attributes usually aren’t randomly distributed across a location – it’s more likely that an area is very similar to its neighbors. In the example above, the population level of a single area is likely to be similar of an adjacent area, as opposed to a distant one.
When do we need spatial cross-validation?
When data is autocorrelated, we might want to be extra wary about overfitting. In this case, if we use random samples for train-test splits or cross-validation, we violate the IID assumption since the samples are not statistically independent. Area A could be in the training set, but an Area Z in the validation set happens to be only a kilometer away from Area A while also sharing very similar features. The model would have a more accurate prediction for Area Z since it saw a very similar example in the training set. To fix this, grouping the data by area would prevent the model from peeking into data it shouldn’t be seeing. Here’s how spatial cross-validation would look like:
A good question to ask here: do we always want to prevent overfitting? Intuitively, yes. But as with most machine learning techniques, it depends. If it fits your use case, overfitting may even be beneficial!
Let’s say we had a randomly sampled national survey on wealth. We have wealth values of a distributed set of households across the country, and we’d like to infer the wealth levels for unsurveyed areas to get complete wealth data for the entire country. Here, the goal is only to fill in spatial gaps. Training with the data of the nearest areas would certainly help fill in the gaps more accurately!
It’s a different story if we were trying to build a generalizable model — say, one that we would apply to another country altogether. [2] In this case, exploiting the spatial autocorrelation property during training will likely inflate the accuracy of a potentially poor model. This is especially concerning if we use this seemingly-good model on an area where there is no ground truth for verifying.
Spatial cross-validation implementation on scikit-learn
To address this, we’d have to split areas between training and testing. If this were a normal train-test split, we could easily filter out a few areas out for our test data. In other cases, however, we would want to utilize all of the available data by using cross-validation. Unfortunately, scikit-learn’s built-in CV functions split the data randomly or by target variable, not by chosen columns. A workaround can be implemented, taking into consideration that our dataset includes geocoded elements.
Website: International Research Awards on Computer Vision
#computervision #deeplearning #machinelearning #artificialintelligence #neuralnetworks, #imageprocessing #objectdetection #imagerecognition #faceRecognition #augmentedreality #robotics #techtrends #3Dvision #professor #doctor #institute #sciencefather #researchawards #machinevision #visiontechnology #smartvision #patternrecognition #imageanalysis #semanticsegmentation #visualcomputing #datascience #techinnovation #university #lecture #biomedical
Awards-Winners : computer-vision-conferences.scifat.com/awards-winners
Contact us : computer@scifat.com
No comments:
Post a Comment