Step-by-Step Guide for Building an Species Distribution Model

Let’s create a Species Distribution Model (SDM) using a dataset from Kaggle, we need to select a dataset that is relatively small in size (in kilobytes). Based on the provided search results, the “Bird Sightings Dataset” from Kaggle seems to be a suitable choice as it includes information on different bird species, their locations, dates, and times of sighting, as well as descriptions of the birds.

Step 1: Load Necessary Libraries

Python

import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import OneClassSVM
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize
import matplotlib.pyplot as plt

Step 2: Load and inspect the dataset

Python

data = pd.read_csv('birdsoftheworld-unprocessed.csv')
print(data.columns)

Output:

Index(['species', 'location', 'time', 'description of bird', 'sex',
       'feather color', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9',
       'Unnamed: 10', 'Unnamed: 11'],
      dtype='object')

Step 3: Data Preprocessing

We’ll use the ‘location’ feature and other relevant features. We will need to encode categorical features and handle any missing values.

Python

 Select relevant columns
features = data[['location', 'sex', 'feather color']]
labels = data['species']

# Handle missing values and encode categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', Pipeline([
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(handle_unknown='ignore'))
        ]), ['location', 'sex', 'feather color'])
    ])

# Standardize the features
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('scaler', StandardScaler(with_mean=False))
])

features_processed = pipeline.fit_transform(features)

# Encode labels
label_encoder = LabelEncoder()
labels_encoded = label_encoder.fit_transform(labels)

Step 4: Model Training

Train a One-Class SVM model to predict species distribution

Python

model = OneClassSVM(kernel='rbf', gamma=0.1, nu=0.1)
model.fit(features_processed)

Output:

OneClassSVM
OneClassSVM(gamma=0.1, nu=0.1)

Step 5: Model Evaluation

Evaluate the model using the Area Under the ROC Curve (AUC) metric for multi-class classification.

Python

labels_binarized = label_binarize(labels_encoded, classes=range(len(label_encoder.classes_)))

# Predict the species distribution
predictions = model.decision_function(features_processed)

# Reshape predictions to match the shape of labels_binarized
predictions_reshaped = predictions.reshape(-1, 1)

auc_score = roc_auc_score(labels_binarized, predictions_reshaped, average='macro', multi_class='ovr')
print(f'Area under the ROC curve: {auc_score:.4f}')

Output:

Area under the ROC curve: 0.0038

Step 6: Prediction and Mapping

Since we don’t have geographic coordinates, we will visualize the predictions using a simple scatter plot.

Python

# Predict the species distribution
predictions = model.predict(features_processed)

plt.figure(figsize=(10, 6))
plt.scatter(range(len(predictions)), predictions, c=predictions, cmap='coolwarm', alpha=0.5)
plt.title('Bird Species Distribution Predictions')
plt.xlabel('Sample Index')
plt.ylabel('Prediction')
plt.show()

Output:

Species Distribution Modeling

The scatter plot provides a clear visualization of the model’s binary predictions for bird species distribution. The distinct separation between the two clusters of points indicates that the model is making confident predictions. This visualization is valuable for understanding species distribution patterns and informing conservation efforts.

The prediction values are binary, with -1.0 indicating one class (likely absence or a negative prediction) and 1.0 indicating another class (likely presence or a positive prediction).
Each point on the x-axis corresponds to a different sample or observation in the dataset.
The plot shows two distinct clusters of points: one at y = -1.0 (blue) and another at y = 1.0 (red). This indicates that the model has made clear binary predictions for each sample, classifying them into two distinct groups.

Species Distribution Modeling in Scikit Learn

Species Distribution Modeling (SDM) is a crucial tool in conservation biology, ecology, and related fields. It involves predicting the geographic distribution of species based on environmental variables and species occurrence data. This article explores how to implement SDM using Scikit-Learn, a popular machine learning library in Python.

Table of Content

Introduction to Species Distribution Modeling
Why Use Scikit-Learn for SDM?
Step-by-Step Guide for Building an Species Distribution Model

Step 1: Load Necessary Libraries
Step 2: Load and inspect the dataset
Step 3: Data Preprocessing
Step 4: Model Training
Step 5: Model Evaluation
Step 6: Prediction and Mapping

Step-by-Step Guide for Building an Species Distribution Model

Step 1: Load Necessary Libraries

Step 2: Load and inspect the dataset

Step 3: Data Preprocessing

Step 4: Model Training

Step 5: Model Evaluation

Step 6: Prediction and Mapping

Species Distribution Modeling in Scikit Learn

Similar Reads