Obtaining the data
Note
The following tutorial is intended to be run entirely with Python 3.
Installing WSL
To run the following tutorial on Windows 10/11, you need a Linux environment. The easiest way to achieve this is by using the Windows Subsystem for Linux (WSL). To install it, follow the steps below:
Go to Control Panel
Change View By to “Category”
Go to Programs
Go to Turn Windows features on or off
Check Windows Subsystem For Linux and click Ok
Restart your computer
Go to Microsoft Store
Search for “Ubuntu”
Click on the first search result
Click on Get or Install
Open Ubuntu
Wait for installation and choose username and password
Done!
Creating working directory
Running the following commands will create an Agri-Plast directory on C:/ and /home/user/ directories on Windows or Linux, respectively:
On Windows:
import subprocess
subprocess.run('wsl bash -c "mkdir /mnt/c/Agri-Plast"', shell=True, check=True)
On Linux/Mac:
import subprocess
subprocess.run('mkdir ~/Agri-Plast', shell=True, check=True)
Downloading metadata file
In this tutorial, we will be firstly focusing on the following dataset: https://dmportal.biodata.pt/dataset.xhtml?persistentId=doi:10.34636/DMPortal/AWYIXC
From this link we should take note of the doi:
doi:10.34636/DMPortal/AWYIXC
And replace on the the doi variable on the following command:
On Windows:
import subprocess
doi = "doi:10.34636/DMPortal/AWYIXC"
subprocess.run(f'wsl bash -c "curl -L \\"https://dmportal.biodata.pt/api/datasets/:persistentId/?persistentId={doi}\\" -o /mnt/c/Agri-Plast/dataset.metadata"', shell=True, check=True)
On Linux/Mac:
import subprocess
doi = "doi:10.34636/DMPortal/AWYIXC"
subprocess.run(f'curl -L "https://dmportal.biodata.pt/api/datasets/:persistentId/?persistentId={doi}" -o ~/Agri-Plast/dataset.metadata', shell=True, check=True)
Obtaining File IDs
To obtain each file IDs from each dataset:
Open dataset.metadata
Search for:
"dataFile":{"id":Take note of the numbers following that search term. Next to it you have
"filename":followed by the name of the file.
Downloading dataset files
In the case of the dataset presented here, the IDs of the two files are 1228 and 1229.
To download each file, replace id by the desired one on the following command:
On Windows:
id = "1229"
import subprocess
subprocess.run(f'wsl bash -c "curl -L \\"https://dmportal.biodata.pt/api/access/datafile/{id}?format=original\\" -o /mnt/c/Agri-Plast/file_{id}.csv"', shell=True, check=True)
On Linux/Mac:
id = "1229"
import subprocess
subprocess.run(f'curl -L "https://dmportal.biodata.pt/api/access/datafile/{id}?format=original" -o ~/Agri-Plast/file_{id}.csv', shell=True, check=True)
Inspecting the file
import pandas as pd
dataset = pd.read_csv("C:\\Agri-Plast\\file_1229.csv") # Change to "~/Agri-Plast/file_1229.csv" on Linux/Mac
print(dataset)
Output:
Date Time Temperature (°C) Humidity (%HR)
0 27/06/2024 11:27:00 31.6 92.8
1 27/06/2024 11:42:00 36.6 43.7
2 27/06/2024 11:57:00 38.1 41.4
3 27/06/2024 12:12:00 38.4 39.7
4 27/06/2024 12:27:00 38.8 39.5
... ... ... ... ...
7995 18/09/2024 18:12:00 26.2 60.5
7996 18/09/2024 18:27:00 26.0 63.2
7997 18/09/2024 18:42:00 25.8 65.7
7998 18/09/2024 18:57:00 25.5 66.5
7999 18/09/2024 19:12:00 25.3 67.3
[8000 rows x 4 columns]
A simple linear regression
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress
dataset = pd.read_csv("C:\\Agri-Plast\\file_1229.csv") # Change to "~/Agri-Plast/file_1229.csv" on Linux/Mac
dataset_clean = dataset.dropna(subset=["Temperature (°C)", "Humidity (%HR)"])
x = dataset_clean["Temperature (°C)"]
y = dataset_clean["Humidity (%HR)"]
slope, intercept, r_value, p_value, std_err = linregress(x, y)
regression_line = slope * x + intercept
plt.scatter(dataset_clean["Temperature (°C)"], dataset_clean["Humidity (%HR)"], alpha=0.6)
plt.plot(x, regression_line, color="red")
plt.xlabel("Temperature (°C)")
plt.ylabel("Humidity (%HR)")
plt.title("Temperature vs Humidity")
plt.show()
print("y ="+str(slope)+"*x"+" + " + str(intercept))
Output:
y =-2.471288951824853*x + 128.68938924684124