Dataset Management with Ultralytics HUB-SDK

Welcome to the Ultralytics HUB-SDK Dataset Management documentation! 👋

Efficient dataset management is crucial in machine learning. Whether you're a seasoned data scientist or a beginner, knowing how to handle dataset operations can streamline your workflow. This page covers the basics of performing operations on datasets using the Ultralytics HUB-SDK in Python. The examples provided illustrate how to get, create, update, delete, and list datasets, and also how to get a URL for dataset access and upload datasets.

Let's dive in! 🚀

Get a Dataset by ID

To fetch a specific dataset rapidly using its unique ID, use the code snippet below. This allows you to access essential information, including its data.

from hub_sdk import HUBClient

credentials = {"api_key": "<YOUR-API-KEY>"}
client = HUBClient(credentials)

# Fetch a dataset by ID
dataset = client.dataset("<Dataset ID>")  # Replace with your actual Dataset ID
print(dataset.data)  # This prints the dataset information

For more details on the Datasets class and its methods, see the Reference for hub_sdk/modules/datasets.py.

Create a Dataset

To create a fresh dataset, define a friendly name for your dataset and use the create_dataset method as shown below:

from hub_sdk import HUBClient

credentials = {"api_key": "<YOUR-API-KEY>"}
client = HUBClient(credentials)

# Define your dataset properties
data = {"meta": {"name": "My Dataset"}}  # Replace 'My Dataset' with your desired dataset name

# Create the dataset
dataset = client.dataset()
dataset.create_dataset(data)
print("Dataset created successfully!")

See the create_dataset method in the API reference for further information.

Update a Dataset

As projects evolve, you may need to modify your dataset's metadata. This is as simple as running the following code with the new details:

from hub_sdk import HUBClient

credentials = {"api_key": "<YOUR-API-KEY>"}
client = HUBClient(credentials)

# Obtain the dataset
dataset = client.dataset("<Dataset ID>")  # Insert the correct Dataset ID

# Update the dataset's metadata
dataset.update({"meta": {"name": "Updated Name"}})  # Modify 'Updated Name' as required
print("Dataset updated with new information.")

The update method provides more details on updating datasets.

Delete a Dataset

To remove a dataset, whether to declutter your workspace or because it's no longer needed, you can permanently delete it by invoking the delete method:

from hub_sdk import HUBClient

credentials = {"api_key": "<YOUR-API-KEY>"}
client = HUBClient(credentials)

# Select the dataset by its ID
dataset = client.dataset("<Dataset ID>")  # Ensure the Dataset ID is specified

# Delete the dataset
dataset.delete()
print("Dataset has been deleted.")

For more on deletion options, including hard deletes, see the delete method documentation.

List Datasets

To browse through your datasets, list all your datasets with pagination. This is helpful when dealing with a large number of datasets.

from hub_sdk import HUBClient

credentials = {"api_key": "<YOUR-API-KEY>"}
client = HUBClient(credentials)

# Retrieve the first page of datasets
datasets = client.dataset_list(page_size=10)
print("Current dataset:", datasets.results)  # Show the datasets on the current page

# Move to the next page and show results
datasets.next()
print("Next page result:", datasets.results)

# Go back to the previous page
datasets.previous()
print("Previous page result:", datasets.results)

The DatasetList class provides more details on listing and paginating datasets.

Get URL from Storage

This function fetches a URL for dataset storage access, making it easy to download dataset files or artifacts stored remotely.

from hub_sdk import HUBClient

credentials = {"api_key": "<YOUR-API-KEY>"}
client = HUBClient(credentials)

# Define the dataset ID for which you want a download link
dataset = client.dataset("<Dataset ID>")  # Replace Dataset ID with the actual dataset ID

# Retrieve the URL for downloading dataset contents
url = dataset.get_download_link()
print("Download URL:", url)

The get_download_link method documentation provides additional details.

Upload Dataset

Uploading your dataset is straightforward. Set your dataset's ID and the file path, then use the upload_dataset function:

from hub_sdk import HUBClient

credentials = {"api_key": "<YOUR-API-KEY>"}
client = HUBClient(credentials)

# Select the dataset
dataset = client.dataset("<Dataset ID>")  # Substitute with the real dataset ID

# Upload the dataset file
dataset.upload_dataset(file="<Dataset File>")  # Specify the correct file path
print("Dataset has been uploaded.")

The upload_dataset method provides further details on uploading datasets. You can also learn about the related DatasetUpload class.

Remember to double-check your Dataset IDs and file paths to ensure everything runs smoothly.

If you encounter any issues or have questions, our support team is here to help. 🤝

Happy data wrangling, and may your models be accurate and insightful! 🌟

📅 Created 1 year ago ✏️ Updated 1 month ago