Hugging Face datasets

Download

The Download API allows you to retrieve dataset files from a Hugging Face repository directly into your local filesystem. This simplifies accessing datasets programmatically without manual downloads.


Code

use Partitech\PhpMistral\Clients\HuggingFace\HuggingFaceDatasetClient;

$apiKey = getenv('HF_TOKEN');  // Hugging Face API token

$client = new HuggingFaceDatasetClient(apiKey: (string) $apiKey);

try {
    // Download dataset files from the 'google/civil_comments' repository
    $dest = $client->downloadDatasetFiles(
        'google/civil_comments',              // Repository name (user/repo format)
        revision: 'main',                      // (Optional) Branch or tag (default: main)
        destination: '/tmp/downloaded_datasets/civil_comments'  // Local target directory
    );

    print_r($dest);  // Output destination path

} catch (\Throwable $e) {
    echo $e->getMessage();  // Handle errors (e.g., invalid repo, network issues)
}

Result

/tmp/downloaded_datasets/civil_comments

The downloaded dataset files will be structured as follows:

tree /tmp/downloaded_datasets/civil_comments

/tmp/downloaded_datasets/civil_comments
├── data
│   ├── test-00000-of-00001.parquet
│   ├── train-00000-of-00002.parquet
│   ├── train-00001-of-00002.parquet
│   └── validation-00000-of-00001.parquet
└── README.md

Use Cases

  • Local dataset processing: Download datasets for offline analysis or model training.
  • Pipeline integration: Automatically retrieve datasets as part of a data processing pipeline.
  • Dataset backups: Keep local copies of specific dataset versions (branches or tags).

Common Pitfalls