Hugging Face datasets

Commit

The Commit API allows you to push local dataset files to a Hugging Face repository using Git and Git LFS. This process leverages the PHP package czproject/git-php for managing Git operations programmatically.


Prerequisites

  1. Git: Installed and configured globally.
  2. Git LFS: Installed and configured (git lfs install).
  3. Hugging Face Token: An API token with write access to the target repository.
  4. czproject/git-php: Installed via Composer:
    composer require czproject/git-php
    

Code

$apiKey = getenv('HF_TOKEN');    // Hugging Face API token
$hfUser = getenv('HF_USER');     // Your Hugging Face username or organization

$client = new HuggingFaceDatasetClient(apiKey: (string) $apiKey);

// Get the list of files from a local directory (e.g., your dataset folder)
$files = $client->listFiles('./dir');

try {
    // Commit and push the files to the Hugging Face dataset repository
    $commit = $client->commit(
        repository: $hfUser . '/test2',        // Target repository (user/repo)
        dir: realpath('mon_dataset'),          // Local directory containing dataset files
        files: $files,                         // Files to commit (list of relative paths)
        summary: 'commit title',               // Commit summary (short description)
        commitMessage: 'commit message',       // Full commit message
        branch: 'main'                         // Branch to commit to (default: main)
    );

    print_r($commit);  // Display commit details

} catch (\Throwable $e) {
    print_r($e);  // Handle any errors during the commit process
}

Result

Array
(
    [repository] => USER/test2
    [branch] => main
    [commit_message] => commit message
    [files] => Array
        (
            [0] => .gitattributes
            [1] => data/validation-00000-of-00001.parquet
            [2] => data/test-00000-of-00001.parquet
            [3] => data/train-00001-of-00002.parquet
            [4] => data/train-00000-of-00002.parquet
            [5] => README.md
        )
)
  • repository: The Hugging Face repository where the dataset was pushed.
  • branch: The branch where the commit was made.
  • commit_message: The message associated with the commit.
  • files: List of files included in the commit.

Use Cases

  • Dataset versioning: Manage different versions of datasets directly from your PHP applications.
  • Automated data pipelines: Integrate dataset pushes into your CI/CD workflows.
  • Collaborative datasets: Easily share and update datasets on Hugging Face.

Common Pitfalls