跳到正文

本地缓存卷

使用 HwameiStor 运行 AI 训练应用程序非常简单。以下是通过创建本地缓存卷来部署 Nginx 应用程序的示例。

在实际生产环境中,您可以将 Nginx 替换为相应的训练 Pod。本示例旨在简化演示,重点展示如何加载数据集。

在开始之前,请确保您的集群已安装 Dragonfly,并已完成相关配置。

安装 Dragonfly

  1. 根据集群配置 /etc/hosts

    $ vi /etc/hosts
    host1-IP hostName1
    host2-IP hostName2
    host3-IP hostName3
  2. 要安装 Dragonfly 组件,请确保配置了默认存储类,因为创建存储卷需要它。

    kubectl patch storageclass hwameistor-storage-lvm-hdd -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
  3. 使用 helm 安装 Dragonfly。

    helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
    helm install --create-namespace --namespace dragonfly-system dragonfly dragonfly/dragonfly --version 1.1.63
  4. dragonfly-dfdaemon 配置。

    kubectl -n dragonfly-system get ds
    kubectl -n dragonfly-system edit ds dragonfly-dfdaemon
    ...
    spec:
    spec:
    containers:
    - image: docker.io/dragonflyoss/dfdaemon:v2.1.45
    ...
    securityContext:
    capabilities:
    add:
    - SYS_ADMIN
    privileged: true
    volumeMounts:
    ...

    - mountPath: /var/run
    name: host-run
    - mountPath: /mnt
    mountPropagation: Bidirectional
    name: host-mnt
    ...
    volumes:
    ...
    - hostPath:
    path: /var/run
    type: DirectoryOrCreate
    name: host-run
    - hostPath:
    path: /mnt
    type: DirectoryOrCreate
    name: host-mnt
    ...
  5. 安装 dfget 客户端命令行工具。在每个节点执行:

    wget https://github.com/dragonflyoss/Dragonfly2/releases/download/v2.1.44/dfget-2.1.44-linux-amd64.rpm
    rpm -ivh dfget-2.1.44-linux-amd64.rpm
  6. 为避免出现问题,请取消之前配置的默认存储类。

    kubectl patch storageclass hwameistor-storage-lvm-hdd -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'

查看 Dragonfly

$ kubectl -n dragonfly-system get pod -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
dragonfly-dfdaemon-d2fzp 1/1 Running 0 19h 200.200.169.158 hwameistor-test-1 <none> <none>
dragonfly-dfdaemon-p7smf 1/1 Running 0 19h 200.200.29.171 hwameistor-test-3 <none> <none>
dragonfly-dfdaemon-tcwkr 1/1 Running 0 19h 200.200.39.71 hwameistor-test-2 <none> <none>
dragonfly-manager-5479bf9bc9-tp4g5 1/1 Running 1 (19h ago) 19h 200.200.29.174 hwameistor-test-3 <none> <none>
dragonfly-manager-5479bf9bc9-wpbr6 1/1 Running 0 19h 200.200.39.92 hwameistor-test-2 <none> <none>
dragonfly-manager-5479bf9bc9-zvrdj 1/1 Running 0 19h 200.200.169.142 hwameistor-test-1 <none> <none>
dragonfly-mysql-0 1/1 Running 0 19h 200.200.29.178 hwameistor-test-3 <none> <none>
dragonfly-redis-master-0 1/1 Running 0 19h 200.200.169.137 hwameistor-test-1 <none> <none>
dragonfly-redis-replicas-0 1/1 Running 0 19h 200.200.39.72 hwameistor-test-2 <none> <none>
dragonfly-redis-replicas-1 1/1 Running 0 19h 200.200.29.130 hwameistor-test-3 <none> <none>
dragonfly-redis-replicas-2 1/1 Running 0 19h 200.200.169.134 hwameistor-test-1 <none> <none>
dragonfly-scheduler-0 1/1 Running 0 19h 200.200.169.190 hwameistor-test-1 <none> <none>
dragonfly-scheduler-1 1/1 Running 0 19h 200.200.39.76 hwameistor-test-2 <none> <none>
dragonfly-scheduler-2 1/1 Running 0 19h 200.200.29.163 hwameistor-test-3 <none> <none>
dragonfly-seed-peer-0 1/1 Running 1 (19h ago) 19h 200.200.169.138 hwameistor-test-1 <none> <none>
dragonfly-seed-peer-1 1/1 Running 0 19h 200.200.39.80 hwameistor-test-2 <none> <none>
dragonfly-seed-peer-2 1/1 Running 0 19h 200.200.29.151 hwameistor-test-3 <none> <none>

查看 DataSet

以 minio 为例:

apiVersion: datastore.io/v1alpha1
kind: DataSet
metadata:
name: dataset-test
spec:
refresh: true
type: minio
minio:
endpoint: Your service ip address:9000
bucket: BucketName/Dir # 根据你的数据集所在的目录级别定义
secretKey: minioadmin
accessKey: minioadmin
region: ap-southeast-2

创建 DataSet

kubectl apply -f dataset.yaml

确认缓存卷已成功创建。

$ kubectl get lv dataset-test
NAME POOL REPLICAS CAPACITY USED STATE PUBLISHED AGE
dataset-test LocalStorage_PoolHDD 3 1073741824 906514432 Ready 20d

$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
dataset-test 1Gi ROX Retain Bound default/hwameistor-dataset 20d

PV 的大小是根据你数据集的大小而决定的,您也可以手动配置。

创建 PVC 绑定 PV

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: hwameistor-dataset
namespace: default
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 1Gi # 数据集大小
volumeMode: Filesystem
volumeName: dataset-test

确认 PVC 已经创建成功。

$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
hwameistor-dataset Bound dataset-test 1Gi ROX 20d

创建 StatefulSet

kubectl apply -f sts-nginx-AI.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nginx-dataload
namespace: default
spec:
serviceName: nginx-dataload
replicas: 1
selector:
matchLabels:
app: nginx-dataload
template:
metadata:
labels:
app: nginx-dataload
spec:
hostNetwork: true
hostPID: true
hostIPC: true
containers:
- name: nginx
image: docker.io/library/nginx:latest
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
env:
- name: DATASET_NAME
value: dataset-test
volumeMounts:
- name: data
mountPath: /data
ports:
- containerPort: 80
volumes:
- name: data
persistentVolumeClaim:
claimName: hwameistor-dataset
信息

claimName使用绑定到数据集的 pvc 的名称。 env: DATASET_NAME=datasetName

查看 Nginx Pod

$ kubectl get pod
NAME READY STATUS RESTARTS AGE
nginx-dataload-0 1/1 Running 0 3m58s

$ kubectl logs nginx-dataload-0 hwameistor-dataloader
Created custom resource
Custom resource deleted, exiting
DataLoad execution time: 1m20.24310857s

根据日志,加载数据耗时 1m20.24310857s。

[可选] 将 Nginx 扩展为 3 节点集群

HwameiStor 缓存卷支持 StatefulSet 横向扩展。StatefulSet 的每个 Pod 都会附加并挂载一个绑定同一份数据集的 HwameiStor 缓存卷。

$ kubectl scale sts/sts-nginx-AI --replicas=3

$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE
nginx-dataload-0 1/1 Running 0 41m
nginx-dataload-1 1/1 Running 0 37m
nginx-dataload-2 1/1 Running 0 35m


$ kubectl logs nginx-dataload-1 hwameistor-dataloader
Created custom resource
Custom resource deleted, exiting
DataLoad execution time: 3.24310857s

$ kubectl logs nginx-dataload-2 hwameistor-dataloader
Created custom resource
Custom resource deleted, exiting
DataLoad execution time: 2.598923144s

根据日志,第二次和第三次加载数据只耗时 3.24310857s、2.598923144s 。对比首次加载速度得到了很大的提升。