# GKE, Airflow, Cloud Composer, and Persistent Volumes

Google Cloud Composer is a managed1 version of Airflow that allows you to schedule Docker images using KubernetesPodOperators. This is nice, except that there's curiously clear documentation or Stackoverflow posts on how to schedule a pod, mount a volume, and use the volume, making it annoying if you want to share information across pods (i.e. pod A does some stuff, writes to a volume, pod B gets scheduled and mounts the same volume used by pod A). I'll go through that process here.

## 1 Compute Engine

Create a disk in Google Compute Engine with the desired size.Note that you should not used a shared disk, as Kubernetes won't let you mount a volume in use. You also cannot use the boot disk of a node pool.

## 2 Persistent Volume Claim

Create a persistent volume claim from the volume (https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/preexisting-pd), and run kubectl apply -f existing-pd.yaml. Note that you will need both the volume (to have a fixed volume in Kube) and the volume claim (to allow the pods to use the volume).

You'll see the disk in the cloud console (Kubernetes Engine->Storage).

For example, if I had a GCE disk with the name persist-pods-disk, the config would look like:

apiVersion: v1
kind: PersistentVolume
name: persist-pods-disk-volume
spec:
storageClassName: ""
capacity:
storage: 500G
accessModes:
gcePersistentDisk:
pdName: persist-pods-disk
fsType: ext4
---
apiVersion: v1
kind: PersistentVolumeClaim
name: persist-pods-disk-claim
spec:
# It's necessary to specify "" as the storageClassName
# so that the default storage class won't be used, see
# https://kubernetes.io/docs/concepts/storage/persistent-volumes/#class-1
storageClassName: ""
volumeName: persist-pods-disk-volume
accessModes:
resources:
requests:
storage: 500G


## 3 The DAG

Update your dag with a volume and a volume mount (https://airflow.apache.org/kubernetes.html).

For example, if I wanted to mount the disk into the pods at location /files, using the configuration give above:

volume_mount = VolumeMount('persist-disk',
mount_path='/files',
sub_path=None,

volume_config= {
'persistentVolumeClaim':
{
'claimName': 'persist-pods-disk-claim' # uses the persistentVolumeClaim given in the Kube yaml
}
}
volume = Volume(name='persist-disk', configs=volume_config) # the name here is the literal name given to volume for the pods yaml.

# ... other stuff

operator = kubernetes_pod_operator.KubernetesPodOperator(
name=name,
namespace='default',
image=image,
image_pull_policy='Always',
retries = retries,
arguments=arguments,
affinity={
'nodeAffinity': {
'requiredDuringSchedulingIgnoredDuringExecution': {
'nodeSelectorTerms': [{
'matchExpressions': [{
'operator': 'In',
'values': [
affinity,
]
}]
}]
}
}
},
volumes = [volume],
volume_mounts = [volume_mount],
)


## Footnotes:

1

I say it's managed, but it actually just deploys a Kube cluster and expects you to still work around with Airflow stuff.

Posted: 2018-09-30
Filed Under: GCP