Drupal on Azure - Forging Docker Image and Beyond

When you switch from PaaS to IaaS, suddenly you have a series of new responsibilities. There is a lot to learn and also a lot to mimic from existing PaaS providers. We would like to share our experience after successfully migrating four larger Drupal sites from Pantheon to Azure cloud. The overview is in chronological order, detailing how we proceeded with the implementation.

Infrastructure Basics

We worked with an excellent infrastructure team who provisioned the following architecture for us:

App Services: For running the Drupal/PHP applications.
Managed MySQL Servers, one for each environment.
Azure Blob Storage: For storing public and private Drupal files, mounted to the App Service container via NFS.
Azure FrontDoor: As a CDN and Web Application Firewall (WAF) for improved performance and security (shared across all projects and environments for the sake of cost efficiency).
Azure Container Registry (ACR): For managing Docker images. Pushing a tag to ACR triggers a deployment to the App Service.

Forging the Right Docker Image

After you select a base image (like Alpine or Ubuntu-based or one with an already configured web stack based on your needs), you likely need to do modifications on top of it. The most important lesson that we learned is to try to keep the images as small as possible, but it’s not enough alone.

Do not sync unneeded parts of the application to the image (Git repo metadata, theme compilation files, READMEs, CHANGELOGs, and so on). You can consider keeping the source code outside the image, if that’s feasible in your infrastructure setup.
Get rid of all the cached installation files (apt, npm, apk caches). For example: apk add --no-cache [package name].
Construct layers in a way that it can be effectively cached.
When installing additional software packages of any kind, always pin version numbers to an exact value. You don’t want surprises just before a deployment.

Security from the Bottom to the Top

When you have a custom Docker image, you need to worry about SSL weaknesses, zero-day exploits in PHP, buffer overflows in Nginx and segmentation faults in ImageMagick.

A naive attempt would be to rely on the maintainer of the Docker image to provide updates and let the base image worry about the updates. It does not usually work like that. Typically, the base OS releases are reflected in the derivative image. For all the other incoming updates, you end up relying on the package manager:

# Apply security and other updates.
RUN apk -U upgrade

But how will you know when it’s time to rebuild your Docker image? You can decide to push new images daily, weekly, or on-demand when there’s a vulnerability. Due to the number of packages involved, this calls for automation. With Trivy, you can scan a Docker image for vulnerabilities and other misconfigurations which can give some kind of signal that suits your way of project management. This way, we could be confident that we do not lag behind any critical security updates.

This is just one part of securing the stack. On a higher level, Drupal and module upgrades need to happen; on an even higher level, there’s WAF. And, finally, for the host system, we can rely on Azure to keep that secure.

Tooling

Remote Drush and SSH

Getting a one-time login link, showing Drupal watchdog logs, checking status reports - all these can be done comfortably from the command line, if you have tooling for that. On Pantheon we enjoy working with terminus that provides remote Drush execution and much more. To some extent, we strived to replicate the functionality. Azure provides a generic tool as a foundation, [Azure CLI])(https://learn.microsoft.com/en-us/cli/azure/). Building on top of that, remote Drush execution is not that hard to replicate.

az webapp create-remote-connection

It provides a standard SSH tunnel, which you can use for command execution. We turned it into a custom DDEV command.

#!/bin/bash

# Azure App Service, SSH connection details,
# RESOURCE_GROUP and WEBAPP_NAME comes from DDEV environment variables.

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
source "$SCRIPT_DIR/az-env-select"

# Function to output debug messages
debug() {
    if [[ $DEBUG == 1 ]]; then
        echo "DEBUG: $1"
    fi
}

# Function to establish SSH tunnel
establish_tunnel() {
    debug "Establishing SSH tunnel..."
    az webapp create-remote-connection --resource-group $RESOURCE_GROUP -n $WEBAPP_NAME --port $SSH_PORT &
    TUNNEL_PID=$!
    wait_for_port
    ssh-keygen -f "/home/.ssh-agent/known_hosts" -R "[localhost]:34222"
}

wait_for_port() {
    debug "Waiting for port $SSH_PORT to be open..."
    while ! nc -z localhost $SSH_PORT; do
        sleep 1
    done
    debug "Port $SSH_PORT is open."
}

close_tunnel() {
    debug "Closing SSH tunnel..."
    kill $TUNNEL_PID
}

establish_tunnel

debug "Executing drush command..."
echo "cd /app && ./vendor/bin/drush --uri="'$PUBLIC_URL'" $@" | sshpass -p "$SSH_PASSWORD" ssh $SSH_USER@localhost -p $SSH_PORT

close_tunnel

Logs

By default, if you do not use a special logger service, all logs (access logs, PHP error logs, and so on) ends up at the standard output and as a standard error of the container. It is a matter of taste if evaluating logs is faster from the Azure portal or from the command line interface, but we also created a DDEV command for this purpose, where the essence also comes from Azure CLI:

az webapp log download --resource-group="$RESOURCE_GROUP" --name="$WEBAPP_NAME" --log-file=/tmp/webapp.zip

When you have it this way, you can feed it into any log parser, or just use grep to locate something quickly.

File Synchronization

Any web application should have at least three instances at the same time, development, testing and live to make sure all the stakeholders can work without too much friction.

In a containerized environment, the public (and private) filesystem of Drupal needs to live at an external managed place like Azure Blob Storage. For the three environments, you end up with three independent volumes, buckets and shares.

When the content editors do their work, new files will accumulate quickly and you need to be able to sync the environments, database and files altogether. This can be a long process if you end up with gigabytes of data, but, here, Azure provides azcopy which is really efficient, as it’s multi-threaded and supports all kind of micro-optimizations for various scenarios.

As you can guess, we have custom DDEV commands around this. The key piece is to make the tool available within DDEV:

#!/bin/bash

if [[ ! -f ~/.local/bin/azcopy ]]; then
  wget https://aka.ms/downloadazcopy-v10-linux
  tar -xvf downloadazcopy-v10-linux
  rm downloadazcopy-v10-linux
  if [[ ! -d ~/.local/bin/ ]]; then
    mkdir -p ~/.local/bin/
  fi
  mv ./azcopy_linux_amd64_*/azcopy ~/.local/bin/
  chmod +x ~/.local/bin/azcopy
  rm -rf ./azcopy_linux_amd64_*
fi

The rest depends on the exact infrastructure.

One to sync all the files to the local DDEV instance, another one that copies data between environments. In this case, the data itself does not go through your local computer, just the metadata, even if the shares are in different subscription groups.

An Azure-specific anomaly which we experienced is that files of shared folders/volumes are case-insensitive. It can cause data loss. Imagine you have filenames like:

thumb.jpg
thumb.JPG

Then one of your images would be lost after copying to such a filesystem. Think ahead and avoid tedious manual restoration. Either if you can, choose a case-sensitive storage aside your container, if it is not possible, you can use a script to detect collisions and rename it ahead of the migration both in the filesystem and in the managed files table in the database.

CDN and WAF

One of the four sites received frequent attacks, which slowed down the site and caused even downtimes. We specifically asked for a web application firewall in this case and the infrastructure team provisioned FrontDoor as a CDN in conjunction with a WAF.

For anonymous visitors, FrontDoor can serve fully cached pages. All we need to do is to configure rules to avoid caching when it’s undesired (e.g. for logged in visitors).

With the proper ruleset, we can ask FrontDoor to turn off caching when the Cookie header matches (NO_CACHE|S+ESS[a-z0-9]+|PHPSESSID|SimpleSAML[A-Za-z]+)= regex. It was easy to follow Pantheon docs for the right solution. Your application might need other cache rules as well. Typically, all you need to do is to issue the proper HTTP header and FrontDoor will respect that.

We cannot disclose the firewall ruleset for security reasons. It was weeks of fine-tuning after content authors had no issues at all entering content to the site. What can be publicly said is: Don’t forget about rate limiting your traffic. If someone sends too many requests too fast, that’s not something you want. You can deliberately calibrate a rate that does not block content authoring, but blocks bad actors.

Azure recently introduced Javascript challenges for WAF rules. It can be used efficiently with rate limiting. You may define a lower rate limit as Azure can gracefully try to determine if the visitor is malicious or not.

Useful Frontdoor Log Queries

The WAF can be very effective in protecting your web application, but even if you fine-tune carefully, chances are that content authors will find a way to be blocked on legitimate content.

There’s a way to query FrontDoor logs with an SQL-like query language called Kusto.

If your site is under attack, you can use:

AzureDiagnostics
| where Category == "FrontDoorWebApplicationFirewallLog"
| where action_s == "Block"
| project TimeGenerated, requestUri_s, clientIp_s, clientIP_s
| summarize count() by clientIP_s

To see the IPs that already violated the rules most frequently (after checking if those are not legitimate search engines, for instance), those IPs can typically be completely blocked with a custom rule. In some cases it is not powerful enough, and you might need to configure FrontDoor temporarily to log all requests, not just Blocked ones, as well as drop the action_s condition from the query.

If an attacker is smart enough to avoid all kinds of rules to violate, you cannot catch them with the above query as-is. For very large sites with frequent attacks, it can even be automated with a Logic App.

If the content authors complain that something is blocked but should not be:

AzureDiagnostics
| where Category == "FrontDoorWebApplicationFirewallLog"
| where action_s == "Block"
| where not (details_msg_s contains "bot")
| where not(ruleName_s contains "BotManager")
| where (requestUri_s contains "/edit" or requestUri_s  contains "/node" or requestUri_s  contains "/admin/")
| where not(requestUri_s contains "node_modules")
| project TimeGenerated, host_s, clientIP_s, requestUri_s, details_msg_s, ruleName_s, action_s, policy_s, details_matches_s

This might be useful with some tweaking to locate the offending query. Of course, if you do not set a custom error message for the WAF block scenario, you can use the special ID, the transaction ID, to locate the problematic requests.

With the flexibility of this query language, you can easily adapt the rules to the actual traffic you have.

We learned that using a JS challenge instead of blocking traffic is useful. When something is just suspicious, but not obviously malicious, you can apply this. One particular example is to conditionally stop traffic this way from countries that are not the target audience of the website.

Releases and Deploy Strategy

As we use Travis at other projects for orchestrating the deployments, we followed this path here as well. It worked well. For security reasons, the only thing that Travis does is to push to the ACR.

After the new container is loaded in the App Service, it executes a series of Drush commands to update the database, apply configuration changes and so on. This is what we already tried, and it worked here as well. However, we learned during the first few deployments that the process is more fragile when you push a Docker image. Some tips:

Freeze base Docker image version using hash.
Always deploy the same Docker image to production, as the one you have tested on non-production first. Just copy the image itself, do not rebuild. Rebuild is a fragile process as you need to copy various assets from different network locations typically. The build could fail, or worse, a bogus Docker image could be generated accidentally.
Provide a way to roll back quickly. Let’s say you need to push the latest tag to have the deployment triggered. Push another tag, let’s say with the Unix timestamp, so when you do the next deployment and it goes wrong for whatever reasons, you will be able to quickly go back in time. It’s best to automate this process.

Backups

Any infrastructure team will consider backups as something to work on, but from the perspective of a Drupal developer, the requirements can differ. Azure, AWS and other cloud providers offer perfect snapshot-based backups of the managed MySQL instance, that can be restored within the cloud easily, but it’s hard or complicated to copy it to your local instance (you can always setup a temporary MySQL instance, and do a mysqldump using that as a last resort).

However, in addition to the snapshot, you need textual SQL backups that ddev import-db can digest. Inside Azure, either you can use a VM on the side that can orchestrate the backup (invoking `mysqldump˙ and storing it in a blob storage) or Logic App can do it also in a recurring manner.

There’s a module for this, Backup and Migrate, but it does not scale, so it is only useful for small sites. Otherwise various limits of PHP will prevent you to complete the backup.

Takeaways

As these were complex projects, it was the right choice to use custom Docker images. But as we previously wrote, a higher level managed stack might be more suitable for your Drupal app. Fewer responsibilities can save costs but it comes at a price with less freedom, of course. Owning almost all the layers below Drupal is challenging, so make sure you have the capacity for that.