Troubleshooting Docker

Abstract

Shows some troubleshooting techniques to use when you work with Sitecore and Docker.

This topic has advice for troubleshooting problems with Docker Desktop for Windows and Docker-based Sitecore development. Docker tools and resources links to a number of resources and community sites where you can also find troubleshooting discussions.

  • Check the logs: The logs are the first place to look. Depending on the issue, check the logs of a container or the engine logs. For accessing container logs, see the Sitecore Docker cheat sheet. You can also view logs in the Docker Desktop (Dashboard) and with the other tools listed above.

    For Sitecore CM and CD images, not all built-in Sitecore log files stream by default. You can add them to the LogMonitor config (c:\LogMonitor\LogMonitorConfig.json), but this might result in excessive output. It is helpful to bind mount the Sitecore log folder, as seen in the Docker Examples repository:

    cm:
     [...]
     volumes:
       - ${LOCAL_DATA_PATH}\cm:C:\inetpub\wwwroot\App_Data\logs

    The Docker engine (daemon) logs are at C:\Users\%USERNAME%\AppData\Local\Docker.

  • Restart Docker Desktop: Restarting the Docker Desktop often resolves an issue. You can restart with the Docker item (the whale icon) in the Windows system tray.

  • Clean mapped volume data: If your containers use mapped volumes for persistent storage, your issue can be come from stale data in these folders. The default Sitecore configuration enables this for the mssql and solr services. Make sure your instance is down (i.e. docker-compose down), then delete files in the mounted folders manually or with a clean script (see the clean.ps1 examples in the the Docker Examples repository).

  • Prune Docker resources: If you have not done so recently, clean up unused Docker resources. This is a good daily habit to get into, at minimum, to free up disk space:

    docker system prune

    See the Sitecore Docker cheat sheet for more details.

  • Restart PC: A system reboot can solve some problems.

  • Upgrade to latest Docker Desktop for Windows: Docker is continuously releasing new versions of Docker Desktop for Windows with bug fixes and improvements. You can check for updates using the Docker item (whale icon) in the Windows system tray.

  • Reset Docker Desktop to factory defaults: This resets all options of Docker Desktop to their initial state, the same as when Docker Desktop was first installed. You can do this from the Troubleshoot option of the Docker item (whale icon) in the Windows system tray.

Sitecore recommends 32GB of memory for developer workstations when you work with Sitecore Containers, with 16GB as the minimum. If you are encountering errors or performance problems due to insufficient system memory, you can attempt to reduce memory usage of your environment with these techniques:

  • Run an XM1 or XP0 topology instead of XP1.

  • Run an XM0 modified XM1 topology (for development only). You do this by setting the redis and cd services to scale: 0 and the cm service to standalone mode via Sitecore_AppSettings_role:define environment variable:

    redis:
      [...]
      scale: 0
    cd:
      [...]
      scale: 0
    cm:
      [...]
      environment:
          Sitecore_AppSettings_role:define: Standalone
  • If you run a Windows 10 version that allows running process isolation with 1909 containers, switch to the 1909-based Sitecore containers and process isolation via the SITECORE_VERSION and ISOLATION environment variables:

    SITECORE_VERSION=10.0.0-1909
    ISOLATION=process
  • Set memory limits for individual containers. Docker uses 1GB by default, but you can reduce that for certain services. However, do not do it for the mssql or solr services.

  • Disable services for containers you do not need. For example, in an XP1 topology, disable xdbautomation, xdbautomationrpt, and xdbautomationworker if you do not use the Marketing Automation Engine, or disable the cortexprocessing, cortexreporting, and cortexprocessingworker if you do not use Cortex. You do this by by setting the services to scale: 0 and removing any depends_on conditions.

You can try excluding the Docker data directory (%ProgramData%\docker) from antivirus software scanning. See https://docs.docker.com/engine/security/antivirus/ for more information.

This can manifest in a number of ways, for example, as a connection error for a NuGet restore operation.

From https://github.com/docker/for-win/issues/2760#issuecomment-430889666:

This often happens when there are multiple networking adapters (Ethernet, Wi-Fi) present on the host. You must set the priority of these adapters properly in order for the Windows networking stack to correctly choose gateway routes. You can fix this by setting your primary internet-connected networking adapter to have the lowest InterfaceMetric value:

Get-NetIPInterface -AddressFamily IPv4 | Sort-Object -Property InterfaceMetric -Descending

Use this command to make the change (this example assumes primary adapter InterfaceAlias is 'Wi-Fi'):

Set-NetIPInterface -InterfaceAlias 'Wi-Fi' -InterfaceMetric 3

If your host's primary network adapter is bridged because you have an External virtual switch setup in Hyper-V, set the external virtual switch to have the lowest InterfaceMetric value.

You can verify your routing tables by using this command (the last line should show the primary adapter's gateway address along with its ifMetric value):

Get-NetRoute -AddressFamily IPv4

You might see an error like the following when you attempt to start your Sitecore environment:

ERROR: for myproject_traefik_1  Cannot start service traefik: failed to create endpoint myproject_traefik_1 on network myproject_default: failed during hnsCallRawResponse: hnsCall failed in Win32: The process cannot access the file because it is being used by another process. (0x20)

This error indicates that you need to stop IIS or some other process that uses a required port. For a complete list of required ports, see Run your first Sitecore instance.

Your SQL_SA_PASSWORD value must meet the strong password requirements SQL Server specifies. A password that does not meet these requirements shows errors such as:

  • Unhealthy container state in your environment (XConnect, CM, others), and startup errors such as:

    ERROR: for traefik  Container <id> is unhealthy.
    ERROR: Encountered errors while bringing up the project.
  • Errors in the XConnect container logs:

    Microsoft.Azure.SqlDatabase.ElasticScale.ShardManagement.ShardManagementException: Store Error: Login failed for user 'sa'.. The error occurred while attempting to perform the underlying storage operation during 'Microsoft.Azure.SqlDatabase.ElasticScale.ShardManagement.StoreException: Error occurred while performing store operation. See the inner SqlException for details. ---> System.Data.SqlClient.SqlException: Login failed for user 'sa'.
  • Errors in the SQL Server container logs:

    VERBOSE: Changing SA login credentials
    Msg 15118, Level 16, State 1, Server 96FAC1ED734A, Line 1
    Password validation failed. The password does not meet the operating system policy requirements because it is not complex enough.

    Change the SQL password in SQL_SA_PASSWORD to fit the default SQL Server policy. After changing the password in the .env file, clear the mounted SQL data folder after running docker-compose down. You can manually delete its contents, or use a clean script (see the clean.ps1 example in the the Docker Examples repository).

You usually see this error after you use docker-compose up. Certain services might fail to run, and instead report a Created status when you check with docker ps -a.

To resolve this issue, try increasing the memory limit for problematic containers in your Docker Compose file. Docker defaults to 1GB, but that can be too little for some services. For example, the containers for mssql and solr services might require 2GB:

mssql:
  [...]
  mem_limit: 2GB
solr:
  [...]
  mem_limit: 2GB

This can occur when persistent SQL data storage is enabled (via a mounted volume in your Docker Compose configuration) and the you changed the Sitecore admin password (SITECORE_ADMIN_PASSWORD variable) in your .env file. The password is set when the database files are initially created, so it is outdated if your instance was run (that is, with docker-compose up) before the password change.

The default Sitecore configuration has this enabled, with the volume mounted to the mssql-data folder:

mssql:
  [...]
  volumes:
    - type: bind
      source: .\mssql-data
      target: c:\data

To resolve, ensure your instance is down (that is, docker-compose down), and delete the files in the mounted folder. You can manually delete these, or use a clean script (see the clean.ps1 example in the the Docker Examples repository).

You might see the following error when you start containers or building images:

hcsshim::PrepareLayer - failed failed in win32 : Incorrect function. (0x1)

This is often caused by incompatible drivers from tools such as Box, Dropbox, or OneDrive. See the discussion for this Docker Desktop for Windows issue on GitHub for potential workarounds and solutions.

You might see the following error when start containers or build images:

failed to shutdown container: container 45917373d49ed4130f7c7ac16f19f59379c1c98d0c429cc806a6f292d6792286 encountered an error during hcsshim::System::Shutdown: failure in a Windows system call: The connection with the virtual machine or container was closed. (0xc037010a): subsequent terminate failed container 45917373d49ed4130f7c7ac16f19f59379c1c98d0c429cc806a6f292d6792286 encountered an error during hcsshim::System::waitBackground: failure in a Windows system call: The connection with the virtual machine or container was closed. (0xc037010a)

Restarting Docker Desktop resolves this issue most of the time, but you might also need to reboot your machine.

Follow the progress for this Docker Desktop for Windows issue on GitHub.

Certain firewall configurations prevent containers that use process isolation from communicating with each other. Symptoms include unhealthy containers and network communication errors between containers when you inspect the logs.

It can be difficult to find the specific firewall conflict. As a workaround, you can set the individual affected containers to default isolation (for example in your docker-compose.override.yml):

solr:
  [...]
  isolation: default