There is very little reason for a build node to need to pull 200 images in 6 hours, and here is why:
When a machine issues a ``docker build`` command, the program reads the relevant dockerfile to check for any base images that need to be pulled (a la "FROM:")
These base images are identified based on the image repository, image name, and image tag. The first thing docker does is it checks its local registry and tries to find a match for the base image the docker build is requesting. If a matching image is located in the local registry, it uses that one in lieu of downloading the image.
This is significant - if your organization only uses a few dozen base images from DockerHub, those images will only be downloaded by each build node _once_, then never again.
Many docker users erroneously believe that if their Dockerfile requests a "latest" tagged image, docker build will always download the newest version of the image.
However, the "latest" tag is literally just a tag, it doesn't have any special functionality built in. If the docker build command finds an image tagged "latest" in the local registry, it stops there.
The only way to get docker build to always use the "actual latest" version of the base image is to add the "--pull" parameter to the docker build command. This arg will tell docker build to check the repository remote to see if the SHA hash of the image tagged "latest" has changed, and if so, re-download and use it. In the absolute worst case, this means each build node will pull 1 copy of each base image when the base image is updated. So unless you use 200 different base images that all have updates deployed to Dockerhub each and every day, you are fine.
> Docker defines pull rate limits as the number of manifest requests to Docker Hub.
> For example, if you already have the image, the Docker Engine client will issue a manifest request, realize it has all of the referenced layers based on the returned manifest, and stop. ... <excluded> ... So an image pull is actually one or two manifest requests,
This still implies that even if you are appropriately re-using layers on your machine, with a free plan you can only do maximum 200 builds (since docker still needs to verify it has the image) per 6 hours?
This change also seems to imply that builds steps which previously did not handle/require authentication against Docker hub (it was only pulling public images, and pushing elsewhere) will now be required to auth against docker hub in order to double the number of pulls/checks/builds it is allowed?
This is an excellent point. Trying to find out if docker build --pull without an accompanying blob download will trigger the rate limiter.
If it does, then this will definitely be a reason to riot. It will effectively mean that anyone who wants to do more than 200 builds every 6 hours using the "right" way will have to get a docker pro subscription.
Yes but Docker achieved their goal of making it annoying as hell to not user DockerHub. Because you can run your own private repo just fine but what you want is a transparent proxy (like apt-catcher) that will let you pretend you’re using DH but actually pulling from either the cache or your private repos. All the pieces are there with private repos and “pullthrough” proxies they’re just not well integrated, seemingly on purpose.
RedHat’s patches to Docker make this possible but Docker has refused to upstream it.
Agreed. Third party package repositories has been a weak point in our CI, and we put all of them behind a self-hosted proxy that we can manage in our own HA fasion. Turns out we get faster pulls from it, as well as being a good internet-citizen.
I admittedly have only used Docker very little, but how exactly does someone manage to build images once every 108 seconds continuously for 6 hours? That sounds extreme.
> The first thing docker does is it checks its local registry and tries to find a match for the base image the docker build is requesting. If a matching image is located in the local registry, it uses that one in lieu of downloading the image.
While I agree that this is the way it's supposed to work, I have unfortunately worked at companies with "stateless" build/CI servers that download the Docker image each build.
> While I agree that this is the way it's supposed to work, I have unfortunately worked at companies with "stateless" build/CI servers that download the Docker image each build.
Couldn't they remain stateless but be redirected through a caching proxy? Memoization is not contrary to statelessness.
Is that going to actually help with the manifest-based rate limits? It sounds like it only caches the layers, the manifest metadata for a tag is not cached.
> When a pull is attempted with a tag, the Registry checks the remote to ensure if it has the latest version of the requested content. Otherwise, it fetches and caches the latest content.
Can you tell me more? How expensive are we talking?
Working for the same sized companies for a while has apparently dulled my senses. At a certain size, the capital that matters is the political capital it takes to get a vendor agreement in place to begin with. The monthly costs of the system are something you only feel through pushback on how big the repo gets, or the rate of traffic (experiencing the latter now with a browser testing SaaS)
> This is significant - if your organization only uses a few dozen base images from DockerHub, those images will only be downloaded by each build node _once_, then never again.
You're assuming that the set of build nodes is relatively static.
Plenty of architectures set up autoscaling for the underlying nodes, that terminate servers that aren't being used and relatively soon enough (tens of minutes, hours) spin up new servers to replace them as needed.
Rarely do the machine images used to spin up new servers include the base images of the containers that will be spun up to replace them. Much more often, the base machine image is a base OS image, and container images are downloaded on-the-fly as needed. Essentially, the engineering cost of making image-launching more efficient was externalized onto an external provider willing to pay the price.
If you’re using docker for production distribution of images, you should be paying for it. That’s exactly the behavior that creates the need for a limit.
CI/CD systems on AWS, Azure, GCP and others might be running on Kubernetes containers (using kaniko, podman, etc) or using Docker-in-Docker, and there isn't a widely supported or in-cloud-platform tool for sharing cached layers.
And as pointed out below, even if you are intelligently caching layers, manifest requests count as a pull. As far as I know, no caching proxies exist for Docker that support limiting manifest pulls.
Surprised to see Docker-in-Docker mentioned so deeply down here. It’s an extremely valid way of doing things, and non-trivial to implement a caching layer for.
Isn't Docker-in-Docker actually using the host's Docker daemon? I am mounting the docker socket in all my Docker-in-Docker containers, thus all the build tasks running on the same host can share the caches.
I guess one could have docker containers that actually run docker, but I don't see a reason to do that...
I was wondering how Docker-in-Docker works, but I couldn't find it dockermented anywhere. If it's using the host's Docker daemon, why do you need to mount the docker socket?
> If it's using the host's Docker daemon, why do you need to mount the docker socket?
There are 2 components for docker: the daemon and the tool used to send commands to the daemon. In order for said tool to be able to send commands to the daemon, it needs a way to communicate with the daemon. Mounting the socket in the container is the easiest method.
I have a "tooling" image that consists of a set of scripts (python code) to do various things ops related. One of the things is to build new images when required. I have a script that given a git commit will detect the images that need to be build and build them. Having my tooling code in a container makes it easier to deploy and use new versions of the tooling code. I don't need anything on the host apart docker itself. No build scripts, no python.
As I said, i could be running the docker daemon inside the container, but that breaks one of my rules related to containers: containers are not virtual machines, they should only run 1 process and the output of that process should be std out.
At the end he describes mounting the socket. The tooling image which has all the dependencies needed to build will also have the docker cli installed, which is what I'm assuming you are doing.
Docker-in-Docker (DinD) doesn't piggy back on the host's Docker daemon, but instead runs a stripped-down Docker daemon inside of the container. The major downside is that I/O is quite slow, since you're going through two virtualization layers (the DinD one, plus the host Docker daemon).
There is, effectively, no "virtualization" layer here.
There are some things that if needed can cause overhead... such as the bridge networking (really shouldn't be a bottleneck for majority of people), and the CoW filesystem... which docker won't be (or shouldn't be) running on top of since, for example, overlayfs on top of overlayfs is not supported.
There is also nothing stripped down about the daemon inside of the container.
Sure, I was speaking off the cuff based on my experience from a few years ago. Maybe I messed up and somehow had the DinD daemon not use a volume mount, and that's what caused it to build images slowly?
Usually docker outside of docker is used, no? If the image is cached on the host, it would be available to any container having access to the docker daemon socket as well since it's the same daemon.
> This is significant - if your organization only uses a few dozen base images from DockerHub, those images will only be downloaded by each build node _once_, then never again.
Only if your build nodes have unlimited storage. If the build nodes are spun up on demand or have housecleaning tasks to prevent Tragedy of the Commons disk exhaustion, this is not true.
On the other hand, this is what caching proxies/registries are for.
I'm not sure if this is 100% true any more. I've found that when enabling the DOCKER_BUILDKIT=1 env var that docker will sometimes eagerly re-fetch stale images. I think your argument is generally still true, but was happy to see that some progress is being made on dealing with stale `latest`.
This is significant - if your organization only uses a few dozen base images from DockerHub, those images will only be downloaded by each build node _once_, then never again.
Unless you’re using something like AWS CodeBuild that spins up a Linux/Windows container for your build environment, executes bash commands in a yaml file, and then terminates it when it is done. Nothing is stored locally after the build is finished.
I’m sure there are other similar services. Wouldn’t Azure Devops using hosted builds do basically the same thing? I haven’t used it since they changed the name from Visual Studio Team Services.
What is a solution for the scenarios you have described? Amazon has ECR but it doesn’t support signing and doesn’t function as a proxy so you would miss upstream changes unless someone pushed them. Anything self hosted that supplies that functionality?
Do you really need all your private images to be derived directly from the upstream? Don't you start every image with:
FROM foo
RUN apt-get upgrade etc
?
Then why not have a set of base images, derived directly from upstream that get built every so often and have your private images be derived from that? This will not only relieve the stres on DockerHub and prevent you from having to pay the 5/month, but also give your security people a hook to run their tests and make your private images build faster, since all the system updates won't happen every time you change the code.
If using Alpine, looks like docker-registry is the needed package and /usr/bin/docker-registry serve /etc/docker-registry/config.yml is the command line. Next to last link has information on the config file.
> When a pull is attempted with a tag, the Registry checks the remote to ensure if it has the latest version of the requested content. Otherwise, it fetches and caches the latest content.
If that causes a manifest pull, it counts as a pull and will be rate limited. Yikes! This could lead to wildly nondeterministic behavior.
Difference is HEAD request or conditional GET, the server will not send a file if it matches the time and/or tag of the version you have, so you are replying with a few bytes rather then (potentially) dozens or hundreds of megabytes. Same with all CDNs.
When a machine issues a ``docker build`` command, the program reads the relevant dockerfile to check for any base images that need to be pulled (a la "FROM:")
These base images are identified based on the image repository, image name, and image tag. The first thing docker does is it checks its local registry and tries to find a match for the base image the docker build is requesting. If a matching image is located in the local registry, it uses that one in lieu of downloading the image.
This is significant - if your organization only uses a few dozen base images from DockerHub, those images will only be downloaded by each build node _once_, then never again.
Many docker users erroneously believe that if their Dockerfile requests a "latest" tagged image, docker build will always download the newest version of the image. However, the "latest" tag is literally just a tag, it doesn't have any special functionality built in. If the docker build command finds an image tagged "latest" in the local registry, it stops there.
The only way to get docker build to always use the "actual latest" version of the base image is to add the "--pull" parameter to the docker build command. This arg will tell docker build to check the repository remote to see if the SHA hash of the image tagged "latest" has changed, and if so, re-download and use it. In the absolute worst case, this means each build node will pull 1 copy of each base image when the base image is updated. So unless you use 200 different base images that all have updates deployed to Dockerhub each and every day, you are fine.