Christians Tagebuch: mogic

Docker MacOS: No space left

2023-10-11T19:13:11+02:00

One of my colleagues uses Docker Desktop on MacOS. One day, it reported that all of the 200GiB of reserved space were full.

The docker daemon did not think so:

$ docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          68        18        36GB      26.9GB (74%)
Containers      20        20        111.7MB   0B (0%)
Local Volumes   56        12        7.947GB   4.992GB (62%)
Build Cache     365       0         0B        0B

This are no 200 GiB.

On Linux we can simply inspect /var/lib/docker, but Docker on MacOS is running inside a virtual machine (VM). We got access to that VM with the help of justincormack/nsenter1:

$ docker run -it --rm --privileged --pid=host justincormack/nsenter1

# cd /var/lib/docker/
# du -h -d 1
44.5G ./overlay2
95.6M ./image
7.7G ./volumes
114.4G ./containers
336.0K ./network
166.7G ./var/lib/docker

So containers is very full although docker said it should contain only 111 MiB :)

# cd /var/lib/docker/containers/
# du -h -d 1
56.0K ./e239e861c2a9097f79d3c0c0f98fce7b20ab899674ea7772356d0355d9f688f4
44.0K ./962a1795b5440e9b9900a4743d86f9c18452675a43c379a88aaac2b16b5bd275
[...]
114G  ./2f9867b08a714fcfd83b85d5cf7f883be05394970e7e0058747b399f06bd269d
[...]

# cd /var/lib/docker/containers/2f9867b08a714fcfd83b85d5cf7f883be05394970e7e0058747b399f06bd269d
# ls -lah
total 114G
[...]
-rw-r-----    1 root     root      114.3G Oct 11 07:01 2f9867b08a714fcfd83b85d5cf7f883be05394970e7e0058747b399f06bd269d-json.log
[...]

And there we have it: A log file of 114 GiB.

But which container is that?

$ docker inspect 2f9867b08a714fcfd83b85d5cf7f883be05394970e7e0058747b399f06bd269d | jq '.[].Config.Image'
"gitlab-docker.company.com/customer/customer-wordpress-docker/mysql:latest"

This particular MySQL container seems to be very badly configured.

git diff for minified JavaScript files

2017-01-17T07:59:14+01:00

At work we have a small project in which we get CSS and JavaScript from an external agency. Now and then updates come in and have to be integrated into the existing code base. Sometimes things break, and all I have is minified JavaScript.

I wanted to be able to properly see the changes in the minified JavaScript files when using git diff, and it turns out you can configure Git to do that.

At first, install the JavaScript beautifier js-beautify via pip:

$ pip3 install jsbeautifier

Now we define a diff configuration with name minjs, which tells Git to prettify files with js-beautify:

$ git config --global diff.minjs.textconv js-beautify

If you have got enough space on disk, enable caching of the beautified files:

# takes extra space, but makes it faster:
$ git config --global diff.minjs.cachetextconv true

At last create a file called .gitattributes in your project root directory that tells Git to use the minjs diff configuration for file names ending with .min.js:

*.min.js diff=minjs

git diff now diffs minified JavaScript files in a readable way.

git show does not unless you use the --ext-diff option.

Colored app environment in docker shell

2018-11-22T21:27:05+01:00

When running a shell in a docker container, you only see random hashes as hostname:

$ docker exec -it project_backend_1 bash
root@112adda3eb64:/#

Now imagine having a dozen of terminals open, and then you run ./vendor/bin/phpunit in container 71f68dcd5379. The first thing that the PHPUnit bootstrap script does is emptying the database and then running all migrations and seeds.

Unfortunately, you intendet to run that command in 112adda3eb64, your local development container. Let's just say that 71f68dcd5379 was not the the local dev one, but on a server in a data center, and the data thrown away were kind of important.

Show the environment

To prevent such mistakes in the future, the shell shall clearly show which environment you are in - local development, testing, staging or production.

APP_ENV

This environment is available in our Laravel .env file, but it's not so easy to access in the terminal. So the first step is to add the current environment in the docker-compose.yml file:

---
version: "3"

services:

  backend:
    image: docker-hub.example.org/project/backend-dev:latest
    environment:
      - APP_ENV=local

Now we can access this variable in our shell via $APP_ENV.

Bash prompt

The bash prompt $PS1 is set in two places in the Ubuntu 16.04 images that we used:

/etc/bash.bashrc: Loaded when bash is used, no matter which user
/root/.bashrc: Is loaded after the /etc/ version when the user is root.

Both files define $PS1, so we have to load our bash-coloring file in both of them:

Dockerfile

FROM ubuntu:xenial
ADD bash.colorprompt /etc/bash.colorprompt
RUN echo '. /etc/bash.colorprompt' >> /etc/bash.bashrc\
 && echo '. /etc/bash.colorprompt' >> /root/.bashrc

Now the only thing left is to write that file that sets the prompt:

bash.colorprompt

# color the prompt according to $APP_ENV variable
case "$APP_ENV" in
    production)
        PS1='\e[41m\n=== $APP_ENV ===\e[m\n\u@\h:\w\$ '
        ;;
    testing)
        PS1='\e[43m$APP_ENV\e[m \u@\h:\w\$ '
        ;;
    local)
        PS1='\e[42m$APP_ENV\e[m \u@\h:\w\$ '
        ;;
esac

Screenshots

PHPUnit?

The obvious question is why PHPUnit was available on that system in the first place.

Our CI server runs unit/integration tests on every deployment, no matter which environment is being deployed to:

Build container with environment-specific configuration
Run tests in container
Deploy and start container on server

While this is in general a good idea, running the tests on the deployment to every environment is something we later stopped doing.

It turned out to be hard to make sure that every single configuration variable is overwritten in phpunit.xml. And if you can't be sure of this, your tests suddenly use some obscure production service that you forgot to stub out.

TYPO3: List of backend UI elements

2017-10-19T19:57:13+02:00

At work I lately was building a TYPO3 backend module to control some static HTML generating export script. I wanted the module to look native and sifted through the backend to find the UI elements I needed - which was cumbersome.

Thanks to the helpful people in the TYPO3 chat I was directed to the Styleguide extension.

It provides a list of all backend UI elements available in TYPO3 and was very helpful. It is important to install the git version, because the TER version was outdated.

Searching REST API documents with TYPO3's indexed_search

2017-04-11T19:58:29+02:00

Instead of writing our own search, we managed to integrate REST API data into a TYPO3's native indexed_search results. This brings us a mix of website content and REST data in one result list.

A TYPO3 v7.6 site at work consists of a normal page tree with content that is searchable with indexed_search.

A separate management interface is used by editors to administrate some domain-specific data outside of TYPO3. Those data are available via a REST API, which is utilized by one of our TYPO3 extensions to display data on the website.

Those externally managed data should now be searchable on the TYPO3 website.

Integration options

I pondered a long time how to tackle this task. There were two approaches:

Integrate API data into indexed_search, so that they appear inside the normal search result list.
Have separate searches for website content and API content. The search result list would have two tabs, one for each type. An indicator would show how many results are found for each type and the user would need to switch between them.

The second option looked easier at first because it does not require one to dig into indexed_search. But after thinking long enough I found that I would be replicating all the basic features needed for search: Listing data, paging, and those tabs as well.

The customer would then also demand that we'd have an overview page showing the first 3 results from each of the types, with a "view all" button.

In the end I decided to use option #1 because it would feel most integrated and would mean less code.

How indexed_search + crawler work together

At first I have to recommend Indexed Search & Crawler - The Missing Manual because it explains many things and helps with basic setup.

URL list generation

You may create crawler configurations and indexed_search configurations in the TYPO3 page tree. Both are similar, yet different. How do they work together?

The crawler scheduler task and command line script both start crawler_lib::CLI_run().
cli_hooks are executed. indexed_search has registered its IS\CrawlerHook as one, and that is started.
All indexing configuration records are checked for their next execution time. If one of them needs to be run, it is put into crawler queue as a callback that runs IS\CrawlerHook again.
The crawler queue is processed and calls IS\CrawlerHook::crawler_execute().
IS\CrawlerHook::crawler_execute_type4() gets an URL list via crawler_lib::getUrlsForPageRow().
1. Crawler configuration records are searched in the rootline.
2. URLs are generated from the configurations found (crawler_lib::compileUrls())
3. URLs are queued with crawler_lib::urlListFromUrlArray()

Note that the crawler only processes entries that were in the queue when it started. Queue items added during the crawl run are not processed yet, but in a later run.

This means that it may take 6 or 7 crawler runs until it gets to your page with the indexing and crawler configuration. It's better to use the backend module Info -> Site crawler to enqueue your custom URLs during development, or have a minimal page tree with one page :)

Crawler URLs

Crawler configuration records are URL generators.

Without special configuration, they return the URL for a page ID. Pretty dull.

The crawler manual shows that they can be used for more, and gives a language configuration as example: &L=[1-3|5|7]. For each page ID this will generate 5 URLs, one for each of the listed languages 1, 2, 3, 5 and 7.

Apart from those value ranges, you may specify a _TABLE configuration :

&myparam=[_TABLE:tt_myext_items;_PID:15, _WHERE: and hidden = 0]

This is where we need to step in: We may handle those [FOO] values and expand them ourselves with a hook:

ext_localconf.php

$GLOBALS['TYPO3_CONF_VARS']['SC_OPTIONS']['crawler/class.tx_crawler_lib.php']
    ['expandParameters'][] = \Vnd\Ext\UrlGenerator::class . '->expandParameters';

The hook gets called for every bracketed URL parameter value. $params['currentValue'] contains the value without brackets.

The code in the hook method only has to expand the value to a list of IDs and set that into $params['paramArray'][$key]:


 * @see    TYPO3\CMS\IndexedSearch\Example\CrawlerHook
 */
class UrlGenerator
{
    /**
     * Add GET parameters to crawler page.
     *
     * This method is registered as hook for crawler/class.tx_crawler_lib.php
     * and is called when crawler configuration "Configuration" fields
     * are expanded (`&L=[1-3]&bar=[FOO]`).
     *
     * @param array  $params Keys:
     *                       - &pObj
     *                       - ¶mArray
     *                       - currentKey
     *                       - currentValue
     *                       - pid
     * @param object $pObj   Crawler lib instance
     *
     * @return void
     */
    public function expandParameters(&$params, $pObj)
    {
        if ($params['currentValue'] === 'FOO') {
            //replace this with your own ID generation code
            $params['paramArray'][$params['currentKey']] = [11, 23, 42];
        }
    }
}
?>]]>

Now when the crawler processes page id 1 and finds a matching configuration record that contains the following configuration:

&tx_myparam=[FOO]

our hook will be called and expand that config to three IDs:

/index.php?id=1&tx_myparam=11
/index.php?id=1&tx_myparam=23
/index.php?id=1&tx_myparam=42

Crawler will then put three URLs into the queue and index them in the next run.

The page and the plugin that show the API data must be cachable. Data are not indexed otherwise. Also make sure you set the page title for indexing.

Enable cHash generation in the crawler configuration.

Cache invalidation

When a visitor uses the website search and indexed_search generates a search result set, it checks if the page ID is still available. Deactivated and deleted pages will thus not show up in the results. This does not work for API results for obvious reasons.

TYPO3 database records integrated into search with an indexed_search configuration get removed on the next crawler run. Until then, they are still findable:

In fact, if a record is removed its indexing entry will also be removed upon next indexing - simply because the "set_id" is used to finally clear out old entries after a re-index!

This works as follows:

The indexing configuration for records is executed and the records are indexed. Both config ID and set_id (timestamp of index config run) are saved with the search index records.
When the index configuration is run the next time, new search index entries will be created for all records - with a new set_id. There are two search index records for each page now.
Once the index configuration is fully processed (no URLs for that config ID in the queue anymore), the search index entries with the same index config ID and old set_ids are removed.

This also works for API data. The indexing configuration "pagetree" processes the API page ID, which in turn creates the API detail URLs through the crawler configuration. After reindexing the data, their old search index data get deleted.

The only thing to remember is not to use a "Crawler Queue" scheduler task, because then the phash records will have no index configuration ID, and thus will not be deleted on the next run.

Development

The "reset all index data" SQL script in invaluable during development:

TRUNCATE TABLE index_debug;
TRUNCATE TABLE index_fulltext;
TRUNCATE TABLE index_grlist;
TRUNCATE TABLE index_phash;
TRUNCATE TABLE index_rel;
TRUNCATE TABLE index_section;
TRUNCATE TABLE index_stat_search;
TRUNCATE TABLE index_stat_word;
TRUNCATE TABLE index_words;
TRUNCATE TABLE tx_crawler_process;
TRUNCATE TABLE tx_crawler_queue;
UPDATE index_config SET timer_next_indexing = 0;

Improving TYPO3 docker cache warming speed

2017-03-27T07:28:50+02:00

Warming the page cache after a production deployment took up to two minutes for certain TYPO3 pages. We got that down to mere seconds by not throwing away scaled and cropped images.

🇩🇪 Eine deutsche Übersetzung dieses Artikels gibt es bei Mogic:
Docker: Schnelleres Cache-Warming für TYPO3

Cache cleaning

At work we use docker for our TYPO3 projects. Deploying changes to the live system only requires us to push into the main extension's master branch, and Jenkins will do the rest: Build the web server image with all the PHP code, pull that onto the production server, start up the new container, clear the cache and stop the old container.

Because potentially everything could have changed code-wise during deployments, we need to clear all the TYPO3 caches. Apart from the database cache tables, all files in typo3temp/ are pruned during deployment.

Responsive sites & focuspoint

Our TYPO3 projects have a responsive layout - it can be viewed in any resolution, and it will look good. Different resolutions and screen aspect ratios often need different images sizes and ratios - and those images need to be generated automatically.

To make sure that the important part of a picture is kept regardless of the targeted width-height-ratio, we utilize the focuspoint extension. Editors select the important part of the picture within the TYPO3 backend, and this part will be kept during image cropping.

Mix that with different image resolutions for normal and high-density displays, and we're up to 6 images that need to be generated for a single image on the website (2 aspect ratios + 2 resolutions each).

When clearing typo3temp/, all those cropped and scaled images are thrown away and need to be regenerated. Calling pages with many images needed up to two minutes until they had all their images regenerated, which was just too much.

Processed files folder

Our goal was to keep the processed files. Their file names are a hash of the image processing configuration options, so they are stable over time. Cleaning caches has no effect on them.

Information about generated files are stored in the database as well, in table sys_file_processedfile. Since the database is kept during deployments, the contents of this table are also stable.

When re-using a production database dump on the test or dev system, TYPO3 notices if files have an entry in the processed files table but are missing on disk, and recreates them.

Solution

focuspoint saved the cropped files into typo3temp/focuscrop which was thrown away on deployments, so we made a patch to make that configurable.

With that in place we created a new folder in the site's document root, processed. It was "mounted" into TYPO3 with a new file storage record (uid: 2) that has its base path set to processed/ (path type "relative").

The focuspoint extension was configured to store its generated files into processed/focuspoint

File storage fileadmin (auto-generated) was configured to store its "manipulated and temporary images" into 2:_processed_.

With those two changes, all generated images now land in the processed directory. We configured our docker container to mount the processed folder from the host, so that it would keep its data when new CMS containers are deployed.

docker-compose.yml

typo3cms:
  image: docker.example.com/project-typo3cms:latest
  volumes_from:
    - storage
  volumes:
    - ./semitemp/processed:/var/www/site/htdocs/processed

Fetching a page with over 200 images directly after deployment with empty caches now takes mere seconds instead of minutes. Mission accomplished.

Docker: DNS requests take 5 seconds

2017-04-03T16:58:39+02:00

Problem

Requests to the Telegram messaging API on a docker container at work took 5 seconds:

$ time curl --silent api.telegram.org --output /dev/null
real 0m5.577s

Doing a IPv4-only request was quick:

$ time curl -4 --silent api.telegram.org --output /dev/null
real 0m0.090s

Cause

Inspecting the network traffic with wireshark shows that two DNS are made: One for the IPv4 address, one for the IPv6 address.

The IPv4 address is immediately resolved, but the IPv6 request is cancelled after 5 seconds:

Solution

A thread on askubuntu.com gave me the hint what to do: Disable parallel DNS requests, so that the IPv4 request is sent first. Only if that fails, the IPv6 request will be made.

/etc/resolv.conf

options single-request

cURL IMAP: no authentication mechanisms supported

2017-04-03T07:44:52+02:00

At work I needed to test a locally installed IMAP server. Knowing the cURL is not only for HTTP but for a dozen other protocols as well - including IMAP and SMTP - I decided to give it a try.

debian-administration.org has a nice article about using curl for IMAP, which is where I got the commands from.

Trying to list the IMAP account's folders gave me an error:

$ curl -v imap://localhost --user "user11@example.org:user11"
* Rebuilt URL to: imap://localhost/
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 143 (#0)
< * OK [CAPABILITY IMAP4rev1 LITERAL+ SASL-IR LOGIN-REFERRALS ID ENABLE IDLE STARTTLS LOGINDISABLED] Dovecot ready.
> A001 CAPABILITY
< * CAPABILITY IMAP4rev1 LITERAL+ SASL-IR LOGIN-REFERRALS ID ENABLE IDLE STARTTLS LOGINDISABLED
< A001 OK Pre-login capabilities listed, post-login capabilities have more.
* No known authentication mechanisms supported!
* Closing connection 0
curl: (67) Login denied

Looking at the capability line we see that login is disabled until the STARTTLS command is issued. curl does not do that, though - we need to force it by using the --ssl option:

$ curl -v -k --ssl imap://localhost --user "user11@example.org:user11"
* Rebuilt URL to: imap://localhost/
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 143 (#0)
< * OK [CAPABILITY IMAP4rev1 LITERAL+ SASL-IR LOGIN-REFERRALS ID ENABLE
IDLE STARTTLS LOGINDISABLED] Dovecot ready.
> A001 CAPABILITY
< * CAPABILITY IMAP4rev1 LITERAL+ SASL-IR LOGIN-REFERRALS ID ENABLE IDLE
STARTTLS LOGINDISABLED
< A001 OK Pre-login capabilities listed, post-login capabilities have more.
> A002 STARTTLS
< A002 OK Begin TLS negotiation now.

Debugging TYPO3 crawler, or: The tale of many "why?"

2017-03-29T14:54:25+02:00

I spent the last couple of days at work integrating REST API data into the search result list of TYPO3's indexed_search extension. Yesterday I wanted to run a last test on my development machine to see if everything worked as it should and if API data would be indexed correctly. It did not work.

Why didn't it work?

After several hours I found out that the crawler extension did indeed process the page with my special crawling configuration, but stops in the middle.

Why did it stop?

The crawler catches an exception and stops processing. Unfortunately it did not tell anyone about that. The exception was "HTTP/1.1 404 Not Found", from the API connector.

Why did the API connector throw an exception?

Our crawler hook thought it was running on the live (production) system and queried the production API. The new API methods had not yet been deployed to the production API system, and it returned a 404.

Why did crawler think we are on prod?

The docker container has an environment variable TYPO3_CONTEXT=Development, which tells the TYPO3 instance to use the development configuration. That variable was not set.

Why was the environment variable not set?

To make the crawler process run correctly (write access to temporary directories + files), it must be run as the same user that the nginx web server runs under, www-data. I switched to the www-data user as I always do:

$ su - www-data -s /bin/bash

The - resets all environment variables. TYPO3_CONTEXT was thus not set anymore.

After 6 hours, I removed that minus and everything worked as it should.

PHP-FPM: 104: Connection reset by peer

2017-02-03T07:31:50+01:00

During development of the TYPO3-based Wohnglück project, fellow developers experienced the dreaded "white pages" when accessing a certain page in the CMS. This happened on production, test and local dev environments - but only now and then and not reproducible.

Our log server only showed:

nginx stdout | 2017/02/02 11:26:55 [error] 24#24:
*40 recv() failed (104: Connection reset by peer) while reading response header from upstream,
 client: 42.0.23.0,
 server: _,
 request: "GET /some/path/ HTTP/1.1",
 upstream: "fastcgi://unix:/run/php/php7.0-fpm.sock:",
 host: "some.host",
 referrer: "https://some.host/other/path/"

I was totally certain that this must be PHP crashing. We were running php version 7.0.8ubuntu*, but the most recent one was 7.1.15. Upgrading would surely make the crash go away.

Another colleague did not want to go down that route and suggested to collect more error information, which we then did by setting the FPM log_level setting to warning (it was error before). After that we waited until it happened again a day later. The log was more verbose now:

php7.0-fpm stderr | WARNING: [pool www] child 27, script '/var/www/site/htdocs/index.php'
 (request: "GET /index.php") execution timed out (152.506891 sec), terminating

php7.0-fpm stderr | WARNING: [pool www] child 27 exited on signal 15 (SIGTERM) after 240.011067 seconds from start

nginx stdout | 24#24: *40 recv() failed (104: Connection reset by peer)
 while reading response header from upstream, ...

So PHP did not crash, it was killed by php-fpm because the process took longer than 4 minutes to run!

The log also contained error messages about TYPO3 calling graphicsmagick but failing whith strange errors like "invalid JPEG header data" and "no information read", which I could not make sense of before.

But now it actually did make sense: The page contained a whopping 219 images, which got lazy-loaded by the browser, but all had to be scaled and cropped during page generation. TYPO3's whole cache is cleared during our automated docker deployment, and the first persons accessing the page experienced that problem.

Multiple people accessing that page at the same time did also explain the gm command errors: Two PHP processes tried to generate the same files at once and one read the partly generated data by the other process.

The solution to this problem is to not throw away the scaled images during deployment.