Skip to content

Routine TarMk Maintenance

Background on TarMk
  • The TarMk ("Micro Kernel") segmentstore is the database that holds all AEM content.

    • In AEM terms, the words on the page are content, and binaries such as images are blobs.
    • Content is stored in a TarMk segmentstore on each AEM machine.

      • The Author segmentstore contains all content, both published and not yet published, as well as prior versions of content.
      • Binaries (images, media files) on the Author are stored on NAS.
      • The Publisher segmentstore(s) contain currently published content. Publisher segmentstores are a subset of the Author segmentstore, and therefore smaller.
  • TarMk are literally tar (tape archive) files - you can use standard tar commands to see the content (although it is mostly unreadable).

  • TarMk is an "append-only" system, meaning that changes to content are made by writing a new blob of data, and marking the old version as obsolete.

    • Some number of old versions are retained to provide history and the ability to revert to prior versions of the content.
    • Versions beyond the retention limit are garbage, which are removed during compaction.
  • TarMk requires periodic compaction (similar to defragmenting a hard disk, or garbage collection in JVM).

    • There is an automatic online compaction process, but it is not very effective.
    • Offline compaction works well, but requires the AEM service to be down.
    • We do an offline compaction during our scheduled maintenance period Saturdays from 22:00-23:00

Author Maintenance

  • Author maintenance must be done during our scheduled Saturday 22:00-23:00 maintenance period.
  • Send a message to the #aem Slack channel prior to maintenance (preferably 30m prior) -- this allows users who are working to finish thier work before you take the system down under them.

1. Put the machine under maintenance

  1. Log on to Alert Manager
  2. Note: This maintenance requires two silences
  3. Add a "New Silence"
  4. Create two separate silences and fill in the boxes as follows:

    • First Silencer
    Box Name Value
    Start (accept default)
    Duration (accept default
    End (accept default)
    Matcher app="AEM Author" (click plus)
    Creator (your pid)
    Comment "Routine TarMk maintenance"
    • Second Silencer
    Box Name Value
    Start (accept default)
    Duration (accept default
    End (accept default)
    Matcher instance="cmsa-prod-01.db.vt.edu:4500" (plus)
    Creator (your pid)
    Comment "Routine TarMk maintenance"

2. Shut down the Author

Log in to cmsa-prod-01.db.vt.edu and sudo to the cms user.

$ aem-sequencer stop
This command takes up to 30 seconds to complete.

3. Ensure the Author process has stopped

NOTE: It is crucial to ensure the AEM Java process has completed and exited before you continue on. Running the TarMK compaction process on open/in-use files will wreck them. (The oak-run utility does not seem to detect in-use files well).

$ pgrep -a java

There should be NO JAVA PROCESSES running. Try the stop commad again, or kill the pid of the java process (do not kill -9 if you can avoid it). Do not proceed with compaction if the AEM Java process is running.

4. Backup the segmentstore

$ backup_segmentstore_to_nas
2020-08-13T14:04:42 [INFO] Begin backup_segmentstore_to_nas
2020-08-13T14:04:42 [INFO] Complete. Run time: 00:04:17

5. Remove old segmentstore backups

The backup_segmentstore_to_nas process writes to a NAS volume on /apps/mounts/cmsbinary, which is typically tight on space, so we need to keep a minimal number of backups. (Older backups are less valuable to us anyway).

[15:30:46] cms@cmsa-prod-03:~$ df -h /apps/mounts/cmsbinary
Filesystem                                 Size  Used Avail Use% Mounted on
skipper.cc.vt.edu:/volcmsbinary/cmsbinary  2.0T  2.0T   90G  96% /apps/mounts/cmsbinary
$ cd /apps/mounts/cmsbinary/prod_segmentstore/
$ ls -lh cmsa-prod-01
total 137G
-rw-rw-r-- 1 cms cms 86G Aug  2 22:29 crx-quickstart-2020-08-02.tar
-rw-rw-r-- 1 cms cms 51G Aug  8 22:12 crx-quickstart-2020-08-08.tar

Remove the backup from 2 weeks prior.

6. Compact TarMk and Restart the Service

$ $ aem-compact-tar -c && aem-sequencer start
2020-08-13T12:02:08 [INFO] Starting AEM publish
2020-08-13T12:02:08 [INFO] Starting AEM in Runmode: publish,crx3,crx3tar,nosamplecontent, Port: 4503 ...
2020-08-13T12:02:08 [INFO] CQ_JARFILE: /apps/cms/var/data/aem64-publish/crx-quickstart/app/cq-quickstart-6.4.0-load22b-standalone-quickstart.jar
2020-08-13T12:02:08 [INFO] Start command: /apps/local/jdk1.8/bin/java -Djava.io.tmpdir=/apps/cms/var/tmp -server -Xms8G -Xmx8G  -XX:MaxMetaspaceSize=512M -D
java.awt.headless=true -Djava.net.preferIPv6Addresses=true -Dsling.run.modes=publish,crx3,crx3tar,nosamplecontent -jar /apps/cms/var/data/aem64-publish/crx-
quickstart/app/cq-quickstart-6.4.0-load22b-standalone-quickstart.jar start -c /apps/cms/var/data/aem64-publish/crx-quickstart -i launchpad -p 4503 -Dsling.p
roperties=/apps/cms/var/data/aem64-publish/crx-quickstart/conf/sling.properties
2020-08-13T12:02:08 [INFO] Started [24158]
2020-08-13T12:02:09 [INFO] AEM publish started: looking for 'Startup finished' message
2020-08-13T12:02:49 [INFO] Syntax checking: /apps/local/httpd/bin/httpd -DGZIP -DCCACHE -DPUBLISH -DAUX -DWEBDAV -DFORENSIC -f /apps/cms/etc/httpd/httpd.con
f.RP -t
2020-08-13T12:02:49 [OK] Syntax OK
2020-08-13T12:02:49 [INFO] start
2020-08-13T12:02:52 [OK] Service listening on port 443
2020-08-13T12:02:52 [OK] Service listening on port 4501

Note: Currently, aem-compact-tar produces no visible output.

7. Restart org.apache.sling.event (Optional from AEM 6.5)

(Starting from AEM 6.5, we've not noticed an issue if this sling event is not restarted and so this step can be ignored each time maintenace is done on the author machine).

Apache Sling Event Support (org.apache.sling.event) must be stopped and then started.

  1. Go to /system/console/bundles
  2. In the top gray bar is a search bar. Type org.apache.sling.event into this box and press enter or "Apply Filter"
  3. Note there are two matches! We are looking for "Apache Sling Event Support"
  4. Click the square "stop" VCR button, and wait a few seconds for it to turn into a triangle (indicating the service is stopped). The status will change from "Active" to "Resolved".
  5. Then click the triangle "play" VCR button, and wait a few seconds for it to again become a square "stop" button. The status will again be "Active".

Publisher Maintenance

Warning

One AEM Publisher instance must be up at all times!

The publishers are a critical part of the AEM infrastructure, and must be available to serve content without interruption. Most of our content is served by the cache layer (cmsw- or Dispatcher), but it is not possible to serve all content this way.

Because we have redundant (n=2) publishers, we technically can do the maintenance at any time the remaining publisher is able to meet demand. We typically perform the maintenance on publishers one at a time during the same maintenance period when we do author maintenance.

1. Put the machine under maintenance

  1. Log on to Alert Manager
  2. Add a "New Silence"
  3. Fill in the boxes as follows:

    Box Name Value
    Start (accept default)
    Duration (accept default
    End (accept default)
    Matchers instance="cmsa-prod-02.db.vt.edu:4501" (plus)
    or
    Matchers instance="cmsa-prod-03.db.vt.edu:4501" (plus)
    Creator (your pid)
    Comment "Routine TarMk maintenance"

Note: the port for publishers is 4501, while the author port is 4500.

2. Shut down the Publisher

$ aem-sequencer stop

Output will look something like this:

2020-08-13T12:01:29 [INFO] Stopping AEM in Runmode: publish,crx3,crx3tar,nosamplecontent, Port: 4503 ...
2020-08-13T12:01:29 [INFO] CQ_JARFILE: /apps/cms/var/data/aem64-publish/crx-quickstart/app/cq-quickstart-6.4.0-load22b-standal
one-quickstart.jar
13.08.2020 12:01:30.901 *INFO * [main] Setting sling.home=/apps/cms/var/data/aem64-publish/crx-quickstart (command line)
13.08.2020 12:01:31.178 *INFO * [main] Sent 'stop' to /127.0.0.1:36853: OK
2020-08-13T12:01:31 [INFO] Waiting 300s for process 29655 to terminate...
2020-08-13T12:01:49 [INFO] Process 29655 ended in 00:00:18 with code 0
2020-08-13T12:01:49 [INFO] AEM Stopped

2020-08-13T12:01:57 [INFO] Syntax checking: /apps/local/httpd/bin/httpd -DGZIP -DCCACHE -DPUBLISH -DAUX -DWEBDAV -DFORENSIC -f /apps/cms/etc/httpd/httpd.conf.RP -t
2020-08-13T12:01:58 [OK] Syntax OK
2020-08-13T12:01:58 [INFO] graceful-stop

3. Ensure the Publisher process has stopped

NOTE: It is crucial to ensure the AEM Java process has completed and exited before you continue on. Running the TarMK compaction process on open/in-use files will wreck them. (The oak-run utility does not seem to detect in-use files well).

$ pgrep -a java

There should be NO JAVA PROCESSES running. Try the stop commad again, or kill the pid of the java process (do not kill -9 the java process - we have corrupted the TarMK database on a publisher this way). Do not proceed with compaction if the AEM Java process is running.

4. Backup the segmentstore

$ backup_segmentstore_to_nas
2020-08-13T14:04:42 [INFO] Begin backup_segmentstore_to_nas
2020-08-13T14:04:42 [INFO] Complete. Run time: 00:04:17

5. Compact TarMk and Restart the Service

$ $ aem-compact-tar -c && aem-sequencer start
2020-08-13T12:02:08 [INFO] Starting AEM publish
2020-08-13T12:02:08 [INFO] Starting AEM in Runmode: publish,crx3,crx3tar,nosamplecontent, Port: 4503 ...
2020-08-13T12:02:08 [INFO] CQ_JARFILE: /apps/cms/var/data/aem64-publish/crx-quickstart/app/cq-quickstart-6.4.0-load22b-standalone-quickstart.jar
2020-08-13T12:02:08 [INFO] Start command: /apps/local/jdk1.8/bin/java -Djava.io.tmpdir=/apps/cms/var/tmp -server -Xms8G -Xmx8G  -XX:MaxMetaspaceSize=512M -D
java.awt.headless=true -Djava.net.preferIPv6Addresses=true -Dsling.run.modes=publish,crx3,crx3tar,nosamplecontent -jar /apps/cms/var/data/aem64-publish/crx-
quickstart/app/cq-quickstart-6.4.0-load22b-standalone-quickstart.jar start -c /apps/cms/var/data/aem64-publish/crx-quickstart -i launchpad -p 4503 -Dsling.p
roperties=/apps/cms/var/data/aem64-publish/crx-quickstart/conf/sling.properties
2020-08-13T12:02:08 [INFO] Started [24158]
2020-08-13T12:02:09 [INFO] AEM publish started: looking for 'Startup finished' message
2020-08-13T12:02:49 [INFO] Syntax checking: /apps/local/httpd/bin/httpd -DGZIP -DCCACHE -DPUBLISH -DAUX -DWEBDAV -DFORENSIC -f /apps/cms/etc/httpd/httpd.con
f.RP -t
2020-08-13T12:02:49 [OK] Syntax OK
2020-08-13T12:02:49 [INFO] start
2020-08-13T12:02:52 [OK] Service listening on port 443
2020-08-13T12:02:52 [OK] Service listening on port 4501

Note: Currently, aem-compact-tar produces no visible output.