Routine TarMk Maintenance
Background on TarMk
-
The TarMk ("Micro Kernel") segmentstore is the database that holds all AEM content.
- In AEM terms, the words on the page are
content, and binaries such as images areblobs. -
Content is stored in a TarMk segmentstore on each AEM machine.
- The Author segmentstore contains all content, both published and not yet published, as well as prior versions of content.
- Binaries (images, media files) on the Author are stored on NAS.
- The Publisher segmentstore(s) contain currently published content. Publisher segmentstores are a subset of the Author segmentstore, and therefore smaller.
- In AEM terms, the words on the page are
-
TarMk are literally tar (tape archive) files - you can use standard
tarcommands to see the content (although it is mostly unreadable). -
TarMk is an "append-only" system, meaning that changes to content are made by writing a new blob of data, and marking the old version as obsolete.
- Some number of old versions are retained to provide history and the ability to revert to prior versions of the content.
- Versions beyond the retention limit are garbage, which are removed during compaction.
-
TarMk requires periodic compaction (similar to defragmenting a hard disk, or garbage collection in JVM).
- There is an automatic online compaction process, but it is not very effective.
- Offline compaction works well, but requires the AEM service to be down.
- We do an offline compaction during our scheduled maintenance period Saturdays from 22:00-23:00
Author Maintenance
- Author maintenance must be done during our scheduled Saturday 22:00-23:00 maintenance period.
- Send a message to the
#aemSlack channel prior to maintenance (preferably 30m prior) -- this allows users who are working to finish thier work before you take the system down under them.
1. Put the machine under maintenance
- Log on to Alert Manager
- Note: This maintenance requires two silences
- Add a "New Silence"
-
Create two separate silences and fill in the boxes as follows:
- First Silencer
Box Name Value Start (accept default) Duration (accept default End (accept default) Matcher app="AEM Author"(click plus)Creator (your pid) Comment "Routine TarMk maintenance" - Second Silencer
Box Name Value Start (accept default) Duration (accept default End (accept default) Matcher instance="cmsa-prod-01.db.vt.edu:4500"(plus)Creator (your pid) Comment "Routine TarMk maintenance"
2. Shut down the Author
Log in to cmsa-prod-01.db.vt.edu and sudo to the cms user.
3. Ensure the Author process has stopped
NOTE: It is crucial to ensure the AEM Java process has completed and exited before you continue on. Running the TarMK compaction process on open/in-use files will wreck them. (The oak-run utility does not seem to detect in-use files well).
There should be NO JAVA PROCESSES running. Try the stop commad again, or kill the pid of the java process (do not kill -9 if you can avoid it). Do not proceed with compaction if the AEM Java process is running.
4. Backup the segmentstore
$ backup_segmentstore_to_nas
2020-08-13T14:04:42 [INFO] Begin backup_segmentstore_to_nas
2020-08-13T14:04:42 [INFO] Complete. Run time: 00:04:17
5. Remove old segmentstore backups
The backup_segmentstore_to_nas process writes to a NAS volume on /apps/mounts/cmsbinary, which is typically tight on space, so we need to keep a minimal number of backups. (Older backups are less valuable to us anyway).
[15:30:46] cms@cmsa-prod-03:~$ df -h /apps/mounts/cmsbinary
Filesystem Size Used Avail Use% Mounted on
skipper.cc.vt.edu:/volcmsbinary/cmsbinary 2.0T 2.0T 90G 96% /apps/mounts/cmsbinary
$ cd /apps/mounts/cmsbinary/prod_segmentstore/
$ ls -lh cmsa-prod-01
total 137G
-rw-rw-r-- 1 cms cms 86G Aug 2 22:29 crx-quickstart-2020-08-02.tar
-rw-rw-r-- 1 cms cms 51G Aug 8 22:12 crx-quickstart-2020-08-08.tar
Remove the backup from 2 weeks prior.
6. Compact TarMk and Restart the Service
$ $ aem-compact-tar -c && aem-sequencer start
2020-08-13T12:02:08 [INFO] Starting AEM publish
2020-08-13T12:02:08 [INFO] Starting AEM in Runmode: publish,crx3,crx3tar,nosamplecontent, Port: 4503 ...
2020-08-13T12:02:08 [INFO] CQ_JARFILE: /apps/cms/var/data/aem64-publish/crx-quickstart/app/cq-quickstart-6.4.0-load22b-standalone-quickstart.jar
2020-08-13T12:02:08 [INFO] Start command: /apps/local/jdk1.8/bin/java -Djava.io.tmpdir=/apps/cms/var/tmp -server -Xms8G -Xmx8G -XX:MaxMetaspaceSize=512M -D
java.awt.headless=true -Djava.net.preferIPv6Addresses=true -Dsling.run.modes=publish,crx3,crx3tar,nosamplecontent -jar /apps/cms/var/data/aem64-publish/crx-
quickstart/app/cq-quickstart-6.4.0-load22b-standalone-quickstart.jar start -c /apps/cms/var/data/aem64-publish/crx-quickstart -i launchpad -p 4503 -Dsling.p
roperties=/apps/cms/var/data/aem64-publish/crx-quickstart/conf/sling.properties
2020-08-13T12:02:08 [INFO] Started [24158]
2020-08-13T12:02:09 [INFO] AEM publish started: looking for 'Startup finished' message
2020-08-13T12:02:49 [INFO] Syntax checking: /apps/local/httpd/bin/httpd -DGZIP -DCCACHE -DPUBLISH -DAUX -DWEBDAV -DFORENSIC -f /apps/cms/etc/httpd/httpd.con
f.RP -t
2020-08-13T12:02:49 [OK] Syntax OK
2020-08-13T12:02:49 [INFO] start
2020-08-13T12:02:52 [OK] Service listening on port 443
2020-08-13T12:02:52 [OK] Service listening on port 4501
Note: Currently, aem-compact-tar produces no visible output.
7. Restart org.apache.sling.event (Optional from AEM 6.5)
(Starting from AEM 6.5, we've not noticed an issue if this sling event is not restarted and so this step can be ignored each time maintenace is done on the author machine).
Apache Sling Event Support (org.apache.sling.event) must be stopped and then started.
- Go to /system/console/bundles
- In the top gray bar is a search bar. Type
org.apache.sling.eventinto this box and press enter or "Apply Filter" - Note there are two matches! We are looking for "Apache Sling Event Support"
- Click the square "stop" VCR button, and wait a few seconds for it to turn into a triangle (indicating the service is stopped). The status will change from "Active" to "Resolved".
- Then click the triangle "play" VCR button, and wait a few seconds for it to again become a square "stop" button. The status will again be "Active".
Publisher Maintenance
Warning
One AEM Publisher instance must be up at all times!
The publishers are a critical part of the AEM infrastructure, and must be available to serve content without interruption. Most of our content is served by the cache layer (cmsw- or Dispatcher), but it is not possible to serve all content this way.
Because we have redundant (n=2) publishers, we technically can do the maintenance at any time the remaining publisher is able to meet demand. We typically perform the maintenance on publishers one at a time during the same maintenance period when we do author maintenance.
1. Put the machine under maintenance
- Log on to Alert Manager
- Add a "New Silence"
-
Fill in the boxes as follows:
Box Name Value Start (accept default) Duration (accept default End (accept default) Matchers instance="cmsa-prod-02.db.vt.edu:4501"(plus)or Matchers instance="cmsa-prod-03.db.vt.edu:4501"(plus)Creator (your pid) Comment "Routine TarMk maintenance"
Note: the port for publishers is 4501, while the author port is 4500.
2. Shut down the Publisher
Output will look something like this:
2020-08-13T12:01:29 [INFO] Stopping AEM in Runmode: publish,crx3,crx3tar,nosamplecontent, Port: 4503 ...
2020-08-13T12:01:29 [INFO] CQ_JARFILE: /apps/cms/var/data/aem64-publish/crx-quickstart/app/cq-quickstart-6.4.0-load22b-standal
one-quickstart.jar
13.08.2020 12:01:30.901 *INFO * [main] Setting sling.home=/apps/cms/var/data/aem64-publish/crx-quickstart (command line)
13.08.2020 12:01:31.178 *INFO * [main] Sent 'stop' to /127.0.0.1:36853: OK
2020-08-13T12:01:31 [INFO] Waiting 300s for process 29655 to terminate...
2020-08-13T12:01:49 [INFO] Process 29655 ended in 00:00:18 with code 0
2020-08-13T12:01:49 [INFO] AEM Stopped
2020-08-13T12:01:57 [INFO] Syntax checking: /apps/local/httpd/bin/httpd -DGZIP -DCCACHE -DPUBLISH -DAUX -DWEBDAV -DFORENSIC -f /apps/cms/etc/httpd/httpd.conf.RP -t
2020-08-13T12:01:58 [OK] Syntax OK
2020-08-13T12:01:58 [INFO] graceful-stop
3. Ensure the Publisher process has stopped
NOTE: It is crucial to ensure the AEM Java process has completed and exited before you continue on. Running the TarMK compaction process on open/in-use files will wreck them. (The oak-run utility does not seem to detect in-use files well).
There should be NO JAVA PROCESSES running. Try the stop commad again, or kill the pid of the java process (do not kill -9 the java process - we have corrupted the TarMK database on a publisher this way). Do not proceed with compaction if the AEM Java process is running.
4. Backup the segmentstore
$ backup_segmentstore_to_nas
2020-08-13T14:04:42 [INFO] Begin backup_segmentstore_to_nas
2020-08-13T14:04:42 [INFO] Complete. Run time: 00:04:17
5. Compact TarMk and Restart the Service
$ $ aem-compact-tar -c && aem-sequencer start
2020-08-13T12:02:08 [INFO] Starting AEM publish
2020-08-13T12:02:08 [INFO] Starting AEM in Runmode: publish,crx3,crx3tar,nosamplecontent, Port: 4503 ...
2020-08-13T12:02:08 [INFO] CQ_JARFILE: /apps/cms/var/data/aem64-publish/crx-quickstart/app/cq-quickstart-6.4.0-load22b-standalone-quickstart.jar
2020-08-13T12:02:08 [INFO] Start command: /apps/local/jdk1.8/bin/java -Djava.io.tmpdir=/apps/cms/var/tmp -server -Xms8G -Xmx8G -XX:MaxMetaspaceSize=512M -D
java.awt.headless=true -Djava.net.preferIPv6Addresses=true -Dsling.run.modes=publish,crx3,crx3tar,nosamplecontent -jar /apps/cms/var/data/aem64-publish/crx-
quickstart/app/cq-quickstart-6.4.0-load22b-standalone-quickstart.jar start -c /apps/cms/var/data/aem64-publish/crx-quickstart -i launchpad -p 4503 -Dsling.p
roperties=/apps/cms/var/data/aem64-publish/crx-quickstart/conf/sling.properties
2020-08-13T12:02:08 [INFO] Started [24158]
2020-08-13T12:02:09 [INFO] AEM publish started: looking for 'Startup finished' message
2020-08-13T12:02:49 [INFO] Syntax checking: /apps/local/httpd/bin/httpd -DGZIP -DCCACHE -DPUBLISH -DAUX -DWEBDAV -DFORENSIC -f /apps/cms/etc/httpd/httpd.con
f.RP -t
2020-08-13T12:02:49 [OK] Syntax OK
2020-08-13T12:02:49 [INFO] start
2020-08-13T12:02:52 [OK] Service listening on port 443
2020-08-13T12:02:52 [OK] Service listening on port 4501
Note: Currently, aem-compact-tar produces no visible output.