Writing MultiNode tests

LAVA supports running a single test across multiple devices (of any type), combining those devices into a group. Devices within this MultiNode group can communicate with each other using the MultiNode API.

The test definitions used in MultiNode tests typically do not have to differ much from single-node tests, unless the tests need to support communication between devices in the same group. In fact, the recommended way to develop MultiNode tests is to start simple and build up complexity one step at a time. That’s what the examples here will show.

Note

When viewing MultiNode log files, the original YAML submitted to start the job is available via the MultiNode Definition link. Internally, LAVA parses and splits up that MultiNode definition into multiple sub-definitions, one per node in the test. Each node will then see a separate logical test job (and therefore a separate log file) based on these sub-definitions. They can be viewed via the Definition link. It is unlikely to be useful to submit the definition of one node of a MultiNode job as a separate job, due to links between the jobs.

Writing a MultiNode job file

Our first example is the simplest possible MultiNode test job - the same job runs on two devices of the same type, without using any of the synchronization calls.

Defining MultiNode roles

Starting with an already-working simple single-device test job, the first changes to make are in device selection:

  • Remove the device_type declaration in the job; that only works for single devices.

  • Add configuration for the MultiNode protocol to tell LAVA how to select multiple devices for your test.

The MultiNode protocol defines the new concept of roles. This example snippet creates a group of two qemu devices, one in the foo role and one in the bar role.

protocols:
  lava-multinode:
    roles:
      foo:
        device_type: qemu
        context:
          arch: amd64
        count: 1
      bar:
        device_type: qemu
        context:
          arch: amd64
        count: 1
    timeout:
      minutes: 6

Note

The role is an arbitrary label - you may use whatever descriptive names you like for the different roles in your test, so long as they are unique.

Using the job context in MultiNode

The job context can be included in the MultiNode role and the same variables will be used for all devices within the specified role. See the example above for an example syntax.

The role names defined here will be used later in the test job to determine which tests are run on which devices, and also inside the test shell definition to determine how the devices communicate with each other. After just these changes, your test job will be enough to run a simple MultiNode test in LAVA. It will pick several devices for the test, then run exactly the same set of actions on each device independently.

Using MultiNode roles

The next thing to do is to modify the test job to use the roles that you have defined. This first example runs the same actions on both of the roles. Each action in the test definition should now include the role field and one or more label(s) to match those defined roles.

Here we deploy the same software to the foo and bar machines by specifying each role in a list:

actions:
- deploy:
    role:
    - foo
    - bar
    timeout:
      minutes: 5
    to: tmpfs
    images:
        rootfs:
          image_arg: -drive format=raw,file={rootfs}
          url: http://files.lavasoftware.org/components/lava/standard/debian/stretch/amd64/2/stretch.img.gz
          sha256sum: b5cdb3b9e65fec2d3654a05dcdf507281f408b624535b33375170d1e852b982c
          compression: gz

We also use the same boot actions for all the devices:

- boot:
    timeout:
      minutes: 1
    role:
    - foo
    - bar
    method: qemu
    media: tmpfs
    auto_login:
      login_prompt: "debian login:"
      username: root
    prompts:
    - "root@debian:"

Running tests in MultiNode

By default, tests in MultiNode jobs will be run independently. If that is sufficient, the test action is very similar to that for a single-node job:

- test:
    role:
    - foo
    - bar
    timeout:
      minutes: 10
    definitions:
    - repository: http://git.linaro.org/lava-team/lava-functional-tests.git
      from: git
      path: lava-test-shell/multi-node/multinode01.yaml
      name: multinode-basic
    - repository: http://git.linaro.org/lava-team/lava-functional-tests.git
      from: git
      path: lava-test-shell/smoke-tests-basic.yaml
      name: smoke-tests

That’s your first MultiNode test job complete. It’s quite simple to follow, but it hasn’t really done much yet. To see this in action, you could try the complete example test job yourself: first-multinode-job.yaml

Running different tests on different devices

As well as simply running the same tasks on similar devices, MultiNode can also run different tests on the different devices in the test. To configure this, use the role support to allocate different deploy, boot and test actions to different roles.

This second example will use two beaglebone-black devices and one cubietruck device. These devices need different files to deploy and different commands to boot, and will most likely take different lengths of time to boot all the way to a login prompt. If you want to run this example test job yourself, you will need at least one cubietruck device and at least two beaglebone-black devices.

The example includes details of how to deploy to devices using U-Boot, but don’t worry about those details. The important elements from a MultiNode perspective are the uses of role here.

Allocating different device types to a group

This is a simple change from our first example, defining the two roles of server and client:

protocols:
  lava-multinode:
    roles:
      server:
        device_type: cubietruck
        count: 1
      client:
        device_type: beaglebone-black
        count: 2
    timeout:
      minutes: 6

Splitting deployment actions between roles

Now we’re using different files in the deployment for each role. To support that, we define two separate deploy action blocks, one for the server machines and one for the client machines.

actions:
- deploy:
    role:
    - server
    timeout:
      minutes: 10
    to: tftp
    kernel:
      url: https://files.lavasoftware.org/components/lava/standard/debian/stretch/armhf/3/vmlinuz-4.9.0-4-armmp
      sha256sum: b6043cc5a07e2cead3f7f098018e7706ea7840eece2a456ba5fcfaddaf98a21e
      type: zimage
    ramdisk:
      url: https://files.lavasoftware.org/components/lava/standard/debian/stretch/armhf/3/initrd.img-4.9.0-4-armmp
      sha256sum: 4cc25f499ae74e72b5d74c9c5e65e143de8c2e3b019f5d1781abbf519479b843
      compression: gz
    modules:
      url: https://files.lavasoftware.org/components/lava/standard/debian/stretch/armhf/3/modules.tar.gz
      sha256sum: 10e6930e9282dd44905cfd3f3a2d5a5058a1d400374afb2619412554e1067d58
      compression: gz
    nfsrootfs:
      url: https://files.lavasoftware.org/components/lava/standard/debian/stretch/armhf/3/stretch-armhf-nfs.tar.gz
      sha256sum: 46d18f339ac973359e8ac507e5258b620709add94cf5e09a858d936ace38f698
      compression: gz
    dtb:
      url: https://files.lavasoftware.org/components/lava/standard/debian/stretch/armhf/3/dtbs/sun7i-a20-cubietruck.dtb
      sha256sum: b727c17dce4a67a865cbcd53447f7297e35ffbd18f0a99c7e9f8a10026237cea

- deploy:
    role:
    - client
    timeout:
      minutes: 10
    to: tftp
    kernel:
      url: https://files.lavasoftware.org/components/lava/standard/debian/stretch/armhf/3/vmlinuz-4.9.0-4-armmp
      sha256sum: b6043cc5a07e2cead3f7f098018e7706ea7840eece2a456ba5fcfaddaf98a21e
      type: zimage
    ramdisk:
      url: https://files.lavasoftware.org/components/lava/standard/debian/stretch/armhf/3/initrd.img-4.9.0-4-armmp
      sha256sum: 4cc25f499ae74e72b5d74c9c5e65e143de8c2e3b019f5d1781abbf519479b843
      compression: gz
    modules:
      url: https://files.lavasoftware.org/components/lava/standard/debian/stretch/armhf/3/modules.tar.gz
      sha256sum: 10e6930e9282dd44905cfd3f3a2d5a5058a1d400374afb2619412554e1067d58
      compression: gz
    nfsrootfs:
      url: https://files.lavasoftware.org/components/lava/standard/debian/stretch/armhf/3/stretch-armhf-nfs.tar.gz
      sha256sum: 46d18f339ac973359e8ac507e5258b620709add94cf5e09a858d936ace38f698
      compression: gz
    dtb:
      url: https://files.lavasoftware.org/components/lava/standard/debian/stretch/armhf/3/dtbs/am335x-boneblack.dtb
      sha256sum: c4c461712bf52af7d020e78678e20fc946f1d9b9552ef26fd07ae85c5373ece9

(Potentially) Splitting boot actions

To cover different boot commands we could now have two different boot action blocks. But in this case our devices behave in the same way in terms of bootup, so we can just use a single boot block and list both client and server.

- boot:
    role:
    - client
    - server
    method: u-boot
    commands: nfs
    auto_login:
      login_prompt: 'login:'
      username: root
    prompts:
    - 'root@stretch:'
    timeout:
      minutes: 5

Using MultiNode commands to synchronize devices

A very common requirement in a MultiNode test is that a device (or devices) within the MultiNode group must wait until another device in the group reaches a particular stage. This can be used to ensure that a device running a server has had time to complete the boot and start the server before the device running the client tries to make a connection to the server, for example. The only way to be sure that the server is ready for client connections is to make every client in the group wait until the server confirms that it is ready.

Continuing with the same cubietruck and beaglebone-black example, let’s look at synchronizing devices within a MultiNode group.

Controlling synchronization from the test shell

Synchronization is done using the MultiNode API, specifically the lava-send and lava-wait calls.

Continuing our example, we have two different versions of the test action block. In the version for the server role, the machine will do some work (in this case, install and start the Apache web server) and then tell the clients that the server is ready using lava-send:

- test:
    role:
    - server
    timeout:
      minutes: 5
    definitions:
    - repository:
        metadata:
          format: Lava-Test Test Definition 1.0
          name: apache-server
          description: "server installation"
          os:
          - debian
        run:
          steps:
          - apt -q update
          - apt -q -y install apache2
          - lava-test-case dpkg --shell dpkg -s apache2
          - lava-send server_installed
      from: inline
      name: apache-server
      path: inline/apache-server.yaml

Note

It is recommended to use inline definitions for the calls to the synchronization helpers. This makes it much easier to debug when a synchronization call times out and will allow the flow of the MultiNode job to be summarized in the UI.

The test definition specified for the client role causes the client devices to wait until the test definition specified for the server role uses lava-send to signal that the server is ready.

- test:
    role:
    - client
    timeout:
      minutes: 5
    definitions:
    - repository:
        metadata:
          format: Lava-Test Test Definition 1.0
          name: client-wait
          description: "client waiting for server"
          os:
          - debian
        run:
          steps:
          - lava-test-case client --shell uname -a
          - lava-wait server_installed
      from: inline
      name: client-wait
      path: inline/client-wait.yaml

This means that each device using the role client will wait until any one device in the group sends a signal with the messageID of server_installed. The assumption here is that the group only has one device with the label server.

The second MultiNode example is now complete. To run this yourself, you can see the complete example test job: second-multinode-job.yaml . Remember, you’ll need specific hardware devices for it to work out of the box, but you can easily replace them with devices you have available.

Controlling synchronization from the dispatcher

The MultiNode protocol also provides support for using the MultiNode API outside of the test shell definition; any action block can access the protocol from within specific actions. This makes it possible to even block deployment or boot on one group of machines until others are fully up and running, for example. There is a lot of flexibility here to allow for a massive range of possible test scenarios.

See Writing jobs using the MultiNode protocol for more information on how to call the MultiNode API outside the test shell.

Using the MultiNode API - further features

As demonstrated earlier, tests can use lava-wait to cause a device to wait on a single message from any other device in the MultiNode group. It is also possible to wait for all other devices in the MultiNode group send a signal - use lava-wait-all instead.

Each message sent using the MultiNode API uses a messageID, which is a string that must be unique within the group. It is recommended to make these strings descriptive to help track job progress and debug problems. Be careful to use underscores instead of spaces in the name. The messageID will be included in the log files of the test.

Warning

When using lava-wait and lava-wait-all, the device will wait until the expected messageID is received. If that messageID does not arrive, the job will simply timeout when the default timeout expires. See Timeouts.

Using MultiNode commands to pass data between devices

lava-send can be used to send additional data to other devices, beyond just messageID. Data is sent as key-value pairs following the messageID. A device can send data at any time, and that data will be broadcast to all devices in the MultiNode group. The data can be received by any device in the group waiting on the target messageID using lava-wait or lava-wait-all.

Note

The message data is stored in a cache file which will be overwritten when the next synchronization call is made. Ensure that your scripts make use of (or copy aside) any MultiNode cache data before calling any other MultiNode API helpers that may clear the cache.

For example, if a device activates a network interface and wants to make data about that network connection available to other devices in the group, the device can send the IP address using lava-send:

run:
   steps:
      - lava-send ipv4 ip=$(./get_ip.sh)

The contents of get_ip.sh is operating system specific.

On the receiving device, the test definition would include a call to lava-wait or lava-wait-all with the same messageID:

run:
   steps:
      - lava-wait ipv4
      - ipdata=$(cat /tmp/lava_multi_node_cache.txt | cut -d = -f 2)

Note

Although multiple key value pairs can be sent as a single message, the API is not intended for large amounts of data. There is a message size limit of 4KiB, including protocol overhead. Use other transfer methods like ssh or wget if you need to send larger amounts of data between devices.

Helper tools in LAVA

LAVA provides some helper routines for common data transfer tasks and more can be added where appropriate. The main MultiNode API calls are intended to work on all POSIX systems.

Other MultiNode calls

It is also possible for devices to retrieve data about the group itself, including the role or name of the current device as well as the names and roles of other devices in the group. See MultiNode API for more information.

Writing jobs using the MultiNode protocol

The MultiNode protocol defines the MultiNode group and also allows actions within the job pipeline to make calls using the MultiNode API outside of a test definition.

The MultiNode protocol allows data to be shared between actions, including data generated in a test shell definition for one role being made available for use by a different role in its deploy or boot action.

The MultiNode protocol can underpin the use of other tools without necessarily needing a dedicated protocol class to be written for those tools. Using the MultiNode protocol is an extension of using the existing MultiNode API calls within a test definition. The use of the protocol is an advanced use of LAVA and relies on the test writer carefully planning how the job will work. See Delaying the start of a job using Multinode for an example of how to use this.

Writing jobs using MultiNode and LXC

Some devices need an LXC to do the deployment operations, boot operations or test shell actions. These devices can still be part of a MultiNode group, as long as some thought is given to constructing the test job submission.

In this section, the example will concentrate on adding MultiNode to a device which needs LXC support in order to allow the device to run tests inside a Secondary Connection.

This test job will combine the ideas of namespace and role. Make sure you are clear on which parts of the test job will happen in which namespace and in which role.

For a secondary connection MultiNode test job using a device which uses an LXC, there will be a need to deploy and boot the LXC, deploy and boot the host device and deploy and boot the SSH guest(s). There will also be test actions but those can be considered later.

The host device and LXC will have existing test jobs which use namespaces. The secondary connection(s) will need to operate in an additional namespace.

The host device and LXC will need to exist in the same role, the secondary connection(s) will need to be in a separate role so that the MultiNode protocol can be used to communicate from the host to the guest:

In the example, the secondary connections use the guest role and the guest namespace. The device uses the host role and the device namespace. The LXC uses the host role and the probe namespace.

Roles and namespaces can have the same label but are not the same thing. If you do use the same string for a namespace and a role, make sure that the namespace and the role apply to the same actions. All actions must have both a namespace and a role.

Download or view the complete example: examples/test-jobs/bbb-lxc-ssh-guest.yaml:

The protocols block needs to combine the MultiNode requirements and the LXC requirements. However, the MultiNode requirements take priority because the LXC block is only accessed after the test job submission has been split into the list of sub-jobs contained within the MultiNode group. Therefore, the LXC protocol information has to declare the role and put the rest of the data within that role:

protocols:
  lava-lxc:
    host:
      name: lxc-ssh-test
      template: debian
      distribution: debian
      release: stretch

The rest of the protocols block now defines the MultiNode roles and data required for the secondary connections - in this case 3 connections will be attempted. The full protocols block looks like:

protocols:
  lava-lxc:
    host:
      name: lxc-ssh-test
      template: debian
      distribution: debian
      release: stretch
  lava-multinode:
    # expect_role is used by the dispatcher and is part of delay_start
    # host_role is used by the scheduler, unrelated to delay_start.
    roles:
      host:
        device_type: beaglebone-black
        count: 1
        timeout:
          minutes: 10
      guest:
        # protocol API call to make during protocol setup
        request: lava-start
        # set the role for which this role will wait
        expect_role: host
        timeout:
          minutes: 15
        # no device_type, just a connection
        connection: ssh
        count: 3
        # each ssh connection will attempt to connect to the device of role 'host'
        host_role: host

The actions need a little bit of consideration. Typically, the LXC will need to be deployed and booted first. This is so that tools inside the LXC can be installed and configured, ready to be used in the deploy stage of the device.

Just as with the protocols block, the MultiNode requirements take priority over the LXC, so the LXC deploy and boot actions must declare a role. To be useful with the device, the role must match the role assigned to the device. In this example, that role is labeled host and uses the probe namespace.

actions:
- deploy:
    role:
    - host
    namespace: probe
    timeout:
      minutes: 5
    to: lxc
    packages:
    - usbutils
    - procps
    - lsb-release
    - util-linux

- boot:
    role:
    - host
    namespace: probe
    prompts:
    - 'root@(.*):/#'
    timeout:
      minutes: 5
    method: lxc

The deploy and boot actions for the device and for the secondary connection guests remain the same as for a secondary connection test job without an LXC. Note how the device uses the host role and the device namespace and the guests use the guest role and guest namespace.

The change compared to a test job for a device which needs an LXC is the use of authorize. Whichever role is operating as the host must specify how to authorize connections from other roles using the authorize: key in the deployment. This allows the relevant Action to deploy the necessary support. e.g. /root/.ssh/authorized_keys

- deploy:
    role:
    - host
    namespace: device
    timeout:
      minutes: 4
    to: tftp
    # authorize for ssh adds the ssh public key to authorized_keys
    authorize: ssh

    kernel:
      url: https://files.lavasoftware.org/components/lava/standard/debian/stretch/armhf/3/vmlinuz-4.9.0-4-armmp
      sha256sum: b6043cc5a07e2cead3f7f098018e7706ea7840eece2a456ba5fcfaddaf98a21e
      type: zimage
    ramdisk:
      url: https://files.lavasoftware.org/components/lava/standard/debian/stretch/armhf/3/initrd.img-4.9.0-4-armmp
      sha256sum: 4cc25f499ae74e72b5d74c9c5e65e143de8c2e3b019f5d1781abbf519479b843
      compression: gz
    modules:
      url: https://files.lavasoftware.org/components/lava/standard/debian/stretch/armhf/3/modules.tar.gz
      sha256sum: 10e6930e9282dd44905cfd3f3a2d5a5058a1d400374afb2619412554e1067d58
      compression: gz
    nfsrootfs:
      url: https://files.lavasoftware.org/components/lava/standard/debian/stretch/armhf/3/stretch-armhf-nfs.tar.gz
      sha256sum: 46d18f339ac973359e8ac507e5258b620709add94cf5e09a858d936ace38f698
      compression: gz
    dtb:
      url: https://files.lavasoftware.org/components/lava/standard/debian/stretch/armhf/3/dtbs/am335x-boneblack.dtb
      sha256sum: c4c461712bf52af7d020e78678e20fc946f1d9b9552ef26fd07ae85c5373ece9

- deploy:
    role:
    - guest
    namespace: guest
    timeout:  # timeout for the ssh connection attempt
      seconds: 30
    to: ssh
    connection: ssh
    protocols:
      lava-multinode:
      - action: prepare-scp-overlay
        request: lava-wait
        # messageID matches hostID
        messageID: ipv4
        message:
          # the key of the message matches value of the host_key
          # the value of the message gets substituted
          ipaddr: $ipaddr
      timeout:  # delay_start timeout
        minutes: 5

- boot:
    role:
    - host
    namespace: device
    timeout:
      minutes: 15
    method: u-boot
    commands: nfs
    auto_login:
      login_prompt: 'login:'
      username: root
    prompts:
    - 'root@stretch:'
    parameters:
      shutdown-message: "reboot: Restarting system"

- boot:
    role:
    - guest
    namespace: guest
    timeout:
      minutes: 3
    prompts:
    - 'root@stretch:'
    parameters:
      hostID: ipv4  # messageID
      host_key: ipaddr  # message key
    method: ssh
    connection: ssh

See also

Secondary Connection for more information on secondary connections and authorization.

Adding test actions

Test actions can be placed anywhere after the corresponding deploy and boot actions. As the LXC is deployed and booted first, the LXC can run a test shell before deploying the device, before booting the device, before the test shell action on the device which starts the secondary connection guests or at any later point.

The test actions for the secondary connections retain the original order - the host test action needs to run first to set up the service and send the message declaring the IP address.

Remember that the LXC is associated with the device (in this example, the host role).