Best practices for dealing with executions that to not stop

markrobertjohnson

Currently I have two Otter agents that have executions that have been running for days. I am not sure what is wrong with them, but I want to properly terminate the executions and make sure the agents are in a good state. What is the best practice for dealing with this? I am planning the following approach (pseudo code):

foreach host with long running execution {
      delete the execution using native Otter API
      #Is this service restart needed?
      restart the INEDOAGENTSVC running on the host
}

Image Text

First and foremost, don't do that :)

The only reliable way to get rid of a stuck (non cancellable) execution is to stop/start the Otter service; this will immediately halt all executions (on stop) and mark them as failed (on start). But we need to figure out why these executions get in a stuck state.

The first thing to look at is... what version of the Inedo Agent is installed? The latest version is 35; it must be updated manually. We will eventually have an update process you can run, but it's not a big priority since the Inedo agent rarely ever changes (#35 is the first update this year, and isn't yet confirmed to fix some deadlock issues we've seen). If the agent version is #35, then the problem is not the same deadlock issue we identified.

The next thing is the plan itself. What operation is it stuck on? Does cancelling the execution have any effect?

Obviously a Sleep 999999 would yield this behavior, but that operation supports cancellation. However if you did something like, a custom operation that did Thread.Sleep, then it could not be cancelled and the execution would remain stuck.

markrobertjohnson

The agent version is 35.0.0.2

The next thing is the plan itself. What operation is it stuck on? Does cancelling the execution have any effect?

This may sound dumb, but I am not sure how to cancel an execution from the UI. I am able to call the "/api/json/Executions_DeleteExecution?API_Key=$apiKey&Execution_Id=$id" native API method, but I don't think that is what you are talking about.

Looking at the executing plan's log output, the last entry is an "Ensure Asset" of a file during the collection phase. It is sitting at "Comparing configuration..." but has not output the expected "INFO: Configuration matches template.". It is iterating though an array of asset names and is stuck on the 5th item of 6 total items.

Here is the plan:

##AH:UseTextMode
# Loop
foreach $assetName in @TeamCityPluginAssetNames
{
    # General
    with retry = 3
    {
        Ensure-Asset
        (
            Name: $assetName,
            Type: File,
            Directory: c:\TeamCity\.BuildServer\plugins
        );
    }
}

Ideally, I would like to create a way to automatically remediate the long running execution scenario, perhaps a nightly Otter job that checks all server executions, and when one is found it restarts the INEDOAGENTSVC on the respective server agent.

Hmm, you might not actually be able to cancel Otter jobs from the UI (or API); we're working on more BuildMaster/Otter unification, so this will come over soon. That said, I doubt this execution would be cancellable anyway.

I don't think this is an agent problem.

Why is there a retry on this block? Has it been throwing errors before? What types of errors?

What type of raft are you using? How big are these team city plug-ins?

Why is there a retry on this block? Has it been throwing errors before? What types of errors?

I have had times in the past when files could be locked by a process for a period of time. This is just my general pattern when ensuring files.

What type of raft are you using? How big are these team city plug-ins?

I am using a Git raft. The biggest plugin is 2.9 MB.

Oh, I forgot to add that these two executions are hanging on completely different plans/configurations. The other server is executing this plan/template:

##AH:UseTextMode
##AH:Description Installs 1 to n PowerShell modules
template InstallPowerShellModules<@PsModuleList, $FailureEmailAddresses>
{
    # Loop
    foreach $moduleName in @PsModuleList
    {
        # Try/Catch
        try
        {
            PSEnsure
            (
                Key: Ensure $moduleName PowerShell Module,
                Value: 0,
                CollectScript: Test-LatestPowerShellModuleInstalled,
                ConfigureScript: Install-PowerShellModule,
                UseExitCode: True,
                Debug: False,
                Verbose: False,
                CollectScriptParams: %(
                       ModuleName: $moduleName
                    ),
                ConfigureScriptParams: %(
                       ModuleName: $moduleName
                    )
            );
        }
        catch
        {
            Send-Email
            (
                To: $FailureEmailAddresses,
                Subject: PowerShell module failed to install: $moduleName
            );
        }
    }
}

It is hung on the 3rd to last module in the array of module names. It is hung on the Collection phase at "DEBUG: Collecting configuration..."

Thanks for the additional info; so, this should most definitely never happen. But in general, to resolve a stuck execution...

Restart the main (Otter/BuildMaster) serivce
if an execution gets "stuck" again on the same server, restart the agent service

There are so many things that can cause a "stuck" execution, and sometimes it's just easier to restart like this.

But I'd definitely like to work together to identify the potential causes for it. The first step is isolating where it's occuring and the conditions:

Does it require a service or agent restart?
is it always on the same servers, or does it vary?
Is it always on the same operation?
Is the last message always "Collecting configuration..."?
Are any error messages logged in Admin > Errors?

Once we have some clue on that, we can explore the code to find more places to log. We've found so many interesting causes over the years from building the BuildMaster Legacy (v4) execution engine, including unexpected network traffic on agents to bad RAM (!).

Fortunately, with this new execution engine, it will be much easier to isolate where the problems are.

That said... from here, it'd be best to open a ticket for this, since this is very specific advice and it's helpful to send lots of logs over email.

Thanks Steve, restarting the Otter service cleared out those long running executions. So that means it was 'hanging' in the Otter service? This issue is very intermittent, and not common (I have seen it maybe 6 or 7 times in the last several months). Generally probably cleared up by Otter server upgrades (requires service is restarted)

So what do you think of this algorithm to remediate stuck executions:

In a Job plan run every 3 hours
    determine if any host has a long running execution (more than an hour)
    if long running executions exist
        restart Otter service

If restarting fixed it, then the "hanging" was in the service, likely the agent never reported back that the "job" (a very small unit of work, not an Otter job) was complete. Once we add cancellation to the UI, you could confirm that to be the case...

Note that, as soon as the Otter service stops, the job will stop running, so the service won't be able to start itself back up. So you will need to have an external process running that will handle the restart, like this:

$script =
"
sleep 10

`$svc = (Get-Service -ComputerName 'inedoappsv1' -Name 'INEDOBMSVC')
Try { `$svc.Stop() }
Catch { }
`$svc.WaitForStatus([System.ServiceProcess.ServiceControllerStatus]::Stopped)


`$svc.Start()
"

Start-Process -FilePath 'powershell.exe' -WorkingDirectory 'C:\Projects\BuildMaster' -ArgumentList @('-NoProfile', '-ExecutionPolicy', 'unrestricted', '-Command', $script)
Write-Output 'BuildMaster will be deployed in 10 seconds'

It is the relevant parts of a PS script we use to have BuildMaster deploy to itself; it seems to work pretty well.