CONFD crashed after loading initial DB

Hi,

Recently upgraded to version 8.0.16 from 7.8.1.
At times(randomly), seeing confd crash soon after loading initial .xml files in phase0

<INFO> 6-Feb-2025::17:01:59.577 localhost confd[2805]: - CDB load: processing file: /var/confd/cdb/0001_arcos_init.xml
<INFO> 6-Feb-2025::17:01:59.938 localhost confd[2805]: - ConfD phase0 started
<INFO> 6-Feb-2025::17:02:24.009 localhost confd[2805]: - Stopping to listen for Internal IPC on 127.0.0.1:4565
<CRIT> 6-Feb-2025::17:02:24.498 localhost confd[2805]: - Internal error: Supervision terminated

A large stack trace follows, but it seems specific to confd internal processes.

Since ConfD boots fine most of the time, I don’t suspect issues with backend daemons or the init.xml file. All backend applications connect to ConfD only after Phase0 for validation and data callpoints.

Has anyone encountered a similar issue, or is there a recommended way to debug this further?

=ERROR REPORT==== 6-Feb-2025::17:01:57.680055 ===
confd_rcmd,1609,
           {noproc,{gen_server,call,[capi_server,get_info,infinity]}},
           [{gen_server,call,3,[{file,"gen_server.erl"},{line,234}]},
            {confd_rcmd,fmt_c_points,0,[{file,"confd_rcmd.erl"},{line,1574}]},
            {confd_rcmd,send_status,2,[{file,"confd_rcmd.erl"},{line,1237}]},
            {confd_rcmd,handle_tcp_data,1,
                        [{file,"confd_rcmd.erl"},{line,841}]},
            {proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,234}]}]}
 =ERROR REPORT==== 6-Feb-2025::17:02:19.853951 ===
 ** Generic server cdb_db terminating
 ** Last message in was {status,50}
 ** When Server state == {state,
                          {config,"/var/confd/cdb",
                           ["/var/confd/cdb"],
                           ramdisk,
                           [{file_save_log_fun,#Fun<cdb_db.2.51010220>},
                            {progressf,#Fun<cdb_db.3.51010220>}],
                           running,true,sync,false},
                          0,init,
                          {cdb_init_sess,init,true,undefined,4,<0.146.0>,
                           undefined,false,[],
                           {[],[],[],[]},
                           [],
                           ["/var/confd/cdb"],
                           {tts_cursor,#Ref<0.3183527211.2701262849.99366>},
                           undefined,false,undefined},
                          3,undefined,undefined,normal,undefined,noreply,
                          {0,0,0},
                          [],
                          {xds_ramdisk,
                           {xds_ram,
                            {otts,#Ref<0.3183527211.2701262849.99511>,0,
                             #Ref<0.3183527211.2701131777.99512>},
                            140694351143280,0,[],[],[],[],140694351143280,
                            undefined,undefined,undefined},
                           read,ram_and_wal,disabled,undefined,undefined,
                           "/var/confd/cdb/A.cdb",raw,0,
                           {compact_after,50,50},
                           undefined,4,#Fun<cdb_db.2.51010220>,0,
                           {xds_wal,"/var/confd/cdb/A.cdb",
                            {file,
                             {file_descriptor,raw_file_io_delayed,
                              #{buffer => #Ref<0.3183527211.2701262849.99518>,
                                delay_size => 65536,owner => <0.137.0>,
                                pid => <0.151.0>}}},
                            raw,[],-1,none,-1},
                           []},
                          [],undefined,undefined,
                          {subs,[],[],0,[],undefined},
                          {subs,[],[],0,[],undefined},
                          notab,[],undefined,undefined,undefined,undefined}
 ** Reason for termination ==
 ** {{timeout,{gen_server,call,
                          [confd_cfg_server,{get,[dbDir,cdb,confdConfig]},50]}},
     [{gen_server,call,3,[{file,"gen_server.erl"},{line,234}]},
      {confd_cfg_server,do_get,2,[{file,"confd_cfg_server.erl"},{line,141}]},
      {cdb_config,get_db_dir,2,[{file,"cdb_config.erl"},{line,31}]},
      {cdb_config,cdb_conf_file,1,[{file,"cdb_config.erl"},{line,55}]},
      {cdb_db,stat,2,[{file,"cdb_db.erl"},{line,5365}]},
      {cdb_db,handle_call,3,[{file,"cdb_db.erl"},{line,1948}]},
      {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,677}]},
      {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,706}]},
      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
 ** Client <0.270.0> is dead

 =CRASH REPORT==== 6-Feb-2025::17:02:20.126601 ===
   crasher:
     initial call: cdb_db:init/1
     pid: <0.137.0>
     registered_name: cdb_db
     exception exit: {timeout,
                         {gen_server,call,
                             [confd_cfg_server,
                              {get,[dbDir,cdb,confdConfig]},
                              50]}}
       in function  gen_server:call/3 (gen_server.erl, line 234)
       in call from confd_cfg_server:do_get/2 (confd_cfg_server.erl, line 141)
       in call from cdb_config:get_db_dir/2 (cdb_config.erl, line 31)
       in call from cdb_config:cdb_conf_file/1 (cdb_config.erl, line 55)
       in call from cdb_db:stat/2 (cdb_db.erl, line 5365)
       in call from cdb_db:handle_call/3 (cdb_db.erl, line 1948)
       in call from gen_server:try_handle_call/4 (gen_server.erl, line 677)
       in call from gen_server:handle_msg/6 (gen_server.erl, line 706)
     ancestors: [cdb_sup,<0.126.0>]
     message_queue_len: 1
     messages: [{#Ref<0.3183527211.2701131777.100949>,
                    {ok,<<"/var/confd/cdb">>}}]
     links: [<0.138.0>,<0.146.0>,<0.128.0>]
     dictionary: [{config_cache,false}]
     trap_exit: true
     status: running
     heap_size: 6772
     stack_size: 27
     reductions: 1897662
   neighbours:
     neighbour:
       pid: <0.138.0>
       registered_name: cdb_subid_alloc
       initial call: cdb_subid:'-start/0-fun-0-'/0
       current_function: {cdb_subid,subid_allocator,1}
       ancestors: [cdb_db,cdb_sup,<0.126.0>]
       message_queue_len: 0
       links: [<0.137.0>]
       trap_exit: false
       status: waiting
       heap_size: 233
       stack_size: 6
       reductions: 19
       current_stacktrace: [{cdb_subid,subid_allocator,1,
                              [{file,"cdb_subid.erl"},{line,40}]},
                   {proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,234}]}]
 =SUPERVISOR REPORT==== 6-Feb-2025::17:02:20.978715 ===
     supervisor: {local,cdb_sup}
     errorContext: child_terminated
     reason: {timeout,{gen_server,call,
                                  [confd_cfg_server,
                                   {get,[dbDir,cdb,confdConfig]},
                                   50]}}
     offender: [{pid,<0.137.0>},
                {id,cdb_db},
                {mfargs,{cdb_db,start_link,[]}},
                {restart_type,permanent},
                {shutdown,3000},
                {child_type,worker}]
 =SUPERVISOR REPORT==== 6-Feb-2025::17:02:21.014888 ===
     supervisor: {local,cdb_sup}
     errorContext: shutdown
     reason: reached_max_restart_intensity
     offender: [{pid,<0.137.0>},
                {id,cdb_db},
                {mfargs,{cdb_db,start_link,[]}},
                {restart_type,permanent},
                {shutdown,3000},
                {child_type,worker}]
 =ERROR REPORT==== 6-Feb-2025::17:02:21.138051 ===
 cdb_capi:1298: handle_client_data/3 failed: exit: {noproc,
                                                    {gen_server,call,
                                                     [cdb_db,get_init_sess,
                                                      infinity]}}
 [{gen_server,call,3,[{file,"gen_server.erl"},{line,234}]},
  {cdb_capi,do_get_phase,2,[{file,"cdb_capi.erl"},{line,4115}]},
  {cdb_capi,handle_setup,3,[{file,"cdb_capi.erl"},{line,1349}]},
  {cdb_capi,handle_client_data,3,[{file,"cdb_capi.erl"},{line,1267}]},
  {cdb_capi,handle_info,2,[{file,"cdb_capi.erl"},{line,542}]},
  {gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,653}]},
  {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,727}]},
  {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]

 =INFO REPORT==== 6-Feb-2025::17:02:22.096885 ===
     application: cdb
     exited: shutdown
     type: permanent
 =ERROR REPORT==== 6-Feb-2025::17:02:22.620281 ===
 confd_ia:574: Server capi_server, which registered 3, seems to be down?!

 [{confd_ia,'-handle_connection/7-fun-0-',0,[{file,"confd_ia.erl"},{line,575}]},
  {confd_ia,handle_connection,7,[{file,"confd_ia.erl"},{line,575}]},
  {confd_ia,acceptor,5,[{file,"confd_ia.erl"},{line,527}]},
  {proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,234}]}]


 "Internal error: Supervision terminated\n"
 =ERROR REPORT==== 6-Feb-2025::17:02:25.547474 ===
 init:boot_msg: "Internal error: Supervision terminated\n"
[IFMGR] INTERNAL ERROR: confd_internal.c(2979): Failed to decode data
[MACSEC] INTERNAL ERROR: confd_internal.c(2979): Failed to decode data

@cohult @waitai
Thanks.

Hi Khan,

Are you saying that you are done upgrading and are solidly on 8.0, but then sometimes, (when? how frequently?) there this this crash? It looks like you are accessing some data from the configuration when it crashes. Is it likely that the accessing of a particular piece of data that you can identify and the frequency of that access corresponds to the crashes?

The other logs could be be helpful for you to look for the command that corresponds to the timing of the crash.

Scott

Thanks @sbarvick for your response,

im done upgrading to 8.0.16. After starting ConfD, it crashes during phase0. At this point, none of the backend daemons try to access CDB. They only connect to CDB after phase1.

I also enabled library debug logs, but I didn’t find any API call failing with an internal error. The crash happens randomly, and when ConfD boots up fine, everything works without issues.

do you suggest enabling any other logs to pin point the root cause.

i see this log as well.

2025-02-08 18:32:42Z spyder[998]:VERBOSE: >>confd[1357]: "Internal error: Component terminated (application_controller) ({application_terminated,cdb,shutdown})\n"

Hi @sbarvick
Could this be cause of checking “confd --status” ?

Here is the sequence i follow to start the confd in phases.

  1. start confd in phase0
    confd --start-phase0 --foreground --smp 2

  2. spawn a script which checks the confd --status : to check if confd is in phase0

  3. if confd in phase0 start backend daemons to connect to confd and register oper connection for validation callback registrations.

  4. spawn a script to check if any validation points are not registered based on confd --status.

  5. if all callbacks are registered, start the confd in phase1, confd --start-phase1

  6. spwan a script to check if confd is in phase1 using the output of “confd --status

  7. if confd is in phase1 then start the confd in phase2, confd --start-phase2

i see this crash after step 3,
i double confirmed that there are no CDB interactions from backend daemons at this point.

<INFO> 8-Feb-2025::19:50:24.555 localhost confd[1355]: - Starting to listen for Internal IPC on 127.0.0.1:4565
<INFO> 8-Feb-2025::19:50:26.508 localhost confd[1355]: - CDB load: processing file: /var/confd/cdb/0001_init.xml
<INFO> 8-Feb-2025::19:50:26.857 localhost confd[1355]: - ConfD phase0 started
<INFO> 8-Feb-2025::19:50:44.091 localhost confd[1355]: - Stopping to listen for Internal IPC on 127.0.0.1:4565
<CRIT> 8-Feb-2025::19:50:44.837 localhost confd[1355]: - Internal error: Supervision terminated

It seems possible that using confd --status to check for the phase transitions is the issue.

We recommend using the confd --wait-* in patterns as mentioned in:

and

Please check these out and see if this will work better for you.

Best,
Scott

Thanks, @sbarvick.

We never encountered this crash with ConfD 7.8.1, so I’m curious about what has changed. Since the issue is intermittent, I’m a bit skeptical about applying fixes without identifying the root cause.

also the one important reason for using the confd --status output in our implementation was to check if any of the validation points are not registered by the backend daemons before transitioning to phase1.
is there any other option to check other than the output of confd --status during phase0 ?

Is there any additional logging I can enable to help trace the source of this crash? Any recommendations on debugging this further?

The crash seems to be happening in the code that is collecting the results of the status command. My guess is that something isn’t yet stable during the phase transition process and you are now more likely to catch it given the other things that have changed between 7.8 and 8.0. You could turn on all of the logging include the developer log to trace to possibly see if there is some indication of one of the processes that now has different timing.