Hello,
We are seeing a slow but steady RSS leak in confd.smp on our products under sustained SNMP polling load. The anonymous RSS (RssAnon) keeps climbing while the load is active, and does not recover after the load stops.
We can reproduce this in the lab using a load script that sends SNMP GET requests from multiple pollers for an hour or longer.
We use:
- ConfD 8.0.6
- Linux kernel 5.4.47
- processors: powerpc (32-bit), x86_64, arm64
- glibc 2.25
- gcc >= 6.3
With the setup above, we observe the growth of RssAnon and VmRSS on powerpc (32-bit) and on arm64, but not on x86_64.
Using the same setup but with ConfD 7.2.6 (same kernel, glibc, gcc), the growth does NOT occur.
To reproduce in the lab, we run the SNMP GET load for an hour or longer, then stop the load and leave the system idle for a while.
Following plot shows confd.smp RSS over time, during and after the load. You can see the steady climb, and that not all of the memory is reclaimed after the load stops.
We would appreciate any pointers from anyone who has seen a similar pattern.
Best Regards,
Ricardo
Can you provide a standalone reproduction that demonstrates the issue?
We can try it in later releases to see if it still behaves the same way.
Hi @fnchooft,
Thanks for picking this up. Some follow-up details on our reproduction
setup, and an update: we have now reproduced the same leak on ConfD
8.0.11 with no other changes to the setup.
ConfD versions reproduced on
- 8.0.6 (original report).
- 8.0.11 – same growth shape, same architecture sensitivity.
Architecture matrix (identical workload everywhere)
- arm64: leak reproduces consistently.
- powerpc: leak reproduces consistently.
- x86_64: no growth.
confd.smp RSS plateaus within minutes and
stays flat.
Reproduction image
To rule out anything in our product stack, we stripped our integration
image down to “raw ConfD only”:
- Removed every local patch we apply on top of the upstream installer (14
quilt patches in our case), so the daemon is the unmodified upstream
BEAM binary (confd.smp).
- Removed our product YANG /
.fxs / init XML / confd_dyncfg.fxs –
none of our application schemas are loaded.
- The only schema in the loadPath is the upstream
examples.confd/snmpa/1-simple/ example (simple.fxs and
TAIL-F-TEST-MIB.bin), built from the example’s stock sources with
confdc at image build time.
confd.conf (essentials)
Minimal config – just what the daemon needs plus the SNMP agent. No
NETCONF, no CLI, no HA, no rollback, no runtimeReconfiguration, and
candidate datastore disabled:
<confdConfig xmlns="http://tail-f.com/ns/confd_cfg/1.0">
<aaa>
<sshServerKeyDir>/etc/ssh</sshServerKeyDir>
</aaa>
<cdb>
<enabled>true</enabled>
<dbDir>/var/confd/cdb</dbDir>
<initPath>
<dir>/etc/confd/init-xml</dir>
</initPath>
</cdb>
<datastores>
<candidate><enabled>false</enabled></candidate>
</datastores>
<stateDir>/var/confd/state</stateDir>
<snmpAgent>
<enabled>true</enabled>
<ip>0.0.0.0</ip>
<port>161</port>
<mibs><fromLoadPath>true</fromLoadPath></mibs>
<snmpEngine><snmpEngineID>80:00:61:81:05:01</snmpEngineID></snmpEngine>
<system>
<sysDescr>Tail-f ConfD agent</sysDescr>
<sysObjectID>1.3.6.1.4.1.24961</sysObjectID>
</system>
</snmpAgent>
<loadPath>
<dir>/opt/confd/etc/confd</dir>
<dir>/opt/confd/etc/confd/snmp</dir>
<dir>/etc/confd</dir>
</loadPath>
<snmpLog>
<enabled>true</enabled>
<file>
<enabled>true</enabled>
<name>@CONFD_LOG_DIR@/confd_snmp.log</name>
</file>
<syslog>
<enabled>true</enabled>
<facility>local4</facility>
</syslog>
</snmpLog>
<!-- logs: confdLog, developerLog, auditLog, netconfLog -->
</confdConfig>
The example’s simple_init.xml, community_init.xml and vacm_init.xml
are shipped under /etc/confd/init-xml/ so the CDB picks them up on
first boot (two hosts, two servers, two services under
TAIL-F-TEST-MIB).
Load
A small Python harness drives net-snmp’s snmpget in parallel. The exact
parameters for the runs in the graph below:
-
Per-PDU invocation (one snmpget subprocess per PDU, Zabbix-style):
snmpget -v 2c -c public -t 5 -r 0 -On <host> <oid>
-- SNMPv2c, community public, single OID per PDU, 5 s timeout, no
retries.
-
15 parallel worker threads, fed by a queue.
-
Back-to-back scheduling: the scheduler enqueues one full pass
through the OID pool, blocks until all workers have drained the queue,
then enqueues the next pass – so the daemon sees sustained
GetRequest traffic with no idle gap and no Zabbix-style poll-interval
pacing (this corresponds to POLLING_INTERVAL_SEC=0 in our harness).
-
OID pool: walked once at startup against the populated subtree of
TAIL-F-TEST-MIB, then held constant. Concretely:
- the 5 scalars (
numberOfServers.0, numberOfHosts.0,
maxNumberOfServers.0, maxNumberOfHosts.0, extraDescr.0),
- the columns of
hostTable, serverTable, serviceTable for every
row created by simple_init.xml (~20 OIDs total for the canned
two-host / two-server / two-service data).
-
Sampling: every 30 s the harness also captures confd.smp VmRSS,
MemAvailable, slab, and the top-3 mappings from
/proc/<pid>/smaps, plus per-PDU latency percentiles (p50/p95/p99)
and the running get_ok / get_err / get_timeouts counters. That
is the data feeding the graph below.
What we see
confd.smp RSS grows monotonically – no plateau in our longest runs.
- The growth is mirrored in
/proc/<pid>/smaps: the top-3 mappings
(BEAM carriers + process heap) all grow together.
- System
MemAvailable decreases at the matching rate.
- Stopping the SNMP load freezes RSS but the memory is never reclaimed.
Happy to share more info if necessary.
Thanks,
Ittalo
Good Morning,
I have tested this scenario on an arm64 machine:
Kernel version…: 6.12.87+rpt-rpi-v8
Full kernel info.: Linux raspberrypi 6.12.87+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.12.87-1+rpt1~bookworm (2026-05-12) aarch64 GNU/Linux
Architecture…: aarch64
glibc version…: glibc 2.36
GCC version…: gcc (Debian 12.2.0-14+deb12u1) 12.2.0
OS release…: Debian GNU/Linux 12 (bookworm)
I could not see the same behavior and I tested our versions 8.0.21, 8.4 and 8.7.
None of the versions on that kernel- and glibc-version have unbounded RSS-growth.
Kind regards,
Fabian
Thanks for getting back to me, @fnchooft!
Just a few questions.
- Have you reproduced the issue with version 8.0.11 or earlier? Could you share the results, please?
- Are there any fixes or improvements documented on RN for versions between 8.0.11 and 8.0.21?
- Have you observed any ConfD configurations or usage patterns that make the leak more likely to happen? Something you recommend not using?
- Is there a chance we can get the 8.0.21 version temporarily for a test?
Thanks in advance,
Best regards,
Ittalo
Afternoon, I will try to obtain the 8.0.11 and re-run the tests on my target.
This might take some days.
Kind regards,
Fabian
Good morning,
Thank you for doing the tests. Were you able to reproduce the issue with the 8.0.11 or earlier?
Thanks,
Ittalo