Cpp-Netdata: Real-time performance monitoring, done right! https://www.netdata.cloud

Netdata Build Status CII Best Practices License: GPL v3+ analytics

Code Climate Codacy Badge LGTM C LGTM JS LGTM PYTHON


Netdata is distributed, real-time performance and health monitoring for systems and applications. It is a highly-optimized monitoring agent you install on all your systems and containers.

Netdata provides unparalleled insights, in real-time, of everything happening on the systems it's running on (including web servers, databases, applications), using highly interactive web dashboards.

A highly-efficient database stores long-term historical metrics for days, weeks, or months, all at 1-second granularity. Run this long-term storage autonomously, or integrate Netdata with your existing monitoring toolchains (Prometheus, Graphite, OpenTSDB, Kafka, Grafana, and more).

Netdata is fast and efficient, designed to permanently run on all systems (physical and virtual servers, containers, IoT devices), without disrupting their core function.

Netdata is free, open-source software and it currently runs on Linux, FreeBSD, and macOS, along with other systems derived from them, such as Kubernetes and Docker.

Netdata is not hosted by the CNCF but is the fourth most starred open-source project in the Cloud Native Computing Foundation (CNCF) landscape.


People get addicted to Netdata. Once you use it on your systems, there is no going back! You've been warned...

image

Tweet about Netdata!

Contents

  1. What does it look like? - Take a quick tour through the dashboard
  2. Our userbase - Enterprises we help monitor and our userbase
  3. Quickstart - How to try it now on your systems
  4. Why Netdata - Why people love Netdata and how it compares with other solutions
  5. News - The latest news about Netdata
  6. How Netdata works - A high-level diagram of how Netdata works
  7. Infographic - Everything about Netdata in a single graphic
  8. Features - How you'll use Netdata on your systems
  9. Visualization - Learn about visual anomaly detection
  10. What Netdata monitors - See which apps/services Netdata auto-detects
  11. Documentation - Read the documentation
  12. Community - Discuss Netdata with others and get support
  13. License - Check Netdata's licencing
  14. Is it any good? - Yes.
  15. Is it awesome? - Yes.

What does it look like?

The following animated GIF shows the top part of a typical Netdata dashboard.

The Netdata dashboard in action

A typical Netdata dashboard, in 1:1 timing. Charts can be panned by dragging them, zoomed in/out with SHIFT + mouse wheel, an area can be selected for zoom-in with SHIFT + mouse selection. Netdata is highly interactive, real-time, and optimized to get the work done!

Want to try Netdata before you install? See our live demo.

User base

Netdata is used by hundreds of thousands of users all over the world. Check our GitHub watchers list. You will find people working for Amazon, Atos, Baidu, Cisco Systems, Citrix, Deutsche Telekom, DigitalOcean, Elastic, EPAM Systems, Ericsson, Google, Groupon, Hortonworks, HP, Huawei, IBM, Microsoft, NewRelic, Nvidia, Red Hat, SAP, Selectel, TicketMaster, Vimeo, and many more!

Docker pulls

We provide Docker images for the most common architectures. These are statistics reported by Docker Hub:

netdata/netdata (official) firehol/netdata (deprecated) titpetric/netdata (donated)

Registry

When you install multiple Netdata, they are integrated into one distributed application, via a Netdata registry. This is a web browser feature and it allows us to count the number of unique users and unique Netdata servers installed. The following information comes from the global public Netdata registry we run:

User Base Monitored Servers Sessions Served

In the last 24 hours:
New Users Today New Machines Today Sessions Today

Quickstart

To install Netdata from source on any Linux system (physical, virtual, container, IoT, edge) and keep it up to date with our nightly releases automatically, run the following:

# make sure you run `bash` for your shell
bash

# install Netdata directly from GitHub source
bash <(curl -Ss https://my-netdata.io/kickstart.sh)

Starting with v1.12, Netdata collects anonymous usage information by default and sends it to Google Analytics. Read about the information collected, and learn how to-opt, on our anonymous statistics page.

The usage statistics are vital for us, as we use them to discover bugs and prioritize new features. We thank you for actively contributing to Netdata's future.

To learn more about the pros and cons of using nightly vs. stable releases, see our notice about the two options.

The above command will:

  • Install any required packages on your system (it will ask you to confirm before doing so)
  • Compile it, install it, and start it.

More installation methods and additional options can be found at the installation page.

To try Netdata in a Docker container, run this:

docker run -d --name=netdata \
  -p 19999:19999 \
  -v netdatalib:/var/lib/netdata \
  -v netdatacache:/var/cache/netdata \
  -v /etc/passwd:/host/etc/passwd:ro \
  -v /etc/group:/host/etc/group:ro \
  -v /proc:/host/proc:ro \
  -v /sys:/host/sys:ro \
  -v /etc/os-release:/host/etc/os-release:ro \
  --restart unless-stopped \
  --cap-add SYS_PTRACE \
  --security-opt apparmor=unconfined \
  netdata/netdata

For more information about running Netdata in Docker, check the docker installation page.

image

From Netdata v1.12 and above, anonymous usage information is collected by default and sent to Google Analytics. To read more about the information collected and how to opt-out, check the anonymous statistics page.

Why Netdata

Netdata has a quite different approach to monitoring.

Netdata is a monitoring agent you install on all your systems. It is:

  • A metrics collector for system and application metrics (including web servers, databases, containers, and much more),
  • A long-term metrics database that stores recent metrics in memory and "spills" historical metrics to disk for efficient long-term storage,
  • A super fast, interactive, and modern metrics visualizer optimized for anomaly detection,
  • And an alarms notification engine for detecting performance and availability issues.

All the above, are packaged together in a very flexible, extremely modular, distributed application.

This is how Netdata compares to other monitoring solutions:

Netdata others (open-source and commercial)
High resolution metrics (1s granularity) Low resolution metrics (10s granularity at best)
Monitors everything, thousands of metrics per node Monitor just a few metrics
UI is super fast, optimized for anomaly detection UI is good for just an abstract view
Long-term, autonomous storage at one-second granularity Centralized metrics in an expensive data lake at 10s granularity
Meaningful presentation, to help you understand the metrics You have to know the metrics before you start
Install and get results immediately Long preparation is required to get any useful results
Use it for troubleshooting performance problems Use them to get statistics of past performance
Kills the console for tracing performance issues The console is always required for troubleshooting
Requires zero dedicated resources Require large dedicated resources

Netdata is open-source, free, super fast, very easy, completely open, extremely efficient, flexible and integrate-able.

It has been designed by system administrators, DevOps engineers, and developers for to not just visualize metrics, but also troubleshoot complex performance problems.

News

May 11, 2020 - Netdata v1.22.0 released!

Release v1.22.0 marks the official launch of our rearchitected Netdata Cloud! This Agent release contains both backend and interface changes necessary to connect your distributed nodes to this dramatically improved experience.

Netdata Cloud builds on top of our open source monitoring Agent to give you real-time visibility for your entire infrastructure. Once you've connected your Agents to Cloud, you can view key metrics, insightful charts, and active alarms from all your nodes in a single web interface. When an anomaly strikes, seamlessly navigate to any node to troubleshoot and discover the root cause with the familiar Netdata dashboard.

Animated GIF of Netdata Cloud

Sign in to Cloud and read our Get started with Cloud guide for details on updating your nodes, claiming them, and navigating the new Cloud.

While Netdata Cloud offers a centralized method of monitoring your Agents, your metrics data is not stored or centralized in any way. Metrics data remains with your nodes and is only streamed to your browser through Cloud.

In addition, Cloud only expands on the functionality of the wildly popular free and open source Agent. We will never make any of our open source Agent features Cloud-exclusive, and we will actively continue to develop the Agent so that we can integrate new features with Netdata Cloud.

We added a new collector called whoisquery that helps you monitor a domain name's expiration date. You can track as many domains as you'd like, and set custom warning and critical thresholds for each. For more information on setup and configuration, see the Whois domain expiry monitoring documentation.

We added a new connector to our experimental exporting engine: Prometheus remote write. You can use this connector to send Netdata metrics to your choice of more than 20 external storage providers for long-term archiving and further analysis.

Our new documentation experience is now available at Netdata Learn! We encourage you to try it out and give us feedback or ask questions in our GitHub issues. Learn features documentation for both the Agent and Cloud in separate-but-connected vaults, which streamlines the experience of learning about both products.

While Learn only features documentation for now, we plan on releasing more types of educational content serving the Agent's open-source community of developers, sysadmins, and DevOps folks. We'll have more to announce soon, but in the meantime, we hope you enjoy what we believe is a smoother (and prettier) docs experience.

As part of the ongoing work to polish our eBPF collector tech preview, we've now proven the collector's performance is very good, and have vastly expanded the number of operating system versions the collector works on. Learn how to enable it in our documentation. We've also extensively stress-tested the eBPF collector and found that it's impressively fast given the depth of metrics it collects! Read up on our benchmarking analysis on GitHub.


See more news and previous releases at our blog or our releases page.

How it works

Netdata is a highly efficient, highly modular, metrics management engine. Its lockless design makes it ideal for concurrent operations on the metrics.

image

This is how it works:

Function Description Documentation
Collect Multiple independent data collection workers are collecting metrics from their sources using the optimal protocol for each application and push the metrics to the database. Each data collection worker has lockless write access to the metrics it collects. collectors
Store Metrics are first stored in RAM in a custom database engine that then "spills" historical metrics to disk for efficient long-term metrics storage. database
Check A lockless independent watchdog is evaluating health checks on the collected metrics, triggers alarms, maintains a health transaction log and dispatches alarm notifications. health
Stream A lockless independent worker is streaming metrics, in full detail and in real-time, to remote Netdata servers, as soon as they are collected. streaming
Archive A lockless independent worker is down-sampling the metrics and pushes them to backend time-series databases. exporting
Query Multiple independent workers are attached to the internal web server, servicing API requests, including data queries. web/api

The result is a highly efficient, low-latency system, supporting multiple readers and one writer on each metric.

Infographic

This is a high level overview of Netdata feature set and architecture. Click it to to interact with it (it has direct links to our documentation).

image

Features

finger-video

This is what you should expect from Netdata:

General

  • 1s granularity - The highest possible resolution for all metrics.
  • Unlimited metrics - Netdata collects all the available metrics—the more, the better.
  • 1% CPU utilization of a single core - It's unbelievably optimized.
  • A few MB of RAM - The highly-efficient database engine stores per-second metrics in RAM and then "spills" historical metrics to disk long-term storage.
  • Minimal disk I/O - While running, Netdata only writes historical metrics and reads error and access logs.
  • Zero configuration - Netdata auto-detects everything, and can collect up to 10,000 metrics per server out of the box.
  • Zero maintenance - You just run it. Netdata does the rest.
  • Zero dependencies - Netdata runs a custom web server for its static web files and its web API (though its plugins may require additional libraries, depending on the applications monitored).
  • Scales to infinity - You can install it on all your servers, containers, VMs, and IoT devices. Metrics are not centralized by default, so there is no limit.
  • Several operating modes - Autonomous host monitoring (the default), headless data collector, forwarding proxy, store and forward proxy, central multi-host monitoring, in all possible configurations. Each node may have different metrics retention policies and run with or without health monitoring.

Health Monitoring & Alarms

Integrations

  • Time-series databases - Netdata can archive its metrics to Graphite, OpenTSDB, Prometheus, AWS Kinesis, MongoDB, JSON document DBs, in the same or lower resolution (lower: to prevent it from congesting these servers due to the amount of data collected). Netdata also supports Prometheus remote write API, which allows storing metrics to Elasticsearch, Gnocchi, InfluxDB, Kafka, PostgreSQL/TimescaleDB, Splunk, VictoriaMetrics and a lot of other storage providers.

Visualization

  • Stunning interactive dashboards - Our dashboard is mouse-, touchpad-, and touch-screen friendly in 2 themes: slate (dark) and white.
  • Amazingly fast visualization - Even on low-end hardware, the dashboard responds to all queries in less than 1 ms per metric.
  • Visual anomaly detection - Our UI/UX emphasizes the relationships between charts so you can better detect anomalies visually.
  • Embeddable - Charts can be embedded on your web pages, wikis and blogs. You can even use Atlassian's Confluence as a monitoring dashboard.
  • Customizable - You can build custom dashboards using simple HTML. No JavaScript needed!

Positive and negative values

To improve clarity on charts, Netdata dashboards present positive values for metrics representing read, input, inbound, received and negative values for metrics representing write, output, outbound, sent.

Screenshot showing positive and negative values

Netdata charts showing the bandwidth and packets of a network interface. received is positive and sent is negative.

Autoscaled y-axis

Netdata charts automatically zoom vertically, to visualize the variation of each metric within the visible time-frame.

Animated GIF showing the auso-scaling Y axis

A zero-based stacked chart, automatically switches to an auto-scaled area chart when a single dimension is selected.

Charts are synchronized

Charts on Netdata dashboards are synchronized to each other. There is no master chart. Any chart can be panned or zoomed at any time, and all other charts will follow.

Animated GIF of the standard Netdata dashboard being manipulated and synchronizing charts

Charts are panned by dragging them with the mouse. Charts can be zoomed in/out withSHIFT + mouse wheel while the mouse pointer is over a chart.

Highlighted time-frame

To improve visual anomaly detection across charts, the user can highlight a time-frame (by pressing Alt + mouse selection) on all charts.

An animated GIF of highlighting a specific timeframe

A highlighted time-frame can be given by pressing Alt + mouse selection on any chart. Netdata will highlight the same range on all charts.

What Netdata monitors

Netdata can collect metrics from 200+ popular services and applications, on top of dozens of system-related metrics jocs, such as CPU, memory, disks, filesystems, networking, and more. We call these collectors, and they're managed by plugins, which support a variety of programming languages, including Go and Python.

Popular collectors include Nginx, Apache, MySQL, statsd, cgroups (containers, Docker, Kubernetes, LXC, and more), Traefik, web server access.log files, and much more.

See the full list of supported collectors.

Netdata's data collection is extensible, which means you can monitor anything you can get a metric for. You can even write a collector for your custom application using our plugin API.

Documentation

The Netdata documentation is at https://docs.netdata.cloud, but you can also find each page inside of Netdata's repository itself in Markdown (.md) files. You can find all our documentation by navigating the repository.

Here is a quick list of notable documents:

Directory Description
installer Instructions to install Netdata on your systems.
docker Instructions to install Netdata using docker.
daemon Information about the Netdata daemon and its configuration.
collectors Information about data collection plugins.
health How Netdata's health monitoring works, how to create your own alarms and how to configure alarm notification methods.
streaming How to build hierarchies of Netdata servers, by streaming metrics between them.
exporting Long term archiving of metrics to industry-standard time-series databases, like prometheus, graphite, opentsdb.
web/api Learn how to query the Netdata API and the queries it supports.
web/api/badges Learn how to generate badges (SVG images) from live data.
web/gui/custom Learn how to create custom Netdata dashboards.
web/gui/confluence Learn how to create Netdata dashboards on Atlassian's Confluence.

You can also check all the other directories. Most of them have plenty of documentation.

Community

We welcome contributions. Feel free to join the team!

To report bugs or get help, use GitHub's issues.

You can also find Netdata on:

License

Netdata is GPLv3+.

Netdata re-distributes other open-source tools and libraries. Please check the third party licenses.

Is it any good?

Yes.

When people first hear about a new product, they frequently ask if it is any good. A Hacker News user remarked:

Note to self: Starting immediately, all raganwald projects will have a “Is it any good?” section in the readme, and the answer shall be “yes.".

So, we follow the tradition...

Is it awesome?

These people seem to like it.

Comments

  • Add install type info to `-W buildinfo` output.
    Add install type info to `-W buildinfo` output.

    Jan 19, 2022

    Summary

    By reading it from the .install-type file and presenting it properly.

    Test Plan

    Tested locally.

    area/daemon 
    Reply
  • [Bug]: mongodb collector - replicaset states metrics missing
    [Bug]: mongodb collector - replicaset states metrics missing

    Jan 19, 2022

    Bug description

    Hello, We use mongodb collector (python plugin) for monitoring three servers that are running a replicaset. Most metrics display fine, but the replicaset states (e.g. PRIMARY , SECONDARY, RECOVERING) is not displayed on the Netdata GUI and API

    image

    We tried to run the plugin in debug mode with this command: /usr/libexec/netdata/plugins.d/python.d.plugin -ppython3 mongodb debug trace

    BEGIN mongodb_local.read_operations 1000000
    SET 'query' = 2
    SET 'getmore' = 0
    END
    
    BEGIN mongodb_local.write_operations 1000000
    SET 'insert' = 0
    SET 'update' = 0
    SET 'delete' = 0
    END
    
    BEGIN mongodb_local.active_clients 1000000
    SET 'activeClients_readers' = 0
    SET 'activeClients_writers' = 0
    END
    
    BEGIN mongodb_local.wiredtiger_read 1000000
    SET 'wiredTigerRead_available' = 127
    SET 'wiredTigerRead_out' = 1
    END
    
    BEGIN mongodb_local.wiredtiger_write 1000000
    SET 'wiredTigerWrite_available' = 128
    SET 'wiredTigerWrite_out' = 0
    END
    
    BEGIN mongodb_local.cursors 1000000
    SET 'cursor_total' = 0
    SET 'noTimeout' = 0
    SET 'timedOut' = 0
    END
    
    BEGIN mongodb_local.connections 1000000
    SET 'connections_available' = 51135
    SET 'connections_current' = 65
    END
    
    BEGIN mongodb_local.memory 1000000
    SET 'virtual' = 1773
    SET 'resident' = 112
    END
    
    BEGIN mongodb_local.page_faults 1000000
    SET 'page_faults' = 176
    END
    
    BEGIN mongodb_local.queued_requests 1000000
    SET 'currentQueue_readers' = 0
    SET 'currentQueue_writers' = 0
    END
    
    BEGIN mongodb_local.record_moves 1000000
    SET 'moves' = 0
    END
    
    BEGIN mongodb_local.wiredtiger_cache 1000000
    SET 'wiredTiger_percent_clean' = 1
    SET 'wiredTiger_percent_dirty' = 0
    END
    
    BEGIN mongodb_local.wiredtiger_pages_evicted 1000000
    SET 'unmodified' = 963
    SET 'modified' = 0
    END
    
    BEGIN mongodb_local.asserts 1000000
    SET 'msg' = 0
    SET 'warning' = 0
    SET 'regular' = 0
    SET 'user' = 25
    END
    
    BEGIN mongodb_local.locks_collection 1000000
    SET 'Collection_W' = 1
    SET 'Collection_r' = 186274
    SET 'Collection_w' = 1
    END
    
    BEGIN mongodb_local.locks_database 1000000
    SET 'Database_W' = 8
    SET 'Database_r' = 459468
    SET 'Database_w' = 3
    END
    
    BEGIN mongodb_local.locks_global 1000000
    SET 'Global_W' = 3
    SET 'Global_r' = 640645
    SET 'Global_w' = 12
    END
    
    BEGIN mongodb_local.locks_oplog 1000000
    SET 'oplog_r' = 273230
    SET 'oplog_w' = 1
    END
    
    BEGIN mongodb_local.tcmalloc_generic 1000000
    SET 'current_allocated_bytes' = 110084120
    SET 'heap_size' = 126111744
    END
    
    BEGIN mongodb_local.tcmalloc_metrics 1000000
    SET 'central_cache_free_bytes' = 537160
    SET 'current_total_thread_cache_bytes' = 2819808
    SET 'pageheap_free_bytes' = 3796992
    SET 'pageheap_unmapped_bytes' = 5722112
    SET 'thread_cache_free_bytes' = 2819808
    SET 'transfer_cache_free_bytes' = 3151552
    END
    
    BEGIN mongodb_local.command_total_rate 1000000
    SET 'count_total' = 0
    SET 'createIndexes_total' = 0
    SET 'delete_total' = 0
    SET 'findAndModify_total' = 0
    SET 'insert_total' = 0
    END
    
    BEGIN mongodb_local.command_failed_rate 1000000
    SET 'count_failed' = 0
    SET 'createIndexes_failed' = 0
    SET 'delete_failed' = 0
    SET 'findAndModify_failed' = 0
    SET 'insert_failed' = 0
    END
    
    BEGIN mongodb_local.heartbeat_delay 1000000
    SET '172.16.4.162:27017_heartbeat_lag' = 578
    SET '172.16.4.163:27017_heartbeat_lag' = 669
    END
    
    BEGIN mongodb_local.optimedate_delay 1000000
    SET '172.16.4.161:27017_optimedate' = 14631728369
    SET '172.16.4.162:27017_optimedate' = 369
    SET '172.16.4.163:27017_optimedate' = 369
    END
    
    BEGIN mongodb_local.172.16.4.161:27017_state 1000000
    SET '172.16.4.161:27017_state_1' = 0
    SET '172.16.4.161:27017_state_8' = 0
    SET '172.16.4.161:27017_state_2' = 0
    SET '172.16.4.161:27017_state_3' = 3
    SET '172.16.4.161:27017_state_5' = 0
    SET '172.16.4.161:27017_state_4' = 0
    SET '172.16.4.161:27017_state_7' = 0
    SET '172.16.4.161:27017_state_6' = 0
    SET '172.16.4.161:27017_state_9' = 0
    SET '172.16.4.161:27017_state_10' = 0
    SET '172.16.4.161:27017_state_0' = 0
    END
    
    BEGIN mongodb_local.172.16.4.162:27017_state 1000000
    SET '172.16.4.162:27017_state_1' = 1
    SET '172.16.4.162:27017_state_8' = 0
    SET '172.16.4.162:27017_state_2' = 0
    SET '172.16.4.162:27017_state_3' = 0
    SET '172.16.4.162:27017_state_5' = 0
    SET '172.16.4.162:27017_state_4' = 0
    SET '172.16.4.162:27017_state_7' = 0
    SET '172.16.4.162:27017_state_6' = 0
    SET '172.16.4.162:27017_state_9' = 0
    SET '172.16.4.162:27017_state_10' = 0
    SET '172.16.4.162:27017_state_0' = 0
    END
    
    BEGIN mongodb_local.172.16.4.163:27017_state 1000000
    SET '172.16.4.163:27017_state_1' = 0
    SET '172.16.4.163:27017_state_8' = 0
    SET '172.16.4.163:27017_state_2' = 2
    SET '172.16.4.163:27017_state_3' = 0
    SET '172.16.4.163:27017_state_5' = 0
    SET '172.16.4.163:27017_state_4' = 0
    SET '172.16.4.163:27017_state_7' = 0
    SET '172.16.4.163:27017_state_6' = 0
    SET '172.16.4.163:27017_state_9' = 0
    SET '172.16.4.163:27017_state_10' = 0
    SET '172.16.4.163:27017_state_0' = 0
    END
    
    BEGIN netdata.runtime_mongodb_local 1000000
    SET run_time = 1
    END
    

    We can clearly see the states of each node are reported correctly at the end of the output. However, none of these metrics are displayed on Netdata GUI / API.

    Versions:

    • Netdata v1.32.1
    • Python 3.6.8
    • pymongo 3.6.1

    Any help on how to troubleshot this would be appreciated.

    Expected behavior

    replicaset state metrics should be available on GUI and API according to documentation: https://learn.netdata.cloud/docs/agent/collectors/python.d.plugin/mongodb

    Steps to reproduce

    We are using default settings here. The only thing we changed to troubleshot this issue is

    [plugin:python.d]
    	# update every = 1
    	command options = -ppython3
    

    on /etc/netdata/netdata.conf

    Installation method

    other

    System info

    Linux sn-a001.dc1.server.mila.quebec 4.18.0-240.10.1.el8_3.x86_64 #1 SMP Mon Jan 18 17:05:51 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
    /etc/centos-release:CentOS Linux release 8.3.2011
    /etc/os-release:NAME="CentOS Linux"
    /etc/os-release:VERSION="8"
    /etc/os-release:ID="centos"
    /etc/os-release:ID_LIKE="rhel fedora"
    /etc/os-release:VERSION_ID="8"
    /etc/os-release:PLATFORM_ID="platform:el8"
    /etc/os-release:PRETTY_NAME="CentOS Linux 8"
    /etc/os-release:ANSI_COLOR="0;31"
    /etc/os-release:CPE_NAME="cpe:/o:centos:centos:8"
    /etc/os-release:CENTOS_MANTISBT_PROJECT="CentOS-8"
    /etc/os-release:CENTOS_MANTISBT_PROJECT_VERSION="8"
    /etc/redhat-release:CentOS Linux release 8.3.2011
    /etc/system-release:CentOS Linux release 8.3.2011
    

    Netdata build info

    Version: netdata v1.32.1
    Configure options:  '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--datadir=/usr/share' '--includedir=/usr/include' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-bundled-libJudy' '--with-bundled-lws' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libexecdir=/usr/libexec' '--libdir=/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection' 'LDFLAGS=-Wl,-z,relro  -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld' 'CXXFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection' 'PKG_CONFIG_PATH=:/usr/lib/pkgconfig:/usr/share/pkgconfig'
    Features:
        dbengine:                   YES
        Native HTTPS:               YES
        Netdata Cloud:              YES 
        ACLK Next Generation:       YES
        ACLK-NG New Cloud Protocol: YES
        ACLK Legacy:                YES
        TLS Host Verification:      YES
        Machine Learning:           YES
    Libraries:
        protobuf:                YES (system)
        jemalloc:                NO
        JSON-C:                  YES
        libcap:                  NO
        libcrypto:               YES
        libm:                    YES
        LWS:                     YES static v3.2.2
        mosquitto:               YES
        tcalloc:                 NO
        zlib:                    YES
    Plugins:
        apps:                    YES
        cgroup Network Tracking: YES
        CUPS:                    YES
        EBPF:                    YES
        IPMI:                    YES
        NFACCT:                  NO
        perf:                    YES
        slabinfo:                YES
        Xen:                     NO
        Xen VBD Error Tracking:  NO
    Exporters:
        AWS Kinesis:             NO
        GCP PubSub:              NO
        MongoDB:                 NO
        Prometheus Remote Write: YES
    

    Additional info

    We use yum to install the package

    ~# yum list netdata
    Last metadata expiration check: 1:59:42 ago on Wed 19 Jan 2022 03:27:24 PM EST.
    Installed Packages
    netdata.x86_64                                           1.32.1-1.el8                                           @netdata
    
    bug area/external/python needs triage 
    Reply
  • Libqueue
    Libqueue

    Jan 20, 2022

    Summary

    Queue is a thread-safe library to handle queue processes with independent objects. This library is implemented due to requirement in replication module. Implementation is done with thread-safe and multi-producer&consumer support. It can be also used by different modules .

    Library includes functions below:

    Queue* initqueue(int max) void freequeue(Queue* q) void enqueue(void *q, void* item) void* dequeue(void *q)

    Test Plan

    The file attached can be compiled and run by gcc -o main main.c -lpthread & ./main commands (Change file extension as .c)

    Additional Information

    main.txt

    area/docs area/build 
    Reply
  • minor - remove ACLK_NEWARCH_DEVMODE
    minor - remove ACLK_NEWARCH_DEVMODE

    Jan 20, 2022

    Summary

    Removes the ACLK_NEWARCH_DEVMODE to reduce the number of ifdefs. This is not needed anymore (was used in the initial new arch protocol development stages).

    Test Plan

    Code should not change any behavior in case ACLK_NEWARCH_DEVMODE is false

    Additional Information
    ACLK 
    Reply
  • [Bug]: Stream compression - Compressor buffer overflow causes a stream corruption (limit 16834 bytes)
    [Bug]: Stream compression - Compressor buffer overflow causes a stream corruption (limit 16834 bytes)

    Jan 20, 2022

    Bug description

    A buffer compressor overflow occurs when the message under compression exceeds the size of 16834 bytes. When the buffer of the compressor is full, the compressor function reports an error and simply skips the transmission of the data. The effect of this "not transmitted" data to the parent is closely related with the importance of the information not being transmitted.

    1. In the case of the bug in the production environment, the parent's error.log file reported the following error message, STREAM_RECEIVER[gke-production-main-xxxx-xxxx, [0.0.0.0]:0000] : requested a BEGIN on chart 'k8s_kubelet.kubelet_pods_log_filesystem_used_bytes', which does not exist on host 'gke-production-main-xxxx-xxxx'. Disabling it. (errno 22, Invalid argument) This means, that the parent received a BEGIN command for an 'unknown' chart and caused a parser error that would result in a reconnection. The problem here is that the chart definition was probably not streamed to the parent because the k8s_kubelet.kubelet_pods_log_filesystem_used_bytes chart and the included dimensions (12) seemed to exceed the size of 16kB.

    2. Trying to reproduce the same behavior with a go.d example plugin and one chart with 1000 dimensions, the compressor buffer overflow was detected and the stream corruption was detected by continuous reporting the following error message,

    Compression error - data discarded
    Message size above limit:
    

    Credits to @stelfrag and @MrZammler for reporting and helping to identify this issue.

    Expected behavior

    Definitely don't corrupt the stream between parent <-> child. Possible solutions include,

    1. Maintain the stream between parent <-> child and downgrade to version protocol 4.
    2. Increase the compressor buffer size to increase robustness of stream compression.
    3. Split the msgs in smaller blocks to fit the compressors buffer.

    Steps to reproduce

    1. Set-up a simple parent <-> child connection with the master branch.
    2. Enable stream compression in the stream.conf file for both agents.
    [stream]
    enable compression = yes
    

    In the child Netdata agent,

    1. cd in /etc/netdata and run sudo ./edit-config go.d.conf.
    2. Enable example go-plugin
    #  dockerhub: yes
    #  elasticsearch: yes
      example: yes
    #  filecheck: yes
    #  fluentd: yes
    
    1. Create a chart with many dimensions in sudo ./edit-config go.d/example.conf
    jobs:
      - name: stress
        charts:
          num: 2
          dimensions: 300
    
    
    1. Restart both agents
    2. Look in the child error.log for the message,
    Compression error - data discarded
    Message size above limit:
    
    1. And child <-> parent stream should be corrupted.

    Installation method

    from source

    System info

    Linux server2 5.4.0-90-generic #101-Ubuntu SMP Fri Oct 15 20:00:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
    /etc/lsb-release:DISTRIB_ID=Ubuntu
    /etc/lsb-release:DISTRIB_RELEASE=20.04
    /etc/lsb-release:DISTRIB_CODENAME=focal
    /etc/lsb-release:DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"
    /etc/os-release:NAME="Ubuntu"
    /etc/os-release:VERSION="20.04.3 LTS (Focal Fossa)"
    /etc/os-release:ID=ubuntu
    /etc/os-release:ID_LIKE=debian
    /etc/os-release:PRETTY_NAME="Ubuntu 20.04.3 LTS"
    /etc/os-release:VERSION_ID="20.04"
    /etc/os-release:VERSION_CODENAME=focal
    /etc/os-release:UBUNTU_CODENAME=focal
    

    Netdata build info

    Version: netdata v1.32.1-114
    Features:
        dbengine:                   YES
        Native HTTPS:               YES
        Netdata Cloud:              YES
        ACLK Next Generation:       YES
        ACLK-NG New Cloud Protocol: YES
        ACLK Legacy:                NO
        TLS Host Verification:      YES
        Machine Learning:           YES
        Stream Compression:         YES
    Libraries:
        protobuf:                NO
        jemalloc:                NO
        JSON-C:                  YES
        libcap:                  NO
        libcrypto:               YES
        libm:                    YES
        tcalloc:                 NO
        zlib:                    YES
    Plugins:
        apps:                    YES
        cgroup Network Tracking: YES
        CUPS:                    NO
        EBPF:                    YES
        IPMI:                    NO
        NFACCT:                  NO
        perf:                    YES
        slabinfo:                YES
        Xen:                     NO
        Xen VBD Error Tracking:  NO
    Exporters:
        AWS Kinesis:             NO
        GCP PubSub:              NO
        MongoDB:                 NO
        Prometheus Remote Write: NO
    

    Additional info

    No response

    bug area/streaming 
    Reply
  • Create a removed alert event if chart goes obsolete
    Create a removed alert event if chart goes obsolete

    Jan 21, 2022

    Summary

    This PR aims to solve the case of how we can inform the cloud with new architecture of a change in an alert when it's chart becomes obsolete.

    Right now, if a chart of a raised alert goes away, the agent "hides" it from api calls etc. But it does not produce an internal health event log for it, rather it keeps the last known one.

    Since communication with the cloud via the new architecture happens with alert events, the last known state has been sent to the cloud alas then the consistency of what we show on the agent vs on the cloud is broken. By creating a REMOVED event and sending it to the cloud we try to keep consistency between the two.

    Test Plan

    Re-creating such a scenario can be done as follows:

    1. Use a usb stick. Create a large enough file to fill up it's space (fallocate -l XXG filename is a good and fast way to do it).
    2. Connect the agent to a whitelisted space in staging. Right now it won't work in production for full testing (see Points to check 4).
    3. Wait for the agent to report a critical alert for it for space usage.
    4. Unmount the usb stick while the agent is running.

    What to observe:

    Without this PR, the cloud will continue to report a critical alert, while the agent will not. In api/v1/alarm_log the last know event for it will be the CRITICAL event.

    With this PR, the cloud should clear the alert when the usb stick is unmounted. The api/v1/alarm_log should also have a REMOVED event for it.

    1. Re-mounting the usb stick should then also produce a CRITICAL event again, the cloud should display it.

    The scenario has been tested on the staging environment with cloud backend.

    Additional Information

    Points to check:

    1. If we need to check more chart statuses except RRDSET_FLAG_OBSOLETE. I believe we are safe even if then the chart goes to other statuses by rrdcalc_isrunnable, but I assume RRDSET_FLAG_OBSOLETE is the first status the chart goes to if it has gone.
    2. Whether we need a notification, either by the cloud or the agent for these events.
    3. We don't need to edit what is sent in the snapshot event. The snapshot event will contain the REMOVED event.
    4. This needs also some patch on the backend cloud side (currently in progress), since it does not store events with last status being REMOVED.
    5. If multiple alerts are processed as REMOVED in same host in same loop, sql_queue_removed_alerts_to_aclk will be called multiple times. Should not pose a problem, if so we can guard it.

    We don't want to break health for this. If someone feels there's a better approach please advise!

    area/health 
    Reply
  • Redis python module + minor fixes
    Redis python module + minor fixes

    Jul 12, 2016

    1. Nginx is shown as nginx: local in dashboard while using python or bash module.
    2. NetSocketService changed name to SocketService, which now can use unix sockets as well as TCP/IP sockets
    3. changed and tested new python shebang (yes it works)
    4. fixed issue with wrong data parsing in exim.chart.py
    5. changed whitelisting method in ExecutableService. It is very probable that whitelisting is not needed, but I am not sure.
    6. Added redis.chart.py

    I have tested this and it works.

    After merging this I need to take a break from rewriting modules to python. There are only 3 modules left, but I don't have any data to create opensips.chart.py as well as nut.chart.py (so I cannot code parsers). I also need to do some more research to create ap.chart.py since using iw isn't the best method.

    Reply
  • new prometheus format
    new prometheus format

    Jul 8, 2017

    Based on recent the discussion on #1497 with @brian-brazil, this PR changes the format netdata sends metrics to prometheus.

    One of the key differences of netdata with traditional time-series solutions, is that it organises metrics in hosts having collections of metrics called charts.

    charts

    Each chart has several properties (common to all its metrics):

    chart_id - it serves 3 purposes: defines the chart application (e.g. mysql), the application instance (e.g. mysql_local or mysql_db2) and the chart type mysql_local.io, mysql_db2.io). However, there is another format: disk_ops.sda (it should be disk_sda.ops). There is issue #807 to normalize these better, but until then, this is how netdata works today.

    chart_name - a more human friendly name for chart_id.

    context - this is the same with above with the application instance removed. So it is mysql.io or disk.ops. Alarm templates use this.

    family is the submenu of the dashboard. Unfortunately, this is again used differently in several cases. For example disks and network interfaces have the disk or the network interface. But mysql uses it just to group multiple chart together and postgres uses both (groups charts, and provide different sections for different databases).

    units is the units for all the metrics attached to the chart.

    dimensions

    Then each chart contains metrics called dimensions. All the dimensions of a chart have the same units of measurement and should be contextually in the same category (ie. the metrics for disk bandwidth are read and write and they are both in the same chart).


    So, there are hosts (multiple netdata instances), each has its own charts, each with its own dimensions (metrics).

    The new prometheus format

    The old format netdata used for prometheus was: CHART_DIMENSION{instance="HOST}

    The new format depends on the data source requested. netdata supports the following data sources:

    • as collected or raw, to send the raw values collected
    • average, to send averages
    • sum or volume to send sums

    The default is the one defined in netdata.conf: [backend].data source = average (changing netdata.conf changes the format for prometheus too). However, prometheus may directly ask for a specific data source by appending &source=SOURCE to the URL (SOURCE being one of the options above).

    When the data source is as collected or raw, the format of the metrics is:

    CONTEXT_DIMENSION{chart="CHART",family="FAMILY",instance="HOSTNAME"}
    

    In all other cases (average, sum, volume), it is:

    CONTEXT{chart="CHART",family="FAMILY",dimension="DIMENSION",instance="HOSTNAME"}
    

    The above format fixes #1519

    time range

    When the data source is average, sum or volume, netdata has to decide the time-range it will calculate the average or the sum.

    The first time a prometheus server hits netdata, netdata will respond with the time frame defined in [backend].update every. But for all queries after the first, netdata remembers the last time it was accessed and responds with the time range since the last time prometheus asked for metrics.

    Each netdata server can respond to multiple prometheus servers. It remembers the last time it was accessed, for each prometheus IP requesting metrics. If the IP is not good enough to distinguish prometheus servers, each prometheus may append &server=PROMETHEUS_NAME to the URL. Then netdata will remember the last time it was accessed for each PROMETHEUS_NAME given.

    instance="HOSTNAME"

    instance="HOSTNAME" is sent only if netdata is called with format=prometheus_all_hosts. If netdata is called with format=prometheus, the instance is not added to the metrics.

    host tags

    Host tags are configured in netdata.conf, like this:

    [backend]
        host tags = tag1="value1",tag2="value2",...
    

    Netdata includes this line at the top of the response:

    netdata_host_tags{tag1="value1",tag2="value2"} 1 1499541463610
    

    The tags are not processed by netdata. Anything set at the host tags config option is just copied. netdata propagates host tags to masters and proxies when streaming metrics.

    If the netdata response includes multiple hosts, netdata_host_tags also includes `instance="HOSTNAME".

    Reply
  • netdata package maintainers
    netdata package maintainers

    Jul 5, 2016

    This issue has been converted to a wiki page

    For the latest info check it here: https://github.com/firehol/netdata/wiki/netdata-package-maintainers


    I think it would be useful to prepare a wiki page with information about the maintainers of netdata for the Linux distributions, automation systems, containers, etc.

    Let's see who is who:


    Official Linux Distributions

    | Linux Distribution | Netdata Version | Maintainer | Related URL | | :-: | :-: | :-: | :-- | | Arch Linux | Release | @svenstaro | netdata @ Arch Linux | | Arch Linux AUR | Git | @sanskritfritz | netdata @ AUR | | Gentoo Linux | Release + Git | @candrews | netdata @ gentoo | | Debian | Release | @lhw @FedericoCeratto | netdata @ debian | | Slackware | Release | @willysr | netdata @ slackbuilds | Ubuntu | | | | | Red Hat / Fedora / Centos | | | | | SuSe / openSuSe | | | |


    FreeBSD

    System|Initial PR|Core Developer|Package Maintainer |:-:|:-:|:-:|:-:| FreeBSD|#1321|@vlvkobal|@mmokhi


    MacOS

    System|URL|Core Developer|Package Maintainer |:-:|:-:|:-:|:-:| MacOS Homebrew Formula|link|@vlvkobal|@rickard-von-essen


    Unofficial Linux Packages

    | Linux Distribution | Netdata Version | Maintainer | Related URL | | :-: | :-: | :-: | :-- | | Ubuntu | Release | @gslin | netdata @ gslin ppa https://github.com/firehol/netdata/issues/69#issuecomment-217458543 |


    Embedded Linux

    | Embedded Linux | Netdata Version | Maintainer | Related URL | | :-: | :-: | :-: | :-- | | ASUSTOR NAS | ? | William Lin | https://www.asustor.com/apps/app_detail?id=532 | | OpenWRT | Release | @nitroshift | openwrt package | | ReadyNAS | Release | @NAStools | https://github.com/nastools/netdata | | QNAP | Release | QNAP_Stephane | https://forum.qnap.com/viewtopic.php?t=121518 | | DietPi | Release | @Fourdee | https://github.com/Fourdee/DietPi |


    Linux Containers

    | Containers | Netdata Version | Maintainer | Related URL | | :-: | :-: | :-: | :-- | | Docker | Git | @titpetric | https://github.com/titpetric/netdata |


    Automation Systems

    | Automation Systems | Netdata Version | Maintainer | Related URL | | :-: | :-: | :-: | :-- | | Ansible | git | @jffz | https://galaxy.ansible.com/jffz/netdata/ | | Chef | ? | @sergiopena | https://github.com/sergiopena/netdata-cookbook |


    If you know other maintainers of distributions that should be mentioned, please help me complete the list...

    cc: @mcnewton @philwhineray @alonbl @simonnagl @paulfantom

    area/packaging area/docs 
    Reply
  • python.d enhancements
    python.d enhancements

    Jul 14, 2016

    @paulfantom I am writing here a TODO list for python.d based on my findings.

    • [x] DOCUMENTATION in wiki.

    • [x] log flood protection - it will require 2 parameters: logs_per_interval = 200 and log_interval = 3600. So, every hour (this_hour = int(now / log_interval)) it should reset the counter and allow up to logs_per_interval log entries until the next hour.

      This is how netdata does it: https://github.com/firehol/netdata/blob/d7b083430de1d39d0196b82035162b4483c08a3c/src/log.c#L33-L107

    • [x] support ipv6 for SocketService (currently redis and squid)

    • [x] netdata passes the environment variable NETDATA_HOST_PREFIX. cpufreq should use this to prefix sys_dir automatically. This variable is used when netdata runs in a container. The system directories /proc, /sys of the host should be exposed with this prefix.

    • [ ] the URLService should somehow support proxy configuration.

    • [ ] the URLService should support Connection: keep-alive.

    • [x] The service that runs external commands should be more descriptive. Example running exim plugin when exim is not installed:

      python.d ERROR: exim_local exim [Errno 2] No such file or directory
      python.d ERROR: exim_local exim [Errno 2] No such file or directory
      python.d ERROR: exim: is misbehaving. Reason:'NoneType' object has no attribute '__getitem__'
      
    • [x] This message should be a debug log No unix socket specified. Trying TCP/IP socket.

    • [x] This message could state where it tried to connect: [Errno 111] Connection refused

    • [x] This message could state the hostname it tried to resolve: [Errno -9] Address family for hostname not supported

    • [x] This should state the job name, not the name:

      python.d ERROR: redis/local: check() function reports failure.
      
    • [x] This should state with is the problem:

      # ./plugins.d/python.d.plugin debug cpufreq 1
      INFO: Using python v2
      python.d INFO: reading configuration file: /etc/netdata/python.d.conf
      python.d INFO: MODULES_DIR='/root/netdata/python.d/', CONFIG_DIR='/etc/netdata/', UPDATE_EVERY=1, ONLY_MODULES=['cpufreq']
      python.d DEBUG: cpufreq: loading module configuration: '/etc/netdata/python.d/cpufreq.conf'
      python.d DEBUG: cpufreq: reading configuration
      python.d DEBUG: cpufreq: job added
      python.d INFO: Disabled cpufreq/None
      python.d ERROR: cpufreq/None: check() function reports failure.
      python.d FATAL: no more jobs
      DISABLE
      
    • [x] ~~There should be a configuration entry in python.d.conf to set the PATH to be searched for commands. By default everything in /usr/sbin/ is not found.~~ Added #695 to do this at the netdata daemon for all its plugins.

    • [x] The default retries in the code, for all modules, is 5 or 10. I suggest to make them 60 for all modules. There are many services that cannot be restarted within 5 seconds.

      Made it in #695

    • [x] When a service reports failure to collect data (during update()), there should be log entry stating the reason of failure.

    • [x] Handling of incremental dimensions in LogService

    • [x] Better autodetection of disk count in hddtemp.chart.py

    • [ ] Move logging mechanism to utilize logging module.

    more to come...

    area/external/python 
    Reply
  • Prometheus Support
    Prometheus Support

    Jan 2, 2017

    Hey guys,

    I recently started using prometheus and I enjoy the simplicity. I want to begin to understand what it would take to implement prometheus support within Netdata. I think this is a great idea because it allows the distributed fashion of netdata to exist along with having persistence at prometheus. Centralized graphing (not monitoring) can now happen with grafana. Netdata is a treasure trove of metrics already - making this a worth wild project.

    Prometheus expects a rest end point to exist which publishes a metric, labels, and values. It will poll this endpoint at a desired time frame and ingest the metrics during that poll.

    To get the ball rolling, how are you currently serving http in Netdata? Is this an embedded sockets server in C ?

    Reply
  • what our users say about netdata?
    what our users say about netdata?

    Apr 2, 2016

    In this thread we collect interesting (or funny, or just plain) posts, blogs, reviews, articles, etc - about netdata.

    1. don't start discussions on this post
    2. if you want to post, post the link to the original post and a screenshot!
    help wanted 
    Reply