Tutorial: how to use the monitoring tools

Here is a tutorial to introduce the new monitoring tools used by the tech team and explain the meaning of the graphics. We will focus only on the most important metrics in this tutorial to keep it simple.

 

What is the purpose of the monitoring tools?

  • The tools are used on a daily basis by the Tech Team to monitor the entire network
  • Managers of hosted sites can monitor the server hosting their sites
  • Members of hosted sites can identify issues with the sites they are using and see what’s going on
  • Visitors experiencing slowdown or downtimes can check if there is a problem with the server or just their own internet connection
  • Serves as an inventory of the IT infrastructures supplied by the NGNM Coop and increases transparency

 

The difference between server monitoring and sites monitoring

Servers monitoring provides real-time data about the current health of our servers. Is the server down? Are the processors overloaded? Is the disk full?

Sites monitoring tracks the page load time to calculate how long it takes for visitors to load a page. This helps the team to identify problems with specific sites and improve user experience by speeding up the load times.

Monitoring both the servers AND the websites allows the team to figure out if a problem is coming from the server itself, or if it is a hosted website that is slowing down due to a bug on the website side.

 

Server-side Monitoring

Server monitoring is the process of analyzing real-time data about the hardware status of each server to see how each component is performing under load. To put it simply, server monitoring is the systematic tracking, measuring, or observing of processes and operations on a server. Its purpose is to use the collected data to draw conclusions about the health and condition of the server and ensure it reaches optimal performance.

The primary objective of server monitoring is always to protect the server from possible failure that would interrupt service availability.

 

CPU Use Percentage

This is the most important indicator to monitor the health of a server. The CPU is the core and brain of the server, if it is overloaded then the server crash and all websites go offline. When CPU usage spikes, everything slows down, and eventually, everything crashes. Heavy CPU usage can also lead to memory problems and overall server performance issues.

The graphics show the percentage of use of the CPUs. The higher the percentage, the more overloaded the server is. The line on the chart has a color code – when the line turns orange we’re approaching a dangerous level, when the line is red the server is overloaded, and when the color is green then all is good.


CPU Load Levels

near 0% use = Websites are down, serious problem needs to be addressed by the tech team.
10% to 50% use = Optimal performance.
50% CPU use = Server slightly slowed down.
60% CPU use = Noticeable slow down in load times.
70% CPU use = Server significantly slowed down. The Tech Team is alerted.
80% CPU use = Extreme slow down, urgent issue must be addressed quickly.
90% CPU use or more = Server crashed and completely irresponsive.

The Tech Team is aiming to keep this level constantly under the target of 50% use.

Here is how it looks when a server is overloaded

The red arrows point to when the processors of the server were overloaded at 100% use which means all websites were down. You can see the team trying to fix the issue with the CPU load going down for a moment, then back up again, until a final solution brought back the load under the target of 50% the next day.

 

But a server crash can also look like this when all websites are down and CPU use is near zero because visitors can’t access the websites.

Average server load

Load average, also called average system load, is another important metric that indicates if there are multiple tasks in the queue on the Linux server. Seeing big spikes in this graphic is a symptom of an overloaded server, and is usually correlated with a big jump in CPU Use Percentage.

This graphics shows an abnormal rise in Average Server Load, meaning that there are too many processes overloading the server. The spikes are correlated with the timings in the previous chart. When CPU Use goes up, then Average Server Load also goes up, confirming a technical problem.

 

Memory Usage (RAM)

These graphics are more difficult to understand because high memory usage is not necessarily evidence of a technical problem with the server (memory could be used for caching). But when associated with other graphics, it helps the team to better understand what’s going on with the server. Generally speaking, as long as the memory is not 100% full there should be no problems.

 

MySQL Database CPU and Memory Usage

The MySQL Database is where all of the data is stored: forum posts, members’ accounts, published articles, shop products, etc. The purpose of monitoring the database’s memory and CPU usage is to better understand what exactly is causing a server overload.

The database is one of the most important parts of a server and is known to be resource-intensive so it often contributes to server load issues.

We have already found out with the 2 previous graphics that Server 2 was overloaded. The next step is understanding what could be causing this problem. This graph shows a huge rise in MySQL CPU Time, and the spikes correspond to the same date/time as the CPU load issues in the other graphics. It shows that the cause of the server overload is the MySQL database consuming too many resources.

 

Disk Usage

These graphics are no different than monitoring the disk use of your own computer. It shows how much free space is left on the hard disk drives of each server. Monitoring disk usage is another key feature of a good server monitoring tool. With this feature, the team can quickly determine how much disk space is left so we can mitigate the risk of downtime.

 

 

Websites Monitoring

Website monitoring enables the Tech Team to identify problems and determine if a website or application is slow or experiencing downtime before that problem affects actual end-users or members of the communities we are hosting. Website monitoring ensure website uptime, performance, and functionality is as expected, such as load times, server response times, and page element performance that is often analyzed and used to further optimize website performance.

The monitoring tools gather real-world data on page load speed to detect issues related to latency, intermittent downtimes, slow queries, network hop issues, and other potential problems. By having multiple monitoring probes in different geographic locations, our monitoring service can determine if a website is available across continents over the Internet and if it is slower in countries farther away from the server’s location.

Each website is tested every 60 seconds to make sure it is accessible and loading fast. The tests are conducted from several locations around the world to make sure the site loads fine in all countries.

 

Hosted Sites Status

This is the easiest way to check if a site is online or offline. The milliseconds (ms) value is the page load speed of each site.
Green square = website is online and loading fast
Yellow square = website is online but slowed down
Red square = website is currently offline due to a technical issue

 

Downtime Detector Heat Map

This chart represents the historical response times for each site. A red color means the site was slow to load at this specific time. You can get more information by moving the mouse pointer over the chart.

A website is considered down if it takes more than 10 seconds to respond.

 

Page Load Speed

This chart shows the page load speed (in seconds or milliseconds) for each site over time. The dotted red line represents the point where a site is significantly slowed down. Anything over this line is problematic.

A website is considered down if it takes more than 10 seconds to respond. You can move your mouse pointer over the chart to see which website was down.

Error Rate

Errors happen, but they are more likely when the server’s CPU is under a big load. The error rate is the number of problems that occur relative to the number of total requests. Errors include requests that were timed out. This is one of many very important performance metrics for a server monitoring tool to be able to keep an eye on.

 

Uptime

Uptime is the percentage of time that a system is fully operational. Most web hosts target an uptime level of over 99%.

 

 

Individual Website Monitoring

The website monitoring section also allows checking a specific website. This is a very useful tool both for hosted site admins and its members. For example, members of the forums can use this tool to view the current health of the website and its server and confirm if there is a technical problem.

‘Response latency by phase’ is the time it takes to load the website. We try to target 300ms or less, but anything under 750ms shouldn’t be an issue.

‘Response latency by probe’ shows the page load time from various locations around the world. Each website is tested from multiple geographical locations around the world.

‘Error rate’ is when a monitoring check fails. This is usually a bad sign.

This graph shows Anarcho-Punk.net load time (aka Response Latency) is usually under 750ms except in Sydney. This is because the server is located in North America and Sydney is in Australia which is very far away. When the site is down, all probes in all locations will show it.

 

 

Real-world use case example

Let’s say you are a member of the Anarcho-Punk.net forum and the site is down when you try to log in.

First, let’s check Anarcho-Punk.net’s status in the Sites Monitoring tool by clicking the monitoring button and going to this page.

In the “Response latency by probe” section, we can see the site stopped responding around 20:00 which confirms the site is down for everyone

 

Now let’s check if it is a server-side problem or just this website. For this, we will use the “Current Status Of Hosted Sites” section in the dashboard.

Uh oh – looks like not only Anarcho-Punk.net is down, but also all of the other forums

When multiple sites are down, this is usually a sign of a server-side problem.

Now let’s check which server this site is hosted on. By looking at this article, you can find out that the forums are hosted on Server 3.

Please refer to the first section of this article (Server-side monitoring) to understand how you can check the health of Server 3.

 

 

The role of the Tech Team

When the server is overloaded or when a site is detected down, an e-mail is automatically sent to members of the Tech Team to alert them about the issue. Since the team has members living in countries from different continents, we are able to offer near 24/7 monitoring due to the team being located in different time zones.

When an error is detected monitoring services send out alerts via email, SMS, or phone, with diagnostic information, such as a network traceroute, code capture of a web page’s HTML file, a screenshot of a webpage, and even a video of a website failing. These diagnostics allow network administrators and webmasters to correct issues faster.

Technical issues with the servers happen almost on a daily basis, so the team is always working in the background and scheduling interventions on the server. We usually have a very fast intervention delay, but since we are volunteers we are sometimes busy with our respective jobs.

You are always welcome to contact the tech team to report technical issues to the team or ask any questions about ongoing issues.

 

 

Why fast speed is so much important

Google has indicated that site speed is one of the signals used by its algorithm to rank pages. In addition, a slow page speed means that search engines can crawl fewer pages using their allocated crawl budget, and this could negatively affect your indexation.

Page speed is also important to user experience. Pages with a longer load time tend to have higher bounce rates and lower average time on page. Longer load times have also been shown to negatively affect traffic.

A lot of optimization has been done by the tech team to improve page speed with caching, compression, and other optimizations. For more information, please see these articles:

Server update: page speed, caching, security, and optimization

Memcached added to 3 of our servers