A disk in the mdadm raid failed on one of the servers in the data center. This is a typical situation that I regularly encounter. I left a request for technical support to replace the disk, indicating the disk to be changed. The working disk was replaced in the center and left the failed disk. The next story is how I solved the problem.
- Tell an instructive story about the problems that can be encountered when renting servers in the data center.
- Show an example of how to proceed when a drive fails in a raid mdadm.
- In simple words, explain the difference between a software and a hardware raid.
When the first time you come across the handshake of technical support staff, you fall into a stupor and think, how is it? Now I have a calm attitude to such situations and act on the basis of the worst expectations. The other day I faced a situation where I was replaced by the wrong drive in the server with RAID1. Instead of the failed disk, the worker was removed and replaced with a clean one. Fortunately, everything ended well, but in order.
I can’t say that I have a lot of experience with server rentals, but I do. I regularly maintain 10-15 servers located in different data centers, both Russian and European. My first negative experience was in Europe and I was very much surprised and confused. I, like many others, was influenced by the liberal propaganda that everything is bad, but Europe is a model of reliability, stability and service. I was so mistaken. Now I will give preference to our data centers. In my opinion and experience, we have better support and service in general than there, without cost. In Europe similar services are cheaper, because the scale of services there is many times larger.
Here are some examples of sapport jambs I have encountered.
- When ordering a private network from the hoster leaseweb.com ruined the entire service for several hours. There was a large project at the hoster. It grew gradually, from scratch. Bought server after server. When there were a lot of servers, they decided to merge into a single locale. The hoster has such a service and is called a private network. Since the servers are strongly spaced out on racks, the hoster said that it is necessary to move everything closer to each other. We have agreed on the time to move the servers and everything else. The hoster had given all network settings in advance. After the hoster had moved all the servers and reported on the successful end in the ticket, the whistling started. On some servers, the specified network settings did not lead to availability. Some servers did not see each other. A long correspondence started with technical support where they offered to enable dhcp, disable and a lot of other useless actions. In the end it turned out that the network settings were simply confused and the servers were moved to the wrong place. It was a tinplate. Plus the whole communication is in English. Since then I never order such services on a running service again. If I have to merge, I configure vpn on current network connections. And in general I turn to those support to a minimum. If you need global changes, a smooth transition to a redundant system.
- Some time before the New Year, on December 31 at 16 o’clock a completely large project was cut off. As it turned out, the hoster was doing some work in the rack and by mistake cut off the power on our server, which was the balance and entry point for all requests. As a result, the whole site and service was easy for visitors. It was lucky that in about 2 hours they found it and signed off in the ticket that all is OK. And at the initial request, they said that we will deal with it now, but all engineers are already celebrating, so we do not promise anything.
- Well, at the end of the classic. Replaced the wrong disc in the raid. Instead of a malfunction removed the worker. By some miracle, the raid did not fall apart. Everything hung in place, we returned the working disk and overloaded the server.
There were many smaller incidents, there is no point in describing them. No, I will describe one, though. I was installing my server in the data center. I decided to go to our room and control the installation. If possible, I strongly recommend using it. The local handgun attached the skids incorrectly and the server began to fall down during the installation. I caught him, so saved him and the servers of other clients. As a result, I helped with the installation. He himself just could not have done it. I can’t imagine what would have happened if I hadn’t gone to the tower. To the credit of the management, I wrote a claim, where I described the case in detail and asked for a free monthly rent. I was provided with it. I advise everyone to do so. Often, the leadership may not be aware of what is happening in reality. It is necessary to give feedback.
The level of my trust in the support of the data centers and hosting you approximately represent 🙂 So, the next emergency has happened. Let me dwell on this situation in detail, as it happened yesterday, fresh memories.
Replacement-disk in raid mdadm
We are talking about cheap grandfathers from selectel. I use them a lot where I use them and generally recommend them. They are ordinary desktop systems for a modest amount of money. I will make my opinion about these servers, as well as comparison with full servers at the end, in a separate section.
The Debian system was installed on the server from the Selectel standard template. Here are the features of the disk subsystem of these servers and the template.
- 2 ssd disks combined in mdadm
- /boot partition in /dev/md0 size 1G
- root / on /dev/md1 and over lvm on the whole array
In general, a good and reliable breakdown, which will be confirmed further. The server had proxmox configured monitoring mdadm. Disk monitoring not done. At some point I received a notification from zabbix that mdadm had fallen apart. The server was still running. This is a normal situation. I went to the server console to check everything. I looked at the state of the raid.
# cat /proc/mdstat
I made sure that one disk fell out of the array. I saw the following in the system log.
Tried to see the information about the dropped out disk.
# smartctl -i /dev/sda
There was no information, the utility showed an error in accessing the disk. It was possible to see the model and serial number of the working disk only.
I did not know what to do with the disk. If I see the problem, I change it immediately. I warned the customer that there was a problem with the disk, you need to plan the replacement. Since the iron is desktop, the “server” has to be shut down. We agreed on the time after 22 hours. I’m already sleeping at this time, so I wrote a ticket in the support where I specified the time and serial number of the disk which should have been left. I made an accent on this, explained that the failed disk does not respond, so I can not see its serial number. I have written everything in great detail so as not to leave the ground for misunderstanding or double interpretation. I am already a specialist in this, but it still did not help.
I calmly agreed to this operation because backups are often made and they are guaranteed to work. Monitoring of backups is set up and regular semi-manual recovery from them is made. The agreement was that the hoster after replacement waits for the login window, and the customer checks that the site is working. Everything turned out to be the same – the server has loaded, the virtual machines are up, the site is running. On that completed the work.
In the morning I got up and saw that the entire system log in disk errors, the working disk in the system is not, but there is one glitchy and one new. I immediately started the rebuilds of the array just in case and it sort of even went without errors. The reboot temporarily brought the failed disk back to life. Basically, you could have stopped there, replaced the failed disk and calmed down. But the point is that this failed disk was not in operation for almost a day and the data on it is old. It was not convenient. Then we would have to somehow glue this data to the data from the backups. In the case of a database, it was not a trivial procedure. I called the customer and decided to roll back to the work disk, which was pulled out the night before.
I created a ticket and asked to return the working disk to its place. Fortunately, it has been preserved. To add another one, completely clean. The hoster did everything promptly and apologized. Finally, I sent a screenshot of the server screen.
And he withdrew himself. He offered to further solve the problem of loading by loading in rescue mode. This mode is available from the server control panel in the admin even if the server does not have an ipmi console. As I understand it, some kind of live cd is downloaded over the network for recovery. I downloaded it, made sure that the data was there, but could not understand the cause of the error. Maybe I could have if I had dug for a longer time, but it is very inconvenient to do without seeing the real server console. I asked to connect to the server kvm over ip so that I could connect to the console. The tech support did it quickly without any unnecessary questions.
By the way, I know cases when selectel tech support then fixes the load itself and returns mdadm to working state. I’ve seen such correspondence in my clients’ tickets before they contacted me. But I did not insist on such a solution because I was afraid it would be worse. Besides, it was Sunday morning and there might not have been specialists who could do it. Plus, I don’t think they would have more competence than me. I wouldn’t go to work in the data center for their salary.
Once I connected to the server console, restoring the load was a technical matter.
You are in emergency mode
I have many examples of how I recovered the load of broken linux distributions.
- Kernel panic not syncing: VFS: Unable to mount root fs
- Booting from Hard Disk error, Entering rescue mode
- It was the same way I fixed a broken boot when moving virtual machines from one hypervisor to another – repairing the linux server boot
- In this article about backup and linux porting I am also dealing with the topic of fixing the loader
In this situation with mdadm I was sure that everything would work out, because the array itself with the system is alive, the data is available. We just have to figure out why the system does not boot. Let me remind you that the boot error was the following.
You are in emergency mode. After logging in, type "journalctl -xb" to view system logs, "systemctl reboot" to reboot, "systemctl default" to try again to boot into default mode. Give root password for maintenance (or type Control-D to continue):
The next step is to enter the root password and you will be in the system console. The first thing I checked was the state of the mdadm array.
# cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 sda3 467716096 blocks super 1.2 [2/2] [U_] bitmap: 2/4 pages [8KB], 65536KB chunk md0 : inactive raid1 sda2(S) 999424 blocks super 1.2 [2/2] [U_]
The state of the md0 array where /boot – inactive is located. This is actually the reason why the server is not booting. Apparently, when the failed disk was connected, mdadm disconnected the array to prevent data corruption. It is not clear why it is on the /boot partition, but this was actually the case. Because the array was stopped, it was not possible to boot from it. I stopped the array and started it again.
# mdadm -S /dev/md0 # mdadm -R /dev/md0
After that, the array left the inactive mode and became available for further work with it. I restarted the server and made sure that it was booting normally. The server was actually in a working state, just with a crumbling mdadm array, without a single disk.
If that doesn’t help you, I have a few more tips on what you can do to fix the load. First, check the /etc/fstab file and see which partitions and how they are mounted. Here is my example of this file.
/dev/mapper/vg0-root / ext4 errors=remount-ro 0 1 UUID=789184ea-50e4-4788-98f4-b500928d35c8 /boot ext3 defaults 0 2 /dev/mapper/vg0-swap_1 none swap sw 0 0
You need to make sure that the lvm specified sections /dev/mapper/vg0-root and /dev/mapper/vg0-swap_1 do exist. Use the command to do this:
# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert root vg0 -wi-ao---- 441.28g swap_1 vg0 -wi-ao---- <4.77g
I will talk about this command in detail, about working with lvm and about disks in general in a separate article – drive configuration in debian. If there is nothing wrong with lvm partitions, check /boot. I have it mounted by uuid. You can see a list of all the uuid partitions with the command.
# blkid /dev/sda1: PARTUUID="5668dd38-a5e2-495e-856f-af0547a9d907" /dev/sda2: UUID="3f8c654b-5c1d-cb5c-3b13-6bd5925c995f" UUID_SUB="8bf2478f-ec17-a055-1f70-d20dd13a19b3" LABEL="Jellicent:0" TYPE="linux_raid_member" PARTUUID="7e3210cc-f267-4372-85e2-1dae7731a6bb" /dev/sda3: UUID="c123309f-fc71-7b99-2fee-9cd567bd6f9d" UUID_SUB="e0697294-88dc-d6f5-b61c-bdf6091bfceb" LABEL="Jellicent:1" TYPE="linux_raid_member" PARTUUID="df1e1100-a01a-46da-8fd3-81ed4c010c11" /dev/sdb1: PARTUUID="5668dd38-a5e2-495e-856f-af0547a9d907" /dev/sdb2: UUID="3f8c654b-5c1d-cb5c-3b13-6bd5925c995f" UUID_SUB="a8431f0f-6d98-3ca5-1dc7-da62082a4a8c" LABEL="Jellicent:0" TYPE="linux_raid_member" PARTUUID="7e3210cc-f267-4372-85e2-1dae7731a6bb" /dev/sdb3: UUID="c123309f-fc71-7b99-2fee-9cd567bd6f9d" UUID_SUB="ea65601c-8a17-c654-735a-1e5c892e6584" LABEL="Jellicent:1" TYPE="linux_raid_member" PARTUUID="df1e1100-a01a-46da-8fd3-81ed4c010c11" /dev/md0: UUID="789184ea-50e4-4788-98f4-b500928d35c8" TYPE="ext3" /dev/md1: UUID="O9Qq20-35Uk-n1Lx-d993-xnkr-9jCi-l0dNVy" TYPE="LVM2_member" /dev/mapper/vg0-root: UUID="8eccb650-dd7e-4643-8898-9dd5befea121" TYPE="ext4" /dev/mapper/vg0-swap_1: UUID="186dfb1a-7c7e-4750-804c-cc507a38f514" TYPE="swap"
As you can see, I have a uuid partition for booting exactly the same as the one in fstab. If for some reason the uuid has changed (disassembled and built a new array), edit fstab.
I have already done all the further steps with ssh. I copied the partition table from my sda working disk to a clean sdb.
# sfdisk -d /dev/sda | sfdisk /dev/sdb
Checked the partition tables and made sure that they are identical.
# fdisk -l | grep /dev
Copied the BIOS boot partition from the working disk to the new one.
# dd if=/dev/sda1 of=/dev/sdb1 bs=512
Then I added sdb2 and sdb3 partitions to the raid array.
# mdadm --add /dev/md0 /dev/sdb2 # mdadm --add /dev/md1 /dev/sdb3
Waited for the end of the rebild and made sure that it passed. Checked the state of the array.
# cat /proc/mdstat.
Finally, we install the loader on both disks.
# dpkg-reconfigure grub-pc
After that I rebooted and made sure that everything worked fine. It’s a good thing to change the boot disk from the first to the second and make sure that the second one also loads normally. I didn’t do that, and it was so easy and so big. The main thing is to keep the array in place, and to fix the boot if anything, it’s a matter of technique.
That’s all for replacing the disk in the mdadm array. After accessing the server console, it took me about 10 minutes to get the server back up and running.
What’s the difference between software and hardware raid
Let me now tell you what is the fundamental difference between a software controller (mdadm) and a hardware controller, for those who do not fully understand it. If I had a failed disk on a hardware controller raid installed in a full server, the problem of replacing the failed disk in RAID would be solved in the following sequence:
- Raid controller notifies that there is a problem with the disk and takes it out of operation. In the case of a software raid, the system may hang up if there is a problem with the drive before marking it as a problem and stopping it from being accessed.
- I leave the ticket in support where I ask to replace the failed disk. I will see information about it in the controller raid control panel.
- Technician can see the failed disk, because the indication on it is likely to blink red light. This is not a guarantee that the handle will do everything right, but there is less chance that it will make a mistake. I have encountered a situation where even in this case, the disk has been changed from the wrong one.
- If a new raid drive appears, the controller automatically starts rebuiling the array.
If you already have a spare disk installed in your server in case of failure of a disk in the raid array, it is still easier:
- In case of a drive failure, the controller marks the drive as a failure, puts the spare drive into operation and starts the rebuild.
- You get an alert that your drive has failed and leave a ticket in support to replace your spare drive.
And that’s all. In both cases, you have no downtime at all. Here is the fundamental difference between an mdadm and an iron raid controller. The cost of a full server with a controller and constant ipmi access to the console is on average 3 times higher than a server with a software raid on a deskopic iron with similar performance. This is all provided that you only need one CPU and 64G memory. This is the ceiling for desktop configurations. Then consider for yourself what’s best for you. If it is possible to have a disk or other components replaced in a few hours, you can safely use desktop iron. Mdadm provides a comparable warranty for data integrity compared to an iron controller. It’s just a question of simplicity and performance. Timely backups add to the confidence that you will survive iron.
By using an iron raid on the hdd drives, it is possible to get a very significant speed boost from the controller cache. For ssd disks I did not really notice the difference. But this is all on the peephole, no measurements or comparisons I did not make. We should also understand that the desktop iron is generally less reliable. For example, in the same selector on cheap servers I caught overheating or very high temperature of disks. I jumped around 55-65 degrees. All that is below 60, that support soccer, saying that this is an acceptable temperature, judging by the documentation for the disks. This is true, but we understand that the disk, which is constantly running at 59 degrees with a higher probability of failure.
Here is another example of the difference in iron. If a memory stick in your normal server breaks down, the server will simply mark it as a failure and take it out of operation. You will see information about this in the management console – ilo, idrac, etc. In the desktop gland you’ll just have the server hanging up all the time and you’ll have to figure out for a long time what the problem is, because you don’t have access to the gland to schedule a server test. And if you order it from those support, there is a nonzero probability that it will get worse – the server will be dropped, the disk wires will be mixed up, etc. In general, this is always a risk. It is easier to move from such hardware to another at once.
I hope my article was interesting. For those who have never worked with data centers, it will be useful to know what to expect from them. I miss the times when all the servers that I administrated were in the server room, where no one had access and where I could access and check them at any time. This is not the way it is now. And your servers are no longer yours. They could be broken, dropped, something could be confused by the support staff data center.
Now there is a big trend to move to the clouds. I look at these clouds and do not understand how to interact with them normally. The declared performance is not guaranteed, the load floats during the day. It can go down at any moment and you won’t understand what the problem is at all. Your virtuals may be mistaken
In general, my experience with clouds is negative. I have tried several times for sites and moved out all the time. I have no guaranteed response time. And this is now a ranking factor. For a very fast site, there is only one option left – your iron, and then who can afford it. It depends on the reliability and allowable downtime.
I’m talking about the clouds, because the tendency is that iron servers should be abandoned and everything should be moved to the clouds. On the one hand, it should be convenient. At least, there will be no problems mentioned above in this article. On the other hand, a lot of other problems are added. I am still sitting on hardware of different quality and cost. What about you?