lags/server freeze

hurrenson · Post by **hurrenson** » Fri Nov 10, 2017 21:53

Hi!

Just wanted to let you know that the server has some issues today. Lags/freeze for >10 seconds. No, it's not me

My guess would be either the server process freezes or other tasks demand full cpu. It comes in irregular intervals as far as i can tell. First time i witnessed it was 5h or so ago.

Post by **adminless** » Fri Nov 10, 2017 23:36

http://travaux.ovh.net/?do=details&id=2 ... f983357e33

hey hi there

yes I noticed that too at the start of the evening (just the usual "admin fucked up" quote you know hehe yes but it wasn't me the admin who fucked up big time this time) and yesterday may be you didn't notice but the servers (even this one of the web services which is even in a different datacenter) were down in the morning (from like 8 to 13 more or less) but it has nothing to do with me or even with the servers, I check that out some time after I saw the drop before and the server was functioning just a usual (i.e. good).

ok, so as you can see in the link posted it simply turned out that the company that host the servers had some big time issues yesterday and I really mean it, in almost a decade working with them is the first time I witness such an incident like that myself, apparently it seems that yesterday one datacenter (this one) got some buggy network devices that lost the configuration (and therefor the connection) and the other (the one of the actual game servers) black out and I really mean it, completely lost the electricity (that really bad that can blow up virtually everything), man I don't even want to imagine me just being there in fact I can tell you that we were very lucky that we got the servers running just a couple of hours later (this one) and the game server by lunch time and there was no really anymore "major" issues (ok, other than some drops, apparently another one now as we speak, specifically it says that a team of 30 techs is going to be working the whole night in this here so I can image them now switching balancers and routers like crazy, bringing up services, apparently still thousand of servers down etc man I get goosebumps just thinking about it), that really speaks very well of them (and of my own infrastructure that also handled this by itself, i.e. adminless), I mean these guys run half million servers (it just falls behind google and amazon nonetheless which are not general purpose hosting companies) this is such a large scale event I'm way sure that they will get this sorted out and soon and on a positive note I'm also sure that they will use the occasion to do the usual updates/review and upgrades of the infrastructure.

I'll keep you updated, thanks for report, take care

Post by **adminless** » Sun Nov 12, 2017 18:19

well apparently I have no really records of this happening yesterday but it looks like this has been happening again at the start of the evening so this time I've sent a record to the tech team, hope they sort it out soon, if not and on the "next days" (let's say this week) this doesn't really improve I'll start considering moving it at least temporally thought understand that's a extremely difficult decision to take that can not be taken so lightly (and I don't mean in the sense of work/cost, well that also let's say it but mostly technically) so we'll see.

in my opinion (that this time I was in fact checking this very same issue at that moment) it looked like some kind of ip/network balancing swap/malfunction due to a temporal maintenance job since I know that they're still actively working on the Thursday incident but the truth of course is that I can not know for sure what's really happening there.

I'll keep this updated, sorry for the inconvenience

Post by **adminless** » Mon Nov 13, 2017 14:59

ok, ok, there's no need to panic, after yesterday evening this was failing almost constantly I think I finally diagnosed this, looks like following the datacenter incident this host/network segment is either broken (hardware/cooling/misconfigured) or simply over loaded/unbalanced running under backup/spare parts (due the extraordinary emergency situation) however after make some tests it seems that luckily not all hope is lost and this incident damage is only affecting the UnFreeZe1 (former now, 137.74.199.130) host, so far UnFreeZe2 (former now, 51.255.170.73) host seems to be not affected by this then after spend several hours considering the various possible solutions I finally decided to at least start by swapping the host so now the main server will run in the former UnFreeZe2 and the secondary server UnFreeZe2 will run in the former UnFreeZe1 host and of course until this doesn't get full loaded I can't 100% mark this as fixed but so far I believe that this should do it for now (I mean this will effectively render UnFreeZe2 broken but well, not only but one of the main reasons of the existence of a secondary server was precisely this to serve a backup host should ever happen something like this and it now it's evidence more than ever why this was a very good decision) if it doesn't then I'll consider once again more drastic measures.

this is mostly only affecting people using the direct host ips to connect (137.74.199.130.74, 51.255.170.73) not people using the dns names instead (server1.unfreeze.ga, server2.unfreeze.ga) the dns names I swapped them now this morning too however dns names changes take some time to propagate so it's still possible that if you used the dns names it will take some time for this change to work.

in addition to work around this issue and prevent further issues and strengthen the infrastructure I acquired a third ip, 217.182.45.102 and setup some "fail-safe" balancing infrastructure there to make these swaps (and future ones) transparent for you, so please note that and also I reused the former server.unfreeze.ga dns name for this purpose.

and that's it for now as always check it out and let me know if you find something broken (still possible I just finished it this this morning) or you have problems/doesn't get it to work.

hope this fixes it for now

Post by **adminless** » Thu Nov 16, 2017 14:07

lol so lucky I didn't make in time to write this post yesterday cause I was going to report that this got "fixed" but unfortunately it's still not.

then in short last days I've been in contact with the tech team at the server company and yesterday after some communications I received and really awkward message from them reporting that IT staff at the DC had check the host for issues but the results of the conducted tests in the hours before and after them were positive and no problems were found and that most likely all this were just a ("unknown") problem caused by the electrical failure (something I already knew btw and that it doesn't really mean much, like for example what exactly failed), but anyways i just don't really care about the words (I care about this getting fixed) and all that all that matters is that coincidentally after tuesday server was as broke as possible (luckily I made the right decision of swapping the servers in time cause all you have got with no server for like a week already and counting) after that message the next day (yesterday) there was no any more problems at all, however unfortunately just one day later (today) I can confirm again that the status of that server keeps being damage (it failing again) so we'll see I believe that I'll still have to keep dealing with that. that is important one because there's really need for a secondary server at times at least and two and most importantly to swap it again in case the current one experience problems in the future (it's always possible so better be already prepared for that as I was).

finally overall mention that as commented the main impact of the incident can be already considered as addressed so far after some days of full server I think that it's safe to say that the server is functional as usual and that there's no more issues relating this (in what refers to the main server), definitively it was the right decision at the right time (not too late but not too soon either) and long story short the secondary is still affected by technical issues i hope that can be sorted out soon, so far even if it's still running (I'm not gonna shut it down again) it still remains broken.

I also want to say that I'm very surprised at how most people (almost everybody) catch up so quickly about this and the switch was really smooth, literally the next day almost everybody was fragging as usual in the "new" (swapped) host, that really surprised me, I had anticipated much more complications and problems for the people, good job there really guys.

as always, as I get something new about this, I'll write it here. see you and sorry for the inconvenience.

Post by **adminless** » Fri Nov 17, 2017 20:18

well finally we have some progress in this matter here, today in talks with the tech team we identified that unexpectedly there was a malfunction with the servers storage hub (it got damaged due the electrical incident) that caused it to momentary stop responding from time to time and therefor that was getting the server blocked meanwhile. it's planned to be repaired this night at 22:00 so it's to expect some possible service interruption and initially let's hope that that finally gets this nasty problem fixed, I will keep monitoring it till monday or so and if no more problems arise relating this issue in let's say the weekend then we'll can say that it was that the problem with this host, if not, then I'll let you know.

note this issue actually only concerns UnFreeZe2, the actual UnFreeZe1 server is not affected by all this and it has been functioning "normally" (as usual, without incidents) since last past monday.

Post by **adminless** » Wed Nov 22, 2017 0:01

hello

I'm back again here at this with yet another update this time report that unfortunately that storage hub suspected to be the cause of the random server hangs was definitively not the main factor at play here, sure it's true that that storage hub was damaged (well or so was reported I mean) but the impact that had on the server function was in the order of milliseconds, tenths of milliseconds (and even so just only over a couple of minutes i.e. definitively relevant for a data base but totally negligible for our application) and not in the order of tens of seconds as is the problem at hand.

anyways good news are that after this continued to fail over the weekend and i continued to work on it I finally tracked this down to defective memory management in the server firmware for good and found and deployed methods and improvements in the server arch to manage this issue and stabilize the server as a consequence server hasn't significantly failed over the last 5 days (when past week it was rare the day that it wasn't just basically inoperative) now the bad news are that know that obviously this has its limitations (is not a perfect a fix) and that therefor it's still possible (and even expected) that it keeps hanging occasionally from time to time.

after almost ten days already of communication with the tech team of the server company without entering in details that aren't meaningful here the relevant thing here is that they failed at effectively solve this issue (and with this I'm not trying to blame anyone else here as is a duty of my own to ensure the functioning of my servers under any circumstance even failures like this one included and therefor to be prepared for them, as I effectively was btw i.e. dual redundant server, I'm just stating a fact and giving a information) and they weren't of much help (if any at all, most, if not all, of the advances at this came from my side) so it looks like a waste of time (for both sides) to continue this way therefor I will not carry on any more tech teams support contacts unless the issue escalate even further (i.e. server doesn't even boot or real bad things like that) so far and I'll then proceed with handling this myself.

in this scenery this morning I was about to make the decision to finally move this server (all this is about UnFreeZe2 btw as all the issue since UnFreeZe1 was already moved) to another host immediately however finally i reconsidered it more closely over the day and I felt that is not really in what you need me to be working at the moment and this even has a high chance of getting fixed (i know for good that at some point the server company is gonna take good care of this issue, i.e. quality department) in the short/middle term plus in addition is already almost completely fixed (so it wouldn't even be justified the spend of so many resources) so I make the decision to just leave this as is now (that is partially fixed and of course still keep doing the normal checks, maintenance etc as I always do) and start (already started and almost finished as we speak) tracing clear guidelines of actuation and a fast acting 24/48 hours response plan in the event of a future crisis scenery and then effectively put it in service in the short/middle term.

and that's all for now, sorry for the inconvenience and thank for your patience and understatement during this issue, see you.

Post by **adminless** » Fri Jan 12, 2018 16:36

as a final update to this issue report that among other things this last month I've been working and finally tracking this issue, they are obviously never going to openly admit something like this for obvious reasons and that's why the logical thing is to erroneously assume a server defect but now after around two months I'm pretty confident in that the real problem at hand here is a infrastructure overload/unbalance and there's absolutely nothing wrong with that particular server.

but anyways what matters here is that after this was working on-again, off-again for some time finally this last Christmas after I realized this and changed the strategy I applied some load balancing and scheduling of my own at the server and so far it hasn't failed at all in the last 10 days then I believe that until it doesn't fail again (which I don't see happening in the near future) is fair to assume this as finally solved for good.

there's a catch though as said this is no perfect solution, the server will go daily for a 10 mins maintenance at around 7 in the cet morning and what's more important is still expected to saturate in a daily basis but and here is the important caveat only at very late hours in the morning, let's say after 2/3 in the morning which while is that no perfect at all I believe that as long as this doesn't build a american community (that I don't see happening any time) is perfectly fine for the actual real use we do so no more measures will be taken by now.

once again remark that just after two days of this incident broke UnFreeZe1 was moved and thus it was completely unaffected all this time though being a complete clone (even at hardware/bios level) is probable that on the future may end up experiencing some problems like this, nevertheless the same measures (as well as more improvements that were consequence of this) have been already deployed on UnFreeZe1 too so this should be consequently already covered.

also remark that this has nothing to do the usual real life degradation/misconfiguration/day-to-day problems like latencies, sync and so this basically saturates and it either works or doesn't at all but when it does it doesn't degradate it always works as good as possible when it does.

anyways that's all here, in addition, in case this breaks even further in the future I could deploy another 48/72 hours measures (i.e. basically escalate the server) but so far this seems pretty consolidated already and more than good enough.

Post by **adminless** » Fri May 11, 2018 7:16

hey guys awesome news here, at last report that following the previous mentioned re-scheduling/re-design of the infrastructure this did its job and this haven't failed anymore at all, not even once, in like a week, I mean not even at late nights as initially expected (and as initially was happening) so finally this has been addressed 100% and this issue can finally be considered as closed for good. for safety and as a peace of mind measure adminless style the daily scheduled auto server maintenance will be left in place since it doesn't significantively impact the service and it definitively minimizes the risk of further future issues. in addition to this I also optimized and improve it a bit so now it should take no more than some minutes less than 10.

so in real-life is just plain stupid to pretend that things always work exceptionally and that they never break or that they never experience any problems at all, that's plain delusional, what's important is how well prepared we're (well the infrastructure is in this case I mean) to confront and react to the hardships that will come, because as you can see they definitively will so getting the full infrastructure back from this has been a very important milestone for us to overcome here since it wouldn't have been case and I would have underestimated the threat (or just made a "bad" decision, thought there's no such thing as "bad" decisions, the only bad decision is the one that don't make, yes, later once the results, consequences of your decisions, are know is very easy and everybody is quick to name a decision as good or bad when before hand decision are just decisions and therefor can be better or worse, more reasonable or less but don't have such a "quality") and this all would have been over already. then luckily this should prove the current infrastructure as very solid and resilient for the long term.

so that's it for now, one thing less to worry about

forum.fpsclassico.com

lags/server freeze

lags/server freeze

Re: lags/server freeze

Re: lags/server freeze

Re: lags/server freeze

Re: lags/server freeze

Re: lags/server freeze

Re: lags/server freeze

Re: lags/server freeze

Re: lags/server freeze