用指令碼記錄一下伺服器的工作狀態工具人養成第三彈

伺服器從裝好，到使用至今，時不時的會莫名其妙的重啟，目前一直都沒有發現原因。。。

首先可以確定的是，電源應該是沒有問題的，伺服器的tdu插座是4000w的，伺服器是2000w的電源，絕對夠

那大概率是因為溫度過高導致的宕機。

雖然。一方面，我可以通過syslog或者journalctl -xe 來檢視系統日誌，但是最對只能知道是不是意外重啟，也沒法得到重啟的原因。

另一方面，我雖然我也可以通過sensors來檢視系統的溫度，不過我也不可能24h盯著看。

首先需要在shell指令碼中實現資料夾以及相關txt記錄檔案的新建

data_fold=$(date +%y'-'%m'-'%d)
hour_fold=$(date +%y'-'%m'-'%d'_'%h)
currentdate=$(date +%y'-'%m'-'%d'_'%h':'%m':'%s)
mkdir -p ./user_and_temp/$data_fold/$hour_fold
touch ./user_and_temp/$data_fold/$hour_fold/cpu_$currentdate.txt
txt_path="./user_and_temp/$data_fold/$hour_fold/cpu_$currentdate.txt"

然後，我們通過指令碼記錄cpu溫度，並記錄進txt檔案中

cpu0=`sensors  coretemp-isa-0000 | tail -n +4 | tr -s " " | awk -f [°c+] ''`
cpu1=`sensors  coretemp-isa-0001 | tail -n +4 | tr -s " " | awk -f [°c+] ''`
temp_zone0=`cat /sys/class/thermal/thermal_zone0/temp`
temp_zone1=`cat /sys/class/thermal/thermal_zone1/temp`
echo "cpu temperature" >> $txt_path
max0=0.0
for i in $cpu0;do
# echo $i >> cpu_temp.txt
if [ $ -gt $ ];then
max0=$i
fidone
echo "max0" >> $txt_path
echo $ >> $txt_path
echo " ">> $txt_path
max1=0.0
for j in $cpu1;do
# echo $j >> cpu_temp.txt
if [ $ -gt $ ];then
max1=$j
fidone
echo "max1" >> $txt_path
echo $ >> $txt_path
echo " " >> $txt_path
echo "package id 0:"  >> $txt_path
echo $ >> $txt_path
echo "package id 1:" >> $txt_path
echo $ >> $txt_path

這裡cpu0和cpu1記錄的是sensors統計的cpu各個核的最高溫度，而temp_zone0和temp_zone1記錄的是系統temp檔案中記錄的溫度

（其實我一直沒搞明白，我們的伺服器是112個核，為啥sensors只能識別到30個核）

echo "online users" >> $txt_path
w | while read line
doecho $line >> $txt_path
done

直接記錄，w的返回值是沒有分行的一長段，但是這段**可以實現按行記錄，用起來方便很多

順便記錄一下各個使用者的記憶體使用情況

echo "memory usage" >> $txt_path
./find_the_bi*ch.sh | tail -n +3  | while read line
doecho $line >> $txt_path
done

find_the_bi*ch是一段我在網上找到的，統計各個使用者記憶體使用量的**。參考：檢視 linux 系統中程序和使用者的記憶體使用情況 linux中國

但是，由於ps的指令只能記錄8個字元的使用者名稱，如果使用者名稱高於 8個字元，則會顯示前七個，第八個用+代替，這並不適合我的需求。

我原本的計畫是，找到記憶體使用量，高於某個閾值的使用者，並通過他的使用者名稱直接kill掉他的程序。這裡使用者名稱顯示不全，明顯不適合我的需求

不過修改後的指令碼已經滿足了我的需求，修改後的find_the_bi*ch.sh**如下

#!/bin/bash
stats=」」
echo "%   user"
echo "**********=="
# collect the data
for user in `ps -o ruser=userforlongname -e -o %mem | awk '' | sort -u`
dostats="$stats\n`ps -o ruser=userforlongname -e -o %mem | egrep ^$user | awk 'begin; \
;end'`"
done
# sort data numerically (largest first)
echo -e $stats | grep -v ^$ | sort -rn | head

再記錄一下使用者的程序數

echo "parallel processes" >> $txt_path
ps aux|awk 'end' | sort -rn | while read line
doecho $line >> $txt_path
done

最後，我們記錄一下gpu的使用情況

echo "gpu usage" >> $txt_path
nvidia-smi | awk 'nr==9,nr==10' >> $txt_path
nvidia-smi | awk 'nr==13,nr==14' >> $txt_path
nvidia-smi | awk 'nr==17,nr==18' >> $txt_path
nvidia-smi | awk 'nr==21,nr==22' >> $txt_path

以上的**是不殺使用者程序的版本，如果需要kill記憶體使用超過一定閾值的使用者的程序，請將讀取find_the_bi*ch.sh結果的**改為以下版本

pick_mem=30
./find_the_bi*ch.sh | tail -n +3  | while read line
doecho $line >> $txt_path
if [ $(echo $line | awk '') != "binary" ];then
first_char=$(echo $line | awk '')
if [ $ -gt $pick_mem ];then
user_name=$(echo $line | awk '')
if [ $user_name != "root" ];then
echo $user_name
killall -u $user_name
fififi
done

如果需要限制使用者的程序數，

請通過修改limits.conf檔案，設定使用者的程序上限

vim /etc/security/limits.conf

比如

chauncey_wang hard nproc 32 # @student hard nproc 32

# @faculty hard nproc 64

就限制chauncey_wang這個使用者的程序數上限為32個程序。

不過，在此之前，還需要確定一下/etc/pam.d/login檔案中下面一行的存在：

session required /lib/security/pam_limits.so

可通過

cat /etc/pam.d/login

來檢視。參考：linux限制使用者程序數

limits.conf 生效,與pam_limits.so 檔案

其實我還嘗試過通過bypy，直接將記錄的檔案備份到網盤中，但是不知道什麼原因，備份的速度特別慢，最後放棄了這個計畫

最後，相關的**我已經放在了github中，有需要的小夥伴可以自取。

ubuntu_log

ubuntu_log_kill_processes

記錄一下伺服器docker的應用

docker systemctl enable docker systemctl start docker docker ps a docker stop docker restart docker start docker rm docker exec it bin bash mysql dock...

記錄一下先前對WEBRTC的伺服器搭建

房間伺服器我當時放在windows上，以下操作最好都使用管理員許可權執行 2 使用cmd走一次node v，看到版本號了說明安裝正確 3 繼續安裝以下元件 npm install express npm install yetify npm install getconfig npm instal...

ping 一下伺服器的IP

ping命令是windows系統是用於檢測網路連線性的基本命令，其基本命令格式為 ping 目標ip位址或者網域名稱，例如檢測www.baidu.com的連線是否正常 1 點選開始，點選執行，輸入cmd，點選確定 2 提示符後輸入 pingwww.baidu.com,回車 3 看到下圖資訊丟失 0，...

用指令碼記錄一下伺服器的工作狀態 工具人養成第三彈

記錄一下伺服器docker的應用

記錄一下先前對WEBRTC的伺服器搭建

ping 一下伺服器的IP

相關推薦

用指令碼記錄一下伺服器的工作狀態工具人養成第三彈