xargs is Slow

2019 May 29
# filepaths.txt is a file with thousands lines
cat filepaths.txt | xargs -n 1 basename

It takes a while (seconds) to finish running the above command. A file with thousands lines usually is not considered as a big volume. Why is xargs slow in the above command?

After read a SO post, it turns out xargs in the above command runs basename thousands times, therefore it has bad performance.

Can it be faster?

According to man xargs,

xargs reads items from the standard input … delimited by blanks … or newlines and executes the command … followed by items read from standard input. The command line for command is built up until it reaches a system-defined limit (unless the -n and -L options are used). … In general, there will be many fewer invocations of command than there were items in the input.
This will normally have significant performance benefits.

It means xargs can pass a batch of “items” to the command. Unfortunately, the -n 1 option in the command forces xargs to just take one “item” a time. To make it fast, use the -a option of basename, which let basename be able to handle multiple arguments at once.

time cat filepaths.txt | xargs -n 1 basename > /dev/null 

real    0m2.409s
user    0m0.044s
sys     0m0.332s
time cat filepaths.txt | xargs basename -a > /dev/null 

real    0m0.004s
user    0m0.000s
sys     0m0.000s

Thousands times faster.

–show-limits
cat /dev/null | xargs --show-limits --no-run-if-empty

Your environment variables take up 2027 bytes
POSIX upper limit on argument length (this system): 2093077
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2091050
Size of command buffer we are actually using: 131072
Maximum parallelism (--max-procs must be no greater): 2147483647

It shows xargs can feed a lot bytes into the command once (2091050 bytes here).

-P

Some commands can usefully be executed in parallel too; see the -P option.

"#!/usr/bin/env ruby" vs. "#!/usr/bin/ruby"

2016 Dec 22

脚本文件(可执行的脚本文件)的第一行会有一个shebang,比如#!/bin/sh,表示用哪个程序来执行该脚本。 每种可执行文件的开头会有一个相应的magic number。 可执行的脚本文件对应的magic number是0x230x21,也就是ASCII码形式的#!

在shebang里使用env可以增加脚本的可移植性。 比如#!/usr/bin/env ruby相比于#!/usr/bin/ruby有着更好的可移植性。 因为在某个系统上,ruby不一定安装在/usr/bin/下,也有可能安装在/usr/local/bin/等其它目录里。 而应用#!/usr/bin/env ruby这种形式的shebang时,只要在执行脚本的用户的PATH里能搜索到ruby程序,就可以执行脚本。

当用rbenv来管理ruby版本时,ruby程序一般会安装在用户目录下,然后rbenv通过设置用户的PATH变量来找到需要的ruby版本。 这种情况下使用#!/usr/bin/env ruby会保证脚本在执行时可以找到正确的ruby程序。

在绝大部分的系统上,env都安装在/usr/bin/目录下,所以不用担心/usr/bin/env的可移植性。

参考

Shebang (Unix)

man env

env – set environment and execute command, or print environment

The env utility executes another utility after modifying the environment as specified on the command line.

记一次失败的systemd尝试

2016 Dec 13

OneFeed的push notification实现借助了Sneakers。 在部署服务时,Ruby环境是通过rbenv管理的,环境变量(比如API Token等)则通过rbenv-vars管理。 为了不“污染”root用户以及系统的稳定,希望新建一个专门的用户来运行服务。

可以通过foreman把Sneakers项目导出成Upstart或者systemd service。 因为Ubuntu 16.04用systemd替换了Upstart,所以尝试用systemd来启动服务。

sudo foreman export systemd /etc/systemd/user -a notification-service -u hong

因为想以非root用户来启动服务,所以导出时加了-u选项指定用户,/etc/systemd/user是存放systemd user unit的目录。 导出的.service文件里会出现User=hong语句。

尝试启动服务,

su - hong
systemctl --user start notification-service.target

却始终报错,Failed to connect to bus: No such file or directory ubuntu。 Google后发现是su的问题。不用su,直接以用户hong登陆系统后再运行systemctl,上面的错误就消失了。 但是通过ps命令没有发现相关的ruby进程。再通过journalctl -r发现了如下错误信息,

Failed at step GROUP spawning /bin/bash: Operation not permitted

这可能是systemctl --user一个bug。 一个解决方案是删除User=hong语句。但是删除该语句后,systemctl会尝试用root用户的ruby来启动服务。 因为root用户的ruby没有安装服务所依赖的gem,所以会启动失败。 而且以root用户来启动服务也违背了初衷。

到此为止,这次对systemd的尝试以失败告终。而且前前后后花费了一两天的时间。 所以专门的DevOps还是必须的🤔。

书籍推荐:The Linux Programming Interface

2016 Nov 5

The Linux Programming Interface,💯。 亚马逊上的评价98%为五星,2%为四星。即便不做Linux开发,读一遍对于了解Linux还是非常有帮助的。

亚马逊链接豆瓣链接

The Linux Programming Interface

不要把重要文件放在/tmp目录下!

2016 Oct 13

永远不要把重要文件放到/tmp目录下,即使是临时存放也不要,因为只要系统一重启/tmp目录就会被清空。

曾经在Ubuntu虚拟机上测试OneFeed部署的时候,想记录下部署时涉及到的每个命令方便后来在生产环境部署。 因为涉及到多个Linux user(一个没有sudo权限的Linux user用于跑Rails App,另一个有sudo权限的user用于安装各种软件), 所以就把文件放在了/tmp目录下。 在差不多部署好的时候,为了能在主机上访问虚拟机上部署的应用,需要配置VirtualBox的Host-only Networks, 配置好后需要重启虚拟机。。。😱

The cleaning of /tmp is done by the upstart script /etc/init/mounted-tmp.conf. The script is run by upstart everytime /tmp is mounted. Practically that means at every boot.

The script does roughly the following: if a file in /tmp is older than $TMPTIME days it will be deleted.

The default value of $TMPTIME is 0, which means every file and directory in /tmp gets deleted. $TMPTIME is an environment variable defined in /etc/default/rcS.

Linux共享库简介

2016 Oct 12

之前写的关于Linux上的共享库(shared library)的一个简单介绍。