Miss Newline Characters When "cat" Text Files

2019 Jan 4

The cat is often used to concatenate text files into one single file. In most cases, the cat works fine like below.

$ echo line 1 > file1.txt
$ echo line 2 > file2.txt
$ cat file{1,2}.txt
line 1
line 2

However, if some of files to be concatenated don’t end with the newline character, using cat to concatenate files may not generate expected file.

# -n, let echo not add the trailing newline character
$ echo -n line 1 > file1.txt
$ echo line 2 > file2.txt
$ cat file{1,2}.txt
line 1line 2

Note that in the above example, file1.txt doesn’t end with newline, so when two files concatenated there is no newline between them. This may not be the expected result. For example, we have multiple large text files. Every line in each file is a user ID. We want to concatenate these files into one file to be fed into a processing program at once. If some of files are not ended with newline, using cat may generate ill user IDs like user-id-foouser-id-bar. If the input volume is huge, these problematic IDs usually would not be detected by human eyes.

If the newlines between files is important in your case, using awk is safer.

# -n, let echo not add the trailing newline character
$ echo -n line 1 > file1.txt
$ echo line 2 > file2.txt
$ $ awk 1 file{1,2}.txt
line 1
line 2

See this SO answer.

Also, it’s a good idea to tune text editors to always show non-printable characters like the newline. Or, use cat -e, which prints invisible characters and a $ for the newline.

$ cat -e file1.txt | tail -1

CLI pitfall

grep Command Examples

2019 Jan 1

Stop after first match
Print only filename if match
Find unmatched files
Show line number of matched lines
Don’t output filename when grep multiple files
Search in “binary” files
Search in directories
Ignore case when search
The pattern to search begins with - (hyphen)
Use pattern file
Print only count of matching lines

First, grep –help lists most of its options, which is the go-to command for most grep questions.

Like most CLI tools, options of grep can be combined. For example, -io is same as -i -o, -A3 is same as -A 3. Also, the options can be anywhere in the command.

$ grep hello a.txt -i --color

Stop after first match

$ grep -m 1 search-word file

-m, –max-count=NUM stop after NUM matches

Only print the 1000th match.

$ grep -m1000 search-word file | tail -n1

Print only filename if match

$ grep -l search-word *.txt

-l, –files-with-matches print only names of FILEs containing matches

It’s useful when you grep lots of files and only care about names of matched files.

Find unmatched files

-L, –files-without-match print only names of FILEs containing no match

-L is the opposite of -l option. It outputs the files which don’t contain the word to search.

$ grep -L search-word *.txt

Show line number of matched lines

$ grep -n search-word file

-n, –line-number print line number with output lines

Don’t output filename when grep multiple files

When grep multiple files, by default filename is included in the output. Like,

$ grep hello *.txt
a.txt:hello
b.txt:hello

Use -h to not output filenames.

$ grep -h hello *.txt
hello
hello

-h, –no-filename suppress the file name prefix on output

Search in “binary” files

Sometimes, a text file may contains a few non-printable characters, which makes grep consider it as a “binary” file. grep doesn’t print matched lines for a “binary” file.

$ printf "hello\000" > test.txt
$ grep hello test.txt 
Binary file test.txt matches

Use -a to let grep know the file should be seen as a “text” file.

$ grep -a hello test.txt 
hello

-a, –text equivalent to –binary-files=text

Search in directories

-r, –recursive like –directories=recurse

-R, –dereference-recursive likewise, but follow all symlinks

Without specifying a directory, grep searches in current working directory by default.

$ grep -R hello
b.md:hello
a.txt:hello

Specify directories.

$ grep -R hello tmp/ tmp2/
tmp/b.md:hello
tmp/a.txt:hello
tmp2/b.md:hello
tmp2/a.txt:hello

–include=FILE_PATTERN search only files that match FILE_PATTERN

Use --include to tell grep the pattern of the filenames you’re interested in.

$ grep -R hello --include="*.md"
b.md:hello

Ignore case when search

-i, –ignore-case ignore case distinctions

$ grep -i Hello a.txt 
hello
HELLO

The pattern to search begins with - (hyphen)

$ grep -- -hello a.txt
-hello

To know what -L option does.

$ grep --help | grep -- -L
  -L, --files-without-match  print only names of FILEs containing no match

Use pattern file

-f FILE, –file=FILE Obtain patterns from FILE, one per line. If this option is used multiple times or is combined with the -e (–regexp) option, search for all patterns given. The empty file contains zero patterns, and therefore matches nothing.

$ cat test.txt
111
222
333

$ cat patterns.txt
111
333

$ grep -f patterns.txt test.txt
111
333

NOTE: Do not put an empty line, i.e. a line with \n only, in the pattern file. Otherwise, the pattern file would match every line, since every line contains \n as its last character. It’s easy to make a mistake to put empty lines in the end of the pattern file.

Print only count of matching lines

Use -c, or --count to print only count of matching lines. For example, below command line is to find out the count of <OrderLine> tag in files of current directory.

$ grep "<OrderLine>" -c -R .

It outputs like below.

./order-1.xml:3
./order-2.xml:9
./order-3.xml:1

To sort the output, use command like below.

$ grep "<OrderLine>" -c -R . | sort -t : -k 2

./order-3.xml:1
./order-1.xml:3
./order-2.xml:9

CLI CLI examples

Daily Dev Log: Find Lines in One File but Not in Another

2018 Dec 12

We can use comm to find lines in one file but not in another file

# fine lines only in file-a
comm -23 file-a file-b

From comm --help,

-2 suppress column 2 (lines unique to FILE2)

-3 suppress column 3 (lines that appear in both files)

So to find lines exist in both file-a and file-b.

comm -12 file-a file-b

Google keywords: “linux command two file not contain” hit link

CLI dev log

Inject a Method Interceptor in Guice

2018 Sep 4

I recently made a mistake to new an object (MethodInterceptor) in an plain old way in a Guice configuration, which caused the object’s @Inject-annotated fields, like Logger for example, were initialized with null value.

public class FooInterceptor implements MethodInterceptor{
  @Inject
  private Logger logger;

  @Override
  public Object invoke(MethodInvocation invocation) throws Throwable {
    // NPE if logger not inject correctly
  	logger.info("start to invoke"); 
  	return invocation.proceed();
  }
}

public class BarModule extends AbstractModule {
	@Override
	protected void configure() {	
    bindInterceptor(Matchers.subclassesOf(OrderApi.class), 
      Matchers.any(), 
      // new the interceptor in the plain old way
      new FooInterceptor()); 
  }
}

Solution

Guice wiki clearly states that requestInjection() should be used for injection of a “method interceptor”.

How do I inject a method interceptor?

In order to inject dependencies in an AOP MethodInterceptor, use requestInjection() alongside the standard bindInterceptor() call.

public class NotOnWeekendsModule extends AbstractModule {
  protected void configure() {
    MethodInterceptor interceptor = new WeekendBlocker();
    // for injection of a "method interceptor"
    requestInjection(interceptor);
    bindInterceptor(any(), annotatedWith(NotOnWeekends.class), interceptor);
  }
}

Some Thoughts

Once you decide to use a dependency injection framework, like Guice here, please DO NOT new Java objects in the plain old way any more. Otherwise, you may very possibly fail to set up the objects’ dependencies correctly.

Secondly, use constructor-injection for mandatory dependency, like Logger here. It’s impossible to forget to inject a dependency using constructor-injection, even if that object is constructed in the plain old way. (However, too many dependencies injected via the constructor makes the constructor look a bit ugly.)

java guice

Browsers Ignore Change in Hosts File

2018 Aug 2

I modified C:\Windows\System32\drivers\etc\hosts for a local test. However, the browser did not respect to the change in hosts file. At last, I found it’s due to the proxy settings in my machine. Change in hosts took effect once after I unchecked all proxy configuration in Control Panel -> Internet Options -> Connections -> LAN settings.

If unfortunately in your network, connections are proxied by force (for example, in a corporate network), you can try to let the proxy bypass some domains by adding the domains into LAN settings -> Proxy server -> Advanced -> Exceptions.

dev tips

OneFeed is Serving HTTPS

2018 Jul 20

Now OneFeed is living on https. Its certificate is signed by Let’s Encrypt, “a free, automated, and open Certificate Authority”. Using Let’s Encrypt’s ACME (Automatic Certificate Management Environment) protocol and a client of the protocol, requesting and renewing site certificate is just done automatically.

ACME protocol defines serveral challenges which a protocol client can use to prove it (the host running the client) owns the domain. Also the protocol defines how to request, renew, and revoke certificates. With clear definition of interaction with ACME server (CA) and client (your site), all process can be automated. Certbot is a recommended ACME client.

To set up https on this site, I use this great post as a reference. Basic steps are:

use a certbot docker image to get certificate from Let’s Encrypt for the first time.
update configuration of web server using the certificate.
set up a cron job to auto-renew the certificate.

Use Docker Images to Get Certificate

Certbot is in active development. Use the certbot docker image (by default latest image), so that we don’t bother ourselves with updating certbot to newest version. And use the nginx docker image to set up a basic web server to fulfill ACME challenges, so that our production web server’s configuration gets untouched when requesting a certificate. (Also plus since OneFeed is already living within docker containers, using docker/docker-compose is an easy decision.) The containers used in this step are discarded/cleaned up as soon as certificate fetched for the first time.

Update Configuration of the Production Web Server

Just google how to set up https on the web server. For OneFeed, https is set up on a nginx server.

Set Up a Cron Job to Renew and Reload/Restart Web Server

The reference post uses docker kill --signal=HUP production-nginx-container to send signal to nginx container’s nginx process for server reloading. However, since OneFeed is not using plain nginx container, but a passenger-docker, therefore using docker-compose restart to reload certificate instead.

0 23 * * * docker run --rm -it --name certbot-renew \
-v /CERTBOT_VOLUME/etc/letsencrypt:/etc/letsencrypt \
-v /CERTBOT_VOLUME/var/lib/letsencrypt:/var/lib/letsencrypt \
-v /CERTBOT_VOLUME/data/letsencrypt:/data/letsencrypt \
-v /CERTBOT_VOLUME/var/log/letsencrypt:/var/log/letsencrypt \
certbot/certbot renew --webroot -w /data/letsencrypt --quiet \
&& cd YOUR_DOCKER_COMPOSE_WORKING_DIR \
&& docker-compose restart

https onefeed

X-Forwarded-For, Forwarded, X-Real-IP and Nginx

2018 May 22

X-Forwarded-For

Http header X-Forwarded-For can be used to get the IP address of the REAL client, especially in a network with proxies and load balancers.

The X-Forwarded-For (XFF) header is a de-facto standard header for identifying the originating IP address of a client connecting to a web server through an HTTP proxy or a load balancer. When traffic is intercepted between clients and servers, server access logs contain the IP address of the proxy or load balancer only. To see the original IP address of the client, the X-Forwarded-For request header is used.

The syntax is,

X-Forwarded-For: <client>, <proxy1>, <proxy2>
X-Forwarded-For: 203.0.113.195, 70.41.3.18, 150.172.238.178

When a Http request flows through a proxy, the proxy appends its IP address to X-Forwarded-For header (if it respects this header).

Forwarded

However, since X- headers are not recommended anymore,

Custom proprietary headers can be added using the ‘X-‘ prefix, but this convention was deprecated in June 2012.

a standardized and enhanced header, Forwarded, is introduced.

# the original request is from 192.0.2.60, and passed through proxy 203.0.113.43
Forwarded: for=192.0.2.60; proto=http; by=203.0.113.43

# client can also append some obfuscated identifier like "secret" here, server can 
# then use it validate the integrity of a client.
Forwarded: for=23.45.67.89;secret=egah2CGj55fSJFs, for=10.1.2.3

X-Real-IP

Another somehow relevant header is X-Real-IP, which contains a single IP. You may find it, for example, somewhere in Nginx docs (ngx_http_proxy_module doc, ngx_http_realip_module doc).

nginx http

在Windows上安装tmux

2018 Apr 11

Windows上的Git BASH提供了大部分常用的Linux命令行工具，比如grep、sed等，但是并没有提供tmux。实际上Git for Windows提供了包管理（package management）功能，

Git for Windows is based on MSYS2 which bundles Arch Linux’ Pacman tool for dependency management.

借助pacman，Git for Windows可以安装额外的命令行工具，比如tmux。但是，在Git BASH里，pacman并没有默认开启，

This is intended. We do not ship pacman with Git for Windows. If you are interested in a fully fledged package manager maintained environment you have to give the Git for Windows SDK a try.

需要安装Git for Windows SDK来开启pacman。安装好之后，打开Git SDK（和Git Bash一样，是一个终端模拟器），

$ pacman -Ss tmux

会找到两个包，

msys/tmux 2.6-1
	A terminal multiplexer
msys/tmux-git 2.5.94.g73b9328c-1 
	A terminal multiplexer

$ pacman -S msys/tmux-git

安装的时候可能会报下面的错误，

$ pacman -S msys/tmux
warning: database file for 'git-for-windows-mingw32' does not exist
error: failed to prepare transaction (could not find database)

打开/etc/pacman.conf文件，注释掉下面的行即可，

#[git-for-windows]
#Server = https://wingit.blob.core.windows.net/x86-64

#[git-for-windows-mingw32]
#Server = https://wingit.blob.core.windows.net/i686

安装好之后，就可以在Windows上（Git SDK）使用tmux了。

pacman的用法可参见Git for Windows的Wiki。

环境：Windows 10

（如果发现某些程序，比如ssh，报错，可以尝试用pacman -Syu升级所有package。）

tmux CLI

删除Arrays.asList返回的列表的元素会发生异常

2018 Feb 5

对Arrays.asList(...)返回的List进行remove()/removeAll()操作时，会抛出UnsupportedOperationException异常。据Arrays.asList(...)的javadoc，这个方法返回的List实现类是基于数组的。

Returns a fixed-size list backed by the specified array. (Changes to the returned list “write through” to the array.)

这个List实现类是Arrays类的一个私有静态类，所有方法基本上只是简单地代理到内部的一个数组成员E[]。数组是不支持删除操作的，所以remove()会抛异常。

实际上对所有基于数组的List实现类最好都不要进行删除操作。 ArrayList虽然支持remove()，但是remove()的实现会导致内部数组的拷贝“平移”，影响效率。

java