Sep 24, 2010
How I query Apache logs from the commandline

When I build services or write online I want to get feedback. Did anybody use them? Read my latest post? When my site has just a little traffic Google Reader summarizes too much. I want to be able to get my hands dirty with the data, to be able to drill down to individual user sessions to see how people interact with my site. How many real users did I have yesterday? Did somebody link to my latest blog post? How many people clicked on that link on Hacker News? Did any of them stick around and browse to other pages?

After several attempts at hard-coded scripts to answer such questions, I came up with a little collection of scripts that can be composed using pipes. Here's an example session on my commandline:

How many uniques did I get yesterday?

$ cat_logs access.log | dump_field IP | sort | uniq | wc -l

Focus on real human beings.

$ cat_logs access.log | skip_crawlers | dump_field IP | sort | uniq

Wow, 15 IPs? Did they stay long?

$ cat_logs access.log | skip_crawlers | dump_field IP | sort | freq

Hmm, so 4 users browsed several pages. Where are they coming from?

$ cat_logs access.log | skip_crawlers | dump_field REFERER

Ah, they're all coming from http://news.ycombinator.com/item?id=1702108. So people actually clicked on that comment of mine, even though there were no votes or responses. Interesting..

This one person viewed 10 pages. What did they see?

$ cat_logs access.log | filter_field IP xxx.xx.xxx.xxx

So they visited twice yesterday, once in the morning and once late at night. And clicked through to different sites each time.

You get the idea. It's just a bunch of shell scripts that read and write YAML. And I can string them together using shell pipes rather than writing one-off scripts like I used to.

Once I put these scripts together I found most queries had a similar format: parsing apache logfiles using cat_logs, filtering bot user-agents through skip_crawlers, some stages of filtering by or grouping by certain fields, leaving YAML using dump_field, followed by non-YAML summarization — de-duping (uniq), counting (wc -l), or frequency-distribution (freq).

Try it out and tell me what you think: Yam.

Comments gratefully appreciated. Please send them to me by any method of your choice and I'll include them here.

archive
projects
writings
videos
subscribe
Mastodon
RSS (?)
twtxt (?)
Station (?)