Tag Archives: Perl

UTF-8 Round Trip for Perl, MySQL and node.js

Doing things the un*x way means keeping a supply of “go-to” tools for the various tasks that spring up during development. For me, that’s a lot of bash, python and Perl on the dev machine and recently node.js on the server.

While scraping acquiring multi-lingual (Unicode) data for a project, I had to make sure I kept the correct utf-8 encoding all along the processing pipeline. One screw-up and it’s garbage out. Here are some tips:

Continue reading

Playing Perl: Counting Occurrences

Ever have a list of phrases and wonder which individual words appear the most? Me too! Here’s a handy Perl command that will get the job done:

perl -F\t -lane "map{$w{$_}++} split (/ /,$F[0]); END { print qq|$_\t$w{$_}| foreach sort{$w{$b}<=>$w{$a}} keys(%w) } "
< INPUT_FILE > OUTPUT_FILE

This assumes:

  • the input contains a TAB separated list of fields
  • the first field is the one containing our keyword phrase

Good Luck!