Lecture 8: Everyday Tools for Manipulating Data

(Regular Expressions and Pivot Tables)

Introduction

Manipulating and reformatting text is often a key component (and limiting factor) for many analyses. This lecture focuses on adding two widely used tools to your informatics arsenal: regular expressions and pivot tables. Regular expressions, or pattern matching, allow for complex reformatting of text. While Pivot Tables (or cross tabulation) allow for rapid generation of summary statistics). This lecture won’t make you an expert in either, but hopefully once you know what can be done, you can begin to use them to address your own data crunching needs. This lecture grew out of thinking about the tools that I use everyday and which ones I am likely to continue using during 3 and 4 years of medical school.

What are Regular Expressions?

Regular Expressions are a “language” to search for character patterns within text. Depending upon the program, matching characters (or lines with the matching characters) can be printed out or modified. You have all used regular expressions when you have used an asterisk to represent any characters when you have searched for text.

[jab1]

PROGRAMS USING REGULAR EXPRESSIONS

A slew of programs and web sites use regular expression. Unix programs such as grep (unix file line search tool), ex (unix line editor), vi (unix visual editor), emacs (text editor), sed (unix stream editor like grep but can modify), Many languages such as awk (pattern scanning and processing language), Perl (general programming language) have regular expressions at their core.

(Deeper than you would ever want to go into RE http://www.lib.uchicago.edu/keith/tcl-course/topics/regexp.html )

(Book: Mastering Regular Expressions. J Friedl http://www.oreilly.com/catalog/regex/ )

I. Regular Expression Overview (Extended Unix)

(Basic and Extended Regular Expressions)

(manual page: http://www.math.grin.edu/~rebelsky/References/ManPages/regexp.html )

Normal Characters: a-z A-Z 0-9 = ; :

abc => would match abc and nothing else

evan => would match evan in a text line and nothing else

Special character: * . [ ] ? + {} / \ -

The asterisk (*) and plus (+)

* matches zero or more occurrences of preceding regular regular expression

+ matches one or more occurrences of preceding regular expression

ab*c matches: ac abc abbbbbbc

ab+c matches: abc abbbbbbc

? the Question Mark

Matches zero or one occurrences of preceding regular expression

ab?c matches: ac or abc

{} Curley Brackets

Can be used to specifically designate number of matching characters

a{4,10} matches: aaaa aaaaaaaa upto aaaaaaaaaa but not aaa

t{4,} matches 4 or more ts

\ the backslash

Hint: if a pattern doesn’t work as expected add backslashes to the problem area

1. Used to designate that a special character should be treated as normal.

a\*b matches a*b

end\. matches end.

4 \+ 3 matches 4 + 3

2. Used to designate that a normal character as special

\t => Tab

\n => new line

\ => space (such as in a file name)

The circumflex (^) and the dollar sign ($)

Forces match to the beginning (^) or to the end ($) of a line

^Dear matches: Dear (only at beginning of line)

\.$ matches: matches a period (only at end of a line)

The period (.)

Matches any single character

a.c matches: abc aec a:c a=c

a..c matches: abbc axvc a=Zx

[] The Square brackets

Defines a set or class of characters

*, $ lose there special meaning (ordinary characters)

^ at the start means NOT (match anything not in this class) (sometimes ! is used instead)

- (the dash) is used to create a range of characters

EXAMPLES:

[aeiouAEIOU] match any vowel

[a-z] match any lower case letter

[^a-z] matach any character that is NOT a lowercase character

[0-9] match any digit

| the ‘pipe’

This regular expression OR that regular expression.

May need to use \| depending upon specific program program.

ab|c matches ab OR c

dumb|stupid matches dumb OR stupid

() Parentheses

1. Can be used to group regular expressions for use with *,+, ?, !

May need to use $ and $ depending upon the specific program.

a(b|c|d)ef matches abef acef adef

2. Can be used to return the results of the match within the parentheses. Useful for find and replace or programming.

Rule: regular expressions are normally GREEDY

Example: The end of world is at hand.

T.*nd matches: ‘The end of the world is at hand’ NOT ‘The end’

.* matches the entire line

II. Using Regular Expressions – Text Editor

A text editor is useful for manipulating text rapidly. Can create a large batch of Unix commands very rapidly or manipulating data into tab-delimited columns.

Example: /people0/rac5/build31/fasta

WHERE TO FIND FREE TEXT EDITORS WITH REGULAR EXPRESSIONS?

Windows

TextPad (http://www.textpad.com/)

TIPS FOR TEXTPAD

3) Useful for getting Text Ready for Excel

Mac

BBEdit Lite 4.6 (http://www.barebones.com/products/bbedit_lite.html)

Unix

vi, vim, emacs, xemacs

III. Using Regular Expressions in Unix – grep

grep [options] regexp [file[s]]

Common Options

-i ignore case

-E extended regular expressions (or egrep)

-c report only a count of the number of lines containing matches, not the matches themselves

-v invert the search, displaying only lines that do not match

-n display the line number along with the line on which a match was found

-H display file name along with line

-r recurse directories

-l list filenames, but not lines, in which matches were found

EXAMPLES

grep -E '(CAT){5,20}' chr22 | more

grep -E '(cat){5,20}' chr22 | more

grep -E '^(CAGGG){2,20}' chr22 Ychrom | more

grep -E '^[AT]+$' chr22 | more

grep -r -n -E 'AC00203[0-9]+' /people2/jab/TESTS

USEFUL PROGRAM TO USE WITH GREP AND ANY OTHER PROGRAM

Find path [expression]

This program finds files based on all sorts of characteristics. (See man find for more information.) It can be used for desending a directory structure and executing a program on specific files. In combination with grep you can have a pseudo-database.

find /people2/jab/* -name *.out

find /people2/jab/* -name *.out -exec grep -H LINE {} \; | more

IV. Pivot Tables (cross tabulation)

A pivot table is a simple way to group data on the basis of categories.

AN EXAMPLE OF PIVOTING GENOMIC DUPLICATIONS

/people2/jab/GENE508/chr7_pairwise_dups.xls

PUTTING IT ALL TOGETHER: BIBLICAL STUDIES

(Combining grep, textpad, Excel)

Compare the size of C22 and the bible. People have been analyzing the bible for a significant amount of time. How long will we be analyzing the genome?
How many verses (lines) in this version of the bible (/people2/jab/GENE508/kjv.rawtxt) contain the word ‘smite’? How about love?
Save all of the verses ‘smite’ to a separate file
Manipulate using regular expression find/replace in TextPad so that you can get the data into Excel (add tabs (\t) to turn the data into columns). Here is a nice regular expression: ^([0-9]?[A-Za-z]+)([0-9]+):([0-9]+)
Build a pivot table(s) to address the following questions. How many pivot table showing the number of verses with ‘smite’ per book bible.

IV. APPENDIX

OTHER UNIX REGULAR EXPRESSION PROGRAMS

1) sed –stream editor

Intro to Substitutions (http://muddcs.cs.hmc.edu/tech_docs/qref/sed.html)

FAQ (http://www.landfield.com/faqs/editor-faq/sed/)

You can do complicated multiple find and replaces but I personally don’t use it that much because I can’t figure out how to deal with tabs (doesn’t support \t for a tab). It basically combines grep with the power do simple regular expressions find and replaces. I use Perl where substitutions expressions have richer regular expressions

Example: sed ‘s/Evan/Devil/’ filename would do a simple find and replace

2) tr –translate or delete characters

It is useful for changing a large number of characters to other characters.

Example: tr /atcg/ /tagc/ < filename would complement the sequence aàt, tàa, etc..

Example: tr –s ‘/ /’ ‘/\t/’ < filename would replace runs of white space with single tab

OTHER UNIX PROGRAMS WORTH KNOWING ABOUT

3) head – see the beginning n lines of a file

4) tail – see the last n lines of a file

5) wc – line, word and character count of files

6) cat – concatenates a list of files into one text stream

6) cmp – checks to see whether to files are the same or different

6) diff – compares two files and identifies the differences

6) join – merge two files based on data fields

6) sort – sort a file based on given criteria

6) paste – compares two files and identifies the differences