Lecture 8:  Everyday Tools for Manipulating Data

 

(Regular Expressions and Pivot Tables)

 

Introduction

 

Manipulating and reformatting text is often a key component (and limiting factor) for many analyses.  This lecture focuses on adding two widely used tools to your informatics arsenal:  regular expressions and pivot tables.  Regular expressions, or pattern matching, allow for complex reformatting of text.  While Pivot Tables (or cross tabulation) allow for rapid generation of summary statistics).  This lecture won’t make you an expert in either, but hopefully once you know what can be done, you can begin to use them to address your own data crunching needs.  This lecture grew out of thinking about the tools that I use everyday and which ones I am likely to continue using during 3 and 4 years of medical school.

 

What are Regular Expressions?

 

Regular Expressions are a “language” to search for character patterns within text.  Depending upon the program, matching characters (or lines with the matching characters) can be printed out or modified.  You have all used regular expressions when you have used an asterisk to represent any characters when you have searched for text. 

 

[jab1] 

PROGRAMS USING REGULAR EXPRESSIONS

A slew of programs and web sites use regular expression.  Unix programs such as grep (unix file line search tool),  ex (unix line editor), vi (unix visual editor), emacs (text editor), sed (unix stream editor like grep but can modify),  Many languages such as awk (pattern scanning and processing language), Perl (general programming language) have regular expressions at their core.

 

(Deeper than you would ever want to go into RE http://www.lib.uchicago.edu/keith/tcl-course/topics/regexp.html )

(Book: Mastering Regular Expressions. J Friedl http://www.oreilly.com/catalog/regex/ )

 

 

I. Regular Expression Overview (Extended Unix)

 

(Basic and Extended Regular Expressions)

(manual page:  http://www.math.grin.edu/~rebelsky/References/ManPages/regexp.html )

 

Normal Characters: a-z A-Z 0-9 = ; :

 

abc   => would match abc and nothing else

evan  => would match evan in a text line and nothing else

 

 

Special character: * . [ ] ? + {} / \ -

 

 

The asterisk (*) and plus (+)

* matches zero or more occurrences of preceding regular regular expression

+ matches one or more occurrences of preceding regular expression

ab*c matches:   ac  abc    abbbbbbc

ab+c matches:         abc    abbbbbbc

 

?  the Question Mark

Matches zero or one occurrences of preceding regular expression

ab?c  matches:   ac  or abc       

 

{} Curley Brackets

            Can be used to specifically designate number of matching characters

            a{4,10} matches: aaaa  aaaaaaaa upto aaaaaaaaaa but not aaa

            t{4,}  matches 4 or more ts

 

\ the backslash

Hint: if  a pattern doesn’t work as expected add backslashes to the problem area

1. Used to designate that a special character should be treated as normal.

a\*b matches a*b

end\. matches  end.

4 \+ 3 matches 4 + 3

2. Used to designate that a normal character as special

\t => Tab

\n => new line

\   => space (such as in a file name)

 

The circumflex (^) and the dollar sign ($)

Forces match to the beginning (^) or to the end ($) of a line

^Dear  matches: Dear     (only at beginning of line)

\.$ matches:  matches a period  (only at end of a line)

 

The period (.)

Matches any single character

a.c   matches: abc aec a:c a=c

a..c  matches: abbc axvc a=Zx

 

 [] The Square brackets

Defines a set or class of characters

*, $      lose there special meaning (ordinary characters)

^          at the start means NOT (match anything not in this class) (sometimes ! is used instead)

-                      (the dash) is used to create a range of characters

EXAMPLES:

[aeiouAEIOU]  match any vowel

[a-z]  match any lower case letter

[^a-z] matach any character that is NOT a lowercase character

[0-9]  match any digit

 

 

 

|  the ‘pipe’

This regular expression OR that regular expression.

May need to use \| depending upon specific program program.

ab|c matches ab OR c

dumb|stupid matches dumb OR stupid

 

() Parentheses

1. Can be used to group regular expressions for use with *,+, ?, !

May need to use \(  and \) depending upon the specific program.

a(b|c|d)ef  matches abef  acef adef

2. Can be used to return the results of the match within the parentheses.  Useful for find and replace or programming.

 

 

 

Rule:  regular expressions are normally GREEDY

            Example:  The end of world is at hand.

            T.*nd matches:   ‘The end of the world is at hand’  NOT     ‘The end

            .* matches the entire line

 

 

II. Using Regular Expressions – Text Editor

 

A text editor is useful for manipulating text rapidly.  Can create a large batch of Unix commands very rapidly or manipulating data into tab-delimited columns.

 

Example: /people0/rac5/build31/fasta

 

WHERE TO FIND FREE TEXT EDITORS WITH REGULAR EXPRESSIONS?

Windows

TextPad (http://www.textpad.com/)

TIPS FOR TEXTPAD

3) Useful for getting Text Ready for Excel

Mac

BBEdit Lite 4.6 (http://www.barebones.com/products/bbedit_lite.html)

Unix

vi, vim, emacs, xemacs 

 

III. Using Regular Expressions in Unix – grep

 

grep [options] regexp [file[s]] 

Common Options

-i ignore case

-E extended regular expressions (or egrep)

-c report only a count of the number of lines containing matches, not the matches themselves

-v invert the search, displaying only lines that do not match

-n display the line number along with the line on which a match was found

-H display file name along with line

-r recurse directories

-l list filenames, but not lines, in which matches were found

 

 

EXAMPLES

grep -E '(CAT){5,20}' chr22 | more

grep -E '(cat){5,20}' chr22 | more

grep -E '^(CAGGG){2,20}' chr22 Ychrom | more

grep -E '^[AT]+$' chr22 | more

grep -r -n -E 'AC00203[0-9]+' /people2/jab/TESTS

 

USEFUL PROGRAM TO USE WITH GREP AND ANY OTHER PROGRAM

Find path [expression]

This program finds files based on all sorts of characteristics.  (See man find for more information.)  It can be used for desending a directory structure and executing a program on specific files.  In combination with grep you can have a pseudo-database.

 

find /people2/jab/* -name *.out

find /people2/jab/* -name *.out -exec grep -H LINE {} \; | more

 

IV. Pivot Tables (cross tabulation)

 

A pivot table is a simple way to group data on the basis of categories.

 

AN EXAMPLE OF PIVOTING GENOMIC DUPLICATIONS

 

/people2/jab/GENE508/chr7_pairwise_dups.xls

 

PUTTING IT ALL TOGETHER: BIBLICAL STUDIES

 (Combining grep, textpad, Excel)

 

 

 

IV. APPENDIX

 

OTHER UNIX REGULAR EXPRESSION PROGRAMS

1) sed –stream editor

Intro to Substitutions (http://muddcs.cs.hmc.edu/tech_docs/qref/sed.html)

FAQ (http://www.landfield.com/faqs/editor-faq/sed/)

You can do complicated multiple find and replaces but I personally don’t use it that much because I can’t figure out how to deal with tabs (doesn’t support \t for a tab).  It basically combines grep with the power do simple regular expressions find and replaces.  I use Perl where substitutions expressions have richer regular expressions

Example: sed ‘s/Evan/Devil/’ filename would do a simple find and replace

2) tr –translate or delete characters

It is useful for changing a large number of characters to other characters.

Example: tr /atcg/  /tagc/ < filename would complement the sequence aŕt, tŕa, etc..

Example: tr –s ‘/ /’ ‘/\t/’ < filename would replace runs of white space with single tab

 

OTHER UNIX PROGRAMS WORTH KNOWING ABOUT

3) head see the beginning n lines of a file

4) tail see the last n lines of a file

5) wc line, word and character  count of files

6) cat concatenates a list of files into one text stream

6) cmp checks to see whether to files are the same or different

6) diff compares two files and identifies the differences

6) join merge two files based on data fields

6) sort sort a file based on given criteria

6) paste compares two files and identifies the differences

 

 

 

 


 [jab1]Example find.cw2ru.edu