Introduction
Manipulating and reformatting text is often a key component (and
limiting factor) for many analyses.
This lecture focuses on adding two widely used tools to your informatics
arsenal: regular expressions and pivot
tables. Regular expressions, or pattern
matching, allow for complex reformatting of text. While Pivot Tables (or cross tabulation) allow for rapid
generation of summary statistics). This
lecture won’t make you an expert in either, but hopefully once you know what
can be done, you can begin to use them to address your own data crunching needs. This lecture grew out of thinking about the
tools that I use everyday and which ones I am likely to continue using during 3
and 4 years of medical school.
Regular Expressions are a “language” to search for character patterns
within text. Depending upon the program, matching characters (or lines
with the matching characters) can be printed out or modified. You have all used regular expressions when
you have used an asterisk to represent any characters when you have searched
for text.
A
slew of programs and web sites use regular expression. Unix programs such as grep (unix file line search
tool), ex (unix line editor), vi
(unix visual editor), emacs (text editor), sed (unix stream
editor like grep but can modify), Many
languages such as awk (pattern scanning and processing language), Perl (general programming language)
have regular expressions at their core.
(Deeper than you would
ever want to go into RE http://www.lib.uchicago.edu/keith/tcl-course/topics/regexp.html
)
(Book:
Mastering Regular Expressions. J Friedl http://www.oreilly.com/catalog/regex/
)
(Basic and Extended Regular Expressions)
(manual page: http://www.math.grin.edu/~rebelsky/References/ManPages/regexp.html
)
Normal Characters: a-z A-Z 0-9 = ; :
abc => would match abc and nothing else
evan => would match evan in a text
line and nothing else
The asterisk (*) and plus
(+)
* matches
zero or more occurrences of preceding regular regular expression
+ matches
one or more occurrences of preceding regular expression
ab*c
matches: ac
abc abbbbbbc
ab+c
matches: abc abbbbbbc
? the Question Mark
Matches
zero or one occurrences of preceding regular expression
ab?c
matches: ac or abc
{} Curley Brackets
Can be used to specifically designate number of
matching characters
a{4,10} matches:
aaaa aaaaaaaa upto aaaaaaaaaa
but not aaa
t{4,} matches 4 or more ts
\ the backslash
Hint: if a pattern doesn’t work as expected add
backslashes to the problem area
1. Used to designate that a special
character should be treated as normal.
a\*b matches a*b
end\. matches
end.
4
\+ 3 matches 4 + 3
2. Used to designate that a
normal character as special
\t => Tab
\n => new line
\ =>
space (such as in a file name)
The
circumflex (^) and the dollar sign ($)
Forces
match to the beginning (^) or to the end ($) of a
line
^Dear matches: Dear
(only at beginning of line)
\.$
matches: matches a period (only
at end of a line)
The
period (.)
Matches
any single character
a.c matches: abc aec a:c a=c
a..c
matches: abbc
axvc a=Zx
[] The Square brackets
Defines a
set or class of characters
*, $ lose there special meaning (ordinary
characters)
^ at
the start means NOT (match anything not in this class) (sometimes ! is
used instead)
-
(the dash) is used to create a range of characters
EXAMPLES:
[aeiouAEIOU]
match any vowel
[a-z]
match any lower case letter
[^a-z]
matach any character that is NOT a lowercase
character
[0-9]
match any digit
|
the ‘pipe’
This
regular expression OR that regular expression.
May need
to use \| depending upon specific program program.
ab|c matches ab
OR c
dumb|stupid matches dumb OR stupid
()
Parentheses
1. Can be used to group regular
expressions for use with *,+, ?, !
May need
to use \( and \) depending upon the specific program.
a(b|c|d)ef matches abef acef adef
2. Can be used to return the
results of the match within the parentheses.
Useful for find and replace or programming.
Rule: regular expressions are
normally GREEDY
Example: The end of world is at
hand.
T.*nd matches:
‘The end of the world is at hand’ NOT ‘The end’
.* matches the entire line
A
text editor is useful for manipulating text rapidly. Can create a large batch of Unix commands very rapidly or
manipulating data into tab-delimited columns.
Example:
/people0/rac5/build31/fasta
TextPad (http://www.textpad.com/)
3) Useful for getting Text Ready
for Excel
BBEdit Lite 4.6 (http://www.barebones.com/products/bbedit_lite.html)
vi, vim, emacs, xemacs
grep
[options] regexp [file[s]]
-i ignore case
-E extended regular
expressions (or egrep)
-c report only a count of
the number of lines containing matches, not the matches themselves
-v
invert
the search, displaying only lines that do not match
-n display the line number
along with the line on which a match was found
-H display file name along
with line
-r
recurse
directories
-l list filenames, but not
lines, in which matches were found
EXAMPLES
grep -E '(CAT){5,20}' chr22 | more
grep -E '(cat){5,20}' chr22 | more
grep -E '^(CAGGG){2,20}' chr22 Ychrom | more
grep -E '^[AT]+$' chr22 | more
grep -r -n -E 'AC00203[0-9]+' /people2/jab/TESTS
USEFUL
PROGRAM TO USE WITH GREP AND ANY OTHER PROGRAM
Find
path [expression]
This
program finds files based on all sorts of characteristics. (See man
find for more information.) It can be used for desending a directory
structure and executing a program on specific files. In combination with
grep you can have a pseudo-database.
find
/people2/jab/* -name *.out
find
/people2/jab/* -name *.out -exec grep -H LINE {} \; | more
A
pivot table is a simple way to group data on the basis of categories.
AN EXAMPLE OF PIVOTING
GENOMIC DUPLICATIONS
/people2/jab/GENE508/chr7_pairwise_dups.xls
PUTTING IT ALL TOGETHER:
BIBLICAL STUDIES
(Combining grep, textpad, Excel)
OTHER UNIX REGULAR
EXPRESSION PROGRAMS
1)
sed –stream editor
Intro
to Substitutions (http://muddcs.cs.hmc.edu/tech_docs/qref/sed.html)
FAQ
(http://www.landfield.com/faqs/editor-faq/sed/)
You
can do complicated multiple find and replaces but I personally don’t use it
that much because I can’t figure out how to deal with tabs (doesn’t support \t
for a tab). It basically combines grep with the power do simple regular
expressions find and replaces. I use Perl where substitutions expressions
have richer regular expressions
Example:
sed ‘s/Evan/Devil/’ filename would do a simple find and replace
2)
tr
–translate or delete characters
It
is useful for changing a large number of characters to other characters.
Example:
tr /atcg/ /tagc/ < filename would complement the sequence aŕt,
tŕa, etc..
Example:
tr –s ‘/ /’ ‘/\t/’ < filename would replace runs of white space with
single tab
OTHER UNIX PROGRAMS WORTH
KNOWING ABOUT
3) head – see
the beginning n lines of a file
4) tail – see
the last n lines of a file
5) wc – line,
word and character count of files
6) cat – concatenates a list of files
into one text stream
6) cmp – checks to see whether to
files are the same or different
6) diff – compares two files and
identifies the differences
6) join – merge two files based on data
fields
6) sort – sort a file based on given
criteria
6) paste – compares two files and
identifies the differences
[jab1]Example find.cw2ru.edu