{"id":267,"date":"2007-10-08T00:00:00","date_gmt":"2007-10-08T00:00:00","guid":{"rendered":"http:\/\/www.strongd.net\/?p=267"},"modified":"2007-10-08T00:00:00","modified_gmt":"2007-10-08T00:00:00","slug":"PERL5 Regular Expression Description","status":"publish","type":"post","link":"https:\/\/www.strongd.net\/?p=267","title":{"rendered":"PERL5 Regular Expression Description"},"content":{"rendered":"<p><P>Why is Perl so useful for sysadmin and WWW and text hacking? It has a lot of nice little features that make it easy to do nearly anything you want to text. A lot of perl programs look like a weird synergy of C and shell and sed and awk. For example: <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; #!\/usr\/bin\/perl<BR>&nbsp;&nbsp;&nbsp; # manpath &#8212; <A href=\"mailto:tchrist@perl.com\">tchrist@perl.com<\/A><BR>&nbsp;&nbsp;&nbsp; foreach $bindir (split(\/:\/, $ENV{PATH})) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ($mandir = $bindir) =~ s\/[^\\\/]+$\/man\/;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; next if $mandir =~ \/^\\.\/ || $mandir eq &#8221;;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (-d $mandir &amp;&amp; ! $seen{$mandir}++ ) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ($dev,$ino) = stat($mandir);<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (! $seen{$dev,$ino}++) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; push(@manpath,$mandir);<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<BR>&nbsp;&nbsp;&nbsp; }<BR>&nbsp;&nbsp;&nbsp; print join(&#8220;:&#8221;, @manpath), &#8220;\\n&#8221;;<\/P><br \/>\n<P>Can anyone see what that does? I&#8217;d like to think it&#8217;s not too hard, even devoid of commentary. It does have some naughty bits, like using side effect operators of assignment operators as expressions and double-plus postfix autoincrement. C programmers don&#8217;t have a problem with it, but a lot of others do. That&#8217;s why Guido banned such things in Python (a rather nice language in many ways), and why I don&#8217;t advocate using them to non-C programmers, whom it generally confuses whether it be done in C or in Perl or C++ or any such language. <BR>By far the most bizarre thing is that dread punctuation lying within funny slashes. Often folks call Perl unreadable because they don&#8217;t grok regexps, which all true perl wizards &#8212; and acolytes &#8212; adore. The slashes and their patterns govern matching and splitting and substituting, and here is where a lot of the Perl magic resides: its unmatched \ud83d\ude42 regular expressions. Certainly the above code could be rewritten in tcl or python or nearly anything else. It could even be rewritten in more legible perl. \ud83d\ude42 <\/P><br \/>\n<P>So what&#8217;s so special about perl&#8217;s regexps? Quite a bit, actually, although the real magic isn&#8217;t demonstrated very well in the manpath program. Once you&#8217;ve read the perlre(1) and the perlop(1) man pages, there&#8217;s still a lot to talk about. So permit me, if you would, to now explain Far More Than Everything You Ever Wanted to Know about Perl Regular Expressions&#8230; \ud83d\ude42 <\/P><br \/>\n<P>Perl starts with POSIX regexps of the &#8220;modern&#8221; variety, that is, egrep style not grep style. Here&#8217;s the simple case of matching a number <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; \/Th?om(as)? (Ch|K)rist(ia|e)ns{1,2}s[eo]n\/<\/P><br \/>\n<P>This avoids a lot of backslashes. I believe many languages also support such regular rexpressions. <BR>Now, Perl&#8217;s regexps &#8220;aren&#8217;t&#8221; &#8212; that is, they aren&#8217;t &#8220;regular&#8221; because backreferences per sed and grep are also supported, which renders the language no longer strictly regular and so forbids &#8220;pure&#8221; DFA implementations. <\/P><br \/>\n<P>But this is exceedingly useful. Backreferences let you refer back to match part of what you just had. Consider lines like these: <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; 1. This is a fine kettle of fish.<BR>&nbsp;&nbsp;&nbsp; 2. The best card is Island Fish Jasconius.<BR>&nbsp;&nbsp;&nbsp; 3. Is isolation unpleasant?<BR>&nbsp;&nbsp;&nbsp; 4. That&#8217;s his isn&#8217;t it?<BR>&nbsp;&nbsp;&nbsp; 5. Is is outlawed?<\/P><br \/>\n<P>If you&#8217;d like to pick up duplicate &#8220;is&#8221; strings there, you could use the pattern <BR>&nbsp;&nbsp;&nbsp; \/(is) \\1\/&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # matches 1,4<\/P><br \/>\n<P>As written, that will match sentences 1 and 4. The others fail due to mixed case. You can&#8217;t fix it just by saying <BR>&nbsp;&nbsp;&nbsp; \/([Ii]s) \\1\/&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # still matches 1,4<\/P><br \/>\n<P>because the \\1 refers back to the real match, not the potential match. So what do we do? Well, POSIX specifies a REG_ICASE flag you can pass into your matcher to help support &#8220;grep -i&#8221; etc. To get perl to do this, affix an i flag after the match: <BR>&nbsp;&nbsp;&nbsp; \/(is) \\1\/i&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # matches 1,2,3,4,5<\/P><br \/>\n<P>And now all 5 of those sentences match. If you only wanted them to match legit words, you might use the \\b notation for word boundaries, making it <BR>&nbsp;&nbsp;&nbsp; \/\\b(is) \\1\/i&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # matches 2,3,5<BR>&nbsp;&nbsp;&nbsp; \/(is) \\1\\b\/i&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # matches 1,5<BR>&nbsp;&nbsp;&nbsp; \/\\b(is) \\1\\b\/i&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # matches 5<\/P><br \/>\n<P>This means you will see Perl code like <BR>&nbsp;&nbsp;&nbsp; if ( $variable =~ \/\\b(is) \\1\\b\/i ) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print &#8220;gotta match&#8221;;<BR>&nbsp;&nbsp;&nbsp; }<\/P><br \/>\n<P>One might argue that is &#8220;should&#8221; be written more like <BR>&nbsp;&nbsp;&nbsp; if ( rematch(variable, &#8216;\\b(is) \\1\\b&#8217;, &#8216;i&#8217;) ) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print &#8220;gotta match&#8221;;<BR>&nbsp;&nbsp;&nbsp; }<\/P><br \/>\n<P>but that&#8217;s not how Perl works. I suspect that other languages could make it work that way. <BR>If you&#8217;d like to know where you matched, you might want to use these: <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $MATCH&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; full match<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $PREMATCH&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; before the match<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $POSTMATCH&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; after the match<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $LAST_PAREN_MATCH&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; useful for alternatives<\/P><br \/>\n<P>Although the most normal case is just to use $1, $2, etc, which match the first, second, etc parenthesized subexpressions. <BR>Another nice thing that Perl supports are the notions from C&#8217;s ctype.h include file: <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; C function&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Perl regexp<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; isalnum&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \\w<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; isspace&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \\s<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; isdigit&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \\d<\/P><br \/>\n<P>That means that you don&#8217;t have to hard-code [A-Z] and have it break when someone has some interesting locale settings. For example, under charset=ISO-8859-1, something like &#8220;fa\u00e7ade&#8221; properly matches \/^\\w+$\/, because the c-cedille is considered an alphanum. In theory, LC_NUMERIC settings should also take, but I&#8217;ve never tried. <BR>This quickly leads to a pattern that detects duplicate words in sentences: <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; \/\\b(\\w+)(\\s+\\1)+\\b\/i<\/P><br \/>\n<P>In fact, that one matches multiple duplicates as well. If if if you read in your input data a paragraph at a time, it will catch dups crossing line boundaries as as well. For example, using some convenient command line flags, here&#8217;s a <BR>&nbsp;&nbsp;&nbsp; perl -00 -ne &#8216;if ( \/\\b(\\w+)(\\s+\\1)+\\b\/i ) { print &#8220;dup $1 at $.\\n&#8221; }&#8217;<\/P><br \/>\n<P>which when used on this article says: <BR>&nbsp;&nbsp;&nbsp; dup Is at 10<BR>&nbsp;&nbsp;&nbsp; dup If at 33<\/P><br \/>\n<P>the $. variable ($NR in English mode) is the record number. I set it to read paragraph records, so paragraphs 10 and 33 of this posting contain duplicate words. <BR>Actually, we can do something a bit nicer: we can find multiple duplicates in the same paragraph. The \/g flag causes a match to store a bit of state and start up where it last left off. This gives us: <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; #!\/usr\/bin\/perl -00 -n<BR>&nbsp;&nbsp;&nbsp; while ( \/\\b(\\w+)(\\s+\\1)+\\b\/gi ) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print &#8220;dup $1 at paragraph $.\\n&#8221;;<BR>&nbsp;&nbsp;&nbsp; }<\/P><br \/>\n<P>This now yields: <BR>&nbsp;&nbsp;&nbsp; dup Is at paragraph 10<BR>&nbsp;&nbsp;&nbsp; dup if at paragraph 33<BR>&nbsp;&nbsp;&nbsp; dup as at paragraph 33<\/P><br \/>\n<P>Of course, we&#8217;re getting a bit hard to read here. So let&#8217;s use the \/x flag to permit embedded white space and comments in our pattern &#8212; you&#8217;ll want 5.002 for this (the white space worked in 5.000, but the comments were added later :-). For legibility, instead of slashes for the match, I&#8217;ll embrace the real m() function, Since \/foo\/ and m(foo) and m{foo} are all equivalent. <BR>&nbsp;&nbsp;&nbsp; #!\/usr\/bin\/perl -n<BR>&nbsp;&nbsp;&nbsp; require 5.002;<BR>&nbsp;&nbsp;&nbsp; use English;<BR>&nbsp;&nbsp;&nbsp; $RS = &#8221;;<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; while (<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m{&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # m{foo} is like \/foo\/, but helps vi&#8217;s % key<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \\b&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # first find a word boundary<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (\\w+)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # followed by the biggest word we can find<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # which we&#8217;ll save in the \\1 buffer<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \\s+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # now have some white space following it<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \\1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # and the word itself<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; )+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # repeat the space+word combo ad libitum<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \\b&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # make sure there&#8217;s a boundary at the end too<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }xgi&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # \/x for space\/comment-expanded patterns<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # \/g for global matching<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # \/i for case-insensitive matching<BR>&nbsp;&nbsp;&nbsp; )<BR>&nbsp;&nbsp;&nbsp; {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print &#8220;dup $1 at paragraph $NR\\n&#8221;;<BR>&nbsp;&nbsp;&nbsp; }<\/P><br \/>\n<P>While it&#8217;s true that someone who doesn&#8217;t know regular expressions won&#8217;t be able to read this at first glance, this is not a problem. So even though we can build up rather complex patterns, we can format and comment them nicely, preserving understandability. I wonder why no one else has done this in their regexp libraries? <BR>I actually wrote a sublegible version of this many years ago. It runs even on ancient versions of Perl. I&#8217;d probably to that a bit differently these days &#8212; my coding style has certainly matured. It violates several of my own current style guidelines. <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; #!\/usr\/bin\/perl<BR>&nbsp;&nbsp;&nbsp; undef $\/; $* = 1;<BR>&nbsp;&nbsp;&nbsp; while ( $ARGV = shift ) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (!open ARGV) { warn &#8220;$ARGV: $!\\n&#8221;; next; }<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $_ = &lt;ARGV&#038;gt$$<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; s\/\\b(\\s?)(([A-Za-z]\\w*)(\\s+\\3)+\\b)\/$1\\200$2\\200\/gi || next;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; split(\/\\n\/);<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $NR = 0;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @hits = ();<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for (@_) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $NR++;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; push(@hits, sprintf(&#8220;%5d %s&#8221;, $NR, $_)) if \/\\200\/;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $_ = join(&#8220;\\n&#8221;,@hits);<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; s\/\\200([^\\200]+)\\200\/[* $1 *]\/g;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print &#8220;$ARGV:\\n$_\\n&#8221;;<BR>&nbsp;&nbsp;&nbsp; }<\/P><br \/>\n<P>here&#8217;s that will output when run on this article up to this current point: <BR>&nbsp;&nbsp; 51&nbsp;&nbsp;&nbsp;&nbsp; 5. [* Is is *] outlawed?<BR>&nbsp; 124 In fact, that one matches multiple duplicates as well.&nbsp; [* If<BR>&nbsp; 125 if if *] you read in your input data a paragraph at a time, it will<BR>&nbsp; 126 catch dups crossing line boundaries [* as as *] well.&nbsp; For example, using<\/P><br \/>\n<P>Which is pretty neat. <BR>Speaking of ctype.h macros, Perl borrows the vi notation of case translation via \\u, \\l, \\U, and \\L. So you could say <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; $variable = &#8220;fa\u00e7ade ni\u00f1o co\u00f6perate moli\u00e8re ren\u00e9e na\u00efve h\u00e6mo tsch\u00fc\u00df&#8221;;<BR>and then do a<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; $variable =~ s\/(\\w+)\/\\U$1\/g;<\/P><br \/>\n<P>and it would come out <BR>&nbsp;&nbsp;&nbsp; FA\u00c7ADE NI\u00d1O CO\u00d6PERATE MOLI\u00c8RE REN\u00c9E NA\u00cfVE H\u00c6MO TSCH\u00dc\u00df<\/P><br \/>\n<P>Oh well. My clib doesn&#8217;t know to turn \u00df -&gt; SS. That&#8217;s a harder issue. <BR>This is much better than writing things like <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; $variable =~ tr[a-z][A-Z];<\/P><br \/>\n<P>because that would give you: <BR>&nbsp;&nbsp;&nbsp; FA\u00e7ADE NI\u00f1O CO\u00f6PERATE MOLI\u00e8RE REN\u00e9E NA\u00efVE H\u00e6MO TSCH\u00fc\u00df<\/P><br \/>\n<P>which isn&#8217;t right at all. <BR>Actually, perl can beat vi and do this: <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; $variable =~ s\/(\\w+)\/\\u\\L$1\/g;<\/P><br \/>\n<P>Yielding: <BR>&nbsp;&nbsp;&nbsp; Fa\u00e7ade Ni\u00f1o Co\u00f6perate Moli\u00e8re Ren\u00e9e Na\u00efve H\u00e6mo Tsch\u00fc\u00df<\/P><br \/>\n<P>which is somewhat interesting. <BR>Speaking of substitutes, we can use a \/e flag on the substitute to get the RHS to evaluate to code instead of just a string. Consider: <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; s\/(\\d+)\/8 * $1\/ge;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # multiple all numbers by 8<BR>&nbsp;&nbsp;&nbsp; s\/(\\d+)\/sprintf(&#8220;%x&#8221;, $1)\/ge;&nbsp;&nbsp; # convert them to hex<\/P><br \/>\n<P>This is nice when renumbering paragraphs. I often write <BR>&nbsp;&nbsp;&nbsp; s\/^(\\d+)\/1 + $1\/<\/P><br \/>\n<P>or from within vi, just <BR>&nbsp;&nbsp;&nbsp; %!perl -pe &#8216;s\/^(\\d+)\/1 + $1\/&#8217;<\/P><br \/>\n<P>Here&#8217;s a more elaborate example of this. If you wanted to expand %d or %s or whatnot, you might just do <BR>&nbsp;&nbsp;&nbsp; s\/%(.)\/$percent{$1}\/g;<\/P><br \/>\n<P>given a %percent definition like this: <BR>&nbsp;&nbsp;&nbsp; %percent = (<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#8216;d&#8217;&nbsp;&nbsp;&nbsp;&nbsp; =&gt; &#8216;digit&#8217;,<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#8216;s&#8217;&nbsp;&nbsp;&nbsp;&nbsp; =&gt; &#8216;string&#8217;,<BR>&nbsp;&nbsp;&nbsp; );<\/P><br \/>\n<P>But in fact, that&#8217;s got quite enough. You might well want to call a function, like <BR>&nbsp;&nbsp;&nbsp; s\/%(.)\/unpercent($1)\/ge;<\/P><br \/>\n<P>(assuming you have an unpercent() function defined.) <BR>You can even use \/ee for a double-eval, but that seems going overboard in most cases. It is, however, nice for converting embedded variables like $foo or whatever in text into their values. This way a sentence with $HOME and $TERM in it, assuming there were valid variables, might become a sentence with \/home\/tchrist and xterm in it. Just do this: <\/P><br \/>\n<P>&nbsp;&nbsp; s\/(\\$\\w+)\/$1\/eeg;<\/P><br \/>\n<P>Ok, what more can we do with perl patterns? split takes a pattern. Imagine that you have a record stored in plain text as blank line separated paragraphs with FIELD: VALUE pairs on each line. <BR>&nbsp;&nbsp;&nbsp; field: value here<BR>&nbsp;&nbsp;&nbsp; somefield: some value here<BR>&nbsp;&nbsp;&nbsp; morefield: other value here<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; field: second record&#8217;s value here<BR>&nbsp;&nbsp;&nbsp; somefield: some value here<BR>&nbsp;&nbsp;&nbsp; morefield: other value here<BR>&nbsp;&nbsp;&nbsp; newfield: other funny stuff<\/P><br \/>\n<P>You could process that this way. We&#8217;ll put it into key value pairs in a hash, just as though it had been initialized as <BR>&nbsp;&nbsp;&nbsp; %hash = (<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#8216;field&#8217;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =&gt; &#8216;value here&#8217;,<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#8216;somefield&#8217;&nbsp;&nbsp;&nbsp;&nbsp; =&gt; &#8216;some value here&#8217;,<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#8216;morefield&#8217;&nbsp;&nbsp;&nbsp;&nbsp; =&gt; &#8216;other value here&#8217;,<BR>&nbsp;&nbsp;&nbsp; );<\/P><br \/>\n<P>I&#8217;ll use a few command line switches for short cuts: <BR>&nbsp;&nbsp;&nbsp; #!\/usr\/bin\/perl -00n<BR>&nbsp;&nbsp;&nbsp; %hash = split( \/^([^:]+):\\s*\/m );<BR>&nbsp;&nbsp;&nbsp; if ( $hash{&#8220;somefield&#8221;} =~ \/here\/) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print &#8220;record $. has here in somefield\\n&#8221;;<BR>&nbsp;&nbsp;&nbsp; }<\/P><br \/>\n<P>The \/m flag governs whether ^ can match internally. I believe this is the POSIX value REG_NEWLINE. Normally perl does not have ^ match anywhere but the beginning of the string. ( <BR>Or you could eschew shortcuts and write: <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; #!\/usr\/bin\/perl<BR>&nbsp;&nbsp;&nbsp; use English;<BR>&nbsp;&nbsp;&nbsp; $RS = &#8221;;<BR>&nbsp;&nbsp;&nbsp; while ( $line = &lt;ARGV&gt; ) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; %hash = split(\/^([^:]+):\\s*\/m, $line);<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if ( $hash{&#8220;somefield&#8221;} =~ \/here\/) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print &#8220;record $NR has here in somefield\\n&#8221;;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<BR>&nbsp;&nbsp;&nbsp; }<\/P><br \/>\n<P>Actually, in the current version of perl, you can use the getline() object method on the predefined ARGV file handle object: <BR>&nbsp;&nbsp;&nbsp; #!\/usr\/bin\/perl<BR>&nbsp;&nbsp;&nbsp; use English;<BR>&nbsp;&nbsp;&nbsp; $RS = &#8221;;<BR>&nbsp;&nbsp;&nbsp; while ( $line = ARGV-&gt;getline() ) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; %hash = split(\/^([^:]+):\\s*\/m, $line);<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if ( $hash{&#8220;somefield&#8221;} =~ \/here\/) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print &#8220;record $NR has here in somefield\\n&#8221;;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<BR>&nbsp;&nbsp;&nbsp; }<\/P><br \/>\n<P>This can be especially convenient for handling mail messages. <BR>Here, for example, is a bair-bones mail-sorting program: <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; #!\/usr\/bin\/perl -00<BR>&nbsp;&nbsp;&nbsp; while (&lt;&gt;) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if ( \/^From \/ ) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ($id) = \/^Message-ID:\\s*(.*)\/mi;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $sub{$id} = \/^Subject:\\s*(Re:\\s*)*(.*)\/mi<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ? uc($2)<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; : $id;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $msg{$id} .= $_;<BR>&nbsp;&nbsp;&nbsp; }<BR>&nbsp;&nbsp;&nbsp; print @msg{ sort { $sub{$a} cmp $sub{$b} } keys %msg};<\/P><br \/>\n<P>Now, I still haven&#8217;t mentioned a couple of features which are to my mind critical in any analysis of the strengths of Perl&#8217;s pattern matching. These are stingy matching and lookaheads. <BR>Stingy matching solves the problem greedy matching. A greedy match picks up everything, as in: <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $line = &#8220;The food is under the bar in the barn.&#8221;;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if ( $line =~ \/foo(.*)bar\/ ) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print &#8220;got &lt;$1&gt;\\n&#8221;;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<\/P><br \/>\n<P>That prints out <BR>&nbsp;&nbsp;&nbsp; &lt;d is under the bar in the &gt;<\/P><br \/>\n<P>Which is often not what you want. Instead, we can add an extra ? after a repetition operator to render it stingy instead of greedy. <BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if ( $line =~ \/foo(.*?)bar\/ ) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print &#8220;got &lt;$1&gt;\\n&#8221;;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<\/P><br \/>\n<P>That prints out <BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; got &lt;d is under the &gt;<\/P><br \/>\n<P>which is often more what folks want. It turns out that having both stringy and greedy repetition operators in no way compromises a regexp engines regularity, nor is it particularly hard to implement. This comes up in matching quoted things. You can do tricks like using [^:] or [^&#8221;] or [^&#8221;&#8216;] for the simple cases, but negating multicharacter strings is hard. You can just use stingy matching instead. <BR>Or you could just use lookaheads. <\/P><br \/>\n<P>This is other important aspect of perl matching I wanted to mention. These are 0-width assertions that state that what follows must match or must not match a particular thing. These are phrased as either (?=pattern) for the assertion or (?!pattern) for the negation. <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; \/\\bfoo(?!bar)\\w+\/<\/P><br \/>\n<P>That will match &#8220;foostuff&#8221; but not &#8220;foo&#8221; or &#8220;foobar&#8221;, because I said there must be some alphanums after the word foo, but these may not begin with bar. <BR>Why would you need this? Oh, there are lots of times. Imagine splitting on newlines that are not followed by a space or a tab: <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; @list_of_results = split ( \/\\n(?![\\t ])\/, $data );<\/P><br \/>\n<P>Let&#8217;s put this all together and look at a couple of examples. Both have to do with HTML munging, the current rage. First, let&#8217;s solve the problem of detecting URLs in plaintext and highlighting them properly. A problem is if the URL has trailling punctuation, like <A href=\"ftp:\/\/host\/path.file\">ftp:\/\/host\/path.file<\/A>. Is that last dot supposed to be in the URL? We can probably just assume that a trailing dot doesn&#8217;t count, but even so, most scanners seem to get this wrong. Here&#8217;s a different approach: <BR>&nbsp; #!\/usr\/bin\/perl<BR>&nbsp; # urlify &#8212; <A href=\"mailto:tchrist@perl.com\">tchrist@perl.com<\/A><BR>&nbsp; require 5.002;&nbsp; # well, or 5.000 if you see below<\/P><br \/>\n<P>&nbsp; $urls = &#8216;(&#8216; . join (&#8216;|&#8217;, qw{<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; http<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; telnet<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; gopher<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; file<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; wais<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ftp<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; } )<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; . &#8216;)&#8217;;<\/P><br \/>\n<P>&nbsp; $ltrs = &#8216;\\w&#8217;;<BR>&nbsp; $gunk = &#8216;\/#~:.?+=&amp;%@!\\-&#8216;;<BR>&nbsp; $punc = &#8216;.:?\\-&#8216;;<BR>&nbsp; $any&nbsp; = &#8220;${ltrs}${gunk}${punc}&#8221;;<\/P><br \/>\n<P>&nbsp; while (&lt;&gt;) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ## use this if early-ish perl5 (pre 5.002)<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ##&nbsp; s{\\b(${urls}:[$any]+?)(?=[$punc]*[^$any]|\\Z)}{&lt;A HREF=&#8221;$1&#8243;&gt;$1&lt;\/A&gt;}goi;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ## otherwise use this &#8212; it just has 5.002ish comments<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; s{<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \\b&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # start at word boundary<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # begin $1&nbsp; {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $urls&nbsp;&nbsp;&nbsp;&nbsp; :&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # need resource and a colon<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [$any] +?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # followed by on or more<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #&nbsp; of any valid character, but<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #&nbsp; be conservative and take only<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #&nbsp; what you need to&#8230;.<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; )&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # end&nbsp;&nbsp; $1&nbsp; }<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (?=&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # look-ahead non-consumptive assertion<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [$punc]*&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # either 0 or more puntuation<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [^$any]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #&nbsp;&nbsp; followed by a non-url char<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # or else<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #&nbsp;&nbsp; then end of the string<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; )<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }{&lt;A HREF=&#8221;$1&#8243;&gt;$1&lt;\/A&gt;}igox;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print;<BR>&nbsp; }<\/P><br \/>\n<P>Pretty nifty, eh? \ud83d\ude42 <BR>Here&#8217;s another HTML thing: we have an html document, and we want to remove all of its embedded markup text. This requires three steps: <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; 1) Strip &lt;!&#8211; html comments &#8211;&gt;<BR>&nbsp;&nbsp;&nbsp; 2) Strip &lt;TAGS&gt;<BR>&nbsp;&nbsp;&nbsp; 3) Convert &amp;entities; into what they should be.<\/P><br \/>\n<P>This is complicated by the horrible specs on how html comments work: they can have embedded tags in them. So you have to be way more careful. But it still only takes three substitutions. \ud83d\ude42 I&#8217;ll use the \/s flag to make sure that my &#8220;.&#8221; can stretch to match a newline as well (normally it doesn&#8217;t). <BR>&nbsp;&nbsp;&nbsp; #!\/usr\/bin\/perl -p0777<BR>&nbsp;&nbsp;&nbsp; #<BR>&nbsp;&nbsp;&nbsp; #########################################################<BR>&nbsp;&nbsp;&nbsp; # striphtml (&#8220;striff tummel&#8221;)<BR>&nbsp;&nbsp;&nbsp; # <A href=\"mailto:tchrist@perl.com\">tchrist@perl.com<\/A><BR>&nbsp;&nbsp;&nbsp; # version 1.0: Thu 01 Feb 1996 1:53:31pm MST<BR>&nbsp;&nbsp;&nbsp; # version 1.1: Sat Feb&nbsp; 3 06:23:50 MST 1996<BR>&nbsp;&nbsp;&nbsp; #&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (fix up comments in annoying places)<BR>&nbsp;&nbsp;&nbsp; #########################################################<BR>&nbsp;&nbsp;&nbsp; #<BR>&nbsp;&nbsp;&nbsp; # how to strip out html comments and tags and transform<BR>&nbsp;&nbsp;&nbsp; # entities in just three &#8212; count &#8217;em three &#8212; substitutions;<BR>&nbsp;&nbsp;&nbsp; # sed and awk eat your heart out.&nbsp; \ud83d\ude42<BR>&nbsp;&nbsp;&nbsp; #<BR>&nbsp;&nbsp;&nbsp; # as always, translations from this nacr\u00e9 rendition into<BR>&nbsp;&nbsp;&nbsp; # more characteristically marine, herpetoid, titillative,<BR>&nbsp;&nbsp;&nbsp; # or indonesian idioms are welcome for the furthering of<BR>&nbsp;&nbsp;&nbsp; # comparitive cyberlinguistic studies.<BR>&nbsp;&nbsp;&nbsp; #<BR>&nbsp;&nbsp;&nbsp; #########################################################<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; require 5.001;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # for nifty embedded regexp comments<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; #########################################################<BR>&nbsp;&nbsp;&nbsp; # first we&#8217;ll shoot all the &lt;!&#8211; comments &#8211;&gt;<BR>&nbsp;&nbsp;&nbsp; #########################################################<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; s{ &lt;!&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # comments begin with a `&lt;!&#8217;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # followed by 0 or more comments;<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (.*?)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # this is actually to eat up comments in non<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # random places<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # not suppose to have any white space here<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # just a quick start;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#8212;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # each comment starts with a `&#8211;&#8216;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .*?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # and includes all text up to and including<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#8212;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # the *next* occurrence of `&#8211;&#8216;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \\s*&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # and may have trailing while space<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #&nbsp;&nbsp; (albeit not leading white space XXX)<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; )+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # repetire ad libitum&nbsp; XXX should be * not +<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (.*?)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # trailing non comment text<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # up to a `&gt;&#8217;<BR>&nbsp;&nbsp;&nbsp; }{<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if ($1 || $3) { # this silliness for embedded comments in tags<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#8220;&lt;!$1 $3&gt;&#8221;;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<BR>&nbsp;&nbsp;&nbsp; }gesx;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # mutate into nada, nothing, and niente<BR>&nbsp;&nbsp;&nbsp; \f<BR>&nbsp;&nbsp;&nbsp; #########################################################<BR>&nbsp;&nbsp;&nbsp; # next we&#8217;ll remove all the &lt;tags&gt;<BR>&nbsp;&nbsp;&nbsp; #########################################################<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; s{ &lt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # opening angle bracket<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (?:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # Non-backreffing grouping paren<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [^&gt;'&#8221;] *&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # 0 or more things that are neither &gt; nor &#8216; nor &#8220;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #&nbsp;&nbsp;&nbsp; or else<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#8220;.*?&#8221;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # a section between double quotes (stingy match)<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #&nbsp;&nbsp;&nbsp; or else<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#8216;.*?&#8217;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # a section between single quotes (stingy match)<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ) +&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # repetire ad libitum<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #&nbsp; hm&#8230;. are null tags &lt;&gt; legal? XXX<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # closing angle bracket<BR>&nbsp;&nbsp;&nbsp; }{}gsx;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # mutate into nada, nothing, and niente<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; #########################################################<BR>&nbsp;&nbsp;&nbsp; # finally we&#8217;ll translate all &amp;valid; HTML 2.0 entities<BR>&nbsp;&nbsp;&nbsp; #########################################################<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; s{ (<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &amp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # an entity starts with a semicolon<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \\x23\\d+&nbsp;&nbsp;&nbsp; # and is either a pound (# == hex 23)) and numbers<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #&nbsp;&nbsp; or else<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \\w+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # has alphanumunders up to a semi<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; )<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # a semi terminates AS DOES ANYTHING ELSE (XXX)<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; )<BR>&nbsp;&nbsp;&nbsp; } {<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $entity{$2}&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # if it&#8217;s a known entity use that<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ||&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #&nbsp;&nbsp; but otherwise<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # leave what we&#8217;d found; NO WARNINGS (XXX)<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; }gex;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # execute replacement &#8212; that&#8217;s code not a string<BR>&nbsp;&nbsp;&nbsp; \f<BR>&nbsp;&nbsp;&nbsp; #########################################################<BR>&nbsp;&nbsp;&nbsp; # but wait! load up the %entity mappings enwrapped in<BR>&nbsp;&nbsp;&nbsp; # a BEGIN that the last might be first, and only execute<BR>&nbsp;&nbsp;&nbsp; # once, since we&#8217;re in a -p &#8220;loop&#8221;; awk is kinda nice after all.<BR>&nbsp;&nbsp;&nbsp; #########################################################<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; BEGIN {<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; %entity = (<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; lt&nbsp;&nbsp;&nbsp;&nbsp; =&gt; &#8216;&lt;&#8216;,&nbsp;&nbsp;&nbsp;&nbsp; #a less-than<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; gt&nbsp;&nbsp;&nbsp;&nbsp; =&gt; &#8216;&gt;&#8217;,&nbsp;&nbsp;&nbsp;&nbsp; #a greater-than<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; amp&nbsp;&nbsp;&nbsp; =&gt; &#8216;&amp;&#8217;,&nbsp;&nbsp;&nbsp;&nbsp; #a nampersand<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; quot&nbsp;&nbsp; =&gt; &#8216;&#8221;&#8216;,&nbsp;&nbsp;&nbsp;&nbsp; #a (verticle) double-quote<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nbsp&nbsp;&nbsp; =&gt; chr 160, #no-break space<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; iexcl&nbsp; =&gt; chr 161, #inverted exclamation mark<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; cent&nbsp;&nbsp; =&gt; chr 162, #cent sign<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pound&nbsp; =&gt; chr 163, #pound sterling sign CURRENCY NOT WEIGHT<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; curren =&gt; chr 164, #general currency sign<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; yen&nbsp;&nbsp;&nbsp; =&gt; chr 165, #yen sign<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; brvbar =&gt; chr 166, #broken (vertical) bar<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sect&nbsp;&nbsp; =&gt; chr 167, #section sign<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; uml&nbsp;&nbsp;&nbsp; =&gt; chr 168, #umlaut (dieresis)<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; copy&nbsp;&nbsp; =&gt; chr 169, #copyright sign<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ordf&nbsp;&nbsp; =&gt; chr 170, #ordinal indicator, feminine<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; laquo&nbsp; =&gt; chr 171, #angle quotation mark, left<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; not&nbsp;&nbsp;&nbsp; =&gt; chr 172, #not sign<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; shy&nbsp;&nbsp;&nbsp; =&gt; chr 173, #soft hyphen<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; reg&nbsp;&nbsp;&nbsp; =&gt; chr 174, #registered sign<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; macr&nbsp;&nbsp; =&gt; chr 175, #macron<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; deg&nbsp;&nbsp;&nbsp; =&gt; chr 176, #degree sign<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; plusmn =&gt; chr 177, #plus-or-minus sign<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sup2&nbsp;&nbsp; =&gt; chr 178, #superscript two<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sup3&nbsp;&nbsp; =&gt; chr 179, #superscript three<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; acute&nbsp; =&gt; chr 180, #acute accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; micro&nbsp; =&gt; chr 181, #micro sign<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; para&nbsp;&nbsp; =&gt; chr 182, #pilcrow (paragraph sign)<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; middot =&gt; chr 183, #middle dot<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; cedil&nbsp; =&gt; chr 184, #cedilla<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sup1&nbsp;&nbsp; =&gt; chr 185, #superscript one<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ordm&nbsp;&nbsp; =&gt; chr 186, #ordinal indicator, masculine<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; raquo&nbsp; =&gt; chr 187, #angle quotation mark, right<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; frac14 =&gt; chr 188, #fraction one-quarter<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; frac12 =&gt; chr 189, #fraction one-half<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; frac34 =&gt; chr 190, #fraction three-quarters<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; iquest =&gt; chr 191, #inverted question mark<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Agrave =&gt; chr 192, #capital A, grave accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Aacute =&gt; chr 193, #capital A, acute accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Acirc&nbsp; =&gt; chr 194, #capital A, circumflex accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Atilde =&gt; chr 195, #capital A, tilde<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Auml&nbsp;&nbsp; =&gt; chr 196, #capital A, dieresis or umlaut mark<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Aring&nbsp; =&gt; chr 197, #capital A, ring<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; AElig&nbsp; =&gt; chr 198, #capital AE diphthong (ligature)<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Ccedil =&gt; chr 199, #capital C, cedilla<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Egrave =&gt; chr 200, #capital E, grave accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Eacute =&gt; chr 201, #capital E, acute accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Ecirc&nbsp; =&gt; chr 202, #capital E, circumflex accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Euml&nbsp;&nbsp; =&gt; chr 203, #capital E, dieresis or umlaut mark<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Igrave =&gt; chr 204, #capital I, grave accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Iacute =&gt; chr 205, #capital I, acute accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Icirc&nbsp; =&gt; chr 206, #capital I, circumflex accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Iuml&nbsp;&nbsp; =&gt; chr 207, #capital I, dieresis or umlaut mark<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ETH&nbsp;&nbsp;&nbsp; =&gt; chr 208, #capital Eth, Icelandic<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Ntilde =&gt; chr 209, #capital N, tilde<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Ograve =&gt; chr 210, #capital O, grave accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Oacute =&gt; chr 211, #capital O, acute accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Ocirc&nbsp; =&gt; chr 212, #capital O, circumflex accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Otilde =&gt; chr 213, #capital O, tilde<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Ouml&nbsp;&nbsp; =&gt; chr 214, #capital O, dieresis or umlaut mark<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; times&nbsp; =&gt; chr 215, #multiply sign<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Oslash =&gt; chr 216, #capital O, slash<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Ugrave =&gt; chr 217, #capital U, grave accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Uacute =&gt; chr 218, #capital U, acute accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Ucirc&nbsp; =&gt; chr 219, #capital U, circumflex accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Uuml&nbsp;&nbsp; =&gt; chr 220, #capital U, dieresis or umlaut mark<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Yacute =&gt; chr 221, #capital Y, acute accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; THORN&nbsp; =&gt; chr 222, #capital THORN, Icelandic<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; szlig&nbsp; =&gt; chr 223, #small sharp s, German (sz ligature)<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; agrave =&gt; chr 224, #small a, grave accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; aacute =&gt; chr 225, #small a, acute accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; acirc&nbsp; =&gt; chr 226, #small a, circumflex accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; atilde =&gt; chr 227, #small a, tilde<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; auml&nbsp;&nbsp; =&gt; chr 228, #small a, dieresis or umlaut mark<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; aring&nbsp; =&gt; chr 229, #small a, ring<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; aelig&nbsp; =&gt; chr 230, #small ae diphthong (ligature)<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ccedil =&gt; chr 231, #small c, cedilla<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; egrave =&gt; chr 232, #small e, grave accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; eacute =&gt; chr 233, #small e, acute accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ecirc&nbsp; =&gt; chr 234, #small e, circumflex accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; euml&nbsp;&nbsp; =&gt; chr 235, #small e, dieresis or umlaut mark<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; igrave =&gt; chr 236, #small i, grave accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; iacute =&gt; chr 237, #small i, acute accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; icirc&nbsp; =&gt; chr 238, #small i, circumflex accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; iuml&nbsp;&nbsp; =&gt; chr 239, #small i, dieresis or umlaut mark<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; eth&nbsp;&nbsp;&nbsp; =&gt; chr 240, #small eth, Icelandic<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ntilde =&gt; chr 241, #small n, tilde<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ograve =&gt; chr 242, #small o, grave accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; oacute =&gt; chr 243, #small o, acute accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ocirc&nbsp; =&gt; chr 244, #small o, circumflex accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; otilde =&gt; chr 245, #small o, tilde<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ouml&nbsp;&nbsp; =&gt; chr 246, #small o, dieresis or umlaut mark<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; divide =&gt; chr 247, #divide sign<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; oslash =&gt; chr 248, #small o, slash<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ugrave =&gt; chr 249, #small u, grave accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; uacute =&gt; chr 250, #small u, acute accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ucirc&nbsp; =&gt; chr 251, #small u, circumflex accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; uuml&nbsp;&nbsp; =&gt; chr 252, #small u, dieresis or umlaut mark<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; yacute =&gt; chr 253, #small y, acute accent<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; thorn&nbsp; =&gt; chr 254, #small thorn, Icelandic<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; yuml&nbsp;&nbsp; =&gt; chr 255, #small y, dieresis or umlaut mark<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; );<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ####################################################<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # now fill in all the numbers to match themselves<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ####################################################<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for $chr ( 0 .. 255 ) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $entity{ &#8216;#&#8217; . $chr } = chr $chr;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<BR>&nbsp;&nbsp;&nbsp; }<BR>&nbsp;&nbsp;&nbsp; \f<BR>&nbsp;&nbsp;&nbsp; #########################################################<BR>&nbsp;&nbsp;&nbsp; # premature finish lest someone clip my signature<BR>&nbsp;&nbsp;&nbsp; #########################################################<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; # NOW FOR SOME SAMPLE DATA &#8212; Switch ARGV to DATA above<BR>&nbsp;&nbsp;&nbsp; # to test<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; __END__<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; &lt;title&gt;Tom Christiansen&#8217;s Mox.Perl.COM Home Page&lt;\/title&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;!&#8211; begin header &#8211;&gt;<BR>&lt;A HREF=&#8221;<A href='http:\/\/perl-ora.songline.com\/universal\/header.map\"><IMG'>http:\/\/perl-ora.songline.com\/universal\/header.map&#8221;&gt;&lt;IMG<\/A> SRC=&#8221;<A href=\"http:\/\/perl-ora.songline.com\/graphics\/header-nav.gif\">http:\/\/perl-ora.songline.com\/graphics\/header-nav.gif<\/A>&#8221; HEIGHT=&#8221;18&#8243; WIDTH=&#8221;515&#8243; ALT=&#8221;Nav bar&#8221; BORDER=&#8221;0&#8243; usemap=&#8221;#header-nav&#8221;&gt;&lt;\/A&gt;<\/P><br \/>\n<P>&lt;map name=&#8221;header-nav&#8221;&gt;<BR>&lt;area shape=&#8221;rect&#8221; alt=&#8221;Perl.com&#8221; coords=&#8221;5,1,103,17&#8243; href=&#8221;<A href=\"http:\/\/www.perl.com\/index.html\">http:\/\/www.perl.com\/index.html<\/A>&#8220;&gt;<BR>&lt;area shape=&#8221;rect&#8221; alt=&#8221;CPAN&#8221; coords=&#8221;114,1,171,17&#8243; href=&#8221;<A href=\"http:\/\/www.perl.com\/CPAN\/CPAN.html\">http:\/\/www.perl.com\/CPAN\/CPAN.html<\/A>&#8220;&gt;<BR>&lt;area shape=&#8221;rect&#8221; alt=&#8221;Perl Language&#8221; coords=&#8221;178,0,248,16&#8243; href=&#8221;<A href=\"http:\/\/language.perl.com\/\">http:\/\/language.perl.com\/<\/A>&#8220;&gt;<BR>&lt;area shape=&#8221;rect&#8221; alt=&#8221;Perl Reference&#8221; coords=&#8221;254,0,328,16&#8243; href=&#8221;<A href=\"http:\/\/reference.perl.com\/\">http:\/\/reference.perl.com\/<\/A>&#8220;&gt;<BR>&lt;area shape=&#8221;rect&#8221; alt=&#8221;Perl Conference&#8221; coords=&#8221;334,0,414,17&#8243; href=&#8221;<A href=\"http:\/\/perl-conf.songline.com\">http:\/\/perl-conf.songline.com<\/A>&#8220;&gt;<BR>&lt;area shape=&#8221;rect&#8221; alt=&#8221;Programming Republic of Perl&#8221; coords=&#8221;422,0,510,17&#8243; href=&#8221;<A href=\"http:\/\/republic.perl.com\">http:\/\/republic.perl.com<\/A>&#8220;&gt;<BR>&lt;\/map&gt;<\/P><br \/>\n<P>&lt;!&#8211; end header &#8211;&gt;<BR>&lt;BODY BGCOLOR=#ffffff TEXT=#000000&gt;<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; &lt;!&#8211;<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; &lt;BODY BGCOLOR=&#8221;#000000&#8243; TEXT=&#8221;#FFFFFF&#8221;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; LINK=&#8221;#FFFF00&#8243; VLINK=&#8221;#22AA22&#8243; ALINK=&#8221;#0077FF&#8221;&gt;<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; !&#8211;&gt;<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; &lt;A NAME=TOP&gt;<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; &lt;CENTER&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;h3&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;A HREF=&#8221;#PERL&#8221;&gt;perl&lt;\/a&gt; \/<BR>&nbsp;&nbsp;&nbsp; &lt;A HREF=&#8221;#MAGIC&#8221;&gt;magic&lt;\/a&gt; \/<BR>&nbsp;&nbsp;&nbsp; &lt;A HREF=&#8221;#USENIX&#8221;&gt;usenix&lt;\/a&gt; \/<BR>&nbsp;&nbsp;&nbsp; &lt;A HREF=&#8221;#BOULDER&#8221;&gt;boulder&lt;\/a&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;\/h3&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;BR&gt;<BR>&nbsp;&nbsp;&nbsp; The word of the day is &lt;i&gt;nidificate&lt;\/i&gt;.<BR>&nbsp;&nbsp;&nbsp; &lt;\/CENTER&gt;<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; Testing: &amp;#69; &amp;#202; &amp;Auml;<BR>&nbsp;&nbsp;&nbsp; &lt;\/a&gt;<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; &lt;HR NOSHADE SIZE=3&gt;<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; &lt;A NAME=PERL&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;CENTER&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;h1&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;IMG SRC=&#8221;\/deckmaster\/gifs\/camel.gif&#8221; ALT=&#8221;&#8221;&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;font size=7&gt;<BR>&nbsp;&nbsp;&nbsp; Perl<BR>&nbsp;&nbsp;&nbsp; &lt;\/font&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;IMG SRC=&#8221;\/deckmaster\/gifs\/camel.gif&#8221; ALT=&#8221;&#8221;&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;\/h1&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;\/a&gt;<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; DOCTYPE START1<BR>&nbsp;&nbsp;&nbsp; &lt;!DOCTYPE&nbsp; HTML PUBLIC &#8220;-\/\/IETF\/\/DTD HTML 2.0\/\/EN&#8221;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#8212; This is an annoying comment &gt; &#8212;<BR>&nbsp;&nbsp;&nbsp; &gt;<BR>&nbsp;&nbsp;&nbsp; END1<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; DOCTYPE START2<BR>&nbsp;&nbsp;&nbsp; &lt;!DOCTYPE&nbsp; HTML PUBLIC &#8220;-\/\/IETF\/\/DTD HTML 2.0\/\/EN&#8221;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &#8212; This is an annoying comment&nbsp; &#8212;<BR>&nbsp;&nbsp;&nbsp; &gt;<BR>&nbsp;&nbsp;&nbsp; END2<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; &lt;I&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;BLOCKQUOTE&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;DL&gt;&lt;DT&gt;A ship then new they built for him<BR>&nbsp;&nbsp;&nbsp; &lt;DD&gt;of mithril and of elven glass&#8230;<BR>&nbsp;&nbsp;&nbsp; &lt;\/DL&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;\/I&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;\/BLOCKQUOTE&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;\/CENTER&gt;<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; &lt;HR size=3 noshade&gt;<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; &lt;BLOCKQUOTE&gt;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Wow!&nbsp; I really can&#8217;t believe that anyone has read this far<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; in this very long news posting about irregular expressions. \ud83d\ude42<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Is anyone really still with me?&nbsp; If so, make my day and<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; drop me a piece of email.<BR>&nbsp;&nbsp;&nbsp; &lt;\/BLOCKQUOTE&gt;<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; &lt;UL&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;LI&gt;<BR>&nbsp;&nbsp;&nbsp; &lt;A HREF=&#8221;\/CPAN\/README.html&#8221;&gt;CPAN<BR>&nbsp;&nbsp;&nbsp; (Comprehensive Perl Archive Network)&lt;\/a&gt; sites are replicated around the world; please ch<BR>oose<BR>&nbsp;&nbsp;&nbsp; from &lt;A HREF=&#8221;\/CPAN\/CPAN.html&#8221;&gt;one near you&lt;\/a&gt;.<BR>&nbsp;&nbsp;&nbsp; The &lt;A HREF=&#8221;\/CPAN\/modules\/01modules.index.html&#8221;&gt;CPAN index&lt;\/a<BR>&gt;<BR>&nbsp;&nbsp;&nbsp; to the &lt;A HREF=&#8221;\/CPAN\/modules\/00modlist.long.html&#8221;&gt;full module<BR>s file&lt;\/a&gt;<BR>&nbsp;&nbsp;&nbsp; are also good places to look.<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; &lt;LI&gt;&lt;IMG SRC=&#8221;\/deckmaster\/gifs\/new.gif&#8221; WIDTH=26 HEIGHT=13 ALT=&#8221;NEW&#8221;&gt;<BR>&nbsp;&nbsp;&nbsp; Here&#8217;s a table of perl and CGI-related books and publications, in either<BR>&nbsp;&nbsp;&nbsp; &lt;A HREF=&#8221;\/info\/books.html&#8221;&gt;&lt;SMALL&gt;HTML&lt;\/SMALL&gt; 3.0 table format&lt;\/a&gt;<BR>&nbsp;&nbsp;&nbsp; or else in<BR>&nbsp;&nbsp;&nbsp; &lt;A HREF=&#8221;\/info\/books.txt&#8221;&gt;pre-formatted&lt;\/a&gt; for old browsers.<\/P><br \/>\n<P>What&#8217;s missing from Perl&#8217;s regular expressions? Anything? Well, yes. The first is that they should be first-class objects. There are some really embarassing optimization hacks to get around not having compiled regepxs directly-usable accessible. The \/o flag I used above is just one of them. (I&#8217;m *not* talking about the study() function, which is a neat thing to turbo-ize your matching.) A much more egregious hack involving closures is demonstrated here using the match_any funtion, which itself returns a function to do the work: <BR>&nbsp;&nbsp;&nbsp; $f = match_any(&#8216;^begin&#8217;, &#8216;end$&#8217;, &#8216;middle&#8217;);<BR>&nbsp;&nbsp;&nbsp; while (&lt;&gt;) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print if &amp;$f();<BR>&nbsp;&nbsp;&nbsp; }<\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; sub match_any {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; die &#8220;usage: match_any pats&#8221; unless @_;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; my $code = &lt;&lt;EOCODE;<BR>&nbsp;&nbsp;&nbsp; sub {<BR>&nbsp;&nbsp;&nbsp; EOCODE<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $code .= &lt;&lt;EOCODE if @_ &gt; 5;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; study;<BR>&nbsp;&nbsp;&nbsp; EOCODE<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for $pat (@_) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $code .= &lt;&lt;EOCODE;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return 1 if \/$pat\/;<BR>&nbsp;&nbsp;&nbsp; EOCODE<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $code .= &#8220;}\\n&#8221;;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print &#8220;CODE: $code\\n&#8221;;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; my $func = eval $code;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; die &#8220;bad pattern: $@&#8221; if $@;<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return $func;<BR>&nbsp;&nbsp;&nbsp; }<\/P><br \/>\n<P>That&#8217;s the kind of thing I just despise writing: the only thing worse would be not being able to do it at all. \ud83d\ude41 1st-class compiled regexps would surely help a great deal here. <BR>Sometimes people expect backreferences to be forward references, as in the pattern \/\\1\\s(\\w+)\/, which just isn&#8217;t the way it works. A related issue is that while lookaheads work, these are not lookbehinds, which can confuse people. This means \/\\n(?=\\s)\/ is ok, but you cannot use this for lookbehind: \/(?!foo)bar\/ will not find an occurrence of &#8220;bar&#8221; that is preceded by something which is not &#8220;foo&#8221;. That&#8217;s because the (?!foo) is just saying that the next thing cannot be &#8220;foo&#8221;&#8211;and it&#8217;s not, it&#8217;s a &#8220;bar&#8221;, so &#8220;foobar&#8221; will match. <\/P><br \/>\n<P>There isn&#8217;t really much support for user-defined character classes. You see a bit of that in the urlify program above. On the other hand, this might be the most clear way of writing it. <\/P><br \/>\n<P>Another thing that would be nice to have is the ability to someone specify a recursive match with nesting. That ways you could pull out matching parens or braces or begin\/end blocks etc. I don&#8217;t know what a good syntax for this might be. Maybe (?{&#8230;) for the opening one and (?}&#8230;) for the closing one, as in: <\/P><br \/>\n<P>&nbsp;&nbsp;&nbsp; \/\\b(?{begin)\\b.*\\b(?}end)\\b\/i<\/P><br \/>\n<P>Finally, while it&#8217;s cool that perl&#8217;s patterns are 8-bit clean, will match strings even with null bytes in them, and have support for alternate 8-bit character sets, it would certainly make the world happy if there were full Unicode support. <\/P><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Why is Perl so useful for sysadmin and WWW and text hacking? It has a lot of nice little features that make it easy to do nearly anything you want to text. A lot of perl programs look like a weird synergy of C and shell and sed and awk. For example: &nbsp;&nbsp;&nbsp; #!\/usr\/bin\/perl&nbsp;&nbsp;&nbsp; # &hellip; <a href=\"https:\/\/www.strongd.net\/?p=267\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">PERL5 Regular Expression Description<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"class_list":["post-267","post","type-post","status-publish","format-standard","hentry"],"_links":{"self":[{"href":"https:\/\/www.strongd.net\/index.php?rest_route=\/wp\/v2\/posts\/267","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.strongd.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.strongd.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.strongd.net\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.strongd.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=267"}],"version-history":[{"count":0,"href":"https:\/\/www.strongd.net\/index.php?rest_route=\/wp\/v2\/posts\/267\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.strongd.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=267"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.strongd.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=267"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.strongd.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=267"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}