{"id":1607,"date":"2015-07-19T22:45:30","date_gmt":"2015-07-20T05:45:30","guid":{"rendered":"http:\/\/zackmdavis.net\/blog\/?p=1607"},"modified":"2015-07-19T22:45:30","modified_gmt":"2015-07-20T05:45:30","slug":"dollar","status":"publish","type":"post","link":"http:\/\/zackmdavis.net\/blog\/2015\/07\/dollar\/","title":{"rendered":"$"},"content":{"rendered":"<p>I used to think of <code>$<\/code> in regular expressions as matching the end of the string. I was wrong! It actually might do something more subtle than that, depending on what regex engine you're using. In my <a href=\"http:\/\/zackmdavis.net\/blog\/2014\/11\/native-tongue\/\">native<\/a> Python's <a href=\"https:\/\/docs.python.org\/3\/library\/re.html\"><code>re<\/code> module, <code>$<\/code><\/a><\/p>\n<blockquote><p>\n[m]atches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline.\n<\/p><\/blockquote>\n<p>Note! The end of the string, <em>or just before the newline at the end of the string<\/em>.<\/p>\n<pre><code>In [2]: my_regex = re.compile(&quot;foo$&quot;)\r\n\r\nIn [3]: my_regex.match(&quot;foo&quot;)\r\nOut[3]: &lt;_sre.SRE_Match object; span=(0, 3), match=&#39;foo&#39;&gt;\r\n\r\nIn [4]: my_regex.match(&quot;foo\\n&quot;)\r\nOut[4]: &lt;_sre.SRE_Match object; span=(0, 3), match=&#39;foo&#39;&gt;<\/code><\/pre>\n<p>I guess I can see the motivation\u2014we often want <a href=\"https:\/\/docs.python.org\/3\/library\/stdtypes.html#str.splitlines\">to use<\/a> the newline character as a terminator of lines (by definition) or files (by <a href=\"http:\/\/unix.stackexchange.com\/a\/18789\">sacred tradition<\/a>), without wanting to think of <code>\\n<\/code> as really part of the content of interest\u2014but the disjunctive behavior of <code>$<\/code> can be a source of treacherous bugs in the fingers of misinformed programmers!<\/p>\n<p><!--more--><\/p>\n<p>It happened to me while I was doing speculative pre-development of the speculative pre-prototype for my speculative <a href=\"https:\/\/github.com\/zackmdavis\/Glitteral\">Glitteral programming language<\/a>, specifically in the lexical analyzer\u2014the part of a compiler that recognizes strings of source code as representing <em>tokens<\/em> that mean something in the language's grammar: this is a language keyword, that's an integer literal, this is an identifier, and so forth. My makeshift lexical analyzer (inspired by, but diverging significantly from, the more sophisticated thing that textbook said to do\u2014I was in a hurry) involved deciding if a segment of source code could represent a particular token (or prefix thereof) by checking if it matched the regular expression defining each token class (or a regex describing prefixes of that token class). I had prudently (but not prudently enough, as you see) anchored each of my token class regexes with <code>$<\/code>, so that, for example, the source fragment <code>fore<\/code> could not be erroneously recognized as the language keyword <code>for<\/code>. But that just left me with a bug in which a newline immediately following a token would be recognized as part of the token: for example, you could end up lexing the string <code>3\\n<\/code> as an integer literal, even though the integer literal was supposed to be just <code>3<\/code>. After <a href=\"https:\/\/github.com\/zackmdavis\/Glitteral\/commit\/abfe411b\">my first crude fix<\/a> later proved to be inadequate, I ended up <a href=\"https:\/\/github.com\/zackmdavis\/Glitteral\/commit\/832285a8\">fixing it by augmenting the <code>$<\/code>s with the negative lookahead assertion <code>(?!\\n)<\/code> immediately thereafter<\/a>, in effect saying, &quot;match the end of the string or just before the newline at the end of the string, <em>but<\/em> not just before a newline,&quot; the negative lookahead assertion canceling out the interpretation of <code>$<\/code> that I didn't want. And then later I <a href=\"https:\/\/github.com\/zackmdavis\/Glitteral\/commit\/b5961d66\">replaced all those <code>$(?!\\n)<\/code>s with <code>\\Z<\/code>s<\/a> (which <em>actually<\/em> match the end of the string, like I wanted in the first place), after it was brought to my attention that <code>\\Z<\/code> was a thing.<\/p>\n<p>But I'm not the only one who was confused! (Note: the previous sentence should be read in a tone of terror and despair at the tightness of the cruel grip of ignorance on our fragile world, <em>not<\/em> relief that other people are as dumb as me.) The famous Django web application framework recently released patch versions <a href=\"https:\/\/docs.djangoproject.com\/en\/1.8\/releases\/1.8.3\/\">1.8.3<\/a>, <a href=\"https:\/\/docs.djangoproject.com\/en\/1.8\/releases\/1.7.9\/\">1.7.9<\/a>, and <a href=\"https:\/\/docs.djangoproject.com\/en\/1.8\/releases\/1.4.21\/\">1.4.21<\/a> due to security concerns, one of which being <a href=\"https:\/\/www.djangoproject.com\/weblog\/2015\/jul\/08\/security-releases\/#s-header-injection-possibility-since-validators-accept-newlines-in-input\">validators failing to guard against possible header injection vulnerabilities<\/a> owing to <a href=\"https:\/\/github.com\/django\/django\/commit\/574dd5e0\">the use of <code>$<\/code> instead of <code>\\Z<\/code> in regular expressions<\/a>.<\/p>\n<p>All this that I have said about <code>$<\/code> in regexes concerns the Python world. Apparently <a href=\"http:\/\/perldoc.perl.org\/perlre.html#Regular-Expressions\">Perl is the same way<\/a> (maybe we got it from them?). But other regex engines don't have the &quot;or just before the newline&quot; semantic flourish in their interpretation of <code>$<\/code>. <a href=\"http:\/\/docs.oracle.com\/javase\/7\/docs\/api\/java\/util\/regex\/Pattern.html\">In Java, for example<\/a> (and therefore, more importantly from my point of view, Clojure),<\/p>\n<blockquote><p>\n[b]y default, the regular expressions <code>^<\/code> and <code>$<\/code> ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence.\n<\/p><\/blockquote>\n<pre><code>user=&gt; (re-matches #&quot;foo$&quot; &quot;foo&quot;)\r\n&quot;foo&quot;\r\nuser=&gt; (re-matches #&quot;foo$&quot; &quot;foo\\n&quot;)\r\nnil<\/code><\/pre>\n<p><a href=\"http:\/\/ruby-doc.org\/core-2.2.0\/Regexp.html#class-Regexp-label-Anchors\">Whereas in Ruby<\/a>, <code>$<\/code> is explicitly the end of <em>line<\/em> anchor (like in Python's <code>MULTILINE<\/code> mode), <code>\\Z<\/code> matches the end of the string or just before the newline if the string ends with a newline (like Python's default), and <code>\\z<\/code> is for the end of the string!<\/p>\n<p>I guess the moral is that if you want to write a kind of regular expression that you're not already intimately familiar with, you should think carefully and read the owner's manual of your chosen regex engine. What you find there may surprise you!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I used to think of $ in regular expressions as matching the end of the string. I was wrong! It actually might do something more subtle than that, depending on what regex engine you're using. In my native Python's re &hellip; <a href=\"http:\/\/zackmdavis.net\/blog\/2015\/07\/dollar\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[20],"tags":[70,21,64],"_links":{"self":[{"href":"http:\/\/zackmdavis.net\/blog\/wp-json\/wp\/v2\/posts\/1607"}],"collection":[{"href":"http:\/\/zackmdavis.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/zackmdavis.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/zackmdavis.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/zackmdavis.net\/blog\/wp-json\/wp\/v2\/comments?post=1607"}],"version-history":[{"count":4,"href":"http:\/\/zackmdavis.net\/blog\/wp-json\/wp\/v2\/posts\/1607\/revisions"}],"predecessor-version":[{"id":1611,"href":"http:\/\/zackmdavis.net\/blog\/wp-json\/wp\/v2\/posts\/1607\/revisions\/1611"}],"wp:attachment":[{"href":"http:\/\/zackmdavis.net\/blog\/wp-json\/wp\/v2\/media?parent=1607"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/zackmdavis.net\/blog\/wp-json\/wp\/v2\/categories?post=1607"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/zackmdavis.net\/blog\/wp-json\/wp\/v2\/tags?post=1607"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}