Merge for tweaks that address BOM-table problems and step over psql-commands. #13

koyae · 2016-05-15T19:04:26Z

I made some fairly minor alterations in the interest of making pgsanity easier to use. Changes include addressing #5 for both piped and file-based input, plus an edit which allows pgsanity to deal with psql-directives that might show up in incoming data. If the latter is not desired (at least as the default behavior), I can either move those changes over to a separate branch and then submit another pull-request once I've done so, or we could allow a flag to be passed that would prompt the ignore-psql-directives behavior.

…l as regular files.

…timing on

markdrago · 2016-05-16T12:19:09Z

pgsanity/pgsanity.py

+    bom_present = check_for_bom(nose)
+    sql_string = sql_string[len(nose):] if bom_present else sql_string
+    success, msg = check_string(sql_string.decode("utf-8"))
+    # ^ The above call to decode() is safe for both ASCII and UTF-8 data.


This is a bit nit-picky, but I think I'd rather see all of this BOM stuff move in to check_for_bom(), which probably means that check_for_bom() would become remove_bom_if_exists(). It would take the full sql string and return the full sql string (with the BOM removed if it exists). I think that encapsulates the BOM handling a bit more.

I had just noticed all of the other check__something_() -functions hanging out in this file and at the time I thought I'd go with the flow! Anyways I'm happy to rename the function and wrap up that logic all the way.

I've tried to neaten this up as per your suggestions: ca34532.

markdrago · 2016-05-16T12:44:32Z

Thanks for sending this PR over - I'm hopeful that we'll get it merged in soon. Would you be able to add a few tests for your changes as well? If you make the changes to the BOM code that I suggested it should be pretty easy to write a few unit tests for that (strip when BOM is present, don't when it's not). Also the sqlprep changes should have tests as well - that code is so complex that the tests really come in handy when making changes.

Thanks!

…OM-table information, per a comment by the original author.

markdrago · 2016-05-17T22:47:49Z

These changes look good. Do you plan on adding the flag to enable psql stripping and supporting the double-backslash to go back to sql?

koyae · 2016-05-18T07:17:20Z

Once #14 looks like it's down for the count, we can circle around to the remaining transitions which deal with with psql directives.

…w errors here.

koyae · 2016-05-30T07:39:38Z

Old tests are now passing as of 46573e8. The following commit (0dd9de2) passes the same tests, plus the ones it adds.

koyae · 2016-05-30T07:53:03Z

On performance: Just did a quick benchmark and it looks like the earlier 92d4bfe runs for about 56 seconds on a 56MB file whereas the latest commit (0dd9de2) runs in about 187 seconds because of the character-by-character method. As I've said, things can be optimized from here.

While we're on the subject, I'll note that on files of around that same ~50MB size, Python will bomb out due to memory-limitations if there's less than 300MB or so available in RAM. Just based off of where I saw crashes (on an extremely memory-impoverished machine), the main culprits seem to be the string slice done to remove BOM-tables and also the call to decode(). On better-provisioned machines, I found no such issues.

...I don't have much experience with profiling Python code for memory-consumption and I'm guessing most folks who would use pgsanity will probably have access to decent hardware if they are doing bulk SQL-operations anyways. However, if you've got specific goals in mind regarding the resource-footprint, do let me know.

markdrago · 2016-06-07T16:24:46Z

Overall this looks pretty good to me. I haven't read through the main processing/transitions stuff in excruciating detail yet, but I like the overall direction. Do you think you'll be able to speed it up?

…kens on the entire input string. This should prevent memory from getting exhausted so easily.

koyae · 2016-06-21T08:58:36Z

Work and life in general have been a little nuts lately, but a belated "yep" on planning to do optimization on this.

Similar to the way you'd been doing it originally, we sorta hop between characters (tokens now, actually) instead of going character-by-character like I did the sorta proof-of-concept. In 2f158f0, the new processing algorithm works and runs pretty swiftly, but the approach of trying to find all tokens ahead of time is too memory-intensive and easily wakes up the OOM killer if given large files to process.

I should be able to do another stint soon to address that.

koyae · 2016-06-22T09:25:13Z

I've gotten the memory-issues under control a bit better now, I believe.

Speed-wise, on the same 60MB file I mentioned earlier, processing now completes in just a hair less than 19 seconds, which I think is quite respectable.

When I tried a larger ~160MB file, it appeared to hang indefinitely, which I discovered was due to chardet.detect() taking its sweet time. I killed it after 18 minutes or so. Although there is a chance the file was just unusual (encoding was either latin1 or windows-1252), it probably wouldn't hurt to:

create an option for users to explicitly provide an encoding to use when reading the file(s)
create an option for controlling how many characters from file to use to try to auto-detect the encoding type if one wasn't explicitly given (defaults to some reasonable number)

Along with those items, adding the optional psql-escaping plus a few tests to go with that should be up next, which should about wrap this up unless you can think of anything else!

markdrago · 2016-07-18T13:01:18Z

@koyae, what do you want to do here for next steps? I'd rather not add any options to let the user control/limit the chardet detection. I would rather just pick 1024 characters or something like that and leave it at that.

I'm also curious to know what the performance difference is between the current stable version of pgsanity and the one with your changes. I believe your approach to parsing is better from a code-sanity perspective, and it handles some situations that the old code doesn't, but I don't want to regress on the speed in any significant way. If the new code is slightly slower (10, even 20 percent) then I think that's acceptable, but if it doubles the runtime then I think we'll need to figure something else out.

Again - thank you for all of the work you have put in to this so far. Please let me know how you'd like to move forward.

koyae · 2016-08-13T06:08:31Z

Hey again! Thanks for your patience on all this. Let me try to answer some of your questions.

Avoiding need for optional commandline flags:

That's a toughie; obviously, pgsanity can't work properly if it can't decode files correctly, but it's also nice to have a tool that works as point-and-shoot without extra nuances to learn. In light of the results of the performance-tests discussed further down, the prospects of having the tool be both fast and easy-to-use actually do look optimistic. To eek out just a touch of flexibility in case some users might need it, I'd like to suggest the following:

We keep pgsanity's invocation as pgsanity <file1> [<file2>...] and refrain from adding additional run-flags.
We allow pgsanity to pick up the encoding from the filename, in the rare cases that a user wants to force a specific encoding, or pgsanity is guessing the encoding wrong.

If the user was working with some weird charset, and pgsanity flags the syntax because a character may have parsed screwy (because of the wrong guessed encoding), the secondary path would look like this:

Pgsanity returns false, and also tells the user it may have chosen the wrong encoding:

Please note: pgsanity may have incorrectly determined the encoding for your file, which can result in false detection of syntax errors.

If you know the encoding used to create your file, you can rename it to include the codec which was used e.g. "your_large_file.cp1250.sql"

Alternatively, you can open the file in the program used to create it, and export using one of the supported encoding types. Standard encoding types can be found at: https://docs.python.org/2.7/library/codecs.html#standard-encodings
The user renames/re-saves the file as instructed.
Pgsanity now takes a using the encoding given by the filename.
Hopefully it works the second time around.

Performance-considerations:

On a decent machine, encoding scans really don't impact performance that much, and I wasn't able to reproduce the freeze I had gotten earlier from scanning a big file. With the files I tested with, it seemed we only had about a 1-second time-difference between scanning the whole file versus just part of it. That's all good news.

I did about 5 to 10 runs per file to see if there was much variance, but found little on the testing environment described below.

When I checked for I/O wait-time, I saw it was virtually 0, so I've omitted any stats about it.

Test-machine: AWS m4.large

Although this machine has two cores, this currently doesn't affect our speed since pgsanity does all processing before handing anything over to ecpg. When ecpg spins up to 100% (on one core), pgsanity's CPU-usage drops from 100% down to around 4%, meaning the second core only gives us a boost of around 0.04.

Specs:

OS: 4.4.15-25.57.amzn1 GNU/Linux x86_64
CPU 1: Xeon(R) CPU E5-2676 v3 @ 2.40GHz
CPU 2: Xeon(R) CPU E5-2676 v3 @ 2.40GHz
RAM: 8 GB
HD: 25 GB SSD
Python: v2.7.10

Test 1: 302,950 KB (trsq_not_uploading_used_out_utm_coords.sql)

Build `3fab645`:

scanning 1024 to 10,000 bytes (runs were identical):

  runtime: 125 seconds
      88 seconds (pgsanity)
      37 seconds (ecpg)
  RAM footprint: 
      1.8 GB physical (pgsanity)
      409 MB physical (ecpg)

scanning 1,000,000 bytes 302,950,000 bytes (whole file):

  runtime: between 126 and 127 seconds
      90 seconds (pgsanity)
      37 seconds (ecpg)
  RAM footprint: 2.3 GB
      1.8 to 1.9 GB physical (pgsanity)
      409 MB physical (ecpg)

Build `00b3977`:

scanning 0 bytes:

  runtime: 107 seconds
      70 seconds (pgsanity)
      37 seconds (ecpg)
  RAM footprint: 1.19 GB
      783 MB physical (pgsanity)
      409 MB physical (ecpg)

Test2: Test file: 167,881 KB (07_tr_trs_trsq_trsqq_2.sql)

Build `3fab645`:

scanning 1024 bytes:

  runtime: 73 seconds
      52 seconds (pgsanity)
      21 seconds (ecpg)
  RAM footprint: 1.23 GB
      1 GB (pgsanity)
      228 MB (ecpg)

scanning 171,909,990 bytes (whole file):

  runtime: 74 seconds
      53 seconds (pgsanity)
      21 seconds (ecpg)
  RAM footprint: 1.23 GB
      1 GB (pgsanity)
      228 MB (ecpg)

Build `00b3977`:

scanning 0 bytes:

  runtime: 60 seconds
      38 seconds (pgsanity)
      22 seconds (ecpg)
  RAM footprint: 0.58 GB
      351 MB (pgsanity)
      228 MB (ecpg)

koyae · 2016-08-17T18:39:09Z

More benchmark info (using same system as described above):

scan = 00b3977
1024 = 3fab645 (1024 bytes)
full = 3fab645 (entire file)

01_db_ha_plss_ma_wr_dc_wy.sql (66MB)
    1024: 29 seconds
    full: 30 seconds
    scan: 21 seconds 


01_tr.sql (25MB)
    1024: 11 seconds
    full: 11 seconds
    scan: 8 seconds


01_trsq_trsqq.sql (101MB)
    1024: 44 seconds
    full: 46 seconds
    scan: 31 seconds


01_trs.sql (86MB)
    1024: 37 seconds
    full: 38 seconds
    scan: 26 seconds

01_xy.sql (84MB)
    1024: 33 seconds
    full: 35 seconds
    scan: 26 seconds


02_db_ha_plss_ma_wr_dc_wy_tr_trs.sql (139MB)
    1024: 62 seconds
    full: 63 seconds
    scan: 46 seconds


02_trsq_trsqq.sql (85MB)
    1024: 37 seconds
    full: 38 seconds
    scan: 26 seconds


02_xy.sql (210MB)
    1024: 72 seconds
    full: 75 seconds
    scan: 66 seconds


03_trsq_trsqq.sql (154MB)
    1024: 67 seconds
    full: 69 seconds
    scan: 49 seconds


03_xy_db_ha_plss_ma_wr_dc_wy_tr_trs.sql (124MB)
    1024: 54 seconds
    full: 56 seconds
    scan: 39 seconds


04_db_ha_plss_ma_wr_dc_wy.sql (111MB)
    1024: 50 seconds
    full: 51 seconds
    scan: 36 seconds


04_trsqq.sql (56MB)
    1024: 24 seconds
    full: 25 seconds
    scan: 17 seconds


04_trsq.sql (295MB)
    1024: 127 seconds
    full: 132 seconds
    scan: 94 seconds


04_tr_trs.sql (229MB)
    1024: 100 seconds
    full: 103 seconds
    scan: 74 seconds


04_xy.sql (121MB)
    1024: 52 seconds
    full: 54 seconds
    scan: 39 seconds


05_db_ha_plss_ma_wr_dc_wy.sql (140MB)
    1024: 61 seconds
    full: 64 seconds
    scan: 44 seconds


05_xy.sql (190MB)
    1024: 86 seconds
    full: 87 seconds
    scan: 63 seconds


06_db_ha_plss_ma_wr_dc_wy.sql (118MB)
    1024: 51 seconds
    full: 53 seconds
    scan: 37 seconds


06_invalidplss_tr_trs_trsq_trsqq.sql (275MB)
    1024: 118 seconds
    full: 123 seconds
    scan: 90 seconds


06_xy.sql (172MB)
    1024: 54 seconds
    full: 56 seconds
    scan: 55 seconds


07_db_ha_plss_ma_wr_dc_wy.sql (151MB)
    1024: 67 seconds
    full: 68 seconds
    scan: 48 seconds


07_tr_trs_trsq_trsqq_2.sql (164MB)
    1024: 71 seconds
    full: 74 seconds
    scan: 52 seconds


07_tr_trs_trsq_trsqq_3.sql (127MB)
    1024: 56 seconds
    full: 57 seconds
    scan: 40 seconds


07_tr_trs_trsq_trsqq.sql (202MB)
    1024: 89 seconds
    full: 91 seconds
    scan: 62 seconds


07_xy.sql (132MB)
    1024: 57 seconds
    full: 60 seconds
    scan: 40 seconds


09.sql (59MB)
    1024: 25 seconds
    full: 26 seconds
    scan: 17 seconds


10.sql (48MB)
    1024: 18 seconds
    full: 19 seconds
    scan: 14 seconds


35.sql (68MB)
    1024: 28 seconds
    full: 29 seconds
    scan: 20 seconds

As I was running these tests, I found a few files (00a.sql, 00b.sql, 00.sql, and 08.sql – not shown here) wherein the _full_ scan either crashed out or took a very long time to complete. I'm going to see if there's anything I can glean from looking at them more carefully. These were mostly smaller files so I believe the cause is content-related. If that's true, then length is more-or-less irrelevant, which means scanning only 1024 bytes may merely _decrease_ chances of encountering a poison sequence under detect().

If my theory holds any water, there's a chance the chardet maintainers may wish to patch their code. I think the balanced approach – at least for now – is probably creating a UnivsersalDetector as they recommend for our type of situation in their docs:

If you’re dealing with a large amount of text... Create a UniversalDetector object, then call its feed method repeatedly with each block of text. If the detector reaches a minimum threshold of confidence, it will set detector.done to True.

markdrago · 2016-09-22T13:08:28Z

@koyae. I'm not exactly sure how to proceed with this. You've put a ton of work in to this and there are parts of it that are really good and that I think we could merge, but I'm nervous about the size of the change and the potential for disruption. I don't have a solid understanding of the performance impact and I'm not sure about the chardet issues.

I think the best approach going forward would be to break this pull request up in to smaller pieces that are easier to review, test, and reason about. If you agree with that approach I think we should close out this pull request and if you create new and smaller PRs with some of these changes broken out we'll get them merged. What do you think?

koyae · 2016-10-05T08:14:24Z

To summarize the performance-numbers posted above:

The whole-file-character-detection version ( 3fab645 ) of pgsanity runs slower than your last stable build ( 00b3977 ) by a factor of about 1.425 (median difference).
The first-1024-bytes-only-detection version ( 3fab645 ) of pgsanity runs slower than your last stable build ( 00b3977 ) by a factor of about 1.38 (median difference).

In other words, passing 1024 characters to determine encoding is faster than passing the whole file by a margin of between 3% and 4%. The _rest_ of the difference is due to the tokenization which was added since your last commit.

Sidenote: I've now tested chardet.detect() independently of pgsanity, and I'm quite certain chardet is not responsible for certain files dragging way on. I checked the behavior under the combinations:

chardet 2.0.1 + Python 2.7.12
chardet 2.3.0 + Python 2.7.10

Sidenote 2: I did a comparison, and whoever said that using a StringIO object is slower than just catting text to some massive string-variable is completely full of it; the use of StringIO in pgsanity.py's prepare_sql() is perfectly appropriate and performed _MUCH_ better than doing something like someString += nextPiece a zillion times.

koyae · 2016-10-05T14:52:20Z

I'd be fine breaking up the various changes as you suggest.

Roughly, I'd organize it:

BOM-handling + related tests
Parsing + related tests
Skipping psql-commands + related tests
Performance enhancements ( e.g. statement-at-once checking, which would net us an average 50% improvement in runtime, leaving pgsanity faster than it was at 00b3977, not to mention reducing the memory-footprint )
Handling for alternate encodings

Does that grouping sound alright?

koyae added 10 commits May 15, 2016 10:51

Attempting to deal with UTF-8 BOM-table stuff.

407e106

Attempting a modification which will strip BOM for piped input as wel…

a4b6c98

…l as regular files.

Typo fix plus small experiment

7cca7a4

Fixed some unescaped metacharacters in README.md

4409420

Experimenting with getting pgsanity to ignore psql-commands such as \…

69bf9e6

…timing on

Tweaked change for ignoring psql-commands.

8526044

Tweaked change for ignoring psql-commands. Attempt 2.

7e1b7f9

Tweaked change for ignoring psql-commands. Attempt 3.

07ad49d

Psql-commands are now being ignored as desired. Cleanup is pending.

b42c2ec

Minor cleanup of psql-command-stripping tweak.

0a77623

markdrago reviewed May 16, 2016
View reviewed changes

koyae added 2 commits May 16, 2016 23:21

Minor adjustment to improve encapsulation of logic written to strip B…

ca34532

…OM-table information, per a comment by the original author.

Added test for BOM-table stripping function.

92d4bfe

koyae added 5 commits May 29, 2016 19:23

Added a test based off of issue markdrago#14.

e7cf7e3

Semi-stable commit before I break something trying to fix the last fe…

b252c8f

…w errors here.

Fixed one more simple mistake here. Now onto the serious part.

d69063d

ALL GREEN!

46573e8

Created additional tests from the todo-list in test_sqlprep.py

0dd9de2

root added 2 commits June 21, 2016 01:23

Completed restructure in hopes of getting better speed. Tests passing.

2f158f0

Unstable commit. Working on a generator to avoid creating a map of to…

c260e0b

…kens on the entire input string. This should prevent memory from getting exhausted so easily.

koyae added 2 commits June 22, 2016 01:20

Stable-ish commit with loads of comments spammed everywhere.

6cf0475

Tidied up comments.

ad4b0e0

Tossed in some code to prevent chardet from hanging on large files.

3fab645

Merge for tweaks that address BOM-table problems and step over psql-commands. #13

Are you sure you want to change the base?

Merge for tweaks that address BOM-table problems and step over psql-commands. #13

Uh oh!

Conversation

koyae commented May 15, 2016

Uh oh!

markdrago May 16, 2016

Choose a reason for hiding this comment

Uh oh!

koyae May 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

koyae May 17, 2016

Choose a reason for hiding this comment

Uh oh!

markdrago commented May 16, 2016

Uh oh!

markdrago commented May 17, 2016

Uh oh!

koyae commented May 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koyae commented May 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koyae commented May 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markdrago commented Jun 7, 2016

Uh oh!

koyae commented Jun 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koyae commented Jun 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markdrago commented Jul 18, 2016

Uh oh!

koyae commented Aug 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Avoiding need for optional commandline flags:

Performance-considerations:

Test-machine: AWS m4.large

Specs:

Test 1: 302,950 KB (trsq_not_uploading_used_out_utm_coords.sql)

Build 3fab645:

Build 00b3977:

Test2: Test file: 167,881 KB (07_tr_trs_trsq_trsqq_2.sql)

Build 3fab645:

Build 00b3977:

Uh oh!

koyae commented Aug 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markdrago commented Sep 22, 2016

Uh oh!

koyae commented Oct 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koyae commented Oct 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

koyae May 17, 2016 •

edited

Loading

koyae commented May 18, 2016 •

edited

Loading

koyae commented May 30, 2016 •

edited

Loading

koyae commented May 30, 2016 •

edited

Loading

koyae commented Jun 21, 2016 •

edited

Loading

koyae commented Jun 22, 2016 •

edited

Loading

koyae commented Aug 13, 2016 •

edited

Loading

Build `3fab645`:

Build `00b3977`:

Build `3fab645`:

Build `00b3977`:

koyae commented Aug 17, 2016 •

edited

Loading

koyae commented Oct 5, 2016 •

edited

Loading

koyae commented Oct 5, 2016 •

edited

Loading