Skip to content

Conversation

@koyae
Copy link

@koyae koyae commented May 15, 2016

I made some fairly minor alterations in the interest of making pgsanity easier to use. Changes include addressing #5 for both piped and file-based input, plus an edit which allows pgsanity to deal with psql-directives that might show up in incoming data. If the latter is not desired (at least as the default behavior), I can either move those changes over to a separate branch and then submit another pull-request once I've done so, or we could allow a flag to be passed that would prompt the ignore-psql-directives behavior.

bom_present = check_for_bom(nose)
sql_string = sql_string[len(nose):] if bom_present else sql_string
success, msg = check_string(sql_string.decode("utf-8"))
# ^ The above call to decode() is safe for both ASCII and UTF-8 data.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit nit-picky, but I think I'd rather see all of this BOM stuff move in to check_for_bom(), which probably means that check_for_bom() would become remove_bom_if_exists(). It would take the full sql string and return the full sql string (with the BOM removed if it exists). I think that encapsulates the BOM handling a bit more.

Copy link
Author

@koyae koyae May 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had just noticed all of the other check__something_() -functions hanging out in this file and at the time I thought I'd go with the flow! Anyways I'm happy to rename the function and wrap up that logic all the way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to neaten this up as per your suggestions: ca34532.

@markdrago
Copy link
Owner

Thanks for sending this PR over - I'm hopeful that we'll get it merged in soon. Would you be able to add a few tests for your changes as well? If you make the changes to the BOM code that I suggested it should be pretty easy to write a few unit tests for that (strip when BOM is present, don't when it's not). Also the sqlprep changes should have tests as well - that code is so complex that the tests really come in handy when making changes.

Thanks!

@markdrago
Copy link
Owner

These changes look good. Do you plan on adding the flag to enable psql stripping and supporting the double-backslash to go back to sql?

@koyae
Copy link
Author

koyae commented May 18, 2016

Once #14 looks like it's down for the count, we can circle around to the remaining transitions which deal with with psql directives.
pgsanity state diagram

@koyae
Copy link
Author

koyae commented May 30, 2016

Old tests are now passing as of 46573e8. The following commit (0dd9de2) passes the same tests, plus the ones it adds.

@koyae
Copy link
Author

koyae commented May 30, 2016

On performance: Just did a quick benchmark and it looks like the earlier 92d4bfe runs for about 56 seconds on a 56MB file whereas the latest commit (0dd9de2) runs in about 187 seconds because of the character-by-character method. As I've said, things can be optimized from here.

While we're on the subject, I'll note that on files of around that same ~50MB size, Python will bomb out due to memory-limitations if there's less than 300MB or so available in RAM. Just based off of where I saw crashes (on an extremely memory-impoverished machine), the main culprits seem to be the string slice done to remove BOM-tables and also the call to decode(). On better-provisioned machines, I found no such issues.

...I don't have much experience with profiling Python code for memory-consumption and I'm guessing most folks who would use pgsanity will probably have access to decent hardware if they are doing bulk SQL-operations anyways. However, if you've got specific goals in mind regarding the resource-footprint, do let me know.

@markdrago
Copy link
Owner

Overall this looks pretty good to me. I haven't read through the main processing/transitions stuff in excruciating detail yet, but I like the overall direction. Do you think you'll be able to speed it up?

root added 2 commits June 21, 2016 01:23
…kens on the entire input string. This should prevent memory from getting exhausted so easily.
@koyae
Copy link
Author

koyae commented Jun 21, 2016

Work and life in general have been a little nuts lately, but a belated "yep" on planning to do optimization on this.

Similar to the way you'd been doing it originally, we sorta hop between characters (tokens now, actually) instead of going character-by-character like I did the sorta proof-of-concept. In 2f158f0, the new processing algorithm works and runs pretty swiftly, but the approach of trying to find all tokens ahead of time is too memory-intensive and easily wakes up the OOM killer if given large files to process.

I should be able to do another stint soon to address that.

@koyae
Copy link
Author

koyae commented Jun 22, 2016

I've gotten the memory-issues under control a bit better now, I believe.

Speed-wise, on the same 60MB file I mentioned earlier, processing now completes in just a hair less than 19 seconds, which I think is quite respectable.

When I tried a larger ~160MB file, it appeared to hang indefinitely, which I discovered was due to chardet.detect() taking its sweet time. I killed it after 18 minutes or so. Although there is a chance the file was just unusual (encoding was either latin1 or windows-1252), it probably wouldn't hurt to:

  • create an option for users to explicitly provide an encoding to use when reading the file(s)
  • create an option for controlling how many characters from file to use to try to auto-detect the encoding type if one wasn't explicitly given (defaults to some reasonable number)

Along with those items, adding the optional psql-escaping plus a few tests to go with that should be up next, which should about wrap this up unless you can think of anything else!

@markdrago
Copy link
Owner

@koyae, what do you want to do here for next steps? I'd rather not add any options to let the user control/limit the chardet detection. I would rather just pick 1024 characters or something like that and leave it at that.

I'm also curious to know what the performance difference is between the current stable version of pgsanity and the one with your changes. I believe your approach to parsing is better from a code-sanity perspective, and it handles some situations that the old code doesn't, but I don't want to regress on the speed in any significant way. If the new code is slightly slower (10, even 20 percent) then I think that's acceptable, but if it doubles the runtime then I think we'll need to figure something else out.

Again - thank you for all of the work you have put in to this so far. Please let me know how you'd like to move forward.

@koyae
Copy link
Author

koyae commented Aug 13, 2016

Hey again! Thanks for your patience on all this. Let me try to answer some of your questions.

Avoiding need for optional commandline flags:

That's a toughie; obviously, pgsanity can't work properly if it can't decode files correctly, but it's also nice to have a tool that works as point-and-shoot without extra nuances to learn. In light of the results of the performance-tests discussed further down, the prospects of having the tool be both fast and easy-to-use actually do look optimistic. To eek out just a touch of flexibility in case some users might need it, I'd like to suggest the following:

  • We keep pgsanity's invocation as pgsanity <file1> [<file2>...] and refrain from adding additional run-flags.
  • We allow pgsanity to pick up the encoding from the filename, in the rare cases that a user wants to force a specific encoding, or pgsanity is guessing the encoding wrong.

If the user was working with some weird charset, and pgsanity flags the syntax because a character may have parsed screwy (because of the wrong guessed encoding), the secondary path would look like this:

  1. Pgsanity returns false, and also tells the user it may have chosen the wrong encoding:

    Please note: pgsanity may have incorrectly determined the encoding for your file, which can result in false detection of syntax errors.

    If you know the encoding used to create your file, you can rename it to include the codec which was used e.g. "your_large_file.cp1250.sql"

    Alternatively, you can open the file in the program used to create it, and export using one of the supported encoding types. Standard encoding types can be found at: https://docs.python.org/2.7/library/codecs.html#standard-encodings

  2. The user renames/re-saves the file as instructed.

  3. Pgsanity now takes a using the encoding given by the filename.

  4. Hopefully it works the second time around.

Performance-considerations:

On a decent machine, encoding scans really don't impact performance that much, and I wasn't able to reproduce the freeze I had gotten earlier from scanning a big file. With the files I tested with, it seemed we only had about a 1-second time-difference between scanning the whole file versus just part of it. That's all good news.

I did about 5 to 10 runs per file to see if there was much variance, but found little on the testing environment described below.

When I checked for I/O wait-time, I saw it was virtually 0, so I've omitted any stats about it.

Test-machine: AWS m4.large

Although this machine has two cores, this currently doesn't affect our speed since pgsanity does all processing before handing anything over to ecpg. When ecpg spins up to 100% (on one core), pgsanity's CPU-usage drops from 100% down to around 4%, meaning the second core only gives us a boost of around 0.04.

Specs:
OS: 4.4.15-25.57.amzn1 GNU/Linux x86_64
CPU 1: Xeon(R) CPU E5-2676 v3 @ 2.40GHz
CPU 2: Xeon(R) CPU E5-2676 v3 @ 2.40GHz
RAM: 8 GB
HD: 25 GB SSD
Python: v2.7.10

Test 1: 302,950 KB (trsq_not_uploading_used_out_utm_coords.sql)

Build 3fab645:

scanning 1024 to 10,000 bytes (runs were identical):

  runtime: 125 seconds
      88 seconds (pgsanity)
      37 seconds (ecpg)
  RAM footprint: 
      1.8 GB physical (pgsanity)
      409 MB physical (ecpg)

scanning 1,000,000 bytes 302,950,000 bytes (whole file):

  runtime: between 126 and 127 seconds
      90 seconds (pgsanity)
      37 seconds (ecpg)
  RAM footprint: 2.3 GB
      1.8 to 1.9 GB physical (pgsanity)
      409 MB physical (ecpg)
Build 00b3977:

scanning 0 bytes:

  runtime: 107 seconds
      70 seconds (pgsanity)
      37 seconds (ecpg)
  RAM footprint: 1.19 GB
      783 MB physical (pgsanity)
      409 MB physical (ecpg)

Test2: Test file: 167,881 KB (07_tr_trs_trsq_trsqq_2.sql)

Build 3fab645:

scanning 1024 bytes:

  runtime: 73 seconds
      52 seconds (pgsanity)
      21 seconds (ecpg)
  RAM footprint: 1.23 GB
      1 GB (pgsanity)
      228 MB (ecpg)

scanning 171,909,990 bytes (whole file):

  runtime: 74 seconds
      53 seconds (pgsanity)
      21 seconds (ecpg)
  RAM footprint: 1.23 GB
      1 GB (pgsanity)
      228 MB (ecpg)
Build 00b3977:

scanning 0 bytes:

  runtime: 60 seconds
      38 seconds (pgsanity)
      22 seconds (ecpg)
  RAM footprint: 0.58 GB
      351 MB (pgsanity)
      228 MB (ecpg)

@koyae
Copy link
Author

koyae commented Aug 17, 2016

More benchmark info (using same system as described above):

scan = 00b3977
1024 = 3fab645 (1024 bytes)
full = 3fab645 (entire file)

01_db_ha_plss_ma_wr_dc_wy.sql (66MB)
    1024: 29 seconds
    full: 30 seconds
    scan: 21 seconds 


01_tr.sql (25MB)
    1024: 11 seconds
    full: 11 seconds
    scan: 8 seconds


01_trsq_trsqq.sql (101MB)
    1024: 44 seconds
    full: 46 seconds
    scan: 31 seconds


01_trs.sql (86MB)
    1024: 37 seconds
    full: 38 seconds
    scan: 26 seconds

01_xy.sql (84MB)
    1024: 33 seconds
    full: 35 seconds
    scan: 26 seconds


02_db_ha_plss_ma_wr_dc_wy_tr_trs.sql (139MB)
    1024: 62 seconds
    full: 63 seconds
    scan: 46 seconds


02_trsq_trsqq.sql (85MB)
    1024: 37 seconds
    full: 38 seconds
    scan: 26 seconds


02_xy.sql (210MB)
    1024: 72 seconds
    full: 75 seconds
    scan: 66 seconds


03_trsq_trsqq.sql (154MB)
    1024: 67 seconds
    full: 69 seconds
    scan: 49 seconds


03_xy_db_ha_plss_ma_wr_dc_wy_tr_trs.sql (124MB)
    1024: 54 seconds
    full: 56 seconds
    scan: 39 seconds


04_db_ha_plss_ma_wr_dc_wy.sql (111MB)
    1024: 50 seconds
    full: 51 seconds
    scan: 36 seconds


04_trsqq.sql (56MB)
    1024: 24 seconds
    full: 25 seconds
    scan: 17 seconds


04_trsq.sql (295MB)
    1024: 127 seconds
    full: 132 seconds
    scan: 94 seconds


04_tr_trs.sql (229MB)
    1024: 100 seconds
    full: 103 seconds
    scan: 74 seconds


04_xy.sql (121MB)
    1024: 52 seconds
    full: 54 seconds
    scan: 39 seconds


05_db_ha_plss_ma_wr_dc_wy.sql (140MB)
    1024: 61 seconds
    full: 64 seconds
    scan: 44 seconds


05_xy.sql (190MB)
    1024: 86 seconds
    full: 87 seconds
    scan: 63 seconds


06_db_ha_plss_ma_wr_dc_wy.sql (118MB)
    1024: 51 seconds
    full: 53 seconds
    scan: 37 seconds


06_invalidplss_tr_trs_trsq_trsqq.sql (275MB)
    1024: 118 seconds
    full: 123 seconds
    scan: 90 seconds


06_xy.sql (172MB)
    1024: 54 seconds
    full: 56 seconds
    scan: 55 seconds


07_db_ha_plss_ma_wr_dc_wy.sql (151MB)
    1024: 67 seconds
    full: 68 seconds
    scan: 48 seconds


07_tr_trs_trsq_trsqq_2.sql (164MB)
    1024: 71 seconds
    full: 74 seconds
    scan: 52 seconds


07_tr_trs_trsq_trsqq_3.sql (127MB)
    1024: 56 seconds
    full: 57 seconds
    scan: 40 seconds


07_tr_trs_trsq_trsqq.sql (202MB)
    1024: 89 seconds
    full: 91 seconds
    scan: 62 seconds


07_xy.sql (132MB)
    1024: 57 seconds
    full: 60 seconds
    scan: 40 seconds


09.sql (59MB)
    1024: 25 seconds
    full: 26 seconds
    scan: 17 seconds


10.sql (48MB)
    1024: 18 seconds
    full: 19 seconds
    scan: 14 seconds


35.sql (68MB)
    1024: 28 seconds
    full: 29 seconds
    scan: 20 seconds

As I was running these tests, I found a few files (00a.sql, 00b.sql, 00.sql, and 08.sql – not shown here) wherein the _full_ scan either crashed out or took a very long time to complete. I'm going to see if there's anything I can glean from looking at them more carefully. These were mostly smaller files so I believe the cause is content-related. If that's true, then length is more-or-less irrelevant, which means scanning only 1024 bytes may merely _decrease_ chances of encountering a poison sequence under detect().

If my theory holds any water, there's a chance the chardet maintainers may wish to patch their code. I think the balanced approach – at least for now – is probably creating a UnivsersalDetector as they recommend for our type of situation in their docs:

If you’re dealing with a large amount of text... Create a UniversalDetector object, then call its feed method repeatedly with each block of text. If the detector reaches a minimum threshold of confidence, it will set detector.done to True.

@markdrago
Copy link
Owner

@koyae. I'm not exactly sure how to proceed with this. You've put a ton of work in to this and there are parts of it that are really good and that I think we could merge, but I'm nervous about the size of the change and the potential for disruption. I don't have a solid understanding of the performance impact and I'm not sure about the chardet issues.

I think the best approach going forward would be to break this pull request up in to smaller pieces that are easier to review, test, and reason about. If you agree with that approach I think we should close out this pull request and if you create new and smaller PRs with some of these changes broken out we'll get them merged. What do you think?

@koyae
Copy link
Author

koyae commented Oct 5, 2016

To summarize the performance-numbers posted above:

  • The whole-file-character-detection version ( 3fab645 ) of pgsanity runs slower than your last stable build ( 00b3977 ) by a factor of about 1.425 (median difference).
  • The first-1024-bytes-only-detection version ( 3fab645 ) of pgsanity runs slower than your last stable build ( 00b3977 ) by a factor of about 1.38 (median difference).

In other words, passing 1024 characters to determine encoding is faster than passing the whole file by a margin of between 3% and 4%. The _rest_ of the difference is due to the tokenization which was added since your last commit.


Sidenote: I've now tested chardet.detect() independently of pgsanity, and I'm quite certain chardet is not responsible for certain files dragging way on. I checked the behavior under the combinations:

  • chardet 2.0.1 + Python 2.7.12
  • chardet 2.3.0 + Python 2.7.10

Sidenote 2: I did a comparison, and whoever said that using a StringIO object is slower than just catting text to some massive string-variable is completely full of it; the use of StringIO in pgsanity.py's prepare_sql() is perfectly appropriate and performed _MUCH_ better than doing something like someString += nextPiece a zillion times.

@koyae
Copy link
Author

koyae commented Oct 5, 2016

I'd be fine breaking up the various changes as you suggest.

Roughly, I'd organize it:

  • BOM-handling + related tests
  • Parsing + related tests
  • Skipping psql-commands + related tests
  • Performance enhancements ( e.g. statement-at-once checking, which would net us an average 50% improvement in runtime, leaving pgsanity faster than it was at 00b3977, not to mention reducing the memory-footprint )
  • Handling for alternate encodings

Does that grouping sound alright?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants