Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse charset and decode text #11

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ To parse a string containing a ``data:`` uri, use ``parse()``:

.. code-block:: python

>>> parsed = datauri.parse('data:text/plain,A%20brief%20note')
>>> parsed = datauri.parse('data:text/plain;charset=UTF-8,A%20brief%20note')

This returns a parse result:

Expand All @@ -43,6 +43,10 @@ This returns a parse result:
b'A brief note'
>>> parsed.uri
'data:text/plain,A%20brief%20note'
>>> parsed.charset
'UTF-8'
>>> parsed.text
'A brief note'

This is a simple container class with a few attributes:

Expand All @@ -54,6 +58,12 @@ This is a simple container class with a few attributes:
* The ``data`` attribute is a byte string (``bytes``) with the decoded
data. URL encoding and base64 is handled transparently.

* The ``charset`` attribute is a charset as is if is specified.
``None`` otherwise.

* The ``text`` is ``data`` decoded by ``charset``. If ``charset``
is not specified, ``ascii`` will be used.

* For convenience, the ``uri`` attribute contains the input uri.

Parsed URIs compare equal if their media type and data are the same.
Expand Down Expand Up @@ -116,6 +126,10 @@ Please use Github issues to report problems or propose improvements.
Version history
===============

* 1.0.1

Added ``charset`` and ``text`` properties.

* 1.0.0

Initial release.
Expand Down
30 changes: 30 additions & 0 deletions datauri/datauri.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,21 @@ class DataURIError(ValueError):
pass


# https://github.com/bottlepy/bottle/commit/fa7733e075da0d790d809aa3d2f53071897e6f76
class CachedProperty(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cached_property (or an alias cached_property = CachedProperty) makes usage look more like @property for which this is a drop-in replacement anyway

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed 👍

def __init__(self, func):
self.func = func

def __get__(self, obj, cls):
if obj is None:
return self
value = obj.__dict__[self.func.__name__] = self.func(obj)
return value


cached_property = CachedProperty


class ParsedDataURI:
"""
Container for parsed data URIs.
Expand All @@ -33,6 +48,21 @@ def __init__(self, media_type, data, uri):
self.data = data
self.uri = uri

@cached_property
def charset(self):
prefix = 'charset='
chunks = self.media_type.split(';')
for chunk in chunks:
if chunk.startswith(prefix):
return chunk[len(prefix):]
return None

@cached_property
def text(self):
if not self.media_type.startswith('text/'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intuitively i'd say utf-8 is a saner and more useful default. or does the spec explicitly forbid that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is:

If one is not specified, the media type of the data URI is assumed to be text/plain;charset=US-ASCIIQ

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok cool ascii it is then

return None
return self.data.decode(self.charset or 'ascii')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will raise LookupError in case the encoding is a weird name like foo. should we error out or return None in that case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, something should be raised to allow lib users to make the decision about such cases themself.


def __repr__(self):
raw = self.data
if len(raw) > 20:
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
name='datauri',
description="implementation of the data uri scheme defined in rfc2397",
long_description=long_description,
version='1.0.0',
version='1.0.1',
author="EclecticIQ",
author_email="info@eclecticiq.com",
packages=['datauri'],
Expand Down
11 changes: 11 additions & 0 deletions tests/test_datauri.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,17 @@ def test_discover():
assert actual == expected


@pytest.mark.parametrize('data, charset, text', [
('data:text/plain;charset=UTF-8;base64,0L7Qu9C10LM=', 'UTF-8', 'олег'),
('data:text/plain;base64,YW55IGNhcm5hbCBwbGVhc3Vy', None, 'any carnal pleasur'),
('data:image/png;base64,YW55IGNhcm5hbCBwbGVhc3Vy', None, None),
])
def test_text_decoding(data, charset, text):
parsed = datauri.parse(data)
assert parsed.charset == charset
assert parsed.text == text


def test_container_equality():
a = datauri.parse(SAMPLE_URL_ENCODED)
b = datauri.parse(SAMPLE_URL_ENCODED)
Expand Down
3 changes: 3 additions & 0 deletions tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,6 @@ deps=-rrequirements-test.txt
commands=
pytest --cov {envsitepackagesdir}/datauri {posargs} tests/
flake8 datauri/

[flake8]
max-line-length = 90
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems unrelated but ok

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, not related to the issue, but with 79 chars I can't even put the link on the source of CachedProperty without splitting it. That's silly.