diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md deleted file mode 100644 index e01f8a0b..00000000 --- a/CONTRIBUTING.md +++ /dev/null @@ -1,43 +0,0 @@ -# Contributing - -The project follows the [Open Knowledge International coding standards](https://github.com/okfn/coding-standards). - -## Getting Started - -Recommended way to get started is to create and activate a project virtual environment. -To install package and development dependencies into active environment: - -``` -$ make install -``` - -## Testing - -To run tests with linting and coverage: - -``` -$ make test -``` - -For linting `pylama` configured in `pylama.ini` is used. On this stage it's already -installed into your environment and could be used separately with more fine-grained control -as described in documentation - https://pylama.readthedocs.io/en/latest/. - -For example to sort results by error type: - -``` -$ pylama --sort -``` - -For testing `tox` configured in `tox.ini` is used. -It's already installed into your environment and could be used separately with more fine-grained control as described in documentation - https://testrun.org/tox/latest/. - -For example to check subset of tests against Python 2 environment with increased verbosity. -All positional arguments and options after `--` will be passed to `py.test`: - -``` -tox -e py27 -- -v tests/ -``` - -Under the hood `tox` uses `pytest` configured in `pytest.ini`, `coverage` -and `mock` packages. This packages are available only in tox envionments. diff --git a/Makefile b/Makefile index 2900e8c7..a1c932e7 100644 --- a/Makefile +++ b/Makefile @@ -8,7 +8,7 @@ VERSION := $(shell head -n 1 $(PACKAGE)/VERSION) all: list install: - pip install --upgrade -e .[develop] + pip install --upgrade -e .[ods,develop] list: @grep '^\.PHONY' Makefile | cut -d' ' -f2- | tr ' ' '\n' diff --git a/README.md b/README.md index 9b774393..9d3a6c67 100644 --- a/README.md +++ b/README.md @@ -3,99 +3,763 @@ [![Travis](https://img.shields.io/travis/frictionlessdata/tabulator-py/master.svg)](https://travis-ci.org/frictionlessdata/tabulator-py) [![Coveralls](http://img.shields.io/coveralls/frictionlessdata/tabulator-py.svg?branch=master)](https://coveralls.io/r/frictionlessdata/tabulator-py?branch=master) [![PyPi](https://img.shields.io/pypi/v/tabulator.svg)](https://pypi.python.org/pypi/tabulator) -[![SemVer](https://img.shields.io/badge/versions-SemVer-brightgreen.svg)](http://semver.org/) [![Gitter](https://img.shields.io/gitter/room/frictionlessdata/chat.svg)](https://gitter.im/frictionlessdata/chat) -Consistent interface for stream reading and writing tabular data (csv/xls/json/etc). +A library for reading and writing tabular data (csv/xls/json/etc). -> Release `v0.10` contains changes in `exceptions` module introduced in NOT backward-compatibility manner. +> Version v1.0 includes deprecated API removal and provisional API changes. Please read a [migration guide](#v10). ## Features -- supports various formats: csv/tsv/xls/xlsx/json/ndjson/ods/gsheet/native/etc -- reads data from variables, filesystem or Internet +- supports various formats: csv/tsv/xls/xlsx/json/ndjson/ods/gsheet/inline/sql/etc +- reads data from local, remote, stream or text sources - streams data instead of using a lot of memory - processes data via simple user processors - saves data using the same interface +- custom loaders, parsers and writers -## Getting Started +## Getting started ### Installation -To get started: +The package use semantic versioning. It means that major versions could include breaking changes. It's highly recommended to specify `tabulator` version range if you use `setup.py` or `requirements.txt` file e.g. `tabulator<2.0`. ``` -$ pip install tabulator +$ pip install tabulator # v0.15 +$ pip install tabulator --pre # v1.0-alpha ``` -### Example +### Examples -Open tabular stream from csv source: +It's pretty simple to start with `tabulator`: ```python from tabulator import Stream with Stream('path.csv', headers=1) as stream: - print(stream.headers) # will print headers from 1 row + stream.headers # [header1, header2, ..] for row in stream: - print(row) # will print row values list + row # [value1, value2, ..] ``` +There is an [examples](https://github.com/frictionlessdata/tabulator-py/tree/master/examples) directory containing other code listings. + +## Documentation + +The whole public API of this package is described here and follows semantic versioning rules. Everyting outside of this readme are private API and could be changed without any notification on any new version. + ### Stream -`Stream` takes the `source` argument: +The `Stream` class represents a tabular stream. It takes the `source` argument in a form of source string or object: ``` ://path/to/file. ``` -and uses corresponding `Loader` and `Parser` to open and start to iterate over the tabular stream. Also user can pass `scheme` and `format` explicitly as constructor arguments. User can force Tabulator to use encoding of choice to open the table passing `encoding` argument. +and uses corresponding `Loader` and `Parser` to open and start to iterate over the tabular stream. Also user can pass `scheme` and `format` explicitly as constructor arguments. There are also alot other options described in sections below. -In this example we use context manager to call `stream.open()` on enter and `stream.close()` when we exit: -- stream can be iterated like file-like object returning row by row -- stream can be used for manual iterating with `iter(keyed/extended)` function -- stream can be read into memory using `read(keyed/extended)` function with row count `limit` -- headers can be accessed via `headers` property -- rows sample can be accessed via `sample` property -- stream pointer can be set to start via `reset` method -- stream could be saved to filesystem using `save` method - -Below the more expanded example is presented: +Let's create a simple stream object to read csv file: ```python from tabulator import Stream -def skip_even_rows(extended_rows): - for number, headers, row in extended_rows: - if number % 2: - yield (number, headers, row) +stream = Stream('data.csv') +``` + +This action just instantiate a stream instance. There is no actual IO interactions or source validity checks. We need to open the stream object. -stream = Stream('http://example.com/source.xls', - headers=1, encoding='utf-8', sample_size=1000, - post_parse=[skip_even_rows], sheet=1) +```python stream.open() -print(stream.sample) # will print sample -print(stream.headers) # will print headers list -print(stream.read(limit=10)) # will print 10 rows +``` + +This call will validate data source, open underlaying stream and read the data sample (if it's not disabled). All possible exceptions will be raised on `stream.open` call not on constructor call. + +After work with the stream is done it could be closed: + +```python +stream.close() +``` + +The `Stream` class supports Python context manager interface so calls above could be written using `with` syntax. It's a common and recommended way to use `tabulator` stream: + +```pytnon +with Stream('data.csv') as stream: + # use stream +``` + +Now we could iterate over rows in our tabular data source. It's important to understand that `tabulator` uses underlaying streams not loading it to memory (just one row at time). So the `stream.iter()` interface is the most effective way to use the stream: + +```python +for row in stream.iter(): + row # [value1, value2, ..] +``` + +But if you need all the data in one call you could use `stream.read()` function instead of `stream.iter()` function. But if you just run it after code snippet above the `stream.read()` call will return an empty list. That another important following of stream nature of `tabulator` - the `Stream` instance just iterates over an underlaying stream. The underlaying stream has internal pointer (for example as file-like object has). So after we've iterated over all rows in the first listing the pointer is set to the end of stream. + +```python +stream.read() # [] +``` + +The recommended way is to iterate (or read) over stream just once (and save data to memory if needed). But there is a possibility to reset the steram pointer. For some sources it will not be effective (another HTTP request for remote source). But if you work with local file as a source for example it's just a cheap `file.seek()` call: + +``` stream.reset() -for keyed_row in stream.iter(keyed=True): - print keyed_row # will print row dict -for extended_row in stream.iter(extended=True): - print extended_row # will print (number, headers, row) +stream.read() # [[value1, value2, ..], ..] +``` + +The `Stream` class supports saving tabular data stream to the filesystem. Let's reset stream again (dont' forget about the pointer) and save it to the disk: + +``` stream.reset() -stream.save('target.csv') -stream.close() +stream.save('data-copy.csv') +``` + +The full session will be looking like this: + +```python +from tabulator import Stream + +with Stream('data.csv') as stream: + for row in stream.iter(): + row # [value1, value2, ..] + stream.reset() + stream.read() # [[value1, value2, ..], ..] + stream.reset() + stream.save('data-copy.csv') +``` + +It's just a pretty basic `Stream` introduction. Please read the full documentation below and about `Stream` arguments in more detail in following sections. There are many other goodies like headers extraction, keyed output, post parse processors and many more! + +#### Stream(source, headers=None, scheme=None, format=None, encoding=None, sample_size=100, allow_html=False, skip_rows=[], post_parse=[], custom_loaders={}, custom_parsers={}, custom_writers={}, \*\*options) + +Create stream class instance. + +- **source (any)** - stream source in a form based on `scheme` argument +- **headers (list/int)** - headers list or source row number containing headers. If number is given for plain source headers row and all rows before will be removed and for keyed source no rows will be removed. +- **scheme (str)** - source scheme with `file` as default. For the most cases scheme will be inferred from source. See a list of supported schemas below. +- **format (str)** - source format with `None` (detect) as default. For the most cases format will be inferred from source. See a list of supported formats below. +- **encoding (str)** - source encoding with `None` (detect) as default. +- **sample_size (int)** - rows count for table.sample. Set to "0" to prevent any parsing activities before actual table.iter call. In this case headers will not be extracted from the source. +- **allow_html (bool)** - a flag to allow html +- **force_strings (bool)** - if `True` all output will be converted to strings +- **force_parse (bool)** - if `True` on row parsing error a stream will return an empty row instead of raising an exception +- **skip_rows (int/str[])** - list of rows to skip by row number or row comment. Example: `skip_rows=[1, 2, '#', '//']` - rows 1, 2 and all rows started with `#` and `//` will be skipped. +- **post_parse (generator[])** - post parse processors (hooks). Signature to follow is `processor(extended_rows) -> yield (row_number, headers, row)` which should yield one extended row per yield instruction. +- **custom_loaders (dict)** - loaders keyed by scheme. See a section below. +- **custom_parsers (dict)** - custom parsers keyed by format. See a section below. +- **custom_writers (dict)** - custom writers keyed by format. See a section below. +- **options (dict)** - loader/parser options. See in the scheme/format section +- returns **(Stream)** - Stream class instance + +#### stream.closed + +- returns **(bool)** - `True` if underlaying stream is closed + +#### stream.open() + +Open stream by opening underlaying stream. + +#### stream.close() + +Close stream by closing underlaying stream. + +#### stream.reset() + +Reset stream pointer to the first row. + +#### stream.headers + +- returns **(str[])** - data headers + +#### stream.sample + +- returns **(list)** - data sample + +#### stream.iter(keyed=False, extended=False) + +Iter stream rows. + +- **keyed (bool)** - yield keyed rows +- **extended (bool)** - yield extended rows +- returns **(any[]/any{})** - yields row/keyed row/extended row + +#### stream.read(keyed=False, extended=False, limit=None) + +Read table rows with count limit. + +- **keyed (bool)** - return keyed rows +- **extended (bool)** - return extended rows +- **limit (int)** - rows count limit +- returns **(list)** - rows/keyed rows/extended rows + +#### stream.save(target, format=None, encoding=None, **options) + +Save stream to filesystem. + +- **target (str)** - stream target +- **format (str)** - saving format. See supported formats +- **encoding (str)** - saving encoding +- **options (dict)** - writer options + +### Headers + +By default `Stream` considers all data source rows as values: + +```python +with Stream([['name', 'age'], ['Alex', 21]]): + stream.headers # None + stream.read() # [['name', 'age'], ['Alex', 21]] +``` + +To alter this behaviour `headers` argument is supported by `Stream` constructor. This argument could be an integer - row number starting from 1 containing headers: + +```python +# Integer +with Stream([['name', 'age'], ['Alex', 21]], headers=1): + stream.headers # ['name', 'age'] + stream.read() # [['Alex', 21]] +``` + +Or it could be a list of strings - user-defined headers: + +```python +with Stream([['Alex', 21]], headers=['name', 'age']): + stream.headers # ['name', 'age'] + stream.read() # [['Alex', 21]] +``` + +If `headers` is a row number and data source is not keyed all rows before this row and this row will be removed from data stream (see first example). + +### Schemes + +There is a list of all supported schemes. + +#### file + +It's a defaulat scheme. Source should be a file in local filesystem. + +```python +stream = Stream('data.csv') +``` + +#### http/https/ftp/ftps + +Source should be a file available via one of this protocols in the web. + + +```python +stream = Stream('http://example.com/data.csv') +``` + +#### stream + +Source should be a file-like python object which supports corresponding protocol. + + +```python +stream = Stream(open('data.csv')) +``` + +#### text + +Source should be a string containing tabular data. In this case `format` has to be explicitely passed because it's not possible to infer it from source string. + + +```python +stream = Stream('text://name,age\nJohn, 21\n', format='csv') +``` + +### Formats + +There is a list of all supported formats. Formats support `read` operation could be opened by `Stream.open()` and formats support `write` operation could be used in `Stream.save()`. + +#### csv + +Source should be parsable by csv parser. + +```python +stream = Stream('data.csv', delimiter=',') +``` + +Operations: +- read +- write + +Options: +- delimiter +- doublequote +- escapechar +- quotechar +- quoting +- skipinitialspace +- lineterminator + +See options reference in [Python documentation](https://docs.python.org/3/library/csv.html#dialects-and-formatting-parameters). + +#### gsheet + +Source should be a link to publicly available Google Spreadsheet. + +```python +stream = Stream('https://docs.google.com/spreadsheets/d/?usp=sharing') +stream = Stream('https://docs.google.com/spreadsheets/d/edit#gid=') +``` + +#### inline + +Source should be a list of lists or a list of dicts. + +```python +stream = Stream([['name', 'age'], ['John', 21], ['Alex', 33]]) +stream = Stream([{'name': 'John', 'age': 21}, {'name': 'Alex', 'age': 33}]) +``` + +Operations: +- read + +#### json + +Source should be a valid JSON document containing array of arrays or array of objects (see `inline` format example). + +```python +stream = Stream('data.json', property='key1.key2') +``` + +Operations: +- read + +Options: +- property - path to tabular data property separated by dots. For example having data structure like `{"response": {"data": [...]}}` you should set property to `response.data`. + +#### ndjson + +Source should be parsable by ndjson parser. + +```python +stream = Stream('data.ndjson') +``` + +Operations: +- read + +#### ods + +> This format is not included to package by default. To use it please install `tabulator` with an `ods` extras: `$ pip install tabulator[ods]` + +Source should be a valid Open Office document. + +```python +stream = Stream('data.ods', sheet=1) +``` + +Operations: +- read + +Options: +- sheet - sheet number starting from 1 + +#### sql + +Source should be a valid database URL supported by `sqlalchemy`. + +```python +stream = Stream('postgresql://name:pass@host:5432/database', table='data') +``` + +Operations: +- read + +Options: +- table - database table name to read data (REQUIRED) +- order_by - SQL expression to order rows e.g. `name desc` + +#### tsv + +Source should be parsable by tsv parser. + +```python +stream = Stream('data.tsv') +``` + +Operations: +- read + +#### xls/xlsx + +Source should be a valid Excel document. + +```python +stream = Stream('data.xls', sheet=1) +``` + +Operations: +- read + +Options: +- sheet - sheet number starting from 1 + +### Encoding + +`Stream` constructor accepts `encoding` argument to ensure needed encoding will be used. As a value argument supported by python encoding name could be used: + +```python +with Stream(source, encoding='latin1') as stream: + stream.read() +``` + +By default an encoding will be detected automatically. + + +### Sample size + +By default `Stream` will read some data on `stream.open()` call in advance. This data is provided as `stream.sample`. The size of this sample could be set in rows using `sample_size` argument of stream constructor: + +```python +with Stream(two_rows_source, sample_size=1) as stream: + stream.sample # only first row + stream.read() # first and second rows +``` + +Data sample could be really useful if you want to implement some initial data checks without moving stream pointer as `stream.iter/read` do. But if you don't want any interactions with an actual source before first `stream.iter/read` call just disable data smapling with `sample_size=0`. + +### Allow html + +By default `Stream` will raise `exceptions.FormatError` on `stream.open()` call if html contents is detected. It's not a tabular format and for example providing link to csv file inside html (e.g. GitHub page) is a common mistake. + +But sometimes this default behaviour is not what is needed. For example you write custom parser which should support html contents. In this case `allow_html` option for `Stream` could be used: + +```python +with Stream(sorce_with_html, allow_html=True) as stream: + stream.read() # no exception on open +``` + +### Force strings + +Because `tabulator` support not only sources with string data representation as `csv` but also sources supporting different data types as `json` or `inline` there is a `Stream` option `force_strings` to stringify all data values on reading. + +Here how stream works without forcing strings: + +```python +with Stream([['string', 1, datetime.time(17, 00)]]) as stream: + stream.read() # [['string', 1, datetime.time(17, 00)]] +``` + +The same data source using `force_strings` option: + +```python +with Stream([['string', 1]], force_strings=True) as stream: + stream.read() # [['string', '1', '17:00:00']] +``` + +For all temporal values stream will use ISO format. But if your data source doesn't support temporal values (for instance `json` format) `Stream` just returns it as it is without converting to ISO format. + +### Force parse + +Some data source could be partially mailformed for a parser. For example `inline` source could have good rows (lists or dicts) and bad rows (for example strings). By default `stream.iter/read` will raise `exceptions.SourceError` on the first bad row: + +```python +with Stream([[1], 'bad', [3]]) as stream: + stream.read() # raise exceptions.SourceError +``` + +With `force_parse` option for `Stream` constructor this default behaviour could be changed. +If it's set to `True` non-parsable rows will be returned as empty rows: + +```python +with Stream([[1], 'bad', [3]]) as stream: + stream.read() # [[1], [], [3]] +``` + +### Skip rows + +It's a very common situation when your tabular data contains some rows you want to skip. It could be blank rows or commented rows. `Stream` constructors accepts `skip_rows` argument to make it possible. Value of this argument should be a list of integers and strings where: +- integer is a row number starting from 1 +- string is a first row chars indicating that row is a comment + +Let's skip first, second and commented by '#' symbol rows: + +```python +source = [['John', 1], ['Alex', 2], ['#Sam', 3], ['Mike', 4]] +with Stream(source, skip_rows=[1, 2, '#']) as stream + stream.read() # [['Mike', 4]] +``` + +### Post parse + +Skipping rows is a very basic ETL (extrac-transform-load) feature. For more advanced data transormations there are post parse processors. + +```python +def skip_odd_rows(extended_rows): + for row_number, headers, row in extended_rows: + if not row_number % 2: + yield (row_number, headers, row) + +def multiply_on_two(extended_rows): + for row_number, headers, row in extended_rows: + yield (row_number, headers, list(map(lambda value: value * 2, row))) + + +with Stream([[1], [2], [3], [4]], post_parse=[skip_odd_rows, multiply_on_two]) as stream: + stream.read() # [[4], [8]] +``` + +Post parse processor gets extended rows (`[row_number, headers, row]`) iterator and must yields updated extended rows back. This interface is very powerful because every processors have full control on iteration process could skip rows, catch exceptions etc. + +Processors will be applied to source from left to right. For example in listing above `multiply_on_two` processor gets rows from `skip_odd_rows` processor. + +### Custom loaders + +> It's a provisional API. If you use it as a part of other program please pin concrete `goodtables` version to your requirements file. + +To create a custom loader `Loader` interface should be implemented and passed to `Stream` constructor as `custom_loaders={'scheme': CustomLoader}` argument. + +For example let's implement a custom loader: + +```python +from tabulator import Loader + +class CustomLoader(Loader): + options = [] + def load(self, source, mode='t', encoding=None, allow_zip=False): + # load logic + +with Stream(source, custom_loaders={'custom': CustomLoader}) as stream: + stream.read() +``` + +There are more examples in internal `tabulator.loaders` module. + +#### Loader(\*\*options) + +- **options (dict)** - loader options +- returns **(Loader)** - `Loader` class instance + +#### Loader.options + +List of supported options. + +#### loader.load(source, mode='t', encoding=None, allow_zip=False) + +- **source (str)** - table source +- **mode (str)** - text stream mode: 't' or 'b' +- **encoding (str)** - encoding of source +- **allow_zip (bool)** - if false will raise on zip format +- returns **(file-like)** - file-like object of bytes or chars based on mode argument + +### Custom parsers + +> It's a provisional API. If you use it as a part of other program please pin concrete `goodtables` version to your requirements file. + +To create a custom parser `Parser` interface should be implemented and passed to `Stream` constructor as `custom_parsers={'format': CustomParser}` argument. + +For example let's implement a custom parser: + +```python +from tabulator import Parser + +class CustomParser(Parser): + def __init__(self, loader): + self.__loader = loader + @property + def closed(self): + return False + def open(self, source, encoding=None): + # open logic + def close(self): + # close logic + def reset(self): + raise NotImplemenedError() + @property + def extended_rows(): + # extended rows logic + +with Stream(source, custom_parsers={'custom': CustomParser}) as stream: + stream.read() +``` + +There are more examples in internal `tabulator.parsers` module. + +#### Parser(loader, \*\*options) + +Create parser class instance. + +- **loader (Loader)** - loader instance +- **options (dict)** - parser options +- returns **(Parser)** - `Parser` class instance + +#### Parser.options + +List of supported options. + +#### parser.closed + +- returns **(bool)** - `True` if parser is closed + +#### parser.open(source, encoding=None, force_parse=False) + +Open underlaying stream. Parser gets byte or text stream from loader +to start emit items from this stream. + +- **source (str)** - table source +- **encoding (str)** - encoding of source +- **force_parse (bool)** - if True parser must yield (row_number, None, []) if there is an row in parsing error instead of stopping the iteration by raising an exception + +#### parser.close() + +Close underlaying stream. + +#### parser.reset() + +Reset items and underlaying stream. After reset call iterations over items will start from scratch. + +#### parser.extended_rows + +- returns **(iterator)** - extended rows iterator + +### Custom writers + +> It's a provisional API. If you use it as a part of other program please pin concrete `goodtables` version to your requirements file. + +To create a custom writer `Writer` interface should be implemented and passed to `Stream` constructor as `custom_writers={'format': CustomWriter}` argument. + +For example let's implement a custom writer: + +```python +from tabulator import Writer + +class CustomWriter(Writer): + options = [] + def save(self, source, target, headers=None, encoding=None): + # save logic + +with Stream(source, custom_writers={'custom': CustomWriter}) as stream: + stream.save(target) +``` + +There are more examples in internal `tabulator.writers` module. + +#### Writer(\*\*options) + +Create writer class instance. + +- **options (dict)** - writer options +- returns **(Writer)** - `Writer` class instance + +#### Writer.options + +List of supported options. + +#### writer.save(source, target, headers=None, encoding=None) + +Save source data to target. + +- **source (str)** - data source +- **source (str)** - save target +- **headers (str[])** - optional headers +- **encoding (str)** - encoding of source + +### Keyed and extended rows + +Stream methods `stream.iter/read()` accept `keyed` and `extended` flags to vary data structure of output data row. + +By default a stream returns every row as a list: + +```python +with Stream([['name', 'age'], ['Alex', 21]]) as stream: + stream.read() # [['Alex', 21]] +``` + +With `keyed=True` a stream returns every row as a dict: + +```python +with Stream([['name', 'age'], ['Alex', 21]]) as stream: + stream.read(keyed=True) # [{'name': 'Alex', 'age': 21}] +``` + +And with `extended=True` a stream returns every row as a tuple contining row number starting from 1, headers as a list and row as a list: + +```python +with Stream([['name', 'age'], ['Alex', 21]]) as stream: + stream.read(extended=True) # (1, ['name', 'age'], ['Alex', 21]) ``` -For the full list of options see - https://github.com/frictionlessdata/tabulator-py/blob/master/tabulator/stream.py#L17 +### Validate + +For cases you don't need open the source but want to know is it supported by `tabulator` or not you could use `validate` function. It also let you know what exactly is not supported raising correspondig exception class. + +```python +from tabulator import validate, exceptions + +try: + tabular = validate('data.csv') +except exceptions.TabulatorException: + tabular = False +``` + +#### validate(source, scheme=None, format=None) + +Validate if this source has supported scheme and format. + +- **source (any)** - data source +- **scheme (str)** - data scheme +- **format (str)** - data format +- raises **(exceptions.SchemeError)** - if scheme is not supported +- raises **(exceptions.FormatError)** - if format is not supported +- returns **(bool)** - `True` if scheme/format is supported + +### Exceptions + +#### exceptions.TabulatorException + +Base class for all `tabulator` exceptions. + +#### exceptions.SourceError + +This class of exceptions covers all source errors like bad data structure for JSON. + +#### exceptions.SchemeError + +For example this exceptions will be used if you provide not supported source scheme like `bad://source.csv`. + +#### exceptions.FormatError + +For example this exceptions will be used if you provide not supported source format like `http://source.bad`. + +#### exceptions.EncodingError + +All errors related to encoding problems. + +#### exceptions.OptionsError + +All errors related to not supported by Loader/Parser/Writer options. + +#### exceptions.IOError + +All underlaying input-output errors. + +#### exceptions.HTTPError + +All underlaying HTTP errors. + +#### exceptions.ResetError + +All errors caused by stream reset problems. ### CLI -> It's a provisional API excluded from SemVer. If you use it as a part of other program please pin concrete `goodtables` version to your requirements file. +> It's a provisional API. If you use it as a part of other program please pin concrete `goodtables` version to your requirements file. The library ships with a simple CLI to read tabular data: ```bash -$ tabulator +$ tabulator data/table.csv +id, name +1, english +2, 中国人 +``` + +#### $ tabulator + +```bash Usage: cli.py [OPTIONS] SOURCE Options: @@ -107,47 +771,66 @@ Options: --help Show this message and exit. ``` -Shell usage example: +## Contributing + +The project follows the [Open Knowledge International coding standards](https://github.com/okfn/coding-standards). + +Recommended way to get started is to create and activate a project virtual environment. +To install package and development dependencies into active environment: + +``` +$ make install +``` + +To run tests with linting and coverage: ```bash -$ tabulator data/table.csv -id, name -1, english -2, 中国人 +$ make test ``` -## API Reference +For linting `pylama` configured in `pylama.ini` is used. On this stage it's already +installed into your environment and could be used separately with more fine-grained control +as described in documentation - https://pylama.readthedocs.io/en/latest/. -### Snapshot +For example to sort results by error type: +```bash +$ pylama --sort ``` -Stream(source, - headers=None, - scheme=None, - format=None, - encoding=None, - sample_size=None, - post_parse=None, - **options) - closed/open/close/reset - headers -> list - sample -> rows - iter(keyed/extended=False) -> (generator) (keyed/extended)row[] - read(keyed/extended=False, limit=None) -> (keyed/extended)row[] - save(target, format=None, encoding=None, **options) -exceptions -~cli + +For testing `tox` configured in `tox.ini` is used. +It's already installed into your environment and could be used separately with more fine-grained control as described in documentation - https://testrun.org/tox/latest/. + +For example to check subset of tests against Python 2 environment with increased verbosity. +All positional arguments and options after `--` will be passed to `py.test`: + +```bash +tox -e py27 -- -v tests/ ``` -### Detailed +Under the hood `tox` uses `pytest` configured in `pytest.ini`, `coverage` +and `mock` packages. This packages are available only in tox envionments. -- [Docstrings](https://github.com/frictionlessdata/tabulator-py/tree/master/tabulator) -- [Changelog](https://github.com/frictionlessdata/tabulator-py/commits/master) +## Changelog -## Contributing +Here described only breaking and the most important changes. The full changelog could be found in nicely formatted [commit history](https://github.com/frictionlessdata/tabulator-py/commits/master). + +### v1.0 [WIP] + +New API added: +- published `Loader/Parser/Writer` API +- added `Stream` argument `force_strings` +- added `Stream` argument `force_parse` +- added `Stream` argument `custom_writers` + +Deprecated API removal: +- removed `topen` and `Table` - use `Stream` instead +- removed `Stream` arguments `loader/parser_options` - use `**options` instead -Please read the contribution guideline: +Provisional API changed: +- updated `Loader/Parser/Writer` API - please use an updated version -[How to Contribute](CONTRIBUTING.md) +### [v0.15](https://github.com/frictionlessdata/tabulator-py/tree/v0.15.0) -Thanks! +Provisional API added: +- unofficial support for `Stream` arguments `custom_loaders/parsers` diff --git a/examples/stream.py b/examples/stream.py index 6f84fdf8..21209f5b 100644 --- a/examples/stream.py +++ b/examples/stream.py @@ -82,7 +82,7 @@ print(row) -print('\nUsage of native lists:') +print('\nUsage of inline lists:') source = [['id', 'name'], ['1', 'english'], ('2', '中国人')] with Stream(source, headers='row1') as stream: print(stream.headers) @@ -90,7 +90,7 @@ print(row) -print('\nUsage of native lists (keyed):') +print('\nUsage of inline lists (keyed):') source = [{'id': '1', 'name': 'english'}, {'id': '2', 'name': '中国人'}] with Stream(source) as stream: print(stream.headers) diff --git a/scripts/testsuite.sh b/scripts/testsuite.sh deleted file mode 100644 index 184a62b5..00000000 --- a/scripts/testsuite.sh +++ /dev/null @@ -1,15 +0,0 @@ -# Trigger testsuite build on Travis -# TRAVIS_TOKEN should be set in Travis settings or in travis.yml - -body='{ -"request": { - "branch":"master" -}}' - -curl -s -X POST \ - -H "Content-Type: application/json" \ - -H "Accept: application/json" \ - -H "Travis-API-Version: 3" \ - -H "Authorization: token $TRAVIS_TOKEN" \ - -d "$body" \ - https://api.travis-ci.org/repo/frictionlessdata%2Ftestsuite-py/requests diff --git a/setup.py b/setup.py index 55d19bda..6ecad498 100644 --- a/setup.py +++ b/setup.py @@ -20,18 +20,21 @@ def read(*paths): # Prepare PACKAGE = 'tabulator' INSTALL_REQUIRES = [ - 'six>=1.9,<2.0a', - 'xlrd>=1.0,<2.0a', - 'ijson>=2.0,<3.0a', - 'cchardet>=1.0,<2.0a', - 'openpyxl>=2.4,<3.0a', - 'requests>=2.8,<3.0a', - 'linear-tsv>=1.0,<2.0a', - 'unicodecsv>=0.14,<1.0a', - 'jsonlines>=1.1,<2.0a', - 'click>=6.0,<7.0a', - 'ezodf>=0.3,<1.0a', - 'lxml>=3.0,<4.0a', # required by ezodf + 'six>=1.9,<2.0', + 'click>=6.0,<7.0', + 'requests>=2.8,<3.0', + 'cchardet>=1.0,<2.0', + 'unicodecsv>=0.14,<1.0', + 'ijson>=2.0,<3.0', + 'jsonlines>=1.1,<2.0', + 'sqlalchemy>=1.1,<2.0', + 'linear-tsv>=1.0,<2.0', + 'xlrd>=1.0,<2.0', + 'openpyxl>=2.4,<3.0', +] +INSTALL_FORMAT_ODS_REQUIRES = [ + 'ezodf>=0.3,<1.0', + 'lxml>=3.0,<4.0', ] TESTS_REQUIRE = [ 'pylama', @@ -50,7 +53,10 @@ def read(*paths): include_package_data=True, install_requires=INSTALL_REQUIRES, tests_require=TESTS_REQUIRE, - extras_require={'develop': TESTS_REQUIRE}, + extras_require={ + 'ods': INSTALL_FORMAT_ODS_REQUIRES, + 'develop': TESTS_REQUIRE, + }, entry_points={ 'console_scripts': [ 'tabulator = tabulator.cli:cli', diff --git a/tabulator/VERSION b/tabulator/VERSION index e815b861..62581c75 100644 --- a/tabulator/VERSION +++ b/tabulator/VERSION @@ -1 +1 @@ -0.15.1 +1.0.0-alpha1 diff --git a/tabulator/__init__.py b/tabulator/__init__.py index af7ccafe..9bd0e0c2 100644 --- a/tabulator/__init__.py +++ b/tabulator/__init__.py @@ -3,22 +3,21 @@ from __future__ import division from __future__ import print_function from __future__ import unicode_literals -import io -import os -# General +# Module API from .stream import Stream +from .loader import Loader +from .parser import Parser +from .writer import Writer +from .validate import validate from . import exceptions -# Deprecated - -from .topen import topen -from .stream import Stream as Table - # Version +import io +import os __version__ = io.open( os.path.join(os.path.dirname(__file__), 'VERSION'), encoding='utf-8').read().strip() diff --git a/tabulator/cli.py b/tabulator/cli.py index 3572f626..124956f6 100644 --- a/tabulator/cli.py +++ b/tabulator/cli.py @@ -20,6 +20,8 @@ @click.option('--limit', type=click.INT) @click.version_option(tabulator.__version__, message='%(version)s') def cli(source, limit, **options): + """https://github.com/frictionlessdata/tabulator-py#cli + """ options = {key: value for key, value in options.items() if value is not None} with tabulator.Stream(source, **options) as stream: cast = str diff --git a/tabulator/config.py b/tabulator/config.py index b48d97cf..e614ded3 100644 --- a/tabulator/config.py +++ b/tabulator/config.py @@ -12,31 +12,39 @@ BYTES_SAMPLE_SIZE = 1000 ENCODING_CONFIDENCE = 0.5 CSV_SAMPLE_LINES = 100 +# http://docs.sqlalchemy.org/en/latest/dialects/index.html +SQL_SCHEMES = ['firebird', 'mssql', 'mysql', 'oracle', 'postgresql', 'sqlite', 'sybase'] + +# Loaders + LOADERS = { - 'file': 'tabulator.loaders.file.FileLoader', - 'ftp': 'tabulator.loaders.web.WebLoader', - 'ftps': 'tabulator.loaders.web.WebLoader', - 'gsheet': None, - 'http': 'tabulator.loaders.web.WebLoader', - 'https': 'tabulator.loaders.web.WebLoader', - 'native': None, + 'file': 'tabulator.loaders.local.LocalLoader', + 'http': 'tabulator.loaders.remote.RemoteLoader', + 'https': 'tabulator.loaders.remote.RemoteLoader', + 'ftp': 'tabulator.loaders.remote.RemoteLoader', + 'ftps': 'tabulator.loaders.remote.RemoteLoader', 'stream': 'tabulator.loaders.stream.StreamLoader', 'text': 'tabulator.loaders.text.TextLoader', } +# Parsers + PARSERS = { 'csv': 'tabulator.parsers.csv.CSVParser', 'gsheet': 'tabulator.parsers.gsheet.GsheetParser', + 'inline': 'tabulator.parsers.inline.InlineParser', 'json': 'tabulator.parsers.json.JSONParser', 'jsonl': 'tabulator.parsers.ndjson.NDJSONParser', 'ndjson': 'tabulator.parsers.ndjson.NDJSONParser', - 'native': 'tabulator.parsers.native.NativeParser', - 'tsv': 'tabulator.parsers.tsv.TSVParser', - 'xls': 'tabulator.parsers.excel.ExcelParser', - 'xlsx': 'tabulator.parsers.excelx.ExcelxParser', 'ods': 'tabulator.parsers.ods.ODSParser', + 'sql': 'tabulator.parsers.sql.SQLParser', + 'tsv': 'tabulator.parsers.tsv.TSVParser', + 'xls': 'tabulator.parsers.xls.XLSParser', + 'xlsx': 'tabulator.parsers.xlsx.XLSXParser', } +# Writers + WRITERS = { 'csv': 'tabulator.writers.csv.CSVWriter', } diff --git a/tabulator/exceptions.py b/tabulator/exceptions.py index 24a1c76e..28eca5b5 100644 --- a/tabulator/exceptions.py +++ b/tabulator/exceptions.py @@ -8,54 +8,54 @@ # Module API class TabulatorException(Exception): - """Base Tabulator exception. + """https://github.com/frictionlessdata/tabulator-py#exceptions """ pass class SourceError(TabulatorException): - """Stream error. + """https://github.com/frictionlessdata/tabulator-py#exceptions """ pass class SchemeError(TabulatorException): - """Scheme error. + """https://github.com/frictionlessdata/tabulator-py#exceptions """ pass class FormatError(TabulatorException): - """Format error. + """https://github.com/frictionlessdata/tabulator-py#exceptions """ pass class EncodingError(TabulatorException): - """Encoding error. + """https://github.com/frictionlessdata/tabulator-py#exceptions """ pass class OptionsError(TabulatorException): - """Options error. + """https://github.com/frictionlessdata/tabulator-py#exceptions """ pass class IOError(TabulatorException): - """IO error. + """https://github.com/frictionlessdata/tabulator-py#exceptions """ pass class HTTPError(TabulatorException): - """HTTP error. + """https://github.com/frictionlessdata/tabulator-py#exceptions """ pass class ResetError(TabulatorException): - """Reset error. + """https://github.com/frictionlessdata/tabulator-py#exceptions """ pass diff --git a/tabulator/helpers.py b/tabulator/helpers.py index 03b9fef3..ae734907 100644 --- a/tabulator/helpers.py +++ b/tabulator/helpers.py @@ -17,49 +17,39 @@ # Module API -def detect_scheme(source): - """Detect scheme by source. +def detect_scheme_and_format(source): + """Detect scheme and format based on source and return as a tuple. Scheme is a minimum 2 letters before `://` (will be lower cased). For example `http` from `http://example.com/table.csv` """ + + # Scheme: stream if hasattr(source, 'read'): - scheme = 'stream' - elif isinstance(source, six.string_types): - if 'docs.google.com/spreadsheets' in source: - if 'export' not in source: - return 'gsheet' - match = re.search(r'^([a-zA-Z]{2,}):\/{2}', source) - if not match: - return None - scheme = match.group(1).lower() - else: - scheme = 'native' - return scheme + return ('stream', None) + # Format: inline + if not isinstance(source, six.string_types): + return (None, 'inline') -def detect_format(source): - """Detect format by source. + # Format: gsheet + if 'docs.google.com/spreadsheets' in source: + if 'export' not in source: + return (None, 'gsheet') - For example `csv` from `http://example.com/table.csv` + # Format: sql + for sql_scheme in config.SQL_SCHEMES: + if source.startswith('%s://' % sql_scheme): + return (None, 'sql') - """ - if hasattr(source, 'read'): - format = '' - elif isinstance(source, six.string_types): - if 'docs.google.com/spreadsheets' in source: - if 'export' not in source: - return 'gsheet' - parsed_source = urlparse(source) - path = parsed_source.path or parsed_source.netloc - format = os.path.splitext(path)[1] - if not format: - return None - format = format[1:].lower() - else: - format = 'native' - return format + # General + parsed = urlparse(source) + scheme = parsed.scheme.lower() + if len(scheme) < 2: + scheme = config.DEFAULT_SCHEME + format = os.path.splitext(parsed.path or parsed.netloc)[1][1:].lower() or None + return (scheme, format) def detect_encoding(sample, encoding=None): @@ -162,3 +152,12 @@ def extract_options(options, names): result[name] = value del options[name] return result + + +def stringify_value(value): + """Convert any value to string. + """ + isoformat = getattr(value, 'isoformat', None) + if isoformat is not None: + value = isoformat() + return str(value) diff --git a/tabulator/loader.py b/tabulator/loader.py new file mode 100644 index 00000000..2787eb1b --- /dev/null +++ b/tabulator/loader.py @@ -0,0 +1,29 @@ +# -*- coding: utf-8 -*- +from __future__ import division +from __future__ import print_function +from __future__ import absolute_import +from __future__ import unicode_literals + +from six import add_metaclass +from abc import ABCMeta, abstractmethod + + +# Module API + +@add_metaclass(ABCMeta) +class Loader(object): + + # Public + + options = [] + + def __init__(self, **options): + """https://github.com/frictionlessdata/tabulator-py#custom-loaders + """ + pass + + @abstractmethod + def load(self, source, mode='t', encoding=None, allow_zip=False): + """https://github.com/frictionlessdata/tabulator-py#custom-loaders + """ + pass diff --git a/tabulator/loaders/api.py b/tabulator/loaders/api.py deleted file mode 100644 index 21920225..00000000 --- a/tabulator/loaders/api.py +++ /dev/null @@ -1,45 +0,0 @@ -# -*- coding: utf-8 -*- -from __future__ import division -from __future__ import print_function -from __future__ import absolute_import -from __future__ import unicode_literals - -from six import add_metaclass -from abc import ABCMeta, abstractmethod - - -# Module API - -@add_metaclass(ABCMeta) -class Loader(object): - """Loader representation. - - Args: - options(dict): loader options - - """ - - # Public - - @property - # @abstractmethod - def options(self): - """list: list of available options - """ - pass - - @abstractmethod - def load(self, source, encoding, mode, allow_zip=False): - """Return byte/text stream file-like object. - - Args: - source (str): table source - encoding (str): encoding of source - mode(str): text stream mode: 't' or 'b' - allow_zip(bool): if false will raise on zip format - - Returns: - file-like: file-like object of byte/text stream - - """ - pass diff --git a/tabulator/loaders/file.py b/tabulator/loaders/local.py similarity index 90% rename from tabulator/loaders/file.py rename to tabulator/loaders/local.py index 05fb035e..b114ac0d 100644 --- a/tabulator/loaders/file.py +++ b/tabulator/loaders/local.py @@ -5,15 +5,15 @@ from __future__ import unicode_literals import io +from ..loader import Loader from .. import exceptions from .. import helpers from .. import config -from . import api # Module API -class FileLoader(api.Loader): +class LocalLoader(Loader): """Loader to load source from filesystem. """ @@ -21,7 +21,7 @@ class FileLoader(api.Loader): options = [] - def load(self, source, encoding, mode, allow_zip=False): + def load(self, source, mode='t', encoding=None, allow_zip=False): # Prepare source scheme = 'file://' diff --git a/tabulator/loaders/web.py b/tabulator/loaders/remote.py similarity index 95% rename from tabulator/loaders/web.py rename to tabulator/loaders/remote.py index 1502a4a2..fcd26583 100644 --- a/tabulator/loaders/web.py +++ b/tabulator/loaders/remote.py @@ -8,15 +8,15 @@ import six from six.moves.urllib.error import URLError from six.moves.urllib.request import Request, urlopen +from ..loader import Loader from .. import exceptions from .. import helpers from .. import config -from . import api # Module API -class WebLoader(api.Loader): +class RemoteLoader(Loader): """Loader to load source from the web. """ @@ -24,7 +24,7 @@ class WebLoader(api.Loader): options = [] - def load(self, source, encoding, mode, allow_zip=False): + def load(self, source, mode='t', encoding=None, allow_zip=False): # Requote uri source = helpers.requote_uri(source) diff --git a/tabulator/loaders/stream.py b/tabulator/loaders/stream.py index 25b2792d..893144e1 100644 --- a/tabulator/loaders/stream.py +++ b/tabulator/loaders/stream.py @@ -5,15 +5,15 @@ from __future__ import unicode_literals import io +from ..loader import Loader from .. import exceptions from .. import helpers from .. import config -from . import api # Module API -class StreamLoader(api.Loader): +class StreamLoader(Loader): """Loader to load source from file-like byte stream. """ @@ -21,7 +21,7 @@ class StreamLoader(api.Loader): options = [] - def load(self, source, encoding, mode, allow_zip=False): + def load(self, source, mode='t', encoding=None, allow_zip=False): # Raise if in text mode if hasattr(source, 'encoding'): diff --git a/tabulator/loaders/text.py b/tabulator/loaders/text.py index 449b10e1..292bb9dc 100644 --- a/tabulator/loaders/text.py +++ b/tabulator/loaders/text.py @@ -5,13 +5,13 @@ from __future__ import unicode_literals import io +from ..loader import Loader from .. import config -from . import api # Module API -class TextLoader(api.Loader): +class TextLoader(Loader): """Loader to load source from text. """ @@ -19,7 +19,7 @@ class TextLoader(api.Loader): options = [] - def load(self, source, encoding, mode, allow_zip=False): + def load(self, source, mode='t', encoding=None, allow_zip=False): # Prepare source scheme = 'text://' diff --git a/tabulator/parser.py b/tabulator/parser.py new file mode 100644 index 00000000..38f94c25 --- /dev/null +++ b/tabulator/parser.py @@ -0,0 +1,55 @@ +# -*- coding: utf-8 -*- +from __future__ import division +from __future__ import print_function +from __future__ import absolute_import +from __future__ import unicode_literals + +from six import add_metaclass +from abc import ABCMeta, abstractmethod + + +# Module API + +@add_metaclass(ABCMeta) +class Parser(object): + + # Public + + options = [] + + def __init__(self, loader, **options): + """https://github.com/frictionlessdata/tabulator-py#custom-parsers + """ + pass + + @property + @abstractmethod + def closed(self): + """https://github.com/frictionlessdata/tabulator-py#custom-parsers + """ + pass # pragma: no cover + + @abstractmethod + def open(self, source, encoding=None, force_parse=False): + """https://github.com/frictionlessdata/tabulator-py#custom-parsers + """ + pass # pragma: no cover + + @abstractmethod + def close(self): + """https://github.com/frictionlessdata/tabulator-py#custom-parsers + """ + pass # pragma: no cover + + @abstractmethod + def reset(self): + """https://github.com/frictionlessdata/tabulator-py#custom-parsers + """ + pass # pragma: no cover + + @property + @abstractmethod + def extended_rows(self): + """https://github.com/frictionlessdata/tabulator-py#custom-parsers + """ + pass # pragma: no cover diff --git a/tabulator/parsers/api.py b/tabulator/parsers/api.py deleted file mode 100644 index 182b328e..00000000 --- a/tabulator/parsers/api.py +++ /dev/null @@ -1,80 +0,0 @@ -# -*- coding: utf-8 -*- -from __future__ import division -from __future__ import print_function -from __future__ import absolute_import -from __future__ import unicode_literals - -from six import add_metaclass -from abc import ABCMeta, abstractmethod - - -# Module API - -@add_metaclass(ABCMeta) -class Parser(object): - """Parser representation. - - Args: - options(dict): parser options - - """ - - # Public - - @property - # @abstractmethod - def options(self): - """list: list of available options - """ - pass - - @property - @abstractmethod - def closed(self): - """Return if underlaynig stream is closed. - """ - pass # pragma: no cover - - @abstractmethod - def open(self, source, encoding, loader): - """Open underlaying stream. - - Parser gets byte or text stream from loader - to start emit items from this stream. - - Args: - source (str): table source - encoding (str): encoding of source - loader (Loader): loader instance - - """ - pass # pragma: no cover - - @abstractmethod - def close(self): - """Close underlaying stream. - """ - pass # pragma: no cover - - @abstractmethod - def reset(self): - """Reset items and underlaying stream. - - After reset call iterations over items will - start from scratch. - - """ - pass # pragma: no cover - - @property - @abstractmethod - def extended_rows(self): - """iterator: Extended rows. - - Extended rows iterator from parsed underlaying stream. - - Yields: - tuple: extended row (number, headers, values) - - """ - pass # pragma: no cover diff --git a/tabulator/parsers/csv.py b/tabulator/parsers/csv.py index 294ec246..935e81c4 100644 --- a/tabulator/parsers/csv.py +++ b/tabulator/parsers/csv.py @@ -8,14 +8,14 @@ import six from itertools import chain from codecs import iterencode +from ..parser import Parser from .. import helpers from .. import config -from . import api # Module API -class CSVParser(api.Parser): +class CSVParser(Parser): """Parser to parse CSV data format. """ @@ -31,7 +31,7 @@ class CSVParser(api.Parser): 'lineterminator' ] - def __init__(self, **options): + def __init__(self, loader, **options): # Make bytes if six.PY2: @@ -40,19 +40,20 @@ def __init__(self, **options): options[key] = str(value) # Set attributes + self.__loader = loader self.__options = options + self.__force_parse = None self.__extended_rows = None - self.__loader = None self.__chars = None @property def closed(self): return self.__chars is None or self.__chars.closed - def open(self, source, encoding, loader): + def open(self, source, encoding=None, force_parse=False): self.close() - self.__loader = loader - self.__chars = loader.load(source, encoding, mode='t') + self.__force_parse = force_parse + self.__chars = self.__loader.load(source, encoding=encoding) self.reset() def close(self): @@ -77,19 +78,19 @@ def __iter_extended_rows(self): bytes = iterencode(self.__chars, 'utf-8') sample, dialect = self.__prepare_dialect(bytes) items = csv.reader(chain(sample, bytes), dialect=dialect) - for number, item in enumerate(items, start=1): + for row_number, item in enumerate(items, start=1): values = [] for value in item: value = value.decode('utf-8') values.append(value) - yield (number, None, list(values)) + yield (row_number, None, list(values)) # For PY3 use chars else: sample, dialect = self.__prepare_dialect(self.__chars) items = csv.reader(chain(sample, self.__chars), dialect=dialect) - for number, item in enumerate(items, start=1): - yield (number, None, list(item)) + for row_number, item in enumerate(items, start=1): + yield (row_number, None, list(item)) def __prepare_dialect(self, stream): diff --git a/tabulator/parsers/gsheet.py b/tabulator/parsers/gsheet.py index b1bbba8b..dc27c0f6 100644 --- a/tabulator/parsers/gsheet.py +++ b/tabulator/parsers/gsheet.py @@ -6,12 +6,12 @@ import re from ..stream import Stream -from . import api +from ..parser import Parser # Module API -class GsheetParser(api.Parser): +class GsheetParser(Parser): """Parser to parse Google Spreadsheets. """ @@ -19,15 +19,18 @@ class GsheetParser(api.Parser): options = [] - def __init__(self): + def __init__(self, loader): + self.__loader = loader + self.__force_parse = None self.__stream = None @property def closed(self): return self.__stream is None or self.__stream.closed - def open(self, source, encoding, loader): + def open(self, source, encoding=None, force_parse=False): self.close() + self.__force_parse = force_parse url = 'https://docs.google.com/spreadsheets/d/%s/export?format=csv&id=%s' match = re.search(r'.*/d/(?P[^/]+)/.*?(?:gid=(?P\d+))?$', source) key, gid = '', '' @@ -37,7 +40,8 @@ def open(self, source, encoding, loader): url = url % (key, key) if gid: url = '%s&gid=%s' % (url, gid) - self.__stream = Stream(url, format='csv', encoding=encoding).open() + self.__stream = Stream( + url, format='csv', encoding=encoding, force_parse=self.__force_parse).open() self.__extended_rows = self.__stream.iter(extended=True) def close(self): @@ -46,7 +50,7 @@ def close(self): def reset(self): self.__stream.reset() - self.__extended_rows = self.__iter_extended_rows() + self.__extended_rows = self.__stream.iter(extended=True) @property def extended_rows(self): diff --git a/tabulator/parsers/native.py b/tabulator/parsers/inline.py similarity index 64% rename from tabulator/parsers/native.py rename to tabulator/parsers/inline.py index a8bb31df..936a7716 100644 --- a/tabulator/parsers/native.py +++ b/tabulator/parsers/inline.py @@ -5,21 +5,23 @@ from __future__ import unicode_literals import six +from ..parser import Parser from .. import exceptions -from . import api # Module API -class NativeParser(api.Parser): - """Parser to provide support for python native lists. +class InlineParser(Parser): + """Parser to provide support for python inline lists. """ # Public options = [] - def __init__(self): + def __init__(self, loader): + self.__loader = loader + self.__force_parse = None self.__extended_rows = None self.__source = None @@ -27,11 +29,12 @@ def __init__(self): def closed(self): return True - def open(self, source, encoding, loader): + def open(self, source, encoding=None, force_parse=False): if hasattr(source, '__next__' if six.PY3 else 'next'): message = 'Only callable returning an iterator is supported' raise exceptions.SourceError(message) self.close() + self.__force_parse = force_parse self.__source = source self.reset() @@ -51,16 +54,18 @@ def __iter_extended_rows(self): items = self.__source if not hasattr(items, '__iter__'): items = items() - for number, item in enumerate(items, start=1): + for row_number, item in enumerate(items, start=1): if isinstance(item, (tuple, list)): - yield (number, None, list(item)) + yield (row_number, None, list(item)) elif isinstance(item, dict): keys = [] values = [] for key in sorted(item.keys()): keys.append(key) values.append(item[key]) - yield (number, list(keys), list(values)) + yield (row_number, list(keys), list(values)) else: - message = 'Native item has to be tuple, list or dict' - raise exceptions.SourceError(message) + if not self.__force_parse: + message = 'Inline data item has to be tuple, list or dict' + raise exceptions.SourceError(message) + yield (row_number, None, []) diff --git a/tabulator/parsers/json.py b/tabulator/parsers/json.py index b61bca93..38d06642 100644 --- a/tabulator/parsers/json.py +++ b/tabulator/parsers/json.py @@ -5,25 +5,27 @@ from __future__ import unicode_literals import ijson +from ..parser import Parser from .. import exceptions from .. import helpers -from . import api # Module API -class JSONParser(api.Parser): +class JSONParser(Parser): """Parser to parse JSON data format. """ # Public options = [ - 'prefix', + 'property', ] - def __init__(self, prefix=None): - self.__prefix = prefix + def __init__(self, loader, property=None): + self.__loader = loader + self.__property = property + self.__force_parse = None self.__extended_rows = None self.__chars = None @@ -31,10 +33,10 @@ def __init__(self, prefix=None): def closed(self): return self.__chars is None or self.__chars.closed - def open(self, source, encoding, loader): + def open(self, source, encoding=None, force_parse=False): self.close() - self.__loader = loader - self.__chars = loader.load(source, encoding, mode='t') + self.__force_parse = force_parse + self.__chars = self.__loader.load(source, encoding=encoding) self.reset() def close(self): @@ -53,19 +55,21 @@ def extended_rows(self): def __iter_extended_rows(self): path = 'item' - if self.__prefix is not None: - path = '%s.item' % self.__prefix + if self.__property is not None: + path = '%s.item' % self.__property items = ijson.items(self.__chars, path) - for number, item in enumerate(items, start=1): + for row_number, item in enumerate(items, start=1): if isinstance(item, (tuple, list)): - yield (number, None, list(item)) + yield (row_number, None, list(item)) elif isinstance(item, dict): keys = [] values = [] for key in sorted(item.keys()): keys.append(key) values.append(item[key]) - yield (number, list(keys), list(values)) + yield (row_number, list(keys), list(values)) else: - message = 'JSON item has to be list or dict' - raise exceptions.SourceError(message) + if not self.__force_parse: + message = 'JSON item has to be list or dict' + raise exceptions.SourceError(message) + yield (row_number, None, []) diff --git a/tabulator/parsers/ndjson.py b/tabulator/parsers/ndjson.py index d5c4dfdb..0201e583 100644 --- a/tabulator/parsers/ndjson.py +++ b/tabulator/parsers/ndjson.py @@ -5,15 +5,14 @@ from __future__ import unicode_literals import jsonlines - +from ..parser import Parser from .. import exceptions from .. import helpers -from . import api # Module API -class NDJSONParser(api.Parser): +class NDJSONParser(Parser): """Parser to parse NDJSON data format. See: http://specs.okfnlabs.org/ndjson/ @@ -23,20 +22,20 @@ class NDJSONParser(api.Parser): options = [] - def __init__(self, **options): - self.__options = options + def __init__(self, loader): + self.__loader = loader + self.__force_parse = None self.__extended_rows = None - self.__loader = None self.__chars = None @property def closed(self): return self.__chars is None or self.__chars.closed - def open(self, source, encoding, loader): + def open(self, source, encoding=None, force_parse=False): self.close() - self.__loader = loader - self.__chars = loader.load(source, encoding, mode='t') + self.__force_parse = force_parse + self.__chars = self.__loader.load(source, encoding=encoding) self.reset() def close(self): @@ -55,13 +54,13 @@ def extended_rows(self): def __iter_extended_rows(self): rows = jsonlines.Reader(self.__chars) - for number, row in enumerate(rows, start=1): + for row_number, row in enumerate(rows, start=1): if isinstance(row, (tuple, list)): - yield number, None, list(row) + yield row_number, None, list(row) elif isinstance(row, dict): keys, values = zip(*sorted(row.items())) - yield number, list(keys), list(values) + yield (row_number, list(keys), list(values)) else: - raise exceptions.SourceError( - "JSON item has to be list or dict" - ) + if not self.__force_parse: + raise exceptions.SourceError('JSON item has to be list or dict') + yield (row_number, None, []) diff --git a/tabulator/parsers/ods.py b/tabulator/parsers/ods.py index 0016d4da..1ae63649 100644 --- a/tabulator/parsers/ods.py +++ b/tabulator/parsers/ods.py @@ -6,14 +6,13 @@ import ezodf from six import BytesIO - +from ..parser import Parser from .. import helpers -from . import api # Module API -class ODSParser(api.Parser): +class ODSParser(Parser): """Parser to parse ODF Spreadsheets. Args: @@ -28,22 +27,24 @@ class ODSParser(api.Parser): 'sheet', ] - def __init__(self, sheet=1): + def __init__(self, loader, sheet=1): + self.__loader = loader self.__index = sheet - 1 if isinstance(sheet, int) else sheet - self.__loader = None + self.__force_parse = None + self.__extended_rows = None self.__bytes = None self.__book = None self.__sheet = None - self.__extended_rows = None @property def closed(self): return self.__bytes is None or self.__bytes.closed - def open(self, source, encoding, loader): + def open(self, source, encoding=None, force_parse=False): self.close() - self.__loader = loader - self.__bytes = loader.load(source, encoding, mode='b', allow_zip=True) + self.__force_parse = force_parse + self.__bytes = self.__loader.load( + source, mode='b', encoding=encoding, allow_zip=True) self.__book = ezodf.opendoc(BytesIO(self.__bytes.read())) self.__sheet = self.__book.sheets[self.__index] self.reset() @@ -63,5 +64,5 @@ def extended_rows(self): # Private def __iter_extended_rows(self): - for number, row in enumerate(self.__sheet.rows(), start=1): - yield number, None, [cell.value for cell in row] + for row_number, row in enumerate(self.__sheet.rows(), start=1): + yield row_number, None, [cell.value for cell in row] diff --git a/tabulator/parsers/sql.py b/tabulator/parsers/sql.py new file mode 100644 index 00000000..315f7926 --- /dev/null +++ b/tabulator/parsers/sql.py @@ -0,0 +1,70 @@ +# -*- coding: utf-8 -*- +from __future__ import division +from __future__ import print_function +from __future__ import absolute_import +from __future__ import unicode_literals + +from sqlalchemy import create_engine, sql +from ..parser import Parser +from .. import exceptions + + +# Module API + +class SQLParser(Parser): + """Parser to get data from SQL database. + """ + + # Public + + options = [ + 'table', + 'order_by', + ] + + def __init__(self, loader, table=None, order_by=None): + + # Ensure table + if table is None: + raise exceptions.OptionsError('Format `sql` requires `table` option.') + + # Set attributes + self.__loader = loader + self.__table = table + self.__order_by = order_by + self.__force_parse = None + self.__engine = None + self.__extended_rows = None + + @property + def closed(self): + return self.__engine is None + + def open(self, source, encoding=None, force_parse=False): + self.close() + self.__force_parse = force_parse + self.__engine = create_engine(source) + self.__engine.update_execution_options(stream_results=True) + self.reset() + + def close(self): + if not self.closed: + self.__engine.dispose() + self.__engine = None + + def reset(self): + self.__extended_rows = self.__iter_extended_rows() + + @property + def extended_rows(self): + return self.__extended_rows + + # Private + + def __iter_extended_rows(self): + table = sql.table(self.__table) + order = sql.text(self.__order_by) if self.__order_by else None + query = sql.select(['*']).select_from(table).order_by(order) + result = self.__engine.execute(query) + for row_number, row in enumerate(iter(result), start=1): + yield (row_number, row.keys(), list(row)) diff --git a/tabulator/parsers/tsv.py b/tabulator/parsers/tsv.py index 2db51dbc..afff049a 100644 --- a/tabulator/parsers/tsv.py +++ b/tabulator/parsers/tsv.py @@ -5,13 +5,13 @@ from __future__ import unicode_literals import tsv +from ..parser import Parser from .. import helpers -from . import api # Module API -class TSVParser(api.Parser): +class TSVParser(Parser): """Parser to parse linear TSV data format. See: http://dataprotocols.org/linear-tsv/ @@ -22,19 +22,20 @@ class TSVParser(api.Parser): options = [] - def __init__(self): + def __init__(self, loader): + self.__loader = loader + self.__force_parse = None self.__extended_rows = None - self.__loader = None self.__chars = None @property def closed(self): return self.__chars is None or self.__chars.closed - def open(self, source, encoding, loader): + def open(self, source, encoding=None, force_parse=False): self.close() - self.__loader = loader - self.__chars = loader.load(source, encoding, mode='t') + self.__force_parse = force_parse + self.__chars = self.__loader.load(source, encoding=encoding) self.reset() def close(self): @@ -53,5 +54,5 @@ def extended_rows(self): def __iter_extended_rows(self): items = tsv.un(self.__chars) - for number, item in enumerate(items, start=1): - yield (number, None, list(item)) + for row_number, item in enumerate(items, start=1): + yield (row_number, None, list(item)) diff --git a/tabulator/parsers/excel.py b/tabulator/parsers/xls.py similarity index 69% rename from tabulator/parsers/excel.py rename to tabulator/parsers/xls.py index 6b067a9e..d54a0292 100644 --- a/tabulator/parsers/excel.py +++ b/tabulator/parsers/xls.py @@ -5,13 +5,13 @@ from __future__ import unicode_literals import xlrd +from ..parser import Parser from .. import helpers -from . import api # Module API -class ExcelParser(api.Parser): +class XLSParser(Parser): """Parser to parse Excel data format. """ @@ -21,19 +21,21 @@ class ExcelParser(api.Parser): 'sheet', ] - def __init__(self, sheet=1): - self.__index = sheet-1 - self.__bytes = None + def __init__(self, loader, sheet=1): + self.__loader = loader + self.__index = sheet - 1 + self.__force_parse = None self.__extended_rows = None + self.__bytes = None @property def closed(self): return self.__bytes is None or self.__bytes.closed - def open(self, source, encoding, loader): + def open(self, source, encoding=None, force_parse=False): self.close() - self.__loader = loader - self.__bytes = loader.load(source, encoding, mode='b') + self.__force_parse = force_parse + self.__bytes = self.__loader.load(source, mode='b', encoding=encoding) self.__book = xlrd.open_workbook( file_contents=self.__bytes.read(), encoding_override=encoding) @@ -55,5 +57,5 @@ def extended_rows(self): # Private def __iter_extended_rows(self): - for number in range(1, self.__sheet.nrows+1): - yield (number, None, list(self.__sheet.row_values(number - 1))) + for row_number in range(1, self.__sheet.nrows+1): + yield (row_number, None, list(self.__sheet.row_values(row_number - 1))) diff --git a/tabulator/parsers/excelx.py b/tabulator/parsers/xlsx.py similarity index 73% rename from tabulator/parsers/excelx.py rename to tabulator/parsers/xlsx.py index 966001fc..3e0f8cec 100644 --- a/tabulator/parsers/excelx.py +++ b/tabulator/parsers/xlsx.py @@ -7,13 +7,13 @@ import shutil import openpyxl from tempfile import TemporaryFile +from ..parser import Parser from .. import helpers -from . import api # Module API -class ExcelxParser(api.Parser): +class XLSXParser(Parser): """Parser to parse Excel modern `xlsx` data format. """ @@ -23,18 +23,22 @@ class ExcelxParser(api.Parser): 'sheet', ] - def __init__(self, sheet=1): - self.__index = sheet-1 - self.__bytes = None + def __init__(self, loader, sheet=1): + self.__loader = loader + self.__index = sheet - 1 + self.__force_parse = None self.__extended_rows = None + self.__bytes = None @property def closed(self): return self.__bytes is None or self.__bytes.closed - def open(self, source, encoding, loader): + def open(self, source, encoding=None, force_parse=False): self.close() - self.__bytes = loader.load(source, encoding, mode='b', allow_zip=True) + self.__force_parse = force_parse + self.__bytes = self.__loader.load( + source, mode='b', encoding=encoding, allow_zip=True) # For remote stream we need local copy (will be deleted on close by Python) # https://docs.python.org/3.5/library/tempfile.html#tempfile.TemporaryFile if hasattr(self.__bytes, 'url'): @@ -62,5 +66,5 @@ def extended_rows(self): # Private def __iter_extended_rows(self): - for number, row in enumerate(self.__sheet.iter_rows(), start=1): - yield (number, None, list(cell.value for cell in row)) + for row_number, row in enumerate(self.__sheet.iter_rows(), start=1): + yield (row_number, None, list(cell.value for cell in row)) diff --git a/tabulator/stream.py b/tabulator/stream.py index 2dbdb62e..629bc9a5 100644 --- a/tabulator/stream.py +++ b/tabulator/stream.py @@ -5,7 +5,6 @@ from __future__ import unicode_literals import six -import warnings from copy import copy from itertools import chain from . import exceptions @@ -16,73 +15,6 @@ # Module API class Stream(object): - """Tabular stream. - - Args: - source (str): stream source - headers (list/str): - headers list or pointer: - - list of headers for setting by user - - row number to extract headers from this row - For plain source headers row and all rows - before will be removed. For keyed source no rows - will be removed. - scheme (str): - scheme of source: - - file (default) - - ftp - - ftps - - gsheet - - http - - https - - native - - stream - - text - format (str): - format of source: - - None (detect) - - csv - options: - - delimiter - - doublequote - - escapechar - - quotechar - - quoting - - skipinitialspace - - gsheet - - json - options: - - prefix - - native - - tsv - - xls - options: - - sheet - - xlsx - options: - - sheet - encoding (str): - encoding of source: - - None (detect) - - utf-8 - - - sample_size (int): rows count for table.sample. Set to "0" to prevent - any parsing activities before actual table.iter call. In this case - headers will not be extracted from the source. - post_parse (generator[]): post parse processors (hooks). Signature - to follow is "processor(extended_rows)" which should yield - one extended row (number, headers, row) per yield instruction. - skip_rows (int/str[]): list of rows to skip by: - - row number (add integers to the list) - - row comment (add strings to the list) - Example: skip_rows=[1, 2, '#', '//'] - rows 1, 2 and - all rows started with '#' and '//' will be skipped. - allow_html (bool): if True it will allow html contents - custom_loaders (dict): unofficial custom loaders keyed by scheme - custom_parsers (dict): unofficial custom parsers keyed by format - options (dict): see in the scheme/format section - - """ # Public @@ -93,25 +25,17 @@ def __init__(self, format=None, encoding=None, sample_size=100, - post_parse=[], - skip_rows=[], allow_html=False, + force_strings=False, + force_parse=False, + skip_rows=[], + post_parse=[], custom_loaders={}, custom_parsers={}, - # DEPRECATED [v0.8-v1) - loader_options={}, - parser_options={}, + custom_writers={}, **options): - - # DEPRECATED [v0.8-v1) - if loader_options: - options.update(loader_options) - message = 'Use kwargs instead of "loader_options"' - warnings.warn(message, UserWarning) - if parser_options is None: - options.update(parser_options) - message = 'Use kwargs instead of "parser_options"' - warnings.warn(message, UserWarning) + """https://github.com/frictionlessdata/tabulator-py#stream + """ # Set headers self.__headers = None @@ -135,71 +59,72 @@ def __init__(self, self.__scheme = scheme self.__format = format self.__encoding = encoding - self.__post_parse = copy(post_parse) self.__sample_size = sample_size self.__allow_html = allow_html + self.__force_strings = force_strings + self.__force_parse = force_parse + self.__post_parse = copy(post_parse) self.__custom_loaders = copy(custom_loaders) self.__custom_parsers = copy(custom_parsers) + self.__custom_writers = copy(custom_writers) self.__options = options self.__sample_extended_rows = [] self.__loader = None self.__parser = None - self.__number = 0 + self.__row_number = 0 def __enter__(self): - """Enter context manager by opening table. + """https://github.com/frictionlessdata/tabulator-py#stream """ if self.closed: self.open() return self def __exit__(self, type, value, traceback): - """Exit context manager by closing table. + """https://github.com/frictionlessdata/tabulator-py#stream """ if not self.closed: self.close() def __iter__(self): - """Return rows iterator. + """https://github.com/frictionlessdata/tabulator-py#stream """ return self.iter() @property def closed(self): - """Return true if table is closed. + """https://github.com/frictionlessdata/tabulator-py#stream """ return not self.__parser or self.__parser.closed def open(self): - """Open table to iterate over it. + """https://github.com/frictionlessdata/tabulator-py#stream """ - # Prepare variables - scheme = self.__scheme - format = self.__format + # Get scheme and format + detected_scheme, detected_format = helpers.detect_scheme_and_format(self.__source) + scheme = self.__scheme or detected_scheme + format = self.__format or detected_format + + # Get options options = copy(self.__options) # Initiate loader self.__loader = None - if scheme is None: - scheme = helpers.detect_scheme(self.__source) - if not scheme: - scheme = config.DEFAULT_SCHEME - loader_class = self.__custom_loaders.get(scheme) - if loader_class is None: - if scheme not in config.LOADERS: - message = 'Scheme "%s" is not supported' % scheme - raise exceptions.SchemeError(message) - loader_path = config.LOADERS[scheme] - if loader_path: - loader_class = helpers.import_attribute(loader_path) - if loader_class is not None: - loader_options = helpers.extract_options(options, loader_class.options) - self.__loader = loader_class(**loader_options) + if scheme is not None: + loader_class = self.__custom_loaders.get(scheme) + if loader_class is None: + if scheme not in config.LOADERS: + message = 'Scheme "%s" is not supported' % scheme + raise exceptions.SchemeError(message) + loader_path = config.LOADERS[scheme] + if loader_path: + loader_class = helpers.import_attribute(loader_path) + if loader_class is not None: + loader_options = helpers.extract_options(options, loader_class.options) + self.__loader = loader_class(**loader_options) # Initiate parser - if format is None: - format = helpers.detect_format(self.__source) parser_class = self.__custom_parsers.get(format) if parser_class is None: if format not in config.PARSERS: @@ -207,7 +132,7 @@ def open(self): raise exceptions.FormatError(message) parser_class = helpers.import_attribute(config.PARSERS[format]) parser_options = helpers.extract_options(options, parser_class.options) - self.__parser = parser_class(**parser_options) + self.__parser = parser_class(self.__loader, **parser_options) # Bad options if options: @@ -216,7 +141,8 @@ def open(self): raise exceptions.OptionsError(message) # Open and setup - self.__parser.open(self.__source, self.__encoding, self.__loader) + self.__parser.open( + self.__source, encoding=self.__encoding, force_parse=self.__force_parse) self.__extract_sample() self.__extract_headers() if not self.__allow_html: @@ -225,72 +151,57 @@ def open(self): return self def close(self): - """Close table by closing underlaying stream. + """https://github.com/frictionlessdata/tabulator-py#stream """ self.__parser.close() def reset(self): - """Reset table pointer to the first row. + """https://github.com/frictionlessdata/tabulator-py#stream """ - if self.__number > self.__sample_size: + if self.__row_number > self.__sample_size: self.__parser.reset() self.__extract_sample() self.__extract_headers() - self.__number = 0 + self.__row_number = 0 @property def headers(self): - """None/list: table headers + """https://github.com/frictionlessdata/tabulator-py#stream """ return self.__headers @property def sample(self): - """list[]: sample of rows + """https://github.com/frictionlessdata/tabulator-py#stream """ sample = [] iterator = iter(self.__sample_extended_rows) iterator = self.__apply_processors(iterator) - for number, headers, row in iterator: + for row_number, headers, row in iterator: sample.append(row) return sample def iter(self, keyed=False, extended=False): - """Return rows iterator. - - Args: - keyed (bool): yield keyed rows - extended (bool): yield extended rows - - Yields: - mixed[]/mixed{}: row/keyed row/extended row - + """https://github.com/frictionlessdata/tabulator-py#stream """ iterator = chain( self.__sample_extended_rows, self.__parser.extended_rows) iterator = self.__apply_processors(iterator) - for number, headers, row in iterator: - if number > self.__number: - self.__number = number + for row_number, headers, row in iterator: + if self.__force_strings: + row = list(map(helpers.stringify_value, row)) + if row_number > self.__row_number: + self.__row_number = row_number if extended: - yield (number, headers, row) + yield (row_number, headers, row) elif keyed: yield dict(zip(headers, row)) else: yield row def read(self, keyed=False, extended=False, limit=None): - """Return table rows with count limit. - - Args: - keyed (bool): return keyed rows - extended (bool): return extended rows - limit (int): rows count limit - - Returns: - list: rows/keyed rows/extended rows - + """https://github.com/frictionlessdata/tabulator-py#stream """ result = [] rows = self.iter(keyed=keyed, extended=extended) @@ -301,63 +212,25 @@ def read(self, keyed=False, extended=False, limit=None): return result def save(self, target, format=None, encoding=None, **options): - """Save stream to filesystem. - - Args: - target (str): stream target - format (str): - saving format: - - None (detect) - - csv - options: - - delimiter - encoding (str): - saving encoding: - - utf-8 (default) - - - + """https://github.com/frictionlessdata/tabulator-py#stream """ if encoding is None: encoding = config.DEFAULT_ENCODING if format is None: - format = helpers.detect_format(target) - if format not in config.WRITERS: - message = 'Format "%s" is not supported' % format - raise exceptions.FormatError(message) - extended_rows = self.iter(extended=True) - writer_class = helpers.import_attribute(config.WRITERS[format]) + _, format = helpers.detect_scheme_and_format(target) + writer_class = self.__custom_writers.get(format) + if writer_class is None: + if format not in config.WRITERS: + message = 'Format "%s" is not supported' % format + raise exceptions.FormatError(message) + writer_class = helpers.import_attribute(config.WRITERS[format]) writer_options = helpers.extract_options(options, writer_class.options) if options: message = 'Not supported options "%s" for format "%s"' message = message % (', '.join(options), format) raise exceptions.OptionsError(message) writer = writer_class(**writer_options) - writer.write(target, encoding, extended_rows) - - @staticmethod - def test(source, scheme=None, format=None): - """Test if this source has supported scheme and format. - - Args: - source (str): stream source - scheme (str): stream scheme - format (str): stream format - - Returns: - bool: True if source source has supported scheme and format - - """ - if scheme is None: - scheme = helpers.detect_scheme(source) - if not scheme: - scheme = config.DEFAULT_SCHEME - if scheme not in config.LOADERS: - return False - if format is None: - format = helpers.detect_format(source) - if format not in config.PARSERS: - return False - return True + writer.write(self.iter(), target, headers=self.headers, encoding=encoding) # Private @@ -368,8 +241,8 @@ def __extract_sample(self): if self.__sample_size: for _ in range(self.__sample_size): try: - number, headers, row = next(self.__parser.extended_rows) - self.__sample_extended_rows.append((number, headers, row)) + row_number, headers, row = next(self.__parser.extended_rows) + self.__sample_extended_rows.append((row_number, headers, row)) except StopIteration: break @@ -382,8 +255,8 @@ def __extract_headers(self): message = 'Headers row (%s) can\'t be more than sample_size (%s)' message = message % (self.__headers_row, self.__sample_size) raise exceptions.OptionsError(message) - for number, headers, row in self.__sample_extended_rows: - if number == self.__headers_row: + for row_number, headers, row in self.__sample_extended_rows: + if row_number == self.__headers_row: if headers is not None: self.__headers = headers keyed_source = True @@ -396,7 +269,7 @@ def __detect_html(self): # Detect html content text = '' - for number, headers, row in self.__sample_extended_rows: + for row_number, headers, row in self.__sample_extended_rows: for value in row: if isinstance(value, six.string_types): text += value @@ -409,17 +282,17 @@ def __apply_processors(self, iterator): # Builtin processor def builtin_processor(extended_rows): - for number, headers, row in extended_rows: + for row_number, headers, row in extended_rows: # Set headers headers = self.__headers # Skip row by numbers - if number in self.__skip_rows_by_numbers: + if row_number in self.__skip_rows_by_numbers: continue # Skip row by comments match = lambda comment: row[0].startswith(comment) if list(filter(match, self.__skip_rows_by_comments)): continue - yield (number, headers, row) + yield (row_number, headers, row) # Apply processors to iterator processors = [builtin_processor] + self.__post_parse diff --git a/tabulator/topen.py b/tabulator/topen.py deleted file mode 100644 index a2a17cd0..00000000 --- a/tabulator/topen.py +++ /dev/null @@ -1,135 +0,0 @@ -# -*- coding: utf-8 -*- -from __future__ import division -from __future__ import print_function -from __future__ import absolute_import -from __future__ import unicode_literals - -import six -import warnings -from .stream import Stream - - -# Module API - -def topen(source, - headers=None, - scheme=None, - format=None, - encoding=None, - post_parse=[], - sample_size=100, - loader_options={}, - parser_options={}, - # DEPRECATED [v0.5-v1) - loader_class=None, - parser_class=None, - with_headers=False, - extract_headers=False): # pragma: no cover - """Open stream from source. - - Args: - source (str): stream source - headers (list/str): - headers list or pointer: - - list of headers for setting by user - - row pointer like `row3` to extract headers. - For plain source headers row and all rows - before will be removed. For keyed source no rows - will be removed. - scheme (str): - scheme of source: - - file (default) - - stream - - text - - http - - https - - ftp - - ftps - - native - format (str): - format of source: - - None (detect) - - csv - - tsv - - json - - xls - - xlsx - - native - encoding (str): - encoding of source: - - None (detect) - - utf-8 - - - post_parse (generator[]): post parse processors (hooks). Signature - to follow is "processor(extended_rows)" which should yield - one extended row (number, headers, row) per yield instruction. - sample_size (int): rows count for stream.sample. Set to "0" to prevent - any parsing activities before actual stream.iter call. In this case - headers will not be extracted from the source. - loader_options (dict): - loader options: - - `constructor`: constructor returning `loaders.API` instance - - `encoding`: encoding of source - - - parser_options (dict): - parser options: - - `constructor`: constructor returning `parsers.API` instance - - - - Returns: - stream (Stream): opened stream instance - - """ - - # DEPRECATED [v0.5-v1) - if loader_options is None: - loader_options = {} - if parser_options is None: - parser_options = {} - if loader_class is not None: - message = 'Argument "loaders_class" is deprecated [v0.5-v1)' - warnings.warn(message, UserWarning) - loader_options['constructor'] = loader_class - if parser_class is not None: - message = 'Argument "parser_class" is deprecated [v0.5-v1)' - warnings.warn(message, UserWarning) - parser_options['constructor'] = parser_class - if with_headers: - message = 'Argument "with_headers" is deprecated [v0.5-v1)' - warnings.warn(message, UserWarning) - headers = 'row1' - if extract_headers: - message = 'Argument "extract_headers" is deprecated [v0.5-v1)' - warnings.warn(message, UserWarning) - headers = 'row1' - - # DEPRECATED [v0.6-v1) - message = 'Function "topen" is deprecated [v0.6-v1)' - warnings.warn(message, UserWarning) - if 'constructor' in loader_options: - message = 'Argument "constructor" is deprecated [v0.6-v1)' - warnings.warn(message, UserWarning) - loader_options.pop('constructor', None) - if 'constructor' in parser_options: - message = 'Argument "constructor" is deprecated [v0.6-v1)' - warnings.warn(message, UserWarning) - parser_options.pop('constructor', None) - if isinstance(headers, six.string_types): - message = 'Headers like "row1" is deprecated [v0.6-v1)' - warnings.warn(message, UserWarning) - headers = int(headers.replace('row', '')) - - # Initiate and open stream - stream = Stream( - source, - headers=headers, - scheme=scheme, - format=format, - encoding=encoding, - post_parse=post_parse, - sample_size=sample_size, - loader_options=loader_options, - parser_options=parser_options) - stream.open() - - return stream diff --git a/tabulator/validate.py b/tabulator/validate.py new file mode 100644 index 00000000..2e20918c --- /dev/null +++ b/tabulator/validate.py @@ -0,0 +1,29 @@ +# -*- coding: utf-8 -*- +from __future__ import division +from __future__ import print_function +from __future__ import absolute_import +from __future__ import unicode_literals + +from . import config +from . import helpers + + +# Module API + +def validate(source, scheme=None, format=None): + """https://github.com/frictionlessdata/tabulator-py#validate + """ + + # Get scheme and format + detected_scheme, detected_format = helpers.detect_scheme_and_format(source) + scheme = scheme or detected_scheme + format = format or detected_format + + # Validate scheme and format + if scheme is not None: + if scheme not in config.LOADERS: + return False + if format not in config.PARSERS: + return False + + return True diff --git a/tabulator/writers/api.py b/tabulator/writer.py similarity index 58% rename from tabulator/writers/api.py rename to tabulator/writer.py index 6d9eeaa1..f637f824 100644 --- a/tabulator/writers/api.py +++ b/tabulator/writer.py @@ -12,24 +12,18 @@ @add_metaclass(ABCMeta) class Writer(object): - """Writer representation. - - Args: - options(dict): writer options - - """ # Public - @property - # @abstractmethod - def options(self): - """list: list of available options + options = [] + + def __init__(self, **options): + """https://github.com/frictionlessdata/tabulator-py#custom-writers """ pass @abstractmethod - def write(self, target, encoding, extended_rows): - """Write tabular data to target. + def write(self, source, target, headers=None, encoding=None): + """https://github.com/frictionlessdata/tabulator-py#custom-writers """ pass diff --git a/tabulator/writers/csv.py b/tabulator/writers/csv.py index 2571362c..272b0cb3 100644 --- a/tabulator/writers/csv.py +++ b/tabulator/writers/csv.py @@ -7,13 +7,13 @@ import io import six import unicodecsv +from ..writer import Writer from .. import helpers -from . import api # Module API -class CSVWriter(api.Writer): +class CSVWriter(Writer): """CSV writer. """ @@ -34,13 +34,11 @@ def __init__(self, **options): # Set attributes self.__options = options - def write(self, target, encoding, extended_rows): + def write(self, source, target, headers=None, encoding=None): helpers.ensure_dir(target) with io.open(target, 'wb') as file: - writer = unicodecsv.writer( - file, encoding=encoding, **self.__options) - iterator = enumerate(extended_rows, start=1) - for count, (_, headers, row) in iterator: - if count == 1 and headers: - writer.writerow(headers) + writer = unicodecsv.writer(file, encoding=encoding, **self.__options) + if headers: + writer.writerow(headers) + for row in source: writer.writerow(row) diff --git a/tests/conftest.py b/tests/conftest.py new file mode 100644 index 00000000..29d4af7c --- /dev/null +++ b/tests/conftest.py @@ -0,0 +1,21 @@ +# -*- coding: utf-8 -*- +from __future__ import division +from __future__ import print_function +from __future__ import absolute_import +from __future__ import unicode_literals + +import pytest +import sqlite3 + + +# Fixtures + +@pytest.fixture +def database_url(tmpdir): + path = str(tmpdir.join('database.db')) + conn = sqlite3.connect(path) + conn.execute('CREATE TABLE data (id INTEGER PRIMARY KEY, name TEXT)') + conn.execute('INSERT INTO data VALUES (1, "english"), (2, "中国人")') + conn.commit() + yield 'sqlite:///%s' % path + conn.close() diff --git a/tests/loaders/__init__.py b/tests/formats/__init__.py similarity index 100% rename from tests/loaders/__init__.py rename to tests/formats/__init__.py diff --git a/tests/formats/test_csv.py b/tests/formats/test_csv.py new file mode 100644 index 00000000..681c5d50 --- /dev/null +++ b/tests/formats/test_csv.py @@ -0,0 +1,146 @@ +# -*- coding: utf-8 -*- +from __future__ import division +from __future__ import print_function +from __future__ import absolute_import +from __future__ import unicode_literals + +import io +from mock import Mock +from tabulator import Stream +from tabulator.parsers.csv import CSVParser +BASE_URL = 'https://raw.githubusercontent.com/okfn/tabulator-py/master/%s' + + +# Stream + +def test_stream_local_csv(): + with Stream('data/table.csv') as stream: + assert stream.headers is None + assert stream.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] + + +def test_stream_local_csv_with_bom(): + with Stream('data/special/bom.csv') as stream: + assert stream.headers is None + assert stream.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] + + +def test_stream_local_csv_with_bom_with_encoding(): + with Stream('data/special/bom.csv', encoding='utf-8') as stream: + assert stream.headers is None + assert stream.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] + + +def test_stream_csv_excel(): + source = 'value1,value2\nvalue3,value4' + with Stream(source, scheme='text', format='csv') as stream: + assert stream.read() == [['value1', 'value2'], ['value3', 'value4']] + + +def test_stream_csv_excel_tab(): + source = 'value1\tvalue2\nvalue3\tvalue4' + with Stream(source, scheme='text', format='csv', delimiter='\t') as stream: + assert stream.read() == [['value1', 'value2'], ['value3', 'value4']] + + +def test_stream_csv_unix(): + source = '"value1","value2"\n"value3","value4"' + with Stream(source, scheme='text', format='csv') as stream: + assert stream.read() == [['value1', 'value2'], ['value3', 'value4']] + + +def test_stream_csv_escaping(): + with Stream('data/special/escaping.csv', escapechar='\\') as stream: + assert stream.read() == [ + ['ID', 'Test'], + ['1', 'Test line 1'], + ['2', 'Test " line 2'], + ['3', 'Test " line 3'], + ] + + +def test_stream_csv_doublequote(): + with Stream('data/special/doublequote.csv') as stream: + for row in stream: + assert len(row) == 17 + + +def test_stream_stream_csv(): + source = io.open('data/table.csv', mode='rb') + with Stream(source, format='csv') as stream: + assert stream.headers is None + assert stream.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] + + +def test_stream_text_csv(): + source = 'text://id,name\n1,english\n2,中国人\n' + with Stream(source, format='csv') as stream: + assert stream.headers is None + assert stream.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] + + +def test_stream_remote_csv(): + with Stream(BASE_URL % 'data/table.csv') as stream: + assert stream.headers is None + assert stream.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] + + +def test_stream_remote_csv_non_ascii_url(): + with Stream('http://data.defra.gov.uk/ops/government_procurement_card/over_£500_GPC_apr_2013.csv') as stream: + assert stream.sample[0] == [ + 'Entity', + 'Transaction Posting Date', + 'Merchant Name', + 'Amount', + 'Description'] + + +def test_stream_csv_delimiter(): + source = '"value1";"value2"\n"value3";"value4"' + with Stream(source, scheme='text', format='csv', delimiter=';') as stream: + assert stream.read() == [['value1', 'value2'], ['value3', 'value4']] + + +def test_stream_csv_escapechar(): + source = 'value1%,value2\nvalue3%,value4' + with Stream(source, scheme='text', format='csv', escapechar='%') as stream: + assert stream.read() == [['value1,value2'], ['value3,value4']] + + +def test_stream_csv_quotechar(): + source = '%value1,value2%\n%value3,value4%' + with Stream(source, scheme='text', format='csv', quotechar='%') as stream: + assert stream.read() == [['value1,value2'], ['value3,value4']] + + +def test_stream_csv_quotechar(): + source = 'value1, value2\nvalue3, value4' + with Stream(source, scheme='text', format='csv', skipinitialspace=True) as stream: + assert stream.read() == [['value1', 'value2'], ['value3', 'value4']] + + +# Parser + +def test_parser_csv(): + + source = 'data/table.csv' + encoding = None + loader = Mock() + loader.load = Mock(return_value=io.open(source, encoding='utf-8')) + parser = CSVParser(loader) + + assert parser.closed + parser.open(source, encoding=encoding) + assert not parser.closed + + assert list(parser.extended_rows) == [ + (1, None, ['id', 'name']), + (2, None, ['1', 'english']), + (3, None, ['2', '中国人'])] + + assert len(list(parser.extended_rows)) == 0 + parser.reset() + assert len(list(parser.extended_rows)) == 3 + + parser.close() + assert parser.closed diff --git a/tests/formats/test_gsheet.py b/tests/formats/test_gsheet.py new file mode 100644 index 00000000..e774f9a9 --- /dev/null +++ b/tests/formats/test_gsheet.py @@ -0,0 +1,28 @@ +# -*- coding: utf-8 -*- +from __future__ import division +from __future__ import print_function +from __future__ import absolute_import +from __future__ import unicode_literals + +import pytest +from tabulator import Stream, exceptions + + +# Stream + +def test_stream_gsheet(): + source = 'https://docs.google.com/spreadsheets/d/1mHIWnDvW9cALRMq9OdNfRwjAthCUFUOACPp0Lkyl7b4/edit?usp=sharing' + with Stream(source) as stream: + assert stream.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] + + +def test_stream_gsheet_with_gid(): + source = 'https://docs.google.com/spreadsheets/d/1mHIWnDvW9cALRMq9OdNfRwjAthCUFUOACPp0Lkyl7b4/edit#gid=960698813' + with Stream(source) as stream: + assert stream.read() == [['id', 'name'], ['2', '中国人'], ['3', 'german']] + + +def test_stream_gsheet_bad_url(): + stream = Stream('https://docs.google.com/spreadsheets/d/bad') + with pytest.raises(exceptions.HTTPError) as excinfo: + stream.open() diff --git a/tests/formats/test_inline.py b/tests/formats/test_inline.py new file mode 100644 index 00000000..a844b832 --- /dev/null +++ b/tests/formats/test_inline.py @@ -0,0 +1,52 @@ +# -*- coding: utf-8 -*- +from __future__ import division +from __future__ import print_function +from __future__ import absolute_import +from __future__ import unicode_literals + +import pytest +from tabulator import Stream, exceptions + + +# Stream + +def test_stream_inline(): + source = [['id', 'name'], ['1', 'english'], ['2', '中国人']] + with Stream(source) as stream: + assert stream.headers is None + assert stream.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] + + +def test_stream_inline_iterator(): + source = iter([['id', 'name'], ['1', 'english'], ['2', '中国人']]) + with Stream(source) as stream: + assert stream.headers is None + assert stream.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] + + +def test_stream_inline_iterator(): + def generator(): + yield ['id', 'name'] + yield ['1', 'english'] + yield ['2', '中国人'] + with pytest.raises(exceptions.SourceError) as excinfo: + iterator = generator() + Stream(iterator).open() + assert 'callable' in str(excinfo.value) + + +def test_stream_inline_generator(): + def generator(): + yield ['id', 'name'] + yield ['1', 'english'] + yield ['2', '中国人'] + with Stream(generator) as stream: + assert stream.headers is None + assert stream.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] + + +def test_stream_inline_keyed(): + source = [{'id': '1', 'name': 'english'}, {'id': '2', 'name': '中国人'}] + with Stream(source, format='inline') as stream: + assert stream.headers is None + assert stream.read() == [['1', 'english'], ['2', '中国人']] diff --git a/tests/formats/test_json.py b/tests/formats/test_json.py new file mode 100644 index 00000000..2c844c57 --- /dev/null +++ b/tests/formats/test_json.py @@ -0,0 +1,77 @@ +# -*- coding: utf-8 -*- +from __future__ import division +from __future__ import print_function +from __future__ import absolute_import +from __future__ import unicode_literals + +import io +from mock import Mock +from tabulator import Stream, exceptions +from tabulator.parsers.json import JSONParser +BASE_URL = 'https://raw.githubusercontent.com/okfn/tabulator-py/master/%s' + + +# Stream + +def test_stream_local_json_dicts(): + with Stream('data/table-dicts.json') as stream: + assert stream.headers is None + assert stream.read() == [[1, 'english'], [2, '中国人']] + + +def test_stream_local_json_lists(): + with Stream('data/table-lists.json') as stream: + assert stream.headers is None + assert stream.read() == [['id', 'name'], [1, 'english'], [2, '中国人']] + + +def test_stream_text_json_dicts(): + source = '[{"id": 1, "name": "english" }, {"id": 2, "name": "中国人" }]' + with Stream(source, scheme='text', format='json') as stream: + assert stream.headers is None + assert stream.read() == [[1, 'english'], [2, '中国人']] + + +def test_stream_text_json_lists(): + source = '[["id", "name"], [1, "english"], [2, "中国人"]]' + with Stream(source, scheme='text', format='json') as stream: + assert stream.headers is None + assert stream.read() == [['id', 'name'], [1, 'english'], [2, '中国人']] + + +def test_stream_remote_json_dicts(): + with Stream(BASE_URL % 'data/table-dicts.json') as stream: + assert stream.headers is None + assert stream.read() == [[1, 'english'], [2, '中国人']] + + +def test_stream_remote_json_lists(): + with Stream(BASE_URL % 'data/table-lists.json') as stream: + assert stream.headers is None + assert stream.read() == [['id', 'name'], [1, 'english'], [2, '中国人']] + + +# Parser + +def test_parser_json(): + + source = 'data/table-dicts.json' + encoding = None + loader = Mock() + loader.load = Mock(return_value=io.open(source, 'rb')) + parser = JSONParser(loader) + + assert parser.closed + parser.open(source, encoding=encoding) + assert not parser.closed + + assert list(parser.extended_rows) == [ + (1, ['id', 'name'], [1, 'english']), + (2, ['id', 'name'], [2, '中国人'])] + + assert len(list(parser.extended_rows)) == 0 + parser.reset() + assert len(list(parser.extended_rows)) == 2 + + parser.close() + assert parser.closed diff --git a/tests/parsers/test_ndjson.py b/tests/formats/test_ndjson.py similarity index 83% rename from tests/parsers/test_ndjson.py rename to tests/formats/test_ndjson.py index f82d37d0..29532de6 100644 --- a/tests/parsers/test_ndjson.py +++ b/tests/formats/test_ndjson.py @@ -6,25 +6,34 @@ import io import pytest -from six import StringIO from mock import Mock - +from six import StringIO from tabulator import exceptions, Stream from tabulator.parsers.ndjson import NDJSONParser -# Tests +# Stream + +def test_stream_ndjson(): + with Stream('data/table.ndjson', headers=1) as stream: + assert stream.headers == ['id', 'name'] + assert stream.read(keyed=True) == [ + {'id': 1, 'name': 'english'}, + {'id': 2, 'name': '中国人'}] + + +# Parser -def test_ndjson_parser(): +def test_parser_ndjson(): source = 'data/table.ndjson' encoding = None loader = Mock() loader.load = Mock(return_value=io.open(source, encoding='utf-8')) - parser = NDJSONParser() + parser = NDJSONParser(loader) assert parser.closed is True - parser.open(source, encoding, loader) + parser.open(source, encoding=encoding) assert parser.closed is False assert list(parser.extended_rows) == [ @@ -40,23 +49,15 @@ def test_ndjson_parser(): assert parser.closed -def test_stream_ndjson(): - with Stream('data/table.ndjson', headers=1) as stream: - assert stream.headers == ['id', 'name'] - assert stream.read(keyed=True) == [ - {'id': 1, 'name': 'english'}, - {'id': 2, 'name': '中国人'}] - - -def test_ndjson_list(): +def test_parser_ndjson_list(): stream = StringIO( '[1, 2, 3]\n' '[4, 5, 6]\n' ) - parser = NDJSONParser() loader = Mock(load=Mock(return_value=stream)) - parser.open(None, None, loader) + parser = NDJSONParser(loader) + parser.open(None) assert list(parser.extended_rows) == [ (1, None, [1, 2, 3]), @@ -64,15 +65,15 @@ def test_ndjson_list(): ] -def test_ndjson_scalar(): +def test_parser_ndjson_scalar(): stream = StringIO( '1\n' '2\n' ) - parser = NDJSONParser() loader = Mock(load=Mock(return_value=stream)) - parser.open(None, None, loader) + parser = NDJSONParser(loader) + parser.open(None) with pytest.raises(exceptions.SourceError): list(parser.extended_rows) diff --git a/tests/parsers/test_ods.py b/tests/formats/test_ods.py similarity index 70% rename from tests/parsers/test_ods.py rename to tests/formats/test_ods.py index c5bf8906..bda8659a 100644 --- a/tests/parsers/test_ods.py +++ b/tests/formats/test_ods.py @@ -5,23 +5,42 @@ from __future__ import unicode_literals import io +import pytest from mock import Mock -from tabulator import Stream +from tabulator import Stream, exceptions from tabulator.parsers.ods import ODSParser +BASE_URL = 'https://raw.githubusercontent.com/okfn/tabulator-py/master/%s' -# Tests +# Stream -def test_excelx_parser(): +def test_stream_ods(): + with Stream('data/table.ods', headers=1) as stream: + assert stream.headers == ['id', 'name'] + assert stream.read(keyed=True) == [ + {'id': 1.0, 'name': 'english'}, + {'id': 2.0, 'name': '中国人'}, + ] + + +def test_stream_ods_remote(): + source = BASE_URL % 'data/table.ods' + with Stream(source) as stream: + assert stream.read() == [['id', 'name'], [1.0, 'english'], [2.0, '中国人']] + + +# Parser + +def test_parser_ods(): source = 'data/table.ods' encoding = None loader = Mock() loader.load = Mock(return_value=io.open(source, 'rb')) - parser = ODSParser() + parser = ODSParser(loader) assert parser.closed - parser.open(source, encoding, loader) + parser.open(source, encoding=encoding) assert not parser.closed assert list(parser.extended_rows) == [ @@ -36,12 +55,3 @@ def test_excelx_parser(): parser.close() assert parser.closed - - -def test_stream_ods(): - with Stream('data/table.ods', headers=1) as stream: - assert stream.headers == ['id', 'name'] - assert stream.read(keyed=True) == [ - {'id': 1.0, 'name': 'english'}, - {'id': 2.0, 'name': '中国人'}, - ] diff --git a/tests/formats/test_sql.py b/tests/formats/test_sql.py new file mode 100644 index 00000000..55813274 --- /dev/null +++ b/tests/formats/test_sql.py @@ -0,0 +1,37 @@ +# -*- coding: utf-8 -*- +from __future__ import division +from __future__ import print_function +from __future__ import absolute_import +from __future__ import unicode_literals + +import pytest +from tabulator import Stream, exceptions + + +# Stream + +def test_stream_format_sql(database_url): + with Stream(database_url, table='data') as stream: + assert stream.read() == [[1, 'english'], [2, '中国人']] + + +def test_stream_format_sql_order_by(database_url): + with Stream(database_url, table='data', order_by='id') as stream: + assert stream.read() == [[1, 'english'], [2, '中国人']] + + +def test_stream_format_sql_order_by_desc(database_url): + with Stream(database_url, table='data', order_by='id desc') as stream: + assert stream.read() == [[2, '中国人'], [1, 'english']] + + +def test_stream_format_sql_table_is_required_error(database_url): + with pytest.raises(exceptions.OptionsError) as excinfo: + Stream(database_url).open() + assert 'table' in str(excinfo.value) + + +def test_stream_format_sql_headers(database_url): + with Stream(database_url, table='data', headers=1) as stream: + assert stream.headers == ['id', 'name'] + assert stream.read() == [[1, 'english'], [2, '中国人']] diff --git a/tests/parsers/test_tsv.py b/tests/formats/test_tsv.py similarity index 88% rename from tests/parsers/test_tsv.py rename to tests/formats/test_tsv.py index f0c3d353..df92e9a4 100644 --- a/tests/parsers/test_tsv.py +++ b/tests/formats/test_tsv.py @@ -9,18 +9,18 @@ from tabulator.parsers.tsv import TSVParser -# Tests +# Parser -def test_tsv_parser(): +def test_parser_tsv(): source = 'data/table.tsv' encoding = None loader = Mock() loader.load = Mock(return_value=io.open(source)) - parser = TSVParser() + parser = TSVParser(loader) assert parser.closed - parser.open(source, encoding, loader) + parser.open(source, encoding=encoding) assert not parser.closed assert list(parser.extended_rows) == [ diff --git a/tests/formats/test_xls.py b/tests/formats/test_xls.py new file mode 100644 index 00000000..c3e979e5 --- /dev/null +++ b/tests/formats/test_xls.py @@ -0,0 +1,59 @@ +# -*- coding: utf-8 -*- +from __future__ import division +from __future__ import print_function +from __future__ import absolute_import +from __future__ import unicode_literals + +import io +from mock import Mock +from tabulator import parsers +from tabulator import Stream, exceptions +from tabulator.parsers.xls import XLSParser +BASE_URL = 'https://raw.githubusercontent.com/okfn/tabulator-py/master/%s' + + +# Stream + +def test_stream_local_xls(): + with Stream('data/table.xls') as stream: + assert stream.headers is None + assert stream.read() == [['id', 'name'], [1.0, 'english'], [2.0, '中国人']] + + +def test_stream_remote_xls(): + with Stream(BASE_URL % 'data/table.xls') as stream: + assert stream.headers is None + assert stream.read() == [['id', 'name'], [1.0, 'english'], [2.0, '中国人']] + + +def test_stream_xls_sheet(): + source = 'data/special/sheet2.xls' + with Stream(source, sheet=2) as stream: + assert stream.read() == [['id', 'name'], [1.0, 'english'], [2.0, '中国人']] + + +# Parser + +def test_parser_xls(): + + source = 'data/table.xls' + encoding = None + loader = Mock() + loader.load = Mock(return_value=io.open(source, 'rb')) + parser = XLSParser(loader) + + assert parser.closed + parser.open(source, encoding=encoding) + assert not parser.closed + + assert list(parser.extended_rows) == [ + (1, None, ['id', 'name']), + (2, None, [1.0, 'english']), + (3, None, [2.0, '中国人'])] + + assert len(list(parser.extended_rows)) == 0 + parser.reset() + assert len(list(parser.extended_rows)) == 3 + + parser.close() + assert parser.closed diff --git a/tests/formats/test_xlsx.py b/tests/formats/test_xlsx.py new file mode 100644 index 00000000..00262076 --- /dev/null +++ b/tests/formats/test_xlsx.py @@ -0,0 +1,59 @@ +# -*- coding: utf-8 -*- +from __future__ import division +from __future__ import print_function +from __future__ import absolute_import +from __future__ import unicode_literals + +import io +from mock import Mock +from tabulator import Stream, exceptions +from tabulator.parsers.xlsx import XLSXParser +BASE_URL = 'https://raw.githubusercontent.com/okfn/tabulator-py/master/%s' + + +# Stream + +def test_stream_xlsx_remote(): + source = BASE_URL % 'data/table.xlsx' + with Stream(source) as stream: + assert stream.read() == [['id', 'name'], [1.0, 'english'], [2.0, '中国人']] + + +def test_stream_stream_xlsx(): + source = io.open('data/table.xlsx', mode='rb') + with Stream(source, format='xlsx') as stream: + assert stream.headers is None + assert stream.read() == [['id', 'name'], [1.0, 'english'], [2.0, '中国人']] + + +def test_stream_xlsx_sheet(): + source = 'data/special/sheet2.xlsx' + with Stream(source, sheet=2) as stream: + assert stream.read() == [['id', 'name'], [1, 'english'], [2, '中国人']] + + +# Parser + +def test_parser_xlsx(): + + source = 'data/table.xlsx' + encoding = None + loader = Mock() + loader.load = Mock(return_value=io.open(source, 'rb')) + parser = XLSXParser(loader) + + assert parser.closed + parser.open(source, encoding=encoding) + assert not parser.closed + + assert list(parser.extended_rows) == [ + (1, None, ['id', 'name']), + (2, None, [1.0, 'english']), + (3, None, [2.0, '中国人'])] + + assert len(list(parser.extended_rows)) == 0 + parser.reset() + assert len(list(parser.extended_rows)) == 3 + + parser.close() + assert parser.closed diff --git a/tests/loaders/test_file.py b/tests/loaders/test_file.py deleted file mode 100644 index fc4ba15d..00000000 --- a/tests/loaders/test_file.py +++ /dev/null @@ -1,22 +0,0 @@ -# -*- coding: utf-8 -*- -from __future__ import division -from __future__ import print_function -from __future__ import absolute_import -from __future__ import unicode_literals - -from tabulator.loaders.file import FileLoader - - -# Tests - -def test_load_t(): - loader = FileLoader() - chars = loader.load('data/table.csv', 'utf-8', mode='t') - assert chars.read() == 'id,name\n1,english\n2,中国人\n' - - -def test_load_b(): - spec = '中国人'.encode('utf-8') - loader = FileLoader() - chars = loader.load('data/table.csv', 'utf-8', mode='b') - assert chars.read() == b'id,name\n1,english\n2,' + spec + b'\n' diff --git a/tests/loaders/test_web.py b/tests/loaders/test_web.py deleted file mode 100644 index 1dfdc884..00000000 --- a/tests/loaders/test_web.py +++ /dev/null @@ -1,27 +0,0 @@ -# -*- coding: utf-8 -*- -from __future__ import division -from __future__ import print_function -from __future__ import absolute_import -from __future__ import unicode_literals - -from tabulator.loaders.web import WebLoader - - -# Constants - -SOURCE = 'https://raw.githubusercontent.com/frictionlessdata/tabulator-py/master/data/table.csv' - - -# Tests - -def test_load_t(): - loader = WebLoader() - chars = loader.load(SOURCE, 'utf-8', mode='t') - assert chars.read() == 'id,name\n1,english\n2,中国人\n' - - -def test_load_b(): - spec = '中国人'.encode('utf-8') - loader = WebLoader() - chars = loader.load(SOURCE, 'utf-8', mode='b') - assert chars.read() == b'id,name\n1,english\n2,' + spec + b'\n' diff --git a/tests/parsers/test_csv.py b/tests/parsers/test_csv.py deleted file mode 100644 index 7461a197..00000000 --- a/tests/parsers/test_csv.py +++ /dev/null @@ -1,36 +0,0 @@ -# -*- coding: utf-8 -*- -from __future__ import division -from __future__ import print_function -from __future__ import absolute_import -from __future__ import unicode_literals - -import io -from mock import Mock -from tabulator.parsers.csv import CSVParser - - -# Tests - -def test_csv_parser(): - - source = 'data/table.csv' - encoding = None - loader = Mock() - loader.load = Mock(return_value=io.open(source, encoding='utf-8')) - parser = CSVParser() - - assert parser.closed - parser.open(source, encoding, loader) - assert not parser.closed - - assert list(parser.extended_rows) == [ - (1, None, ['id', 'name']), - (2, None, ['1', 'english']), - (3, None, ['2', '中国人'])] - - assert len(list(parser.extended_rows)) == 0 - parser.reset() - assert len(list(parser.extended_rows)) == 3 - - parser.close() - assert parser.closed diff --git a/tests/parsers/test_excel.py b/tests/parsers/test_excel.py deleted file mode 100644 index 468377e7..00000000 --- a/tests/parsers/test_excel.py +++ /dev/null @@ -1,37 +0,0 @@ -# -*- coding: utf-8 -*- -from __future__ import division -from __future__ import print_function -from __future__ import absolute_import -from __future__ import unicode_literals - -import io -from mock import Mock -from tabulator import parsers -from tabulator.parsers.excel import ExcelParser - - -# Tests - -def test_excel_parser(): - - source = 'data/table.xls' - encoding = None - loader = Mock() - loader.load = Mock(return_value=io.open(source, 'rb')) - parser = ExcelParser() - - assert parser.closed - parser.open(source, encoding, loader) - assert not parser.closed - - assert list(parser.extended_rows) == [ - (1, None, ['id', 'name']), - (2, None, [1.0, 'english']), - (3, None, [2.0, '中国人'])] - - assert len(list(parser.extended_rows)) == 0 - parser.reset() - assert len(list(parser.extended_rows)) == 3 - - parser.close() - assert parser.closed diff --git a/tests/parsers/test_excelx.py b/tests/parsers/test_excelx.py deleted file mode 100644 index e5e21b4b..00000000 --- a/tests/parsers/test_excelx.py +++ /dev/null @@ -1,36 +0,0 @@ -# -*- coding: utf-8 -*- -from __future__ import division -from __future__ import print_function -from __future__ import absolute_import -from __future__ import unicode_literals - -import io -from mock import Mock -from tabulator.parsers.excelx import ExcelxParser - - -# Tests - -def test_excelx_parser(): - - source = 'data/table.xlsx' - encoding = None - loader = Mock() - loader.load = Mock(return_value=io.open(source, 'rb')) - parser = ExcelxParser() - - assert parser.closed - parser.open(source, encoding, loader) - assert not parser.closed - - assert list(parser.extended_rows) == [ - (1, None, ['id', 'name']), - (2, None, [1.0, 'english']), - (3, None, [2.0, '中国人'])] - - assert len(list(parser.extended_rows)) == 0 - parser.reset() - assert len(list(parser.extended_rows)) == 3 - - parser.close() - assert parser.closed diff --git a/tests/parsers/test_json.py b/tests/parsers/test_json.py deleted file mode 100644 index 0f322881..00000000 --- a/tests/parsers/test_json.py +++ /dev/null @@ -1,35 +0,0 @@ -# -*- coding: utf-8 -*- -from __future__ import division -from __future__ import print_function -from __future__ import absolute_import -from __future__ import unicode_literals - -import io -from mock import Mock -from tabulator.parsers.json import JSONParser - - -# Tests - -def test_json_parser(): - - source = 'data/table-dicts.json' - encoding = None - loader = Mock() - loader.load = Mock(return_value=io.open(source, 'rb')) - parser = JSONParser() - - assert parser.closed - parser.open(source, encoding, loader) - assert not parser.closed - - assert list(parser.extended_rows) == [ - (1, ['id', 'name'], [1, 'english']), - (2, ['id', 'name'], [2, '中国人'])] - - assert len(list(parser.extended_rows)) == 0 - parser.reset() - assert len(list(parser.extended_rows)) == 2 - - parser.close() - assert parser.closed diff --git a/tests/parsers/__init__.py b/tests/schemes/__init__.py similarity index 100% rename from tests/parsers/__init__.py rename to tests/schemes/__init__.py diff --git a/tests/schemes/test_local.py b/tests/schemes/test_local.py new file mode 100644 index 00000000..fadc515e --- /dev/null +++ b/tests/schemes/test_local.py @@ -0,0 +1,30 @@ +# -*- coding: utf-8 -*- +from __future__ import division +from __future__ import print_function +from __future__ import absolute_import +from __future__ import unicode_literals + +from tabulator import Stream +from tabulator.loaders.local import LocalLoader + + +# Stream + +def test_stream_file(): + with Stream('data/table.csv') as stream: + assert stream.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] + + +# Loader + +def test_loader_local_t(): + loader = LocalLoader() + chars = loader.load('data/table.csv', encoding='utf-8') + assert chars.read() == 'id,name\n1,english\n2,中国人\n' + + +def test_loader_local_b(): + spec = '中国人'.encode('utf-8') + loader = LocalLoader() + chars = loader.load('data/table.csv', mode='b', encoding='utf-8') + assert chars.read() == b'id,name\n1,english\n2,' + spec + b'\n' diff --git a/tests/schemes/test_remote.py b/tests/schemes/test_remote.py new file mode 100644 index 00000000..7087426a --- /dev/null +++ b/tests/schemes/test_remote.py @@ -0,0 +1,31 @@ +# -*- coding: utf-8 -*- +from __future__ import division +from __future__ import print_function +from __future__ import absolute_import +from __future__ import unicode_literals + +from tabulator import Stream +from tabulator.loaders.remote import RemoteLoader +BASE_URL = 'https://raw.githubusercontent.com/okfn/tabulator-py/master/%s' + + +# Stream + +def test_stream_https(): + with Stream(BASE_URL % 'data/table.csv') as stream: + assert stream.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] + + +# Loader + +def test_loader_remote_t(): + loader = RemoteLoader() + chars = loader.load(BASE_URL % 'data/table.csv', encoding='utf-8') + assert chars.read() == 'id,name\n1,english\n2,中国人\n' + + +def test_loader_remote_b(): + spec = '中国人'.encode('utf-8') + loader = RemoteLoader() + chars = loader.load(BASE_URL % 'data/table.csv', mode='b', encoding='utf-8') + assert chars.read() == b'id,name\n1,english\n2,' + spec + b'\n' diff --git a/tests/schemes/test_stream.py b/tests/schemes/test_stream.py new file mode 100644 index 00000000..c1b41665 --- /dev/null +++ b/tests/schemes/test_stream.py @@ -0,0 +1,16 @@ +# -*- coding: utf-8 -*- +from __future__ import division +from __future__ import print_function +from __future__ import absolute_import +from __future__ import unicode_literals + +import io +from tabulator import Stream + + +# Stream + +def test_stream_stream(): + source = io.open('data/table.csv', mode='rb') + with Stream(source, format='csv') as stream: + assert stream.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] diff --git a/tests/loaders/test_text.py b/tests/schemes/test_text.py similarity index 53% rename from tests/loaders/test_text.py rename to tests/schemes/test_text.py index fe892a27..a6d82c65 100644 --- a/tests/loaders/test_text.py +++ b/tests/schemes/test_text.py @@ -4,23 +4,27 @@ from __future__ import absolute_import from __future__ import unicode_literals +from tabulator import Stream from tabulator.loaders.text import TextLoader -# Constants +# Stream -SOURCE = 'id,name\n1,english\n2,中国人\n' +def test_stream_text(): + source = 'text://value1,value2\nvalue3,value4' + with Stream(source, format='csv') as stream: + assert stream.read() == [['value1', 'value2'], ['value3', 'value4']] -# Tests +# Loader def test_load_t(): loader = TextLoader() - chars = loader.load(SOURCE, 'utf-8', mode='t') + chars = loader.load('id,name\n1,english\n2,中国人\n', encoding='utf-8') assert chars.read() == 'id,name\n1,english\n2,中国人\n' def test_load_b(): spec = '中国人'.encode('utf-8') loader = TextLoader() - chars = loader.load(SOURCE, 'utf-8', mode='b') + chars = loader.load('id,name\n1,english\n2,中国人\n', mode='b', encoding='utf-8') assert chars.read() == b'id,name\n1,english\n2,' + spec + b'\n' diff --git a/tests/test_cli.py b/tests/test_cli.py index 682c6300..3dd92df2 100644 --- a/tests/test_cli.py +++ b/tests/test_cli.py @@ -8,6 +8,8 @@ from tabulator.cli import cli +# Tests + def test_cli(): runner = CliRunner() result = runner.invoke(cli, ['data/table.csv']) diff --git a/tests/test_helpers.py b/tests/test_helpers.py index 300d4122..ad18ec76 100644 --- a/tests/test_helpers.py +++ b/tests/test_helpers.py @@ -11,33 +11,28 @@ # Tests -def test_detect_scheme(): - assert helpers.detect_scheme('text://path') == 'text' - assert helpers.detect_scheme('stream://path') == 'stream' - assert helpers.detect_scheme('file://path') == 'file' - assert helpers.detect_scheme('ftp://path') == 'ftp' - assert helpers.detect_scheme('ftps://path') == 'ftps' - assert helpers.detect_scheme('http://path') == 'http' - assert helpers.detect_scheme('https://path') == 'https' - assert helpers.detect_scheme('xxx://path') == 'xxx' - assert helpers.detect_scheme('xx://path') == 'xx' - assert helpers.detect_scheme('XXX://path') == 'xxx' - assert helpers.detect_scheme('XX://path') == 'xx' - assert helpers.detect_scheme('c://path') == None - assert helpers.detect_scheme('c:\\path') == None - assert helpers.detect_scheme('c:\path') == None - assert helpers.detect_scheme('http:/path') == None - assert helpers.detect_scheme('http//path') == None - assert helpers.detect_scheme('path') == None - - -def test_detect_format(): - assert helpers.detect_format('path.CsV') == 'csv' - - -def test_detect_format_works_with_urls_with_query_and_fragment_components(): - url = 'http://someplace.com/foo/path.csv?foo=bar#baz' - assert helpers.detect_format(url) == 'csv' +@pytest.mark.parametrize('source, scheme, format', [ + ('text://path', 'text', None), + ('stream://path', 'stream', None), + ('file://path', 'file', None), + ('ftp://path', 'ftp', None), + ('ftps://path', 'ftps', None), + ('http://path', 'http', None), + ('https://path', 'https', None), + ('xxx://path', 'xxx', None), + ('xx://path', 'xx', None), + ('XXX://path', 'xxx', None), + ('XX://path', 'xx', None), + ('c://path', 'file', None), + ('c:\\path', 'file', None), + ('c:\path', 'file', None), + ('http//path', 'file', None), + ('path', 'file', None), + ('path.CsV', 'file', 'csv'), + ('http://someplace.com/foo/path.csv?foo=bar#baz', 'http', 'csv'), +]) +def test_detect_scheme_and_format(source, scheme, format): + assert helpers.detect_scheme_and_format(source) == (scheme, format) def test_detect_encoding(): diff --git a/tests/test_init.py b/tests/test_init.py deleted file mode 100644 index 171f10df..00000000 --- a/tests/test_init.py +++ /dev/null @@ -1,15 +0,0 @@ -# -*- coding: utf-8 -*- -from __future__ import division -from __future__ import print_function -from __future__ import absolute_import -from __future__ import unicode_literals - -import tabulator - - -# Tests - -def test_public_api(): - assert isinstance(tabulator.Stream, type) - assert isinstance(tabulator.exceptions, object) - assert len(tabulator.__version__.split('.')) == 3 diff --git a/tests/test_stream.py b/tests/test_stream.py index db943ac5..21c8fbf5 100644 --- a/tests/test_stream.py +++ b/tests/test_stream.py @@ -5,92 +5,103 @@ from __future__ import unicode_literals import io +import ast +import six import pytest +import datetime +from sqlalchemy import create_engine from tabulator import Stream, exceptions -from tabulator.loaders.file import FileLoader +from tabulator.loaders.local import LocalLoader from tabulator.parsers.csv import CSVParser +from tabulator.writers.csv import CSVWriter -# Constants +# Headers -BASE_URL = 'https://raw.githubusercontent.com/okfn/tabulator-py/master/%s' +def test_stream_headers(): + with Stream('data/table.csv', headers=1) as stream: + assert stream.headers == ['id', 'name'] + assert list(stream.iter(keyed=True)) == [ + {'id': '1', 'name': 'english'}, + {'id': '2', 'name': '中国人'}] -# Tests [format:csv] - -def test_stream_csv_excel(): - source = 'value1,value2\nvalue3,value4' - with Stream(source, scheme='text', format='csv') as stream: - assert stream.read() == [['value1', 'value2'], ['value3', 'value4']] - - -def test_stream_csv_excel_tab(): - source = 'value1\tvalue2\nvalue3\tvalue4' - with Stream(source, scheme='text', format='csv', delimiter='\t') as stream: - assert stream.read() == [['value1', 'value2'], ['value3', 'value4']] +def test_stream_headers_user_set(): + source = [['1', 'english'], ['2', '中国人']] + with Stream(source, headers=['id', 'name']) as stream: + assert stream.headers == ['id', 'name'] + assert list(stream.iter(keyed=True)) == [ + {'id': '1', 'name': 'english'}, + {'id': '2', 'name': '中国人'}] -def test_stream_csv_unix(): - source = '"value1","value2"\n"value3","value4"' - with Stream(source, scheme='text', format='csv') as stream: - assert stream.read() == [['value1', 'value2'], ['value3', 'value4']] +def test_stream_headers_stream_context_manager(): + source = io.open('data/table.csv', mode='rb') + with Stream(source, headers=1, format='csv') as stream: + assert stream.headers == ['id', 'name'] + assert stream.read(extended=True) == [ + (2, ['id', 'name'], ['1', 'english']), + (3, ['id', 'name'], ['2', '中国人'])] -def test_stream_csv_escaping(): - with Stream('data/special/escaping.csv', escapechar='\\') as stream: - assert stream.read() == [ - ['ID', 'Test'], - ['1', 'Test line 1'], - ['2', 'Test " line 2'], - ['3', 'Test " line 3'], - ] +def test_stream_headers_inline(): + source = [[], ['id', 'name'], ['1', 'english'], ['2', '中国人']] + with Stream(source, headers=2) as stream: + assert stream.headers == ['id', 'name'] + assert stream.read(extended=True) == [ + (3, ['id', 'name'], ['1', 'english']), + (4, ['id', 'name'], ['2', '中国人'])] -# Tests [format:ods] +def test_stream_headers_json_keyed(): + # Get table + source = ('text://[' + '{"id": 1, "name": "english"},' + '{"id": 2, "name": "中国人"}]') + with Stream(source, headers=1, format='json') as stream: + assert stream.headers == ['id', 'name'] + assert list(stream.iter(keyed=True)) == [ + {'id': 1, 'name': 'english'}, + {'id': 2, 'name': '中国人'}] -def test_stream_ods_remote(): - source = BASE_URL % 'data/table.ods' - with Stream(source) as stream: - assert stream.read() == [['id', 'name'], [1.0, 'english'], [2.0, '中国人']] +def test_stream_headers_inline_keyed(): + source = [{'id': '1', 'name': 'english'}, {'id': '2', 'name': '中国人'}] + with Stream(source, headers=1) as stream: + assert stream.headers == ['id', 'name'] + assert list(stream.iter(keyed=True)) == [ + {'id': '1', 'name': 'english'}, + {'id': '2', 'name': '中国人'}] -# Tests [format:xlsx] -def test_stream_xlsx_remote(): - source = BASE_URL % 'data/table.xlsx' - with Stream(source) as stream: - assert stream.read() == [['id', 'name'], [1.0, 'english'], [2.0, '中国人']] +def test_stream_headers_inline_keyed_headers_is_none(): + source = [{'id': '1', 'name': 'english'}, {'id': '2', 'name': '中国人'}] + with Stream(source, headers=None) as stream: + assert stream.headers == None + assert list(stream.iter(extended=True)) == [ + (1, None, ['1', 'english']), + (2, None, ['2', '中国人'])] -# Tests [format:gsheet] +# Encoding -def test_stream_gsheet(): - source = 'https://docs.google.com/spreadsheets/d/1mHIWnDvW9cALRMq9OdNfRwjAthCUFUOACPp0Lkyl7b4/edit?usp=sharing' - with Stream(source) as stream: +def test_stream_encoding(): + with Stream('data/table.csv', encoding='utf-8') as stream: assert stream.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] -def test_stream_gsheet_with_gid(): - source = 'https://docs.google.com/spreadsheets/d/1mHIWnDvW9cALRMq9OdNfRwjAthCUFUOACPp0Lkyl7b4/edit#gid=960698813' - with Stream(source) as stream: - assert stream.read() == [['id', 'name'], ['2', '中国人'], ['3', 'german']] +# Sample size +def test_stream_sample(): + source = [['id', 'name'], ['1', 'english'], ['2', '中国人']] + with Stream(source, headers=1) as stream: + assert stream.headers == ['id', 'name'] + assert stream.sample == [['1', 'english'], ['2', '中国人']] -def test_stream_gsheet_bad_url(): - stream = Stream('https://docs.google.com/spreadsheets/d/bad') - with pytest.raises(exceptions.HTTPError) as excinfo: - stream.open() +# Allow html -def test_stream_csv_doublequote(): - with Stream('data/special/doublequote.csv') as stream: - for row in stream: - assert len(row) == 17 - - -# Tests [allow html] - -def test_html_content(): +def test_stream_html_content(): # Link to html file containing information about csv file source = 'https://github.com/frictionlessdata/tabulator-py/blob/master/data/table.csv' with pytest.raises(exceptions.FormatError) as excinfo: @@ -98,14 +109,49 @@ def test_html_content(): assert 'HTML' in str(excinfo.value) -def test_html_content_with_allow_html(): +def test_stream_html_content_with_allow_html(): # Link to html file containing information about csv file source = 'https://github.com/frictionlessdata/tabulator-py/blob/master/data/table.csv' with Stream(source, allow_html=True) as stream: assert stream -# Tests [skip rows] +# Force strings + +def test_stream_force_strings(): + temp = datetime.datetime(2000, 1, 1, 17) + date = datetime.date(2000, 1, 1) + time = datetime.time(17, 00) + source = [['John', 21, 1.5, temp, date, time]] + with Stream(source, force_strings=True) as stream: + assert stream.read() == [ + ['John', '21', '1.5', '2000-01-01T17:00:00', '2000-01-01', '17:00:00'] + ] + + +# Force parse + +def test_stream_force_parse_inline(): + source = [['John', 21], 'bad-row', ['Alex', 33]] + with Stream(source, force_parse=True) as stream: + assert stream.read(extended=True) == [ + (1, None, ['John', 21]), + (2, None, []), + (3, None, ['Alex', 33]), + ] + + +def test_stream_force_parse_json(): + source = '[["John", 21], "bad-row", ["Alex", 33]]' + with Stream(source, scheme='text', format='json', force_parse=True) as stream: + assert stream.read(extended=True) == [ + (1, None, ['John', 21]), + (2, None, []), + (3, None, ['Alex', 33]), + ] + + +# Skip rows def test_stream_skip_rows(): @@ -120,12 +166,81 @@ def test_stream_skip_rows_with_headers(): assert stream.read() == [['2', '中国人']] -# Tests [custom loaders] - - -def test_custom_loaders(): +# Post parse + +def test_stream_post_parse_headers(): + + # Processors + def extract_headers(extended_rows): + headers = None + for row_number, _, row in extended_rows: + if row_number == 1: + headers = row + continue + yield (row_number, headers, row) + + # Stream + source = [['id', 'name'], ['1', 'english'], ['2', '中国人']] + with Stream(source, post_parse=[extract_headers]) as stream: + assert stream.headers == None + assert stream.read(extended=True) == [ + (2, ['id', 'name'], ['1', 'english']), + (3, ['id', 'name'], ['2', '中国人'])] + + +def test_stream_post_parse_chain(): + + # Processors + def skip_commented_rows(extended_rows): + for row_number, headers, row in extended_rows: + if (row and hasattr(row[0], 'startswith') and + row[0].startswith('#')): + continue + yield (row_number, headers, row) + def skip_blank_rows(extended_rows): + for row_number, headers, row in extended_rows: + if not row: + continue + yield (row_number, headers, row) + def cast_rows(extended_rows): + for row_number, headers, row in extended_rows: + crow = [] + for value in row: + try: + if isinstance(value, six.string_types): + value = ast.literal_eval(value) + except Exception: + pass + crow.append(value) + yield (row_number, headers, crow) + + # Stream + source = [['id', 'name'], ['#1', 'english'], [], ['2', '中国人']] + post_parse = [skip_commented_rows, skip_blank_rows, cast_rows] + with Stream(source, headers=1, post_parse=post_parse) as stream: + assert stream.headers == ['id', 'name'] + assert stream.read() == [[2, '中国人']] + + +def test_stream_post_parse_sample(): + + # Processors + def only_first_row(extended_rows): + for row_number, header, row in extended_rows: + if row_number == 1: + yield (row_number, header, row) + + # Stream + with Stream('data/table.csv', post_parse=[only_first_row]) as stream: + assert stream.sample == [['id', 'name']] + + +# Custom loaders + + +def test_stream_custom_loaders(): source = 'custom://data/table.csv' - class CustomLoader(FileLoader): + class CustomLoader(LocalLoader): def load(self, source, *args, **kwargs): return super(CustomLoader, self).load( source.replace('custom://', ''), *args, **kwargs) @@ -133,10 +248,10 @@ def load(self, source, *args, **kwargs): assert stream.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] -# Tests [custom parsers] +# Custom parsers -def test_custom_parsers(): +def test_stream_custom_parsers(): source = 'data/table.custom' class CustomParser(CSVParser): def open(self, source, *args, **kwargs): @@ -146,51 +261,31 @@ def open(self, source, *args, **kwargs): assert stream.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] -# Tests [options] - -def test_stream_csv_delimiter(): - source = '"value1";"value2"\n"value3";"value4"' - with Stream(source, scheme='text', format='csv', delimiter=';') as stream: - assert stream.read() == [['value1', 'value2'], ['value3', 'value4']] - - -def test_stream_csv_escapechar(): - source = 'value1%,value2\nvalue3%,value4' - with Stream(source, scheme='text', format='csv', escapechar='%') as stream: - assert stream.read() == [['value1,value2'], ['value3,value4']] - - -def test_stream_csv_quotechar(): - source = '%value1,value2%\n%value3,value4%' - with Stream(source, scheme='text', format='csv', quotechar='%') as stream: - assert stream.read() == [['value1,value2'], ['value3,value4']] - - -def test_stream_csv_quotechar(): - source = 'value1, value2\nvalue3, value4' - with Stream(source, scheme='text', format='csv', skipinitialspace=True) as stream: - assert stream.read() == [['value1', 'value2'], ['value3', 'value4']] - +# Custom writers -def test_stream_excel_sheet(): - source = 'data/special/sheet2.xls' - with Stream(source, sheet=2) as stream: - assert stream.read() == [['id', 'name'], [1.0, 'english'], [2.0, '中国人']] +def test_stream_save_custom_writers(tmpdir): + source = 'data/table.csv' + target = str(tmpdir.join('table.csv')) + class CustomWriter(CSVWriter): pass + with Stream(source, headers=1, custom_writers={'csv': CustomWriter}) as stream: + stream.save(target) + with Stream(target, headers=1) as stream: + assert stream.headers == ['id', 'name'] + assert stream.read(extended=True) == [ + (2, ['id', 'name'], ['1', 'english']), + (3, ['id', 'name'], ['2', '中国人'])] -def test_stream_excelx_sheet(): - source = 'data/special/sheet2.xlsx' - with Stream(source, sheet=2) as stream: - assert stream.read() == [['id', 'name'], [1, 'english'], [2, '中国人']] +# Loader/parser options -def test_stream_json_prefix(): +def test_stream_json_property(): source = '{"root": [["value1", "value2"], ["value3", "value4"]]}' - with Stream(source, scheme='text', format='json', prefix='root') as stream: + with Stream(source, scheme='text', format='json', property='root') as stream: assert stream.read() == [['value1', 'value2'], ['value3', 'value4']] -# Tests [errors] +# Open errors def test_stream_source_error_data(): stream = Stream('[1,2]', scheme='text', format='json') @@ -244,41 +339,77 @@ def test_stream_http_error(): stream.open() -# Tests [Table.test] - -def test_stream_test_schemes(): - # Supported - assert Stream.test('path.csv') - assert Stream.test('file://path.csv') - assert Stream.test('http://example.com/path.csv') - assert Stream.test('https://example.com/path.csv') - assert Stream.test('ftp://example.com/path.csv') - assert Stream.test('ftps://example.com/path.csv') - assert Stream.test('path.csv', scheme='file') - # Not supported - assert not Stream.test('ssh://example.com/path.csv') - assert not Stream.test('bad://example.com/path.csv') - -def test_stream_test_formats(): - # Supported - assert Stream.test('path.csv') - assert Stream.test('path.json') - assert Stream.test('path.jsonl') - assert Stream.test('path.ndjson') - assert Stream.test('path.tsv') - assert Stream.test('path.xls') - assert Stream.test('path.ods') - assert Stream.test('path.no-format', format='csv') - # Not supported - assert not Stream.test('path.txt') - assert not Stream.test('path.bad') - -def test_stream_test_special(): - # Gsheet - assert Stream.test('https://docs.google.com/spreadsheets/d/id', format='csv') - # File-like - assert Stream.test(io.open('data/table.csv', encoding='utf-8'), format='csv') - # Text - assert Stream.test('text://name,value\n1,2', format='csv') - # Native - assert Stream.test([{'name': 'value'}]) +# Reset + +def test_stream_reset(): + with Stream('data/table.csv', headers=1) as stream: + headers1 = stream.headers + contents1 = stream.read() + stream.reset() + headers2 = stream.headers + contents2 = stream.read() + assert headers1 == ['id', 'name'] + assert contents1 == [['1', 'english'], ['2', '中国人']] + assert headers1 == headers2 + assert contents1 == contents2 + + +def test_stream_reset_and_sample_size(): + with Stream('data/special/long.csv', headers=1, sample_size=3) as stream: + # Before reset + assert stream.read(extended=True) == [ + (2, ['id', 'name'], ['1', 'a']), + (3, ['id', 'name'], ['2', 'b']), + (4, ['id', 'name'], ['3', 'c']), + (5, ['id', 'name'], ['4', 'd']), + (6, ['id', 'name'], ['5', 'e']), + (7, ['id', 'name'], ['6', 'f'])] + assert stream.sample == [['1', 'a'], ['2', 'b']] + assert stream.read() == [] + # Reset stream + stream.reset() + # After reset + assert stream.read(extended=True, limit=3) == [ + (2, ['id', 'name'], ['1', 'a']), + (3, ['id', 'name'], ['2', 'b']), + (4, ['id', 'name'], ['3', 'c'])] + assert stream.sample == [['1', 'a'], ['2', 'b']] + assert stream.read(extended=True) == [ + (5, ['id', 'name'], ['4', 'd']), + (6, ['id', 'name'], ['5', 'e']), + (7, ['id', 'name'], ['6', 'f'])] + + +def test_stream_reset_generator(): + def generator(): + yield [1] + yield [2] + with Stream(generator, sample_size=0) as stream: + # Before reset + assert stream.read() == [[1], [2]] + # Reset stream + stream.reset() + # After reset + assert stream.read() == [[1], [2]] + +# Save + +def test_stream_save_csv(tmpdir): + source = 'data/table.csv' + target = str(tmpdir.join('table.csv')) + with Stream(source, headers=1) as stream: + stream.save(target) + with Stream(target, headers=1) as stream: + assert stream.headers == ['id', 'name'] + assert stream.read(extended=True) == [ + (2, ['id', 'name'], ['1', 'english']), + (3, ['id', 'name'], ['2', '中国人'])] + + +def test_stream_save_xls(tmpdir): + source = 'data/table.csv' + target = str(tmpdir.join('table.xls')) + with Stream(source, headers=1) as stream: + with pytest.raises(exceptions.FormatError) as excinfo: + stream.save(target) + assert 'xls' in str(excinfo.value) diff --git a/tests/test_topen.py b/tests/test_topen.py deleted file mode 100644 index 10634148..00000000 --- a/tests/test_topen.py +++ /dev/null @@ -1,577 +0,0 @@ -# -*- coding: utf-8 -*- -from __future__ import division -from __future__ import print_function -from __future__ import absolute_import -from __future__ import unicode_literals - -import io -import ast -import sys -import six -import pytest -from tabulator import topen, exceptions -from tabulator.parsers.csv import CSVParser - - -# Constants - -BASE_URL = 'https://raw.githubusercontent.com/okfn/tabulator-py/master/%s' - - -# Tests [loaders/parsers] - -def test_file_csv(): - - # Get table - table = topen('data/table.csv') - - # Make assertions - assert table.headers is None - assert table.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] - - -def test_file_csv_parser_options(): - - # Get table - table = topen('data/table.csv', - parser_options={'constructor': CSVParser}) - - # Make assertions - assert table.headers is None - assert table.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] - -def test_file_csv_with_bom(): - - # Get table - table = topen('data/special/bom.csv', encoding='utf-8') - - # Make assertions - assert table.headers is None - assert table.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] - - # Get table - table = topen('data/special/bom.csv') - - # Make assertions - assert table.headers is None - assert table.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] - - -# DEPRECATED [v0.5-v1) -def test_file_csv_parser_class(): - - # Get table - table = topen('data/table.csv', parser_class=CSVParser) - - # Make assertions - assert table.headers is None - assert table.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] - - -def test_file_json_dicts(): - - # Get table - table = topen('data/table-dicts.json') - - # Make assertions - assert table.headers is None - assert table.read() == [[1, 'english'], [2, '中国人']] - - -def test_file_json_lists(): - - # Get table - table = topen('data/table-lists.json') - - # Make assertions - assert table.headers is None - assert table.read() == [['id', 'name'], [1, 'english'], [2, '中国人']] - - -def test_file_xls(): - - # Get table - table = topen('data/table.xls') - - # Make assertions - assert table.headers is None - assert table.read() == [['id', 'name'], [1.0, 'english'], [2.0, '中国人']] - - -def test_stream_csv(): - - # Get table - source = io.open('data/table.csv', mode='rb') - table = topen(source, format='csv') - - # Make assertions - assert table.headers is None - assert table.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] - - -def test_stream_xlsx(): - - # Get table - source = io.open('data/table.xlsx', mode='rb') - table = topen(source, format='xlsx') - - # Make assertions - assert table.headers is None - assert table.read() == [['id', 'name'], [1.0, 'english'], [2.0, '中国人']] - - -def test_text_csv(): - - # Get table - source = 'text://id,name\n1,english\n2,中国人\n' - table = topen(source, format='csv') - - # Make assertions - assert table.headers is None - assert table.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] - - -def test_text_json_dicts(): - - # Get table - source = '[{"id": 1, "name": "english" }, {"id": 2, "name": "中国人" }]' - table = topen(source, scheme='text', format='json') - - # Make assertions - assert table.headers is None - assert table.read() == [[1, 'english'], [2, '中国人']] - - -def test_text_json_lists(): - - # Get table - source = '[["id", "name"], [1, "english"], [2, "中国人"]]' - table = topen(source, scheme='text', format='json') - - # Make assertions - assert table.headers is None - assert table.read() == [['id', 'name'], [1, 'english'], [2, '中国人']] - - -def test_web_csv(): - - # Get table - table = topen(BASE_URL % 'data/table.csv') - - # Make assertions - assert table.headers is None - assert table.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] - - -def test_web_csv_non_ascii_url(): - - # Get table - table = topen('http://data.defra.gov.uk/ops/government_procurement_card/over_£500_GPC_apr_2013.csv') - - # Make assertions - assert table.sample[0] == [ - 'Entity', - 'Transaction Posting Date', - 'Merchant Name', - 'Amount', - 'Description'] - - -def test_web_json_dicts(): - - # Get table - table = topen(BASE_URL % 'data/table-dicts.json') - - # Make assertions - assert table.headers is None - assert table.read() == [[1, 'english'], [2, '中国人']] - - -def test_web_json_lists(): - - # Get table - table = topen(BASE_URL % 'data/table-lists.json') - - # Make assertions - assert table.headers is None - assert table.read() == [['id', 'name'], [1, 'english'], [2, '中国人']] - - -def test_web_excel(): - - # Get table - table = topen(BASE_URL % 'data/table.xls') - - # Make assertions - assert table.headers is None - assert table.read() == [['id', 'name'], [1.0, 'english'], [2.0, '中国人']] - - -def test_native(): - - # Get table - source = [['id', 'name'], ['1', 'english'], ['2', '中国人']] - table = topen(source) - - # Make assertions - assert table.headers is None - assert table.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] - - -def test_native_iterator(): - - # Get table - source = iter([['id', 'name'], ['1', 'english'], ['2', '中国人']]) - table = topen(source) - - # Make assertions - assert table.headers is None - assert table.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] - - -def test_native_iterator(): - - # Get table - def generator(): - yield ['id', 'name'] - yield ['1', 'english'] - yield ['2', '中国人'] - with pytest.raises(exceptions.SourceError) as excinfo: - iterator = generator() - topen(iterator) - assert 'callable' in str(excinfo.value) - - -def test_native_generator(): - - # Get table - def generator(): - yield ['id', 'name'] - yield ['1', 'english'] - yield ['2', '中国人'] - table = topen(generator) - - # Make assertions - assert table.headers is None - assert table.read() == [['id', 'name'], ['1', 'english'], ['2', '中国人']] - - -def test_native_keyed(): - - # Get table - source = [{'id': '1', 'name': 'english'}, {'id': '2', 'name': '中国人'}] - table = topen(source, scheme='native', format='native') - - # Make assertions - assert table.headers is None - assert table.read() == [['1', 'english'], ['2', '中国人']] - - -# Tests [headers] - -def test_headers(): - - # Get table - table = topen('data/table.csv', headers=1) - - # Make assertions - assert table.headers == ['id', 'name'] - assert list(table.iter(keyed=True)) == [ - {'id': '1', 'name': 'english'}, - {'id': '2', 'name': '中国人'}] - - -# DEPRECATED [v0.6-v1) -def test_headers_str(): - - # Get table - table = topen('data/table.csv', headers='row1') - - # Make assertions - assert table.headers == ['id', 'name'] - assert list(table.iter(keyed=True)) == [ - {'id': '1', 'name': 'english'}, - {'id': '2', 'name': '中国人'}] - -def test_headers_user_set(): - - # Get table - source = [['1', 'english'], ['2', '中国人']] - table = topen(source, headers=['id', 'name']) - - # Make assertions - assert table.headers == ['id', 'name'] - assert list(table.iter(keyed=True)) == [ - {'id': '1', 'name': 'english'}, - {'id': '2', 'name': '中国人'}] - - -# DEPRECATED [v0.5-v1) -def test_headers_with_headers_argument(): - - # Get table - table = topen('data/table.csv', with_headers=True) - - # Make assertions - assert table.headers == ['id', 'name'] - assert list(table.iter(keyed=True)) == [ - {'id': '1', 'name': 'english'}, - {'id': '2', 'name': '中国人'}] - - -def test_headers_stream_context_manager(): - - # Get source - source = io.open('data/table.csv', mode='rb') - - # Make assertions - with topen(source, headers='row1', format='csv') as table: - assert table.headers == ['id', 'name'] - assert table.read(extended=True) == [ - (2, ['id', 'name'], ['1', 'english']), - (3, ['id', 'name'], ['2', '中国人'])] - - -def test_headers_native(): - - # Get table - source = [[], ['id', 'name'], ['1', 'english'], ['2', '中国人']] - table = topen(source, headers='row2') - - # Make assertions - assert table.headers == ['id', 'name'] - assert table.read(extended=True) == [ - (3, ['id', 'name'], ['1', 'english']), - (4, ['id', 'name'], ['2', '中国人'])] - - -def test_headers_json_keyed(): - - # Get table - source = ('text://[' - '{"id": 1, "name": "english"},' - '{"id": 2, "name": "中国人"}]') - table = topen(source, headers='row1', format='json') - - # Make assertions - assert table.headers == ['id', 'name'] - assert list(table.iter(keyed=True)) == [ - {'id': 1, 'name': 'english'}, - {'id': 2, 'name': '中国人'}] - - -def test_headers_native_keyed(): - - # Get table - source = [{'id': '1', 'name': 'english'}, {'id': '2', 'name': '中国人'}] - table = topen(source, headers='row1') - - # Make assertions - assert table.headers == ['id', 'name'] - assert list(table.iter(keyed=True)) == [ - {'id': '1', 'name': 'english'}, - {'id': '2', 'name': '中国人'}] - - -def test_headers_native_keyed_headers_is_none(): - - # Get table - source = [{'id': '1', 'name': 'english'}, {'id': '2', 'name': '中国人'}] - table = topen(source, headers=None) - - # Make assertions - assert table.headers == None - assert list(table.iter(extended=True)) == [ - (1, None, ['1', 'english']), - (2, None, ['2', '中国人'])] - - -# Tests [sample] - - -def test_sample(): - - # Get table - source = [['id', 'name'], ['1', 'english'], ['2', '中国人']] - table = topen(source, headers='row1') - - # Make assertions - assert table.headers == ['id', 'name'] - assert table.sample == [['1', 'english'], ['2', '中国人']] - - -# Tests [reset] - -def test_reset(): - - # Get results - with topen('data/table.csv', headers='row1') as table: - headers1 = table.headers - contents1 = table.read() - table.reset() - headers2 = table.headers - contents2 = table.read() - - # Make assertions - assert headers1 == ['id', 'name'] - assert contents1 == [['1', 'english'], ['2', '中国人']] - assert headers1 == headers2 - assert contents1 == contents2 - - -def test_reset_and_sample_size(): - - # Get table - table = topen('data/special/long.csv', headers=1, sample_size=3) - - # Make assertions - assert table.read(extended=True) == [ - (2, ['id', 'name'], ['1', 'a']), - (3, ['id', 'name'], ['2', 'b']), - (4, ['id', 'name'], ['3', 'c']), - (5, ['id', 'name'], ['4', 'd']), - (6, ['id', 'name'], ['5', 'e']), - (7, ['id', 'name'], ['6', 'f'])] - assert table.sample == [['1', 'a'], ['2', 'b']] - assert table.read() == [] - - # Reset table - table.reset() - - # Make assertions - assert table.read(extended=True, limit=3) == [ - (2, ['id', 'name'], ['1', 'a']), - (3, ['id', 'name'], ['2', 'b']), - (4, ['id', 'name'], ['3', 'c'])] - assert table.sample == [['1', 'a'], ['2', 'b']] - assert table.read(extended=True) == [ - (5, ['id', 'name'], ['4', 'd']), - (6, ['id', 'name'], ['5', 'e']), - (7, ['id', 'name'], ['6', 'f'])] - - -def test_reset_generator(): - - # Generator - def generator(): - yield [1] - yield [2] - - # Get table - table = topen(generator, sample_size=0) - - # Make assertions - assert table.read() == [[1], [2]] - - # Reset - table.reset() - - # Make assertions - assert table.read() == [[1], [2]] - - -# Tests [processors] - -def test_processors_headers(): - - # Processors - def extract_headers(extended_rows): - headers = None - for number, _, row in extended_rows: - if number == 1: - headers = row - continue - yield (number, headers, row) - - # Get table - source = [['id', 'name'], ['1', 'english'], ['2', '中国人']] - table = topen(source, post_parse=[extract_headers]) - - # Make assertions - assert table.headers == None - assert table.read(extended=True) == [ - (2, ['id', 'name'], ['1', 'english']), - (3, ['id', 'name'], ['2', '中国人'])] - - -def test_processors_chain(): - - # Processors - def skip_commented_rows(extended_rows): - for number, headers, row in extended_rows: - if (row and hasattr(row[0], 'startswith') and - row[0].startswith('#')): - continue - yield (number, headers, row) - def skip_blank_rows(extended_rows): - for number, headers, row in extended_rows: - if not row: - continue - yield (number, headers, row) - def cast_rows(extended_rows): - for number, headers, row in extended_rows: - crow = [] - for value in row: - try: - if isinstance(value, six.string_types): - value = ast.literal_eval(value) - except Exception: - pass - crow.append(value) - yield (number, headers, crow) - - # Get table - source = [['id', 'name'], ['#1', 'english'], [], ['2', '中国人']] - table = topen(source, headers='row1', post_parse=[ - skip_commented_rows, - skip_blank_rows, - cast_rows]) - - # Make assertions - assert table.headers == ['id', 'name'] - - -def test_processors_sample(): - - # Processors - def only_first_row(extended_rows): - for number, header, row in extended_rows: - if number == 1: - yield (number, header, row) - - # Get table - table = topen('data/table.csv', post_parse=[only_first_row]) - - # Make assertions - assert table.sample == [['id', 'name']] - - -# Tests [save] - -def test_save_csv(tmpdir): - - # Save table - path = str(tmpdir.join('table.csv')) - table = topen('data/table.csv', headers=1) - table.save(path) - - # Open saved table - table = topen(path, headers=1) - - # Make assertions - assert table.headers == ['id', 'name'] - assert table.read(extended=True) == [ - (2, ['id', 'name'], ['1', 'english']), - (3, ['id', 'name'], ['2', '中国人'])] - - -def test_save_xls(tmpdir): - - # Save table - path = str(tmpdir.join('table.xls')) - table = topen('data/table.csv', headers=1) - - # Assert raises - with pytest.raises(exceptions.FormatError) as excinfo: - table.save(path) - assert 'xls' in str(excinfo.value) diff --git a/tests/test_validate.py b/tests/test_validate.py new file mode 100644 index 00000000..4a558556 --- /dev/null +++ b/tests/test_validate.py @@ -0,0 +1,50 @@ +# -*- coding: utf-8 -*- +from __future__ import division +from __future__ import print_function +from __future__ import absolute_import +from __future__ import unicode_literals + +import io +from tabulator import validate + + +# Tests + +def test_validate_test_schemes(): + # Supported + assert validate('path.csv') + assert validate('file://path.csv') + assert validate('http://example.com/path.csv') + assert validate('https://example.com/path.csv') + assert validate('ftp://example.com/path.csv') + assert validate('ftps://example.com/path.csv') + assert validate('path.csv', scheme='file') + # Not supported + assert not validate('ssh://example.com/path.csv') + assert not validate('bad://example.com/path.csv') + + +def test_validate_test_formats(): + # Supported + assert validate('path.csv') + assert validate('path.json') + assert validate('path.jsonl') + assert validate('path.ndjson') + assert validate('path.tsv') + assert validate('path.xls') + assert validate('path.ods') + assert validate('path.no-format', format='csv') + # Not supported + assert not validate('path.txt') + assert not validate('path.bad') + + +def test_validate_test_special(): + # Gsheet + assert validate('https://docs.google.com/spreadsheets/d/id', format='csv') + # File-like + assert validate(io.open('data/table.csv', encoding='utf-8'), format='csv') + # Text + assert validate('text://name,value\n1,2', format='csv') + # Inline + assert validate([{'name': 'value'}]) diff --git a/tox.ini b/tox.ini index c11596f9..9bdccd69 100644 --- a/tox.ini +++ b/tox.ini @@ -8,6 +8,8 @@ envlist= py35 [testenv] +extras= + ods deps= mock pytest