Skip to content

Commit

Permalink
Merge branch 'release/0.2.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
fgmacedo committed Oct 3, 2013
2 parents b56d801 + 89b7186 commit ebdf77f
Show file tree
Hide file tree
Showing 33 changed files with 1,032 additions and 966 deletions.
6 changes: 4 additions & 2 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
[run]
include=*raspador*
omit=tasks.py
omit=tasks.py,*ordereddict*

[report]
exclude_lines =

raise NotImplementedError

if __name__ == '__main__':
if __name__ == '__main__':

except ImportError:
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ MANIFEST

# Virtualenvs
env*
.tox

# Distribute
build
Expand Down
1 change: 0 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ python:
- "3.3"
- "pypy"
install:
- "pip install -r requirements_dev.txt --use-mirrors"
# For Python 2.6 support
- "pip install ordereddict --use-mirrors"
- "pip install coveralls"
Expand Down
122 changes: 66 additions & 56 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,90 +15,99 @@ raspador
:target: https://crate.io/packages/raspador/


Biblioteca para extração de dados em documentos semi-estruturados.
Library to extract data from semi-structured text documents.

A definição dos extratores é feita através de classes como modelos, de forma
semelhante ao ORM do Django. Cada extrator procura por um padrão especificado
por expressão regular, e a conversão para tipos primitidos é feita
automaticamente a partir dos grupos capturados.
It's best suited for data-processing in files that do not have a formal
structure and are in plain text (or that are easy to convert). Structured files
like XML, CSV and HTML doesn't fit a good use case for raspador, and have
excellent alternatives to get data extracted, like lxml_, html5lib_,
BeautifulSoup_, and PyQuery_.

The extractors are defined through classes as models, something similar to the
Django ORM. Each field searches for a pattern specified by the regular
expression, and captured groups are converted automatically to primitives.

O analisador é implementado como um gerador, onde cada item encontrado pode ser
consumido antes do final da análise, caracterizando uma pipeline.
The parser is implemented as a generator, where each item found can be consumed
before the end of the analysis, featuring a pipeline.

The analysis is forward-only, which makes it extremely quick, and thus any
iterator that returns a string can be analyzed, including infinite streams.

A análise é foward-only, o que o torna extremamente rápido, e deste modo
qualquer iterador que retorne uma string pode ser analisado, incluindo streams
infinitos.
.. _lxml: http://lxml.de
.. _html5lib: https://github.com/html5lib/html5lib-python
.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
.. _PyQuery: https://github.com/gawel/pyquery/


Com uma base sólida e enxuta, é fácil construir seus próprios extratores.
Install
=======

Além da utilidade da ferramenta, o raspador é um exemplo prático e simples da
utilização de conceitos e recursos como iteradores, geradores, meta-programação
e property-descriptors.
raspador works on CPython 2.6+, CPython 3.2+ and PyPy. To install it, use::

pip install raspador

Compatibilidade e dependências
===============================
or easy install::

O raspador é compatível com Python 2.6, 2.7, 3.2, 3.3 e pypy.
easy_install raspador

Desenvolvimento realizado em Python 2.7.5 e Python 3.2.3.

Não há dependências externas.
From source
-----------

.. note:: Python 2.6
Download and install from source::

Em Python 2.6, a biblioteca `ordereddict
<https://pypi.python.org/pypi/ordereddict/>`_ é necessária.
git clone https://github.com/fgmacedo/raspador.git
cd raspador
python setup.py install

Você pode instalar com pip::

pip install ordereddict
Dependencies
------------

Testes
======
There are no external dependencies.

Os testes dependem de algumas bibliotecas externas:

.. code-block:: text
.. note:: Python 2.6

coverage==3.6
nose==1.3.0
flake8==2.0
invoke==0.5.0
With Python 2.6, you must install `ordereddict
<https://pypi.python.org/pypi/ordereddict/>`_.

You can install it with pip::

Você pode executar os testes com ``nosetests``:
pip install ordereddict

.. code-block:: bash
Tests
======

$ nosetests
To automate tests with all supported Python versions at once, we use `tox
<http://tox.readthedocs.org/en/latest/>`_.

E adicionalmente, verificar a compatibilidade com o PEP8:
Run all tests with:

.. code-block:: bash
$ flake8 raspador testes
$ tox
Ou por conveniência, executar os dois em sequência com invoke:
Tests depend on several third party libraries, but these are installed by tox
on each Python's virtualenv:

.. code-block:: bash
.. code-block:: text
$ invoke test
nose==1.3.0
coverage==3.6
flake8==2.0
Exemplos
Examples
========

Extrator de dados em logs
-------------------------
Extract data from logs
----------------------

.. code-block:: python
from __future__ import print_function
import json
from raspador import Analizador, CampoString
from raspador import Parser, StringField
out = """
PART:/dev/sda1 UUID:423k34-3423lk423-sdfsd-43 TYPE:ext4
Expand All @@ -107,22 +116,23 @@ Extrator de dados em logs
"""
class AnalizadorDeLog(Analizador):
inicio = r'^PART.*'
fim = r'^PART.*'
PART = CampoString(r'PART:([^\s]+)')
UUID = CampoString(r'UUID:([^\s]+)')
TYPE = CampoString(r'TYPE:([^\s]+)')
class LogParser(Parser):
begin = r'^PART.*'
end = r'^PART.*'
PART = StringField(r'PART:([^\s]+)')
UUID = StringField(r'UUID:([^\s]+)')
TYPE = StringField(r'TYPE:([^\s]+)')
a = AnalizadorDeLog()
a = LogParser()
# res é um gerador
res = a.analizar(linha for linha in out.splitlines())
# res is a generator
res = a.parse(iter(out.splitlines()))
print (json.dumps(list(res), indent=2))
out_as_json = json.dumps(list(res), indent=2)
print (out_as_json)
# Saída:
# Output:
"""
[
{
Expand Down
22 changes: 11 additions & 11 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
Documentação do raspador
========================
.. _topics-index:

================================
Raspador |version| documentation
================================

Conteúdo:

.. toctree::
:maxdepth: 2

raspador
.. toctree::
:hidden:

intro/overview
intro/install
intro/tutorial

Índices e tabelas
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
Looking for specific information? Try the :ref:`genindex` or :ref:`modindex`.
28 changes: 28 additions & 0 deletions docs/source/intro/install.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@

*******
Install
*******


Package managers
================

You can install using pip or easy_install.

PIP::

pip install raspador

Easy install::

easy_install raspador


From source
===========

Download and install from source::

git clone https://github.com/fgmacedo/raspador.git
cd raspador
python setup.py install
13 changes: 7 additions & 6 deletions docs/source/raspador.rst
Original file line number Diff line number Diff line change
@@ -1,30 +1,31 @@

========
raspador
========

O módulo raspador fornece estrutura genérica para extração de dados a partir de
arquivos texto semi-estruturados.


Analizador
Parser
----------

.. automodule:: raspador.analizador
.. automodule:: raspador.parser
:members:


Campos
------

.. automodule:: raspador.campos
.. automodule:: raspador.fields
:members:
:undoc-members:


Coleções
--------
Item
----

.. automodule:: raspador.colecoes
.. automodule:: raspador.item
:members:
:undoc-members:

9 changes: 5 additions & 4 deletions raspador/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
# flake8: noqa

from .analizador import Analizador, Dicionario
from .campos import CampoBase, CampoString, CampoNumerico, \
CampoInteiro, CampoData, CampoDataHora, CampoBooleano
from .parser import Parser
from .item import Dictionary
from .fields import BaseField, StringField, FloatField, BRFloatField, \
IntegerField, DateField, DateTimeField, BooleanField

from .decoradores import ProxyDeCampo, ProxyConcatenaAteRE
from .decorators import FieldProxy, UnionUntilRegexProxy

from .cache import Cache
Loading

0 comments on commit ebdf77f

Please sign in to comment.