HBase Stargate (REST API) client wrapper for Python.
Read the official documentation of the Stargate.
Deprecation warning!
This package is no longer supported. Either maintain your own fork or switch to alternative.
starbase is (at the moment) a client implementation of the Apache HBase REST API (Stargate).
Beware, that REST API is slow (not to blame on this library!). If you can operate with HBase directly better do so.
You need to have Hadoop, HBase, Thrift and Stargate running. If you want to make it easy for yourself, read my instructions on installing Cloudera manager (free) on Ubuntu 12.04 LTS here or there.
Once you have everything installed and running (by default Stargate runs on 127.0.0.1:8000), you should be able to run src/starbase/client/test.py without problems (UnitTest).
- 2.6.8 and up
- 2.7
- 3.3
Project is still in development, thus not all the features of the API are available.
- Connect to Stargate.
- Show software version.
- Show cluster version.
- Show cluster status.
- List tables.
- Retrieve table schema.
- Retrieve table meta data.
- Get a list of tables' column families.
- Create a table.
- Delete a table.
- Alter table schema.
- Insert (PUT) data into a single row (single or multiple columns).
- Update (POST) data of a single row (single or multiple columns).
- Select (GET) a single row from table, optionally with selected columns only.
- Delete (DELETE) a single row by id.
- Batch insert (PUT).
- Batch update (POST).
- Basic HTTP auth is working. You could provide a login and a password to the connection.
- Retrive all rows in a table (table scanning).
- Table scanning.
- Syntax globbing.
Install latest stable version from PyPI.
$ pip install starbase
Or latest stable version from github.
$ pip install -e git+https://github.com/barseghyanartur/starbase@stable#egg=starbase
Operating with API starts with making a connection instance.
from starbase import Connection
Defaults to 127.0.0.1:8000. Specify host
and port
arguments when creating a connection instance,
if your settings are different.
c = Connection()
With customisations, would look simlar to the following.
c = Connection(host='192.168.88.22', port=8001)
Assuming that there are two existing tables named table1
and table2
, the following would be
printed out.
c.tables()
Output.
['table1', 'table2']
Whenever you need to operate with a table (also, if you need to create one), you need to have a table instance created.
Create a table instance (note, that at this step no table is created).
t = c.table('table3')
Assuming that no table named table3
yet exists in the database, create a table named table3
with
columns (column families) column1
, column2
, column3
(this is the point where the table is
actually created). In the example below, column1
, column2
and column3
are column families (in
short - columns). Columns are declared in the table schema.
t.create('column1', 'column2', 'column3')
Output.
201
t.exists()
Output.
True
t.columns()
Output.
['column1', 'column2', 'column3']
Add columns given (column4
, column5
, column6
, column7
).
t.add_columns('column4', 'column5', 'column6', 'column7')
Output.
200
Drop columns given (column6
, column7
).
t.drop_columns('column6', 'column7')
Output.
201
t.drop()
Output.
200
HBase is a key/value store. In HBase columns (also named column families) are part of declared table schema and have to be defined when a table is created. Columns have qualifiers, which are not declared in the table schema. Number of column qualifiers is not limited.
Within a single row, a value is mapped by a column family and a qualifier (in terms of key/value store concept). Value might be anything castable to string (JSON objects, data structures, XML, etc).
In the example below, key11
, key12
, key21
, etc. - are the qualifiers. Obviously, column1
,
column2
and column3
are column families.
Column families must be composed of printable characters. Qualifiers can be made of any arbitrary bytes.
Table rows are identified by row keys - unique identifiers (UID or so called primary key). In the example
below, my-key-1
is the row key (UID).
То recap all what's said above, HBase maps (row key, column family, column qualifier and timestamp) to a value.
t.insert(
'my-key-1',
{
'column1': {'key11': 'value 11', 'key12': 'value 12',
'key13': 'value 13'},
'column2': {'key21': 'value 21', 'key22': 'value 22'},
'column3': {'key32': 'value 31', 'key32': 'value 32'}
}
)
Output.
200
Note, that you may also use the native way of naming the columns and cells (qualifiers). Result of the following would be equal to the result of the previous example.
t.insert(
'my-key-1',
{
'column1:key11': 'value 11', 'column1:key12': 'value 12',
'column1:key13': 'value 13',
'column2:key21': 'value 21', 'column2:key22': 'value 22',
'column3:key32': 'value 31', 'column3:key32': 'value 32'
}
)
Output.
200
t.update(
'my-key-1',
{'column4': {'key41': 'value 41', 'key42': 'value 42'}}
)
Output.
200
Remove a row cell (qualifier) data. In the example below, the my-key-1
is table row UID, column4
is the column family and the key41
is the qualifier. Note, that only qualifer data (for the row given)
is being removed. All other possible qualifiers of the column column4
will remain untouched.
t.remove('my-key-1', 'column4', 'key41')
Output.
200
Remove a row column (column family) data. Note, that at this point, the entire column data (data of all qualifiers for the row given) is being removed.
t.remove('my-key-1', 'column4')
Output.
200
Remove an entire row data. Note, that in this case, entire row data, along with all columns and qualifiers for the row given, is being removed.
t.remove('my-key-1')
Output.
200
Fetch a single row data with all columns and qualifiers.
t.fetch('my-key-1')
Output.
{
'column1': {'key11': 'value 11', 'key12': 'value 12', 'key13': 'value 13'},
'column2': {'key21': 'value 21', 'key22': 'value 22'},
'column3': {'key32': 'value 31', 'key32': 'value 32'}
}
Fetch a single row data with selected columns (limit to column1
and column2
columns and all
their qualifiers).
t.fetch('my-key-1', ['column1', 'column2'])
Output.
{
'column1': {'key11': 'value 11', 'key12': 'value 12', 'key13': 'value 13'},
'column2': {'key21': 'value 21', 'key22': 'value 22'},
}
Narrow the result set even more (limit to qualifiers key1
and key2
of column column1
and
qualifier key32
of column column3
).
t.fetch('my-key-1', {'column1': ['key11', 'key13'], 'column3': ['key32']})
Output.
{
'column1': {'key11': 'value 11', 'key13': 'value 13'},
'column3': {'key32': 'value 32'}
}
Note, that you may also use the native way of naming the columns and cells (qualifiers). Example below does exactly the same as example above.
t.fetch('my-key-1', ['column1:key11', 'column1:key13', 'column3:key32'])
Output.
{
'column1': {'key11': 'value 11', 'key13': 'value 13'},
'column3': {'key32': 'value 32'}
}
If you set the perfect_dict argument to False, you'll get the native data structure.
t.fetch(
'my-key-1',
['column1:key11', 'column1:key13', 'column3:key32'],
perfect_dict=False
)
Output.
{
'column1:key11': 'value 11',
'column1:key13': 'value 13',
'column3:key32': 'value 32'
}
Batch operations (insert and update) work similar to normal insert and update, but are done in a batch. You are advised to operate in batch as much as possible.
In the example below, we will insert 5000 records in a batch.
data = {
'column1': {'key11': 'value 11', 'key12': 'value 12', 'key13': 'value 13'},
'column2': {'key21': 'value 21', 'key22': 'value 22'},
}
b = t.batch()
if b:
for i in range(0, 5000):
b.insert('my-key-%s' % i, data)
b.commit(finalize=True)
Output.
{'method': 'PUT', 'response': [200], 'url': 'table3/bXkta2V5LTA='}
In the example below, we will update 5000 records in a batch.
data = {
'column3': {'key31': 'value 31', 'key32': 'value 32'},
}
b = t.batch()
if b:
for i in range(0, 5000):
b.update('my-key-%s' % i, data)
b.commit(finalize=True)
Output.
{'method': 'POST', 'response': [200], 'url': 'table3/bXkta2V5LTA='}
Note: The table batch method accepts an optional size argument (int). If set, an auto-commit is fired
each the time the stack is full
.
Table scanning is in development (therefore, the scanning API will likely be changed). Result set returned is a generator.
t.fetch_all_rows()
Output.
<generator object results at 0x28e9190>
rf = '{"type": "RowFilter", "op": "EQUAL", "comparator": {"type": "RegexStringComparator", "value": "^row_1.+"}}'
t.fetch_all_rows(with_row_id=True, filter_string=rf)
Output.
<generator object results at 0x28e9190>
By default, prior further execution of the fetch, insert, update, remove (table row operations)
methods, it's being checked whether the table exists or not. That's safe, but comes in cost of an
extra (light though) HTTP request. If you're absolutely sure you want to avoid those checks, you can
disable them. It's possible to disable each type of row operation, by setting the following properties
of the table instance to False: check_if_exists_on_row_fetch
, check_if_exists_on_row_insert
,
check_if_exists_on_row_remove
and check_if_exists_on_row_update
.
t.check_if_exists_on_row_fetch = False
t.fetch('row1')
It's also possible to disable
them all at once, by calling the disable_row_operation_if_exists_checks
method of the table instance.
t.disable_row_operation_if_exists_checks()
t.remove('row1')
Same goes for table scanner operations. Setting the value of check_if_exists_on_scanner_operations
of a table instance to False, skips the checks for scanner operations.
t.check_if_exists_on_scanner_operations = False
t.fetch_all_rows(flat=True)
Methods that accept fail_silently argument are listed per class below.
- cluster_version
- cluster_status
- drop_table
- tables
- table_exists
- version
- add_columns
- batch
- create
- drop
- drop_columns
- exists
- insert
- fetch
- fetch_all_rows
- regions
- remove
- schema
- update
- commit
- insert
- update
Class starbase.client.table.Batch accepts fail_silently as a constructor argument.
print connection.version
Output.
{u'JVM': u'Sun Microsystems Inc. 1.6.0_43-20.14-b01',
u'Jersey': u'1.8',
u'OS': u'Linux 3.5.0-30-generic amd64',
u'REST': u'0.0.2',
u'Server': u'jetty/6.1.26'}
print connection.cluster_version
Output.
u'0.94.7'
print connection.cluster_status
Output.
{u'DeadNodes': [],
u'LiveNodes': [{u'Region': [{u'currentCompactedKVs': 0,
...
u'regions': 3,
u'requests': 0}
print table.schema()
Output.
{u'ColumnSchema': [{u'BLOCKCACHE': u'true',
u'BLOCKSIZE': u'65536',
...
u'IS_ROOT': u'false',
u'name': u'messages'}
print table.regions()
By default, number of retries for a failed request is equal to zero. That means, the request isn't being repeated if failed. It's possible to retry a failed request (for instance, in case of timeouts, etc).
In order to do that, two additional arguments of the
starbase.client.connection.Connection
have been introduced:
- retries (int)
- retry_delay (int)
c = Connection(
retries = 3, # Retry 3 times
retry_delay = 5 # Wait for 5 seconds between retries
)
Beware! Number of retries can cause performance issues (lower responsiveness) of your application. At the moment, failed requests, such as deletion of non-existing column, row or a table, are handled in the same way and would all cause a retry. This likely will change in future (smarter detection of failures worth to retry a request).
GPL 2.0/LGPL 2.1
For any issues contact me at the e-mail given in the Author section.
Artur Barseghyan <artur.barseghyan@gmail.com>