- Python 77%
- Thrift 21.3%
- Shell 1.7%
thrift 0.10.0 package has a breaking change. https://github.com/cloudera/impyla/issues/235 |
||
|---|---|---|
| bin | ||
| dev | ||
| impala | ||
| jenkins | ||
| .coveragerc | ||
| .gitattributes | ||
| .gitignore | ||
| .landscape.yaml | ||
| DEVELOP.md | ||
| ez_setup.py | ||
| LICENSE.txt | ||
| MANIFEST.in | ||
| README.md | ||
| setup.cfg | ||
| setup.py | ||
| versioneer.py | ||
impyla
Python client for HiveServer2 implementations (e.g., Impala, Hive) for distributed query engines.
For higher-level Impala functionality, including a Pandas-like interface over distributed data sets, see the Ibis project.
Features
-
HiveServer2 compliant; works with Impala and Hive, including nested data
-
Fully DB API 2.0 (PEP 249)-compliant Python client (similar to sqlite or MySQL clients) supporting Python 2.6+ and Python 3.3+.
-
Works with Kerberos, LDAP, SSL
-
SQLAlchemy connector
-
Converter to pandas
DataFrame, allowing easy integration into the Python data stack (including scikit-learn and matplotlib); but see the Ibis project for a richer experience
Dependencies
Required:
-
Python 2.6+ or 3.3+
-
six,bit_array -
thrift(on Python 2.x) orthriftpy(on Python 3.x)
For Hive and/or Kerberos support:
pip install thrift_sasl
pip install sasl
Optional:
-
pandasfor conversion toDataFrameobjects; but see the Ibis project instead -
sqlalchemyfor the SQLAlchemy engine -
pytestfor running tests;unittest2for testing on Python 2.6
Installation
Install the latest release (0.13.1) with pip:
pip install impyla
For the latest (dev) version, install directly from the repo:
pip install git+https://github.com/cloudera/impyla.git
or clone the repo:
git clone https://github.com/cloudera/impyla.git
cd impyla
python setup.py install
Running the tests
impyla uses the pytest toolchain, and depends on the following environment variables:
export IMPYLA_TEST_HOST=your.impalad.com
export IMPYLA_TEST_PORT=21050
export IMPYLA_TEST_AUTH_MECH=NOSASL
To run the maximal set of tests, run
cd path/to/impyla
py.test --connect impyla
Leave out the --connect option to skip tests for DB API compliance.
Usage
Impyla implements the Python DB API v2.0 (PEP 249) database interface (refer to it for API details):
from impala.dbapi import connect
conn = connect(host='my.host.com', port=21050)
cursor = conn.cursor()
cursor.execute('SELECT * FROM mytable LIMIT 100')
print cursor.description # prints the result set's schema
results = cursor.fetchall()
The Cursor object also exposes the iterator interface, which is buffered
(controlled by cursor.arraysize):
cursor.execute('SELECT * FROM mytable LIMIT 100')
for row in cursor:
process(row)
You can also get back a pandas DataFrame object
from impala.util import as_pandas
df = as_pandas(cur)
# carry df through scikit-learn, for example