Data Engine Design
Source: Notion | Last edited: 2022-12-20 | ID: ada4259d-203...
给研究人员提供高质量,覆盖广的数据,方便直接下载。节省掉做模型测试和特征开发时,不断重复的数据采集和数据清洗的成本和时间。通过定义标准化的column name和时间戳格式等信息,节省新数据的学习和处理时间。
Data Collection Engine(DCE)
- 提供方便的方式让使用者下载各类可能使用到的数据。这些数据用我们规定好的规范放到数据库里,periodically放到S3里面,方便Research Team可以直接下载
- 设立标准的增加数据流程,可以方便的增加新数据
- (Optional) 未来可以做成web app,不需要登陆AWS就能下载,进一步简化流程,并且可以更方便的做access control
- 可以下载到的数据应该包括:
- 几个我们关心的交易所的价格和交易量(OHLCV)数据,包括期货和现货
- Enigma 的历史PNL数据
- 舆情数据:Google Trends, 当前使用的SantAPI部分数据等等
- …
Jira Reference Link
- The Data Collection Engine(DCE)will provide:
- The initial version of DCE will provide all the following data stored in S3, and each researcher need an AWS account to access the data:
1. Metadata of dynamo db tables, such as table name, description, column names, data range, and a few rows of sample data. The Metadata is used to explain the data structure and usage. e.g. Google Trends metadata is stored in
eonlabs-data-engineering-service/google-trends/metadata.md1. The actual data object is saved in CSV format in S3. The data is generally advised to be categorized by partitionKey(pk), and the file is too large, then chunk the file into smaller pieces by **sortKey **and stored in its designated folder. The way how data is chunked should be described in itsmetadata.jsonfile. e.g. The Google Trends data may be chunked intoeonlabs-data-engineering-service/google-trends/{datetime}/bitcoin_monthly.csv, ethereum_hourly.csv1. The following dynamo db data will need to be converted to s3 objects:- GoogleTrends (s3 path -
data-service/google-trends). It can be saved using based on partition key, e.g.bitcoin_monthly.csv, crypto_weekly.csv, ethereum_hourly.csv - Klines (s3 path -
data-service/ohlcv). It can be saved using based on partition key, with the name changed e.g. - future data
binance_usds_future-BTC_USDT-1m.csv(pk=binance_usds_future-BTC/USDT-1m) - spot data
binance-BTC_USDT-1m.csv(pk=binance-BTC/USDT-1m), binance-BTC_USDT-5m.csv(pk=binance-BTC/USDT-5m), binance-ETH_USDT-15m.csv(pk=binance-ETH/USDT-15m), etc - SanAPI (s3 path -
data-service/santiment). It can be saved using based on partition key, with the name changed e.g. - Model-Backtest-KPI (s3 path -
data-service/enigma-pnl). It can be saved using based on partition key, with the name changed, e.g. BTC_12h.csv(partitionKey = BTC_48_15m),BTC_1h.csv(partitionKey = BTC_4_15m),DOGE_12h.csv(partitionKey = DOGE_48_15m),Binance tick-by-tick and snapshot Orderbook Data (s3 path - ~~~~data-service/orderbook~~~~).we can use the same file structure as the original data
- GoogleTrends (s3 path -
- The future version of DCE will provide the following extra functionalities:
1. Web app access. The data users will no longer need AWS accounts to access s3 buckets. Instead, a data service webapp will be created and serve the data (maybe combined with el-admin):
- The webapp should display available s3 objects.
- The webapp should display metadata (data description) from s3 metadata.json objects.
- The webapp should allow users to download s3 objects.
- The webapp should allow admin to control user access.
- The webapp should allow users to upload data to s3 buckets.
- Infrastructure:
-
The following stories may be created:
-
Create S3 Bucket, Folders and Metadata.csv of DynamoDB Tables.
-
Create a new Serverless repo for DCE.
-
Create Lambda to save DynamoDB table items to S3 in DCE Serverless repo.
-
Create IAM policy to grant user access to the designated S3 folders - data engineering service policy.
-
Fetch both tick-by-tick and snapshot Orderbook data from Binance and upload to S3, and create metadata or supporting doc to explain the data.
-
Create a Data Engineering Service README page in Notion to introduce steps to fetch data.
-
Migrate existing Data related lambda from eonlabs-serverless repo to DCE serverless repo
-
The DCE may evolve (such as adding a table, adding data into tables) by creating a web app, and having dynamic forms to allow users to define table metadata of a table, uploading data (json file) to s3 and trigger s3_to_dynamo lambda data converter. The function is feasible in a future version of DCE.