OFRecord 資料格式

2021-10-19 11:13:07 字數 3839 閱讀 1437

ofrecord 資料格式

深度學習應用需要複雜的多階段資料預處理流水線，資料載入是流水線的第一步，oneflow 支援多種格式資料的載入，其中 ofrecord 格式是 oneflow 原生的資料格式。

ofrecord 的格式定義參考了 tensorflow 的 tfrecord，熟悉 tfrecord 的使用者，可以很快上手 oneflow 的 ofrecord。

本文將介紹：

• ofrecord 使用的資料型別

• 如何將資料轉化為 ofrecord 物件並序列化

• ofrecord 檔案格式

有助於學習載入與準備 ofrecord 資料集。

ofrecord 相關資料型別

oneflow 內部採用protocol buffers 描述 ofrecord 的序列化格式。相關的 .proto 檔案在 oneflow/core/record/record.proto 中，具體定義如下：

syntax = 「proto2」;

package oneflow;

message byteslist

message floatlist

message doublelist

message int32list

message int64list

message feature

}message ofrecord

先對以上的重要資料型別進行解釋：

• ofrecord: ofrecord 的例項化物件，可用於儲存所有需要序列化的資料。它由任意多個 string->feature 的鍵值對組成；

• feature: feature 可儲存 byteslist、floatlist、doublelist、int32list、int64list 各型別中的任意一種；

• ofrecord、feature、***list 等型別，均由 protocol buffers 生成對應的同名介面，使得可以在 python 層面構造對應物件。

轉化資料為 feature 格式

可以通過呼叫 ofrecord.***list 及 ofrecord.feature 將資料轉為 feature 格式，為了更加方便，需要對 protocol buffers 生成的介面進行簡單封裝：

import oneflow.core.record.record_pb2 as ofrecord

def int32_feature(value):

if not isinstance(value, (list, tuple)):

value = [value]

return ofrecord.feature(int32_list=ofrecord.int32list(value=value))

def int64_feature(value):

if not isinstance(value, (list, tuple)):

value = [value]

return ofrecord.feature(int64_list=ofrecord.int64list(value=value))

def float_feature(value):

if not isinstance(value, (list, tuple)):

value = [value]

return ofrecord.feature(float_list=ofrecord.floatlist(value=value))

def double_feature(value):

if not isinstance(value, (list, tuple)):

value = [value]

return ofrecord.feature(double_list=ofrecord.doublelist(value=value))

def bytes_feature(value):

if not isinstance(value, (list, tuple)):

value = [value]

if not six.py2:

if isinstance(value[0], str):

value = [x.encode() for x in value]

return ofrecord.feature(bytes_list=ofrecord.byteslist(value=value))

建立 ofrecord 物件並序列化

在下例子中，將建立有2個 feature 的 ofrecord 物件，並且呼叫它的 serializetostring 方法序列化。

obserations = 28 * 28

f = open("./dataset/part-0", 「wb」)

for loop in range(0, 3):

image = [random.random() for x in range(0, obserations)]

label = [random.randint(0, 9)]

topack = 
ofrecord_features = ofrecord.ofrecord(feature=topack)
serilizedbytes = ofrecord_features.serializetostring()

通過以上例子，可以總結序列化資料的步驟：

• 將需要序列化的資料，通過呼叫 ofrecord.feature 及 ofrecord.***list 轉為 feature 物件；

• 將上一步得到的各個 feature 物件，以 string->feature 鍵值對的形式，存放在 python 字典中；

• 呼叫 ofrecord.ofrecord 建立 ofrecord 物件

• 呼叫 ofrecord 物件的 serializetostring 方法得到序列化結果

序列化的結果，可以存為 ofrecord 格式的檔案。

ofrecord 格式的檔案

將 ofrecord 物件序列化後按 oneflow 約定的格式存檔案，就得到 ofrecord檔案。

1個 ofrecord 檔案中可儲存多個 ofrecord 物件，ofrecord 檔案可用於 oneflow 資料流水線，具體操作可見載入與準備 ofrecord 資料集

oneflow 約定，對於每個 ofrecord 物件，用以下格式儲存：

uint64 length

byte data[length]

即頭8個位元組存入資料長度，然後存入序列化資料本身。

length = ofrecord_features.bytesize()

f.write(struct.pack(「q」, length))

f.write(serilizedbytes)

**以下完整**展示如何生成 ofrecord 檔案，並呼叫 protobuf 生成的 ofrecord 介面手工讀取 ofrecord 檔案中的資料。

實際上，oneflow 提供了 flow.data.decode_ofrecord 等介面，可以更方便地提取 ofrecord 檔案（資料集）中的內容。詳細內容請參見載入與準備 ofrecord 資料集。

將 ofrecord 物件寫入檔案

以下指令碼，模擬了3個樣本，每個樣本為28*28的，並且包含對應標籤。將三個樣本轉化為 ofrecord 物件後，按照 oneflow 約定格式，存入檔案。

**：ofrecord_to_string.py

從 ofrecord 檔案中讀取資料

以下指令碼，讀取上例中生成的 ofrecord 檔案，呼叫 fromstring 方法反序列化得到 ofrecord 物件，並最終顯示資料：

**：ofrecord_from_string.py

Json資料格式

在web 系統開發中，經常會碰到客戶端和伺服器端互動的問題，比如說客戶端傳送乙個 ajax 請求，然後在伺服器端進行計算，計算後返回結果，客戶端接收到這個響應結果並對它進行處理。那麼這個結果以一種什麼資料結構返回，客戶端才能比較容易和較好的處理呢？通過幾個專案的實踐，我發現 json 格式的資料是一...

JSON資料格式

下面這段文字，摘錄自留作備忘 21世紀初，douglas crockford尋找一種簡便的資料交換格式，能夠在伺服器之間交換資料。當時通用的資料交換語言是xml，但是douglas crockford覺得xml的生成和解析都太麻煩，所以他提出了一種簡化格式，也就是json。json的規格非常簡單，只...

資料格式大全

yy mm dd 百分比00.00 12.68 13 12.68 3 23 2003 12 00 00 am 字元用於分隔格式字串中的正數負數和零各部分。格式字串資料結果 12345.6789 12,345.68 12345.6789 12,345.68 12345 12345 12345 0...