woc

To Use

python-woc is the python interface to the World of Code (WoC) data. It precedes the oscar.py project and is hundreds of times faster than the invoking lookup scripts via subprocess.

What mappings and objects are supported?

Note that python-woc does not support all data types in WoC. It has built-in readers for:

  • Tokyo Cabinet hash databases (woc.tch files)
  • Stacked Binary files (.bin files)

Gzipped files (.s/.gz, e.g. PYthruMaps/c2bPtaPkgOPY.0.gz) are not supported yet, because currently it makes no sense to manipulate them natively in Python. Instead, you should refer to WoC tutorial and decompress them into a pipe, and deal them with command line utilities.

Mappings below are supported by both woc.get_values and woc.objects:

['A2P', 'A2a', 'A2b', 'A2c', 'A2f', 'A2fb', 'P2A', 'P2a', 'P2c', 'P2p', 'a2A', 'a2P', 'a2b', 'a2c', 'a2f', 'a2p', 'b2P', 'b2c', 'b2f', 'b2fa', 'b2tac', 'bb2cf', 'c2P', 'c2b', 'c2cc', 'c2dat', 'c2f', 'c2fbb', 'c2h', 'c2p', 'c2pc', 'c2r', 'c2ta', 'f2a', 'f2b', 'f2c', 'obb2cf', 'p2P', 'p2a', 'p2c']

And objects:

['commit', 'tree', 'blob', 'tag']

If you are still unsure what characters in the mappings mean, checkout the WoC Tutorial.

Requirements

  • Linux with a GNU toolchain (only tested on x86_64, Ubuntu / CentOS)

  • Python 3.8 or later

Install python-woc

From PyPI

The latest version of python-woc is available on PyPI and can be installed using pip:

pip3 install python-woc

From Source

You can also install python-woc from source. First, clone the repository:

git clone https://github.com/ssc-oscar/python-woc.git
cd python-woc

We use poetry as the package manager. Setting up using poetry:

python3 -m pip install poetry
python3 -m poetry install

[!TIP] On some UTK servers, installing poetry yields the following error: urllib3 v2 only supports OpenSSL 1.1.1+. A workaround is to run python3 -m pip install 'urllib3<2.0' before installing poetry.

Generate Profiles

[!NOTE] If you are on UTK/PKU WoC servers, you can skip this step. Profiles are already generated and available at /home/wocprofile.json or /etc/wocprofile.json.

One of the major improvents packed in python-woc is profile. Profiles tell the driver what versions of what maps are available, decoupling the driver from the folder structure of the data. It grants the driver the ability to work with multiple versions of WoC, on a different machine, or even on the cloud.

Profiles are generated using the woc.detect script. The script takes a list of directories, scans for matched filenames, and generates a profile:

python3 woc.detect /path/to/woc/1 /path/to/woc/2 ... > wocprofile.json

By default, python-woc looks for wocprofile.json, ~/.wocprofile.json, /home/wocprofile.json and /etc/wocprofile.json for the profile.

Use CLI

python-woc's CLI is a drop-in replacement for the getValues and showCnt perl scripts. We expect existing scripts to be work just well with the following:

alias getValues='python3 -m woc.get_values'
alias showCnt='python3 -m woc.show_content'

The usage is the same as the original scripts, and the output should be identical:

# echo some_key | echo python3 -m woc.get_values some_map
> echo e4af89166a17785c1d741b8b1d5775f3223f510f | showCnt commit 3
tree f1b66dcca490b5c4455af319bc961a34f69c72c2
parent c19ff598808b181f1ab2383ff0214520cb3ec659
author Audris Mockus <audris@utk.edu> 1410029988 -0400
committer Audris Mockus <audris@utk.edu> 1410029988 -0400

News for Sep 5

You may find more examples in the lookup repository. If you find any incompatibilities, please submit an issue report.

Use Python API

The python API is designed to get rid of the overhead of invoking the perl scripts via subprocess. It is also more native to python and provides a more intuitive interface.

With a wocprofile.json, you can create a WocMapsLocal object and access the maps in the file system:

>>> from woc.local import WocMapsLocal
>>> woc = WocMapsLocal()  # or use only the version R: woc = WocMapsLocal(version="R")
>>> woc.maps
{'p2c', 'a2b', 'c2ta', 'a2c', 'c2h', 'b2tac', 'a2p', 'a2f', 'c2pc', 'c2dat', 'b2c', 'P2p', 'P2c', 'c2b', 'f2b', 'b2f', 'c2p', 'P2A', 'b2fa', 'c2f', 'p2P', 'f2a', 'p2a', 'c2cc', 'f2c', 'c2r', 'b2P'}

To query the maps, you can use the get_values method:

>>> woc.get_values("b2fa", "05fe634ca4c8386349ac519f899145c75fff4169")
('1410029988', 'Audris Mockus <audris@utk.edu>', 'e4af89166a17785c1d741b8b1d5775f3223f510f')
>>> woc.get_values("c2b", "e4af89166a17785c1d741b8b1d5775f3223f510f")
['05fe634ca4c8386349ac519f899145c75fff4169']
>>> woc.get_values("b2tac", "05fe634ca4c8386349ac519f899145c75fff4169")
[('1410029988', 'Audris Mockus <audris@utk.edu>', 'e4af89166a17785c1d741b8b1d5775f3223f510f')]

Use show_content to get the content of a blob, a commit, or a tree:

>>> woc.show_content("tree", "f1b66dcca490b5c4455af319bc961a34f69c72c2")
[('100644', 'README.md', '05fe634ca4c8386349ac519f899145c75fff4169'), ('100644', 'course.pdf', 'dfcd0359bfb5140b096f69d5fad3c7066f101389')]
>>> woc.show_content("commit", "e4af89166a17785c1d741b8b1d5775f3223f510f")
('f1b66dcca490b5c4455af319bc961a34f69c72c2', ('c19ff598808b181f1ab2383ff0214520cb3ec659',), ('Audris Mockus <audris@utk.edu>', '1410029988', '-0400'), ('Audris Mockus <audris@utk.edu>', '1410029988', '-0400'), 'News for Sep 5')
>>> woc.show_content("blob", "05fe634ca4c8386349ac519f899145c75fff4169")
'# Syllabus for "Fundamentals of Digital Archeology"\n\n## News\n\n* ...'

Note that the function yields different types for different maps. Please refer to the documentation for details.

Sometimes you may want to know the exact size of WoC, doing so is easy and quick with count:

>>> woc.count("blob")  # count the number of blobs
17334020520
>>> woc.count("A2P")  # count the number of unique authors
44613280

👉🏻 More examples can be found in the guide.

Use Python Objects API

The objects API provides a more intuitive way to access the WoC data. Note that the objects API is not a replacement to oscar.py even looks pretty much like the same: many of the methods have their signatures changed and refactored to be more consistent, intuitive and performant. Query results are cached, so you can access the same object multiple times without additional overhead.

Call init_woc_objects to initialize the objects API with a WoC instance:

from woc.local import WocMapsLocal
from woc.objects import init_woc_objects
woc = WocMapsLocal()
init_woc_objects(woc)

To get the tree of a commit:

from woc.objects import Commit
>>> c1 = Commit("91f4da4c173e41ffbf0d9ecbe2f07f3a3296933c")
>>> c1.tree
Tree(836f04d5b374033b1608269e2f3aaabae263a0db)
>>> c1.projects[0].url
'https://github.com/woc-hack/thebridge'

For more, check woc.objects in the documentation.

Remote Access

>>> from woc.remote import WocMapsRemote
>>> woc = WocMapsRemote(base_url="https://woc.osslab-pku.org/api", api_key="woc-api-key")
>>> woc.get_values("b2fa", "05fe634ca4c8386349ac519f899145c75fff4169")
('1410029988', 'Audris Mockus <audris@utk.edu>', 'e4af89166a17785c1d741b8b1d5775f3223f510f')

👉🏻 Read the guide for more details.

Guide (Local)

[!NOTE] This guide is for users who have the access to UTK / PKU WoC servers. If you don't have access, apply here or send an email to Audris Mockus.

Task 1: Install the python package

python-woc is available on PyPI. For most users, the following command is enough:

python3 -m pip install -U python-woc
Requirement already satisfied: python-woc in /export/data/play/rzhe/miniforge3/lib/python3.10/site-packages (0.2.2)
Requirement already satisfied: chardet<6.0.0,>=5.2.0 in /export/data/play/rzhe/miniforge3/lib/python3.10/site-packages (from python-woc) (5.2.0)
Requirement already satisfied: python-lzf<0.3.0,>=0.2.4 in /export/data/play/rzhe/miniforge3/lib/python3.10/site-packages (from python-woc) (0.2.4)
Requirement already satisfied: tqdm<5.0.0,>=4.65.0 in /export/data/play/rzhe/miniforge3/lib/python3.10/site-packages (from python-woc) (4.66.4)
WARNING: Error parsing requirements for certifi: [Errno 2] No such file or directory: '/export/data/play/rzhe/miniforge3/lib/python3.10/site-packages/certifi-2023.11.17.dist-info/METADATA'


Task 2: Basic operations

Let's start by initializing a local WoC client: (If you want to specify the version of maps to query, pass a version parameter to the constructor.)

from woc.local import WocMapsLocal

woc = WocMapsLocal()

# # or specify a version
# woc = WocMapsLocal(version='V')
# woc = WocMapsLocal(version=['V','V3'])

What maps are available?

from pprint import pprint

pprint([(m.name, m.version) for m in woc.maps])
[('c2fbb', 'V'),
 ('obb2cf', 'V'),
 ('bb2cf', 'V'),
 ('a2f', 'V'),
 ('a2f', 'T'),
 ('b2A', 'U'),
 ('b2a', 'U'),
 ('A2f', 'V'),
 ('P2a', 'V'),
 ('b2P', 'V'),
 ('b2f', 'V'),
 ('a2P', 'V'),
 ('a2P', 'T'),
 ('b2fa', 'V'),
 ('b2tac', 'V'),
 ('c2p', 'V3'),
 ('c2p', 'V'),
 ('c2pc', 'U'),
 ('c2cc', 'V'),
 ('c2rhp', 'U'),
 ('p2a', 'V'),
 ('ob2b', 'U'),
 ('A2a', 'V'),
 ('A2a', 'U'),
 ('A2a', 'T'),
 ('A2a', 'S'),
 ('a2A', 'V'),
 ('a2A', 'T'),
 ('a2A', 'S'),
 ('c2dat', 'V'),
 ('c2dat', 'U'),
 ('a2c', 'V'),
 ('a2fb', 'T'),
 ('a2fb', 'S'),
 ('P2c', 'V'),
 ('P2c', 'U'),
 ('c2r', 'T'),
 ('c2r', 'S'),
 ('P2p', 'V'),
 ('P2p', 'U'),
 ('P2p', 'T'),
 ('P2p', 'S'),
 ('P2p', 'R'),
 ('c2h', 'T'),
 ('c2h', 'S'),
 ('c2P', 'V'),
 ('c2P', 'U'),
 ('p2P', 'V'),
 ('p2P', 'U'),
 ('p2P', 'T'),
 ('p2P', 'S'),
 ('p2P', 'R'),
 ('A2c', 'V'),
 ('A2c', 'U'),
 ('A2P', 'V'),
 ...]

Task 3: Determine the author of the parent commit for commit 009d7b6da9c4419fe96ffd1fffb2ee61fa61532a

It's time for some hand on tasks. The python client supports three types of API:

  1. get_values API: similar to getValues perl API, straightforward and simply queries the database;
  2. show_content API: similar to showCnt perl API, returns the content of the object;
  3. objects API: more intuitive, caches results, but adds some overhead.

Let's start with the get_values API.

# 1. get_values API

woc.get_values('c2ta', 
               woc.get_values('c2pc', '009d7b6da9c4419fe96ffd1fffb2ee61fa61532a')[0])
['1092637858', 'Maxim Konovalov <maxim@FreeBSD.org>']

Commits are also stored as objects in WoC. Check what is in the object with show_content:

# 2. show_content API

woc.show_content('commit', '009d7b6da9c4419fe96ffd1fffb2ee61fa61532a')
('464ac950171f673d1e45e2134ac9a52eca422132',
 ('dddff9a89ddd7098a1625cafd3c9d1aa87474cc7',),
 ('Warner Losh <imp@FreeBSD.org>', '1092638038', '+0000'),
 ('Warner Losh <imp@FreeBSD.org>', '1092638038', '+0000'),
 "Don't need to declare cbb module.  don't know why I never saw\nduplicate messages..\n")

For users seeking a more wrapped up OOP interface, the objects API is for you.

# 3. objects API
from woc.objects import *
init_woc_objects(woc)
Commit('009d7b6da9c4419fe96ffd1fffb2ee61fa61532a').parents[0].author
Author(Maxim Konovalov <maxim@FreeBSD.org>)

Task 4: Find out who and when first commited "Hello World"

Let's dive deeper into a more real-world senario. Git stores files as blobs, indexed by the SHA-1 hash of the content (with a prefix indicating the type of the object):

from hashlib import sha1

def git_hash_object(data, type_='blob'):
    """Compute the Git object ID for a given type and data.
    """
    s = f'{type_} {len(data)}\0'.encode() + data
    return sha1(s).hexdigest()

git_hash_object(b'Hello, World!\n')
'8ab686eafeb1f44702738c8b0f24f2567c36da6d'

The map b2fa tells the original creator of the blob:

# 1. get_values API

t, a, c = woc.get_values('b2fa', git_hash_object(b'Hello, World!\n'))
print('Time:', datetime.fromtimestamp(int(t)))
print('Author:', a)
print('Project:', woc.get_values('c2p', c)[0])
Time: 1999-12-31 19:25:29
Author: Roberto Cadena Vega <robblack00_7@hotmail.com
Project: robblack007_scripts-beagleboard
# 2. objects API

t, a, c = Blob('8ab686eafeb1f44702738c8b0f24f2567c36da6d').first_author
print('Time:', t)
print('Author:', a)
print('Project:', c.projects[0].url)
Time: 1999-12-31 19:25:29
Author: Roberto Cadena Vega <robblack00_7@hotmail.com
Project: https://github.com/robblack007/scripts-beagleboard

Task 5: Find the aliases of the author

WoC has algorithms to detect aliases of authors and forks of projects. "A" represents unique author IDs. Here we find the aliases of the author "Roberto Cadena Vega robblack00_7@hotmail.com":

# 1. get_values API

woc.get_values('A2a', woc.get_values('a2A', 'Roberto Cadena Vega <robblack00_7@hotmail.com>')[0])
['Roberto Cadena Vega <robblack00_7@hotmail.com>',
 'robblack007 <robblack00_7@hotmail.com>']

Above is implemented as an attribute of the Author object:

# 2. objects API

Author('Roberto Cadena Vega <robblack00_7@hotmail.com>').aliases
[Author(Roberto Cadena Vega <robblack00_7@hotmail.com>),
 Author(robblack007 <robblack00_7@hotmail.com>)]

Task 6: List the files in 'team-combinatorics_shuwashuwa-server' and save them to a directory

We know tree represents the directory structure of the project at a certain point in time. We can list all the blobs in the tree:

list(Project('team-combinatorics_shuwashuwa-server').head.tree.traverse())
[(File(.github/workflows/build-test.yml),
  Blob(00a33c83a095ff8b4e0a48864234d6c04fbefb69)),
 (File(.github/workflows/upload-artifact.yml),
  Blob(867dec12e7466a31ad3bfdbb67a00b6d6383ae0c)),
 (File(.gitignore), Blob(91cea399a919aaca5bd1c1be0d8a9f309095f875)),
 (File(LICENSE), Blob(f288702d2fa16d3cdf0035b15a9fcbc552cd88e7)),
 (File(README.md), Blob(b98c3d928866e6558793d2aacbca4037480fea5e)),
 (File(doc/Java开发手册(嵩山版).pdf), Blob(4d4c5cc9a1a3d42b92c3fc169f5215a2e60ad86f)),
 (File(doc/docker.md), Blob(cf4dcb8b40210fbef88d483d26bae16c176bd1d3)),
 (File(doc/github-actions.md), Blob(6b30f9626e9d34750e2b6faf1fada2bd90d1b79d)),
 (File(doc/todo.md), Blob(e54518e2043803c59bcec43a1ed8cba145b7c0ef)),
 (File(doc/webhook.md), Blob(f83a89dd5efcf76017804e793cf7712e0844257e)),
 (File(helpful_tools/database/csv/activity_info.csv),
  Blob(081b5902a8b6fe393c700ebaa0bfe265ccea4dea)),
 (File(helpful_tools/database/csv/activity_time_slot.csv),
  Blob(90d5c2ecac32368d62ea76d65f466a5f1f50eaef)),
...]

WoC has every code blob stored in the database. We can save the project to a local directory (binary or missing blobs are fetched on-demand):

Project('team-combinatorics_shuwashuwa-server')\
    .save('local_repo')

Guide (Remote)

Task 1: Install the python package

Starting from 0.3.0, python-woc supports the HTTP API! You are not limited by the access to PKU or UTK servers. First, let us install or upgrade the python package:

python3 -m pip install -U python-woc
Looking in indexes: https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
Requirement already satisfied: python-woc in /home/hrz/mambaforge/envs/woc/lib/python3.8/site-packages (0.3.0)
Requirement already satisfied: chardet<6.0.0,>=5.2.0 in /home/hrz/mambaforge/envs/woc/lib/python3.8/site-packages (from python-woc) (5.2.0)
Requirement already satisfied: httpx<0.29.0,>=0.28.1 in /home/hrz/mambaforge/envs/woc/lib/python3.8/site-packages (from python-woc) (0.28.1)
Requirement already satisfied: python-lzf<0.3.0,>=0.2.4 in /home/hrz/mambaforge/envs/woc/lib/python3.8/site-packages (from python-woc) (0.2.4)
Requirement already satisfied: rapidgzip<0.15.0,>=0.14.3 in /home/hrz/mambaforge/envs/woc/lib/python3.8/site-packages (from python-woc) (0.14.3)
Requirement already satisfied: tqdm<5.0.0,>=4.65.0 in /home/hrz/mambaforge/envs/woc/lib/python3.8/site-packages (from python-woc) (4.66.4)
Requirement already satisfied: anyio in /home/hrz/mambaforge/envs/woc/lib/python3.8/site-packages (from httpx<0.29.0,>=0.28.1->python-woc) (4.5.2)
Requirement already satisfied: certifi in /home/hrz/mambaforge/envs/woc/lib/python3.8/site-packages (from httpx<0.29.0,>=0.28.1->python-woc) (2025.1.31)
Requirement already satisfied: httpcore==1.* in /home/hrz/mambaforge/envs/woc/lib/python3.8/site-packages (from httpx<0.29.0,>=0.28.1->python-woc) (1.0.7)
Requirement already satisfied: idna in /home/hrz/mambaforge/envs/woc/lib/python3.8/site-packages (from httpx<0.29.0,>=0.28.1->python-woc) (3.10)
Requirement already satisfied: h11<0.15,>=0.13 in /home/hrz/mambaforge/envs/woc/lib/python3.8/site-packages (from httpcore==1.*->httpx<0.29.0,>=0.28.1->python-woc) (0.14.0)
Requirement already satisfied: sniffio>=1.1 in /home/hrz/mambaforge/envs/woc/lib/python3.8/site-packages (from anyio->httpx<0.29.0,>=0.28.1->python-woc) (1.3.1)
Requirement already satisfied: exceptiongroup>=1.0.2 in /home/hrz/mambaforge/envs/woc/lib/python3.8/site-packages (from anyio->httpx<0.29.0,>=0.28.1->python-woc) (1.2.1)
Requirement already satisfied: typing-extensions>=4.1 in /home/hrz/mambaforge/envs/woc/lib/python3.8/site-packages (from anyio->httpx<0.29.0,>=0.28.1->python-woc) (4.12.2)

Task 2 (Optional): Generate an API key

This is an optional step but I recommend you to do it. By default HTTP API restricts the number of requests per minute to avoid abuse. To raise the limit, you can generate an API key on the World of Code website.

Currently the website is at: https://woc.osslab-pku.org/ (After we figure out the domain transfer, it will probably be moved to https://worldofcode.org/)

The API key is a string like woc-XXXXXXXXXXXXXX-YYYYYYYYYYYYYY, and you pass it to the client with the api_key argument:

# sync client
woc = WocMapsRemote(
    base_url="https://woc.osslab-pku.org/api/",
    api_key="woc-XXXXXXXXXXXXXX-YYYYYYYYYYYYYY"
)
# async client
woca = WocMapsRemoteAsync(
    base_url="https://woc.osslab-pku.org/api/",
    api_key="woc-XXXXXXXXXXXXXX-YYYYYYYYYYYYYY"
)

Task 3: Use the python package

The sync remote client feels the same as the local client, and most APIs will just work:

from woc.remote import WocMapsRemote

woc = WocMapsRemote(
    base_url="https://woc.osslab-pku.org/api/",  # <- may be different for you
)
[(m.name, m.version) for m in woc.maps]
[('c2fbb', 'V'),
 ('obb2cf', 'V'),
 ('bb2cf', 'V'),
 ('a2f', 'V'),
 ('a2f', 'T'),
 ('b2A', 'U'),
 ('b2a', 'U'),
 ('A2f', 'V'),
 ('P2a', 'V'),
 ('b2P', 'V'),
 ('b2f', 'V'),
 ('a2P', 'V'),
 ('a2P', 'T'),
 ('b2fa', 'V'),
 ('b2tac', 'V'),
 ('c2p', 'V3'),
 ('c2p', 'V'),
 ('c2pc', 'U'),
 ('c2cc', 'V'),
 ('c2rhp', 'U'),
 ('p2a', 'V'),
 ('ob2b', 'U'),
 ('A2a', 'V'),
 ('A2a', 'U'),
 ('A2a', 'T'),
 ('A2a', 'S'),
 ('a2A', 'V0'),
 ('a2A', 'V3'),
 ('a2A', 'V'),
 ('a2A', 'T'),
 ('a2A', 'S'),
 ('c2dat', 'V'),
 ('c2dat', 'U'),
 ('a2c', 'V'),
 ('a2fb', 'T'),
 ('a2fb', 'S'),
 ('P2c', 'V'),
 ('P2c', 'U'),
 ('c2r', 'T'),
 ('c2r', 'S'),
 ('P2p', 'V'),
 ('P2p', 'U'),
 ('P2p', 'T'),
 ('P2p', 'S'),
 ('P2p', 'R'),
 ('c2h', 'T'),
 ('c2h', 'S'),
 ('c2P', 'V'),
 ('c2P', 'U'),
 ('p2P', 'V'),
 ('p2P', 'U'),
 ('p2P', 'T'),
 ('p2P', 'S'),
 ('p2P', 'R'),
 ...]
# get_values API

woc.get_values('c2ta', 
               woc.get_values('c2pc', '009d7b6da9c4419fe96ffd1fffb2ee61fa61532a')[0])
['1092637858', 'Maxim Konovalov <maxim@FreeBSD.org>']
# show_content API

woc.show_content('commit', '009d7b6da9c4419fe96ffd1fffb2ee61fa61532a')
['464ac950171f673d1e45e2134ac9a52eca422132',
 ['dddff9a89ddd7098a1625cafd3c9d1aa87474cc7'],
 ['Warner Losh <imp@FreeBSD.org>', '1092638038', '+0000'],
 ['Warner Losh <imp@FreeBSD.org>', '1092638038', '+0000'],
 "Don't need to declare cbb module.  don't know why I never saw\nduplicate messages..\n"]

The only exception is all_keys API, which is not supported by the remote client (I did not find a way to paginate that.)

# Objects API

from woc.objects import *
init_woc_objects(woc)

Commit('009d7b6da9c4419fe96ffd1fffb2ee61fa61532a').parents[0].author
Author(Maxim Konovalov <maxim@FreeBSD.org>)
woc.all_keys('c2p')
---------------------------------------------------------------------------

NotImplementedError                       Traceback (most recent call last)

Cell In[6], line 1
----> 1 woc.all_keys('c2p')


File ~/mambaforge/envs/woc/lib/python3.8/site-packages/woc/remote.py:333, in WocMapsRemote.all_keys(self, map_name)
    332 def all_keys(self, map_name: str) -> Generator[bytes, None, None]:
--> 333     return self._asyncio_run(super().all_keys(map_name))


File ~/mambaforge/envs/woc/lib/python3.8/site-packages/woc/remote.py:292, in WocMapsRemote._asyncio_run(self, coro, timeout)
    285 def _asyncio_run(self, coro: Awaitable, timeout=30):
    286     """
    287     Runs the coroutine in an event loop running on a background thread, and blocks the current thread until it returns a result. This plays well with gevent, since it can yield on the Future result call.
    288 
    289     :param coro: A coroutine, typically an async method
    290     :param timeout: How many seconds we should wait for a result before raising an error
    291     """
--> 292     return asyncio.run_coroutine_threadsafe(coro, self._loop).result(timeout=timeout)


File ~/mambaforge/envs/woc/lib/python3.8/concurrent/futures/_base.py:444, in Future.result(self, timeout)
    442     raise CancelledError()
    443 elif self._state == FINISHED:
--> 444     return self.__get_result()
    445 else:
    446     raise TimeoutError()


File ~/mambaforge/envs/woc/lib/python3.8/concurrent/futures/_base.py:389, in Future.__get_result(self)
    387 if self._exception:
    388     try:
--> 389         raise self._exception
    390     finally:
    391         # Break a reference cycle with the exception in self._exception
    392         self = None


File ~/mambaforge/envs/woc/lib/python3.8/site-packages/woc/remote.py:261, in WocMapsRemoteAsync.all_keys(self, map_name)
    260 async def all_keys(self, map_name: str) -> Generator[bytes, None, None]:
--> 261     raise NotImplementedError(
    262         "all_keys is not implemented in WoC HTTP API. "
    263         "If you feel it is necessary, please create a feature request at "
    264         "https://github.com/ssc-oscar/python-woc/issues/new"
    265     )


NotImplementedError: all_keys is not implemented in WoC HTTP API. If you feel it is necessary, please create a feature request at https://github.com/ssc-oscar/python-woc/issues/new

Task 4: Batching

Git objects are typically small, and sending dozens of small queries is not efficient. The remote client supports batching by show_content_many and get_values_many, it will send 10 queries in one request. For the impatient, it displays a progress bar with progress=True. The return value is a tuple of 2 dictionaries { results }, { errors }.

woc.show_content_many('commit', woc.get_values('a2c', 'Audris Mockus <audris@utk.edu>')[:50], progress=True)
100%|██████████| 5/5 [00:00<00:00,  5.45it/s]





({'001ec7302de3b07f32669a1f1faed74585c8a8dc': ['d0074dfdf50faf1a679a293d1833af74513d5b38',
   ['13710ca2439f85eff9922169a4588da64b3f1fce'],
   ['Audris Mockus <audris@utk.edu>', '1514659483', '-0500'],
   ['Audris Mockus <audris@utk.edu>', '1514659483', '-0500'],
   'work on diff performance\n'],
  '0037d5c34c2787f2a0b619c5d2a1f76254ac974c': ['ac14b680a6c58f50221b8da7cfa307528b5b971a',
   ['87ec9a9e6fda18cdcb8bd78a0e909afd0e40d329',
    '5fee5205917fa803036a17aba185a0a8af17d1fa'],
   ['Audris Mockus <audris@utk.edu>', '1629765554', '-0400'],
   ['GitHub <noreply@github.com>', '1629765554', '-0400'],
   'Merging 43\n\n'],
  '003f2b790d6fb83924649d90867f3d1545ea0e36': ['3eda5f06cba2c0051367de6ebcf1daf9c3a9cdc6',
   ['8750edd4576f6a0b592a36d777f09d272c42097b',
    'a4ac9e07db0a27268584d2912ddf2cceaf3dc3d2'],
   ['Audris Mockus <audris@utk.edu>', '1512787832', '-0500'],
   ['GitHub <noreply@github.com>', '1512787832', '-0500'],
   'Merge pull request #1 from dylanrainwater/master\n\nCreate drainwa1.md'],
  '00448b8ca4198a41b64618e4f0f9726d206fce69': ['ba20c6b9a6cafb42dcfd01f71a01140639e4a5ea',
   ['37ed4871d1578a9f02169a2bef5e612d465c3c4f'],
   ['Audris Mockus <audris@utk.edu>', '1571281741', '-0400'],
   ['Audris Mockus <audris@utk.edu>', '1571281741', '-0400'],
   'autodeducing types\n'],
  '005905285ff1b9e33babbacfce09f39484e9428b': ['64a63070bb2b9a01707535e8509c80c2e4674e3f',
   [],
   ['Audris Mockus <audris@utk.edu>', '1581825750', '-0500'],
   ['Audris Mockus <audris@utk.edu>', '1581825750', '-0500'],
   'examples on how to do api embedding via doc2vec\n'],
  '00616e51919581ebd84fcb14f26282e963e8a6cd': ['5a573140144efb4f0f987b61e8ff23de1ddef80a',
   ['c5bd344bdd75c3b707ad9d821026bc2ba382c0bb'],
   ['Audris Mockus <audris@utk.edu>', '1516821061', '-0500'],
   ['Audris Mockus <audris@utk.edu>', '1516821061', '-0500'],
   'make sure integers are packed\n'],
  '0064fd6ca213b387347f10de2457ed94d0cf798a': ['21f231a310838a4bef9f8db26e06baabf553b9d7',
   ['2d6fa88a0f45806283b4b0ab987d59bebfd3b9d8'],
   ['Audris Mockus <audris@utk.edu>', '1631804415', '-0400'],
   ['GitHub <noreply@github.com>', '1631804415', '-0400'],
   'Update README.md'],
  '006d7e83e272d1715b2aca1a43e91b9141227532': ['b1423403b051f56afe79662b0222ddca7789e88b',
   ['cb83bea9dcdfb0364796f5f54403767cedf90c5a'],
   ['Audris Mockus <audris@utk.edu>', '1473861546', '+0000'],
   ['Audris Mockus <audris@utk.edu>', '1473861546', '+0000'],
   'README.md edited online with Bitbucket'],
  '007c4f4867e5a27971b056ddcd9b7abc7221f231': ['651a93de99df7e87f98d290292353d4af211c05c',
   ['aa35f02cb9eef14bfcdc354f21bd6e1499519f3e'],
   ['Audris Mockus <audris@utk.edu>', '1550241863', '-0500'],
   ['Audris Mockus <audris@utk.edu>', '1550241863', '-0500'],
   'Current status\n'],
  ...},
 {})
woc.get_values_many('c2b', woc.get_values('P2c', 'user2589_minicms')[:50], progress=True)
100%|██████████| 5/5 [00:00<00:00,  5.56it/s]





({'05cf84081b63cda822ee407e688269b494a642de': ['03d1977aecf31666578422805c60cf61562ceea1',
   '1619cce13ffcc3eaaddb1f714072914625f576f6',
   '1838ded6411e5fbfd9d0168de007de3e78e94d94',
   '3993b20337e33a36c9125d139f1f53a279a4c128',
   '3dd682cbf7fd0c482d31f0e74e9ed05e4853cd9f',
   '44a07afd30f499cdba30847094a1e92f13e1320e',
   '6ad59f8158da5afca559c5b3d422af2b1a17eb81',
   '6b0af6a378d95ac9a11297fe83baca147c7af4c2',
   '70ce71f3cd86c10b11b778e502ca9364b2262d8f',
   '83d5c112f8584c5a7f2db377e5dda2216586f2ca',
   '9599aebf9b3cae84678ef0703e6217d47030b0ff',
   'dae40a15a0f5eaef5259d66defe3544166da59bd',
   'e8b638f1548c5b74e2bb4b74d3aaf8da93e24aa1'],
  '086a622f0e24feb7853c520f965f04c7fc7e4861': ['773e50f6785dafd0acbe050e7dd16a8179297652',
   'ad4d743e78a5ceb39942675cf968c1dcaa935557',
   'e87238845f6b48d16807cb4d56a58bd17ab41931'],
  '0ad5b0b392ed22cef866de5ae8504462183b0316': ['29f38813ea514935bce39d8b24c31e486d033340'],
  '104b8284ba6435a3c07eb5ce82f15cb0f956eda3': ['c11edc429d433037a18d346dad544731809f6898'],
  '1837bfa6553a9f272c5dcc1f6259ba17357cf8ed': ['1b92e22ab92e429615d8fdf84353ece7233f2487',
   '870c07dbfaccf8faa87a32efb63d6ae67b37c539',
   '8f0f90c29f067e20ebaf0c53c02c66b89a31e5c3',
   '973a78a1fe9e69d4d3b25c92b3889f7e91142439',
   'a43bb729c565ea9ce17a26d23d68b88030a84aa5',
   'c8da2827d7ed589656075db8c083f5e5ba6d81d9',
   'dca8d68784e46f66ec548c5dab7a0bfbdeaaf5a9'],
  '19ddf6dafb6014c954253bd022778051213ccd9a': ['56aecae2b6137a3d62bdde0c36ea755d48643dc5'],
  '1d3038eab8cac1e8a9df187d411fbc0e4a317270': ['028b4844ea03bfb07bad74efe0aa800464835f1c',
   'dfadce9d6f708fc79711f7e10453ad5584b925e0'],
  '1e971a073f40d74a1e72e07c682e1cba0bae159b': ['1e0eaec8f6164cb5e15031fee8702a05dec6a1cf',
   '2060f551336795224535caa172703b6c0e660510',
   '2bdf5d686c6cd488b706be5c99c3bb1e166cf2f6',
   '7e2a34e2ec9bfdccfa01fff7762592d9458866eb',
   'c006bef767d08b41633b380058a171b7786b71ab',
   'e0ac96cefe3d230553931c54a79fa164a8fa11da',
   'e69de29bb2d1d6434b8b29ae775ad8c2e48c5391'],
  '1eda863abed481df83c680a6c31fad05719b166b': ['9300f6dfdbff157aa7a28a42331c334b36302c9d'],
  '27f6af62ff6facfb21a7fe33cddfc115f93cb75f': ['51f9f7da85518d034176fe3a1d5d9eacf0bbaed7',
   'f25b48beb98cfb011373517f23883dd4dbaab589'],
  '2881cf0080f947beadbb7c240707de1b40af2747': ['3bbb2f13dcfbc1c4352c940f8d3c22c2789c621d',
   'c3bfa5467227e7188626e001652b85db57950a36'],
  '2c02c1f9b1a959c5228bf8cfad1a09fd5489b381': ['2cfbd298f18a75d1f0f51c2f6a1f2fcdf41a9559',
   '6365ac91afdde2db36d7c8e7119c5a4cc04d9a2a',
   '6481ac0fbbd735752710018df1ddae0cc926d5c5',
   '64fd4dcec034e966cfc240008d93cc300f878def',
   'e6edff07e43858eb6c1ac618d7956d1dff8f4be8',
   'f796acbf50001003f398f53be27d947acbaa76bf'],
  '3303b05caf2f9b51fc6323820fe9e04780c40e48': ['9be0d9635e048fb5239e1b893adebe6b94cf1942',
   '9eaec93fea6c0f22d6138dd33fa6750d1a9556a9',
   'a932689dd969fc2d6a3c16664021cca7f7e8967a',
   'e048cddd988918af374a4a253017c4d8133c29c0'],
  '335aeff4c90d4d31562a24b2648ed529ef664664': ['070cdd1f69532a37dfc434153f6e887376a68f68'],
  '3b0cbf364870cd35d9e41630387d97393fea2fa5': ['e263e29d9fcd7168498f290560f49de58b69fb57'],
  '3beec34c51d9ae5d60c7eb976bb03a95db235514': ['df8a84af3a9db52a5ee2bf8afd0a120e7e7aecc4'],
  '3c57cd4791ac46ccb73cc22a50d9a4c77e5cd0a3': ['5f41f1fa0952ca9adcfc88d89617e102454a8447'],
  '3c59a5aca8ee3e201977558fe9f1ea5489d2b1b3': ['5c6802269104f3f5a8a831ed70b2170eb94cc46c'],
  '3f68ba216c938e93aff1dc45b241511a0fa94e51': ['028b4844ea03bfb07bad74efe0aa800464835f1c',
   'dfadce9d6f708fc79711f7e10453ad5584b925e0'],
  '4dffda766eba4f4edc31eb0b7691cc75d7775de0': ['409c292ddf48ceedc69cc96c59325dbf6226e287'],
  '58898751944b69ffc04d148b0917473e2d5d5db8': ['1f3e4f03f555d640d72ac4e89d2c8c97bc6255f1',
   '56aecae2b6137a3d62bdde0c36ea755d48643dc5',
   '5f41f1fa0952ca9adcfc88d89617e102454a8447',
   '773e50f6785dafd0acbe050e7dd16a8179297652',
   '958efc14c7d74d732eac137af7e554795dbfe6fc',
   '96bc275bee57ddbe38acbd46776d907bc10f279f',
   '9825f4f761657f2a8cc1352f2a5cd50a442fb624',
   '9eaec93fea6c0f22d6138dd33fa6750d1a9556a9',
   'a400c114785031e934d3a323c247c697f944ba04',
   'a932689dd969fc2d6a3c16664021cca7f7e8967a',
   'ad4d743e78a5ceb39942675cf968c1dcaa935557',
   'e043781b07bbf336f1b9bb7a2c5c1cd60c00c046',
   'e048cddd988918af374a4a253017c4d8133c29c0',
   'e69de29bb2d1d6434b8b29ae775ad8c2e48c5391'],
  '5b0afd26ed90f8b3352f4ac8a53da9f23597c42d': ['4ab32146cbd1c0ba24564b28cbda8b70bc571dda',
   'daae152bed904558cd06cd5ed8da995d818d1eb5',
   'e043781b07bbf336f1b9bb7a2c5c1cd60c00c046'],
  '63ff96cbb38687e68cb4fdd7e208aa70f66ba252': ['df10ebd07b711886fa9a7f9b4569ca5778e187d7'],
  '66acf0a046a02b48e0b32052a17f1e240c2d7356': ['a7227f22a261aec4824b4657d381ab49bce35005',
   'd05d461b48a8a5b5a9d1ea62b3815e089f3eb79b',
   'd1d952ee766d616eae5bfbd040c684007a424364',
   'fcff510d9cc6217b45c1aca343bba71bb6a2577b'],
  '67cfcb7bf8c28c280603d7ba7a7831c5ee1ea040': ['0f05e24cc408cda9c573d3e76774e499d338b88d'],
  ...},
 {'43981b68b7a24544d4bc4f3094be7a12c9f0afe0': 'Key 43981b68b7a24544d4bc4f3094be7a12c9f0afe0 not found in /da8_data/basemaps/c2bFullV.3woc.tch'})

Task 5: Go Async

The remote client also supports async API, which is useful when you are running multiple requests in parallel. APIs are similar to the sync ones, but with await in front of them.

from woc.remote import WocMapsRemoteAsync, WocMapsRemote

woca = WocMapsRemoteAsync(
    base_url="https://woc.osslab-pku.org/api/",
)
[(m.name, m.version) for m in await woca.get_maps()]
[('c2fbb', 'V'),
 ('obb2cf', 'V'),
 ('bb2cf', 'V'),
 ('a2f', 'V'),
 ('a2f', 'T'),
 ('b2A', 'U'),
 ('b2a', 'U'),
 ('A2f', 'V'),
 ('P2a', 'V'),
 ('b2P', 'V'),
 ('b2f', 'V'),
 ('a2P', 'V'),
 ('a2P', 'T'),
 ('b2fa', 'V'),
 ('b2tac', 'V'),
 ('c2p', 'V3'),
 ('c2p', 'V'),
 ('c2pc', 'U'),
 ('c2cc', 'V'),
 ('c2rhp', 'U'),
 ('p2a', 'V'),
 ('ob2b', 'U'),
 ('A2a', 'V'),
 ('A2a', 'U'),
 ('A2a', 'T'),
 ('A2a', 'S'),
 ('a2A', 'V0'),
 ('a2A', 'V3'),
 ('a2A', 'V'),
 ('a2A', 'T'),
 ('a2A', 'S'),
 ('c2dat', 'V'),
 ('c2dat', 'U'),
 ('a2c', 'V'),
 ('a2fb', 'T'),
 ('a2fb', 'S'),
 ('P2c', 'V'),
 ('P2c', 'U'),
 ('c2r', 'T'),
 ('c2r', 'S'),
 ('P2p', 'V'),
 ('P2p', 'U'),
 ('P2p', 'T'),
 ('P2p', 'S'),
 ('P2p', 'R'),
 ...]
# 1. get_values API

# woc.get_values('c2ta', 
#                woc.get_values('c2pc', '009d7b6da9c4419fe96ffd1fffb2ee61fa61532a')[0])

await woca.get_values('c2ta', 
               (await woca.get_values('c2pc', '009d7b6da9c4419fe96ffd1fffb2ee61fa61532a'))[0])
['1092637858', 'Maxim Konovalov <maxim@FreeBSD.org>']
async for i in woca.iter_values("P2c", "user2589_minicms"):
    print(i)
['05cf84081b63cda822ee407e688269b494a642de', '086a622f0e24feb7853c520f965f04c7fc7e4861', '0ad5b0b392ed22cef866de5ae8504462183b0316', '104b8284ba6435a3c07eb5ce82f15cb0f956eda3', '1837bfa6553a9f272c5dcc1f6259ba17357cf8ed', '19ddf6dafb6014c954253bd022778051213ccd9a', '1d3038eab8cac1e8a9df187d411fbc0e4a317270', '1e971a073f40d74a1e72e07c682e1cba0bae159b', '1eda863abed481df83c680a6c31fad05719b166b', '27f6af62ff6facfb21a7fe33cddfc115f93cb75f', '2881cf0080f947beadbb7c240707de1b40af2747', '2c02c1f9b1a959c5228bf8cfad1a09fd5489b381', '3303b05caf2f9b51fc6323820fe9e04780c40e48', '335aeff4c90d4d31562a24b2648ed529ef664664', '3b0cbf364870cd35d9e41630387d97393fea2fa5', '3beec34c51d9ae5d60c7eb976bb03a95db235514', '3c57cd4791ac46ccb73cc22a50d9a4c77e5cd0a3', '3c59a5aca8ee3e201977558fe9f1ea5489d2b1b3', '3f68ba216c938e93aff1dc45b241511a0fa94e51', '43981b68b7a24544d4bc4f3094be7a12c9f0afe0', '4dffda766eba4f4edc31eb0b7691cc75d7775de0', '58898751944b69ffc04d148b0917473e2d5d5db8', '5b0afd26ed90f8b3352f4ac8a53da9f23597c42d', '63ff96cbb38687e68cb4fdd7e208aa70f66ba252', '66acf0a046a02b48e0b32052a17f1e240c2d7356', '67cfcb7bf8c28c280603d7ba7a7831c5ee1ea040', '6d031dc38204b9bb0ddf75f6f76ea28bc7e4d054', '710d45b6cfffb5d630b117572cf27fb1848fb5af', '78bd544cdb869b5b1f7280f6df2e856d1dbb8775', '7da8be92d11b33e913408438fa174dc524f1d9d9', '81494a6510c11dad3aabe14625b1644561f0af5f', '827f3a14d1edd48d36effbdb7cc5f44221df3a7a', '85787429380cb20b6a935e52c50f85f455790617', '8caff1690253f8a9596c9918819f24c9f79140ce', '8fb99ec51bccc6ea4828c6ea08cd0976b53e6edc', '995e6a997e3f841235487dac9c50f903d855aaa2', '9bd02434b834979bb69d0b752a403228f2e385e8', '9be4c4ec851fafc295c9e098a2e4741a8140f7be', 'a443e1e76c39c7b1ad6f38967a75df667b9fed57', 'aa779a8f11473abc461c61db917e2d3c3d2c2e5d', 'ab124ab4baa42cd9f554b7bb038e19d4e3647957', 'b35161cffea14767e66e75bece19b368815522d4', 'b66c430180f96fb1e7e821ec637f88e63c6c5aae', 'b782ea2936b97d0bfb3f2ed089b49b1a72414182', 'ba3659e841cb145050f4a36edb760be41e639d68', 'bb3701be55292fef1b0daa815199ff0886be540d', 'bde80c730ef2dc34e8c34291aae3174b78dd8cb0', 'd009acbf3b4e663fc101e4086e5cd06eb7e5e418', 'd04e0f15916c843aa1e4b8aeabb758999d22e390', 'd11431c3ef74770ac570a82b2fd9b19a690a4adc', 'd64156ef9a753c105958c34ae2518ad46d3dc6bb', 'd8b38d3798d277d1e15474f26aa8e6ae33ba2d67', 'd9a72b48e7bb3406022207039c3b9c1e22ea8955', 'de8985354c2af17bbc1263d8b3ae5e0f2330b540', 'dff7aa6c388b07f7fc9a171c1659d60df17c379f', 'dff95810cd0b99a6027cbb3f725b0116f6aa9f33', 'e25e7e06bcc63c7aca7d6ffa6f54fbc9f00b5da6', 'e38126dbca6572912013621d2aa9e6f7c50f36bc', 'e40c989e1e583ea5d32824c28231d594b10fce2b', 'e766916b4ab3ccd430ff3b0e55bcf2cae0772f91', 'e99e506f6214aec5fd0d52bf00f36c1df00de9be', 'eb081cc38858eab997921f9bc2bfb57596d5bdf8', 'f0d02fc17be5fdd57b969880fc0ea0f6fa96ba95', 'f2a7fcdc51450ab03cb364415f14e634fa69b62c', 'f2ac3a79ebc17c7f10814f5f13d3a17d8fc990c3', 'f5220ef868579d41c7fba0a66e9697d61626a4a7', 'fa695566bac8584045c4e95209aeb7c9e4adfe49', 'fe60de5ca98daae4a056c96544ed218aab28b0d2', 'fec2d855fb0086ab37e1a557a1e3531e187cfa0a']
await woca.show_content("tree", "f1b66dcca490b5c4455af319bc961a34f69c72c2")
[['100644', 'README.md', '05fe634ca4c8386349ac519f899145c75fff4169'],
 ['100644', 'course.pdf', 'dfcd0359bfb5140b096f69d5fad3c7066f101389']]

To Contribute

How to commit

We follow the standard "fork-and-pull" Git workflow. All the development is done on feature branches, which are then merged to master via pull requests. Pull requires are unit tested and linted automatically.

To generate release notes, we use conventional commits, a convention to commit messages. In a nutshell, it means commit messages should be prefixed with one of:

  • fix: in case the change fixes a problem without changing any interfaces. Example commit message: fix: missing clickhouse-driver dependency (closes #123).
  • feat: the change implements a new feature, without affecting existing interfaces. Example: feat: implement author timeline.
  • other prefixes, e.g. chore:, refactor:, docs:, test:, ci:, etc.
    • these will not be included in release notes and will not trigger a new release without new features or fixes added, unless contain breaking changes (see below).

In case of breaking changes, commit message should include an exclamation mark before the semicolon, or contain BREAKING CHANGE in the footer, e.g.:

`feat!: drop support for deprectated parameters`

Commit hooks will reject commits that do not follow the convention, so you have no choice but to follow the rules 😈

Setup development environment

Setup Poetry

To make sure everyone is on the same page, we use poetry to manage dependencies and virtual environments. If you don't have it installed yet, please follow the installation guide.

After installing poetry, create an virtual environment and install all dependencies:

poetry shell       # activate the virtual environment
poetry install     # install all dependencies

The poetry install command builds python-woc from source as well.

Install pre-commit hooks

Pre-commit hooks ensure that all code is formatted, linted, and tested before pushed to GitHub. This "fail fast, fail early" approach saves time and effort for all of us.

# install linter and unit tests to pre-commit hooks
pre-commit install 
# install the conventional commits checker
pre-commit install --hook-type commit-msg  

About Cython

(Notes from the original maintainer, @moo-ack)

The reason to use Cython was primarily Python 3 support. WoC data is stored in tokyocabinet (woc.tch) files, note natively supported by Python. libtokyocabinet binding, python-tokyocabinet, is a C extension supporting Python 2 only, and lack of development activity suggests updating it for Py 3 is hardly considered. So, our options to interface libtokyocabinet were:

  • cffi (C Foreign Functions Interface) - perhaps, the simplest option, but it does not support conditional definitions (#IFDEF), that are actively used in tokyocabinet headers
  • C extension, adapting existing python-tokyocabinet code for Python 3. It is rather hard to support and even harder to debug; a single attempt was scrapped after making a silently failing extension.
  • SWIG(and its successor, CLIF), a Google project to generate C/C++ library bindings for pretty much any language. Since this library provides 1:1 interface to the library methods, Python clients had to be aware of the returned C structures used by libtokyocabinet, which too inconvenient.
  • Cython, a weird mix of Python and C. It allows writing Python interfaces, while simultaneously using C functions. This makes it the ideal option for our purposes, providing a Python woc.tch file handler working with libtokyocabinet C structures under the hood.

Cython came a clear winner in this comparison, also helping to speed up some utility functions along the way (e.g. fnvhash, a pure Python version of which was previously used).

Compile changes to Cython code

Cython code is not interpreted; it needs to be compiled to C and then to a shared object file. If you made any changes to .pyx or .pxd files, you need to recompile:

python3 setup.py

Cython requires a functioning GNU toolchain. And sometimes ld complains it can not find -lbz2, and you need to install bzip2-devel package on CentOS or libbz2-dev on Ubuntu.

Lint

We use ruff as the one and only linter and formatter. Being in Rust, it is blazingly fast and perfect for commit hooks and CI. The pre-commit hooks already include the following:

ruff format        # format all Python code
ruff check         # lint all Python code
ruff check --fix   # fix all linting issues

But ruff's fix feature is very cautious: sometimes it refuse to perform "unsafe" changes, and yields an error when you commit. In this case, we recommend installing the ruff VSCode integration, double check the suggested changes, and apply them manually.

Test

Run tests

python-woc use pytest to run unit tests, and pytest-cov to check coverage. To run tests:

pytest              # run all unit tests
pytest -k test_name # run a specific test
pytest --cov        # run all tests and check coverage

Add a new test case

Test cases are located at tests/ directory. To add a new test case, create a new file with the name test_*.py, like the following:

# tests/test_my_new_feature.py
def test_my_new_feature():
    assert 1 == 1

Add a new fixture

Some may claim that the binary fixtures is good enough for testing, but we prefer to incorporate generation scripts into the test suite. To add a new fixture, add a line to tests/fixtures/create_fixtures.py:

cp.copy_content("tree", "51968a7a4e67fd2696ffd5ccc041560a4d804f5d")

Run create_fixtures.py on a server with WoC datasets, generate a profile at ./wocprofile.json, and the following at the project root:

PYTHONPATH=. python3 tests/fixtures/create_fixtures.py

Manage dependencies

Managing dependencies is not hard with poetry. To add a new dependency, run:

poetry add package_name  # add a new dependency

Sometimes tasks are easier to perform to edit the manifest file directly, e.g. add a new dependency to a specific group. You need to update the lockfile manually or poetry gets angry on install:

nano pyproject.toml      # add a new dependency manually
poetry lock --no-update  # update the lock file

Poetry is heavy to install and setup. To make it easier for manual installation, we keep a requirements.txt file. You will need to update it after modifying dependencies:

poetry check --lock
poetry export -f requirements.txt --with build --output requirements.txt

Test GitHub actions locally

It's always a good idea to test your code before commit to avoid fixups polluting the commit history. You can run the GitHub actions locally with act to see if it works as expected:

act -j 'test' -s CODECOV_TOKEN  # run unit tests
act -j 'build-and-publish' --artifact-server-path build  # run the wheel builder
act -j 'docs' -s GITHUB_TOKEN  # run the documentation generator

Build wheels

Actually the easiest way to build manylinux wheels is to run the GitHub action locally, with act. (Note that write permission to docker socket is required) You will get the exact same wheels as the CI produces:

act -j 'build-wheel' --artifact-server-path build
cd build/1/wheels/
# Somehow artifacts are gzipped, and we need to unzip them
for f in *.gz__; do mv "$f" "${f%__}"; gzip -d "${f%__}"; done
# move them to dist/
mv *.whl ../../../dist/

Note that even poetry build does produce manylinux wheels, its compatibility level is not guaranteed. To ensure the wheels are compatible with CentOS 7, we fix the level to manylinux2014.

Bump version

You don't have to, and please do not change version number manually. Use poetry to bump the version number:

poetry version patch  # or minor, or major, or pre-release

For the full usage, please refer to the poetry documentation.

Publish a new version to GitHub and PyPI

Super easy:

git tag vX.Y.Z
git push --tags

The GitHub action will automatically build and publish the wheels to PyPI. You can check the status of the build in the Actions tab.

The pipeline also drafts GitHub releases. Navigate to the Releases tab to edit the release notes and publish the release.

Add new mappings to python-woc

woc.get_values

We don't hard code how to encode and decode each one of the mappings. Instead, we follow the practice of the original World of Code perl driver and define the following datatypes:

{
  "h": "hex",
  "s": "str",
  "cs": "[compressed]str",
  "sh": "str_hex",
  "hhwww": "hex_hex_url",
  "r": "hex_berint",
  "cs3": "[compressed]str_str_str"
}

woc.detect should be able to recognize the new mappings if they follow the current naming scheme. To get get_values working, you may add a new line to woc/detect.py and regenerate the profile, or modify the following field in wocprofile.json:

"dtypes": [
  "h",  // Input dtype
  "cs3"  // Output dtype
]

woc.show_content

woc.show_content is a bit tricky, and we have to implement the encoders and decoders separately. To add another git object, please refer to existing implementations in woc/local.pyx.

woc.objects

The implementation of the object API is in pure python, at woc/objects.py. A new object class need to be a subclass of one of the following:

  • _GitObject: A hash-indexed Git object, e.g. commit, tree, blob
  • _NamedObject: A named object indexed by its fnv hash, e.g. author, project

World of Code Tutorial

[!NOTE] This tutorial is obsolete and will be removed soon. Please refer to the tutorial in woc-hack for the most up-to-date information.

List of relevant directories

da0 Server

<relationship>.{0-31}woc.tch files in /data/basemaps/:

(.s) signifies that there are either .s or .gz versions of these files in gz/ subfolder, which can be opened with Python gzip module or Unix zcat.
da0 is the only server with these .s/.gz files
Keys for identifying letters:

  • a = Author
  • b = Blob
  • c = Commit
  • cc = Child Commit
  • f = File
  • h = Head Commit
  • p = Project
  • pc = Parent Commit
  • ta = Time Author
  • trp = Torvald Path

List of relationships:

* a2c (.s)      * a2f           * a2ft          * a2L (.s only)     * a2p (.s)      * a2trp0 (.s)
* b2c (.s)      * b2f (.s)
* c2b (.s)      * c2cc          * c2f (.s)      
* c2h           * c2pc          * c2p (.s)      * c2ta (.s)
* f2b (.s)      * f2c (.s)      
* p2a (.s)      * p2c (.s)

/data/play/$LANGthruMaps/ on da0:

These thruMaps directories contain mappings of repositories with modules that were utilized at a given UNIX timestamp under a specific commit. The mappings are in c2bPtaPkgO{$LANG}.{0-31}.gz files.
Format: commit;repo_name;timestamp;author;blob;module1;module2;...
Each thruMaps directory has a different language ($LANG) that contains modules relevant to that language.


da3 Server

woc.tch files in /fast/:

da3 contains the same files located on da0, except for b2f, c2cc, f2b, and f2c. This folder can be used for faster reading, hence the directory name. In the context of oscar.py, the dictionary values listed in the PATHS dictionary can be changed from /da0_data/basemaps/... to /fast/... when referencing oscar.py in another program.


OSCAR functions from oscar.py

Note: "/" after a function name denotes the version of that function that returns a Generator object

These are corresponding functions in oscar.py that open the woc.tch files listed above for a given entity:

  1. Author('...') - initialized with a combination of name and email
    • .commit_shas/commits
    • .project_names
    • .torvald - returns the torvald path of an Author, i.e, who did this Author work with that also worked with Linus Torvald
  2. Blob('...') - initialized with SHA of blob
    • .commit_shas/commits - commits removing this blob are not included
  3. Commit('...') - initialized with SHA of commit
    • .blob_shas/blobs
    • .child_shas/children
    • .changed_file_names/files_changed
    • .parent_shas/parents
    • .project_names/projects
  4. Commit_info('...') - initialized like Commit()
    • .head
    • .time_author
  5. File('...') - initialized with a path, starting from a commit root tree
    • .commit_shas/commits
  6. Project('...') - initialized with project name/URI
    • .author_names
    • .commit_shas/commits

The non-Generator version of these functions will return a tuple of items which can then be iterated:

for commit in Author(author_name).commit_shas:
    print(commit)

Examples of doing certain tasks

  • Get a list of commits and repositories that imported Tensorflow for .py files:
    On da0: UNIX> zcat /data/play/PYthruMaps/c2bPtaPkgOPY.0.gz | grep tensorflow
    Output:
0000331084e1a567dbbaae4cc12935b610cd341a;abdella-mohamed_BreastCancer;1553266304;abdella <abdella.mohamed-idris-mohamed@de.sii.group>;0dd695391117e784d968c111f010cff802c0e6d1;sns;keras.models;np;random;tensorflow;os;pd;sklearn.metrics;plt;keras.layers;yaml
00034db68f89d3d2061b763deb7f9e5f81fef27;lucaskjaero_chinese-character-recognizer;1497547797;Lucas Kjaero <lucas@lucaskjaero.com>;0629a6caa45ded5f4a2774ff7a72738460b399d4;tensorflow;preprocessing;sklearn
000045f6a3601be885b0b028011440dd5a5b89f2;yjernite_DeepCRF;1451682395;yacine <yacine.jernite@nyu.edu>;4aac89ae85b261dba185d5ee35d12f6939fc2e44;nn_defs;utils;tensorflow
000069240776f2b94acb9420e042f5043ec869d0;tickleliu_tf_learn;1530460653;tickleliu <tickleliu@163.com>;493f0fc310765d62b03390ddd4a7a8be96c7d48c;np;tf;tensorflow
.....
  • Get a list of commits made by a specific author:
    On da0: UNIX> zcat /data/basemaps/gz/a2cFullP0.s | grep "Albert Krawczyk" <pro-logic@optusnet.com.au>
    Output:
"Albert Krawczyk" <pro-logic@optusnet.com.au>;17abdbdc90195016442a6a8dd8e38dea825292ae
"Albert Krawczyk" <pro-logic@optusnet.com.au>;9cdc918bfba1010de15d0c968af8ee37c9c300ff
"Albert Krawczyk" <pro-logic@optusnet.com.au>;d9fc680a69198300d34bc7e31bbafe36e7185c76
  • Do the same thing above using oscar.py:
UNIX> python
>>> from oscar import Author
>>> Author('"Albert Krawczyk" <pro-logic@optusnet.com.au>').commit_shas
('17abdbdc90195016442a6a8dd8e38dea825292ae', '9cdc918bfba1010de15d0c968af8ee37c9c300ff', 'd9fc680a69198300d34bc7e31bbafe36e7185c76')
  • Get the URL of a projects repository using the oscar.py Project(...).toURL() function:
UNIX> python
>>> from oscar import Project
>>> Project('notcake_gcad').toURL()
'https://github.com/notcake/gcad'

Examples of implementing applications -- Simple vs. Complex

Finding 1st-time imports for AI modules (Simple)

Given the data available, this is a fairly simple task. Making an application to detect the first time that a repo adopted an AI module would give you a better idea as to when it was first used, and also when it started to gain popularity.

A good example of this lies in popmods.py. In this application, we can read all 32 c2bPtaPkgO$LANG.{0-31}.gz files of a given language and look for a given module with the earliest import times. The program then creates a .first file, with each line formatted as repo_name;UNIX_timestamp.

Usage: UNIX> python popmods.py language_file_extension module_name

Before anything else (and this can be applied to many other programs), you want to know what your input looks like ahead of time and know how you are going to parse it. Since each line of the file has this format:
commit;repo_name;timestamp;author;blob;module1;module2;...
We can use the string.split() method to turn this string into a list of words, split by a semicolon (;).
By turning this line into a list, and giving it a variable name, entry = ['commit', 'repo_name', 'timestamp', ...], we can then grab the pieces of information we need with repo, time = entry[1], entry[2].

An important idea to keep in mind is that we only want to count unique timestamps once. This is because we want to account for repositories that forked off of another repository with the exact timestamp of imports. An easy way to do this would be to keep a running list of the times we have come across, and if we have already seen that timestamp before, we will simply skip that line in the file:

...
if time in times:
    continue
else:
    times.append(time)
...

We also want to find the earliest timestamp for a repository importing a given module. Again, this is fairly simple:

...
if repo not in dict.keys() or time < dict[repo]:
    for word in entry[5:]:
        if module in word:
            dict[repo] = time
            break
...

Implementing the application

Now that we have the .first files put together, we can take this one step further and graph a modules first-time usage over time on a line graph, or even compare multiple modules to see how they stack up against each other. modtrends.py accomplishes this by:

  • reading 1 or more .first files
  • converting each timestamp for each repository into a datetime date
  • "rounding" those dates by year and month
  • putting those dates in a dictionary with dict["year-month"] += 1
  • graphing the dates and frequencies using matplotlib.

If you want to compare first-time usage over time for Tensorflow and Keras for the .ipynb language .first files you created, run: UNIX> python3.6 modtrends.py tensorflow.first keras.first
The final graph looks something like this:
Tensorflow vs Keras


Detecting percentage language use and changes over time (Complex)

An application to calculate this would be useful for seeing how different authors changed languages over a range of years, based on the commits they have made to different files.
In order to accomplish this task, we will modify an existing program from the swsc/lookup repo (a2fBinSorted.perl) and create a new program (a2L.py) that will get language counts per year per author.

Part 1 -- Modifying a2fBinSorted.perl

For the first part, we look at what a2fBinSorted.perl currently does: it takes one of the 32 a2cFullP{0-31}.s files thru STDIN, opens the 32 c2fFullO.{0-31}woc.tch files for reading, and writes a corresponding a2fFullP.{0-31}woc.tch file based on the a2c file number. The lines of the file being author_id;file1;file2;file3...

Example usage: UNIX> zcat /da0_data/basemaps/gz/a2cFullP0.s | ./a2fBinSorted.perl 0

We can modify this program so that it will write the earliest commit dates made by that author for those files, which will become useful for a2L.py later on. To accomplish this, we will have the program additionally read from the c2taFullP.{0-31}woc.tch files so we can get the time of each commit made by a given author:

my %c2ta;
for my $s (0..($sections-1)){
    tie %{$c2ta{$s}}, "TokyoCabinet::HDB", "/fast/c2taFullP.$s.tch", TokyoCabinet::HDB::OREADER |
    TokyoCabinet::HDB::ONOLCK,
       16777213, -1, -1, TokyoCabinet::TDB::TLARGE, 100000
    or die "cant open fast/c2taFullP.$s.tch\n";
}

We will also ensure the files to be written will have the relationship a2ft as oppposed to a2f:

my %a2ft;
tie %a2ft, "TokyoCabinet::HDB", "/data/play/dkennard/a2ftFullP.$part.tch", TokyoCabinet::HDB::OWRITER | 
     TokyoCabinet::HDB::OCREAT,
    16777213, -1, -1, TokyoCabinet::TDB::TLARGE, 100000
    or die "cant open /data/play/dkennard/a2ftFullP.$part.tch\n";

Another important part of the file we want to change is inside the output function:

sub output {
    my $a = $_[0];
    my %fs;
    for my $c (@cs){
        my $sec =  segB ($c, $sections);
        if (defined $c2f{$sec}{$c} and defined $c2ta{$sec}{$c}){
            my @fs = split(/\;/, safeDecomp ($c2f{$sec}{$c}, $a), -1);
            my ($time, $au) = split(/\;/, $c2ta{$sec}{$c}, -1);  #add this for grabbing the time
            for my $f (@fs){
                if (defined $time and (!defined $fs{$f} or $time < $fs{$f})){ #modify condition to grab earliest time
                    $fs{$f} = $time;
                }
            }
        }
    }
    $a2ft{$a} = safeComp (join ';', %fs); #changed
}

Now when we run the new program, it should write individual a2ftFullP.{0-31}woc.tch files with the format:
author_id;file1;file1_timestamp;file2;file2_timestamp;...

We can then create a new PATHS dictionary entry in oscar.py, as well as making another function under the Author class to read our newly-created woc.tch files:

In PATHS dictionary:
...
'author_file_times': ('/data/play/dkennard/a2ftFullP.{key}woc.tch', 5)
...

In class Author(_Base):
...
@cached_property
def file_times(self):
    data = decomp(self.read_tch('author_file_times'))
    return tuple(file for file in (data and data.split(";")))
...

Part 2 -- Creating a2L.py

Our next task involves creating a2LFullP{0-31}.s files utilizing the new woc.tch files we have just created. We want these files to have each line filled with the author name, each year, and the language counts for each year. A possible format could look something like this:
"tim.bentley@gmail.com" <>;year2015;2;py;31;js;30;year2016;1;py;29;year2017;8;c;2;doc;1;py;386;html;6;sh;1;js;3;other;3;build;1
where the number after each year represents the number of languages used for that year, followed by pairs of languages and the number of files written in that language for that year. As an example, in the year 2015, Tim Bentley made initial commits to files in 2 languages, 31 of which were in Python, and 30 of which were in JavaScript.

There is a number of things that have to happen to get to this point, so lets break it down:

  • Iterating Author().file_times and grouping timestamps into year

We will start by reading in a a2cFullP{0-31}.s file to get a list of authors, which we then hold as a tuple in memory and start building our dictionary:

a2L[author] = {}
file_times = Author(author).file_times
for j in range(0,len(file_times),2):
    try:
        year = str(datetime.fromtimestamp(float(file_times[j+1]))).split(" ")[0].split("-")[0]
    #have to skip years either in the 20th century or somewhere far in the future
    except ValueError:
        continue
    #in case the last file listed doesnt have a time
    except IndexError:
        break
    year = specifier + year #specifier is the string 'year'
    if year not in a2L[author]:
        a2L[author][year] = []
    a2L[author][year].append(file_times[j])

The datetime.fromtimestamp() function will turn this into a datetime format: year-month-day hour-min-sec which we split by a space to get the first half year-month-day of the string, and then split again to get year.

  • Detecting the language of a file based on file extension
for year, files in a2L[author].items():
    build_list = []
    for file in files:
        la = "other"
        if re.search("\.(js|iced|liticed|iced.md|coffee|litcoffee|coffee.md|ts|cs|ls|es6|es|jsx|sjs|co|eg|json|json.ls|json5)$",file):
            la = "js"
        elif re.search("\.(py|py3|pyx|pyo|pyw|pyc|whl|ipynb)$",file):
            la = "py"
        elif re.search("(\.[Cch]|\.cpp|\.hh|\.cc|\.hpp|\.cxx)$",file):
            la = "c"
    .......

The simplest way to check for a language based on a file extension is to use the re module for regular expressions. If a given file matches a certain expression, like .py, then that file was written in Python. la = other if no matches were found in any of those searches. We then keep track of these languages and put each language in a list build_list.append(la), and count how many of those languages occurred when we looped through the files build_list.count(lang). The final format for an author in the a2L dictionary will be a2L[author][year][lang] = lang_count.

  • Writing each authors information into the file

See a2L.py for how information is written into each file.

Usage: UNIX> python a2L.py 2 for writing a2LFullP2.s

Implementing the application

Now that we have our a2L files, we can run some interesting statistics as to how significant language usage changes over time for different authors. The program langtrend.py runs the chi-squared contingency test (via the stats.chi2_contingency() function from scipy module) for authors from an a2LFullP{0-31}.s file on STDIN and calculates a p-value for each pair of years for each language for each author.
This p-value means the percentage chance that you would find another person (say out of 1000 people) that has this same extreme of change in language use, whether that be an increase or a decrease. For example, if a given author editied 300 different Python files in 2006, but then editied 500 different Java files in 2007, the percentage chance that you would see this extreme of a change in another author is very low. In fact, if this p-value is less than 0.001, then the change in language use between a pair of years is considered "significant".

In order for this p-value to be a more accurate approximation, we need a larger sample size of language counts. When reading the a2LFullP{0-31}.s files, you may want to rule out people who dont meet certain criteria:

  • the author has at least 5 consecutive years of commits for files
  • the author has edited at least 100 different files for all of their years of commits

If an author does not meet this criteria, we would not want to consider them for the chi-squared test simply because their results would be "uninteresting" and not worth investigating any further.

Heres one of the authors from the programs output:

----------------------------------
Ben Niemann <pink@odahoda.de>
{ '2015': {'doc': 3, 'markup': 2, 'obj': 1, 'other': 67, 'py': 127, 'sh': 1},
    '2016': {'doc': 1, 'other': 23, 'py': 163},
    '2017': {'build': 36, 'c': 116, 'lsp': 1, 'other': 81, 'py': 160},
    '2018': { 'build': 12,
        'c': 134,
        'lsp': 2,
        'markup': 2,
        'other': 133,
        'py': 182},
    '2019': { 'build': 13,
        'c': 30,
        'doc': 8,
        'html': 10,
        'js': 1,
        'lsp': 2,
        'markup': 16,
        'other': 67,
        'py': 134}}
    pfactors for obj language
        2015--2016 pfactor == 0.9711606775110577  no change
    pfactors for doc language
        2015--2016 pfactor == 0.6669499228133753  no change
        2016--2017 pfactor == 0.7027338745275937  no change
        2018--2019 pfactor == 0.0009971248193242038  rise/drop
    pfactors for markup language
        2015--2016 pfactor == 0.5104066960256399  no change
        2017--2018 pfactor == 0.5532258789014389  no change
        2018--2019 pfactor == 1.756929555308731e-05  rise/drop
    pfactors for py language
        2015--2016 pfactor == 1.0629725495084215e-07  rise/drop
        2016--2017 pfactor == 1.2847558344252341e-25  rise/drop
        2017--2018 pfactor == 0.7125543569718793  no change
        2018--2019 pfactor == 0.026914075872778477  no change
    pfactors for sh language
        2015--2016 pfactor == 0.9711606775110577  no change
    pfactors for other language
        2015--2016 pfactor == 1.7143130378377696e-06  rise/drop
        2016--2017 pfactor == 0.020874234589765908  no change
        2017--2018 pfactor == 0.008365948846657284  no change
        2018--2019 pfactor == 0.1813919210757513  no change
    pfactors for c language
        2016--2017 pfactor == 2.770649054044977e-16  rise/drop
        2017--2018 pfactor == 0.9002187643203734  no change
        2018--2019 pfactor == 1.1559110387953382e-08  rise/drop
    pfactors for lsp language
        2016--2017 pfactor == 0.7027338745275937  no change
        2017--2018 pfactor == 0.8855759560371912  no change
        2018--2019 pfactor == 0.9944669523033288  no change
    pfactors for build language
        2016--2017 pfactor == 4.431916568235125e-05  rise/drop
        2017--2018 pfactor == 5.8273175348446296e-05  rise/drop
        2018--2019 pfactor == 0.1955154860787908  no change
    pfactors for html language
        2018--2019 pfactor == 0.0001652525618661536  rise/drop
    pfactors for js language
        2018--2019 pfactor == 0.7989681687355706  no change
----------------------------------

Although it is currently not implemented, one could take this one step further and visually represent an authors language changes on a graph, which would be simpler to interpret as opposed to viewing a long list of pfactors such as the one shown above.


Useful Python imports for applications

subprocess

Simlar to the C version, system(), this module allows you to run UNIX processes, and also allows you to gather any input, output, or error from those processes, all from within a Python script. Documentation can be found here: https://docs.python.org/3/library/subprocess.html

re

Good for evaluating regular expressions in files and extracting lines that have certain words or patterns. Documentation can be found here: https://docs.python.org/3/library/re.html

matplotlib

Popular graphing library for Python. Documentation can be found here: https://matplotlib.org/index.html

World of Code DataFormat

Git objects

Sequential access:

Read an index line from /data/All.blobs/{commit,blob,tag,tree}_{key}.idx to get offset and length. Note that the key, a 1..127 uint, is 7 least significant bits of the first byte of sha, and not 7 msb. So, to get the key, use sha[0] & 7f, NOT bit shift.

Content of .idx files for commits, trees and tags:

id, offset, compressed_length, object sha

e.g.:

0;0;267;80b4ca99f8605903d8ac6bd921ebedfdfecdd660
1;267;185;0017b852ce7b49225c5a797b3d4221d363c0acdd
2;452;167;0054bab7302b386ddf2350a3fb2db08d59e125e1
3;619;235;8028315640bac6eae17297270d4ee1892abf6add

sometimes blob records have this format, too. See aef80de037d7a39ccd5f7f70d9e7dffc0a67aec8 for example.

Blobs:

id, offset, compressed_length, full_length, blob sha, ??sha, ??int

e.g.

0;0;461;647;00b31262da21c4f57d5b207372b6ded0bb332911;c88e5561832d1fe25a5e19cf15dc7de2fd81aae5;365420358
1;461;2836;7145;00ad7956ac3c0227c0abf2e59b3270c54837bf46;c83d8bfb7c8aef24c8c2efd0abf4d90c7e0cc421;366044154
2;3297;1170;2524;00b9870d283215cbe9eeca5433f211b702a749a1;00549bb056793128f1f35b1ada0a375466a69905;366711281

You can also obtain index line number for an arbitrary object from /fast1/All.sha1/sha1.{commit,tree,tag}_{key}woc.tch or /fast/All.sha1/sha1.blob_{key}woc.tch, but it doesn't look very useful neither for sequential nor random access.

Accessing an arbitrary git object:

Full content is available for commits, trees and tags in /fast1/All.sha1c/{tree,commit}_{0..127}woc.tch

For blobs,

/data/All.sha1o/sha1.blob_{1..127}
  1. Get index line number from /fast1?/All.sha1/sha1.{commit,blob,tag,tree}_{key}woc.tch
  2. read compressed data by the given offset/length from /data/All.blobs/{commit,blob,tag,tree}_{key}.bin
  3. Uncompress with LZF. Note that unlike vanilla LZF used by python-lzf Perl Compress::LZF also prepends variable length header with uncompressed chunk size.

Mappings

Please see https://bitbucket.org/swsc/lookup/src/master/README.md for the full list and some extra descriptions.

Mappings are stored in /data/basemaps/:

Auth2Cmt.tch  # works
    Author to commits
    db[email] = bytestring of bin shas
    keys are non-normalized authors e.g. 'gsadaram <gsadaram@cisco.com>'
    9.35M keys
b2cFullF.{0..15}woc.tch
    Blob to commits where this blob was added or removed
    db[blob_sha] = commit_bin_shas
c2bFullF.{0..15}woc.tch  # bug #1
    Commit to blobs
    db[blob_sha] = commit_bin_shas
    Looks to be incomplete, see docs/Bugreport1 for details
Prj2CmtG.{0..7}woc.tch  # works
    Project to Commits
    Project is a user_repo string, e.g. user2589_minicms
Cmt2PrjG.{0..7}woc.tch  # works
    Commit to projects
    data is lzf compressed, semicolon separated list of projects
Cmt2Chld.tch  # works
    Commit to children
    db[commit_sha] = children_commit_bin_shas
f2cFullF.{0..7}.{tch,lst}  # bug #4
    File to commits where this file has been changed
    File is a full path, usually terminated with a newline
        e.g.: 'public_html/images/cms/flags/my.gif\n'
    There are 181M filenames as of Apr 2018 just in 0woc.tch
t2pt0-127.{0..7}.{tch,lst}  # works
    Tree to parent trees
    db[tree_bin_sha] = parent_tree_bin_shas
b2pt.00-15.{0..7}woc.tch  # bug #2 - deprecated
    Blob to parent trees

Python files:

pyfiles.{0..7}.gz  # each line is a path to .py file (not just a filename)
pyfilesC.{0..7}.gz  # hashes (trees or commits?)
    check 721141f28f0a15354e283eae26be43c2b81e6e52
pyfilesCU.{0..7}.gz  # hashes (trees or commits?)
    check 0000000fcd1c59eac9dd76e7d75229065733de3b
pyfilesP.{0..7}.gz  # projects (user_repo)

There is also a bunch of mappings in /fast1/All.sha1c/*woc.tch, which looks to be outdated

b2cFullE.{0..15}woc.tch  # blob to commits bin sha
c2fFull.{0..7}woc.tch  # commit to filename?
Cmt2Prj.{0..7}woc.tch
commit_parent_{0..127}woc.tch
Auth2Cmt.tch
author_class.tch
author_commit.tch
class2commit.tch
commit_atime.tch
commit_child.tch
commit_class.tch
f2b.tch
NAMESPACE.tch
package.json.{1..127}woc.tch
setup.py.{1..127}woc.tch

Update queue

Logs of processed projects are available in da0_data/gitnub/list* and da0_data/gitnub/new*. By checking dates on the log files, you can find when a project was updated.

 1#!/usr/bin/env python3
 2
 3# SPDX-License-Identifier: GPL-3.0-or-later
 4# @authors: Runzhi He <rzhe@pku.edu.cn>
 5# @date: 2024-01-17
 6
 7"""
 8# To Use
 9.. include:: ../README.md
10   :start-line: 4
11   :end-before: ## Contributing
12# Guide (Local)
13.. include:: ../docs/guide.md
14# Guide (Remote)
15.. include:: ../docs/guide_remote.md
16# To Contribute
17.. include:: ../docs/contributing.md
18# World of Code Tutorial
19.. include:: ../docs/tutorial.md
20# World of Code DataFormat
21.. include:: ../docs/DataFormat.md
22
23"""  # noqa: D205
24
25__all__ = ["local", "tch", "detect", "objects", "remote"]
26
27import importlib.metadata
28
29__version__ = importlib.metadata.version("python-woc")