Common SQLite examples¶
This section lists recepies to frequently asked questions on “how-to’s” with NVIDIA Nsight Systems data exported in SQLite format.
General notes¶
The data exported into SQLite database don’t contain any indexes by default and are as close to the contents of the report as possible. Therefore, for better readability and speed of execution, queries below will alter tables.
Common helper commands¶
When utilizing sqlite3 commandline tool, it’s helpful to have data printed as named columns, this can be done with:
.mode column
.headers on
Default column width is determined by the data in the first row of results. If this doesn’t work out well, you can specify widths manually:
.width 10 20 50
Obtaining sample report¶
CLI interface of Nsight Systems was used to profile radixSortThrust
CUDA sample, then the resulting .qdrep file was exported using the
nsys-exporter tool.
nsys profile --trace=cuda,osrt radixSortThrust
nsys-exporter --export-sqlite report1.qdrep
Serialized process and thread identifiers¶
NVIDIA Nsight Systems stores identifiers where events originated in serialized
form. For events that have globalTid or globalPid fields exported,
use the following code to extract numeric TID and PID:
SELECT globalTid / 0x1000000 % 0x1000000 AS PID, globalTid % 0x1000000 AS TID FROM TABLE_NAME;
NOTE: globalTid field includes both TID and PID values, while globalPid
only the PID value.
Correlate CUDA kernel launches with CUDA API kernel launches¶
ALTER TABLE CUPTI_ACTIVITY_KIND_RUNTIME ADD COLUMN name TEXT;
ALTER TABLE CUPTI_ACTIVITY_KIND_RUNTIME ADD COLUMN kernelName TEXT;
UPDATE CUPTI_ACTIVITY_KIND_RUNTIME SET kernelName =
(SELECT value FROM StringIds
JOIN CUPTI_ACTIVITY_KIND_KERNEL AS cuda_gpu
ON cuda_gpu.shortName = StringIds.id
AND CUPTI_ACTIVITY_KIND_RUNTIME.correlationId = cuda_gpu.correlationId);
UPDATE CUPTI_ACTIVITY_KIND_RUNTIME SET name =
(SELECT value FROM StringIds WHERE nameId = StringIds.id);
-- Select 10 longest CUDA API ranges that resulted in kernel execution.
SELECT name, kernelName, start, end FROM CUPTI_ACTIVITY_KIND_RUNTIME
WHERE kernelName IS NOT NULL ORDER BY end - start LIMIT 10;
Results:
name kernelName start end
---------------------- ----------------------- ---------- ----------
cudaLaunchKernel_v7000 RadixSortScanBinsKernel 658863435 658868490
cudaLaunchKernel_v7000 RadixSortScanBinsKernel 609755015 609760075
cudaLaunchKernel_v7000 RadixSortScanBinsKernel 632683286 632688349
cudaLaunchKernel_v7000 RadixSortScanBinsKernel 606495356 606500439
cudaLaunchKernel_v7000 RadixSortScanBinsKernel 603114486 603119586
cudaLaunchKernel_v7000 RadixSortScanBinsKernel 802729785 802734906
cudaLaunchKernel_v7000 RadixSortScanBinsKernel 593381170 593386294
cudaLaunchKernel_v7000 RadixSortScanBinsKernel 658759955 658765090
cudaLaunchKernel_v7000 RadixSortScanBinsKernel 681549917 681555059
cudaLaunchKernel_v7000 RadixSortScanBinsKernel 717812527 717817671
Remove ranges overlapping with overhead¶
-- Use the this query to count CUDA API ranges overlapping with the overhead ones.
-- Replace "SELECT COUNT(*)" with "DELETE" to remove such ranges.
SELECT COUNT(*) FROM CUPTI_ACTIVITY_KIND_RUNTIME WHERE id IN
(
SELECT cuda.id
FROM PROFILER_OVERHEAD as overhead
INNER JOIN CUPTI_ACTIVITY_KIND_RUNTIME as cuda ON
(cuda.start BETWEEN overhead.start and overhead.end)
OR (cuda.end BETWEEN overhead.start and overhead.end)
OR (cuda.start < overhead.start AND cuda.end > overhead.end)
);
Results:
COUNT(*)
----------
1095
Backtraces for OSRT ranges¶
-- Adding text columns makes results of the query below more human-readable.
-- These steps are completely optional.
ALTER TABLE OSRT_API ADD COLUMN name TEXT;
UPDATE OSRT_API SET name = (SELECT value FROM StringIds WHERE OSRT_API.nameId = StringIds.id);
ALTER TABLE OSRT_CALLCHAINS ADD COLUMN symbolName TEXT;
UPDATE OSRT_CALLCHAINS SET symbolName = (SELECT value FROM StringIds WHERE symbol = StringIds.id);
ALTER TABLE OSRT_CALLCHAINS ADD COLUMN moduleName TEXT;
UPDATE OSRT_CALLCHAINS SET moduleName = (SELECT value FROM StringIds WHERE module = StringIds.id);
-- Print backtrace of the longest OSRT range
SELECT globalTid / 0x1000000 % 0x1000000 AS PID, globalTid % 0x1000000 AS TID,
OSRT_API.id, start, end, name, callchainId, stackDepth, symbolName, moduleName
FROM OSRT_API LEFT JOIN OSRT_CALLCHAINS ON callchainId == OSRT_CALLCHAINS.id
WHERE OSRT_API.id IN (SELECT id FROM OSRT_API ORDER BY end - start DESC LIMIT 1)
ORDER BY stackDepth LIMIT 10;
Results:
PID TID id start end name callchainId stackDepth symbolName moduleName
---------- ---------- ---------- ---------- ---------- ---------------------- ----------- ---------- ------------------------------ ----------------------------------------
19163 19176 676 360897690 860966851 pthread_cond_timedwait 88 0 pthread_cond_timedwait@GLIBC_2 /lib/x86_64-linux-gnu/libpthread-2.27.so
19163 19176 676 360897690 860966851 pthread_cond_timedwait 88 1 0x7fbc983b7227 /usr/lib/x86_64-linux-gnu/libcuda.so.418
19163 19176 676 360897690 860966851 pthread_cond_timedwait 88 2 0x7fbc9835d5c7 /usr/lib/x86_64-linux-gnu/libcuda.so.418
19163 19176 676 360897690 860966851 pthread_cond_timedwait 88 3 0x7fbc983b64a8 /usr/lib/x86_64-linux-gnu/libcuda.so.418
19163 19176 676 360897690 860966851 pthread_cond_timedwait 88 4 start_thread /lib/x86_64-linux-gnu/libpthread-2.27.so
19163 19176 676 360897690 860966851 pthread_cond_timedwait 88 5 __clone /lib/x86_64-linux-gnu/libc-2.27.so
Profiled processes output streams¶
ALTER TABLE ProcessStreams ADD COLUMN filename TEXT;
UPDATE ProcessStreams SET filename = (SELECT value FROM StringIds WHERE ProcessStreams.filenameId = StringIds.id);
ALTER TABLE ProcessStreams ADD COLUMN content TEXT;
UPDATE ProcessStreams SET content = (SELECT value FROM StringIds WHERE ProcessStreams.contentId = StringIds.id);
-- Select all collected stdout and stderr streams.
select globalPid / 0x1000000 % 0x1000000 AS PID, filename, content from ProcessStreams;
Results:
PID filename content
---------- ------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------
19163 /tmp/nvidia/nsight_systems/streams/pid_19163_stdout.log /home/user_name/NVIDIA_CUDA-10.1_Samples/6_Advanced/radixSortThrust/radixSortThrust Starting...
GPU Device 0: "Quadro P2000" with compute capability 6.1
Sorting 1048576 32-bit unsigned int keys and values
radixSortThrust, Throughput = 401.0872 MElements/s, Time = 0.00261 s, Size = 1048576 elements
Test passed
19163 /tmp/nvidia/nsight_systems/streams/pid_19163_stderr.log
Thread summary¶
Please note, that Nsight Systems applies additional logic during sampling events processing to work around lost events. This means that the results of the below query might differ slightly from the ones shown in “Analysis summary” tab.
-- Thread summary calculated using CPU cycles (when available)
SELECT
globalTid / 0x1000000 % 0x1000000 AS PID,
globalTid % 0x1000000 AS TID,
ROUND(100.0 * SUM(cpuCycles) /
(
SELECT SUM(cpuCycles) FROM COMPOSITE_EVENTS
GROUP BY globalTid / 0x1000000000000 % 0x100
),
2
) as CPU_utilization,
(SELECT value FROM StringIds WHERE id =
(
SELECT nameId FROM ThreadNames
WHERE ThreadNames.globalTid = COMPOSITE_EVENTS.globalTid
)
) as thread_name
FROM COMPOSITE_EVENTS
GROUP BY globalTid
ORDER BY CPU_utilization DESC
LIMIT 10;
Results:
PID TID CPU_utilization thread_name
---------- ---------- --------------- ---------------
19163 19163 98.4 radixSortThrust
19163 19168 1.35 CUPTI worker th
19163 19166 0.25 [NS]
Thread running time may be calculated using scheduling data, when PMU counter data was not collected.
CREATE INDEX sched_start ON SCHED_EVENTS (start);
-- Intermediate table
CREATE TABLE CPU_USAGE AS
SELECT
first.globalTid as globalTid,
(SELECT nameId FROM ThreadNames WHERE ThreadNames.globalTid = first.globalTid) as nameId,
sum(second.start - first.start) as total_duration,
count() as ranges_count
FROM SCHED_EVENTS as first
LEFT JOIN SCHED_EVENTS as second
ON second.id =
(
SELECT id
FROM SCHED_EVENTS
WHERE start > first.start AND globalTid = first.globalTid
ORDER BY start ASC
LIMIT 1
)
WHERE first.isSchedIn != 0
GROUP BY first.globalTid
ORDER BY total_duration DESC;
-- Thread summary calculated using scheduled time
SELECT
globalTid / 0x1000000 % 0x1000000 AS PID,
globalTid % 0x1000000 AS TID,
(SELECT value FROM StringIds where nameId == id) as thread_name,
ROUND(100.0 * total_duration / (SELECT SUM(total_duration) FROM CPU_USAGE), 2) as CPU_utilization
FROM CPU_USAGE
ORDER BY CPU_utilization DESC;
Results:
PID TID thread_name CPU_utilization
---------- ---------- --------------- ---------------
19163 19163 radixSortThrust 93.74
19163 19169 radixSortThrust 3.22
19163 19168 CUPTI worker th 2.46
19163 19166 [NS] 0.44
19163 19172 radixSortThrust 0.07
19163 19167 [NS Comms] 0.05
19163 19176 radixSortThrust 0.02
19163 19170 radixSortThrust 0.0
Function table¶
These examples demonstrate how to calculate Flat and BottomUp (for top level only) views statistics.
ALTER TABLE SAMPLING_CALLCHAINS ADD COLUMN symbolName TEXT;
UPDATE SAMPLING_CALLCHAINS SET symbolName = (SELECT value FROM StringIds WHERE symbol = StringIds.id);
ALTER TABLE SAMPLING_CALLCHAINS ADD COLUMN moduleName TEXT;
UPDATE SAMPLING_CALLCHAINS SET moduleName = (SELECT value FROM StringIds WHERE module = StringIds.id);
-- Flat view
SELECT symbolName, moduleName, ROUND(100.0 * sum(cpuCycles) /
(SELECT SUM(cpuCycles) FROM COMPOSITE_EVENTS), 2) AS flatTimePercentage
FROM SAMPLING_CALLCHAINS
LEFT JOIN COMPOSITE_EVENTS ON SAMPLING_CALLCHAINS.id == COMPOSITE_EVENTS.id
GROUP BY symbol, module
ORDER BY flatTimePercentage DESC
LIMIT 5;
-- BottomUp view (top level only)
SELECT symbolName, moduleName, ROUND(100.0 * sum(cpuCycles) /
(SELECT SUM(cpuCycles) FROM COMPOSITE_EVENTS), 2) AS selfTimePercentage
FROM SAMPLING_CALLCHAINS
LEFT JOIN COMPOSITE_EVENTS ON SAMPLING_CALLCHAINS.id == COMPOSITE_EVENTS.id
WHERE stackDepth == 0
GROUP BY symbol, module
ORDER BY selfTimePercentage DESC
LIMIT 5;
Results:
symbolName moduleName flatTimePercentage
----------- ----------- ------------------
[Max depth] [Max depth] 99.92
thrust::zip /home/user_ 24.17
thrust::zip /home/user_ 24.17
thrust::det /home/user_ 24.17
thrust::det /home/user_ 24.17
symbolName moduleName selfTimePercentage
-------------- ------------------------------------------- ------------------
0x7fbc984982b6 /usr/lib/x86_64-linux-gnu/libcuda.so.418.39 5.29
0x7fbc982d0010 /usr/lib/x86_64-linux-gnu/libcuda.so.418.39 2.81
thrust::iterat /home/user_name/NVIDIA_CUDA-10.1_Samples/6_ 2.23
thrust::iterat /home/user_name/NVIDIA_CUDA-10.1_Samples/6_ 1.55
void thrust::i /home/user_name/NVIDIA_CUDA-10.1_Samples/6_ 1.55
DX12 API frame duration histogram¶
The example demonstrates how to calculate DX12 CPU frames durartion and construct a histogram out of it.
CREATE INDEX DX12_API_ENDTS ON DX12_API (end);
CREATE TEMP VIEW DX12_API_FPS AS SELECT end AS start,
(SELECT end FROM DX12_API
WHERE end > outer.end AND nameId == (SELECT id FROM StringIds
WHERE value == "IDXGISwapChain::Present")
ORDER BY end ASC LIMIT 1) AS end
FROM DX12_API AS outer
WHERE nameId == (SELECT id FROM StringIds WHERE value == "IDXGISwapChain::Present")
ORDER BY end;
-- Number of frames with a duration of [X, X + 1) milliseconds
SELECT
CAST((end - start) / 1000000.0 AS INT) AS duration_ms,
count(*)
FROM DX12_API_FPS
WHERE end IS NOT NULL
GROUP BY duration_ms
ORDER BY duration_ms;
Results:
duration_ms count(*)
----------- ----------
3 1
4 2
5 7
6 153
7 19
8 116
9 16
10 8
11 2
12 2
13 1
14 4
16 3
17 2
18 1
GPU context switch events enumeration¶
GPU context duration is between first BEGIN and a matching END event.
SELECT (CASE tag WHEN 8 THEN "BEGIN" WHEN 7 THEN "END" END) AS tag,
globalPid / 0x1000000 % 0x1000000 AS PID,
vmId, seqNo, contextId, timestamp, gpuId FROM FECS_EVENTS
WHERE tag in (7, 8) ORDER BY seqNo LIMIT 10;
Results:
tag PID vmId seqNo contextId timestamp gpuId
---------- ---------- ---------- ---------- ---------- ---------- ----------
BEGIN 23371 0 0 1048578 56759171 0
BEGIN 23371 0 1 1048578 56927765 0
BEGIN 23371 0 3 1048578 63799379 0
END 23371 0 4 1048578 63918806 0
BEGIN 19397 0 5 1048577 64014692 0
BEGIN 19397 0 6 1048577 64250369 0
BEGIN 19397 0 8 1048577 1918310004 0
END 19397 0 9 1048577 1918521098 0
BEGIN 19397 0 10 1048577 2024164744 0
BEGIN 19397 0 11 1048577 2024358650 0
Rename CUDA kernels with NVTX¶
The example demonstrates how to map innermost NVTX push-pop range to a matching CUDA kernel run.
ALTER TABLE CUPTI_ACTIVITY_KIND_KERNEL ADD COLUMN nvtxRange TEXT;
CREATE INDEX nvtx_start ON NVTX_EVENTS (start);
UPDATE CUPTI_ACTIVITY_KIND_KERNEL SET nvtxRange = (
SELECT NVTX_EVENTS.text
FROM NVTX_EVENTS JOIN CUPTI_ACTIVITY_KIND_RUNTIME ON
NVTX_EVENTS.eventType == 59 AND
NVTX_EVENTS.globalTid == CUPTI_ACTIVITY_KIND_RUNTIME.globalTid AND
NVTX_EVENTS.start <= CUPTI_ACTIVITY_KIND_RUNTIME.start AND
NVTX_EVENTS.end >= CUPTI_ACTIVITY_KIND_RUNTIME.end
WHERE
CUPTI_ACTIVITY_KIND_KERNEL.correlationId == CUPTI_ACTIVITY_KIND_RUNTIME.correlationId
ORDER BY NVTX_EVENTS.start DESC LIMIT 1
);
SELECT start, end, globalPid, StringIds.value as shortName, nvtxRange
FROM CUPTI_ACTIVITY_KIND_KERNEL JOIN StringIds ON shortName == id
ORDER BY start LIMIT 6;
Results:
start end globalPid shortName nvtxRange
---------- ---------- ----------------- ------------- ----------
526545376 526676256 72057700439031808 MatrixMulCUDA
526899648 527030368 72057700439031808 MatrixMulCUDA Add
527031648 527162272 72057700439031808 MatrixMulCUDA Add
527163584 527294176 72057700439031808 MatrixMulCUDA My Kernel
527296160 527426592 72057700439031808 MatrixMulCUDA My Range
527428096 527558656 72057700439031808 MatrixMulCUDA